13 datasets found
  1. Z

    Supplementary material: Transitioning from file-based HPC workflows to...

    • data.niaid.nih.gov
    Updated Jun 8, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Davis, Philip E. (2021). Supplementary material: Transitioning from file-based HPC workflows to streaming data pipelines with openPMD and ADIOS2 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4906275
    Explore at:
    Dataset updated
    Jun 8, 2021
    Dataset provided by
    Bussmann, Michael
    Huebl, Axel
    Klasky, Scott
    Podhorszki, Norbert
    Gu, Junmin
    Eisenhauer, Greg
    E, Juncheng
    Gainaru, Ana
    Poeschel, Franz
    Godoy, William F.
    Wan, Lipeng
    Davis, Philip E.
    Widera, René
    Koller, Fabian
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Used software versions

    Self-built:

    PIConGPU: https://github.com/franzpoeschel/picongpu/tree/smc2021-paper GAPD: closed source software, Git tag smc2021-paper in private repository openPMD-api: https://github.com/franzpoeschel/openPMD-api/tree/smc2021-paper ADIOS2: https://github.com/ornladios/ADIOS2, Git hash bf25ad59b8b15b9f48ddabad65a41f2050d3bd7f libfabric: 1.6.3a1

    Summit modules:

    1) gcc/8.1.1
    2) spectrum-mpi/10.3.1.2-20200121
    3) cmake/3.18.2
    4) git/2.20.1
    5) cuda/10.1.243
    6) boost/1.66.0
    7) zlib/1.2.11
    8) libpng/1.6.34 9) freetype/2.9.1 10) python/3.7.0-anaconda3-5.3.0

  2. e

    Replication Data for: Hardware Attack detectoR via Performance counters...

    • b2find.eudat.eu
    Updated Apr 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Replication Data for: Hardware Attack detectoR via Performance counters analYsis Dataset (HARPY Dataset) - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/d9254f6d-a98e-5d09-a066-64763d33adb4
    Explore at:
    Dataset updated
    Apr 15, 2025
    Description

    A dataset containing the monitoring of several hardware counters (HPC) associated with 7 cache side-channel attacks (Spectre V1, V2, V4; Meltdown, ZombieLoad, Fallout, and Crosstalk), along with data obtained for 7 benign/benchmark programs (matrix multiplier, stress -c, stress -m, MiBench, STREAM, bzip2, and ffmpeg). All programs are run on Intel x86 architectures. The selection of the hardware attacks used to collect the data was done by analyzing the characteristics of the computer, as well as the available mitigations, to determine if the machine was vulnerable to each of them. The selection of benign programs was mainly based on benchmark sets that offered reliable and reproducible execution behavior, allowing for effective comparison with workloads. A selection of different benchmark sets with varied approaches was made to ensure optimal coverage of the dataset. Finally, the selection of activity counters was based on a detailed analysis of the exploited vulnerability, prior work, and, later, data analysis to ensure their validity. From this study, the following hardware counters were selected: branch-misses, branch-instructions, LLC-load-misses, L1-dcache-load-misses, and instructions. Each file corresponds to one of the 14 programs executed to generate the values of the analyzed hardware counters. Each file is identified by the name of the program associated with its execution.For the data collection, it was necessary to identify and acquire the binary codes of the selected programs (benign and attacks). Below, the source from which the codes were obtained is defined for each case. Malicious codes:1) Meltdown Github: I. of Applied Information Processing and C. (IAIK), Meltdown, https://github.com/IAIK/meltdown.2) Spectre V1 GitHub: R. C. (crozone), Spectrepoc, https://github.com/crozone/SpectrePoC. 3) Spectre V2 GitHub: A. C. (Anton-Cao), Spectrev2-poc, https://github.com/Anton-Cao/spectrev2-poc.4) Spectre V4 GitHub: Y. S. (mmxsrup), Cve-2018-3639, https://github.com/mmxsrup/CVE-2018-3639.5) ZombieLoad GitHub: I. of Applied Information Processing and C. (IAIK), Zombieload, https://github.com/IAIK/ZombieLoad. 6) Fallout GitHub: T. H. (tristan-hornetz), Fallout, https://github.com/tristan - hornetz /fallout.7) Crosstalk GitHub: T. H. (tristan-hornetz), Crosstalk, https://github.com/tristan- hornetz/crosstalk. Benign codes:1) Matrix Multiplier: Codi propi2) stress -c UNIX Tool: R. O. S. Projects, Stress, https://github.com/resurrecting-open-source-projects/stress.3) stress -m UNIX Tool: R. O. S. Projects, Stress, https://github.com/resurrecting-open-source-projects/stress.4) MiBench Bitcount GitHub: Embecosm, Mibench, https://github.com/embecosm/mibench.5) STREAM GitHub: J. H. (jeffhammond), Stream, https://github.com/jeffhammond/STREAM.6) bzip2 UNIX Tool: https://sourceware.org/bzip2/7) ffmpeg UNIX Package: https://ffmpeg.org/

  3. Music Informatics for Radio Across the GlobE (MIRAGE) MetaCorpus (v0.2)

    • zenodo.org
    csv
    Updated Nov 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David R.W. Sears; David R.W. Sears (2024). Music Informatics for Radio Across the GlobE (MIRAGE) MetaCorpus (v0.2) [Dataset]. http://doi.org/10.5281/zenodo.12786202
    Explore at:
    csvAvailable download formats
    Dataset updated
    Nov 7, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    David R.W. Sears; David R.W. Sears
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jul 19, 2024
    Description

    Overview

    Welcome to the Music Informatics for Radio Across the GlobE (MIRAGE) MetaCorpus. The current (v0.2) development release consists of metadata (e.g., artist name, track title) and musicological features (e.g., instrument list, voice type, tempo) for 1 million events streaming on 10,000 internet radio stations across the globe, with 100 events from each station.

    Users who wish to access, interact with, and/or export metadata from the MIRAGE-MetaCorpus may also visit the MIRAGE online dashboard at the following url:

    Attribution

    The current MIRAGE-MetaCorpus is available under a CC4 license. Users may cite the dataset here:

    Sears, David R.W. “Music Informatics for Radio Across the Globe (MIRAGE) Metacorpus -- 2024”. Zenodo, July 19, 2024. https://doi.org/10.5281/zenodo.12786202.

    Users accessing the MIRAGE-MetaCorpus using the online dashboard should also cite the following ISMIR paper:

    Ngan V.T. Nguyen, Elizabeth A.M. Acosta, Tommy Dang, and David R.W. Sears. "Exploring Internet Radio Across the Globe with the MIRAGE Online Dashboard," in Proceedings of the 25th International Society for Music Information Retrieval Conference (San Francisco, CA, 2024).

    Data Sources

    This repository of the MIRAGE-MetaCorpus contains 81 metadata variables from the following open-access sources:

    Each event also includes attribution metadata from the following commercial sources:

    Data Sets

    The metadata reflect information about each event's location (e.g., city, country), station (name, format, url), event (id, local time at station, etc.), artist (name, voice type, etc.), and track (e.g., title, year of release, etc.). For that reason, the MIRAGE-MetaCorpus includes the following datasets:

    • MIRAGE.csv -- the complete metacorpus (1 million)
    • events.csv -- all event-level metadata (1 million)
    • tracks.csv -- all track-level metadata (414,886)
    • artists.csv -- all artist-level metadata (259,783)
    • stations.csv -- all station-level metadata (10,000)
    • locations.csv -- all location-level metadata (4,324)

    A subset of the MIRAGE-MetaCorpus is also available for events with metadata from online music libraries that reliably matched the event's description in the radio station's stream encoder:

    • MIRAGE_reliable.csv (473,850)
    • events_reliable.csv (473,850)
    • tracks_reliable.csv (204,969)
    • artists_reliable.csv (80,005)
    • stations_reliable.csv (9,284)
    • locations_reliable.csv (4,142)

    Contact

    If you are a copyright owner for any of the metadata that appears in the MIRAGE-MetaCorpus and would like us to remove your metadata, please contact the developer team at the following email address: miragedashboard@gmail.com

  4. A

    ‘COVID-19 Coronavirus Dataset’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Feb 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘COVID-19 Coronavirus Dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-covid-19-coronavirus-dataset-4bcc/6a53de38/?iid=022-156&v=presentation
    Explore at:
    Dataset updated
    Feb 14, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘COVID-19 Coronavirus Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/vignesh1694/covid19-coronavirus on 14 February 2022.

    --- Dataset description provided by original source is as follows ---

    Context

    A SARS-like virus outbreak originating in Wuhan, China, is spreading into neighboring Asian countries, and as far afield as Australia, the US a and Europe.

    On 31 December 2019, the Chinese authorities reported a case of pneumonia with an unknown cause in Wuhan, Hubei province, to the World Health Organisation (WHO)’s China Office. As more and more cases emerged, totaling 44 by 3 January, the country’s National Health Commission isolated the virus causing fever and flu-like symptoms and identified it as a novel coronavirus, now known to the WHO as 2019-nCoV.

    The following dataset shows the numbers of spreading coronavirus across the globe.

    Content

    Sno - Serial number Date - Date of the observation Province / State - Province or state of the observation Country - Country of observation Last Update - Recent update (not accurate in terms of time) Confirmed - Number of confirmed cases Deaths - Number of death cases Recovered - Number of recovered cases

    Acknowledgements

    Thanks to John Hopkins CSSE for the live updates on Coronavirus and data streaming. Source: https://github.com/CSSEGISandData/COVID-19 Dashboard: https://public.tableau.com/profile/vignesh.coumarane#!/vizhome/DashboardToupload/Dashboard12

    Inspiration

    Inspired by the following work: https://gisanddata.maps.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6

    --- Original source retains full ownership of the source dataset ---

  5. d

    LCZO-Stream Water Chemistry, Streamflow / Discharge, Hysteretic response of...

    • search.dataone.org
    • hydroshare.org
    • +1more
    Updated Aug 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adam Wymore; James B Shanley; William H McDowell; Miguel C Leon (2022). LCZO-Stream Water Chemistry, Streamflow / Discharge, Hysteretic response of solutes and turbidity at the event scale across forested tropical montane watersheds - Luquillo Experimental Forest (2016-2017) [Dataset]. http://doi.org/10.4211/hs.f8420c1447fe440eb93e656b2db0b64d
    Explore at:
    Dataset updated
    Aug 5, 2022
    Dataset provided by
    Hydroshare
    Authors
    Adam Wymore; James B Shanley; William H McDowell; Miguel C Leon
    Time period covered
    Aug 6, 2016 - Sep 21, 2017
    Area covered
    Description

    Concentration-discharge relationships are a key tool for understanding the sourcing and transport of material from watersheds to fluvial networks. Storm events in particular provide insight into variability in the sources of solutes and sediment within watersheds, and the hydrologic pathways that connect hillslope to stream channel. Here we examine high-frequency sensor-based specific conductance and turbidity data from multiple storm events across two watersheds (Quebrada Sonadora and Rio Icacos) with different lithology in the Luquillo Mountains of Puerto Rico, a forested tropical ecosystem. Our analyses include Hurricane Maria, a category 5 hurricane. To analyze hysteresis, we used a recently developed set of metrics to describe and quantify storm events including the hysteresis index (HI), which describes the directionality of hysteresis loops, and the flushing index (FI), which describes whether the mobilization of material is source or transport limited. We also examine the role of antecedent discharge to predict hysteretic behavior during storms. Overall, specific conductance and turbidity showed contrasting responses to storms. The hysteretic behavior of specific conductance was very similar across sites, displaying clockwise hysteresis and a negative flushing index indicating proximal sources of solutes and consistent source limitation. In contrast, the directionality of turbidity hysteresis was significantly different between watersheds, although both had strong flushing behavior indicative of transport limitation. Overall, models that included antecedent discharge did not perform any better than models with peak discharge alone, suggesting that the magnitude and trajectory of an individual event was the strongest driver of material flux and hysteretic behavior. Hurricane Maria produced unique hysteresis metrics within both watersheds, indicating a distinctive response to this major hydrological event. The similarity in response of specific conductance to storms suggests that solute sources and pathways are similar in the two watersheds. The divergence in behavior for turbidity suggests that sources and pathways of particulate matter vary between the two watersheds. The use of high-frequency sensor data allows the quantification of storm events while index-based metrics of hysteresis allow for the direct comparison of complex storm events across a heterogeneous landscape and variable flow conditions.

    Additional scripts for hysteresis analysis are available here in the 'python scripts for analysis' folder and at https://github.com/miguelcleon/HysteresisAnalysis/

  6. Z

    SLF4Web - MPEG-DASH datasets of static light fields

    • data.niaid.nih.gov
    Updated Nov 29, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michiels, Nick (2021). SLF4Web - MPEG-DASH datasets of static light fields [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5730525
    Explore at:
    Dataset updated
    Nov 29, 2021
    Dataset provided by
    Lamotte, Wim
    Put, Jeroen
    Lievens, Hendrik
    Michiels, Nick
    Quax, Peter
    Wijnants, Maarten
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    MPEG-DASH datasets for the SLF4Web research project. SLF4Web is a Web-based implementation of a static light field consumption system; it allows SLF datasets to be adaptively streamed over the network (via MPEG-DASH) and then to be visualized in a vanilla Web browser. The datasets are encoded using the H.264/AVC video codec. A subset of the datasets are available in multiple qualities to allow for adaptive network streaming.

    The SLF4Web source code is available on GitHub (https://github.com/EDM-Research/SLF4Web) and as a bundle at https://zenodo.org/badge/latestdoi/432214902.

  7. Data from: National-scale biogeography and function of river and stream...

    • zenodo.org
    Updated Mar 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amy Thorpe; Amy Thorpe; Susheel Bhanu Busi; Susheel Bhanu Busi; Jonathan Warren; Jonathan Warren; Laura Hunt; Kerry Walsh; Daniel Read; Daniel Read; Laura Hunt; Kerry Walsh (2025). National-scale biogeography and function of river and stream bacterial biofilm communities [Dataset]. http://doi.org/10.5281/zenodo.14947235
    Explore at:
    Dataset updated
    Mar 11, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Amy Thorpe; Amy Thorpe; Susheel Bhanu Busi; Susheel Bhanu Busi; Jonathan Warren; Jonathan Warren; Laura Hunt; Kerry Walsh; Daniel Read; Daniel Read; Laura Hunt; Kerry Walsh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Supplementary data associated with 'National-scale biogeography and function of river and stream bacterial biofilm communities'. Preprint is available at: https://doi.org/10.1101/2025.03.05.641783.

    R scripts for data analysis and visualisation of this dataset are available on GitHub at: https://github.com/amycthorpe/biofilm_MAG_analysis.

    Snakemake workflows to generate the results are available on GitHub at: https://github.com/amycthorpe/metag_analysis_EA and https://github.com/amycthorpe/EA_metag_post_analysis.

    Environmental metadata:

    Metagenome assembled genomes (MAGs):

    • finalbins_coverage.csv - coverage of MAGs per sample
    • checkm_gtdb.csv - statistics calculated with CheckM2 for each MAG and MAG taxonomy with the GTDB-tk database
    • levins_median.csv - Levins' niche breadth index (Bn) calculated for each MAG, the associated P value (P.val) and adjusted P value (P.adj), N denoting above threshold of quantification (Below.NOQ), and identification as generalist or specialist (category, Bn > median Bn = generalist, Bn < median Bn = specialist)
    • singlem_results.csv - proportion of metagenomic reads assigned to bacteria, archaea and eukaryotes calculated with SingleM
    • env_with_seq_accessions.csv - ENA accessions for metagenomic reads
    • mag_accessions.csv - ENA accessions for dereplicated MAGs

    Metabolic and functional traits:

    • metabolic_results.csv - presence of metabolic pathways in the MAGs generated using METABOLIC
    • metabolishmm_results.csv - presence of metabolic pathways in the MAGs generated using metabolisHMM
    • microtrait_results.csv - presence of functional traits identified in the MAGs using microTrait

    Environmental drivers:

    • varPart.csv - results of variance partitioning between MAGs and environmental metadata
    • correlations.csv - pearson correlation coefficients (r_value, p_value and significance level) between environmental metadata and bacterial phyla.
  8. H

    CO - Coal Creek - Distinct Source Water Chemistry Shapes Contrasting...

    • hydroshare.org
    • beta.hydroshare.org
    • +1more
    zip
    Updated Oct 9, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wei Zhi; Li Li; Wenming Dong; Wendy Brown; Jason Kaye; Carl Steefel; Kenneth Williams (2023). CO - Coal Creek - Distinct Source Water Chemistry Shapes Contrasting Concentration Discharge Patterns [Dataset]. https://www.hydroshare.org/resource/24b834aab72743db899b99404b48cb68
    Explore at:
    zip(136 bytes)Available download formats
    Dataset updated
    Oct 9, 2023
    Dataset provided by
    HydroShare
    Authors
    Wei Zhi; Li Li; Wenming Dong; Wendy Brown; Jason Kaye; Carl Steefel; Kenneth Williams
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Dec 31, 2015 - Jun 27, 2018
    Area covered
    Description

    This data package contains discharge and water quality data and model results at Coal Creek Watershed in the central Rocky Mountains of Colorado, USA. Files include high-frequency stream chemistry data collected during the period of Dec 2015 to Jun 2018, and model results of water storage and flux. The dataset also includes dissolved organic carbon and sodium stream chemistry data for the period of 2016. Our model then incorporates the USGS datasets of discharge and stream chemistry, for which data and citations are provided in the dataset files and related reference field. The resulting model BioRT-Flux-PIHM is the biogeochemical reactive transport model of the PIHM family code MM-PIHM for watershed processes and is detailed in the reference paper (doi.org/10.1029/2018WR024257) and in Github (https://github.com/PSUmodeling/BioRT-Flux-PIHM).

  9. Open Australian Legal Embeddings

    • kaggle.com
    Updated Nov 15, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Umar Butler (2023). Open Australian Legal Embeddings [Dataset]. https://www.kaggle.com/datasets/umarbutler/open-australian-legal-embeddings
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 15, 2023
    Dataset provided by
    Kaggle
    Authors
    Umar Butler
    Area covered
    Australia
    Description

    Open Australian Legal Embeddings ‍⚖️

    The Open Australian Legal Embeddings are the first open-source embeddings of Australian legislative and judicial documents.

    Trained on the largest open database of Australian law, the Open Australian Legal Corpus, the Embeddings consist of roughly 5.2 million 384-dimensional vectors embedded with BAAI/bge-small-en-v1.5.

    The Embeddings open the door to a wide range of possibilities in the field of Australian legal AI, including the development of document classifiers, search engines and chatbots.

    To ensure their accessibility to as wide an audience as possible, the Embeddings are distributed under the same licence as the Open Australian Legal Corpus.

    Usage 👩‍💻

    The below code snippet illustrates how the Embeddings may be loaded and queried via the Hugging Face Datasets Python library: ```python import itertools import sklearn.metrics.pairwise

    from datasets import load_dataset from sentence_transformers import SentenceTransformer

    model = SentenceTransformer('BAAI/bge-small-en-v1.5') instruction = 'Represent this sentence for searching relevant passages: '

    oale = load_dataset('umarbutler/open_australian_legal_embeddings', split='train', streaming=True) # Set streaming to False if you wish to load the entire dataset into memory (unadvised unless you have at least 64 GB of RAM).

    Sample the first 100,000 embeddings.

    sample = list(itertools.islice(oale, 100000))

    Embed a query.

    query = model.encode(instruction + 'Who is the Governor-General of Australia?', normalize_embeddings=True)

    Identify the most similar embedding to the query.

    similarities = sklearn.metrics.pairwise.cosine_similarity([query], [embedding['embedding'] for embedding in sample]) most_similar_index = similarities.argmax() most_similar = sample[most_similar_index]

    Print the most similar text.

    print(most_similar['text']) ```

    To speed up the loading of the Embeddings, you may wish to install orjson.

    Structure 🗂️

    The Embeddings are stored in data/embeddings.jsonl, a json lines file where each line is a list of 384 32-bit floating point numbers. Associated metadata is stored in data/metadatas.jsonl and the corresponding texts are located in data/texts.jsonl.

    The metadata fields are the same as those used for the Open Australian Legal Corpus, barring the text field, which was removed, and with the addition of the is_last_chunk key, which is a boolean flag for whether a text is the last chunk of a document (used to detect and remove corrupted documents when creating and updating the Embeddings).

    Creation 🧪

    All documents in the Open Australian Legal Corpus were split into semantically meaningful chunks up to 512-tokens-long (as determined by bge-small-en-v1.5's tokeniser) with the semchunk Python library. These chunks included a header embedding documents' titles, jurisdictions and types in the following format: perl Title: {title} Jurisdiction: {jurisdiction} Type: {type} {text}

    The chunks were then vectorised by bge-small-en-v1.5 on a single GeForce RTX 2080 Ti with a batch size of 32 via the SentenceTransformers library.

    The resulting embeddings were serialised as json-encoded lists of floats by orjson and stored in data/embeddings.jsonl. The corresponding metadata and texts (with their headers removed) were saved to data/metadatas.jsonl and data/texts.jsonl, respectively.

    The code used to create and update the Embeddings may be found [here](https://github.com/umarbutler/open-australian-legal-embeddings-...

  10. A

    Brook Trout Probability of Occurrence, Plus 4 degrees C, Northeast U.S.

    • data.amerigeoss.org
    xml
    Updated Aug 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    United States (2022). Brook Trout Probability of Occurrence, Plus 4 degrees C, Northeast U.S. [Dataset]. https://data.amerigeoss.org/sl/dataset/brook-trout-probability-of-occurrence-plus-4-degrees-c-northeast-u-s-be6a2
    Explore at:
    xmlAvailable download formats
    Dataset updated
    Aug 20, 2022
    Dataset provided by
    United States
    Area covered
    Northeastern United States, United States
    Description

    This dataset is one of a suite of products from the Nature’s Network project (naturesnetwork.org). Nature’s Network is a collaborative effort to identify shared priorities for conservation in the Northeast, considering the value of fish and wildlife species and the natural areas they inhabit. Brook Trout probability of occurrence is intended to provide predictions of occupancy (probability of presence) for catchments smaller than 200 km2 in the Northeast and Mid-Atlantic region from Virginia to Maine. The dataset provides predictions under current environmental conditions and for future increases in stream temperature. Brook Trout probability of occurrence (under current climate) is one input used in developing “Lotic Core Areas, Stratified by Watershed, Northeast U.S.” that is also part of Nature’s Network. Lotic core areas represent intact, well-connected rivers and stream reaches in the Northeast and Mid-Atlantic region that, if protected as part of stream networks and watersheds, will continue to support a broad diversity of aquatic species and the ecosystems on which they depend. The combination of lotic core areas, lentic (lake and pond) core areas, and aquatic buffers constitute the “aquatic core networks” of Nature’s Network. These and other datasets that augment or complement aquatic core networks are available in the Nature’s Network gallery: https://nalcc.databasin.org/galleries/8f4dfe780c444634a45ee4acc930a055.

    Intended Uses

    In the context of Nature’s Network, this dataset is primarily intended to be used in conjunction with the product “Lotic Core Areas, Stratified by Watershed, Northeast U.S.” to better understand the importance of core areas to Brook Trout. It also can be used on its own to identify priority watersheds for Brook Trout.

    The dataset was originally developed for and is part of the Interactive Catchment Explorer (ICE). ICE (http://ice.ecosheds.org/) is a dynamic visualization interface for exploring catchment characteristics and environmental model predictions. ICE was created for resource managers and researchers to explore complex, multivariate environmental datasets and model results, to identify spatial patterns related to ecological conditions, and to prioritize locations for restoration or further study. ICE is part of the Spatial Hydro-Ecological Decision System (SHEDS).

    Description and Derivation

    The dataset provides predictions under current environmental conditions and for future increases in stream temperature of 2, 4, and 6 degrees Celsius. It employs a logistic mixed effects model to include the effects of landscape, land-use, and climate variables on the probability of Brook Trout occupancy in stream reaches (confluence to confluence). It includes random effects of HUC10 (watershed) to allow for the chance that the probability of occupancy and the effect of covariates were likely to be similar within a watershed. The fish data came primarily from state and federal agencies that sample streams for Brook Trout as part of regular monitoring. A stream is considered occupied if any Brook Trout were ever caught during an electrofishing survey between 1991 and 2010. The results are based on more than 15,000 samples from more than 13,000 catchments from all 13 Northeast states.

    Factors that had a strong positive effect on Brook Trout occupancy included percent forest cover and summer precipitation. Factors that had a strong negative effect on occupancy included July stream temperature, percent agriculture, drainage area, and percent upstream impounded area.

    Estimates of the probability of occupancy for each catchment with increases in stream temperature of either 2,4 or 6 degrees C are also provided. To provide these estimates, the input values for mean July stream temperature were simply increased by 2, 4, or 6 C and estimated occupancies recorded.

    More technical details about the Brook Trout probability of occurrence product are available at: http://conte-ecology.github.io/Northeast_Bkt_Occupancy/. Technical details about the regional stream temperature model, which is used in predicting Brook Trout occupancy, are available at: http://conte-ecology.github.io/conteStreamTemperature_northeast/.

    Known Issues and Uncertainties

    As with any project carried out across such a large area, this dataset is subject to limitations. The results by themselves are not a prescription for on-the-ground action; users are encouraged to verify, with field visits and site-specific knowledge, the value of any areas identified in the project. Known issues and uncertainties include the following:

    • Users are cautioned against using the data on too small an area (for example, a small segment of stream), as the data may not be sufficiently accurate at that level of resolution.

    • Uncertainties in predictions of stream temperature also result in uncertainties in Brook Trout occupancy estimates. Local effects of groundwater (which may provide cold-water refugia for Brook Trout) cannot be well accounted for in regional stream temperature models at this time. Catchments near waterbodies with water control structures such as dams may also have unreliable temperature predictions because the temperature model does not include information on release schedules or strategies.

    • Catchments with any Brook Trout occurrences reported in the past 30 years have been presumed to be occupied for purposes of the model. If local extirpations have occurred, this could lead to overprediction of the probability of Brook Trout occupancy.

    • Projections of effects of future temperature changes to Brook Trout occupancy are intended to convey a sense of the resilience of the species to changing temperatures. In reality, stream temperatures will not change at the same rate or uniformly, as some streams are more buffered against changing air temperatures than others.

    • Brook Trout occupancy predictions are not available in certain areas where surficial soil coarseness data were absent. These areas include the White Mountains of NH and mountainous areas in NY such as the Adirondacks.

    • As with any regional GIS data, errors in mapping and alignment of hydrography, development, agriculture, and a number of other data layers can affect the model results.

      Attribute definitions

      Source = data source

      FEATUREID = unique identifier

      NextDownID = unique identifier of catchment immediately downstream (-1 = none)

      Shape_Leng = length of catchment in meters

      Shape_Area = area of catchment in square meters

      AreaSqKm = area of catchment in square kilometers

      huc12 = 12 digit Hydrologic Unit Code for the watershed

      stusps = state in which the catchment is located

      agricultur = the percentage of the catchment that is covered by agricultural land (e.g. cultivated crops, orchards, and pasture) including fallow land.

      elevation = mean elevation of catchment (m)

      forest = the percentage of the catchment that is forested

      summer_prc = mean precipitation per month in summer (mm)

      UpAreaSqKM = drainage area upstream of catchment in square kilometers

      occ_curren = probability of Brook Trout occupancy (current climate)

      plus2 = probability of Brook Trout occupancy if stream temperature were to warm by 2 degrees C, relative to current climate

      plus4 = probability of Brook Trout occupancy if stream temperature were to warm by 4 degrees C, relative to current climate

      plus6 = probability of Brook Trout occupancy if stream temperature were to warm by 6 degrees C, relative to current climate

      max_temp_0 = the maximum additional stream temperature (degrees C), on top of the current mean summer temperature for the catchment, that would be predicted to result in a 30% probability of occupancy for Brook Trout

      max_temp_1 = the maximum additional stream temperature (degrees C), on top of the current mean summer temperature for the catchment, that would be predicted to result in a 50% probability of occupancy Brook Trout

      max_temp_2 = the maximum additional stream temperature (degrees C), on top of the current mean summer temperature for the catchment, that would be predicted to result in a 70% probability of occupancy Brook Trout

      meanSumme = mean summer stream temperature (C)

      meanDays_1 = mean days per year that stream temperature exceeds 18 degrees C

      meanDays_2 = mean days per year that stream temperature exceeds 22 degrees C

  11. CLM - Richmond stream gauge data

    • researchdata.edu.au
    • cloud.csiss.gmu.edu
    • +2more
    Updated Mar 30, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bioregional Assessment Program (2016). CLM - Richmond stream gauge data [Dataset]. https://researchdata.edu.au/clm-richmond-stream-gauge/1434675
    Explore at:
    Dataset updated
    Mar 30, 2016
    Dataset provided by
    Data.govhttps://data.gov/
    Authors
    Bioregional Assessment Program
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    Abstract

    This dataset was supplied to the Bioregional Assessment Programme by a third party and is presented here as originally supplied. Metadata was not provided and has been compiled by the Bioregional Assessment Programme based on the known details at the time of acquisition.

    The data includes level, salinity and temperature from gauge 203450 and 203470 in the Richmond catchment. This data is plotted against time for water quality analysis purposes

    This is a download from the open access NSW database at http://realtimedata.water.nsw.gov.au/water.stm

    Dataset History

    This data is a download from the open access NSW database

    http://realtimedata.water.nsw.gov.au/water.stm

    The data includes level, salinity and temperature from gauge 203450 and 203470 in the Richmond catchment.

    Data is was downloaded on 18/3/2015.

    Dataset Citation

    NSW Office of Water (2015) CLM - Richmond stream gauge data. Bioregional Assessment Source Dataset. Viewed 07 April 2016, http://data.bioregionalassessments.gov.au/dataset/03f59f6b-8d06-4513-b662-db7c4c2d2909.

  12. Z

    Data from: CaImAn: An open source tool for scalable Calcium Imaging data...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Farzaneh Najafi (2020). CaImAn: An open source tool for scalable Calcium Imaging data Analysis [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_1659148
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Pat Gunn
    Eftychios A. Pnevmatikakis
    Brandon L. Brown
    Jiannis Taxidis
    David W. Tank
    Andrea Giovannucci
    Dmitri Chklovskii
    Baljit S. Khakh
    Sue Ann Koay
    Farzaneh Najafi
    Jeffrey L. Gauthier
    Johannes Friedrich
    Pengcheng Zhou
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Advances in fluorescence microscopy enable monitoring larger brain areas in-vivo with finer time resolution. The resulting data rates require reproducible analysis pipelines that are reliable, fully automated, and scalable to datasets generated over the course of months. We present CaImAn, an open-source library for calcium imaging data analysis. CaImAn provides automatic and scalable methods to address problems common to preprocessing, including motion correction, neural activity identification, and registration across different sessions of data collection. It does this while requiring minimal user intervention, with good scalability on computers ranging from laptops to high-performance computing clusters. CaImAn is suitable for two-photon and one-photon imaging, and also enables real-time analysis on streaming data.

    To benchmark the performance of CaImAn we collected and combined a corpus of manual annotations from multiple labelers on nine mouse two-photon datasets, that are contained in this open access repository. We demonstrate that CaImAn achieves near-human performance in detecting locations of active neurons.

    In order to reproduce the results of the paper or download the annotations and the raw movies, please refer to the readme.md at:

    https://github.com/flatironinstitute/CaImAn/blob/master/use_cases/eLife_scripts/README.md

  13. Z

    Data from: Russian Financial Statements Database: A firm-level collection of...

    • data.niaid.nih.gov
    Updated Mar 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Skougarevskiy, Dmitriy (2025). Russian Financial Statements Database: A firm-level collection of the universe of financial statements [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14622208
    Explore at:
    Dataset updated
    Mar 14, 2025
    Dataset provided by
    Ledenev, Victor
    Bondarkov, Sergey
    Skougarevskiy, Dmitriy
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Area covered
    Russia
    Description

    The Russian Financial Statements Database (RFSD) is an open, harmonized collection of annual unconsolidated financial statements of the universe of Russian firms:

    • 🔓 First open data set with information on every active firm in Russia.

    • 🗂️ First open financial statements data set that includes non-filing firms.

    • 🏛️ Sourced from two official data providers: the Rosstat and the Federal Tax Service.

    • 📅 Covers 2011-2023 initially, will be continuously updated.

    • 🏗️ Restores as much data as possible through non-invasive data imputation, statement articulation, and harmonization.

    The RFSD is hosted on 🤗 Hugging Face and Zenodo and is stored in a structured, column-oriented, compressed binary format Apache Parquet with yearly partitioning scheme, enabling end-users to query only variables of interest at scale.

    The accompanying paper provides internal and external validation of the data: http://arxiv.org/abs/2501.05841.

    Here we present the instructions for importing the data in R or Python environment. Please consult with the project repository for more information: http://github.com/irlcode/RFSD.

    Importing The Data

    You have two options to ingest the data: download the .parquet files manually from Hugging Face or Zenodo or rely on 🤗 Hugging Face Datasets library.

    Python

    🤗 Hugging Face Datasets

    It is as easy as:

    from datasets import load_dataset import polars as pl

    This line will download 6.6GB+ of all RFSD data and store it in a 🤗 cache folder

    RFSD = load_dataset('irlspbru/RFSD')

    Alternatively, this will download ~540MB with all financial statements for 2023# to a Polars DataFrame (requires about 8GB of RAM)

    RFSD_2023 = pl.read_parquet('hf://datasets/irlspbru/RFSD/RFSD/year=2023/*.parquet')

    Please note that the data is not shuffled within year, meaning that streaming first n rows will not yield a random sample.

    Local File Import

    Importing in Python requires pyarrow package installed.

    import pyarrow.dataset as ds import polars as pl

    Read RFSD metadata from local file

    RFSD = ds.dataset("local/path/to/RFSD")

    Use RFSD_dataset.schema to glimpse the data structure and columns' classes

    print(RFSD.schema)

    Load full dataset into memory

    RFSD_full = pl.from_arrow(RFSD.to_table())

    Load only 2019 data into memory

    RFSD_2019 = pl.from_arrow(RFSD.to_table(filter=ds.field('year') == 2019))

    Load only revenue for firms in 2019, identified by taxpayer id

    RFSD_2019_revenue = pl.from_arrow( RFSD.to_table( filter=ds.field('year') == 2019, columns=['inn', 'line_2110'] ) )

    Give suggested descriptive names to variables

    renaming_df = pl.read_csv('local/path/to/descriptive_names_dict.csv') RFSD_full = RFSD_full.rename({item[0]: item[1] for item in zip(renaming_df['original'], renaming_df['descriptive'])})

    R

    Local File Import

    Importing in R requires arrow package installed.

    library(arrow) library(data.table)

    Read RFSD metadata from local file

    RFSD <- open_dataset("local/path/to/RFSD")

    Use schema() to glimpse into the data structure and column classes

    schema(RFSD)

    Load full dataset into memory

    scanner <- Scanner$create(RFSD) RFSD_full <- as.data.table(scanner$ToTable())

    Load only 2019 data into memory

    scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scanner <- scan_builder$Finish() RFSD_2019 <- as.data.table(scanner$ToTable())

    Load only revenue for firms in 2019, identified by taxpayer id

    scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scan_builder$Project(cols = c("inn", "line_2110")) scanner <- scan_builder$Finish() RFSD_2019_revenue <- as.data.table(scanner$ToTable())

    Give suggested descriptive names to variables

    renaming_dt <- fread("local/path/to/descriptive_names_dict.csv") setnames(RFSD_full, old = renaming_dt$original, new = renaming_dt$descriptive)

    Use Cases

    🌍 For macroeconomists: Replication of a Bank of Russia study of the cost channel of monetary policy in Russia by Mogiliat et al. (2024) — interest_payments.md

    🏭 For IO: Replication of the total factor productivity estimation by Kaukin and Zhemkova (2023) — tfp.md

    🗺️ For economic geographers: A novel model-less house-level GDP spatialization that capitalizes on geocoding of firm addresses — spatialization.md

    FAQ

    Why should I use this data instead of Interfax's SPARK, Moody's Ruslana, or Kontur's Focus?hat is the data period?

    To the best of our knowledge, the RFSD is the only open data set with up-to-date financial statements of Russian companies published under a permissive licence. Apart from being free-to-use, the RFSD benefits from data harmonization and error detection procedures unavailable in commercial sources. Finally, the data can be easily ingested in any statistical package with minimal effort.

    What is the data period?

    We provide financials for Russian firms in 2011-2023. We will add the data for 2024 by July, 2025 (see Version and Update Policy below).

    Why are there no data for firm X in year Y?

    Although the RFSD strives to be an all-encompassing database of financial statements, end users will encounter data gaps:

    We do not include financials for firms that we considered ineligible to submit financial statements to the Rosstat/Federal Tax Service by law: financial, religious, or state organizations (state-owned commercial firms are still in the data).

    Eligible firms may enjoy the right not to disclose under certain conditions. For instance, Gazprom did not file in 2022 and we had to impute its 2022 data from 2023 filings. Sibur filed only in 2023, Novatek — in 2020 and 2021. Commercial data providers such as Interfax's SPARK enjoy dedicated access to the Federal Tax Service data and therefore are able source this information elsewhere.

    Firm may have submitted its annual statement but, according to the Uniform State Register of Legal Entities (EGRUL), it was not active in this year. We remove those filings.

    Why is the geolocation of firm X incorrect?

    We use Nominatim to geocode structured addresses of incorporation of legal entities from the EGRUL. There may be errors in the original addresses that prevent us from geocoding firms to a particular house. Gazprom, for instance, is geocoded up to a house level in 2014 and 2021-2023, but only at street level for 2015-2020 due to improper handling of the house number by Nominatim. In that case we have fallen back to street-level geocoding. Additionally, streets in different districts of one city may share identical names. We have ignored those problems in our geocoding and invite your submissions. Finally, address of incorporation may not correspond with plant locations. For instance, Rosneft has 62 field offices in addition to the central office in Moscow. We ignore the location of such offices in our geocoding, but subsidiaries set up as separate legal entities are still geocoded.

    Why is the data for firm X different from https://bo.nalog.ru/?

    Many firms submit correcting statements after the initial filing. While we have downloaded the data way past the April, 2024 deadline for 2023 filings, firms may have kept submitting the correcting statements. We will capture them in the future releases.

    Why is the data for firm X unrealistic?

    We provide the source data as is, with minimal changes. Consider a relatively unknown LLC Banknota. It reported 3.7 trillion rubles in revenue in 2023, or 2% of Russia's GDP. This is obviously an outlier firm with unrealistic financials. We manually reviewed the data and flagged such firms for user consideration (variable outlier), keeping the source data intact.

    Why is the data for groups of companies different from their IFRS statements?

    We should stress that we provide unconsolidated financial statements filed according to the Russian accounting standards, meaning that it would be wrong to infer financials for corporate groups with this data. Gazprom, for instance, had over 800 affiliated entities and to study this corporate group in its entirety it is not enough to consider financials of the parent company.

    Why is the data not in CSV?

    The data is provided in Apache Parquet format. This is a structured, column-oriented, compressed binary format allowing for conditional subsetting of columns and rows. In other words, you can easily query financials of companies of interest, keeping only variables of interest in memory, greatly reducing data footprint.

    Version and Update Policy

    Version (SemVer): 1.0.0.

    We intend to update the RFSD annualy as the data becomes available, in other words when most of the firms have their statements filed with the Federal Tax Service. The official deadline for filing of previous year statements is April, 1. However, every year a portion of firms either fails to meet the deadline or submits corrections afterwards. Filing continues up to the very end of the year but after the end of April this stream quickly thins out. Nevertheless, there is obviously a trade-off between minimization of data completeness and version availability. We find it a reasonable compromise to query new data in early June, since on average by the end of May 96.7% statements are already filed, including 86.4% of all the correcting filings. We plan to make a new version of RFSD available by July.

    Licence

    Creative Commons License Attribution 4.0 International (CC BY 4.0).

    Copyright © the respective contributors.

    Citation

    Please cite as:

    @unpublished{bondarkov2025rfsd, title={{R}ussian {F}inancial {S}tatements {D}atabase}, author={Bondarkov, Sergey and Ledenev, Victor and Skougarevskiy, Dmitriy}, note={arXiv preprint arXiv:2501.05841}, doi={https://doi.org/10.48550/arXiv.2501.05841}, year={2025}}

    Acknowledgments and Contacts

    Data collection and processing: Sergey Bondarkov, sbondarkov@eu.spb.ru, Viktor Ledenev, vledenev@eu.spb.ru

    Project conception, data validation, and use cases: Dmitriy Skougarevskiy, Ph.D.,

  14. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Davis, Philip E. (2021). Supplementary material: Transitioning from file-based HPC workflows to streaming data pipelines with openPMD and ADIOS2 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4906275

Supplementary material: Transitioning from file-based HPC workflows to streaming data pipelines with openPMD and ADIOS2

Explore at:
Dataset updated
Jun 8, 2021
Dataset provided by
Bussmann, Michael
Huebl, Axel
Klasky, Scott
Podhorszki, Norbert
Gu, Junmin
Eisenhauer, Greg
E, Juncheng
Gainaru, Ana
Poeschel, Franz
Godoy, William F.
Wan, Lipeng
Davis, Philip E.
Widera, René
Koller, Fabian
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Used software versions

Self-built:

PIConGPU: https://github.com/franzpoeschel/picongpu/tree/smc2021-paper GAPD: closed source software, Git tag smc2021-paper in private repository openPMD-api: https://github.com/franzpoeschel/openPMD-api/tree/smc2021-paper ADIOS2: https://github.com/ornladios/ADIOS2, Git hash bf25ad59b8b15b9f48ddabad65a41f2050d3bd7f libfabric: 1.6.3a1

Summit modules:

1) gcc/8.1.1
2) spectrum-mpi/10.3.1.2-20200121
3) cmake/3.18.2
4) git/2.20.1
5) cuda/10.1.243
6) boost/1.66.0
7) zlib/1.2.11
8) libpng/1.6.34 9) freetype/2.9.1 10) python/3.7.0-anaconda3-5.3.0

Search
Clear search
Close search
Google apps
Main menu