13 datasets found

Z
Supplementary material: Transitioning from file-based HPC workflows to...
data.niaid.nih.gov
Updated Jun 8, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Davis, Philip E. (2021). Supplementary material: Transitioning from file-based HPC workflows to streaming data pipelines with openPMD and ADIOS2 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4906275
Explore at:
Dataset updated
Jun 8, 2021
Dataset provided by
Bussmann, Michael
Huebl, Axel
Klasky, Scott
Podhorszki, Norbert
Gu, Junmin
Eisenhauer, Greg
E, Juncheng
Gainaru, Ana
Poeschel, Franz
Godoy, William F.
Wan, Lipeng
Davis, Philip E.
Widera, René
Koller, Fabian
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Used software versions

Self-built:

PIConGPU: https://github.com/franzpoeschel/picongpu/tree/smc2021-paper GAPD: closed source software, Git tag smc2021-paper in private repository openPMD-api: https://github.com/franzpoeschel/openPMD-api/tree/smc2021-paper ADIOS2: https://github.com/ornladios/ADIOS2, Git hash bf25ad59b8b15b9f48ddabad65a41f2050d3bd7f libfabric: 1.6.3a1

Summit modules:

1) gcc/8.1.1
2) spectrum-mpi/10.3.1.2-20200121
3) cmake/3.18.2
4) git/2.20.1
5) cuda/10.1.243
6) boost/1.66.0
7) zlib/1.2.11
8) libpng/1.6.34 9) freetype/2.9.1 10) python/3.7.0-anaconda3-5.3.0
e
Replication Data for: Hardware Attack detectoR via Performance counters...
b2find.eudat.eu
Updated Apr 15, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Replication Data for: Hardware Attack detectoR via Performance counters analYsis Dataset (HARPY Dataset) - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/d9254f6d-a98e-5d09-a066-64763d33adb4
Explore at:
Dataset updated
Apr 15, 2025
Description
A dataset containing the monitoring of several hardware counters (HPC) associated with 7 cache side-channel attacks (Spectre V1, V2, V4; Meltdown, ZombieLoad, Fallout, and Crosstalk), along with data obtained for 7 benign/benchmark programs (matrix multiplier, stress -c, stress -m, MiBench, STREAM, bzip2, and ffmpeg). All programs are run on Intel x86 architectures. The selection of the hardware attacks used to collect the data was done by analyzing the characteristics of the computer, as well as the available mitigations, to determine if the machine was vulnerable to each of them. The selection of benign programs was mainly based on benchmark sets that offered reliable and reproducible execution behavior, allowing for effective comparison with workloads. A selection of different benchmark sets with varied approaches was made to ensure optimal coverage of the dataset. Finally, the selection of activity counters was based on a detailed analysis of the exploited vulnerability, prior work, and, later, data analysis to ensure their validity. From this study, the following hardware counters were selected: branch-misses, branch-instructions, LLC-load-misses, L1-dcache-load-misses, and instructions. Each file corresponds to one of the 14 programs executed to generate the values of the analyzed hardware counters. Each file is identified by the name of the program associated with its execution.For the data collection, it was necessary to identify and acquire the binary codes of the selected programs (benign and attacks). Below, the source from which the codes were obtained is defined for each case. Malicious codes:1) Meltdown Github: I. of Applied Information Processing and C. (IAIK), Meltdown, https://github.com/IAIK/meltdown.2) Spectre V1 GitHub: R. C. (crozone), Spectrepoc, https://github.com/crozone/SpectrePoC. 3) Spectre V2 GitHub: A. C. (Anton-Cao), Spectrev2-poc, https://github.com/Anton-Cao/spectrev2-poc.4) Spectre V4 GitHub: Y. S. (mmxsrup), Cve-2018-3639, https://github.com/mmxsrup/CVE-2018-3639.5) ZombieLoad GitHub: I. of Applied Information Processing and C. (IAIK), Zombieload, https://github.com/IAIK/ZombieLoad. 6) Fallout GitHub: T. H. (tristan-hornetz), Fallout, https://github.com/tristan - hornetz /fallout.7) Crosstalk GitHub: T. H. (tristan-hornetz), Crosstalk, https://github.com/tristan- hornetz/crosstalk. Benign codes:1) Matrix Multiplier: Codi propi2) stress -c UNIX Tool: R. O. S. Projects, Stress, https://github.com/resurrecting-open-source-projects/stress.3) stress -m UNIX Tool: R. O. S. Projects, Stress, https://github.com/resurrecting-open-source-projects/stress.4) MiBench Bitcount GitHub: Embecosm, Mibench, https://github.com/embecosm/mibench.5) STREAM GitHub: J. H. (jeffhammond), Stream, https://github.com/jeffhammond/STREAM.6) bzip2 UNIX Tool: https://sourceware.org/bzip2/7) ffmpeg UNIX Package: https://ffmpeg.org/
Music Informatics for Radio Across the GlobE (MIRAGE) MetaCorpus (v0.2)
zenodo.org
csv
Updated Nov 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David R.W. Sears; David R.W. Sears (2024). Music Informatics for Radio Across the GlobE (MIRAGE) MetaCorpus (v0.2) [Dataset]. http://doi.org/10.5281/zenodo.12786202
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.12786202
Dataset updated
Nov 7, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
David R.W. Sears; David R.W. Sears
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jul 19, 2024
Description
Overview

Welcome to the Music Informatics for Radio Across the GlobE (MIRAGE) MetaCorpus. The current (v0.2) development release consists of metadata (e.g., artist name, track title) and musicological features (e.g., instrument list, voice type, tempo) for 1 million events streaming on 10,000 internet radio stations across the globe, with 100 events from each station.

Users who wish to access, interact with, and/or export metadata from the MIRAGE-MetaCorpus may also visit the MIRAGE online dashboard at the following url:

https://pearl-laboratory.github.io/mirage-mc/

Attribution

The current MIRAGE-MetaCorpus is available under a CC4 license. Users may cite the dataset here:

Sears, David R.W. “Music Informatics for Radio Across the Globe (MIRAGE) Metacorpus -- 2024”. Zenodo, July 19, 2024. https://doi.org/10.5281/zenodo.12786202.

Users accessing the MIRAGE-MetaCorpus using the online dashboard should also cite the following ISMIR paper:

Ngan V.T. Nguyen, Elizabeth A.M. Acosta, Tommy Dang, and David R.W. Sears. "Exploring Internet Radio Across the Globe with the MIRAGE Online Dashboard," in Proceedings of the 25th International Society for Music Information Retrieval Conference (San Francisco, CA, 2024).

Data Sources

This repository of the MIRAGE-MetaCorpus contains 81 metadata variables from the following open-access sources:

Radio Garden (RG) -- https://radio.garden

Natural Earth map data set (NE) -- https://www.naturalearthdata.com/

Internet Radio Station Stream Encoder (SE)

Annotator Review (AR)

Monitoring/Matching Algorithm (MA)

WikiData (WD) -- https://www.wikidata.org

MusicBrainz (MB) -- https://musicbrainz.org/

Each event also includes attribution metadata from the following commercial sources:

Spotify (SP) -- https://open.spotify.com/

Note that users may examine an additional 19 metadata variables on the MIRAGE online dashboard that were obtained from the Spotify API.

Musixmatch (MX) -- https://www.musixmatch.com/

YouTube (YT) -- https://www.youtube.com/

Genius (GE) -- https://genius.com/

AZlyrics (AZ) -- https://www.azlyrics.com/

Data Sets

The metadata reflect information about each event's location (e.g., city, country), station (name, format, url), event (id, local time at station, etc.), artist (name, voice type, etc.), and track (e.g., title, year of release, etc.). For that reason, the MIRAGE-MetaCorpus includes the following datasets:

MIRAGE.csv -- the complete metacorpus (1 million)

events.csv -- all event-level metadata (1 million)

tracks.csv -- all track-level metadata (414,886)

artists.csv -- all artist-level metadata (259,783)

stations.csv -- all station-level metadata (10,000)

locations.csv -- all location-level metadata (4,324)

A subset of the MIRAGE-MetaCorpus is also available for events with metadata from online music libraries that reliably matched the event's description in the radio station's stream encoder:

MIRAGE_reliable.csv (473,850)

events_reliable.csv (473,850)

tracks_reliable.csv (204,969)

artists_reliable.csv (80,005)

stations_reliable.csv (9,284)

locations_reliable.csv (4,142)

Contact

If you are a copyright owner for any of the metadata that appears in the MIRAGE-MetaCorpus and would like us to remove your metadata, please contact the developer team at the following email address: miragedashboard@gmail.com
A
‘COVID-19 Coronavirus Dataset’ analyzed by Analyst-2
analyst-2.ai
Updated Feb 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘COVID-19 Coronavirus Dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-covid-19-coronavirus-dataset-4bcc/6a53de38/?iid=022-156&v=presentation
Explore at:
Dataset updated
Feb 14, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘COVID-19 Coronavirus Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/vignesh1694/covid19-coronavirus on 14 February 2022.

--- Dataset description provided by original source is as follows ---

Context

A SARS-like virus outbreak originating in Wuhan, China, is spreading into neighboring Asian countries, and as far afield as Australia, the US a and Europe.

On 31 December 2019, the Chinese authorities reported a case of pneumonia with an unknown cause in Wuhan, Hubei province, to the World Health Organisation (WHO)’s China Office. As more and more cases emerged, totaling 44 by 3 January, the country’s National Health Commission isolated the virus causing fever and flu-like symptoms and identified it as a novel coronavirus, now known to the WHO as 2019-nCoV.

The following dataset shows the numbers of spreading coronavirus across the globe.

Content

Sno - Serial number Date - Date of the observation Province / State - Province or state of the observation Country - Country of observation Last Update - Recent update (not accurate in terms of time) Confirmed - Number of confirmed cases Deaths - Number of death cases Recovered - Number of recovered cases

Acknowledgements

Thanks to John Hopkins CSSE for the live updates on Coronavirus and data streaming. Source: https://github.com/CSSEGISandData/COVID-19 Dashboard: https://public.tableau.com/profile/vignesh.coumarane#!/vizhome/DashboardToupload/Dashboard12

Inspiration

Inspired by the following work: https://gisanddata.maps.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6

--- Original source retains full ownership of the source dataset ---
d
LCZO-Stream Water Chemistry, Streamflow / Discharge, Hysteretic response of...
search.dataone.org
hydroshare.org
+1more
Updated Aug 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adam Wymore; James B Shanley; William H McDowell; Miguel C Leon (2022). LCZO-Stream Water Chemistry, Streamflow / Discharge, Hysteretic response of solutes and turbidity at the event scale across forested tropical montane watersheds - Luquillo Experimental Forest (2016-2017) [Dataset]. http://doi.org/10.4211/hs.f8420c1447fe440eb93e656b2db0b64d
Explore at:
Unique identifier
https://doi.org/10.4211/hs.f8420c1447fe440eb93e656b2db0b64d
Dataset updated
Aug 5, 2022
Dataset provided by
Hydroshare
Authors
Adam Wymore; James B Shanley; William H McDowell; Miguel C Leon
Time period covered
Aug 6, 2016 - Sep 21, 2017
Area covered

Description
Concentration-discharge relationships are a key tool for understanding the sourcing and transport of material from watersheds to fluvial networks. Storm events in particular provide insight into variability in the sources of solutes and sediment within watersheds, and the hydrologic pathways that connect hillslope to stream channel. Here we examine high-frequency sensor-based specific conductance and turbidity data from multiple storm events across two watersheds (Quebrada Sonadora and Rio Icacos) with different lithology in the Luquillo Mountains of Puerto Rico, a forested tropical ecosystem. Our analyses include Hurricane Maria, a category 5 hurricane. To analyze hysteresis, we used a recently developed set of metrics to describe and quantify storm events including the hysteresis index (HI), which describes the directionality of hysteresis loops, and the flushing index (FI), which describes whether the mobilization of material is source or transport limited. We also examine the role of antecedent discharge to predict hysteretic behavior during storms. Overall, specific conductance and turbidity showed contrasting responses to storms. The hysteretic behavior of specific conductance was very similar across sites, displaying clockwise hysteresis and a negative flushing index indicating proximal sources of solutes and consistent source limitation. In contrast, the directionality of turbidity hysteresis was significantly different between watersheds, although both had strong flushing behavior indicative of transport limitation. Overall, models that included antecedent discharge did not perform any better than models with peak discharge alone, suggesting that the magnitude and trajectory of an individual event was the strongest driver of material flux and hysteretic behavior. Hurricane Maria produced unique hysteresis metrics within both watersheds, indicating a distinctive response to this major hydrological event. The similarity in response of specific conductance to storms suggests that solute sources and pathways are similar in the two watersheds. The divergence in behavior for turbidity suggests that sources and pathways of particulate matter vary between the two watersheds. The use of high-frequency sensor data allows the quantification of storm events while index-based metrics of hysteresis allow for the direct comparison of complex storm events across a heterogeneous landscape and variable flow conditions.

Additional scripts for hysteresis analysis are available here in the 'python scripts for analysis' folder and at https://github.com/miguelcleon/HysteresisAnalysis/
Z
SLF4Web - MPEG-DASH datasets of static light fields
data.niaid.nih.gov
Updated Nov 29, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michiels, Nick (2021). SLF4Web - MPEG-DASH datasets of static light fields [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5730525
Explore at:
Dataset updated
Nov 29, 2021
Dataset provided by
Lamotte, Wim
Put, Jeroen
Lievens, Hendrik
Michiels, Nick
Quax, Peter
Wijnants, Maarten
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
MPEG-DASH datasets for the SLF4Web research project. SLF4Web is a Web-based implementation of a static light field consumption system; it allows SLF datasets to be adaptively streamed over the network (via MPEG-DASH) and then to be visualized in a vanilla Web browser. The datasets are encoded using the H.264/AVC video codec. A subset of the datasets are available in multiple qualities to allow for adaptive network streaming.

The SLF4Web source code is available on GitHub (https://github.com/EDM-Research/SLF4Web) and as a bundle at https://zenodo.org/badge/latestdoi/432214902.
Data from: National-scale biogeography and function of river and stream...
zenodo.org
Updated Mar 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amy Thorpe; Amy Thorpe; Susheel Bhanu Busi; Susheel Bhanu Busi; Jonathan Warren; Jonathan Warren; Laura Hunt; Kerry Walsh; Daniel Read; Daniel Read; Laura Hunt; Kerry Walsh (2025). National-scale biogeography and function of river and stream bacterial biofilm communities [Dataset]. http://doi.org/10.5281/zenodo.14947235
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.14947235
Dataset updated
Mar 11, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Amy Thorpe; Amy Thorpe; Susheel Bhanu Busi; Susheel Bhanu Busi; Jonathan Warren; Jonathan Warren; Laura Hunt; Kerry Walsh; Daniel Read; Daniel Read; Laura Hunt; Kerry Walsh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Supplementary data associated with 'National-scale biogeography and function of river and stream bacterial biofilm communities'. Preprint is available at: https://doi.org/10.1101/2025.03.05.641783.

R scripts for data analysis and visualisation of this dataset are available on GitHub at: https://github.com/amycthorpe/biofilm_MAG_analysis.

Snakemake workflows to generate the results are available on GitHub at: https://github.com/amycthorpe/metag_analysis_EA and https://github.com/amycthorpe/EA_metag_post_analysis.

Environmental metadata:

water_chem.csv - water chemistry associated with each sample [source: https://environment.data.gov.uk/water-quality/view/landing]

values are the minimum, maximum and mean calculated for each variable across a 3-month period prior to sampling

times are the number of measurements taken during the 3-month period

water temperature (°C)

pH

alkalinity to pH 4.5 as CaCO3 (mg L-1)

conductivity at 25 °C

dissolved oxygen (DO, mg L-1)

dissolved organic carbon (DOC, mg L-1)

reactive orthophosphate (mg L-1)

nitrate as N (nitrate-N, mg L-1)

nitrite as N (nitrite-N, mg L-1)

ammoniacal nitrogen as N (ammonia-N, mg L-1)

reactive silica as SiO2 (mg L-1)

catchment_land_cover.csv - percentage of upstream catchment covered by each land cover type [source: https://www.ceh.ac.uk/data/ukceh-land-cover-maps]

catchment_geology.csv - percentage of upstream catchment covered by each geology type [source: https://www.bgs.ac.uk/datasets/bgs-geology-250k]

catchment_chars.csv - characteristics of the upstream catchment, includes:

altitude in m

catchment area in sqkm [source: https://catalogue.ceh.ac.uk/cmp/documents]

strahler stream order [source: https://www.ordnancesurvey.co.uk/products/os-open-rivers]

river depth and width in m [source: https://doi.org/10.5285/8df65124-68e9-4c68-8659-1c6b82c735e9]

average shade [source: https://data.catchmentbasedapproach.org/maps/theriverstrust::riparian-shade-england]

population equivalent wastewater treatment plant (WWTP) load in sqkm [source: https://www.data.gov.uk/dataset/0f76a1c3-1368-476b-a4df-7ef32bfd9a8b/urban-waste-water-treatment-directive-treatment-plants]

Metagenome assembled genomes (MAGs):

finalbins_coverage.csv - coverage of MAGs per sample

checkm_gtdb.csv - statistics calculated with CheckM2 for each MAG and MAG taxonomy with the GTDB-tk database

levins_median.csv - Levins' niche breadth index (Bn) calculated for each MAG, the associated P value (P.val) and adjusted P value (P.adj), N denoting above threshold of quantification (Below.NOQ), and identification as generalist or specialist (category, Bn > median Bn = generalist, Bn < median Bn = specialist)

singlem_results.csv - proportion of metagenomic reads assigned to bacteria, archaea and eukaryotes calculated with SingleM

env_with_seq_accessions.csv - ENA accessions for metagenomic reads

mag_accessions.csv - ENA accessions for dereplicated MAGs

Metabolic and functional traits:

metabolic_results.csv - presence of metabolic pathways in the MAGs generated using METABOLIC

metabolishmm_results.csv - presence of metabolic pathways in the MAGs generated using metabolisHMM

microtrait_results.csv - presence of functional traits identified in the MAGs using microTrait

Environmental drivers:

varPart.csv - results of variance partitioning between MAGs and environmental metadata

correlations.csv - pearson correlation coefficients (r_value, p_value and significance level) between environmental metadata and bacterial phyla.
H
CO - Coal Creek - Distinct Source Water Chemistry Shapes Contrasting...
hydroshare.org
beta.hydroshare.org
+1more
zip
Updated Oct 9, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wei Zhi; Li Li; Wenming Dong; Wendy Brown; Jason Kaye; Carl Steefel; Kenneth Williams (2023). CO - Coal Creek - Distinct Source Water Chemistry Shapes Contrasting Concentration Discharge Patterns [Dataset]. https://www.hydroshare.org/resource/24b834aab72743db899b99404b48cb68
Explore at:
zip(136 bytes)Available download formats
Dataset updated
Oct 9, 2023
Dataset provided by
HydroShare
Authors
Wei Zhi; Li Li; Wenming Dong; Wendy Brown; Jason Kaye; Carl Steefel; Kenneth Williams
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Dec 31, 2015 - Jun 27, 2018
Area covered

Description
This data package contains discharge and water quality data and model results at Coal Creek Watershed in the central Rocky Mountains of Colorado, USA. Files include high-frequency stream chemistry data collected during the period of Dec 2015 to Jun 2018, and model results of water storage and flux. The dataset also includes dissolved organic carbon and sodium stream chemistry data for the period of 2016. Our model then incorporates the USGS datasets of discharge and stream chemistry, for which data and citations are provided in the dataset files and related reference field. The resulting model BioRT-Flux-PIHM is the biogeochemical reactive transport model of the PIHM family code MM-PIHM for watershed processes and is detailed in the reference paper (doi.org/10.1029/2018WR024257) and in Github (https://github.com/PSUmodeling/BioRT-Flux-PIHM).
Open Australian Legal Embeddings
kaggle.com
Updated Nov 15, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Umar Butler (2023). Open Australian Legal Embeddings [Dataset]. https://www.kaggle.com/datasets/umarbutler/open-australian-legal-embeddings
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 15, 2023
Dataset provided by
Kaggle
Authors
Umar Butler
Area covered
Australia
Description
Open Australian Legal Embeddings ‍⚖️

The Open Australian Legal Embeddings are the first open-source embeddings of Australian legislative and judicial documents.

Trained on the largest open database of Australian law, the Open Australian Legal Corpus, the Embeddings consist of roughly 5.2 million 384-dimensional vectors embedded with BAAI/bge-small-en-v1.5.

The Embeddings open the door to a wide range of possibilities in the field of Australian legal AI, including the development of document classifiers, search engines and chatbots.

To ensure their accessibility to as wide an audience as possible, the Embeddings are distributed under the same licence as the Open Australian Legal Corpus.

Usage 👩‍💻

The below code snippet illustrates how the Embeddings may be loaded and queried via the Hugging Face Datasets Python library: ```python import itertools import sklearn.metrics.pairwise

from datasets import load_dataset from sentence_transformers import SentenceTransformer

model = SentenceTransformer('BAAI/bge-small-en-v1.5') instruction = 'Represent this sentence for searching relevant passages: '

oale = load_dataset('umarbutler/open_australian_legal_embeddings', split='train', streaming=True) # Set streaming to False if you wish to load the entire dataset into memory (unadvised unless you have at least 64 GB of RAM).

Sample the first 100,000 embeddings.

sample = list(itertools.islice(oale, 100000))

Embed a query.

query = model.encode(instruction + 'Who is the Governor-General of Australia?', normalize_embeddings=True)

Identify the most similar embedding to the query.

similarities = sklearn.metrics.pairwise.cosine_similarity([query], [embedding['embedding'] for embedding in sample]) most_similar_index = similarities.argmax() most_similar = sample[most_similar_index]

Print the most similar text.

print(most_similar['text']) ```

To speed up the loading of the Embeddings, you may wish to install orjson.

Structure 🗂️

The Embeddings are stored in data/embeddings.jsonl, a json lines file where each line is a list of 384 32-bit floating point numbers. Associated metadata is stored in data/metadatas.jsonl and the corresponding texts are located in data/texts.jsonl.

The metadata fields are the same as those used for the Open Australian Legal Corpus, barring the text field, which was removed, and with the addition of the is_last_chunk key, which is a boolean flag for whether a text is the last chunk of a document (used to detect and remove corrupted documents when creating and updating the Embeddings).

Creation 🧪

All documents in the Open Australian Legal Corpus were split into semantically meaningful chunks up to 512-tokens-long (as determined by bge-small-en-v1.5's tokeniser) with the semchunk Python library. These chunks included a header embedding documents' titles, jurisdictions and types in the following format: perl Title: {title} Jurisdiction: {jurisdiction} Type: {type} {text}

The chunks were then vectorised by bge-small-en-v1.5 on a single GeForce RTX 2080 Ti with a batch size of 32 via the SentenceTransformers library.

The resulting embeddings were serialised as json-encoded lists of floats by orjson and stored in data/embeddings.jsonl. The corresponding metadata and texts (with their headers removed) were saved to data/metadatas.jsonl and data/texts.jsonl, respectively.

The code used to create and update the Embeddings may be found [here](https://github.com/umarbutler/open-australian-legal-embeddings-...
A
Brook Trout Probability of Occurrence, Plus 4 degrees C, Northeast U.S.
data.amerigeoss.org
xml
Updated Aug 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
United States (2022). Brook Trout Probability of Occurrence, Plus 4 degrees C, Northeast U.S. [Dataset]. https://data.amerigeoss.org/sl/dataset/brook-trout-probability-of-occurrence-plus-4-degrees-c-northeast-u-s-be6a2
Explore at:
xmlAvailable download formats
Dataset updated
Aug 20, 2022
Dataset provided by
United States
Area covered
Northeastern United States, United States
Description
This dataset is one of a suite of products from the Nature’s Network project (naturesnetwork.org). Nature’s Network is a collaborative effort to identify shared priorities for conservation in the Northeast, considering the value of fish and wildlife species and the natural areas they inhabit. Brook Trout probability of occurrence is intended to provide predictions of occupancy (probability of presence) for catchments smaller than 200 km2 in the Northeast and Mid-Atlantic region from Virginia to Maine. The dataset provides predictions under current environmental conditions and for future increases in stream temperature. Brook Trout probability of occurrence (under current climate) is one input used in developing “Lotic Core Areas, Stratified by Watershed, Northeast U.S.” that is also part of Nature’s Network. Lotic core areas represent intact, well-connected rivers and stream reaches in the Northeast and Mid-Atlantic region that, if protected as part of stream networks and watersheds, will continue to support a broad diversity of aquatic species and the ecosystems on which they depend. The combination of lotic core areas, lentic (lake and pond) core areas, and aquatic buffers constitute the “aquatic core networks” of Nature’s Network. These and other datasets that augment or complement aquatic core networks are available in the Nature’s Network gallery: https://nalcc.databasin.org/galleries/8f4dfe780c444634a45ee4acc930a055.

Intended Uses

In the context of Nature’s Network, this dataset is primarily intended to be used in conjunction with the product “Lotic Core Areas, Stratified by Watershed, Northeast U.S.” to better understand the importance of core areas to Brook Trout. It also can be used on its own to identify priority watersheds for Brook Trout.

The dataset was originally developed for and is part of the Interactive Catchment Explorer (ICE). ICE (http://ice.ecosheds.org/) is a dynamic visualization interface for exploring catchment characteristics and environmental model predictions. ICE was created for resource managers and researchers to explore complex, multivariate environmental datasets and model results, to identify spatial patterns related to ecological conditions, and to prioritize locations for restoration or further study. ICE is part of the Spatial Hydro-Ecological Decision System (SHEDS).

Description and Derivation

The dataset provides predictions under current environmental conditions and for future increases in stream temperature of 2, 4, and 6 degrees Celsius. It employs a logistic mixed effects model to include the effects of landscape, land-use, and climate variables on the probability of Brook Trout occupancy in stream reaches (confluence to confluence). It includes random effects of HUC10 (watershed) to allow for the chance that the probability of occupancy and the effect of covariates were likely to be similar within a watershed. The fish data came primarily from state and federal agencies that sample streams for Brook Trout as part of regular monitoring. A stream is considered occupied if any Brook Trout were ever caught during an electrofishing survey between 1991 and 2010. The results are based on more than 15,000 samples from more than 13,000 catchments from all 13 Northeast states.

Factors that had a strong positive effect on Brook Trout occupancy included percent forest cover and summer precipitation. Factors that had a strong negative effect on occupancy included July stream temperature, percent agriculture, drainage area, and percent upstream impounded area.

Estimates of the probability of occupancy for each catchment with increases in stream temperature of either 2,4 or 6 degrees C are also provided. To provide these estimates, the input values for mean July stream temperature were simply increased by 2, 4, or 6 C and estimated occupancies recorded.

More technical details about the Brook Trout probability of occurrence product are available at: http://conte-ecology.github.io/Northeast_Bkt_Occupancy/. Technical details about the regional stream temperature model, which is used in predicting Brook Trout occupancy, are available at: http://conte-ecology.github.io/conteStreamTemperature_northeast/.

Known Issues and Uncertainties

As with any project carried out across such a large area, this dataset is subject to limitations. The results by themselves are not a prescription for on-the-ground action; users are encouraged to verify, with field visits and site-specific knowledge, the value of any areas identified in the project. Known issues and uncertainties include the following:

Users are cautioned against using the data on too small an area (for example, a small segment of stream), as the data may not be sufficiently accurate at that level of resolution.

Uncertainties in predictions of stream temperature also result in uncertainties in Brook Trout occupancy estimates. Local effects of groundwater (which may provide cold-water refugia for Brook Trout) cannot be well accounted for in regional stream temperature models at this time. Catchments near waterbodies with water control structures such as dams may also have unreliable temperature predictions because the temperature model does not include information on release schedules or strategies.

Catchments with any Brook Trout occurrences reported in the past 30 years have been presumed to be occupied for purposes of the model. If local extirpations have occurred, this could lead to overprediction of the probability of Brook Trout occupancy.

Projections of effects of future temperature changes to Brook Trout occupancy are intended to convey a sense of the resilience of the species to changing temperatures. In reality, stream temperatures will not change at the same rate or uniformly, as some streams are more buffered against changing air temperatures than others.

Brook Trout occupancy predictions are not available in certain areas where surficial soil coarseness data were absent. These areas include the White Mountains of NH and mountainous areas in NY such as the Adirondacks.

As with any regional GIS data, errors in mapping and alignment of hydrography, development, agriculture, and a number of other data layers can affect the model results.

Attribute definitions

Source = data source

FEATUREID = unique identifier

NextDownID = unique identifier of catchment immediately downstream (-1 = none)

Shape_Leng = length of catchment in meters

Shape_Area = area of catchment in square meters

AreaSqKm = area of catchment in square kilometers

huc12 = 12 digit Hydrologic Unit Code for the watershed

stusps = state in which the catchment is located

agricultur = the percentage of the catchment that is covered by agricultural land (e.g. cultivated crops, orchards, and pasture) including fallow land.

elevation = mean elevation of catchment (m)

forest = the percentage of the catchment that is forested

summer_prc = mean precipitation per month in summer (mm)

UpAreaSqKM = drainage area upstream of catchment in square kilometers

occ_curren = probability of Brook Trout occupancy (current climate)

plus2 = probability of Brook Trout occupancy if stream temperature were to warm by 2 degrees C, relative to current climate

plus4 = probability of Brook Trout occupancy if stream temperature were to warm by 4 degrees C, relative to current climate

plus6 = probability of Brook Trout occupancy if stream temperature were to warm by 6 degrees C, relative to current climate

max_temp_0 = the maximum additional stream temperature (degrees C), on top of the current mean summer temperature for the catchment, that would be predicted to result in a 30% probability of occupancy for Brook Trout

max_temp_1 = the maximum additional stream temperature (degrees C), on top of the current mean summer temperature for the catchment, that would be predicted to result in a 50% probability of occupancy Brook Trout

max_temp_2 = the maximum additional stream temperature (degrees C), on top of the current mean summer temperature for the catchment, that would be predicted to result in a 70% probability of occupancy Brook Trout

meanSumme = mean summer stream temperature (C)

meanDays_1 = mean days per year that stream temperature exceeds 18 degrees C

meanDays_2 = mean days per year that stream temperature exceeds 22 degrees C
CLM - Richmond stream gauge data
researchdata.edu.au
cloud.csiss.gmu.edu
+2more
Updated Mar 30, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bioregional Assessment Program (2016). CLM - Richmond stream gauge data [Dataset]. https://researchdata.edu.au/clm-richmond-stream-gauge/1434675
Explore at:
Dataset updated
Mar 30, 2016
Dataset provided by
Data.govhttps://data.gov/
Authors
Bioregional Assessment Program
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
Abstract

This dataset was supplied to the Bioregional Assessment Programme by a third party and is presented here as originally supplied. Metadata was not provided and has been compiled by the Bioregional Assessment Programme based on the known details at the time of acquisition.

The data includes level, salinity and temperature from gauge 203450 and 203470 in the Richmond catchment. This data is plotted against time for water quality analysis purposes

This is a download from the open access NSW database at http://realtimedata.water.nsw.gov.au/water.stm

Dataset History

This data is a download from the open access NSW database

http://realtimedata.water.nsw.gov.au/water.stm

The data includes level, salinity and temperature from gauge 203450 and 203470 in the Richmond catchment.

Data is was downloaded on 18/3/2015.

Dataset Citation

NSW Office of Water (2015) CLM - Richmond stream gauge data. Bioregional Assessment Source Dataset. Viewed 07 April 2016, http://data.bioregionalassessments.gov.au/dataset/03f59f6b-8d06-4513-b662-db7c4c2d2909.
Z
Data from: CaImAn: An open source tool for scalable Calcium Imaging data...
data.niaid.nih.gov
zenodo.org
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Farzaneh Najafi (2020). CaImAn: An open source tool for scalable Calcium Imaging data Analysis [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_1659148
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Pat Gunn
Eftychios A. Pnevmatikakis
Brandon L. Brown
Jiannis Taxidis
David W. Tank
Andrea Giovannucci
Dmitri Chklovskii
Baljit S. Khakh
Sue Ann Koay
Farzaneh Najafi
Jeffrey L. Gauthier
Johannes Friedrich
Pengcheng Zhou
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Advances in fluorescence microscopy enable monitoring larger brain areas in-vivo with finer time resolution. The resulting data rates require reproducible analysis pipelines that are reliable, fully automated, and scalable to datasets generated over the course of months. We present CaImAn, an open-source library for calcium imaging data analysis. CaImAn provides automatic and scalable methods to address problems common to preprocessing, including motion correction, neural activity identification, and registration across different sessions of data collection. It does this while requiring minimal user intervention, with good scalability on computers ranging from laptops to high-performance computing clusters. CaImAn is suitable for two-photon and one-photon imaging, and also enables real-time analysis on streaming data.

To benchmark the performance of CaImAn we collected and combined a corpus of manual annotations from multiple labelers on nine mouse two-photon datasets, that are contained in this open access repository. We demonstrate that CaImAn achieves near-human performance in detecting locations of active neurons.

In order to reproduce the results of the paper or download the annotations and the raw movies, please refer to the readme.md at:

https://github.com/flatironinstitute/CaImAn/blob/master/use_cases/eLife_scripts/README.md
Z
Data from: Russian Financial Statements Database: A firm-level collection of...
data.niaid.nih.gov
Updated Mar 14, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Skougarevskiy, Dmitriy (2025). Russian Financial Statements Database: A firm-level collection of the universe of financial statements [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14622208
Explore at:
Dataset updated
Mar 14, 2025
Dataset provided by
Ledenev, Victor
Bondarkov, Sergey
Skougarevskiy, Dmitriy
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Area covered
Russia
Description
The Russian Financial Statements Database (RFSD) is an open, harmonized collection of annual unconsolidated financial statements of the universe of Russian firms:

🔓 First open data set with information on every active firm in Russia.

🗂️ First open financial statements data set that includes non-filing firms.

🏛️ Sourced from two official data providers: the Rosstat and the Federal Tax Service.

📅 Covers 2011-2023 initially, will be continuously updated.

🏗️ Restores as much data as possible through non-invasive data imputation, statement articulation, and harmonization.

The RFSD is hosted on 🤗 Hugging Face and Zenodo and is stored in a structured, column-oriented, compressed binary format Apache Parquet with yearly partitioning scheme, enabling end-users to query only variables of interest at scale.

The accompanying paper provides internal and external validation of the data: http://arxiv.org/abs/2501.05841.

Here we present the instructions for importing the data in R or Python environment. Please consult with the project repository for more information: http://github.com/irlcode/RFSD.

Importing The Data

You have two options to ingest the data: download the .parquet files manually from Hugging Face or Zenodo or rely on 🤗 Hugging Face Datasets library.

Python

🤗 Hugging Face Datasets

It is as easy as:

from datasets import load_dataset import polars as pl

This line will download 6.6GB+ of all RFSD data and store it in a 🤗 cache folder

RFSD = load_dataset('irlspbru/RFSD')

Alternatively, this will download ~540MB with all financial statements for 2023# to a Polars DataFrame (requires about 8GB of RAM)

RFSD_2023 = pl.read_parquet('hf://datasets/irlspbru/RFSD/RFSD/year=2023/*.parquet')

Please note that the data is not shuffled within year, meaning that streaming first n rows will not yield a random sample.

Local File Import

Importing in Python requires pyarrow package installed.

import pyarrow.dataset as ds import polars as pl

Read RFSD metadata from local file

RFSD = ds.dataset("local/path/to/RFSD")

Use RFSD_dataset.schema to glimpse the data structure and columns' classes

print(RFSD.schema)

Load full dataset into memory

RFSD_full = pl.from_arrow(RFSD.to_table())

Load only 2019 data into memory

RFSD_2019 = pl.from_arrow(RFSD.to_table(filter=ds.field('year') == 2019))

Load only revenue for firms in 2019, identified by taxpayer id

RFSD_2019_revenue = pl.from_arrow( RFSD.to_table( filter=ds.field('year') == 2019, columns=['inn', 'line_2110'] ) )

Give suggested descriptive names to variables

renaming_df = pl.read_csv('local/path/to/descriptive_names_dict.csv') RFSD_full = RFSD_full.rename({item[0]: item[1] for item in zip(renaming_df['original'], renaming_df['descriptive'])})

R

Local File Import

Importing in R requires arrow package installed.

library(arrow) library(data.table)

Read RFSD metadata from local file

RFSD <- open_dataset("local/path/to/RFSD")

Use schema() to glimpse into the data structure and column classes

schema(RFSD)

Load full dataset into memory

scanner <- Scanner$create(RFSD) RFSD_full <- as.data.table(scanner$ToTable())

Load only 2019 data into memory

scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scanner <- scan_builder$Finish() RFSD_2019 <- as.data.table(scanner$ToTable())

Load only revenue for firms in 2019, identified by taxpayer id

scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scan_builder$Project(cols = c("inn", "line_2110")) scanner <- scan_builder$Finish() RFSD_2019_revenue <- as.data.table(scanner$ToTable())

Give suggested descriptive names to variables

renaming_dt <- fread("local/path/to/descriptive_names_dict.csv") setnames(RFSD_full, old = renaming_dt$original, new = renaming_dt$descriptive)

Use Cases

🌍 For macroeconomists: Replication of a Bank of Russia study of the cost channel of monetary policy in Russia by Mogiliat et al. (2024) — interest_payments.md

🏭 For IO: Replication of the total factor productivity estimation by Kaukin and Zhemkova (2023) — tfp.md

🗺️ For economic geographers: A novel model-less house-level GDP spatialization that capitalizes on geocoding of firm addresses — spatialization.md

FAQ

Why should I use this data instead of Interfax's SPARK, Moody's Ruslana, or Kontur's Focus?hat is the data period?

To the best of our knowledge, the RFSD is the only open data set with up-to-date financial statements of Russian companies published under a permissive licence. Apart from being free-to-use, the RFSD benefits from data harmonization and error detection procedures unavailable in commercial sources. Finally, the data can be easily ingested in any statistical package with minimal effort.

What is the data period?

We provide financials for Russian firms in 2011-2023. We will add the data for 2024 by July, 2025 (see Version and Update Policy below).

Why are there no data for firm X in year Y?

Although the RFSD strives to be an all-encompassing database of financial statements, end users will encounter data gaps:

We do not include financials for firms that we considered ineligible to submit financial statements to the Rosstat/Federal Tax Service by law: financial, religious, or state organizations (state-owned commercial firms are still in the data).

Eligible firms may enjoy the right not to disclose under certain conditions. For instance, Gazprom did not file in 2022 and we had to impute its 2022 data from 2023 filings. Sibur filed only in 2023, Novatek — in 2020 and 2021. Commercial data providers such as Interfax's SPARK enjoy dedicated access to the Federal Tax Service data and therefore are able source this information elsewhere.

Firm may have submitted its annual statement but, according to the Uniform State Register of Legal Entities (EGRUL), it was not active in this year. We remove those filings.

Why is the geolocation of firm X incorrect?

We use Nominatim to geocode structured addresses of incorporation of legal entities from the EGRUL. There may be errors in the original addresses that prevent us from geocoding firms to a particular house. Gazprom, for instance, is geocoded up to a house level in 2014 and 2021-2023, but only at street level for 2015-2020 due to improper handling of the house number by Nominatim. In that case we have fallen back to street-level geocoding. Additionally, streets in different districts of one city may share identical names. We have ignored those problems in our geocoding and invite your submissions. Finally, address of incorporation may not correspond with plant locations. For instance, Rosneft has 62 field offices in addition to the central office in Moscow. We ignore the location of such offices in our geocoding, but subsidiaries set up as separate legal entities are still geocoded.

Why is the data for firm X different from https://bo.nalog.ru/?

Many firms submit correcting statements after the initial filing. While we have downloaded the data way past the April, 2024 deadline for 2023 filings, firms may have kept submitting the correcting statements. We will capture them in the future releases.

Why is the data for firm X unrealistic?

We provide the source data as is, with minimal changes. Consider a relatively unknown LLC Banknota. It reported 3.7 trillion rubles in revenue in 2023, or 2% of Russia's GDP. This is obviously an outlier firm with unrealistic financials. We manually reviewed the data and flagged such firms for user consideration (variable outlier), keeping the source data intact.

Why is the data for groups of companies different from their IFRS statements?

We should stress that we provide unconsolidated financial statements filed according to the Russian accounting standards, meaning that it would be wrong to infer financials for corporate groups with this data. Gazprom, for instance, had over 800 affiliated entities and to study this corporate group in its entirety it is not enough to consider financials of the parent company.

Why is the data not in CSV?

The data is provided in Apache Parquet format. This is a structured, column-oriented, compressed binary format allowing for conditional subsetting of columns and rows. In other words, you can easily query financials of companies of interest, keeping only variables of interest in memory, greatly reducing data footprint.

Version and Update Policy

Version (SemVer): 1.0.0.

We intend to update the RFSD annualy as the data becomes available, in other words when most of the firms have their statements filed with the Federal Tax Service. The official deadline for filing of previous year statements is April, 1. However, every year a portion of firms either fails to meet the deadline or submits corrections afterwards. Filing continues up to the very end of the year but after the end of April this stream quickly thins out. Nevertheless, there is obviously a trade-off between minimization of data completeness and version availability. We find it a reasonable compromise to query new data in early June, since on average by the end of May 96.7% statements are already filed, including 86.4% of all the correcting filings. We plan to make a new version of RFSD available by July.

Licence

Creative Commons License Attribution 4.0 International (CC BY 4.0).

Copyright © the respective contributors.

Citation

Please cite as:

@unpublished{bondarkov2025rfsd, title={{R}ussian {F}inancial {S}tatements {D}atabase}, author={Bondarkov, Sergey and Ledenev, Victor and Skougarevskiy, Dmitriy}, note={arXiv preprint arXiv:2501.05841}, doi={https://doi.org/10.48550/arXiv.2501.05841}, year={2025}}

Acknowledgments and Contacts

Data collection and processing: Sergey Bondarkov, sbondarkov@eu.spb.ru, Viktor Ledenev, vledenev@eu.spb.ru

Project conception, data validation, and use cases: Dmitriy Skougarevskiy, Ph.D.,
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Davis, Philip E. (2021). Supplementary material: Transitioning from file-based HPC workflows to streaming data pipelines with openPMD and ADIOS2 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4906275

Supplementary material: Transitioning from file-based HPC workflows to streaming data pipelines with openPMD and ADIOS2

Explore at:

Dataset updated

Jun 8, 2021

Dataset provided by

Bussmann, Michael
Huebl, Axel
Klasky, Scott
Podhorszki, Norbert
Gu, Junmin
Eisenhauer, Greg
E, Juncheng
Gainaru, Ana
Poeschel, Franz
Godoy, William F.
Wan, Lipeng
Davis, Philip E.
Widera, René
Koller, Fabian

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Used software versions

Self-built:

PIConGPU: https://github.com/franzpoeschel/picongpu/tree/smc2021-paper GAPD: closed source software, Git tag smc2021-paper in private repository openPMD-api: https://github.com/franzpoeschel/openPMD-api/tree/smc2021-paper ADIOS2: https://github.com/ornladios/ADIOS2, Git hash bf25ad59b8b15b9f48ddabad65a41f2050d3bd7f libfabric: 1.6.3a1

Summit modules:

1) gcc/8.1.1
2) spectrum-mpi/10.3.1.2-20200121
3) cmake/3.18.2
4) git/2.20.1
5) cuda/10.1.243
6) boost/1.66.0
7) zlib/1.2.11
8) libpng/1.6.34 9) freetype/2.9.1 10) python/3.7.0-anaconda3-5.3.0

Clear search

Close search

Google apps

Main menu

Supplementary material: Transitioning from file-based HPC workflows to...

Used software versions

Replication Data for: Hardware Attack detectoR via Performance counters...

Music Informatics for Radio Across the GlobE (MIRAGE) MetaCorpus (v0.2)

Overview

Attribution

Data Sources

Data Sets

Contact

‘COVID-19 Coronavirus Dataset’ analyzed by Analyst-2

Context

Content

Acknowledgements

Inspiration

LCZO-Stream Water Chemistry, Streamflow / Discharge, Hysteretic response of...

SLF4Web - MPEG-DASH datasets of static light fields

Data from: National-scale biogeography and function of river and stream...

CO - Coal Creek - Distinct Source Water Chemistry Shapes Contrasting...

Open Australian Legal Embeddings

Open Australian Legal Embeddings ‍⚖️

Usage 👩‍💻

Sample the first 100,000 embeddings.

Embed a query.

Identify the most similar embedding to the query.

Print the most similar text.

Structure 🗂️

Creation 🧪

Brook Trout Probability of Occurrence, Plus 4 degrees C, Northeast U.S.

CLM - Richmond stream gauge data

Abstract

Dataset History

Dataset Citation

Data from: CaImAn: An open source tool for scalable Calcium Imaging data...

Data from: Russian Financial Statements Database: A firm-level collection of...

This line will download 6.6GB+ of all RFSD data and store it in a 🤗 cache folder

Alternatively, this will download ~540MB with all financial statements for 2023# to a Polars DataFrame (requires about 8GB of RAM)

Read RFSD metadata from local file

Use RFSD_dataset.schema to glimpse the data structure and columns' classes

Load full dataset into memory

Load only 2019 data into memory

Load only revenue for firms in 2019, identified by taxpayer id

Give suggested descriptive names to variables

Read RFSD metadata from local file

Use schema() to glimpse into the data structure and column classes

Load full dataset into memory

Load only 2019 data into memory

Load only revenue for firms in 2019, identified by taxpayer id

Give suggested descriptive names to variables

Supplementary material: Transitioning from file-based HPC workflows to streaming data pipelines with openPMD and ADIOS2

Used software versions