Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This archive contains the dataset of properly-licensed Jupyter notebooks from the MSR'22 paper "A Large-Scale Comparison of Python Code in Jupyter Notebooks and Scripts". The dataset contains 847,881 notebooks stored in the PostgreSQL dump file. You can find the details about the database in the README file. To transform the notebooks into this convenient format and to calcuate the structural metrics, we used our library called Matroskin, which can be found here: https://github.com/JetBrains-Research/Matroskin.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A test dataset for MUSCLE (MUltiplexed Single-molecule Characterization at the Library scalE) data analysis. See "\Python codes for MUSCLE data analysis\README.txt" for the instructions on running the data analysis codes. Use the files in the "Test MUSCLE dataset" folder as input for the codes. "Test MUSCLE dataset\Output_tile1" contains the code output for the test dataset. The example dataset corresponds to one MiSeq tile in an experiment analyzing dCas9-induced R-loop formation for a library of 256 different target sequences.The latest version of the Python codes for matching single-molecule FRET traces with sequenced clusters is available at https://github.com/deindllab/MUSCLE/.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Multiplexed imaging technologies provide insights into complex tissue architectures. However, challenges arise due to software fragmentation with cumbersome data handoffs, inefficiencies in processing large images (8 to 40 gigabytes per image), and limited spatial analysis capabilities. To efficiently analyze multiplexed imaging data, we developed SPACEc, a scalable end-to-end Python solution, that handles image extraction, cell segmentation, and data preprocessing and incorporates machine-learning-enabled, multi-scaled, spatial analysis, operated through a user-friendly and interactive interface. The demonstration dataset was derived from a previous analysis and contains TMA cores from a human tonsil and tonsillitis sample that were acquired with the Akoya PhenocyclerFusion platform. The dataset can be used to test the workflow and establish it on a user’s system or to familiarize oneself with the pipeline. Methods Tissue samples: Tonsil cores were extracted from a larger multi-tumor tissue microarray (TMA), which included a total of 66 unique tissues (51 malignant and semi-malignant tissues, as well as 15 non-malignant tissues). Representative tissue regions were annotated on corresponding hematoxylin and eosin (H&E)-stained sections by a board-certified surgical pathologist (S.Z.). Annotations were used to generate the 66 cores each with cores of 1mm diameter. FFPE tissue blocks were retrieved from the tissue archives of the Institute of Pathology, University Medical Center Mainz, Germany, and the Department of Dermatology, University Medical Center Mainz, Germany. The multi-tumor-TMA block was sectioned at 3µm thickness onto SuperFrost Plus microscopy slides before being processed for CODEX multiplex imaging as previously described. CODEX multiplexed imaging and processing To run the CODEX machine, the slide was taken from the storage buffer and placed in PBS for 10 minutes to equilibrate. After drying the PBS with a tissue, a flow cell was sealed onto the tissue slide. The assembled slide and flow cell were then placed in a PhenoCycler Buffer made from 10X PhenoCycler Buffer & Additive for at least 10 minutes before starting the experiment. A 96-well reporter plate was prepared with each reporter corresponding to the correct barcoded antibody for each cycle, with up to 3 reporters per cycle per well. The fluorescence reporters were mixed with 1X PhenoCycler Buffer, Additive, nuclear-staining reagent, and assay reagent according to the manufacturer's instructions. With the reporter plate and assembled slide and flow cell placed into the CODEX machine, the automated multiplexed imaging experiment was initiated. Each imaging cycle included steps for reporter binding, imaging of three fluorescent channels, and reporter stripping to prepare for the next cycle and set of markers. This was repeated until all markers were imaged. After the experiment, a .qptiff image file containing individual antibody channels and the DAPI channel was obtained. Image stitching, drift compensation, deconvolution, and cycle concatenation are performed within the Akoya PhenoCycler software. The raw imaging data output (tiff, 377.442nm per pixel for 20x CODEX) is first examined with QuPath software (https://qupath.github.io/) for inspection of staining quality. Any markers that produce unexpected patterns or low signal-to-noise ratios should be excluded from the ensuing analysis. The qptiff files must be converted into tiff files for input into SPACEc. Data preprocessing includes image stitching, drift compensation, deconvolution, and cycle concatenation performed using the Akoya Phenocycler software. The raw imaging data (qptiff, 377.442 nm/pixel for 20x CODEX) files from the Akoya PhenoCycler technology were first examined with QuPath software (https://qupath.github.io/) to inspect staining qualities. Markers with untenable patterns or low signal-to-noise ratios were excluded from further analysis. A custom CODEX analysis pipeline was used to process all acquired CODEX data (scripts available upon request). The qptiff files were converted into tiff files for tissue detection (watershed algorithm) and cell segmentation.
This repository contains all of the data used in the manuscript "Linearizing the vertical scale of an interferometric microscope and its effect on step-height measurement," by Thomas A. Germer, T. Brian Renegar, Ulf Griesmann, and Johannes A. Soons, which has been published in Surface Topography: Metrology and Properties volume 12, number 2, article 025012 on 8 May 2024. The repository also contains a Python Jupyter notebook that performs the analysis of the data and generates the figures in the manuscript.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset includes two Jupyterlab book python script and a folder of compressed data.
Due to the size of the dataset we are unable to include the raw simulation output but have instead inluded all datasets required to plot the figures.
The python script "original_analysis_script.ipynb" contains the code that was used to perform the analysis on the outputs from the simulations. This is not intended for public use.
The python script "figure_plotting_scripts.ipynb" is a cut-down verison of the original code that can be used to plot the figures as presented in the manuscript. All neccessary data to achieve this can be found in the compressed folder "data_arrays".
The scripts will run on Jupyterlab book. Providing the user has the neccessary python modules available the plotting script can be run by amending the paths for the figure outout directory and the uncompressed "data_arrays" directory. These paths can be found at the top of the script.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities. Details can be found in the attached report. Annotation The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library. The split into train, validation and test set follows the split of the original datasets. Installation
pip install pandas pyarrow Example
import pandas as pddf = pd.read_parquet('annotation_train.parquet', engine='pyarrow')print(df.iloc[0])
dataset AudioSet filename train/---2_BBVHAA.mp3 captions_visual [a man in a black hat and glasses.] captions_auditory [a man speaks and dishes clank.] tags [Speech] Description The annotation file consists of the following fields:filename: Name of the corresponding file (video or audio file)dataset: Source dataset associated with the data pointcaptions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual contentcaptions_auditory: A list of captions related to the auditory content of the videotags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided Data files The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description This repository contains a comprehensive solar irradiance, imaging, and forecasting dataset. The goal with this release is to provide standardized solar and meteorological datasets to the research community for the accelerated development and benchmarking of forecasting methods. The data consist of three years (2014–2016) of quality-controlled, 1-min resolution global horizontal irradiance and direct normal irradiance ground measurements in California. In addition, we provide overlapping data from commonly used exogenous variables, including sky images, satellite imagery, Numerical Weather Prediction forecasts, and weather data. We also include sample codes of baseline models for benchmarking of more elaborated models.
Data usage The usage of the datasets and sample codes presented here is intended for research and development purposes only and implies explicit reference to the paper: Pedro, H.T.C., Larson, D.P., Coimbra, C.F.M., 2019. A comprehensive dataset for the accelerated development and benchmarking of solar forecasting methods. Journal of Renewable and Sustainable Energy 11, 036102. https://doi.org/10.1063/1.5094494
Although every effort was made to ensure the quality of the data, no guarantees or liabilities are implied by the authors or publishers of the data.
Sample code As part of the data release, we are also including the sample code written in Python 3. The preprocessed data used in the scripts are also provided. The code can be used to reproduce the results presented in this work and as a starting point for future studies. Besides the standard scientific Python packages (numpy, scipy, and matplotlib), the code depends on pandas for time-series operations, pvlib for common solar-related tasks, and scikit-learn for Machine Learning models. All required Python packages are readily available on Mac, Linux, and Windows and can be installed via, e.g., pip.
Units All time stamps are in UTC (YYYY-MM-DD HH:MM:SS). All irradiance and weather data are in SI units. Sky image features are derived from 8-bit RGB (256 color levels) data. Satellite images are derived from 8-bit gray-scale (256 color levels) data.
Missing data The string "NAN" indicates missing data
File formats All time series data files as in CSV (comma separated values) Images are given in tar.bz2 files
Files
Folsom_irradiance.csv Primary One-minute GHI, DNI, and DHI data.
Folsom_weather.csv Primary One-minute weather data.
Folsom_sky_images_{YEAR}.tar.bz2 Primary Tar archives with daytime sky images captured at 1-min intervals for the years 2014, 2015, and 2016, compressed with bz2.
Folsom_NAM_lat{LAT}_lon{LON}.csv Primary NAM forecasts for the four nodes nearest the target location. {LAT} and {LON} are replaced by the node’s coordinates listed in Table I in the paper.
Folsom_sky_image_features.csv Secondary Features derived from the sky images.
Folsom_satellite.csv Secondary 10 pixel by 10 pixel GOES-15 images centered in the target location.
Irradiance_features_{horizon}.csv Secondary Irradiance features for the different forecasting horizons ({horizon} 1⁄4 {intra-hour, intra-day, day-ahead}).
Sky_image_features_intra-hour.csv Secondary Sky image features for the intra-hour forecasting issuing times.
Sat_image_features_intra-day.csv Secondary Satellite image features for the intra-day forecasting issuing times.
NAM_nearest_node_day-ahead.csv Secondary NAM forecasts (GHI, DNI computed with the DISC algorithm, and total cloud cover) for the nearest node to the target location prepared for day-ahead forecasting.
Target_{horizon}.csv Secondary Target data for the different forecasting horizons.
Forecast_{horizon}.py Code Python script used to create the forecasts for the different horizons.
Postprocess.py Code Python script used to compute the error metric for all the forecasts.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset necessary to reproduce the results and analyses presented in the paper:
Gourgue, O., van Belzen, J., Schwarz, C., Bouma, T.J., van de Koppel, J. & Temmerman, S. (2020) A convolution method to assess subgrid-scale interactions between flow and patchy vegetation in biogeomorphic models, Journal of Advances in Modeling Earth Systems, submitted.
The dataset contains:
Process-based model simulations, including their input files and the Python scripts to generate them (pre-processing), as well as the output files and Python scripts to post-process them.
Flume experiment data processed for the model calibration.
Python scripts to generate the figures and tables of the manuscript.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains full-scale visualizations as well as original data and code (in R and Python) to reproduce the figures and tables for "Critical Search." The data includes full-text data for the Hansard debates, and the code employs keyword search, topic modeling, and KL measurement.
https://www.sodha.be/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.34934/DVN/4WYRN9https://www.sodha.be/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.34934/DVN/4WYRN9
This dataset comes from a study conducted in Poland with 44 participants. The goal of the study was to measure personality traits known as the Dark Triad. The Dark Triad consists of three key traits that influence how people think and behave towards others. These traits are Machiavellianism, Narcissism, and Psychopathy. Machiavellianism refers to a person's tendency to manipulate others and be strategic in their actions. People with high Machiavellianism scores often believe that deception is necessary to achieve their goals. Narcissism is related to self-importance and the need for admiration. Individuals with high narcissism scores may see themselves as special and expect others to recognize their greatness. Psychopathy is linked to impulsive behavior and a lack of empathy. People with high psychopathy scores tend to be less concerned about the feelings of others and may take risks without worrying about consequences. Each participant in the dataset answered 30 questions, divided into three sections, with 10 questions per trait. The answers were recorded using a Likert scale from 1 to 5, where: 1 means "Strongly Disagree" 2 means "Disagree" 3 means "Neutral" 4 means "Agree" 5 means "Strongly Agree" This scale helps measure how much a person agrees with statements related to each of the three traits. The dataset also includes basic demographic information. Each participant has a unique ID (such as P001, P002, etc.) to keep their identity anonymous. The dataset records their age, which ranges from 18 to 60 years old, and their gender, which is categorized as "Male," "Female," or "Other." The responses in the dataset are realistic, with small variations to reflect natural differences in personality. On average, participants scored around 3.2 for Machiavellianism, meaning most people showed a moderate tendency to be strategic or manipulative. The average Narcissism score was 3.5, indicating that some participants valued themselves highly and sought admiration. The average Psychopathy score was 2.8, showing that most participants did not strongly exhibit impulsive or reckless behaviors. This dataset can be useful for many purposes. Researchers can use it to analyze personality traits and see how they compare across different groups. The data can also be used for cross-cultural comparisons, allowing researchers to study how personality traits in Poland differ from those in other countries. Additionally, psychologists can use this data to understand how Dark Triad traits influence behavior in everyday life. The dataset is saved in a CSV format, which makes it easy to open in programs like Excel, SPSS, or Python for further analysis. Because the data is structured and anonymized, it can be used safely for research without revealing personal information. In summary, this dataset provides valuable insights into personality traits among people in Poland. It allows researchers to explore how Machiavellianism, Narcissism, and Psychopathy vary among individuals. By studying these traits, psychologists can better understand human behavior and how it affects relationships, decision-making, and personal success.
BuGL is a large-scale cross-language dataset for bug localization in code. BuGL constitutes of more than 10,000 bug reports drawn from open-source projects written in four programming languages, namely C, C++, Java, and Python. The dataset consists of information which includes Bug Reports and Pull-Requests. BuGL aims to unfold new research opportunities in the area of bug localization.
Wake Vision is a large, high-quality dataset featuring over 6 million images, significantly exceeding the scale and diversity of current tinyML datasets (100x). This dataset includes images with annotations of whether each image contains a person. Additionally, it incorporates a comprehensive fine-grained benchmark to assess fairness and robustness, covering perceived gender, perceived age, subject distance, lighting conditions, and depictions. The Wake Vision labels are derived from Open Image's annotations which are licensed by Google LLC under CC BY 4.0 license. The images are listed as having a CC BY 2.0 license. Note from Open Images: "while we tried to identify images that are licensed under a Creative Commons Attribution license, we make no representations or warranties regarding the license status of each image and you should verify the license for each image yourself."
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('wake_vision', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/wake_vision-1.0.0.png" alt="Visualization" width="500px">
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A dataset of 12-lead ECGs with annotations. The dataset contains 345 779 exams from 233 770 patients. It was obtained through stratified sampling from the CODE dataset ( 15% of the patients). The data was collected by the Telehealth Network of Minas Gerais in the period between 2010 and 2016.
This repository contains the files `exams.csv` and the files `exams_part{i}.zip` for i = 0, 1, 2, ... 17.
In python, one can read this file using h5py.
```python
import h5py
f = h5py.File(path_to_file, 'r')
# Get ids
traces_ids = np.array(self.f['id_exam'])
x = f['signal']
```
The `signal` dataset is too large to fit in memory, so don't convert it to a numpy array all at once.
It is possible to access a chunk of it using: ``x[start:end, :, :]``.
The CODE dataset was collected by the Telehealth Network of Minas Gerais (TNMG) in the period between 2010 and 2016. TNMG is a public telehealth system assisting 811 out of the 853 municipalities in the state of Minas Gerais, Brazil. The dataset is described
Ribeiro, Antônio H., Manoel Horta Ribeiro, Gabriela M. M. Paixão, Derick M. Oliveira, Paulo R. Gomes, Jéssica A. Canazart, Milton P. S. Ferreira, et al. “Automatic Diagnosis of the 12-Lead ECG Using a Deep Neural Network.” Nature Communications 11, no. 1 (2020): 1760. https://doi.org/10.1038/s41467-020-15432-4
The CODE 15% dataset is obtained from stratified sampling from the CODE dataset. This subset of the code dataset is described in and used for assessing model performance:
"Deep neural network estimated electrocardiographic-age as a mortality predictor"
Emilly M Lima, Antônio H Ribeiro, Gabriela MM Paixão, Manoel Horta Ribeiro, Marcelo M Pinto Filho, Paulo R Gomes, Derick M Oliveira, Ester C Sabino, Bruce B Duncan, Luana Giatti, Sandhi M Barreto, Wagner Meira Jr, Thomas B Schön, Antonio Luiz P Ribeiro. MedRXiv (2021) https://www.doi.org/10.1101/2021.02.19.21251232
The companion code for reproducing the experiments in the two papers described above can be found, respectively, in:
- https://github.com/antonior92/automatic-ecg-diagnosis; and in,
- https://github.com/antonior92/ecg-age-prediction.
Note about authorship: Antônio H. Ribeiro, Emilly M. Lima and Gabriela M.M. Paixão contributed equally to this work.
Project CodeNet is a large-scale dataset with approximately 14 million code samples, each of which is an intended solution to one of 4000 coding problems. The code samples are written in over 50 programming languages (although the dominant languages are C++, C, Python, and Java) and they are annotated with a rich set of information, such as its code size, memory footprint, cpu run time, and status, which indicates acceptance or error types. The dataset is accompanied by a repository, where we provide a set of tools to aggregate codes samples based on user criteria and to transform code samples into token sequences, simplified parse trees and other code graphs. A detailed discussion of Project CodeNet is available in this paper.
The rich annotation of Project CodeNet enables research in code search, code completion, code-code translation, and a myriad of other use cases. We also extracted several benchmarks in Python, Java and C++ to drive innovation in deep learning and machine learning models in code classification and code similarity.
Citation @inproceedings{puri2021codenet, author = {Ruchir Puri and David Kung and Geert Janssen and Wei Zhang and Giacomo Domeniconi and Vladmir Zolotov and Julian Dolby and Jie Chen and Mihir Choudhury and Lindsey Decker and Veronika Thost and Luca Buratti and Saurabh Pujar and Ulrich Finkler}, title = {Project CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks}, year = {2021}, }
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is an enriched version of Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle. The initial corpus consists of ≈ 2.5 million snippets of ML code collected from ≈ 100 thousand Jupyter notebooks. A representative fraction of the snippets is annotated by human assessors through a user-friendly interface specially designed for that purpose.
The data is organized as a set of tables in CSV format. It includes several central entities: raw code blocks collected from Kaggle (code_blocks.csv), kernels (kernels_meta.csv) and competitions meta information (competitions_meta.csv). Manually annotated code blocks are presented as a separate table (murkup_data.csv). As this table contains the numeric id of the code block semantic type, we also provide a mapping from the id to semantic class and subclass (vertices.csv).
Snippets information (code_blocks.csv) can be mapped with kernels meta-data via kernel_id. Kernels metadata is linked to Kaggle competitions information through comp_name. To ensure the quality of the data kernels_meta.csv includes only notebooks with an available Kaggle score.
Automatic classification of code_blocks are stored in data_with_preds.csv. The mapping of this table with code_blocks.csv can be doe through code_blocks_index column, which corresponds to code_blocks indices.
The updated Code4ML 2.0 corpus includes kernels retrieved from Code Kaggle Meta. These kernels correspond to the kaggle competitions launched since 2020. The natural descriptions of the competitions are retrieved with the aim of LLM.
kernels_meta2.csv may contain kernels without Kaggle score, but with the place in the leader board (rank).
Code4ML 2.0 dataset can be used for various purposes, including training and evaluating models for code generation, code understanding, and natural language processing tasks.
WIDER FACE dataset is a face detection benchmark dataset, of which images are selected from the publicly available WIDER dataset. We choose 32,203 images and label 393,703 faces with a high degree of variability in scale, pose and occlusion as depicted in the sample images. WIDER FACE dataset is organized based on 61 event classes. For each event class, we randomly select 40%/10%/50% data as training, validation and testing sets. We adopt the same evaluation metric employed in the PASCAL VOC dataset. Similar to MALF and Caltech datasets, we do not release bounding box ground truth for the test images. Users are required to submit final prediction files, which we shall proceed to evaluate.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('wider_face', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/wider_face-0.1.0.png" alt="Visualization" width="500px">
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Rapidata Video Generation Runway Alpha Human Preference
If you get value from this dataset and would like to see more in the future, please consider liking it.
This dataset was collected in ~1 hour total using the Rapidata Python API, accessible to anyone and ideal for large scale data annotation.
Overview
In this dataset, ~30'000 human annotations were collected to evaluate Runway's Alpha video generation model on our benchmark. The up to date benchmark can… See the full description on the dataset page: https://huggingface.co/datasets/Rapidata/text-2-video-human-preferences-runway-alpha.
See original article: Englund, O., Mola-Yudego, B., Börjesson, P., Cederberg, C., Dimitriou, I., Scarlat, N., Berndes, G. Large-scale deployment of grass in crop rotations as a multifunctional climate mitigation strategy. GCB Bioenergy Preprint: Englund, O., Mola-Yudego, B., Börjesson, P., Cederberg, C., Dimitriou, I., Scarlat, N., Berndes, G., (2022). Large-scale deployment of grass in crop rotations as a multifunctional climate mitigation strategy. EarthArXiv. Sept. 23. https://doi.org/10.31223/X5KW5J
This dataset is released along with the paper: “A Large Scale Benchmark for Uplift Modeling” Eustache Diemert, Artem Betlei, Christophe Renaudin; (Criteo AI Lab), Massih-Reza Amini (LIG, Grenoble INP)
This work was published in: AdKDD 2018 Workshop, in conjunction with KDD 2018.
This dataset is constructed by assembling data resulting from several incrementality tests, a particular randomized trial procedure where a random part of the population is prevented from being targeted by advertising. it consists of 25M rows, each one representing a user with 11 features, a treatment indicator and 2 labels (visits and conversions).
Here is a detailed description of the fields (they are comma-separated in the file):
The dataset was collected and prepared with uplift prediction in mind as the main task. Additionally we can foresee related usages such as but not limited to:
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('criteo', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Hydrological and meteorological information can help inform the conditions and risk factors related to the environment and their inhabitants. Due to the limitations of observation sampling, gridded data sets provide the modeled information for areas where data collection are infeasible using observations collected and known process relations. Although available, data users are faced with barriers to use, challenges like how to access, acquire, then analyze data for small watershed areas, when these datasets were produced for large, continental scale processes. In this tutorial, we introduce Observatory for Gridded Hydrometeorology (OGH) to resolve such hurdles in a use-case that incorporates NetCDF gridded data sets processes developed to interpret the findings and apply secondary modeling frameworks (landlab).
LEARNING OBJECTIVES - Familiarize with data management, metadata management, and analyses with gridded data - Inspecting and problem solving with Python libraries - Explore data architecture and processes - Learn about OGH Python Library - Discuss conceptual data engineering and science operations
Use-case operations: 1. Prepare computing environment 2. Get list of grid cells 3. NetCDF retrieval and clipping to a spatial extent 4. Extract NetCDF metadata and convert NetCDFs to 1D ASCII time-series files 5. Visualize the average monthly total precipitations 6. Apply summary values as modeling inputs 7. Visualize modeling outputs 8. Save results in a new HydroShare resource
For inquiries, issues, or contribute to the developments, please refer to https://github.com/freshwater-initiative/Observatory
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This archive contains the dataset of properly-licensed Jupyter notebooks from the MSR'22 paper "A Large-Scale Comparison of Python Code in Jupyter Notebooks and Scripts". The dataset contains 847,881 notebooks stored in the PostgreSQL dump file. You can find the details about the database in the README file. To transform the notebooks into this convenient format and to calcuate the structural metrics, we used our library called Matroskin, which can be found here: https://github.com/JetBrains-Research/Matroskin.