24 datasets found
  1. Z

    Data Visualization of Weight Sensor and Event Detection of Aifi Store

    • data.niaid.nih.gov
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adeola Bannis (2024). Data Visualization of Weight Sensor and Event Detection of Aifi Store [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4292483
    Explore at:
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    Pei Zhang
    Carlos Ruiz
    Adeola Bannis
    Rahul S Hoskeri
    Hae Young Noh
    João Diogo Falcão
    Shijia Pan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Aifi Store is an autonomus store for cashier-less shopping experience which is achieved by multi modal sensing (Vision modality, weight modality and location modality). Aifi Nano store layout (Fig 1) (Image Credits: AIM3S research paper).

    Overview: The store is organized in the gondola's and each gondola has shelfs that holds the products and each shelf has weight sensor plates. These weight sensor plates data is used to find the event trigger (pick up, put down or no event) from which we can find the weight of the product picked.

    Gondola is similar to vertical fixture consisting of horizontal shelfs in any normal store and in this case there are 5 to 6 shelfs in a Gondola. Every shelf again is composed of weight sensing plates, weight sensing modalities, there are around 12 plates on each shelf.

    Every plate has a sampling rate of 60Hz, so there are 60 samples collected every second from each plate

    The pick up event on the plate can be observed and marked when the weight sensor reading decreases with time and increases with time when the put down event happens.

    Event Detection:

    The event is said to be detected if the moving variance calculated from the raw weight sensor reading exceeds a set threshold of (10000gm^2 or 0.01kg^2) over the sliding window length of 0.5 seconds, which is half of the sampling rate of sensors (i.e 1 second).

    There are 3 types of events:

    Pick Up Event (Fig 2)= Object being taken from the particular gondola and shelf from the customer

    Put Down Event (Fig 3)= Object being placed back from the customer on that particular gondola and shelf

    No Event = (Fig 4)No object being picked up from that shelf

    NOTE:

    1.The python script must be in the same folder as of the weight.csv files and .csv files should not be placed in other subdirectories.

    2.The videos for the corresponding weight sensor data can be found in the "Videos folder" in the repository and are named similar to their corresponding ".csv" files.

    3.Each video files consists of video data from 13 different camera angles.

    Details of the weight sensor files:

    These weight.csv (Baseline cases and team particular cases ) files are from the AIFI CPS IoT 2020 week.There are over 50 cases in total and each file has 5 columns (Fig 5) (timestamp, reading (in grams), gondola, shelf, plate number).

    Each of these files have data of around 2-5 minutes or 120 seconds in the form of timestamp. In order to unpack date and time from timestamp use datetime module from python.

    Details of the product.csv files:

    There are product.csv files for each test cases and these files provide the detailed information about the product name, product location (gondola number, shelf number and plate number) in the store, product weight(in grams), also link to the image of the product.

    Instruction to run the script:

    To start analysing the weigh.csv files using the python script and plot the timeseries plot for corresponding files.

    Download the dataset.

    Make sure to place the python/ jupyter notebook file is in same directory as the .csv files.

    Install the requirements $ pip3 install -r requirements.txt

    Run the python script Plot.py $ python3 Plot.py

    After the script has run successfully you will find the corresponding folders of weight.csv files which contain the figures (weight vs timestamp) in the format

    Instruction to run the Jupyter Notebook:

    Run the Plot.ipynb file using Jupyter Notebook by placing .csv files in the same directory as the Plot.ipynb script.

                                       gondola_number,shelf_number.png
    
    
                                        Ex: 1,1.png (Fig 4) (Timeseries Graph)
    
  2. e

    Machine Learning Majorite barometer - Dataset - B2FIND

    • b2find.eudat.eu
    Updated Feb 6, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). Machine Learning Majorite barometer - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/1a523db9-b8d3-508d-9d69-3efed2629d00
    Explore at:
    Dataset updated
    Feb 6, 2021
    Description

    A machine learning barometer (using Random Forest Regression) to calculate equilibration pressure for majoritic garnetsUpdated 04/02/21 (21/01/21) (10/12/20):**The barometer codeThe barometer is provided as python scripts (.py) and Jupiter Notebooks (.ipynb) files. These are completely equivalent to one another and which is used depends on the users preference. Separate instructions are provided for each.data files included in this repository are: • "Majorite_database_04022021.xlsm" (Excel sheet of literature majoritic garnet compositions - inclusions (up to date as of 04/02/2021) and experiments (up to date as of 03/07/2020). This data includes all compositions that are close to majoritic, but some are borderline. Filtering as described in paper accompanying this barometer is performed in the python script prior to any data analysis or fitting) • "lit_maj_nat_030720.txt" (python script input file of experimental literature majoritic garnet compositions - taken from dataset above) • "di_incs_040221.txt" (python script input file of literature compilation of majoritic garnet inclusions observed in natural diamonds - taken from the dataset above)*The barometer as Jupiter Notebooks - including integrated Caret validation (added 21/01/2021)For those more unfamiliar with Python, running the barometer as a Notebook is somewhat more intuitive than running the scripts below. It also has the benefit of including the RFR validation in using Caret within a single integrated notebook. For success the Jupiter Notebook requires a suitable Python3 environment (with pandas, numpy, matplotlib, sklearn, rpy2 and pickle packages + dependencies). We recommend installing the latest anaconda python distribution (found here https://docs.anaconda.com/anaconda/install/) and creating a custom environment containing the required packages to run the Jupiter Notebook (as both python3 and R must be active in the environment). Instructions on this procedure can be found here (https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html), or to assist we have provided a copy of the environment used to produce the scripts to assist in this process (barom-spec-file.txt). An identical conda environment (called myenv) can be created, and used by:1) copying the barometer-spec-file.txt to a suitable location (i.e. your home directory)2) running the command conda create --name myenv --file barom-spec-file.txt3) entering this environmentconda activate myenv4) Running an instance of Jupyter Notebook by typingjupyter notebookTwo Notebooks are provided: • calculate_pressures_notebook.ipynb (equivalent to calculate_pressures.py described below) • rfr_majbar_10122020_notebook.ipynb (equivalent to rfr_majbar_10122020.py described below) but also including integrated Caret validation performed using the rpy2 package in a single notebook environment*The barometer as scripts (10/12/2020)The scripts below need to be run in a suitable Python3 environment (with pandas, numpy, matplotlib, sklearn and pickle packages + dependencies). For inexperienced users we recommend installing the latest anaconda python distribution (found here https://docs.anaconda.com/anaconda/install/) and running in Spyder (a GUI scripting environment provided with Anaconda.Note - if running python 3.7 (or earlier) then you will need to install pickle5 package to use the provided barometer files and comment / uncomment the appropriate lines in the “calculate_pressures.py” (lines 16/17) and “rfr_majbar_10122020.py” (lines 26/27) scripts.The user may additionally need to download and install the packages required if they are not provided with the anaconda distribution (pandas, numpy, matplotlib, scikit-learn and pickle). This will be obvious as, when run, the script will return an error similar to “No module name XXXX”. Packages can either be installed using the anaconda package manager or in the command line / terminal via commands such as: conda install -c conda-forge pickle5Appropriate command line installation commands can be obtained via searching the anaconda cloud at anaconda.org for each required package.A python script (.py) is provided to calculate pressures for any majoritic garnet using barometer calibrated in Thomson et al. (2021) • calculate_pressures.py script takes an input file of any majoritic garnet compositions (example input file is provided “example_test_data.txt" - which are inclusion compositions reported by Zedgenizov et al., 2014, Chemical Geology, 363, pp 114-124). • employs published RFR model and scaler - both provided as pickle files (pickle_model_20201210.pkl, scaler_20201210.pkl)User can simply edit the input file name in the provided .py script - and then runs the script in a suitable python3 environment (requires pandas, numpy, sklearn and pickle packages). Script initially filters data for majoritic compositions (according to criteria used for barometer calibration) and predicts pressures for these compositions. Writes out pressures and 2 x std_dev in pressure estimates alongside input data into "out_pressures_test.txt". if this script produces any errors or warnings it is likely because the serialised pickle files provided are not compatible with the python build being used (this is a common issue with serialised ML models). Please first try installing the pickle5 package and commenting/uncommenting lines 16/17. If this is unsuccessful then run the full barometer calibration script below (using the same input files as in Thomson et al. (2021) which are provided) to produce pickle files compatible with the python build on the local machine (action 5 of script below). Subsequently edit the filenames called in the “calculate_pressures.py” script (lines 22 & 27) to match the new barometer calibration files and re-run the calculate pressure script. The output (predicted pressures) for the test dataset provided (and using the published calibration) given in the output file should be similar to the following results:P (GPa) error (GPa)17.0 0.416.6 0.319.5 1.321.8 1.312.8 0.314.3 0.414.7 0.414.4 0.612.1 0.614.6 0.517.0 1.014.6 0.611.9 0.714.0 0.516.8 0.8Full RFR barometer calibration script - rfr_majbar_10122020.py The RFR barometer calibration script used and described in Thomson et al. (2021). This script performs the following actions. 1) filters input data - outputs this filtered data as a .txt file (which is the input expected for RFR validation script using R package Caret) 2) fits 1000 RFR models each using a randomly selected training dataset (70% of the input data) 3) performs leave-one-out validation 4) plots figure 5 from Thomson et al. (2021) 5) fits one single RFR barometer using all input data (saves this and the scaler as .pkl files with a datestamp for use in the "calculate_pressures.py script) 6) calculates the pressure for all literature inclusion compositions over 100 iterations with randomly distributed compositional uncertainties added - provides the mean pressure and 2 std deviations, written alongside input inclusion compositons, as a .txt output file "diout.txt" 7) plots the global distribution of majoritic inclusion pressuresThe RFR barometer can be easily updated to include (or exclude) additional experimental compositions by modification of the literature data input files providedRFR validation using Caret in R (script titled “RFR_validation_03072020.R”)Additional validation tests of RFR barometer completed using the Caret package in R. Requires the filtered experimental dataset file "data_filteredforvalidation.txt" (which is generated by the rfr_majbar_10122020.py script if required for a new dataset) performs bootstrap, K-fold and leave-one out validation. outputs validation stats for 5, 7 and 9 input variables (elements)Please email Andrew Thomson (a.r.thomson@ucl.ac.uk) if you have any questions or queries.

  3. f

    Post-processed neural and behavioral data class with reward-relative cells...

    • plus.figshare.com
    bin
    Updated May 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marielena Sosa; Mark Plitt; Lisa Giocomo (2025). Post-processed neural and behavioral data class with reward-relative cells identified [Dataset]. http://doi.org/10.25452/figshare.plus.27138633.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    May 17, 2025
    Dataset provided by
    Figshare+
    Authors
    Marielena Sosa; Mark Plitt; Lisa Giocomo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains 2 pickled Python files containing the post-processed data used for: Sosa, Plitt, Giocomo, 2025. A flexible hippocampal population code for experience relative to reward. Nature Neuroscience. https://doi.org/10.1038/s41593-025-01985-4Hippocampal neurons were recorded in dorsal CA1 using 2-photon calcium imaging and synchronized with virtual reality behavior in head-fixed mice. This data is referred to as "post-processed" because shuffles have already been run to identify cells as "reward-relative" or not; therefore, the cell ID labels in this dataset will allow exact replication of any figures in the paper that do not require an additional shuffle of the data. The pickle or dill package in Python is required to load this dataset. See accompanying dataset: Pre-processed neural and behavioral data, including raw fluorescence. See the code base on Github for usage and additional documentation. File name format: m[mouse-number-range]_expdays[list-of-day-numbers]_multiDayData_dff_[date saved, yyyymm]Each pickle file is a Python dictionary, either for (1) only the experimental days where a reward zone location was switched on a virtual linear track (3-5-7-8-10-12-14) or (2) for all days, including days where the reward zone remained in the same location (1...14). In pickle (2), the data on the switch days are identical to pickle (1) -- we have provided both options to allow users to download a smaller file size if they are only interested in the "switch" days. Each entry of the dictionary corresponds to a class object for a given experimental day indexed as [3, 5, 7, 8, 10, 12, 14], for example, corresponding to the day number.Below are the most relevant attributes of the class for analyses in the paper. Additional attributes are explained in the dayData.py docstring on the Github. Values before the '--' are defaults.self.anim_list: list of mouse IDs included in this dayself.place_cell_logical: 'or' -- cells were classified as place cells by having significant spatial information in the trials before OR after the reward switchself.force_two_sets: True -- trials were split into "set 0" before the reward switch, and "set 1" after the reward switch. In animals without a reward switch, "set 0" and "set 1" correspond to the 1st and 2nd half of trials, respectivelyself.ts_key: 'dff' -- timeseries data type (dF/F) used to find place cell peaksself.use_speed_thr: True -- whether a running speed threshold was used to quantify neural activityself.speed_thr: 2 -- the speed threshold used, in cm/sself.exclude_int: True -- whether putative interneurons were excluded from analysesself.int_thresh: 0.5 -- speed correlation threshold to identify putative interneuronsself.int_method: 'speed' -- method of finding putative interneuronsself.reward_dist_exclusive: 50 -- distance in cm to exclude cells "near" rewardself.reward_dist_inclusive: 50 -- distance in cm to include cells as "near" rewardself.bin_size: 10 -- linear bin size (cm) for quantifying spatial activityself.sigma: 1 -- Gaussian s.d. in bins for smoothingself.smooth: False -- whether to smooth binned data for finding place cell peaksself.impute_NaNs: True -- whether to impute NaN bins in spatial activity matricesself.sim_method: 'correlation' -- trial-by-trial similarity matrix method: 'cosine_sim' or 'correlation'self.lick_correction_thr: 0.35 -- threshold to detect capacitive sensor errors and set trial licking to NaN self.is_switch: whether each animal had a reward switchself.anim_tag: string of animal ID numbersself.trial_dict: dictionary of booleans identifying each trial as in "set 0" or "set 1"self.rzone_pos: [start, stop] position of each reward zone (cm)self.rzone_by_trial: same as above but for each trialself.rzone_label: label of each reward zone (e.g. 'A', 'B')self.activity_matrix: spatially-binned neural activity of type self.ts_key (trials x position bins x neurons)self.events: original spatially-binned deconvolved events (trials x position bins x neurons) (no speed threshold applied)self.place_cell_masks: booleans identifying which cells are place cells in each trial setself.SI: spatial information for each cell in each trial setself.overall_place_cell_masks: single boolean identifying which cells are place cells according to self.place_cell_logicalself.peaks: spatial bin center of peak activity for each cell in each trial setself.field_dict: dictionary of place field properties for each cellself.plane_per_cell: imaging plane of each cell (all zeros if only a single plane was imaged, otherwise 0 or 1 if two planes were imaged)self.is_int: boolean, whether each cell is a putative interneuronself.is_reward_cell: boolean, whether each cell has a peak within 50 cm of both reward zone startsself.is_end_cell: boolean, whether each cell has a peak in the first or last spatial bin of the trackself.is_track_cell: boolean, whether each cell's peak stays within 50 cm of itself from trial set 0 to trial set 1self.sim_mat: trial-by-trial similarity matrix for place cells, licking, and speedself.in_vs_out_lickratio: ratio of lick rate in the anticipatory zone vs. everywhere outside the anticipatory and reward zonesself.lickpos_std: standard deviation of licking positionself.lick_mat: matrix of lick rate in each spatial bin (trials x position bins)self.cell_class: dictionary containing booleans of which cells have remapping types classified as "track", "disappear", "appear", "reward", or "nonreward_remap", where:'track' = track-relative'disappear' = disappearing'appear' = appearing'reward' = remap near reward (firing peak ≤50 cm from both reward zone starts), including reward-relative 'nonreward_remap' = remap far from reward (>50 cm from reward zone start), including reward-relativeSee Fig. 2 notebook and code docstrings for more details.self.pos_bin_centers: position bin centersself.dist_btwn_rel_null: distance between spatial firing peaks relative to reward before the switch and the "random remapping" shuffle after the switch (radians)self.dist_btwn_rel_peaks: distance between spatial firing peaks relative to reward before vs. after the switch (radians)self.reward_rel_cell_ids: integer cell indices that were identified as reward-relative after application of all criteriaself.xcorr_above_shuf: lag, in spatial bins, of the above-shuffle maximum of the cross-correlation used to confirm cells as reward-relative (computed for all cells; NaNs indicate that the xcorr did not exceed shuffle)self.reward_rel_dist_along_unity: circular mean of pre-switch and post-switch spatial firing peak position relative to reward (radians)self.rel_peaks: spatial firing peak position relative to reward in each trial set (radians)self.rel_null: spatial firing peak position relative to reward, for the random-remapping shuffle post-switch (radians)self.circ_licks: spatially-binned licking, in circular coordinates relative to reward (trials x position bins)self.circ_speed: spatially-binned speed, in circular coordinates relative to reward (trials x position bins)self.circ_map: mean spatially-binned neural activity within each trial set, of type self.ts_key, in circular coordinates relative to rewardself.circ_trial_matrix: spatially-binned neural activity of type self.ts_key, in circular coordinates relative to reward (trials x position bins x neurons)self.circ_rel_stats_across_an: metadata across the "switch" animals:'include_ans': list of "switch" animal names'rdist_to_rad_inc': self.reward_dist_inclusive converted to radians'rdist_to_rad_exc': self.reward_dist_exclusive converted to radians'min_pos': minimum position bin used'max_pos': maximum position bin used'hist_bin_centers': bin centers used for spatial binning

  4. Z

    Data from: Lost in Translation: A Study of Bugs Introduced by Large Language...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ibrahimzada, Ali Reza (2024). Lost in Translation: A Study of Bugs Introduced by Large Language Models while Translating Code [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8190051
    Explore at:
    Dataset updated
    Jan 25, 2024
    Dataset authored and provided by
    Ibrahimzada, Ali Reza
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Artifact repository for the paper Lost in Translation: A Study of Bugs Introduced by Large Language Models while Translating Code, accepted at ICSE 2024, Lisbon, Portugal. Authors are Rangeet Pan* Ali Reza Ibrahimzada*, Rahul Krishna, Divya Sankar, Lambert Pougeum Wassi, Michele Merler, Boris Sobolev, Raju Pavuluri, Saurabh Sinha, and Reyhaneh Jabbarvand.

    Install

    This repository contains the source code for reproducing the results in our paper. Please start by cloning this repository:

    git clone https://github.com/Intelligent-CAT-Lab/PLTranslationEmpirical

    We recommend using a virtual environment for running the scripts. Please download conda 23.11.0 from this link. You can create a virtual environment using the following command:

    conda create -n plempirical python=3.10.13

    After creating the virtual environment, you can activate it using the following command:

    conda activate plempirical

    You can run the following command to make sure that you are using the correct version of Python:

    python3 --version && pip3 --version

    Dependencies

    To install all software dependencies, please execute the following command:

    pip3 install -r requirements.txt

    As for hardware dependencies, we used 16 NVIDIA A100 GPUs with 80GBs of memory for inferencing models. The models can be inferenced on any combination of GPUs as long as the reader can properly distribute the model weights across the GPUs. We did not perform weight distribution since we had enough memory (80 GB) per GPU.

    Moreover, for compiling and testing the generated translations, we used Python 3.10, g++ 11, GCC Clang 14.0, Java 11, Go 1.20, Rust 1.73, and .Net 7.0.14 for Python, C++, C, Java, Go, Rust, and C#, respectively. Overall, we recommend using a machine with Linux OS and at least 32GB of RAM for running the scripts.

    For running scripts of alternative approaches, you need to make sure you have installed C2Rust, CxGO, and Java2C# on your machine. Please refer to their repositories for installation instructions. For Java2C#, you need to create a .csproj file like below:

    Exe
    net7.0
    enable
    enable
    

    Dataset

    We uploaded the dataset we used in our empirical study to Zenodo. The dataset is organized as follows:

    CodeNet

    AVATAR

    Evalplus

    Apache Commons-CLI

    Click

    Please download and unzip the dataset.zip file from Zenodo. After unzipping, you should see the following directory structure:

    PLTranslationEmpirical ├── dataset ├── codenet ├── avatar ├── evalplus ├── real-life-cli ├── ...

    The structure of each dataset is as follows:

    1. CodeNet & Avatar: Each directory in these datasets correspond to a source language where each include two directories Code and TestCases for code snippets and test cases, respectively. Each code snippet has an id in the filename, where the id is used as a prefix for test I/O files.

    2. Evalplus: The source language code snippets follow a similar structure as CodeNet and Avatar. However, as a one time effort, we manually created the test cases in the target Java language inside a maven project, evalplus_java. To evaluate the translations from an LLM, we recommend moving the generated Java code snippets to the src/main/java directory of the maven project and then running the command mvn clean test surefire-report:report -Dmaven.test.failure.ignore=true to compile, test, and generate reports for the translations.

    3. Real-life Projects: The real-life-cli directory represents two real-life CLI projects from Java and Python. These datasets only contain code snippets as files and no test cases. As mentioned in the paper, the authors manually evaluated the translations for these datasets.

    Scripts

    We provide bash scripts for reproducing our results in this work. First, we discuss the translation script. For doing translation with a model and dataset, first you need to create a .env file in the repository and add the following:

    OPENAI_API_KEY= LLAMA2_AUTH_TOKEN= STARCODER_AUTH_TOKEN=

    1. Translation with GPT-4: You can run the following command to translate all Python -> Java code snippets in codenet dataset with the GPT-4 while top-k sampling is k=50, top-p sampling is p=0.95, and temperature=0.7:

    bash scripts/translate.sh GPT-4 codenet Python Java 50 0.95 0.7 0

    1. Translation with CodeGeeX: Prior to running the script, you need to clone the CodeGeeX repository from here and use the instructions from their artifacts to download their model weights. After cloning it inside PLTranslationEmpirical and downloading the model weights, your directory structure should be like the following:

    PLTranslationEmpirical ├── dataset ├── codenet ├── avatar ├── evalplus ├── real-life-cli ├── CodeGeeX ├── codegeex ├── codegeex_13b.pt # this file is the model weight ├── ... ├── ...

    You can run the following command to translate all Python -> Java code snippets in codenet dataset with the CodeGeeX while top-k sampling is k=50, top-p sampling is p=0.95, and temperature=0.2 on GPU gpu_id=0:

    bash scripts/translate.sh CodeGeeX codenet Python Java 50 0.95 0.2 0

    1. For all other models (StarCoder, CodeGen, LLaMa, TB-Airoboros, TB-Vicuna), you can execute the following command to translate all Python -> Java code snippets in codenet dataset with the StarCoder|CodeGen|LLaMa|TB-Airoboros|TB-Vicuna while top-k sampling is k=50, top-p sampling is p=0.95, and temperature=0.2 on GPU gpu_id=0:

    bash scripts/translate.sh StarCoder codenet Python Java 50 0.95 0.2 0

    1. For translating and testing pairs with traditional techniques (i.e., C2Rust, CxGO, Java2C#), you can run the following commands:

    bash scripts/translate_transpiler.sh codenet C Rust c2rust fix_report bash scripts/translate_transpiler.sh codenet C Go cxgo fix_reports bash scripts/translate_transpiler.sh codenet Java C# java2c# fix_reports bash scripts/translate_transpiler.sh avatar Java C# java2c# fix_reports

    1. For compile and testing of CodeNet, AVATAR, and Evalplus (Python to Java) translations from GPT-4, and generating fix reports, you can run the following commands:

    bash scripts/test_avatar.sh Python Java GPT-4 fix_reports 1 bash scripts/test_codenet.sh Python Java GPT-4 fix_reports 1 bash scripts/test_evalplus.sh Python Java GPT-4 fix_reports 1

    1. For repairing unsuccessful translations of Java -> Python in CodeNet dataset with GPT-4, you can run the following commands:

    bash scripts/repair.sh GPT-4 codenet Python Java 50 0.95 0.7 0 1 compile bash scripts/repair.sh GPT-4 codenet Python Java 50 0.95 0.7 0 1 runtime bash scripts/repair.sh GPT-4 codenet Python Java 50 0.95 0.7 0 1 incorrect

    1. For cleaning translations of open-source LLMs (i.e., StarCoder) in codenet, you can run the following command:

    bash scripts/clean_generations.sh StarCoder codenet

    Please note that for the above commands, you can change the dataset and model name to execute the same thing for other datasets and models. Moreover, you can refer to /prompts for different vanilla and repair prompts used in our study.

    Artifacts

    Please download the artifacts.zip file from our Zenodo repository. We have organized the artifacts as follows:

    RQ1 - Translations: This directory contains the translations from all LLMs and for all datasets. We have added an excel file to show a detailed breakdown of the translation results.

    RQ2 - Manual Labeling: This directory contains an excel file which includes the manual labeling results for all translation bugs.

    RQ3 - Alternative Approaches: This directory contains the translations from all alternative approaches (i.e., C2Rust, CxGO, Java2C#). We have added an excel file to show a detailed breakdown of the translation results.

    RQ4 - Mitigating Translation Bugs: This directory contains the fix results of GPT-4, StarCoder, CodeGen, and Llama 2. We have added an excel file to show a detailed breakdown of the fix results.

    Contact

    We look forward to hearing your feedback. Please contact Rangeet Pan or Ali Reza Ibrahimzada for any questions or comments 🙏.

  5. f

    Metadata supporting data files in the published article:

    • figshare.com
    jpeg
    Updated Jan 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    chang liu (2024). Metadata supporting data files in the published article: [Dataset]. http://doi.org/10.6084/m9.figshare.25054619.v2
    Explore at:
    jpegAvailable download formats
    Dataset updated
    Jan 30, 2024
    Dataset provided by
    figshare
    Authors
    chang liu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Figure posted on 2024-01-24, 12: 30 authored by Chang Liu, Qiange Wan, Zhi Zheng, Yue Zhao.Our paper proposes a Chinese character font generation method based on font contour information (EdgeStyleGAN). In order to verify the effectiveness of the proposed method, author design a series of generation tasks and comparative experiments to validate the effectiveness of the proposed method. The generation tasks involve learning and generating seven common font styles using our model and comparative models. The experiments consisted of both qualitative and quantitative evaluation experiments. The specific results for generating 7 font styles can be foundin Figure 4 - Figure 10.In constructing our dataset, we drew inspiration from the well-established design process of the industry-leading company Fangzheng Font Library. We opted for a 500-character set as our training dataset, as depicted in Figure 2. These 500 characters are built upon the foundational 50 characters used to establish font styles. These structures encompass not only the 31 strokes of Chinese characters and the morphological variations of each stroke in different positions but also the fundamental structural forms of Chinese characters (e.g., left-right structure, top-bottom structure; enclosure structure, and semienclosure structure). Additionally, this dataset includes the majority of individual characters, compound characters, and most of the radical components. The selection of these 500 characters is pivotal in determining the overall consistency requirements for the entire font library.We utilized data from seven common font styles commonly employed in font design in the experiments. These styles include 1 Song style (Song), 1 Hei style (Black), 1 Kai style (Kai), 1 Fang Song style (FangSong), 1 Yuan style (Yuan), 1 Handwriting style (Handwriting), and 1 Decoration style (Decoration). Following the proposed 500-character dataset in this paper, datasets for these seven fonts were individually created. The fonts were downloaded in ttf or otf format from the respective font manufacturers' official websites. Python code was then used to preprocess the images into 256px*256px white-background black-font RGB images, forming seven sets of paired training datasets, each containing 500 characters. These datasets served as the source and target fonts in the model's comparative experiments to assess the model's style transfer capabilities. Additionally, 30 characters were randomly selected from the GB2312 character set, excluding the initial 500 characters, to form a test dataset for model inference and generation evaluation. The training was conducted on a cloud processor utilizing an NVIDIA RTX 3090 graphics processing unit (GPU), 60 GB of memory, and a 6-core Intel Xeon Gold 6142 processor (CPU). The experiments were run using PyCharm with remote connectivity to the cloud server. The experimental environment was configured with Python 3.10 and the torch 2.0 framework.The Song style font is downloaded from https://source.typekit.com/source-han-serif/cn/The Hei style font is downloaded from https://www.hanyi.com.cn/productdetail?id=831The Kai style font is downloaded from https://www.foundertype.com/index.php/FontInfo/index/id/241The Fang Song style font is downloaded from https://www.hanyi.com.cn/productdetail?id=10726The Yuan style font is downloaded from https://www.foundertype.com/index.php/FontInfo/index/id/219The Handwriting style font is downloaded from https://www.hanyi.com.cn/productdetail.php?id=9053&type=0The Decoration style font is downloaded from https://www.hanyi.com.cn/productdetail.php?id=608&type=0In common tasks involving character generation in seven typical Chinese character styles, the method and dataset employed in this paper demonstrate excellent generation results, demonstrating the superiority of the proposed method over the baseline method. For qualitative assessment, we visualized the results generated by the proposed model in Figures 2 to 9 and incorporated them into a questionnaire. As shown in Table 1, The final result indicate that the images generated by the proposed model surpass those generated by the other models for all 5 criteria. In particular, the model achieved the highest score in edge clarity evaluation, providing preliminary evidence that the proposed model optimizes character contour edges and enhances font generation quality. The models used in this study include ground truth font images for their respective tasks so the quantitative evaluation was conducted using image similarity metrics. Four image similarity evaluation metrics were employed to ensure the objectivity and reliability of the results: the structural similarity index (SSIM), feature similarity index (FSIM), peak signal-to-noise ratio (PSNR), and root mean square error (RMSE). A higher SSIM, FSIM, or PSNR indicates greater similarity between the generated and target images. Conversely, a smaller RMSE indicates lower dissimilarity and greater similarity between the generated and target images. As shown in Table 2, the proposed model obtained superior image similarity results compared to those of the comparative models in the 7 different style font generation tasks. On at least 6 out of the 7 experimental datasets, our proposed method achieved larger SSIM, FSIM, and PSNR values and smaller RMSE values, more objectively demonstrating the effectiveness of EdgeStyleGAN in optimizing character contour edges and font style transfer capabilities.For more details on the methodology, please read the related published article.The code is available from the corresponding author by request.

  6. Frédéric Chopin - Mazurkas (A corpus of annotated scores)

    • zenodo.org
    zip
    Updated Sep 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Johannes Hentschel; Johannes Hentschel; Yannis Rammos; Yannis Rammos; Markus Neuwirth; Markus Neuwirth; Martin Rohrmeier; Martin Rohrmeier (2023). Frédéric Chopin - Mazurkas (A corpus of annotated scores) [Dataset]. http://doi.org/10.5281/zenodo.8329063
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 15, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Johannes Hentschel; Johannes Hentschel; Yannis Rammos; Yannis Rammos; Markus Neuwirth; Markus Neuwirth; Martin Rohrmeier; Martin Rohrmeier
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This corpus of annotated MuseScore files has been created within the DCML corpus initiative and employs the DCML harmony annotation standard. It is one out of nine similar corpora that have been grouped together to An Annotated Corpus of Tonal Piano Music from the Long 19th Century which comes with a data report that is currently under review.

    The dataset lives on GitHub (link under "Related identifiers") and is stored on Zenodo purely for conservation and automatic DOI generation for new GitHub releases. For technical reasons, we include only brief, generic instructions on how to use the data. For more detailed documentation, please refer to the dataset's GitHub page.

    What is included

    The dataset includes annotated MusicScores .mscx files that have been created with MuseScore 3.6.2 and can be opened with any MuseScore 3, or later version. Apart from that, the score information (measures, notes, harmony labels) have been extracted in the form of TSV files which can be found respectively in the folders measures, notes, and harmonies. They have been extracted with the Python library ms3 and its documentation has a column glossary for looking up the meaning of a column.

    Getting the data

    You can download the dataset as a ZIP file from Zenodo or GitHub. Please note that these automatically generated ZIP files do not include submodules, which would appear as empty folders. If you need ZIP files, you will need to find the submodule repositories (e.g. via GitHub) and download them individually.

    Apart from that, there is the possibility to git-clone the GitHub repository to your disk. This has the advantage that it allows to version-control any changes you want to make to the dataset and to ask for your changes to be included ("merged") in a future version.

  7. W

    MBC Groundwater model

    • cloud.csiss.gmu.edu
    • researchdata.edu.au
    • +2more
    zip
    Updated Dec 13, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Australia (2019). MBC Groundwater model [Dataset]. https://cloud.csiss.gmu.edu/uddi/dataset/6fe25546-a6ca-44fc-a101-51b1758e2890
    Explore at:
    zip(32125461941)Available download formats
    Dataset updated
    Dec 13, 2019
    Dataset provided by
    Australia
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract

    The dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.

    This data set comprises the inputs, parameters and outputs of the OGIA groundwater models as used for BA groundwater modelling reported in the product 2.6.2 of Maranoa-Balonne-Condamine subregion. Only one input file (MODFLOW drain package input file) was modified for the BA specific model runs. This input file is provided within the folder named NIC_MBC_Model_Input_files. All other model inputs are the same as is used in the OGIA model runs for Underground water Impact Report 2012. These inputs may be requested from Office of Groundwater Impact Assessment and is not published in the BA repository as per the agreement with OGIA.

    The OGIA drain package has been modified to include the boundary conditions caused by 7 mines in the MBC subregion. For this a python script was written. The script reads the original drain package file (ASCII file) of OGIA model and then write additional drain cells to the file at the locations of the 7 mines - Commodore, newAcland stage 2, Wilkie Creek, Kogan Creek, Cameby Downs, New Acland stage 3 and The Range. The script reads the locations of these mines from input files - .dat (to the script) with names similar to the mine names. The locations of the mines are specified as rows and columns of the model grid in these input files. These files also have the elevations of the drain cells (bottom/centre of the cells in the Walloon Coal Measures in which the coal mine pit is present). The location and elevation data were obtained from OGIA and is registered as a separate data set.

    Two copies of the python script are present in the folder corresponding to the baseline and CRDP drain packages.

    Dataset History

    All the model input data files were obtained from OGIA with the condition that these will not be published. The drain package input file and all output files were generated by CSIRO Land and Water by using CSIRO generated scripts and by running the model using these inputs. The drain package has inputs from the original OGIA model drain pakcage and additional data obtained from OGIA and coal mine footprint data.

    The coal mine footprint data sets is registered as a separate data set. The OGIA model is not registered as a separate source data set as it is not publishable as per the agreement.

    Dataset Citation

    Bioregional Assessment Programme (2015) MBC Groundwater model. Bioregional Assessment Derived Dataset. Viewed 25 October 2017, http://data.bioregionalassessments.gov.au/dataset/6fe25546-a6ca-44fc-a101-51b1758e2890.

    Dataset Ancestors

  8. T

    cardiotox

    • tensorflow.org
    Updated Dec 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). cardiotox [Dataset]. https://www.tensorflow.org/datasets/catalog/cardiotox
    Explore at:
    Dataset updated
    Dec 6, 2022
    Description

    Drug Cardiotoxicity dataset [1-2] is a molecule classification task to detect cardiotoxicity caused by binding hERG target, a protein associated with heart beat rhythm. The data covers over 9000 molecules with hERG activity.

    Note:

    1. The data is split into four splits: train, test-iid, test-ood1, test-ood2.

    2. Each molecule in the dataset has 2D graph annotations which is designed to facilitate graph neural network modeling. Nodes are the atoms of the molecule and edges are the bonds. Each atom is represented as a vector encoding basic atom information such as atom type. Similar logic applies to bonds.

    3. We include Tanimoto fingerprint distance (to training data) for each molecule in the test sets to facilitate research on distributional shift in graph domain.

    For each example, the features include: atoms: a 2D tensor with shape (60, 27) storing node features. Molecules with less than 60 atoms are padded with zeros. Each atom has 27 atom features. pairs: a 3D tensor with shape (60, 60, 12) storing edge features. Each edge has 12 edge features. atom_mask: a 1D tensor with shape (60, ) storing node masks. 1 indicates the corresponding atom is real, othewise a padded one. pair_mask: a 2D tensor with shape (60, 60) storing edge masks. 1 indicates the corresponding edge is real, othewise a padded one. active: a one-hot vector indicating if the molecule is toxic or not. [0, 1] indicates it's toxic, otherwise [1, 0] non-toxic.

    References

    [1]: V. B. Siramshetty et al. Critical Assessment of Artificial Intelligence Methods for Prediction of hERG Channel Inhibition in the Big Data Era. JCIM, 2020. https://pubs.acs.org/doi/10.1021/acs.jcim.0c00884

    [2]: K. Han et al. Reliable Graph Neural Networks for Drug Discovery Under Distributional Shift. NeurIPS DistShift Workshop 2021. https://arxiv.org/abs/2111.12951

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('cardiotox', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  9. Daily website visitors (time series regression)

    • kaggle.com
    Updated Aug 20, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bob Nau (2020). Daily website visitors (time series regression) [Dataset]. https://www.kaggle.com/bobnau/daily-website-visitors/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 20, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Bob Nau
    Description

    Context

    This file contains 5 years of daily time series data for several measures of traffic on a statistical forecasting teaching notes website whose alias is statforecasting.com. The variables have complex seasonality that is keyed to the day of the week and to the academic calendar. The patterns you you see here are similar in principle to what you would see in other daily data with day-of-week and time-of-year effects. Some good exercises are to develop a 1-day-ahead forecasting model, a 7-day ahead forecasting model, and an entire-next-week forecasting model (i.e., next 7 days) for unique visitors.

    Content

    The variables are daily counts of page loads, unique visitors, first-time visitors, and returning visitors to an academic teaching notes website. There are 2167 rows of data spanning the date range from September 14, 2014, to August 19, 2020. A visit is defined as a stream of hits on one or more pages on the site on a given day by the same user, as identified by IP address. Multiple individuals with a shared IP address (e.g., in a computer lab) are considered as a single user, so real users may be undercounted to some extent. A visit is classified as "unique" if a hit from the same IP address has not come within the last 6 hours. Returning visitors are identified by cookies if those are accepted. All others are classified as first-time visitors, so the count of unique visitors is the sum of the counts of returning and first-time visitors by definition. The data was collected through a traffic monitoring service known as StatCounter.

    Inspiration

    This file and a number of other sample datasets can also be found on the website of RegressIt, a free Excel add-in for linear and logistic regression which I originally developed for use in the course whose website generated the traffic data given here. If you use Excel to some extent as well as Python or R, you might want to try it out on this dataset.

  10. t

    ESA CCI SM PASSIVE Daily Gap-filled Root-Zone Soil Moisture from merged...

    • researchdata.tuwien.ac.at
    zip
    Updated May 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wolfgang Preimesberger; Wolfgang Preimesberger; Johanna Lems; Martin Hirschi; Martin Hirschi; Wouter Arnoud Dorigo; Wouter Arnoud Dorigo; Johanna Lems; Johanna Lems; Johanna Lems (2025). ESA CCI SM PASSIVE Daily Gap-filled Root-Zone Soil Moisture from merged multi-satellite observations [Dataset]. http://doi.org/10.48436/8dda4-xne96
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 5, 2025
    Dataset provided by
    TU Wien
    Authors
    Wolfgang Preimesberger; Wolfgang Preimesberger; Johanna Lems; Martin Hirschi; Martin Hirschi; Wouter Arnoud Dorigo; Wouter Arnoud Dorigo; Johanna Lems; Johanna Lems; Johanna Lems
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset provides global daily estimates of Root-Zone Soil Moisture (RZSM) content at 0.25° spatial grid resolution, derived from gap-filled merged satellite observations of 14 passive satellites sensors operating in the microwave domain of the electromagnetic spectrum. Data is provided from January 1991 to December 2023.

    This dataset was produced with funding from the European Space Agency (ESA) Climate Change Initiative (CCI) Plus Soil Moisture Project (CCN 3 to ESRIN Contract No: 4000126684/19/I-NB "ESA CCI+ Phase 1 New R&D on CCI ECVS Soil Moisture"). Project website: https://climate.esa.int/en/projects/soil-moisture/" target="_blank" rel="noopener">https://climate.esa.int/en/projects/soil-moisture/. Operational implementation is supported by the Copernicus Climate Change Service implemented by ECMWF through C3S2 312a/313c.

    Studies using this dataset

    This dataset is used by Hirschi et al. (2025) to assess recent summer drought trends in Switzerland.

    Abstract

    ESA CCI Soil Moisture is a multi-satellite climate data record that consists of harmonized, daily observations from various microwave satellite remote sensing sensors (Dorigo et al., 2017, 2024; Gruber et al., 2019). This version of the dataset uses the PASSIVE record as input, which contains only observations from passive (radiometer) measurements (scaling reference AMSR-E). The surface observations are gap-filled using a univariate interpolation algorithm (Preimesberger et al., 2025). The gap-filled passive observations serve as input for an exponential filter based method to assess soil moisture in different layers of the root-zone of soil (0-200 cm) following the approach by Pasik et al. (2023). The final gap-free root-zone soil moisture estimates based on passive surface input data are provided here at 4 separate depth layers (0-10, 10-40, 40-100, 100-200 cm) over the period 1991-2023.

    Summary

    • Gap-free root-zone soil moisture estimates from 1991-2023 at 0.25° spatial sampling from passive measurements
    • Fields of application include: climate variability and change, land-atmosphere interactions, global biogeochemical cycles and ecology, hydrological and land surface modelling, drought applications, agriculture and meteorology
    • More information: See Dorigo et al. (2017, 2024) and Gruber et al. (2019) for a description of the satellite base product and uncertainty estimates, Preimesberger et al. (2025) for the gap-filling, and Pasik et al. (2023) for the root-zone soil moisture and uncertainty propagation algorithm.

    Programmatic Download

    You can use command line tools such as wget or curl to download (and extract) data for multiple years. The following command will download and extract the complete data set to the local directory ~/Downloads on Linux or macOS systems.

    #!/bin/bash

    # Set download directory
    DOWNLOAD_DIR=~/Downloads

    base_url="https://researchdata.tuwien.ac.at/records/8dda4-xne96/files"

    # Loop through years 1991 to 2023 and download & extract data
    for year in {1991..2023}; do
    echo "Downloading $year.zip..."
    wget -q -P "$DOWNLOAD_DIR" "$base_url/$year.zip"
    unzip -o "$DOWNLOAD_DIR/$year.zip" -d $DOWNLOAD_DIR
    rm "$DOWNLOAD_DIR/$year.zip"
    done

    Data details

    The dataset provides global daily estimates for the 1991-2023 period at 0.25° (~25 km) horizontal grid resolution. Daily images are grouped by year (YYYY), each subdirectory containing one netCDF image file for a specific day (DD), month (MM) in a 2-dimensional (longitude, latitude) grid system (CRS: WGS84). The file name has the following convention:

    ESA_CCI_PASSIVERZSM-YYYYMMDD000000-fv09.1.nc

    Data Variables

    Each netCDF file contains 3 coordinate variables (WGS84 longitude, latitude and time stamp), as well as the following data variables:

    • rzsm_1: (float) Root Zone Soil Moisture at 0-10 cm. Given in volumetric units [m3/m3].
    • rzsm_2: (float) Root Zone Soil Moisture at 10-40 cm. Given in volumetric units [m3/m3].
    • rzsm_3: (float) Root Zone Soil Moisture at 40-100 cm. Given in volumetric units [m3/m3].
    • rzsm_4: (float) Root Zone Soil Moisture at 100-200. Given in volumetric units [m3/m3].
    • uncertainty_1: (float) Root Zone Soil Moisture uncertainty at 0-10 cm from propagated surface uncertainties [m3/m3].
    • uncertainty_2: (float) Root Zone Soil Moisture uncertainty at 10-40 cm from propagated surface uncertainties [m3/m3].
    • uncertainty_3: (float) Root Zone Soil Moisture uncertainty at 40-100 cm from propagated surface uncertainties [m3/m3].
    • uncertainty_4: (float) Root Zone Soil Moisture uncertainty at 100-200 cm from propagated surface uncertainties [m3/m3].

    Additional information for each variable is given in the netCDF attributes.

    Version Changelog

    • v9.1
      • Initial version based on PASSIVE input data from ESA CCI SM v09.1 as used by Hirschi et al. (2025).

    Software to open netCDF files

    These data can be read by any software that supports Climate and Forecast (CF) conform metadata standards for netCDF files, such as:

    References

    • Dorigo, W., Wagner, W., Albergel, C., Albrecht, F., Balsamo, G., Brocca, L., Chung, D., Ertl, M., Forkel, M., Gruber, A., Haas, E., Hamer, P. D., Hirschi, M., Ikonen, J., de Jeu, R., Kidd, R., Lahoz, W., Liu, Y. Y., Miralles, D., Mistelbauer, T., Nicolai-Shaw, N., Parinussa, R., Pratola, C., Reimer, C., van der Schalie, R., Seneviratne, S. I., Smolander, T., and Lecomte, P.: ESA CCI Soil Moisture for improved Earth system understanding: State-of-the art and future directions, Remote Sensing of Environment, 203, 185-215, 10.1016/j.rse.2017.07.001, 2017
    • Dorigo, W., Stradiotti, P., Preimesberger, W., Kidd, R., van der Schalie, R., Frederikse, T., Rodriguez-Fernandez, N., & Baghdadi, N. (2024). ESA Climate Change Initiative Plus - Soil Moisture Algorithm Theoretical Baseline Document (ATBD) Supporting Product Version 09.0. Zenodo. https://doi.org/10.5281/zenodo.13860922
    • Gruber, A., Scanlon, T., van der Schalie, R., Wagner, W., and Dorigo, W.: Evolution of the ESA CCI Soil Moisture climate data records and their underlying merging methodology, Earth Syst. Sci. Data, 11, 717–739, https://doi.org/10.5194/essd-11-717-2019, 2019.
    • Hirschi, M., Michel, D., Schumacher, D. L., Preimesberger, W., Seneviratne, S. I.: Recent summer soil moisture drying in Switzerland based on the SwissSMEX network, 2025 (paper submitted)
    • Pasik, A., Gruber, A., Preimesberger, W., De Santis, D., and Dorigo, W.: Uncertainty estimation for a new exponential-filter-based long-term root-zone soil moisture dataset from Copernicus Climate Change Service (C3S) surface observations, Geosci. Model Dev., 16, 4957–4976, https://doi.org/10.5194/gmd-16-4957-2023, 2023
    • Preimesberger, W., Stradiotti, P., and Dorigo, W.: ESA CCI Soil Moisture GAPFILLED: An independent global gap-free satellite climate data record with uncertainty estimates, Earth Syst. Sci. Data Discuss. [preprint], https://doi.org/10.5194/essd-2024-610, in review, 2025.

    Related Records

    Please see the ESA CCI Soil Moisture science data records community for more records based on ESA CCI SM.

  11. Most used programming languages among developers worldwide 2024

    • statista.com
    Updated Feb 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Most used programming languages among developers worldwide 2024 [Dataset]. https://www.statista.com/statistics/793628/worldwide-developer-survey-most-used-languages/
    Explore at:
    Dataset updated
    Feb 6, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    May 19, 2024 - Jun 20, 2024
    Area covered
    Worldwide
    Description

    As of 2024, JavaScript and HTML/CSS were the most commonly used programming languages among software developers around the world, with more than 62 percent of respondents stating that they used JavaScript and just around 53 percent using HTML/CSS. Python, SQL, and TypeScript rounded out the top five most widely used programming languages around the world. Programming languages At a very basic level, programming languages serve as sets of instructions that direct computers on how to behave and carry out tasks. Thanks to the increased prevalence of, and reliance on, computers and electronic devices in today’s society, these languages play a crucial role in the everyday lives of people around the world. An increasing number of people are interested in furthering their understanding of these tools through courses and bootcamps, while current developers are constantly seeking new languages and resources to learn to add to their skills. Furthermore, programming knowledge is becoming an important skill to possess within various industries throughout the business world. Job seekers with skills in Python, R, and SQL will find their knowledge to be among the most highly desirable data science skills and likely assist in their search for employment.

  12. Bitcoin Blockchain Historical Data

    • kaggle.com
    zip
    Updated Feb 12, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google BigQuery (2019). Bitcoin Blockchain Historical Data [Dataset]. https://www.kaggle.com/datasets/bigquery/bitcoin-blockchain
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Feb 12, 2019
    Dataset provided by
    BigQueryhttps://cloud.google.com/bigquery
    Authors
    Google BigQuery
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Blockchain technology, first implemented by Satoshi Nakamoto in 2009 as a core component of Bitcoin, is a distributed, public ledger recording transactions. Its usage allows secure peer-to-peer communication by linking blocks containing hash pointers to a previous block, a timestamp, and transaction data. Bitcoin is a decentralized digital currency (cryptocurrency) which leverages the Blockchain to store transactions in a distributed manner in order to mitigate against flaws in the financial industry.

    Nearly ten years after its inception, Bitcoin and other cryptocurrencies experienced an explosion in popular awareness. The value of Bitcoin, on the other hand, has experienced more volatility. Meanwhile, as use cases of Bitcoin and Blockchain grow, mature, and expand, hype and controversy have swirled.

    Content

    In this dataset, you will have access to information about blockchain blocks and transactions. All historical data are in the bigquery-public-data:crypto_bitcoin dataset. It’s updated it every 10 minutes. The data can be joined with historical prices in kernels. See available similar datasets here: https://www.kaggle.com/datasets?search=bitcoin.

    Querying BigQuery tables

    You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.crypto_bitcoin.[TABLENAME]. Fork this kernel to get started.

    Method & Acknowledgements

    Allen Day (Twitter | Medium), Google Cloud Developer Advocate & Colin Bookman, Google Cloud Customer Engineer retrieve data from the Bitcoin network using a custom client available on GitHub that they built with the bitcoinj Java library. Historical data from the origin block to 2018-01-31 were loaded in bulk to two BigQuery tables, blocks_raw and transactions. These tables contain fresh data, as they are now appended when new blocks are broadcast to the Bitcoin network. For additional information visit the Google Cloud Big Data and Machine Learning Blog post "Bitcoin in BigQuery: Blockchain analytics on public data".

    Photo by Andre Francois on Unsplash.

    Inspiration

    • How many bitcoins are sent each day?
    • How many addresses receive bitcoin each day?
    • Compare transaction volume to historical prices by joining with other available data sources
  13. Z

    Idealized Planar Array study for Quantifying Spatial heterogeneity (IPAQS) -...

    • data.niaid.nih.gov
    Updated Mar 31, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marc Calaf (2022). Idealized Planar Array study for Quantifying Spatial heterogeneity (IPAQS) - Numerical Simulations [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6342277
    Explore at:
    Dataset updated
    Mar 31, 2022
    Dataset provided by
    Marc Calaf
    Eric Pardyjak
    Fabien Margairaz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Idealized Planar Array study for Quantifying Spatial heterogeneity (IPAQS) is the result of a National Science Foundation (US) funded project, that aims at studying the effect of surface thermal heterogeneities of different length-scale on the atmospheric boundary layer. This project consisted of a computational effort (dataset here included), and an experimental effort (dataset being prepared for publication).

    Overview of the numerical (Large Eddy) simulations:

    The simulations are separated into two sets to study the differences between heterogeneous and homogeneous surfaces. In the first set, a total of seven configurations are considered, all with a homogeneous surface temperature fixed at a value of (T_s) = 290 K, and for which the geostrophic wind speed has been increased from 1 to 15 m s-1 (i.e., Ug = 1, 2, 3, 4, 6, 9, 15 m s−1 ). These homogeneous cases are referred to as Homog-X, where X indicates the geostrophic wind speed corresponding case (see Margairaz et al. 2020a). In the second set, the surface temperature is distributed amongst square patches, where the temperature of each patch is determined by sampling a Gaussian distribution with a mean temperature of 290 K and a standard deviation of 5 K. In this case, three different patch sizes were considered (i.e., lh = 800, 400, and 200 m). The sizes of the heterogeneities were chosen to be of similar size (lh /ld ≈ 1), half the size (lh /ld ≈ 1/2), and about a quarter of the size (lh /ld ≈ 1/4) of the largest flow motions within the represented thermal boundary layer, assuming that this is of the order of the boundary-layer height (ld ∼ z i ). These heterogeneities are typically not resolved in NWP models. These cases have been studied for the same geostrophic wind speeds indicated above, and hereafter are referred to as PYYY-X-, where X indicates the corresponding geostrophic wind speed, and YYY refers to the size of the patches (e.g., P800_Ug1_ would be the heterogeneous case with patches of 800 m, and forced with Ug = 1 m s−1 ). Additionally, for the case with larger patches, three different random distributions of the patches were considered to evaluate the potential effect of a given surface distribution for all geostrophic wind speeds. In this dataset we only include case v3. The LES imposed surface temperature distributions emulate the surface thermal conditions observed in Morrison et al. (2017 QJRMS, 2021 BLM, 2022 BLM), where measurements of the surface temperature were taken with a thermal camera at the SLTEST site of the US Army Dugway Proving Ground in Utah, USA. This is an ideal site with uniform roughness and a large unperturbed fetch, where surface thermal heterogeneities are naturally created by differences in surface salinity. In all studied cases, the surface roughness is assumed homogeneous, with z0 = 0.1 m, and representative of a surface with sparse forest or farmland with many hedges (Brutsaert 1982; Stull 1988). The initial boundary-layer height is set to zi = 1000 m. The temperature profile is initialized with a mean air temperature of 285 K. At the top of the initial boundary layer, a capping inversion of 1000 m is used to limit its growth. The strength of this inversion is fixed at Γ = 0.012 K m−1. The atmospheric boundary layer (ABL) is considered dry and the latent heat flux is neglected in all cases. Further, in all simulations, the surface heat flux is computed using MOST, as explained in Margairaz et al. 2020a, where the surface temperature is kept constant in time throughout the simulations. Thus, there is no feedback from the atmosphere to the surface as the surface temperature does not cool down or warm up with local changes in velocities. As a consequence, the ABL gradually warms up as the simulations progress, and hence becomes less convective over time. However, the runs are not long enough for this to be significant. In addition, to ensure a degree of homogeneity within each patch and a certain degree of validity of MOST, note that even for the heterogeneous cases with the fewest amount of grid points per patch, a minimum of eight grid points is granted in each horizontal direction. The domain size is set to (Lx, Ly, Lz) = (2π, 2π, 2) km at a grid size of (Nx , Ny , Nz ) = (256, 256, 256) resulting in a horizontal resolution of (\Delta)x = (\Delta)y = 24.5 m and a vertical grid spacing of (\Delta)z = 7.8 m. A timestep of (\Delta)t = 0.1 s is used to ensure the stability of the time integration. The two sets of simulations span a large range of geostrophic forcing conditions, allowing the study of the effect on the structure of the convective boundary layer (CBL) above a patchy surface compared to a homogeneous surface. The procedure used to spin up the simulations is the following: a spinup phase of four hours of real time is used to achieve converged turbulent statistics, which is then followed by an evaluation phase. During the latter, running averages are computed for the next hour of real time (dataset here published). Statistics have been computed for averaging times of 5 min to 1 h, showing statistical convergence at 30-min averages with negligible changes between the 30-min and the 60-min averages. The simulations cover a wide range of atmospheric stability regimes ranging from −zi/L < 5 to −zi/L > 700, and hence spanning from near neutral to highly convective scenarios.

    Description of the Dataset as included in the NetCDF files:

    Data for each study case is included in two files, one for momentum related variables, and one for temperature related variables. For example, the following files "P200_Ug1_Momentum.nc" and "P200_Ug1_Scalar.nc", include the 1h averaged variables for momentum and temperature for the case of 200 m surface patches with 1 m/s geostrophic winds.

    Each corresponding momentum file "PXXX_UgX_Momentum.nc" includes the following variables in a Python Xarray structure:

    'avgU' = mean streamwise wind speed; 'avgV' = mean spanwise wind speed; 'avgW' = mean vertical wind speed, 'avgP' = mean dynamic modified pressure field ((p^*), see Margairaz et al 2020a),

    'avgU2', 'avgV2', 'avgW2' = correspond to (\overline{UU}), (\overline{VV}), and (\overline{WW}), where the capital indicates the LES filtered variable.

    'avgUV', 'avgUW', 'avgVW' = correspond to (\overline{UV}), (\overline{UW}), and (\overline{VW}). These variables together with the ones above are used to compute the Reynolds stress components (e.g. (R_{xz} = \overline{U}\overline{W} - \overline{UW})).

    avgU3', 'avgV3', 'avgW3', 'avgU4', 'avgV4', 'avgW4' = correspond to the equivalent but instead of squared they are cubed and to the 4th power.

    'avgtxx','avgtyy','avgtzz','avgtxy','avgtxz','avgtyz' = These represent the corresponding averaged subgrid scale (SGS) stress.

    'avgdudz','avgdvdz','avgNut','avgCs' = Represent the averaged vertical derivatives, an averaged subgrid Nusselt number, and the Cs coefficient computed in the SGS model.

    Overall, there are a total of 26 variables related to the momentum field. Alternatively, the temperature fields are included in the "PXXX_UgX_Scalar.nc" files. These files include 10 variables,

    'avgT' = mean Temperature field, 'avgT2' = corresponds to (\overline{TT}), 'avgUT' = correspond to (\overline{UT}), 'avgVT' = correspond to (\overline{VT}), 'avgWT' = correspond to (\overline{WT}); one can use these terms to compute the corresponding Reynolds averaged turbulent fluxes as is the case for momentum.

    'avgUT_sgs','avgVT_sgs','avgWT_sgs' = These represent the corresponding subgrid scale fluxes.

    'avg_nus', avg_ds' = averaged subgrid Nusselt number, and the Ds coefficient computed in the scalar SGS model.

    All variables output from the LES are normalized by Tscale = 290 [K] when it includes dimensions of temperature, u_scale = 0.45 [m/s], when it relates to velocity fields, and zi = 1000 [m] for length scales.

    The only output variables that are expressed in dimensional form are those for the surface temperature included in the files "SurfTemp_DXXX.nc"

    Together with the data files we include a Python script that loads the data and includes it in two Xarray structures that one can then use to work with the datasets.

  14. f

    MCCN Case Study 5 - Produce farm zone map

    • adelaide.figshare.com
    zip
    Updated May 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Donald Hobern; Hoang Son Le; Alisha Aneja; Rakesh David; Lili Andres Hernandez (2025). MCCN Case Study 5 - Produce farm zone map [Dataset]. http://doi.org/10.25909/29176640.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 29, 2025
    Dataset provided by
    The University of Adelaide
    Authors
    Donald Hobern; Hoang Son Le; Alisha Aneja; Rakesh David; Lili Andres Hernandez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The MCCN project is to deliver tools to assist the agricultural sector to understand crop-environment relationships, specifically by facilitating generation of data cubes for spatiotemporal data. This repository contains Jupyter notebooks to demonstrate the functionality of the MCCN data cube components.The dataset contains input files for the case study (data), RO-Crate metadata (ro-crate-metadata.json), results from the case study (result), and Jupyter Notebook (MCCN-CASE 5.ipynb)Research Activity Identifier (RAiD)RAiD: https://doi.org/10.26292/8679d473Case StudiesThis repository contains code and sample data for the following case studies. Note that the analyses here are to demonstrate the software and result should not be considered scientifically or statistically meaningful. No effort has been made to address bias in samples, and sample data may not be available at sufficient density to warrant analysis. All case studies end with generation of an RO-Crate data package including the source data, the notebook and generated outputs, including netcdf exports of the datacubes themselves.Case Study 5 - Produce farm zone mapDescriptionUse soil sample data and crop yield data to develop a zone map for a farm. This study demonstrates: 1) Loading heterogeneous data sources into a cube, and 2) Analysis and visualisation using pykrige and KMeans.Data SourcesUse Llara-Campey data including yield values and soil maps to develop classification of farm area into contiguous zones of relatively self-similar productivity. Variables should include the minimum zone area and the maximum number of zone classes to return.This notebook can be delivered as a tool into which the user can load their own data in the form of spreadsheets containing points and associated values for the variables to take into account in the analysis. The requirement is either for comprehensive (raster) coverage for the area or of a set of point-based measurements for each variable (in which case a simple kriging or mesh interpolation will be applied).DependenciesThis notebook requires Python 3.10 or higherInstall relevant Python libraries with: pip install mccn-engine rocrate pykrige scikit-learnInstalling mccn-engine will install other dependencies

  15. Student Performance Data Set

    • kaggle.com
    Updated Mar 27, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data-Science Sean (2020). Student Performance Data Set [Dataset]. https://www.kaggle.com/datasets/larsen0966/student-performance-data-set
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 27, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Data-Science Sean
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    If this Data Set is useful, and upvote is appreciated. This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd-period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).

  16. t

    Study data for "Accounting for seasonal retrieval errors in the merging of...

    • researchdata.tuwien.at
    zip
    Updated Feb 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pietro Stradiotti; Pietro Stradiotti; Alexander Gruber; Alexander Gruber; Wolfgang Preimesberger; Wolfgang Preimesberger; Wouter Arnoud Dorigo; Wouter Arnoud Dorigo (2025). Study data for "Accounting for seasonal retrieval errors in the merging of multi-sensor satellite soil moisture products" [Dataset]. http://doi.org/10.48436/z0zzp-f4j39
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 11, 2025
    Dataset provided by
    TU Wien
    Authors
    Pietro Stradiotti; Pietro Stradiotti; Alexander Gruber; Alexander Gruber; Wolfgang Preimesberger; Wolfgang Preimesberger; Wouter Arnoud Dorigo; Wouter Arnoud Dorigo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data repository contains the accompanying data for the study by Stradiotti et al. (2025). Developed as part of the ESA Climate Change Initiative (CCI) Soil Moisture project. Project website: https://climate.esa.int/en/projects/soil-moisture/

    Summary

    This repository contains the final, merged soil moisture and uncertainty values from Stradiotti et al. (2025), derived using a novel uncertainty quantification and merging scheme. In the accompanying study, we present a method to quantify the seasonal component of satellite soil moisture observations, based on Triple Collocation Analysis. Data from three independent satellite missions are used (from ASCAT, AMSR2, and SMAP). We observe consistent intra-annual variations in measurement uncertainties across all products (primarily caused by dynamics on the land surface such as seasonal vegetation changes), which affect the quality of the received signals. We then use these estimates to merge data from the three missions into a single consistent record, following the approach described by Dorigo et al. (2017). The new (seasonal) uncertainty estimates are propagated through the merging scheme, to enhance the uncertainty characterization of the final merged product provided here.

    Evaluation against in situ data suggests that the estimated uncertainties of the new product are more representative of their true seasonal behaviour, compared to the previously used static approach. Based on these findings, we conclude that using a seasonal TCA approach can provide a more realistic characterization of dataset uncertainty, in particular its temporal variation. However, improvements in the merged soil moisture values are constrained, primarily due to correlated uncertainties among the sensors.

    Technical details

    The dataset provides global daily gridded soil moisture estimates for the 2012-2023 period at 0.25° (~25 km) resolution. Daily images are grouped by year (YYYY), each subdirectory containing one netCDF image file for a specific day (DD), month (MM) in a 2-dimensional (longitude, latitude) grid system (CRS: WGS84). All file names follow the naming convention:

    L3S-SSMS-MERGED-SOILMOISTURE-YYYYMMDD000000-fv0.1.nc

    Data Variables

    Each netCDF file contains 3 coordinate variables (WGS84 longitude, latitude and time stamp), as well as the following data variables:

    • sm: (float) The Soil Moisture variable contains the daily average volumetric soil moisture content (m3/m3) in the soil surface layer (~0-5 cm) over a whole grid cell (0.25 degree). Based on (merged) observations from ASCAT, AMSR2 and SMAP using the new merging scheme described in our study.
    • sm_uncertainty: (float) The Soil Moisture Uncertainty variable contains the uncertainty estimates (random error) for the ‘sm’ field. Based on the uncertainty estimation and propagation scheme described in our study.
    • dnflag: (int) Indicator for satellite orbit(s) used in the retrieval (day/nighttime). 1=day, 2=night, 3=both
    • flag: (int) Indicator for data quality / missing data indicator. For more details, see netcdf attributes.
    • freqbandID: (int) Indicator for frequency band(s) used in the retrieval. For more details, see netcdf attributes.
    • mode: (int) Indicator for satellite orbit(s) used in the retrieval (ascending, descending)
    • sensor: (int) Indicator for satellite sensor(s) used in the retrieval. For more details, see netcdf attributes.
    • t0: (float) Representative time stamp, based on overpass times of all merged satellites.

    Software to open netCDF files

    After extracting the .nc files from the downloaded zip archived, they can read by any software that supports Climate and Forecast (CF) standard conform netCDF files, such as:

    • Xarray (python)
    • netCDF4 (python)
    • esa_cci_sm (python)
    • Similar tools exists for other programming languages (Matlab, R, etc.)
    • GIS and netCDF tools such as CDO, NCO, QGIS, ArCGIS.
    • You can also use the GUI software Panoply to view the contents of each file

    Funding

    This dataset was produced with funding from the European Space Agency (ESA) Climate Change Initiative (CCI) Plus Soil Moisture Project (CCN 3 to ESRIN Contract No: 4000126684/19/I-NB "ESA CCI+ Phase 1 New R&D on CCI ECVS Soil Moisture"). Project website: https://climate.esa.int/en/projects/soil-moisture/

  17. Z

    Data from: Spatial and temporal variation in the value of solar power across...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brown, Patrick R. (2020). Spatial and temporal variation in the value of solar power across United States electricity markets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3562895
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset authored and provided by
    Brown, Patrick R.
    Area covered
    United States
    Description

    This repository includes python scripts and input/output data associated with the following publication:

    [1] Brown, P.R.; O'Sullivan, F. "Spatial and temporal variation in the value of solar power across United States Electricity Markets". Renewable & Sustainable Energy Reviews 2019. https://doi.org/10.1016/j.rser.2019.109594

    Please cite reference [1] for full documentation if the contents of this repository are used for subsequent work.

    Many of the scripts, data, and descriptive text in this repository are shared with the following publication:

    [2] Brown, P.R.; O'Sullivan, F. "Shaping photovoltaic array output to align with changing wholesale electricity price profiles". Applied Energy 2019, 256, 113734. https://doi.org/10.1016/j.apenergy.2019.113734

    All code is in python 3 and relies on a number of dependencies that can be installed using pip or conda.

    Contents

    pvvm/*.py : Python module with functions for modeling PV generation and calculating PV energy revenue, capacity value, and emissions offset.

    notebooks/*.ipynb : Jupyter notebooks, including:

    pvvm-vos-data.ipynb: Example scripts used to download and clean input LMP data, determine LMP node locations, assign nodes to capacity zones, download NSRDB input data, and reproduce some figures in [1]

    pvvm-example-generation.ipynb: Example scripts demonstrating the use of the PV generation model and a sensitivity analysis of PV generator assumptions

    pvvm-example-plots.ipynb: Example scripts demonstrating different plotting functions

    validate-pv-monthly-eia.ipynb: Scripts and plots for comparing modeled PV generation with monthly generation reported in EIA forms 860 and 923, as discussed in SI Note 3 of [1]

    validate-pv-hourly-pvdaq.ipynb: Scripts and plots for comparing modeled PV generation with hourly generation reported in NREL PVDAQ database, as discussed in SI Note 3 of [1]

    pvvm-energyvalue.ipynb: Scripts for calculating the wholesale energy market revenues of PV and reproducing some figures in [1]

    pvvm-capacityvalue.ipynb: Scripts for calculating the capacity credit and capacity revenues of PV and reproducing some figures in [1]

    pvvm-emissionsvalue.ipynb: Scripts for calculating the emissions offset of PV and reproducing some figures in [1]

    pvvm-breakeven.ipynb: Scripts for calculating the breakeven upfront cost and carbon price for PV and reproducing some figures in [1]

    html/*.html : Static images of the above Jupyter notebooks for viewing without a python kernel

    data/lmp/*.gz : Day-ahead nodal locational marginal prices (LMPs) and marginal costs of energy (MCE), congestion (MCC), and losses (MCL) for CAISO, ERCOT, MISO, NYISO, and ISONE.

    At the time of publication of this repository, permission had not been received from PJM to republish their LMP data. If permission is received in the future, a new version of this repository will be linked here with the complete dataset.

    results/*.csv.gz : Simulation results associated with [1], including modeled energy revenue, capacity credit and revenue, emissions offsets, and breakeven costs for PV systems at all LMP nodes

    Data notes

    ISO LMP data are used with permission from the different ISOs. Adapting the MIT License (https://opensource.org/licenses/MIT), "The data are provided 'as is', without warranty of any kind, express or implied, including but not limited to the warranties of merchantibility, fitness for a particular purpose and noninfringement. In no event shall the authors or sources be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the data or other dealings with the data." Copyright and usage permissions for the LMP data are available on the ISO websites, linked below.

    ISO-specific notes on LMP data:

    CAISO data from http://oasis.caiso.com/mrioasis/logon.do are used pursuant to the terms at http://www.caiso.com/Pages/PrivacyPolicy.aspx#TermsOfUse.

    ERCOT data are from http://www.ercot.com/mktinfo/prices.

    MISO data are from https://www.misoenergy.org/markets-and-operations/real-time--market-data/market-reports/ and https://www.misoenergy.org/markets-and-operations/real-time--market-data/market-reports/market-report-archives/.

    PJM data were originally downloaded from https://www.pjm.com/markets-and-operations/energy/day-ahead/lmpda.aspx and https://www.pjm.com/markets-and-operations/energy/real-time/lmp.aspx. At the time of this writing these data are currently hosted at https://dataminer2.pjm.com/feed/da_hrl_lmps and https://dataminer2.pjm.com/feed/rt_hrl_lmps.

    NYISO data from http://mis.nyiso.com/public/ are used subject to the disclaimer at https://www.nyiso.com/legal-notice.

    ISONE data are from https://www.iso-ne.com/isoexpress/web/reports/pricing/-/tree/lmps-da-hourly and https://www.iso-ne.com/isoexpress/web/reports/pricing/-/tree/lmps-rt-hourly-final. The Material is provided on an "as is" basis. ISO New England Inc., to the fullest extent permitted by law, disclaims all warranties, either express or implied, statutory or otherwise, including but not limited to the implied warranties of merchantability, non-infringement of third parties' rights, and fitness for particular purpose. Without limiting the foregoing, ISO New England Inc. makes no representations or warranties about the accuracy, reliability, completeness, date, or timeliness of the Material. ISO New England Inc. shall have no liability to you, your employer or any other third party based on your use of or reliance on the Material.

    Data workup: LMP data were downloaded directly from the ISOs using scripts similar to the pvvm.data.download_lmps() function (see below for caveats), then repackaged into single-node single-year files using the pvvm.data.nodalize() function. These single-node single-year files were then combined into the dataframes included in this repository, using the procedure shown in the pvvm-vos-data.ipynb notebook for MISO. We provide these yearly dataframes, rather than the long-form data, to minimize file size and number. These dataframes can be unpacked into the single-node files used in the analysis using the pvvm.data.copylmps() function.

    Usage notes

    Code is provided under the MIT License, as specified in the pvvm/LICENSE file and at the top of each *.py file.

    Updates to the code, if any, will be posted in the non-static repository at https://github.com/patrickbrown4/pvvm_vos. The code in the present repository has the following version-specific dependencies:

    matplotlib: 3.0.3

    numpy: 1.16.2

    pandas: 0.24.2

    pvlib: 0.6.1

    scipy: 1.2.1

    tqdm: 4.31.1

    To use the NSRDB download functions, you will need to modify the "settings.py" file to insert a valid NSRDB API key, which can be requested from https://developer.nrel.gov/signup/. Locations can be specified by passing (latitude, longitude) floats to pvvm.data.downloadNSRDBfile(), or by passing a string googlemaps query to pvvm.io.queryNSRDBfile(). To use the googlemaps functionality, you will need to request a googlemaps API key (https://developers.google.com/maps/documentation/javascript/get-api-key) and insert it in the "settings.py" file.

    Note that many of the ISO websites have changed in the time since the functions in the pvvm.data module were written and the LMP data used in the above papers were downloaded. As such, the pvvm.data.download_lmps() function no longer works for all ISOs and years. We provide this function to illustrate the general procedure used, and do not intend to maintain it or keep it up to date with the changing ISO websites. For up-to-date functions for accessing ISO data, the following repository (no connection to the present work) may be helpful: https://github.com/catalyst-cooperative/pudl.

  18. Z

    Hyperreal Talk (Polish clear web message board) messages data

    • data.niaid.nih.gov
    Updated Mar 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Świeca, Leszek (2024). Hyperreal Talk (Polish clear web message board) messages data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10810250
    Explore at:
    Dataset updated
    Mar 18, 2024
    Dataset provided by
    Siuda, Piotr
    Shi, Haitao
    Świeca, Leszek
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    General Information

    1. Title of Dataset

    Hyperreal Talk (Polish clear web message board) messages data.

    1. Data Collectors

    Haitao Shi (The University of Edinburgh, UK); Leszek Świeca (Kazimierz Wielki University in Bydgoszcz, Poland).

    1. Funding Information

    The dataset is part of the research supported by the Polish National Science Centre (Narodowe Centrum Nauki) grant 2021/43/B/HS6/00710.

    Project title: “Rhizomatic networks, circulation of meanings and contents, and offline contexts of online drug trade” (2022-2025; PLN 956 620; funding institution: Polish National Science Centre [NCN], call: OPUS 22; Principal Investigator: Piotr Siuda [Kazimierz Wielki University in Bydgoszcz, Poland]).

    Data Collection Context

    1. Data Source

    Polish clear web message board called Hyperreal Talk (https://hyperreal.info/talk/).

    1. Purpose

    This dataset was developed within the abovementioned project. The project delves into internet dynamics within disruptive activities, specifically focusing on the online drug trade in Poland. It aims to (1) examine the utilization of the open internet, including social media, in the drug trade; (2) delineate the role of darknet environments in narcotics distribution; and (3) uncover the intricate flow of drug trade-related content and its meanings between the open web and the darknet, and how these meanings are shaped within the so-called drug subculture.

    The Hyperreal Talk forum emerges as a pivotal online space on the Polish internet, serving as a hub for discussions and the exchange of knowledge and experiences concerning drug use. It plays a crucial role in investigating the narratives and discourses that shape the drug subculture and the broader societal perceptions of drug consumption. The dataset has been instrumental in conducting analyses pertinent to the earlier project goals.

    1. Collection Method

    The dataset was compiled using the Scrapy framework, a web crawling and scraping library for Python. This tool facilitated systematic content extraction from the targeted message board.

    1. Collection Date

    The data was collected in two periods, i.e., in September 2023 and November 2023.

    Data Content

    1. Data Description

    The dataset comprises all messages posted on the Polish-language Hyperreal Talk message board from its inception until November 2023. These messages include the initial posts that start each thread and the subsequent posts (replies) within those threads. The dataset is organized into two directories: “hyperreal” and “hyperreal_hidden.” The “hyperreal” directory contains accessible posts without needing to log in to Hyperreal Talk, while the “hyperreal_hidden” directory holds posts that can only be viewed by logged-in users. For each directory, a .txt file has been prepared detailing the structure of the message board folders from which the posts were extracted. The dataset includes 6,248,842 posts.

    1. Data Cleaning, Processing, and Anonymization

    The data has been cleaned and processed using regular expressions in Python. Additionally, all personal information was removed through regular expressions. The data has been hashed to exclude all identifiers related to instant messaging apps and email addresses. Furthermore, all usernames appearing in messages have been eliminated.

    1. File Formats and Variables/Fields

    The dataset consists of the following files:

    Zipped .txt files (hyperreal.zip) containing messages that are visible without logging into Hyperreal Talk. These files are organized into individual directories that mirror the folder structure found on the Hyperreal Talk message board.

    Zipped .txt files (hyperreal_hidden.zip) containing messages that are visible only after logging into Hyperreal Talk. Similar to the first type, these files are organized into directories corresponding to the website’s folder structure.

    A .csv file that lists all the messages, including file names and the content of each post.

    Accessibility and Usage

    1. Access Conditions

    The data can be accessed without any restrictions.

    1. Related Documentation

    Attached are .txt files detailing the tree of folders for “hyperreal.zip” and “hyperreal_hidden.zip.”

    Documentation on the Python regular expressions used for scraping, cleaning, processing, and anonymizing the data can be found on GitHub at the following URLs:

    https://github.com/LeszekSwieca/Project_2021-43-B-HS6-00710

    https://github.com/HaitaoShi/Scrapy_hyperreal"

    Ethical Considerations

    1. Ethics Statement

    A set of data handling policies aimed at ensuring safety and ethics has been outlined in the following paper:

    Harviainen, J.T., Haasio, A., Ruokolainen, T., Hassan, L., Siuda, P., Hamari, J. (2021). Information Protection in Dark Web Drug Markets Research [in:] Proceedings of the 54th Hawaii International Conference on System Sciences, HICSS 2021, Grand Hyatt Kauai, Hawaii, USA, 4-8 January 2021, Maui, Hawaii, (ed.) Tung X. Bui, Honolulu, HI, pp. 4673-4680.

    The primary safeguard was the early-stage hashing of usernames and identifiers from the messages, utilizing automated systems for irreversible hashing. Recognizing that scraping and automatic name removal might not catch all identifiers, the data underwent manual review to ensure compliance with research ethics and thorough anonymization.

  19. f

    Texas Synthetic Power System Test Case (TX-123BT).zip

    • figshare.com
    zip
    Updated Mar 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jin Lu; Xingpeng Li (2024). Texas Synthetic Power System Test Case (TX-123BT).zip [Dataset]. http://doi.org/10.6084/m9.figshare.22144616.v6
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 8, 2024
    Dataset provided by
    figshare
    Authors
    Jin Lu; Xingpeng Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Texas
    Description

    The dataset of the synthetic Texas 123-bus backbone transmission (TX-123BT) system.The procedures and details to create TX-123BT system are described in the paper below:Jin Lu, Xingpeng Li et al., “A Synthetic Texas Backbone Power System with Climate-Dependent Spatio-Temporal Correlated Profiles”.If you use this dataset in your work, please cite the paper above.***Introduction:The TX-123BT system has similar temporal and spatial characteristics as the actual Electric Reliability Council of Texas (ERCOT) system.TX-123BT system has a backbone network consisting of only high-voltage transmission lines distributed in the Texas territory.It includes time series profiles of renewable generation, electrical load, and transmission thermal limits for 5 years from 2017 to 2021.The North American Land Data Assimilation System (NLDAS) climate data is extracted and used to create the climate-dependent time series profiles mentioned above.Two sets of climate-dependent dynamic line rating (DLR) profiles are created: (i) daily DLR and (ii) hourly DLR.***Power system configuration data:'Bus_data.csv': Bus data including bus name and location (longitude & latitude, weather zone).'Line_data.csv': Line capacity and terminal bus information.'Generator_data.xlsx': 'Gen_data' sheet: Generator parameters including active/reactive capacity, fuel type, cost and ramping rate.'Solar Plant Number' sheet: Correspondence between the solar plant number and generator number.'Wind Plant Number' sheet: Correspondence between the wind plant number and generator number.***Time series profiles:'Climate_5y' folder: Include each day's climate data for solar radiation, air temperature, wind speed near surface at 10 meter height.Each file in the folder includes the hourly temperature, longwave & shortwave solar radiation, zonal & Meridional wind speed data of a day in 2019.'Hourly_line_rating_5y' folder: Include the hourly dynamic line rating for each day in the year.Each file includes the hourly line rating (MW) of a line for all hours in the year.In each file, columns represent hour 1-24 in a day, rows represent day 1-365 in the year.'Daily_line_rating_5y' folder: The daily dynamic line rating (MW) for all lines and all days in the year.'solar_5y' folder: Solar production for all the solar farms in the TX-123BT and for all the days in the year.Each file includes the hourly solar production (MW) of all the solar plants for a day in the year.In each file, columns represent hour 1-24 in a day, rows represent solar plant 1-72.'wind_5y' folder: Wind production for all the wind farms in the case and for all the days in the year.Each file includes the hourly wind production (MW) of all the wind plants for a day in the year.In each file, columns represent hour 1-24 in a day, rows represent wind plant 1-82.'load_5y' folder: Include each day's hourly load data on all the buses.Each file includes the hourly nodal loads (MW) of all the buses in a day in the year.In each file,columns represent bus 1-123, rows represent hour 1-24 in a day.***Python Codes to run security-constrainted unit commitment (SCUC) for TX-123BT profilesRecommand Python Version: Python 3.11Required packages: Numpy, pyomo, pypower, pickleRequired a solver which can be called by the pyomo to solve the SCUC optimization problem.*'Sample_Codes_SCUC' folder: A standard SCUC model.The load, solar generation, wind generation profiles are provided by 'load_annual','solar_annual', 'wind_annual' folders.The daily line rating profiles are provided by 'Line_annual_Dmin.txt'.'power_mod.py': define the python class for the power system.'UC_function.py': define functions to build, solve, and save results for pyomo SCUC model.'formpyomo_UC': define the function to create the input file for pyomo model.'Run_SCUC_annual': run this file to perform SCUC simulation on the selected days of the TX-123BT profiles.Steps to run SCUC simulation:1) Set up the python environment.2) Set the solver location: 'UC_function.py'=>'solve_UC' function=>UC_solver=SolverFactory('solver_name',executable='solver_location')3) Set the days you want to run SCUC: 'Run_SCUC_annual.py'=>last row: run_annual_UC(case_inst,start_day,end_day)For example: to run SCUC simulations for 125th-146th days in 2019, the last row of the file is 'run_annual_UC(case_inst,125,146)'You can also run a single day's SCUC simulation by using: 'run_annual_UC(case_inst,single_day,single_day)'* 'Sample_Codes_SCUC_HourlyDLR' folder: The SCUC model consider hourly dynamic line rating (DLR) profiles.The load, solar generation, wind generation profiles are provided by 'load_annual','solar_annual', 'wind_annual' folders.The hourly line rating profiles in 2019 are provided by 'dynamic_rating_result' folder.'power_mod.py': define the python class for the power system.'UC_function_DLR.py': define functions to build, solve, and save results for pyomo SCUC model (with hourly DLR).'formpyomo_UC': define the function to create the input file for pyomo model.'RunUC_annual_dlr': run this file to perform SCUC simulation (with hourly DLR) on the selected days of the TX-123BT profiles.Steps to run SCUC simulation (with hourly DLR):1) Set up the python environment.2) Set the solver location: 'UC_function_DLR.py'=>'solve_UC' function=>UC_solver=SolverFactory('solver_name',executable='solver_location')3) Set the daily profiles for SCUC simulation: 'RunUC_annual_dlr.py'=>last row: run_annual_UC_dlr(case_inst,start_day,end_day)For example: to run SCUC simulations (with hourly DLR) for 125th-146th days in 2019, the last row of the file is 'run_annual_UC_dlr(case_inst,125,146)'You can also run a single day's SCUC simulation (with hourly DLR) by using: 'run_annual_UC_dlr(case_inst,single_day,single_day)'The SCUC/SCUC with DLR simulation results are saved in the 'UC_results' folders under the corresponding folder.Under 'UC_results' folder:'UCcase_Opcost.txt': total operational cost ($)'UCcase_pf.txt': the power flow results (MW). Rows represent lines, columns represent hours.'UCcase_pfpct.txt': the percentage of the power flow to the line capacity (%). Rows represent lines, columns represent hours.'UCcase_pgt.txt': the generators output power (MW). Rows represent conventional generators, columns represent hours.'UCcase_lmp.txt': the locational marginal price ($/MWh). Rows represent buses, columns represent hours.***Geographic information system (GIS) data:'Texas_GIS_Data' folder: includes the geographic information systems (GIS) data of the TX-123BT system configurations and ERCOT weather zones.The GIS data can be viewed and edited using GIS software: ArcGIS.The subfolders are:'Bus' folder: the shapefile of bus data for the TX-123BT system.'Line' folder: the shapefile of line data for the TX-123BT system.'Weather Zone' folder: the shapefile of the weather zones in Electric Reliability Council of Texas (ERCOT).*** Maps(Pictures) of the TX-123BT & ERCOT Weather Zone'Maps_TX123BT_WeatherZone' folder:1) 'TX123BT_Noted.jpg': The maps (pictures) of the TX-123BT transmission network. Buses are in blue and lines are in green.2) 'Area_Houston_Noted.jpg', 'Area_Dallas_Noted.jpg', 'Area_Austin_SanAntonio_Noted.jpg':The maps for different areas including Houston, Dallas, and Austin-SanAntonio are also provided.3) 'Weather_Zone.jpg': The map of ERCOT weather zones. It's ploted by author, may be slightly different from the actual ERCOT weather zones.***FundingThis project is supported by Alfred P. Sloan Foundation.***License:This work is licensed under the terms of the Creative Commons Attribution 4.0 (CC BY 4.0) license.***Disclaimer:The author doesn’t make any warranty for the accuracy, completeness, or usefulness of any information disclosed and the author assumes no liability or responsibility for any errors or omissions for the information (data/code/results etc) disclosed.***Contributions:Jin Lu created this dataset. Xingpeng Li supervised this work. Hongyi Li and Taher Chegini provided the raw historical climate data (extracted from an open-access dataset - NLDAS).

  20. Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic...

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv, zip
    Updated Dec 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexander R. Hartloper; Alexander R. Hartloper; Selimcan Ozden; Albano de Castro e Sousa; Dimitrios G. Lignos; Dimitrios G. Lignos; Selimcan Ozden; Albano de Castro e Sousa (2022). Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials [Dataset]. http://doi.org/10.5281/zenodo.6965147
    Explore at:
    bin, zip, csvAvailable download formats
    Dataset updated
    Dec 24, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alexander R. Hartloper; Alexander R. Hartloper; Selimcan Ozden; Albano de Castro e Sousa; Dimitrios G. Lignos; Dimitrios G. Lignos; Selimcan Ozden; Albano de Castro e Sousa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials

    Background

    This dataset contains data from monotonic and cyclic loading experiments on structural metallic materials. The materials are primarily structural steels and one iron-based shape memory alloy is also included. Summary files are included that provide an overview of the database and data from the individual experiments is also included.

    The files included in the database are outlined below and the format of the files is briefly described. Additional information regarding the formatting can be found through the post-processing library (https://github.com/ahartloper/rlmtp/tree/master/protocols).

    Usage

    • The data is licensed through the Creative Commons Attribution 4.0 International.
    • If you have used our data and are publishing your work, we ask that you please reference both:
      1. this database through its DOI, and
      2. any publication that is associated with the experiments. See the Overall_Summary and Database_References files for the associated publication references.

    Included Files

    • Overall_Summary_2022-08-25_v1-0-0.csv: summarises the specimen information for all experiments in the database.
    • Summarized_Mechanical_Props_Campaign_2022-08-25_v1-0-0.csv: summarises the average initial yield stress and average initial elastic modulus per campaign.
    • Unreduced_Data-#_v1-0-0.zip: contain the original (not downsampled) data
      • Where # is one of: 1, 2, 3, 4, 5, 6. The unreduced data is broken into separate archives because of upload limitations to Zenodo. Together they provide all the experimental data.
      • We recommend you un-zip all the folders and place them in one "Unreduced_Data" directory similar to the "Clean_Data"
      • The experimental data is provided through .csv files for each test that contain the processed data. The experiments are organised by experimental campaign and named by load protocol and specimen. A .pdf file accompanies each test showing the stress-strain graph.
      • There is a "db_tag_clean_data_map.csv" file that is used to map the database summary with the unreduced data.
      • The computed yield stresses and elastic moduli are stored in the "yield_stress" directory.
    • Clean_Data_v1-0-0.zip: contains all the downsampled data
      • The experimental data is provided through .csv files for each test that contain the processed data. The experiments are organised by experimental campaign and named by load protocol and specimen. A .pdf file accompanies each test showing the stress-strain graph.
      • There is a "db_tag_clean_data_map.csv" file that is used to map the database summary with the clean data.
      • The computed yield stresses and elastic moduli are stored in the "yield_stress" directory.
    • Database_References_v1-0-0.bib
      • Contains a bibtex reference for many of the experiments in the database. Corresponds to the "citekey" entry in the summary files.

    File Format: Downsampled Data

    These are the "LP_

    • The header of the first column is empty: the first column corresponds to the index of the sample point in the original (unreduced) data
    • Time[s]: time in seconds since the start of the test
    • e_true: true strain
    • Sigma_true: true stress in MPa
    • (optional) Temperature[C]: the surface temperature in degC

    These data files can be easily loaded using the pandas library in Python through:

    import pandas
    data = pandas.read_csv(data_file, index_col=0)

    The data is formatted so it can be used directly in RESSPyLab (https://github.com/AlbanoCastroSousa/RESSPyLab). Note that the column names "e_true" and "Sigma_true" were kept for backwards compatibility reasons with RESSPyLab.

    File Format: Unreduced Data

    These are the "LP_

    • The first column is the index of each data point
    • S/No: sample number recorded by the DAQ
    • System Date: Date and time of sample
    • Time[s]: time in seconds since the start of the test
    • C_1_Force[kN]: load cell force
    • C_1_Déform1[mm]: extensometer displacement
    • C_1_Déplacement[mm]: cross-head displacement
    • Eng_Stress[MPa]: engineering stress
    • Eng_Strain[]: engineering strain
    • e_true: true strain
    • Sigma_true: true stress in MPa
    • (optional) Temperature[C]: specimen surface temperature in degC

    The data can be loaded and used similarly to the downsampled data.

    File Format: Overall_Summary

    The overall summary file provides data on all the test specimens in the database. The columns include:

    • hidden_index: internal reference ID
    • grade: material grade
    • spec: specifications for the material
    • source: base material for the test specimen
    • id: internal name for the specimen
    • lp: load protocol
    • size: type of specimen (M8, M12, M20)
    • gage_length_mm_: unreduced section length in mm
    • avg_reduced_dia_mm_: average measured diameter for the reduced section in mm
    • avg_fractured_dia_top_mm_: average measured diameter of the top fracture surface in mm
    • avg_fractured_dia_bot_mm_: average measured diameter of the bottom fracture surface in mm
    • fy_n_mpa_: nominal yield stress
    • fu_n_mpa_: nominal ultimate stress
    • t_a_deg_c_: ambient temperature in degC
    • date: date of test
    • investigator: person(s) who conducted the test
    • location: laboratory where test was conducted
    • machine: setup used to conduct test
    • pid_force_k_p, pid_force_t_i, pid_force_t_d: PID parameters for force control
    • pid_disp_k_p, pid_disp_t_i, pid_disp_t_d: PID parameters for displacement control
    • pid_extenso_k_p, pid_extenso_t_i, pid_extenso_t_d: PID parameters for extensometer control
    • citekey: reference corresponding to the Database_References.bib file
    • yield_stress_mpa_: computed yield stress in MPa
    • elastic_modulus_mpa_: computed elastic modulus in MPa
    • fracture_strain: computed average true strain across the fracture surface
    • c,si,mn,p,s,n,cu,mo,ni,cr,v,nb,ti,al,b,zr,sn,ca,h,fe: chemical compositions in units of %mass
    • file: file name of corresponding clean (downsampled) stress-strain data

    File Format: Summarized_Mechanical_Props_Campaign

    Meant to be loaded in Python as a pandas DataFrame with multi-indexing, e.g.,

    tab1 = pd.read_csv('Summarized_Mechanical_Props_Campaign_' + date + version + '.csv',
              index_col=[0, 1, 2, 3], skipinitialspace=True, header=[0, 1],
              keep_default_na=False, na_values='')
    • citekey: reference in "Campaign_References.bib".
    • Grade: material grade.
    • Spec.: specifications (e.g., J2+N).
    • Yield Stress [MPa]: initial yield stress in MPa
      • size, count, mean, coefvar: number of experiments in campaign, number of experiments in mean, mean value for campaign, coefficient of variation for campaign
    • Elastic Modulus [MPa]: initial elastic modulus in MPa
      • size, count, mean, coefvar: number of experiments in campaign, number of experiments in mean, mean value for campaign, coefficient of variation for campaign

    Caveats

    • The files in the following directories were tested before the protocol was established. Therefore, only the true stress-strain is available for each:
      • A500
      • A992_Gr50
      • BCP325
      • BCR295
      • HYP400
      • S460NL
      • S690QL/25mm
      • S355J2_Plates/S355J2_N_25mm and S355J2_N_50mm
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Adeola Bannis (2024). Data Visualization of Weight Sensor and Event Detection of Aifi Store [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4292483

Data Visualization of Weight Sensor and Event Detection of Aifi Store

Explore at:
Dataset updated
Jul 19, 2024
Dataset provided by
Pei Zhang
Carlos Ruiz
Adeola Bannis
Rahul S Hoskeri
Hae Young Noh
João Diogo Falcão
Shijia Pan
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Aifi Store is an autonomus store for cashier-less shopping experience which is achieved by multi modal sensing (Vision modality, weight modality and location modality). Aifi Nano store layout (Fig 1) (Image Credits: AIM3S research paper).

Overview: The store is organized in the gondola's and each gondola has shelfs that holds the products and each shelf has weight sensor plates. These weight sensor plates data is used to find the event trigger (pick up, put down or no event) from which we can find the weight of the product picked.

Gondola is similar to vertical fixture consisting of horizontal shelfs in any normal store and in this case there are 5 to 6 shelfs in a Gondola. Every shelf again is composed of weight sensing plates, weight sensing modalities, there are around 12 plates on each shelf.

Every plate has a sampling rate of 60Hz, so there are 60 samples collected every second from each plate

The pick up event on the plate can be observed and marked when the weight sensor reading decreases with time and increases with time when the put down event happens.

Event Detection:

The event is said to be detected if the moving variance calculated from the raw weight sensor reading exceeds a set threshold of (10000gm^2 or 0.01kg^2) over the sliding window length of 0.5 seconds, which is half of the sampling rate of sensors (i.e 1 second).

There are 3 types of events:

Pick Up Event (Fig 2)= Object being taken from the particular gondola and shelf from the customer

Put Down Event (Fig 3)= Object being placed back from the customer on that particular gondola and shelf

No Event = (Fig 4)No object being picked up from that shelf

NOTE:

1.The python script must be in the same folder as of the weight.csv files and .csv files should not be placed in other subdirectories.

2.The videos for the corresponding weight sensor data can be found in the "Videos folder" in the repository and are named similar to their corresponding ".csv" files.

3.Each video files consists of video data from 13 different camera angles.

Details of the weight sensor files:

These weight.csv (Baseline cases and team particular cases ) files are from the AIFI CPS IoT 2020 week.There are over 50 cases in total and each file has 5 columns (Fig 5) (timestamp, reading (in grams), gondola, shelf, plate number).

Each of these files have data of around 2-5 minutes or 120 seconds in the form of timestamp. In order to unpack date and time from timestamp use datetime module from python.

Details of the product.csv files:

There are product.csv files for each test cases and these files provide the detailed information about the product name, product location (gondola number, shelf number and plate number) in the store, product weight(in grams), also link to the image of the product.

Instruction to run the script:

To start analysing the weigh.csv files using the python script and plot the timeseries plot for corresponding files.

Download the dataset.

Make sure to place the python/ jupyter notebook file is in same directory as the .csv files.

Install the requirements $ pip3 install -r requirements.txt

Run the python script Plot.py $ python3 Plot.py

After the script has run successfully you will find the corresponding folders of weight.csv files which contain the figures (weight vs timestamp) in the format

Instruction to run the Jupyter Notebook:

Run the Plot.ipynb file using Jupyter Notebook by placing .csv files in the same directory as the Plot.ipynb script.

                                   gondola_number,shelf_number.png


                                    Ex: 1,1.png (Fig 4) (Timeseries Graph)
Search
Clear search
Close search
Google apps
Main menu