Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Standardized data from Mobilise-D participants (YAR dataset) and pre-existing datasets (ICICLE, MSIPC2, Gait in Lab and real-life settings, MS project, UNISS-UNIGE) are provided in the shared folder, as an example of the procedures proposed in the publication "Mobility recorded by wearable devices and gold standards: the Mobilise-D procedure for data standardization" that is currently under review in Scientific data. Please refer to that publication for further information. Please cite that publication if using these data.
The code to standardize an example subject (for the ICICLE dataset) and to open the standardized Matlab files in other languages (Python, R) is available in github (https://github.com/luca-palmerini/Procedure-wearable-data-standardization-Mobilise-D).
Burmese pythons are an invasive species in the Greater Everglades Ecosystem. Burmese pythons captured in the ecosystem are euthanized, and in an effort to learn about this invasive species, all euthanized pythons are necropsied, during which time samples are collected. We analyzed the stable isotope ratios of carbon and nitrogen in muscle samples from 423 Burmese pythons euthanized and necropsied between 2003-05-01 and 2012-09-02, and after processing and QA/QC, we were left with isotope ratios for 412 samples, which we reported here. We used these data to estimate the size of the isotopic niche of the Burmese python, commonly measured using standard ellipse areas, or SEAs. To put these SEAs in context, we conducted a thorough literature review to find published sizes of other isotopic niches, beginning in 2017-01 and finalized on 2018-06-12. We reported the papers we found during this literature review here. We then reviewed each paper and recorded any SEAs presented in the paper or its supporting information, and we presented those results here.
This dataset was created by abhishek
Mass spectrometry-based proteomics is increasingly employed in biology and medicine. To generate reliable information from large data sets and ensure comparability of results it is crucial to implement and standardize the quality control of the raw data, the data processing steps and the statistical analyses. The MSPypeline provides a platform for the import of MaxQuant output tables, the generation of quality control reports, the preprocessing of data including normalization and exploratory analyses by statistical inference plots. These standardized steps assess data quality, provide customizable figures and enable the identification of differentially expressed proteins to reach biologically relevant conclusions.
The dataset has combined the Parcels and Computer-Assisted Mass Appraisal (CAMA) data for 2023 into a single dataset. This dataset is designed to make it easier for stakeholders and the GIS community to use and access the information as a geospatial dataset. Included in this dataset are geometries for all 169 municipalities and attribution from the CAMA data for all but one municipality. Pursuant to Section 7-100l of the Connecticut General Statutes, each municipality is required to transmit a digital parcel file and an accompanying assessor’s database file (known as a CAMA report), to its respective regional council of governments (COG) by May 1 annually. These data were gathered from the CT municipalities by the COGs and then submitted to CT OPM. This dataset was created on 12/08/2023 from data collected in 2022-2023. Data was processed using Python scripts and ArcGIS Pro, ensuring standardization and integration of the data.CAMA Notes:The CAMA underwent several steps to standardize and consolidate the information. Python scripts were used to concatenate fields and create a unique identifier for each entry. The resulting dataset contains 1,353,595 entries and information on property assessments and other relevant attributes.CAMA was provided by the towns.Canaan parcels are viewable, but no additional information is available since no CAMA data was submitted.Spatial Data Notes:Data processing involved merging the parcels from different municipalities using ArcGIS Pro and Python. The resulting dataset contains 1,247,506 parcels.No alteration has been made to the spatial geometry of the data.Fields that are associated with CAMA data were provided by towns.The data fields that have information from the CAMA were sourced from the towns’ CAMA data.If no field for the parcels was provided for linking back to the CAMA by the town a new field within the original data was selected if it had a match rate above 50%, that joined back to the CAMA.Linking fields were renamed to "Link".All linking fields had a census town code added to the beginning of the value to create a unique identifier per town.Any field that was not town name, Location, Editor, Edit Date, or a field associated back to the CAMA, was not used in the creation of this Dataset.Only the fields related to town name, location, editor, edit date, and link fields associated with the towns’ CAMA were included in the creation of this dataset. Any other field provided in the original data was deleted or not used.Field names for town (Muni, Municipality) were renamed to "Town Name".
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description This repository contains a comprehensive solar irradiance, imaging, and forecasting dataset. The goal with this release is to provide standardized solar and meteorological datasets to the research community for the accelerated development and benchmarking of forecasting methods. The data consist of three years (2014–2016) of quality-controlled, 1-min resolution global horizontal irradiance and direct normal irradiance ground measurements in California. In addition, we provide overlapping data from commonly used exogenous variables, including sky images, satellite imagery, Numerical Weather Prediction forecasts, and weather data. We also include sample codes of baseline models for benchmarking of more elaborated models.
Data usage The usage of the datasets and sample codes presented here is intended for research and development purposes only and implies explicit reference to the paper: Pedro, H.T.C., Larson, D.P., Coimbra, C.F.M., 2019. A comprehensive dataset for the accelerated development and benchmarking of solar forecasting methods. Journal of Renewable and Sustainable Energy 11, 036102. https://doi.org/10.1063/1.5094494
Although every effort was made to ensure the quality of the data, no guarantees or liabilities are implied by the authors or publishers of the data.
Sample code As part of the data release, we are also including the sample code written in Python 3. The preprocessed data used in the scripts are also provided. The code can be used to reproduce the results presented in this work and as a starting point for future studies. Besides the standard scientific Python packages (numpy, scipy, and matplotlib), the code depends on pandas for time-series operations, pvlib for common solar-related tasks, and scikit-learn for Machine Learning models. All required Python packages are readily available on Mac, Linux, and Windows and can be installed via, e.g., pip.
Units All time stamps are in UTC (YYYY-MM-DD HH:MM:SS). All irradiance and weather data are in SI units. Sky image features are derived from 8-bit RGB (256 color levels) data. Satellite images are derived from 8-bit gray-scale (256 color levels) data.
Missing data The string "NAN" indicates missing data
File formats All time series data files as in CSV (comma separated values) Images are given in tar.bz2 files
Files
Folsom_irradiance.csv Primary One-minute GHI, DNI, and DHI data.
Folsom_weather.csv Primary One-minute weather data.
Folsom_sky_images_{YEAR}.tar.bz2 Primary Tar archives with daytime sky images captured at 1-min intervals for the years 2014, 2015, and 2016, compressed with bz2.
Folsom_NAM_lat{LAT}_lon{LON}.csv Primary NAM forecasts for the four nodes nearest the target location. {LAT} and {LON} are replaced by the node’s coordinates listed in Table I in the paper.
Folsom_sky_image_features.csv Secondary Features derived from the sky images.
Folsom_satellite.csv Secondary 10 pixel by 10 pixel GOES-15 images centered in the target location.
Irradiance_features_{horizon}.csv Secondary Irradiance features for the different forecasting horizons ({horizon} 1⁄4 {intra-hour, intra-day, day-ahead}).
Sky_image_features_intra-hour.csv Secondary Sky image features for the intra-hour forecasting issuing times.
Sat_image_features_intra-day.csv Secondary Satellite image features for the intra-day forecasting issuing times.
NAM_nearest_node_day-ahead.csv Secondary NAM forecasts (GHI, DNI computed with the DISC algorithm, and total cloud cover) for the nearest node to the target location prepared for day-ahead forecasting.
Target_{horizon}.csv Secondary Target data for the different forecasting horizons.
Forecast_{horizon}.py Code Python script used to create the forecasts for the different horizons.
Postprocess.py Code Python script used to compute the error metric for all the forecasts.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
To standardize metabolomics data analysis and facilitate future computational developments, it is essential to have a set of well-defined templates for common data structures. Here we describe a collection of data structures involved in metabolomics data processing and illustrate how they are utilized in a full-featured Python-centric pipeline. We demonstrate the performance of the pipeline, and the details in annotation and quality control using large-scale LC-MS metabolomics and lipidomics data and LC-MS/MS data. Multiple previously published datasets are also reanalyzed to showcase its utility in biological data analysis. This pipeline allows users to streamline data processing, quality control, annotation, and standardization in an efficient and transparent manner. This work fills a major gap in the Python ecosystem for computational metabolomics.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset used in this study is the Wisconsin Diagnostic Breast Cancer (WDBC) dataset, originally provided by the University of Wisconsin and obtained via Kaggle. It consists of 569 observations, each corresponding to a digitized image of a fine needle aspirate (FNA) of a breast mass. The dataset contains 32 attributes: one identifier column (discarded during preprocessing), one diagnosis label (malignant or benign), and 30 continuous real-valued features that describe the morphology of cell nuclei. These features are grouped into three statistical descriptors—mean, standard error (SE), and worst (mean of the three largest values)—for ten morphological properties including radius, perimeter, area, concavity, and fractal dimension. All feature values were normalized using z-score standardization to ensure uniform scale across models sensitive to input ranges. No missing values were present in the original dataset. Label encoding was applied to the diagnosis column, assigning 1 to malignant and 0 to benign cases. The dataset was split into training (80%) and testing (20%) sets while preserving class balance via stratified sampling. The accompanying Python source code (breast_cancer_classification_models.py) performs data loading, preprocessing, model training, evaluation, and result visualization. Four lightweight classifiers—Decision Tree, Naïve Bayes, Perceptron, and K-Nearest Neighbors (KNN)—were implemented using the scikit-learn library (version 1.2 or later). Performance metrics including Accuracy, Precision, Recall, F1-score, and ROC-AUC were calculated for each model. Confusion matrices and ROC curves were generated and saved as PNG files for interpretability. All results are saved in a structured CSV file (classification_results.csv) that contains the performance metrics for each model. Supplementary visualizations include all_feature_histograms.png (distribution plots for all standardized features), model_comparison.png (metric-wise bar plot), and feature_correlation_heatmap.png (Pearson correlation matrix of all 30 features). The data files are in standard CSV and PNG formats and can be opened using any spreadsheet or image viewer, respectively. No rare file types are used, and all scripts are compatible with any Python 3.x environment. This data package enables reproducibility and offers a transparent overview of how baseline machine learning models perform in the domain of breast cancer diagnosis using a clinically-relevant dataset.
The analysis of electrophysiology data typically comprises multiple steps. These often consist of several scripts executed in a specific temporal order, which take different parameter sets and use distinct data files. As the researcher adjusts the individual analysis steps to accommodate new hypotheses or additional data, the resulting workflows may become increasingly complex, and undergo frequent changes. Although it is possible to use workflow management systems to organize the execution of the scripts and capture provenance information at the level of the script (i.e., which script file was executed, and in which environment?) and data file (i.e., which input and output files were supplied to that script), the resulting provenance track does not automatically provide details about the actual analysis carried out inside each script. Therefore, the final analysis results can only be understood by source code inspection or reliance in any accompanying documentation. We focus on two open-source tools for the analysis of electrophysiology data developed in EBRAINS. The Neo (RRID:SCR_000634) framework provides an object model to standardize neural activity data acquired from distinct sources. Elephant (RRID:SCR_003833) is a Python toolbox that provides several functions for the analysis of electrophysiology data. We set to improve these tools by implementing a data model that captures detailed provenance information and by representing the analysis results in a systematic and formalized manner. Ultimately, these developments aim to improve reproducibility, interoperability, findability, and re-use of analysis results.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This is the data from the tournament described here: http://axelrod-tournament.readthedocs.org/en/latest/standard/strategies.html
A description of the format is available here: http://axelrod.readthedocs.org/en/latest/tutorials/further_topics/reading_and_writing_interactions.html
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Samples in this benchmark were generated by RELAI using the following data source(s): Data Source Name: langchain Documentation Data Source Link: https://python.langchain.com/docs/introduction/ Data Source License: https://github.com/langchain-ai/langchain/blob/master/LICENSE Data Source Authors: Observable AI Benchmarks by Data Agents © 2025 RELAI.AI. Licensed under CC BY 4.0. Source: https://relai.ai
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Code [GitHub] | Publication [Nature Scientific Data'23 / ISBI'21] | Preprint [arXiv]
Abstract
We introduce MedMNIST, a large-scale MNIST-like collection of standardized biomedical images, including 12 datasets for 2D and 6 datasets for 3D. All images are pre-processed into 28x28 (2D) or 28x28x28 (3D) with the corresponding classification labels, so that no background knowledge is required for users. Covering primary data modalities in biomedical images, MedMNIST is designed to perform classification on lightweight 2D and 3D images with various data scales (from 100 to 100,000) and diverse tasks (binary/multi-class, ordinal regression and multi-label). The resulting dataset, consisting of approximately 708K 2D images and 10K 3D images in total, could support numerous research and educational purposes in biomedical image analysis, computer vision and machine learning. We benchmark several baseline methods on MedMNIST, including 2D / 3D neural networks and open-source / commercial AutoML tools. The data and code are publicly available at https://medmnist.com/.
Disclaimer: The only official distribution link for the MedMNIST dataset is Zenodo. We kindly request users to refer to this original dataset link for accurate and up-to-date data.
Update: We are thrilled to release MedMNIST+ with larger sizes: 64x64, 128x128, and 224x224 for 2D, and 64x64x64 for 3D. As a complement to the previous 28-size MedMNIST, the large-size version could serve as a standardized benchmark for medical foundation models. Install the latest API to try it out!
Python Usage
We recommend our official code to download, parse and use the MedMNIST dataset:
% pip install medmnist% python
To use the standard 28-size (MNIST-like) version utilizing the downloaded files:
from medmnist import PathMNIST
train_dataset = PathMNIST(split="train")
To enable automatic downloading by setting download=True
:
from medmnist import NoduleMNIST3D
val_dataset = NoduleMNIST3D(split="val", download=True)
Alternatively, you can access MedMNIST+ with larger image sizes by specifying the size
parameter:
from medmnist import ChestMNIST
test_dataset = ChestMNIST(split="test", download=True, size=224)
Citation
If you find this project useful, please cite both v1 and v2 paper as:
Jiancheng Yang, Rui Shi, Donglai Wei, Zequan Liu, Lin Zhao, Bilian Ke, Hanspeter Pfister, Bingbing Ni. Yang, Jiancheng, et al. "MedMNIST v2-A large-scale lightweight benchmark for 2D and 3D biomedical image classification." Scientific Data, 2023.
Jiancheng Yang, Rui Shi, Bingbing Ni. "MedMNIST Classification Decathlon: A Lightweight AutoML Benchmark for Medical Image Analysis". IEEE 18th International Symposium on Biomedical Imaging (ISBI), 2021.
or using bibtex:
@article{medmnistv2, title={MedMNIST v2-A large-scale lightweight benchmark for 2D and 3D biomedical image classification}, author={Yang, Jiancheng and Shi, Rui and Wei, Donglai and Liu, Zequan and Zhao, Lin and Ke, Bilian and Pfister, Hanspeter and Ni, Bingbing}, journal={Scientific Data}, volume={10}, number={1}, pages={41}, year={2023}, publisher={Nature Publishing Group UK London} }
@inproceedings{medmnistv1, title={MedMNIST Classification Decathlon: A Lightweight AutoML Benchmark for Medical Image Analysis}, author={Yang, Jiancheng and Shi, Rui and Ni, Bingbing}, booktitle={IEEE 18th International Symposium on Biomedical Imaging (ISBI)}, pages={191--195}, year={2021} }
Please also cite the corresponding paper(s) of source data if you use any subset of MedMNIST as per the description on the project website.
License
The MedMNIST dataset is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0), except DermaMNIST under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0).
The code is under Apache-2.0 License.
Changelog
v3.0 (this repository): Released MedMNIST+ featuring larger sizes: 64x64, 128x128, and 224x224 for 2D, and 64x64x64 for 3D.
v2.2: Removed a small number of mistakenly included blank samples in OrganAMNIST, OrganCMNIST, OrganSMNIST, OrganMNIST3D, and VesselMNIST3D.
v2.1: Addressed an issue in the NoduleMNIST3D file (i.e., nodulemnist3d.npz). Further details can be found in this issue.
v2.0: Launched the initial repository of MedMNIST v2, adding 6 datasets for 3D and 2 for 2D.
v1.0: Established the initial repository (in a separate repository) of MedMNIST v1, featuring 10 datasets for 2D.
Note: This dataset is NOT intended for clinical use.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Non-ribosomal peptide synthetase (NRPS) is a diverse family of biosynthetic enzymes for the assembly of bioactive peptides. Despite advances in microbial sequencing, the lack of a consistent standard for annotating NRPS domains and modules has made data-driven discoveries challenging. To address this, we introduced a standardized architecture for NRPS, by using known conserved motifs to partition typical domains. This motif-and-intermotif standardization allowed for systematic evaluations of sequence properties from a large number of NRPS pathways, resulting in the most comprehensive cross-kingdom C domain subtype classifications to date, as well as the discovery and experimental validation of novel conserved motifs with functional significance. Furthermore, our coevolution analysis revealed important barriers associated with re-engineering NRPSs and uncovered the entanglement between phylogeny and substrate specificity in NRPS sequences. Our findings provide a comprehensive and statistically insightful analysis of NRPS sequences, opening avenues for future data-driven discoveries.
1.Framework overview. This paper proposed a pipeline to construct high-quality datasets for text mining in materials science. Firstly, we utilize the traceable automatic acquisition scheme of literature to ensure the traceability of textual data. Then, a data processing method driven by downstream tasks is performed to generate high-quality pre-annotated corpora conditioned on the characteristics of materials texts. On this basis, we define a general annotation scheme derived from materials science tetrahedron to complete high-quality annotation. Finally, a conditional data augmentation model incorporating materials domain knowledge (cDA-DK) is constructed to augment the data quantity.2.Dataset information. The experimental datasets used in this paper include: the Matscholar dataset publicly published by Weston et al. (DOI: 10.1021/acs.jcim.9b00470), and the NASICON entity recognition dataset constructed by ourselves. Herein, we mainly introduce the details of NASICON entity recognition dataset.2.1 Data collection and preprocessing. Firstly, 55 materials science literature related to NASICON system are collected through Crystallographic Information File (CIF), which contains a wealth of structure-activity relationship information. Note that materials science literature is mostly stored as portable document format (PDF), with content arranged in columns and mixed with tables, images, and formulas, which significantly compromises the readability of the text sequence. To tackle this issue, we employ the text parser PDFMiner (a Python toolkit) to standardize, segment, and parse the original documents, thereby converting PDF literature into plain text. In this process, the entire textual information of literature, encompassing title, author, abstract, keywords, institution, publisher, and publication year, is retained and stored as a unified TXT document. Subsequently, we apply rules based on Python regular expressions to remove redundant information, such as garbled characters and line breaks caused by figures, tables, and formulas. This results in a cleaner text corpus, enhancing its readability and enabling more efficient data analysis. Note that special symbols may also appear as garbled characters, but we refrain from directly deleting them, as they may contain valuable information such as chemical units. Therefore, we converted all such symbols to a special token
Coordinate system Update:
Notably, this dataset will be provided in NAD 83 Connecticut State Plane (2011) (EPSG 2234) projection, instead of WGS 1984 Web Mercator Auxiliary Sphere (EPSG 3857) which is the coordinate system of the 2023 dataset and will remain in Connecticut State Plane moving forward.
Ownership Suppression and Data Access:
The updated dataset now includes parcel data for all towns across the state, with some towns featuring fully suppressed ownership information. In these instances, the owner’s name will be replaced with the label "Current Owner," the co-owner’s name will be listed as "Current Co-Owner," and the mailing address will appear as the property address itself. For towns with suppressed ownership data, users should be aware that there was no "Suppression" field in the submission to verify specific details. This measure was implemented this year to help verify compliance with Suppression.
New Data Fields:
The new dataset introduces the "Land Acres" field, which will display the total acreage for each parcel. This additional field allows for more detailed analysis and better supports planning, zoning, and property valuation tasks. An important new addition is the FIPS code field, which provides the Federal Information Processing Standards (FIPS) code for each parcel’s corresponding block. This allows users to easily identify which block the parcel is in.
Updated Service URL:
The new parcel service URL includes all the updates mentioned above, such as the improved coordinate system, new data fields, and additional geospatial information. Users are strongly encouraged to transition to the new service as soon as possible to ensure that their workflows remain uninterrupted. The URL for this service will remain persistent moving forward. Once you have transitioned to the new service, the URL will remain constant, ensuring long term stability.
For a limited time, the old service will continue to be available, but it will eventually be retired. Users should plan to switch to the new service well before this cutoff to avoid any disruptions in data access.
The dataset has combined the Parcels and Computer-Assisted Mass Appraisal (CAMA) data for 2024 into a single dataset. This dataset is designed to make it easier for stakeholders and the GIS community to use and access the information as a geospatial dataset. Included in this dataset are geometries for all 169 municipalities and attribution from the CAMA data for all but one municipality. Pursuant to Section 7-100l of the Connecticut General Statutes, each municipality is required to transmit a digital parcel file and an accompanying assessor’s database file (known as a CAMA report), to its respective regional council of governments (COG) by May 1 annually.
These data were gathered from the CT municipalities by the COGs and then submitted to CT OPM. This dataset was created on 10/31/2024 from data collected in 2023-2024. Data was processed using Python scripts and ArcGIS Pro, ensuring standardization and integration of the data.
CAMA Notes:
The CAMA underwent several steps to standardize and consolidate the information. Python scripts were used to concatenate fields and create a unique identifier for each entry. The resulting dataset contains 1,353,595 entries and information on property assessments and other relevant attributes.
CAMA was provided by the towns.
Spatial Data Notes:
Data processing involved merging the parcels from different municipalities using ArcGIS Pro and Python. The resulting dataset contains 1,290,196 parcels.
No alteration has been made to the spatial geometry of the data.
Fields that are associated with CAMA data were provided by towns.
The data fields that have information from the CAMA were sourced from the towns’ CAMA data.
If no field for the parcels was provided for linking back to the CAMA by the town a new field within the original data was selected if it had a match rate above 50%, that joined back to the CAMA.
Linking fields were renamed to "Link".
All linking fields had a census town code added to the beginning of the value to create a unique identifier per town.
Any field that was not town name, Location, Editor, Edit Date, or a field associated back to the CAMA, was not used in the creation of this Dataset.
Only the fields related to town name, location, editor, edit date, and link fields associated with the towns’ CAMA were included in the creation of this dataset. Any other field provided in the original data was deleted or not used.
Field names for town (Muni, Municipality) were renamed to "Town Name".
The attributes included in the data:
Town Name
Owner
Co-Owner
Link
Editor
Edit Date
Collection year – year the parcels were submitted
Location
Mailing Address
Mailing City
Mailing State
Assessed Total
Assessed Land
Assessed Building
Pre-Year Assessed Total
Appraised Land
Appraised Building
Appraised Outbuilding
Condition
<span
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
pip install -r requirements.txt
config.py
file. For DeepSeek-Coder and CodeLlama, follow the instructions on the Ollama website to deploy the models locally.python src/Baseline_Standard.py --output_path "./dataset/Standard.json"
python src/Defense_Standard.py --output_path "./dataset/Standard.json"
CyberSecEval
directory.cd CyberSecEval/CybersecurityBenchmarks
python -m CybersecurityBenchmarks.benchmark.run </code>
--benchmark=instruct </code>
--prompt-path="./datasets/instruct/Standard.json" </code>
--llm-under-test="OPENAI::{MODEL_NAME}::{MODEL_KEY}::{BASE_URL}"
Replace {MODEL_NAME}
, {MODEL_KEY}
, and {BASE_URL}
with the appropriate values for your LLM. For example, if using OpenAI's GPT-4, BASE_URL
is the API endpoint. For DeepSeek, use https://api.deepseek.com/v1
.
For other scenarios, modify the corresponding Baseline_*.py
and Defense_*.py
files. All generated data will be saved in the dataset
directory.
Security Evaluation
Evaluation results are stored in the result
directory. For instance, to evaluate the effectiveness of CodeGuarder under the Standard scenario with DeepSeek-Coder, run:
python src/sec_eval.py --result_path "./output/std_def_DS-Coder.json"
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is a processed data in NetCDF (.nc) files, that used in our study. We used the SPI to determine meteorological drought conditions in the study area, that calculated by using the open-source module Climate and Drought Indices in Python.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This data is part of the Monthly aggregated Water Vapor MODIS MCD19A2 (1 km) dataset. Check the related identifiers section on the Zenodo side panel to access other parts of the dataset. General Description The monthly aggregated water vapor dataset is derived from MCD19A2 v061. The Water Vapor data measures the column above ground retrieved from MODIS near-IR bands at 0.94μm. The dataset time spans from 2000 to 2022 and provides data that covers the entire globe. The dataset can be used in many applications like water cycle modeling, vegetation mapping, and soil mapping. This dataset includes:
Monthly time-series:Derived from MCD19A2 v061, this data provides a monthly aggregated mean and standard deviation of daily water vapor time-series data from 2000 to 2022. Only positive non-cloudy pixels were considered valid observations to derive the mean and the standard deviation. The remaining no-data values were filled using the TMWM algorithm. This dataset also includes smoothed mean and standard deviation values using the Whittaker method. The quality assessment layers and the number of valid observations for each month can provide an indication of the reliability of the monthly mean and standard deviation values. Yearly time-series:Derived from monthly time-series, this data provides a yearly time-series aggregated statistics of the monthly time-series data. Long-term data (2000-2022):Derived from monthly time-series, this data provides long-term aggregated statistics for the whole series of monthly observations. Data Details
Time period: 2000–2022 Type of data: Water vapor column above the ground (0.001cm) How the data was collected or derived: Derived from MCD19A2 v061 using Google Earth Engine. Cloudy pixels were removed and only positive values of water vapor were considered to compute the statistics. The time-series gap-filling and time-series smoothing were computed using the Scikit-map Python package. Statistical methods used: Four statistics were derived: standard deviation, percentiles 25, 50, and 75. Limitations or exclusions in the data: The dataset does not include data for Antarctica. Coordinate reference system: EPSG:4326 Bounding box (Xmin, Ymin, Xmax, Ymax): (-180.00000, -62.00081, 179.99994, 87.37000) Spatial resolution: 1/120 d.d. = 0.008333333 (1km) Image size: 43,200 x 17,924 File format: Cloud Optimized Geotiff (COG) format. Support If you discover a bug, artifact, or inconsistency, or if you have a question please use some of the following channels:
Technical issues and questions about the code: GitLab Issues General questions and comments: LandGIS Forum Name convention To ensure consistency and ease of use across and within the projects, we follow the standard Open-Earth-Monitor file-naming convention. The convention works with 10 fields that describes important properties of the data. In this way users can search files, prepare data analysis etc, without needing to open files. The fields are:
generic variable name: wv = Water vapor variable procedure combination: mcd19a2v061.seasconv = MCD19A2 v061 with gap-filling algorithm Position in the probability distribution / variable type: m = mean | sd = standard deviation | n = number of observations | qa = quality assessment Spatial support: 1km Depth reference: s = surface Time reference begin time: 20000101 = 2000-01-01 Time reference end time: 20221231 = 2022-12-31 Bounding box: go = global (without Antarctica) EPSG code: epsg.4326 = EPSG:4326 Version code: v20230619 = 2023-06-19 (creation date)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
AbstractPopulation surveys are vital for wildlife management, yet traditional methods are typically effort-intensive, leading to data gaps. Modern technologies — such as drones — facilitate field surveys but increase the data analysis burden. Citizen Science (CS) can alleviate this issue by engaging non-specialists in data collection and analysis. We evaluated this approach for population monitoring using the endangered Galápagos marine iguana as a case study, assessing citizen scientists’ ability to detect and count animals in aerial images. Comparing against a Gold Standard dataset of expert counts in 4345 images, we explored optimal aggregation methods from CS inputs and evaluated the accuracy of CS counts. During three phases of our project — hosted on Zooniverse.org — over 13,000 volunteers made 1,375,201 classifications from 57,838 images; each being independently classified up to 30 times. Volunteers achieved 68% to 94% accuracy in detecting iguanas, with more false negatives than false positives. Image quality strongly influenced accuracy; by excluding data from suboptimal pilot-phase images, volunteers counted with 91% to 92% of accuracy. For detecting iguanas, the standard ‘majority vote' aggregation approach (where the answer selected is that given by the majority of individual inputs) produced less accurate results than when a minimum threshold of five (from the total independent classifications) was used. For counting iguanas, HDBSCAN clustering yielded the best results. We conclude that CS can accurately identify and count marine iguanas from drone images though there is a tendency to underestimate. CS-based data analysis is still resource-intensive, underscoring the need to develop a Machine Learning approach.MethodsWe created a citizen science project, named Iguanas from Above, in Zooniverse.org. There, we uploaded 'sliced' images from drone imagery belonging to several colonies of the Galápagos marine iguana. Citizen scientists (CS) were asked to classify the images doing two tasks: First to say yes or no for iguana presence in the image and second to count the individuals when present. Each image was classified by 20 or 30 volunteers. Once all the images, corresponding to three phases launched were classified, we downloaded the data from the Zooniverse portal and used the Panoptes Aggregation python package to extract and aggregate CS data (source code: https://github.com/cwinkelmann/iguanas-from-above-zooniverse).We ramdomly selected 5–10% of all the images to create a Gold Standard (GS) dataset. Three experts from the research team identified presence and absence of marine iguanas in the images and count them. The concensus answers are presented in this dataset and is referred as expert data. The aggregated CS data from Task 1 (a total number of yes and no answers per image) was analyzed as accepted for iguana presence when 5 or more volunteers (from the 20–30) selected yes (a minimum threshold rule), otherwise absence was accepted. Then, we compared all CS accepted answers against the expert data, as correct or incorrect, and calculated a percentage of CS accuracy regarding marine iguana detection.For Task 2, we selected all the images identied by the volunteers to have iguanas with this minimum threshold rule and aggregate (summarize) all classifications into one value (count) per image by using the statistical metrics median and mode and the spatial clustering methods DBSCAN and HDBSCAN. The rest of the images obtained 0 counts. CS data was incorporated into this dataset. We then compared total counts in this GS dataset calculated by the expert and all the aggregating methods used in terms of percentages of agreement towards the expert data. These percentages showed CS accuracy regarding marine iguana counting. We also investigated number of marine iguanas under and overestimated with all aggregating methods.Finally, by applying generalized linear models, we used this dataset to explore statistical differences among the different methods used to count marine iguanas (expert, median, mode and HDBSCAN) in the images and how the factors: phase analyzed, quality of the imges (assessed by the experts) and number of marine iguanas present in the image, could affect CS accuracy.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Standardized data from Mobilise-D participants (YAR dataset) and pre-existing datasets (ICICLE, MSIPC2, Gait in Lab and real-life settings, MS project, UNISS-UNIGE) are provided in the shared folder, as an example of the procedures proposed in the publication "Mobility recorded by wearable devices and gold standards: the Mobilise-D procedure for data standardization" that is currently under review in Scientific data. Please refer to that publication for further information. Please cite that publication if using these data.
The code to standardize an example subject (for the ICICLE dataset) and to open the standardized Matlab files in other languages (Python, R) is available in github (https://github.com/luca-palmerini/Procedure-wearable-data-standardization-Mobilise-D).