19 datasets found

Z
Example subjects for Mobilise-D data standardization
data.niaid.nih.gov
Updated Oct 11, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Soltani, Abolfazl (2022). Example subjects for Mobilise-D data standardization [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7185428
Explore at:
Dataset updated
Oct 11, 2022
Dataset provided by
Cereatti, Andrea
Paraschiv-Ionescu, Anisoara
Palmerini, Luca
Kluge, Felix
D'Ascanio, Ilaria
Hansen, Clint
Salis, Francesca
Mazzà, Claudia
Bertuletti, Stefano
Ullrich, Martin
Rochester, Lynn
Kirk, Cameron
Del Din, Silvia
Gazit, Eran
Soltani, Abolfazl
Caruso, Marco
Reggi, Luca
on behalf of the Mobilise-D consortium
Chiari, Lorenzo
Hiden, Hugo
Küderle, Arne
Bonci, Tecla
Micó-Amigo, Encarna
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Standardized data from Mobilise-D participants (YAR dataset) and pre-existing datasets (ICICLE, MSIPC2, Gait in Lab and real-life settings, MS project, UNISS-UNIGE) are provided in the shared folder, as an example of the procedures proposed in the publication "Mobility recorded by wearable devices and gold standards: the Mobilise-D procedure for data standardization" that is currently under review in Scientific data. Please refer to that publication for further information. Please cite that publication if using these data.

The code to standardize an example subject (for the ICICLE dataset) and to open the standardized Matlab files in other languages (Python, R) is available in github (https://github.com/luca-palmerini/Procedure-wearable-data-standardization-Mobilise-D).
d
Greater Everglades Burmese Python Stable Isotope Data, 2003-2012, and...
catalog.data.gov
data.usgs.gov
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Greater Everglades Burmese Python Stable Isotope Data, 2003-2012, and Standard Ellipse Area Literature Review, 2018 [Dataset]. https://catalog.data.gov/dataset/greater-everglades-burmese-python-stable-isotope-data-2003-2012-and-standard-ellipse-area-
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
U.S. Geological Survey
Area covered
Everglades
Description
Burmese pythons are an invasive species in the Greater Everglades Ecosystem. Burmese pythons captured in the ecosystem are euthanized, and in an effort to learn about this invasive species, all euthanized pythons are necropsied, during which time samples are collected. We analyzed the stable isotope ratios of carbon and nitrogen in muscle samples from 423 Burmese pythons euthanized and necropsied between 2003-05-01 and 2012-09-02, and after processing and QA/QC, we were left with isotope ratios for 412 samples, which we reported here. We used these data to estimate the size of the isotopic niche of the Burmese python, commonly measured using standard ellipse areas, or SEAs. To put these SEAs in context, we conducted a thorough literature review to find published sizes of other isotopic niches, beginning in 2017-01 and finalized on 2018-06-12. We reported the papers we found during this literature review here. We then reviewed each paper and recorded any SEAs presented in the paper or its supporting information, and we presented those results here.
Feature Scaling using Python
kaggle.com
Updated Nov 25, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
abhishek (2020). Feature Scaling using Python [Dataset]. https://www.kaggle.com/datasets/abhishekgupta31/feature-scaling-using-python/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 25, 2020
Dataset provided by
Kaggle
Authors
abhishek
Description
Dataset

This dataset was created by abhishek

Contents
Benchmark data set for MSPypeline, a python package for streamlined mass...
data.niaid.nih.gov
xml
Updated Jul 22, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexander Held; Ursula Klingmüller (2021). Benchmark data set for MSPypeline, a python package for streamlined mass spectrometry-based proteomics data analysis [Dataset]. https://data.niaid.nih.gov/resources?id=pxd025792
Explore at:
xmlAvailable download formats
Dataset updated
Jul 22, 2021
Dataset provided by
Division Systems Biology of Signal Transduction, German Cancer Research Center (DKFZ), Heidelberg, 69120, Germany
DKFZ Heidelberg
Authors
Alexander Held; Ursula Klingmüller
Variables measured
Proteomics
Description
Mass spectrometry-based proteomics is increasingly employed in biology and medicine. To generate reliable information from large data sets and ensure comparability of results it is crucial to implement and standardize the quality control of the raw data, the data processing steps and the statistical analyses. The MSPypeline provides a platform for the import of MaxQuant output tables, the generation of quality control reports, the preprocessing of data including normalization and exploratory analyses by statistical inference plots. These standardized steps assess data quality, provide customizable figures and enable the identification of differentially expressed proteins to reach biologically relevant conclusions.
d
Connecticut State Parcel Layer 2023
catalog.data.gov
data.ct.gov
+2more
Updated May 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
State of Connecticut (2025). Connecticut State Parcel Layer 2023 [Dataset]. https://catalog.data.gov/dataset/connecticut-state-parcel-layer-2023-74a65
Explore at:
Dataset updated
May 10, 2025
Dataset provided by
State of Connecticut
Area covered
Connecticut
Description
The dataset has combined the Parcels and Computer-Assisted Mass Appraisal (CAMA) data for 2023 into a single dataset. This dataset is designed to make it easier for stakeholders and the GIS community to use and access the information as a geospatial dataset. Included in this dataset are geometries for all 169 municipalities and attribution from the CAMA data for all but one municipality. Pursuant to Section 7-100l of the Connecticut General Statutes, each municipality is required to transmit a digital parcel file and an accompanying assessor’s database file (known as a CAMA report), to its respective regional council of governments (COG) by May 1 annually. These data were gathered from the CT municipalities by the COGs and then submitted to CT OPM. This dataset was created on 12/08/2023 from data collected in 2022-2023. Data was processed using Python scripts and ArcGIS Pro, ensuring standardization and integration of the data.CAMA Notes:The CAMA underwent several steps to standardize and consolidate the information. Python scripts were used to concatenate fields and create a unique identifier for each entry. The resulting dataset contains 1,353,595 entries and information on property assessments and other relevant attributes.CAMA was provided by the towns.Canaan parcels are viewable, but no additional information is available since no CAMA data was submitted.Spatial Data Notes:Data processing involved merging the parcels from different municipalities using ArcGIS Pro and Python. The resulting dataset contains 1,247,506 parcels.No alteration has been made to the spatial geometry of the data.Fields that are associated with CAMA data were provided by towns.The data fields that have information from the CAMA were sourced from the towns’ CAMA data.If no field for the parcels was provided for linking back to the CAMA by the town a new field within the original data was selected if it had a match rate above 50%, that joined back to the CAMA.Linking fields were renamed to "Link".All linking fields had a census town code added to the beginning of the value to create a unique identifier per town.Any field that was not town name, Location, Editor, Edit Date, or a field associated back to the CAMA, was not used in the creation of this Dataset.Only the fields related to town name, location, editor, edit date, and link fields associated with the towns’ CAMA were included in the creation of this dataset. Any other field provided in the original data was deleted or not used.Field names for town (Muni, Municipality) were renamed to "Town Name".
Z
Data from: A comprehensive dataset for the accelerated development and...
data.niaid.nih.gov
explore.openaire.eu
+1more
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Larson, David (2020). A comprehensive dataset for the accelerated development and benchmarking of solar forecasting methods [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2826938
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Carreira Pedro, Hugo
Coimbra, Carlos
Larson, David
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Description This repository contains a comprehensive solar irradiance, imaging, and forecasting dataset. The goal with this release is to provide standardized solar and meteorological datasets to the research community for the accelerated development and benchmarking of forecasting methods. The data consist of three years (2014–2016) of quality-controlled, 1-min resolution global horizontal irradiance and direct normal irradiance ground measurements in California. In addition, we provide overlapping data from commonly used exogenous variables, including sky images, satellite imagery, Numerical Weather Prediction forecasts, and weather data. We also include sample codes of baseline models for benchmarking of more elaborated models.

Data usage The usage of the datasets and sample codes presented here is intended for research and development purposes only and implies explicit reference to the paper: Pedro, H.T.C., Larson, D.P., Coimbra, C.F.M., 2019. A comprehensive dataset for the accelerated development and benchmarking of solar forecasting methods. Journal of Renewable and Sustainable Energy 11, 036102. https://doi.org/10.1063/1.5094494

Although every effort was made to ensure the quality of the data, no guarantees or liabilities are implied by the authors or publishers of the data.

Sample code As part of the data release, we are also including the sample code written in Python 3. The preprocessed data used in the scripts are also provided. The code can be used to reproduce the results presented in this work and as a starting point for future studies. Besides the standard scientific Python packages (numpy, scipy, and matplotlib), the code depends on pandas for time-series operations, pvlib for common solar-related tasks, and scikit-learn for Machine Learning models. All required Python packages are readily available on Mac, Linux, and Windows and can be installed via, e.g., pip.

Units All time stamps are in UTC (YYYY-MM-DD HH:MM:SS). All irradiance and weather data are in SI units. Sky image features are derived from 8-bit RGB (256 color levels) data. Satellite images are derived from 8-bit gray-scale (256 color levels) data.

Missing data The string "NAN" indicates missing data

File formats All time series data files as in CSV (comma separated values) Images are given in tar.bz2 files

Files

Folsom_irradiance.csv Primary One-minute GHI, DNI, and DHI data.

Folsom_weather.csv Primary One-minute weather data.

Folsom_sky_images_{YEAR}.tar.bz2 Primary Tar archives with daytime sky images captured at 1-min intervals for the years 2014, 2015, and 2016, compressed with bz2.

Folsom_NAM_lat{LAT}_lon{LON}.csv Primary NAM forecasts for the four nodes nearest the target location. {LAT} and {LON} are replaced by the node’s coordinates listed in Table I in the paper.

Folsom_sky_image_features.csv Secondary Features derived from the sky images.

Folsom_satellite.csv Secondary 10 pixel by 10 pixel GOES-15 images centered in the target location.

Irradiance_features_{horizon}.csv Secondary Irradiance features for the different forecasting horizons ({horizon} 1⁄4 {intra-hour, intra-day, day-ahead}).

Sky_image_features_intra-hour.csv Secondary Sky image features for the intra-hour forecasting issuing times.

Sat_image_features_intra-day.csv Secondary Satellite image features for the intra-day forecasting issuing times.

NAM_nearest_node_day-ahead.csv Secondary NAM forecasts (GHI, DNI computed with the DISC algorithm, and total cloud cover) for the nearest node to the target location prepared for day-ahead forecasting.

Target_{horizon}.csv Secondary Target data for the different forecasting horizons.

Forecast_{horizon}.py Code Python script used to create the forecasts for the different horizons.

Postprocess.py Code Python script used to compute the error metric for all the forecasts.
f
zip of MetDataModel v0.6.1.
plos.figshare.com
zip
Updated Jun 18, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joshua M. Mitchell; Yuanye Chi; Maheshwor Thapa; Zhiqiang Pang; Jianguo Xia; Shuzhao Li (2024). zip of MetDataModel v0.6.1. [Dataset]. http://doi.org/10.1371/journal.pcbi.1011912.s007
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1011912.s007
Dataset updated
Jun 18, 2024
Dataset provided by
PLOS Computational Biology
Authors
Joshua M. Mitchell; Yuanye Chi; Maheshwor Thapa; Zhiqiang Pang; Jianguo Xia; Shuzhao Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
To standardize metabolomics data analysis and facilitate future computational developments, it is essential to have a set of well-defined templates for common data structures. Here we describe a collection of data structures involved in metabolomics data processing and illustrate how they are utilized in a full-featured Python-centric pipeline. We demonstrate the performance of the pipeline, and the details in annotation and quality control using large-scale LC-MS metabolomics and lipidomics data and LC-MS/MS data. Multiple previously published datasets are also reanalyzed to showcase its utility in biological data analysis. This pipeline allows users to streamline data processing, quality control, annotation, and standardization in an efficient and transparent manner. This work fills a major gap in the Python ecosystem for computational metabolomics.
S
machine learning models on the WDBC dataset
scidb.cn
Updated Apr 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahdi Aghaziarati (2025). machine learning models on the WDBC dataset [Dataset]. http://doi.org/10.57760/sciencedb.23537
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.23537
Dataset updated
Apr 15, 2025
Dataset provided by
Science Data Bank
Authors
Mahdi Aghaziarati
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset used in this study is the Wisconsin Diagnostic Breast Cancer (WDBC) dataset, originally provided by the University of Wisconsin and obtained via Kaggle. It consists of 569 observations, each corresponding to a digitized image of a fine needle aspirate (FNA) of a breast mass. The dataset contains 32 attributes: one identifier column (discarded during preprocessing), one diagnosis label (malignant or benign), and 30 continuous real-valued features that describe the morphology of cell nuclei. These features are grouped into three statistical descriptors—mean, standard error (SE), and worst (mean of the three largest values)—for ten morphological properties including radius, perimeter, area, concavity, and fractal dimension. All feature values were normalized using z-score standardization to ensure uniform scale across models sensitive to input ranges. No missing values were present in the original dataset. Label encoding was applied to the diagnosis column, assigning 1 to malignant and 0 to benign cases. The dataset was split into training (80%) and testing (20%) sets while preserving class balance via stratified sampling. The accompanying Python source code (breast_cancer_classification_models.py) performs data loading, preprocessing, model training, evaluation, and result visualization. Four lightweight classifiers—Decision Tree, Naïve Bayes, Perceptron, and K-Nearest Neighbors (KNN)—were implemented using the scikit-learn library (version 1.2 or later). Performance metrics including Accuracy, Precision, Recall, F1-score, and ROC-AUC were calculated for each model. Confusion matrices and ROC curves were generated and saved as PNG files for interpretability. All results are saved in a structured CSV file (classification_results.csv) that contains the performance metrics for each model. Supplementary visualizations include all_feature_histograms.png (distribution plots for all standardized features), model_comparison.png (metric-wise bar plot), and feature_correlation_heatmap.png (Pearson correlation matrix of all 30 features). The data files are in standard CSV and PNG formats and can be opened using any spreadsheet or image viewer, respectively. No rare file types are used, and all scripts are compatible with any Python 3.x environment. This data package enables reproducibility and offers a transparent overview of how baseline machine learning models perform in the domain of breast cancer diagnosis using a clinically-relevant dataset.
o
Data from: FAIRification of electrophysiology data analysis: provenance...
explore.openaire.eu
Updated Jan 1, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cristiano Köhler (2021). FAIRification of electrophysiology data analysis: provenance capture in the Elephant toolbox [Dataset]. https://explore.openaire.eu/search/other?orpId=od_3364::36d2cfc9e17dfb6572f51366ff8c78e2
Explore at:
Dataset updated
Jan 1, 2021
Authors
Cristiano Köhler
Description
The analysis of electrophysiology data typically comprises multiple steps. These often consist of several scripts executed in a specific temporal order, which take different parameter sets and use distinct data files. As the researcher adjusts the individual analysis steps to accommodate new hypotheses or additional data, the resulting workflows may become increasingly complex, and undergo frequent changes. Although it is possible to use workflow management systems to organize the execution of the scripts and capture provenance information at the level of the script (i.e., which script file was executed, and in which environment?) and data file (i.e., which input and output files were supplied to that script), the resulting provenance track does not automatically provide details about the actual analysis carried out inside each script. Therefore, the final analysis results can only be understood by source code inspection or reliance in any accompanying documentation. We focus on two open-source tools for the analysis of electrophysiology data developed in EBRAINS. The Neo (RRID:SCR_000634) framework provides an object model to standardize neural activity data acquired from distinct sources. Elephant (RRID:SCR_003833) is a Python toolbox that provides several functions for the analysis of electrophysiology data. We set to improve these tools by implementing a data model that captures detailed provenance information and by representing the analysis results in a systematic and formalized manner. Ultimately, these developments aim to improve reproducibility, interoperability, findability, and re-use of analysis results.
Interaction data for standard tournament for the Axelrod Python project.
zenodo.org
csv
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Axelrod Project Development Team; Axelrod Project Development Team (2020). Interaction data for standard tournament for the Axelrod Python project. [Dataset]. http://doi.org/10.5281/zenodo.49345
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.49345
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Axelrod Project Development Team; Axelrod Project Development Team
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This is the data from the tournament described here: http://axelrod-tournament.readthedocs.org/en/latest/standard/strategies.html

A description of the format is available here: http://axelrod.readthedocs.org/en/latest/tutorials/further_topics/reading_and_writing_interactions.html
h
langchain-standard
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
RELAI, langchain-standard [Dataset]. https://huggingface.co/datasets/relai-ai/langchain-standard
Explore at:
Dataset authored and provided by
RELAI
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Samples in this benchmark were generated by RELAI using the following data source(s): Data Source Name: langchain Documentation Data Source Link: https://python.langchain.com/docs/introduction/ Data Source License: https://github.com/langchain-ai/langchain/blob/master/LICENSE Data Source Authors: Observable AI Benchmarks by Data Agents © 2025 RELAI.AI. Licensed under CC BY 4.0. Source: https://relai.ai
Z
[MedMNIST+] 18x Standardized Datasets for 2D and 3D Biomedical Image...
data.niaid.nih.gov
Updated Nov 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rui Shi (2024). [MedMNIST+] 18x Standardized Datasets for 2D and 3D Biomedical Image Classification with Multiple Size Options: 28 (MNIST-Like), 64, 128, and 224 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5208229
Explore at:
Dataset updated
Nov 28, 2024
Dataset provided by
Jiancheng Yang
Lin Zhao
Donglai Wei
Zequan Liu
Hanspeter Pfister
Rui Shi
Bingbing Ni
Bilian Ke
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Code [GitHub] | Publication [Nature Scientific Data'23 / ISBI'21] | Preprint [arXiv]

Abstract

We introduce MedMNIST, a large-scale MNIST-like collection of standardized biomedical images, including 12 datasets for 2D and 6 datasets for 3D. All images are pre-processed into 28x28 (2D) or 28x28x28 (3D) with the corresponding classification labels, so that no background knowledge is required for users. Covering primary data modalities in biomedical images, MedMNIST is designed to perform classification on lightweight 2D and 3D images with various data scales (from 100 to 100,000) and diverse tasks (binary/multi-class, ordinal regression and multi-label). The resulting dataset, consisting of approximately 708K 2D images and 10K 3D images in total, could support numerous research and educational purposes in biomedical image analysis, computer vision and machine learning. We benchmark several baseline methods on MedMNIST, including 2D / 3D neural networks and open-source / commercial AutoML tools. The data and code are publicly available at https://medmnist.com/.

Disclaimer: The only official distribution link for the MedMNIST dataset is Zenodo. We kindly request users to refer to this original dataset link for accurate and up-to-date data.

Update: We are thrilled to release MedMNIST+ with larger sizes: 64x64, 128x128, and 224x224 for 2D, and 64x64x64 for 3D. As a complement to the previous 28-size MedMNIST, the large-size version could serve as a standardized benchmark for medical foundation models. Install the latest API to try it out!

Python Usage

We recommend our official code to download, parse and use the MedMNIST dataset:

% pip install medmnist% python

To use the standard 28-size (MNIST-like) version utilizing the downloaded files:

from medmnist import PathMNIST

train_dataset = PathMNIST(split="train")

To enable automatic downloading by setting download=True:

from medmnist import NoduleMNIST3D

val_dataset = NoduleMNIST3D(split="val", download=True)

Alternatively, you can access MedMNIST+ with larger image sizes by specifying the size parameter:

from medmnist import ChestMNIST

test_dataset = ChestMNIST(split="test", download=True, size=224)

Citation

If you find this project useful, please cite both v1 and v2 paper as:

Jiancheng Yang, Rui Shi, Donglai Wei, Zequan Liu, Lin Zhao, Bilian Ke, Hanspeter Pfister, Bingbing Ni. Yang, Jiancheng, et al. "MedMNIST v2-A large-scale lightweight benchmark for 2D and 3D biomedical image classification." Scientific Data, 2023.

Jiancheng Yang, Rui Shi, Bingbing Ni. "MedMNIST Classification Decathlon: A Lightweight AutoML Benchmark for Medical Image Analysis". IEEE 18th International Symposium on Biomedical Imaging (ISBI), 2021.

or using bibtex:

@article{medmnistv2, title={MedMNIST v2-A large-scale lightweight benchmark for 2D and 3D biomedical image classification}, author={Yang, Jiancheng and Shi, Rui and Wei, Donglai and Liu, Zequan and Zhao, Lin and Ke, Bilian and Pfister, Hanspeter and Ni, Bingbing}, journal={Scientific Data}, volume={10}, number={1}, pages={41}, year={2023}, publisher={Nature Publishing Group UK London} }

@inproceedings{medmnistv1, title={MedMNIST Classification Decathlon: A Lightweight AutoML Benchmark for Medical Image Analysis}, author={Yang, Jiancheng and Shi, Rui and Ni, Bingbing}, booktitle={IEEE 18th International Symposium on Biomedical Imaging (ISBI)}, pages={191--195}, year={2021} }

Please also cite the corresponding paper(s) of source data if you use any subset of MedMNIST as per the description on the project website.

License

The MedMNIST dataset is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0), except DermaMNIST under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0).

The code is under Apache-2.0 License.

Changelog

v3.0 (this repository): Released MedMNIST+ featuring larger sizes: 64x64, 128x128, and 224x224 for 2D, and 64x64x64 for 3D.

v2.2: Removed a small number of mistakenly included blank samples in OrganAMNIST, OrganCMNIST, OrganSMNIST, OrganMNIST3D, and VesselMNIST3D.

v2.1: Addressed an issue in the NoduleMNIST3D file (i.e., nodulemnist3d.npz). Further details can be found in this issue.

v2.0: Launched the initial repository of MedMNIST v2, adding 6 datasets for 3D and 2 for 2D.

v1.0: Established the initial repository (in a separate repository) of MedMNIST v1, featuring 10 datasets for 2D.

Note: This dataset is NOT intended for clinical use.
f
NRPS Motif Finder online version code (Python).
figshare.com
zip
Updated Jun 2, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ruolin He; Jinyu Zhang; Yuanzhe Shao; Shaohua Gu; Chen Song; Long Qian; Wen-Bing Yin; Zhiyuan Li (2023). NRPS Motif Finder online version code (Python). [Dataset]. http://doi.org/10.1371/journal.pcbi.1011100.s059
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1011100.s059
Dataset updated
Jun 2, 2023
Dataset provided by
PLOS Computational Biology
Authors
Ruolin He; Jinyu Zhang; Yuanzhe Shao; Shaohua Gu; Chen Song; Long Qian; Wen-Bing Yin; Zhiyuan Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Non-ribosomal peptide synthetase (NRPS) is a diverse family of biosynthetic enzymes for the assembly of bioactive peptides. Despite advances in microbial sequencing, the lack of a consistent standard for annotating NRPS domains and modules has made data-driven discoveries challenging. To address this, we introduced a standardized architecture for NRPS, by using known conserved motifs to partition typical domains. This motif-and-intermotif standardization allowed for systematic evaluations of sequence properties from a large number of NRPS pathways, resulting in the most comprehensive cross-kingdom C domain subtype classifications to date, as well as the discovery and experimental validation of novel conserved motifs with functional significance. Furthermore, our coevolution analysis revealed important barriers associated with re-engineering NRPSs and uncovered the entanglement between phylogeny and substrate specificity in NRPS sequences. Our findings provide a comprehensive and statistically insightful analysis of NRPS sequences, opening avenues for future data-driven discoveries.
S
NASICON-type solid electrolyte materials named entity recognition dataset
scidb.cn
Updated Apr 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Liu Yue; Liu Dahui; Yang Zhengwei; Shi Siqi (2023). NASICON-type solid electrolyte materials named entity recognition dataset [Dataset]. http://doi.org/10.57760/sciencedb.j00213.00001
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.j00213.00001
Dataset updated
Apr 27, 2023
Dataset provided by
Science Data Bank
Authors
Liu Yue; Liu Dahui; Yang Zhengwei; Shi Siqi
Description
1.Framework overview. This paper proposed a pipeline to construct high-quality datasets for text mining in materials science. Firstly, we utilize the traceable automatic acquisition scheme of literature to ensure the traceability of textual data. Then, a data processing method driven by downstream tasks is performed to generate high-quality pre-annotated corpora conditioned on the characteristics of materials texts. On this basis, we define a general annotation scheme derived from materials science tetrahedron to complete high-quality annotation. Finally, a conditional data augmentation model incorporating materials domain knowledge (cDA-DK) is constructed to augment the data quantity.2.Dataset information. The experimental datasets used in this paper include: the Matscholar dataset publicly published by Weston et al. (DOI: 10.1021/acs.jcim.9b00470), and the NASICON entity recognition dataset constructed by ourselves. Herein, we mainly introduce the details of NASICON entity recognition dataset.2.1 Data collection and preprocessing. Firstly, 55 materials science literature related to NASICON system are collected through Crystallographic Information File (CIF), which contains a wealth of structure-activity relationship information. Note that materials science literature is mostly stored as portable document format (PDF), with content arranged in columns and mixed with tables, images, and formulas, which significantly compromises the readability of the text sequence. To tackle this issue, we employ the text parser PDFMiner (a Python toolkit) to standardize, segment, and parse the original documents, thereby converting PDF literature into plain text. In this process, the entire textual information of literature, encompassing title, author, abstract, keywords, institution, publisher, and publication year, is retained and stored as a unified TXT document. Subsequently, we apply rules based on Python regular expressions to remove redundant information, such as garbled characters and line breaks caused by figures, tables, and formulas. This results in a cleaner text corpus, enhancing its readability and enabling more efficient data analysis. Note that special symbols may also appear as garbled characters, but we refrain from directly deleting them, as they may contain valuable information such as chemical units. Therefore, we converted all such symbols to a special token
O
Connecticut CAMA and Parcel Layer
data.ct.gov
geodata.ct.gov
+1more
application/rdfxml +5
Updated Feb 4, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Office of Policy and Management (2025). Connecticut CAMA and Parcel Layer [Dataset]. https://data.ct.gov/Local-Government/Connecticut-CAMA-and-Parcel-Layer/5ygf-diwu/data
Explore at:
application/rssxml, csv, tsv, application/rdfxml, json, xmlAvailable download formats
Dataset updated
Feb 4, 2025
Dataset authored and provided by
Office of Policy and Management
Area covered
Connecticut
Description
Coordinate system Update:
Notably, this dataset will be provided in NAD 83 Connecticut State Plane (2011) (EPSG 2234) projection, instead of WGS 1984 Web Mercator Auxiliary Sphere (EPSG 3857) which is the coordinate system of the 2023 dataset and will remain in Connecticut State Plane moving forward.
Ownership Suppression and Data Access:
The updated dataset now includes parcel data for all towns across the state, with some towns featuring fully suppressed ownership information. In these instances, the owner’s name will be replaced with the label "Current Owner," the co-owner’s name will be listed as "Current Co-Owner," and the mailing address will appear as the property address itself. For towns with suppressed ownership data, users should be aware that there was no "Suppression" field in the submission to verify specific details. This measure was implemented this year to help verify compliance with Suppression.
New Data Fields:
The new dataset introduces the "Land Acres" field, which will display the total acreage for each parcel. This additional field allows for more detailed analysis and better supports planning, zoning, and property valuation tasks. An important new addition is the FIPS code field, which provides the Federal Information Processing Standards (FIPS) code for each parcel’s corresponding block. This allows users to easily identify which block the parcel is in.
Updated Service URL:
The new parcel service URL includes all the updates mentioned above, such as the improved coordinate system, new data fields, and additional geospatial information. Users are strongly encouraged to transition to the new service as soon as possible to ensure that their workflows remain uninterrupted. The URL for this service will remain persistent moving forward. Once you have transitioned to the new service, the URL will remain constant, ensuring long term stability.
For a limited time, the old service will continue to be available, but it will eventually be retired. Users should plan to switch to the new service well before this cutoff to avoid any disruptions in data access.
The dataset has combined the Parcels and Computer-Assisted Mass Appraisal (CAMA) data for 2024 into a single dataset. This dataset is designed to make it easier for stakeholders and the GIS community to use and access the information as a geospatial dataset. Included in this dataset are geometries for all 169 municipalities and attribution from the CAMA data for all but one municipality. Pursuant to Section 7-100l of the Connecticut General Statutes, each municipality is required to transmit a digital parcel file and an accompanying assessor’s database file (known as a CAMA report), to its respective regional council of governments (COG) by May 1 annually.
These data were gathered from the CT municipalities by the COGs and then submitted to CT OPM. This dataset was created on 10/31/2024 from data collected in 2023-2024. Data was processed using Python scripts and ArcGIS Pro, ensuring standardization and integration of the data.
CAMA Notes:
The CAMA underwent several steps to standardize and consolidate the information. Python scripts were used to concatenate fields and create a unique identifier for each entry. The resulting dataset contains 1,353,595 entries and information on property assessments and other relevant attributes.
CAMA was provided by the towns.
Spatial Data Notes:
Data processing involved merging the parcels from different municipalities using ArcGIS Pro and Python. The resulting dataset contains 1,290,196 parcels.
No alteration has been made to the spatial geometry of the data.
Fields that are associated with CAMA data were provided by towns.
The data fields that have information from the CAMA were sourced from the towns’ CAMA data.
If no field for the parcels was provided for linking back to the CAMA by the town a new field within the original data was selected if it had a match rate above 50%, that joined back to the CAMA.
Linking fields were renamed to "Link".
All linking fields had a census town code added to the beginning of the value to create a unique identifier per town.
Any field that was not town name, Location, Editor, Edit Date, or a field associated back to the CAMA, was not used in the creation of this Dataset.
Only the fields related to town name, location, editor, edit date, and link fields associated with the towns’ CAMA were included in the creation of this dataset. Any other field provided in the original data was deleted or not used.
Field names for town (Muni, Municipality) were renamed to "Town Name".
The attributes included in the data:
Town Name
Owner
Co-Owner
Link
Editor
Edit Date
Collection year – year the parcels were submitted
Location
Mailing Address
Mailing City
Mailing State
Assessed Total
Assessed Land
Assessed Building
Pre-Year Assessed Total
Appraised Land
Appraised Building
Appraised Outbuilding
Condition
<span
Artifacts for Paper Submission
zenodo.org
zip
Updated Apr 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous; Anonymous (2025). Artifacts for Paper Submission [Dataset]. http://doi.org/10.5281/zenodo.15205184
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15205184
Dataset updated
Apr 15, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous; Anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Mar 28, 2025
Description

This repository contains the source code and dataset associated with our submission to ACM CCS 2025. Below, you'll find instructions for installing dependencies, setting up the environment, and running the provided tools.

Installation and Requirements

To evaluate the effectiveness of CodeGuarder, please first install the benchmark CyberSecEval by following the instructions in the official CyberSecEval repository.

Next, install the Python dependencies required to run our tools:

pip install -r requirements.txt

Running CodeGuarder

To run CodeGuarder, ensure that the API key for your LLM is correctly set in the config.py file. For DeepSeek-Coder and CodeLlama, follow the instructions on the Ollama website to deploy the models locally.

To reproduce the results reported in the paper, use the following commands:

Generating Data for the Standard Scenario

#### Without CodeGuarder (Baseline)

python src/Baseline_Standard.py --output_path "./dataset/Standard.json"

#### With CodeGuarder (Defense)

python src/Defense_Standard.py --output_path "./dataset/Standard.json"

### Evaluating CodeGuarder Under the Standard Scenario

Make sure CyberSecEval has been cloned into the CyberSecEval directory.

cd CyberSecEval/CybersecurityBenchmarks

python -m CybersecurityBenchmarks.benchmark.run </code>
--benchmark=instruct </code> --prompt-path="./datasets/instruct/Standard.json" </code> --llm-under-test="OPENAI::{MODEL_NAME}::{MODEL_KEY}::{BASE_URL}" Replace {MODEL_NAME}, {MODEL_KEY}, and {BASE_URL} with the appropriate values for your LLM. For example, if using OpenAI's GPT-4, BASE_URL is the API endpoint. For DeepSeek, use https://api.deepseek.com/v1. For other scenarios, modify the corresponding Baseline_*.py and Defense_*.py files. All generated data will be saved in the dataset directory. Security Evaluation Evaluation results are stored in the result directory. For instance, to evaluate the effectiveness of CodeGuarder under the Standard scenario with DeepSeek-Coder, run: python src/sec_eval.py --result_path "./output/std_def_DS-Coder.json"
PROCESSED DATA .nc (NetCDF Files)
figshare.com
hdf
Updated Apr 24, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kartika Wardani (2022). PROCESSED DATA .nc (NetCDF Files) [Dataset]. http://doi.org/10.6084/m9.figshare.19641777.v1
Explore at:
hdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19641777.v1
Dataset updated
Apr 24, 2022
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Kartika Wardani
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is a processed data in NetCDF (.nc) files, that used in our study. We used the SPI to determine meteorological drought conditions in the study area, that calculated by using the open-source module Climate and Drought Indices in Python.
Z
Monthly aggregated Water Vapor MODIS MCD19A2 (1 km): Long-term data...
data.niaid.nih.gov
zenodo.org
Updated Jul 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Leandro Parente (2024). Monthly aggregated Water Vapor MODIS MCD19A2 (1 km): Long-term data (2000-2022) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8192543
Explore at:
Dataset updated
Jul 11, 2024
Dataset provided by
Tomislav Hengl
Leandro Parente
Rolf Simoes
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This data is part of the Monthly aggregated Water Vapor MODIS MCD19A2 (1 km) dataset. Check the related identifiers section on the Zenodo side panel to access other parts of the dataset. General Description The monthly aggregated water vapor dataset is derived from MCD19A2 v061. The Water Vapor data measures the column above ground retrieved from MODIS near-IR bands at 0.94μm. The dataset time spans from 2000 to 2022 and provides data that covers the entire globe. The dataset can be used in many applications like water cycle modeling, vegetation mapping, and soil mapping. This dataset includes:

Monthly time-series:Derived from MCD19A2 v061, this data provides a monthly aggregated mean and standard deviation of daily water vapor time-series data from 2000 to 2022. Only positive non-cloudy pixels were considered valid observations to derive the mean and the standard deviation. The remaining no-data values were filled using the TMWM algorithm. This dataset also includes smoothed mean and standard deviation values using the Whittaker method. The quality assessment layers and the number of valid observations for each month can provide an indication of the reliability of the monthly mean and standard deviation values. Yearly time-series:Derived from monthly time-series, this data provides a yearly time-series aggregated statistics of the monthly time-series data. Long-term data (2000-2022):Derived from monthly time-series, this data provides long-term aggregated statistics for the whole series of monthly observations. Data Details

Time period: 2000–2022 Type of data: Water vapor column above the ground (0.001cm) How the data was collected or derived: Derived from MCD19A2 v061 using Google Earth Engine. Cloudy pixels were removed and only positive values of water vapor were considered to compute the statistics. The time-series gap-filling and time-series smoothing were computed using the Scikit-map Python package. Statistical methods used: Four statistics were derived: standard deviation, percentiles 25, 50, and 75. Limitations or exclusions in the data: The dataset does not include data for Antarctica. Coordinate reference system: EPSG:4326 Bounding box (Xmin, Ymin, Xmax, Ymax): (-180.00000, -62.00081, 179.99994, 87.37000) Spatial resolution: 1/120 d.d. = 0.008333333 (1km) Image size: 43,200 x 17,924 File format: Cloud Optimized Geotiff (COG) format. Support If you discover a bug, artifact, or inconsistency, or if you have a question please use some of the following channels:

Technical issues and questions about the code: GitLab Issues General questions and comments: LandGIS Forum Name convention To ensure consistency and ease of use across and within the projects, we follow the standard Open-Earth-Monitor file-naming convention. The convention works with 10 fields that describes important properties of the data. In this way users can search files, prepare data analysis etc, without needing to open files. The fields are:

generic variable name: wv = Water vapor variable procedure combination: mcd19a2v061.seasconv = MCD19A2 v061 with gap-filling algorithm Position in the probability distribution / variable type: m = mean | sd = standard deviation | n = number of observations | qa = quality assessment Spatial support: 1km Depth reference: s = surface Time reference begin time: 20000101 = 2000-01-01 Time reference end time: 20221231 = 2022-12-31 Bounding box: go = global (without Antarctica) EPSG code: epsg.4326 = EPSG:4326 Version code: v20230619 = 2023-06-19 (creation date)
Iguanas from Above - Raw classifications data and resulting Gold Standard...
figshare.com
txt
Updated Jun 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrea Varela-Jaramillo; Christian Winkelmann; Gonzalo Rivas-Torres; Juan M. Guayasamin; Sebastian Steinfartz; Amy MacLeod (2024). Iguanas from Above - Raw classifications data and resulting Gold Standard counts [Dataset]. http://doi.org/10.6084/m9.figshare.25196306.v6
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25196306.v6
Dataset updated
Jun 6, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Andrea Varela-Jaramillo; Christian Winkelmann; Gonzalo Rivas-Torres; Juan M. Guayasamin; Sebastian Steinfartz; Amy MacLeod
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
AbstractPopulation surveys are vital for wildlife management, yet traditional methods are typically effort-intensive, leading to data gaps. Modern technologies — such as drones — facilitate field surveys but increase the data analysis burden. Citizen Science (CS) can alleviate this issue by engaging non-specialists in data collection and analysis. We evaluated this approach for population monitoring using the endangered Galápagos marine iguana as a case study, assessing citizen scientists’ ability to detect and count animals in aerial images. Comparing against a Gold Standard dataset of expert counts in 4345 images, we explored optimal aggregation methods from CS inputs and evaluated the accuracy of CS counts. During three phases of our project — hosted on Zooniverse.org — over 13,000 volunteers made 1,375,201 classifications from 57,838 images; each being independently classified up to 30 times. Volunteers achieved 68% to 94% accuracy in detecting iguanas, with more false negatives than false positives. Image quality strongly influenced accuracy; by excluding data from suboptimal pilot-phase images, volunteers counted with 91% to 92% of accuracy. For detecting iguanas, the standard ‘majority vote' aggregation approach (where the answer selected is that given by the majority of individual inputs) produced less accurate results than when a minimum threshold of five (from the total independent classifications) was used. For counting iguanas, HDBSCAN clustering yielded the best results. We conclude that CS can accurately identify and count marine iguanas from drone images though there is a tendency to underestimate. CS-based data analysis is still resource-intensive, underscoring the need to develop a Machine Learning approach.MethodsWe created a citizen science project, named Iguanas from Above, in Zooniverse.org. There, we uploaded 'sliced' images from drone imagery belonging to several colonies of the Galápagos marine iguana. Citizen scientists (CS) were asked to classify the images doing two tasks: First to say yes or no for iguana presence in the image and second to count the individuals when present. Each image was classified by 20 or 30 volunteers. Once all the images, corresponding to three phases launched were classified, we downloaded the data from the Zooniverse portal and used the Panoptes Aggregation python package to extract and aggregate CS data (source code: https://github.com/cwinkelmann/iguanas-from-above-zooniverse).We ramdomly selected 5–10% of all the images to create a Gold Standard (GS) dataset. Three experts from the research team identified presence and absence of marine iguanas in the images and count them. The concensus answers are presented in this dataset and is referred as expert data. The aggregated CS data from Task 1 (a total number of yes and no answers per image) was analyzed as accepted for iguana presence when 5 or more volunteers (from the 20–30) selected yes (a minimum threshold rule), otherwise absence was accepted. Then, we compared all CS accepted answers against the expert data, as correct or incorrect, and calculated a percentage of CS accuracy regarding marine iguana detection.For Task 2, we selected all the images identied by the volunteers to have iguanas with this minimum threshold rule and aggregate (summarize) all classifications into one value (count) per image by using the statistical metrics median and mode and the spatial clustering methods DBSCAN and HDBSCAN. The rest of the images obtained 0 counts. CS data was incorporated into this dataset. We then compared total counts in this GS dataset calculated by the expert and all the aggregating methods used in terms of percentages of agreement towards the expert data. These percentages showed CS accuracy regarding marine iguana counting. We also investigated number of marine iguanas under and overestimated with all aggregating methods.Finally, by applying generalized linear models, we used this dataset to explore statistical differences among the different methods used to count marine iguanas (expert, median, mode and HDBSCAN) in the images and how the factors: phase analyzed, quality of the imges (assessed by the experts) and number of marine iguanas present in the image, could affect CS accuracy.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Soltani, Abolfazl (2022). Example subjects for Mobilise-D data standardization [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7185428

Example subjects for Mobilise-D data standardization

Explore at:

Dataset updated

Oct 11, 2022

Dataset provided by

Cereatti, Andrea
Paraschiv-Ionescu, Anisoara
Palmerini, Luca
Kluge, Felix
D'Ascanio, Ilaria
Hansen, Clint
Salis, Francesca
Mazzà, Claudia
Bertuletti, Stefano
Ullrich, Martin
Rochester, Lynn
Kirk, Cameron
Del Din, Silvia
Gazit, Eran
Soltani, Abolfazl
Caruso, Marco
Reggi, Luca
on behalf of the Mobilise-D consortium
Chiari, Lorenzo
Hiden, Hugo
Küderle, Arne
Bonci, Tecla
Micó-Amigo, Encarna

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Standardized data from Mobilise-D participants (YAR dataset) and pre-existing datasets (ICICLE, MSIPC2, Gait in Lab and real-life settings, MS project, UNISS-UNIGE) are provided in the shared folder, as an example of the procedures proposed in the publication "Mobility recorded by wearable devices and gold standards: the Mobilise-D procedure for data standardization" that is currently under review in Scientific data. Please refer to that publication for further information. Please cite that publication if using these data.

The code to standardize an example subject (for the ICICLE dataset) and to open the standardized Matlab files in other languages (Python, R) is available in github (https://github.com/luca-palmerini/Procedure-wearable-data-standardization-Mobilise-D).

Clear search

Close search

Google apps

Main menu

Example subjects for Mobilise-D data standardization

Greater Everglades Burmese Python Stable Isotope Data, 2003-2012, and...

Feature Scaling using Python

Dataset

Contents

Benchmark data set for MSPypeline, a python package for streamlined mass...

Connecticut State Parcel Layer 2023

Data from: A comprehensive dataset for the accelerated development and...

zip of MetDataModel v0.6.1.

machine learning models on the WDBC dataset

Data from: FAIRification of electrophysiology data analysis: provenance...

Interaction data for standard tournament for the Axelrod Python project.

langchain-standard

[MedMNIST+] 18x Standardized Datasets for 2D and 3D Biomedical Image...

NRPS Motif Finder online version code (Python).

NASICON-type solid electrolyte materials named entity recognition dataset

Connecticut CAMA and Parcel Layer

Artifacts for Paper Submission

Installation and Requirements

Running CodeGuarder

Generating Data for the Standard Scenario

Security Evaluation

PROCESSED DATA .nc (NetCDF Files)

Monthly aggregated Water Vapor MODIS MCD19A2 (1 km): Long-term data...

Iguanas from Above - Raw classifications data and resulting Gold Standard...

Example subjects for Mobilise-D data standardization