18 datasets found

f
Data from: S1 Data -
plos.figshare.com
zip
Updated Oct 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yancong Zhou; Wenyue Chen; Xiaochen Sun; Dandan Yang (2023). S1 Data - [Dataset]. http://doi.org/10.1371/journal.pone.0292466.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0292466.s001
Dataset updated
Oct 11, 2023
Dataset provided by
PLOS ONE
Authors
Yancong Zhou; Wenyue Chen; Xiaochen Sun; Dandan Yang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analyzing customers’ characteristics and giving the early warning of customer churn based on machine learning algorithms, can help enterprises provide targeted marketing strategies and personalized services, and save a lot of operating costs. Data cleaning, oversampling, data standardization and other preprocessing operations are done on 900,000 telecom customer personal characteristics and historical behavior data set based on Python language. Appropriate model parameters were selected to build BPNN (Back Propagation Neural Network). Random Forest (RF) and Adaboost, the two classic ensemble learning models were introduced, and the Adaboost dual-ensemble learning model with RF as the base learner was put forward. The four models and the other four classical machine learning models-decision tree, naive Bayes, K-Nearest Neighbor (KNN), Support Vector Machine (SVM) were utilized respectively to analyze the customer churn data. The results show that the four models have better performance in terms of recall rate, precision rate, F1 score and other indicators, and the RF-Adaboost dual-ensemble model has the best performance. Among them, the recall rates of BPNN, RF, Adaboost and RF-Adaboost dual-ensemble model on positive samples are respectively 79%, 90%, 89%,93%, the precision rates are 97%, 99%, 98%, 99%, and the F1 scores are 87%, 95%, 94%, 96%. The RF-Adaboost dual-ensemble model has the best performance, and the three indicators are 10%, 1%, and 6% higher than the reference. The prediction results of customer churn provide strong data support for telecom companies to adopt appropriate retention strategies for pre-churn customers and reduce customer churn.
h
python-code-instructions-18k-alpaca-standardized
huggingface.co
Updated Sep 2, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HydraLM (2023). python-code-instructions-18k-alpaca-standardized [Dataset]. https://huggingface.co/datasets/HydraLM/python-code-instructions-18k-alpaca-standardized
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 2, 2023
Dataset authored and provided by
HydraLM
Description
Dataset Card for "python-code-instructions-18k-alpaca-standardized"

More Information needed
Z
Example subjects for Mobilise-D data standardization
data.niaid.nih.gov
zenodo.org
Updated Oct 11, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Del Din, Silvia (2022). Example subjects for Mobilise-D data standardization [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7185428
Explore at:
Dataset updated
Oct 11, 2022
Dataset provided by
Kluge, Felix
Soltani, Abolfazl
Micó-Amigo, Encarna
Rochester, Lynn
on behalf of the Mobilise-D consortium
Paraschiv-Ionescu, Anisoara
Caruso, Marco
Salis, Francesca
Hansen, Clint
Gazit, Eran
Bonci, Tecla
Palmerini, Luca
Reggi, Luca
Bertuletti, Stefano
Del Din, Silvia
Kirk, Cameron
Ullrich, Martin
Cereatti, Andrea
Mazzà, Claudia
Küderle, Arne
Hiden, Hugo
Chiari, Lorenzo
D'Ascanio, Ilaria
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Standardized data from Mobilise-D participants (YAR dataset) and pre-existing datasets (ICICLE, MSIPC2, Gait in Lab and real-life settings, MS project, UNISS-UNIGE) are provided in the shared folder, as an example of the procedures proposed in the publication "Mobility recorded by wearable devices and gold standards: the Mobilise-D procedure for data standardization" that is currently under review in Scientific data. Please refer to that publication for further information. Please cite that publication if using these data.

The code to standardize an example subject (for the ICICLE dataset) and to open the standardized Matlab files in other languages (Python, R) is available in github (https://github.com/luca-palmerini/Procedure-wearable-data-standardization-Mobilise-D).
Z
Data from: A comprehensive dataset for the accelerated development and...
data.niaid.nih.gov
zenodo.org
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carreira Pedro, Hugo (2020). A comprehensive dataset for the accelerated development and benchmarking of solar forecasting methods [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2826938
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Larson, David
Carreira Pedro, Hugo
Coimbra, Carlos
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Description This repository contains a comprehensive solar irradiance, imaging, and forecasting dataset. The goal with this release is to provide standardized solar and meteorological datasets to the research community for the accelerated development and benchmarking of forecasting methods. The data consist of three years (2014–2016) of quality-controlled, 1-min resolution global horizontal irradiance and direct normal irradiance ground measurements in California. In addition, we provide overlapping data from commonly used exogenous variables, including sky images, satellite imagery, Numerical Weather Prediction forecasts, and weather data. We also include sample codes of baseline models for benchmarking of more elaborated models.

Data usage The usage of the datasets and sample codes presented here is intended for research and development purposes only and implies explicit reference to the paper: Pedro, H.T.C., Larson, D.P., Coimbra, C.F.M., 2019. A comprehensive dataset for the accelerated development and benchmarking of solar forecasting methods. Journal of Renewable and Sustainable Energy 11, 036102. https://doi.org/10.1063/1.5094494

Although every effort was made to ensure the quality of the data, no guarantees or liabilities are implied by the authors or publishers of the data.

Sample code As part of the data release, we are also including the sample code written in Python 3. The preprocessed data used in the scripts are also provided. The code can be used to reproduce the results presented in this work and as a starting point for future studies. Besides the standard scientific Python packages (numpy, scipy, and matplotlib), the code depends on pandas for time-series operations, pvlib for common solar-related tasks, and scikit-learn for Machine Learning models. All required Python packages are readily available on Mac, Linux, and Windows and can be installed via, e.g., pip.

Units All time stamps are in UTC (YYYY-MM-DD HH:MM:SS). All irradiance and weather data are in SI units. Sky image features are derived from 8-bit RGB (256 color levels) data. Satellite images are derived from 8-bit gray-scale (256 color levels) data.

Missing data The string "NAN" indicates missing data

File formats All time series data files as in CSV (comma separated values) Images are given in tar.bz2 files

Files

Folsom_irradiance.csv Primary One-minute GHI, DNI, and DHI data.

Folsom_weather.csv Primary One-minute weather data.

Folsom_sky_images_{YEAR}.tar.bz2 Primary Tar archives with daytime sky images captured at 1-min intervals for the years 2014, 2015, and 2016, compressed with bz2.

Folsom_NAM_lat{LAT}_lon{LON}.csv Primary NAM forecasts for the four nodes nearest the target location. {LAT} and {LON} are replaced by the node’s coordinates listed in Table I in the paper.

Folsom_sky_image_features.csv Secondary Features derived from the sky images.

Folsom_satellite.csv Secondary 10 pixel by 10 pixel GOES-15 images centered in the target location.

Irradiance_features_{horizon}.csv Secondary Irradiance features for the different forecasting horizons ({horizon} 1⁄4 {intra-hour, intra-day, day-ahead}).

Sky_image_features_intra-hour.csv Secondary Sky image features for the intra-hour forecasting issuing times.

Sat_image_features_intra-day.csv Secondary Satellite image features for the intra-day forecasting issuing times.

NAM_nearest_node_day-ahead.csv Secondary NAM forecasts (GHI, DNI computed with the DISC algorithm, and total cloud cover) for the nearest node to the target location prepared for day-ahead forecasting.

Target_{horizon}.csv Secondary Target data for the different forecasting horizons.

Forecast_{horizon}.py Code Python script used to create the forecasts for the different horizons.

Postprocess.py Code Python script used to compute the error metric for all the forecasts.
c
Connecticut State Parcel Layer 2023
deepmaps.ct.gov
data.ct.gov
+4more
Updated Feb 15, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
State of Connecticut (2024). Connecticut State Parcel Layer 2023 [Dataset]. https://deepmaps.ct.gov/items/b8d8b38c83c74a23a6595cfb7d4c839b
Explore at:
Dataset updated
Feb 15, 2024
Dataset authored and provided by
State of Connecticut
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered

Description
The dataset has combined the Parcels and Computer-Assisted Mass Appraisal (CAMA) data for 2023 into a single dataset. This dataset is designed to make it easier for stakeholders and the GIS community to use and access the information as a geospatial dataset. Included in this dataset are geometries for all 169 municipalities and attribution from the CAMA data for all but one municipality. Pursuant to Section 7-100l of the Connecticut General Statutes, each municipality is required to transmit a digital parcel file and an accompanying assessor’s database file (known as a CAMA report), to its respective regional council of governments (COG) by May 1 annually. These data were gathered from the CT municipalities by the COGs and then submitted to CT OPM. This dataset was created on 12/08/2023 from data collected in 2022-2023. Data was processed using Python scripts and ArcGIS Pro, ensuring standardization and integration of the data.CAMA Notes:The CAMA underwent several steps to standardize and consolidate the information. Python scripts were used to concatenate fields and create a unique identifier for each entry. The resulting dataset contains 1,353,595 entries and information on property assessments and other relevant attributes.CAMA was provided by the towns.Canaan parcels are viewable, but no additional information is available since no CAMA data was submitted.Spatial Data Notes:Data processing involved merging the parcels from different municipalities using ArcGIS Pro and Python. The resulting dataset contains 1,247,506 parcels.No alteration has been made to the spatial geometry of the data.Fields that are associated with CAMA data were provided by towns.The data fields that have information from the CAMA were sourced from the towns’ CAMA data.If no field for the parcels was provided for linking back to the CAMA by the town a new field within the original data was selected if it had a match rate above 50%, that joined back to the CAMA.Linking fields were renamed to "Link".All linking fields had a census town code added to the beginning of the value to create a unique identifier per town.Any field that was not town name, Location, Editor, Edit Date, or a field associated back to the CAMA, was not used in the creation of this Dataset.Only the fields related to town name, location, editor, edit date, and link fields associated with the towns’ CAMA were included in the creation of this dataset. Any other field provided in the original data was deleted or not used.Field names for town (Muni, Municipality) were renamed to "Town Name".The attributes included in the data: Town Name OwnerCo-OwnerLinkEditorEdit DateCollection year – year the parcels were submittedLocationMailing AddressMailing CityMailing StateAssessed TotalAssessed LandAssessed BuildingPre-Year Assessed Total Appraised LandAppraised BuildingAppraised OutbuildingConditionModelValuationZoneState UseState Use DescriptionLiving AreaEffective AreaTotal roomsNumber of bedroomsNumber of BathsNumber of Half-BathsSale PriceSale DateQualifiedOccupancyPrior Sale PricePrior Sale DatePrior Book and PagePlanning Region*Please note that not all parcels have a link to a CAMA entry.*If any discrepancies are discovered within the data, whether pertaining to geographical inaccuracies or attribute inaccuracy, please directly contact the respective municipalities to request any necessary amendmentsAs of 2/15/2023 - Occupancy, State Use, State Use Description, and Mailing State added to datasetAdditional information about the specifics of data availability and compliance will be coming soon.
c
Connecticut CAMA and Parcel Layer
geodata.ct.gov
hub.arcgis.com
Updated Nov 20, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
State of Connecticut (2024). Connecticut CAMA and Parcel Layer [Dataset]. https://geodata.ct.gov/datasets/ctmaps::connecticut-cama-and-parcel-layer
Explore at:
Dataset updated
Nov 20, 2024
Dataset authored and provided by
State of Connecticut
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered

Description
Coordinate system Update:Notably, this dataset will be provided in NAD 83 Connecticut State Plane (2011) (EPSG 6434) projection, instead of WGS 1984 Web Mercator Auxiliary Sphere (EPSG 3857) which is the coordinate system of the 2023 dataset and will remain in Connecticut State Plane moving forward.Ownership Suppression and Data Access:The updated dataset now includes parcel data for all towns across the state, with some towns featuring fully suppressed ownership information. In these instances, the owner’s name will be replaced with the label "Current Owner," the co-owner’s name will be listed as "Current Co-Owner," and the mailing address will appear as the property address itself. For towns with suppressed ownership data, users should be aware that there was no "Suppression" field in the submission to verify specific details. This measure was implemented this year to help verify compliance with Suppression.New Data Fields:The new dataset introduces the "Land Acres" field, which will display the total acreage for each parcel. This additional field allows for more detailed analysis and better supports planning, zoning, and property valuation tasks. An important new addition is the FIPS code field, which provides the Federal Information Processing Standards (FIPS) code for each parcel’s corresponding block. This allows users to easily identify which block the parcel is in.Updated Service URL:The new parcel service URL includes all the updates mentioned above, such as the improved coordinate system, new data fields, and additional geospatial information. Users are strongly encouraged to transition to the new service as soon as possible to ensure that their workflows remain uninterrupted. The URL for this service will remain persistent moving forward. Once you have transitioned to the new service, the URL will remain constant, ensuring long term stability.For a limited time, the old service will continue to be available, but it will eventually be retired. Users should plan to switch to the new service well before this cutoff to avoid any disruptions in data access.The dataset has combined the Parcels and Computer-Assisted Mass Appraisal (CAMA) data for 2024 into a single dataset. This dataset is designed to make it easier for stakeholders and the GIS community to use and access the information as a geospatial dataset. Included in this dataset are geometries for all 169 municipalities and attribution from the CAMA data for all but one municipality. Pursuant to Section 7-100l of the Connecticut General Statutes, each municipality is required to transmit a digital parcel file and an accompanying assessor’s database file (known as a CAMA report), to its respective regional council of governments (COG) by May 1 annually. These data were gathered from the CT municipalities by the COGs and then submitted to CT OPM. This dataset was created on 10/31/2024 from data collected in 2023-2024. Data was processed using Python scripts and ArcGIS Pro, ensuring standardization and integration of the data.CAMA Notes:The CAMA underwent several steps to standardize and consolidate the information. Python scripts were used to concatenate fields and create a unique identifier for each entry. The resulting dataset contains 1,353,595 entries and information on property assessments and other relevant attributes.CAMA was provided by the towns.Spatial Data Notes:Data processing involved merging the parcels from different municipalities using ArcGIS Pro and Python. The resulting dataset contains 1,290,196 parcels.No alteration has been made to the spatial geometry of the data.Fields that are associated with CAMA data were provided by towns.The data fields that have information from the CAMA were sourced from the towns’ CAMA data.If no field for the parcels was provided for linking back to the CAMA by the town a new field within the original data was selected if it had a match rate above 50%, that joined back to the CAMA.Linking fields were renamed to "Link".All linking fields had a census town code added to the beginning of the value to create a unique identifier per town.Any field that was not town name, Location, Editor, Edit Date, or a field associated back to the CAMA, was not used in the creation of this Dataset.Only the fields related to town name, location, editor, edit date, and link fields associated with the towns’ CAMA were included in the creation of this dataset. Any other field provided in the original data was deleted or not used.Field names for town (Muni, Municipality) were renamed to "Town Name".The attributes included in the data: Town Name OwnerCo-OwnerLinkEditorEdit DateCollection year – year the parcels were submittedLocationMailing AddressMailing CityMailing StateAssessed TotalAssessed LandAssessed BuildingPre-Year Assessed Total Appraised LandAppraised BuildingAppraised OutbuildingConditionModelValuationZoneState UseState Use DescriptionLand Acre Living AreaEffective AreaTotal roomsNumber of bedroomsNumber of BathsNumber of Half-BathsSale PriceSale DateQualifiedOccupancyPrior Sale PricePrior Sale DatePrior Book and PagePlanning RegionFIPS Code *Please note that not all parcels have a link to a CAMA entry.*If any discrepancies are discovered within the data, whether pertaining to geographical inaccuracies or attribute inaccuracy, please directly contact the respective municipalities to request any necessary amendmentsAdditional information about the specifics of data availability and compliance will be coming soon.If you need a WFS service for use in specific applications : Please Click Here
P
ATOM3D Dataset
paperswithcode.com
opendatalab.com
Updated Jul 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raphael J. L. Townshend; Martin Vögele; Patricia Suriana; Alexander Derry; Alexander Powers; Yianni Laloudakis; Sidhika Balachandar; Bowen Jing; Brandon Anderson; Stephan Eismann; Risi Kondor; Russ B. Altman; Ron O. Dror (2023). ATOM3D Dataset [Dataset]. https://paperswithcode.com/dataset/atom3d
Explore at:
Dataset updated
Jul 16, 2023
Authors
Raphael J. L. Townshend; Martin Vögele; Patricia Suriana; Alexander Derry; Alexander Powers; Yianni Laloudakis; Sidhika Balachandar; Bowen Jing; Brandon Anderson; Stephan Eismann; Risi Kondor; Russ B. Altman; Ron O. Dror
Description
ATOM3D is a unified collection of datasets concerning the three-dimensional structure of biomolecules, including proteins, small molecules, and nucleic acids. These datasets are specifically designed to provide a benchmark for machine learning methods which operate on 3D molecular structure, and represent a variety of important structural, functional, and engineering tasks. All datasets are provided in a standardized format along with a Python package containing processing code, utilities, models, and dataloaders for common machine learning frameworks such as PyTorch. ATOM3D is designed to be a living database, where datasets are updated and tasks are added as the field progresses.

Description from: https://www.atom3d.ai/
SDNist v1.3: Temporal Map Challenge Environment
datasets.ai
data.nist.gov
0, 23, 5, 8
Updated Aug 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2024). SDNist v1.3: Temporal Map Challenge Environment [Dataset]. https://datasets.ai/datasets/sdnist-benchmark-data-and-evaluation-tools-for-data-synthesizers
Explore at:
5, 23, 8, 0Available download formats
Dataset updated
Aug 6, 2024
Dataset authored and provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
SDNist (v1.3) is a set of benchmark data and metrics for the evaluation of synthetic data generators on structured tabular data. This version (1.3) reproduces the challenge environment from Sprints 2 and 3 of the Temporal Map Challenge. These benchmarks are distributed as a simple open-source python package to allow standardized and reproducible comparison of synthetic generator models on real world data and use cases. These data and metrics were developed for and vetted through the NIST PSCR Differential Privacy Temporal Map Challenge, where the evaluation tools, k-marginal and Higher Order Conjunction, proved effective in distinguishing competing models in the competition environment.SDNist is available via pip install: pip install sdnist==1.2.8 for Python >=3.6 or on the USNIST/Github. The sdnist Python module will download data from NIST as necessary, and users are not required to download data manually.
d
Data from: Tidal Resource Data from Sequim Bay Inlet, WA, August 2020
datasets.ai
data.openei.org
+3more
21, 32, 34
Updated Aug 26, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Energy (2020). Tidal Resource Data from Sequim Bay Inlet, WA, August 2020 [Dataset]. https://datasets.ai/datasets/tidal-resource-data-from-sequim-bay-inlet-wa-august-2020
Explore at:
32, 34, 21Available download formats
Dataset updated
Aug 26, 2020
Dataset authored and provided by
Department of Energy
Area covered
Sequim, Sequim Bay
Description
Data from a Nortek Signature1000 deployed on a lander for 14 days in Aug 2020 in the entrance to Sequim Bay, WA. Raw data were processed using the DOLfYN python package and standardized using the ME Data Pipeline python package, tsdat version 0.2.12. Processed data were partitioned into 24 hour increments and saved in the NETCDF file format.
d
Analyzed Data for The Impact of COVID-19 on Technical Services Units Survey...
search.dataone.org
figshare.com
Updated Nov 12, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Szkirpan, Elizabeth (2023). Analyzed Data for The Impact of COVID-19 on Technical Services Units Survey Results [Dataset]. http://doi.org/10.7910/DVN/DGBUV7
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/DGBUV7
Dataset updated
Nov 12, 2023
Dataset provided by
Harvard Dataverse
Authors
Szkirpan, Elizabeth
Description
These datasets contain cleaned data survey results from the October 2021-January 2022 survey titled "The Impact of COVID-19 on Technical Services Units". This data was gathered from a Qualtrics survey, which was anonymized to prevent Qualtrics from gathering identifiable information from respondents. These specific iterations of data reflect cleaning and standardization so that data can be analyzed using Python. Ultimately, the three files reflect the removal of survey begin/end times, other data auto-recorded by Qualtrics, blank rows, blank responses after question four (the first section of the survey), and non-United States responses. Note that State names for "What state is your library located in?" (Q36) were also standardized beginning in Impact_of_COVID_on_Tech_Services_Clean_3.csv to aid in data analysis. In this step, state abbreviations were spelled out and spelling errors were corrected.
d
gravitools - A collection of tools to analyze gravimeter data - Dataset -...
b2find.dkrz.de
Updated Mar 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). gravitools - A collection of tools to analyze gravimeter data - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/fafb0909-205c-5425-b117-d1f4c3269d70
Explore at:
Dataset updated
Mar 15, 2025
Description
This Python package is a collaborative effort by the gravity Metrology group at the German Federal Agency for Carthography and Geoesy (BKG) and the Hydrology section at GFZ Helmholtz Centre for Geosciences. It comprises functionalities and features around the respectively new instrument type of a Quantum Gravimeter (here AQG). New (standardized) instrument data format additional to new measurement and processing concepts lead to the first collection of scripts and now complete python package for a fully-featured analysis of AQG data. This encompasses live-monitoring while the instrument is actually measuring (with enhanced functionality than what is provided by the manufacturer), data processing, visualizations as well as archiving data, fulfilling the idea of reproducible data within FAIR principles. Many of these functionalities and concepts also apply to other gravimeter types. It is thus planned to include also access and processing of data for these other devices (starting in the near future with CG-6 relative gravimeters). This package is actively maintained and developed. If you are interested in contributing, please do not hesitate to contact us. Please find instructions for its installation and usage in the documentation or git repository, linked in the left panel. gravitools is listed in the python standard repository database "PyPi".
f
NRPS Motif Finder online version code (Python).
figshare.com
zip
Updated Jun 2, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ruolin He; Jinyu Zhang; Yuanzhe Shao; Shaohua Gu; Chen Song; Long Qian; Wen-Bing Yin; Zhiyuan Li (2023). NRPS Motif Finder online version code (Python). [Dataset]. http://doi.org/10.1371/journal.pcbi.1011100.s059
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1011100.s059
Dataset updated
Jun 2, 2023
Dataset provided by
PLOS Computational Biology
Authors
Ruolin He; Jinyu Zhang; Yuanzhe Shao; Shaohua Gu; Chen Song; Long Qian; Wen-Bing Yin; Zhiyuan Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Non-ribosomal peptide synthetase (NRPS) is a diverse family of biosynthetic enzymes for the assembly of bioactive peptides. Despite advances in microbial sequencing, the lack of a consistent standard for annotating NRPS domains and modules has made data-driven discoveries challenging. To address this, we introduced a standardized architecture for NRPS, by using known conserved motifs to partition typical domains. This motif-and-intermotif standardization allowed for systematic evaluations of sequence properties from a large number of NRPS pathways, resulting in the most comprehensive cross-kingdom C domain subtype classifications to date, as well as the discovery and experimental validation of novel conserved motifs with functional significance. Furthermore, our coevolution analysis revealed important barriers associated with re-engineering NRPSs and uncovered the entanglement between phylogeny and substrate specificity in NRPS sequences. Our findings provide a comprehensive and statistically insightful analysis of NRPS sequences, opening avenues for future data-driven discoveries.
T
protein_net
tensorflow.org
Updated Dec 16, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). protein_net [Dataset]. https://www.tensorflow.org/datasets/catalog/protein_net
Explore at:
Dataset updated
Dec 16, 2022
Description
ProteinNet is a standardized data set for machine learning of protein structure. It provides protein sequences, structures (secondary and tertiary), multiple sequence alignments (MSAs), position-specific scoring matrices (PSSMs), and standardized training / validation / test splits. ProteinNet builds on the biennial CASP assessments, which carry out blind predictions of recently solved but publicly unavailable protein structures, to provide test sets that push the frontiers of computational methodology. It is organized as a series of data sets, spanning CASP 7 through 12 (covering a ten-year period), to provide a range of data set sizes that enable assessment of new methods in relatively data poor and data rich regimes.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('protein_net', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
WoSIS snapshot - December 2023
data.isric.org
repository.soilwise-he.eu
Updated Dec 20, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ISRIC - World Soil Information (2023). WoSIS snapshot - December 2023 [Dataset]. https://data.isric.org/geonetwork/srv/api/records/e50f84e1-aa5b-49cb-bd6b-cd581232a2ec
Explore at:
www:link-1.0-http--related, www:link-1.0-http--link, www:download-1.0-ftp--downloadAvailable download formats
Dataset updated
Dec 20, 2023
Dataset provided by
International Soil Reference and Information Centre
Authors
ISRIC - World Soil Information
Time period covered
Jan 1, 1918 - Dec 1, 2022
Area covered
Pacific Ocean, North Pacific Ocean
Description
ABSTRACT: The World Soil Information Service (WoSIS) provides quality-assessed and standardized soil profile data to support digital soil mapping and environmental applications at broad scale levels. Since the release of the ‘WoSIS snapshot 2019’ many new soil data were shared with us, registered in the ISRIC data repository, and subsequently standardized in accordance with the licenses specified by the data providers. The source data were contributed by a wide range of data providers, therefore special attention was paid to the standardization of soil property definitions, soil analytical procedures and soil property values (and units of measurement). We presently consider the following soil chemical properties (organic carbon, total carbon, total carbonate equivalent, total Nitrogen, Phosphorus (extractable-P, total-P, and P-retention), soil pH, cation exchange capacity, and electrical conductivity) and physical properties (soil texture (sand, silt, and clay), bulk density, coarse fragments, and water retention), grouped according to analytical procedures (aggregates) that are operationally comparable. For each profile we provide the original soil classification (FAO, WRB, USDA, and version) and horizon designations as far as these have been specified in the source databases. Three measures for 'fitness-for-intended-use' are provided: positional uncertainty (for site locations), time of sampling/description, and a first approximation for the uncertainty associated with the operationally defined analytical methods. These measures should be considered during digital soil mapping and subsequent earth system modelling that use the present set of soil data. DATA SET DESCRIPTION: The 'WoSIS 2023 snapshot' comprises data for 228k profiles from 217k geo-referenced sites that originate from 174 countries. The profiles represent over 900k soil layers (or horizons) and over 6 million records. The actual number of measurements for each property varies (greatly) between proﬁles and with depth, this generally depending on the objectives of the initial soil sampling programmes. The data are provided in TSV (tab separated values) format and as GeoPackage. The zip-file (446 Mb) contains the following files: - Readme_WoSIS_202312_v2.pdf: Provides a short description of the dataset, file structure, column names, units and category values (this file is also available directly under 'online resources'). The pdf includes links to tutorials for downloading the TSV files into R respectively Excel. See also 'HOW TO READ TSV FILES INTO R AND PYTHON' in the next section. - wosis_202312_observations.tsv: This file lists the four to six letter codes for each observation, whether the observation is for a site/profile or layer (horizon), the unit of measurement and the number of profiles respectively layers represented in the snapshot. It also provides an estimate for the inferred accuracy for the laboratory measurements. - wosis_202312_sites.tsv: This file characterizes the site location where profiles were sampled. - wosis_2023112_profiles: Presents the unique profile ID (i.e. primary key), site_id, source of the data, country ISO code and name, positional uncertainty, latitude and longitude (WGS 1984), maximum depth of soil described and sampled, as well as information on the soil classification system and edition. Depending on the soil classification system used, the number of fields will vary . - wosis_202312_layers: This file characterises the layers (or horizons) per profile, and lists their upper and lower depths (cm). - wosis_202312_xxxx.tsv : This type of file presents results for each observation (e.g. “xxxx” = “BDFIOD” ), as defined under “code” in file wosis_202312_observation.tsv. (e.g. wosis_202311_bdfiod.tsv). - wosis_202312.gpkg: Contains the above datafiles in GeoPackage format (which stores the files within an SQLite database). HOW TO READ TSV FILES INTO R AND PYTHON: A) To read the data in R, please uncompress the ZIP file and specify the uncompressed folder. setwd("/YourFolder/WoSIS_2023_December/") ## For example: setwd('D:/WoSIS_2023_December/') Then use read_tsv to read the TSV files, specifying the data types for each column (c = character, i = integer, n = number, d = double, l = logical, f = factor, D = date, T = date time, t = time). observations = readr::read_tsv('wosis_202312_observations.tsv', col_types='cccciid') observations ## show columns and first 10 rows sites = readr::read_tsv('wosis_202312_sites.tsv', col_types='iddcccc') sites profiles = readr::read_tsv('wosis_202312_profiles.tsv', col_types='icciccddcccccciccccicccci') profiles layers = readr::read_tsv('wosis_202312_layers.tsv', col_types='iiciciiilcc') layers ## Do this for each observation 'XXXX', e.g. file 'Wosis_202312_orgc.tsv': orgc = readr::read_tsv('wosis_202312_orgc.tsv', col_types='iicciilccdccddccccc') orgc Note: One may also use the following R code (example is for file 'observations.tsv'): observations <- read.table("wosis_202312_observations.tsv", sep = "\t", header = TRUE, quote = "", comment.char = "", stringsAsFactors = FALSE ) B) To read the files into python first decompress the files to your selected folder. Then in python: # import the required library import pandas as pd # Read the observations data observations = pd.read_csv("wosis_202312_observations.tsv", sep="\t") # print the data frame header and some rows observations.head() # Read the sites data sites = pd.read_csv("wosis_202312_sites.tsv", sep="\t") # Read the profiles data profiles = pd.read_csv("wosis_202312_profiles.tsv", sep="\t") # Read the layers data layers = pd.read_csv("wosis_202312_layers.tsv", sep="\t") # Read the soil property data, e.g. 'cfvo' (do this for each observation) cfvo = pd.read_csv("wosis_202312_cfvo.tsv", sep="\t") CITATION: Calisto, L., de Sousa, L.M., Batjes, N.H., 2023. Standardised soil profile data for the world (WoSIS snapshot – December 2023), https://doi.org/10.17027/isric-wdcsoils-20231130 Supplement to: Batjes N.H., Calisto, L. and de Sousa L.M., 2023. Providing quality-assessed and standardised soil data to support global mapping and modelling (WoSIS snapshot 2023). Earth System Science Data, https://doi.org/10.5194/essd-16-4735-2024.
f
DataSheet_2_tidytcells: standardizer for TR/MH nomenclature.zip
frontiersin.figshare.com
zip
Updated Oct 25, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuta Nagano; Benjamin Chain (2023). DataSheet_2_tidytcells: standardizer for TR/MH nomenclature.zip [Dataset]. http://doi.org/10.3389/fimmu.2023.1276106.s002
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.3389/fimmu.2023.1276106.s002
Dataset updated
Oct 25, 2023
Dataset provided by
Frontiers
Authors
Yuta Nagano; Benjamin Chain
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
T cell receptors (TR) underpin the diversity and specificity of T cell activity. As such, TR repertoire data is valuable both as an adaptive immune biomarker, and as a way to identify candidate therapeutic TR. Analysis of TR repertoires relies heavily on computational analysis, and therefore it is of vital importance that the data is standardized and computer-readable. However in practice, the usage of different abbreviations and non-standard nomenclature in different datasets makes this data pre-processing non-trivial. tidytcells is a lightweight, platform-independent Python package that provides easy-to-use standardization tools specifically designed for TR nomenclature. The software is open-sourced under the MIT license and is available to install from the Python Package Index (PyPI). At the time of publishing, tidytcells is on version 2.0.0.
f
zip of MetDataModel v0.6.1.
plos.figshare.com
zip
Updated Jun 18, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joshua M. Mitchell; Yuanye Chi; Maheshwor Thapa; Zhiqiang Pang; Jianguo Xia; Shuzhao Li (2024). zip of MetDataModel v0.6.1. [Dataset]. http://doi.org/10.1371/journal.pcbi.1011912.s007
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1011912.s007
Dataset updated
Jun 18, 2024
Dataset provided by
PLOS Computational Biology
Authors
Joshua M. Mitchell; Yuanye Chi; Maheshwor Thapa; Zhiqiang Pang; Jianguo Xia; Shuzhao Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
To standardize metabolomics data analysis and facilitate future computational developments, it is essential to have a set of well-defined templates for common data structures. Here we describe a collection of data structures involved in metabolomics data processing and illustrate how they are utilized in a full-featured Python-centric pipeline. We demonstrate the performance of the pipeline, and the details in annotation and quality control using large-scale LC-MS metabolomics and lipidomics data and LC-MS/MS data. Multiple previously published datasets are also reanalyzed to showcase its utility in biological data analysis. This pipeline allows users to streamline data processing, quality control, annotation, and standardization in an efficient and transparent manner. This work fills a major gap in the Python ecosystem for computational metabolomics.
RF models’ parameters and experimental results comparison table.
plos.figshare.com
figshare.com
xls
Updated Oct 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yancong Zhou; Wenyue Chen; Xiaochen Sun; Dandan Yang (2023). RF models’ parameters and experimental results comparison table. [Dataset]. http://doi.org/10.1371/journal.pone.0292466.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0292466.t002
Dataset updated
Oct 11, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Yancong Zhou; Wenyue Chen; Xiaochen Sun; Dandan Yang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
RF models’ parameters and experimental results comparison table.
Adaboost models’ parameters and experimental results comparison.
plos.figshare.com
xls
Updated Oct 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yancong Zhou; Wenyue Chen; Xiaochen Sun; Dandan Yang (2023). Adaboost models’ parameters and experimental results comparison. [Dataset]. http://doi.org/10.1371/journal.pone.0292466.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0292466.t003
Dataset updated
Oct 11, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Yancong Zhou; Wenyue Chen; Xiaochen Sun; Dandan Yang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Adaboost models’ parameters and experimental results comparison.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Yancong Zhou; Wenyue Chen; Xiaochen Sun; Dandan Yang (2023). S1 Data - [Dataset]. http://doi.org/10.1371/journal.pone.0292466.s001

Data from: S1 Data -

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0292466.s001

Dataset updated

Oct 11, 2023

Dataset provided by

PLOS ONE

Authors

Yancong Zhou; Wenyue Chen; Xiaochen Sun; Dandan Yang

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Analyzing customers’ characteristics and giving the early warning of customer churn based on machine learning algorithms, can help enterprises provide targeted marketing strategies and personalized services, and save a lot of operating costs. Data cleaning, oversampling, data standardization and other preprocessing operations are done on 900,000 telecom customer personal characteristics and historical behavior data set based on Python language. Appropriate model parameters were selected to build BPNN (Back Propagation Neural Network). Random Forest (RF) and Adaboost, the two classic ensemble learning models were introduced, and the Adaboost dual-ensemble learning model with RF as the base learner was put forward. The four models and the other four classical machine learning models-decision tree, naive Bayes, K-Nearest Neighbor (KNN), Support Vector Machine (SVM) were utilized respectively to analyze the customer churn data. The results show that the four models have better performance in terms of recall rate, precision rate, F1 score and other indicators, and the RF-Adaboost dual-ensemble model has the best performance. Among them, the recall rates of BPNN, RF, Adaboost and RF-Adaboost dual-ensemble model on positive samples are respectively 79%, 90%, 89%,93%, the precision rates are 97%, 99%, 98%, 99%, and the F1 scores are 87%, 95%, 94%, 96%. The RF-Adaboost dual-ensemble model has the best performance, and the three indicators are 10%, 1%, and 6% higher than the reference. The prediction results of customer churn provide strong data support for telecom companies to adopt appropriate retention strategies for pre-churn customers and reduce customer churn.

Clear search

Close search

Google apps

Main menu

Data from: S1 Data -

python-code-instructions-18k-alpaca-standardized

Example subjects for Mobilise-D data standardization

Data from: A comprehensive dataset for the accelerated development and...

Connecticut State Parcel Layer 2023

Connecticut CAMA and Parcel Layer

ATOM3D Dataset

SDNist v1.3: Temporal Map Challenge Environment

Data from: Tidal Resource Data from Sequim Bay Inlet, WA, August 2020

Analyzed Data for The Impact of COVID-19 on Technical Services Units Survey...

gravitools - A collection of tools to analyze gravimeter data - Dataset -...

NRPS Motif Finder online version code (Python).

protein_net

WoSIS snapshot - December 2023

DataSheet_2_tidytcells: standardizer for TR/MH nomenclature.zip

zip of MetDataModel v0.6.1.

RF models’ parameters and experimental results comparison table.

Adaboost models’ parameters and experimental results comparison.

Data from: S1 Data -