Facebook
TwitterDataset Card for "python-code-instructions-18k-alpaca-standardized"
More Information needed
Facebook
TwitterR and Python libraries for the standardization of data extraction and analysis from NHANES.
Facebook
TwitterDataset Card for "instruct-python-500k-standardized"
More Information needed
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Hosted by: Walsoft Computer Institute 📁 Download dataset 👤 Kaggle profile
Walsoft Computer Institute runs a Business Intelligence (BI) training program for students from diverse educational, geographical, and demographic backgrounds. The institute has collected detailed data on student attributes, entry exams, study effort, and final performance in two technical subjects: Python Programming and Database Systems.
As part of an internal review, the leadership team has hired you — a Data Science Consultant — to analyze this dataset and provide clear, evidence-based recommendations on how to improve:
Answer this central question:
“Using the BI program dataset, how can Walsoft strategically improve student success, optimize resources, and increase the effectiveness of its training program?”
You are required to analyze and provide actionable insights for the following three areas:
Should entry exams remain the primary admissions filter?
Your task is to evaluate the predictive power of entry exam scores compared to other features such as prior education, age, gender, and study hours.
✅ Deliverables:
Are there at-risk student groups who need extra support?
Your task is to uncover whether certain backgrounds (e.g., prior education level, country, residence type) correlate with poor performance and recommend targeted interventions.
✅ Deliverables:
How can we allocate resources for maximum student success?
Your task is to segment students by success profiles and suggest differentiated teaching/facility strategies.
✅ Deliverables:
| Column | Description |
|---|---|
fNAME, lNAME | Student first and last name |
Age | Student age (21–71 years) |
gender | Gender (standardized as "Male"/"Female") |
country | Student’s country of origin |
residence | Student housing/residence type |
entryEXAM | Entry test score (28–98) |
prevEducation | Prior education (High School, Diploma, etc.) |
studyHOURS | Total study hours logged |
Python | Final Python exam score |
DB | Final Database exam score |
You are provided with a real-world messy dataset that reflects the types of issues data scientists face every day — from inconsistent formatting to missing values.
Download: bi.csv
This dataset includes common data quality challenges:
Country name inconsistencies
e.g. Norge → Norway, RSA → South Africa, UK → United Kingdom
Residence type variations
e.g. BI-Residence, BIResidence, BI_Residence → unify to BI Residence
Education level typos and casing issues
e.g. Barrrchelors → Bachelor, DIPLOMA, Diplomaaa → Diploma
Gender value noise
e.g. M, F, female → standardize to Male / Female
Missing scores in Python subject
Fill NaN values using column mean or suitable imputation strategy
Participants using this dataset are expected to apply data cleaning techniques such as:
- String standardization
- Null value imputation
- Type correction (e.g., scores as float)
- Validation and visual verification
✅ Bonus: Submissions that use and clean this dataset will earn additional Technical Competency points.
Download: cleaned_bi.csv
This version has been fully standardized and preprocessed: - All fields cleaned and renamed consistently - Missing Python scores filled with th...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Standardized data from Mobilise-D participants (YAR dataset) and pre-existing datasets (ICICLE, MSIPC2, Gait in Lab and real-life settings, MS project, UNISS-UNIGE) are provided in the shared folder, as an example of the procedures proposed in the publication "Mobility recorded by wearable devices and gold standards: the Mobilise-D procedure for data standardization" that is currently under review in Scientific data. Please refer to that publication for further information. Please cite that publication if using these data.
The code to standardize an example subject (for the ICICLE dataset) and to open the standardized Matlab files in other languages (Python, R) is available in github (https://github.com/luca-palmerini/Procedure-wearable-data-standardization-Mobilise-D).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analyzing customers’ characteristics and giving the early warning of customer churn based on machine learning algorithms, can help enterprises provide targeted marketing strategies and personalized services, and save a lot of operating costs. Data cleaning, oversampling, data standardization and other preprocessing operations are done on 900,000 telecom customer personal characteristics and historical behavior data set based on Python language. Appropriate model parameters were selected to build BPNN (Back Propagation Neural Network). Random Forest (RF) and Adaboost, the two classic ensemble learning models were introduced, and the Adaboost dual-ensemble learning model with RF as the base learner was put forward. The four models and the other four classical machine learning models-decision tree, naive Bayes, K-Nearest Neighbor (KNN), Support Vector Machine (SVM) were utilized respectively to analyze the customer churn data. The results show that the four models have better performance in terms of recall rate, precision rate, F1 score and other indicators, and the RF-Adaboost dual-ensemble model has the best performance. Among them, the recall rates of BPNN, RF, Adaboost and RF-Adaboost dual-ensemble model on positive samples are respectively 79%, 90%, 89%,93%, the precision rates are 97%, 99%, 98%, 99%, and the F1 scores are 87%, 95%, 94%, 96%. The RF-Adaboost dual-ensemble model has the best performance, and the three indicators are 10%, 1%, and 6% higher than the reference. The prediction results of customer churn provide strong data support for telecom companies to adopt appropriate retention strategies for pre-churn customers and reduce customer churn.
Facebook
TwitterThe dataset has combined the Parcels and Computer-Assisted Mass Appraisal (CAMA) data for 2023 into a single dataset. This dataset is designed to make it easier for stakeholders and the GIS community to use and access the information as a geospatial dataset. Included in this dataset are geometries for all 169 municipalities and attribution from the CAMA data for all but one municipality. Pursuant to Section 7-100l of the Connecticut General Statutes, each municipality is required to transmit a digital parcel file and an accompanying assessor’s database file (known as a CAMA report), to its respective regional council of governments (COG) by May 1 annually.
These data were gathered from the CT municipalities by the COGs and then submitted to CT OPM. This dataset was created on 12/08/2023 from data collected in 2022-2023. Data was processed using Python scripts and ArcGIS Pro, ensuring standardization and integration of the data.
CAMA Notes:
The CAMA underwent several steps to standardize and consolidate the information. Python scripts were used to concatenate fields and create a unique identifier for each entry. The resulting dataset contains 1,353,595 entries and information on property assessments and other relevant attributes.
CAMA was provided by the towns.
Canaan parcels are viewable, but no additional information is available since no CAMA data was submitted.
Spatial Data Notes:
Data processing involved merging the parcels from different municipalities using ArcGIS Pro and Python. The resulting dataset contains 1,247,506 parcels.
No alteration has been made to the spatial geometry of the data.
Fields that are associated with CAMA data were provided by towns.
The data fields that have information from the CAMA were sourced from the towns’ CAMA data.
If no field for the parcels was provided for linking back to the CAMA by the town a new field within the original data was selected if it had a match rate above 50%, that joined back to the CAMA.
Linking fields were renamed to "Link".
All linking fields had a census town code added to the beginning of the value to create a unique identifier per town.
Any field that was not town name, Location, Editor, Edit Date, or a field associated back to the CAMA, was not used in the creation of this Dataset.
Only the fields related to town name, location, editor, edit date, and link fields associated with the towns’ CAMA were included in the creation of this dataset. Any other field provided in the original data was deleted or not used.
Field names for town (Muni, Municipality) were renamed to "Town Name".
The attributes included in the data:
Town Name
Owner
Co-Owner
Link
Editor
Edit Date
Collection year – year the parcels were submitted
Location
Mailing Address
Mailing City
Mailing State
Assessed Total
Assessed Land
Assessed Building
Pre-Year Assessed Total
Appraised Land
Appraised Building
Appraised Outbuilding
Condition
Model
Valuation
Zone
State Use
State Use Description
Living Area
Effective Area
Total rooms
Number of bedrooms
Number of Baths
Number of Half-Baths
Sale Price
Sale Date
Qualified
Occupancy
Prior Sale Price
Prior Sale Date
Prior Book and Page
Planning Region
*Please note that not all parcels have a link to a CAMA entry.
*If any discrepancies are discovered within the data, whether pertaining to geographical inaccuracies or attribute inaccuracy, please directly contact the respective municipalities to request any necessary amendments
As of 2/15/2023 - Occupancy, State Use, State Use Description, and Mailing State added to dataset
Additional information about the specifics of data availability and compliance will be coming soon.
Facebook
TwitterAdditional file 2: List of Embryophyta genomes analyzed in this study, including their taxonomic classification and accession numbers
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Code [GitHub] | Publication [Nature Scientific Data'23 / ISBI'21] | Preprint [arXiv]
Abstract
We introduce MedMNIST, a large-scale MNIST-like collection of standardized biomedical images, including 12 datasets for 2D and 6 datasets for 3D. All images are pre-processed into 28x28 (2D) or 28x28x28 (3D) with the corresponding classification labels, so that no background knowledge is required for users. Covering primary data modalities in biomedical images, MedMNIST is designed to perform classification on lightweight 2D and 3D images with various data scales (from 100 to 100,000) and diverse tasks (binary/multi-class, ordinal regression and multi-label). The resulting dataset, consisting of approximately 708K 2D images and 10K 3D images in total, could support numerous research and educational purposes in biomedical image analysis, computer vision and machine learning. We benchmark several baseline methods on MedMNIST, including 2D / 3D neural networks and open-source / commercial AutoML tools. The data and code are publicly available at https://medmnist.com/.
Disclaimer: The only official distribution link for the MedMNIST dataset is Zenodo. We kindly request users to refer to this original dataset link for accurate and up-to-date data.
Update: We are thrilled to release MedMNIST+ with larger sizes: 64x64, 128x128, and 224x224 for 2D, and 64x64x64 for 3D. As a complement to the previous 28-size MedMNIST, the large-size version could serve as a standardized benchmark for medical foundation models. Install the latest API to try it out!
Python Usage
We recommend our official code to download, parse and use the MedMNIST dataset:
% pip install medmnist% python
To use the standard 28-size (MNIST-like) version utilizing the downloaded files:
from medmnist import PathMNIST
train_dataset = PathMNIST(split="train")
To enable automatic downloading by setting download=True:
from medmnist import NoduleMNIST3D
val_dataset = NoduleMNIST3D(split="val", download=True)
Alternatively, you can access MedMNIST+ with larger image sizes by specifying the size parameter:
from medmnist import ChestMNIST
test_dataset = ChestMNIST(split="test", download=True, size=224)
Citation
If you find this project useful, please cite both v1 and v2 paper as:
Jiancheng Yang, Rui Shi, Donglai Wei, Zequan Liu, Lin Zhao, Bilian Ke, Hanspeter Pfister, Bingbing Ni. Yang, Jiancheng, et al. "MedMNIST v2-A large-scale lightweight benchmark for 2D and 3D biomedical image classification." Scientific Data, 2023.
Jiancheng Yang, Rui Shi, Bingbing Ni. "MedMNIST Classification Decathlon: A Lightweight AutoML Benchmark for Medical Image Analysis". IEEE 18th International Symposium on Biomedical Imaging (ISBI), 2021.
or using bibtex:
@article{medmnistv2, title={MedMNIST v2-A large-scale lightweight benchmark for 2D and 3D biomedical image classification}, author={Yang, Jiancheng and Shi, Rui and Wei, Donglai and Liu, Zequan and Zhao, Lin and Ke, Bilian and Pfister, Hanspeter and Ni, Bingbing}, journal={Scientific Data}, volume={10}, number={1}, pages={41}, year={2023}, publisher={Nature Publishing Group UK London} }
@inproceedings{medmnistv1, title={MedMNIST Classification Decathlon: A Lightweight AutoML Benchmark for Medical Image Analysis}, author={Yang, Jiancheng and Shi, Rui and Ni, Bingbing}, booktitle={IEEE 18th International Symposium on Biomedical Imaging (ISBI)}, pages={191--195}, year={2021} }
Please also cite the corresponding paper(s) of source data if you use any subset of MedMNIST as per the description on the project website.
License
The MedMNIST dataset is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0), except DermaMNIST under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0).
The code is under Apache-2.0 License.
Changelog
v3.0 (this repository): Released MedMNIST+ featuring larger sizes: 64x64, 128x128, and 224x224 for 2D, and 64x64x64 for 3D.
v2.2: Removed a small number of mistakenly included blank samples in OrganAMNIST, OrganCMNIST, OrganSMNIST, OrganMNIST3D, and VesselMNIST3D.
v2.1: Addressed an issue in the NoduleMNIST3D file (i.e., nodulemnist3d.npz). Further details can be found in this issue.
v2.0: Launched the initial repository of MedMNIST v2, adding 6 datasets for 3D and 2 for 2D.
v1.0: Established the initial repository (in a separate repository) of MedMNIST v1, featuring 10 datasets for 2D.
Note: This dataset is NOT intended for clinical use.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Pursuant to Section 7-100l of the Connecticut General Statutes, each municipality is required to transmit a digital parcel file and an accompanying assessor’s database file (known as a CAMA report), to its respective regional council of governments (COG) by May 1 annually. The dataset has combined the Parcels and Computer-Assisted Mass Appraisal (CAMA) data for 2025 into a single dataset. This dataset is designed to make it easier for stakeholders and the GIS community to use and access the information as a geospatial dataset. Included in this dataset are geometries for all 169 municipalities and attribution from the CAMA data for all but one municipality. These data were gathered from the CT municipalities by the COGs and then submitted to CT OPM. This dataset was created on September 2025 from data collected in 2024-2025. Data was processed using Python scripts and ArcGIS Pro for standardization and integration of the data. To learn more about Parcel and CAMA in CT visit our Parcels Page in the Geodata Portal.Coordinate system: This dataset is provided in NAD 83 Connecticut State Plane (2011) (EPSG 2234) projection as it was for 2024. Prior versions were provided at WGS 1984 Web Mercator Auxiliary Sphere (EPSG 3857). Ownership Suppression: The updated dataset includes parcel data for all towns across the state, with some towns featuring fully suppressed ownership information. In these instances, the owner’s name was replaced with the label "Current Owner," the co-owner’s name will be listed as "Current Co-Owner," and the mailing address will appear as the property address itself. For towns with fully suppressed ownership data, please note that no "Suppression" field was included in the submission to confirm these details and this labeling approach was implemented as the solution.New Data Fields:The new dataset introduces the “Property Zip” and “Mailing Zip” fields, which will display the zip codes for the owner and property.Service URL:In 2024, we implemented a stable URL to maintain public access to the most up-to-date data layer. Users are strongly encouraged to transition to the new service as soon as possible to ensure uninterrupted workflows. This URL will remain persistent, providing long-term stability for your applications and integrations. Once you’ve transitioned to the new service, no further URL changes will be necessary.CAMA Notes:The CAMA underwent several steps to standardize and consolidate the information. Python scripts were used to concatenate fields and create a unique identifier for each entry. The resulting dataset contains 1,354,720 entries and information on property assessments and other relevant attributes.CAMA was provided by the towns.Spatial Data Notes:Data processing involved merging the parcels from different municipalities using ArcGIS Pro and Python. The resulting dataset contains 1,282,833 parcels.No alteration has been made to the spatial geometry of the data.Fields that are associated with CAMA data were provided by towns.The data fields that have information from the CAMA were sourced from the towns’ CAMA data.If no field for the parcels was provided for linking back to the CAMA by the town a new field within the original data was selected if it had a match rate above 50%, that joined back to the CAMA.Linking fields were renamed to "Link".All linking fields had a census town code added to the beginning of the value to create a unique identifier per town.Any field that was not town name, Location, Editor, Edit Date, or a field associated back to the CAMA, was not used in the creation of this Dataset.Only the fields related to town name, location, editor, edit date, and link fields associated with the towns’ CAMA were included in the creation of this dataset. Any other field provided in the original data was deleted or not used.Field names for town (Muni, Municipality) were renamed to "Town Name".Attributes included in the data: Town Name OwnerCo-OwnerLinkEditorEdit DateCollection year – year the parcels were submittedLocationProperty ZipMailing AddressMailing CityMailing StateMailing ZipAssessed TotalAssessed LandAssessed BuildingPre-Year Assessed Total Appraised LandAppraised BuildingAppraised OutbuildingConditionModelValuationZoneState UseState Use DescriptionLand Acre Living AreaEffective AreaTotal roomsNumber of bedroomsNumber of BathsNumber of Half-BathsSale PriceSale DateQualifiedOccupancyPrior Sale PricePrior Sale DatePrior Book and PagePlanning RegionFIPS Code *Please note that not all parcels have a link to a CAMA entry.*If any discrepancies are discovered within the data, whether pertaining to geographical inaccuracies or attribute inaccuracy, please directly contact the respective municipalities to request any necessary amendmentsAdditional information about the specifics of data availability and compliance will be coming soon.If you need a WFS service for use in specific applications : Please Click HereContact: opm.giso@ct.gov
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Effective pavement maintenance is vital for the economy, optimal performance, and safety, necessitating a thorough evaluation of pavement conditions such as strength, roughness, and surface distress. Pavement performance indicators significantly influence vehicle safety and ride quality. Recent advancements have focused on leveraging data-driven models to predict pavement performance, aiming to optimize fund allocation and enhance Maintenance and Rehabilitation (M&R) strategies through precise assessment of pavement conditions and defects. A critical prerequisite for these models is access to standardized, high-quality datasets to enhance prediction accuracy in pavement infrastructure management. This data article presents a comprehensive dataset compiled to support pavement performance prediction research, focusing on Southeast Texas, particularly the flood-prone region of Beaumont. The dataset includes pavement and traffic data, meteorological records, flood maps, ground deformation, and topographic indices to assess the impact of load-associated and non-load-associated pavement degradation. Data preprocessing was conducted using ArcGIS Pro, Microsoft Excel, and Python, ensuring the dataset is formatted for direct application in data-driven modeling approaches, including Machine Learning methods. Key contributions of this dataset include facilitating the analysis of climatic and environmental impacts on pavement conditions, enabling the identification of critical features influencing pavement performance, and allowing comprehensive data analysis to explore correlations and trends among input variables. By addressing gaps in input variable selection studies, this dataset supports the development of predictive tools for estimating future maintenance needs and improving the resilience of pavement infrastructure in flood-affected areas. This work highlights the importance of standardized datasets in advancing pavement management systems and provides a foundation for future research to enhance pavement performance prediction accuracy.
Facebook
TwitterMass spectrometry-based proteomics is increasingly employed in biology and medicine. To generate reliable information from large data sets and ensure comparability of results it is crucial to implement and standardize the quality control of the raw data, the data processing steps and the statistical analyses. The MSPypeline provides a platform for the import of MaxQuant output tables, the generation of quality control reports, the preprocessing of data including normalization and exploratory analyses by statistical inference plots. These standardized steps assess data quality, provide customizable figures and enable the identification of differentially expressed proteins to reach biologically relevant conclusions.
Facebook
TwitterThis dataset contains 2017 national Commercial RCRA-defined Hazardous Waste by North American Industry Classification System (NAICS) 2012 6-digit codes. This dataset was created in FLOWSA, a publicly available python package that generates standardized environmental flows by industry. This dataset is associated with the following publication: Ingwersen, W.W., M. Li, B. Young, J. Vendries, and C. Birney. USEEIO v2.0, The US Environmentally-Extended InputOutput Model v2.0. Scientific Data. Springer Nature Group, New York, NY, 194, (2022).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Note: All supplementary files are provided as a single compressed archive named dataset.zip. Users should extract this file to access the individual Excel and Python files listed below.
This supplementary dataset supports the manuscript titled “Mahalanobis-Based Multivariate Financial Statement Analysis: Outlier Detection and Typological Clustering in U.S. Tech Firms.” It contains both data files and Python scripts used in the financial ratio analysis, Mahalanobis distance computation, and hierarchical clustering stages of the study. The files are organized as follows:
ESM_1.xlsx – Raw financial ratios of 18 U.S. technology firms (2020–2024)
ESM_2.py – Python script to calculate Z-scores from raw financial ratios
ESM_3.xlsx – Dataset containing Z-scores for the selected financial ratios
ESM_4.py – Python script for generating the correlation heatmap of the Z-scores
ESM_5.xlsx – Mahalanobis distance values for each firm
ESM_6.py – Python script to compute Mahalanobis distances
ESM_7.py – Python script to visualize Mahalanobis distances
ESM_8.xlsx – Mean Z-scores per firm (used for cluster analysis)
ESM_9.py – Python script to compute mean Z-scores
ESM_10.xlsx – Re-standardized Z-scores based on firm-level means
ESM_11.py – Python script to re-standardize mean Z-scores
ESM_12.py – Python script to generate the hierarchical clustering dendrogram
All files are provided to ensure transparency and reproducibility of the computational procedures in the manuscript. Each script is commented and formatted for clarity. The dataset is intended for educational and academic reuse under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0).
Facebook
Twitter1.Framework overview. This paper proposed a pipeline to construct high-quality datasets for text mining in materials science. Firstly, we utilize the traceable automatic acquisition scheme of literature to ensure the traceability of textual data. Then, a data processing method driven by downstream tasks is performed to generate high-quality pre-annotated corpora conditioned on the characteristics of materials texts. On this basis, we define a general annotation scheme derived from materials science tetrahedron to complete high-quality annotation. Finally, a conditional data augmentation model incorporating materials domain knowledge (cDA-DK) is constructed to augment the data quantity.2.Dataset information. The experimental datasets used in this paper include: the Matscholar dataset publicly published by Weston et al. (DOI: 10.1021/acs.jcim.9b00470), and the NASICON entity recognition dataset constructed by ourselves. Herein, we mainly introduce the details of NASICON entity recognition dataset.2.1 Data collection and preprocessing. Firstly, 55 materials science literature related to NASICON system are collected through Crystallographic Information File (CIF), which contains a wealth of structure-activity relationship information. Note that materials science literature is mostly stored as portable document format (PDF), with content arranged in columns and mixed with tables, images, and formulas, which significantly compromises the readability of the text sequence. To tackle this issue, we employ the text parser PDFMiner (a Python toolkit) to standardize, segment, and parse the original documents, thereby converting PDF literature into plain text. In this process, the entire textual information of literature, encompassing title, author, abstract, keywords, institution, publisher, and publication year, is retained and stored as a unified TXT document. Subsequently, we apply rules based on Python regular expressions to remove redundant information, such as garbled characters and line breaks caused by figures, tables, and formulas. This results in a cleaner text corpus, enhancing its readability and enabling more efficient data analysis. Note that special symbols may also appear as garbled characters, but we refrain from directly deleting them, as they may contain valuable information such as chemical units. Therefore, we converted all such symbols to a special token
Facebook
TwitterThis dataset contains 2017 national employment by North American Industry Classification System (NAICS) 2012 6-digit codes. This dataset was created in FLOWSA, a publicly available python package that generates standardized environmental flows by industry. This dataset is associated with the following publication: Ingwersen, W.W., M. Li, B. Young, J. Vendries, and C. Birney. USEEIO v2.0, The US Environmentally-Extended InputOutput Model v2.0. Scientific Data. Springer Nature Group, New York, NY, 194, (2022).
Facebook
TwitterThis dataset contains 2012 national level land occupation totals by North American Industry Classification System (NAICS) 2012 6-digit codes. This dataset was created in FLOWSA, a publicly available python package that generates standardized environmental flows by industry. This dataset is associated with the following publication: Ingwersen, W.W., M. Li, B. Young, J. Vendries, and C. Birney. USEEIO v2.0, The US Environmentally-Extended InputOutput Model v2.0. Scientific Data. Springer Nature Group, New York, NY, 194, (2022).
Facebook
TwitterThis dataset contains 2017 national point source releases to ground by North American Industry Classification System (NAICS) 2012 6-digit codes. This dataset was created in FLOWSA, a publicly available python package that generates standardized environmental flows by industry. This dataset is associated with the following publication: Ingwersen, W.W., M. Li, B. Young, J. Vendries, and C. Birney. USEEIO v2.0, The US Environmentally-Extended InputOutput Model v2.0. Scientific Data. Springer Nature Group, New York, NY, 194, (2022).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is a processed data in NetCDF (.nc) files, that used in our study. We used the SPI to determine meteorological drought conditions in the study area, that calculated by using the open-source module Climate and Drought Indices in Python.
Facebook
TwitterProteinNet is a standardized data set for machine learning of protein structure. It provides protein sequences, structures (secondary and tertiary), multiple sequence alignments (MSAs), position-specific scoring matrices (PSSMs), and standardized training / validation / test splits. ProteinNet builds on the biennial CASP assessments, which carry out blind predictions of recently solved but publicly unavailable protein structures, to provide test sets that push the frontiers of computational methodology. It is organized as a series of data sets, spanning CASP 7 through 12 (covering a ten-year period), to provide a range of data set sizes that enable assessment of new methods in relatively data poor and data rich regimes.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('protein_net', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Facebook
TwitterDataset Card for "python-code-instructions-18k-alpaca-standardized"
More Information needed