30 datasets found

Z
Example subjects for Mobilise-D data standardization
data.niaid.nih.gov
Updated Oct 11, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Soltani, Abolfazl (2022). Example subjects for Mobilise-D data standardization [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7185428
Explore at:
Dataset updated
Oct 11, 2022
Dataset provided by
Soltani, Abolfazl
Paraschiv-Ionescu, Anisoara
D'Ascanio, Ilaria
Micó-Amigo, Encarna
Kluge, Felix
Küderle, Arne
Hiden, Hugo
Chiari, Lorenzo
Mazzà, Claudia
Reggi, Luca
Bonci, Tecla
Bertuletti, Stefano
Caruso, Marco
Palmerini, Luca
Cereatti, Andrea
Gazit, Eran
on behalf of the Mobilise-D consortium
Ullrich, Martin
Rochester, Lynn
Del Din, Silvia
Kirk, Cameron
Salis, Francesca
Hansen, Clint
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Standardized data from Mobilise-D participants (YAR dataset) and pre-existing datasets (ICICLE, MSIPC2, Gait in Lab and real-life settings, MS project, UNISS-UNIGE) are provided in the shared folder, as an example of the procedures proposed in the publication "Mobility recorded by wearable devices and gold standards: the Mobilise-D procedure for data standardization" that is currently under review in Scientific data. Please refer to that publication for further information. Please cite that publication if using these data.

The code to standardize an example subject (for the ICICLE dataset) and to open the standardized Matlab files in other languages (Python, R) is available in github (https://github.com/luca-palmerini/Procedure-wearable-data-standardization-Mobilise-D).
h
instruct-python-500k-standardized
huggingface.co
Updated Sep 3, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HydraLM (2023). instruct-python-500k-standardized [Dataset]. https://huggingface.co/datasets/HydraLM/instruct-python-500k-standardized
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 3, 2023
Dataset authored and provided by
HydraLM
Description
Dataset Card for "instruct-python-500k-standardized"

More Information needed
h
python-code-instructions-18k-alpaca-standardized
huggingface.co
Updated Sep 2, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HydraLM (2023). python-code-instructions-18k-alpaca-standardized [Dataset]. https://huggingface.co/datasets/HydraLM/python-code-instructions-18k-alpaca-standardized
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 2, 2023
Dataset authored and provided by
HydraLM
Description
Dataset Card for "python-code-instructions-18k-alpaca-standardized"

More Information needed
S
machine learning models on the WDBC dataset
scidb.cn
Updated Apr 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahdi Aghaziarati (2025). machine learning models on the WDBC dataset [Dataset]. http://doi.org/10.57760/sciencedb.23537
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.23537
Dataset updated
Apr 15, 2025
Dataset provided by
Science Data Bank
Authors
Mahdi Aghaziarati
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset used in this study is the Wisconsin Diagnostic Breast Cancer (WDBC) dataset, originally provided by the University of Wisconsin and obtained via Kaggle. It consists of 569 observations, each corresponding to a digitized image of a fine needle aspirate (FNA) of a breast mass. The dataset contains 32 attributes: one identifier column (discarded during preprocessing), one diagnosis label (malignant or benign), and 30 continuous real-valued features that describe the morphology of cell nuclei. These features are grouped into three statistical descriptors—mean, standard error (SE), and worst (mean of the three largest values)—for ten morphological properties including radius, perimeter, area, concavity, and fractal dimension. All feature values were normalized using z-score standardization to ensure uniform scale across models sensitive to input ranges. No missing values were present in the original dataset. Label encoding was applied to the diagnosis column, assigning 1 to malignant and 0 to benign cases. The dataset was split into training (80%) and testing (20%) sets while preserving class balance via stratified sampling. The accompanying Python source code (breast_cancer_classification_models.py) performs data loading, preprocessing, model training, evaluation, and result visualization. Four lightweight classifiers—Decision Tree, Naïve Bayes, Perceptron, and K-Nearest Neighbors (KNN)—were implemented using the scikit-learn library (version 1.2 or later). Performance metrics including Accuracy, Precision, Recall, F1-score, and ROC-AUC were calculated for each model. Confusion matrices and ROC curves were generated and saved as PNG files for interpretability. All results are saved in a structured CSV file (classification_results.csv) that contains the performance metrics for each model. Supplementary visualizations include all_feature_histograms.png (distribution plots for all standardized features), model_comparison.png (metric-wise bar plot), and feature_correlation_heatmap.png (Pearson correlation matrix of all 30 features). The data files are in standard CSV and PNG formats and can be opened using any spreadsheet or image viewer, respectively. No rare file types are used, and all scripts are compatible with any Python 3.x environment. This data package enables reproducibility and offers a transparent overview of how baseline machine learning models perform in the domain of breast cancer diagnosis using a clinically-relevant dataset.
S
NASICON-type solid electrolyte materials named entity recognition dataset
scidb.cn
Updated Apr 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Liu Yue; Liu Dahui; Yang Zhengwei; Shi Siqi (2023). NASICON-type solid electrolyte materials named entity recognition dataset [Dataset]. http://doi.org/10.57760/sciencedb.j00213.00001
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.j00213.00001
Dataset updated
Apr 27, 2023
Dataset provided by
Science Data Bank
Authors
Liu Yue; Liu Dahui; Yang Zhengwei; Shi Siqi
Description
1.Framework overview. This paper proposed a pipeline to construct high-quality datasets for text mining in materials science. Firstly, we utilize the traceable automatic acquisition scheme of literature to ensure the traceability of textual data. Then, a data processing method driven by downstream tasks is performed to generate high-quality pre-annotated corpora conditioned on the characteristics of materials texts. On this basis, we define a general annotation scheme derived from materials science tetrahedron to complete high-quality annotation. Finally, a conditional data augmentation model incorporating materials domain knowledge (cDA-DK) is constructed to augment the data quantity.2.Dataset information. The experimental datasets used in this paper include: the Matscholar dataset publicly published by Weston et al. (DOI: 10.1021/acs.jcim.9b00470), and the NASICON entity recognition dataset constructed by ourselves. Herein, we mainly introduce the details of NASICON entity recognition dataset.2.1 Data collection and preprocessing. Firstly, 55 materials science literature related to NASICON system are collected through Crystallographic Information File (CIF), which contains a wealth of structure-activity relationship information. Note that materials science literature is mostly stored as portable document format (PDF), with content arranged in columns and mixed with tables, images, and formulas, which significantly compromises the readability of the text sequence. To tackle this issue, we employ the text parser PDFMiner (a Python toolkit) to standardize, segment, and parse the original documents, thereby converting PDF literature into plain text. In this process, the entire textual information of literature, encompassing title, author, abstract, keywords, institution, publisher, and publication year, is retained and stored as a unified TXT document. Subsequently, we apply rules based on Python regular expressions to remove redundant information, such as garbled characters and line breaks caused by figures, tables, and formulas. This results in a cleaner text corpus, enhancing its readability and enabling more efficient data analysis. Note that special symbols may also appear as garbled characters, but we refrain from directly deleting them, as they may contain valuable information such as chemical units. Therefore, we converted all such symbols to a special token
d
Connecticut State Parcel Layer 2023
catalog.data.gov
data.ct.gov
+2more
Updated May 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
State of Connecticut (2025). Connecticut State Parcel Layer 2023 [Dataset]. https://catalog.data.gov/dataset/connecticut-state-parcel-layer-2023-74a65
Explore at:
Dataset updated
May 10, 2025
Dataset provided by
State of Connecticut
Area covered
Connecticut
Description
The dataset has combined the Parcels and Computer-Assisted Mass Appraisal (CAMA) data for 2023 into a single dataset. This dataset is designed to make it easier for stakeholders and the GIS community to use and access the information as a geospatial dataset. Included in this dataset are geometries for all 169 municipalities and attribution from the CAMA data for all but one municipality. Pursuant to Section 7-100l of the Connecticut General Statutes, each municipality is required to transmit a digital parcel file and an accompanying assessor’s database file (known as a CAMA report), to its respective regional council of governments (COG) by May 1 annually. These data were gathered from the CT municipalities by the COGs and then submitted to CT OPM. This dataset was created on 12/08/2023 from data collected in 2022-2023. Data was processed using Python scripts and ArcGIS Pro, ensuring standardization and integration of the data.CAMA Notes:The CAMA underwent several steps to standardize and consolidate the information. Python scripts were used to concatenate fields and create a unique identifier for each entry. The resulting dataset contains 1,353,595 entries and information on property assessments and other relevant attributes.CAMA was provided by the towns.Canaan parcels are viewable, but no additional information is available since no CAMA data was submitted.Spatial Data Notes:Data processing involved merging the parcels from different municipalities using ArcGIS Pro and Python. The resulting dataset contains 1,247,506 parcels.No alteration has been made to the spatial geometry of the data.Fields that are associated with CAMA data were provided by towns.The data fields that have information from the CAMA were sourced from the towns’ CAMA data.If no field for the parcels was provided for linking back to the CAMA by the town a new field within the original data was selected if it had a match rate above 50%, that joined back to the CAMA.Linking fields were renamed to "Link".All linking fields had a census town code added to the beginning of the value to create a unique identifier per town.Any field that was not town name, Location, Editor, Edit Date, or a field associated back to the CAMA, was not used in the creation of this Dataset.Only the fields related to town name, location, editor, edit date, and link fields associated with the towns’ CAMA were included in the creation of this dataset. Any other field provided in the original data was deleted or not used.Field names for town (Muni, Municipality) were renamed to "Town Name".
f
S1 Data -
plos.figshare.com
zip
Updated Oct 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yancong Zhou; Wenyue Chen; Xiaochen Sun; Dandan Yang (2023). S1 Data - [Dataset]. http://doi.org/10.1371/journal.pone.0292466.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0292466.s001
Dataset updated
Oct 11, 2023
Dataset provided by
PLOS ONE
Authors
Yancong Zhou; Wenyue Chen; Xiaochen Sun; Dandan Yang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analyzing customers’ characteristics and giving the early warning of customer churn based on machine learning algorithms, can help enterprises provide targeted marketing strategies and personalized services, and save a lot of operating costs. Data cleaning, oversampling, data standardization and other preprocessing operations are done on 900,000 telecom customer personal characteristics and historical behavior data set based on Python language. Appropriate model parameters were selected to build BPNN (Back Propagation Neural Network). Random Forest (RF) and Adaboost, the two classic ensemble learning models were introduced, and the Adaboost dual-ensemble learning model with RF as the base learner was put forward. The four models and the other four classical machine learning models-decision tree, naive Bayes, K-Nearest Neighbor (KNN), Support Vector Machine (SVM) were utilized respectively to analyze the customer churn data. The results show that the four models have better performance in terms of recall rate, precision rate, F1 score and other indicators, and the RF-Adaboost dual-ensemble model has the best performance. Among them, the recall rates of BPNN, RF, Adaboost and RF-Adaboost dual-ensemble model on positive samples are respectively 79%, 90%, 89%,93%, the precision rates are 97%, 99%, 98%, 99%, and the F1 scores are 87%, 95%, 94%, 96%. The RF-Adaboost dual-ensemble model has the best performance, and the three indicators are 10%, 1%, and 6% higher than the reference. The prediction results of customer churn provide strong data support for telecom companies to adopt appropriate retention strategies for pre-churn customers and reduce customer churn.
P
OpenML-CC18 Dataset
library.toponeai.link
opendatalab.com
Updated Jul 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bernd Bischl; Giuseppe Casalicchio; Matthias Feurer; Pieter Gijsbers; Frank Hutter; Michel Lang; Rafael G. Mantovani; Jan N. van Rijn; Joaquin Vanschoren (2024). OpenML-CC18 Dataset [Dataset]. https://library.toponeai.link/dataset/openml-cc18
Explore at:
Dataset updated
Jul 28, 2024
Authors
Bernd Bischl; Giuseppe Casalicchio; Matthias Feurer; Pieter Gijsbers; Frank Hutter; Michel Lang; Rafael G. Mantovani; Jan N. van Rijn; Joaquin Vanschoren
Description
We advocate the use of curated, comprehensive benchmark suites of machine learning datasets, backed by standardized OpenML-based interfaces and complementary software toolkits written in Python, Java and R. We demonstrate how to easily execute comprehensive benchmarking studies using standardized OpenML-based benchmarking suites and complementary software toolkits written in Python, Java and R. Major distinguishing features of OpenML benchmark suites are (i) ease of use through standardized data formats, APIs, and existing client libraries; (ii) machine-readable meta-information regarding the contents of the suite; and (iii) online sharing of results, enabling large scale comparisons. As a first such suite, we propose the OpenML-CC18, a machine learning benchmark suite of 72 classification datasets carefully curated from the thousands of datasets on OpenML.

The inclusion criteria are: * classification tasks on dense data set independent observations * number of classes >= 2, each class with at least 20 observations and ratio of minority to majority class must exceed 5% * 500 <= number of observations <= 100000 * number of features after one-hot-encoding < 5000 * no artificial data sets * no subsets of larger data sets nor binarizations of other data sets * no data sets which are perfectly predictable by using a single feature or by using a simple decision tree * source or reference available

If you use this benchmarking suite, please cite:

Bernd Bischl, Giuseppe Casalicchio, Matthias Feurer, Frank Hutter, Michel Lang, Rafael G. Mantovani, Jan N. van Rijn and Joaquin Vanschoren. “OpenML Benchmarking Suites” arXiv:1708.03731v2 stats.ML.

@article{oml-benchmarking-suites, title={OpenML Benchmarking Suites}, author={Bernd Bischl and Giuseppe Casalicchio and Matthias Feurer and Frank Hutter and Michel Lang and Rafael G. Mantovani and Jan N. van Rijn and Joaquin Vanschoren}, year={2019}, journal={arXiv:1708.03731v2 [stat.ML]} }
f
Additional file 2 of SynGenes: a Python class for standardizing...
datasetcatalog.nlm.nih.gov
springernature.figshare.com
Updated Aug 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sampaio, Iracilda; Rabelo, Luan Pinto; Sodré, Davidson; de Sousa, Rodrigo Petry Corrêa; Gomes, Grazielle; Watanabe, Luciana; Vallinoto, Marcelo (2024). Additional file 2 of SynGenes: a Python class for standardizing nomenclatures of mitochondrial and chloroplast genes and a web form for enhancing searches for evolutionary analyses [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001403967
Explore at:
Dataset updated
Aug 15, 2024
Authors
Sampaio, Iracilda; Rabelo, Luan Pinto; Sodré, Davidson; de Sousa, Rodrigo Petry Corrêa; Gomes, Grazielle; Watanabe, Luciana; Vallinoto, Marcelo
Description
Additional file 2: List of Embryophyta genomes analyzed in this study, including their taxonomic classification and accession numbers
Z
[MedMNIST+] 18x Standardized Datasets for 2D and 3D Biomedical Image...
data.niaid.nih.gov
Updated Nov 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rui Shi (2024). [MedMNIST+] 18x Standardized Datasets for 2D and 3D Biomedical Image Classification with Multiple Size Options: 28 (MNIST-Like), 64, 128, and 224 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5208229
Explore at:
Dataset updated
Nov 28, 2024
Dataset provided by
Hanspeter Pfister
Zequan Liu
Bingbing Ni
Bilian Ke
Jiancheng Yang
Donglai Wei
Lin Zhao
Rui Shi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Code [GitHub] | Publication [Nature Scientific Data'23 / ISBI'21] | Preprint [arXiv]

Abstract

We introduce MedMNIST, a large-scale MNIST-like collection of standardized biomedical images, including 12 datasets for 2D and 6 datasets for 3D. All images are pre-processed into 28x28 (2D) or 28x28x28 (3D) with the corresponding classification labels, so that no background knowledge is required for users. Covering primary data modalities in biomedical images, MedMNIST is designed to perform classification on lightweight 2D and 3D images with various data scales (from 100 to 100,000) and diverse tasks (binary/multi-class, ordinal regression and multi-label). The resulting dataset, consisting of approximately 708K 2D images and 10K 3D images in total, could support numerous research and educational purposes in biomedical image analysis, computer vision and machine learning. We benchmark several baseline methods on MedMNIST, including 2D / 3D neural networks and open-source / commercial AutoML tools. The data and code are publicly available at https://medmnist.com/.

Disclaimer: The only official distribution link for the MedMNIST dataset is Zenodo. We kindly request users to refer to this original dataset link for accurate and up-to-date data.

Update: We are thrilled to release MedMNIST+ with larger sizes: 64x64, 128x128, and 224x224 for 2D, and 64x64x64 for 3D. As a complement to the previous 28-size MedMNIST, the large-size version could serve as a standardized benchmark for medical foundation models. Install the latest API to try it out!

Python Usage

We recommend our official code to download, parse and use the MedMNIST dataset:

% pip install medmnist% python

To use the standard 28-size (MNIST-like) version utilizing the downloaded files:

from medmnist import PathMNIST

train_dataset = PathMNIST(split="train")

To enable automatic downloading by setting download=True:

from medmnist import NoduleMNIST3D

val_dataset = NoduleMNIST3D(split="val", download=True)

Alternatively, you can access MedMNIST+ with larger image sizes by specifying the size parameter:

from medmnist import ChestMNIST

test_dataset = ChestMNIST(split="test", download=True, size=224)

Citation

If you find this project useful, please cite both v1 and v2 paper as:

Jiancheng Yang, Rui Shi, Donglai Wei, Zequan Liu, Lin Zhao, Bilian Ke, Hanspeter Pfister, Bingbing Ni. Yang, Jiancheng, et al. "MedMNIST v2-A large-scale lightweight benchmark for 2D and 3D biomedical image classification." Scientific Data, 2023.

Jiancheng Yang, Rui Shi, Bingbing Ni. "MedMNIST Classification Decathlon: A Lightweight AutoML Benchmark for Medical Image Analysis". IEEE 18th International Symposium on Biomedical Imaging (ISBI), 2021.

or using bibtex:

@article{medmnistv2, title={MedMNIST v2-A large-scale lightweight benchmark for 2D and 3D biomedical image classification}, author={Yang, Jiancheng and Shi, Rui and Wei, Donglai and Liu, Zequan and Zhao, Lin and Ke, Bilian and Pfister, Hanspeter and Ni, Bingbing}, journal={Scientific Data}, volume={10}, number={1}, pages={41}, year={2023}, publisher={Nature Publishing Group UK London} }

@inproceedings{medmnistv1, title={MedMNIST Classification Decathlon: A Lightweight AutoML Benchmark for Medical Image Analysis}, author={Yang, Jiancheng and Shi, Rui and Ni, Bingbing}, booktitle={IEEE 18th International Symposium on Biomedical Imaging (ISBI)}, pages={191--195}, year={2021} }

Please also cite the corresponding paper(s) of source data if you use any subset of MedMNIST as per the description on the project website.

License

The MedMNIST dataset is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0), except DermaMNIST under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0).

The code is under Apache-2.0 License.

Changelog

v3.0 (this repository): Released MedMNIST+ featuring larger sizes: 64x64, 128x128, and 224x224 for 2D, and 64x64x64 for 3D.

v2.2: Removed a small number of mistakenly included blank samples in OrganAMNIST, OrganCMNIST, OrganSMNIST, OrganMNIST3D, and VesselMNIST3D.

v2.1: Addressed an issue in the NoduleMNIST3D file (i.e., nodulemnist3d.npz). Further details can be found in this issue.

v2.0: Launched the initial repository of MedMNIST v2, adding 6 datasets for 3D and 2 for 2D.

v1.0: Established the initial repository (in a separate repository) of MedMNIST v1, featuring 10 datasets for 2D.

Note: This dataset is NOT intended for clinical use.
Student Performance Data Set
kaggle.com
Updated Mar 27, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data-Science Sean (2020). Student Performance Data Set [Dataset]. https://www.kaggle.com/datasets/larsen0966/student-performance-data-set
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 27, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Data-Science Sean
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
If this Data Set is useful, and upvote is appreciated. This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd-period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).
O
Connecticut CAMA and Parcel Layer
data.ct.gov
geodata.ct.gov
+1more
application/rdfxml +5
Updated Feb 4, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Office of Policy and Management (2025). Connecticut CAMA and Parcel Layer [Dataset]. https://data.ct.gov/Local-Government/Connecticut-CAMA-and-Parcel-Layer/5ygf-diwu/data
Explore at:
application/rssxml, csv, tsv, application/rdfxml, json, xmlAvailable download formats
Dataset updated
Feb 4, 2025
Dataset authored and provided by
Office of Policy and Management
Area covered
Connecticut
Description
Coordinate system Update:
Notably, this dataset will be provided in NAD 83 Connecticut State Plane (2011) (EPSG 2234) projection, instead of WGS 1984 Web Mercator Auxiliary Sphere (EPSG 3857) which is the coordinate system of the 2023 dataset and will remain in Connecticut State Plane moving forward.
Ownership Suppression and Data Access:
The updated dataset now includes parcel data for all towns across the state, with some towns featuring fully suppressed ownership information. In these instances, the owner’s name will be replaced with the label "Current Owner," the co-owner’s name will be listed as "Current Co-Owner," and the mailing address will appear as the property address itself. For towns with suppressed ownership data, users should be aware that there was no "Suppression" field in the submission to verify specific details. This measure was implemented this year to help verify compliance with Suppression.
New Data Fields:
The new dataset introduces the "Land Acres" field, which will display the total acreage for each parcel. This additional field allows for more detailed analysis and better supports planning, zoning, and property valuation tasks. An important new addition is the FIPS code field, which provides the Federal Information Processing Standards (FIPS) code for each parcel’s corresponding block. This allows users to easily identify which block the parcel is in.
Updated Service URL:
The new parcel service URL includes all the updates mentioned above, such as the improved coordinate system, new data fields, and additional geospatial information. Users are strongly encouraged to transition to the new service as soon as possible to ensure that their workflows remain uninterrupted. The URL for this service will remain persistent moving forward. Once you have transitioned to the new service, the URL will remain constant, ensuring long term stability.
For a limited time, the old service will continue to be available, but it will eventually be retired. Users should plan to switch to the new service well before this cutoff to avoid any disruptions in data access.
The dataset has combined the Parcels and Computer-Assisted Mass Appraisal (CAMA) data for 2024 into a single dataset. This dataset is designed to make it easier for stakeholders and the GIS community to use and access the information as a geospatial dataset. Included in this dataset are geometries for all 169 municipalities and attribution from the CAMA data for all but one municipality. Pursuant to Section 7-100l of the Connecticut General Statutes, each municipality is required to transmit a digital parcel file and an accompanying assessor’s database file (known as a CAMA report), to its respective regional council of governments (COG) by May 1 annually.
These data were gathered from the CT municipalities by the COGs and then submitted to CT OPM. This dataset was created on 10/31/2024 from data collected in 2023-2024. Data was processed using Python scripts and ArcGIS Pro, ensuring standardization and integration of the data.
CAMA Notes:
The CAMA underwent several steps to standardize and consolidate the information. Python scripts were used to concatenate fields and create a unique identifier for each entry. The resulting dataset contains 1,353,595 entries and information on property assessments and other relevant attributes.
CAMA was provided by the towns.
Spatial Data Notes:
Data processing involved merging the parcels from different municipalities using ArcGIS Pro and Python. The resulting dataset contains 1,290,196 parcels.
No alteration has been made to the spatial geometry of the data.
Fields that are associated with CAMA data were provided by towns.
The data fields that have information from the CAMA were sourced from the towns’ CAMA data.
If no field for the parcels was provided for linking back to the CAMA by the town a new field within the original data was selected if it had a match rate above 50%, that joined back to the CAMA.
Linking fields were renamed to "Link".
All linking fields had a census town code added to the beginning of the value to create a unique identifier per town.
Any field that was not town name, Location, Editor, Edit Date, or a field associated back to the CAMA, was not used in the creation of this Dataset.
Only the fields related to town name, location, editor, edit date, and link fields associated with the towns’ CAMA were included in the creation of this dataset. Any other field provided in the original data was deleted or not used.
Field names for town (Muni, Municipality) were renamed to "Town Name".
The attributes included in the data:
Town Name
Owner
Co-Owner
Link
Editor
Edit Date
Collection year – year the parcels were submitted
Location
Mailing Address
Mailing City
Mailing State
Assessed Total
Assessed Land
Assessed Building
Pre-Year Assessed Total
Appraised Land
Appraised Building
Appraised Outbuilding
Condition
<span
National Point Source Releases to Water Totals by Industry 2017
catalog.data.gov
datasets.ai
+2more
Updated Jul 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2023). National Point Source Releases to Water Totals by Industry 2017 [Dataset]. https://catalog.data.gov/dataset/national-point-source-releases-to-water-totals-by-industry-2017
Explore at:
Dataset updated
Jul 25, 2023
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
This dataset contains national 2017 point-source releases to water by North American Industry Classification System (NAICS) 2012 6-digit codes. This dataset was created in FLOWSA, a publicly available python package that generates standardized environmental flows by industry. This dataset is associated with the following publication: Ingwersen, W.W., M. Li, B. Young, J. Vendries, and C. Birney. USEEIO v2.0, The US Environmentally-Extended InputOutput Model v2.0. Scientific Data. Springer Nature Group, New York, NY, 194, (2022).
d
Analyzed Data for The Impact of COVID-19 on Technical Services Units Survey...
search.dataone.org
dataverse.harvard.edu
+1more
Updated Nov 12, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Szkirpan, Elizabeth (2023). Analyzed Data for The Impact of COVID-19 on Technical Services Units Survey Results [Dataset]. http://doi.org/10.7910/DVN/DGBUV7
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/DGBUV7
Dataset updated
Nov 12, 2023
Dataset provided by
Harvard Dataverse
Authors
Szkirpan, Elizabeth
Description
These datasets contain cleaned data survey results from the October 2021-January 2022 survey titled "The Impact of COVID-19 on Technical Services Units". This data was gathered from a Qualtrics survey, which was anonymized to prevent Qualtrics from gathering identifiable information from respondents. These specific iterations of data reflect cleaning and standardization so that data can be analyzed using Python. Ultimately, the three files reflect the removal of survey begin/end times, other data auto-recorded by Qualtrics, blank rows, blank responses after question four (the first section of the survey), and non-United States responses. Note that State names for "What state is your library located in?" (Q36) were also standardized beginning in Impact_of_COVID_on_Tech_Services_Clean_3.csv to aid in data analysis. In this step, state abbreviations were spelled out and spelling errors were corrected.
T
protein_net
tensorflow.org
Updated Dec 16, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). protein_net [Dataset]. https://www.tensorflow.org/datasets/catalog/protein_net
Explore at:
Dataset updated
Dec 16, 2022
Description
ProteinNet is a standardized data set for machine learning of protein structure. It provides protein sequences, structures (secondary and tertiary), multiple sequence alignments (MSAs), position-specific scoring matrices (PSSMs), and standardized training / validation / test splits. ProteinNet builds on the biennial CASP assessments, which carry out blind predictions of recently solved but publicly unavailable protein structures, to provide test sets that push the frontiers of computational methodology. It is organized as a series of data sets, spanning CASP 7 through 12 (covering a ten-year period), to provide a range of data set sizes that enable assessment of new methods in relatively data poor and data rich regimes.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('protein_net', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
m
Comprehensive Dataset for Data-Driven Pavement Performance Prediction and...
data.mendeley.com
Updated Mar 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hossein Hariri Asli (2025). Comprehensive Dataset for Data-Driven Pavement Performance Prediction and Analysis in Flood-Prone Beaumont, Southeast Texas [Dataset]. http://doi.org/10.17632/p6vg4v7f9k.2
Explore at:
Unique identifier
https://doi.org/10.17632/p6vg4v7f9k.2
Dataset updated
Mar 19, 2025
Authors
Hossein Hariri Asli
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Area covered
Beaumont, Texas
Description
Effective pavement maintenance is vital for the economy, optimal performance, and safety, necessitating a thorough evaluation of pavement conditions such as strength, roughness, and surface distress. Pavement performance indicators significantly influence vehicle safety and ride quality. Recent advancements have focused on leveraging data-driven models to predict pavement performance, aiming to optimize fund allocation and enhance Maintenance and Rehabilitation (M&R) strategies through precise assessment of pavement conditions and defects. A critical prerequisite for these models is access to standardized, high-quality datasets to enhance prediction accuracy in pavement infrastructure management. This data article presents a comprehensive dataset compiled to support pavement performance prediction research, focusing on Southeast Texas, particularly the flood-prone region of Beaumont. The dataset includes pavement and traffic data, meteorological records, flood maps, ground deformation, and topographic indices to assess the impact of load-associated and non-load-associated pavement degradation. Data preprocessing was conducted using ArcGIS Pro, Microsoft Excel, and Python, ensuring the dataset is formatted for direct application in data-driven modeling approaches, including Machine Learning methods. Key contributions of this dataset include facilitating the analysis of climatic and environmental impacts on pavement conditions, enabling the identification of critical features influencing pavement performance, and allowing comprehensive data analysis to explore correlations and trends among input variables. By addressing gaps in input variable selection studies, this dataset supports the development of predictive tools for estimating future maintenance needs and improving the resilience of pavement infrastructure in flood-affected areas. This work highlights the importance of standardized datasets in advancing pavement management systems and provides a foundation for future research to enhance pavement performance prediction accuracy.
Students Test Data
kaggle.com
Updated Sep 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ATHARV BHARASKAR (2023). Students Test Data [Dataset]. https://www.kaggle.com/datasets/atharvbharaskar/students-test-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 12, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
ATHARV BHARASKAR
License
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Description
Dataset Overview: This dataset pertains to the examination results of students who participated in a series of academic assessments at a fictitious educational institution named "University of Exampleville." The assessments were administered across various courses and academic levels, with a focus on evaluating students' performance in general management and domain-specific topics.

Columns: The dataset comprises 12 columns, each representing specific attributes and performance indicators of the students. These columns encompass information such as the students' names (which have been anonymized), their respective universities, academic program names (including BBA and MBA), specializations, the semester of the assessment, the type of examination domain (general management or domain-specific), general management scores (out of 50), domain-specific scores (out of 50), total scores (out of 100), student ranks, and percentiles.

Data Collection: The examination data was collected during a standardized assessment process conducted by the University of Exampleville. The exams were designed to assess students' knowledge and skills in general management and their chosen domain-specific subjects. It involved students from both BBA and MBA programs who were in their final year of study.

Data Format: The dataset is available in a structured format, typically as a CSV file. Each row represents a unique student's performance in the examination, while columns contain specific information about their results and academic details.

Data Usage: This dataset is valuable for analyzing and gaining insights into the academic performance of students pursuing BBA and MBA degrees. It can be used for various purposes, including statistical analysis, performance trend identification, program assessment, and comparison of scores across domains and specializations. Furthermore, it can be employed in predictive modeling or decision-making related to curriculum development and student support.

Data Quality: The dataset has undergone preprocessing and anonymization to protect the privacy of individual students. Nevertheless, it is essential to use the data responsibly and in compliance with relevant data protection regulations when conducting any analysis or research.

Data Format: The exam data is typically provided in a structured format, commonly as a CSV (Comma-Separated Values) file. Each row in the dataset represents a unique student's examination performance, and each column contains specific attributes and scores related to the examination. The CSV format allows for easy import and analysis using various data analysis tools and programming languages like Python, R, or spreadsheet software like Microsoft Excel.

Here's a column-wise description of the dataset:

Name OF THE STUDENT: The full name of the student who took the exam. (Anonymized)

UNIVERSITY: The university where the student is enrolled.

PROGRAM NAME: The name of the academic program in which the student is enrolled (BBA or MBA).

Specialization: If applicable, the specific area of specialization or major that the student has chosen within their program.

Semester: The semester or academic term in which the student took the exam.

Domain: Indicates whether the exam was divided into two parts: general management and domain-specific.

GENERAL MANAGEMENT SCORE (OUT of 50): The score obtained by the student in the general management part of the exam, out of a maximum possible score of 50.

Domain-Specific Score (Out of 50): The score obtained by the student in the domain-specific part of the exam, also out of a maximum possible score of 50.

TOTAL SCORE (OUT of 100): The total score obtained by adding the scores from the general management and domain-specific parts, out of a maximum possible score of 100.
CO2-Locate
osti.gov
Updated Mar 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
USDOE Office of Fossil Energy (FE), Clean Coal and Carbon Management (2023). CO2-Locate [Dataset]. http://doi.org/10.18141/1964068
Explore at:
Unique identifier
https://doi.org/10.18141/1964068
Dataset updated
Mar 31, 2023
Dataset provided by
National Energy Technology Laboratoryhttps://netl.doe.gov/
USDOE Office of Fossil Energy (FE), Clean Coal and Carbon Management
Description
Siloed and disparate wellbore datasets from federal, state, and tribal entities have been acquired, processed, standardized, and integrated to produce a dynamic national wellbore database. The initial version (March 2023) served as a use case to test and implement the standardization and integration methodology. It was initially developed and titled CO2-Locate to inform injection site selection, permitting, and other stakeholder needs. Since its initial release, NETL has substantially enhanced this resource with expansions including 50+ more data sources and millions of additional well records. As an evolving resource, the next release is scheduled for Summer 2025 as the Wellbore Exploration and Location Logistic System (WELLS) Database. The WELLS Database will contain more than six million wellbore records, containing critical information to support a range of uses, including oil, gas, and critical mineral production. This is built to be a living database, which will eventually enable access through an application programming interface (API), but currently provides annual updates using a series of bespoke Python scripts that acquire, process, standardize, and integrate the resources automatically with minimal curation needed. Supplemental Information: In addition to the integrated wellbore data records, this resource includes supplemental information to support domestic energy advancements. Resources packaged within the database include spatial summary layers from proprietary wellbore resources, providing additional insights, including wellbore summaries queried by age, status, and depth. Moreover, this resource includes the previously published Global Oil and Gas Infrastructure (GOGI) Database, with updates on U.S. infrastructure. Metadata resources packaged with this database include a field dictionary with field (i.e., attribute) coverage across acquired public well resources; the resulting integrated public well layers are detailed in the supplemental information table, CO2_Locate_Field_Dictionary.pdf. Notes for consideration: This database will be updated as new data and information become available and are processed, reviewed, and approved for release. Additional state and federal entity data are planned to be integrated and included in future revisions. Summary layers provided in this application are derived from proprietary layers and may not always contain key features; therefore, these features might not be shown when data are queried.
WoSIS snapshot - December 2023
data.isric.org
repository.soilwise-he.eu
Updated Dec 20, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ISRIC - World Soil Information (2023). WoSIS snapshot - December 2023 [Dataset]. https://data.isric.org/geonetwork/srv/api/records/e50f84e1-aa5b-49cb-bd6b-cd581232a2ec
Explore at:
www:link-1.0-http--related, www:link-1.0-http--link, www:download-1.0-ftp--downloadAvailable download formats
Dataset updated
Dec 20, 2023
Dataset provided by
International Soil Reference and Information Centre
Authors
ISRIC - World Soil Information
Time period covered
Jan 1, 1918 - Dec 1, 2022
Area covered
Description
ABSTRACT: The World Soil Information Service (WoSIS) provides quality-assessed and standardized soil profile data to support digital soil mapping and environmental applications at broad scale levels. Since the release of the ‘WoSIS snapshot 2019’ many new soil data were shared with us, registered in the ISRIC data repository, and subsequently standardized in accordance with the licenses specified by the data providers. The source data were contributed by a wide range of data providers, therefore special attention was paid to the standardization of soil property definitions, soil analytical procedures and soil property values (and units of measurement). We presently consider the following soil chemical properties (organic carbon, total carbon, total carbonate equivalent, total Nitrogen, Phosphorus (extractable-P, total-P, and P-retention), soil pH, cation exchange capacity, and electrical conductivity) and physical properties (soil texture (sand, silt, and clay), bulk density, coarse fragments, and water retention), grouped according to analytical procedures (aggregates) that are operationally comparable. For each profile we provide the original soil classification (FAO, WRB, USDA, and version) and horizon designations as far as these have been specified in the source databases. Three measures for 'fitness-for-intended-use' are provided: positional uncertainty (for site locations), time of sampling/description, and a first approximation for the uncertainty associated with the operationally defined analytical methods. These measures should be considered during digital soil mapping and subsequent earth system modelling that use the present set of soil data. DATA SET DESCRIPTION: The 'WoSIS 2023 snapshot' comprises data for 228k profiles from 217k geo-referenced sites that originate from 174 countries. The profiles represent over 900k soil layers (or horizons) and over 6 million records. The actual number of measurements for each property varies (greatly) between proﬁles and with depth, this generally depending on the objectives of the initial soil sampling programmes. The data are provided in TSV (tab separated values) format and as GeoPackage. The zip-file (446 Mb) contains the following files: - Readme_WoSIS_202312_v2.pdf: Provides a short description of the dataset, file structure, column names, units and category values (this file is also available directly under 'online resources'). The pdf includes links to tutorials for downloading the TSV files into R respectively Excel. See also 'HOW TO READ TSV FILES INTO R AND PYTHON' in the next section. - wosis_202312_observations.tsv: This file lists the four to six letter codes for each observation, whether the observation is for a site/profile or layer (horizon), the unit of measurement and the number of profiles respectively layers represented in the snapshot. It also provides an estimate for the inferred accuracy for the laboratory measurements. - wosis_202312_sites.tsv: This file characterizes the site location where profiles were sampled. - wosis_2023112_profiles: Presents the unique profile ID (i.e. primary key), site_id, source of the data, country ISO code and name, positional uncertainty, latitude and longitude (WGS 1984), maximum depth of soil described and sampled, as well as information on the soil classification system and edition. Depending on the soil classification system used, the number of fields will vary . - wosis_202312_layers: This file characterises the layers (or horizons) per profile, and lists their upper and lower depths (cm). - wosis_202312_xxxx.tsv : This type of file presents results for each observation (e.g. “xxxx” = “BDFIOD” ), as defined under “code” in file wosis_202312_observation.tsv. (e.g. wosis_202311_bdfiod.tsv). - wosis_202312.gpkg: Contains the above datafiles in GeoPackage format (which stores the files within an SQLite database). HOW TO READ TSV FILES INTO R AND PYTHON: A) To read the data in R, please uncompress the ZIP file and specify the uncompressed folder. setwd("/YourFolder/WoSIS_2023_December/") ## For example: setwd('D:/WoSIS_2023_December/') Then use read_tsv to read the TSV files, specifying the data types for each column (c = character, i = integer, n = number, d = double, l = logical, f = factor, D = date, T = date time, t = time). observations = readr::read_tsv('wosis_202312_observations.tsv', col_types='cccciid') observations ## show columns and first 10 rows sites = readr::read_tsv('wosis_202312_sites.tsv', col_types='iddcccc') sites profiles = readr::read_tsv('wosis_202312_profiles.tsv', col_types='icciccddcccccciccccicccci') profiles layers = readr::read_tsv('wosis_202312_layers.tsv', col_types='iiciciiilcc') layers ## Do this for each observation 'XXXX', e.g. file 'Wosis_202312_orgc.tsv': orgc = readr::read_tsv('wosis_202312_orgc.tsv', col_types='iicciilccdccddccccc') orgc Note: One may also use the following R code (example is for file 'observations.tsv'): observations <- read.table("wosis_202312_observations.tsv", sep = "\t", header = TRUE, quote = "", comment.char = "", stringsAsFactors = FALSE ) B) To read the files into python first decompress the files to your selected folder. Then in python: # import the required library import pandas as pd # Read the observations data observations = pd.read_csv("wosis_202312_observations.tsv", sep="\t") # print the data frame header and some rows observations.head() # Read the sites data sites = pd.read_csv("wosis_202312_sites.tsv", sep="\t") # Read the profiles data profiles = pd.read_csv("wosis_202312_profiles.tsv", sep="\t") # Read the layers data layers = pd.read_csv("wosis_202312_layers.tsv", sep="\t") # Read the soil property data, e.g. 'cfvo' (do this for each observation) cfvo = pd.read_csv("wosis_202312_cfvo.tsv", sep="\t") CITATION: Calisto, L., de Sousa, L.M., Batjes, N.H., 2023. Standardised soil profile data for the world (WoSIS snapshot – December 2023), https://doi.org/10.17027/isric-wdcsoils-20231130 Supplement to: Batjes N.H., Calisto, L. and de Sousa L.M., 2023. Providing quality-assessed and standardised soil data to support global mapping and modelling (WoSIS snapshot 2023). Earth System Science Data, https://doi.org/10.5194/essd-16-4735-2024.
National Employment Totals by Industry 2017
res1catalogd-o-tdatad-o-tgov.vcapture.xyz
catalog.data.gov
Updated Jul 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2023). National Employment Totals by Industry 2017 [Dataset]. https://res1catalogd-o-tdatad-o-tgov.vcapture.xyz/dataset/national-employment-totals-by-industry-2017
Explore at:
Dataset updated
Jul 25, 2023
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
This dataset contains 2017 national employment by North American Industry Classification System (NAICS) 2012 6-digit codes. This dataset was created in FLOWSA, a publicly available python package that generates standardized environmental flows by industry. This dataset is associated with the following publication: Ingwersen, W.W., M. Li, B. Young, J. Vendries, and C. Birney. USEEIO v2.0, The US Environmentally-Extended InputOutput Model v2.0. Scientific Data. Springer Nature Group, New York, NY, 194, (2022).

Facebook

Twitter

Click to copy link

Link copied

Cite

Soltani, Abolfazl (2022). Example subjects for Mobilise-D data standardization [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7185428

Example subjects for Mobilise-D data standardization

Explore at:

Dataset updated

Oct 11, 2022

Dataset provided by

Soltani, Abolfazl
Paraschiv-Ionescu, Anisoara
D'Ascanio, Ilaria
Micó-Amigo, Encarna
Kluge, Felix
Küderle, Arne
Hiden, Hugo
Chiari, Lorenzo
Mazzà, Claudia
Reggi, Luca
Bonci, Tecla
Bertuletti, Stefano
Caruso, Marco
Palmerini, Luca
Cereatti, Andrea
Gazit, Eran
on behalf of the Mobilise-D consortium
Ullrich, Martin
Rochester, Lynn
Del Din, Silvia
Kirk, Cameron
Salis, Francesca
Hansen, Clint

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Standardized data from Mobilise-D participants (YAR dataset) and pre-existing datasets (ICICLE, MSIPC2, Gait in Lab and real-life settings, MS project, UNISS-UNIGE) are provided in the shared folder, as an example of the procedures proposed in the publication "Mobility recorded by wearable devices and gold standards: the Mobilise-D procedure for data standardization" that is currently under review in Scientific data. Please refer to that publication for further information. Please cite that publication if using these data.

The code to standardize an example subject (for the ICICLE dataset) and to open the standardized Matlab files in other languages (Python, R) is available in github (https://github.com/luca-palmerini/Procedure-wearable-data-standardization-Mobilise-D).

Clear search

Close search

Google apps

Main menu

Example subjects for Mobilise-D data standardization

instruct-python-500k-standardized

python-code-instructions-18k-alpaca-standardized

machine learning models on the WDBC dataset

NASICON-type solid electrolyte materials named entity recognition dataset

Connecticut State Parcel Layer 2023

S1 Data -

OpenML-CC18 Dataset

Additional file 2 of SynGenes: a Python class for standardizing...

[MedMNIST+] 18x Standardized Datasets for 2D and 3D Biomedical Image...

Student Performance Data Set

Connecticut CAMA and Parcel Layer

National Point Source Releases to Water Totals by Industry 2017

Analyzed Data for The Impact of COVID-19 on Technical Services Units Survey...

protein_net

Comprehensive Dataset for Data-Driven Pavement Performance Prediction and...

Students Test Data

CO2-Locate

WoSIS snapshot - December 2023

National Employment Totals by Industry 2017

Example subjects for Mobilise-D data standardization