24 datasets found
  1. h

    python-code-instructions-18k-alpaca-standardized

    • huggingface.co
    Updated Sep 2, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HydraLM (2023). python-code-instructions-18k-alpaca-standardized [Dataset]. https://huggingface.co/datasets/HydraLM/python-code-instructions-18k-alpaca-standardized
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 2, 2023
    Dataset authored and provided by
    HydraLM
    Description

    Dataset Card for "python-code-instructions-18k-alpaca-standardized"

    More Information needed

  2. f

    R and Python libraries for the standardization of data extraction and...

    • datasetcatalog.nlm.nih.gov
    • figshare.com
    Updated May 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zwiggelaar, Reyer; Spick, Matt; Harrison, Charlie; Suchak, Tulsi; Aliu, Anietie E.; Geifman, Nophar (2025). R and Python libraries for the standardization of data extraction and analysis from NHANES. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0002102076
    Explore at:
    Dataset updated
    May 8, 2025
    Authors
    Zwiggelaar, Reyer; Spick, Matt; Harrison, Charlie; Suchak, Tulsi; Aliu, Anietie E.; Geifman, Nophar
    Description

    R and Python libraries for the standardization of data extraction and analysis from NHANES.

  3. h

    instruct-python-500k-standardized

    • huggingface.co
    Updated Sep 3, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HydraLM (2023). instruct-python-500k-standardized [Dataset]. https://huggingface.co/datasets/HydraLM/instruct-python-500k-standardized
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 3, 2023
    Dataset authored and provided by
    HydraLM
    Description

    Dataset Card for "instruct-python-500k-standardized"

    More Information needed

  4. BI intro to data cleaning eda and machine learning

    • kaggle.com
    zip
    Updated Nov 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Walekhwa Tambiti Leo Philip (2025). BI intro to data cleaning eda and machine learning [Dataset]. https://www.kaggle.com/datasets/walekhwatlphilip/intro-to-data-cleaning-eda-and-machine-learning/suggestions
    Explore at:
    zip(9961 bytes)Available download formats
    Dataset updated
    Nov 17, 2025
    Authors
    Walekhwa Tambiti Leo Philip
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Real-World Data Science Challenge

    Business Intelligence Program Strategy — Student Success Optimization

    Hosted by: Walsoft Computer Institute 📁 Download dataset 👤 Kaggle profile

    Background

    Walsoft Computer Institute runs a Business Intelligence (BI) training program for students from diverse educational, geographical, and demographic backgrounds. The institute has collected detailed data on student attributes, entry exams, study effort, and final performance in two technical subjects: Python Programming and Database Systems.

    As part of an internal review, the leadership team has hired you — a Data Science Consultant — to analyze this dataset and provide clear, evidence-based recommendations on how to improve:

    • Admissions decision-making
    • Academic support strategies
    • Overall program impact and ROI

    Your Mission

    Answer this central question:

    “Using the BI program dataset, how can Walsoft strategically improve student success, optimize resources, and increase the effectiveness of its training program?”

    Key Strategic Areas

    You are required to analyze and provide actionable insights for the following three areas:

    1. Admissions Optimization

    Should entry exams remain the primary admissions filter?

    Your task is to evaluate the predictive power of entry exam scores compared to other features such as prior education, age, gender, and study hours.

    ✅ Deliverables:

    • Feature importance ranking for predicting Python and DB scores
    • Admission policy recommendation (e.g., retain exams, add screening tools, adjust thresholds)
    • Business rationale and risk analysis

    2. Curriculum Support Strategy

    Are there at-risk student groups who need extra support?

    Your task is to uncover whether certain backgrounds (e.g., prior education level, country, residence type) correlate with poor performance and recommend targeted interventions.

    ✅ Deliverables:

    • At-risk segment identification
    • Support program design (e.g., prep course, mentoring)
    • Expected outcomes, costs, and KPIs

    3. Resource Allocation & Program ROI

    How can we allocate resources for maximum student success?

    Your task is to segment students by success profiles and suggest differentiated teaching/facility strategies.

    ✅ Deliverables:

    • Performance drivers
    • Student segmentation
    • Resource allocation plan and ROI projection

    🛠️ Dataset Overview

    ColumnDescription
    fNAME, lNAMEStudent first and last name
    AgeStudent age (21–71 years)
    genderGender (standardized as "Male"/"Female")
    countryStudent’s country of origin
    residenceStudent housing/residence type
    entryEXAMEntry test score (28–98)
    prevEducationPrior education (High School, Diploma, etc.)
    studyHOURSTotal study hours logged
    PythonFinal Python exam score
    DBFinal Database exam score

    📊 Dataset

    You are provided with a real-world messy dataset that reflects the types of issues data scientists face every day — from inconsistent formatting to missing values.

    Raw Dataset (Recommended for Full Project)

    Download: bi.csv

    This dataset includes common data quality challenges:

    • Country name inconsistencies
      e.g. Norge → Norway, RSA → South Africa, UK → United Kingdom

    • Residence type variations
      e.g. BI-Residence, BIResidence, BI_Residence → unify to BI Residence

    • Education level typos and casing issues
      e.g. Barrrchelors → Bachelor, DIPLOMA, DiplomaaaDiploma

    • Gender value noise
      e.g. M, F, female → standardize to Male / Female

    • Missing scores in Python subject
      Fill NaN values using column mean or suitable imputation strategy

    Participants using this dataset are expected to apply data cleaning techniques such as: - String standardization - Null value imputation - Type correction (e.g., scores as float) - Validation and visual verification

    Bonus: Submissions that use and clean this dataset will earn additional Technical Competency points.

    Cleaned Dataset (Optional Shortcut)

    Download: cleaned_bi.csv

    This version has been fully standardized and preprocessed: - All fields cleaned and renamed consistently - Missing Python scores filled with th...

  5. Z

    Example subjects for Mobilise-D data standardization

    • data.niaid.nih.gov
    • zenodo.org
    Updated Oct 11, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Palmerini, Luca; Reggi, Luca; Bonci, Tecla; Del Din, Silvia; Micó-Amigo, Encarna; Salis, Francesca; Bertuletti, Stefano; Caruso, Marco; Cereatti, Andrea; Gazit, Eran; Paraschiv-Ionescu, Anisoara; Soltani, Abolfazl; Kluge, Felix; Küderle, Arne; Ullrich, Martin; Kirk, Cameron; Hiden, Hugo; D'Ascanio, Ilaria; Hansen, Clint; Rochester, Lynn; Mazzà, Claudia; Chiari, Lorenzo; on behalf of the Mobilise-D consortium (2022). Example subjects for Mobilise-D data standardization [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7185428
    Explore at:
    Dataset updated
    Oct 11, 2022
    Dataset provided by
    Neurogeriatrics Kiel, Department of Neurology, University Hospital Schleswig-Holstein, Germany.
    Laboratory of Movement Analysis and Measurement, Ecole Polytechnique Federale de Lausanne, Lausanne, Switzerland.
    The University of Sheffield, INSIGNEO Institute for in silico Medicine, UK. The University of Sheffield, Department of Mechanical Engineering, UK
    Newcastle University, School of Computing, UK.
    University of Bologna, Health Sciences and Technologies—Interdepartmental Center for Industrial Research (CIRI-SDV), Italy
    University of Bologna, Department of Electrical, Electronic and Information Engineering 'Guglielmo Marconi', Italy.
    Politecnico di Torino, Department of Electronics and Telecommunications, Italy.
    Politecnico di Torino, Department of Electronics and Telecommunications, Italy. Politecnico di Torino, PolitoBIOMed Lab – Biomedical Engineering Lab, Italy.
    Machine Learning and Data Analytics Lab, Department of Artificial Intelligence in Biomedical Engineering, Friedrich-Alexander-University Erlangen-Nürnberg, Germany.
    Newcastle University, Translational and Clinical Research Institute, Faculty of Medical Sciences, UK.
    University of Bologna, Department of Electrical, Electronic and Information Engineering 'Guglielmo Marconi', Italy. University of Bologna, Health Sciences and Technologies—Interdepartmental Center for Industrial Research (CIRI-SDV), Italy
    Newcastle University, Translational and Clinical Research Institute, Faculty of Medical Sciences, UK. The Newcastle upon Tyne NHS Foundation Trust, UK.
    University of Sassari, Department of Biomedical Sciences, Italy.
    Tel Aviv Sourasky Medical Center, Center for the Study of Movement, Cognition and Mobility, Neurological Institute, Israel.
    https://www.mobilise-d.eu/partners
    Authors
    Palmerini, Luca; Reggi, Luca; Bonci, Tecla; Del Din, Silvia; Micó-Amigo, Encarna; Salis, Francesca; Bertuletti, Stefano; Caruso, Marco; Cereatti, Andrea; Gazit, Eran; Paraschiv-Ionescu, Anisoara; Soltani, Abolfazl; Kluge, Felix; Küderle, Arne; Ullrich, Martin; Kirk, Cameron; Hiden, Hugo; D'Ascanio, Ilaria; Hansen, Clint; Rochester, Lynn; Mazzà, Claudia; Chiari, Lorenzo; on behalf of the Mobilise-D consortium
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Standardized data from Mobilise-D participants (YAR dataset) and pre-existing datasets (ICICLE, MSIPC2, Gait in Lab and real-life settings, MS project, UNISS-UNIGE) are provided in the shared folder, as an example of the procedures proposed in the publication "Mobility recorded by wearable devices and gold standards: the Mobilise-D procedure for data standardization" that is currently under review in Scientific data. Please refer to that publication for further information. Please cite that publication if using these data.

    The code to standardize an example subject (for the ICICLE dataset) and to open the standardized Matlab files in other languages (Python, R) is available in github (https://github.com/luca-palmerini/Procedure-wearable-data-standardization-Mobilise-D).

  6. S1 Data -

    • plos.figshare.com
    zip
    Updated Oct 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yancong Zhou; Wenyue Chen; Xiaochen Sun; Dandan Yang (2023). S1 Data - [Dataset]. http://doi.org/10.1371/journal.pone.0292466.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 11, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Yancong Zhou; Wenyue Chen; Xiaochen Sun; Dandan Yang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analyzing customers’ characteristics and giving the early warning of customer churn based on machine learning algorithms, can help enterprises provide targeted marketing strategies and personalized services, and save a lot of operating costs. Data cleaning, oversampling, data standardization and other preprocessing operations are done on 900,000 telecom customer personal characteristics and historical behavior data set based on Python language. Appropriate model parameters were selected to build BPNN (Back Propagation Neural Network). Random Forest (RF) and Adaboost, the two classic ensemble learning models were introduced, and the Adaboost dual-ensemble learning model with RF as the base learner was put forward. The four models and the other four classical machine learning models-decision tree, naive Bayes, K-Nearest Neighbor (KNN), Support Vector Machine (SVM) were utilized respectively to analyze the customer churn data. The results show that the four models have better performance in terms of recall rate, precision rate, F1 score and other indicators, and the RF-Adaboost dual-ensemble model has the best performance. Among them, the recall rates of BPNN, RF, Adaboost and RF-Adaboost dual-ensemble model on positive samples are respectively 79%, 90%, 89%,93%, the precision rates are 97%, 99%, 98%, 99%, and the F1 scores are 87%, 95%, 94%, 96%. The RF-Adaboost dual-ensemble model has the best performance, and the three indicators are 10%, 1%, and 6% higher than the reference. The prediction results of customer churn provide strong data support for telecom companies to adopt appropriate retention strategies for pre-churn customers and reduce customer churn.

  7. O

    Connecticut State Parcel Layer 2023

    • data.ct.gov
    • s.cnmilf.com
    • +3more
    csv, xlsx, xml
    Updated Jan 29, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Office of Policy and Management (2025). Connecticut State Parcel Layer 2023 [Dataset]. https://data.ct.gov/Environment-and-Natural-Resources/Connecticut-State-Parcel-Layer-2023/v875-mr5r/data
    Explore at:
    xml, csv, xlsxAvailable download formats
    Dataset updated
    Jan 29, 2025
    Dataset authored and provided by
    Office of Policy and Management
    Area covered
    Connecticut
    Description

    The dataset has combined the Parcels and Computer-Assisted Mass Appraisal (CAMA) data for 2023 into a single dataset. This dataset is designed to make it easier for stakeholders and the GIS community to use and access the information as a geospatial dataset. Included in this dataset are geometries for all 169 municipalities and attribution from the CAMA data for all but one municipality. Pursuant to Section 7-100l of the Connecticut General Statutes, each municipality is required to transmit a digital parcel file and an accompanying assessor’s database file (known as a CAMA report), to its respective regional council of governments (COG) by May 1 annually.

    These data were gathered from the CT municipalities by the COGs and then submitted to CT OPM. This dataset was created on 12/08/2023 from data collected in 2022-2023. Data was processed using Python scripts and ArcGIS Pro, ensuring standardization and integration of the data.

    CAMA Notes:

    The CAMA underwent several steps to standardize and consolidate the information. Python scripts were used to concatenate fields and create a unique identifier for each entry. The resulting dataset contains 1,353,595 entries and information on property assessments and other relevant attributes.

    • CAMA was provided by the towns.

    • Canaan parcels are viewable, but no additional information is available since no CAMA data was submitted.

    Spatial Data Notes:

    Data processing involved merging the parcels from different municipalities using ArcGIS Pro and Python. The resulting dataset contains 1,247,506 parcels.

    • No alteration has been made to the spatial geometry of the data.

    • Fields that are associated with CAMA data were provided by towns.

    • The data fields that have information from the CAMA were sourced from the towns’ CAMA data.

    • If no field for the parcels was provided for linking back to the CAMA by the town a new field within the original data was selected if it had a match rate above 50%, that joined back to the CAMA.

    • Linking fields were renamed to "Link".

    • All linking fields had a census town code added to the beginning of the value to create a unique identifier per town.

    • Any field that was not town name, Location, Editor, Edit Date, or a field associated back to the CAMA, was not used in the creation of this Dataset.

    • Only the fields related to town name, location, editor, edit date, and link fields associated with the towns’ CAMA were included in the creation of this dataset. Any other field provided in the original data was deleted or not used.

    • Field names for town (Muni, Municipality) were renamed to "Town Name".

    The attributes included in the data:

    • Town Name

    • Owner

    • Co-Owner

    • Link

    • Editor

    • Edit Date

    • Collection year – year the parcels were submitted

    • Location

    • Mailing Address

    • Mailing City

    • Mailing State

    • Assessed Total

    • Assessed Land

    • Assessed Building

    • Pre-Year Assessed Total

    • Appraised Land

    • Appraised Building

    • Appraised Outbuilding

    • Condition

    • Model

    • Valuation

    • Zone

    • State Use

    • State Use Description

    • Living Area

    • Effective Area

    • Total rooms

    • Number of bedrooms

    • Number of Baths

    • Number of Half-Baths

    • Sale Price

    • Sale Date

    • Qualified

    • Occupancy

    • Prior Sale Price

    • Prior Sale Date

    • Prior Book and Page

    • Planning Region

    *Please note that not all parcels have a link to a CAMA entry.

    *If any discrepancies are discovered within the data, whether pertaining to geographical inaccuracies or attribute inaccuracy, please directly contact the respective municipalities to request any necessary amendments

    As of 2/15/2023 - Occupancy, State Use, State Use Description, and Mailing State added to dataset

    Additional information about the specifics of data availability and compliance will be coming soon.

  8. f

    Additional file 2 of SynGenes: a Python class for standardizing...

    • datasetcatalog.nlm.nih.gov
    • springernature.figshare.com
    Updated Aug 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sampaio, Iracilda; Rabelo, Luan Pinto; Sodré, Davidson; de Sousa, Rodrigo Petry Corrêa; Gomes, Grazielle; Watanabe, Luciana; Vallinoto, Marcelo (2024). Additional file 2 of SynGenes: a Python class for standardizing nomenclatures of mitochondrial and chloroplast genes and a web form for enhancing searches for evolutionary analyses [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001403967
    Explore at:
    Dataset updated
    Aug 15, 2024
    Authors
    Sampaio, Iracilda; Rabelo, Luan Pinto; Sodré, Davidson; de Sousa, Rodrigo Petry Corrêa; Gomes, Grazielle; Watanabe, Luciana; Vallinoto, Marcelo
    Description

    Additional file 2: List of Embryophyta genomes analyzed in this study, including their taxonomic classification and accession numbers

  9. Z

    [MedMNIST+] 18x Standardized Datasets for 2D and 3D Biomedical Image...

    • data.niaid.nih.gov
    Updated Nov 28, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jiancheng Yang; Rui Shi; Donglai Wei; Zequan Liu; Lin Zhao; Bilian Ke; Hanspeter Pfister; Bingbing Ni (2024). [MedMNIST+] 18x Standardized Datasets for 2D and 3D Biomedical Image Classification with Multiple Size Options: 28 (MNIST-Like), 64, 128, and 224 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5208229
    Explore at:
    Dataset updated
    Nov 28, 2024
    Dataset provided by
    Harvard University
    Zhongshan Hospital Affiliated to Fudan University
    Shanghai General Hospital, Shanghai Jiao Tong University School of Medicine
    Shanghai Jiao Tong University
    RWTH Aachen University
    Authors
    Jiancheng Yang; Rui Shi; Donglai Wei; Zequan Liu; Lin Zhao; Bilian Ke; Hanspeter Pfister; Bingbing Ni
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Code [GitHub] | Publication [Nature Scientific Data'23 / ISBI'21] | Preprint [arXiv]

    Abstract

    We introduce MedMNIST, a large-scale MNIST-like collection of standardized biomedical images, including 12 datasets for 2D and 6 datasets for 3D. All images are pre-processed into 28x28 (2D) or 28x28x28 (3D) with the corresponding classification labels, so that no background knowledge is required for users. Covering primary data modalities in biomedical images, MedMNIST is designed to perform classification on lightweight 2D and 3D images with various data scales (from 100 to 100,000) and diverse tasks (binary/multi-class, ordinal regression and multi-label). The resulting dataset, consisting of approximately 708K 2D images and 10K 3D images in total, could support numerous research and educational purposes in biomedical image analysis, computer vision and machine learning. We benchmark several baseline methods on MedMNIST, including 2D / 3D neural networks and open-source / commercial AutoML tools. The data and code are publicly available at https://medmnist.com/.

    Disclaimer: The only official distribution link for the MedMNIST dataset is Zenodo. We kindly request users to refer to this original dataset link for accurate and up-to-date data.

    Update: We are thrilled to release MedMNIST+ with larger sizes: 64x64, 128x128, and 224x224 for 2D, and 64x64x64 for 3D. As a complement to the previous 28-size MedMNIST, the large-size version could serve as a standardized benchmark for medical foundation models. Install the latest API to try it out!

    Python Usage

    We recommend our official code to download, parse and use the MedMNIST dataset:

    % pip install medmnist% python

    To use the standard 28-size (MNIST-like) version utilizing the downloaded files:

    from medmnist import PathMNIST

    train_dataset = PathMNIST(split="train")

    To enable automatic downloading by setting download=True:

    from medmnist import NoduleMNIST3D

    val_dataset = NoduleMNIST3D(split="val", download=True)

    Alternatively, you can access MedMNIST+ with larger image sizes by specifying the size parameter:

    from medmnist import ChestMNIST

    test_dataset = ChestMNIST(split="test", download=True, size=224)

    Citation

    If you find this project useful, please cite both v1 and v2 paper as:

    Jiancheng Yang, Rui Shi, Donglai Wei, Zequan Liu, Lin Zhao, Bilian Ke, Hanspeter Pfister, Bingbing Ni. Yang, Jiancheng, et al. "MedMNIST v2-A large-scale lightweight benchmark for 2D and 3D biomedical image classification." Scientific Data, 2023.

    Jiancheng Yang, Rui Shi, Bingbing Ni. "MedMNIST Classification Decathlon: A Lightweight AutoML Benchmark for Medical Image Analysis". IEEE 18th International Symposium on Biomedical Imaging (ISBI), 2021.

    or using bibtex:

    @article{medmnistv2, title={MedMNIST v2-A large-scale lightweight benchmark for 2D and 3D biomedical image classification}, author={Yang, Jiancheng and Shi, Rui and Wei, Donglai and Liu, Zequan and Zhao, Lin and Ke, Bilian and Pfister, Hanspeter and Ni, Bingbing}, journal={Scientific Data}, volume={10}, number={1}, pages={41}, year={2023}, publisher={Nature Publishing Group UK London} }

    @inproceedings{medmnistv1, title={MedMNIST Classification Decathlon: A Lightweight AutoML Benchmark for Medical Image Analysis}, author={Yang, Jiancheng and Shi, Rui and Ni, Bingbing}, booktitle={IEEE 18th International Symposium on Biomedical Imaging (ISBI)}, pages={191--195}, year={2021} }

    Please also cite the corresponding paper(s) of source data if you use any subset of MedMNIST as per the description on the project website.

    License

    The MedMNIST dataset is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0), except DermaMNIST under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0).

    The code is under Apache-2.0 License.

    Changelog

    v3.0 (this repository): Released MedMNIST+ featuring larger sizes: 64x64, 128x128, and 224x224 for 2D, and 64x64x64 for 3D.

    v2.2: Removed a small number of mistakenly included blank samples in OrganAMNIST, OrganCMNIST, OrganSMNIST, OrganMNIST3D, and VesselMNIST3D.

    v2.1: Addressed an issue in the NoduleMNIST3D file (i.e., nodulemnist3d.npz). Further details can be found in this issue.

    v2.0: Launched the initial repository of MedMNIST v2, adding 6 datasets for 3D and 2 for 2D.

    v1.0: Established the initial repository (in a separate repository) of MedMNIST v1, featuring 10 datasets for 2D.

    Note: This dataset is NOT intended for clinical use.

  10. c

    Connecticut CAMA and Parcel Layer

    • geodata.ct.gov
    • data.ct.gov
    • +1more
    Updated Nov 20, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    State of Connecticut (2024). Connecticut CAMA and Parcel Layer [Dataset]. https://geodata.ct.gov/datasets/ctmaps::connecticut-cama-and-parcel-layer
    Explore at:
    Dataset updated
    Nov 20, 2024
    Dataset authored and provided by
    State of Connecticut
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Description

    Pursuant to Section 7-100l of the Connecticut General Statutes, each municipality is required to transmit a digital parcel file and an accompanying assessor’s database file (known as a CAMA report), to its respective regional council of governments (COG) by May 1 annually. The dataset has combined the Parcels and Computer-Assisted Mass Appraisal (CAMA) data for 2025 into a single dataset. This dataset is designed to make it easier for stakeholders and the GIS community to use and access the information as a geospatial dataset. Included in this dataset are geometries for all 169 municipalities and attribution from the CAMA data for all but one municipality. These data were gathered from the CT municipalities by the COGs and then submitted to CT OPM. This dataset was created on September 2025 from data collected in 2024-2025. Data was processed using Python scripts and ArcGIS Pro for standardization and integration of the data. To learn more about Parcel and CAMA in CT visit our Parcels Page in the Geodata Portal.Coordinate system: This dataset is provided in NAD 83 Connecticut State Plane (2011) (EPSG 2234) projection as it was for 2024. Prior versions were provided at WGS 1984 Web Mercator Auxiliary Sphere (EPSG 3857). Ownership Suppression: The updated dataset includes parcel data for all towns across the state, with some towns featuring fully suppressed ownership information. In these instances, the owner’s name was replaced with the label "Current Owner," the co-owner’s name will be listed as "Current Co-Owner," and the mailing address will appear as the property address itself. For towns with fully suppressed ownership data, please note that no "Suppression" field was included in the submission to confirm these details and this labeling approach was implemented as the solution.New Data Fields:The new dataset introduces the “Property Zip” and “Mailing Zip” fields, which will display the zip codes for the owner and property.Service URL:In 2024, we implemented a stable URL to maintain public access to the most up-to-date data layer. Users are strongly encouraged to transition to the new service as soon as possible to ensure uninterrupted workflows. This URL will remain persistent, providing long-term stability for your applications and integrations. Once you’ve transitioned to the new service, no further URL changes will be necessary.CAMA Notes:The CAMA underwent several steps to standardize and consolidate the information. Python scripts were used to concatenate fields and create a unique identifier for each entry. The resulting dataset contains 1,354,720 entries and information on property assessments and other relevant attributes.CAMA was provided by the towns.Spatial Data Notes:Data processing involved merging the parcels from different municipalities using ArcGIS Pro and Python. The resulting dataset contains 1,282,833 parcels.No alteration has been made to the spatial geometry of the data.Fields that are associated with CAMA data were provided by towns.The data fields that have information from the CAMA were sourced from the towns’ CAMA data.If no field for the parcels was provided for linking back to the CAMA by the town a new field within the original data was selected if it had a match rate above 50%, that joined back to the CAMA.Linking fields were renamed to "Link".All linking fields had a census town code added to the beginning of the value to create a unique identifier per town.Any field that was not town name, Location, Editor, Edit Date, or a field associated back to the CAMA, was not used in the creation of this Dataset.Only the fields related to town name, location, editor, edit date, and link fields associated with the towns’ CAMA were included in the creation of this dataset. Any other field provided in the original data was deleted or not used.Field names for town (Muni, Municipality) were renamed to "Town Name".Attributes included in the data: Town Name OwnerCo-OwnerLinkEditorEdit DateCollection year – year the parcels were submittedLocationProperty ZipMailing AddressMailing CityMailing StateMailing ZipAssessed TotalAssessed LandAssessed BuildingPre-Year Assessed Total Appraised LandAppraised BuildingAppraised OutbuildingConditionModelValuationZoneState UseState Use DescriptionLand Acre Living AreaEffective AreaTotal roomsNumber of bedroomsNumber of BathsNumber of Half-BathsSale PriceSale DateQualifiedOccupancyPrior Sale PricePrior Sale DatePrior Book and PagePlanning RegionFIPS Code *Please note that not all parcels have a link to a CAMA entry.*If any discrepancies are discovered within the data, whether pertaining to geographical inaccuracies or attribute inaccuracy, please directly contact the respective municipalities to request any necessary amendmentsAdditional information about the specifics of data availability and compliance will be coming soon.If you need a WFS service for use in specific applications : Please Click HereContact: opm.giso@ct.gov

  11. m

    Comprehensive Dataset for Data-Driven Pavement Performance Prediction and...

    • data.mendeley.com
    Updated Mar 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hossein Hariri Asli (2025). Comprehensive Dataset for Data-Driven Pavement Performance Prediction and Analysis in Flood-Prone Beaumont, Southeast Texas [Dataset]. http://doi.org/10.17632/p6vg4v7f9k.2
    Explore at:
    Dataset updated
    Mar 19, 2025
    Authors
    Hossein Hariri Asli
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Area covered
    Beaumont, Southeast Texas, Texas
    Description

    Effective pavement maintenance is vital for the economy, optimal performance, and safety, necessitating a thorough evaluation of pavement conditions such as strength, roughness, and surface distress. Pavement performance indicators significantly influence vehicle safety and ride quality. Recent advancements have focused on leveraging data-driven models to predict pavement performance, aiming to optimize fund allocation and enhance Maintenance and Rehabilitation (M&R) strategies through precise assessment of pavement conditions and defects. A critical prerequisite for these models is access to standardized, high-quality datasets to enhance prediction accuracy in pavement infrastructure management. This data article presents a comprehensive dataset compiled to support pavement performance prediction research, focusing on Southeast Texas, particularly the flood-prone region of Beaumont. The dataset includes pavement and traffic data, meteorological records, flood maps, ground deformation, and topographic indices to assess the impact of load-associated and non-load-associated pavement degradation. Data preprocessing was conducted using ArcGIS Pro, Microsoft Excel, and Python, ensuring the dataset is formatted for direct application in data-driven modeling approaches, including Machine Learning methods. Key contributions of this dataset include facilitating the analysis of climatic and environmental impacts on pavement conditions, enabling the identification of critical features influencing pavement performance, and allowing comprehensive data analysis to explore correlations and trends among input variables. By addressing gaps in input variable selection studies, this dataset supports the development of predictive tools for estimating future maintenance needs and improving the resilience of pavement infrastructure in flood-affected areas. This work highlights the importance of standardized datasets in advancing pavement management systems and provides a foundation for future research to enhance pavement performance prediction accuracy.

  12. Benchmark data set for MSPypeline, a python package for streamlined mass...

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    xml
    Updated Jul 22, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexander Held; Ursula Klingmüller (2021). Benchmark data set for MSPypeline, a python package for streamlined mass spectrometry-based proteomics data analysis [Dataset]. https://data-staging.niaid.nih.gov/resources?id=pxd025792
    Explore at:
    xmlAvailable download formats
    Dataset updated
    Jul 22, 2021
    Dataset provided by
    DKFZ Heidelberg
    Division Systems Biology of Signal Transduction, German Cancer Research Center (DKFZ), Heidelberg, 69120, Germany
    Authors
    Alexander Held; Ursula Klingmüller
    Variables measured
    Proteomics
    Description

    Mass spectrometry-based proteomics is increasingly employed in biology and medicine. To generate reliable information from large data sets and ensure comparability of results it is crucial to implement and standardize the quality control of the raw data, the data processing steps and the statistical analyses. The MSPypeline provides a platform for the import of MaxQuant output tables, the generation of quality control reports, the preprocessing of data including normalization and exploratory analyses by statistical inference plots. These standardized steps assess data quality, provide customizable figures and enable the identification of differentially expressed proteins to reach biologically relevant conclusions.

  13. National Commercial Hazardous Waste Totals by Industry 2017

    • s.cnmilf.com
    • datasets.ai
    • +1more
    Updated Jul 25, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2023). National Commercial Hazardous Waste Totals by Industry 2017 [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/national-commercial-hazardous-waste-totals-by-industry-2017
    Explore at:
    Dataset updated
    Jul 25, 2023
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    This dataset contains 2017 national Commercial RCRA-defined Hazardous Waste by North American Industry Classification System (NAICS) 2012 6-digit codes. This dataset was created in FLOWSA, a publicly available python package that generates standardized environmental flows by industry. This dataset is associated with the following publication: Ingwersen, W.W., M. Li, B. Young, J. Vendries, and C. Birney. USEEIO v2.0, The US Environmentally-Extended InputOutput Model v2.0. Scientific Data. Springer Nature Group, New York, NY, 194, (2022).

  14. Supplementary Data for Mahalanobis-Based Ratio Analysis and Clustering of...

    • zenodo.org
    zip
    Updated May 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    o; o (2025). Supplementary Data for Mahalanobis-Based Ratio Analysis and Clustering of U.S. Tech Firms [Dataset]. http://doi.org/10.5281/zenodo.15337959
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 7, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    o; o
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    May 4, 2025
    Description

    Note: All supplementary files are provided as a single compressed archive named dataset.zip. Users should extract this file to access the individual Excel and Python files listed below.

    This supplementary dataset supports the manuscript titled “Mahalanobis-Based Multivariate Financial Statement Analysis: Outlier Detection and Typological Clustering in U.S. Tech Firms.” It contains both data files and Python scripts used in the financial ratio analysis, Mahalanobis distance computation, and hierarchical clustering stages of the study. The files are organized as follows:

    • ESM_1.xlsx – Raw financial ratios of 18 U.S. technology firms (2020–2024)

    • ESM_2.py – Python script to calculate Z-scores from raw financial ratios

    • ESM_3.xlsx – Dataset containing Z-scores for the selected financial ratios

    • ESM_4.py – Python script for generating the correlation heatmap of the Z-scores

    • ESM_5.xlsx – Mahalanobis distance values for each firm

    • ESM_6.py – Python script to compute Mahalanobis distances

    • ESM_7.py – Python script to visualize Mahalanobis distances

    • ESM_8.xlsx – Mean Z-scores per firm (used for cluster analysis)

    • ESM_9.py – Python script to compute mean Z-scores

    • ESM_10.xlsx – Re-standardized Z-scores based on firm-level means

    • ESM_11.py – Python script to re-standardize mean Z-scores

    • ESM_12.py – Python script to generate the hierarchical clustering dendrogram

    All files are provided to ensure transparency and reproducibility of the computational procedures in the manuscript. Each script is commented and formatted for clarity. The dataset is intended for educational and academic reuse under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0).

  15. S

    NASICON-type solid electrolyte materials named entity recognition dataset

    • scidb.cn
    Updated Apr 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Liu Yue; Liu Dahui; Yang Zhengwei; Shi Siqi (2023). NASICON-type solid electrolyte materials named entity recognition dataset [Dataset]. http://doi.org/10.57760/sciencedb.j00213.00001
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 27, 2023
    Dataset provided by
    Science Data Bank
    Authors
    Liu Yue; Liu Dahui; Yang Zhengwei; Shi Siqi
    Description

    1.Framework overview. This paper proposed a pipeline to construct high-quality datasets for text mining in materials science. Firstly, we utilize the traceable automatic acquisition scheme of literature to ensure the traceability of textual data. Then, a data processing method driven by downstream tasks is performed to generate high-quality pre-annotated corpora conditioned on the characteristics of materials texts. On this basis, we define a general annotation scheme derived from materials science tetrahedron to complete high-quality annotation. Finally, a conditional data augmentation model incorporating materials domain knowledge (cDA-DK) is constructed to augment the data quantity.2.Dataset information. The experimental datasets used in this paper include: the Matscholar dataset publicly published by Weston et al. (DOI: 10.1021/acs.jcim.9b00470), and the NASICON entity recognition dataset constructed by ourselves. Herein, we mainly introduce the details of NASICON entity recognition dataset.2.1 Data collection and preprocessing. Firstly, 55 materials science literature related to NASICON system are collected through Crystallographic Information File (CIF), which contains a wealth of structure-activity relationship information. Note that materials science literature is mostly stored as portable document format (PDF), with content arranged in columns and mixed with tables, images, and formulas, which significantly compromises the readability of the text sequence. To tackle this issue, we employ the text parser PDFMiner (a Python toolkit) to standardize, segment, and parse the original documents, thereby converting PDF literature into plain text. In this process, the entire textual information of literature, encompassing title, author, abstract, keywords, institution, publisher, and publication year, is retained and stored as a unified TXT document. Subsequently, we apply rules based on Python regular expressions to remove redundant information, such as garbled characters and line breaks caused by figures, tables, and formulas. This results in a cleaner text corpus, enhancing its readability and enabling more efficient data analysis. Note that special symbols may also appear as garbled characters, but we refrain from directly deleting them, as they may contain valuable information such as chemical units. Therefore, we converted all such symbols to a special token

  16. National Employment Totals by Industry 2017

    • s.cnmilf.com
    • catalog.data.gov
    Updated Jul 25, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2023). National Employment Totals by Industry 2017 [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/national-employment-totals-by-industry-2017
    Explore at:
    Dataset updated
    Jul 25, 2023
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    This dataset contains 2017 national employment by North American Industry Classification System (NAICS) 2012 6-digit codes. This dataset was created in FLOWSA, a publicly available python package that generates standardized environmental flows by industry. This dataset is associated with the following publication: Ingwersen, W.W., M. Li, B. Young, J. Vendries, and C. Birney. USEEIO v2.0, The US Environmentally-Extended InputOutput Model v2.0. Scientific Data. Springer Nature Group, New York, NY, 194, (2022).

  17. National Land Occupation Totals By Industry 2012

    • s.cnmilf.com
    • catalog.data.gov
    Updated Jul 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2023). National Land Occupation Totals By Industry 2012 [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/national-land-occupation-totals-by-industry-2012
    Explore at:
    Dataset updated
    Jul 25, 2023
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    This dataset contains 2012 national level land occupation totals by North American Industry Classification System (NAICS) 2012 6-digit codes. This dataset was created in FLOWSA, a publicly available python package that generates standardized environmental flows by industry. This dataset is associated with the following publication: Ingwersen, W.W., M. Li, B. Young, J. Vendries, and C. Birney. USEEIO v2.0, The US Environmentally-Extended InputOutput Model v2.0. Scientific Data. Springer Nature Group, New York, NY, 194, (2022).

  18. National Point Source Releases to Ground Totals by Industry 2017

    • catalog.data.gov
    Updated Jul 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2023). National Point Source Releases to Ground Totals by Industry 2017 [Dataset]. https://catalog.data.gov/dataset/national-point-source-releases-to-ground-totals-by-industry-2017
    Explore at:
    Dataset updated
    Jul 25, 2023
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    This dataset contains 2017 national point source releases to ground by North American Industry Classification System (NAICS) 2012 6-digit codes. This dataset was created in FLOWSA, a publicly available python package that generates standardized environmental flows by industry. This dataset is associated with the following publication: Ingwersen, W.W., M. Li, B. Young, J. Vendries, and C. Birney. USEEIO v2.0, The US Environmentally-Extended InputOutput Model v2.0. Scientific Data. Springer Nature Group, New York, NY, 194, (2022).

  19. PROCESSED DATA .nc (NetCDF Files)

    • figshare.com
    hdf
    Updated Apr 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kartika Wardani (2022). PROCESSED DATA .nc (NetCDF Files) [Dataset]. http://doi.org/10.6084/m9.figshare.19641777.v1
    Explore at:
    hdfAvailable download formats
    Dataset updated
    Apr 24, 2022
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Kartika Wardani
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is a processed data in NetCDF (.nc) files, that used in our study. We used the SPI to determine meteorological drought conditions in the study area, that calculated by using the open-source module Climate and Drought Indices in Python.

  20. T

    protein_net

    • tensorflow.org
    Updated Dec 16, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). protein_net [Dataset]. https://www.tensorflow.org/datasets/catalog/protein_net
    Explore at:
    Dataset updated
    Dec 16, 2022
    Description

    ProteinNet is a standardized data set for machine learning of protein structure. It provides protein sequences, structures (secondary and tertiary), multiple sequence alignments (MSAs), position-specific scoring matrices (PSSMs), and standardized training / validation / test splits. ProteinNet builds on the biennial CASP assessments, which carry out blind predictions of recently solved but publicly unavailable protein structures, to provide test sets that push the frontiers of computational methodology. It is organized as a series of data sets, spanning CASP 7 through 12 (covering a ten-year period), to provide a range of data set sizes that enable assessment of new methods in relatively data poor and data rich regimes.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('protein_net', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
HydraLM (2023). python-code-instructions-18k-alpaca-standardized [Dataset]. https://huggingface.co/datasets/HydraLM/python-code-instructions-18k-alpaca-standardized

python-code-instructions-18k-alpaca-standardized

HydraLM/python-code-instructions-18k-alpaca-standardized

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 2, 2023
Dataset authored and provided by
HydraLM
Description

Dataset Card for "python-code-instructions-18k-alpaca-standardized"

More Information needed

Search
Clear search
Close search
Google apps
Main menu