50 datasets found
  1. f

    DataSheet_1_Automated data preparation for in vivo tumor characterization...

    • frontiersin.figshare.com
    docx
    Updated Jun 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Denis Krajnc; Clemens P. Spielvogel; Marko Grahovac; Boglarka Ecsedi; Sazan Rasul; Nina Poetsch; Tatjana Traub-Weidinger; Alexander R. Haug; Zsombor Ritter; Hussain Alizadeh; Marcus Hacker; Thomas Beyer; Laszlo Papp (2023). DataSheet_1_Automated data preparation for in vivo tumor characterization with machine learning.docx [Dataset]. http://doi.org/10.3389/fonc.2022.1017911.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 13, 2023
    Dataset provided by
    Frontiers
    Authors
    Denis Krajnc; Clemens P. Spielvogel; Marko Grahovac; Boglarka Ecsedi; Sazan Rasul; Nina Poetsch; Tatjana Traub-Weidinger; Alexander R. Haug; Zsombor Ritter; Hussain Alizadeh; Marcus Hacker; Thomas Beyer; Laszlo Papp
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundThis study proposes machine learning-driven data preparation (MLDP) for optimal data preparation (DP) prior to building prediction models for cancer cohorts.MethodsA collection of well-established DP methods were incorporated for building the DP pipelines for various clinical cohorts prior to machine learning. Evolutionary algorithm principles combined with hyperparameter optimization were employed to iteratively select the best fitting subset of data preparation algorithms for the given dataset. The proposed method was validated for glioma and prostate single center cohorts by 100-fold Monte Carlo (MC) cross-validation scheme with 80-20% training-validation split ratio. In addition, a dual-center diffuse large B-cell lymphoma (DLBCL) cohort was utilized with Center 1 as training and Center 2 as independent validation datasets to predict cohort-specific clinical endpoints. Five machine learning (ML) classifiers were employed for building prediction models across all analyzed cohorts. Predictive performance was estimated by confusion matrix analytics over the validation sets of each cohort. The performance of each model with and without MLDP, as well as with manually-defined DP were compared in each of the four cohorts.ResultsSixteen of twenty established predictive models demonstrated area under the receiver operator characteristics curve (AUC) performance increase utilizing the MLDP. The MLDP resulted in the highest performance increase for random forest (RF) (+0.16 AUC) and support vector machine (SVM) (+0.13 AUC) model schemes for predicting 36-months survival in the glioma cohort. Single center cohorts resulted in complex (6-7 DP steps) DP pipelines, with a high occurrence of outlier detection, feature selection and synthetic majority oversampling technique (SMOTE). In contrast, the optimal DP pipeline for the dual-center DLBCL cohort only included outlier detection and SMOTE DP steps.ConclusionsThis study demonstrates that data preparation prior to ML prediction model building in cancer cohorts shall be ML-driven itself, yielding optimal prediction models in both single and multi-centric settings.

  2. d

    Ground Validation Data Used to Map Benthic Habitats of the Republic of Palau...

    • catalog.data.gov
    • fisheries.noaa.gov
    Updated May 22, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (Point of Contact, Custodian) (2025). Ground Validation Data Used to Map Benthic Habitats of the Republic of Palau [Dataset]. https://catalog.data.gov/dataset/ground-validation-data-used-to-map-benthic-habitats-of-the-republic-of-palau1
    Explore at:
    Dataset updated
    May 22, 2025
    Dataset provided by
    (Point of Contact, Custodian)
    Area covered
    Palau
    Description

    This project is a cooperative effort among the National Ocean Service, National Centers for Coastal Ocean Science, Center for Coastal Monitoring and Assessment; the University of Hawaii; and Analytical Laboratories of Hawaii, LLC. The goal of the work was to incorporate previously developed mapping methods to produce benthic habitat maps generated by photo interpreting georeferenced IKONOS satellite imagery. These point data were generated to conduct ground validation during map preparation.

  3. d

    Oahu Ground Validation Point Data for Benthic Habitats of the Main Hawaiian...

    • catalog.data.gov
    • datasets.ai
    • +3more
    Updated May 22, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (Point of Contact, Custodian) (2025). Oahu Ground Validation Point Data for Benthic Habitats of the Main Hawaiian Islands Prepared by Visual Interpretation from Remote Sensing Imagery Collected by NOAA Year 2000 [Dataset]. https://catalog.data.gov/dataset/oahu-ground-validation-point-data-for-benthic-habitats-of-the-main-hawaiian-islands-prepar-20005
    Explore at:
    Dataset updated
    May 22, 2025
    Dataset provided by
    (Point of Contact, Custodian)
    Area covered
    Hawaiian Islands, O‘ahu, Hawaii
    Description

    This project is a cooperative effort among the National Ocean Service, National Centers for Coastal Ocean Science, Center for Coastal Monitoring and Assessment; the University of Hawaii; and Analytical Laboratories of Hawaii, LLC. The goal of the work was to develop coral reef mapping methods and compare benthic habitat maps generated by photointerpreting georeferenced color aerial photography, hyperspectral and IKONOS satellite imagery. These pointdata were generated to conduct ground validation during map preparation.

  4. d

    Molokai Ground Validation Point Data for Benthic Habitats of the Main...

    • catalog.data.gov
    • fisheries.noaa.gov
    • +1more
    Updated May 22, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (Point of Contact, Custodian) (2025). Molokai Ground Validation Point Data for Benthic Habitats of the Main Hawaiian Islands Prepared by Visual Interpretation from Remote Sensing Imagery Collected by NOAA Year 2000 [Dataset]. https://catalog.data.gov/dataset/molokai-ground-validation-point-data-for-benthic-habitats-of-the-main-hawaiian-islands-pre-20005
    Explore at:
    Dataset updated
    May 22, 2025
    Dataset provided by
    (Point of Contact, Custodian)
    Area covered
    Moloka‘i, Hawaiian Islands, Hawaii
    Description

    This project is a cooperative effort among the National Ocean Service, National Centers for Coastal Ocean Science, Center for Coastal Monitoring and Assessment; the University of Hawaii; and Analytical Laboratories of Hawaii, LLC. The goal of the work was to develop coral reef mapping methods and compare benthic habitat maps generated by photointerpreting georeferenced color aerial photography, hyperspectral and IKONOS satellite imagery. These point data were generated to conduct ground validation during map preparation.

  5. 2.1 Drug-pair Representation and the Associated ADRs

    • figshare.com
    txt
    Updated Jun 30, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Susmitha Shankar (2020). 2.1 Drug-pair Representation and the Associated ADRs [Dataset]. http://doi.org/10.6084/m9.figshare.12579704.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 30, 2020
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Susmitha Shankar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Drug pair data representation required for the project can be directly extracted from the repository. Along with the main dataset, subsets used for cross validation are also presented.

  6. Test Data Management Market Analysis, Size, and Forecast 2025-2029: North...

    • technavio.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technavio, Test Data Management Market Analysis, Size, and Forecast 2025-2029: North America (US and Canada), Europe (France, Germany, Italy, and UK), APAC (Australia, China, India, and Japan), and Rest of World (ROW) [Dataset]. https://www.technavio.com/report/test-data-management-market-industry-analysis
    Explore at:
    Dataset provided by
    TechNavio
    Authors
    Technavio
    Time period covered
    2021 - 2025
    Area covered
    United States, Global
    Description

    Snapshot img

    Test Data Management Market Size 2025-2029

    The test data management market size is forecast to increase by USD 727.3 million, at a CAGR of 10.5% between 2024 and 2029.

    The market is experiencing significant growth, driven by the increasing adoption of automation by enterprises to streamline their testing processes. The automation trend is fueled by the growing consumer spending on technological solutions, as businesses seek to improve efficiency and reduce costs. However, the market faces challenges, including the lack of awareness and standardization in test data management practices. This obstacle hinders the effective implementation of test data management solutions, requiring companies to invest in education and training to ensure successful integration. To capitalize on market opportunities and navigate challenges effectively, businesses must stay informed about emerging trends and best practices in test data management. By doing so, they can optimize their testing processes, reduce risks, and enhance overall quality.

    What will be the Size of the Test Data Management Market during the forecast period?

    Explore in-depth regional segment analysis with market size data - historical 2019-2023 and forecasts 2025-2029 - in the full report.
    Request Free SampleThe market continues to evolve, driven by the ever-increasing volume and complexity of data. Data exploration and analysis are at the forefront of this dynamic landscape, with data ethics and governance frameworks ensuring data transparency and integrity. Data masking, cleansing, and validation are crucial components of data management, enabling data warehousing, orchestration, and pipeline development. Data security and privacy remain paramount, with encryption, access control, and anonymization key strategies. Data governance, lineage, and cataloging facilitate data management software automation and reporting. Hybrid data management solutions, including artificial intelligence and machine learning, are transforming data insights and analytics. Data regulations and compliance are shaping the market, driving the need for data accountability and stewardship. Data visualization, mining, and reporting provide valuable insights, while data quality management, archiving, and backup ensure data availability and recovery. Data modeling, data integrity, and data transformation are essential for data warehousing and data lake implementations. Data management platforms are seamlessly integrated into these evolving patterns, enabling organizations to effectively manage their data assets and gain valuable insights. Data management services, cloud and on-premise, are essential for organizations to adapt to the continuous changes in the market and effectively leverage their data resources.

    How is this Test Data Management Industry segmented?

    The test data management industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments. ApplicationOn-premisesCloud-basedComponentSolutionsServicesEnd-userInformation technologyTelecomBFSIHealthcare and life sciencesOthersSectorLarge enterpriseSMEsGeographyNorth AmericaUSCanadaEuropeFranceGermanyItalyUKAPACAustraliaChinaIndiaJapanRest of World (ROW).

    By Application Insights

    The on-premises segment is estimated to witness significant growth during the forecast period.In the realm of data management, on-premises testing represents a popular approach for businesses seeking control over their infrastructure and testing process. This approach involves establishing testing facilities within an office or data center, necessitating a dedicated team with the necessary skills. The benefits of on-premises testing extend beyond control, as it enables organizations to upgrade and configure hardware and software at their discretion, providing opportunities for exploration testing. Furthermore, data security is a significant concern for many businesses, and on-premises testing alleviates the risk of compromising sensitive information to third-party companies. Data exploration, a crucial aspect of data analysis, can be carried out more effectively with on-premises testing, ensuring data integrity and security. Data masking, cleansing, and validation are essential data preparation techniques that can be executed efficiently in an on-premises environment. Data warehousing, data pipelines, and data orchestration are integral components of data management, and on-premises testing allows for seamless integration and management of these elements. Data governance frameworks, lineage, catalogs, and metadata are essential for maintaining data transparency and compliance. Data security, encryption, and access control are paramount, and on-premises testing offers greater control over these aspects. Data reporting

  7. e

    Development and validation of Food and Nutrition Literacy Survey 2024 FANSY...

    • b2find.eudat.eu
    Updated Jul 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Development and validation of Food and Nutrition Literacy Survey 2024 FANSY - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/3405ac76-dad9-5b8b-abab-d06d8dcaeb5f
    Explore at:
    Dataset updated
    Jul 23, 2025
    Description

    The research aimed to develop and validate a general food and nutrition literacy (FNL) assessment tool, which will be used to measure the FNL in the adult population - The Food and Nutrition Literacy Survey (FANSy). To validate the tool, the preliminary version of the questionnaire was administered to representative sample of adults living in the United Kingdom. This data was used for validation and preparation of the final version of the survey. Primary data from the research will be made available after the end of the SYRI project.

  8. d

    Data from: Validation of Methods to Assess the Immunoglobulin Gene...

    • catalog.data.gov
    • data.staging.idas-ds1.appdat.jsc.nasa.gov
    • +1more
    Updated Apr 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Open Science Data Repository (2025). Validation of Methods to Assess the Immunoglobulin Gene Repertoire in Tissues Obtained from Mice on the International Space Station [Dataset]. https://catalog.data.gov/dataset/validation-of-methods-to-assess-the-immunoglobulin-gene-repertoire-in-tissues-obtained-fro-80f34
    Explore at:
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Open Science Data Repository
    Description

    Spaceflight is known to affect immune cell populations. In particular, splenic B-cell numbers decrease during spaceflight and in ground-based physiological models. Although antibody isotype changes have been assessed during and after spaceflight, an extensive characterization of the impact of spaceflight on antibody composition has not been conducted in mice. Next Generation Sequencing and bioinformatic tools are now available to assess antibody repertoires. We can now identify immunoglobulin gene- segment usage, junctional regions, and modifications that contribute to specificity and diversity. Due to limitations on the International Space Station, alternate sample collection and storage methods must be employed. Our group compared Illumina MiSeq sequencing data from multiple sample preparation methods in normal C57Bl/6J mice to validate that sample preparation and storage would not bias the outcome of antibody repertoire characterization. In this report, we also compared sequencing techniques and a bioinformatic workflow on the data output when we assessed the IgH and Igκ variable gene usage. Our bioinformatic workflow has been optimized for Illumina HiSeq and MiSeq datasets, and is designed specifically to reduce bias, capture the most information from Ig sequences, and produce a data set that provides other data mining options.

  9. Data Annotation Tools Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Jun 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). Data Annotation Tools Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/data-annotation-tools-market-global-geographical-industry-analysis
    Explore at:
    csv, pdf, pptxAvailable download formats
    Dataset updated
    Jun 30, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Data Annotation Tools Market Outlook



    According to our latest research, the global Data Annotation Tools market size reached USD 2.1 billion in 2024. The market is set to expand at a robust CAGR of 26.7% from 2025 to 2033, projecting a remarkable value of USD 18.1 billion by 2033. The primary growth driver for this market is the escalating adoption of artificial intelligence (AI) and machine learning (ML) across various industries, which necessitates high-quality labeled data for model training and validation.




    One of the most significant growth factors propelling the data annotation tools market is the exponential rise in AI-powered applications across sectors such as healthcare, automotive, retail, and BFSI. As organizations increasingly integrate AI and ML into their core operations, the demand for accurately annotated data has surged. Data annotation tools play a crucial role in transforming raw, unstructured data into structured, labeled datasets that can be efficiently used to train sophisticated algorithms. The proliferation of deep learning and natural language processing technologies further amplifies the need for comprehensive data labeling solutions. This trend is particularly evident in industries like healthcare, where annotated medical images are vital for diagnostic algorithms, and in automotive, where labeled sensor data supports the evolution of autonomous vehicles.




    Another prominent driver is the shift toward automation and digital transformation, which has accelerated the deployment of data annotation tools. Enterprises are increasingly adopting automated and semi-automated annotation platforms to enhance productivity, reduce manual errors, and streamline the data preparation process. The emergence of cloud-based annotation solutions has also contributed to market growth by enabling remote collaboration, scalability, and integration with advanced AI development pipelines. Furthermore, the growing complexity and variety of data types, including text, audio, image, and video, necessitate versatile annotation tools capable of handling multimodal datasets, thus broadening the market's scope and applications.




    The market is also benefiting from a surge in government and private investments aimed at fostering AI innovation and digital infrastructure. Several governments across North America, Europe, and Asia Pacific have launched initiatives and funding programs to support AI research and development, including the creation of high-quality, annotated datasets. These efforts are complemented by strategic partnerships between technology vendors, research institutions, and enterprises, which are collectively advancing the capabilities of data annotation tools. As regulatory standards for data privacy and security become more stringent, there is an increasing emphasis on secure, compliant annotation solutions, further driving innovation and market demand.




    From a regional perspective, North America currently dominates the data annotation tools market, driven by the presence of major technology companies, well-established AI research ecosystems, and significant investments in digital transformation. However, Asia Pacific is emerging as the fastest-growing region, fueled by rapid industrialization, expanding IT infrastructure, and a burgeoning startup ecosystem focused on AI and data science. Europe also holds a substantial market share, supported by robust regulatory frameworks and active participation in AI research. Latin America and the Middle East & Africa are gradually catching up, with increasing adoption in sectors such as retail, automotive, and government. The global landscape is characterized by dynamic regional trends, with each market contributing uniquely to the overall growth trajectory.





    Component Analysis



    The data annotation tools market is segmented by component into software and services, each playing a pivotal role in the market's overall ecosystem. Software solutions form the backbone of the market, providing the technical infrastructure for auto

  10. Histopathology data of bone marrow biopsies (HistBMP or HistMNIST)

    • zenodo.org
    application/gzip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jakub Tomczak; Jakub Tomczak (2020). Histopathology data of bone marrow biopsies (HistBMP or HistMNIST) [Dataset]. http://doi.org/10.5281/zenodo.1205024
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jakub Tomczak; Jakub Tomczak
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Data information

    We prepared a dataset basing on histopathological images freely available on-line (http://www.enjoypath.com/). We selected 16 patients (patient IDs: 272, 274, 283, 289, 290, 291, 292, 295, 297, 298, 299). Each histopathological image represents a bone marrow biopsy. Diagnoses of the chosen cases were associated with different kinds of cancer (e.g., lymphoma, leukemia) or anemia. All original images were taken using HE, 40×, and each image was of size 336 × 448.

    Data preparation

    The original RGB representation was transformed to gray scale. Further, we divided each image into small patches of size 28 × 28. Eventually, we picked 10 patients for training, 3 patients for validation and 3 patients for testing, which resulted in 6,800 training images, 2,000 validation images and 2,000 test images. The selection of patients was performed in such a fashion that each dataset contained representative images with different diagnoses and amount of fat.

    Since the small patches resemble a widely-used benchmark in machine learning/AI community called MNIST, the dataset is referred to as HistMNIST.

    First usage

    The dataset was used to train deep generative models (VAEs):

    • Tomczak, J. M., & Welling, M. (2016). Improving variational auto-encoders using householder flow. arXiv preprint arXiv:1611.09630.
  11. f

    Data from: Applying machine learning to predict bowel preparation adequacy...

    • tandf.figshare.com
    jpeg
    Updated Mar 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jianying Liu; Wei Jiang; Yahong Yu; Jiali Gong; Guie Chen; Yuxing Yang; Chao Wang; Dalong Sun; Xuefeng Lu (2025). Applying machine learning to predict bowel preparation adequacy in elderly patients for colonoscopy: development and validation of a web-based prediction tool [Dataset]. http://doi.org/10.6084/m9.figshare.28573083.v1
    Explore at:
    jpegAvailable download formats
    Dataset updated
    Mar 11, 2025
    Dataset provided by
    Taylor & Francis
    Authors
    Jianying Liu; Wei Jiang; Yahong Yu; Jiali Gong; Guie Chen; Yuxing Yang; Chao Wang; Dalong Sun; Xuefeng Lu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Adequate bowel preparation is crucial for effective colonoscopy, especially in elderly patients who face a high risk of inadequate preparation. This study develops and validates a machine learning model to predict bowel preparation adequacy in elderly patients before colonoscopy. The study adhered to the TRIPOD AI guidelines. Clinical data from 471 elderly patients collected between February and December 2023 were utilized for developing and internally validating the model, while 221 patients’ data from March to June 2024 were used for external validation. The Boruta algorithm was applied for feature selection. Models including logistic regression, light gradient boosting machines, support vector machines (SVM), decision trees, random forests, and extreme gradient boosting were evaluated using metrics such as AUC, accuracy, sensitivity, and specificity. The SHAP algorithm helped rank feature importance. A web-based application was developed using the Streamlit framework to enhance clinical usability. The Boruta algorithm identified 7 key features. The SVM model excelled with an AUC of 0.895 (95% CI: 0.822–0.969), and high accuracy, sensitivity, and specificity. In external validation, the SVM model maintained robust performance with an AUC of 0.889. The SHAP algorithm further explained the contribution of each feature to model predictions. The study developed an interpretable and practical machine learning model for predicting bowel preparation adequacy in elderly patients, facilitating early interventions to improve outcomes and reduce resource wastage. This study developed a machine learning model to predict bowel preparation adequacy in elderly patients undergoing colonoscopy, notably improving prediction accuracy and aiding clinical decision-making.Multiple machine learning models were used to predict bowel preparation adequacy, with the support vector machine (SVM) achieving the best performance. SHAP analysis enhanced the interpretability of the model by identifying key predictive factors, making it a reliable and transparent tool for clinical use.The predictive model was integrated into a user-friendly web application, enabling healthcare providers to identify high-risk patients early and enhance the quality of bowel preparation interventions. This study developed a machine learning model to predict bowel preparation adequacy in elderly patients undergoing colonoscopy, notably improving prediction accuracy and aiding clinical decision-making. Multiple machine learning models were used to predict bowel preparation adequacy, with the support vector machine (SVM) achieving the best performance. SHAP analysis enhanced the interpretability of the model by identifying key predictive factors, making it a reliable and transparent tool for clinical use. The predictive model was integrated into a user-friendly web application, enabling healthcare providers to identify high-risk patients early and enhance the quality of bowel preparation interventions.

  12. d

    Phase Field Raw Data

    • data.dtu.dk
    bin
    Updated Jul 23, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Laura Rieger; Klemen Zelič; Igor Mele; Tomaž Katrašnik; Arghya Bhowmik (2024). Phase Field Raw Data [Dataset]. http://doi.org/10.11583/DTU.26325274.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Jul 23, 2024
    Dataset provided by
    Technical University of Denmark
    Authors
    Laura Rieger; Klemen Zelič; Igor Mele; Tomaž Katrašnik; Arghya Bhowmik
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is the raw data for the dataset found at https://data.dtu.dk/articles/dataset/Phase_field_data/25562364 It comprises tailored phase field prediction data generated using an innovative automated workflow designed to offer insights into complex phenomena while minimizing computational expenses. The dataset aims to facilitate benchmarking of new algorithms in phase field prediction, emphasizing accessibility and utility for researchers. The data creation process is detailed, focusing on streamlining data collection and preparation. Validation of the dataset's effectiveness is conducted through a benchmark experiment utilizing U-Net regression, a widely adopted neural network architecture. Results showcase competitive performance of the U-Net model, akin to previous state-of-the-art methods. This dataset not only serves as a valuable resource for the phase field prediction community but also highlights the potential of U-Net regression, fostering further advancements in the field. The linked code can be found under https://github.com/laura-rieger/phase_field_benchmark and describes in detail how the dataset is to be used.

  13. d

    Ground Validation Data Used to Map Benthic Habitats of the Republic of...

    • datadiscoverystudio.org
    esri shapefile
    Updated 2007
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2007). Ground Validation Data Used to Map Benthic Habitats of the Republic of PalauNOAA/NMFS/EDM [Dataset]. http://datadiscoverystudio.org/geoportal/rest/metadata/item/77175bc649c043b4a990f4f0fbbe5148/html
    Explore at:
    esri shapefileAvailable download formats
    Dataset updated
    2007
    Area covered
    Description

    This project is a cooperative effort among the National Ocean Service, National Centers for Coastal Ocean Science, Center for Coastal Monitoring and Assessment; the University of Hawaii; and Analytical Laboratories of Hawaii, LLC. The goal of the work was to incorporate previously developed mapping methods to produce benthic habitat maps generated by photo interpreting georeferenced IKONOS satellite imagery. These point data were generated to conduct ground validation during map preparation.

  14. e

    Validation dataset for Land Cover Map of Europe 2017 - Dataset - B2FIND

    • b2find.eudat.eu
    Updated May 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Validation dataset for Land Cover Map of Europe 2017 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/e2fd3d68-bd50-5118-886c-aab2d9c31c49
    Explore at:
    Dataset updated
    May 9, 2023
    Area covered
    Europe
    Description

    Thematic accuracy assessment of land cover/use products requires reliable reference data that enable their qualitative and quantitative evaluation. Such dataset with up-to-date information on a predefined class composition and spatial distribution is rarely available and its preparation requires an appropriate methodological approach adjusted to a specific product.Development of a new pan-European land cover/use map, generated from Copernicus Sentinel-2 data 2017 within the Sentinel-2 Global Land Cover (S2GLC) project carried out under a programme of and funded by the European Space Agency, provided an opportunity to design and develop an unique dataset dedicated to validation of this product. The dataset was prepared by twofold stratified random sampling. The first selection designated validation sites represented by Sentinel-2 image tiles and was performed on a country level with county borders used as a stratum. In the second selection validation samples were chosen randomly within the validation sites with stratification based on classes of the CORINE Land Cover database.The final dataset composed of samples visually checked by experienced image interpreters consists of a total number of 52,024 samples spread over the European countries. The samples represent 13 land cover/use classes including artificial surfaces, natural material surfaces (consolidated and un-consolidated), broadleaf tree cover, coniferous tree cover, herbaceous vegetation, moors and heathland, sclerophyllous vegetation, cultivated areas, vineyards, marshes, peatbogs, water bodies and permanent snow cover. Each sample provides information about the occurrence of one of the predefined land cover or land use classes within an area of 100 m² represented by a single pixel (10 m size) of Sentinel-2 imagery for the year 2017. The described dataset was used for the accuracy assessment process of the product Land Cover Map of Europe 2017 resulting from the S2GLC project and provided an estimate of the overall accuracy at the level of 86.1%. S2GLC - Land Cover Map of Europe 2017 reference dataCBK PAN, http://s2glc.cbk.waw.pl/extensionAttribute table fields:'S2GLC' – a land cover/use class symbol according to the S2GLC classification system'TILE' – a symbol of the Sentinel-2 granule (a tile of the Military Grid Reference System)'NAME_ENG' – a country English name (valid for inland and coastal areas). Data source of country names and administrative boundaries: 'Countries, 2020 - Administrative Units - Dataset' of European Commission, Eurostat (ESTAT), GISCO, https://ec.europa.eu/eurostat/web/gisco/geodata/reference-data/administrative-units-statistical-units/countriesClassification system:111 - Artificial surfaces211 - Cultivated areas221 - Vineyards231 - Herbaceous vegetation311 - Broadleaf tree cover312 - Coniferous tree cover322 - Moors and heathland323 - Sclerophyllous vegetation331 - Natural material surfaces335 - Permanent snow cover411 - Marshes412 - Peatbogs511 - Water bodiesData projection: Lambert Azimuthal Equal Area (LAEA)EPSG: 3035For more technical information on this dataset please refer to Malinowski et al. (2020).

  15. h

    visual_haystacks_v0

    • huggingface.co
    Updated Jul 27, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick (Tsung-Han) Wu (2024). visual_haystacks_v0 [Dataset]. https://huggingface.co/datasets/tsunghanwu/visual_haystacks_v0
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 27, 2024
    Authors
    Patrick (Tsung-Han) Wu
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Visual Haystacks Dataset Card

      Dataset details
    

    Dataset type: Visual Haystacks (VHs) is a benchmark dataset specifically designed to evaluate the Large Multimodal Model's (LMM's) capability to handle long-context visual information. It can also be viewed as the first visual-centric Needle-In-A-Haystack (NIAH) benchmark dataset. Please also download COCO-2017's training set validation set.

    Data Preparation and Benchmarking

    Download the VQA questions:huggingface-cli… See the full description on the dataset page: https://huggingface.co/datasets/tsunghanwu/visual_haystacks_v0.

  16. UVP5 data sorted with EcoTaxa and MorphoCluster

    • seanoe.org
    • pigma.org
    image/*
    Updated 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rainer Kiko; Simon-Martin Schröder (2020). UVP5 data sorted with EcoTaxa and MorphoCluster [Dataset]. http://doi.org/10.17882/73002
    Explore at:
    image/*Available download formats
    Dataset updated
    2020
    Dataset provided by
    SEANOE
    Authors
    Rainer Kiko; Simon-Martin Schröder
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Time period covered
    Oct 23, 2012 - Aug 7, 2017
    Area covered
    Description

    here, we provide plankton image data that was sorted with the web applications ecotaxa and morphocluster. the data set was used for image classification tasks as described in schröder et. al (in preparation) and does not include any geospatial or temporal meta-data.plankton was imaged using the underwater vision profiler 5 (picheral et al. 2010) in various regions of the world's oceans between 2012-10-24 and 2017-08-08.this data publication consists of an archive containing "training.csv" (list of 392k training images for classification, validated using ecotaxa), "validation.csv" (list of 196k validation images for classification, validated using ecotaxa), "unlabeld.csv" (list of 1m unlabeled images), "morphocluster.csv" (1.2m objects validated using morphocluster, a subset of "unlabeled.csv" and "validation.csv") and the image files themselves. the csv files each contain the columns "object_id" (a unique id), "image_fn" (the relative filename), and "label" (the assigned name).the training and validation sets were sorted into 65 classes using the web application ecotaxa (http://ecotaxa.obs-vlfr.fr). this data shows a severe class imbalance; the 10% most populated classes contain more than 80% of the objects and the class sizes span four orders of magnitude. the validation set and a set of additional 1m unlabeled images were sorted during the first trial of morphocluster (https://github.com/morphocluster).the images in this data set were sampled during rv meteor cruises m92, m93, m96, m97, m98, m105, m106, m107, m108, m116, m119, m121, m130, m131, m135, m136, m137 and m138, during rv maria s merian cruises msm22, msm23, msm40 and msm49, during the rv polarstern cruise ps88b and during the fluxes1 experiment with rv sarmiento de gamboa.the following people have contributed to the sorting of the image data on ecotaxa:rainer kiko, tristan biard, benjamin blanc, svenja christiansen, justine courboules, charlotte eich, jannik faustmann, christine gawinski, augustin lafond, aakash panchal, marc picheral, akanksha singh and helena haussin schröder et al. (in preparation), the training set serves as a source for knowledge transfer in the training of the feature extractor. the classification using morphocluster was conducted by rainer kiko. used labels are operational and not yet matched to respective ecotaxa classes.

  17. d

    Data from: Validation of Innovative Exploration Technologies for Newberry...

    • catalog.data.gov
    • data.openei.org
    • +2more
    Updated Jan 20, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Davenport Newberry Holdings, LLC (2025). Validation of Innovative Exploration Technologies for Newberry Volcano: Raw Gravity Data [Dataset]. https://catalog.data.gov/dataset/validation-of-innovative-exploration-technologies-for-newberry-volcano-raw-gravity-data-88282
    Explore at:
    Dataset updated
    Jan 20, 2025
    Dataset provided by
    Davenport Newberry Holdings LLC
    Area covered
    Newberry Volcano
    Description

    Validation of Innovative Exploration Technologies for Newberry Volcano: Raw data used to prepare the Gravity Report by Zonge 2012

  18. f

    Table_2_Operational Challenges in the Use of Structured Secondary Data for...

    • frontiersin.figshare.com
    docx
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kelsy N. Areco; Tulio Konstantyner; Paulo Bandiera-Paiva; Rita C. X. Balda; Daniela T. Costa-Nobre; Adriana Sanudo; Carlos Roberto V. Kiffer; Mandira D. Kawakami; Milton H. Miyoshi; Ana Sílvia Scavacini Marinonio; Rosa M. V. Freitas; Liliam C. C. Morais; Monica L. P. Teixeira; Bernadette Waldvogel; Maria Fernanda B. Almeida; Ruth Guinsburg (2023). Table_2_Operational Challenges in the Use of Structured Secondary Data for Health Research.DOCX [Dataset]. http://doi.org/10.3389/fpubh.2021.642163.s002
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Frontiers
    Authors
    Kelsy N. Areco; Tulio Konstantyner; Paulo Bandiera-Paiva; Rita C. X. Balda; Daniela T. Costa-Nobre; Adriana Sanudo; Carlos Roberto V. Kiffer; Mandira D. Kawakami; Milton H. Miyoshi; Ana Sílvia Scavacini Marinonio; Rosa M. V. Freitas; Liliam C. C. Morais; Monica L. P. Teixeira; Bernadette Waldvogel; Maria Fernanda B. Almeida; Ruth Guinsburg
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Background: In Brazil, secondary data for epidemiology are largely available. However, they are insufficiently prepared for use in research, even when it comes to structured data since they were often designed for other purposes. To date, few publications focus on the process of preparing secondary data. The present findings can help in orienting future research projects that are based on secondary data.Objective: Describe the steps in the process of ensuring the adequacy of a secondary data set for a specific use and to identify the challenges of this process.Methods: The present study is qualitative and reports methodological issues about secondary data use. The study material was comprised of 6,059,454 live births and 73,735 infant death records from 2004 to 2013 of children whose mothers resided in the State of São Paulo - Brazil. The challenges and description of the procedures to ensure data adequacy were undertaken in 6 steps: (1) problem understanding, (2) resource planning, (3) data understanding, (4) data preparation, (5) data validation and (6) data distribution. For each step, procedures, and challenges encountered, and the actions to cope with them and partial results were described. To identify the most labor-intensive tasks in this process, the steps were assessed by adding the number of procedures, challenges, and coping actions. The highest values were assumed to indicate the most critical steps.Results: In total, 22 procedures and 23 actions were needed to deal with the 27 challenges encountered along the process of ensuring the adequacy of the study material for the intended use. The final product was an organized database for a historical cohort study suitable for the intended use. Data understanding and data preparation were identified as the most critical steps, accounting for about 70% of the challenges observed for data using.Conclusion: Significant challenges were encountered in the process of ensuring the adequacy of secondary health data for research use, mainly in the data understanding and data preparation steps. The use of the described steps to approach structured secondary data and the knowledge of the potential challenges along the process may contribute to planning health research.

  19. Machine learning pipeline to train toxicity prediction model of...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jan Ewald; Jan Ewald (2020). Machine learning pipeline to train toxicity prediction model of FunTox-Networks [Dataset]. http://doi.org/10.5281/zenodo.3529162
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jan Ewald; Jan Ewald
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Machine Learning pipeline used to provide toxicity prediction in FunTox-Networks

    01_DATA # preprocessing and filtering of raw activity data from ChEMBL
    - Chembl_v25 # latest activity assay data set from ChEMBL (retrieved Nov 2019)
    - filt_stats.R # Filtering and preparation of raw data
    - Filtered # output data sets from filt_stats.R
    - toxicity_direction.csv # table of toxicity measurements and their proportionality to toxicity

    02_MolDesc # Calculation of molecular descriptors for all compounds within the filtered ChEMBL data set
    - datastore # files with all compounds and their calculated molecular descriptors based on SMILES
    - scripts
    - calc_molDesc.py # calculates for all compounds based on their smiles the molecular descriptors
    - chemopy-1.1 # used python package for descriptor calculation as decsribed in: https://doi.org/10.1093/bioinformatics/btt105

    03_Averages # Calculation of moving averages for levels and organisms as required for calculation of Z-scores
    - datastore # output files with statistics calculated by make_Z.R
    - scripts
    -make_Z.R # script to calculate statistics to calculate Z-scores as used by the regression models

    04_ZScores # Calculation of Z-scores and preparation of table to fit regression models
    - datastore # Z-normalized activity data and molecular descriptors in the form as used for fitting regression models
    - scripts
    -calc_Ztable.py # based on activity data, molecular descriptors and Z-statistics, the learning data is calculated

    05_Regression # Performing regression. Preparation of data by removing of outliers based on a linear regression model. Learning of random forest regression models. Validation of learning process by cross validation and tuning of hyperparameters.

    - datastore # storage of all random forest regression models and average level of Z output value per level and organism (zexp_*.tsv)
    - scripts
    - data_preperation.R # set up of regression data set, removal of outliers and optional removal of fields and descriptors
    - Rforest_CV.R # analysis of machine learning by cross validation, importance of regression variables and tuning of hyperparameters (number of trees, split of variables)
    - Rforest.R # based on analysis of Rforest_CV.R learning of final models

    rregrs_output
    # early analysis of regression model performance with the package RRegrs as described in: https://doi.org/10.1186/s13321-015-0094-2

  20. A

    Data from: Validation of Methods to Assess the Immunoglobulin Gene...

    • data.amerigeoss.org
    html
    Updated Jan 29, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    United States (2020). Validation of Methods to Assess the Immunoglobulin Gene Repertoire in Tissues Obtained from Mice on the International Space Station [Dataset]. https://data.amerigeoss.org/dataset/validation-of-methods-to-assess-the-immunoglobulin-gene-repertoire-in-tissues-obtained-fro1
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Jan 29, 2020
    Dataset provided by
    United States
    Description

    Spaceflight is known to affect immune cell populations. In particular splenic B-cell numbers decrease during spaceflight and in ground-based physiological models. Although antibody isotype changes have been assessed during and after spaceflight an extensive characterization of the impact of spaceflight on antibody composition has not been conducted in mice. Next Generation Sequencing and bioinformatic tools are now available to assess antibody repertoires. We can now identify immunoglobulin gene- segment usage junctional regions and modifications that contribute to specificity and diversity. Due to limitations on the International Space Station alternate sample collection and storage methods must be employed. Our group compared Illumina MiSeq xc2 xae sequencing data from multiple sample preparation methods in normal C57Bl/6J mice to validate that sample preparation and storage would not bias the outcome of antibody repertoire characterization. In this report we also compared sequencing techniques and a bioinformatic workflow on the data output when we assessed the IgH and Ig xce xba variable gene usage. Our bioinformatic workflow has been optimized for Illumina HiSeq xc2 xae and MiSeq xc2 xae datasets and is designed specifically to reduce bias capture the most information from Ig sequences and produce a data set that provides other data mining options.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Denis Krajnc; Clemens P. Spielvogel; Marko Grahovac; Boglarka Ecsedi; Sazan Rasul; Nina Poetsch; Tatjana Traub-Weidinger; Alexander R. Haug; Zsombor Ritter; Hussain Alizadeh; Marcus Hacker; Thomas Beyer; Laszlo Papp (2023). DataSheet_1_Automated data preparation for in vivo tumor characterization with machine learning.docx [Dataset]. http://doi.org/10.3389/fonc.2022.1017911.s001

DataSheet_1_Automated data preparation for in vivo tumor characterization with machine learning.docx

Related Article
Explore at:
docxAvailable download formats
Dataset updated
Jun 13, 2023
Dataset provided by
Frontiers
Authors
Denis Krajnc; Clemens P. Spielvogel; Marko Grahovac; Boglarka Ecsedi; Sazan Rasul; Nina Poetsch; Tatjana Traub-Weidinger; Alexander R. Haug; Zsombor Ritter; Hussain Alizadeh; Marcus Hacker; Thomas Beyer; Laszlo Papp
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

BackgroundThis study proposes machine learning-driven data preparation (MLDP) for optimal data preparation (DP) prior to building prediction models for cancer cohorts.MethodsA collection of well-established DP methods were incorporated for building the DP pipelines for various clinical cohorts prior to machine learning. Evolutionary algorithm principles combined with hyperparameter optimization were employed to iteratively select the best fitting subset of data preparation algorithms for the given dataset. The proposed method was validated for glioma and prostate single center cohorts by 100-fold Monte Carlo (MC) cross-validation scheme with 80-20% training-validation split ratio. In addition, a dual-center diffuse large B-cell lymphoma (DLBCL) cohort was utilized with Center 1 as training and Center 2 as independent validation datasets to predict cohort-specific clinical endpoints. Five machine learning (ML) classifiers were employed for building prediction models across all analyzed cohorts. Predictive performance was estimated by confusion matrix analytics over the validation sets of each cohort. The performance of each model with and without MLDP, as well as with manually-defined DP were compared in each of the four cohorts.ResultsSixteen of twenty established predictive models demonstrated area under the receiver operator characteristics curve (AUC) performance increase utilizing the MLDP. The MLDP resulted in the highest performance increase for random forest (RF) (+0.16 AUC) and support vector machine (SVM) (+0.13 AUC) model schemes for predicting 36-months survival in the glioma cohort. Single center cohorts resulted in complex (6-7 DP steps) DP pipelines, with a high occurrence of outlier detection, feature selection and synthetic majority oversampling technique (SMOTE). In contrast, the optimal DP pipeline for the dual-center DLBCL cohort only included outlier detection and SMOTE DP steps.ConclusionsThis study demonstrates that data preparation prior to ML prediction model building in cancer cohorts shall be ML-driven itself, yielding optimal prediction models in both single and multi-centric settings.

Search
Clear search
Close search
Google apps
Main menu