50 datasets found

f
DataSheet_1_Automated data preparation for in vivo tumor characterization...
frontiersin.figshare.com
docx
Updated Jun 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Denis Krajnc; Clemens P. Spielvogel; Marko Grahovac; Boglarka Ecsedi; Sazan Rasul; Nina Poetsch; Tatjana Traub-Weidinger; Alexander R. Haug; Zsombor Ritter; Hussain Alizadeh; Marcus Hacker; Thomas Beyer; Laszlo Papp (2023). DataSheet_1_Automated data preparation for in vivo tumor characterization with machine learning.docx [Dataset]. http://doi.org/10.3389/fonc.2022.1017911.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fonc.2022.1017911.s001
Dataset updated
Jun 13, 2023
Dataset provided by
Frontiers
Authors
Denis Krajnc; Clemens P. Spielvogel; Marko Grahovac; Boglarka Ecsedi; Sazan Rasul; Nina Poetsch; Tatjana Traub-Weidinger; Alexander R. Haug; Zsombor Ritter; Hussain Alizadeh; Marcus Hacker; Thomas Beyer; Laszlo Papp
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundThis study proposes machine learning-driven data preparation (MLDP) for optimal data preparation (DP) prior to building prediction models for cancer cohorts.MethodsA collection of well-established DP methods were incorporated for building the DP pipelines for various clinical cohorts prior to machine learning. Evolutionary algorithm principles combined with hyperparameter optimization were employed to iteratively select the best fitting subset of data preparation algorithms for the given dataset. The proposed method was validated for glioma and prostate single center cohorts by 100-fold Monte Carlo (MC) cross-validation scheme with 80-20% training-validation split ratio. In addition, a dual-center diffuse large B-cell lymphoma (DLBCL) cohort was utilized with Center 1 as training and Center 2 as independent validation datasets to predict cohort-specific clinical endpoints. Five machine learning (ML) classifiers were employed for building prediction models across all analyzed cohorts. Predictive performance was estimated by confusion matrix analytics over the validation sets of each cohort. The performance of each model with and without MLDP, as well as with manually-defined DP were compared in each of the four cohorts.ResultsSixteen of twenty established predictive models demonstrated area under the receiver operator characteristics curve (AUC) performance increase utilizing the MLDP. The MLDP resulted in the highest performance increase for random forest (RF) (+0.16 AUC) and support vector machine (SVM) (+0.13 AUC) model schemes for predicting 36-months survival in the glioma cohort. Single center cohorts resulted in complex (6-7 DP steps) DP pipelines, with a high occurrence of outlier detection, feature selection and synthetic majority oversampling technique (SMOTE). In contrast, the optimal DP pipeline for the dual-center DLBCL cohort only included outlier detection and SMOTE DP steps.ConclusionsThis study demonstrates that data preparation prior to ML prediction model building in cancer cohorts shall be ML-driven itself, yielding optimal prediction models in both single and multi-centric settings.
d
Ground Validation Data Used to Map Benthic Habitats of the Republic of Palau...
catalog.data.gov
fisheries.noaa.gov
Updated May 22, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(Point of Contact, Custodian) (2025). Ground Validation Data Used to Map Benthic Habitats of the Republic of Palau [Dataset]. https://catalog.data.gov/dataset/ground-validation-data-used-to-map-benthic-habitats-of-the-republic-of-palau1
Explore at:
Dataset updated
May 22, 2025
Dataset provided by
(Point of Contact, Custodian)
Area covered
Palau
Description
This project is a cooperative effort among the National Ocean Service, National Centers for Coastal Ocean Science, Center for Coastal Monitoring and Assessment; the University of Hawaii; and Analytical Laboratories of Hawaii, LLC. The goal of the work was to incorporate previously developed mapping methods to produce benthic habitat maps generated by photo interpreting georeferenced IKONOS satellite imagery. These point data were generated to conduct ground validation during map preparation.
d
Oahu Ground Validation Point Data for Benthic Habitats of the Main Hawaiian...
catalog.data.gov
datasets.ai
+3more
Updated May 22, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(Point of Contact, Custodian) (2025). Oahu Ground Validation Point Data for Benthic Habitats of the Main Hawaiian Islands Prepared by Visual Interpretation from Remote Sensing Imagery Collected by NOAA Year 2000 [Dataset]. https://catalog.data.gov/dataset/oahu-ground-validation-point-data-for-benthic-habitats-of-the-main-hawaiian-islands-prepar-20005
Explore at:
Dataset updated
May 22, 2025
Dataset provided by
(Point of Contact, Custodian)
Area covered
Hawaiian Islands, O‘ahu, Hawaii
Description
This project is a cooperative effort among the National Ocean Service, National Centers for Coastal Ocean Science, Center for Coastal Monitoring and Assessment; the University of Hawaii; and Analytical Laboratories of Hawaii, LLC. The goal of the work was to develop coral reef mapping methods and compare benthic habitat maps generated by photointerpreting georeferenced color aerial photography, hyperspectral and IKONOS satellite imagery. These pointdata were generated to conduct ground validation during map preparation.
d
Molokai Ground Validation Point Data for Benthic Habitats of the Main...
catalog.data.gov
fisheries.noaa.gov
+1more
Updated May 22, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(Point of Contact, Custodian) (2025). Molokai Ground Validation Point Data for Benthic Habitats of the Main Hawaiian Islands Prepared by Visual Interpretation from Remote Sensing Imagery Collected by NOAA Year 2000 [Dataset]. https://catalog.data.gov/dataset/molokai-ground-validation-point-data-for-benthic-habitats-of-the-main-hawaiian-islands-pre-20005
Explore at:
Dataset updated
May 22, 2025
Dataset provided by
(Point of Contact, Custodian)
Area covered
Moloka‘i, Hawaiian Islands, Hawaii
Description
This project is a cooperative effort among the National Ocean Service, National Centers for Coastal Ocean Science, Center for Coastal Monitoring and Assessment; the University of Hawaii; and Analytical Laboratories of Hawaii, LLC. The goal of the work was to develop coral reef mapping methods and compare benthic habitat maps generated by photointerpreting georeferenced color aerial photography, hyperspectral and IKONOS satellite imagery. These point data were generated to conduct ground validation during map preparation.
2.1 Drug-pair Representation and the Associated ADRs
figshare.com
txt
Updated Jun 30, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Susmitha Shankar (2020). 2.1 Drug-pair Representation and the Associated ADRs [Dataset]. http://doi.org/10.6084/m9.figshare.12579704.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12579704.v1
Dataset updated
Jun 30, 2020
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Susmitha Shankar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Drug pair data representation required for the project can be directly extracted from the repository. Along with the main dataset, subsets used for cross validation are also presented.
Test Data Management Market Analysis, Size, and Forecast 2025-2029: North...
technavio.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Technavio, Test Data Management Market Analysis, Size, and Forecast 2025-2029: North America (US and Canada), Europe (France, Germany, Italy, and UK), APAC (Australia, China, India, and Japan), and Rest of World (ROW) [Dataset]. https://www.technavio.com/report/test-data-management-market-industry-analysis
Explore at:
Dataset provided by
TechNavio
Authors
Technavio
Time period covered
2021 - 2025
Area covered
United States, Global
Description
Snapshot img

Test Data Management Market Size 2025-2029

The test data management market size is forecast to increase by USD 727.3 million, at a CAGR of 10.5% between 2024 and 2029.

The market is experiencing significant growth, driven by the increasing adoption of automation by enterprises to streamline their testing processes. The automation trend is fueled by the growing consumer spending on technological solutions, as businesses seek to improve efficiency and reduce costs. However, the market faces challenges, including the lack of awareness and standardization in test data management practices. This obstacle hinders the effective implementation of test data management solutions, requiring companies to invest in education and training to ensure successful integration. To capitalize on market opportunities and navigate challenges effectively, businesses must stay informed about emerging trends and best practices in test data management. By doing so, they can optimize their testing processes, reduce risks, and enhance overall quality.

What will be the Size of the Test Data Management Market during the forecast period?

Explore in-depth regional segment analysis with market size data - historical 2019-2023 and forecasts 2025-2029 - in the full report.
Request Free SampleThe market continues to evolve, driven by the ever-increasing volume and complexity of data. Data exploration and analysis are at the forefront of this dynamic landscape, with data ethics and governance frameworks ensuring data transparency and integrity. Data masking, cleansing, and validation are crucial components of data management, enabling data warehousing, orchestration, and pipeline development. Data security and privacy remain paramount, with encryption, access control, and anonymization key strategies. Data governance, lineage, and cataloging facilitate data management software automation and reporting. Hybrid data management solutions, including artificial intelligence and machine learning, are transforming data insights and analytics. Data regulations and compliance are shaping the market, driving the need for data accountability and stewardship. Data visualization, mining, and reporting provide valuable insights, while data quality management, archiving, and backup ensure data availability and recovery. Data modeling, data integrity, and data transformation are essential for data warehousing and data lake implementations. Data management platforms are seamlessly integrated into these evolving patterns, enabling organizations to effectively manage their data assets and gain valuable insights. Data management services, cloud and on-premise, are essential for organizations to adapt to the continuous changes in the market and effectively leverage their data resources.

How is this Test Data Management Industry segmented?

The test data management industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments. ApplicationOn-premisesCloud-basedComponentSolutionsServicesEnd-userInformation technologyTelecomBFSIHealthcare and life sciencesOthersSectorLarge enterpriseSMEsGeographyNorth AmericaUSCanadaEuropeFranceGermanyItalyUKAPACAustraliaChinaIndiaJapanRest of World (ROW).

By Application Insights

The on-premises segment is estimated to witness significant growth during the forecast period.In the realm of data management, on-premises testing represents a popular approach for businesses seeking control over their infrastructure and testing process. This approach involves establishing testing facilities within an office or data center, necessitating a dedicated team with the necessary skills. The benefits of on-premises testing extend beyond control, as it enables organizations to upgrade and configure hardware and software at their discretion, providing opportunities for exploration testing. Furthermore, data security is a significant concern for many businesses, and on-premises testing alleviates the risk of compromising sensitive information to third-party companies. Data exploration, a crucial aspect of data analysis, can be carried out more effectively with on-premises testing, ensuring data integrity and security. Data masking, cleansing, and validation are essential data preparation techniques that can be executed efficiently in an on-premises environment. Data warehousing, data pipelines, and data orchestration are integral components of data management, and on-premises testing allows for seamless integration and management of these elements. Data governance frameworks, lineage, catalogs, and metadata are essential for maintaining data transparency and compliance. Data security, encryption, and access control are paramount, and on-premises testing offers greater control over these aspects. Data reporting
e
Development and validation of Food and Nutrition Literacy Survey 2024 FANSY...
b2find.eudat.eu
Updated Jul 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Development and validation of Food and Nutrition Literacy Survey 2024 FANSY - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/3405ac76-dad9-5b8b-abab-d06d8dcaeb5f
Explore at:
Dataset updated
Jul 23, 2025
Description
The research aimed to develop and validate a general food and nutrition literacy (FNL) assessment tool, which will be used to measure the FNL in the adult population - The Food and Nutrition Literacy Survey (FANSy). To validate the tool, the preliminary version of the questionnaire was administered to representative sample of adults living in the United Kingdom. This data was used for validation and preparation of the final version of the survey. Primary data from the research will be made available after the end of the SYRI project.
d
Data from: Validation of Methods to Assess the Immunoglobulin Gene...
catalog.data.gov
data.staging.idas-ds1.appdat.jsc.nasa.gov
+1more
Updated Apr 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Open Science Data Repository (2025). Validation of Methods to Assess the Immunoglobulin Gene Repertoire in Tissues Obtained from Mice on the International Space Station [Dataset]. https://catalog.data.gov/dataset/validation-of-methods-to-assess-the-immunoglobulin-gene-repertoire-in-tissues-obtained-fro-80f34
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Open Science Data Repository
Description
Spaceflight is known to affect immune cell populations. In particular, splenic B-cell numbers decrease during spaceflight and in ground-based physiological models. Although antibody isotype changes have been assessed during and after spaceflight, an extensive characterization of the impact of spaceflight on antibody composition has not been conducted in mice. Next Generation Sequencing and bioinformatic tools are now available to assess antibody repertoires. We can now identify immunoglobulin gene- segment usage, junctional regions, and modifications that contribute to specificity and diversity. Due to limitations on the International Space Station, alternate sample collection and storage methods must be employed. Our group compared Illumina MiSeq sequencing data from multiple sample preparation methods in normal C57Bl/6J mice to validate that sample preparation and storage would not bias the outcome of antibody repertoire characterization. In this report, we also compared sequencing techniques and a bioinformatic workflow on the data output when we assessed the IgH and Igκ variable gene usage. Our bioinformatic workflow has been optimized for Illumina HiSeq and MiSeq datasets, and is designed specifically to reduce bias, capture the most information from Ig sequences, and produce a data set that provides other data mining options.
Data Annotation Tools Market Research Report 2033
growthmarketreports.com
csv, pdf, pptx
Updated Jun 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2025). Data Annotation Tools Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/data-annotation-tools-market-global-geographical-industry-analysis
Explore at:
csv, pdf, pptxAvailable download formats
Dataset updated
Jun 30, 2025
Dataset authored and provided by
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
Data Annotation Tools Market Outlook

According to our latest research, the global Data Annotation Tools market size reached USD 2.1 billion in 2024. The market is set to expand at a robust CAGR of 26.7% from 2025 to 2033, projecting a remarkable value of USD 18.1 billion by 2033. The primary growth driver for this market is the escalating adoption of artificial intelligence (AI) and machine learning (ML) across various industries, which necessitates high-quality labeled data for model training and validation.

One of the most significant growth factors propelling the data annotation tools market is the exponential rise in AI-powered applications across sectors such as healthcare, automotive, retail, and BFSI. As organizations increasingly integrate AI and ML into their core operations, the demand for accurately annotated data has surged. Data annotation tools play a crucial role in transforming raw, unstructured data into structured, labeled datasets that can be efficiently used to train sophisticated algorithms. The proliferation of deep learning and natural language processing technologies further amplifies the need for comprehensive data labeling solutions. This trend is particularly evident in industries like healthcare, where annotated medical images are vital for diagnostic algorithms, and in automotive, where labeled sensor data supports the evolution of autonomous vehicles.

Another prominent driver is the shift toward automation and digital transformation, which has accelerated the deployment of data annotation tools. Enterprises are increasingly adopting automated and semi-automated annotation platforms to enhance productivity, reduce manual errors, and streamline the data preparation process. The emergence of cloud-based annotation solutions has also contributed to market growth by enabling remote collaboration, scalability, and integration with advanced AI development pipelines. Furthermore, the growing complexity and variety of data types, including text, audio, image, and video, necessitate versatile annotation tools capable of handling multimodal datasets, thus broadening the market's scope and applications.

The market is also benefiting from a surge in government and private investments aimed at fostering AI innovation and digital infrastructure. Several governments across North America, Europe, and Asia Pacific have launched initiatives and funding programs to support AI research and development, including the creation of high-quality, annotated datasets. These efforts are complemented by strategic partnerships between technology vendors, research institutions, and enterprises, which are collectively advancing the capabilities of data annotation tools. As regulatory standards for data privacy and security become more stringent, there is an increasing emphasis on secure, compliant annotation solutions, further driving innovation and market demand.

From a regional perspective, North America currently dominates the data annotation tools market, driven by the presence of major technology companies, well-established AI research ecosystems, and significant investments in digital transformation. However, Asia Pacific is emerging as the fastest-growing region, fueled by rapid industrialization, expanding IT infrastructure, and a burgeoning startup ecosystem focused on AI and data science. Europe also holds a substantial market share, supported by robust regulatory frameworks and active participation in AI research. Latin America and the Middle East & Africa are gradually catching up, with increasing adoption in sectors such as retail, automotive, and government. The global landscape is characterized by dynamic regional trends, with each market contributing uniquely to the overall growth trajectory.

Component Analysis

The data annotation tools market is segmented by component into software and services, each playing a pivotal role in the market's overall ecosystem. Software solutions form the backbone of the market, providing the technical infrastructure for auto
Histopathology data of bone marrow biopsies (HistBMP or HistMNIST)
zenodo.org
application/gzip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jakub Tomczak; Jakub Tomczak (2020). Histopathology data of bone marrow biopsies (HistBMP or HistMNIST) [Dataset]. http://doi.org/10.5281/zenodo.1205024
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.1205024
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jakub Tomczak; Jakub Tomczak
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Data information

We prepared a dataset basing on histopathological images freely available on-line (http://www.enjoypath.com/). We selected 16 patients (patient IDs: 272, 274, 283, 289, 290, 291, 292, 295, 297, 298, 299). Each histopathological image represents a bone marrow biopsy. Diagnoses of the chosen cases were associated with different kinds of cancer (e.g., lymphoma, leukemia) or anemia. All original images were taken using HE, 40×, and each image was of size 336 × 448.

Data preparation

The original RGB representation was transformed to gray scale. Further, we divided each image into small patches of size 28 × 28. Eventually, we picked 10 patients for training, 3 patients for validation and 3 patients for testing, which resulted in 6,800 training images, 2,000 validation images and 2,000 test images. The selection of patients was performed in such a fashion that each dataset contained representative images with different diagnoses and amount of fat.

Since the small patches resemble a widely-used benchmark in machine learning/AI community called MNIST, the dataset is referred to as HistMNIST.

First usage

The dataset was used to train deep generative models (VAEs):

Tomczak, J. M., & Welling, M. (2016). Improving variational auto-encoders using householder flow. arXiv preprint arXiv:1611.09630.
f
Data from: Applying machine learning to predict bowel preparation adequacy...
tandf.figshare.com
jpeg
Updated Mar 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jianying Liu; Wei Jiang; Yahong Yu; Jiali Gong; Guie Chen; Yuxing Yang; Chao Wang; Dalong Sun; Xuefeng Lu (2025). Applying machine learning to predict bowel preparation adequacy in elderly patients for colonoscopy: development and validation of a web-based prediction tool [Dataset]. http://doi.org/10.6084/m9.figshare.28573083.v1
Explore at:
jpegAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28573083.v1
Dataset updated
Mar 11, 2025
Dataset provided by
Taylor & Francis
Authors
Jianying Liu; Wei Jiang; Yahong Yu; Jiali Gong; Guie Chen; Yuxing Yang; Chao Wang; Dalong Sun; Xuefeng Lu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Adequate bowel preparation is crucial for effective colonoscopy, especially in elderly patients who face a high risk of inadequate preparation. This study develops and validates a machine learning model to predict bowel preparation adequacy in elderly patients before colonoscopy. The study adhered to the TRIPOD AI guidelines. Clinical data from 471 elderly patients collected between February and December 2023 were utilized for developing and internally validating the model, while 221 patients’ data from March to June 2024 were used for external validation. The Boruta algorithm was applied for feature selection. Models including logistic regression, light gradient boosting machines, support vector machines (SVM), decision trees, random forests, and extreme gradient boosting were evaluated using metrics such as AUC, accuracy, sensitivity, and specificity. The SHAP algorithm helped rank feature importance. A web-based application was developed using the Streamlit framework to enhance clinical usability. The Boruta algorithm identified 7 key features. The SVM model excelled with an AUC of 0.895 (95% CI: 0.822–0.969), and high accuracy, sensitivity, and specificity. In external validation, the SVM model maintained robust performance with an AUC of 0.889. The SHAP algorithm further explained the contribution of each feature to model predictions. The study developed an interpretable and practical machine learning model for predicting bowel preparation adequacy in elderly patients, facilitating early interventions to improve outcomes and reduce resource wastage. This study developed a machine learning model to predict bowel preparation adequacy in elderly patients undergoing colonoscopy, notably improving prediction accuracy and aiding clinical decision-making.Multiple machine learning models were used to predict bowel preparation adequacy, with the support vector machine (SVM) achieving the best performance. SHAP analysis enhanced the interpretability of the model by identifying key predictive factors, making it a reliable and transparent tool for clinical use.The predictive model was integrated into a user-friendly web application, enabling healthcare providers to identify high-risk patients early and enhance the quality of bowel preparation interventions. This study developed a machine learning model to predict bowel preparation adequacy in elderly patients undergoing colonoscopy, notably improving prediction accuracy and aiding clinical decision-making. Multiple machine learning models were used to predict bowel preparation adequacy, with the support vector machine (SVM) achieving the best performance. SHAP analysis enhanced the interpretability of the model by identifying key predictive factors, making it a reliable and transparent tool for clinical use. The predictive model was integrated into a user-friendly web application, enabling healthcare providers to identify high-risk patients early and enhance the quality of bowel preparation interventions.
d
Phase Field Raw Data
data.dtu.dk
bin
Updated Jul 23, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Laura Rieger; Klemen Zelič; Igor Mele; Tomaž Katrašnik; Arghya Bhowmik (2024). Phase Field Raw Data [Dataset]. http://doi.org/10.11583/DTU.26325274.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.11583/DTU.26325274.v1
Dataset updated
Jul 23, 2024
Dataset provided by
Technical University of Denmark
Authors
Laura Rieger; Klemen Zelič; Igor Mele; Tomaž Katrašnik; Arghya Bhowmik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is the raw data for the dataset found at https://data.dtu.dk/articles/dataset/Phase_field_data/25562364 It comprises tailored phase field prediction data generated using an innovative automated workflow designed to offer insights into complex phenomena while minimizing computational expenses. The dataset aims to facilitate benchmarking of new algorithms in phase field prediction, emphasizing accessibility and utility for researchers. The data creation process is detailed, focusing on streamlining data collection and preparation. Validation of the dataset's effectiveness is conducted through a benchmark experiment utilizing U-Net regression, a widely adopted neural network architecture. Results showcase competitive performance of the U-Net model, akin to previous state-of-the-art methods. This dataset not only serves as a valuable resource for the phase field prediction community but also highlights the potential of U-Net regression, fostering further advancements in the field. The linked code can be found under https://github.com/laura-rieger/phase_field_benchmark and describes in detail how the dataset is to be used.
d
Ground Validation Data Used to Map Benthic Habitats of the Republic of...
datadiscoverystudio.org
esri shapefile
Updated 2007
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2007). Ground Validation Data Used to Map Benthic Habitats of the Republic of PalauNOAA/NMFS/EDM [Dataset]. http://datadiscoverystudio.org/geoportal/rest/metadata/item/77175bc649c043b4a990f4f0fbbe5148/html
Explore at:
esri shapefileAvailable download formats
Dataset updated
2007
Area covered

Description
This project is a cooperative effort among the National Ocean Service, National Centers for Coastal Ocean Science, Center for Coastal Monitoring and Assessment; the University of Hawaii; and Analytical Laboratories of Hawaii, LLC. The goal of the work was to incorporate previously developed mapping methods to produce benthic habitat maps generated by photo interpreting georeferenced IKONOS satellite imagery. These point data were generated to conduct ground validation during map preparation.
e
Validation dataset for Land Cover Map of Europe 2017 - Dataset - B2FIND
b2find.eudat.eu
Updated May 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Validation dataset for Land Cover Map of Europe 2017 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/e2fd3d68-bd50-5118-886c-aab2d9c31c49
Explore at:
Dataset updated
May 9, 2023
Area covered
Europe
Description
Thematic accuracy assessment of land cover/use products requires reliable reference data that enable their qualitative and quantitative evaluation. Such dataset with up-to-date information on a predefined class composition and spatial distribution is rarely available and its preparation requires an appropriate methodological approach adjusted to a specific product.Development of a new pan-European land cover/use map, generated from Copernicus Sentinel-2 data 2017 within the Sentinel-2 Global Land Cover (S2GLC) project carried out under a programme of and funded by the European Space Agency, provided an opportunity to design and develop an unique dataset dedicated to validation of this product. The dataset was prepared by twofold stratified random sampling. The first selection designated validation sites represented by Sentinel-2 image tiles and was performed on a country level with county borders used as a stratum. In the second selection validation samples were chosen randomly within the validation sites with stratification based on classes of the CORINE Land Cover database.The final dataset composed of samples visually checked by experienced image interpreters consists of a total number of 52,024 samples spread over the European countries. The samples represent 13 land cover/use classes including artificial surfaces, natural material surfaces (consolidated and un-consolidated), broadleaf tree cover, coniferous tree cover, herbaceous vegetation, moors and heathland, sclerophyllous vegetation, cultivated areas, vineyards, marshes, peatbogs, water bodies and permanent snow cover. Each sample provides information about the occurrence of one of the predefined land cover or land use classes within an area of 100 m² represented by a single pixel (10 m size) of Sentinel-2 imagery for the year 2017. The described dataset was used for the accuracy assessment process of the product Land Cover Map of Europe 2017 resulting from the S2GLC project and provided an estimate of the overall accuracy at the level of 86.1%. S2GLC - Land Cover Map of Europe 2017 reference dataCBK PAN, http://s2glc.cbk.waw.pl/extensionAttribute table fields:'S2GLC' – a land cover/use class symbol according to the S2GLC classification system'TILE' – a symbol of the Sentinel-2 granule (a tile of the Military Grid Reference System)'NAME_ENG' – a country English name (valid for inland and coastal areas). Data source of country names and administrative boundaries: 'Countries, 2020 - Administrative Units - Dataset' of European Commission, Eurostat (ESTAT), GISCO, https://ec.europa.eu/eurostat/web/gisco/geodata/reference-data/administrative-units-statistical-units/countriesClassification system:111 - Artificial surfaces211 - Cultivated areas221 - Vineyards231 - Herbaceous vegetation311 - Broadleaf tree cover312 - Coniferous tree cover322 - Moors and heathland323 - Sclerophyllous vegetation331 - Natural material surfaces335 - Permanent snow cover411 - Marshes412 - Peatbogs511 - Water bodiesData projection: Lambert Azimuthal Equal Area (LAEA)EPSG: 3035For more technical information on this dataset please refer to Malinowski et al. (2020).
h
visual_haystacks_v0
huggingface.co
Updated Jul 27, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Patrick (Tsung-Han) Wu (2024). visual_haystacks_v0 [Dataset]. https://huggingface.co/datasets/tsunghanwu/visual_haystacks_v0
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 27, 2024
Authors
Patrick (Tsung-Han) Wu
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Visual Haystacks Dataset Card

Dataset details

Dataset type: Visual Haystacks (VHs) is a benchmark dataset specifically designed to evaluate the Large Multimodal Model's (LMM's) capability to handle long-context visual information. It can also be viewed as the first visual-centric Needle-In-A-Haystack (NIAH) benchmark dataset. Please also download COCO-2017's training set validation set.

Data Preparation and Benchmarking

Download the VQA questions:huggingface-cli… See the full description on the dataset page: https://huggingface.co/datasets/tsunghanwu/visual_haystacks_v0.
UVP5 data sorted with EcoTaxa and MorphoCluster
seanoe.org
pigma.org
image/*
Updated 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rainer Kiko; Simon-Martin Schröder (2020). UVP5 data sorted with EcoTaxa and MorphoCluster [Dataset]. http://doi.org/10.17882/73002
Explore at:
image/*Available download formats
Unique identifier
https://doi.org/10.17882/73002
Dataset updated
2020
Dataset provided by
SEANOE
Authors
Rainer Kiko; Simon-Martin Schröder
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Time period covered
Oct 23, 2012 - Aug 7, 2017
Area covered
Description
here, we provide plankton image data that was sorted with the web applications ecotaxa and morphocluster. the data set was used for image classification tasks as described in schröder et. al (in preparation) and does not include any geospatial or temporal meta-data.plankton was imaged using the underwater vision profiler 5 (picheral et al. 2010) in various regions of the world's oceans between 2012-10-24 and 2017-08-08.this data publication consists of an archive containing "training.csv" (list of 392k training images for classification, validated using ecotaxa), "validation.csv" (list of 196k validation images for classification, validated using ecotaxa), "unlabeld.csv" (list of 1m unlabeled images), "morphocluster.csv" (1.2m objects validated using morphocluster, a subset of "unlabeled.csv" and "validation.csv") and the image files themselves. the csv files each contain the columns "object_id" (a unique id), "image_fn" (the relative filename), and "label" (the assigned name).the training and validation sets were sorted into 65 classes using the web application ecotaxa (http://ecotaxa.obs-vlfr.fr). this data shows a severe class imbalance; the 10% most populated classes contain more than 80% of the objects and the class sizes span four orders of magnitude. the validation set and a set of additional 1m unlabeled images were sorted during the first trial of morphocluster (https://github.com/morphocluster).the images in this data set were sampled during rv meteor cruises m92, m93, m96, m97, m98, m105, m106, m107, m108, m116, m119, m121, m130, m131, m135, m136, m137 and m138, during rv maria s merian cruises msm22, msm23, msm40 and msm49, during the rv polarstern cruise ps88b and during the fluxes1 experiment with rv sarmiento de gamboa.the following people have contributed to the sorting of the image data on ecotaxa:rainer kiko, tristan biard, benjamin blanc, svenja christiansen, justine courboules, charlotte eich, jannik faustmann, christine gawinski, augustin lafond, aakash panchal, marc picheral, akanksha singh and helena haussin schröder et al. (in preparation), the training set serves as a source for knowledge transfer in the training of the feature extractor. the classification using morphocluster was conducted by rainer kiko. used labels are operational and not yet matched to respective ecotaxa classes.
d
Data from: Validation of Innovative Exploration Technologies for Newberry...
catalog.data.gov
data.openei.org
+2more
Updated Jan 20, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Davenport Newberry Holdings, LLC (2025). Validation of Innovative Exploration Technologies for Newberry Volcano: Raw Gravity Data [Dataset]. https://catalog.data.gov/dataset/validation-of-innovative-exploration-technologies-for-newberry-volcano-raw-gravity-data-88282
Explore at:
Dataset updated
Jan 20, 2025
Dataset provided by
Davenport Newberry Holdings LLC
Area covered
Newberry Volcano
Description
Validation of Innovative Exploration Technologies for Newberry Volcano: Raw data used to prepare the Gravity Report by Zonge 2012
f
Table_2_Operational Challenges in the Use of Structured Secondary Data for...
frontiersin.figshare.com
docx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kelsy N. Areco; Tulio Konstantyner; Paulo Bandiera-Paiva; Rita C. X. Balda; Daniela T. Costa-Nobre; Adriana Sanudo; Carlos Roberto V. Kiffer; Mandira D. Kawakami; Milton H. Miyoshi; Ana Sílvia Scavacini Marinonio; Rosa M. V. Freitas; Liliam C. C. Morais; Monica L. P. Teixeira; Bernadette Waldvogel; Maria Fernanda B. Almeida; Ruth Guinsburg (2023). Table_2_Operational Challenges in the Use of Structured Secondary Data for Health Research.DOCX [Dataset]. http://doi.org/10.3389/fpubh.2021.642163.s002
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fpubh.2021.642163.s002
Dataset updated
Jun 1, 2023
Dataset provided by
Frontiers
Authors
Kelsy N. Areco; Tulio Konstantyner; Paulo Bandiera-Paiva; Rita C. X. Balda; Daniela T. Costa-Nobre; Adriana Sanudo; Carlos Roberto V. Kiffer; Mandira D. Kawakami; Milton H. Miyoshi; Ana Sílvia Scavacini Marinonio; Rosa M. V. Freitas; Liliam C. C. Morais; Monica L. P. Teixeira; Bernadette Waldvogel; Maria Fernanda B. Almeida; Ruth Guinsburg
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Background: In Brazil, secondary data for epidemiology are largely available. However, they are insufficiently prepared for use in research, even when it comes to structured data since they were often designed for other purposes. To date, few publications focus on the process of preparing secondary data. The present findings can help in orienting future research projects that are based on secondary data.Objective: Describe the steps in the process of ensuring the adequacy of a secondary data set for a specific use and to identify the challenges of this process.Methods: The present study is qualitative and reports methodological issues about secondary data use. The study material was comprised of 6,059,454 live births and 73,735 infant death records from 2004 to 2013 of children whose mothers resided in the State of São Paulo - Brazil. The challenges and description of the procedures to ensure data adequacy were undertaken in 6 steps: (1) problem understanding, (2) resource planning, (3) data understanding, (4) data preparation, (5) data validation and (6) data distribution. For each step, procedures, and challenges encountered, and the actions to cope with them and partial results were described. To identify the most labor-intensive tasks in this process, the steps were assessed by adding the number of procedures, challenges, and coping actions. The highest values were assumed to indicate the most critical steps.Results: In total, 22 procedures and 23 actions were needed to deal with the 27 challenges encountered along the process of ensuring the adequacy of the study material for the intended use. The final product was an organized database for a historical cohort study suitable for the intended use. Data understanding and data preparation were identified as the most critical steps, accounting for about 70% of the challenges observed for data using.Conclusion: Significant challenges were encountered in the process of ensuring the adequacy of secondary health data for research use, mainly in the data understanding and data preparation steps. The use of the described steps to approach structured secondary data and the knowledge of the potential challenges along the process may contribute to planning health research.
Machine learning pipeline to train toxicity prediction model of...
zenodo.org
data.niaid.nih.gov
zip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jan Ewald; Jan Ewald (2020). Machine learning pipeline to train toxicity prediction model of FunTox-Networks [Dataset]. http://doi.org/10.5281/zenodo.3529162
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3529162
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jan Ewald; Jan Ewald
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Machine Learning pipeline used to provide toxicity prediction in FunTox-Networks

01_DATA # preprocessing and filtering of raw activity data from ChEMBL
- Chembl_v25 # latest activity assay data set from ChEMBL (retrieved Nov 2019)
- filt_stats.R # Filtering and preparation of raw data
- Filtered # output data sets from filt_stats.R
- toxicity_direction.csv # table of toxicity measurements and their proportionality to toxicity

02_MolDesc # Calculation of molecular descriptors for all compounds within the filtered ChEMBL data set
- datastore # files with all compounds and their calculated molecular descriptors based on SMILES
- scripts
- calc_molDesc.py # calculates for all compounds based on their smiles the molecular descriptors
- chemopy-1.1 # used python package for descriptor calculation as decsribed in: https://doi.org/10.1093/bioinformatics/btt105

03_Averages # Calculation of moving averages for levels and organisms as required for calculation of Z-scores
- datastore # output files with statistics calculated by make_Z.R
- scripts
-make_Z.R # script to calculate statistics to calculate Z-scores as used by the regression models

04_ZScores # Calculation of Z-scores and preparation of table to fit regression models
- datastore # Z-normalized activity data and molecular descriptors in the form as used for fitting regression models
- scripts
-calc_Ztable.py # based on activity data, molecular descriptors and Z-statistics, the learning data is calculated

05_Regression # Performing regression. Preparation of data by removing of outliers based on a linear regression model. Learning of random forest regression models. Validation of learning process by cross validation and tuning of hyperparameters.

- datastore # storage of all random forest regression models and average level of Z output value per level and organism (zexp_*.tsv)
- scripts
- data_preperation.R # set up of regression data set, removal of outliers and optional removal of fields and descriptors
- Rforest_CV.R # analysis of machine learning by cross validation, importance of regression variables and tuning of hyperparameters (number of trees, split of variables)
- Rforest.R # based on analysis of Rforest_CV.R learning of final models

rregrs_output
# early analysis of regression model performance with the package RRegrs as described in: https://doi.org/10.1186/s13321-015-0094-2
A
Data from: Validation of Methods to Assess the Immunoglobulin Gene...
data.amerigeoss.org
html
Updated Jan 29, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
United States (2020). Validation of Methods to Assess the Immunoglobulin Gene Repertoire in Tissues Obtained from Mice on the International Space Station [Dataset]. https://data.amerigeoss.org/dataset/validation-of-methods-to-assess-the-immunoglobulin-gene-repertoire-in-tissues-obtained-fro1
Explore at:
htmlAvailable download formats
Dataset updated
Jan 29, 2020
Dataset provided by
United States
Description
Spaceflight is known to affect immune cell populations. In particular splenic B-cell numbers decrease during spaceflight and in ground-based physiological models. Although antibody isotype changes have been assessed during and after spaceflight an extensive characterization of the impact of spaceflight on antibody composition has not been conducted in mice. Next Generation Sequencing and bioinformatic tools are now available to assess antibody repertoires. We can now identify immunoglobulin gene- segment usage junctional regions and modifications that contribute to specificity and diversity. Due to limitations on the International Space Station alternate sample collection and storage methods must be employed. Our group compared Illumina MiSeq xc2 xae sequencing data from multiple sample preparation methods in normal C57Bl/6J mice to validate that sample preparation and storage would not bias the outcome of antibody repertoire characterization. In this report we also compared sequencing techniques and a bioinformatic workflow on the data output when we assessed the IgH and Ig xce xba variable gene usage. Our bioinformatic workflow has been optimized for Illumina HiSeq xc2 xae and MiSeq xc2 xae datasets and is designed specifically to reduce bias capture the most information from Ig sequences and produce a data set that provides other data mining options.

Facebook

Twitter

Click to copy link

Link copied

Cite

Denis Krajnc; Clemens P. Spielvogel; Marko Grahovac; Boglarka Ecsedi; Sazan Rasul; Nina Poetsch; Tatjana Traub-Weidinger; Alexander R. Haug; Zsombor Ritter; Hussain Alizadeh; Marcus Hacker; Thomas Beyer; Laszlo Papp (2023). DataSheet_1_Automated data preparation for in vivo tumor characterization with machine learning.docx [Dataset]. http://doi.org/10.3389/fonc.2022.1017911.s001

DataSheet_1_Automated data preparation for in vivo tumor characterization with machine learning.docx

Explore at:

docxAvailable download formats

Unique identifier

https://doi.org/10.3389/fonc.2022.1017911.s001

Dataset updated

Jun 13, 2023

Dataset provided by

Frontiers

Authors

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

BackgroundThis study proposes machine learning-driven data preparation (MLDP) for optimal data preparation (DP) prior to building prediction models for cancer cohorts.MethodsA collection of well-established DP methods were incorporated for building the DP pipelines for various clinical cohorts prior to machine learning. Evolutionary algorithm principles combined with hyperparameter optimization were employed to iteratively select the best fitting subset of data preparation algorithms for the given dataset. The proposed method was validated for glioma and prostate single center cohorts by 100-fold Monte Carlo (MC) cross-validation scheme with 80-20% training-validation split ratio. In addition, a dual-center diffuse large B-cell lymphoma (DLBCL) cohort was utilized with Center 1 as training and Center 2 as independent validation datasets to predict cohort-specific clinical endpoints. Five machine learning (ML) classifiers were employed for building prediction models across all analyzed cohorts. Predictive performance was estimated by confusion matrix analytics over the validation sets of each cohort. The performance of each model with and without MLDP, as well as with manually-defined DP were compared in each of the four cohorts.ResultsSixteen of twenty established predictive models demonstrated area under the receiver operator characteristics curve (AUC) performance increase utilizing the MLDP. The MLDP resulted in the highest performance increase for random forest (RF) (+0.16 AUC) and support vector machine (SVM) (+0.13 AUC) model schemes for predicting 36-months survival in the glioma cohort. Single center cohorts resulted in complex (6-7 DP steps) DP pipelines, with a high occurrence of outlier detection, feature selection and synthetic majority oversampling technique (SMOTE). In contrast, the optimal DP pipeline for the dual-center DLBCL cohort only included outlier detection and SMOTE DP steps.ConclusionsThis study demonstrates that data preparation prior to ML prediction model building in cancer cohorts shall be ML-driven itself, yielding optimal prediction models in both single and multi-centric settings.

Clear search

Close search

Google apps

Main menu

DataSheet_1_Automated data preparation for in vivo tumor characterization...

Ground Validation Data Used to Map Benthic Habitats of the Republic of Palau...

Oahu Ground Validation Point Data for Benthic Habitats of the Main Hawaiian...

Molokai Ground Validation Point Data for Benthic Habitats of the Main...

2.1 Drug-pair Representation and the Associated ADRs

Test Data Management Market Analysis, Size, and Forecast 2025-2029: North...

Snapshot img

Development and validation of Food and Nutrition Literacy Survey 2024 FANSY...

Data from: Validation of Methods to Assess the Immunoglobulin Gene...

Data Annotation Tools Market Research Report 2033

Data Annotation Tools Market Outlook

Component Analysis

Histopathology data of bone marrow biopsies (HistBMP or HistMNIST)

Data from: Applying machine learning to predict bowel preparation adequacy...

Phase Field Raw Data

Ground Validation Data Used to Map Benthic Habitats of the Republic of...

Validation dataset for Land Cover Map of Europe 2017 - Dataset - B2FIND

visual_haystacks_v0

UVP5 data sorted with EcoTaxa and MorphoCluster

Data from: Validation of Innovative Exploration Technologies for Newberry...

Table_2_Operational Challenges in the Use of Structured Secondary Data for...

Machine learning pipeline to train toxicity prediction model of...

Data from: Validation of Methods to Assess the Immunoglobulin Gene...

DataSheet_1_Automated data preparation for in vivo tumor characterization with machine learning.docx