100+ datasets found
  1. d

    Machine Learning (ML) Data | 800M+ B2B Profiles | AI-Ready for Deep Learning...

    • datarade.ai
    .json, .csv
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xverum, Machine Learning (ML) Data | 800M+ B2B Profiles | AI-Ready for Deep Learning (DL), NLP & LLM Training [Dataset]. https://datarade.ai/data-products/xverum-company-data-b2b-data-belgium-netherlands-denm-xverum
    Explore at:
    .json, .csvAvailable download formats
    Dataset provided by
    Xverum LLC
    Authors
    Xverum
    Area covered
    Dominican Republic, Sint Maarten (Dutch part), Cook Islands, Jordan, Oman, Barbados, Western Sahara, United Kingdom, India, Norway
    Description

    Xverum’s AI & ML Training Data provides one of the most extensive datasets available for AI and machine learning applications, featuring 800M B2B profiles with 100+ attributes. This dataset is designed to enable AI developers, data scientists, and businesses to train robust and accurate ML models. From natural language processing (NLP) to predictive analytics, our data empowers a wide range of industries and use cases with unparalleled scale, depth, and quality.

    What Makes Our Data Unique?

    Scale and Coverage: - A global dataset encompassing 800M B2B profiles from a wide array of industries and geographies. - Includes coverage across the Americas, Europe, Asia, and other key markets, ensuring worldwide representation.

    Rich Attributes for Training Models: - Over 100 fields of detailed information, including company details, job roles, geographic data, industry categories, past experiences, and behavioral insights. - Tailored for training models in NLP, recommendation systems, and predictive algorithms.

    Compliance and Quality: - Fully GDPR and CCPA compliant, providing secure and ethically sourced data. - Extensive data cleaning and validation processes ensure reliability and accuracy.

    Annotation-Ready: - Pre-structured and formatted datasets that are easily ingestible into AI workflows. - Ideal for supervised learning with tagging options such as entities, sentiment, or categories.

    How Is the Data Sourced? - Publicly available information gathered through advanced, GDPR-compliant web aggregation techniques. - Proprietary enrichment pipelines that validate, clean, and structure raw data into high-quality datasets. This approach ensures we deliver comprehensive, up-to-date, and actionable data for machine learning training.

    Primary Use Cases and Verticals

    Natural Language Processing (NLP): Train models for named entity recognition (NER), text classification, sentiment analysis, and conversational AI. Ideal for chatbots, language models, and content categorization.

    Predictive Analytics and Recommendation Systems: Enable personalized marketing campaigns by predicting buyer behavior. Build smarter recommendation engines for ecommerce and content platforms.

    B2B Lead Generation and Market Insights: Create models that identify high-value leads using enriched company and contact information. Develop AI systems that track trends and provide strategic insights for businesses.

    HR and Talent Acquisition AI: Optimize talent-matching algorithms using structured job descriptions and candidate profiles. Build AI-powered platforms for recruitment analytics.

    How This Product Fits Into Xverum’s Broader Data Offering Xverum is a leading provider of structured, high-quality web datasets. While we specialize in B2B profiles and company data, we also offer complementary datasets tailored for specific verticals, including ecommerce product data, job listings, and customer reviews. The AI Training Data is a natural extension of our core capabilities, bridging the gap between structured data and machine learning workflows. By providing annotation-ready datasets, real-time API access, and customization options, we ensure our clients can seamlessly integrate our data into their AI development processes.

    Why Choose Xverum? - Experience and Expertise: A trusted name in structured web data with a proven track record. - Flexibility: Datasets can be tailored for any AI/ML application. - Scalability: With 800M profiles and more being added, you’ll always have access to fresh, up-to-date data. - Compliance: We prioritize data ethics and security, ensuring all data adheres to GDPR and other legal frameworks.

    Ready to supercharge your AI and ML projects? Explore Xverum’s AI Training Data to unlock the potential of 800M global B2B profiles. Whether you’re building a chatbot, predictive algorithm, or next-gen AI application, our data is here to help.

    Contact us for sample datasets or to discuss your specific needs.

  2. Z

    Data for training, validation and testing of methods in the thesis:...

    • data.niaid.nih.gov
    • zenodo.org
    Updated May 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lucia Hajduková (2021). Data for training, validation and testing of methods in the thesis: Camera-based Accuracy Improvement of Indoor Localization [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4730337
    Explore at:
    Dataset updated
    May 1, 2021
    Dataset authored and provided by
    Lucia Hajduková
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The package contains files for two modules designed to improve the accuracy of the indoor positioning system, namely the following:

    door detection

    videos_test - videos used to demonstrate the application of door detector

    videos_res - videos from videos_test directory with detected doors marked

    parts detection

    frames_train_val - images generated from videos used for training and validation of VGG16 neural network model

    frames_test - images generated from videos used for testing of the trained model

    videos_test - videos used to demonstrate the application of parts detector

    videos_res - videos from videos_test directory with detected parts marked

  3. CNN models and training, validation and test datasets for "PlotMI:...

    • zenodo.org
    application/gzip
    Updated Sep 15, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tuomo Hartonen; Tuomo Hartonen; Teemu Kivioja; Jussi Taipale; Teemu Kivioja; Jussi Taipale (2021). CNN models and training, validation and test datasets for "PlotMI: interpretation of pairwise interactions and positional preferences learned by a deep learning model from sequence data" [Dataset]. http://doi.org/10.5281/zenodo.5508698
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Sep 15, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Tuomo Hartonen; Tuomo Hartonen; Teemu Kivioja; Jussi Taipale; Teemu Kivioja; Jussi Taipale
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Convolutional neural network (CNN) models and their respective training, validation and test datasets used in manuscript:

    Tuomo Hartonen, Teemu Kivioja and Jussi Taipale, "PlotMI: interpretation of pairwise interactions and positional preferences learned by a deep learning model from sequence data"

  4. Machine Learning Dataset

    • brightdata.com
    .json, .csv, .xlsx
    Updated Dec 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bright Data (2024). Machine Learning Dataset [Dataset]. https://brightdata.com/products/datasets/machine-learning
    Explore at:
    .json, .csv, .xlsxAvailable download formats
    Dataset updated
    Dec 23, 2024
    Dataset authored and provided by
    Bright Datahttps://brightdata.com/
    License

    https://brightdata.com/licensehttps://brightdata.com/license

    Area covered
    Worldwide
    Description

    Utilize our machine learning datasets to develop and validate your models. Our datasets are designed to support a variety of machine learning applications, from image recognition to natural language processing and recommendation systems. You can access a comprehensive dataset or tailor a subset to fit your specific requirements, using data from a combination of various sources and websites, including custom ones. Popular use cases include model training and validation, where the dataset can be used to ensure robust performance across different applications. Additionally, the dataset helps in algorithm benchmarking by providing extensive data to test and compare various machine learning algorithms, identifying the most effective ones for tasks such as fraud detection, sentiment analysis, and predictive maintenance. Furthermore, it supports feature engineering by allowing you to uncover significant data attributes, enhancing the predictive accuracy of your machine learning models for applications like customer segmentation, personalized marketing, and financial forecasting.

  5. v

    Training dataset for NABat Machine Learning V1.0

    • res1catalogd-o-tdatad-o-tgov.vcapture.xyz
    • data.usgs.gov
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Training dataset for NABat Machine Learning V1.0 [Dataset]. https://res1catalogd-o-tdatad-o-tgov.vcapture.xyz/dataset/training-dataset-for-nabat-machine-learning-v1-0
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Description

    Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://res1doid-o-torg.vcapture.xyz/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://res1doid-o-torg.vcapture.xyz/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.

  6. f

    Data from: Robust Validation: Confident Predictions Even When Distributions...

    • tandf.figshare.com
    bin
    Updated Dec 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maxime Cauchois; Suyash Gupta; Alnur Ali; John C. Duchi (2023). Robust Validation: Confident Predictions Even When Distributions Shift* [Dataset]. http://doi.org/10.6084/m9.figshare.24904721.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Dec 26, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Maxime Cauchois; Suyash Gupta; Alnur Ali; John C. Duchi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    While the traditional viewpoint in machine learning and statistics assumes training and testing samples come from the same population, practice belies this fiction. One strategy—coming from robust statistics and optimization—is thus to build a model robust to distributional perturbations. In this paper, we take a different approach to describe procedures for robust predictive inference, where a model provides uncertainty estimates on its predictions rather than point predictions. We present a method that produces prediction sets (almost exactly) giving the right coverage level for any test distribution in an f-divergence ball around the training population. The method, based on conformal inference, achieves (nearly) valid coverage in finite samples, under only the condition that the training data be exchangeable. An essential component of our methodology is to estimate the amount of expected future data shift and build robustness to it; we develop estimators and prove their consistency for protection and validity of uncertainty estimates under shifts. By experimenting on several large-scale benchmark datasets, including Recht et al.’s CIFAR-v4 and ImageNet-V2 datasets, we provide complementary empirical results that highlight the importance of robust predictive validity.

  7. d

    Factori Machine Learning (ML) Data | 247 Countries Coverage | 5.2 B Event...

    • datarade.ai
    .csv
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Factori, Factori Machine Learning (ML) Data | 247 Countries Coverage | 5.2 B Event per Day [Dataset]. https://datarade.ai/data-products/factori-ai-ml-training-data-web-data-machine-learning-d-factori
    Explore at:
    .csvAvailable download formats
    Dataset authored and provided by
    Factori
    Area covered
    Austria, Sweden, Japan, Palestine, Cameroon, Turks and Caicos Islands, Uzbekistan, Egypt, Faroe Islands, Taiwan
    Description

    Factori's AI & ML training data is thoroughly tested and reviewed to ensure that what you receive on your end is of the best quality.

    Integrate the comprehensive AI & ML training data provided by Grepsr and develop a superior AI & ML model.

    Whether you're training algorithms for natural language processing, sentiment analysis, or any other AI application, we can deliver comprehensive datasets tailored to fuel your machine learning initiatives.

    Enhanced Data Quality: We have rigorous data validation processes and also conduct quality assurance checks to guarantee the integrity and reliability of the training data for you to develop the AI & ML models.

    Gain a competitive edge, drive innovation, and unlock new opportunities by leveraging the power of tailored Artificial Intelligence and Machine Learning training data with Factori.

    We offer web activity data of users that are browsing popular websites around the world. This data can be used to analyze web behavior across the web and build highly accurate audience segments based on web activity for targeting ads based on interest categories and search/browsing intent.

    Web Data Reach: Our reach data represents the total number of data counts available within various categories and comprises attributes such as Country, Anonymous ID, IP addresses, Search Query, and so on.

    Data Export Methodology: Since we collect data dynamically, we provide the most updated data and insights via a best-suited method at a suitable interval (daily/weekly/monthly).

    Data Attributes: Anonymous_id IDType Timestamp Estid Ip userAgent browserFamily deviceType Os Url_metadata_canonical_url Url_metadata_raw_query_params refDomain mappedEvent Channel searchQuery Ttd_id Adnxs_id Keywords Categories Entities Concepts

  8. 4

    Train, validation, test data sets and confusion matrices underlying...

    • data.4tu.nl
    zip
    Updated Sep 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Louis Kuijpers; Nynke Dekker; Belen Solano Hermosilla; Edo van Veen (2023). Train, validation, test data sets and confusion matrices underlying publication: "Automated cell counting for Trypan blue stained cell cultures using machine learning" [Dataset]. http://doi.org/10.4121/21695819.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 7, 2023
    Dataset provided by
    4TU.ResearchData
    Authors
    Louis Kuijpers; Nynke Dekker; Belen Solano Hermosilla; Edo van Veen
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Annotated test and train data sets. Both images and annotations are provided separately.


    Validation data set for Hi5, Sf9 and HEK cells.


    Confusion matrices for the determination of performance parameters

  9. c

    Research data supporting "Unveil the unseen: exploit information hidden in...

    • repository.cam.ac.uk
    bin
    Updated Aug 25, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zviazhynski, Bahdan; Conduit, Gareth (2022). Research data supporting "Unveil the unseen: exploit information hidden in noise" [Dataset]. http://doi.org/10.17863/CAM.87875
    Explore at:
    bin(256 bytes), bin(1957 bytes), bin(10308 bytes), bin(722 bytes), bin(705 bytes)Available download formats
    Dataset updated
    Aug 25, 2022
    Dataset provided by
    University of Cambridge
    Apollo
    Authors
    Zviazhynski, Bahdan; Conduit, Gareth
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The datasets are training sets (file names ending with _train.csv) and validation sets (file names ending with _validate.csv). The machine learning algorithm is trained on the data in a _train.csv file and validated against the data in the corresponding _validate.csv file. The first column of each file is the input variable. The second column is the available values for the intermediate target variable. The last column of each file is the values of the final target variable.

    The extrapolation_using_Y training and validation sets are used to validate the ability of the machine learning algorithm to perform extrapolation using intermediate target variable. The sinosoidal_noise training and validation sets are used to validate the ability of the machine learning algorithm to perform extrapolation using uncertainty in the intermediate target variable. The phase_transition and droplet_diffraction training and validation files contain the experimental data for the real-world physical examples that we applied the methodology to.

  10. f

    Training and validation datasets for training probabilistic machine learning...

    • figshare.com
    • data.4tu.nl
    txt
    Updated Jun 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Deepali Singh (2023). Training and validation datasets for training probabilistic machine learning models on NREL's 10-MW reference wind turbine [Dataset]. http://doi.org/10.4121/21939995.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 11, 2023
    Dataset provided by
    4TU.ResearchData
    Authors
    Deepali Singh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository consists of two databases- CASE-ONSHORE and CASE-OFFSHORE, generated using OpenFAST v2.4 on NREL's 10-MW reference wind turbine for training data-driven probabilistic load surrogate models. The data is to be used for mapping 10-minute average environmental conditions to the corresponding 10-minute load statistics such as load average, fatigue and range at various locations on the tower and blades.

  11. i

    CT TRAINING AND VALIDATION SERIES FOR 3D AUTOMATED SEGMENTATION OF INNER EAR...

    • ieee-dataport.org
    Updated Oct 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jonathan Lim (2023). CT TRAINING AND VALIDATION SERIES FOR 3D AUTOMATED SEGMENTATION OF INNER EAR USING U-NET ARCHITECTURE DEEP-LEARNING MODEL [Dataset]. https://ieee-dataport.org/documents/ct-training-and-validation-series-3d-automated-segmentation-inner-ear-using-u-net
    Explore at:
    Dataset updated
    Oct 19, 2023
    Authors
    Jonathan Lim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data set contains: - Training dataset: 271 CT-scans of inner ears used for optimization and training of the model. - Validation dataset: 70 CT-scans of inner ears used for external validation. - U-net architecture deep-learning model's weight after optimized training. - All manual segmentations performed for both datasets. - All post-processed automated segmentations performed by the model for bothd atasets.

  12. f

    DataSheet_1_Automated data preparation for in vivo tumor characterization...

    • frontiersin.figshare.com
    docx
    Updated Jun 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Denis Krajnc; Clemens P. Spielvogel; Marko Grahovac; Boglarka Ecsedi; Sazan Rasul; Nina Poetsch; Tatjana Traub-Weidinger; Alexander R. Haug; Zsombor Ritter; Hussain Alizadeh; Marcus Hacker; Thomas Beyer; Laszlo Papp (2023). DataSheet_1_Automated data preparation for in vivo tumor characterization with machine learning.docx [Dataset]. http://doi.org/10.3389/fonc.2022.1017911.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 13, 2023
    Dataset provided by
    Frontiers
    Authors
    Denis Krajnc; Clemens P. Spielvogel; Marko Grahovac; Boglarka Ecsedi; Sazan Rasul; Nina Poetsch; Tatjana Traub-Weidinger; Alexander R. Haug; Zsombor Ritter; Hussain Alizadeh; Marcus Hacker; Thomas Beyer; Laszlo Papp
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundThis study proposes machine learning-driven data preparation (MLDP) for optimal data preparation (DP) prior to building prediction models for cancer cohorts.MethodsA collection of well-established DP methods were incorporated for building the DP pipelines for various clinical cohorts prior to machine learning. Evolutionary algorithm principles combined with hyperparameter optimization were employed to iteratively select the best fitting subset of data preparation algorithms for the given dataset. The proposed method was validated for glioma and prostate single center cohorts by 100-fold Monte Carlo (MC) cross-validation scheme with 80-20% training-validation split ratio. In addition, a dual-center diffuse large B-cell lymphoma (DLBCL) cohort was utilized with Center 1 as training and Center 2 as independent validation datasets to predict cohort-specific clinical endpoints. Five machine learning (ML) classifiers were employed for building prediction models across all analyzed cohorts. Predictive performance was estimated by confusion matrix analytics over the validation sets of each cohort. The performance of each model with and without MLDP, as well as with manually-defined DP were compared in each of the four cohorts.ResultsSixteen of twenty established predictive models demonstrated area under the receiver operator characteristics curve (AUC) performance increase utilizing the MLDP. The MLDP resulted in the highest performance increase for random forest (RF) (+0.16 AUC) and support vector machine (SVM) (+0.13 AUC) model schemes for predicting 36-months survival in the glioma cohort. Single center cohorts resulted in complex (6-7 DP steps) DP pipelines, with a high occurrence of outlier detection, feature selection and synthetic majority oversampling technique (SMOTE). In contrast, the optimal DP pipeline for the dual-center DLBCL cohort only included outlier detection and SMOTE DP steps.ConclusionsThis study demonstrates that data preparation prior to ML prediction model building in cancer cohorts shall be ML-driven itself, yielding optimal prediction models in both single and multi-centric settings.

  13. e

    3D Microvascular Image Data and Labels for Machine Learning - Dataset -...

    • b2find.eudat.eu
    Updated May 18, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). 3D Microvascular Image Data and Labels for Machine Learning - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/fc51c530-c314-5fd4-8979-1032b6f798cf
    Explore at:
    Dataset updated
    May 18, 2024
    Description

    These images and associated binary labels were collected from collaborators across multiple universities to serve as a diverse representation of biomedical images of vessel structures, for use in the training and validation of machine learning tools for vessel segmentation. The dataset contains images from a variety of imaging modalities, at different resolutions, using difference sources of contrast and featuring different organs/ pathologies. This data was use to train, test and validated a foundational model for 3D vessel segmentation, tUbeNet, which can be found on github. The paper descripting the training and validation of the model can be found here. Filenames are structured as follows: Data - [Modality][species Organ][resolution].tif Labels - [Modality][species Organ][resolution]labels.tif Sub-volumes of larger dataset - [Modality][species Organ]_subvolume[dimensions in pixels].tif Manual labelling of blood vessels was carried out using Amira (2020.2, Thermo-Fisher, UK). Training data: opticalHREM_murineLiver_2.26x2.26x1.75um.tif: A high resolution episcopic microscopy (HREM) dataset, acquired in house by staining a healthy mouse liver with Eosin B and imaged using a standard HREM protocol. NB: 25% of this image volume was withheld from training, for use as test data. CT_murineTumour_20x20x20um.tif: X-ray microCT images of a microvascular cast, taken from a subcutaneous mouse model of colorectal cancer (acquired in house). NB: 25% of this image volume was withheld from training, for use as test data. RSOM_murineTumour_20x20um.tif: Raster-Scanning Optoacoustic Mesoscopy (RSOM) data from a subcutaneous tumour model (provided by Emma Brown, Bohndiek Group, University of Cambridge). The image data has undergone filtering to reduce the background ​(Brown et al., 2019)​. OCTA_humanRetina_24x24um.tif: retinal angiography data obtained using Optical Coherence Tomography Angiography (OCT-A) (provided by Dr Ranjan Rajendram, Moorfields Eye Hospital). Test data: MRI_porcineLiver_0.9x0.9x5mm.tif: T1-weighted Balanced Turbo Field Echo Magnetic Resonance Imaging (MRI) data from a machine-perfused porcine liver, acquired in-house. Test Data MFHREM_murineTumourLectin_2.76x2.76x2.61um.tif: a subcutaneous colorectal tumour mouse model was imaged in house using Multi-fluorescence HREM in house, with Dylight 647 conjugated lectin staining the vasculature ​(Walsh et al., 2021)​. The image data has been processed using an asymmetric deconvolution algorithm described by ​Walsh et al., 2020​. NB: A sub-volume of 480x480x640 voxels was manually labelled (MFHREM_murineTumourLectin_subvolume480x480x640.tif). MFHREM_murineBrainLectin_0.85x0.85x0.86um.tif: an MF-HREM image of the cortex of a mouse brain, stained with Dylight-647 conjugated lectin, was acquired in house ​(Walsh et al., 2021)​. The image data has been downsampled and processed using an asymmetric deconvolution algorithm described by ​Walsh et al., 2020​. NB: A sub-volume of 1000x1000x99 voxels was manually labelled. This sub-volume is provided at full resolution and without preprocessing (MFHREM_murineBrainLectin_subvol_0.57x0.57x0.86um.tif). 2Photon_murineOlfactoryBulbLectin_0.2x0.46x5.2um.tif: two-photon data of mouse olfactory bulb blood vessels, labelled with sulforhodamine 101, was kindly provided by Yuxin Zhang at the Sensory Circuits and Neurotechnology Lab, the Francis Crick Institute ​(Bosch et al., 2022)​. NB: A sub-volume of 500x500x79 voxel was manually labelled (2Photon_murineOlfactoryBulbLectin_subvolume500x500x79.tif). References: ​​Bosch, C., Ackels, T., Pacureanu, A., Zhang, Y., Peddie, C. J., Berning, M., Rzepka, N., Zdora, M. C., Whiteley, I., Storm, M., Bonnin, A., Rau, C., Margrie, T., Collinson, L., & Schaefer, A. T. (2022). Functional and multiscale 3D structural investigation of brain tissue through correlative in vivo physiology, synchrotron microtomography and volume electron microscopy. Nature Communications 2022 13:1, 13(1), 1–16. https://doi.org/10.1038/s41467-022-30199-6 ​Brown, E., Brunker, J., & Bohndiek, S. E. (2019). Photoacoustic imaging as a tool to probe the tumour microenvironment. DMM Disease Models and Mechanisms, 12(7). https://doi.org/10.1242/DMM.039636 ​Walsh, C., Holroyd, N. A., Finnerty, E., Ryan, S. G., Sweeney, P. W., Shipley, R. J., & Walker-Samuel, S. (2021). Multifluorescence High-Resolution Episcopic Microscopy for 3D Imaging of Adult Murine Organs. Advanced Photonics Research, 2(10), 2100110. https://doi.org/10.1002/ADPR.202100110 ​Walsh, C., Holroyd, N., Shipley, R., & Walker-Samuel, S. (2020). Asymmetric Point Spread Function Estimation and Deconvolution for Serial-Sectioning Block-Face Imaging. Communications in Computer and Information Science, 1248 CCIS, 235–249. https://doi.org/10.1007/978-3-030-52791-4_19

  14. Training and validation data used to produce the pre-trained model for the...

    • zenodo.org
    application/gzip
    Updated Jun 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thorsten Wagner; Thorsten Wagner; Gavin Rice; Gavin Rice; Markus Stabrin; Markus Stabrin; Stefan Raunser; Stefan Raunser (2022). Training and validation data used to produce the pre-trained model for the TomoTwin paper. [Dataset]. http://doi.org/10.5281/zenodo.6637456
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jun 22, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Thorsten Wagner; Thorsten Wagner; Gavin Rice; Gavin Rice; Markus Stabrin; Markus Stabrin; Stefan Raunser; Stefan Raunser
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This datasets represents the training and validation data that was used to produce the pre-trained model for the TomoTwin paper. Please see 10.5281/zenodo.6637357 for the raw tomograms.

  15. A

    Artificial Intelligence Synthetic Data Service Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Jun 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Artificial Intelligence Synthetic Data Service Report [Dataset]. https://www.datainsightsmarket.com/reports/artificial-intelligence-synthetic-data-service-525726
    Explore at:
    pdf, ppt, docAvailable download formats
    Dataset updated
    Jun 8, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The Artificial Intelligence (AI) Synthetic Data Service market is experiencing rapid growth, driven by the increasing need for high-quality data to train and validate AI models, especially in sectors with data scarcity or privacy concerns. The market, estimated at $2 billion in 2025, is projected to expand significantly over the next decade, achieving a Compound Annual Growth Rate (CAGR) of approximately 30% from 2025 to 2033. This robust growth is fueled by several key factors: the escalating adoption of AI across various industries, the rising demand for robust and unbiased AI models, and the growing awareness of data privacy regulations like GDPR, which restrict the use of real-world data. Furthermore, advancements in synthetic data generation techniques, enabling the creation of more realistic and diverse datasets, are accelerating market expansion. Major players like Synthesis, Datagen, Rendered, Parallel Domain, Anyverse, and Cognata are actively shaping the market landscape through innovative solutions and strategic partnerships. The market is segmented by data type (image, text, time-series, etc.), application (autonomous driving, healthcare, finance, etc.), and deployment model (cloud, on-premise). Despite the significant growth potential, certain restraints exist. The high cost of developing and deploying synthetic data generation solutions can be a barrier to entry for smaller companies. Additionally, ensuring the quality and realism of synthetic data remains a crucial challenge, requiring continuous improvement in algorithms and validation techniques. Overcoming these limitations and fostering wider adoption will be key to unlocking the full potential of the AI Synthetic Data Service market. The historical period (2019-2024) likely saw a lower CAGR due to initial market development and technology maturation, before experiencing the accelerated growth projected for the forecast period (2025-2033). Future growth will heavily depend on further technological advancements, decreasing costs, and increasing industry awareness of the benefits of synthetic data.

  16. Z

    Data for the manuscript "Spatially resolved uncertainties for machine...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Aug 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Schörghuber, Johannes (2024). Data for the manuscript "Spatially resolved uncertainties for machine learning potentials" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11086346
    Explore at:
    Dataset updated
    Aug 1, 2024
    Dataset provided by
    Wanzenböck, Ralf
    Madsen, Georg K. H.
    Heid, Esther
    Schörghuber, Johannes
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository accompanies the manuscript "Spatially resolved uncertainties for machine learning potentials" by E. Heid, J. Schörghuber, R. Wanzenböck, and G. K. H. Madsen. The following files are available:

    mc_experiment.ipynb is a Jupyter notebook for the Monte Carlo experiment described in the study (artificial model with only variance as error source).

    aggregate_cut_relax.py contains code to cut and relax boxes for the water active learning cycle.

    data_t1x.tar.gz contains reaction pathways for 10,073 reactions from a subset of the Transition1x dataset, split into training, validation and test sets. The training and validation sets contain the indices 1, 2, 9, and 10 from a 10-image nudged-elastic band search (40k datapoints), while the test set contains indices 3-8 (60k datapoints). The test set is ordered according to the reaction and index, i.e. rxn1_index3, rxn1_index4, [...] rxn1_index8, rxn2_index3, [...].

    data_sto.tar.gz contains surface reconstructions of SrTiO3, randomly split into a training and validation set, as well as a test set.

    data_h2o.tar.gz contains:

    full_db.extxyz: The full dataset of 1.5k structures.

    iter00_train.extxyz and iter00_validation.extxyz: The initial training and validation set for the active learning cycle.

    the subfolders in the folders random, and uncertain, and atomic contain the training and validation sets for the random and uncertainty-based (local or atomic) active learning loops.

  17. e

    CA-9, a dataset of carbon allotropes for training and testing of neural...

    • b2find.eudat.eu
    Updated Aug 30, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). CA-9, a dataset of carbon allotropes for training and testing of neural network potentials - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/a3ef8472-dc40-583f-898e-88f295d624ea
    Explore at:
    Dataset updated
    Aug 30, 2022
    Description

    The use of machine learning to accelerate computer simulations is on the rise. In atomistic simulations, the use of machine learning interatomic potentials (ML-IAPs) can significantly reduce computational costs while maintaining accuracy close to that of ab initio methods. To achieve this, ML-IAPs are trained on large datasets of images, meaning atomistic configurations labeled with data from ab initio calculations. Focusing on carbon, we have created a dataset, CA-9, consisting of 48000 images labeled with energies, forces and stress tensors obtained via ab initio molecular dynamics (AIMD). We use deep learning to train state-of-the-art neural network potentials (NNPs), a form of ML-IAP, on the CA-9 dataset and investigate how training and validation data can affect the performance of the NNPs. Our results show that image generation with AIMD causes a high degree of similarity between the generated images, which has a detrimental effect on the NNPs. However, by carefully choosing which images from the dataset are included in the training and validation data, this effect can be mitigated. We end by benchmarking our trained NNPs in real-world applications and show we can reproduce results from ab initio calculations with an accuracy higher than previously published ML- or classic IAPs.

  18. t

    FAIR Dataset for Disease Prediction in Healthcare Applications

    • test.researchdata.tuwien.ac.at
    bin, csv, json, png
    Updated Apr 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf (2025). FAIR Dataset for Disease Prediction in Healthcare Applications [Dataset]. http://doi.org/10.70124/5n77a-dnf02
    Explore at:
    csv, json, bin, pngAvailable download formats
    Dataset updated
    Apr 14, 2025
    Dataset provided by
    TU Wien
    Authors
    Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Description

    Context and Methodology

    • Research Domain/Project:
      This dataset was created for a machine learning experiment aimed at developing a classification model to predict outcomes based on a set of features. The primary research domain is disease prediction in patients. The dataset was used in the context of training, validating, and testing.

    • Purpose of the Dataset:
      The purpose of this dataset is to provide training, validation, and testing data for the development of machine learning models. It includes labeled examples that help train classifiers to recognize patterns in the data and make predictions.

    • Dataset Creation:
      Data preprocessing steps involved cleaning, normalization, and splitting the data into training, validation, and test sets. The data was carefully curated to ensure its quality and relevance to the problem at hand. For any missing values or outliers, appropriate handling techniques were applied (e.g., imputation, removal, etc.).

    Technical Details

    • Structure of the Dataset:
      The dataset consists of several files organized into folders by data type:

      • Training Data: Contains the training dataset used to train the machine learning model.

      • Validation Data: Used for hyperparameter tuning and model selection.

      • Test Data: Reserved for final model evaluation.

      Each folder contains files with consistent naming conventions for easy navigation, such as train_data.csv, validation_data.csv, and test_data.csv. Each file follows a tabular format with columns representing features and rows representing individual data points.

    • Software Requirements:
      To open and work with this dataset, you need VS Code or Jupyter, which could include tools like:

      • Python (with libraries such as pandas, numpy, scikit-learn, matplotlib, etc.)

    Further Details

    • Reusability:
      Users of this dataset should be aware that it is designed for machine learning experiments involving classification tasks. The dataset is already split into training, validation, and test subsets. Any model trained with this dataset should be evaluated using the test set to ensure proper validation.

    • Limitations:
      The dataset may not cover all edge cases, and it might have biases depending on the selection of data sources. It's important to consider these limitations when generalizing model results to real-world applications.

  19. d

    Code, imagery, and annotations for training a deep learning model to detect...

    • catalog.data.gov
    • data.usgs.gov
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Code, imagery, and annotations for training a deep learning model to detect wildlife in aerial imagery [Dataset]. https://catalog.data.gov/dataset/code-imagery-and-annotations-for-training-a-deep-learning-model-to-detect-wildlife-in-aeri
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    U.S. Geological Survey
    Description

    There are 3 child zip files included in this data release. 01_Codebase.zip contains a codebase for using deep learning to filter images based on the probability of any bird occurrence. It includes instructions and files necessary for training, validating, and testing a machine learning detection algorithm. 02_Imagery.zip contains imagery that were collected using a Partenavia P68 fixed-wing airplane using a PhaseOne iXU-R 180 forward motion compensating 80-megapixel digital frame camera with a 70 mm Rodenstock lens. The imagery were cropped into smaller patches of 720x720 pixels for training and 1440x1440 pixels for validation and test datasets. These data were collected for developing machine learning algorithms for the detection and classification of avian targets in aerial imagery. These data can be paired with annotation values to train and evaluate object detection and classification models. 03_Annotations.zip contains a collection of bounding boxes around avian targets in aerial imagery formatted as COCO JSON file. The data are nested under evaluation and test folders and contain both ground truth targets and predicted targets.These data were collected for two main functions. The ground truth avian targets were manually annotated and can be used to train avian detection algorithms using machine learning methods. The predicted targets can be used to evaluate model performance while referencing the larger work associated with these data.

  20. d

    Kieli NLP Data - Fully-labelled Audio & Text Dataset for Machine Learning &...

    • datarade.ai
    Updated Mar 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kieli (2021). Kieli NLP Data - Fully-labelled Audio & Text Dataset for Machine Learning & AI platforms [Dataset]. https://datarade.ai/data-products/a-fully-labelled-dataset-for-machine-learning-and-ai-platforms-kieli
    Explore at:
    .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Mar 20, 2021
    Dataset authored and provided by
    Kieli
    Area covered
    Mauritius, Uruguay, Tajikistan, Anguilla, Ethiopia, Denmark, Djibouti, Fiji, Venezuela (Bolivarian Republic of), Antigua and Barbuda
    Description

    Kieli labels audio speech, Image, Video & Text Data including semantic segmentation, named entity recognition (NER) and POS tagging. Kieli transforms unstructured data into high quality training data for the refinement of Artificial Intelligence and Machine Learning platforms. For over a decade, hundreds of organizations have relied on Kieli to deliver secure, high-quality training data and model validation for machine learning. At Kieli, we believe that accurate data is the most important factor in production learning models. We are committed to delivering the best quality data for the most enterprising organizations and helping you make strides in Artificial Intelligence. At Kieli, we're passionately dedicated to serving the Arabic, English and French markets. We work in all areas of industry: healthcare, technology and retail.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Xverum, Machine Learning (ML) Data | 800M+ B2B Profiles | AI-Ready for Deep Learning (DL), NLP & LLM Training [Dataset]. https://datarade.ai/data-products/xverum-company-data-b2b-data-belgium-netherlands-denm-xverum

Machine Learning (ML) Data | 800M+ B2B Profiles | AI-Ready for Deep Learning (DL), NLP & LLM Training

Explore at:
.json, .csvAvailable download formats
Dataset provided by
Xverum LLC
Authors
Xverum
Area covered
Dominican Republic, Sint Maarten (Dutch part), Cook Islands, Jordan, Oman, Barbados, Western Sahara, United Kingdom, India, Norway
Description

Xverum’s AI & ML Training Data provides one of the most extensive datasets available for AI and machine learning applications, featuring 800M B2B profiles with 100+ attributes. This dataset is designed to enable AI developers, data scientists, and businesses to train robust and accurate ML models. From natural language processing (NLP) to predictive analytics, our data empowers a wide range of industries and use cases with unparalleled scale, depth, and quality.

What Makes Our Data Unique?

Scale and Coverage: - A global dataset encompassing 800M B2B profiles from a wide array of industries and geographies. - Includes coverage across the Americas, Europe, Asia, and other key markets, ensuring worldwide representation.

Rich Attributes for Training Models: - Over 100 fields of detailed information, including company details, job roles, geographic data, industry categories, past experiences, and behavioral insights. - Tailored for training models in NLP, recommendation systems, and predictive algorithms.

Compliance and Quality: - Fully GDPR and CCPA compliant, providing secure and ethically sourced data. - Extensive data cleaning and validation processes ensure reliability and accuracy.

Annotation-Ready: - Pre-structured and formatted datasets that are easily ingestible into AI workflows. - Ideal for supervised learning with tagging options such as entities, sentiment, or categories.

How Is the Data Sourced? - Publicly available information gathered through advanced, GDPR-compliant web aggregation techniques. - Proprietary enrichment pipelines that validate, clean, and structure raw data into high-quality datasets. This approach ensures we deliver comprehensive, up-to-date, and actionable data for machine learning training.

Primary Use Cases and Verticals

Natural Language Processing (NLP): Train models for named entity recognition (NER), text classification, sentiment analysis, and conversational AI. Ideal for chatbots, language models, and content categorization.

Predictive Analytics and Recommendation Systems: Enable personalized marketing campaigns by predicting buyer behavior. Build smarter recommendation engines for ecommerce and content platforms.

B2B Lead Generation and Market Insights: Create models that identify high-value leads using enriched company and contact information. Develop AI systems that track trends and provide strategic insights for businesses.

HR and Talent Acquisition AI: Optimize talent-matching algorithms using structured job descriptions and candidate profiles. Build AI-powered platforms for recruitment analytics.

How This Product Fits Into Xverum’s Broader Data Offering Xverum is a leading provider of structured, high-quality web datasets. While we specialize in B2B profiles and company data, we also offer complementary datasets tailored for specific verticals, including ecommerce product data, job listings, and customer reviews. The AI Training Data is a natural extension of our core capabilities, bridging the gap between structured data and machine learning workflows. By providing annotation-ready datasets, real-time API access, and customization options, we ensure our clients can seamlessly integrate our data into their AI development processes.

Why Choose Xverum? - Experience and Expertise: A trusted name in structured web data with a proven track record. - Flexibility: Datasets can be tailored for any AI/ML application. - Scalability: With 800M profiles and more being added, you’ll always have access to fresh, up-to-date data. - Compliance: We prioritize data ethics and security, ensuring all data adheres to GDPR and other legal frameworks.

Ready to supercharge your AI and ML projects? Explore Xverum’s AI Training Data to unlock the potential of 800M global B2B profiles. Whether you’re building a chatbot, predictive algorithm, or next-gen AI application, our data is here to help.

Contact us for sample datasets or to discuss your specific needs.

Search
Clear search
Close search
Google apps
Main menu