Factori's AI & ML training data is thoroughly tested and reviewed to ensure that what you receive on your end is of the best quality.
Integrate the comprehensive AI & ML training data provided by Grepsr and develop a superior AI & ML model.
Whether you're training algorithms for natural language processing, sentiment analysis, or any other AI application, we can deliver comprehensive datasets tailored to fuel your machine learning initiatives.
Enhanced Data Quality: We have rigorous data validation processes and also conduct quality assurance checks to guarantee the integrity and reliability of the training data for you to develop the AI & ML models.
Gain a competitive edge, drive innovation, and unlock new opportunities by leveraging the power of tailored Artificial Intelligence and Machine Learning training data with Factori.
We offer web activity data of users that are browsing popular websites around the world. This data can be used to analyze web behavior across the web and build highly accurate audience segments based on web activity for targeting ads based on interest categories and search/browsing intent.
Web Data Reach: Our reach data represents the total number of data counts available within various categories and comprises attributes such as Country, Anonymous ID, IP addresses, Search Query, and so on.
Data Export Methodology: Since we collect data dynamically, we provide the most updated data and insights via a best-suited method at a suitable interval (daily/weekly/monthly).
Data Attributes: Anonymous_id IDType Timestamp Estid Ip userAgent browserFamily deviceType Os Url_metadata_canonical_url Url_metadata_raw_query_params refDomain mappedEvent Channel searchQuery Ttd_id Adnxs_id Keywords Categories Entities Concepts
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In this project, we aim to annotate car images captured on highways. The annotated data will be used to train machine learning models for various computer vision tasks, such as object detection and classification.
For this project, we will be using Roboflow, a powerful platform for data annotation and preprocessing. Roboflow simplifies the annotation process and provides tools for data augmentation and transformation.
Roboflow offers data augmentation capabilities, such as rotation, flipping, and resizing. These augmentations can help improve the model's robustness.
Once the data is annotated and augmented, Roboflow allows us to export the dataset in various formats suitable for training machine learning models, such as YOLO, COCO, or TensorFlow Record.
By completing this project, we will have a well-annotated dataset ready for training machine learning models. This dataset can be used for a wide range of applications in computer vision, including car detection and tracking on highways.
Our consumer data is gathered and aggregated via surveys, digital services, and public data sources. We use powerful profiling algorithms to collect and ingest only fresh and reliable data points.
Our comprehensive data enrichment solution includes a variety of data sets that can help you address gaps in your customer data, gain a deeper understanding of your customers, and power superior client experiences.
Consumer Graph Schema & Reach: Our data reach represents the total number of counts available within various categories and comprises attributes such as country location, MAU, DAU & Monthly Location Pings:
Data Export Methodology: Since we collect data dynamically, we provide the most updated data and insights via a best-suited method on a suitable interval (daily/weekly/monthly).
Consumer Graph Use Cases: 360-Degree Customer View: Get a comprehensive image of customers by the means of internal and external data aggregation. Data Enrichment: Leverage Online to offline consumer profiles to build holistic audience segments to improve campaign targeting using user data enrichment Fraud Detection: Use multiple digital (web and mobile) identities to verify real users and detect anomalies or fraudulent activity. Advertising & Marketing: Understand audience demographics, interests, lifestyle, hobbies, and behaviors to build targeted marketing campaigns.
Here's the schema of Consumer Data:
person_id
first_name
last_name
age
gender
linkedin_url
twitter_url
facebook_url
city
state
address
zip
zip4
country
delivery_point_bar_code
carrier_route
walk_seuqence_code
fips_state_code
fips_country_code
country_name
latitude
longtiude
address_type
metropolitan_statistical_area
core_based+statistical_area
census_tract
census_block_group
census_block
primary_address
pre_address
streer
post_address
address_suffix
address_secondline
address_abrev
census_median_home_value
home_market_value
property_build+year
property_with_ac
property_with_pool
property_with_water
property_with_sewer
general_home_value
property_fuel_type
year
month
household_id
Census_median_household_income
household_size
marital_status
length+of_residence
number_of_kids
pre_school_kids
single_parents
working_women_in_house_hold
homeowner
children
adults
generations
net_worth
education_level
occupation
education_history
credit_lines
credit_card_user
newly_issued_credit_card_user
credit_range_new
credit_cards
loan_to_value
mortgage_loan2_amount
mortgage_loan_type
mortgage_loan2_type
mortgage_lender_code
mortgage_loan2_render_code
mortgage_lender
mortgage_loan2_lender
mortgage_loan2_ratetype
mortgage_rate
mortgage_loan2_rate
donor
investor
interest
buyer
hobby
personal_email
work_email
devices
phone
employee_title
employee_department
employee_job_function
skills
recent_job_change
company_id
company_name
company_description
technologies_used
office_address
office_city
office_country
office_state
office_zip5
office_zip4
office_carrier_route
office_latitude
office_longitude
office_cbsa_code
office_census_block_group
office_census_tract
office_county_code
company_phone
company_credit_score
company_csa_code
company_dpbc
company_franchiseflag
company_facebookurl
company_linkedinurl
company_twitterurl
company_website
company_fortune_rank
company_government_type
company_headquarters_branch
company_home_business
company_industry
company_num_pcs_used
company_num_employees
company_firm_individual
company_msa
company_msa_name
company_naics_code
company_naics_description
company_naics_code2
company_naics_description2
company_sic_code2
company_sic_code2_description
company_sic_code4
company_sic_code4_description
company_sic_code6
company_sic_code6_description
company_sic_code8
company_sic_code8_description
company_parent_company
company_parent_company_location
company_public_private
company_subsidiary_company
company_residential_business_code
company_revenue_at_side_code
company_revenue_range
company_revenue
company_sales_volume
company_small_business
company_stock_ticker
company_year_founded
company_minorityowned
company_female_owned_or_operated
company_franchise_code
company_dma
company_dma_name
company_hq_address
company_hq_city
company_hq_duns
company_hq_state
company_hq_zip5
company_hq_zip4
c...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains Zenodo's published open access records and communities metadata, including entries marked by the Zenodo staff as spam and deleted.
The datasets are gzipped compressed JSON-lines files, where each line is a JSON object representation of a Zenodo record or community.
Records dataset
Filename: zenodo_open_metadata_{ date of export }.jsonl.gz
Each object contains the terms: part_of, thesis, description, doi, meeting, imprint, references, recid, alternate_identifiers, resource_type, journal, related_identifiers, title, subjects, notes, creators, communities, access_right, keywords, contributors, publication_date
which correspond to the fields with the same name available in Zenodo's record JSON Schema at https://zenodo.org/schemas/records/record-v1.0.0.json.
In addition, some terms have been altered:
Communities dataset
Filename: zenodo_community_metadata_{ date of export }.jsonl.gz
Each object contains the terms: id, title, description, curation_policy, page
which correspond to the fields with the same name available in Zenodo's community creation form.
Notes for all datasets
For each object the term spam contains a boolean value, determining whether a given record/community was marked as spam content by Zenodo staff.
Some values for the top-level terms, which were missing in the metadata may contain a null value.
A smaller uncompressed random sample of 200 JSON lines is also included for each dataset to test and get familiar with the format without having to download the entire dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains all recorded and hand-annotated as well as all synthetically generated data as well as representative trained networks used for detection and tracking experiments in the replicAnt - generating annotated images of animals in complex environments using Unreal Engine manuscript. Unless stated otherwise, all 3D animal models used in the synthetically generated data have been generated with the open-source photgrammetry platform scAnt peerj.com/articles/11155/. All synthetic data has been generated with the associated replicAnt project available from https://github.com/evo-biomech/replicAnt.
Abstract:
Deep learning-based computer vision methods are transforming animal behavioural research. Transfer learning has enabled work in non-model species, but still requires hand-annotation of example footage, and is only performant in well-defined conditions. To overcome these limitations, we created replicAnt, a configurable pipeline implemented in Unreal Engine 5 and Python, designed to generate large and variable training datasets on consumer-grade hardware instead. replicAnt places 3D animal models into complex, procedurally generated environments, from which automatically annotated images can be exported. We demonstrate that synthetic data generated with replicAnt can significantly reduce the hand-annotation required to achieve benchmark performance in common applications such as animal detection, tracking, pose-estimation, and semantic segmentation; and that it increases the subject-specificity and domain-invariance of the trained networks, so conferring robustness. In some applications, replicAnt may even remove the need for hand-annotation altogether. It thus represents a significant step towards porting deep learning-based computer vision tools to the field.
Benchmark data
Two video datasets were curated to quantify detection performance; one in laboratory and one in field conditions. The laboratory dataset consists of top-down recordings of foraging trails of Atta vollenweideri (Forel 1893) leaf-cutter ants. The colony was collected in Uruguay in 2014, and housed in a climate chamber at 25°C and 60% humidity. A recording box was built from clear acrylic, and placed between the colony nest and a box external to the climate chamber, which functioned as feeding site. Bramble leaves were placed in the feeding area prior to each recording session, and ants had access to the recording area at will. The recorded area was 104 mm wide and 200 mm long. An OAK-D camera (OpenCV AI Kit: OAK-D, Luxonis Holding Corporation) was positioned centrally 195 mm above the ground. While keeping the camera position constant, lighting, exposure, and background conditions were varied to create recordings with variable appearance: The “base” case is an evenly lit and well exposed scene with scattered leaf fragments on an otherwise plain white backdrop. A “bright” and “dark” case are characterised by systematic over- or underexposure, respectively, which introduces motion blur, colour-clipped appendages, and extensive flickering and compression artefacts. In a separate well exposed recording, the clear acrylic backdrop was substituted with a printout of a highly textured forest ground to create a “noisy” case. Last, we decreased the camera distance to 100 mm at constant focal distance, effectively doubling the magnification, and yielding a “close” case, distinguished by out-of-focus workers. All recordings were captured at 25 frames per second (fps).
The field datasets consists of video recordings of Gnathamitermes sp. desert termites, filmed close to the nest entrance in the desert of Maricopa County, Arizona, using a Nikon D850 and a Nikkor 18-105 mm lens on a tripod at camera distances between 20 cm to 40 cm. All video recordings were well exposed, and captured at 23.976 fps.
Each video was trimmed to the first 1000 frames, and contains between 36 and 103 individuals. In total, 5000 and 1000 frames were hand-annotated for the laboratory- and field-dataset, respectively: each visible individual was assigned a constant size bounding box, with a centre coinciding approximately with the geometric centre of the thorax in top-down view. The size of the bounding boxes was chosen such that they were large enough to completely enclose the largest individuals, and was automatically adjusted near the image borders. A custom-written Blender Add-on aided hand-annotation: the Add-on is a semi-automated multi animal tracker, which leverages blender’s internal contrast-based motion tracker, but also include track refinement options, and CSV export functionality. Comprehensive documentation of this tool and Jupyter notebooks for track visualisation and benchmarking is provided on the replicAnt and BlenderMotionExport GitHub repositories.
Synthetic data generation
Two synthetic datasets, each with a population size of 100, were generated from 3D models of \textit{Atta vollenweideri} leaf-cutter ants. All 3D models were created with the scAnt photogrammetry workflow. A “group” population was based on three distinct 3D models of an ant minor (1.1 mg), a media (9.8 mg), and a major (50.1 mg) (see 10.5281/zenodo.7849059)). To approximately simulate the size distribution of A. vollenweideri colonies, these models make up 20%, 60%, and 20% of the simulated population, respectively. A 33% within-class scale variation, with default hue, contrast, and brightness subject material variation, was used. A “single” population was generated using the major model only, with 90% scale variation, but equal material variation settings.
A Gnathamitermes sp. synthetic dataset was generated from two hand-sculpted models; a worker and a soldier made up 80% and 20% of the simulated population of 100 individuals, respectively with default hue, contrast, and brightness subject material variation. Both 3D models were created in Blender v3.1, using reference photographs.
Each of the three synthetic datasets contains 10,000 images, rendered at a resolution of 1024 by 1024 px, using the default generator settings as documented in the Generator_example level file (see documentation on GitHub). To assess how the training dataset size affects performance, we trained networks on 100 (“small”), 1,000 (“medium”), and 10,000 (“large”) subsets of the “group” dataset. Generating 10,000 samples at the specified resolution took approximately 10 hours per dataset on a consumer-grade laptop (6 Core 4 GHz CPU, 16 GB RAM, RTX 2070 Super).
Additionally, five datasets which contain both real and synthetic images were curated. These “mixed” datasets combine image samples from the synthetic “group” dataset with image samples from the real “base” case. The ratio between real and synthetic images across the five datasets varied between 10/1 to 1/100.
Funding
This study received funding from Imperial College’s President’s PhD Scholarship (to Fabian Plum), and is part of a project that has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (Grant agreement No. 851705, to David Labonte). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Mobility/Location data is gathered from location-aware mobile apps using an SDK-based implementation. All users explicitly consent to allow location data sharing using a clear opt-in process for our use cases and are given clear opt-out options. Factori ingests, cleans, validates, and exports all location data signals to ensure only the highest quality of data is made available for analysis.
Record Count:90 Billion+ Capturing Frequency: Once per Event Delivering Frequency: Once per Day Updated: Daily
Mobility Data Reach: Our data reach represents the total number of counts available within various categories and comprises attributes such as country location, MAU, DAU & Monthly Location Pings.
Data Export Methodology: Since we collect data dynamically, we provide the most updated data and insights via a best-suited interval (daily/weekly/monthly/quarterly).
Use Cases: Consumer Insight: Gain a comprehensive 360-degree perspective of the customer to spot behavioral changes, analyze trends and predict business outcomes. Market Intelligence: Study various market areas, the proximity of points or interests, and the competitive landscape. Advertising: Create campaigns and customize your messaging depending on your target audience's online and offline activity. Retail Analytics Analyze footfall trends in various locations and gain understanding of customer personas.
Here's the data attributes: maid latitude longtitude horizontal_accuracy timestamp id_type ipv4 ipv6 user_agent country state_hasc city_hasc postcode geohash hex8 hex9 carrier
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data and deep learning segmentation model deposited here are derived from 3D multicoloured intravital microscopy of mammary epithelial cells during development. We aimed to study in vivo cell shape dynamics in real-time in an unbiased way. This robust and deep analysis revealed that hormone-responsive breast cells are unexpectedly elongated and motile at a high frequency during duct growth. The data is associated with our publication Dawson, Milevskiy et al, Cell Reports 2024, Hormone-responsive progenitors have a unique identity and exhibit high motility during mammary morphogenesis. https://doi.org/10.1016/j.celrep.2024.115073
Deposited data
- Single channel intravital movie maximum projections (File:MaSCOT-AI Max projections). These are up to 5 hours long, with timepoints every 10 minutes.
- Extracted 5th time points from each movie that we used for model training (File:MaSCOT-AI t5 training)
- Segmentation files generated by Cellpose 2.2.2 (File: MaSCOT-AI t5 segmentation files)
Analysis scripts:
The Trackmate-Cellpose python script, R data processing scripts and example excel data sheet are on github at https://github.com/cadaws/MaSCOT-AI
Example analysis and data export:
A small set of example data and resulting trackmate-Cellpose output will be uploaded at a later date.
Methods
27 4D movies were acquired every 10 minutes by multiphoton microscopy of anaesthetised cell-type-specific confetti mice at different stages of development. 350 single channel, single-cell thick layers (10-30 µm sections) were isolated by 3D cropping, then flattened by max projection. The 5th time point from all movies was taken for model training in Cellpose 2.2.2, which was achieved after manual correction of segmentation for 150 images (MaSCOT-AI model).
The MaSCOT-AI model was used in a high throughput Trackmate-Cellpose script in ImageJ to track mammary cell shape over time.
Software versions:
Cellpose 2.2.2 GUI with GPU was installed according to https://pypi.org/project/cellpose/ (March 2024).
Trackmate v7.11.1
File name structure
Date_mouse-model_developmental-stage_fluorescent-protein_z-span
Mouse models:
K5: K5-rtTA/tetoCre/Confetti
Elf5: Elf5-rtTA/tetoCre/Confetti
Pr: PR-Cre/Confetti
Developmental stage:
no label = Terminal end bud at 5 weeks
duct/notpreg = duct at 6 or 9 weeks
6dPreg/6dplug = 6 days pregnancy
6d MPA = 6 days MPA treatment
MPAveh = 6 days MPA vehicle treatment
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This study introduces Popnet, a deep learning model for forecasting 1km-gridded populations, integrating U-Net, ConvLSTM, a Spatial Autocorrelation module and deep ensemble methods. Using spatial variables and population data from 2000 to 2020, Popnet predicts South Korea's population trends by age groups (under 14, 15-64, over 65) up to 2040. In validation, it outperforms traditional machine learning and state-of-the-art computer vision models. The output of this model discovered significant polarization: population growth in urban areas, especially the capital region, and severe depopulation in rural areas. Popnet is a robust tool for offering significant insights to policymakers and related stakeholders about the detailed future population, which allows them to establish detailed, localised planning and resource allocations.*Due to the export restrictions on grid data imposed by the National Geographic Information Institute of Korea, the training data has been replaced with data from Tennessee. However, the Korean version of the future prediction data remains unchanged. Please take this into consideration.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This Zenodo record contains two test datasets (Birds and Littorina) used in the paper:
PhenoLearn: A user-friendly Toolkit for Image Annotation and Deep Learning-Based Phenotyping for Biological Datasets
Authors: Yichen He, Christopher R. Cooney, Steve Maddock, Gavin H. Thomas
PhenoLearn is a graphical and script-based toolkit designed to help biologists annotate and analyse biological images using deep learning. This dataset includes two test cases: one of bird specimen images for semantic segmentation, and another of marine snail (Littorina) images for landmark detection. These datasets are used to demonstrate the PhenoLearn workflow in the accompanying paper.
Download the dataset folders.
Use PhenoLearn to load seg_train.csv
(segmentation) or pts_train.csv
(landmark) to view and edit annotations.
Train segmentation or landmark prediction models directly via PhenoLearn's training module, or export data for external tools.
Use name_file_pred
to match predictions with ground-truth for evaluation.
See the full tutorial and usage guide in the https://github.com/EchanHe/PhenoLearn.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Welcome to the Universal Roblox Character Detection Dataset (URCDD). This dataset is a comprehensive collection of images extracted from various games on the Roblox platform. Our primary objective is to offer a diverse and extensive dataset that encompasses the wide array of characters found in Roblox games.
Versions
tab.We have created a unique tag for each game that we have collected data from. Refer to the list below:
1. baseplate
- https://www.roblox.com/games/4483381587
2. da-hood
- https://www.roblox.com/games/2788229376
3. arsenal
- https://www.roblox.com/games/286090429
4. aimblox
- https://www.roblox.com/games/6808416928
5. hood-customs
- https://www.roblox.com/games/9825515356
6. counter-blox
- https://www.roblox.com/games/301549746/
7. hood-testing
- https://www.roblox.com/games/12673840215
8. phantom-forces
- https://www.roblox.com/games/292439477
9. entrenched
- https://www.roblox.com/games/3678761576
When you need to analyze crypto market history, batch processing often beats streaming APIs. That's why we built the Flat Files S3 API - giving analysts and researchers direct access to structured historical cryptocurrency data without the integration complexity of traditional APIs.
Pull comprehensive historical data across 800+ cryptocurrencies and their trading pairs, delivered in clean, ready-to-use CSV formats that drop straight into your analysis tools. Whether you're building backtest environments, training machine learning models, or running complex market studies, our flat file approach gives you the flexibility to work with massive datasets efficiently.
Why work with us?
Market Coverage & Data Types: - Comprehensive historical data since 2010 (for chosen assets) - Comprehensive order book snapshots and updates - Trade-by-trade data
Technical Excellence: - 99,9% uptime guarantee - Standardized data format across exchanges - Flexible Integration - Detailed documentation - Scalable Architecture
CoinAPI serves hundreds of institutions worldwide, from trading firms and hedge funds to research organizations and technology providers. Our S3 delivery method easily integrates with your existing workflows, offering familiar access patterns, reliable downloads, and straightforward automation for your data team. Our commitment to data quality and technical excellence, combined with accessible delivery options, makes us the trusted choice for institutions that demand both comprehensive historical data and real-time market intelligence
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a collection of images and video frames of cheetahs at the Omaha Henry Doorly Zoo taken in October, 2020. The capture device was a SEEK Thermal Compact XR connected to an iPhone 11 Pro. Video frames were sampled and labeled by hand with bounding boxes for object detection using Robofow.
We have provided the dataset for download under a creative commons by-attribution license. You may use this dataset in any project (including for commercial use) but must cite Roboflow as the source.
This dataset could be used for conservation of endangered species, cataloging animals with a trail camera, gathering statistics on wildlife behavior, or experimenting with other thermal and infrared imagery.
Roboflow creates tools that make computer vision easy to use for any developer, even if you're not a machine learning expert. You can use it to organize, label, inspect, convert, and export your image datasets. And even to train and deploy computer vision models with no code required.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains more than 100K textual descriptions of cultural items from Cultura Italia (http://www.culturaitalia.it/opencms/index.jsp?language=en), the Italian National Cultural aggregator. Each of the description is labeled either HIGH or LOW quality, according its adherence to the standard cataloguing guidelines provided by Istituto Centrale per il Catalogo e la Documentazione (ICCD). More precisely, each description is labeled as HIGH quality if the object and subject of the item (for which the description is provided) are both described according to the ICCD guidelines, and as LOW quality in all other cases. Most of the dataset was manually annotated, with ~30K descriptions automatically labeled as LOW quality due to their length (less than 3 tokens) or their provenance from old (pre-2012), not curated, collections. The dataset was developed to support the training and testing of ML text classification approaches for automatically assessing the quality of textual descriptions in digital Cultural Heritage repositories.The dataset is provided as a CSV file, where each row corresponds to an item from Cultura Italia, and contains the textual description of the item, the domain of the item (OpereArteVisiva/RepertoArcheologico/Architettura) and the quality label (Low_Quality/High_Quality).The textual descriptions in the dataset are provided by Cultura Italia with a "Public Domain" license (c.f., http://www.culturaitalia.it/opencms/export/sites/culturaitalia/attachments/linked_open_data/Licenza_CulturaItalia_CC0.pdf). The whole dataset, including the annotation, is openly distributed according to the Creative Commons Attribution-ShareAlike 4.0 Generic (CC BY-SA 4.0) licence.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The License Plates dataset is a object detection dataset of different vehicles (i.e. cars, vans, etc.) and their respective license plate. Annotations also include examples of "vehicle" and "license-plate". This dataset has a train/validation/test split of 245/70/35 respectively.
https://i.imgur.com/JmRgjBq.png" alt="Dataset Example">
This dataset could be used to create a vehicle and license plate detection object detection model. Roboflow provides a great guide on creating a license plate and vehicle object detection model.
This dataset is a subset of the Open Images Dataset. The annotations are licensed by Google LLC under CC BY 4.0 license. Some annotations have been combined or removed using Roboflow's annotation management tools to better align the annotations with the purpose of the dataset. The images have a CC BY 2.0 license.
Roboflow creates tools that make computer vision easy to use for any developer, even if you're not a machine learning expert. You can use it to organize, label, inspect, convert, and export your image datasets. And even to train and deploy computer vision models with no code required.
https://i.imgur.com/WHFqYSJ.png" alt="https://roboflow.com">
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Title:
Dark-Field Microscopy Images of Evaporated Mezcal Droplets for Agave Species Classification
Description:
This dataset contains dark-field microscopy images of mezcal samples produced from four agave species: Agave salmiana (salmiana), Agave marmorata (tepeztate), Agave rhodacantha (cuishe), and Agave angustifolia (espadin), as well as an aged salmiana. Each 1 μL droplet of diluted mezcal (20% ABV) was deposited on a cleaned glass slide and allowed to evaporate under ambient conditions to form distinct microstructures. The resulting images were acquired at 4× magnification and used to train and validate a Support Vector Machine (SVM) classifier to distinguish between the first two varietals. The dataset supports research in agave-based spirit authentication, chemometric image analysis, and low-cost classification of artisanal products.
Contents:
/salmiana/
, /tepeztate/, /espadin/
, /cuishe/
,and /tepeztate/
)Python scripts and Jupyter Notebooks for training, evaluation, and model export
Pretrained SVM model and label encoder files
Format:
Images (224×224 pixels), Notebooks (.ipynb)
Intended Use:
Research in chemometrics, machine learning, and food authentication. May also serve as a benchmark dataset for image-based classification of fermented or distilled products.
License:
Creative Commons Attribution 4.0 International (CC BY 4.0)
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Factori's AI & ML training data is thoroughly tested and reviewed to ensure that what you receive on your end is of the best quality.
Integrate the comprehensive AI & ML training data provided by Grepsr and develop a superior AI & ML model.
Whether you're training algorithms for natural language processing, sentiment analysis, or any other AI application, we can deliver comprehensive datasets tailored to fuel your machine learning initiatives.
Enhanced Data Quality: We have rigorous data validation processes and also conduct quality assurance checks to guarantee the integrity and reliability of the training data for you to develop the AI & ML models.
Gain a competitive edge, drive innovation, and unlock new opportunities by leveraging the power of tailored Artificial Intelligence and Machine Learning training data with Factori.
We offer web activity data of users that are browsing popular websites around the world. This data can be used to analyze web behavior across the web and build highly accurate audience segments based on web activity for targeting ads based on interest categories and search/browsing intent.
Web Data Reach: Our reach data represents the total number of data counts available within various categories and comprises attributes such as Country, Anonymous ID, IP addresses, Search Query, and so on.
Data Export Methodology: Since we collect data dynamically, we provide the most updated data and insights via a best-suited method at a suitable interval (daily/weekly/monthly).
Data Attributes: Anonymous_id IDType Timestamp Estid Ip userAgent browserFamily deviceType Os Url_metadata_canonical_url Url_metadata_raw_query_params refDomain mappedEvent Channel searchQuery Ttd_id Adnxs_id Keywords Categories Entities Concepts