100+ datasets found

SVG Code Generation Sample Training Data

kaggle.com

zip

Updated May 3, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Vinothkumar Sekar (2025). SVG Code Generation Sample Training Data [Dataset]. https://www.kaggle.com/datasets/vinothkumarsekar89/svg-generation-sample-training-data

Explore at:

zip(193477 bytes)Available download formats

Dataset updated

May 3, 2025

Authors

Vinothkumar Sekar

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

This training data was generated using GPT-4o as part of the 'Drawing with LLM' competition (https://www.kaggle.com/competitions/drawing-with-llms). It can be used to fine-tune small language models for the competition or serve as an augmentation dataset alongside other data sources.

The dataset is generated in two steps using the GPT-4o model. - In the first step, topic descriptions relevant to the competition are generated using a specific prompt. By running this prompt multiple times, over 3,000 descriptions were collected.

 
prompt=f""" I am participating in an SVG code generation competition.
  
   The competition involves generating SVG images based on short textual descriptions of everyday objects and scenes, spanning a wide range of categories. The key guidelines are as follows:
  
   - Descriptions are generic and do not contain brand names, trademarks, or personal names.
   - No descriptions include people, even in generic terms.
   - Descriptions are concise—each is no more than 200 characters, with an average length of about 50 characters.
   - Categories cover various domains, with some overlap between public and private test sets.
  
   To train a small LLM model, I am preparing a synthetic dataset. Could you generate 100 unique topics aligned with the competition style?
  
   Requirements:
   - Each topic should range between **20 and 200 characters**, with an **average around 60 characters**.
   - Ensure **diversity and creativity** across topics.
   - **50% of the topics** should come from the categories of **landscapes**, **abstract art**, and **fashion**.
   - Avoid duplication or overly similar phrasing.
  
   Example topics:
                 a purple forest at dusk, gray wool coat with a faux fur collar, a lighthouse overlooking the ocean, burgundy corduroy, pants with patch pockets and silver buttons, orange corduroy overalls, a purple silk scarf with tassel trim, a green lagoon under a cloudy sky, crimson rectangles forming a chaotic grid,  purple pyramids spiraling around a bronze cone, magenta trapezoids layered on a translucent silver sheet,  a snowy plain, black and white checkered pants,  a starlit night over snow-covered peaks, khaki triangles and azure crescents,  a maroon dodecahedron interwoven with teal threads.
  
   Please return the 100 topics in csv format.
   """

In the second step, SVG code is generated by prompting the GPT-4o model. The following prompt is used to query the model to generate svg.

 
  prompt = f"""
      Generate SVG code to visually represent the following text description, while respecting the given constraints.
      
      Allowed Elements: `svg`, `path`, `circle`, `rect`, `ellipse`, `line`, `polyline`, `polygon`, `g`, `linearGradient`, `radialGradient`, `stop`, `defs`
      Allowed Attributes: `viewBox`, `width`, `height`, `fill`, `stroke`, `stroke-width`, `d`, `cx`, `cy`, `r`, `x`, `y`, `rx`, `ry`, `x1`, `y1`, `x2`, `y2`, `points`, `transform`, `opacity`
      

      Please ensure that the generated SVG code is well-formed, valid, and strictly adheres to these constraints. 
      Focus on a clear and concise representation of the input description within the given limitations. 
      Always give the complete SVG code with nothing omitted. Never use an ellipsis.

      The code is scored based on similarity to the description, Visual question anwering and aesthetic components.
      Please generate a detailed svg code accordingly.

      input description: {text}
      """

The raw SVG output is then cleaned and sanitized using a competition-specific sanitization class. After that, the cleaned SVG is scored using the SigLIP model to evaluate text-to-SVG similarity. Only SVGs with a score above 0.5 are included in the dataset. On average, out of three SVG generations, only one meets the quality threshold after the cleaning, sanitization, and scoring process.

A dataset with ~50,000 samples for SVG code generation is publicly available at: https://huggingface.co/datasets/vinoku89/svg-code-generation

c
The global AI Training Dataset Market size will be USD 2962.4 million in...
cognitivemarketresearch.com
pdf,excel,csv,ppt
Updated Aug 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cognitive Market Research (2025). The global AI Training Dataset Market size will be USD 2962.4 million in 2025. [Dataset]. https://www.cognitivemarketresearch.com/ai-training-dataset-market-report
Explore at:
pdf,excel,csv,pptAvailable download formats
Dataset updated
Aug 15, 2025
Dataset authored and provided by
Cognitive Market Research
License
https://www.cognitivemarketresearch.com/privacy-policyhttps://www.cognitivemarketresearch.com/privacy-policy
Time period covered
2021 - 2033
Area covered
Global
Description
According to Cognitive Market Research, the global AI Training Dataset Market size will be USD 2962.4 million in 2025. It will expand at a compound annual growth rate (CAGR) of 28.60% from 2025 to 2033.

North America held the major market share for more than 37% of the global revenue with a market size of USD 1096.09 million in 2025 and will grow at a compound annual growth rate (CAGR) of 26.4% from 2025 to 2033. Europe accounted for a market share of over 29% of the global revenue, with a market size of USD 859.10 million. APAC held a market share of around 24% of the global revenue with a market size of USD 710.98 million in 2025 and will grow at a compound annual growth rate (CAGR) of 30.6% from 2025 to 2033. South America has a market share of more than 3.8% of the global revenue, with a market size of USD 112.57 million in 2025 and will grow at a compound annual growth rate (CAGR) of 27.6% from 2025 to 2033. Middle East had a market share of around 4% of the global revenue and was estimated at a market size of USD 118.50 million in 2025 and will grow at a compound annual growth rate (CAGR) of 27.9% from 2025 to 2033. Africa had a market share of around 2.20% of the global revenue and was estimated at a market size of USD 65.17 million in 2025 and will grow at a compound annual growth rate (CAGR) of 28.3% from 2025 to 2033. Data Annotation category is the fastest growing segment of the AI Training Dataset Market

Market Dynamics of AI Training Dataset Market

Key Drivers for AI Training Dataset Market

Government-Led Open Data Initiatives Fueling AI Training Dataset Market Growth

In recent years, Government-initiated open data efforts have strongly driven the development of the AI Training Dataset Market through offering affordable, high-quality datasets that are vital in training sound AI models. For instance, the U.S. government's drive for openness and innovation can be seen through portals such as Data.gov, which provides an enormous collection of datasets from many industries, ranging from healthcare, finance, and transportation. Such datasets are basic building blocks in constructing AI applications and training models using real-world data. In the same way, the platform data.gov.uk, run by the U.K. government, offers ample datasets to aid AI research and development, creating an environment that is supportive of technological growth. By releasing such information into the public domain, governments not only enhance transparency but also encourage innovation in the AI industry, resulting in greater demand for training datasets and helping to drive the market's growth.

India's IndiaAI Datasets Platform Accelerates AI Training Dataset Market Growth

India's upcoming launch of the IndiaAI Datasets Platform in January 2025 is likely to greatly increase the AI Training Dataset Market. The project, which is part of the government's ?10,000 crore IndiaAI Mission, will establish an open-source repository similar to platforms such as HuggingFace to enable developers to create, train, and deploy AI models. The platform will collect datasets from central and state governments and private sector organizations to provide a wide and rich data pool. Through improved access to high-quality, non-personal data, the platform is filling an important requirement for high-quality datasets for training AI models, thus driving innovation and development in the AI industry. This public initiative reflects India's determination to become a global AI hub, offering the infrastructure required to facilitate startups, researchers, and businesses in creating cutting-edge AI solutions. The initiative not only simplifies data access but also creates a model for public-private partnerships in AI development.

Restraint Factor for the AI Training Dataset Market

Data Privacy Regulations Impeding AI Training Dataset Market Growth

Strict data privacy laws are coming up as a major constraint in the AI Training Dataset Market since governments across the globe are establishing legislation to safeguard personal data. In the European Union, explicit consent for using personal data is required under the General Data Protection Regulation (GDPR), reducing the availability of datasets for training AI. Likewise, the data protection regulator in Brazil ordered Meta and others to stop the use of Brazilian personal data in training AI models due to dangers to individuals' funda...
Make Data Count training labels
kaggle.com
zip
Updated Aug 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
RodericD.M.Page (2025). Make Data Count training labels [Dataset]. https://www.kaggle.com/datasets/rdmpage/new-training-labels
Explore at:
zip(14355 bytes)Available download formats
Dataset updated
Aug 4, 2025
Authors
RodericD.M.Page
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset identifiers extracted from training PDFs for Make Data Count competition. This data is based on my interpretation of what consititutes a "data citation" and may not conform to what the competetion organisers think is data citation. There is a GitHub repo to track fixes and updates.
Machine Learning Dataset
brightdata.com
.json, .csv, .xlsx
Updated Dec 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bright Data (2024). Machine Learning Dataset [Dataset]. https://brightdata.com/products/datasets/machine-learning
Explore at:
.json, .csv, .xlsxAvailable download formats
Dataset updated
Dec 23, 2024
Dataset authored and provided by
Bright Datahttps://brightdata.com/
License
https://brightdata.com/licensehttps://brightdata.com/license
Area covered
Worldwide
Description
Utilize our machine learning datasets to develop and validate your models. Our datasets are designed to support a variety of machine learning applications, from image recognition to natural language processing and recommendation systems. You can access a comprehensive dataset or tailor a subset to fit your specific requirements, using data from a combination of various sources and websites, including custom ones. Popular use cases include model training and validation, where the dataset can be used to ensure robust performance across different applications. Additionally, the dataset helps in algorithm benchmarking by providing extensive data to test and compare various machine learning algorithms, identifying the most effective ones for tasks such as fraud detection, sentiment analysis, and predictive maintenance. Furthermore, it supports feature engineering by allowing you to uncover significant data attributes, enhancing the predictive accuracy of your machine learning models for applications like customer segmentation, personalized marketing, and financial forecasting.
R
Trainings Dataset
universe.roboflow.com
zip
Updated Sep 16, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
training images (2024). Trainings Dataset [Dataset]. https://universe.roboflow.com/training-images/training-datasets
Explore at:
zipAvailable download formats
Dataset updated
Sep 16, 2024
Dataset authored and provided by
training images
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Cloths Bounding Boxes
Description
Training Datasets

## Overview Training Datasets is a dataset for object detection tasks - it contains Cloths annotations for 1,498 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Zzzs: Lightweight training dataset + target
kaggle.com
zip
Updated Sep 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carl McBride Ellis (2023). Zzzs: Lightweight training dataset + target [Dataset]. https://www.kaggle.com/datasets/carlmcbrideellis/zzzs-lightweight-training-dataset-target
Explore at:
zip(193667957 bytes)Available download formats
Dataset updated
Sep 20, 2023
Authors
Carl McBride Ellis
License
https://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/
Description
Updated: with the corrected Offset Events (18 September 2023)

Updated: with the corrected Offset Events (20 September 2023)

Zzzs_train.parquet

This is a lightweight training dataset + binary target (awake) for the kaggle "Child Mind Institute - Detect Sleep States" competition.

It consists of 35 series selected from the original 277 series. This dataset was created using the notebook:

Zzzs: Make small starter datasets + target

This lightweight dataset is in parquet format (180M).

Zzzs_train_multi.parquet

The dataset Zzzs_train_multi.parquet also has an extra 8 non_wear (?) series assigned the awake=2 class. These extra series are:

['0f9e60a8e56d', '390b487231ce', '2fc653ca75c7', 'c7b1283bb7eb', '89c7daa72eee', 'e11b9d69f856', 'c5d08fc3e040', 'a3e59c2ce3f6']
Machine learning algorithm validation with a limited sample size
plos.figshare.com
text/x-python
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson (2023). Machine learning algorithm validation with a limited sample size [Dataset]. http://doi.org/10.1371/journal.pone.0224365
Explore at:
text/x-pythonAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0224365
Dataset updated
May 30, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Andrius Vabalas; Emma Gowen; Ellen Poliakoff; Alexander J. Casson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.
R
Training Data For Paper Dataset
universe.roboflow.com
zip
Updated Jun 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pulak Deb Roy (2025). Training Data For Paper Dataset [Dataset]. https://universe.roboflow.com/pulak-deb-roy/training-data-for-paper/dataset/1
Explore at:
zipAvailable download formats
Dataset updated
Jun 21, 2025
Dataset authored and provided by
Pulak Deb Roy
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
YOLOv8 Bounding Boxes
Description
Training Data For Paper

## Overview Training Data For Paper is a dataset for object detection tasks - it contains YOLOv8 annotations for 1,335 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
TREC 2022 Deep Learning test collection
catalog.data.gov
gimi9.com
+1more
Updated May 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2023). TREC 2022 Deep Learning test collection [Dataset]. https://catalog.data.gov/dataset/trec-2022-deep-learning-test-collection
Explore at:
Dataset updated
May 9, 2023
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
This is a test collection for passage and document retrieval, produced in the TREC 2023 Deep Learning track. The Deep Learning Track studies information retrieval in a large training data regime. This is the case where the number of training queries with at least one positive label is at least in the tens of thousands, if not hundreds of thousands or more. This corresponds to real-world scenarios such as training based on click logs and training based on labels from shallow pools (such as the pooling in the TREC Million Query Track or the evaluation of search engines based on early precision).Certain machine learning based methods, such as methods based on deep learning are known to require very large datasets for training. Lack of such large scale datasets has been a limitation for developing such methods for common information retrieval tasks, such as document ranking. The Deep Learning Track organized in the previous years aimed at providing large scale datasets to TREC, and create a focused research effort with a rigorous blind evaluation of ranker for the passage ranking and document ranking tasks.Similar to the previous years, one of the main goals of the track in 2022 is to study what methods work best when a large amount of training data is available. For example, do the same methods that work on small data also work on large data? How much do methods improve when given more training data? What external data and models can be brought in to bear in this scenario, and how useful is it to combine full supervision with other forms of supervision?The collection contains 12 million web pages, 138 million passages from those web pages, search queries, and relevance judgments for the queries.
R
Training Data (add Augmented) Dataset
universe.roboflow.com
zip
Updated Dec 7, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
car plate license (2022). Training Data (add Augmented) Dataset [Dataset]. https://universe.roboflow.com/car-plate-license/training-data-add-augmented
Explore at:
zipAvailable download formats
Dataset updated
Dec 7, 2022
Dataset authored and provided by
car plate license
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Plate Bounding Boxes
Description
Training Data (add Augmented)

## Overview Training Data (add Augmented) is a dataset for object detection tasks - it contains Plate annotations for 825 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Training dataset
figshare.com
txt
Updated Jun 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MOHAMMED ABDALSALAM (2024). Training dataset [Dataset]. http://doi.org/10.6084/m9.figshare.26129863.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.26129863.v1
Dataset updated
Jun 29, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
MOHAMMED ABDALSALAM
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This file contains the training dataset used to train the machine learning model. It includes a wide range of features combined to create a robust training set for model development.
d
Customer Service Call Dataset [Multisector] – Annotated support transcripts...
datarade.ai
Updated Apr 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
WiserBrand.com (2025). Customer Service Call Dataset [Multisector] – Annotated support transcripts for training AI and improving CX [Dataset]. https://datarade.ai/data-products/customer-service-call-dataset-multisector-annotated-suppo-wiserbrand-com
Explore at:
.json, .csv, .xls, .txtAvailable download formats
Dataset updated
Apr 11, 2025
Dataset provided by
WiserBrand
Area covered
United States of America
Description
"This dataset contains transcribed customer support calls from companies in over 160 industries, offering a high-quality foundation for developing customer-aware AI systems and improving service operations. It captures how real people express concerns, frustrations, and requests — and how support teams respond.

Included in each record:

Full call transcription with labeled speakers (system, agent, customer)

Concise human-written summary of the conversation

Sentiment tag for the overall interaction: positive, neutral, or negative

Company name, duration, and geographic location of the caller

Call context includes industries such as eCommerce, banking, telecom, and streaming services

Common use cases:

Train NLP models to understand support calls and detect churn risk

Power complaint detection engines for customer success and support teams

Create high-quality LLM training sets with real support narratives

Build summarization and topic tagging pipelines for CX dashboards

Analyze tone shifts and resolution language in customer-agent interaction

This dataset is structured, high-signal, and ready for use in AI pipelines, CX design, and quality assurance systems. It brings full transparency to what actually happens during customer service moments — from routine fixes to emotional escalations."

The more you purchase, the lower the price will be.
h
Astral-1.5-Post-Training-Dataset-SFT
huggingface.co
Updated Oct 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lucidity AI (2025). Astral-1.5-Post-Training-Dataset-SFT [Dataset]. https://huggingface.co/datasets/LucidityAI/Astral-1.5-Post-Training-Dataset-SFT
Explore at:
Dataset updated
Oct 10, 2025
Dataset authored and provided by
Lucidity AI
Description
Astral 1.5 Post-Training Dataset

A albeit smaller, yet higher-quality reasoning dataset combining mathematics, code, and general stem used in the training of the Astral 1.5 model family.

Dataset Description

This dataset merges four datasets to create a high quality 25 thousand example dataset. With the size of the dataset, we rely on the principle that quality > quantity leads to better model performance.

Dataset Composition Setup

General STEM:… See the full description on the dataset page: https://huggingface.co/datasets/LucidityAI/Astral-1.5-Post-Training-Dataset-SFT.
h
Astral-Post-Training-Dataset
huggingface.co
Updated Aug 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lucidity AI (2025). Astral-Post-Training-Dataset [Dataset]. https://huggingface.co/datasets/LucidityAI/Astral-Post-Training-Dataset
Explore at:
Dataset updated
Aug 31, 2025
Dataset authored and provided by
Lucidity AI
Description
Astral Post-Training Dataset

A high-quality reasoning dataset combining mathematics, code, and science problems used in the training of the Astral model family.

Dataset Description

This dataset merges three premium reasoning datasets to create a balanced training corpus for improving model performance across mathematical reasoning, competitive programming, and scientific problem-solving. The dataset contains 100,000 samples with equal representation from each domain.… See the full description on the dataset page: https://huggingface.co/datasets/LucidityAI/Astral-Post-Training-Dataset.

Images used for training, validation, and testing.

kaggle.com

Updated Mar 15, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Chrysthian Chrisley (2024). Images used for training, validation, and testing. [Dataset]. https://www.kaggle.com/datasets/chrysthian/images-used-for-training-validation-and-testing

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Mar 15, 2024

Dataset provided by

Kaggle

Authors

Chrysthian Chrisley

License

Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically

Description

Imports:

# All Imports
import os
from matplotlib import pyplot as plt
import pandas as pd
from sklearn.calibration import LabelEncoder
import seaborn as sns
import matplotlib.image as mpimg
import cv2
import numpy as np
import pickle

# Tensflor and Keras Layer and Model and Optimize and Loss
import tensorflow as tf
from tensorflow import keras
from keras import Sequential
from keras.layers import *

#Kernel Intilizer 
from keras.optimizers import Adamax

# PreTrained Model
from keras.applications import *

#Early Stopping
from keras.callbacks import EarlyStopping
import warnings

Warnings Suppression | Configuration

# Warnings Remove 
warnings.filterwarnings("ignore")

# Define the base path for the training folder
base_path = 'jaguar_cheetah/train'

# Weights file
weights_file = 'Model_train_weights.weights.h5'

# Path to the saved or to save the model:
model_file = 'Model-cheetah_jaguar_Treined.keras'

# Model history
history_path = 'training_history_cheetah_jaguar.pkl'

# Initialize lists to store file paths and labels
filepaths = []
labels = []

# Iterate over folders and files within the training directory
for folder in ['Cheetah', 'Jaguar']:
  folder_path = os.path.join(base_path, folder)
  for filename in os.listdir(folder_path):
    file_path = os.path.join(folder_path, filename)
    filepaths.append(file_path)
    labels.append(folder)

# Create the TRAINING dataframe
file_path_series = pd.Series(filepaths , name= 'filepath')
Label_path_series = pd.Series(labels , name = 'label')
df_train = pd.concat([file_path_series ,Label_path_series ] , axis = 1)


# Define the base path for the test folder
directory = "jaguar_cheetah/test"

filepath =[]
label = []

folds = os.listdir(directory)

for fold in folds:
  f_path = os.path.join(directory , fold)
  
  imgs = os.listdir(f_path)
  
  for img in imgs:
    
    img_path = os.path.join(f_path , img)
    filepath.append(img_path)
    label.append(fold)
    
# Create the TEST dataframe
file_path_series = pd.Series(filepath , name= 'filepath')
Label_path_series = pd.Series(label , name = 'label')
df_test = pd.concat([file_path_series ,Label_path_series ] , axis = 1)

# Display the first rows of the dataframe for verification
#print(df_train)

# Folders with Training and Test files
data_dir = 'jaguar_cheetah/train'
test_dir = 'jaguar_cheetah/test'

# Image size 256x256
IMAGE_SIZE = (256,256)

Tain | Test

#print('Training Images:')

# Create the TRAIN dataframe
train_ds = tf.keras.utils.image_dataset_from_directory(
  data_dir,
  validation_split=0.1,
  subset='training',
  seed=123,
  image_size=IMAGE_SIZE,
  batch_size=32)

#Testing Data
#print('Validation Images:')
validation_ds = tf.keras.utils.image_dataset_from_directory(
  data_dir, 
  validation_split=0.1,
  subset='validation',
  seed=123,
  image_size=IMAGE_SIZE,
  batch_size=32)

print('Testing Images:')
test_ds = tf.keras.utils.image_dataset_from_directory(
  test_dir, 
  seed=123,
  image_size=IMAGE_SIZE,
  batch_size=32)

# Extract labels
train_labels = train_ds.class_names
test_labels = test_ds.class_names
validation_labels = validation_ds.class_names

# Encode labels
# Defining the class labels
class_labels = ['CHEETAH', 'JAGUAR'] 

# Instantiate (encoder) LabelEncoder
label_encoder = LabelEncoder()

# Fit the label encoder on the class labels
label_encoder.fit(class_labels)

# Transform the labels for the training dataset
train_labels_encoded = label_encoder.transform(train_labels)

# Transform the labels for the validation dataset
validation_labels_encoded = label_encoder.transform(validation_labels)

# Transform the labels for the testing dataset
test_labels_encoded = label_encoder.transform(test_labels)

# Normalize the pixel values

# Train files 
train_ds = train_ds.map(lambda x, y: (x / 255.0, y))
# Validate files
validation_ds = validation_ds.map(lambda x, y: (x / 255.0, y))
# Test files
test_ds = test_ds.map(lambda x, y: (x / 255.0, y))

#TRAINING VISUALIZATION
#Count the occurrences of each category in the column
count = df_train['label'].value_counts()

# Create a figure with 2 subplots
fig, axs = plt.subplots(1, 2, figsize=(12, 6), facecolor='white')

# Plot a pie chart on the first subplot
palette = sns.color_palette("viridis")
sns.set_palette(palette)
axs[0].pie(count, labels=count.index, autopct='%1.1f%%', startangle=140)
axs[0].set_title('Distribution of Training Categories')

# Plot a bar chart on the second subplot
sns.barplot(x=count.index, y=count.values, ax=axs[1], palette="viridis")
axs[1].set_title('Count of Training Categories')

# Adjust the layout
plt.tight_layout()

# Visualize
plt.show()

# TEST VISUALIZATION
count = df_test['label'].value_counts()

# Create a figure with 2 subplots
fig, axs = plt.subplots(1, 2, figsize=(12, 6), facec...

G
Synthetic Training Data Market Research Report 2033
growthmarketreports.com
csv, pdf, pptx
Updated Aug 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2025). Synthetic Training Data Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/synthetic-training-data-market
Explore at:
csv, pdf, pptxAvailable download formats
Dataset updated
Aug 29, 2025
Dataset authored and provided by
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
Synthetic Training Data Market Outlook

According to our latest research, the global synthetic training data market size in 2024 is valued at USD 1.45 billion, demonstrating robust momentum as organizations increasingly adopt artificial intelligence and machine learning solutions. The market is projected to grow at a remarkable CAGR of 38.7% from 2025 to 2033, reaching an estimated USD 22.46 billion by 2033. This exponential growth is primarily driven by the rising demand for high-quality, diverse, and privacy-compliant datasets that fuel advanced AI models, as well as the escalating need for scalable data solutions across various industries.

One of the primary growth factors propelling the synthetic training data market is the escalating complexity and diversity of AI and machine learning applications. As organizations strive to develop more accurate and robust AI models, the need for vast amounts of annotated and high-quality training data has surged. Traditional data collection methods are often hampered by privacy concerns, high costs, and time-consuming processes. Synthetic training data, generated through advanced algorithms and simulation tools, offers a compelling alternative by providing scalable, customizable, and bias-mitigated datasets. This enables organizations to accelerate model development, improve performance, and comply with evolving data privacy regulations such as GDPR and CCPA, thus driving widespread adoption across sectors like healthcare, finance, autonomous vehicles, and robotics.

Another significant driver is the increasing adoption of synthetic data for data augmentation and rare event simulation. In sectors such as autonomous vehicles, manufacturing, and robotics, real-world data for edge-case scenarios or rare events is often scarce or difficult to capture. Synthetic training data allows for the generation of these critical scenarios at scale, enabling AI systems to learn and adapt to complex, unpredictable environments. This not only enhances model robustness but also reduces the risk associated with deploying AI in safety-critical applications. The flexibility to generate diverse data types, including images, text, audio, video, and tabular data, further expands the applicability of synthetic data solutions, making them indispensable tools for innovation and competitive advantage.

The synthetic training data market is also experiencing rapid growth due to the heightened focus on data privacy and regulatory compliance. As data protection regulations become more stringent worldwide, organizations face increasing challenges in accessing and utilizing real-world data for AI training without violating user privacy. Synthetic data addresses this challenge by creating realistic yet entirely artificial datasets that preserve the statistical properties of original data without exposing sensitive information. This capability is particularly valuable for industries such as BFSI, healthcare, and government, where data sensitivity and compliance requirements are paramount. As a result, the adoption of synthetic training data is expected to accelerate further as organizations seek to balance innovation with ethical and legal responsibilities.

From a regional perspective, North America currently leads the synthetic training data market, driven by the presence of major technology companies, robust R&D investments, and early adoption of AI technologies. However, the Asia Pacific region is anticipated to witness the highest growth rate during the forecast period, fueled by expanding AI initiatives, government support, and the rapid digital transformation of industries. Europe is also emerging as a key market, particularly in sectors where data privacy and regulatory compliance are critical. Latin America and the Middle East & Africa are gradually increasing their market share as awareness and adoption of synthetic data solutions grow. Overall, the global landscape is characterized by dynamic regional trends, with each region contributing uniquely to the marketÂ’s expansion.

The introduction of a Synthetic Data Generation Engine has revolutionized the way organizations approach data creation and management. This engine leverages cutting-edge algorithms to produce high-quality synthetic datasets that mirror real-world data without compromising privacy. By sim
w
Dataset of books called 101 ways to make training active
workwithdata.com
Updated Apr 17, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2025). Dataset of books called 101 ways to make training active [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=101+ways+to+make+training+active
Explore at:
Dataset updated
Apr 17, 2025
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is about books. It has 1 row and is filtered where the book is 101 ways to make training active. It features 7 columns including author, publication date, language, and book publisher.
w
Dataset of book subjects that contain Training in motion : how to use...
workwithdata.com
Updated Nov 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2024). Dataset of book subjects that contain Training in motion : how to use movement to create engaging and effective learning [Dataset]. https://www.workwithdata.com/datasets/book-subjects?f=1&fcol0=j0-book&fop0=%3D&fval0=Training+in+motion+:+how+to+use+movement+to+create+engaging+and+effective+learning&j=1&j0=books
Explore at:
Dataset updated
Nov 7, 2024
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is about book subjects. It has 3 rows and is filtered where the books is Training in motion : how to use movement to create engaging and effective learning. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.
F
Japanese Chain of Thought Prompt & Response Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Japanese Chain of Thought Prompt & Response Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/japanese-chain-of-thought-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Welcome to the Japanese Chain of Thought prompt-response dataset, a meticulously curated collection containing 3000 comprehensive prompt and response pairs. This dataset is an invaluable resource for training Language Models (LMs) to generate well-reasoned answers and minimize inaccuracies. Its primary utility lies in enhancing LLMs' reasoning skills for solving arithmetic, common sense, symbolic reasoning, and complex problems.
Dataset Content
This COT dataset comprises a diverse set of instructions and questions paired with corresponding answers and rationales in the Japanese language. These prompts and completions cover a broad range of topics and questions, including mathematical concepts, common sense reasoning, complex problem-solving, scientific inquiries, puzzles, and more.
Each prompt is meticulously accompanied by a response and rationale, providing essential information and insights to enhance the language model training process. These prompts, completions, and rationales were manually curated by native Japanese people, drawing references from various sources, including open-source datasets, news articles, websites, and other reliable references.
Our chain-of-thought prompt-completion dataset includes various prompt types, such as instructional prompts, continuations, and in-context learning (zero-shot, few-shot) prompts. Additionally, the dataset contains prompts and completions enriched with various forms of rich text, such as lists, tables, code snippets, JSON, and more, with proper markdown format.
Prompt Diversity
To ensure a wide-ranging dataset, we have included prompts from a plethora of topics related to mathematics, common sense reasoning, and symbolic reasoning. These topics encompass arithmetic, percentages, ratios, geometry, analogies, spatial reasoning, temporal reasoning, logic puzzles, patterns, and sequences, among others.
These prompts vary in complexity, spanning easy, medium, and hard levels. Various question types are included, such as multiple-choice, direct queries, and true/false assessments.
Response Formats
To accommodate diverse learning experiences, our dataset incorporates different types of answers depending on the prompt and provides step-by-step rationales. The detailed rationale aids the language model in building reasoning process for complex questions.
These responses encompass text strings, numerical values, and date and time formats, enhancing the language model's ability to generate reliable, coherent, and contextually appropriate answers.
Data Format and Annotation Details
This fully labeled Japanese Chain of Thought Prompt Completion Dataset is available in JSON and CSV formats. It includes annotation details such as a unique ID, prompt, prompt type, prompt complexity, prompt category, domain, response, rationale, response type, and rich text presence.
Quality and Accuracy
Our dataset upholds the highest standards of quality and accuracy. Each prompt undergoes meticulous validation, and the corresponding responses and rationales are thoroughly verified. We prioritize inclusivity, ensuring that the dataset incorporates prompts and completions representing diverse perspectives and writing styles, maintaining an unbiased and discrimination-free stance.
The Japanese version is grammatically accurate without any spelling or grammatical errors. No copyrighted, toxic, or harmful content is used during the construction of this dataset.
Continuous Updates and Customization
The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Ongoing efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to gather custom chain of thought prompt completion data tailored to specific needs, providing flexibility and customization options.
License
The dataset, created by FutureBeeAI, is now available for commercial use. Researchers, data scientists, and developers can leverage this fully labeled and ready-to-deploy Japanese Chain of Thought Prompt Completion Dataset to enhance the rationale and accurate response generation capabilities of their generative AI models and explore new approaches to NLP tasks.
Rescaled Fashion-MNIST dataset
zenodo.org
Updated Jun 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg (2025). Rescaled Fashion-MNIST dataset [Dataset]. http://doi.org/10.5281/zenodo.15187793
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.15187793
Dataset updated
Jun 27, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg
Time period covered
Apr 10, 2025
Description
Motivation

The goal of introducing the Rescaled Fashion-MNIST dataset is to provide a dataset that contains scale variations (up to a factor of 4), to evaluate the ability of networks to generalise to scales not present in the training data.

The Rescaled Fashion-MNIST dataset was introduced in the paper:

[1] A. Perzanowski and T. Lindeberg (2025) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, Journal of Mathematical Imaging and Vision, 67(29), https://doi.org/10.1007/s10851-025-01245-x.

with a pre-print available at arXiv:

[2] Perzanowski and Lindeberg (2024) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, arXiv preprint arXiv:2409.11140.

Importantly, the Rescaled Fashion-MNIST dataset is more challenging than the MNIST Large Scale dataset, introduced in:

[3] Y. Jansson and T. Lindeberg (2022) "Scale-invariant scale-channel networks: Deep networks that generalise to previously unseen scales", Journal of Mathematical Imaging and Vision, 64(5): 506-536, https://doi.org/10.1007/s10851-022-01082-2.

Access and rights

The Rescaled Fashion-MNIST dataset is provided on the condition that you provide proper citation for the original Fashion-MNIST dataset:

[4] Xiao, H., Rasul, K., and Vollgraf, R. (2017) “Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms”, arXiv preprint arXiv:1708.07747

and also for this new rescaled version, using the reference [1] above.

The data set is made available on request. If you would be interested in trying out this data set, please make a request in the system below, and we will grant you access as soon as possible.

The dataset

The Rescaled FashionMNIST dataset is generated by rescaling 28×28 gray-scale images of clothes from the original FashionMNIST dataset [4]. The scale variations are up to a factor of 4, and the images are embedded within black images of size 72x72, with the object in the frame always centred. The imresize() function in Matlab was used for the rescaling, with default anti-aliasing turned on, and bicubic interpolation overshoot removed by clipping to the [0, 255] range. The details of how the dataset was created can be found in [1].

There are 10 different classes in the dataset: “T-shirt/top”, “trouser”, “pullover”, “dress”, “coat”, “sandal”, “shirt”, “sneaker”, “bag” and “ankle boot”. In the dataset, these are represented by integer labels in the range [0, 9].

The dataset is split into 50 000 training samples, 10 000 validation samples and 10 000 testing samples. The training dataset is generated using the initial 50 000 samples from the original Fashion-MNIST training set. The validation dataset, on the other hand, is formed from the final 10 000 images of that same training set. For testing, all test datasets are built from the 10 000 images contained in the original Fashion-MNIST test set.

The h5 files containing the dataset

The training dataset file (~2.9 GB) for scale 1, which also contains the corresponding validation and test data for the same scale, is:

fashionmnist_with_scale_variations_tr50000_vl10000_te10000_outsize72-72_scte1p000_scte1p000.h5

Additionally, for the Rescaled FashionMNIST dataset, there are 9 datasets (~415 MB each) for testing scale generalisation at scales not present in the training set. Each of these datasets is rescaled using a different image scaling factor, 2^k/4, with k being integers in the range [-4, 4]:

fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p500.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p595.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p707.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p841.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p000.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p189.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p414.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p682.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte2p000.h5

These dataset files were used for the experiments presented in Figures 6, 7, 14, 16, 19 and 23 in [1].

Instructions for loading the data set

The datasets are saved in HDF5 format, with the partitions in the respective h5 files named as
('/x_train', '/x_val', '/x_test', '/y_train', '/y_test', '/y_val'); which ones exist depends on which data split is used.

The training dataset can be loaded in Python as:

with h5py.File(`

x_train = np.array( f["/x_train"], dtype=np.float32)
x_val = np.array( f["/x_val"], dtype=np.float32)
x_test = np.array( f["/x_test"], dtype=np.float32)
y_train = np.array( f["/y_train"], dtype=np.int32)
y_val = np.array( f["/y_val"], dtype=np.int32)
y_test = np.array( f["/y_test"], dtype=np.int32)

We also need to permute the data, since Pytorch uses the format [num_samples, channels, width, height], while the data is saved as [num_samples, width, height, channels]:

x_train = np.transpose(x_train, (0, 3, 1, 2))
x_val = np.transpose(x_val, (0, 3, 1, 2))
x_test = np.transpose(x_test, (0, 3, 1, 2))

The test datasets can be loaded in Python as:

with h5py.File(`

x_test = np.array( f["/x_test"], dtype=np.float32)
y_test = np.array( f["/y_test"], dtype=np.int32)

The test datasets can be loaded in Matlab as:

x_test = h5read(`

The images are stored as [num_samples, x_dim, y_dim, channels] in HDF5 files. The pixel intensity values are not normalised, and are in a [0, 255] range.

There is also a closely related Fashion-MNIST with translations dataset, which in addition to scaling variations also comprises spatial translations of the objects.

Facebook

Twitter

Click to copy link

Link copied

Cite

Vinothkumar Sekar (2025). SVG Code Generation Sample Training Data [Dataset]. https://www.kaggle.com/datasets/vinothkumarsekar89/svg-generation-sample-training-data

SVG Code Generation Sample Training Data

Sample Training Data for SVG Code Generation

Explore at:

zip(193477 bytes)Available download formats

Dataset updated

May 3, 2025

Authors

Vinothkumar Sekar

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

 
prompt=f""" I am participating in an SVG code generation competition.
  
   The competition involves generating SVG images based on short textual descriptions of everyday objects and scenes, spanning a wide range of categories. The key guidelines are as follows:
  
   - Descriptions are generic and do not contain brand names, trademarks, or personal names.
   - No descriptions include people, even in generic terms.
   - Descriptions are concise—each is no more than 200 characters, with an average length of about 50 characters.
   - Categories cover various domains, with some overlap between public and private test sets.
  
   To train a small LLM model, I am preparing a synthetic dataset. Could you generate 100 unique topics aligned with the competition style?
  
   Requirements:
   - Each topic should range between **20 and 200 characters**, with an **average around 60 characters**.
   - Ensure **diversity and creativity** across topics.
   - **50% of the topics** should come from the categories of **landscapes**, **abstract art**, and **fashion**.
   - Avoid duplication or overly similar phrasing.
  
   Example topics:
                 a purple forest at dusk, gray wool coat with a faux fur collar, a lighthouse overlooking the ocean, burgundy corduroy, pants with patch pockets and silver buttons, orange corduroy overalls, a purple silk scarf with tassel trim, a green lagoon under a cloudy sky, crimson rectangles forming a chaotic grid,  purple pyramids spiraling around a bronze cone, magenta trapezoids layered on a translucent silver sheet,  a snowy plain, black and white checkered pants,  a starlit night over snow-covered peaks, khaki triangles and azure crescents,  a maroon dodecahedron interwoven with teal threads.
  
   Please return the 100 topics in csv format.
   """

In the second step, SVG code is generated by prompting the GPT-4o model. The following prompt is used to query the model to generate svg.

 
  prompt = f"""
      Generate SVG code to visually represent the following text description, while respecting the given constraints.
      
      Allowed Elements: `svg`, `path`, `circle`, `rect`, `ellipse`, `line`, `polyline`, `polygon`, `g`, `linearGradient`, `radialGradient`, `stop`, `defs`
      Allowed Attributes: `viewBox`, `width`, `height`, `fill`, `stroke`, `stroke-width`, `d`, `cx`, `cy`, `r`, `x`, `y`, `rx`, `ry`, `x1`, `y1`, `x2`, `y2`, `points`, `transform`, `opacity`
      

      Please ensure that the generated SVG code is well-formed, valid, and strictly adheres to these constraints. 
      Focus on a clear and concise representation of the input description within the given limitations. 
      Always give the complete SVG code with nothing omitted. Never use an ellipsis.

      The code is scored based on similarity to the description, Visual question anwering and aesthetic components.
      Please generate a detailed svg code accordingly.

      input description: {text}
      """

A dataset with ~50,000 samples for SVG code generation is publicly available at: https://huggingface.co/datasets/vinoku89/svg-code-generation

Clear search

Close search

Google apps

Main menu

SVG Code Generation Sample Training Data

The global AI Training Dataset Market size will be USD 2962.4 million in...

Make Data Count training labels

Machine Learning Dataset

Trainings Dataset

Training Datasets

Zzzs: Lightweight training dataset + target

Updated: with the corrected Offset Events (18 September 2023)

Updated: with the corrected Offset Events (20 September 2023)

`Zzzs_train.parquet`

`Zzzs_train_multi.parquet`

Machine learning algorithm validation with a limited sample size

Training Data For Paper Dataset

Training Data For Paper

TREC 2022 Deep Learning test collection

Training Data (add Augmented) Dataset

Training Data (add Augmented)

Training dataset

Customer Service Call Dataset [Multisector] – Annotated support transcripts...

Astral-1.5-Post-Training-Dataset-SFT

Astral-Post-Training-Dataset

Images used for training, validation, and testing.

Synthetic Training Data Market Research Report 2033

Synthetic Training Data Market Outlook

Dataset of books called 101 ways to make training active

Dataset of book subjects that contain Training in motion : how to use...

Japanese Chain of Thought Prompt & Response Dataset

Dataset Content

Prompt Diversity

Response Formats

Data Format and Annotation Details

Rescaled Fashion-MNIST dataset

Motivation

Access and rights

The dataset

The h5 files containing the dataset

Instructions for loading the data set

SVG Code Generation Sample Training Data

Sample Training Data for SVG Code Generation