100+ datasets found

h
dataset-card-example
huggingface.co
Updated Sep 28, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Templates (2023). dataset-card-example [Dataset]. https://huggingface.co/datasets/templates/dataset-card-example
Explore at:
Dataset updated
Sep 28, 2023
Dataset authored and provided by
Templates
Description
Dataset Card for Dataset Name

This dataset card aims to be a base template for new datasets. It has been generated using this raw template.

Dataset Details Dataset Description

Curated by: [More Information Needed] Funded by [optional]: [More Information Needed] Shared by [optional]: [More Information Needed] Language(s) (NLP): [More Information Needed] License: [More Information Needed]

Dataset Sources [optional]… See the full description on the dataset page: https://huggingface.co/datasets/templates/dataset-card-example.
Sample Leads Dataset
kaggle.com
Updated Jun 24, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ThatSean (2022). Sample Leads Dataset [Dataset]. https://www.kaggle.com/datasets/thatsean/sample-leads-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 24, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
ThatSean
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset is based on the Sample Leads Dataset and is intended to allow some simple filtering by lead source. I had modified this dataset to support an upcoming Towards Data Science article walking through the process. Link to be shared once published.
Language Generation Dataset: 200M Samples
kaggle.com
zip
Updated Sep 7, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abhishek Chatterjee (2019). Language Generation Dataset: 200M Samples [Dataset]. https://www.kaggle.com/datasets/imdeepmind/language-generation-dataset-200m-samples
Explore at:
zip(3416608411 bytes)Available download formats
Dataset updated
Sep 7, 2019
Authors
Abhishek Chatterjee
Description
Context

Amazon Customer Reviews Dataset is a dataset of user-generated product reviews on the shopping website Amazon. It contains over 130 million product reviews.

This dataset contains a tiny fraction of that dataset processed and prepared specifically for language generation.

To know how the dataset is prepared, then please check the GitHub repository for this dataset. https://github.com/imdeepmind/AmazonReview-LanguageGenerationDataset

Content

The dataset is stored in an SQLite database. The database contains one table called reviews. This table contains two columns sequence and next.

The sequence column contains sequences of characters. In this dataset, each sequence of 40 characters long.

The next column contains the next character after the sequence.

There are about 200 million samples are in the dataset.

Acknowledgements

Thanks to Amazon for making this awesome dataset. Here is the link for the dataset: https://s3.amazonaws.com/amazon-reviews-pds/readme.html

Inspiration

This dataset can be used for Language Generation. As it contains 200 million samples, complex Deep Learning models can be trained on this data.
h
cot-example-dataset
huggingface.co
Updated Nov 24, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Vila (2024). cot-example-dataset [Dataset]. https://huggingface.co/datasets/dvilasuero/cot-example-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 24, 2024
Authors
Daniel Vila
Description
Dataset Card for cot-example-dataset

This dataset has been created with distilabel.

Dataset Summary

This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/dvilasuero/cot-example-dataset/raw/main/pipeline.yaml"

or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/dvilasuero/cot-example-dataset.
Best Books Ever Dataset
zenodo.org
csv
Updated Nov 10, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lorena Casanova Lozano; Sergio Costa Planells; Lorena Casanova Lozano; Sergio Costa Planells (2020). Best Books Ever Dataset [Dataset]. http://doi.org/10.5281/zenodo.4265096
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4265096
Dataset updated
Nov 10, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Lorena Casanova Lozano; Sergio Costa Planells; Lorena Casanova Lozano; Sergio Costa Planells
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The dataset has been collected in the frame of the Prac1 of the subject Tipology and Data Life Cycle of the Master's Degree in Data Science of the Universitat Oberta de Catalunya (UOC).

The dataset contains 25 variables and 52478 records corresponding to books on the GoodReads Best Books Ever list (the larges list on the site).

Original code used to retrieve the dataset can be found on github repository: github.com/scostap/goodreads_bbe_dataset

The data was retrieved in two sets, the first 30000 books and then the remainig 22478. Dates were not parsed and reformated on the second chunk so publishDate and firstPublishDate are representet in a mm/dd/yyyy format for the first 30000 records and Month Day Year for the rest.

Book cover images can be optionally downloaded from the url in the 'coverImg' field. Python code for doing so and an example can be found on the github repo.

The 25 fields of the dataset are:

| Attributes | Definition | Completeness | | ------------- | ------------- | ------------- | | bookId | Book Identifier as in goodreads.com | 100 | | title | Book title | 100 | | series | Series Name | 45 | | author | Book's Author | 100 | | rating | Global goodreads rating | 100 | | description | Book's description | 97 | | language | Book's language | 93 | | isbn | Book's ISBN | 92 | | genres | Book's genres | 91 | | characters | Main characters | 26 | | bookFormat | Type of binding | 97 | | edition | Type of edition (ex. Anniversary Edition) | 9 | | pages | Number of pages | 96 | | publisher | Editorial | 93 | | publishDate | publication date | 98 | | firstPublishDate | Publication date of first edition | 59 | | awards | List of awards | 20 | | numRatings | Number of total ratings | 100 | | ratingsByStars | Number of ratings by stars | 97 | | likedPercent | Derived field, percent of ratings over 2 starts (as in GoodReads) | 99 | | setting | Story setting | 22 | | coverImg | URL to cover image | 99 | | bbeScore | Score in Best Books Ever list | 100 | | bbeVotes | Number of votes in Best Books Ever list | 100 | | price | Book's price (extracted from Iberlibro) | 73 |
w
Synthetic Data for an Imaginary Country, Sample, 2023 - World
microdata.worldbank.org
Updated Jul 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Development Data Group, Data Analytics Unit (2023). Synthetic Data for an Imaginary Country, Sample, 2023 - World [Dataset]. https://microdata.worldbank.org/index.php/catalog/5906
Explore at:
Dataset updated
Jul 7, 2023
Dataset authored and provided by
Development Data Group, Data Analytics Unit
Time period covered
2023
Area covered
World, World
Description
Abstract

The dataset is a relational dataset of 8,000 households households, representing a sample of the population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.

The full-population dataset (with about 10 million individuals) is also distributed as open data.

Geographic coverage

The dataset is a synthetic dataset for an imaginary country. It was created to represent the population of this country by province (equivalent to admin1) and by urban/rural areas of residence.

Analysis unit

Household, Individual

Universe

The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.

Kind of data

ssd

Sampling procedure

The sample size was set to 8,000 households. The fixed number of households to be selected from each enumeration area was set to 25. In a first stage, the number of enumeration areas to be selected in each stratum was calculated, proportional to the size of each stratum (stratification by geo_1 and urban/rural). Then 25 households were randomly selected within each enumeration area. The R script used to draw the sample is provided as an external resource.

Mode of data collection

other

Research instrument

The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A "fake" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.

Cleaning operations

The synthetic data generation process included a set of "validators" (consistency checks, based on which synthetic observation were assessed and rejected/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files.

Response rate

This is a synthetic dataset; the "response rate" is 100%.
c
Netflix movies and tv shows sample dataset
crawlfeeds.com
csv, zip
Updated Apr 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crawl Feeds (2025). Netflix movies and tv shows sample dataset [Dataset]. https://crawlfeeds.com/datasets/netflix-movies-and-tv-shows-sample-dataset
Explore at:
zip, csvAvailable download formats
Dataset updated
Apr 27, 2025
Dataset authored and provided by
Crawl Feeds
License
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Description
Netflix is a streaming service and production company. Crawl feeds team extracted more than 100 records from netflix for quality analysis purposes. Get in touch with crawl feeds team for complete dataset. Last extracted on 5 mar 2022
h
Data from: example-dataset
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shalini Sundaram, example-dataset [Dataset]. https://huggingface.co/datasets/CoffeeDoodle/example-dataset
Explore at:
Authors
Shalini Sundaram
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for example-dataset

This dataset has been created with distilabel.

Dataset Summary

This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/CoffeeDoodle/example-dataset/raw/main/pipeline.yaml"

or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/CoffeeDoodle/example-dataset.
P
RealNews Dataset
paperswithcode.com
opendatalab.com
Updated Jan 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rowan Zellers; Ari Holtzman; Hannah Rashkin; Yonatan Bisk; Ali Farhadi; Franziska Roesner; Yejin Choi (2023). RealNews Dataset [Dataset]. https://paperswithcode.com/dataset/realnews
Explore at:
Dataset updated
Jan 30, 2023
Authors
Rowan Zellers; Ari Holtzman; Hannah Rashkin; Yonatan Bisk; Ali Farhadi; Franziska Roesner; Yejin Choi
Description
RealNews is a large corpus of news articles from Common Crawl. Data is scraped from Common Crawl, limited to the 5000 news domains indexed by Google News. The authors used the Newspaper Python library to extract the body and metadata from each article. News from Common Crawl dumps from December 2016 through March 2019 were used as training data; articles published in April 2019 from the April 2019 dump were used for evaluation. After deduplication, RealNews is 120 gigabytes without compression.
Single Layer Perceptron Dataset(Small)
kaggle.com
Updated Apr 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ABIR HASAN 1703100 (2023). Single Layer Perceptron Dataset(Small) [Dataset]. http://doi.org/10.34740/kaggle/ds/3154953
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/ds/3154953
Dataset updated
Apr 19, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
ABIR HASAN 1703100
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
We have chosen a simple numpy array to implement the single layer perceptron algorithm. We have considered a total of 13 samples with three features and one class label. The class label is defined in binary 0 and 1. The training dataset contains eight data samples, while the validation dataset contains five. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F9905947%2F7dc95405d7b0696adeb1c90f1cf8682b%2Ftraining%20data.jpg?generation=1681929479850322&alt=media" alt=""> Fig 1.1: Train Data https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F9905947%2Fe83b9677df9780414f25471c72ead9ca%2Ftest%20data.jpg?generation=1681929512768929&alt=media" alt=""> Fig 1.2: Test Data Here the first value for every sample is considered 1, as the algorithm says the value of x0 should always be 1. But even without this characteristic, our code will give the correct output.
P
Meta-Dataset Dataset
paperswithcode.com
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eleni Triantafillou; Tyler Zhu; Vincent Dumoulin; Pascal Lamblin; Utku Evci; Kelvin Xu; Ross Goroshin; Carles Gelada; Kevin Swersky; Pierre-Antoine Manzagol; Hugo Larochelle, Meta-Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/meta-dataset
Explore at:
Authors
Eleni Triantafillou; Tyler Zhu; Vincent Dumoulin; Pascal Lamblin; Utku Evci; Kelvin Xu; Ross Goroshin; Carles Gelada; Kevin Swersky; Pierre-Antoine Manzagol; Hugo Larochelle
Description
The Meta-Dataset benchmark is a large few-shot learning benchmark and consists of multiple datasets of different data distributions. It does not restrict few-shot tasks to have fixed ways and shots, thus representing a more realistic scenario. It consists of 10 datasets from diverse domains:

ILSVRC-2012 (the ImageNet dataset, consisting of natural images with 1000 categories) Omniglot (hand-written characters, 1623 classes) Aircraft (dataset of aircraft images, 100 classes) CUB-200-2011 (dataset of Birds, 200 classes) Describable Textures (different kinds of texture images with 43 categories) Quick Draw (black and white sketches of 345 different categories) Fungi (a large dataset of mushrooms with 1500 categories) VGG Flower (dataset of flower images with 102 categories), Traffic Signs (German traffic sign images with 43 classes) MSCOCO (images collected from Flickr, 80 classes).

All datasets except Traffic signs and MSCOCO have a training, validation and test split (proportioned roughly into 70%, 15%, 15%). The datasets Traffic Signs and MSCOCO are reserved for testing only.
h
AirfRANS_clipped
huggingface.co
Updated May 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PLAID-datasets (2025). AirfRANS_clipped [Dataset]. https://huggingface.co/datasets/PLAID-datasets/AirfRANS_clipped
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 5, 2025
Dataset authored and provided by
PLAID-datasets
License
https://choosealicense.com/licenses/odbl/https://choosealicense.com/licenses/odbl/
Description
Dataset Card

This dataset contains a single huggingface split, named 'all_samples'. The samples contains a single huggingface feature, named called "sample". Samples are instances of plaid.containers.sample.Sample. Mesh objects included in samples follow the CGNS standard, and can be converted in Muscat.Containers.Mesh.Mesh. Example of commands: import pickle from datasets import load_dataset from plaid.containers.sample import Sample

Load the dataset

dataset =… See the full description on the dataset page: https://huggingface.co/datasets/PLAID-datasets/AirfRANS_clipped.
i
Network dataset
ieee-dataport.org
Updated Jul 21, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Theyazn Aldhyani (2020). Network dataset [Dataset]. https://ieee-dataport.org/documents/network-dataset
Explore at:
Dataset updated
Jul 21, 2020
Authors
Theyazn Aldhyani
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Intelligent Hybrid model to Enhance Time Series Models for Predicting Network Traffic
LinkedIn Datasets
brightdata.com
.json, .csv, .xlsx
Updated Dec 17, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bright Data (2021). LinkedIn Datasets [Dataset]. https://brightdata.com/products/datasets/linkedin
Explore at:
.json, .csv, .xlsxAvailable download formats
Dataset updated
Dec 17, 2021
Dataset authored and provided by
Bright Datahttps://brightdata.com/
License
https://brightdata.com/licensehttps://brightdata.com/license
Area covered
Worldwide
Description
Unlock the full potential of LinkedIn data with our extensive dataset that combines profiles, company information, and job listings into one powerful resource for business decision-making, strategic hiring, competitive analysis, and market trend insights. This all-encompassing dataset is ideal for professionals, recruiters, analysts, and marketers aiming to enhance their strategies and operations across various business functions. Dataset Features

Profiles: Dive into detailed public profiles featuring names, titles, positions, experience, education, skills, and more. Utilize this data for talent sourcing, lead generation, and investment signaling, with a refresh rate ensuring up to 30 million records per month. Companies: Access comprehensive company data including ID, country, industry, size, number of followers, website details, subsidiaries, and posts. Tailored subsets by industry or region provide invaluable insights for CRM enrichment, competitive intelligence, and understanding the startup ecosystem, updated monthly with up to 40 million records. Job Listings: Explore current job opportunities detailed with job titles, company names, locations, and employment specifics such as seniority levels and employment functions. This dataset includes direct application links and real-time application numbers, serving as a crucial tool for job seekers and analysts looking to understand industry trends and the job market dynamics.

Customizable Subsets for Specific Needs Our LinkedIn dataset offers the flexibility to tailor the dataset according to your specific business requirements. Whether you need comprehensive insights across all data points or are focused on specific segments like job listings, company profiles, or individual professional details, we can customize the dataset to match your needs. This modular approach ensures that you get only the data that is most relevant to your objectives, maximizing efficiency and relevance in your strategic applications. Popular Use Cases

Strategic Hiring and Recruiting: Track talent movement, identify growth opportunities, and enhance your recruiting efforts with targeted data. Market Analysis and Competitive Intelligence: Gain a competitive edge by analyzing company growth, industry trends, and strategic opportunities. Lead Generation and CRM Enrichment: Enrich your database with up-to-date company and professional data for targeted marketing and sales strategies. Job Market Insights and Trends: Leverage detailed job listings for a nuanced understanding of employment trends and opportunities, facilitating effective job matching and market analysis. AI-Driven Predictive Analytics: Utilize AI algorithms to analyze large datasets for predicting industry shifts, optimizing business operations, and enhancing decision-making processes based on actionable data insights.

Whether you are mapping out competitive landscapes, sourcing new talent, or analyzing job market trends, our LinkedIn dataset provides the tools you need to succeed. Customize your access to fit specific needs, ensuring that you have the most relevant and timely data at your fingertips.

SVG Code Generation Sample Training Data

kaggle.com

Updated May 3, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Vinothkumar Sekar (2025). SVG Code Generation Sample Training Data [Dataset]. https://www.kaggle.com/datasets/vinothkumarsekar89/svg-generation-sample-training-data

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

May 3, 2025

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Vinothkumar Sekar

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

This training data was generated using GPT-4o as part of the 'Drawing with LLM' competition (https://www.kaggle.com/competitions/drawing-with-llms). It can be used to fine-tune small language models for the competition or serve as an augmentation dataset alongside other data sources.

The dataset is generated in two steps using the GPT-4o model. - In the first step, topic descriptions relevant to the competition are generated using a specific prompt. By running this prompt multiple times, over 3,000 descriptions were collected.

 
prompt=f""" I am participating in an SVG code generation competition.
  
   The competition involves generating SVG images based on short textual descriptions of everyday objects and scenes, spanning a wide range of categories. The key guidelines are as follows:
  
   - Descriptions are generic and do not contain brand names, trademarks, or personal names.
   - No descriptions include people, even in generic terms.
   - Descriptions are concise—each is no more than 200 characters, with an average length of about 50 characters.
   - Categories cover various domains, with some overlap between public and private test sets.
  
   To train a small LLM model, I am preparing a synthetic dataset. Could you generate 100 unique topics aligned with the competition style?
  
   Requirements:
   - Each topic should range between **20 and 200 characters**, with an **average around 60 characters**.
   - Ensure **diversity and creativity** across topics.
   - **50% of the topics** should come from the categories of **landscapes**, **abstract art**, and **fashion**.
   - Avoid duplication or overly similar phrasing.
  
   Example topics:
                 a purple forest at dusk, gray wool coat with a faux fur collar, a lighthouse overlooking the ocean, burgundy corduroy, pants with patch pockets and silver buttons, orange corduroy overalls, a purple silk scarf with tassel trim, a green lagoon under a cloudy sky, crimson rectangles forming a chaotic grid,  purple pyramids spiraling around a bronze cone, magenta trapezoids layered on a translucent silver sheet,  a snowy plain, black and white checkered pants,  a starlit night over snow-covered peaks, khaki triangles and azure crescents,  a maroon dodecahedron interwoven with teal threads.
  
   Please return the 100 topics in csv format.
   """

In the second step, SVG code is generated by prompting the GPT-4o model. The following prompt is used to query the model to generate svg.

 
  prompt = f"""
      Generate SVG code to visually represent the following text description, while respecting the given constraints.
      
      Allowed Elements: `svg`, `path`, `circle`, `rect`, `ellipse`, `line`, `polyline`, `polygon`, `g`, `linearGradient`, `radialGradient`, `stop`, `defs`
      Allowed Attributes: `viewBox`, `width`, `height`, `fill`, `stroke`, `stroke-width`, `d`, `cx`, `cy`, `r`, `x`, `y`, `rx`, `ry`, `x1`, `y1`, `x2`, `y2`, `points`, `transform`, `opacity`
      

      Please ensure that the generated SVG code is well-formed, valid, and strictly adheres to these constraints. 
      Focus on a clear and concise representation of the input description within the given limitations. 
      Always give the complete SVG code with nothing omitted. Never use an ellipsis.

      The code is scored based on similarity to the description, Visual question anwering and aesthetic components.
      Please generate a detailed svg code accordingly.

      input description: {text}
      """

The raw SVG output is then cleaned and sanitized using a competition-specific sanitization class. After that, the cleaned SVG is scored using the SigLIP model to evaluate text-to-SVG similarity. Only SVGs with a score above 0.5 are included in the dataset. On average, out of three SVG generations, only one meets the quality threshold after the cleaning, sanitization, and scoring process.

P
DataSeeds.AI-Sample-Dataset-DSD Dataset
paperswithcode.com
Updated Jun 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). DataSeeds.AI-Sample-Dataset-DSD Dataset [Dataset]. https://paperswithcode.com/dataset/dataseeds-ai-sample-dataset-dsd
Explore at:
Dataset updated
Jun 5, 2025
Description
Dataset Summary The DataSeeds.AI Sample Dataset (DSD) is a high-fidelity, human-curated computer vision-ready dataset comprised of 7,772 peer-ranked, fully annotated photographic images, 350,000+ words of descriptive text, and comprehensive metadata. While the DSD is being released under an open source license, a sister dataset of over 10,000 fully annotated and segmented images is available for immediate commercial licensing, and the broader GuruShots ecosystem contains over 100 million images in its catalog.

Each image includes multi-tier human annotations and semantic segmentation masks. Generously contributed to the community by the GuruShots photography platform, where users engage in themed competitions, the DSD uniquely captures aesthetic preference signals and high-quality technical metadata (EXIF) across an expansive diversity of photographic styles, camera types, and subject matter. The dataset is optimized for fine-tuning and evaluating multimodal vision-language models, especially in scene description and stylistic comprehension tasks.

Technical Report - Peer-Ranked Precision: Creating a Foundational Dataset for Fine-Tuning Vision Models from DataSeeds' Annotated Imagery Github Repo - Access the complete weights and code which were used to evaluate the DSD -- https://github.com/DataSeeds-ai/DSD-finetune-blip-llava This dataset is ready for commercial/non-commercial use. Dataset Structure Size: 7,772 images (7,010 train, 762 validation) Format: Apache Parquet files for metadata, with images in JPG format Total Size: ~4.1GB Languages: English (annotations) Annotation Quality: All annotations were verified through a multi-tier human-in-the-loop process
c
Walmart Dataset
crawlfeeds.com
csv, zip
Updated Apr 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crawl Feeds (2025). Walmart Dataset [Dataset]. https://crawlfeeds.com/datasets/walmart-dataset
Explore at:
csv, zipAvailable download formats
Dataset updated
Apr 26, 2025
Dataset authored and provided by
Crawl Feeds
License
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Description
Walmart products sample dataset having 1000+ records in CSV format. Download monthly dataset for walmart data and it having around 100K+ records.

Get 50% discount for all datasets. Link
Iris Species
kaggle.com
zip
Updated Sep 27, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCI Machine Learning (2016). Iris Species [Dataset]. https://www.kaggle.com/datasets/uciml/iris
Explore at:
zip(3687 bytes)Available download formats
Dataset updated
Sep 27, 2016
Dataset authored and provided by
UCI Machine Learning
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The Iris dataset was used in R.A. Fisher's classic 1936 paper, The Use of Multiple Measurements in Taxonomic Problems, and can also be found on the UCI Machine Learning Repository.

It includes three iris species with 50 samples each as well as some properties about each flower. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other.

The columns in this dataset are:

Id

SepalLengthCm

SepalWidthCm

PetalLengthCm

PetalWidthCm

Species
d
SAMPLE DATASET with lots of files
staging-elsevier.digitalcommonsdata.com
Updated Oct 3, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FirstName+36125284 LastName+36125284 (2019). SAMPLE DATASET with lots of files [Dataset]. http://doi.org/10.1234/tgpfnk7zyt.37
Explore at:
Unique identifier
https://doi.org/10.1234/tgpfnk7zyt.37
Dataset updated
Oct 3, 2019
Authors
FirstName+36125284 LastName+36125284
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This is the description of a dataset. The description can be quite long and this can look strange in the public dataset page. In the drafts page there is a scrollbar in the scrollbar, why not in the public page? Well, the public page needs to support viewing on a mobile phone and this can make scroll bars within scrollbars within scrollbars a little difficult. So maybe it’ll be better to try using ellipses. Additionally only adding a description does not make it a new version. .. This is the description of a dataset. The description can be quite long and this can look strange in the public dataset page. In the drafts page there is a scrollbar in the scrollbar, why not in the public page? Well, the public page needs to support viewing on a mobile phone and this can make scroll bars within scrollbars within scrollbars a little difficult. So maybe it’ll be better to try using ellipses. Additionally only adding a description does not make it a new version.

This is the description of a dataset. The description can be quite long and this can look strange in the public dataset page. In the drafts page there is a scrollbar in the scrollbar, why not in the public page? Well, the public page needs to support viewing on a mobile phone and this can make scroll bars within scrollbars within scrollbars a little difficult. So maybe it’ll be better to try using ellipses. Additionally only adding a description does not make it a new version. This is the description of a dataset. The description can be quite long and this can look strange in the public dataset page. In the drafts page there is a scrollbar in the scrollbar, why not in the public page? Well, the public page needs to support viewing on a mobile phone and this can make scroll bars within scrollbars within scrollbars a little difficult. So maybe it’ll be better to try using ellipses. Additionally only adding a description does not make it a new version.
Z
Architectural interior styles sample Dataset
data.niaid.nih.gov
Updated Sep 20, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marcin Kostrzewski (2023). Architectural interior styles sample Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8360664
Explore at:
Dataset updated
Sep 20, 2023
Dataset provided by
Michał Ulaniuk
Adam Wojdyła
Marcin Kostrzewski
Description
The dataset contains around 1600 images depicting a particular interior style. The photos belong to one of eight classes: rustic, industrial, classic, vintage, modernist, art-deco, scandinavian, glamour.

The source of the dataset is Houzz.com. The images were downloaded from the website and grouped into folders.

You may use the dataset under the following terms:

Research and Development Purposes Only: Access to the dataset hosted on Zenodo is granted exclusively for research and development purposes. Users are required to clearly state their intention for using the dataset in this context.

Acknowledgment and Citation: Users must commit to providing proper acknowledgment and citation of the dataset in their research or development work. They should include the dataset's DOI and a reference to the original source in all publications, presentations, or reports derived from the dataset.

No Commercial Use: The dataset is not to be used for any commercial, for-profit, or financially exploitative purposes. Users must refrain from any activities that generate direct monetary gains from the dataset.

Ethical Use: Users are required to use the dataset in a manner consistent with ethical research practices. This includes respecting privacy, complying with relevant laws and regulations, and ensuring that the use of the data does not harm individuals, groups, or communities.

No Redistribution: Users are strictly prohibited from redistributing the dataset to third parties without prior written consent from the dataset owner. Any sharing of the dataset should be done solely for collaboration within the context of the research or development project.

Non-Discrimination: Access to the dataset should not be denied or granted based on factors such as race, ethnicity, gender, religion, nationality, or any other discriminatory criteria. All requests for access will be evaluated solely based on the justification provided by the user.

No Charge for Access: Users will not be charged any fees for accessing the data hosted on Zenodo. Access is provided free of charge, and users should not be required to make any payments to obtain or use the dataset.

Compliance with Zenodo's Terms of Use: Users are expected to comply with Zenodo's terms of use, including any additional terms or policies specific to the platform

Facebook

Twitter

Click to copy link

Link copied

Cite

Templates (2023). dataset-card-example [Dataset]. https://huggingface.co/datasets/templates/dataset-card-example

dataset-card-example

templates/dataset-card-example

Explore at:

Dataset updated

Sep 28, 2023

Dataset authored and provided by

Templates

Description

Dataset Card for Dataset Name

This dataset card aims to be a base template for new datasets. It has been generated using this raw template.

  Dataset Details







  Dataset Description

Curated by: [More Information Needed] Funded by [optional]: [More Information Needed] Shared by [optional]: [More Information Needed] Language(s) (NLP): [More Information Needed] License: [More Information Needed]

  Dataset Sources [optional]… See the full description on the dataset page: https://huggingface.co/datasets/templates/dataset-card-example.

Clear search

Close search

Google apps

Main menu

dataset-card-example

Sample Leads Dataset

Language Generation Dataset: 200M Samples

Context

Content

Acknowledgements

Inspiration

cot-example-dataset

Best Books Ever Dataset

Synthetic Data for an Imaginary Country, Sample, 2023 - World

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

Response rate

Netflix movies and tv shows sample dataset

Data from: example-dataset

RealNews Dataset

Single Layer Perceptron Dataset(Small)

Meta-Dataset Dataset

AirfRANS_clipped

Load the dataset

Network dataset

LinkedIn Datasets

SVG Code Generation Sample Training Data

DataSeeds.AI-Sample-Dataset-DSD Dataset

Walmart Dataset

Iris Species

SAMPLE DATASET with lots of files

Architectural interior styles sample Dataset

dataset-card-exampleSee More Versions

templates/dataset-card-example

dataset-card-example