100+ datasets found
  1. h

    example-generate-preference-dataset

    • huggingface.co
    Updated Aug 23, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    distilabel-internal-testing (2024). example-generate-preference-dataset [Dataset]. https://huggingface.co/datasets/distilabel-internal-testing/example-generate-preference-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 23, 2024
    Dataset authored and provided by
    distilabel-internal-testing
    Description

    Dataset Card for example-preference-dataset

    This dataset has been created with distilabel.

      Dataset Summary
    

    This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/sdiazlor/example-preference-dataset/raw/main/pipeline.yaml"

    or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/distilabel-internal-testing/example-generate-preference-dataset.

  2. SVG Code Generation Sample Training Data

    • kaggle.com
    zip
    Updated May 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vinothkumar Sekar (2025). SVG Code Generation Sample Training Data [Dataset]. https://www.kaggle.com/datasets/vinothkumarsekar89/svg-generation-sample-training-data
    Explore at:
    zip(193477 bytes)Available download formats
    Dataset updated
    May 3, 2025
    Authors
    Vinothkumar Sekar
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This training data was generated using GPT-4o as part of the 'Drawing with LLM' competition (https://www.kaggle.com/competitions/drawing-with-llms). It can be used to fine-tune small language models for the competition or serve as an augmentation dataset alongside other data sources.

    The dataset is generated in two steps using the GPT-4o model. - In the first step, topic descriptions relevant to the competition are generated using a specific prompt. By running this prompt multiple times, over 3,000 descriptions were collected.

     
    prompt=f""" I am participating in an SVG code generation competition.
      
       The competition involves generating SVG images based on short textual descriptions of everyday objects and scenes, spanning a wide range of categories. The key guidelines are as follows:
      
       - Descriptions are generic and do not contain brand names, trademarks, or personal names.
       - No descriptions include people, even in generic terms.
       - Descriptions are concise—each is no more than 200 characters, with an average length of about 50 characters.
       - Categories cover various domains, with some overlap between public and private test sets.
      
       To train a small LLM model, I am preparing a synthetic dataset. Could you generate 100 unique topics aligned with the competition style?
      
       Requirements:
       - Each topic should range between **20 and 200 characters**, with an **average around 60 characters**.
       - Ensure **diversity and creativity** across topics.
       - **50% of the topics** should come from the categories of **landscapes**, **abstract art**, and **fashion**.
       - Avoid duplication or overly similar phrasing.
      
       Example topics:
                     a purple forest at dusk, gray wool coat with a faux fur collar, a lighthouse overlooking the ocean, burgundy corduroy, pants with patch pockets and silver buttons, orange corduroy overalls, a purple silk scarf with tassel trim, a green lagoon under a cloudy sky, crimson rectangles forming a chaotic grid,  purple pyramids spiraling around a bronze cone, magenta trapezoids layered on a translucent silver sheet,  a snowy plain, black and white checkered pants,  a starlit night over snow-covered peaks, khaki triangles and azure crescents,  a maroon dodecahedron interwoven with teal threads.
      
       Please return the 100 topics in csv format.
       """
     
    • In the second step, SVG code is generated by prompting the GPT-4o model. The following prompt is used to query the model to generate svg.
     
      prompt = f"""
          Generate SVG code to visually represent the following text description, while respecting the given constraints.
          
          Allowed Elements: `svg`, `path`, `circle`, `rect`, `ellipse`, `line`, `polyline`, `polygon`, `g`, `linearGradient`, `radialGradient`, `stop`, `defs`
          Allowed Attributes: `viewBox`, `width`, `height`, `fill`, `stroke`, `stroke-width`, `d`, `cx`, `cy`, `r`, `x`, `y`, `rx`, `ry`, `x1`, `y1`, `x2`, `y2`, `points`, `transform`, `opacity`
          
    
          Please ensure that the generated SVG code is well-formed, valid, and strictly adheres to these constraints. 
          Focus on a clear and concise representation of the input description within the given limitations. 
          Always give the complete SVG code with nothing omitted. Never use an ellipsis.
    
          The code is scored based on similarity to the description, Visual question anwering and aesthetic components.
          Please generate a detailed svg code accordingly.
    
          input description: {text}
          """
     

    The raw SVG output is then cleaned and sanitized using a competition-specific sanitization class. After that, the cleaned SVG is scored using the SigLIP model to evaluate text-to-SVG similarity. Only SVGs with a score above 0.5 are included in the dataset. On average, out of three SVG generations, only one meets the quality threshold after the cleaning, sanitization, and scoring process.

    A dataset with ~50,000 samples for SVG code generation is publicly available at: https://huggingface.co/datasets/vinoku89/svg-code-generation

  3. VegeNet - Image datasets and Codes

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Oct 27, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jo Yen Tan; Jo Yen Tan (2022). VegeNet - Image datasets and Codes [Dataset]. http://doi.org/10.5281/zenodo.7254508
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 27, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jo Yen Tan; Jo Yen Tan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Compilation of python codes for data preprocessing and VegeNet building, as well as image datasets (zip files).

    Image datasets:

    1. vege_original : Images of vegetables captured manually in data acquisition stage
    2. vege_cropped_renamed : Images in (1) cropped to remove background areas and image labels renamed
    3. non-vege images : Images of non-vegetable foods for CNN network to recognize other-than-vegetable foods
    4. food_image_dataset : Complete set of vege (2) and non-vege (3) images for architecture building.
    5. food_image_dataset_split : Image dataset (4) split into train and test sets
    6. process : Images created when cropping (pre-processing step) to create dataset (2).
  4. h

    sample-create-dataset

    • huggingface.co
    Updated Nov 5, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Noel JP (2025). sample-create-dataset [Dataset]. https://huggingface.co/datasets/Noel-997/sample-create-dataset
    Explore at:
    Dataset updated
    Nov 5, 2025
    Authors
    Noel JP
    Description

    Noel-997/sample-create-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community

  5. d

    Acoustic Country Blues Dataset for AI-Generated Music (Machine Learning (ML)...

    • datarade.ai
    .json, .csv, .xls
    Updated Mar 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rightsify (2024). Acoustic Country Blues Dataset for AI-Generated Music (Machine Learning (ML) Data) [Dataset]. https://datarade.ai/data-products/acoustic-country-blues-dataset-for-ai-generated-music-rightsify
    Explore at:
    .json, .csv, .xlsAvailable download formats
    Dataset updated
    Mar 19, 2024
    Dataset authored and provided by
    Rightsify
    Area covered
    Madagascar, Senegal, Thailand, Colombia, Grenada, Afghanistan, Russian Federation, Uganda, Falkland Islands (Malvinas), Paraguay
    Description

    "Acoustic Country Blues" beckons you to the heart of a rustic musical landscape, where soulful storytelling meets the raw authenticity of country blues. This meticulously curated AI-generated music dataset captures the essence of a bygone era, encapsulating the heartfelt strums, intricate fingerpicking, and emotive chord progressions that define Acoustic Country Blues.

    With an array of carefully crafted samples, this provides an expansive canvas for machine learning applications, allowing the exploration and reimagining of the timeless allure of this genre through a modern, digital lens.

    Dive into the timeless beauty of acoustic guitar strums, the haunting notes of slide guitars, and the resonant warmth of fingerpicked strings.

    This exceptional AI Music Dataset encompasses an array of vital data categories, contributing to its excellence. It encompasses Machine Learning (ML) Data, serving as the foundation for training intricate algorithms that generate musical pieces. Music Data, offering a rich collection of melodies, harmonies, and rhythms that fuel the AI's creative process. AI & ML Training Data continuously hone the dataset's capabilities through iterative learning. Copyright Data ensures the dataset's compliance with legal standards, while Intellectual Property Data safeguards the innovative techniques embedded within, fostering a harmonious blend of technological advancement and artistic innovation.

    This dataset can also be useful as Advertising Data to generate music tailored to resonate with specific target audiences, enhancing the effectiveness of advertisements by evoking emotions and capturing attention. It can be a valuable source of Social Media Data as well. Users can post, share, and interact with the music, leading to increased user engagement and virality. The music's novelty and uniqueness can spark discussions, debates, and trends across social media communities, amplifying its reach and impact.

  6. d

    Trap Dataset for AI-Generated Music (Machine Learning (ML) Data)

    • datarade.ai
    .json, .csv, .xls
    Updated Feb 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rightsify (2024). Trap Dataset for AI-Generated Music (Machine Learning (ML) Data) [Dataset]. https://datarade.ai/data-products/trap-dataset-for-ai-generated-music-machine-learning-ml-data-rightsify
    Explore at:
    .json, .csv, .xlsAvailable download formats
    Dataset updated
    Feb 15, 2024
    Dataset authored and provided by
    Rightsify
    Area covered
    British Indian Ocean Territory, Oman, Niger, Peru, Ireland, Portugal, Malta, Switzerland, Paraguay, Philippines
    Description

    Trap dataset is a structured collection of audio files with rich metadata designed for a variety of machine learning applications. This dataset captures the evolution of trap music, which began in the late 1990s Southern US hip-hop culture. Trap, defined by powerful bass, fast hi-hats, and gritty storylines based on street life, has grown into a global craze.

    The dataset contains a wide range of information, including chords, instrumentation, key, tempo, and timestamps, allowing for subtle exploration in generative AI music, Music Information Retrieval (MIR), and source separation applications. This resource provides a unique opportunity to train models with a thorough understanding of the trap's distinguishing features. Notably, the drum and bass instrumentation in trap is critical to its trademark sound. The genre's rhythmic foundation is defined by its unrelenting, booming bass and complicated hi-hat patterns, which have left an indelible influence on current music.

    Explore into the intricate elements of trap music, and use our dataset to improve your machine learning applications. Whether you're creating generative compositions or fine-tuning source separation methods, this dataset provides the foundation for an intensive investigation of the genre's machine-readable details. Understand the rhythmic complexity of trap's drum and bass instrumentation, taking your studies to the heart of one of today's most influential musical genres.

  7. Synthetic Design-Related Data Generated by LLMs

    • figshare.com
    txt
    Updated Aug 24, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yunjian Qiu (2024). Synthetic Design-Related Data Generated by LLMs [Dataset]. http://doi.org/10.6084/m9.figshare.26122543.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Aug 24, 2024
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Yunjian Qiu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    To produce a domain-specific dataset, GPT-4 is assigned the role of an engineering design expert. Furthermore, the ontology, which signifies the design process and design entities, is integrated into the prompts to label the synthetic dataset and enhance the GPT model's grasp of the conceptual design process and domain-specific knowledge. Additionally, the CoT prompting technique compels the GPT models to clarify their reasoning process, thereby fostering a deeper understanding of the tasks.

  8. ai generated faces

    • kaggle.com
    zip
    Updated Sep 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    misteick (2022). ai generated faces [Dataset]. https://www.kaggle.com/datasets/chelove4draste/ai-generated-faces
    Explore at:
    zip(105847789285 bytes)Available download formats
    Dataset updated
    Sep 20, 2022
    Authors
    misteick
    Description

    Fully AI generated human faces. Github page of the dataset

  9. R

    Synthetic Fruit Dataset

    • universe.roboflow.com
    zip
    Updated Aug 11, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brad Dwyer (2021). Synthetic Fruit Dataset [Dataset]. https://universe.roboflow.com/brad-dwyer/synthetic-fruit/model/10
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 11, 2021
    Dataset authored and provided by
    Brad Dwyer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Fruits Bounding Boxes
    Description

    About this dataset

    This dataset contains 6,000 example images generated with the process described in Roboflow's How to Create a Synthetic Dataset tutorial.

    The images are composed of a background (randomly selected from Google's Open Images dataset) and a number of fruits (from Horea94's Fruit Classification Dataset) superimposed on top with a random orientation, scale, and color transformation. All images are 416x550 to simulate a smartphone aspect ratio.

    To generate your own images, follow our tutorial or download the code.

    Example: https://blog.roboflow.ai/content/images/2020/04/synthetic-fruit-examples.jpg" alt="Example Image">

  10. G

    Synthetic Data Platform Service Liability Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Oct 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). Synthetic Data Platform Service Liability Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/synthetic-data-platform-service-liability-market
    Explore at:
    pdf, pptx, csvAvailable download formats
    Dataset updated
    Oct 7, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Synthetic Data Platform Service Liability Market Outlook




    As per our latest research, the global Synthetic Data Platform Service Liability market size in 2024 stands at USD 1.82 billion, with a projected CAGR of 34.5% from 2025 to 2033. By the end of 2033, the market is expected to reach approximately USD 22.43 billion. This impressive growth trajectory is primarily fueled by the increasing adoption of AI and machine learning technologies across diverse industries, which demand high-quality, privacy-compliant data for training robust models.




    One of the primary growth factors for the Synthetic Data Platform Service Liability market is the growing emphasis on data privacy and compliance with stringent regulations such as GDPR, HIPAA, and CCPA. Organizations across sectors are facing mounting pressure to protect sensitive customer information while leveraging data-driven insights. Synthetic data platforms offer a solution by generating realistic but entirely artificial datasets, effectively mitigating privacy risks and reducing the liabilities associated with data breaches. This capability is particularly valuable in industries like healthcare and finance, where the repercussions of data misuse or exposure can be severe both legally and reputationally. As regulatory frameworks evolve globally, the demand for synthetic data solutions that ensure compliance and minimize liability is expected to surge, further propelling market expansion.




    Another significant driver is the rapid advancement and deployment of artificial intelligence and machine learning applications. These technologies require vast quantities of high-quality, unbiased, and diverse datasets for optimal performance. However, acquiring such data from real-world sources is often fraught with challenges, including privacy concerns, high costs, and potential biases. Synthetic data platforms address these obstacles by enabling organizations to create tailored datasets that closely mimic real-world scenarios without compromising sensitive information. This not only accelerates innovation but also reduces the risk of liability arising from the misuse of personal data. Consequently, industries such as automotive, IT & telecommunications, and retail are increasingly integrating synthetic data solutions to enhance model accuracy and operational efficiency while minimizing legal exposure.




    The proliferation of digital transformation initiatives across enterprises of all sizes is also contributing to the robust growth of the synthetic data platform service liability market. As organizations strive to modernize their operations and leverage data-driven decision-making, the need for scalable, secure, and flexible data solutions becomes paramount. Synthetic data platforms, available in both cloud and on-premises deployment modes, offer the agility required to support these digital initiatives. Moreover, the ability to generate synthetic datasets on-demand empowers businesses to test, validate, and refine their AI models without incurring the liabilities associated with handling sensitive real-world data. This trend is especially pronounced among small and medium enterprises (SMEs), which often lack the resources to invest heavily in data security infrastructure and rely on synthetic data to level the playing field with larger competitors.




    From a regional perspective, North America currently leads the synthetic data platform service liability market, driven by the presence of major technology providers, early adoption of AI technologies, and stringent regulatory requirements. Europe is also witnessing substantial growth, fueled by robust data protection laws and a strong focus on digital innovation. Meanwhile, the Asia Pacific region is emerging as a lucrative market due to rapid industrialization, increasing investments in AI and machine learning, and growing awareness of data privacy issues. These regional dynamics are expected to shape the competitive landscape and influence market trends over the forecast period.





    Component Ana

  11. h

    generated-usa-passeports-dataset

    • huggingface.co
    Updated Jul 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Unique Data (2023). generated-usa-passeports-dataset [Dataset]. https://huggingface.co/datasets/UniqueData/generated-usa-passeports-dataset
    Explore at:
    Dataset updated
    Jul 15, 2023
    Authors
    Unique Data
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    Data generation in machine learning involves creating or manipulating data to train and evaluate machine learning models. The purpose of data generation is to provide diverse and representative examples that cover a wide range of scenarios, ensuring the model's robustness and generalization. Data augmentation techniques involve applying various transformations to existing data samples to create new ones. These transformations include: random rotations, translations, scaling, flips, and more. Augmentation helps in increasing the dataset size, introducing natural variations, and improving model performance by making it more invariant to specific transformations. The dataset contains GENERATED USA passports, which are replicas of official passports but with randomly generated details, such as name, date of birth etc. The primary intention of generating these fake passports is to demonstrate the structure and content of a typical passport document and to train the neural network to identify this type of document. Generated passports can assist in conducting research without accessing or compromising real user data that is often sensitive and subject to privacy regulations. Synthetic data generation allows researchers to develop and refine models using simulated passport data without risking privacy leaks.

  12. d

    Film Dataset for AI-Generated Music (Machine Learning (ML) Data)

    • datarade.ai
    .json, .csv, .xls
    Updated Feb 14, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rightsify (2024). Film Dataset for AI-Generated Music (Machine Learning (ML) Data) [Dataset]. https://datarade.ai/data-products/film-dataset-for-ai-generated-music-machine-learning-ml-data-rightsify
    Explore at:
    .json, .csv, .xlsAvailable download formats
    Dataset updated
    Feb 14, 2024
    Dataset authored and provided by
    Rightsify
    Area covered
    Kuwait, Cuba, Guinea, Tokelau, Moldova (Republic of), Luxembourg, Falkland Islands (Malvinas), Bermuda, Antarctica, Denmark
    Description

    The Film dataset is a large collection of audio files with full metadata, including chords, instrumentation, key, tempo, and timestamps. This dataset is designed for machine learning applications and serves as a reliable resource for generative AI music, Music Information Retrieval (MIR), and source separation. With an emphasis on expanding machine learning attempts, the dataset allows researchers to delve into the complexities of film music, enabling the development of algorithms capable of generating creative compositions that genuinely represent the emotive nuances of various genres.

    Film music, an essential component of cinematic storytelling, plays an important role in increasing spectator engagement and emotional resonance. Composers work collaboratively with filmmakers to create music that enhance visual aspects, set the tone, and reinforce story themes.

    Training models on this cinema dataset allows researchers to better grasp and mimic these artistic details, extending the bounds of AI-generated music and contributing to advances in MIR and source separation.

  13. F

    German Open Ended Question Answer Text Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). German Open Ended Question Answer Text Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/german-open-ended-question-answer-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    The German Open-Ended Question Answering Dataset is a meticulously curated collection of comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and Question-answering models in the German language, advancing the field of artificial intelligence.

    Dataset Content:

    This QA dataset comprises a diverse set of open-ended questions paired with corresponding answers in German. There is no context paragraph given to choose an answer from, and each question is answered without any predefined context content. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.

    Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native German people, and references were taken from diverse sources like books, news articles, websites, and other reliable references.

    This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.

    Question Diversity:

    To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. Additionally, questions are further classified into fact-based and opinion-based categories, creating a comprehensive variety. The QA dataset also contains the question with constraints and persona restrictions, which makes it even more useful for LLM training.

    Answer Formats:

    To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraph types of answers. The answer contains text strings, numerical values, date and time formats as well. Such diversity strengthens the Language model's ability to generate coherent and contextually appropriate answers.

    Data Format and Annotation Details:

    This fully labeled German Open Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as id, language, domain, question_length, prompt_type, question_category, question_type, complexity, answer_type, rich_text.

    Quality and Accuracy:

    The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.

    Both the question and answers in German are grammatically accurate without any word or grammatical errors. No copyrighted, toxic, or harmful content is used while building this dataset.

    Continuous Updates and Customization:

    The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.

    License:

    The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy German Open Ended Question Answer Dataset to enhance the language understanding capabilities of their generative ai models, improve response generation, and explore new approaches to NLP question-answering tasks.

  14. Z

    replicAnt - Plum2023 - Detection & Tracking Datasets and Trained Networks

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    Updated Apr 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Plum, Fabian; Bulla, René; Beck, Hendrik; Imirzian, Natalie; Labonte, David (2023). replicAnt - Plum2023 - Detection & Tracking Datasets and Trained Networks [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_7849416
    Explore at:
    Dataset updated
    Apr 21, 2023
    Dataset provided by
    The Pocket Dimension, Munich
    Imperial College London
    Authors
    Plum, Fabian; Bulla, René; Beck, Hendrik; Imirzian, Natalie; Labonte, David
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains all recorded and hand-annotated as well as all synthetically generated data as well as representative trained networks used for detection and tracking experiments in the replicAnt - generating annotated images of animals in complex environments using Unreal Engine manuscript. Unless stated otherwise, all 3D animal models used in the synthetically generated data have been generated with the open-source photgrammetry platform scAnt peerj.com/articles/11155/. All synthetic data has been generated with the associated replicAnt project available from https://github.com/evo-biomech/replicAnt.

    Abstract:

    Deep learning-based computer vision methods are transforming animal behavioural research. Transfer learning has enabled work in non-model species, but still requires hand-annotation of example footage, and is only performant in well-defined conditions. To overcome these limitations, we created replicAnt, a configurable pipeline implemented in Unreal Engine 5 and Python, designed to generate large and variable training datasets on consumer-grade hardware instead. replicAnt places 3D animal models into complex, procedurally generated environments, from which automatically annotated images can be exported. We demonstrate that synthetic data generated with replicAnt can significantly reduce the hand-annotation required to achieve benchmark performance in common applications such as animal detection, tracking, pose-estimation, and semantic segmentation; and that it increases the subject-specificity and domain-invariance of the trained networks, so conferring robustness. In some applications, replicAnt may even remove the need for hand-annotation altogether. It thus represents a significant step towards porting deep learning-based computer vision tools to the field.

    Benchmark data

    Two video datasets were curated to quantify detection performance; one in laboratory and one in field conditions. The laboratory dataset consists of top-down recordings of foraging trails of Atta vollenweideri (Forel 1893) leaf-cutter ants. The colony was collected in Uruguay in 2014, and housed in a climate chamber at 25°C and 60% humidity. A recording box was built from clear acrylic, and placed between the colony nest and a box external to the climate chamber, which functioned as feeding site. Bramble leaves were placed in the feeding area prior to each recording session, and ants had access to the recording area at will. The recorded area was 104 mm wide and 200 mm long. An OAK-D camera (OpenCV AI Kit: OAK-D, Luxonis Holding Corporation) was positioned centrally 195 mm above the ground. While keeping the camera position constant, lighting, exposure, and background conditions were varied to create recordings with variable appearance: The “base” case is an evenly lit and well exposed scene with scattered leaf fragments on an otherwise plain white backdrop. A “bright” and “dark” case are characterised by systematic over- or underexposure, respectively, which introduces motion blur, colour-clipped appendages, and extensive flickering and compression artefacts. In a separate well exposed recording, the clear acrylic backdrop was substituted with a printout of a highly textured forest ground to create a “noisy” case. Last, we decreased the camera distance to 100 mm at constant focal distance, effectively doubling the magnification, and yielding a “close” case, distinguished by out-of-focus workers. All recordings were captured at 25 frames per second (fps).

    The field datasets consists of video recordings of Gnathamitermes sp. desert termites, filmed close to the nest entrance in the desert of Maricopa County, Arizona, using a Nikon D850 and a Nikkor 18-105 mm lens on a tripod at camera distances between 20 cm to 40 cm. All video recordings were well exposed, and captured at 23.976 fps.

    Each video was trimmed to the first 1000 frames, and contains between 36 and 103 individuals. In total, 5000 and 1000 frames were hand-annotated for the laboratory- and field-dataset, respectively: each visible individual was assigned a constant size bounding box, with a centre coinciding approximately with the geometric centre of the thorax in top-down view. The size of the bounding boxes was chosen such that they were large enough to completely enclose the largest individuals, and was automatically adjusted near the image borders. A custom-written Blender Add-on aided hand-annotation: the Add-on is a semi-automated multi animal tracker, which leverages blender’s internal contrast-based motion tracker, but also include track refinement options, and CSV export functionality. Comprehensive documentation of this tool and Jupyter notebooks for track visualisation and benchmarking is provided on the replicAnt and BlenderMotionExport GitHub repositories.

    Synthetic data generation

    Two synthetic datasets, each with a population size of 100, were generated from 3D models of \textit{Atta vollenweideri} leaf-cutter ants. All 3D models were created with the scAnt photogrammetry workflow. A “group” population was based on three distinct 3D models of an ant minor (1.1 mg), a media (9.8 mg), and a major (50.1 mg) (see 10.5281/zenodo.7849059)). To approximately simulate the size distribution of A. vollenweideri colonies, these models make up 20%, 60%, and 20% of the simulated population, respectively. A 33% within-class scale variation, with default hue, contrast, and brightness subject material variation, was used. A “single” population was generated using the major model only, with 90% scale variation, but equal material variation settings.

    A Gnathamitermes sp. synthetic dataset was generated from two hand-sculpted models; a worker and a soldier made up 80% and 20% of the simulated population of 100 individuals, respectively with default hue, contrast, and brightness subject material variation. Both 3D models were created in Blender v3.1, using reference photographs.

    Each of the three synthetic datasets contains 10,000 images, rendered at a resolution of 1024 by 1024 px, using the default generator settings as documented in the Generator_example level file (see documentation on GitHub). To assess how the training dataset size affects performance, we trained networks on 100 (“small”), 1,000 (“medium”), and 10,000 (“large”) subsets of the “group” dataset. Generating 10,000 samples at the specified resolution took approximately 10 hours per dataset on a consumer-grade laptop (6 Core 4 GHz CPU, 16 GB RAM, RTX 2070 Super).

    Additionally, five datasets which contain both real and synthetic images were curated. These “mixed” datasets combine image samples from the synthetic “group” dataset with image samples from the real “base” case. The ratio between real and synthetic images across the five datasets varied between 10/1 to 1/100.

    Funding

    This study received funding from Imperial College’s President’s PhD Scholarship (to Fabian Plum), and is part of a project that has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (Grant agreement No. 851705, to David Labonte). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

  15. n

    ccPDB - Compilation and Creation of datasets from PDB

    • neuinfo.org
    • scicrunch.org
    Updated Jan 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). ccPDB - Compilation and Creation of datasets from PDB [Dataset]. http://identifiers.org/RRID:SCR_005870
    Explore at:
    Dataset updated
    Jan 29, 2022
    Description

    ccPDB (Compilation and Creation of datasets from PDB) is designed to provide service to scientific community working in the field of function or structure annoation of proteins. This database of datasets is based on Protein Data Bank (PDB), where all datasets were derived from PDB. ccPDB have four modules; i) compilation of datasets, ii) creation of datasets, iii) web services and iv) Important links. * Compilation of Datasets: Datasets at ccPDB can be classified in two categories, i) datasets collected from literature and ii) datasets compiled from PDB. We are in process of collecting PDB datasetsfrom literature and maintaining at ccPDB. We are also requesting community to suggest datasets. In addition, we generate datasets from PDB, these datasets were generated using commonly used standard protocols like non-redundant chains, structures solved at high resolution. * Creation of datasets: This module developed for creating customized datasets where user can create a dataset using his/her conditions from PDB. This module will be useful for those users who wish to create a new dataset as per ones requirement. This module have six steps, which are described in help page. * Web Services: We integrated following web services in ccPDB; i) Analyze of PDB ID service allows user to submit their PDB on around 40 servers from single point, ii) BLAST search allows user to perform BLAST search of their protein against PDB, iii) Structural information service is designed for annotating a protein structure from PDB ID, iv) Search in PDB facilitate user in searching structures in PDB, v)Generate patterns service facility to generate different types of patterns required for machine learning techniques and vi) Download useful information allows user to download various types of information for a given set of proteins (PDB IDs). * Important Links: One of major objectives of this web site is to provide links to web servers related to functional annotation of proteins. In first phase we have collected and compiled these links in different categories. In future attempt will be made to collect as many links as possible.

  16. Data used to produce figures and tables

    • catalog.data.gov
    • s.cnmilf.com
    • +1more
    Updated May 15, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2021). Data used to produce figures and tables [Dataset]. https://catalog.data.gov/dataset/data-used-to-produce-figures-and-tables-c6864
    Explore at:
    Dataset updated
    May 15, 2021
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    The data set was used to produce tables and figures in paper. This dataset is associated with the following publications: Lytle, D., S. Pfaller, C. Muhlen, I. Struewing, S. Triantafyllidou, C. White, S. Hayes, D. King, and J. Lu. A Comprehensive Evaluation of Monochloramine Disinfection on Water Quality, Legionella and Other Important Microorganisms in a Hospital. WATER RESEARCH. Elsevier Science Ltd, New York, NY, USA, 189: 116656, (2021). Lytle, D., C. Formal, K. Cahalan, C. Muhlen, and S. Triantafyllidou. The Impact of Sampling Approach and Daily Water Usage on Lead Levels Measured at the Tap. WATER RESEARCH. Elsevier Science Ltd, New York, NY, USA, 197: 117071, (2021).

  17. h

    my_dataset

    • huggingface.co
    Updated Nov 21, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lucain Pouget (2024). my_dataset [Dataset]. https://huggingface.co/datasets/Wauplin/my_dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 21, 2024
    Authors
    Lucain Pouget
    Description

    my_dataset

    Note: This is an AI-generated dataset, so its content may be inaccurate or false. Source of the data: The dataset was generated using Fastdata library and claude-3-haiku-20240307 with the following input:

      System Prompt
    

    You are a helpful assistant.

      Prompt Template
    

    Generate English and Spanish translations on the following topic:

      Sample Input
    

    [{'topic': 'I am going to the beach this weekend'}, {'topic': 'I am going… See the full description on the dataset page: https://huggingface.co/datasets/Wauplin/my_dataset.

  18. D

    Synthetic Data For Security Market Research Report 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Oct 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Synthetic Data For Security Market Research Report 2033 [Dataset]. https://dataintelo.com/report/synthetic-data-for-security-market
    Explore at:
    pptx, pdf, csvAvailable download formats
    Dataset updated
    Oct 1, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Synthetic Data for Security Market Outlook



    According to our latest research, the synthetic data for security market size reached $1.42 billion globally in 2024, reflecting a rapidly expanding adoption curve across industries. The market is projected to grow at a robust CAGR of 36.7% from 2025 to 2033, setting the stage for an impressive forecasted market size of $19.6 billion by 2033. This exponential growth is primarily driven by the increasing sophistication of cyber threats, the need for advanced data privacy solutions, and the accelerating pace of digital transformation initiatives. As organizations worldwide prioritize secure data environments and compliance, synthetic data is emerging as a critical enabler for secure innovation and risk mitigation in the digital era.




    One of the pivotal growth factors propelling the synthetic data for security market is the escalating demand for robust data privacy and compliance solutions. With regulatory frameworks such as GDPR, CCPA, and HIPAA imposing stringent requirements on data handling, organizations are under immense pressure to ensure that sensitive information is protected at every stage of processing. Synthetic data, by its very nature, eliminates direct exposure of real personal or confidential data, offering a highly effective means to conduct analytics, test security protocols, and train machine learning models without risking privacy breaches. This capability is especially valuable in sectors like BFSI, healthcare, and government, where data sensitivity is paramount. As a result, enterprises are increasingly integrating synthetic data solutions into their security architecture to address compliance mandates while maintaining operational agility.




    Another significant driver for the synthetic data for security market is the surge in cyberattacks and fraudulent activities targeting digital assets across industries. Traditional security testing with real data can inadvertently expose vulnerabilities or lead to data leaks, making synthetic data an attractive alternative for simulating diverse threat scenarios and validating security controls. Organizations are leveraging synthetic data to enhance their fraud detection, threat intelligence, and identity management systems by generating realistic yet non-sensitive datasets for rigorous testing and training. This not only strengthens the overall cybersecurity posture but also accelerates the deployment of AI-driven security solutions by providing abundant, high-quality training data without regulatory or ethical constraints. The ability to rapidly generate tailored datasets for evolving threat landscapes gives organizations a decisive edge in proactive risk management.




    The proliferation of digital transformation initiatives and the adoption of cloud-based security solutions are further catalyzing the growth of the synthetic data for security market. As enterprises migrate critical workloads to cloud environments, the need for scalable, secure, and compliant data management becomes paramount. Synthetic data seamlessly fits into cloud-native security architectures, enabling secure DevOps, sandbox testing, and continuous integration/continuous deployment (CI/CD) pipelines. The flexibility to generate synthetic datasets on demand supports agile development cycles and reduces the time-to-market for new security applications. Additionally, the rise of AI and machine learning in security operations is amplifying the demand for synthetic data, as it provides the diverse, balanced, and unbiased datasets needed to train advanced detection and response systems. This convergence of cloud, AI, and synthetic data is reshaping the future of secure digital innovation.




    From a regional perspective, North America currently dominates the synthetic data for security market, accounting for the largest revenue share in 2024. This leadership is attributed to the region's mature cybersecurity ecosystem, high technology adoption rates, and stringent regulatory environment. Europe follows closely, driven by robust data protection regulations and a strong focus on privacy-centric security solutions. The Asia Pacific region is witnessing the fastest growth, fueled by rapid digitalization, increasing cyber threats, and growing investments in advanced security infrastructure. Latin America and the Middle East & Africa are also experiencing steady adoption, albeit at a slower pace, as organizations in these regions recognize the strategic value of synthetic data in mitigating security risks and ensuring regulatory compliance. Overall, the global landscape is charact

  19. 10k Random Shapes with Random Operations

    • kaggle.com
    zip
    Updated May 16, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    makra (2023). 10k Random Shapes with Random Operations [Dataset]. https://www.kaggle.com/datasets/makra2077/10000-random-shapes-with-random-operations
    Explore at:
    zip(7965036 bytes)Available download formats
    Dataset updated
    May 16, 2023
    Authors
    makra
    License

    ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
    License information was derived automatically

    Description

    About

    The dataset contains 10000 images with 2 random shapes (of 17 possible shapes) having random operations (of 3 possible operations). This dataset is generated using the 3D Shapes Dataset Generator I've developed. Feel free to use it from here.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F15136143%2F434f4faa08f5f66033feca35f6c682f3%2Flogo_op_spidey.ico?generation=1684269448893545&alt=media" alt="">

    Label

    Column NameInfo
    filenameName of the image file
    shapeShape Index
    operationOperation Index
    a,b,c,d,e,f,g,h,i,j,k,lDimensional parameters
    hue, sat, valHSV Values of the color
    rot_x, rot_y, rot_zEuler Angles
    pos_x, pos_y, pos_zPosition Vector

    Each row depicts information about a shape in the image of a dataset.

    Seed

    The seed value of the dataset is stored in a txt file and can be used to re-generate the dataset using the tool.

  20. D

    ScatteringSplatting Dataset: Synthetic dataset for 3D reconstruction in...

    • dataverse.no
    • dataverse.azure.uit.no
    • +1more
    txt, zip
    Updated Apr 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anurag Dalal; Anurag Dalal (2025). ScatteringSplatting Dataset: Synthetic dataset for 3D reconstruction in scattering medium [Dataset]. http://doi.org/10.18710/8AS0US
    Explore at:
    zip(64201895), txt(4565)Available download formats
    Dataset updated
    Apr 10, 2025
    Dataset provided by
    DataverseNO
    Authors
    Anurag Dalal; Anurag Dalal
    License

    https://dataverse.no/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18710/8AS0UShttps://dataverse.no/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18710/8AS0US

    Description

    We create a synthetic dataset using Unreal Engine 5 to evaluate 3D reconstruction under scattering media like fog and underwater conditions. It includes two scenes—an outdoor foggy environment and a realistic underwater setting—with images captured from a hemispherical camera layout. Each scene provides separate training and evaluation views, and COLMAP is used to generate sparse reconstructions and ground-truth poses for benchmarking.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
distilabel-internal-testing (2024). example-generate-preference-dataset [Dataset]. https://huggingface.co/datasets/distilabel-internal-testing/example-generate-preference-dataset

example-generate-preference-dataset

distilabel-internal-testing/example-generate-preference-dataset

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 23, 2024
Dataset authored and provided by
distilabel-internal-testing
Description

Dataset Card for example-preference-dataset

This dataset has been created with distilabel.

  Dataset Summary

This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/sdiazlor/example-preference-dataset/raw/main/pipeline.yaml"

or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/distilabel-internal-testing/example-generate-preference-dataset.

Search
Clear search
Close search
Google apps
Main menu