96 datasets found
  1. PASTA Data

    • kaggle.com
    Updated Dec 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google Research (2024). PASTA Data [Dataset]. https://www.kaggle.com/datasets/googleai/pasta-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 10, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Google Research
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This dataset contains human rater trajectories used in paper: "Preference Adaptive and Sequential Text-to-Image Generation".

    We use human raters to gather sequential user preferences data for personalized T2I generation. Participants are tasked with interacting with an LMM agent for five turns. Throughout our rater study we use a Gemini 1.5 Flash Model as our base LMM, which acts as an agent. At each turn, the system presents 16 images, arranged in four columns, each representing a different prompt expansion derived from the user's initial prompt and prior interactions. Raters are shown only the generated images, not the prompt expansions themselves.

    At session start, raters are instructed to provide an initial prompt of at most 12 words, encapsulating a specific visual concept. They are encouraged to provide descriptive prompts that avoid generic terms (e.g., "an ancient Egyptian temple with hieroglyphs" 'instead of "a temple"). At each turn, raters then select the column of images preferred most; they are instructed to select a column based on the quality of the best image in that column w.r.t. their original intent. Raters may optionally provide a free-text critique (up to 12 words) to guide subsequent prompt expansions, though most raters did not use this facility.

    See our paper for a comprehensive description of the rater study.

    Citation

    Please cite our paper if you use it in your work.

  2. N

    United States Age Group Population Dataset: A complete breakdown of United...

    • neilsberg.com
    csv, json
    Updated Sep 16, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neilsberg Research (2023). United States Age Group Population Dataset: A complete breakdown of United States age demographics from 0 to 85 years, distributed across 18 age groups [Dataset]. https://www.neilsberg.com/research/datasets/5fd2b2bb-3d85-11ee-9abe-0aa64bf2eeb2/
    Explore at:
    json, csvAvailable download formats
    Dataset updated
    Sep 16, 2023
    Dataset authored and provided by
    Neilsberg Research
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    United States
    Variables measured
    Population Under 5 Years, Population over 85 years, Population Between 5 and 9 years, Population Between 10 and 14 years, Population Between 15 and 19 years, Population Between 20 and 24 years, Population Between 25 and 29 years, Population Between 30 and 34 years, Population Between 35 and 39 years, Population Between 40 and 44 years, and 9 more
    Measurement technique
    The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates. To measure the two variables, namely (a) population and (b) population as a percentage of the total population, we initially analyzed and categorized the data for each of the age groups. For age groups we divided it into roughly a 5 year bucket for ages between 0 and 85. For over 85, we aggregated data into a single group for all ages. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
    Dataset funded by
    Neilsberg Research
    Description
    About this dataset

    Context

    The dataset tabulates the United States population distribution across 18 age groups. It lists the population in each age group along with the percentage population relative of the total population for United States. The dataset can be utilized to understand the population distribution of United States by age. For example, using this dataset, we can identify the largest age group in United States.

    Key observations

    The largest age group in United States was for the group of age 25-29 years with a population of 22,854,328 (6.93%), according to the 2021 American Community Survey. At the same time, the smallest age group in United States was the 80-84 years with a population of 5,932,196 (1.80%). Source: U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates.

    Content

    When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates.

    Age groups:

    • Under 5 years
    • 5 to 9 years
    • 10 to 14 years
    • 15 to 19 years
    • 20 to 24 years
    • 25 to 29 years
    • 30 to 34 years
    • 35 to 39 years
    • 40 to 44 years
    • 45 to 49 years
    • 50 to 54 years
    • 55 to 59 years
    • 60 to 64 years
    • 65 to 69 years
    • 70 to 74 years
    • 75 to 79 years
    • 80 to 84 years
    • 85 years and over

    Variables / Data Columns

    • Age Group: This column displays the age group in consideration
    • Population: The population for the specific age group in the United States is shown in this column.
    • % of Total Population: This column displays the population of each age group as a proportion of United States total population. Please note that the sum of all percentages may not equal one due to rounding of values.

    Good to know

    Margin of Error

    Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

    Custom data

    If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

    Inspiration

    Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

    Recommended for further research

    This dataset is a part of the main dataset for United States Population by Age. You can refer the same here

  3. SVG Code Generation Sample Training Data

    • kaggle.com
    Updated May 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vinothkumar Sekar (2025). SVG Code Generation Sample Training Data [Dataset]. https://www.kaggle.com/datasets/vinothkumarsekar89/svg-generation-sample-training-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 3, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Vinothkumar Sekar
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This training data was generated using GPT-4o as part of the 'Drawing with LLM' competition (https://www.kaggle.com/competitions/drawing-with-llms). It can be used to fine-tune small language models for the competition or serve as an augmentation dataset alongside other data sources.

    The dataset is generated in two steps using the GPT-4o model. - In the first step, topic descriptions relevant to the competition are generated using a specific prompt. By running this prompt multiple times, over 3,000 descriptions were collected.

     
    prompt=f""" I am participating in an SVG code generation competition.
      
       The competition involves generating SVG images based on short textual descriptions of everyday objects and scenes, spanning a wide range of categories. The key guidelines are as follows:
      
       - Descriptions are generic and do not contain brand names, trademarks, or personal names.
       - No descriptions include people, even in generic terms.
       - Descriptions are concise—each is no more than 200 characters, with an average length of about 50 characters.
       - Categories cover various domains, with some overlap between public and private test sets.
      
       To train a small LLM model, I am preparing a synthetic dataset. Could you generate 100 unique topics aligned with the competition style?
      
       Requirements:
       - Each topic should range between **20 and 200 characters**, with an **average around 60 characters**.
       - Ensure **diversity and creativity** across topics.
       - **50% of the topics** should come from the categories of **landscapes**, **abstract art**, and **fashion**.
       - Avoid duplication or overly similar phrasing.
      
       Example topics:
                     a purple forest at dusk, gray wool coat with a faux fur collar, a lighthouse overlooking the ocean, burgundy corduroy, pants with patch pockets and silver buttons, orange corduroy overalls, a purple silk scarf with tassel trim, a green lagoon under a cloudy sky, crimson rectangles forming a chaotic grid,  purple pyramids spiraling around a bronze cone, magenta trapezoids layered on a translucent silver sheet,  a snowy plain, black and white checkered pants,  a starlit night over snow-covered peaks, khaki triangles and azure crescents,  a maroon dodecahedron interwoven with teal threads.
      
       Please return the 100 topics in csv format.
       """
     
    • In the second step, SVG code is generated by prompting the GPT-4o model. The following prompt is used to query the model to generate svg.
     
      prompt = f"""
          Generate SVG code to visually represent the following text description, while respecting the given constraints.
          
          Allowed Elements: `svg`, `path`, `circle`, `rect`, `ellipse`, `line`, `polyline`, `polygon`, `g`, `linearGradient`, `radialGradient`, `stop`, `defs`
          Allowed Attributes: `viewBox`, `width`, `height`, `fill`, `stroke`, `stroke-width`, `d`, `cx`, `cy`, `r`, `x`, `y`, `rx`, `ry`, `x1`, `y1`, `x2`, `y2`, `points`, `transform`, `opacity`
          
    
          Please ensure that the generated SVG code is well-formed, valid, and strictly adheres to these constraints. 
          Focus on a clear and concise representation of the input description within the given limitations. 
          Always give the complete SVG code with nothing omitted. Never use an ellipsis.
    
          The code is scored based on similarity to the description, Visual question anwering and aesthetic components.
          Please generate a detailed svg code accordingly.
    
          input description: {text}
          """
     

    The raw SVG output is then cleaned and sanitized using a competition-specific sanitization class. After that, the cleaned SVG is scored using the SigLIP model to evaluate text-to-SVG similarity. Only SVGs with a score above 0.5 are included in the dataset. On average, out of three SVG generations, only one meets the quality threshold after the cleaning, sanitization, and scoring process.

  4. a

    NYC Population by Generation Demographics Map

    • nyc-open-data-statelocalps.hub.arcgis.com
    • hub.arcgis.com
    • +2more
    Updated Mar 19, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    pkunduNYC (2020). NYC Population by Generation Demographics Map [Dataset]. https://nyc-open-data-statelocalps.hub.arcgis.com/maps/62dad0e61f534b3fa97c6950c07b5007
    Explore at:
    Dataset updated
    Mar 19, 2020
    Dataset authored and provided by
    pkunduNYC
    Area covered
    Description

    This map contains NYC administrative boundaries enriched with various demographics datasets.Learn more about Esri's Enrich Layer / Geoenrichment analysis tool.Learn more about Esri's Demographics, Psychographic, and Socioeconomic datasets.Search for a specific location or site using the search bar. Toggle layer visibility with the layer list. Click on a layer to see more information about the feature.

  5. U

    The Extended Global Lake area, Climate, and Population Dataset (GLCP)

    • data.usgs.gov
    • portal.edirepository.org
    • +1more
    Updated Feb 13, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Meyer; Matthew Brousil; Salvatore Virdis; Xiao Yang; Alli Cramer; Ryan McClure; Stephen Katz; Stephanie Hampton (2025). The Extended Global Lake area, Climate, and Population Dataset (GLCP) [Dataset]. http://doi.org/10.6073/pasta/e0bf4571ca6cbfb81c3ed7caefc85fc6
    Explore at:
    Dataset updated
    Feb 13, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Authors
    Michael Meyer; Matthew Brousil; Salvatore Virdis; Xiao Yang; Alli Cramer; Ryan McClure; Stephen Katz; Stephanie Hampton
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Time period covered
    Jan 1, 1995 - Dec 31, 2020
    Description

    A changing climate and increasing human population necessitate understanding global freshwater availability and temporal variability. To examine lake freshwater availability from local-to-global and monthly-to-decadal scales, we created the Global Lake area, Climate, and Population (GLCP) dataset, which contains annual lake surface area for 1.42 million lakes with paired annual basin-level climate and population data. Building off an existing data product infrastructure, the next generation of the GLCP includes monthly lake ice area, snow basin area, and more climate variables including specific humidity, longwave and shortwave radiation, as well as cloud cover. The new generation of the GLCP continues previous FAIR data efforts by expanding its scripting repository and maintaining unique relational keys for merging with external data products. Compared to the original version, the new GLCP contains an even richer suite of variables capable of addressing disparate analyses of lake ...

  6. Third Generation Simulation Data (TGSIM) I-90/I-94 Moving Trajectories

    • catalog.data.gov
    • data.virginia.gov
    • +2more
    Updated Jan 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Federal Highway Administration (2025). Third Generation Simulation Data (TGSIM) I-90/I-94 Moving Trajectories [Dataset]. https://catalog.data.gov/dataset/third-generation-simulation-data-tgsim-i-90-i-94-moving-trajectories
    Explore at:
    Dataset updated
    Jan 24, 2025
    Dataset provided by
    Federal Highway Administrationhttps://highways.dot.gov/
    Area covered
    Interstate 90
    Description

    The main dataset is a 130 MB file of trajectory data (I90_94_moving_final.csv) that contains position, speed, and acceleration data for small and large automated (L2) and non-automated vehicles on a highway in an urban environment. Supporting files include aerial reference images for four distinct data collection “Runs” (I90_94_moving_RunX_with_lanes.png, where X equals 1, 2, 3, and 4). Associated centerline files are also provided for each “Run” (I-90-moving-Run_X-geometry-with-ramps.csv). In each centerline file, x and y coordinates (in meters) marking each lane centerline are provided. The origin point of the reference image is located at the top left corner. Additionally, in each centerline file, an indicator variable is used for each lane to define the following types of road sections: 0=no ramp, 1=on-ramps, 2=off-ramps, and 3=weaving segments. The number attached to each column header is the numerical ID assigned for the specific lane (see “TGSIM – Centerline Data Dictionary – I90_94moving.csv” for more details). The dataset defines six northbound lanes using these centerline files. Images that map the lanes of interest to the numerical lane IDs referenced in the trajectory dataset are stored in the folder titled “Annotation on Regions.zip”. The northbound lanes are shown visually from left to right in I90_94_moving_lane1.png through I90_94_moving_lane6.png. This dataset was collected as part of the Third Generation Simulation Data (TGSIM): A Closer Look at the Impacts of Automated Driving Systems on Human Behavior project. During the project, six trajectory datasets capable of characterizing human-automated vehicle interactions under a diverse set of scenarios in highway and city environments were collected and processed. For more information, see the project report found here: https://rosap.ntl.bts.gov/view/dot/74647. This dataset, which is one of the six collected as part of the TGSIM project, contains data collected using one high-resolution 8K camera mounted on a helicopter that followed three SAE Level 2 ADAS-equipped vehicles (one at a time) northbound through the 4 km long segment at an altitude of 200 meters. Once a vehicle finished the segment, the helicopter would return to the beginning of the segment to follow the next SAE Level 2 ADAS-equipped vehicle to ensure continuous data collection. The segment was selected to study mandatory and discretionary lane changing and last-minute, forced lane-changing maneuvers. The segment has five off-ramps and three on-ramps to the right and one off-ramp and one on-ramp to the left. All roads have 88 kph (55 mph) speed limits. The camera captured footage during the evening rush hour (3:00 PM-5:00 PM CT) on a cloudy day. As part of this dataset, the following files were provided: I90_94_moving_final.csv contains the numerical data to be used for analysis that includes vehicle level trajectory data at every 0.1 second. Vehicle size (small or large), width, length, and whether the vehicle was one of the automated test vehicles ("yes" or "no") are provided with instantaneous location, speed, and acceleration data. All distance measurements (width, length, location) were converted from pixels to meters using the following conversion factor: 1 pixel = 0.3-meter conversion. I90_94_moving_RunX_with_lanes.png are the aerial reference images that define the geographic region and associated roadway segments of interest (see bounding boxes on northbound lanes) for each run X. I-90-moving-Run_X-geometry-with-ramps.csv contain the coordinates that define the lane centerlines for each Run X. The "x" and "y" columns represent the horizontal and vertical locations in the reference image, respectively. The "ramp" columns define the type of roadway segment (0=no ramp, 1=on-ramps, 2=off-ramps, and 3=weaving segments). In total, the centerline files define six northbound lanes. Annotation on Regions.zip, which includes images that visually map lanes (I90_9

  7. NLUCat

    • zenodo.org
    • huggingface.co
    • +1more
    zip
    Updated Mar 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2024). NLUCat [Dataset]. http://doi.org/10.5281/zenodo.10721193
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 4, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    NLUCat

    Dataset Description

    Dataset Summary

    NLUCat is a dataset of NLU in Catalan. It consists of nearly 12,000 instructions annotated with the most relevant intents and spans. Each instruction is accompanied, in addition, by the instructions received by the annotator who wrote it.

    The intents taken into account are the habitual ones of a virtual home assistant (activity calendar, IOT, list management, leisure, etc.), but specific ones have also been added to take into account social and healthcare needs for vulnerable people (information on administrative procedures, menu and medication reminders, etc.).

    The spans have been annotated with a tag describing the type of information they contain. They are fine-grained, but can be easily grouped to use them in robust systems.

    The examples are not only written in Catalan, but they also take into account the geographical and cultural reality of the speakers of this language (geographic points, cultural references, etc.)

    This dataset can be used to train models for intent classification, spans identification and examples generation.

    This is the complete version of the dataset. A version prepared to train and evaluate intent classifiers has been published in HuggingFace.

    In this repository you'll find the following items:

    • NLUCat_annotation_guidelines.docx: the guidelines provided to the annotation team
    • NLUCat_dataset.json: the completed NLUCat dataset
    • NLUCat_stats.tsv: statistics about de NLUCat dataset
    • dataset: folder with the dataset as published in HuggingFace, splited and prepared for training and evaluating intent classifiers
    • reports: folder with the reports done as feedback to the annotators during the annotation process

    This dataset can be used for any purpose, whether academic or commercial, under the terms of the CC BY 4.0. Give appropriate credit , provide a link to the license, and indicate if changes were made.

    Supported Tasks and Leaderboards

    Intent classification, spans identification and examples generation.

    Languages

    The dataset is in Catalan (ca-ES).

    Dataset Structure

    Data Instances

    Three JSON files, one for each split.

    Data Fields

    • example: `str`. Example
    • annotation: `dict`. Annotation of the example
    • intent: `str`. Intent tag
    • slots: `list`. List of slots
    • Tag:`str`. tag to the slot
    • Text:`str`. Text of the slot
    • Start_char: `int`. First character of the span
    • End_char: `int`. Last character of the span

    Example


    An example looks as follows:

    {
    "example": "Demana una ambulància; la meva dona està de part.",
    "annotation": {
    "intent": "call_emergency",
    "slots": [
    {
    "Tag": "service",
    "Text": "ambulància",
    "Start_char": 11,
    "End_char": 21
    },
    {
    "Tag": "situation",
    "Text": "la meva dona està de part",
    "Start_char": 23,
    "End_char": 48
    }
    ]
    }
    },


    Data Splits

    • NLUCat.train: 9128 examples
    • NLUCat.dev: 1441 examples
    • NLUCat.test: 1441 examples

    Dataset Creation

    Curation Rationale

    We created this dataset to contribute to the development of language models in Catalan, a low-resource language.

    When creating this dataset, we took into account not only the language but the entire socio-cultural reality of the Catalan-speaking population. Special consideration was also given to the needs of the vulnerable population.

    Source Data

    Initial Data Collection and Normalization

    We commissioned a company to create fictitious examples for the creation of this dataset.

    Who are the source language producers?

    We commissioned the writing of the examples to the company m47 labs.

    Annotations

    Annotation process

    The elaboration of this dataset has been done in three steps, taking as a model the process followed by the NLU-Evaluation-Data dataset, as explained in the paper.
    * First step: translation or elaboration of the instructions given to the annotators to write the examples.
    * Second step: writing the examples. This step also includes the grammatical correction and normalization of the texts.
    * Third step: recording the attempts and the slots of each example. In this step, some modifications were made to the annotation guides to adjust them to the real situations.

    Who are the annotators?

    The drafting of the examples and their annotation was entrusted to the company m47 labs through a public tender process.

    Personal and Sensitive Information

    No personal or sensitive information included.

    The examples used for the preparation of this dataset are fictitious and, therefore, the information shown is not real.

    Considerations for Using the Data

    Social Impact of Dataset

    We hope that this dataset will help the development of virtual assistants in Catalan, a language that is often not taken into account, and that it will especially help to improve the quality of life of people with special needs.

    Discussion of Biases

    When writing the examples, the annotators were asked to take into account the socio-cultural reality (geographic points, artists and cultural references, etc.) of the Catalan-speaking population.
    Likewise, they were asked to be careful to avoid examples that reinforce the stereotypes that exist in this society. For example: be careful with the gender or origin of personal names that are associated with certain activities.

    Other Known Limitations

    [N/A]

    Additional Information

    Dataset Curators

    Language Technologies Unit at the Barcelona Supercomputing Center (langtech@bsc.es)

    This work has been promoted and financed by the Generalitat de Catalunya through the Aina project.

    Licensing Information

    This dataset can be used for any purpose, whether academic or commercial, under the terms of the CC BY 4.0.
    Give appropriate credit, provide a link to the license, and indicate if changes were made.

    Citation Information

    DOI

    Contributions

    The drafting of the examples and their annotation was entrusted to the company m47 labs through a public tender process.

  8. d

    NYC Benefits Platform: Benefits and Programs Dataset

    • catalog.data.gov
    • data.cityofnewyork.us
    • +1more
    Updated Jun 29, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.cityofnewyork.us (2025). NYC Benefits Platform: Benefits and Programs Dataset [Dataset]. https://catalog.data.gov/dataset/benefits-and-programs-api
    Explore at:
    Dataset updated
    Jun 29, 2025
    Dataset provided by
    data.cityofnewyork.us
    Area covered
    New York
    Description

    This dataset provides benefit, program, and resource information for over 80 health and human services available to NYC residents in all eleven local law languages. The data is kept up-to-date, including the most recent applications, eligibility requirements, and application dates. Information in this dataset is used on ACCESS NYC, Generation NYC, and Growing Up NYC. Reach out to products@nycopportunity.nyc.gov if you have any questions about this dataset. This data makes it easier for NYC residents to discover and be aware of multiple benefits they may be eligible for. NYC Opportunity Product team works with 15+ government agencies to collect and update this data. Each record in the dataset represents a benefit or program. Blank fields are NULL values in this dataset. The data can be used to develop new websites or directory resources to help residents to discover benefits they need. For access to the multilingual version of this dataset, please follow this link: https://data.cityofnewyork.us/City-Government/Benefits-and-Programs-Multilingual-Dataset/yjpx-srhp

  9. h

    internal-datasets

    • huggingface.co
    Updated Jun 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ivan Rivaldo Marbun (2023). internal-datasets [Dataset]. https://huggingface.co/datasets/Marbyun/internal-datasets
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 1, 2023
    Authors
    Ivan Rivaldo Marbun
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    SynQA is a Reading Comprehension dataset created in the work "Improving Question Answering Model Robustness with Synthetic Adversarial Data Generation" (https://aclanthology.org/2021.emnlp-main.696/). It consists of 314,811 synthetically generated questions on the passages in the SQuAD v1.1 (https://arxiv.org/abs/1606.05250) training set.

    In this work, we use a synthetic adversarial data generation to make QA models more robust to human adversaries. We develop a data generation pipeline that selects source passages, identifies candidate answers, generates questions, then finally filters or re-labels them to improve quality. Using this approach, we amplify a smaller human-written adversarial dataset to a much larger set of synthetic question-answer pairs. By incorporating our synthetic data, we improve the state-of-the-art on the AdversarialQA (https://adversarialqa.github.io/) dataset by 3.7F1 and improve model generalisation on nine of the twelve MRQA datasets. We further conduct a novel human-in-the-loop evaluation to show that our models are considerably more robust to new human-written adversarial examples: crowdworkers can fool our model only 8.8% of the time on average, compared to 17.6% for a model trained without synthetic data.

    For full details on how the dataset was created, kindly refer to the paper.

  10. h

    human-style-preferences-images

    • ollama.hf-mirror.com
    • huggingface.co
    Updated Nov 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rapidata (2024). human-style-preferences-images [Dataset]. https://ollama.hf-mirror.com/datasets/Rapidata/human-style-preferences-images
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 9, 2024
    Dataset authored and provided by
    Rapidata
    License

    https://choosealicense.com/licenses/cdla-permissive-2.0/https://choosealicense.com/licenses/cdla-permissive-2.0/

    Description

    Rapidata Image Generation Preference Dataset

    This dataset was collected in ~4 Days using the Rapidata Python API, accessible to anyone and ideal for large scale data annotation. Explore our latest model rankings on our website. If you get value from this dataset and would like to see more in the future, please consider liking it.

      Overview
    

    One of the largest human preference datasets for text-to-image models, this release contains over 1,200,000 human preference… See the full description on the dataset page: https://huggingface.co/datasets/Rapidata/human-style-preferences-images.

  11. P

    JuICe Dataset

    • paperswithcode.com
    Updated Oct 4, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rajas Agashe; Srinivasan Iyer; Luke Zettlemoyer (2019). JuICe Dataset [Dataset]. https://paperswithcode.com/dataset/juice
    Explore at:
    Dataset updated
    Oct 4, 2019
    Authors
    Rajas Agashe; Srinivasan Iyer; Luke Zettlemoyer
    Description

    JuICe is a corpus of 1.5 million examples with a curated test set of 3.7K instances based on online programming assignments. Compared with existing contextual code generation datasets, JuICe provides refined human-curated data, open-domain code, and an order of magnitude more training data.

  12. h

    text-2-image-Rich-Human-Feedback

    • huggingface.co
    Updated Jan 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rapidata (2025). text-2-image-Rich-Human-Feedback [Dataset]. https://huggingface.co/datasets/Rapidata/text-2-image-Rich-Human-Feedback
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 7, 2025
    Dataset authored and provided by
    Rapidata
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Building upon Google's research Rich Human Feedback for Text-to-Image Generation we have collected over 1.5 million responses from 152'684 individual humans using Rapidata via the Python API. Collection took roughly 5 days. If you get value from this dataset and would like to see more in the future, please consider liking it.

      Overview
    

    We asked humans to evaluate AI-generated images in style, coherence and prompt alignment. For images that contained flaws, participants were… See the full description on the dataset page: https://huggingface.co/datasets/Rapidata/text-2-image-Rich-Human-Feedback.

  13. o

    Databricks Dolly 15K Dataset

    • opendatabay.com
    .csv
    Updated Jun 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Databricks Dolly 15K Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/78cf60f8-b078-411f-aa41-bc5794f3121c
    Explore at:
    .csvAvailable download formats
    Dataset updated
    Jun 15, 2025
    Dataset authored and provided by
    Datasimple
    Area covered
    Data Science and Analytics
    Description

    Dataset Overview databricks-dolly-15k is a corpus of more than 15,000 records generated by thousands of Databricks employees to enable large language models to exhibit the magical interactivity of ChatGPT. Databricks employees were invited to create prompt / response pairs in each of eight different instruction categories, including the seven outlined in the InstructGPT paper, as well as an open-ended free-form category. The contributors were instructed to avoid using information from any source on the web with the exception of Wikipedia (for particular subsets of instruction categories), and explicitly instructed to avoid using generative AI in formulating instructions or responses. Examples of each behavior were provided to motivate the types of questions and instructions appropriate to each category.

    Halfway through the data generation process, contributors were given the option of answering questions posed by other contributors. They were asked to rephrase the original question and only select questions they could be reasonably expected to answer correctly.

    For certain categories contributors were asked to provide reference texts copied from Wikipedia. Reference text (indicated by the context field in the actual dataset) may contain bracketed Wikipedia citation numbers (e.g. [42]) which we recommend users remove for downstream applications.

    Intended Uses While immediately valuable for instruction fine tuning large language models, as a corpus of human-generated instruction prompts, this dataset also presents a valuable opportunity for synthetic data generation in the methods outlined in the Self-Instruct paper. For example, contributor--generated prompts could be submitted as few-shot examples to a large open language model to generate a corpus of millions of examples of instructions in each of the respective InstructGPT categories.

    Likewise, both the instructions and responses present fertile ground for data augmentation. A paraphrasing model might be used to restate each prompt or short responses, with the resulting text associated to the respective ground-truth sample. Such an approach might provide a form of regularization on the dataset that could allow for more robust instruction-following behavior in models derived from these synthetic datasets.

    Dataset Purpose of Collection As part of our continuing commitment to open source, Databricks developed what is, to the best of our knowledge, the first open source, human-generated instruction corpus specifically designed to enable large language models to exhibit the magical interactivity of ChatGPT. Unlike other datasets that are limited to non-commercial use, this dataset can be used, modified, and extended for any purpose, including academic or commercial applications.

    Sources Human-generated data: Databricks employees were invited to create prompt / response pairs in each of eight different instruction categories. Wikipedia: For instruction categories that require an annotator to consult a reference text (information extraction, closed QA, summarization) contributors selected passages from Wikipedia for particular subsets of instruction categories. No guidance was given to annotators as to how to select the target passages. Annotator Guidelines To create a record, employees were given a brief description of the annotation task as well as examples of the types of prompts typical of each annotation task. Guidelines were succinct by design so as to encourage a high task completion rate, possibly at the cost of rigorous compliance to an annotation rubric that concretely and reliably operationalizes the specific task. Caveat emptor.

    The annotation guidelines for each of the categories are as follows:

    Creative Writing: Write a question or instruction that requires a creative, open-ended written response. The instruction should be reasonable to ask of a person with general world knowledge and should not require searching. In this task, your prompt should give very specific instructions to follow. Constraints, instructions, guidelines, or requirements all work, and the more of them the better. Closed QA: Write a question or instruction that requires factually correct response based on a passage of text from Wikipedia. The question can be complex and can involve human-level reasoning capabilities, but should not require special knowledge. To create a question for this task include both the text of the question as well as the reference text in the form. Open QA: Write a question that can be answered using general world knowledge or at most a single search. This task asks for opinions and facts about the world at large and does not provide any reference text for consultation. Summarization: Give a summary of a paragraph from Wikipedia. Please don't ask questions that will require more than 3-5 minutes to answer. To create a question for this task include both the text of the question as well as the reference text in the form. Information Extraction: T

  14. P

    GEM Dataset

    • paperswithcode.com
    Updated Jun 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sebastian Gehrmann; Tosin Adewumi; Karmanya Aggarwal; Pawan Sasanka Ammanamanchi; Aremu Anuoluwapo; Antoine Bosselut; Khyathi Raghavi Chandu; Miruna Clinciu; Dipanjan Das; Kaustubh D. Dhole; Wanyu Du; Esin Durmus; Ondřej Dušek; Chris Emezue; Varun Gangal; Cristina Garbacea; Tatsunori Hashimoto; Yufang Hou; Yacine Jernite; Harsh Jhamtani; Yangfeng Ji; Shailza Jolly; Mihir Kale; Dhruv Kumar; Faisal Ladhak; Aman Madaan; Mounica Maddela; Khyati Mahajan; Saad Mahamood; Bodhisattwa Prasad Majumder; Pedro Henrique Martins; Angelina McMillan-Major; Simon Mille; Emiel van Miltenburg; Moin Nadeem; Shashi Narayan; Vitaly Nikolaev; Rubungo Andre Niyongabo; Salomey Osei; Ankur Parikh; Laura Perez-Beltrachini; Niranjan Ramesh Rao; Vikas Raunak; Juan Diego Rodriguez; Sashank Santhanam; João Sedoc; Thibault Sellam; Samira Shaikh; Anastasia Shimorina; Marco Antonio Sobrevilla Cabezudo; Hendrik Strobelt; Nishant Subramani; Wei Xu; Diyi Yang; Akhila Yerukola; Jiawei Zhou (2022). GEM Dataset [Dataset]. https://paperswithcode.com/dataset/gem
    Explore at:
    Dataset updated
    Jun 8, 2022
    Authors
    Sebastian Gehrmann; Tosin Adewumi; Karmanya Aggarwal; Pawan Sasanka Ammanamanchi; Aremu Anuoluwapo; Antoine Bosselut; Khyathi Raghavi Chandu; Miruna Clinciu; Dipanjan Das; Kaustubh D. Dhole; Wanyu Du; Esin Durmus; Ondřej Dušek; Chris Emezue; Varun Gangal; Cristina Garbacea; Tatsunori Hashimoto; Yufang Hou; Yacine Jernite; Harsh Jhamtani; Yangfeng Ji; Shailza Jolly; Mihir Kale; Dhruv Kumar; Faisal Ladhak; Aman Madaan; Mounica Maddela; Khyati Mahajan; Saad Mahamood; Bodhisattwa Prasad Majumder; Pedro Henrique Martins; Angelina McMillan-Major; Simon Mille; Emiel van Miltenburg; Moin Nadeem; Shashi Narayan; Vitaly Nikolaev; Rubungo Andre Niyongabo; Salomey Osei; Ankur Parikh; Laura Perez-Beltrachini; Niranjan Ramesh Rao; Vikas Raunak; Juan Diego Rodriguez; Sashank Santhanam; João Sedoc; Thibault Sellam; Samira Shaikh; Anastasia Shimorina; Marco Antonio Sobrevilla Cabezudo; Hendrik Strobelt; Nishant Subramani; Wei Xu; Diyi Yang; Akhila Yerukola; Jiawei Zhou
    Description

    Generation, Evaluation, and Metrics (GEM) is a benchmark environment for Natural Language Generation with a focus on its Evaluation, both through human annotations and automated Metrics.

    GEM aims to:

    measure NLG progress across 13 datasets spanning many NLG tasks and languages. provide an in-depth analysis of data and models presented via data statements and challenge sets. develop standards for evaluation of generated text using both automated and human metrics.

    It is our goal to regularly update GEM and to encourage toward more inclusive practices in dataset development by extending existing data or developing datasets for additional languages.

  15. P

    RACE Dataset

    • paperswithcode.com
    Updated Jan 27, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Guokun Lai; Qizhe Xie; Hanxiao Liu; Yiming Yang; Eduard Hovy (2022). RACE Dataset [Dataset]. https://paperswithcode.com/dataset/race
    Explore at:
    Dataset updated
    Jan 27, 2022
    Authors
    Guokun Lai; Qizhe Xie; Hanxiao Liu; Yiming Yang; Eduard Hovy
    Description

    The ReAding Comprehension dataset from Examinations (RACE) dataset is a machine reading comprehension dataset consisting of 27,933 passages and 97,867 questions from English exams, targeting Chinese students aged 12-18. RACE consists of two subsets, RACE-M and RACE-H, from middle school and high school exams, respectively. RACE-M has 28,293 questions and RACE-H has 69,574. Each question is associated with 4 candidate answers, one of which is correct. The data generation process of RACE differs from most machine reading comprehension datasets - instead of generating questions and answers by heuristics or crowd-sourcing, questions in RACE are specifically designed for testing human reading skills, and are created by domain experts.

  16. The Complete Pokédex Dataset

    • kaggle.com
    zip
    Updated Dec 19, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cristobal Mitchell (2020). The Complete Pokédex Dataset [Dataset]. https://www.kaggle.com/cristobalmitchell/pokedex
    Explore at:
    zip(63002 bytes)Available download formats
    Dataset updated
    Dec 19, 2020
    Authors
    Cristobal Mitchell
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    This dataset contains information on all 898 Pokémon from all Eight Generations of Pokémon. The information contained in this dataset include Base Stats, Height, Weight, Classification, Abilities, etc.

    Content

    • national_number: The entry number of the Pokémon in the National Pokédex
    • gen: The numbered generation which the Pokémon was first introduced
    • english_name: The English name of the Pokémon
    • japanese_name: The Original Japanese name of the Pokémon
    • primary_type: The Primary Type of the Pokémon
    • secondary_type: The Secondary Type of the Pokémon
    • classification: The Classification of the Pokémon as described by the Sun and Moon or Sword and Shield Pokédex
    • percent_male: The percentage of the species that are male (Blank if the Pokémon is genderless)
    • percent_female: The percentage of the species that are female (Blank if the Pokémon is genderless)
    • height_m: Height of the Pokémon in metres
    • weight_kg: The Weight of the Pokémon in kilograms
    • capture_rate: Capture Rate of the Pokémon
    • base_egg_steps: The number of steps required to hatch an egg of the Pokémon
    • hp: The Base HP of the Pokémon
    • attack: The Base Attack of the Pokémon
    • defense: The Base Defense of the Pokémon
    • sp_attack: The Base Special Attack of the Pokémon
    • sp_defense: The Base Special Defense of the Pokémon
    • speed: The Base Speed of the Pokémon
    • abilities: A list of abilities that the Pokémon is capable of having
    • against_*: Eighteen features that denote the amount of damage taken against an attack of a particular type
    • is_sublegendary: Denotes if the Pokémon is sublegendary
    • is_legendary: Denotes if the Pokémon is legendary
    • is_mythical: Denotes if the Pokémon is mythical
    • evochain_*: Seven features that indicate the evolutionary chain and triggers

    For more information on the legendary status of Pokémon see https://www.serebii.net/pokemon/legendary.shtml

    For more information on the legendary status of Pokémon see https://www.serebii.net/pokemon/legendary.shtml

    Acknowledgements

    The dataset was scraped from http://serebii.net/

    Inspiration

    How does height and weight of a Pokémon correlate with its various base stats? What are the general distributions for the various Pokémon segments? What factors influence the Experience Growth and Egg Steps? Are these quantities correlated? Which type is the strongest overall? Which is the weakest

  17. C

    Synthetic Integrated Services Data

    • data.wprdc.org
    csv, html, pdf, zip
    Updated Jun 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Allegheny County (2024). Synthetic Integrated Services Data [Dataset]. https://data.wprdc.org/dataset/synthetic-integrated-services-data
    Explore at:
    html, csv(1375554033), zip(39231637), pdfAvailable download formats
    Dataset updated
    Jun 25, 2024
    Dataset provided by
    Allegheny County
    Description

    Motivation

    This dataset was created to pilot techniques for creating synthetic data from datasets containing sensitive and protected information in the local government context. Synthetic data generation replaces actual data with representative data generated from statistical models; this preserves the key data properties that allow insights to be drawn from the data while protecting the privacy of the people included in the data. We invite you to read the Understanding Synthetic Data white paper for a concise introduction to synthetic data.

    This effort was a collaboration of the Urban Institute, Allegheny County’s Department of Human Services (DHS) and CountyStat, and the University of Pittsburgh’s Western Pennsylvania Regional Data Center.

    Collection

    The source data for this project consisted of 1) month-by-month records of services included in Allegheny County's data warehouse and 2) demographic data about the individuals who received the services. As the County’s data warehouse combines this service and client data, this data is referred to as “Integrated Services data”. Read more about the data warehouse and the kinds of services it includes here.

    Preprocessing

    Synthetic data are typically generated from probability distributions or models identified as being representative of the confidential data. For this dataset, a model of the Integrated Services data was used to generate multiple versions of the synthetic dataset. These different candidate datasets were evaluated to select for publication the dataset version that best balances utility and privacy. For high-level information about this evaluation, see the Synthetic Data User Guide.

    For more information about the creation of the synthetic version of this data, see the technical brief for this project, which discusses the technical decision making and modeling process in more detail.

    Recommended Uses

    This disaggregated synthetic data allows for many analyses that are not possible with aggregate data (summary statistics). Broadly, this synthetic version of this data could be analyzed to better understand the usage of human services by people in Allegheny County, including the interplay in the usage of multiple services and demographic information about clients.

    Known Limitations/Biases

    Some amount of deviation from the original data is inherent to the synthetic data generation process. Specific examples of limitations (including undercounts and overcounts for the usage of different services) are given in the Synthetic Data User Guide and the technical report describing this dataset's creation.

    Feedback

    Please reach out to this dataset's data steward (listed below) to let us know how you are using this data and if you found it to be helpful. Please also provide any feedback on how to make this dataset more applicable to your work, any suggestions of future synthetic datasets, or any additional information that would make this more useful. Also, please copy wprdc@pitt.edu on any such feedback (as the WPRDC always loves to hear about how people use the data that they publish and how the data could be improved).

    Further Documentation and Resources

    1) A high-level overview of synthetic data generation as a method for protecting privacy can be found in the Understanding Synthetic Data white paper.
    2) The Synthetic Data User Guide provides high-level information to help users understand the motivation, evaluation process, and limitations of the synthetic version of Allegheny County DHS's Human Services data published here.
    3) Generating a Fully Synthetic Human Services Dataset: A Technical Report on Synthesis and Evaluation Methodologies describes the full technical methodology used for generating the synthetic data, evaluating the various options, and selecting the final candidate for publication.
    4) The WPRDC also hosts the Allegheny County Human Services Community Profiles dataset, which provides annual updates on human-services usage, aggregated by neighborhood/municipality. That data can be explored using the County's Human Services Community Profile web site.

  18. Female Faces - Image Dataset

    • kaggle.com
    Updated Apr 26, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Training Data (2024). Female Faces - Image Dataset [Dataset]. https://www.kaggle.com/datasets/trainingdatapro/female-selfie-image-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 26, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Training Data
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    Face Recognition, Face Detection, Female Photo Dataset 👩

    If you are interested in biometric data - visit our website to learn more and buy the dataset :)

    90,000+ photos of 46,000+ women from 141 countries. The dataset includes photos of people's faces. All people presented in the dataset are women. The dataset contains a variety of images capturing individuals from diverse backgrounds and age groups.

    Our dataset will diversify your data by adding more photos of women of different ages and ethnic groups, enhancing the quality of your model.

    People in the dataset

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2Fd1b31bcda4a90b808473dbe5970bebfb%2FFrame%20108.png?generation=1714148221118707&alt=media" alt="">

    The dataset can be utilized for a wide range of tasks, including face recognition, age estimation, image feature extraction, or any problem related to human image analysis.

    💴 For Commercial Usage: Full version of the dataset includes 90,000+ photos of people, leave a request on TrainingData to buy the dataset

    Metadata for the dataset:

    • id - unique identifier of the media file
    • photo - link to access the photo,
    • age - age of the person
    • gender - gender of the person
    • country - country of the person
    • ethnicity - ethnicity of the person
    • photo_extension - photo extension,
    • photo_resolution - photo resolution

    Statistics for the dataset

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F2796accc4d7b47e8e1ac02701f4eac7b%2FFemale%20Images.png?generation=1714147921067232&alt=media" alt="">

    💴 Buy the Dataset: This is just an example of the data. Leave a request on https://trainingdata.pro/datasets to learn about the price and buy the dataset

    Content

    The dataset consists of: - files - includes 20 images corresponding to each person in the sample, - .csv file - contains information about the images and people in the dataset

    File with the extension .csv

    • id: id of the person,
    • age - age of the person,
    • country - country of the person,
    • ethnicity - ethnicity of the person,
    • photo_extension: extension of the photo,
    • photo_resolution: photo_resolution of the photo

    TrainingData provides high-quality data annotation tailored to your needs

    keywords: biometric system, biometric dataset, face recognition database, face recognition dataset, face detection dataset, facial analysis, object detection dataset, deep learning datasets, computer vision datset, human images dataset, human faces dataset, machine learning, image-to-image, verification models, digital photo-identification, women images, females dataset, female selfie, female face recognition

  19. P

    TikTok Dataset Dataset

    • paperswithcode.com
    Updated Jul 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yasamin Jafarian; Hyun Soo Park (2024). TikTok Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/tiktok-dataset
    Explore at:
    Dataset updated
    Jul 22, 2024
    Authors
    Yasamin Jafarian; Hyun Soo Park
    Description

    We learn high fidelity human depths by leveraging a collection of social media dance videos scraped from the TikTok mobile social networking application. It is by far one of the most popular video sharing applications across generations, which include short videos (10-15 seconds) of diverse dance challenges as shown above. We manually find more than 300 dance videos that capture a single person performing dance moves from TikTok dance challenge compilations for each month, variety, type of dances, which are moderate movements that do not generate excessive motion blur. For each video, we extract RGB images at 30 frame per second, resulting in more than 100K images. We segmented these images using Removebg application, and computed the UV coordinates from DensePose.

    Download TikTok Dataset:

    Please use the dataset only for the research purpose.

    The dataset can be viewed and downloaded from the Kaggle page. (you need to make an account in Kaggle to be able to download the data. It is free!)

    The dataset can also be downloaded from here (42 GB). The dataset resolution is: (1080 x 604)

    The original YouTube videos corresponding to each sequence and the dance name can be downloaded from here (2.6 GB).

  20. [Dataset] Data for the course "Population Genomics" at Aarhus University

    • zenodo.org
    application/gzip, bin
    Updated Jan 8, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samuele Soraggi; Samuele Soraggi; Kasper Munch; Kasper Munch (2025). [Dataset] Data for the course "Population Genomics" at Aarhus University [Dataset]. http://doi.org/10.5281/zenodo.7670839
    Explore at:
    application/gzip, binAvailable download formats
    Dataset updated
    Jan 8, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Samuele Soraggi; Samuele Soraggi; Kasper Munch; Kasper Munch
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Datasets, conda environments and Softwares for the course "Population Genomics" of Prof Kasper Munch. This course material is maintained by the health data science sandbox. This webpage shows the latest version of the course material.

    1. Data.tar.gz Contains the datasets and executable files for some of the softwares
      You can unpack by simply doing
      tar -zxf Data.tar.gz -C ./
      This will create a folder called Data with the uncompressed material inside
    2. Course_Env.packed.tar.gz Contains the conda environment used for the course. This needs to be unpacked to adjust all the prefixes (Note this environment is created on Ubuntu 22.10). You do this in the command line by
      1. creating the folder Course_Env: mkdir Course_Env
      2. untar the file: tar -zxf Course_Env.packed.tar.gz -C Course_Env
      3. Activate the environment: conda activate ./Course_Env
      4. Run the unpacking script (it can take quite some time to get it done): conda-unpack
    3. Course_Env.unpacked.tar.gz The same environment as above, but will work only if untarred into the folder /usr/Material - so use the version above if you are using it in another folder. This file is mostly to execute the course in our own cloud environment.
    4. environment_with_args.yml The file needed to generate the conda environment. Create and activate the environment with the following commands:
      1. conda env create -f environment_with_args.yml -p ./Course_Env
      2. conda activate ./Course_Env

    The data is connected to the following repository: https://github.com/hds-sandbox/Popgen_course_aarhus. The original course material from Prof Kasper Munch is at https://github.com/kaspermunch/PopulationGenomicsCourse.

    Description

    The participants will after the course have detailed knowledge of the methods and applications required to perform a typical population genomic study.

    The participants must at the end of the course be able to:

    • Identify an experimental platform relevant to a population genomic analysis.
    • Apply commonly used population genomic methods.
    • Explain the theory behind common population genomic methods.
    • Reflect on strengths and limitations of population genomic methods.
    • Interpret and analyze results of population genomic inference.
    • Formulate population genetics hypotheses based on data

    The course introduces key concepts in population genomics from generation of population genetic data sets to the most common population genetic analyses and association studies. The first part of the course focuses on generation of population genetic data sets. The second part introduces the most common population genetic analyses and their theoretical background. Here topics include analysis of demography, population structure, recombination and selection. The last part of the course focus on applications of population genetic data sets for association studies in relation to human health.

    Curriculum

    The curriculum for each week is listed below. "Coop" refers to a set of lecture notes by Graham Coop that we will use throughout the course.

    Course plan

    1. Course intro and overview:
    2. Drift and the coalescent:
    3. Recombination:
    4. Population strucure and incomplete lineage sorting:
    5. Hidden Markov models:
    6. Ancestral recombination graphs:
    7. Past population demography:
    8. Direct and linked selection:
    9. Admixture:
    10. Genome-wide association study (GWAS):
    11. Heritability:
      • Lecture: Coop Lecture notes Sec. 2.2 (p23-36) + Chap. 7 (p119-142)
      • Exercise: Association testing
    12. Evolution and disease:
      • Lecture: Coop Lecture notes Sec. 11.0.1 (p217-221)
      • Exercise: Estimating heritability
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Google Research (2024). PASTA Data [Dataset]. https://www.kaggle.com/datasets/googleai/pasta-data
Organization logo

PASTA Data

Data used in paper "Preference Adaptive and Sequential Text-to-Image Generation"

Explore at:
397 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 10, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Google Research
License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

This dataset contains human rater trajectories used in paper: "Preference Adaptive and Sequential Text-to-Image Generation".

We use human raters to gather sequential user preferences data for personalized T2I generation. Participants are tasked with interacting with an LMM agent for five turns. Throughout our rater study we use a Gemini 1.5 Flash Model as our base LMM, which acts as an agent. At each turn, the system presents 16 images, arranged in four columns, each representing a different prompt expansion derived from the user's initial prompt and prior interactions. Raters are shown only the generated images, not the prompt expansions themselves.

At session start, raters are instructed to provide an initial prompt of at most 12 words, encapsulating a specific visual concept. They are encouraged to provide descriptive prompts that avoid generic terms (e.g., "an ancient Egyptian temple with hieroglyphs" 'instead of "a temple"). At each turn, raters then select the column of images preferred most; they are instructed to select a column based on the quality of the best image in that column w.r.t. their original intent. Raters may optionally provide a free-text critique (up to 12 words) to guide subsequent prompt expansions, though most raters did not use this facility.

See our paper for a comprehensive description of the rater study.

Citation

Please cite our paper if you use it in your work.

Search
Clear search
Close search
Google apps
Main menu