54 datasets found
  1. N

    United States Age Group Population Dataset: A complete breakdown of United...

    • neilsberg.com
    csv, json
    Updated Sep 16, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neilsberg Research (2023). United States Age Group Population Dataset: A complete breakdown of United States age demographics from 0 to 85 years, distributed across 18 age groups [Dataset]. https://www.neilsberg.com/research/datasets/5fd2b2bb-3d85-11ee-9abe-0aa64bf2eeb2/
    Explore at:
    json, csvAvailable download formats
    Dataset updated
    Sep 16, 2023
    Dataset authored and provided by
    Neilsberg Research
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    United States
    Variables measured
    Population Under 5 Years, Population over 85 years, Population Between 5 and 9 years, Population Between 10 and 14 years, Population Between 15 and 19 years, Population Between 20 and 24 years, Population Between 25 and 29 years, Population Between 30 and 34 years, Population Between 35 and 39 years, Population Between 40 and 44 years, and 9 more
    Measurement technique
    The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates. To measure the two variables, namely (a) population and (b) population as a percentage of the total population, we initially analyzed and categorized the data for each of the age groups. For age groups we divided it into roughly a 5 year bucket for ages between 0 and 85. For over 85, we aggregated data into a single group for all ages. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
    Dataset funded by
    Neilsberg Research
    Description
    About this dataset

    Context

    The dataset tabulates the United States population distribution across 18 age groups. It lists the population in each age group along with the percentage population relative of the total population for United States. The dataset can be utilized to understand the population distribution of United States by age. For example, using this dataset, we can identify the largest age group in United States.

    Key observations

    The largest age group in United States was for the group of age 25-29 years with a population of 22,854,328 (6.93%), according to the 2021 American Community Survey. At the same time, the smallest age group in United States was the 80-84 years with a population of 5,932,196 (1.80%). Source: U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates.

    Content

    When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates.

    Age groups:

    • Under 5 years
    • 5 to 9 years
    • 10 to 14 years
    • 15 to 19 years
    • 20 to 24 years
    • 25 to 29 years
    • 30 to 34 years
    • 35 to 39 years
    • 40 to 44 years
    • 45 to 49 years
    • 50 to 54 years
    • 55 to 59 years
    • 60 to 64 years
    • 65 to 69 years
    • 70 to 74 years
    • 75 to 79 years
    • 80 to 84 years
    • 85 years and over

    Variables / Data Columns

    • Age Group: This column displays the age group in consideration
    • Population: The population for the specific age group in the United States is shown in this column.
    • % of Total Population: This column displays the population of each age group as a proportion of United States total population. Please note that the sum of all percentages may not equal one due to rounding of values.

    Good to know

    Margin of Error

    Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

    Custom data

    If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

    Inspiration

    Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

    Recommended for further research

    This dataset is a part of the main dataset for United States Population by Age. You can refer the same here

  2. w

    Dataset of books called Boomer nation : the largest and richest generation...

    • workwithdata.com
    Updated Apr 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2025). Dataset of books called Boomer nation : the largest and richest generation ever, and how it changed America [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=Boomer+nation+%3A+the+largest+and+richest+generation+ever%2C+and+how+it+changed+America
    Explore at:
    Dataset updated
    Apr 17, 2025
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    United States
    Description

    This dataset is about books. It has 1 row and is filtered where the book is Boomer nation : the largest and richest generation ever, and how it changed America. It features 7 columns including author, publication date, language, and book publisher.

  3. Lizard dataset

    • kaggle.com
    Updated Dec 6, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aadam (2021). Lizard dataset [Dataset]. https://www.kaggle.com/datasets/aadimator/lizard-dataset/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 6, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Aadam
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    The development of deep segmentation models for computational pathology (CPath) can help foster the investigation of interpretable morphological biomarkers. Yet, there is a major bottleneck in the success of such approaches because supervised deep learning models require an abundance of accurately labelled data. This issue is exacerbated in the field of CPath because the generation of detailed annotations usually demands the input of a pathologist to be able to distinguish between different tissue constructs and nuclei. Manually labelling nuclei may not be a feasible approach for collecting large-scale annotated datasets, especially when a single image region can contain thousands of different cells. Yet, solely relying on automatic generation of annotations will limit the accuracy and reliability of ground truth. Therefore, to help overcome the above challenges, we propose a multi-stage annotation pipeline to enable the collection of large-scale datasets for histology image analysis, with pathologist-in-the-loop refinement steps. Using this pipeline, we generate the largest known nuclear instance segmentation and classification dataset, containing nearly half a million labelled nuclei in H&E stained colon tissue. We will publish the dataset and encourage the research community to utilise it to drive forward the development of downstream cell-based models in CPath.

    Link to the dataset paper.

    Citation

    @inproceedings{graham2021lizard,
     title={Lizard: A Large-Scale Dataset for Colonic Nuclear Instance Segmentation and Classification},
     author={Graham, Simon and Jahanifar, Mostafa and Azam, Ayesha and Nimir, Mohammed and Tsang, Yee-Wah and Dodd, Katherine and Hero, Emily and Sahota, Harvir and Tank, Atisha and Benes, Ksenija and others},
     booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
     pages={684--693},
     year={2021}
    }
    

    Acknowledgements

    We would like to acknowledge the following institutions, where the images in this dataset originated from:

    • University Hospitals Coventry and Warwickshire, United Kingdom
    • Histo Pathology Diagnostic Center, Shanghai, China
    • Ruijin Hospital, Shanghai, China
    • Xijing Hospital, Xi'an, China
    • Shanghai Songjiang District Central Hospital, Shanghai, China
    • The National Cancer Institute (NCI), United States of America
  4. k

    Top 20 Countries Wind Power Generation Capacity

    • datasource.kapsarc.org
    • data.kapsarc.org
    Updated Dec 26, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2017). Top 20 Countries Wind Power Generation Capacity [Dataset]. https://datasource.kapsarc.org/explore/dataset/top-20-countries-wind-power-generation-capacity/
    Explore at:
    Dataset updated
    Dec 26, 2017
    Description

    Source: BP, World Energy Statistics 2017, June 2017.

  5. d

    Employee Data | The Largest Dataset Of Active Profiles | Global / 1B Records...

    • datarade.ai
    .json
    Updated Apr 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Avanteer (2025). Employee Data | The Largest Dataset Of Active Profiles | Global / 1B Records / Updated Daily [Dataset]. https://datarade.ai/data-products/employee-data-the-largest-dataset-of-active-profiles-glob-avanteer
    Explore at:
    .jsonAvailable download formats
    Dataset updated
    Apr 19, 2025
    Dataset authored and provided by
    Avanteer
    Area covered
    Fiji, Maldives, State of, Anguilla, Pitcairn, United Arab Emirates, Gambia, Nicaragua, Tunisia, Bulgaria
    Description

    //// 🌍 Avanteer Employee Data ////

    The Largest Dataset of Active Global Profiles 1B+ Records | Updated Daily | Built for Scale & Accuracy

    Avanteer’s Employee Data offers unparalleled access to the world’s most comprehensive dataset of active professional profiles. Designed for companies building data-driven products or workflows, this resource supports recruitment, lead generation, enrichment, and investment intelligence — with unmatched scale and update frequency.

    //// 🔧 What You Get ////

    1B+ active profiles across industries, roles, and geographies

    Work history, education history, languages, skills and multiple additional datapoints.

    AI-enriched datapoints include: Gender Age Normalized seniority Normalized department Normalized skillset MBTI assessment

    Daily updates, with change-tracking fields to capture job changes, promotions, and new entries.

    Flexible delivery via API, S3, or flat file.

    Choice of formats: raw, cleaned, or AI-enriched.

    Built-in compliance aligned with GDPR and CCPA.

    //// 💡 Key Use Cases ////

    ✅ Smarter Talent Acquisition Identify, enrich, and engage high-potential candidates using up-to-date global profiles.

    ✅ B2B Lead Generation at Scale Build prospecting lists with confidence using job-related and firmographic filters to target decision-makers across verticals.

    ✅ Data Enrichment for SaaS & Platforms Supercharge ATS, CRMs, or HR tech products by syncing enriched, structured employee data through real-time or batch delivery.

    ✅ Investor & Market Intelligence Analyze team structures, hiring trends, and senior leadership signals to discover early-stage investment opportunities or evaluate portfolio companies.

    //// 🧰 Built for Top-Tier Teams Who Move Fast ////

    Zero duplicate, by design

    <300ms API response time

    99.99% guaranteed API uptime

    Onboarding support including data samples, test credits, and consultations

    Advanced data quality checks

    //// ✅ Why Companies Choose Avanteer ////

    ➔ The largest daily-updated dataset of global professional profiles

    ➔ Trusted by sales, HR, and data teams building at enterprise scale

    ➔ Transparent, compliant data collection with opt-out infrastructure baked in

    ➔ Dedicated support with fast onboarding and hands-on implementation help

    ////////////////////////////////

    Empower your team with reliable, current, and scalable employee data — all from a single source.

  6. P

    AGENDA Dataset

    • paperswithcode.com
    Updated Jul 8, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rik Koncel-Kedziorski; Dhanush Bekal; Yi Luan; Mirella Lapata; Hannaneh Hajishirzi (2021). AGENDA Dataset [Dataset]. https://paperswithcode.com/dataset/agenda
    Explore at:
    Dataset updated
    Jul 8, 2021
    Authors
    Rik Koncel-Kedziorski; Dhanush Bekal; Yi Luan; Mirella Lapata; Hannaneh Hajishirzi
    Description

    Abstract GENeration DAtaset (AGENDA) is a dataset of knowledge graphs paired with scientific abstracts. The dataset consists of 40k paper titles and abstracts from the Semantic Scholar Corpus taken from the proceedings of 12 top AI conferences.

  7. h

    playground-popular

    • huggingface.co
    Updated Aug 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BIG data (2024). playground-popular [Dataset]. https://huggingface.co/datasets/bigdata-pw/playground-popular
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 6, 2024
    Dataset authored and provided by
    BIG data
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    Dataset Card for Playground Popular

    Most popular image generations by number of likes.

      Dataset Details
    
    
    
    
    
      Dataset Description
    

    A subset of bigdata-pw/playground filtered to the most popular 1 million images by number of likes. Entries include generation details such as prompts and model used, anonymized user information, creation date, and URL to the image.

    Curated by: hlky License: Open Data Commons Attribution License (ODC-By) v1.0

      Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/bigdata-pw/playground-popular.
    
  8. LAS&T: Large Shape And Texture Dataset

    • zenodo.org
    jpeg, zip
    Updated May 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sagi Eppel; Sagi Eppel (2025). LAS&T: Large Shape And Texture Dataset [Dataset]. http://doi.org/10.5281/zenodo.15453634
    Explore at:
    jpeg, zipAvailable download formats
    Dataset updated
    May 26, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sagi Eppel; Sagi Eppel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Large Shape And Texture Dataset (LAS&T)

    LAS&T is the largest and most diverse dataset for shape, texture and material recognition and retrieval in 2D and 3D with 650,000 images, based on real world shapes and textures.

    Overview

    The LAS&T Dataset aims to test the most basic aspect of vision in the most general way. Mainly the ability to identify any shape, texture, and material in any setting and environment, without being limited to specific types or classes of objects, materials, and environments. For shapes, this means identifying and retrieving any shape in 2D or 3D with every element of the shape changed between images, including the shape material and texture, orientation, size, and environment. For textures and materials, the goal is to recognize the same texture or material when appearing on different objects, environments, and light conditions. The dataset relies on shapes, textures, and materials extracted from real-world images, leading to an almost unlimited quantity and diversity of real-world natural patterns. Each section of the dataset (shapes, and textures), contains 3D parts that rely on physics-based scenes with realistic light materials and object simulation and abstract 2D parts. In addition, the real-world benchmark for 3D shapes.

    Main Dataset webpage

    The dataset contain four parts parts:

    3D shape recognition and retrieval.

    2D shape recognition and retrieval.

    3D Materials recognition and retrieval.

    2D Texture recognition and retrieval.

    Each can be used independently for training and testing.

    Additional assets are a set of 350,000 natural 2D shapes extracted from real-world images (SHAPES_COLLECTION_350k.zip)

    3D shape recognition real-world images benchmark

    The scripts used to generate and test the dataset are supplied as in SCRIPT** files.

    Shapes Recognition and Retrieval:

    For shape recognition the goal is to identify the same shape in different images, where the material/texture/color of the shape is changed, the shape is rotated, and the background is replaced. Hence, only the shape remains the same in both images. All files with 3D shapes contain samples of the 3D shape dataset. This is tested for 3D shapes/objects with realistic light simulation. All files with 2D shapes contain samples of the 2D shape dataset. Examples files contain images with examples for each set.

    Main files:

    Real_Images_3D_shape_matching_Benchmarks.zip contains real-world image benchmarks for 3D shapes.

    3D_Shape_Recognition_Synthethic_GENERAL_LARGE_SET_76k.zip A Large number of synthetic examples 3D shapes with max variability can be used for training/testing 3D shape/objects recognition/retrieval.

    2D_Shapes_Recognition_Textured_Synthetic_Resize2_GENERAL_LARGE_SET_61k.zip A Large number of synthetic examples for 2D shapes with max variability can be used for training/testing 2D shape recognition/retrieval.

    SHAPES_2D_365k.zip 365,000 2D shapes extracted from real-world images saved as black and white .png image files.

    File structure:

    All jpg images that are in the exact same subfolder contain the exact same shape (but with different texture/color/background/orientation).

    Textures and Materials Recognition and Retrieval

    For texture and materials, the goal is to identify and match images containing the same material or textures, however the shape/object on which the material texture is applied is different, and so is the background and light.

    This is done for physics-based material in 3D and abstract 2D textures.

    3D_Materials_PBR_Synthetic_GENERAL_LARGE_SET_80K.zip A Large number of examples of 3D materials in physics grounded can be used for training or testing of material recognition/retrieval.

    2D_Textures_Recogition_GENERAL_LARGE_SET_Synthetic_53K.zip

    Large number of images of 2D texture in maximum variability of setting can be used for training/testing 2D textured recognition/retrieval.

    File structure:

    All jpg images that are in the exact same subfolder contain the exact same texture/material (but overlay on different objects with different background/and illumination/orientation).

    Data Generation:

    The images in the synthetic part of the dataset were created by automatically extracting shapes and textures from natural images and combining them in synthetic images. This created synthetic images that completely rely on real-world patterns, making extremely diverse and complex shapes and textures. As far as we know this is the largest and most diverse shape and texture recognition/retrieval dataset. 3D data was generated using physics-based material and rendering (blender) making the images physically grounded and enabling using the data to train for real-world examples. The scripts for generating the data are supplied in files with the world SCRIPTS* in them.

    Real-world image data:

    For 3D shape recognition and retrieval, we also supply a real-world natural image benchmark. With a variety of natural images containing the exact same 3D shape but made/coated with different materials and in different environments and orientations. The goal is again to identify the same shape in different images. The benchmark is available at: Real_Images_3D_shape_matching_Benchmarks.zip

    File structure:

    Files containing the word 'GENERAL_LARGE_SET' contains synthetic images that can be used for training or testing, the type of data (2D shapes, 3D shapes, 2D textures, 3D materials) that appears in the file name, as well as the number of images. Files containing MultiTests contain a number of different tests in which only a single aspect of the aspect of the instance is changed (for example only the background.) File containing "SCRIPTS" contain data generation testing scripts. Images containing "examples" are example of each test.

    Shapes Collections

    The file SHAPES_COLLECTION_350k.zip contains 350,000 2D shapes extracted from natural images and used for the dataset generation.

    Evaluating and Testing

    For evaluating and testing see: SCRIPTS_Testing_LVLM_ON_LAST_VQA.zip
    This can be use to test leading LVLMs using api, create human tests, and in general turn the dataset into multichoice question images similar to the one in the paper.

  9. Amount of data created, consumed, and stored 2010-2023, with forecasts to...

    • statista.com
    Updated Jun 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). Amount of data created, consumed, and stored 2010-2023, with forecasts to 2028 [Dataset]. https://www.statista.com/statistics/871513/worldwide-data-created/
    Explore at:
    Dataset updated
    Jun 30, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    May 2024
    Area covered
    Worldwide
    Description

    The total amount of data created, captured, copied, and consumed globally is forecast to increase rapidly, reaching *** zettabytes in 2024. Over the next five years up to 2028, global data creation is projected to grow to more than *** zettabytes. In 2020, the amount of data created and replicated reached a new high. The growth was higher than previously expected, caused by the increased demand due to the COVID-19 pandemic, as more people worked and learned from home and used home entertainment options more often. Storage capacity also growing Only a small percentage of this newly created data is kept though, as just * percent of the data produced and consumed in 2020 was saved and retained into 2021. In line with the strong growth of the data volume, the installed base of storage capacity is forecast to increase, growing at a compound annual growth rate of **** percent over the forecast period from 2020 to 2025. In 2020, the installed base of storage capacity reached *** zettabytes.

  10. GLAMI-1M: A Multilingual Image-Text Fashion Dataset

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jan 19, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vaclav Kosar; Vaclav Kosar; Antonín Hoskovec; Antonín Hoskovec; Milan Šulc; Milan Šulc; Radek Bartyzal; Radek Bartyzal (2023). GLAMI-1M: A Multilingual Image-Text Fashion Dataset [Dataset]. http://doi.org/10.5281/zenodo.7326406
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 19, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Vaclav Kosar; Vaclav Kosar; Antonín Hoskovec; Antonín Hoskovec; Milan Šulc; Milan Šulc; Radek Bartyzal; Radek Bartyzal
    License

    http://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0

    Description

    We introduce GLAMI-1M: the largest multilingual image-text classification dataset and benchmark. The dataset contains images of fashion products with item descriptions, each in 1 of 13 languages. Categorization into 191 classes has high-quality annotations: all 100k images in the test set and 75% of the 1M training set were human-labeled. The paper presents baselines for image-text classification showing that the dataset presents a challenging fine-grained classification problem: The best scoring EmbraceNet model using both visual and textual features achieves 69.7% accuracy. Experiments with a modified Imagen model show the dataset is also suitable for image generation conditioned on text. The dataset, source code and model checkpoints are published at: https://github.com/glami/glami-1m.

  11. o

    Illiterate Population - Dataset OD Mekong Datahub

    • data.opendevelopmentmekong.net
    Updated Mar 8, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). Illiterate Population - Dataset OD Mekong Datahub [Dataset]. https://data.opendevelopmentmekong.net/dataset/illiterate-population
    Explore at:
    Dataset updated
    Mar 8, 2018
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    Despite the steady rise in literacy rates over the past 50 years, there are still 750 million illiterate adults around the world, most of whom are women. These numbers produced by the UIS are a stark reminder of the work ahead to meet the Sustainable Development Goals (SDGs), especially Target 4.6 to ensure that all youth and most adults achieve literacy and numeracy by 2030. Current literacy data are generally collected through population censuses or household surveys in which the respondent or head of the household declares whether they can read and write with understanding a short, simple statement about one's everyday life in any written language. Some surveys require respondents to take a quick test in which they are asked to read a simple passage or write a sentence, yet clearly literacy is a far more complex issue that requires more information. For the UIS, the existing dataset serves as a placeholder for a new generation of indicators being developed with countries and partners under the umbrella of the Global Alliance to Monitor Learning (GAML). GAML is developing the methodologies needed to gather more nuanced data and the tools required for their standardisation. In particular, the Alliance is finding ways to link existing large-scale assessments to produce comparable data to monitor the literacy skills of children, youth and adults. This involves close collaboration with a wide range of partners.

  12. t

    PV Generation and Consumption Dataset of an Estonian Residential Dwelling

    • data.taltech.ee
    Updated Mar 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sayeed Hasan; Sayeed Hasan; Andrei Blinov; Andrei Blinov; Andrii Chub; Andrii Chub; Dmitri Vinnikov; Dmitri Vinnikov (2025). PV Generation and Consumption Dataset of an Estonian Residential Dwelling [Dataset]. http://doi.org/10.48726/6hayh-x0h25
    Explore at:
    Dataset updated
    Mar 22, 2025
    Dataset provided by
    TalTech Data Repository
    Authors
    Sayeed Hasan; Sayeed Hasan; Andrei Blinov; Andrei Blinov; Andrii Chub; Andrii Chub; Dmitri Vinnikov; Dmitri Vinnikov
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Estonia
    Description

    This is a Residential PV generation and consumption data set from an Estonian house. At the time of submission, one year (2023) of data was available. The data was logged at a 10-second resolution. The untouched dataset can be found in the raw data folder, which is separated month-wise. A few missing points in the dataset were filled with a simple KNN algorithm. However, improved data imputation methods based on machine learning are also possible. To carry out the imputing, run the scripts in the script folder one by one in the numerical serial order (SC1..py, SC2..py, etc.).

    Data Descriptor (Scientific Data): https://doi.org/10.1038/s41597-025-04747-w">https://doi.org/10.1038/s41597-025-04747-w

    General Information:

    Duration: January 2023 – December 2023

    Resolution: 10 seconds

    Dataset Type: Aggregated consumption and PV generation data

    Logging Device: Camile Bauer PQ1000 (×2)

    Load/Appliance Information:

    • 5 kW Rooftop PV array connected to AC Bus via 4.2kW 3-ϕ Inverter
    • Air conditioner: 0.44 kW (Cooling), 0.62 kW (Heating)
    • Air to Water (ATW) Heat Pump: 2.5kW (Cooling), 2.6 kW (Heating)
    • ATW Cylinder unit: 0.21 kW (Controller), 9 kW (Booster Heater)
    • Microwave oven: 0.9 kW
    • Coffee Maker: 1 kW
    • Cooktop Hot Plate: 4.6 kW
    • TV: 0.103 kW
    • Vacuum Cleaner: 1.5 kW
    • Ventilation: 0.1 kW
    • Washing Machine: 2.2 kW
    • Electric Sauna: 10 kW
    • Lighting: 0.25 kW
    • EV charger: 2.4 kW 1-ϕ

    Measurement Points:

    1. PV converter-side current transformer, potential transformer (Measurement of PV generation).
    2. Utility meter-side current transformer, potential transformer (Measurement of power exchange with the grid).

    Measured Parameters:

    • Per-phase mean power recorded within the sampling period
    • Per-phase Minimum power recorded within the sampling period
    • Per-phase maximum power recorded within the sampling period
    • Quadrant-wise mean power recorded within the sampling period (1st + 3rd), (2nd + 4th)
    • Quadrant-wise minimum power recorded within the sampling period (1st + 3rd), (2nd + 4th)
    • Quadrant-wise maximum power recorded within the sampling period (1st + 3rd), (2nd + 4th)
    • mean power Factor recorded within the sampling period
    • Minimum power Factor recorded within the sampling period
    • Maximum power Factor recorded within the sampling period
    • System Voltage
    • Minimum system Voltage
    • Maximum system Voltage
    • Mean Voltage between phase and neutral
    • Minimum voltage between phase and neutral
    • Maximum voltage between phase and neutral
    • Zero displacement voltage 4-wire systems (mean, min, max)

    Script Description:

    SC1_PV_auto_sort.py : This fixes timestamp continuity by resampling at the original sampling rate for PV generation data.

    SC2_L2_auto_sort.py : This fixes timestamp continuity by resampling at the original sampling rate for meter-side measurement data.

    SC3_PV_KNN_impute.py : Filling missing data points by simple KNN for PV generation data.

    SC4_L2_KNN_impute.py : Filling missing data points by simple KNN for meter-side measurement data.

    SC5_Final_data_gen.py : Merge PV and meter-side measurement data, and calculate load consumption.

    The dataset provides all the outcomes (CSV files) from the scripts. All processed variables (PV generation, load, power import, and export) are expressed in kW units.

    Update: 'SC1_PV_auto_sort.py' & 'SC2_L2_auto_sort.py' are adequate for cleaning up data and making the missing point visible. 'SC3_PV_KNN_impute.py' & 'SC4_L2_KNN_impute.py' work fine for short-range missing data points; however, these two scripts won't help much for missing data points for a longer period. They are provided as examples of one method of processing data. Future updates will include proper ML-based forecasting to predict missing data points.


    Funding Agency and Grant Number:

    1. European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement no. 955614.
    2. Estonian Research Council under Grant PRG1086.
    3. Estonian Centre of Excellence in Energy Efficiency, ENER, funded by the Estonian Ministry of Education and Research under Grant TK230.
  13. Test Data Generation Tools Market Report | Global Forecast From 2025 To 2033...

    • dataintelo.com
    csv, pdf, pptx
    Updated Jan 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Test Data Generation Tools Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/global-test-data-generation-tools-market
    Explore at:
    csv, pptx, pdfAvailable download formats
    Dataset updated
    Jan 7, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Test Data Generation Tools Market Outlook



    The global market size for Test Data Generation Tools was valued at USD 800 million in 2023 and is projected to reach USD 2.2 billion by 2032, growing at a CAGR of 12.1% during the forecast period. The surge in the adoption of agile and DevOps practices, along with the increasing complexity of software applications, is driving the growth of this market.



    One of the primary growth factors for the Test Data Generation Tools market is the increasing need for high-quality test data in software development. As businesses shift towards more agile and DevOps methodologies, the demand for automated and efficient test data generation solutions has surged. These tools help in reducing the time required for test data creation, thereby accelerating the overall software development lifecycle. Additionally, the rise in digital transformation across various industries has necessitated the need for robust testing frameworks, further propelling the market growth.



    The proliferation of big data and the growing emphasis on data privacy and security are also significant contributors to market expansion. With the introduction of stringent regulations like GDPR and CCPA, organizations are compelled to ensure that their test data is compliant with these laws. Test Data Generation Tools that offer features like data masking and data subsetting are increasingly being adopted to address these compliance requirements. Furthermore, the increasing instances of data breaches have underscored the importance of using synthetic data for testing purposes, thereby driving the demand for these tools.



    Another critical growth factor is the technological advancements in artificial intelligence and machine learning. These technologies have revolutionized the field of test data generation by enabling the creation of more realistic and comprehensive test data sets. Machine learning algorithms can analyze large datasets to generate synthetic data that closely mimics real-world data, thus enhancing the effectiveness of software testing. This aspect has made AI and ML-powered test data generation tools highly sought after in the market.



    Regional outlook for the Test Data Generation Tools market shows promising growth across various regions. North America is expected to hold the largest market share due to the early adoption of advanced technologies and the presence of major software companies. Europe is also anticipated to witness significant growth owing to strict regulatory requirements and increased focus on data security. The Asia Pacific region is projected to grow at the highest CAGR, driven by rapid industrialization and the growing IT sector in countries like India and China.



    Synthetic Data Generation has emerged as a pivotal component in the realm of test data generation tools. This process involves creating artificial data that closely resembles real-world data, without compromising on privacy or security. The ability to generate synthetic data is particularly beneficial in scenarios where access to real data is restricted due to privacy concerns or regulatory constraints. By leveraging synthetic data, organizations can perform comprehensive testing without the risk of exposing sensitive information. This not only ensures compliance with data protection regulations but also enhances the overall quality and reliability of software applications. As the demand for privacy-compliant testing solutions grows, synthetic data generation is becoming an indispensable tool in the software development lifecycle.



    Component Analysis



    The Test Data Generation Tools market is segmented into software and services. The software segment is expected to dominate the market throughout the forecast period. This dominance can be attributed to the increasing adoption of automated testing tools and the growing need for robust test data management solutions. Software tools offer a wide range of functionalities, including data profiling, data masking, and data subsetting, which are essential for effective software testing. The continuous advancements in software capabilities also contribute to the growth of this segment.



    In contrast, the services segment, although smaller in market share, is expected to grow at a substantial rate. Services include consulting, implementation, and support services, which are crucial for the successful deployment and management of test data generation tools. The increasing complexity of IT inf

  14. Data from: Ensembl TSS dataset for GRCh38

    • zenodo.org
    • portalcienciaytecnologia.jcyl.es
    • +2more
    bin
    Updated Aug 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    José A. Barbero-Aparicio; José A. Barbero-Aparicio; Alicia Olivares-Gil; Alicia Olivares-Gil; José F. Díez-Pastor; José F. Díez-Pastor; César García-Osorio; César García-Osorio (2024). Ensembl TSS dataset for GRCh38 [Dataset]. http://doi.org/10.5281/zenodo.7147597
    Explore at:
    binAvailable download formats
    Dataset updated
    Aug 26, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    José A. Barbero-Aparicio; José A. Barbero-Aparicio; Alicia Olivares-Gil; Alicia Olivares-Gil; José F. Díez-Pastor; José F. Díez-Pastor; César García-Osorio; César García-Osorio
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We used the human genome reference sequence in its GRCh38.p13 version in order to have a reliable source of data in which to carry out our experiments. We chose this version because it is the most recent one available in Ensemble at the moment. However, the DNA sequence by itself is not enough, the specific TSS position of each transcript is needed. In this section, we explain the steps followed to generate the final dataset. These steps are: raw data gathering, positive instances processing, negative instances generation and data splitting by chromosomes.

    First, we need an interface in order to download the raw data, which is composed by every transcript sequence in the human genome. We used Ensembl release 104 (Howe et al., 2020) and its utility BioMart (Smedley et al., 2009), which allows us to get large amounts of data easily. It also enables us to select a wide variety of interesting fields, including the transcription start and end sites. After filtering instances that present null values in any relevant field, this combination of the sequence and its flanks will form our raw dataset. Once the sequences are available, we find the TSS position (given by Ensembl) and the 2 following bases to treat it as a codon. After that, 700 bases before this codon and 300 bases after it are concatenated, getting the final sequence of 1003 nucleotides that is going to be used in our models. These specific window values have been used in (Bhandari et al., 2021) and we have kept them as we find it interesting for comparison purposes. One of the most sensitive parts of this dataset is the generation of negative instances. We cannot get this kind of data in a straightforward manner, so we need to generate it synthetically. In order to get examples of negative instances, i.e. sequences that do not represent a transcript start site, we select random DNA positions inside the transcripts that do not correspond to a TSS. Once we have selected the specific position, we get 700 bases ahead and 300 bases after it as we did with the positive instances.

    Regarding the positive to negative ratio, in a similar problem, but studying TIS instead of TSS (Zhang135
    et al., 2017), a ratio of 10 negative instances to each positive one was found optimal. Following this136
    idea, we select 10 random positions from the transcript sequence of each positive codon and label them137
    as negative instances. After this process, we end up with 1,122,113 instances: 102,488 positive and 1,019,625 negative sequences. In order to validate and test our models, we need to split this dataset into three parts: train, validation and test. We have decided to make this differentiation by chromosomes, as it is done in (Perez-Rodriguez et al., 2020). Thus, we use chromosome 16 as validation because it is a good example of a chromosome with average characteristics. Then we selected samples from chromosomes 1, 3, 13, 19 and 21 to be part of the test set and used the rest of them to train our models. Every step of this process can be replicated using the scripts available in https://github.com/JoseBarbero/EnsemblTSSPrediction.

  15. GARD: Gustavo’s Awesome Runway Dataset (2025)

    • kaggle.com
    Updated Mar 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gustavo de Paula (2025). GARD: Gustavo’s Awesome Runway Dataset (2025) [Dataset]. https://www.kaggle.com/datasets/depaulagu/gard2025
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 30, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Gustavo de Paula
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    GARD (Gustavo’s Awesome Runway Dataset) is the largest publicly available synthetic runway image dataset, built to support machine learning tasks in vision-based aircraft landing systems. It contains over 45,000 high-resolution (1024×1024) labeled images.

    This dataset was created using Canny2Concrete, a modular open-source data augmentation pipeline leveraging ControlNet and Stable Diffusion XL. The generation process conditions on edge maps extracted from real-world template images and applies multiple stages of variation including weather, lighting, and occlusion effects.

    Models trained with GARD have been shown to outperform or match those trained on existing synthetic datasets like LARD, especially in challenging segmentation tasks.

    🚀 What’s Inside:

    • BaseImages: Direct and diverse generations from runway edge maps (Canny).
    • VariantImages: Geometric augmentations (rotations, translations, etc).
    • VariantImagesWithOcclusion: Added weather occlusion effects (rain, fog, snow, night).

    Each image includes: - 📷 .png image file
    - 🏷 .txt YOLO-format label
    - 🧩 .mask.png segmentation mask
    - 📄 .json full metadata, designed for full reproducibility (prompt, seed, label points, effects applied)

    📂 Resources:

    🏁 Built For:

    • Runway segmentation and detection
    • Computer vision research in aviation
    • Synthetic dataset generation at scale
    • Researchers working on UAV and autonomous landing
  16. AI Training Dataset Market Report | Global Forecast From 2025 To 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Jan 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). AI Training Dataset Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/global-ai-training-dataset-market
    Explore at:
    csv, pptx, pdfAvailable download formats
    Dataset updated
    Jan 7, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    AI Training Dataset Market Outlook



    The global AI training dataset market size was valued at approximately USD 1.2 billion in 2023 and is projected to reach USD 6.5 billion by 2032, growing at a compound annual growth rate (CAGR) of 20.5% from 2024 to 2032. This substantial growth is driven by the increasing adoption of artificial intelligence across various industries, the necessity for large-scale and high-quality datasets to train AI models, and the ongoing advancements in AI and machine learning technologies.



    One of the primary growth factors in the AI training dataset market is the exponential increase in data generation across multiple sectors. With the proliferation of internet usage, the expansion of IoT devices, and the digitalization of industries, there is an unprecedented volume of data being generated daily. This data is invaluable for training AI models, enabling them to learn and make more accurate predictions and decisions. Moreover, the need for diverse and comprehensive datasets to improve AI accuracy and reliability is further propelling market growth.



    Another significant factor driving the market is the rising investment in AI and machine learning by both public and private sectors. Governments around the world are recognizing the potential of AI to transform economies and improve public services, leading to increased funding for AI research and development. Simultaneously, private enterprises are investing heavily in AI technologies to gain a competitive edge, enhance operational efficiency, and innovate new products and services. These investments necessitate high-quality training datasets, thereby boosting the market.



    The proliferation of AI applications in various industries, such as healthcare, automotive, retail, and finance, is also a major contributor to the growth of the AI training dataset market. In healthcare, AI is being used for predictive analytics, personalized medicine, and diagnostic automation, all of which require extensive datasets for training. The automotive industry leverages AI for autonomous driving and vehicle safety systems, while the retail sector uses AI for personalized shopping experiences and inventory management. In finance, AI assists in fraud detection and risk management. The diverse applications across these sectors underline the critical need for robust AI training datasets.



    As the demand for AI applications continues to grow, the role of Ai Data Resource Service becomes increasingly vital. These services provide the necessary infrastructure and tools to manage, curate, and distribute datasets efficiently. By leveraging Ai Data Resource Service, organizations can ensure that their AI models are trained on high-quality and relevant data, which is crucial for achieving accurate and reliable outcomes. The service acts as a bridge between raw data and AI applications, streamlining the process of data acquisition, annotation, and validation. This not only enhances the performance of AI systems but also accelerates the development cycle, enabling faster deployment of AI-driven solutions across various sectors.



    Regionally, North America currently dominates the AI training dataset market due to the presence of major technology companies and extensive R&D activities in the region. However, Asia Pacific is expected to witness the highest growth rate during the forecast period, driven by rapid technological advancements, increasing investments in AI, and the growing adoption of AI technologies across various industries in countries like China, India, and Japan. Europe and Latin America are also anticipated to experience significant growth, supported by favorable government policies and the increasing use of AI in various sectors.



    Data Type Analysis



    The data type segment of the AI training dataset market encompasses text, image, audio, video, and others. Each data type plays a crucial role in training different types of AI models, and the demand for specific data types varies based on the application. Text data is extensively used in natural language processing (NLP) applications such as chatbots, sentiment analysis, and language translation. As the use of NLP is becoming more widespread, the demand for high-quality text datasets is continually rising. Companies are investing in curated text datasets that encompass diverse languages and dialects to improve the accuracy and efficiency of NLP models.



    Image data is critical for computer vision application

  17. Data from: Tango Spacecraft Wireframe Dataset Model for Line Segments...

    • zenodo.org
    • explore.openaire.eu
    • +1more
    zip
    Updated May 23, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michele Bechini; Michele Bechini; Paolo Lunghi; Paolo Lunghi; Michèle Lavagna; Michèle Lavagna (2023). Tango Spacecraft Wireframe Dataset Model for Line Segments Detection [Dataset]. http://doi.org/10.5281/zenodo.6383001
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 23, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Michele Bechini; Michele Bechini; Paolo Lunghi; Paolo Lunghi; Michèle Lavagna; Michèle Lavagna
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Reference Paper:

    M. Bechini, M. Lavagna, P. Lunghi, Dataset generation and validation for spacecraft pose estimation via monocular images processing, Acta Astronautica 204 (2023) 358–369

    M. Bechini, P. Lunghi, M. Lavagna. "Spacecraft Pose Estimation via Monocular Image Processing: Dataset Generation and Validation". In 9th European Conference for Aeronautics and Aerospace Sciences (EUCASS)

    General Description:

    The "Tango Spacecraft Wireframe Dataset Model for Line Segments Detection" dataset here published should be used for line detection and segmentation tasks. It is split into 30002 train images and 3002 test images representing the Tango spacecraft from Prisma mission, being the only publicly available dataset of synthetic space-borne images tailored to line detection tasks (up to our knowledge). The label of each image gives the reprojection of a simplified wireframe model of Tango on the image plane split into lines. The labels are written following the Wireframe Model format. The "Tango Spacecraft Wireframe Dataset Model for Line Segments Detection" is also the largest dataset with wireframe annotations available up to date. More information on the dataset split and on the label format are reported below.

    Images Information:

    The dataset comprises 30002 synthetic grayscale images of Tango spacecraft from Prisma mission that serves as train set, while the test set is formed by 3002 synthetic grayscale images of Tango spacecraft from Prisma mission in PNG format. About 1/6 of the images both in the train and in the test set have a non-black background, obtained by rendering an Earth-like model in the raytracing process used to define the images reported. The images are noise-free to increase the flexibility of the dataset. The illumination direction of the spacecraft in the scene is uniformly distributed in the 3D space in agreement with the Sun position constraints.


    Labels Information:

    Labels in the Wireframe dataset format are here provided in separated JSON files. The files are formatted per each image as in the following example:

    • width : 98 # width in pixels (int) of the current image
    • height : 176 # height in pixels (int) of the current image
    • lines : [[line1], [line2], ..., [lineN]] # list of lines in each image
    • filename : tango_img_866.png # string with image name and format

    Per each line (line1, ... , lineN) in lines, the format is [x0, y0, x1, y1].

    (x0, y0) are the coordinates (float) of the line starting point in the image reference frame (x pointing right and y pointing down with origin located in the top-left corner of the image).
    (X1, y1) are the coordinates (float) of the line ending point in the image reference frame (x pointing right and y pointing down with origin located in the top-left corner of the image).

    Note that the starting point is assumed to be the left-most endpoint (lower x coordinate in image reference frame) of each line. In the case of vertical lines, the starting point is the upper-most endpoint (lower y coordinate in image reference frame) of each line.

    VERSION CONTROL

    • v1.0: All the images (both for train and test) have different resolutions, with Tango always centered in the image. The height of the images is in the range 19 - 352 pixels, while the width is in the range 16 - 336 pixels. The height over width ratio spans from 0.34 to 3.25.
    • v2.0: This version contains all the images of v1.0 in the .zip folder named Tango_WF.zip, while in the .zip folder named Tango_WF_fullscale.zip there is the dataset (both train and test) of full scale images. These images have width=height=1024 pixels. The position of tango with respect to the camera is randomly selected from a uniform distribution, but it is ensured the full visibility in all the images. The labels for the wireframe are in the same format of v1.0.

    Note: the dataset in v1.0 is obtained by cropping the fullscale images in v2.0 and by properly rescaling the wireframe annotations.

    Note: this dataset contains the same images of the "Tango Spacecraft Dataset for Region of Interest Estimation and Semantic Segmentation" v1.0 (DOI: https://doi.org/10.5281/zenodo.6507863) and also "Tango Spacecraft Dataset for Monocular Pose Estimation" v1.0 (DOI: https://doi.org/10.5281/zenodo.6499007) and they can be used together by combining the annotations of the relative pose and the ones of the reprojected wireframe model of Tango, with also the ones of the ROI. These three datasets give the most comprehensive dataset of space borne synthetic images ever published (up to our knowledge).

  18. I

    Datasets for EMMA: A New Method for Computing Multiple Sequence Alignments...

    • databank.illinois.edu
    Updated Aug 8, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chengze Shen; Baqiao Liu; Kelly P. Williams; Tandy Warnow (2022). Datasets for EMMA: A New Method for Computing Multiple Sequence Alignments given a Constraint Subset Alignment [Dataset]. http://doi.org/10.13012/B2IDB-2567453_V1
    Explore at:
    Dataset updated
    Aug 8, 2022
    Authors
    Chengze Shen; Baqiao Liu; Kelly P. Williams; Tandy Warnow
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Dataset funded by
    U.S. Department of Energy (DOE)
    U.S. National Science Foundation (NSF)
    Description

    This upload contains all datasets used in Experiment 2 of the EMMA paper (appeared in WABI 2023): Shen, Chengze, Baqiao Liu, Kelly P. Williams, and Tandy Warnow. "EMMA: A New Method for Computing Multiple Sequence Alignments given a Constraint Subset Alignment". The zip file has the following structure (presented as an example): salma_paper_datasets/ |_README.md |_10aa/ |_crw/ |_homfam/ |_aat/ | |_... |_... |_het/ |_5000M2-het/ | |_... |_5000M3-het/ ... |_rec_res/ Generally, the structure can be viewed as: [category]/[dataset]/[replicate]/[alignment files] # Categories: 1. 10aa: There are 10 small biological protein datasets within the 10aa directory, each with just one replicate. 2. crw: There are 5 selected CRW datasets, namely 5S.3, 5S.E, 5S.T, 16S.3, and 16S.T, each with one replicate. These are the cleaned version from Shen et. al. 2022 (MAGUS+eHMM). 3. homfam: There are the 10 largest Homfam datasets, each with one replicate. 4. het: There are three newly simulated nucleotide datasets from this study, 5000M2-het, 5000M3-het, and 5000M4-het, each with 10 replicates. 5. rec_res: It contains the Rec and Res datasets. Detailed dataset generation can be found in the supplementary materials of the paper. # Alignment files There are at most 6 .fasta files in each sub-directory: 1. all.unaln.fasta: All unaligned sequences. 2. all.aln.fasta: Reference alignments of all sequences. If not all sequences have reference alignments, only the sequences that have will be included. 3. all-queries.unaln.fasta: All unaligned query sequences. Query sequences are sequences that do not have lengths within 25% of the median length (i.e., not full-length sequences). 4. all-queries.aln.fasta: Reference alignments of query sequences. If not all queries have reference alignments, only the sequences that have will be included. 5. backbone.unaln.fasta: All unaligned backbone sequences. Backbone sequences are sequences that have lengths within 25% of the median length (i.e., full-length sequences). 6. backbone.aln.fasta: Reference alignments of backbone sequences. If not all backbone sequences have reference alignments, only the sequences that have will be included. >If all sequences are full-length sequences, then all-queries.unaln.fasta will be missing. >If fewer than two query sequences have reference alignments, then all-queries.aln.fasta will be missing. >If fewer than two backbone sequences have reference alignments, then backbone.aln.fasta will be missing. # Additional file(s) 1. 350378genomes.txt: the file contains all 350,378 bacterial and archaeal genome names that were used by Prodigal (Hyatt et. al. 2010) to search for protein sequences.

  19. S

    CSQ: A Chinese Elementary Science Question Dataset with Rich Discipline...

    • scidb.cn
    Updated Apr 7, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhi Liu; Dong Li; Taotao Long; Chongdong Wen; Peng Xian; Jiaxin Guo (2025). CSQ: A Chinese Elementary Science Question Dataset with Rich Discipline Properties in Adaptive Problem-Solving Process Generation [Dataset]. http://doi.org/10.57760/sciencedb.22816
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 7, 2025
    Dataset provided by
    Science Data Bank
    Authors
    Zhi Liu; Dong Li; Taotao Long; Chongdong Wen; Peng Xian; Jiaxin Guo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is currently the world's largest Chinese Science Question (CSQ) dataset, which includes benchmarks and training sets and is designed to evaluate and improve the scientific problem-solving ability of LLMs. CSQ consists of 12,000 high-quality samples with a variety of question types and different subject attributes, covering four subjects and multiple topics in Chinese primary schools. It is deeply coupled with the Science Curriculum Standards for Compulsory Education of China (2022), providing a new way for large language models to empower science education, and also providing a research foundation for science curriculum ITS based on LLMs.

  20. LinkedIn Datasets

    • brightdata.com
    .json, .csv, .xlsx
    Updated Dec 17, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bright Data (2021). LinkedIn Datasets [Dataset]. https://brightdata.com/products/datasets/linkedin
    Explore at:
    .json, .csv, .xlsxAvailable download formats
    Dataset updated
    Dec 17, 2021
    Dataset authored and provided by
    Bright Datahttps://brightdata.com/
    License

    https://brightdata.com/licensehttps://brightdata.com/license

    Area covered
    Worldwide
    Description

    Unlock the full potential of LinkedIn data with our extensive dataset that combines profiles, company information, and job listings into one powerful resource for business decision-making, strategic hiring, competitive analysis, and market trend insights. This all-encompassing dataset is ideal for professionals, recruiters, analysts, and marketers aiming to enhance their strategies and operations across various business functions. Dataset Features

    Profiles: Dive into detailed public profiles featuring names, titles, positions, experience, education, skills, and more. Utilize this data for talent sourcing, lead generation, and investment signaling, with a refresh rate ensuring up to 30 million records per month. Companies: Access comprehensive company data including ID, country, industry, size, number of followers, website details, subsidiaries, and posts. Tailored subsets by industry or region provide invaluable insights for CRM enrichment, competitive intelligence, and understanding the startup ecosystem, updated monthly with up to 40 million records. Job Listings: Explore current job opportunities detailed with job titles, company names, locations, and employment specifics such as seniority levels and employment functions. This dataset includes direct application links and real-time application numbers, serving as a crucial tool for job seekers and analysts looking to understand industry trends and the job market dynamics.

    Customizable Subsets for Specific Needs Our LinkedIn dataset offers the flexibility to tailor the dataset according to your specific business requirements. Whether you need comprehensive insights across all data points or are focused on specific segments like job listings, company profiles, or individual professional details, we can customize the dataset to match your needs. This modular approach ensures that you get only the data that is most relevant to your objectives, maximizing efficiency and relevance in your strategic applications. Popular Use Cases

    Strategic Hiring and Recruiting: Track talent movement, identify growth opportunities, and enhance your recruiting efforts with targeted data. Market Analysis and Competitive Intelligence: Gain a competitive edge by analyzing company growth, industry trends, and strategic opportunities. Lead Generation and CRM Enrichment: Enrich your database with up-to-date company and professional data for targeted marketing and sales strategies. Job Market Insights and Trends: Leverage detailed job listings for a nuanced understanding of employment trends and opportunities, facilitating effective job matching and market analysis. AI-Driven Predictive Analytics: Utilize AI algorithms to analyze large datasets for predicting industry shifts, optimizing business operations, and enhancing decision-making processes based on actionable data insights.

    Whether you are mapping out competitive landscapes, sourcing new talent, or analyzing job market trends, our LinkedIn dataset provides the tools you need to succeed. Customize your access to fit specific needs, ensuring that you have the most relevant and timely data at your fingertips.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Neilsberg Research (2023). United States Age Group Population Dataset: A complete breakdown of United States age demographics from 0 to 85 years, distributed across 18 age groups [Dataset]. https://www.neilsberg.com/research/datasets/5fd2b2bb-3d85-11ee-9abe-0aa64bf2eeb2/

United States Age Group Population Dataset: A complete breakdown of United States age demographics from 0 to 85 years, distributed across 18 age groups

Explore at:
json, csvAvailable download formats
Dataset updated
Sep 16, 2023
Dataset authored and provided by
Neilsberg Research
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Area covered
United States
Variables measured
Population Under 5 Years, Population over 85 years, Population Between 5 and 9 years, Population Between 10 and 14 years, Population Between 15 and 19 years, Population Between 20 and 24 years, Population Between 25 and 29 years, Population Between 30 and 34 years, Population Between 35 and 39 years, Population Between 40 and 44 years, and 9 more
Measurement technique
The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates. To measure the two variables, namely (a) population and (b) population as a percentage of the total population, we initially analyzed and categorized the data for each of the age groups. For age groups we divided it into roughly a 5 year bucket for ages between 0 and 85. For over 85, we aggregated data into a single group for all ages. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
Dataset funded by
Neilsberg Research
Description
About this dataset

Context

The dataset tabulates the United States population distribution across 18 age groups. It lists the population in each age group along with the percentage population relative of the total population for United States. The dataset can be utilized to understand the population distribution of United States by age. For example, using this dataset, we can identify the largest age group in United States.

Key observations

The largest age group in United States was for the group of age 25-29 years with a population of 22,854,328 (6.93%), according to the 2021 American Community Survey. At the same time, the smallest age group in United States was the 80-84 years with a population of 5,932,196 (1.80%). Source: U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates.

Content

When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates.

Age groups:

  • Under 5 years
  • 5 to 9 years
  • 10 to 14 years
  • 15 to 19 years
  • 20 to 24 years
  • 25 to 29 years
  • 30 to 34 years
  • 35 to 39 years
  • 40 to 44 years
  • 45 to 49 years
  • 50 to 54 years
  • 55 to 59 years
  • 60 to 64 years
  • 65 to 69 years
  • 70 to 74 years
  • 75 to 79 years
  • 80 to 84 years
  • 85 years and over

Variables / Data Columns

  • Age Group: This column displays the age group in consideration
  • Population: The population for the specific age group in the United States is shown in this column.
  • % of Total Population: This column displays the population of each age group as a proportion of United States total population. Please note that the sum of all percentages may not equal one due to rounding of values.

Good to know

Margin of Error

Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

Custom data

If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

Inspiration

Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

Recommended for further research

This dataset is a part of the main dataset for United States Population by Age. You can refer the same here

Search
Clear search
Close search
Google apps
Main menu