10 datasets found
  1. o

    Natural Language Inference Evaluation Dataset

    • opendatabay.com
    .undefined
    Updated Jul 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Natural Language Inference Evaluation Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/abcd24c8-a1a1-4724-83b2-ea07314b8d13
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 6, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Data Science and Analytics
    Description

    The HellaSwag dataset is a highly valuable resource for assessing a machine's sentence completion abilities based on commonsense natural language inference (NLI). It was initially introduced in a paper published at ACL2019. This dataset enables researchers and machine learning practitioners to train, validate, and evaluate models designed to understand and predict plausible sentence completions using common sense knowledge. It is useful for understanding the limitations of current NLI systems and for developing algorithms that reason with common sense.

    Columns

    The dataset includes several key columns: * ind: The index of the data point. (Integer) * activity_label: The label indicating the activity or event described in the sentence. (String) * ctx_a: The first context sentence, providing background information. (String) * ctx_b: The second context sentence, providing further background information. (String) * endings: A list of possible sentence completions for the given context. (List of Strings) * split: The dataset split, such as 'train', 'dev', or 'test'. (String) * split_type: The type of split used for dividing the dataset, like 'random' or 'balanced'. (String) * source_id: An identifier for the source. * label: A label associated with the data point.

    Distribution

    The dataset is typically provided in CSV format and consists of three primary files: train.csv, validation.csv, and test.csv. The train.csv file facilitates the learning process for machine learning models, validation.csv is used to validate model performance, and test.csv enables thorough evaluation of models in completing sentences with common sense. While exact total row counts for the entire dataset are not specified in the provided information, insights into unique values for fields such as activity_label (9965 unique values), source_id (8173 unique values), and split_type (e.g., 'indomain' and 'zeroshot' each accounting for 50%) are available.

    Usage

    This dataset is ideal for a variety of applications and use cases: * Language Modelling: Training language models to better understand common sense knowledge and improve sentence completion tasks. * Common Sense Reasoning: Developing and studying algorithms that can reason and make inferences based on common sense. * Machine Performance Evaluation: Assessing the effectiveness of machine learning models in generating appropriate sentence endings given specific contexts and activity labels. * Natural Language Inference (NLI): Benchmarking and improving NLI systems by evaluating their ability to predict plausible sentence completions.

    Coverage

    The dataset has a global region scope. It was listed on 17/06/2025. Specific time ranges for the data collection itself or detailed demographic scopes are not provided. The dataset includes various splits (train, dev, test) and split types (random, balanced) to ensure diversity for generalisation testing and fairness evaluation during model development.

    License

    CC0

    Who Can Use It

    The HellaSwag dataset is intended for researchers and machine learning practitioners. They can utilise it to: * Train, validate, and evaluate machine learning models for tasks requiring common sense knowledge. * Develop and refine algorithms for common sense reasoning. * Benchmark and assess the performance and limitations of current natural language inference systems.

    Dataset Name Suggestions

    • HellaSwag: Commonsense NLI
    • Commonsense Sentence Completion Data
    • Natural Language Inference Evaluation Dataset
    • AI Common Sense Benchmark

    Attributes

    Original Data Source: HellaSwag: Commonsense NLI

  2. w

    Fire statistics data tables

    • gov.uk
    • s3.amazonaws.com
    Updated Jul 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ministry of Housing, Communities and Local Government (2025). Fire statistics data tables [Dataset]. https://www.gov.uk/government/statistical-data-sets/fire-statistics-data-tables
    Explore at:
    Dataset updated
    Jul 10, 2025
    Dataset provided by
    GOV.UK
    Authors
    Ministry of Housing, Communities and Local Government
    Description

    On 1 April 2025 responsibility for fire and rescue transferred from the Home Office to the Ministry of Housing, Communities and Local Government.

    This information covers fires, false alarms and other incidents attended by fire crews, and the statistics include the numbers of incidents, fires, fatalities and casualties as well as information on response times to fires. The Ministry of Housing, Communities and Local Government (MHCLG) also collect information on the workforce, fire prevention work, health and safety and firefighter pensions. All data tables on fire statistics are below.

    MHCLG has responsibility for fire services in England. The vast majority of data tables produced by the Ministry of Housing, Communities and Local Government are for England but some (0101, 0103, 0201, 0501, 1401) tables are for Great Britain split by nation. In the past the Department for Communities and Local Government (who previously had responsibility for fire services in England) produced data tables for Great Britain and at times the UK. Similar information for devolved administrations are available at https://www.firescotland.gov.uk/about/statistics/" class="govuk-link">Scotland: Fire and Rescue Statistics, https://statswales.gov.wales/Catalogue/Community-Safety-and-Social-Inclusion/Community-Safety" class="govuk-link">Wales: Community safety and https://www.nifrs.org/home/about-us/publications/" class="govuk-link">Northern Ireland: Fire and Rescue Statistics.

    If you use assistive technology (for example, a screen reader) and need a version of any of these documents in a more accessible format, please email alternativeformats@communities.gov.uk. Please tell us what format you need. It will help us if you say what assistive technology you use.

    Related content

    Fire statistics guidance
    Fire statistics incident level datasets

    Incidents attended

    https://assets.publishing.service.gov.uk/media/686d2aa22557debd867cbe14/FIRE0101.xlsx">FIRE0101: Incidents attended by fire and rescue services by nation and population (MS Excel Spreadsheet, 153 KB) Previous FIRE0101 tables

    https://assets.publishing.service.gov.uk/media/686d2ab52557debd867cbe15/FIRE0102.xlsx">FIRE0102: Incidents attended by fire and rescue services in England, by incident type and fire and rescue authority (MS Excel Spreadsheet, 2.19 MB) Previous FIRE0102 tables

    https://assets.publishing.service.gov.uk/media/686d2aca10d550c668de3c69/FIRE0103.xlsx">FIRE0103: Fires attended by fire and rescue services by nation and population (MS Excel Spreadsheet, 201 KB) Previous FIRE0103 tables

    https://assets.publishing.service.gov.uk/media/686d2ad92557debd867cbe16/FIRE0104.xlsx">FIRE0104: Fire false alarms by reason for false alarm, England (MS Excel Spreadsheet, 492 KB) Previous FIRE0104 tables

    Dwelling fires attended

    https://assets.publishing.service.gov.uk/media/686d2af42cfe301b5fb6789f/FIRE0201.xlsx">FIRE0201: Dwelling fires attended by fire and rescue services by motive, population and nation (MS Excel Spreadsheet, <span class="gem-c-attac

  3. InsectSet459: A large dataset for automatic acoustic identification of...

    • zenodo.org
    csv, txt, zip
    Updated Apr 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marius Faiß; Marius Faiß; Dan Stowell; Dan Stowell (2025). InsectSet459: A large dataset for automatic acoustic identification of insects (Orthoptera and Cicadidae) [Dataset]. http://doi.org/10.5281/zenodo.14056458
    Explore at:
    zip, txt, csvAvailable download formats
    Dataset updated
    Apr 25, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Marius Faiß; Marius Faiß; Dan Stowell; Dan Stowell
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    *Version 0.1: In this initial release of the dataset some duplicate files were not discovered during our checksum tests. We fixed this in version 0.2 by removing 101 duplicates. Many thanks to Jaromir Kunzelmann for pointing this out to us. Please select the latest version of this dataset in the sidebar.

    This dataset will be used in the upcoming 2025 BioDCASE data challenge. Therefore, the test set is being held back until the challenge has concluded in 2025. The full version of the dataset will then be published as version 1.0 here.

    Background

    In 2024, the public animal sound database xeno-canto has seen a dramatic increase in insect sound recordings. This is due to the publication of several large collections of field and laboratory recordings from insect sound experts, as well as increased adoption of citizen scientists uploading their insect sound observations to the website. We used this opportunity to expand our previously published datasets (InsectSet32, InsectSet47&InsectSet66) to compile the first large-scale dataset of insect sounds that is easy to use for training deep learning methods to detect and classify insect sounds in the wild. A short pre-print describing the dataset curation and characteristics in more detail, as well as results from two baseline classifiers trained on the datasets, is accessible here and will be submitted for publication in a journal.

    Data curation

    Recordings from xeno-canto (Orthoptera), iNaturalist (Orthoptera & Cicadidae) and BioAcoustica (Cicadidae) were downloaded and pooled together. Several selection steps were chosen to compile a final selection of recordings. From iNaturalist, only research-grade observations were downloaded. For observations with multiple audio files attached, only one file was downloaded. If users uploaded to both iNaturalist and xeno-canto, only the files from one of the platforms were used. To further avoid duplicate uploads, a checksum test was applied to the entire source dataset. Another common occurrence is serial uploads from one location and time period split into separate observations (especially common on xeno-canto), which could include the same individual animals vocalizing. This problem was adressed by pooling all recordings by username, species, geographic location, date and time, and selecting only one recording from a one-hour period.

    After these filtering steps, all files from species with at least 10 sound examples were selected for the final dataset. All stereo files were converted to mono, file formats were standardized to wav and mp3. Recordings of a length longer than two minutes were automatically trimmed. Species nomenclature was unified to COL24.4 2024-04-26 [294826] using checklistbank.

    This new dataset greatly increases the number of species included: from 66 in InsectSet66 to now contain 459 unique species from the groups Orthoptera and Cicadidae, while also strongly increasing the geographic coverage of recording locations. The total duration of the dataset and number of sound examples is heavily expanded to a total of 26399 files containing 9.5 days of audio material with sample rates ranging from 8 to 500 kHz.

    Dataset Usage

    All recordings are licensed under creative commons licenses 4.0 or 0. We excluded no-derivatives licenses to simply further usage of this dataset. For machine-learning purposes, the dataset was split into the training, validation and test sets while ensuring a roughly equal distribution of audio files and audio material for every species in all three subsets. This resulted in a 60/20/20 split (train/validation/test) by file number and file length.

    *Version 0.1: In this initial release of the dataset some duplicate files were not discovered during our checksum tests. We fixed this in version 0.2 by removing 101 duplicates. Many thanks to Jaromir Kunzelmann for pointing this out to us. Please select the latest version of this dataset in the sidebar.

    This dataset will be used in the upcoming 2025 BioDCASE data challenge. Therefore, the test set is being held back until the challenge has concluded in 2025. The full version of the dataset will then be published as version 1.0 here.

  4. o

    Contextual Language Comprehension Dataset

    • opendatabay.com
    .undefined
    Updated Jul 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Contextual Language Comprehension Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/87666f40-537f-4c7b-97eb-cd9e55d284b0
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 5, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Data Science and Analytics
    Description

    This dataset, known as HellaSwag (Commonsense NLI), is designed to evaluate a machine's ability to complete sentences in a logically coherent and sensible manner. It provides over 10,000 examples of sentence completion tasks, each featuring an initial sentence segment followed by four potential endings. The primary challenge for an artificial intelligence system is to identify and select the most appropriate ending that best completes the given sentence. This task is particularly demanding for machines because it necessitates an understanding that extends beyond mere word recognition to encompass deeper meaning and contextual nuances. While humans typically find this task straightforward due to their inherent grasp of language and common sense, it presents a significant hurdle for machines. The HellaSwag dataset represents a vital step towards the development of AI systems capable of communicating similarly to humans, offering a benchmark to assess current machine capabilities in language comprehension and generation, and highlighting areas requiring further advancement.

    Columns

    The dataset typically includes the following columns:

    • ind: An integer representing the index of the sentence.
    • activity_label: A string indicating the label of the activity.
    • ctx_a: A string containing the first context sentence.
    • ctx_b: A string containing the second context sentence.
    • endings: A string that holds the potential endings for the sentence.
    • split: A string denoting the division of the dataset (e.g., training or test set).
    • split_type: A string specifying the type of split, such as 'indomain' or 'zeroshot'.
    • label: The label indicating which of the possible endings is the correct one for the sentence completion.

    Distribution

    The dataset is primarily available in a data file format, commonly CSV. It comprises over 10,000 examples of sentence completion. While specific row or record counts for the entire dataset are not explicitly provided, it is structured with context sentences and multiple choice endings. The dataset can be readily split into training and test sets, for instance, using an 80/20 ratio for model development. The 'split' column helps categorise the data, with 'indomain' and 'zeroshot' types each accounting for 50% of the split.

    Usage

    This dataset is ideally suited for various machine learning and natural language processing applications, including:

    • Training models to generate novel sentence endings that mimic human-like creativity and coherence.
    • Developing models that enhance their understanding of sentence context, enabling them to select the most appropriate ending based on the given context.
    • Building models capable of evaluating two sentences with different endings and determining which one is more probable, drawing upon common-sense knowledge.

    Coverage

    The dataset is listed with a GLOBAL region scope. No specific geographical, temporal, or demographic coverage details regarding the content of the data itself are provided in the available information. The listing date for the dataset is noted as 17/06/2025.

    License

    CC0

    Who Can Use It

    This dataset is invaluable for:

    • Data scientists and machine learning engineers working on natural language understanding and generation tasks.
    • AI researchers focused on advancing the capabilities of artificial intelligence systems to interact and communicate more human-like.
    • Anyone involved in building models for sentence completion, contextual reasoning, and common-sense knowledge integration in AI.

    Dataset Name Suggestions

    • HellaSwag (Commonsense NLI)
    • AI Sentence Completion Challenge
    • Contextual Language Comprehension Dataset
    • Commonsense Language Understanding Benchmark

    Attributes

    Original Data Source: HellaSwag (Commonsense NLI)

  5. H

    Improved River Slope Datasets for the United States Hydrofabrics

    • hydroshare.org
    • search.dataone.org
    • +1more
    zip
    Updated Apr 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yixian Chen; Anupal Baruah; Dipsikha Devi; Sagy Cohen (2025). Improved River Slope Datasets for the United States Hydrofabrics [Dataset]. http://doi.org/10.4211/hs.1532f4cb360244f9a6ba772ebd428180
    Explore at:
    zip(129.2 MB)Available download formats
    Dataset updated
    Apr 18, 2025
    Dataset provided by
    HydroShare
    Authors
    Yixian Chen; Anupal Baruah; Dipsikha Devi; Sagy Cohen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Description

    The CONtiguous United States (CONUS) “Flood Inundation Mapping Hydrofabric - ICESat-2 River Surface Slope” (FIM HF IRIS) dataset integrates river slopes from the global IRIS dataset for 117,357 spatially corresponding main-stream reaches within NOAA’s Office of Water Prediction operational FIM forecasting system, which utilizes the Height Above Nearest Drainage approach (OWP HAND-FIM) to help warn communities of floods. To achieve this, a spatial joining approach was developed to align FIM HF reaches with IRIS reaches, accounting for differences in reach flowline sources. When applied to OWP HAND-FIM, FIM HF IRIS improved flood map accuracy by an average of 31% (CSI) across eight flood events compared to the original FIM HF slopes. Using a common attribute, IRIS data were also transferred from FIM HF IRIS to the CONUS-scale Next Generation Water Resources Modeling Framework Hydrofabric (NextGen HF), creating the NextGen HF IRIS dataset. By referencing another common attribute, SWOT vector data (e.g., water surface elevation, slope, discharge) can be leveraged by OWP HAND-FIM and NextGen through the two resulting datasets. The spatial joining approach, which enables the integration of FIM HF with other hydrologic datasets via flowlines, is provided alongside the two resulting datasets.

    The slope_iris_sword in FIM HF IRIS can be used with the Recalculate_Discharge_in_Hydrotable_useFIMHFIRIS.py script to regenerate the hydrotable for OWP HAND-FIM, where the discharge will be recalculated using slope_iris_sword. Consequently, the synthetic rating curves (SRCs) will be updated based on the new discharges (see more details in https://github.com/NOAA-OWP/inundation-mapping/wiki/3.-HAND-Methodology). The script can also be used to regenerate hydrotables using river slopes from other sources, such as NextGen HF, provided they are linked to the FIM HF flowlines.

    The feature classes for FIMHF_IRIS and NextGenHF_IRIS are provided in formats of geopackage (.gpkg) and geodatabases (.gdb), which can be accessed using ArcGIS, QGIS, or relevant Python packages for inspection, visualization, or spatial analysis of slope_iris_sword.

    More information can be found at: Chen, Y., Baruah, A., Devi, D., & Cohen, S. (2025). Improved River Slope Datasets for the United States Hydrofabrics [Data set]. Zenodo. https://doi.org/10.5281/zenodo.15099149

  6. h

    MOMIJI

    • huggingface.co
    Updated May 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Turing Inc. (2025). MOMIJI [Dataset]. https://huggingface.co/datasets/turing-motors/MOMIJI
    Explore at:
    Dataset updated
    May 12, 2025
    Dataset authored and provided by
    Turing Inc.
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Card for MOMIJI

    MOMIJI (Modern Open Multimodal Japanese filtered Dataset) is a large-scale, carefully curated public dataset of image-text–interleaved web documents. The dataset was extracted from Common Crawl dumps covering February 2024 – January 2025 and contains roughly 56M Japanese documents, 110B characters, and 249M images. Details of the collection and filtering pipeline will be described in a forthcoming paper. Image-text–interleaved data is generally used to train… See the full description on the dataset page: https://huggingface.co/datasets/turing-motors/MOMIJI.

  7. A Curated List of Image Deblurring Datasets

    • kaggle.com
    Updated Mar 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jishnu Parayil Shibu (2023). A Curated List of Image Deblurring Datasets [Dataset]. https://www.kaggle.com/datasets/jishnuparayilshibu/a-curated-list-of-image-deblurring-datasets
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 28, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Jishnu Parayil Shibu
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Given a blurred image, image deblurring aims to produce a clear, high-quality image that accurately represents the original scene. Blurring can be caused by various factors such as camera shake, fast motion, out-of-focus objects, etc. making it a particularly challenging computer vision problem. This has led to the recent development of a large spectrum of deblurring models and unique datasets.

    Despite the rapid advancement in image deblurring, the process of finding and pre-processing a number of datasets for training and testing purposes has been both time exhaustive and unnecessarily complicated for both experts and non-experts alike. Moreover, there is a serious lack of ready-to-use domain-specific datasets such as face and text deblurring datasets.

    To this end, the following card contains a curated list of ready-to-use image deblurring datasets for training and testing various deblurring models. Additionally, we have created an extensive, highly customizable python package for single image deblurring called DBlur that can be used to train and test various SOTA models on the given datasets just with 2-3 lines of code.

    Following is a list of the datasets that are currently provided: - GoPro: The GoPro dataset for deblurring consists of 3,214 blurred images with a size of 1,280×720 that are divided into 2,103 training images and 1,111 test images. - HIDE: HIDE is a motion-blurred dataset that includes 2025 blurred images for testing. It mainly focus on pedestrians and street scenes. - RealBlur: The RealBlur testing dataset consists of two subsets. The first is RealBlur-J, consisting of 1900 camera JPEG outputs. The second is RealBlur-R, consisting of 1900 RAW images. The RAW images are generated by using white balance, demosaicking, and denoising operations. - CelebA: A face deblurring dataset created using the CelebA dataset which consists of 2 000 000 training images, 1299 validation images, and 1300 testing images. The blurred images were created using the blurred kernels provided by Shent et al. 2018 - Helen: A face deblurring dataset created using the Helen dataset which consists of 2 000 training images, 155 validation images, and 155 testing images. The blurred images were created using the blurred kernels provided by Shent et al. 2018 - Wider-Face: A face deblurring dataset created using the Wider-Face dataset which consists of 4080 training images, 567 validation images, and 567 testing images. The blurred images were created using the blurred kernels provided by Shent et al. 2018
    - TextOCR: A text deblurring dataset created using the TextOCR dataset which consists of 5000 training images, 500 validation images, and 500 testing images. The blurred images were created using the blurred kernels provided by Shent et al. 2018

  8. Instagram accounts with the most followers worldwide 2024

    • statista.com
    • davegsmith.com
    Updated Jun 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stacy Jo Dixon (2025). Instagram accounts with the most followers worldwide 2024 [Dataset]. https://www.statista.com/topics/1164/social-networks/
    Explore at:
    Dataset updated
    Jun 17, 2025
    Dataset provided by
    Statistahttp://statista.com/
    Authors
    Stacy Jo Dixon
    Description

    Cristiano Ronaldo has one of the most popular Instagram accounts as of April 2024.

                  The Portuguese footballer is the most-followed person on the photo sharing app platform with 628 million followers. Instagram's own account was ranked first with roughly 672 million followers.
    
                  How popular is Instagram?
    
                  Instagram is a photo-sharing social networking service that enables users to take pictures and edit them with filters. The platform allows users to post and share their images online and directly with their friends and followers on the social network. The cross-platform app reached one billion monthly active users in mid-2018. In 2020, there were over 114 million Instagram users in the United States and experts project this figure to surpass 127 million users in 2023.
    
                  Who uses Instagram?
    
                  Instagram audiences are predominantly young – recent data states that almost 60 percent of U.S. Instagram users are aged 34 years or younger. Fall 2020 data reveals that Instagram is also one of the most popular social media for teens and one of the social networks with the biggest reach among teens in the United States.
    
                  Celebrity influencers on Instagram
                  Many celebrities and athletes are brand spokespeople and generate additional income with social media advertising and sponsored content. Unsurprisingly, Ronaldo ranked first again, as the average media value of one of his Instagram posts was 985,441 U.S. dollars.
    
  9. Instagram: distribution of global audiences 2024, by gender

    • statista.com
    • davegsmith.com
    Updated Jun 17, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stacy Jo Dixon (2025). Instagram: distribution of global audiences 2024, by gender [Dataset]. https://www.statista.com/topics/1164/social-networks/
    Explore at:
    Dataset updated
    Jun 17, 2025
    Dataset provided by
    Statistahttp://statista.com/
    Authors
    Stacy Jo Dixon
    Description

    As of January 2024, Instagram was slightly more popular with men than women, with men accounting for 50.6 percent of the platform’s global users. Additionally, the social media app was most popular amongst younger audiences, with almost 32 percent of users aged between 18 and 24 years.

                  Instagram’s Global Audience
    
                  As of January 2024, Instagram was the fourth most popular social media platform globally, reaching two billion monthly active users (MAU). This number is projected to keep growing with no signs of slowing down, which is not a surprise as the global online social penetration rate across all regions is constantly increasing.
                  As of January 2024, the country with the largest Instagram audience was India with 362.9 million users, followed by the United States with 169.7 million users.
    
                  Who is winning over the generations?
    
                  Even though Instagram’s audience is almost twice the size of TikTok’s on a global scale, TikTok has shown itself to be a fierce competitor, particularly amongst younger audiences. TikTok was the most downloaded mobile app globally in 2022, generating 672 million downloads. As of 2022, Generation Z in the United States spent more time on TikTok than on Instagram monthly.
    
  10. Facebook: countries with the highest Facebook reach 2024

    • statista.com
    • davegsmith.com
    Updated Jun 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stacy Jo Dixon (2025). Facebook: countries with the highest Facebook reach 2024 [Dataset]. https://www.statista.com/topics/1164/social-networks/
    Explore at:
    Dataset updated
    Jun 17, 2025
    Dataset provided by
    Statistahttp://statista.com/
    Authors
    Stacy Jo Dixon
    Description

    As of April 2024, Facebook had an addressable ad audience reach 131.1 percent in Libya, followed by the United Arab Emirates with 120.5 percent and Mongolia with 116 percent. Additionally, the Philippines and Qatar had addressable ad audiences of 114.5 percent and 111.7 percent.

  11. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Datasimple (2025). Natural Language Inference Evaluation Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/abcd24c8-a1a1-4724-83b2-ea07314b8d13

Natural Language Inference Evaluation Dataset

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
.undefinedAvailable download formats
Dataset updated
Jul 6, 2025
Dataset authored and provided by
Datasimple
License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Area covered
Data Science and Analytics
Description

The HellaSwag dataset is a highly valuable resource for assessing a machine's sentence completion abilities based on commonsense natural language inference (NLI). It was initially introduced in a paper published at ACL2019. This dataset enables researchers and machine learning practitioners to train, validate, and evaluate models designed to understand and predict plausible sentence completions using common sense knowledge. It is useful for understanding the limitations of current NLI systems and for developing algorithms that reason with common sense.

Columns

The dataset includes several key columns: * ind: The index of the data point. (Integer) * activity_label: The label indicating the activity or event described in the sentence. (String) * ctx_a: The first context sentence, providing background information. (String) * ctx_b: The second context sentence, providing further background information. (String) * endings: A list of possible sentence completions for the given context. (List of Strings) * split: The dataset split, such as 'train', 'dev', or 'test'. (String) * split_type: The type of split used for dividing the dataset, like 'random' or 'balanced'. (String) * source_id: An identifier for the source. * label: A label associated with the data point.

Distribution

The dataset is typically provided in CSV format and consists of three primary files: train.csv, validation.csv, and test.csv. The train.csv file facilitates the learning process for machine learning models, validation.csv is used to validate model performance, and test.csv enables thorough evaluation of models in completing sentences with common sense. While exact total row counts for the entire dataset are not specified in the provided information, insights into unique values for fields such as activity_label (9965 unique values), source_id (8173 unique values), and split_type (e.g., 'indomain' and 'zeroshot' each accounting for 50%) are available.

Usage

This dataset is ideal for a variety of applications and use cases: * Language Modelling: Training language models to better understand common sense knowledge and improve sentence completion tasks. * Common Sense Reasoning: Developing and studying algorithms that can reason and make inferences based on common sense. * Machine Performance Evaluation: Assessing the effectiveness of machine learning models in generating appropriate sentence endings given specific contexts and activity labels. * Natural Language Inference (NLI): Benchmarking and improving NLI systems by evaluating their ability to predict plausible sentence completions.

Coverage

The dataset has a global region scope. It was listed on 17/06/2025. Specific time ranges for the data collection itself or detailed demographic scopes are not provided. The dataset includes various splits (train, dev, test) and split types (random, balanced) to ensure diversity for generalisation testing and fairness evaluation during model development.

License

CC0

Who Can Use It

The HellaSwag dataset is intended for researchers and machine learning practitioners. They can utilise it to: * Train, validate, and evaluate machine learning models for tasks requiring common sense knowledge. * Develop and refine algorithms for common sense reasoning. * Benchmark and assess the performance and limitations of current natural language inference systems.

Dataset Name Suggestions

  • HellaSwag: Commonsense NLI
  • Commonsense Sentence Completion Data
  • Natural Language Inference Evaluation Dataset
  • AI Common Sense Benchmark

Attributes

Original Data Source: HellaSwag: Commonsense NLI

Search
Clear search
Close search
Google apps
Main menu