10 datasets found

o
Natural Language Inference Evaluation Dataset
opendatabay.com
.undefined
Updated Jul 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Natural Language Inference Evaluation Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/abcd24c8-a1a1-4724-83b2-ea07314b8d13
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 6, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Data Science and Analytics
Description
The HellaSwag dataset is a highly valuable resource for assessing a machine's sentence completion abilities based on commonsense natural language inference (NLI). It was initially introduced in a paper published at ACL2019. This dataset enables researchers and machine learning practitioners to train, validate, and evaluate models designed to understand and predict plausible sentence completions using common sense knowledge. It is useful for understanding the limitations of current NLI systems and for developing algorithms that reason with common sense.

Columns

The dataset includes several key columns: * ind: The index of the data point. (Integer) * activity_label: The label indicating the activity or event described in the sentence. (String) * ctx_a: The first context sentence, providing background information. (String) * ctx_b: The second context sentence, providing further background information. (String) * endings: A list of possible sentence completions for the given context. (List of Strings) * split: The dataset split, such as 'train', 'dev', or 'test'. (String) * split_type: The type of split used for dividing the dataset, like 'random' or 'balanced'. (String) * source_id: An identifier for the source. * label: A label associated with the data point.

Distribution

The dataset is typically provided in CSV format and consists of three primary files: train.csv, validation.csv, and test.csv. The train.csv file facilitates the learning process for machine learning models, validation.csv is used to validate model performance, and test.csv enables thorough evaluation of models in completing sentences with common sense. While exact total row counts for the entire dataset are not specified in the provided information, insights into unique values for fields such as activity_label (9965 unique values), source_id (8173 unique values), and split_type (e.g., 'indomain' and 'zeroshot' each accounting for 50%) are available.

Usage

This dataset is ideal for a variety of applications and use cases: * Language Modelling: Training language models to better understand common sense knowledge and improve sentence completion tasks. * Common Sense Reasoning: Developing and studying algorithms that can reason and make inferences based on common sense. * Machine Performance Evaluation: Assessing the effectiveness of machine learning models in generating appropriate sentence endings given specific contexts and activity labels. * Natural Language Inference (NLI): Benchmarking and improving NLI systems by evaluating their ability to predict plausible sentence completions.

Coverage

The dataset has a global region scope. It was listed on 17/06/2025. Specific time ranges for the data collection itself or detailed demographic scopes are not provided. The dataset includes various splits (train, dev, test) and split types (random, balanced) to ensure diversity for generalisation testing and fairness evaluation during model development.

License

CC0

Who Can Use It

The HellaSwag dataset is intended for researchers and machine learning practitioners. They can utilise it to: * Train, validate, and evaluate machine learning models for tasks requiring common sense knowledge. * Develop and refine algorithms for common sense reasoning. * Benchmark and assess the performance and limitations of current natural language inference systems.

Dataset Name Suggestions

HellaSwag: Commonsense NLI

Commonsense Sentence Completion Data

Natural Language Inference Evaluation Dataset

AI Common Sense Benchmark

Attributes

Original Data Source: HellaSwag: Commonsense NLI
w
Fire statistics data tables
gov.uk
s3.amazonaws.com
Updated Jul 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ministry of Housing, Communities and Local Government (2025). Fire statistics data tables [Dataset]. https://www.gov.uk/government/statistical-data-sets/fire-statistics-data-tables
Explore at:
Dataset updated
Jul 10, 2025
Dataset provided by
GOV.UK
Authors
Ministry of Housing, Communities and Local Government
Description

On 1 April 2025 responsibility for fire and rescue transferred from the Home Office to the Ministry of Housing, Communities and Local Government.

This information covers fires, false alarms and other incidents attended by fire crews, and the statistics include the numbers of incidents, fires, fatalities and casualties as well as information on response times to fires. The Ministry of Housing, Communities and Local Government (MHCLG) also collect information on the workforce, fire prevention work, health and safety and firefighter pensions. All data tables on fire statistics are below.

MHCLG has responsibility for fire services in England. The vast majority of data tables produced by the Ministry of Housing, Communities and Local Government are for England but some (0101, 0103, 0201, 0501, 1401) tables are for Great Britain split by nation. In the past the Department for Communities and Local Government (who previously had responsibility for fire services in England) produced data tables for Great Britain and at times the UK. Similar information for devolved administrations are available at https://www.firescotland.gov.uk/about/statistics/" class="govuk-link">Scotland: Fire and Rescue Statistics, https://statswales.gov.wales/Catalogue/Community-Safety-and-Social-Inclusion/Community-Safety" class="govuk-link">Wales: Community safety and https://www.nifrs.org/home/about-us/publications/" class="govuk-link">Northern Ireland: Fire and Rescue Statistics.

If you use assistive technology (for example, a screen reader) and need a version of any of these documents in a more accessible format, please email alternativeformats@communities.gov.uk. Please tell us what format you need. It will help us if you say what assistive technology you use.

Related content

Fire statistics guidance
Fire statistics incident level datasets

Incidents attended

https://assets.publishing.service.gov.uk/media/686d2aa22557debd867cbe14/FIRE0101.xlsx">FIRE0101: Incidents attended by fire and rescue services by nation and population (MS Excel Spreadsheet, 153 KB) Previous FIRE0101 tables

https://assets.publishing.service.gov.uk/media/686d2ab52557debd867cbe15/FIRE0102.xlsx">FIRE0102: Incidents attended by fire and rescue services in England, by incident type and fire and rescue authority (MS Excel Spreadsheet, 2.19 MB) Previous FIRE0102 tables

https://assets.publishing.service.gov.uk/media/686d2aca10d550c668de3c69/FIRE0103.xlsx">FIRE0103: Fires attended by fire and rescue services by nation and population (MS Excel Spreadsheet, 201 KB) Previous FIRE0103 tables

https://assets.publishing.service.gov.uk/media/686d2ad92557debd867cbe16/FIRE0104.xlsx">FIRE0104: Fire false alarms by reason for false alarm, England (MS Excel Spreadsheet, 492 KB) Previous FIRE0104 tables

Dwelling fires attended

https://assets.publishing.service.gov.uk/media/686d2af42cfe301b5fb6789f/FIRE0201.xlsx">FIRE0201: Dwelling fires attended by fire and rescue services by motive, population and nation (MS Excel Spreadsheet, <span class="gem-c-attac
InsectSet459: A large dataset for automatic acoustic identification of...
zenodo.org
csv, txt, zip
Updated Apr 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marius Faiß; Marius Faiß; Dan Stowell; Dan Stowell (2025). InsectSet459: A large dataset for automatic acoustic identification of insects (Orthoptera and Cicadidae) [Dataset]. http://doi.org/10.5281/zenodo.14056458
Explore at:
zip, txt, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14056458
Dataset updated
Apr 25, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Marius Faiß; Marius Faiß; Dan Stowell; Dan Stowell
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
*Version 0.1: In this initial release of the dataset some duplicate files were not discovered during our checksum tests. We fixed this in version 0.2 by removing 101 duplicates. Many thanks to Jaromir Kunzelmann for pointing this out to us. Please select the latest version of this dataset in the sidebar.

This dataset will be used in the upcoming 2025 BioDCASE data challenge. Therefore, the test set is being held back until the challenge has concluded in 2025. The full version of the dataset will then be published as version 1.0 here.

Background

In 2024, the public animal sound database xeno-canto has seen a dramatic increase in insect sound recordings. This is due to the publication of several large collections of field and laboratory recordings from insect sound experts, as well as increased adoption of citizen scientists uploading their insect sound observations to the website. We used this opportunity to expand our previously published datasets (InsectSet32, InsectSet47&InsectSet66) to compile the first large-scale dataset of insect sounds that is easy to use for training deep learning methods to detect and classify insect sounds in the wild. A short pre-print describing the dataset curation and characteristics in more detail, as well as results from two baseline classifiers trained on the datasets, is accessible here and will be submitted for publication in a journal.

Data curation

Recordings from xeno-canto (Orthoptera), iNaturalist (Orthoptera & Cicadidae) and BioAcoustica (Cicadidae) were downloaded and pooled together. Several selection steps were chosen to compile a final selection of recordings. From iNaturalist, only research-grade observations were downloaded. For observations with multiple audio files attached, only one file was downloaded. If users uploaded to both iNaturalist and xeno-canto, only the files from one of the platforms were used. To further avoid duplicate uploads, a checksum test was applied to the entire source dataset. Another common occurrence is serial uploads from one location and time period split into separate observations (especially common on xeno-canto), which could include the same individual animals vocalizing. This problem was adressed by pooling all recordings by username, species, geographic location, date and time, and selecting only one recording from a one-hour period.

After these filtering steps, all files from species with at least 10 sound examples were selected for the final dataset. All stereo files were converted to mono, file formats were standardized to wav and mp3. Recordings of a length longer than two minutes were automatically trimmed. Species nomenclature was unified to COL24.4 2024-04-26 [294826] using checklistbank.

This new dataset greatly increases the number of species included: from 66 in InsectSet66 to now contain 459 unique species from the groups Orthoptera and Cicadidae, while also strongly increasing the geographic coverage of recording locations. The total duration of the dataset and number of sound examples is heavily expanded to a total of 26399 files containing 9.5 days of audio material with sample rates ranging from 8 to 500 kHz.

Dataset Usage

All recordings are licensed under creative commons licenses 4.0 or 0. We excluded no-derivatives licenses to simply further usage of this dataset. For machine-learning purposes, the dataset was split into the training, validation and test sets while ensuring a roughly equal distribution of audio files and audio material for every species in all three subsets. This resulted in a 60/20/20 split (train/validation/test) by file number and file length.

*Version 0.1: In this initial release of the dataset some duplicate files were not discovered during our checksum tests. We fixed this in version 0.2 by removing 101 duplicates. Many thanks to Jaromir Kunzelmann for pointing this out to us. Please select the latest version of this dataset in the sidebar.

This dataset will be used in the upcoming 2025 BioDCASE data challenge. Therefore, the test set is being held back until the challenge has concluded in 2025. The full version of the dataset will then be published as version 1.0 here.
o
Contextual Language Comprehension Dataset
opendatabay.com
.undefined
Updated Jul 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Contextual Language Comprehension Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/87666f40-537f-4c7b-97eb-cd9e55d284b0
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 5, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Data Science and Analytics
Description
This dataset, known as HellaSwag (Commonsense NLI), is designed to evaluate a machine's ability to complete sentences in a logically coherent and sensible manner. It provides over 10,000 examples of sentence completion tasks, each featuring an initial sentence segment followed by four potential endings. The primary challenge for an artificial intelligence system is to identify and select the most appropriate ending that best completes the given sentence. This task is particularly demanding for machines because it necessitates an understanding that extends beyond mere word recognition to encompass deeper meaning and contextual nuances. While humans typically find this task straightforward due to their inherent grasp of language and common sense, it presents a significant hurdle for machines. The HellaSwag dataset represents a vital step towards the development of AI systems capable of communicating similarly to humans, offering a benchmark to assess current machine capabilities in language comprehension and generation, and highlighting areas requiring further advancement.

Columns

The dataset typically includes the following columns:

ind: An integer representing the index of the sentence.

activity_label: A string indicating the label of the activity.

ctx_a: A string containing the first context sentence.

ctx_b: A string containing the second context sentence.

endings: A string that holds the potential endings for the sentence.

split: A string denoting the division of the dataset (e.g., training or test set).

split_type: A string specifying the type of split, such as 'indomain' or 'zeroshot'.

label: The label indicating which of the possible endings is the correct one for the sentence completion.

Distribution

The dataset is primarily available in a data file format, commonly CSV. It comprises over 10,000 examples of sentence completion. While specific row or record counts for the entire dataset are not explicitly provided, it is structured with context sentences and multiple choice endings. The dataset can be readily split into training and test sets, for instance, using an 80/20 ratio for model development. The 'split' column helps categorise the data, with 'indomain' and 'zeroshot' types each accounting for 50% of the split.

Usage

This dataset is ideally suited for various machine learning and natural language processing applications, including:

Training models to generate novel sentence endings that mimic human-like creativity and coherence.

Developing models that enhance their understanding of sentence context, enabling them to select the most appropriate ending based on the given context.

Building models capable of evaluating two sentences with different endings and determining which one is more probable, drawing upon common-sense knowledge.

Coverage

The dataset is listed with a GLOBAL region scope. No specific geographical, temporal, or demographic coverage details regarding the content of the data itself are provided in the available information. The listing date for the dataset is noted as 17/06/2025.

License

CC0

Who Can Use It

This dataset is invaluable for:

Data scientists and machine learning engineers working on natural language understanding and generation tasks.

AI researchers focused on advancing the capabilities of artificial intelligence systems to interact and communicate more human-like.

Anyone involved in building models for sentence completion, contextual reasoning, and common-sense knowledge integration in AI.

Dataset Name Suggestions

HellaSwag (Commonsense NLI)

AI Sentence Completion Challenge

Contextual Language Comprehension Dataset

Commonsense Language Understanding Benchmark

Attributes

Original Data Source: HellaSwag (Commonsense NLI)
H
Improved River Slope Datasets for the United States Hydrofabrics
hydroshare.org
search.dataone.org
+1more
zip
Updated Apr 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yixian Chen; Anupal Baruah; Dipsikha Devi; Sagy Cohen (2025). Improved River Slope Datasets for the United States Hydrofabrics [Dataset]. http://doi.org/10.4211/hs.1532f4cb360244f9a6ba772ebd428180
Explore at:
zip(129.2 MB)Available download formats
Unique identifier
https://doi.org/10.4211/hs.1532f4cb360244f9a6ba772ebd428180
Dataset updated
Apr 18, 2025
Dataset provided by
HydroShare
Authors
Yixian Chen; Anupal Baruah; Dipsikha Devi; Sagy Cohen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered

Description
The CONtiguous United States (CONUS) “Flood Inundation Mapping Hydrofabric - ICESat-2 River Surface Slope” (FIM HF IRIS) dataset integrates river slopes from the global IRIS dataset for 117,357 spatially corresponding main-stream reaches within NOAA’s Office of Water Prediction operational FIM forecasting system, which utilizes the Height Above Nearest Drainage approach (OWP HAND-FIM) to help warn communities of floods. To achieve this, a spatial joining approach was developed to align FIM HF reaches with IRIS reaches, accounting for differences in reach flowline sources. When applied to OWP HAND-FIM, FIM HF IRIS improved flood map accuracy by an average of 31% (CSI) across eight flood events compared to the original FIM HF slopes. Using a common attribute, IRIS data were also transferred from FIM HF IRIS to the CONUS-scale Next Generation Water Resources Modeling Framework Hydrofabric (NextGen HF), creating the NextGen HF IRIS dataset. By referencing another common attribute, SWOT vector data (e.g., water surface elevation, slope, discharge) can be leveraged by OWP HAND-FIM and NextGen through the two resulting datasets. The spatial joining approach, which enables the integration of FIM HF with other hydrologic datasets via flowlines, is provided alongside the two resulting datasets.

The slope_iris_sword in FIM HF IRIS can be used with the Recalculate_Discharge_in_Hydrotable_useFIMHFIRIS.py script to regenerate the hydrotable for OWP HAND-FIM, where the discharge will be recalculated using slope_iris_sword. Consequently, the synthetic rating curves (SRCs) will be updated based on the new discharges (see more details in https://github.com/NOAA-OWP/inundation-mapping/wiki/3.-HAND-Methodology). The script can also be used to regenerate hydrotables using river slopes from other sources, such as NextGen HF, provided they are linked to the FIM HF flowlines.

The feature classes for FIMHF_IRIS and NextGenHF_IRIS are provided in formats of geopackage (.gpkg) and geodatabases (.gdb), which can be accessed using ArcGIS, QGIS, or relevant Python packages for inspection, visualization, or spatial analysis of slope_iris_sword.

More information can be found at: Chen, Y., Baruah, A., Devi, D., & Cohen, S. (2025). Improved River Slope Datasets for the United States Hydrofabrics [Data set]. Zenodo. https://doi.org/10.5281/zenodo.15099149
h
MOMIJI
huggingface.co
Updated May 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Turing Inc. (2025). MOMIJI [Dataset]. https://huggingface.co/datasets/turing-motors/MOMIJI
Explore at:
Dataset updated
May 12, 2025
Dataset authored and provided by
Turing Inc.
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Card for MOMIJI

MOMIJI (Modern Open Multimodal Japanese filtered Dataset) is a large-scale, carefully curated public dataset of image-text–interleaved web documents. The dataset was extracted from Common Crawl dumps covering February 2024 – January 2025 and contains roughly 56M Japanese documents, 110B characters, and 249M images. Details of the collection and filtering pipeline will be described in a forthcoming paper. Image-text–interleaved data is generally used to train… See the full description on the dataset page: https://huggingface.co/datasets/turing-motors/MOMIJI.
A Curated List of Image Deblurring Datasets
kaggle.com
Updated Mar 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jishnu Parayil Shibu (2023). A Curated List of Image Deblurring Datasets [Dataset]. https://www.kaggle.com/datasets/jishnuparayilshibu/a-curated-list-of-image-deblurring-datasets
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 28, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Jishnu Parayil Shibu
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Given a blurred image, image deblurring aims to produce a clear, high-quality image that accurately represents the original scene. Blurring can be caused by various factors such as camera shake, fast motion, out-of-focus objects, etc. making it a particularly challenging computer vision problem. This has led to the recent development of a large spectrum of deblurring models and unique datasets.

Despite the rapid advancement in image deblurring, the process of finding and pre-processing a number of datasets for training and testing purposes has been both time exhaustive and unnecessarily complicated for both experts and non-experts alike. Moreover, there is a serious lack of ready-to-use domain-specific datasets such as face and text deblurring datasets.

To this end, the following card contains a curated list of ready-to-use image deblurring datasets for training and testing various deblurring models. Additionally, we have created an extensive, highly customizable python package for single image deblurring called DBlur that can be used to train and test various SOTA models on the given datasets just with 2-3 lines of code.

Following is a list of the datasets that are currently provided: - GoPro: The GoPro dataset for deblurring consists of 3,214 blurred images with a size of 1,280×720 that are divided into 2,103 training images and 1,111 test images. - HIDE: HIDE is a motion-blurred dataset that includes 2025 blurred images for testing. It mainly focus on pedestrians and street scenes. - RealBlur: The RealBlur testing dataset consists of two subsets. The first is RealBlur-J, consisting of 1900 camera JPEG outputs. The second is RealBlur-R, consisting of 1900 RAW images. The RAW images are generated by using white balance, demosaicking, and denoising operations. - CelebA: A face deblurring dataset created using the CelebA dataset which consists of 2 000 000 training images, 1299 validation images, and 1300 testing images. The blurred images were created using the blurred kernels provided by Shent et al. 2018 - Helen: A face deblurring dataset created using the Helen dataset which consists of 2 000 training images, 155 validation images, and 155 testing images. The blurred images were created using the blurred kernels provided by Shent et al. 2018 - Wider-Face: A face deblurring dataset created using the Wider-Face dataset which consists of 4080 training images, 567 validation images, and 567 testing images. The blurred images were created using the blurred kernels provided by Shent et al. 2018
- TextOCR: A text deblurring dataset created using the TextOCR dataset which consists of 5000 training images, 500 validation images, and 500 testing images. The blurred images were created using the blurred kernels provided by Shent et al. 2018

Instagram accounts with the most followers worldwide 2024

statista.com
davegsmith.com

Updated Jun 17, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Stacy Jo Dixon (2025). Instagram accounts with the most followers worldwide 2024 [Dataset]. https://www.statista.com/topics/1164/social-networks/

Explore at:

Dataset updated

Jun 17, 2025

Dataset provided by

Statistahttp://statista.com/

Authors

Stacy Jo Dixon

Description

Cristiano Ronaldo has one of the most popular Instagram accounts as of April 2024.

              The Portuguese footballer is the most-followed person on the photo sharing app platform with 628 million followers. Instagram's own account was ranked first with roughly 672 million followers.

              How popular is Instagram?

              Instagram is a photo-sharing social networking service that enables users to take pictures and edit them with filters. The platform allows users to post and share their images online and directly with their friends and followers on the social network. The cross-platform app reached one billion monthly active users in mid-2018. In 2020, there were over 114 million Instagram users in the United States and experts project this figure to surpass 127 million users in 2023.

              Who uses Instagram?

              Instagram audiences are predominantly young – recent data states that almost 60 percent of U.S. Instagram users are aged 34 years or younger. Fall 2020 data reveals that Instagram is also one of the most popular social media for teens and one of the social networks with the biggest reach among teens in the United States.

              Celebrity influencers on Instagram
              Many celebrities and athletes are brand spokespeople and generate additional income with social media advertising and sponsored content. Unsurprisingly, Ronaldo ranked first again, as the average media value of one of his Instagram posts was 985,441 U.S. dollars.

Instagram: distribution of global audiences 2024, by gender

statista.com
davegsmith.com

Updated Jun 17, 2025

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Stacy Jo Dixon (2025). Instagram: distribution of global audiences 2024, by gender [Dataset]. https://www.statista.com/topics/1164/social-networks/

Explore at:

Dataset updated

Jun 17, 2025

Dataset provided by

Statistahttp://statista.com/

Authors

Stacy Jo Dixon

Description

As of January 2024, Instagram was slightly more popular with men than women, with men accounting for 50.6 percent of the platform’s global users. Additionally, the social media app was most popular amongst younger audiences, with almost 32 percent of users aged between 18 and 24 years.

              Instagram’s Global Audience

              As of January 2024, Instagram was the fourth most popular social media platform globally, reaching two billion monthly active users (MAU). This number is projected to keep growing with no signs of slowing down, which is not a surprise as the global online social penetration rate across all regions is constantly increasing.
              As of January 2024, the country with the largest Instagram audience was India with 362.9 million users, followed by the United States with 169.7 million users.

              Who is winning over the generations?

              Even though Instagram’s audience is almost twice the size of TikTok’s on a global scale, TikTok has shown itself to be a fierce competitor, particularly amongst younger audiences. TikTok was the most downloaded mobile app globally in 2022, generating 672 million downloads. As of 2022, Generation Z in the United States spent more time on TikTok than on Instagram monthly.

Facebook: countries with the highest Facebook reach 2024
statista.com
davegsmith.com
Updated Jun 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stacy Jo Dixon (2025). Facebook: countries with the highest Facebook reach 2024 [Dataset]. https://www.statista.com/topics/1164/social-networks/
Explore at:
Dataset updated
Jun 17, 2025
Dataset provided by
Statistahttp://statista.com/
Authors
Stacy Jo Dixon
Description
As of April 2024, Facebook had an addressable ad audience reach 131.1 percent in Libya, followed by the United Arab Emirates with 120.5 percent and Mongolia with 116 percent. Additionally, the Philippines and Qatar had addressable ad audiences of 114.5 percent and 111.7 percent.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Datasimple (2025). Natural Language Inference Evaluation Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/abcd24c8-a1a1-4724-83b2-ea07314b8d13

Natural Language Inference Evaluation Dataset

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

.undefinedAvailable download formats

Dataset updated

Jul 6, 2025

Dataset authored and provided by

Datasimple

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Area covered

Data Science and Analytics

Description

The HellaSwag dataset is a highly valuable resource for assessing a machine's sentence completion abilities based on commonsense natural language inference (NLI). It was initially introduced in a paper published at ACL2019. This dataset enables researchers and machine learning practitioners to train, validate, and evaluate models designed to understand and predict plausible sentence completions using common sense knowledge. It is useful for understanding the limitations of current NLI systems and for developing algorithms that reason with common sense.

Columns

The dataset includes several key columns: * ind: The index of the data point. (Integer) * activity_label: The label indicating the activity or event described in the sentence. (String) * ctx_a: The first context sentence, providing background information. (String) * ctx_b: The second context sentence, providing further background information. (String) * endings: A list of possible sentence completions for the given context. (List of Strings) * split: The dataset split, such as 'train', 'dev', or 'test'. (String) * split_type: The type of split used for dividing the dataset, like 'random' or 'balanced'. (String) * source_id: An identifier for the source. * label: A label associated with the data point.

Distribution

The dataset is typically provided in CSV format and consists of three primary files: train.csv, validation.csv, and test.csv. The train.csv file facilitates the learning process for machine learning models, validation.csv is used to validate model performance, and test.csv enables thorough evaluation of models in completing sentences with common sense. While exact total row counts for the entire dataset are not specified in the provided information, insights into unique values for fields such as activity_label (9965 unique values), source_id (8173 unique values), and split_type (e.g., 'indomain' and 'zeroshot' each accounting for 50%) are available.

Usage

This dataset is ideal for a variety of applications and use cases: * Language Modelling: Training language models to better understand common sense knowledge and improve sentence completion tasks. * Common Sense Reasoning: Developing and studying algorithms that can reason and make inferences based on common sense. * Machine Performance Evaluation: Assessing the effectiveness of machine learning models in generating appropriate sentence endings given specific contexts and activity labels. * Natural Language Inference (NLI): Benchmarking and improving NLI systems by evaluating their ability to predict plausible sentence completions.

Coverage

The dataset has a global region scope. It was listed on 17/06/2025. Specific time ranges for the data collection itself or detailed demographic scopes are not provided. The dataset includes various splits (train, dev, test) and split types (random, balanced) to ensure diversity for generalisation testing and fairness evaluation during model development.

License

CC0

Who Can Use It

The HellaSwag dataset is intended for researchers and machine learning practitioners. They can utilise it to: * Train, validate, and evaluate machine learning models for tasks requiring common sense knowledge. * Develop and refine algorithms for common sense reasoning. * Benchmark and assess the performance and limitations of current natural language inference systems.

Dataset Name Suggestions

HellaSwag: Commonsense NLI
Commonsense Sentence Completion Data
Natural Language Inference Evaluation Dataset
AI Common Sense Benchmark

Attributes

Original Data Source: HellaSwag: Commonsense NLI

Clear search

Close search

Google apps

Main menu

Natural Language Inference Evaluation Dataset

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

Fire statistics data tables

Related content

Incidents attended

Dwelling fires attended

InsectSet459: A large dataset for automatic acoustic identification of...

Contextual Language Comprehension Dataset

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

Improved River Slope Datasets for the United States Hydrofabrics

MOMIJI

A Curated List of Image Deblurring Datasets

Instagram accounts with the most followers worldwide 2024

Instagram: distribution of global audiences 2024, by gender

Facebook: countries with the highest Facebook reach 2024

Natural Language Inference Evaluation Dataset

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes