CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The HellaSwag dataset is a highly valuable resource for assessing a machine's sentence completion abilities based on commonsense natural language inference (NLI). It was initially introduced in a paper published at ACL2019. This dataset enables researchers and machine learning practitioners to train, validate, and evaluate models designed to understand and predict plausible sentence completions using common sense knowledge. It is useful for understanding the limitations of current NLI systems and for developing algorithms that reason with common sense.
The dataset includes several key columns: * ind: The index of the data point. (Integer) * activity_label: The label indicating the activity or event described in the sentence. (String) * ctx_a: The first context sentence, providing background information. (String) * ctx_b: The second context sentence, providing further background information. (String) * endings: A list of possible sentence completions for the given context. (List of Strings) * split: The dataset split, such as 'train', 'dev', or 'test'. (String) * split_type: The type of split used for dividing the dataset, like 'random' or 'balanced'. (String) * source_id: An identifier for the source. * label: A label associated with the data point.
The dataset is typically provided in CSV format and consists of three primary files: train.csv
, validation.csv
, and test.csv
. The train.csv
file facilitates the learning process for machine learning models, validation.csv
is used to validate model performance, and test.csv
enables thorough evaluation of models in completing sentences with common sense. While exact total row counts for the entire dataset are not specified in the provided information, insights into unique values for fields such as activity_label
(9965 unique values), source_id
(8173 unique values), and split_type
(e.g., 'indomain' and 'zeroshot' each accounting for 50%) are available.
This dataset is ideal for a variety of applications and use cases: * Language Modelling: Training language models to better understand common sense knowledge and improve sentence completion tasks. * Common Sense Reasoning: Developing and studying algorithms that can reason and make inferences based on common sense. * Machine Performance Evaluation: Assessing the effectiveness of machine learning models in generating appropriate sentence endings given specific contexts and activity labels. * Natural Language Inference (NLI): Benchmarking and improving NLI systems by evaluating their ability to predict plausible sentence completions.
The dataset has a global region scope. It was listed on 17/06/2025. Specific time ranges for the data collection itself or detailed demographic scopes are not provided. The dataset includes various splits (train, dev, test) and split types (random, balanced) to ensure diversity for generalisation testing and fairness evaluation during model development.
CC0
The HellaSwag dataset is intended for researchers and machine learning practitioners. They can utilise it to: * Train, validate, and evaluate machine learning models for tasks requiring common sense knowledge. * Develop and refine algorithms for common sense reasoning. * Benchmark and assess the performance and limitations of current natural language inference systems.
Original Data Source: HellaSwag: Commonsense NLI
On 1 April 2025 responsibility for fire and rescue transferred from the Home Office to the Ministry of Housing, Communities and Local Government.
This information covers fires, false alarms and other incidents attended by fire crews, and the statistics include the numbers of incidents, fires, fatalities and casualties as well as information on response times to fires. The Ministry of Housing, Communities and Local Government (MHCLG) also collect information on the workforce, fire prevention work, health and safety and firefighter pensions. All data tables on fire statistics are below.
MHCLG has responsibility for fire services in England. The vast majority of data tables produced by the Ministry of Housing, Communities and Local Government are for England but some (0101, 0103, 0201, 0501, 1401) tables are for Great Britain split by nation. In the past the Department for Communities and Local Government (who previously had responsibility for fire services in England) produced data tables for Great Britain and at times the UK. Similar information for devolved administrations are available at https://www.firescotland.gov.uk/about/statistics/" class="govuk-link">Scotland: Fire and Rescue Statistics, https://statswales.gov.wales/Catalogue/Community-Safety-and-Social-Inclusion/Community-Safety" class="govuk-link">Wales: Community safety and https://www.nifrs.org/home/about-us/publications/" class="govuk-link">Northern Ireland: Fire and Rescue Statistics.
If you use assistive technology (for example, a screen reader) and need a version of any of these documents in a more accessible format, please email alternativeformats@communities.gov.uk. Please tell us what format you need. It will help us if you say what assistive technology you use.
Fire statistics guidance
Fire statistics incident level datasets
https://assets.publishing.service.gov.uk/media/686d2aa22557debd867cbe14/FIRE0101.xlsx">FIRE0101: Incidents attended by fire and rescue services by nation and population (MS Excel Spreadsheet, 153 KB) Previous FIRE0101 tables
https://assets.publishing.service.gov.uk/media/686d2ab52557debd867cbe15/FIRE0102.xlsx">FIRE0102: Incidents attended by fire and rescue services in England, by incident type and fire and rescue authority (MS Excel Spreadsheet, 2.19 MB) Previous FIRE0102 tables
https://assets.publishing.service.gov.uk/media/686d2aca10d550c668de3c69/FIRE0103.xlsx">FIRE0103: Fires attended by fire and rescue services by nation and population (MS Excel Spreadsheet, 201 KB) Previous FIRE0103 tables
https://assets.publishing.service.gov.uk/media/686d2ad92557debd867cbe16/FIRE0104.xlsx">FIRE0104: Fire false alarms by reason for false alarm, England (MS Excel Spreadsheet, 492 KB) Previous FIRE0104 tables
https://assets.publishing.service.gov.uk/media/686d2af42cfe301b5fb6789f/FIRE0201.xlsx">FIRE0201: Dwelling fires attended by fire and rescue services by motive, population and nation (MS Excel Spreadsheet, <span class="gem-c-attac
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
*Version 0.1: In this initial release of the dataset some duplicate files were not discovered during our checksum tests. We fixed this in version 0.2 by removing 101 duplicates. Many thanks to Jaromir Kunzelmann for pointing this out to us. Please select the latest version of this dataset in the sidebar.
This dataset will be used in the upcoming 2025 BioDCASE data challenge. Therefore, the test set is being held back until the challenge has concluded in 2025. The full version of the dataset will then be published as version 1.0 here.
Background
In 2024, the public animal sound database xeno-canto has seen a dramatic increase in insect sound recordings. This is due to the publication of several large collections of field and laboratory recordings from insect sound experts, as well as increased adoption of citizen scientists uploading their insect sound observations to the website. We used this opportunity to expand our previously published datasets (InsectSet32, InsectSet47&InsectSet66) to compile the first large-scale dataset of insect sounds that is easy to use for training deep learning methods to detect and classify insect sounds in the wild. A short pre-print describing the dataset curation and characteristics in more detail, as well as results from two baseline classifiers trained on the datasets, is accessible here and will be submitted for publication in a journal.
Data curation
Recordings from xeno-canto (Orthoptera), iNaturalist (Orthoptera & Cicadidae) and BioAcoustica (Cicadidae) were downloaded and pooled together. Several selection steps were chosen to compile a final selection of recordings. From iNaturalist, only research-grade observations were downloaded. For observations with multiple audio files attached, only one file was downloaded. If users uploaded to both iNaturalist and xeno-canto, only the files from one of the platforms were used. To further avoid duplicate uploads, a checksum test was applied to the entire source dataset. Another common occurrence is serial uploads from one location and time period split into separate observations (especially common on xeno-canto), which could include the same individual animals vocalizing. This problem was adressed by pooling all recordings by username, species, geographic location, date and time, and selecting only one recording from a one-hour period.
After these filtering steps, all files from species with at least 10 sound examples were selected for the final dataset. All stereo files were converted to mono, file formats were standardized to wav and mp3. Recordings of a length longer than two minutes were automatically trimmed. Species nomenclature was unified to COL24.4 2024-04-26 [294826] using checklistbank.
This new dataset greatly increases the number of species included: from 66 in InsectSet66 to now contain 459 unique species from the groups Orthoptera and Cicadidae, while also strongly increasing the geographic coverage of recording locations. The total duration of the dataset and number of sound examples is heavily expanded to a total of 26399 files containing 9.5 days of audio material with sample rates ranging from 8 to 500 kHz.
Dataset Usage
All recordings are licensed under creative commons licenses 4.0 or 0. We excluded no-derivatives licenses to simply further usage of this dataset. For machine-learning purposes, the dataset was split into the training, validation and test sets while ensuring a roughly equal distribution of audio files and audio material for every species in all three subsets. This resulted in a 60/20/20 split (train/validation/test) by file number and file length.
*Version 0.1: In this initial release of the dataset some duplicate files were not discovered during our checksum tests. We fixed this in version 0.2 by removing 101 duplicates. Many thanks to Jaromir Kunzelmann for pointing this out to us. Please select the latest version of this dataset in the sidebar.
This dataset will be used in the upcoming 2025 BioDCASE data challenge. Therefore, the test set is being held back until the challenge has concluded in 2025. The full version of the dataset will then be published as version 1.0 here.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset, known as HellaSwag (Commonsense NLI), is designed to evaluate a machine's ability to complete sentences in a logically coherent and sensible manner. It provides over 10,000 examples of sentence completion tasks, each featuring an initial sentence segment followed by four potential endings. The primary challenge for an artificial intelligence system is to identify and select the most appropriate ending that best completes the given sentence. This task is particularly demanding for machines because it necessitates an understanding that extends beyond mere word recognition to encompass deeper meaning and contextual nuances. While humans typically find this task straightforward due to their inherent grasp of language and common sense, it presents a significant hurdle for machines. The HellaSwag dataset represents a vital step towards the development of AI systems capable of communicating similarly to humans, offering a benchmark to assess current machine capabilities in language comprehension and generation, and highlighting areas requiring further advancement.
The dataset typically includes the following columns:
The dataset is primarily available in a data file format, commonly CSV. It comprises over 10,000 examples of sentence completion. While specific row or record counts for the entire dataset are not explicitly provided, it is structured with context sentences and multiple choice endings. The dataset can be readily split into training and test sets, for instance, using an 80/20 ratio for model development. The 'split' column helps categorise the data, with 'indomain' and 'zeroshot' types each accounting for 50% of the split.
This dataset is ideally suited for various machine learning and natural language processing applications, including:
The dataset is listed with a GLOBAL region scope. No specific geographical, temporal, or demographic coverage details regarding the content of the data itself are provided in the available information. The listing date for the dataset is noted as 17/06/2025.
CC0
This dataset is invaluable for:
Original Data Source: HellaSwag (Commonsense NLI)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The CONtiguous United States (CONUS) “Flood Inundation Mapping Hydrofabric - ICESat-2 River Surface Slope” (FIM HF IRIS) dataset integrates river slopes from the global IRIS dataset for 117,357 spatially corresponding main-stream reaches within NOAA’s Office of Water Prediction operational FIM forecasting system, which utilizes the Height Above Nearest Drainage approach (OWP HAND-FIM) to help warn communities of floods. To achieve this, a spatial joining approach was developed to align FIM HF reaches with IRIS reaches, accounting for differences in reach flowline sources. When applied to OWP HAND-FIM, FIM HF IRIS improved flood map accuracy by an average of 31% (CSI) across eight flood events compared to the original FIM HF slopes. Using a common attribute, IRIS data were also transferred from FIM HF IRIS to the CONUS-scale Next Generation Water Resources Modeling Framework Hydrofabric (NextGen HF), creating the NextGen HF IRIS dataset. By referencing another common attribute, SWOT vector data (e.g., water surface elevation, slope, discharge) can be leveraged by OWP HAND-FIM and NextGen through the two resulting datasets. The spatial joining approach, which enables the integration of FIM HF with other hydrologic datasets via flowlines, is provided alongside the two resulting datasets.
The slope_iris_sword in FIM HF IRIS can be used with the Recalculate_Discharge_in_Hydrotable_useFIMHFIRIS.py script to regenerate the hydrotable for OWP HAND-FIM, where the discharge will be recalculated using slope_iris_sword. Consequently, the synthetic rating curves (SRCs) will be updated based on the new discharges (see more details in https://github.com/NOAA-OWP/inundation-mapping/wiki/3.-HAND-Methodology). The script can also be used to regenerate hydrotables using river slopes from other sources, such as NextGen HF, provided they are linked to the FIM HF flowlines.
The feature classes for FIMHF_IRIS and NextGenHF_IRIS are provided in formats of geopackage (.gpkg) and geodatabases (.gdb), which can be accessed using ArcGIS, QGIS, or relevant Python packages for inspection, visualization, or spatial analysis of slope_iris_sword.
More information can be found at: Chen, Y., Baruah, A., Devi, D., & Cohen, S. (2025). Improved River Slope Datasets for the United States Hydrofabrics [Data set]. Zenodo. https://doi.org/10.5281/zenodo.15099149
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for MOMIJI
MOMIJI (Modern Open Multimodal Japanese filtered Dataset) is a large-scale, carefully curated public dataset of image-text–interleaved web documents. The dataset was extracted from Common Crawl dumps covering February 2024 – January 2025 and contains roughly 56M Japanese documents, 110B characters, and 249M images. Details of the collection and filtering pipeline will be described in a forthcoming paper. Image-text–interleaved data is generally used to train… See the full description on the dataset page: https://huggingface.co/datasets/turing-motors/MOMIJI.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Given a blurred image, image deblurring aims to produce a clear, high-quality image that accurately represents the original scene. Blurring can be caused by various factors such as camera shake, fast motion, out-of-focus objects, etc. making it a particularly challenging computer vision problem. This has led to the recent development of a large spectrum of deblurring models and unique datasets.
Despite the rapid advancement in image deblurring, the process of finding and pre-processing a number of datasets for training and testing purposes has been both time exhaustive and unnecessarily complicated for both experts and non-experts alike. Moreover, there is a serious lack of ready-to-use domain-specific datasets such as face and text deblurring datasets.
To this end, the following card contains a curated list of ready-to-use image deblurring datasets for training and testing various deblurring models. Additionally, we have created an extensive, highly customizable python package for single image deblurring called DBlur that can be used to train and test various SOTA models on the given datasets just with 2-3 lines of code.
Following is a list of the datasets that are currently provided:
- GoPro: The GoPro dataset for deblurring consists of 3,214 blurred images with a size of 1,280×720 that are divided into 2,103 training images and 1,111 test images.
- HIDE: HIDE is a motion-blurred dataset that includes 2025 blurred images for testing. It mainly focus on pedestrians and street scenes.
- RealBlur: The RealBlur testing dataset consists of two subsets. The first is RealBlur-J, consisting of 1900 camera JPEG outputs. The second is RealBlur-R, consisting of 1900 RAW images. The RAW images are generated by using white balance, demosaicking, and denoising operations.
- CelebA: A face deblurring dataset created using the CelebA dataset which consists of 2 000 000 training images, 1299 validation images, and 1300 testing images. The blurred images were created using the blurred kernels provided by Shent et al. 2018
- Helen: A face deblurring dataset created using the Helen dataset which consists of 2 000 training images, 155 validation images, and 155 testing images. The blurred images were created using the blurred kernels provided by Shent et al. 2018
- Wider-Face: A face deblurring dataset created using the Wider-Face dataset which consists of 4080 training images, 567 validation images, and 567 testing images. The blurred images were created using the blurred kernels provided by Shent et al. 2018
- TextOCR: A text deblurring dataset created using the TextOCR dataset which consists of 5000 training images, 500 validation images, and 500 testing images. The blurred images were created using the blurred kernels provided by Shent et al. 2018
Cristiano Ronaldo has one of the most popular Instagram accounts as of April 2024.
The Portuguese footballer is the most-followed person on the photo sharing app platform with 628 million followers. Instagram's own account was ranked first with roughly 672 million followers.
How popular is Instagram?
Instagram is a photo-sharing social networking service that enables users to take pictures and edit them with filters. The platform allows users to post and share their images online and directly with their friends and followers on the social network. The cross-platform app reached one billion monthly active users in mid-2018. In 2020, there were over 114 million Instagram users in the United States and experts project this figure to surpass 127 million users in 2023.
Who uses Instagram?
Instagram audiences are predominantly young – recent data states that almost 60 percent of U.S. Instagram users are aged 34 years or younger. Fall 2020 data reveals that Instagram is also one of the most popular social media for teens and one of the social networks with the biggest reach among teens in the United States.
Celebrity influencers on Instagram
Many celebrities and athletes are brand spokespeople and generate additional income with social media advertising and sponsored content. Unsurprisingly, Ronaldo ranked first again, as the average media value of one of his Instagram posts was 985,441 U.S. dollars.
As of January 2024, Instagram was slightly more popular with men than women, with men accounting for 50.6 percent of the platform’s global users. Additionally, the social media app was most popular amongst younger audiences, with almost 32 percent of users aged between 18 and 24 years.
Instagram’s Global Audience
As of January 2024, Instagram was the fourth most popular social media platform globally, reaching two billion monthly active users (MAU). This number is projected to keep growing with no signs of slowing down, which is not a surprise as the global online social penetration rate across all regions is constantly increasing.
As of January 2024, the country with the largest Instagram audience was India with 362.9 million users, followed by the United States with 169.7 million users.
Who is winning over the generations?
Even though Instagram’s audience is almost twice the size of TikTok’s on a global scale, TikTok has shown itself to be a fierce competitor, particularly amongst younger audiences. TikTok was the most downloaded mobile app globally in 2022, generating 672 million downloads. As of 2022, Generation Z in the United States spent more time on TikTok than on Instagram monthly.
As of April 2024, Facebook had an addressable ad audience reach 131.1 percent in Libya, followed by the United Arab Emirates with 120.5 percent and Mongolia with 116 percent. Additionally, the Philippines and Qatar had addressable ad audiences of 114.5 percent and 111.7 percent.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The HellaSwag dataset is a highly valuable resource for assessing a machine's sentence completion abilities based on commonsense natural language inference (NLI). It was initially introduced in a paper published at ACL2019. This dataset enables researchers and machine learning practitioners to train, validate, and evaluate models designed to understand and predict plausible sentence completions using common sense knowledge. It is useful for understanding the limitations of current NLI systems and for developing algorithms that reason with common sense.
The dataset includes several key columns: * ind: The index of the data point. (Integer) * activity_label: The label indicating the activity or event described in the sentence. (String) * ctx_a: The first context sentence, providing background information. (String) * ctx_b: The second context sentence, providing further background information. (String) * endings: A list of possible sentence completions for the given context. (List of Strings) * split: The dataset split, such as 'train', 'dev', or 'test'. (String) * split_type: The type of split used for dividing the dataset, like 'random' or 'balanced'. (String) * source_id: An identifier for the source. * label: A label associated with the data point.
The dataset is typically provided in CSV format and consists of three primary files: train.csv
, validation.csv
, and test.csv
. The train.csv
file facilitates the learning process for machine learning models, validation.csv
is used to validate model performance, and test.csv
enables thorough evaluation of models in completing sentences with common sense. While exact total row counts for the entire dataset are not specified in the provided information, insights into unique values for fields such as activity_label
(9965 unique values), source_id
(8173 unique values), and split_type
(e.g., 'indomain' and 'zeroshot' each accounting for 50%) are available.
This dataset is ideal for a variety of applications and use cases: * Language Modelling: Training language models to better understand common sense knowledge and improve sentence completion tasks. * Common Sense Reasoning: Developing and studying algorithms that can reason and make inferences based on common sense. * Machine Performance Evaluation: Assessing the effectiveness of machine learning models in generating appropriate sentence endings given specific contexts and activity labels. * Natural Language Inference (NLI): Benchmarking and improving NLI systems by evaluating their ability to predict plausible sentence completions.
The dataset has a global region scope. It was listed on 17/06/2025. Specific time ranges for the data collection itself or detailed demographic scopes are not provided. The dataset includes various splits (train, dev, test) and split types (random, balanced) to ensure diversity for generalisation testing and fairness evaluation during model development.
CC0
The HellaSwag dataset is intended for researchers and machine learning practitioners. They can utilise it to: * Train, validate, and evaluate machine learning models for tasks requiring common sense knowledge. * Develop and refine algorithms for common sense reasoning. * Benchmark and assess the performance and limitations of current natural language inference systems.
Original Data Source: HellaSwag: Commonsense NLI