19 datasets found

h
vqa
huggingface.co
Updated Oct 9, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
World Cuisines (2024). vqa [Dataset]. https://huggingface.co/datasets/worldcuisines/vqa
Explore at:
Dataset updated
Oct 9, 2024
Dataset authored and provided by
World Cuisines
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines

WorldCuisines is a massive-scale visual question answering (VQA) benchmark for multilingual and multicultural understanding through global cuisines. The dataset contains text-image pairs across 30 languages and dialects, spanning 9 language families and featuring over 1 million data points, making it the largest multicultural VQA benchmark as of 17 October 2024.… See the full description on the dataset page: https://huggingface.co/datasets/worldcuisines/vqa.
d
Data from: Knowledge from non-English-language studies broadens...
search.dataone.org
data.niaid.nih.gov
+1more
Updated May 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Filipe Serrano; Valentina Marconi; Stefanie Deinet; Hannah Puleston; Helga Correa; Juan C. DÃaz-Ricaurte; Carolina Farhat; Ricardo Luria-Manzano; Marcio Martins; Eletra Souza; Sergio Souza; Joao Vieira-Alencar; Paula Valdujo; Robin Freeman; Louise McRae (2025). Knowledge from non-English-language studies broadens contributions to conservation policy and helps to tackle bias in biodiversity data [Dataset]. http://doi.org/10.5061/dryad.ngf1vhj68
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.ngf1vhj68
Dataset updated
May 20, 2025
Dataset provided by
Dryad Digital Repository
Authors
Filipe Serrano; Valentina Marconi; Stefanie Deinet; Hannah Puleston; Helga Correa; Juan C. DÃaz-Ricaurte; Carolina Farhat; Ricardo Luria-Manzano; Marcio Martins; Eletra Souza; Sergio Souza; Joao Vieira-Alencar; Paula Valdujo; Robin Freeman; Louise McRae
Description
Local ecological evidence is key to informing conservation. However, many global biodiversity indicators often neglect local ecological evidence published in languages other than English, potentially biassing our understanding of biodiversity trends in areas where English is not the dominant language. Brazil is a megadiverse country with a thriving national scientific publishing landscape. Here, using Brazil and a species abundance indicator as examples, we assess how well bilingual literature searches can both improve data coverage for a country where English is not the primary language and help tackle biases in biodiversity datasets. We conducted a comprehensive screening of articles containing abundance data for vertebrates published in 59 Brazilian journals (articles in Portuguese or English) and 79 international English-only journals. These were grouped into three datasets according to journal origin and article language (Brazilian-Portuguese, Brazilian-English and International). ..., Data collection We collected time-series of vertebrate population abundance suitable for entry into the LPD (livingplanetindex.org), which provides the repository for one of the indicators in the GBF, the Living Planet Index (LPI, Ledger et al., 2023). Despite the continuous addition of new data, LPI coverage remains incomplete for some regions (Living Planet Report 2024 â€“ A System in Peril, 2024). We collected data from three sets of sources: a) Portuguese-language articles from Brazilian journals (hereafter â€œBrazilian-Portugueseâ€ dataset), b) English-language articles from Brazilian journals (â€œBrazilian-Englishâ€ dataset) and c) English-language articles from non-Brazilian journals (â€œInternationalâ€ dataset). For a) and b), we first compiled a list of Brazilian biodiversity-related journals using the list of non-English-language journals in ecology and conservation published by the translatE project (www.translatesciences.com) as a starting point. The International dataset was obtained ..., # Knowledge from non-English-language studies broadens contributions to conservation policy and helps to tackle bias in biodiversity data

Dataset DOI: 10.5061/dryad.ngf1vhj68

Description of the data and file structure

We collected time-series of vertebrate population abundance suitable for entry into the LPD (livingplanetindex.org), which provides the repository for one of the indicators in the GBF, the Living Planet Index (LPI, Ledger et al., 2023).

We collected data from three sets of sources: a) Portuguese-language articles from Brazilian journals (hereafter â€œBrazilian-Portugueseâ€ dataset), b) English-language articles from Brazilian journals (â€œBrazilian-Englishâ€ dataset) and c) English-language articles from non-Brazilian journals (â€œInternationalâ€ dataset). For a) and b), we first compiled a list of Brazilian biodiversity-related journals using the list of non-English-language journals in ecology and conservat...,
Most popular database management systems worldwide 2024
statista.com
Updated Jun 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Most popular database management systems worldwide 2024 [Dataset]. https://www.statista.com/statistics/809750/worldwide-popularity-ranking-database-management-systems/
Explore at:
Dataset updated
Jun 30, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Jun 2024
Area covered
Worldwide
Description
As of June 2024, the most popular database management system (DBMS) worldwide was Oracle, with a ranking score of *******; MySQL and Microsoft SQL server rounded out the top three. Although the database management industry contains some of the largest companies in the tech industry, such as Microsoft, Oracle and IBM, a number of free and open-source DBMSs such as PostgreSQL and MariaDB remain competitive. Database Management Systems As the name implies, DBMSs provide a platform through which developers can organize, update, and control large databases. Given the business world’s growing focus on big data and data analytics, knowledge of SQL programming languages has become an important asset for software developers around the world, and database management skills are seen as highly desirable. In addition to providing developers with the tools needed to operate databases, DBMS are also integral to the way that consumers access information through applications, which further illustrates the importance of the software.
f
Data from: Mpox Narrative on Instagram: A Labeled Multilingual Dataset of...
figshare.com
xlsx
Updated Oct 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nirmalya Thakur (2024). Mpox Narrative on Instagram: A Labeled Multilingual Dataset of Instagram Posts on Mpox for Sentiment, Hate Speech, and Anxiety Analysis [Dataset]. http://doi.org/10.6084/m9.figshare.27072247.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27072247.v1
Dataset updated
Oct 12, 2024
Dataset provided by
figshare
Authors
Nirmalya Thakur
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Please cite this paper when using this dataset: N. Thakur, “Mpox narrative on Instagram: A labeled multilingual dataset of Instagram posts on mpox for sentiment, hate speech, and anxiety analysis,” arXiv [cs.LG], 2024, URL: https://arxiv.org/abs/2409.05292Abstract: The world is currently experiencing an outbreak of mpox, which has been declared a Public Health Emergency of International Concern by WHO. During recent virus outbreaks, social media platforms have played a crucial role in keeping the global population informed and updated regarding various aspects of the outbreaks. As a result, in the last few years, researchers from different disciplines have focused on the development of social media datasets focusing on different virus outbreaks. No prior work in this field has focused on the development of a dataset of Instagram posts about the mpox outbreak. The work presented in this paper (stated above) aims to address this research gap. It presents this multilingual dataset of 60,127 Instagram posts about mpox, published between July 23, 2022, and September 5, 2024. This dataset contains Instagram posts about mpox in 52 languages.For each of these posts, the Post ID, Post Description, Date of publication, language, and translated version of the post (translation to English was performed using the Google Translate API) are presented as separate attributes in the dataset. After developing this dataset, sentiment analysis, hate speech detection, and anxiety or stress detection were also performed. This process included classifying each post intoone of the fine-grain sentiment classes, i.e., fear, surprise, joy, sadness, anger, disgust, or neutralhate or not hateanxiety/stress detected or no anxiety/stress detected.These results are presented as separate attributes in the dataset for the training and testing of machine learning algorithms for sentiment, hate speech, and anxiety or stress detection, as well as for other applications.The 52 distinct languages in which Instagram posts are present in the dataset are English, Portuguese, Indonesian, Spanish, Korean, French, Hindi, Finnish, Turkish, Italian, German, Tamil, Urdu, Thai, Arabic, Persian, Tagalog, Dutch, Catalan, Bengali, Marathi, Malayalam, Swahili, Afrikaans, Panjabi, Gujarati, Somali, Lithuanian, Norwegian, Estonian, Swedish, Telugu, Russian, Danish, Slovak, Japanese, Kannada, Polish, Vietnamese, Hebrew, Romanian, Nepali, Czech, Modern Greek, Albanian, Croatian, Slovenian, Bulgarian, Ukrainian, Welsh, Hungarian, and Latvian.The following is a description of the attributes present in this dataset:Post ID: Unique ID of each Instagram postPost Description: Complete description of each post in the language in which it was originally publishedDate: Date of publication in MM/DD/YYYY formatLanguage: Language of the post as detected using the Google Translate APITranslated Post Description: Translated version of the post description. All posts which were not in English were translated into English using the Google Translate API. No language translation was performed for English posts.Sentiment: Results of sentiment analysis (using the preprocessed version of the translated Post Description) where each post was classified into one of the sentiment classes: fear, surprise, joy, sadness, anger, disgust, and neutralHate: Results of hate speech detection (using the preprocessed version of the translated Post Description) where each post was classified as hate or not hateAnxiety or Stress: Results of anxiety or stress detection (using the preprocessed version of the translated Post Description) where each post was classified as stress/anxiety detected or no stress/anxiety detected.All the Instagram posts that were collected during this data mining process to develop this dataset were publicly available on Instagram and did not require a user to log in to Instagram to view the same (at the time of writing this paper).
h
MultiFin
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ashwin Mathur, MultiFin [Dataset]. https://huggingface.co/datasets/awinml/MultiFin
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Ashwin Mathur
Description
MultiFin

MultiFin – a publicly available financial dataset consisting of real-world article headlines covering 15 languages across different writing systems and language families. The dataset consists of hierarchical label structure providing two classification tasks: multi-label and multi-class.

Dataset Description

The MULTIFIN dataset is a multilingual corpus, consisting of real-world article headlines covering 15 languages. The corpus is annotated using hierarchical… See the full description on the dataset page: https://huggingface.co/datasets/awinml/MultiFin.
h
aime_2024_multilingual
huggingface.co
Updated Jun 4, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shan Chen (2025). aime_2024_multilingual [Dataset]. https://huggingface.co/datasets/shanchen/aime_2024_multilingual
Explore at:
Dataset updated
Jun 4, 2025
Authors
Shan Chen
Description
When Models Reason in Your Language: Controlling Thinking Trace Language Comes at the Cost of Accuracy https://arxiv.org/abs/2505.22888 Jirui Qi, Shan Chen, Zidi Xiong, Raquel Fernández, Danielle S. Bitterman, Arianna Bisazza Recent Large Reasoning Models (LRMs) with thinking traces have shown strong performance on English reasoning tasks. However, their ability to think in other languages is less studied. This capability is as important as answer accuracy for real world applications because… See the full description on the dataset page: https://huggingface.co/datasets/shanchen/aime_2024_multilingual.
Z
MoreFixes: Largest CVE dataset with fixes
data.niaid.nih.gov
explore.openaire.eu
+1more
Updated Oct 23, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Akhoundali, Jafar (2024). MoreFixes: Largest CVE dataset with fixes [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11199119
Explore at:
Dataset updated
Oct 23, 2024
Dataset provided by
Rietveld, Kristian F. D.
Rahim Nouri, Sajad
Akhoundali, Jafar
GADYATSKAYA, Olga
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
In our work, we have designed and implemented a novel workflow with several heuristic methods to combine state-of-the-art methods related to CVE fix commits gathering. As a consequence of our improvements, we have been able to gather the largest programming language-independent real-world dataset of CVE vulnerabilities with the associated fix commits. Our dataset containing 29,203 unique CVEs coming from 7,238 unique GitHub projects is, to the best of our knowledge, by far the biggest CVE vulnerability dataset with fix commits available today. These CVEs are associated with 35,276 unique commits as sql and 39,931 patch commit files that fixed those vulnerabilities(some patch files can't be saved as sql due to several techincal reasons) Our larger dataset thus substantially improves over the current real-world vulnerability datasets and enables further progress in research on vulnerability detection and software security. We used NVD(nvd.nist.gov) and Github Secuirty advisory Database as the main sources of our pipeline.

We release to the community a 16GB PostgreSQL database that contains information on CVEs up to 2024-09-26, CWEs of each CVE, files and methods changed by each commit, and repository metadata. Additionally, patch files related to the fix commits are available as a separate package. Furthermore, we make our dataset collection tool also available to the community.

cvedataset-patches.zip file contains fix patches, and postgrescvedumper.sql.zip contains a postgtesql dump of fixes, together with several other fields such as CVEs, CWEs, repository meta-data, commit data, file changes, method changed, etc.

MoreFixes data-storage strategy is based on CVEFixes to store CVE commits fixes from open-source repositories, and uses a modified version of Porspector(part of ProjectKB from SAP) as a module to detect commit fixes of a CVE. Our full methodology is presented in the paper, with the title of "MoreFixes: A Large-Scale Dataset of CVE Fix Commits Mined through Enhanced Repository Discovery", which will be published in the Promise conference (2024).

For more information about usage and sample queries, visit the Github repository: https://github.com/JafarAkhondali/Morefixes

If you are using this dataset, please be aware that the repositories that we mined contain different licenses and you are responsible to handle any licesnsing issues. This is also the similar case with CVEFixes.

This product uses the NVD API but is not endorsed or certified by the NVD.

This research was partially supported by the Dutch Research Council (NWO) under the project NWA.1215.18.008 Cyber Security by Integrated Design (C-SIDe).

To restore the dataset, you can use the docker-compose file available at the gitub repository. Dataset default credentials after restoring dump:

POSTGRES_USER=postgrescvedumper POSTGRES_DB=postgrescvedumper POSTGRES_PASSWORD=a42a18537d74c3b7e584c769152c3d

Please use this for citation:

title={MoreFixes: A large-scale dataset of CVE fix commits mined through enhanced repository discovery}, author={Akhoundali, Jafar and Nouri, Sajad Rahim and Rietveld, Kristian and Gadyatskaya, Olga}, booktitle={Proceedings of the 20th International Conference on Predictive Models and Data Analytics in Software Engineering}, pages={42--51}, year={2024} }
t
Lohmann, Aaron, Békés, Gábor, Hinz, Julian, Koren, Miklós (2024). Dataset:...
service.tib.eu
Updated Nov 28, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Lohmann, Aaron, Békés, Gábor, Hinz, Julian, Koren, Miklós (2024). Dataset: Open source software input output tables (ossio). https://doi.org/10.22000/SaNahyIFpqpJVFbb [Dataset]. https://service.tib.eu/ldmservice/dataset/rdr-doi-10-22000-sanahyifpqpjvfbb
Explore at:
Dataset updated
Nov 28, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract: The global Open- Source Software Input Output (OSSIO) tables were built including five different programming languages and 15 countries. The researchers used knowledge of geographical location of software developers and linkages between software projects (dependencies) to aggregate these to flows between countries. The OSSIO tables were built as part of the EU-funded research project 'Rethinking Global Supply Chains: Measurement, Impact and Policy' (RETHINK-GSC; https://rethink-gsc.eu/), which captures the impact of knowledge flows and service inputs in global supply chains (GSCs).
P
THAR Dataset Dataset
paperswithcode.com
Updated Mar 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). THAR Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/thar-dataset
Explore at:
Dataset updated
Mar 22, 2024
Description
The increase in religiously motivated hate on social media is clear and ongoing. These platforms have become fertile ground for the dissemination of hate speech directed at religious communities, resulting in tangible repercussions in the real world. Much of the current research concerning the automated identification of hateful content on social media focuses on English-language content. There is comparatively less exploration in low-resource languages such as Hindi. As social media users increasingly utilize their regional languages for expression, it becomes crucial to dedicate appropriate research efforts to hate speech detection in these languages.

Hence, this work aims to fill this research void by introducing a meticulously curated and annotated dataset of YouTube comments in Hindi-English code-mixed language, specifically designed to identify instances of religious hate.

Citation: Sharma, D., Singh, A., & Singh, V. K. (2024). THAR-Targeted Hate Speech Against Religion: A high-quality Hindi-English code-mixed Dataset with the Application of Deep Learning Models for Automatic Detection. ACM Transactions on Asian and Low-Resource Language Information Processing. (https://doi.org/10.1145/3653017)
c
The global cloud database and DBaaS market size is USD 21.9 billion in 2024...
cognitivemarketresearch.com
pdf,excel,csv,ppt
Updated May 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cognitive Market Research (2024). The global cloud database and DBaaS market size is USD 21.9 billion in 2024 and will grow at a compound annual growth rate (CAGR) of 21.6% from 2024 to 2031. [Dataset]. https://www.cognitivemarketresearch.com/cloud-database-and-dbaas-market-report
Explore at:
pdf,excel,csv,pptAvailable download formats
Dataset updated
May 24, 2024
Dataset authored and provided by
Cognitive Market Research
License
https://www.cognitivemarketresearch.com/privacy-policyhttps://www.cognitivemarketresearch.com/privacy-policy
Time period covered
2021 - 2033
Area covered
Global
Description
According to Cognitive Market Research, the global cloud database and DBaaS marketsize will be USD 21.9 billion in 2024 and will increase at a compound annual growth rate (CAGR) of 21.6% from 2024 to 2031. Market Dynamics of Cloud Database and DBaaS Market Key Drivers for Cloud Database and DBaaS Market Mobile and IoT Adoption - The rise of mobile and IoT technologies fuels demand for cloud databases and DBaaS solutions. Data generation surges as mobile usage skyrockets and IoT devices flourish, necessitating scalable, accessible storage options. Cloud databases offer flexibility and scalability to accommodate these dynamic workloads while enabling seamless integration with mobile and IoT applications. The shift towards digital transformation initiatives also amplifies the need for agile, cloud-native database solutions to support modernization efforts across industries. Automated administration reduces operational complexity, which drives the cloud database and DBaaS market's expansion in the years ahead. Key Restraints for Cloud Database and DBaaS Market Compatibility issues with existing systems hinder the adoption of the cloud database and DBaaS in the industry. The market also faces significant difficulties related to data migration challenges that hinder adoption and scalability.. Introduction of the Cloud Database and DBaaS Market Cloud databases and Database-as-a-Service (DBaaS) offer scalable and managed storage solutions where data is hosted and accessed over the internet. Market drivers for these services include the imperative for scalability to accommodate growing data volumes, cost efficiencies achieved through a shift from capital to operational expenditure, enhanced accessibility enabling collaboration and innovation from any location, heightened demand for robust security features to address data privacy concerns, simplified management through automated administration, and elasticity to handle fluctuating workloads seamlessly. These drivers collectively address modern business needs for flexibility, cost-effectiveness, security, and performance. As organizations increasingly depend on data as a strategic asset, cloud databases, and DBaaS solutions provide the agility and efficiency required to meet evolving demands while leveraging the benefits of cloud computing infrastructure.
Dataset - CORE-MD Post-Market Surveillance Tool
zenodo.org
data.niaid.nih.gov
csv
Updated Apr 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yijun Ren; Yijun Ren; Enrico Gianluca Caiani; Enrico Gianluca Caiani (2024). Dataset - CORE-MD Post-Market Surveillance Tool [Dataset]. http://doi.org/10.5281/zenodo.10864069
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10864069
Dataset updated
Apr 24, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Yijun Ren; Yijun Ren; Enrico Gianluca Caiani; Enrico Gianluca Caiani
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Mar 25, 2024
Description
WP3 of CORE-MD investigated how to aggregate and extract maximal value for post-market surveillance from medical device registries, big data, clinical practices and experience, and the internet. This data collection was created by the Task 3.2 of the CORE-MD project, as the result of the proposed methodological framework to transform unstructured and dispersed publicly available safety information (Field Safety Notices, recalls, alerts) into a standardized and harmonized database. The databases includes 137,720 historical safety notices (updated to February 2024) safety notices published by different competent national authorities (16 EU Member States and 5 extra EU jurisdictions).
h
xcopa
huggingface.co
Updated Jun 20, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SEACrowd (2024). xcopa [Dataset]. https://huggingface.co/datasets/SEACrowd/xcopa
Explore at:
Dataset updated
Jun 20, 2024
Dataset authored and provided by
SEACrowd
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning The Cross-lingual Choice of Plausible Alternatives dataset is a benchmark to evaluate the ability of machine learning models to transfer commonsense reasoning across languages. The dataset is the translation and reannotation of the English COPA (Roemmele et al. 2011) and covers 11 languages from 11 families and several areas around the globe. The dataset is challenging as it requires both the command of world knowledge and the ability to generalise to new languages. All the details about the creation of XCOPA and the implementation of the baselines are available in the paper.
P
STEM Dataset
paperswithcode.com
Updated May 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). STEM Dataset [Dataset]. https://paperswithcode.com/dataset/stem
Explore at:
Dataset updated
May 15, 2025
Description
This dataset is proposed in the ICLR 2024 paper: Measuring Vision-Language STEM Skills of Neural Models. The problems in the real world often require solutions, combining knowledge from STEM (science, technology, engineering, and math). Unlike existing datasets, our dataset requires the understanding of multimodal vision-language information of STEM. Our dataset features one of the largest and most comprehensive datasets for the challenge. It includes 448 skills and 1,073,146 questions spanning all STEM subjects. Compared to existing datasets that often focus on examining expert-level ability, our dataset includes fundamental skills and questions designed based on the K-12 curriculum. We also add state-of-the-art foundation models such as CLIP and GPT-3.5-Turbo to our benchmark.
THINGS-MEG
openneuro.org
Updated May 29, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martin N. Hebart; Oliver Contier; Lina Teichmann; Adam H. Rockter; Charles Zheng; Alexis Kidder; Anna Corriveau; Maryam Vaziri-Pashkam; Chris I. Baker (2025). THINGS-MEG [Dataset]. http://doi.org/10.18112/openneuro.ds004212.v3.0.0
Explore at:
Unique identifier
https://doi.org/10.18112/openneuro.ds004212.v3.0.0
Dataset updated
May 29, 2025
Dataset provided by
OpenNeurohttps://openneuro.org/
Authors
Martin N. Hebart; Oliver Contier; Lina Teichmann; Adam H. Rockter; Charles Zheng; Alexis Kidder; Anna Corriveau; Maryam Vaziri-Pashkam; Chris I. Baker
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
THINGS-MEG

Understanding object representations visual and semantic processing of objects requires a broad, comprehensive sampling of the objects in our visual world with dense measurements of brain activity and behavior. This densely sampled fMRI dataset is part of THINGS-data, a multimodal collection of large-scale datasets comprising functional MRI, magnetoencephalographic recordings, and 4.70 million behavioral judgments in response to thousands of photographic images for up to 1,854 object concepts. THINGS-data is unique in its breadth of richly-annotated objects, allowing for testing countless novel hypotheses at scale while assessing the reproducibility of previous findings. The multimodal data allows for studying both the temporal and spatial dynamics of object representations and their relationship to behavior and additionally provides the means for combining these datasets for novel insights into object processing. THINGS-data constitutes the core release of the THINGS initiative for bridging the gap between disciplines and the advancement of cognitive neuroscience.

Dataset overview

We collected extensively sampled object representations using magnetoencephalography (MEG). To this end, we drew on the THINGS database (Hebart et al., 2019), a richly-annotated database of 1,854 object concepts representative of the American English language which contains 26,107 manually-curated naturalistic object images.

During the fMRI experiment, participants were shown a representative subset of THINGS images, spread across 12 separate sessions (N=4, 22,448 unique images of 1,854 objects). Images were shown in fast succession (1.5±0.2s), and participants were instructed to maintain central fixation. To ensure engagement, participants performed an oddball detection task responding to occasional artificially-generated images. A subset of images (n=200) were shown repeatedly in each session.

Beyond the core functional imaging data in response to THINGS images, we acquired T1-weighted MRI scans to allow for cortical source localization. Eye movements were monitored in the MEG to ensure participants maintained central fixation.
Artificial Intelligence (AI) Text Generator Market Analysis North America,...
technavio.com
Updated Jul 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Technavio (2024). Artificial Intelligence (AI) Text Generator Market Analysis North America, Europe, APAC, South America, Middle East and Africa - US, UK, China, India, Germany - Size and Forecast 2024-2028 [Dataset]. https://www.technavio.com/report/ai-text-generator-market-analysis
Explore at:
Dataset updated
Jul 15, 2024
Dataset provided by
TechNavio
Authors
Technavio
Time period covered
2021 - 2025
Area covered
United States, Global
Description
Snapshot img

Artificial Intelligence Text Generator Market Size 2024-2028

The artificial intelligence (AI) text generator market size is forecast to increase by USD 908.2 million at a CAGR of 21.22% between 2023 and 2028.

The market is experiencing significant growth due to several key trends. One of these trends is the increasing popularity of AI generators in various sectors, including education for e-learning applications. Another trend is the growing importance of speech-to-text technology, which is becoming increasingly essential for improving productivity and accessibility. However, data privacy and security concerns remain a challenge for the market, as generators process and store vast amounts of sensitive information. It is crucial for market participants to address these concerns through strong data security measures and transparent data handling practices to ensure customer trust and compliance with regulations. Overall, the AI generator market is poised for continued growth as it offers significant benefits in terms of efficiency, accuracy, and accessibility.

What will be the Size of the Artificial Intelligence (AI) Text Generator Market During the Forecast Period?

Request Free Sample

The market is experiencing significant growth as businesses and organizations seek to automate content creation across various industries. Driven by technological advancements in machine learning (ML) and natural language processing, AI generators are increasingly being adopted for downstream applications in sectors such as education, manufacturing, and e-commerce. Moreover, these systems enable the creation of personalized content for global audiences in multiple languages, providing a competitive edge for businesses in an interconnected Internet economy. However, responsible AI practices are crucial to mitigate risks associated with biased content, misinformation, misuse, and potential misrepresentation.

How is this Artificial Intelligence (AI) Text Generator Industry segmented and which is the largest segment?

The artificial intelligence (AI) text generator industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2024-2028, as well as historical data from 2018-2022 for the following segments.

Component Solution Service Application Text to text Speech to text Image/video to text Geography North America US Europe Germany UK APAC China India South America Middle East and Africa

By Component Insights

The solution segment is estimated to witness significant growth during the forecast period.

Artificial Intelligence (AI) text generators have gained significant traction in various industries due to their efficiency and cost-effectiveness in content creation. These solutions utilize machine learning algorithms, such as Deep Neural Networks, to analyze and learn from vast datasets of human-written text. By predicting the most probable word or sequence of words based on patterns and relationships identified In the training data, AIgenerators produce personalized content for multiple languages and global audiences. The application spans across industries, including education, manufacturing, e-commerce, and entertainment & media. In the education industry, AI generators assist in creating personalized learning materials.

Get a glance at the Artificial Intelligence (AI) Text Generator Industry report of share of various segments Request Free Sample

The solution segment was valued at USD 184.50 million in 2018 and showed a gradual increase during the forecast period.

Regional Analysis

North America is estimated to contribute 33% to the growth of the global market during the forecast period.

Technavio's analysts have elaborately explained the regional trends and drivers that shape the market during the forecast period.

For more insights on the market share of various regions, Request Free Sample

The North American market holds the largest share in the market, driven by the region's technological advancements and increasing adoption of AI in various industries. AI text generators are increasingly utilized for content creation, customer service, virtual assistants, and chatbots, catering to the growing demand for high-quality, personalized content in sectors such as e-commerce and digital marketing. Moreover, the presence of tech giants like Google, Microsoft, and Amazon in North America, who are investing significantly in AI and machine learning, further fuels market growth. AI generators employ Machine Learning algorithms, Deep Neural Networks, and Natural Language Processing to generate content in multiple languages for global audiences.

Market Dynamics

Our researchers analyzed the data with 2023 as the base year, along with the key drivers, trends, and c
h
x-fact
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NLP at University of Utah, x-fact [Dataset]. https://huggingface.co/datasets/utahnlp/x-fact
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
NLP at University of Utah
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for "x-fact"

Dataset Description Dataset Summary

X-FACT is a multilingual dataset for fact-checking with real world claims. The dataset contains short statments in 25 languages with top five evidence documents retrieved by performing google search with claim statements. The dataset contains two additional evaluation splits (in addition to a traditional test set): ood and zeroshot. ood measures out-of-domain generalization where while the language… See the full description on the dataset page: https://huggingface.co/datasets/utahnlp/x-fact.
h
tydiqa
huggingface.co
Updated Jun 20, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SEACrowd (2024). tydiqa [Dataset]. https://huggingface.co/datasets/SEACrowd/tydiqa
Explore at:
Dataset updated
Jun 20, 2024
Dataset authored and provided by
SEACrowd
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
TyDi QA is a question answering dataset covering 11 typologically diverse languages with 204K question-answer pairs. The languages of TyDi QA are diverse with regard to their typology -- the set of linguistic features that each language expresses -- such that we expect models performing well on this set to generalize across a large number of the languages in the world. It contains language phenomena that would not be found in English-only corpora. To provide a realistic information-seeking task and avoid priming effects, questions are written by people who want to know the answer, but don’t know the answer yet, (unlike SQuAD and its descendents) and the data is collected directly in each language without the use of translation (unlike MLQA and XQuAD).
h
danish-citizen-global-exams
huggingface.co
Updated Apr 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mike Zhang (2025). danish-citizen-global-exams [Dataset]. https://huggingface.co/datasets/jjzha/danish-citizen-global-exams
Explore at:
Dataset updated
Apr 1, 2025
Authors
Mike Zhang
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
Dataset Card for "danish-citizen-tests"

Original point of contact: Dan Saattrup Nielsen from the The Alexandra Institute Processor for Global Exams: Mike Zhang from Aalborg University

Dataset Summary

This dataset contains tests for citizenship ("indfødsretsprøven") and permanent residence ("medborgerskabsprøven") in Denmark, from the years 2016-2023.

Languages

The dataset is available in Danish (da).

Dataset Structure

An example from… See the full description on the dataset page: https://huggingface.co/datasets/jjzha/danish-citizen-global-exams.
h
Indic-subtitler-audio_evals
huggingface.co
Updated Feb 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kurian Benoy (2024). Indic-subtitler-audio_evals [Dataset]. https://huggingface.co/datasets/kurianbenoy/Indic-subtitler-audio_evals
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 19, 2024
Authors
Kurian Benoy
License
https://choosealicense.com/licenses/gpl-2.0/https://choosealicense.com/licenses/gpl-2.0/
Description
Indic_audio_evals

As part of this project. We are evaluating our performance of various ASR models as well in a benchmarking dataset, we have created in various languages. This benchmarking dataset is more alligned to real-world use-cases rather than having any academic datasets.

About Dataset

Dataset Link in HuggingFace: kurianbenoy/Indic-subtitler-audio_evals

This dataset contains audio file in .wav format and video file in .mp4. The respective groundtruth will be… See the full description on the dataset page: https://huggingface.co/datasets/kurianbenoy/Indic-subtitler-audio_evals.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

World Cuisines (2024). vqa [Dataset]. https://huggingface.co/datasets/worldcuisines/vqa

vqa

worldcuisines/vqa

Explore at:

Dataset updated

Oct 9, 2024

Dataset authored and provided by

World Cuisines

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines

WorldCuisines is a massive-scale visual question answering (VQA) benchmark for multilingual and multicultural understanding through global cuisines. The dataset contains text-image pairs across 30 languages and dialects, spanning 9 language families and featuring over 1 million data points, making it the largest multicultural VQA benchmark as of 17 October 2024.… See the full description on the dataset page: https://huggingface.co/datasets/worldcuisines/vqa.

Clear search

Close search

Google apps

Main menu

vqa

Data from: Knowledge from non-English-language studies broadens...

Description of the data and file structure

Most popular database management systems worldwide 2024

Data from: Mpox Narrative on Instagram: A Labeled Multilingual Dataset of...

MultiFin

aime_2024_multilingual

MoreFixes: Largest CVE dataset with fixes

Lohmann, Aaron, Békés, Gábor, Hinz, Julian, Koren, Miklós (2024). Dataset:...

THAR Dataset Dataset

The global cloud database and DBaaS market size is USD 21.9 billion in 2024...

Dataset - CORE-MD Post-Market Surveillance Tool

xcopa

STEM Dataset

THINGS-MEG

THINGS-MEG

Dataset overview

Artificial Intelligence (AI) Text Generator Market Analysis North America,...

Snapshot img

x-fact

tydiqa

danish-citizen-global-exams

Indic-subtitler-audio_evals

vqaSee More Versions

worldcuisines/vqa

vqa