100+ datasets found

P
Ancient Tamil Stone Dataset Dataset
paperswithcode.com
Updated Mar 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Ancient Tamil Stone Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/ancient-tamil-stone-dataset
Explore at:
Dataset updated
Mar 5, 2025
Description
Description:

👉 Download the dataset here

We are pleased to present a unique and valuable dataset collect from ancient stone inscriptions locate at the historic Venkatesa Perumal Temple in Kanchipuram, Tamil Nadu, India. The temple, which dates back over a millennium, is adorned with stone inscriptions that are equally ancient, offering insight into the language, culture, and history of the Tamil people. This dataset is a critical resource for those interest in studying Tamil epigraphy, paleography, and the preservation of heritage through 3D reconstruction techniques.

For each inscription, we have meticulously capture 15 to 25 digital images from multiple angles using high-resolution cameras. These images are utilize to reconstruct 3D models of the inscriptions, enabling a detail and comprehensive analysis. Additionally, we employed LiDAR technology to create high-accuracy 3D models, preserving these invaluable inscriptions in digital form for future generations.

Download Dataset

Inscribed Inscriptions: These are inscriptions that remain intact and are carved deeply into the stone. They are relatively well-preserve and legible, providing clear information.

Eroded Inscriptions: Over time, weathering has eroded parts of these inscriptions. While some characters may still be discernible, much of the original text has fade.

Projected Inscriptions: These inscriptions are carved in such a way that the text projects outward from the stone surface. The unique style often makes them easier to read but also more susceptible to damage.

Rural Paleographic Inscriptions: These inscriptions represent a specific style of writing used in rural areas. They often reflect the simpler, everyday language and script of the common people of the time.

Urban Paleographic Inscriptions: In contrast, urban paleographic inscriptions were typically create in more sophisticate scripts. These were often commissioned by rulers or temple authorities and represent higher levels of literacy and formal language use.

3D Model Construction

The images capture for each inscription are use to construct 3D models of the stones, providing a digital preservation of these historical records. The number of input images plays a significant role in the clarity and detail of the final model. Our experiments show that with a minimum of 10 input images, a clear 3D model can be constructed. However, when 15-25 images are use, the resulting models are much sharper, offering greater detail and accuracy for further study and analysis.

The inclusion of LiDAR technology further enhances the accuracy of the 3D models by capturing intricate details in the inscriptions’ surfaces. This technology is especially useful for eroded or projected inscriptions, where traditional photography may struggle to capture the full depth and nuance of the text.

List of Inscriptions

Inscription 1 – Eroded Inscriptions: Despite the wear and tear, some segments of the text are still visible, offering fragments of the original narrative.

Inscription 2 – Urban Paleographic Inscriptions: Displaying the formal urban writing style, these inscriptions are a testament to the sophistication of the era’s linguistic and cultural practices.

Inscription 3 – Projected Inscriptions: These inscriptions protrude from the surface of the stone, creating a unique visual experience and allowing easier identification of certain characters.

Inscription 4 – Inscribed Inscriptions: These deeply carved inscriptions remain remarkably preserved, providing a clearer representation of the historical texts.

Inscription 5 – Rural Paleographic Inscriptions: These inscriptions offer valuable insight into the language and script of rural Tamil Nadu, reflecting a simpler style.

Inscription 6 – Eroded Inscriptions: As with Inscription 1, this inscription has suffered from the ravages of time, but still offers partial data for epigraphic studies.

This dataset is sourced from Kaggle.
F
Portuguese Chain of Thought Prompt & Response Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Portuguese Chain of Thought Prompt & Response Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/portuguese-chain-of-thought-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
Dataset funded by
FutureBeeAI
Description
Welcome to the Portuguese Chain of Thought prompt-response dataset, a meticulously curated collection containing 3000 comprehensive prompt and response pairs. This dataset is an invaluable resource for training Language Models (LMs) to generate well-reasoned answers and minimize inaccuracies. Its primary utility lies in enhancing LLMs' reasoning skills for solving arithmetic, common sense, symbolic reasoning, and complex problems. Dataset Content: This COT dataset comprises a diverse set of instructions and questions paired with corresponding answers and rationales in the Portuguese language. These prompts and completions cover a broad range of topics and questions, including mathematical concepts, common sense reasoning, complex problem-solving, scientific inquiries, puzzles, and more. Each prompt is meticulously accompanied by a response and rationale, providing essential information and insights to enhance the language model training process. These prompts, completions, and rationales were manually curated by native Portuguese people, drawing references from various sources, including open-source datasets, news articles, websites, and other reliable references. Our chain-of-thought prompt-completion dataset includes various prompt types, such as instructional prompts, continuations, and in-context learning (zero-shot, few-shot) prompts. Additionally, the dataset contains prompts and completions enriched with various forms of rich text, such as lists, tables, code snippets, JSON, and more, with proper markdown format. Prompt Diversity: To ensure a wide-ranging dataset, we have included prompts from a plethora of topics related to mathematics, common sense reasoning, and symbolic reasoning. These topics encompass arithmetic, percentages, ratios, geometry, analogies, spatial reasoning, temporal reasoning, logic puzzles, patterns, and sequences, among others. These prompts vary in complexity, spanning easy, medium, and hard levels. Various question types are included, such as multiple-choice, direct queries, and true/false assessments. Response Formats: To accommodate diverse learning experiences, our dataset incorporates different types of answers depending on the prompt and provides step-by-step rationales. The detailed rationale aids the language model in building reasoning process for complex questions. These responses encompass text strings, numerical values, and date and time formats, enhancing the language model's ability to generate reliable, coherent, and contextually appropriate answers. Data Format and Annotation Details: This fully labeled Portuguese Chain of Thought Prompt Completion Dataset is available in JSON and CSV formats. It includes annotation details such as a unique ID, prompt, prompt type, prompt complexity, prompt category, domain, response, rationale, response type, and rich text presence. Quality and Accuracy: Our dataset upholds the highest standards of quality and accuracy. Each prompt undergoes meticulous validation, and the corresponding responses and rationales are thoroughly verified. We prioritize inclusivity, ensuring that the dataset incorporates prompts and completions representing diverse perspectives and writing styles, maintaining an unbiased and discrimination-free stance. The Portuguese version is grammatically accurate without any spelling or grammatical errors. No copyrighted, toxic, or harmful content is used during the construction of this dataset. Continuous Updates and Customization: The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Ongoing efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to gather custom chain of thought prompt completion data tailored to specific needs, providing flexibility and customization options. License: The dataset, created by FutureBeeAI, is now available for commercial use. Researchers, data scientists, and developers can leverage this fully labeled and ready-to-deploy Portuguese Chain of Thought Prompt Completion Dataset to enhance the rationale and accurate response generation capabilities of their generative AI models and explore new approaches to NLP tasks.
Wirestock's AI/ML Image Training Data, 4.5M Files with Metadata
datarade.ai
.csv
Updated Jul 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
WIRESTOCK (2023). Wirestock's AI/ML Image Training Data, 4.5M Files with Metadata [Dataset]. https://datarade.ai/data-products/wirestock-s-ai-ml-image-training-data-4-5m-files-with-metadata-wirestock
Explore at:
.csvAvailable download formats
Dataset updated
Jul 18, 2023
Dataset provided by
Wirestock
Authors
WIRESTOCK
Area covered
Pakistan, Chile, Belarus, Georgia, Peru, New Caledonia, Estonia, Sudan, Jersey, Swaziland
Description
Wirestock's AI/ML Image Training Data, 4.5M Files with Metadata: This data product is a unique offering in the realm of AI/ML training data. What sets it apart is the sheer volume and diversity of the dataset, which includes 4.5 million files spanning across 20 different categories. These categories range from Animals/Wildlife and The Arts to Technology and Transportation, providing a rich and varied dataset for AI/ML applications.

The data is sourced from Wirestock's platform, where creators upload and sell their photos, videos, and AI art online. This means that the data is not only vast but also constantly updated, ensuring a fresh and relevant dataset for your AI/ML needs. The data is collected in a GDPR-compliant manner, ensuring the privacy and rights of the creators are respected.

The primary use-cases for this data product are numerous. It is ideal for training machine learning models for image recognition, improving computer vision algorithms, and enhancing AI applications in various industries such as retail, healthcare, and transportation. The diversity of the dataset also means it can be used for more niche applications, such as training AI to recognize specific objects or scenes.

This data product fits into Wirestock's broader data offering as a key resource for AI/ML training. Wirestock is a platform for creators to sell their work, and this dataset is a collection of that work. It represents the breadth and depth of content available on Wirestock, making it a valuable resource for any company working with AI/ML.

The core benefits of this dataset are its volume, diversity, and quality. With 4.5 million files, it provides a vast resource for AI training. The diversity of the dataset, spanning 20 categories, ensures a wide range of images for training purposes. The quality of the images is also high, as they are sourced from creators selling their work on Wirestock.

In terms of how the data is collected, creators upload their work to Wirestock, where it is then sold on various marketplaces. This means the data is sourced directly from creators, ensuring a diverse and unique dataset. The data includes both the images themselves and associated metadata, providing additional context for each image.

The different image categories included in this dataset are Animals/Wildlife, The Arts, Backgrounds/Textures, Beauty/Fashion, Buildings/Landmarks, Business/Finance, Celebrities, Education, Emotions, Food Drinks, Holidays, Industrial, Interiors, Nature Parks/Outdoor, People, Religion, Science, Signs/Symbols, Sports/Recreation, Technology, Transportation, Vintage, Healthcare/Medical, Objects, and Miscellaneous. This wide range of categories ensures a diverse dataset that can cater to a variety of AI/ML applications.
F
Hindi Open Ended Question Answer Text Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Hindi Open Ended Question Answer Text Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/hindi-open-ended-question-answer-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
The Hindi Open-Ended Question Answering Dataset is a meticulously curated collection of comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and Question-answering models in the Hindi language, advancing the field of artificial intelligence.
Dataset Content: This QA dataset comprises a diverse set of open-ended questions paired with corresponding answers in Hindi. There is no context paragraph given to choose an answer from, and each question is answered without any predefined context content. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.
Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native Hindi people, and references were taken from diverse sources like books, news articles, websites, and other reliable references.
This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.
Question Diversity: To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. Additionally, questions are further classified into fact-based and opinion-based categories, creating a comprehensive variety. The QA dataset also contains the question with constraints and persona restrictions, which makes it even more useful for LLM training.Answer Formats: To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraph types of answers. The answer contains text strings, numerical values, date and time formats as well. Such diversity strengthens the Language model's ability to generate coherent and contextually appropriate answers.Data Format and Annotation Details: This fully labeled Hindi Open Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as id, language, domain, question_length, prompt_type, question_category, question_type, complexity, answer_type, rich_text.Quality and Accuracy: The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.
Both the question and answers in Hindi are grammatically accurate without any word or grammatical errors. No copyrighted, toxic, or harmful content is used while building this dataset.
Continuous Updates and Customization: The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.License: The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy Hindi Open Ended Question Answer Dataset to enhance the language understanding capabilities of their generative ai models, improve response generation, and explore new approaches to NLP question-answering tasks.
Animals names
kaggle.com
Updated Aug 31, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PyBear (2020). Animals names [Dataset]. https://www.kaggle.com/pybear/animalsnames/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 31, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
PyBear
Description
Context

We need to build more efficient language models, that are able to generate some fun pieces of text.

Content

This dataset contains 3259 of animals names. These names are useful to train a character-level language model to generate new possible names. Character-level language models are less used because of they're expensive to train. But using them, you can avoid having
F
English (Canada) General Conversation Speech Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). English (Canada) General Conversation Speech Dataset [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-english-canada
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
Area covered
Canada
Dataset funded by
FutureBeeAI
Description
What’s Included
Welcome to the English Language General Conversation Speech Dataset, a comprehensive and diverse collection of voice data specifically curated to advance the development of English language speech recognition models, with a particular focus on Canadian accents and dialects.
With high-quality audio recordings, detailed metadata, and accurate transcriptions, it empowers researchers and developers to enhance natural language processing, conversational AI, and Generative Voice AI algorithms. Moreover, it facilitates the creation of sophisticated voice assistants and voice bots tailored to the unique linguistic nuances found in the English language spoken in Canada.
Speech Data:
This training dataset comprises 30 hours of audio recordings covering a wide range of topics and scenarios, ensuring robustness and accuracy in speech technology applications. To achieve this, we collaborated with a diverse network of 40 native English speakers from different states/provinces of Canada. This collaborative effort guarantees a balanced representation of Canadian accents, dialects, and demographics, reducing biases and promoting inclusivity.
Each audio recording captures the essence of spontaneous, unscripted conversations between two individuals, with an average duration ranging from 15 to 60 minutes. The speech data is available in WAV format, with stereo channel files having a bit depth of 16 bits and a sample rate of 8 kHz. The recording environment is generally quiet, without background noise and echo.
Metadata:
In addition to the audio recordings, our dataset provides comprehensive metadata for each participant. This metadata includes the participant's age, gender, country, state, and dialect. Furthermore, additional metadata such as recording device detail, topic of recording, bit depth, and sample rate will be provided.
The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of English language speech recognition models.
Transcription:
This dataset provides a manual verbatim transcription of each audio file to enhance your workflow efficiency. The transcriptions are available in JSON format. The transcriptions capture speaker-wise transcription with time-coded segmentation along with non-speech labels and tags.
Our goal is to expedite the deployment of English language conversational AI and NLP models by offering ready-to-use transcriptions, ultimately saving valuable time and resources in the development process.
Updates and Customization:
We understand the importance of collecting data in various environments to build robust ASR models. Therefore, our voice dataset is regularly updated with new audio data captured in diverse real-world conditions.
If you require a custom training dataset with specific environmental conditions such as in-car, busy street, restaurant, or any other scenario, we can accommodate your request. We can provide voice data with customized sample rates ranging from 8kHz to 48kHz, allowing you to fine-tune your models for different audio recording setups. Additionally, we can also customize the transcription following your specific guidelines and requirements, to further support your ASR development process.
License:
This audio dataset, created by FutureBeeAI, is now available for commercial use.
Conclusion:
Whether you are training or fine-tuning speech recognition models, advancing NLP algorithms, exploring generative voice AI, or building cutting-edge voice assistants and bots, our dataset serves as a reliable and valuable resource.
INTRODUCTION OF COVID-NEWS-US-NNK AND COVID-NEWS-BD-NNK DATASET
zenodo.org
data.niaid.nih.gov
pdf
Updated Jul 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nafiz Sadman; Nishat Anjum; Kishor Datta Gupta; Kishor Datta Gupta; Nafiz Sadman; Nishat Anjum (2024). INTRODUCTION OF COVID-NEWS-US-NNK AND COVID-NEWS-BD-NNK DATASET [Dataset]. http://doi.org/10.5281/zenodo.4047648
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4047648
Dataset updated
Jul 19, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nafiz Sadman; Nishat Anjum; Kishor Datta Gupta; Kishor Datta Gupta; Nafiz Sadman; Nishat Anjum
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
United States, Bangladesh
Description
Introduction

There are several works based on Natural Language Processing on newspaper reports. Mining opinions from headlines [ 1 ] using Standford NLP and SVM by Rameshbhaiet. Al.compared several algorithms on a small and large dataset. Rubinet. al., in their paper [ 2 ], created a mechanism to differentiate fake news from real ones by building a set of characteristics of news according to their types. The purpose was to contribute to the low resource data available for training machine learning algorithms. Doumitet. al.in [ 3 ] have implemented LDA, a topic modeling approach to study bias present in online news media.

However, there are not many NLP research invested in studying COVID-19. Most applications include classification of chest X-rays and CT-scans to detect presence of pneumonia in lungs [ 4 ], a consequence of the virus. Other research areas include studying the genome sequence of the virus[ 5 ][ 6 ][ 7 ] and replicating its structure to fight and find a vaccine. This research is crucial in battling the pandemic. The few NLP based research publications are sentiment classification of online tweets by Samuel et el [ 8 ] to understand fear persisting in people due to the virus. Similar work has been done using the LSTM network to classify sentiments from online discussion forums by Jelodaret. al.[ 9 ]. NKK dataset is the first study on a comparatively larger dataset of a newspaper report on COVID-19, which contributed to the virus’s awareness to the best of our knowledge.

2 Data-set Introduction

2.1 Data Collection

We accumulated 1000 online newspaper report from United States of America (USA) on COVID-19. The newspaper includes The Washington Post (USA) and StarTribune (USA). We have named it as “Covid-News-USA-NNK”. We also accumulated 50 online newspaper report from Bangladesh on the issue and named it “Covid-News-BD-NNK”. The newspaper includes The Daily Star (BD) and Prothom Alo (BD). All these newspapers are from the top provider and top read in the respective countries. The collection was done manually by 10 human data-collectors of age group 23- with university degrees. This approach was suitable compared to automation to ensure the news were highly relevant to the subject. The newspaper online sites had dynamic content with advertisements in no particular order. Therefore there were high chances of online scrappers to collect inaccurate news reports. One of the challenges while collecting the data is the requirement of subscription. Each newspaper required $1 per subscriptions. Some criteria in collecting the news reports provided as guideline to the human data-collectors were as follows:

The headline must have one or more words directly or indirectly related to COVID-19.

The content of each news must have 5 or more keywords directly or indirectly related to COVID-19.

The genre of the news can be anything as long as it is relevant to the topic. Political, social, economical genres are to be more prioritized.

Avoid taking duplicate reports.

Maintain a time frame for the above mentioned newspapers.

To collect these data we used a google form for USA and BD. We have two human editor to go through each entry to check any spam or troll entry.

2.2 Data Pre-processing and Statistics

Some pre-processing steps performed on the newspaper report dataset are as follows:

Remove hyperlinks.

Remove non-English alphanumeric characters.

Remove stop words.

Lemmatize text.

While more pre-processing could have been applied, we tried to keep the data as much unchanged as possible since changing sentence structures could result us in valuable information loss. While this was done with help of a script, we also assigned same human collectors to cross check for any presence of the above mentioned criteria.

The primary data statistics of the two dataset are shown in Table 1 and 2.

Table 1: Covid-News-USA-NNK data statistics

No of words per headline

7 to 20

No of words per body content

150 to 2100

Table 2: Covid-News-BD-NNK data statistics No of words per headline

10 to 20

No of words per body content

100 to 1500

2.3 Dataset Repository

We used GitHub as our primary data repository in account name NKK^1. Here, we created two repositories USA-NKK^2 and BD-NNK^3. The dataset is available in both CSV and JSON format. We are regularly updating the CSV files and regenerating JSON using a py script. We provided a python script file for essential operation. We welcome all outside collaboration to enrich the dataset.

3 Literature Review

Natural Language Processing (NLP) deals with text (also known as categorical) data in computer science, utilizing numerous diverse methods like one-hot encoding, word embedding, etc., that transform text to machine language, which can be fed to multiple machine learning and deep learning algorithms.

Some well-known applications of NLP includes fraud detection on online media sites[ 10 ], using authorship attribution in fallback authentication systems[ 11 ], intelligent conversational agents or chatbots[ 12 ] and machine translations used by Google Translate[ 13 ]. While these are all downstream tasks, several exciting developments have been made in the algorithm solely for Natural Language Processing tasks. The two most trending ones are BERT[ 14 ], which uses bidirectional encoder-decoder architecture to create the transformer model, that can do near-perfect classification tasks and next-word predictions for next generations, and GPT-3 models released by OpenAI[ 15 ] that can generate texts almost human-like. However, these are all pre-trained models since they carry huge computation cost. Information Extraction is a generalized concept of retrieving information from a dataset. Information extraction from an image could be retrieving vital feature spaces or targeted portions of an image; information extraction from speech could be retrieving information about names, places, etc[ 16 ]. Information extraction in texts could be identifying named entities and locations or essential data. Topic modeling is a sub-task of NLP and also a process of information extraction. It clusters words and phrases of the same context together into groups. Topic modeling is an unsupervised learning method that gives us a brief idea about a set of text. One commonly used topic modeling is Latent Dirichlet Allocation or LDA[17].

Keyword extraction is a process of information extraction and sub-task of NLP to extract essential words and phrases from a text. TextRank [ 18 ] is an efficient keyword extraction technique that uses graphs to calculate the weight of each word and pick the words with more weight to it.

Word clouds are a great visualization technique to understand the overall ’talk of the topic’. The clustered words give us a quick understanding of the content.

4 Our experiments and Result analysis

We used the wordcloud library^4 to create the word clouds. Figure 1 and 3 presents the word cloud of Covid-News-USA- NNK dataset by month from February to May. From the figures 1,2,3, we can point few information:

In February, both the news paper have talked about China and source of the outbreak.

StarTribune emphasized on Minnesota as the most concerned state. In April, it seemed to have been concerned more.

Both the newspaper talked about the virus impacting the economy, i.e, bank, elections, administrations, markets.

Washington Post discussed global issues more than StarTribune.

StarTribune in February mentioned the first precautionary measurement: wearing masks, and the uncontrollable spread of the virus throughout the nation.

While both the newspaper mentioned the outbreak in China in February, the weight of the spread in the United States are more highlighted through out March till May, displaying the critical impact caused by the virus.

We used a script to extract all numbers related to certain keywords like ’Deaths’, ’Infected’, ’Died’ , ’Infections’, ’Quarantined’, Lock-down’, ’Diagnosed’ etc from the news reports and created a number of cases for both the newspaper. Figure 4 shows the statistics of this series. From this extraction technique, we can observe that April was the peak month for the covid cases as it gradually rose from February. Both the newspaper clearly shows us that the rise in covid cases from February to March was slower than the rise from March to April. This is an important indicator of possible recklessness in preparations to battle the virus. However, the steep fall from April to May also shows the positive response against the attack. We used Vader Sentiment Analysis to extract sentiment of the headlines and the body. On average, the sentiments were from -0.5 to -0.9. Vader Sentiment scale ranges from -1(highly negative to 1(highly positive). There were some cases

where the sentiment scores of the headline and body contradicted each other,i.e., the sentiment of the headline was negative but the sentiment of the body was slightly positive. Overall, sentiment analysis can assist us sort the most concerning (most negative) news from the positive ones, from which we can learn more about the indicators related to COVID-19 and the serious impact caused by it. Moreover, sentiment analysis can also provide us information about how a state or country is reacting to the pandemic. We used PageRank algorithm to extract
F
Bahasa Product Image OCR Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Bahasa Product Image OCR Dataset [Dataset]. https://www.futurebeeai.com/dataset/ocr-dataset/bahasa-product-image-ocr-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
Introducing the Bahasa Product Image Dataset - a diverse and comprehensive collection of images meticulously curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Bahasa language.
Dataset Contain & Diversity:
Containing a total of 2000 images, this Bahasa OCR dataset offers diverse distribution across different types of front images of Products. In this dataset, you'll find a variety of text that includes product names, taglines, logos, company names, addresses, product content, etc. Images in this dataset showcase distinct fonts, writing formats, colors, designs, and layouts.
To ensure the diversity of the dataset and to build a robust text recognition model we allow limited (less than five) unique images from a single resource. Stringent measures have been taken to exclude any personally identifiable information (PII) and to ensure that in each image a minimum of 80% of space contains visible Bahasa text.
Images have been captured under varying lighting conditions – both day and night – along with different capture angles and backgrounds, to build a balanced OCR dataset. The collection features images in portrait and landscape modes.
All these images were captured by native Bahasa people to ensure the text quality, avoid toxic content and PII text. We used the latest iOS and Android mobile devices above 5MP cameras to click all these images to maintain the image quality. In this training dataset images are available in both JPEG and HEIC formats.
Metadata:
Along with the image data, you will also receive detailed structured metadata in CSV format. For each image, it includes metadata like image orientation, county, language, and device information. Each image is properly renamed corresponding to the metadata.
The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of Bahasa text recognition models.
Update & Custom Collection:
We're committed to expanding this dataset by continuously adding more images with the assistance of our native Bahasa crowd community.
If you require a custom product image OCR dataset tailored to your guidelines or specific device distribution, feel free to contact us. We're equipped to curate specialized data to meet your unique needs.
Furthermore, we can annotate or label the images with bounding box or transcribe the text in the image to align with your specific project requirements using our crowd community.
License:
This Image dataset, created by FutureBeeAI, is now available for commercial use.
Conclusion:
Leverage the power of this product image OCR dataset to elevate the training and performance of text recognition, text detection, and optical character recognition models within the realm of the Bahasa language. Your journey to enhanced language understanding and processing starts here.
P
10,000 People - Human Pose Recognition Data Dataset
paperswithcode.com
Updated Nov 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). 10,000 People - Human Pose Recognition Data Dataset [Dataset]. https://paperswithcode.com/dataset/10000-people-human-pose-recognition-data
Explore at:
Dataset updated
Nov 27, 2024
Description
Description: 10,000 People - Human Pose Recognition Data. This dataset includes indoor and outdoor scenes.This dataset covers males and females. Age distribution ranges from teenager to the elderly, the middle-aged and young people are the majorities. The data diversity includes different shooting heights, different ages, different light conditions, different collecting environment, clothes in different seasons, multiple human poses. For each subject, the labels of gender, race, age, collecting environment and clothes were annotated. The data can be used for human pose recognition and other tasks.

Data size: 10,000 people

Race distribution: Asian (Chinese)
NLUCat
zenodo.org
huggingface.co
+1more
zip
Updated Mar 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2024). NLUCat [Dataset]. http://doi.org/10.5281/zenodo.10721193
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10721193
Dataset updated
Mar 4, 2024
Dataset provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
NLUCat

Dataset Description

Dataset Summary

NLUCat is a dataset of NLU in Catalan. It consists of nearly 12,000 instructions annotated with the most relevant intents and spans. Each instruction is accompanied, in addition, by the instructions received by the annotator who wrote it.

The intents taken into account are the habitual ones of a virtual home assistant (activity calendar, IOT, list management, leisure, etc.), but specific ones have also been added to take into account social and healthcare needs for vulnerable people (information on administrative procedures, menu and medication reminders, etc.).

The spans have been annotated with a tag describing the type of information they contain. They are fine-grained, but can be easily grouped to use them in robust systems.

The examples are not only written in Catalan, but they also take into account the geographical and cultural reality of the speakers of this language (geographic points, cultural references, etc.)

This dataset can be used to train models for intent classification, spans identification and examples generation.

This is the complete version of the dataset. A version prepared to train and evaluate intent classifiers has been published in HuggingFace.

In this repository you'll find the following items:

NLUCat_annotation_guidelines.docx: the guidelines provided to the annotation team

NLUCat_dataset.json: the completed NLUCat dataset

NLUCat_stats.tsv: statistics about de NLUCat dataset

dataset: folder with the dataset as published in HuggingFace, splited and prepared for training and evaluating intent classifiers

reports: folder with the reports done as feedback to the annotators during the annotation process

This dataset can be used for any purpose, whether academic or commercial, under the terms of the CC BY 4.0. Give appropriate credit , provide a link to the license, and indicate if changes were made.

Supported Tasks and Leaderboards

Intent classification, spans identification and examples generation.

Languages

The dataset is in Catalan (ca-ES).

Dataset Structure

Data Instances

Three JSON files, one for each split.

Data Fields

example: `str`. Example

annotation: `dict`. Annotation of the example

intent: `str`. Intent tag

slots: `list`. List of slots

Tag:`str`. tag to the slot

Text:`str`. Text of the slot

Start_char: `int`. First character of the span

End_char: `int`. Last character of the span

Example

An example looks as follows:

{
"example": "Demana una ambulància; la meva dona està de part.",
"annotation": {
"intent": "call_emergency",
"slots": [
{
"Tag": "service",
"Text": "ambulància",
"Start_char": 11,
"End_char": 21
},
{
"Tag": "situation",
"Text": "la meva dona està de part",
"Start_char": 23,
"End_char": 48
}
]
}
},

Data Splits

NLUCat.train: 9128 examples

NLUCat.dev: 1441 examples

NLUCat.test: 1441 examples

Dataset Creation

Curation Rationale

We created this dataset to contribute to the development of language models in Catalan, a low-resource language.

When creating this dataset, we took into account not only the language but the entire socio-cultural reality of the Catalan-speaking population. Special consideration was also given to the needs of the vulnerable population.

Source Data

Initial Data Collection and Normalization

We commissioned a company to create fictitious examples for the creation of this dataset.

Who are the source language producers?

We commissioned the writing of the examples to the company m47 labs.

Annotations

Annotation process

The elaboration of this dataset has been done in three steps, taking as a model the process followed by the NLU-Evaluation-Data dataset, as explained in the paper.
* First step: translation or elaboration of the instructions given to the annotators to write the examples.
* Second step: writing the examples. This step also includes the grammatical correction and normalization of the texts.
* Third step: recording the attempts and the slots of each example. In this step, some modifications were made to the annotation guides to adjust them to the real situations.

Who are the annotators?

The drafting of the examples and their annotation was entrusted to the company m47 labs through a public tender process.

Personal and Sensitive Information

No personal or sensitive information included.

The examples used for the preparation of this dataset are fictitious and, therefore, the information shown is not real.

Considerations for Using the Data

Social Impact of Dataset

We hope that this dataset will help the development of virtual assistants in Catalan, a language that is often not taken into account, and that it will especially help to improve the quality of life of people with special needs.

Discussion of Biases

When writing the examples, the annotators were asked to take into account the socio-cultural reality (geographic points, artists and cultural references, etc.) of the Catalan-speaking population.
Likewise, they were asked to be careful to avoid examples that reinforce the stereotypes that exist in this society. For example: be careful with the gender or origin of personal names that are associated with certain activities.

Other Known Limitations

[N/A]

Additional Information

Dataset Curators

Language Technologies Unit at the Barcelona Supercomputing Center (langtech@bsc.es)

This work has been promoted and financed by the Generalitat de Catalunya through the Aina project.

Licensing Information

This dataset can be used for any purpose, whether academic or commercial, under the terms of the CC BY 4.0.
Give appropriate credit, provide a link to the license, and indicate if changes were made.

Citation Information

DOI

Contributions

The drafting of the examples and their annotation was entrusted to the company m47 labs through a public tender process.
n
1,998 People - Lip Language Video Data
m.nexdata.ai
nexdata.ai
Updated Nov 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2023). 1,998 People - Lip Language Video Data [Dataset]. https://m.nexdata.ai/datasets/computervision/2
Explore at:
Dataset updated
Nov 1, 2023
Dataset provided by
nexdata technology inc
Authors
Nexdata
Variables measured
Device, Accuracy, Data size, Data format, Data diversity, Age distribution, Collecting angles, Collecting content, Collecting environment
Description
1,998 People - Lip Language Video Data. The data diversity includes multiple scenes, multiple ages and multiple time periods. In each video, the lip language of 8-bit Arabic numbers was collected. In this dataset, there are 41,866 videos and the total duration is 86 hours 56 minutes 1.52 seconds. This dataset can be used in tasks such as face anti-spoofing recognition, lip language recognition, etc.
F
Bahasa Closed Ended Question Answer Text Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Bahasa Closed Ended Question Answer Text Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/bahasa-closed-ended-question-answer-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
The Bahasa Closed-Ended Question Answering Dataset is a meticulously curated collection of 5000 comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and question-answering models in the Bahasa language, advancing the field of artificial intelligence.
Dataset Content: This closed-ended QA dataset comprises a diverse set of context paragraphs and questions paired with corresponding answers in Bahasa. There is a context paragraph given for each question to get the answer from. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.
Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native Bahasa people, and references were taken from diverse sources like books, news articles, websites, web forums, and other reliable references.
This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.
Question Diversity: To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. The QA dataset also contains questions with constraints, which makes it even more useful for LLM training.Answer Formats: To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraphs types of answers. The answers contain text strings, numerical values, date and time formats as well. Such diversity strengthens the language model's ability to generate coherent and contextually appropriate answers.Data Format and Annotation Details: This fully labeled Bahasa Closed-Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as a unique id, context paragraph, context reference link, question, question type, question complexity, question category, domain, prompt type, answer, answer type, and rich text presence.Quality and Accuracy: The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.
The Bahasa versions is grammatically accurate without any spelling or grammatical errors. No toxic or harmful content is used while building this dataset.
Continuous Updates and Customization: The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.License: The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy Bahasa Closed-Ended Question Answer Dataset to enhance the language understanding capabilities of their generative AI models, improve response generation, and explore new approaches to NLP question-answering tasks.
R
Sign Language Recognition Dataset
universe.roboflow.com
zip
Updated Dec 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
universitas bina insan (2022). Sign Language Recognition Dataset [Dataset]. https://universe.roboflow.com/universitas-bina-insan/sign-language-recognition-mmbok/model/1
Explore at:
zipAvailable download formats
Dataset updated
Dec 23, 2022
Authors
universitas bina insan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
SIBI Bounding Boxes
Description
Here are a few use cases for this project:

Education Tools: The "SIGN LANGUAGE RECOGNITION" model can be used in educational platforms or apps designed for learning sign language. The model can provide real-time feedback on users' signing accuracy and help them improve their signing skills.

Communication Aid for the Hearing-Impaired: This model can be implemented in applications that assist hearing-impaired individuals in communicating with others who do not understand sign language. By converting signed gestures into written or spoken language, the system could facilitate more seamless communication.

Real-Time Sign Language Translator: Utilization of this tool can be done in video conferencing platforms to provide real-time translation of sign language, which can make online meetings, webinars, or classes accessible to those who use sign language.

Accessibility in Digital Media: It could be used in platforms like YouTube or streaming services to provide sign language translations for videos that don't already have them, thereby making content more accessible.

Interactive Entertainment: For game developers, the tool can be used to create interactive experiences or games that use sign language. This would not only provide a fun, immersive experience, but also promote sign language learning.
Long document similarity datasets, Wikipedia excerptions for movies, video...
zenodo.org
live.european-language-grid.eu
csv
Updated Jan 27, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
anonymous; anonymous (2021). Long document similarity datasets, Wikipedia excerptions for movies, video games and wine collections [Dataset]. http://doi.org/10.5281/zenodo.4468783
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4468783
Dataset updated
Jan 27, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
anonymous; anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Three corpora in different domains extracted from Wikipedia.

For all datasets, the figures and tables have been filtered out, as well as the categories and "see also" sections.

The article structure, and particularly the sub-titles and paragraphs are kept in these datasets

Wines

Wikipedia wines dataset consists of 1635 articles from the wine domain. The extracted dataset consists of a non-trivial mixture of articles, including different wine categories, brands, wineries, grape types, and more. The ground-truth recommendations were crafted by a human sommelier, which annotated 92 source articles with ~10 ground-truth recommendations for each sample. Examples for ground-truth expert-based recommendations are

Dom Pérignon - Moët & Chandon

Pinot Meunier - Chardonnay

Movies

The Wikipedia movies dataset consists of 100385 articles describing different movies. The movies' articles may consist of text passages describing the plot, cast, production, reception, soundtrack, and more.
For this dataset, we have extracted a test set of ground truth annotations for 50 source articles using the "BestSimilar" database. Each source articles is associated with a list of ${\scriptsize \sim}12$ most similar movies.
Examples for ground-truth expert-based recommendations are

Schindler's List - The Pianist

Lion King - The Jungle Book

Video games

The Wikipedia video games dataset consists of 21,935 articles reviewing video games from all genres and consoles. Each article may consist of a different combination of sections, including summary, gameplay, plot, production, etc. Examples for ground-truth expert-based recommendations are:

Grand Theft Auto - Mafia

Burnout Paradise - Forza Horizon 3
P
MS COCO Dataset
paperswithcode.com
Updated Apr 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tsung-Yi Lin; Michael Maire; Serge Belongie; Lubomir Bourdev; Ross Girshick; James Hays; Pietro Perona; Deva Ramanan; C. Lawrence Zitnick; Piotr Dollár, MS COCO Dataset [Dataset]. https://paperswithcode.com/dataset/coco
Explore at:
Dataset updated
Apr 15, 2024
Authors
Tsung-Yi Lin; Michael Maire; Serge Belongie; Lubomir Bourdev; Ross Girshick; James Hays; Pietro Perona; Deva Ramanan; C. Lawrence Zitnick; Piotr Dollár
Description
The MS COCO (Microsoft Common Objects in Context) dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset. The dataset consists of 328K images.

Splits: The first version of MS COCO dataset was released in 2014. It contains 164K images split into training (83K), validation (41K) and test (41K) sets. In 2015 additional test set of 81K images was released, including all the previous test images and 40K new images.

Based on community feedback, in 2017 the training/validation split was changed from 83K/41K to 118K/5K. The new split uses the same images and annotations. The 2017 test set is a subset of 41K images of the 2015 test set. Additionally, the 2017 release contains a new unannotated dataset of 123K images.

Annotations: The dataset has annotations for

object detection: bounding boxes and per-instance segmentation masks with 80 object categories, captioning: natural language descriptions of the images (see MS COCO Captions), keypoints detection: containing more than 200,000 images and 250,000 person instances labeled with keypoints (17 possible keypoints, such as left eye, nose, right hip, right ankle), stuff image segmentation – per-pixel segmentation masks with 91 stuff categories, such as grass, wall, sky (see MS COCO Stuff), panoptic: full scene segmentation, with 80 thing categories (such as person, bicycle, elephant) and a subset of 91 stuff categories (grass, sky, road), dense pose: more than 39,000 images and 56,000 person instances labeled with DensePose annotations – each labeled person is annotated with an instance id and a mapping between image pixels that belong to that person body and a template 3D model. The annotations are publicly available only for training and validation images.
databricks-dolly-15k
huggingface.co
Updated Apr 17, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Databricks (2023). databricks-dolly-15k [Dataset]. https://huggingface.co/datasets/databricks/databricks-dolly-15k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 17, 2023
Dataset authored and provided by
Databrickshttp://databricks.com/
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Summary

databricks-dolly-15k is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization. This dataset can be used for any purpose, whether academic or commercial, under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported… See the full description on the dataset page: https://huggingface.co/datasets/databricks/databricks-dolly-15k.
P
ALTA 2023 Shared Task Dataset
paperswithcode.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ALTA 2023 Shared Task Dataset [Dataset]. https://paperswithcode.com/dataset/alta-2023-shared-task
Explore at:
Description
This dataset is described in the ALTA 2023 Shared Task and associated CodaLab competition.

The goal of this task is to build automatic detection systems that can discriminate between human-authored and synthetic text generated by Large Language Models (LLMs). The generated synthetic text will come from a variety of sources, including different domain sources (e.g., law, medical) and different LLMs (e.g., T5, GPT-X). The performance of the models will be evaluated based on their accuracy, robustness in detecting synthetic text.
Z
Expert annotations for the Catalan Common Voice (v13)
data.niaid.nih.gov
zenodo.org
Updated May 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Language Technologies Unit (2024). Expert annotations for the Catalan Common Voice (v13) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11104387
Explore at:
Dataset updated
May 2, 2024
Dataset authored and provided by
Language Technologies Unit
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Description

Homepage: https://projecteaina.cat/tech/]- Point of Contact: langech@bsc.es

Dataset Summary

These are the annotations made by a team of experts on the speakers with more than 1200 seconds recorded in the Catalan set of the Common Voice dataset (v13).

The annotators were initially tasked with evaluating all recordings associated with the same individual. Following that, they were instructed to annotate the speaker's accent, gender, and the overall quality of the recordings.

The accents and genders taken into account are the ones used until version 8 of the Common Voice corpus.

See annotations for more details.

Supported Tasks and Leaderboards

Gender classification, Accent classification.

Languages

The dataset is in Catalan (ca).

Dataset Structure

Instances

Two xlsx documents are published, one for each round of annotations.

The following information is available in each of the documents:

{ 'speaker ID': '1b7fc0c4e437188bdf1b03ed21d45b780b525fd0dc3900b9759d0755e34bc25e31d64e69c5bd547ed0eda67d104fc0d658b8ec78277810830167c53ef8ced24b', 'idx': '31', 'same speaker': {'AN1': 'SI', 'AN2': 'SI', 'AN3': 'SI', 'agreed': 'SI', 'percentage': '100'}, 'gender': {'AN1': 'H', 'AN2': 'H', 'AN3': 'H', 'agreed': 'H', 'percentage': '100'}, 'accent': {'AN1': 'Central', 'AN2': 'Central', 'AN3': 'Central', 'agreed': 'Central', 'percentage': '100'}, 'audio quality': {'AN1': '4.0', 'AN2': '3.0', 'AN3': '3.0', 'agreed': '3.0', 'percentage': '66', 'mean quality': '3.33', 'stdev quality': '0.58'}, 'comments': {'AN1': '', 'AN2': 'pujades i baixades de volum', 'AN3': 'Deu ser d'alguna zona de transició amb el central, perquè no fa una reducció total vocàlica, però hi té molta tendència'}, }

We also publish the document Guia anotació parlants.pdf, with the guidelines the annotators recieved.

Data Fields

speaker ID (string): An id for which client (voice) made the recording in the Common Voice corpus

idx (int): Id in this corpus

AN1 (string): Annotations from Annotator 1

AN2 (string): Annotations from Annotator 2

AN3 (string): Annotations from Annotator 3

agreed (string): Annotation from the majority of the annotators

percentage (int): Percentage of annotators that agree with the agreed annotation

mean quality (float): Mean of the quality annotation

stdev quality (float): Standard deviation of the mean quality

Data Splits

The corpus remains undivided into splits, as its purpose does not involve training models.

Dataset Creation

Curation Rationale

During 2022, a campaign was launched to promote the Common Voice corpus within the Catalan-speaking community, achieving remarkable success. However, not all participants provided their demographic details such as age, gender, and accent. Additionally, some individuals faced difficulty in self-defining their accent using the standard classifications established by specialists.

In order to obtain a balanced corpus with reliable information, we have seen the the necessity of enlisting a group of experts from the University of Barcelona to provide accurate annotations.

We release the complete annotations because transparency is fundamental to our project. Furthermore, we believe they hold philological value for studying dialectal and gender variants.

Source Data

The original data comes from the Catalan sentences of the Common Voice corpus.

Initial Data Collection and Normalization

We have selected speakers who have recorded more than 1200 seconds of speech in the Catalan set of the version 13 of the Common Voice corpus.

Who are the source language producers?

The original data comes from the Catalan sentences of the Common Voice corpus.

Annotations

Annotation process

Starting with version 13 of the Common Voice corpus we identified the speakers (273) who have recorded more than 1200 seconds of speech.

A team of three annotators was tasked with annotating:

if all the recordings correspond to the same person

the gender of the speaker

the accent of the speaker

the quality of the recording

They conducted an initial round of annotation, discussed their varying opinions, and subsequently conducted a second round.

We release the complete annotations because transparency is fundamental to our project. Furthermore, we believe they hold philological value for studying dialectal and gender variants.

Who are the annotators?

The annotation was entrusted to the CLiC (Centre de Llenguatge i Computació) team from the University of Barcelona. They selected a group of three annotators (two men and one woman), who received a scholarship to do this work.

The annotation team was composed of:

Annotator 1: 1 female annotator, aged 18-25, L1 Catalan, student in the Modern Languages and Literatures degree, with a focus on Catalan.

Annotators 2 & 3: 2 male annotators, aged 18-25, L1 Catalan, students in the Catalan Philology degree.

1 female supervisor, aged 40-50, L1 Catalan, graduate in Physics and in Linguistics, Ph.D. in Signal Theory and Communications.

To do the annotation they used a Google Drive spreadsheet

Personal and Sensitive Information

The Common Voice dataset consists of people who have donated their voice online. We don't share here their voices, but their gender and accent. You agree to not attempt to determine the identity of speakers in the Common Voice dataset.

Considerations for Using the Data

Social Impact of Dataset

The ID come from the Common Voice dataset, that consists of people who have donated their voice online.

You agree to not attempt to determine the identity of speakers in the Common Voice dataset.

The information from this corpus will allow us to train and evaluate well balanced Catalan ASR models. Furthermore, we believe they hold philological value for studying dialectal and gender variants.

Discussion of Biases

Most of the voices of the common voice in Catalan correspond to men with a central accent between 40 and 60 years old. The aim of this dataset is to provide information that allows to minimize the biases that this could cause.

For the gender annotation, we have only considered "H" (male) and "D" (female).

Other Known Limitations

[N/A]

Additional Information

Dataset Curators

Language Technologies Unit at the Barcelona Supercomputing Center (langtech@bsc.es)

This work has been promoted and financed by the Generalitat de Catalunya through the Aina project.

Licensing Information

This dataset is licensed under a CC BY 4.0 license.

It can be used for any purpose, whether academic or commercial, under the terms of the license. Give appropriate credit, provide a link to the license, and indicate if changes were made.

Citation Information

DOI

Contributions

The annotation was entrusted to the STeL team from the University of Barcelona.
Z
CT-FAN: A Multilingual dataset for Fake News Detection
data.niaid.nih.gov
explore.openaire.eu
+1more
Updated Oct 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thomas Mandl (2022). CT-FAN: A Multilingual dataset for Fake News Detection [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4714516
Explore at:
Dataset updated
Oct 23, 2022
Dataset provided by
Julia Maria Struß
Juliane Köhler
Gautam Kishore Shahi
Thomas Mandl
Michael Wiegand
Melanie Siegel
Description
By downloading the data, you agree with the terms & conditions mentioned below:

Data Access: The data in the research collection may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use them only for research purposes.

Summaries, analyses and interpretations of the linguistic properties of the information may be derived and published, provided it is impossible to reconstruct the information from these summaries. You may not try identifying the individuals whose texts are included in this dataset. You may not try to identify the original entry on the fact-checking site. You are not permitted to publish any portion of the dataset besides summary statistics or share it with anyone else.

We grant you the right to access the collection's content as described in this agreement. You may not otherwise make unauthorised commercial use of, reproduce, prepare derivative works, distribute copies, perform, or publicly display the collection or parts of it. You are responsible for keeping and storing the data in a way that others cannot access. The data is provided free of charge.

Citation

Please cite our work as

@InProceedings{clef-checkthat:2022:task3, author = {K{"o}hler, Juliane and Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Wiegand, Michael and Siegel, Melanie and Mandl, Thomas}, title = "Overview of the {CLEF}-2022 {CheckThat}! Lab Task 3 on Fake News Detection", year = {2022}, booktitle = "Working Notes of CLEF 2022---Conference and Labs of the Evaluation Forum", series = {CLEF~'2022}, address = {Bologna, Italy},}

@article{shahi2021overview, title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection}, author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas}, journal={Working Notes of CLEF}, year={2021} }

Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English and German.

Task 3: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. The training data will be released in batches and roughly about 1264 articles with the respective label in English language. Our definitions for the categories are as follows:

False - The main claim made in an article is untrue.

Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services.

True - This rating indicates that the primary elements of the main claim are demonstrably true.

Other- An article that cannot be categorised as true, false, or partially false due to a lack of evidence about its claims. This category includes articles in dispute and unproven articles.

Cross-Lingual Task (German)

Along with the multi-class task for the English language, we have introduced a task for low-resourced language. We will provide the data for the test in the German language. The idea of the task is to use the English data and the concept of transfer to build a classification model for the German language.

Input Data

The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows:

ID- Unique identifier of the news article

Title- Title of the news article

text- Text mentioned inside the news article

our rating - class of the news article as false, partially false, true, other

Output data format

public_id- Unique identifier of the news article

predicted_rating- predicted class

Sample File

public_id, predicted_rating 1, false 2, true

IMPORTANT!

We have used the data from 2010 to 2022, and the content of fake news is mixed up with several topics like elections, COVID-19 etc.

Baseline: For this task, we have created a baseline system. The baseline system can be found at https://zenodo.org/record/6362498

Related Work

Shahi GK. AMUSED: An Annotation Framework of Multi-modal Social Media Data. arXiv preprint arXiv:2010.00502. 2020 Oct 1.https://arxiv.org/pdf/2010.00502.pdf

G. K. Shahi and D. Nandini, “FakeCovid – a multilingual cross-domain fact check news dataset for covid-19,” in workshop Proceedings of the 14th International AAAI Conference on Web and Social Media, 2020. http://workshop-proceedings.icwsm.org/abstract?id=2020_14

Shahi, G. K., Dirkson, A., & Majchrzak, T. A. (2021). An exploratory study of covid-19 misinformation on twitter. Online Social Networks and Media, 22, 100104. doi: 10.1016/j.osnem.2020.100104

Shahi, G. K., Struß, J. M., & Mandl, T. (2021). Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection. Working Notes of CLEF.

Nakov, P., Da San Martino, G., Elsayed, T., Barrón-Cedeno, A., Míguez, R., Shaar, S., ... & Mandl, T. (2021, March). The CLEF-2021 CheckThat! lab on detecting check-worthy claims, previously fact-checked claims, and fake news. In European Conference on Information Retrieval (pp. 639-649). Springer, Cham.

Nakov, P., Da San Martino, G., Elsayed, T., Barrón-Cedeño, A., Míguez, R., Shaar, S., ... & Kartal, Y. S. (2021, September). Overview of the CLEF–2021 CheckThat! Lab on Detecting Check-Worthy Claims, Previously Fact-Checked Claims, and Fake News. In International Conference of the Cross-Language Evaluation Forum for European Languages (pp. 264-291). Springer, Cham.
m
Data from: AN EMG DATASET FOR ARABIC SIGN LANGUAGE ALPHABET LETTERS AND...
data.mendeley.com
Updated Nov 10, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amina Ben Haj Amor (2023). AN EMG DATASET FOR ARABIC SIGN LANGUAGE ALPHABET LETTERS AND NUMBERS [Dataset]. http://doi.org/10.17632/ft9bhdgybs.3
Explore at:
Unique identifier
https://doi.org/10.17632/ft9bhdgybs.3
Dataset updated
Nov 10, 2023
Authors
Amina Ben Haj Amor
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ABSTRACT Sign languages are natural, gestural languages that use visual channel to communicate. Deaf people develop them to overcome their inability to communicate orally. Sign language interpreters bridge the gap that deaf people face in society and provide them with an equal opportunity to thrive in all environments. However, Deaf people often struggle to communicate on a daily basis, especially in public service spaces such as hospitals, post offices, and municipal buildings. Therefore, the implementation of a tool for automatic recognition of sign language is essential to allow the autonomy of deaf people. Moreover, it is difficult to provide full-time interpreters to help deaf people in all public services and administrations.

Although surface electromyography (sEMG) provides an important potential technology for the detection of hand gestures, the related research in automatic SL recognition remains limited. To date, most works have focused on the recognition of hand gestures from images, videos, or gloves. The works of BEN HAJ AMOR et al. on EMG signals have shown that these multichannel signals contain rich and detailed information that can be exploited, in particular for the recognition of handshape and for the control prosthesis. Consequently, these successes represent a great step towards the recognition of gestures in sign language.

We build a large database of EMG data, recorded while signing the 28 characters of the Arabic sign language alphabet. This provides a valuable resource for research into how the muscles involved in signing produce the shapes needed to form the letters of the alphabet.

Instructions: The data for this project is provided as zipped NumPy arrays with custom headers. In order to load these files, you will need to have the NumPy package installed.

The respective loadz primitive allows for a straight forwardloading of the datasets. The data is organized as follows:

The data for each label (handshape) is stored in a separate folder. Each folder contains .npz files. An npz file contains the data for one record (a matrix 8x400).

For more details, please refer to the paper.

Facebook

Twitter

Click to copy link

Link copied

Cite

(2025). Ancient Tamil Stone Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/ancient-tamil-stone-dataset

Ancient Tamil Stone Dataset Dataset

Explore at:

Dataset updated

Mar 5, 2025

Description

Description:

👉 Download the dataset here

We are pleased to present a unique and valuable dataset collect from ancient stone inscriptions locate at the historic Venkatesa Perumal Temple in Kanchipuram, Tamil Nadu, India. The temple, which dates back over a millennium, is adorned with stone inscriptions that are equally ancient, offering insight into the language, culture, and history of the Tamil people. This dataset is a critical resource for those interest in studying Tamil epigraphy, paleography, and the preservation of heritage through 3D reconstruction techniques.

For each inscription, we have meticulously capture 15 to 25 digital images from multiple angles using high-resolution cameras. These images are utilize to reconstruct 3D models of the inscriptions, enabling a detail and comprehensive analysis. Additionally, we employed LiDAR technology to create high-accuracy 3D models, preserving these invaluable inscriptions in digital form for future generations.

Download Dataset

Inscribed Inscriptions: These are inscriptions that remain intact and are carved deeply into the stone. They are relatively well-preserve and legible, providing clear information.

Eroded Inscriptions: Over time, weathering has eroded parts of these inscriptions. While some characters may still be discernible, much of the original text has fade.

Projected Inscriptions: These inscriptions are carved in such a way that the text projects outward from the stone surface. The unique style often makes them easier to read but also more susceptible to damage.

Rural Paleographic Inscriptions: These inscriptions represent a specific style of writing used in rural areas. They often reflect the simpler, everyday language and script of the common people of the time.

Urban Paleographic Inscriptions: In contrast, urban paleographic inscriptions were typically create in more sophisticate scripts. These were often commissioned by rulers or temple authorities and represent higher levels of literacy and formal language use.

3D Model Construction

The images capture for each inscription are use to construct 3D models of the stones, providing a digital preservation of these historical records. The number of input images plays a significant role in the clarity and detail of the final model. Our experiments show that with a minimum of 10 input images, a clear 3D model can be constructed. However, when 15-25 images are use, the resulting models are much sharper, offering greater detail and accuracy for further study and analysis.

The inclusion of LiDAR technology further enhances the accuracy of the 3D models by capturing intricate details in the inscriptions’ surfaces. This technology is especially useful for eroded or projected inscriptions, where traditional photography may struggle to capture the full depth and nuance of the text.

List of Inscriptions

Inscription 1 – Eroded Inscriptions: Despite the wear and tear, some segments of the text are still visible, offering fragments of the original narrative.

Inscription 2 – Urban Paleographic Inscriptions: Displaying the formal urban writing style, these inscriptions are a testament to the sophistication of the era’s linguistic and cultural practices.

Inscription 3 – Projected Inscriptions: These inscriptions protrude from the surface of the stone, creating a unique visual experience and allowing easier identification of certain characters.

Inscription 4 – Inscribed Inscriptions: These deeply carved inscriptions remain remarkably preserved, providing a clearer representation of the historical texts.

Inscription 5 – Rural Paleographic Inscriptions: These inscriptions offer valuable insight into the language and script of rural Tamil Nadu, reflecting a simpler style.

Inscription 6 – Eroded Inscriptions: As with Inscription 1, this inscription has suffered from the ravages of time, but still offers partial data for epigraphic studies.

This dataset is sourced from Kaggle.

Clear search

Close search

Google apps

Main menu

Ancient Tamil Stone Dataset Dataset

Portuguese Chain of Thought Prompt & Response Dataset

Wirestock's AI/ML Image Training Data, 4.5M Files with Metadata

Hindi Open Ended Question Answer Text Dataset

What’s Included

Animals names

Context

Content

English (Canada) General Conversation Speech Dataset

What’s Included

INTRODUCTION OF COVID-NEWS-US-NNK AND COVID-NEWS-BD-NNK DATASET

Bahasa Product Image OCR Dataset

What’s Included

10,000 People - Human Pose Recognition Data Dataset

NLUCat

NLUCat

Dataset Description

Dataset Summary

Supported Tasks and Leaderboards

Languages

Dataset Structure

Data Instances

Data Fields

Data Splits

Dataset Creation

Curation Rationale

Source Data

Annotations

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

Licensing Information

Citation Information

Contributions

1,998 People - Lip Language Video Data

Bahasa Closed Ended Question Answer Text Dataset

What’s Included

Sign Language Recognition Dataset

Long document similarity datasets, Wikipedia excerptions for movies, video...

MS COCO Dataset

databricks-dolly-15k

ALTA 2023 Shared Task Dataset

Expert annotations for the Catalan Common Voice (v13)

CT-FAN: A Multilingual dataset for Fake News Detection

Data from: AN EMG DATASET FOR ARABIC SIGN LANGUAGE ALPHABET LETTERS AND...

Ancient Tamil Stone Dataset Dataset