https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By SetFit (From Huggingface) [source]
The SetFit/mnli dataset is a comprehensive collection of textual entailment data designed to facilitate the development and evaluation of models for natural language understanding tasks. This dataset includes three distinct files: validation.csv, train.csv, and test.csv, each containing valuable information for training and evaluating textual entailment models.
In these files, users will find various columns providing important details about the text pairs. The text1 and text2 columns indicate the first and second texts in each pair respectively, allowing researchers to analyze the relationships between these texts. Additionally, the label column provides a categorical value indicating the specific relationship between text1 and text2.
To further aid in understanding the relationships expressed by these labels, there is an accompanying label_text column that offers a human-readable representation of each categorical label. This allows practitioners to interpret and analyze the labeled data more easily.
Moreover, all three files in this dataset contain an additional index column called idx, which assists in organizing and referencing specific samples within the dataset during analysis or model development.
It's worth noting that this SetFit/mnli dataset has been carefully prepared for textual entailment tasks specifically. To ensure accurate evaluation of model performance on such tasks, researchers can leverage validation.csv as a dedicated set of samples specifically reserved for validating their models' performance during training. The train.csv file contains ample training data with corresponding labels that can be utilized to effectively train reliable textual entailment models. Lastly, test.csv includes test samples designed for evaluating model performance on textual entailment tasks.
By utilizing this extensive collection of high-quality data provided by SetFit/mnli dataset, researchers can develop powerful models capable of accurately understanding natural language relationships expressed within text pairs across various domains
- text1: This column contains the first text in a pair.
- text2: This column contains the second text in a pair.
- label: The label column indicates the relationship between text1 and text2 using categorical values.
- label_text: The label_text column provides the text representation of the labels.
To effectively use this dataset for your textual entailment task, follow these steps:
1. Understanding the Columns
Start by familiarizing yourself with the different columns present in each file of this dataset:
- text1: The first text in a pair that needs to be evaluated for textual entailment.
- text2: The second text in a pair that needs to be compared with text1 to determine its logical relationship.
- label: This categorical field represents predefined relationships or categories between texts based on their meaning or logical inference.
- label_text: A human-readable representation of each label category that helps understand their real-world implications.
2. Data Exploration
Before building models or applying any algorithms, it's essential to explore and understand your data thoroughly:
- Analyze sample data points from each file (validation.csv, train.csv).
- Identify any class imbalances within different labels present in your data distribution.
3. Preprocessing Steps
- Handle missing values: Check if there are any missing values (NaNs) within any columns and decide how to handle them.
- Text cleaning: Depending on the nature of your task, implement appropriate text cleaning techniques like removing stop words, lowercasing, punctuation removal, etc.
- Tokenization: Break down the text into individual tokens or words to facilitate further processing steps.
4. Model Training and Evaluation
Once your dataset is ready for modeling:
- Split your data into training and testing sets using the train.csv and test.csv files. This division allows you to train models on a subset of data while evaluating their performance on an unseen portion.
- Utilize machine learning or deep learning algorithms suitable for textual entailment tasks (e.g., BERT
- Natural Language Understanding: The dataset can be used for training and evaluating models that perform natural language understanding tasks, such as text classification, ...
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Within the central repository, there are subfolders of different categories. Each of these subfolders contains both images and their corresponding transcriptions, saved as .txt files. As an example, the folder 'summary-based-0001-0055' encompasses 55 handwritten image documents pertaining to the summary task, with the images ranging from 0001 to 0055 within this category. In the transcription files, any crossed-out content is denoted by the '#' symbol, facilitating the easy identification of files with or without such modifications.
Moreover, there exists a document detailing the transcription rules utilized for transcribing the dataset. Following these guidelines will enable the seamless addition of more images.
We have incorporated contributions from more than 500 students to construct the dataset. Handwritten examination papers are primary sources in academic institutes to assess student learning. In our experience as academics, we have found that student examination papers tend to be messy with all kinds of insertions and corrections and would thus be a great source of documents for investigating HTR in the wild. Unfortunately, student examination papers are not available due to ethical considerations. So, we created an exam-like situation to collect handwritten samples from students. The corpus of the collected data is academic-based. Usually, in academia, handwritten papers have lines in them. For this purpose, we drew lines using light colors on white paper. The height of a line is 1.5 pt and the space between two lines is 40 pt. The filled handwritten documents were scanned at a resolution of 300 dpi at a grey-level resolution of 8 bits.
In the second exercise, we asked participants to write an essay from a given list of topics, or they could write on any topic of their choice. We called it an essay-based dataset. This dataset is collected from 250 High school students. We gave them 30 minutes to think about the topic and write for this task.
In the third exercise, we select participants from different subjects and ask them to write on a topic from their current study. We called it a subject-based dataset. For this study, we used undergraduate students from different subjects, including 33 students from Mathematics, 71 from Biological Sciences, 24 from Environmental Sciences, 17 from Physics, and more than 84 from English studies.
Finally a class-notes dataset, we have collected class notes from almost 31 students on the same topic. We asked students to take notes of every possible sentence the speaker delivered during the lecture. After finishing the lesson in almost 10 minutes, we asked students to recheck their notes and compare them with other classmates. We did not impose any time restrictions for rechecking. We observed more cross-outs and corrections in class-notes compared to summary-based and academic-based collections.
In all four exercises, we did not impose any rules on them, for example, spacing, usage of a pen, etc. We asked them to cross out the text if it seemed inappropriate. Although usually writers made corrections in a second read, we also gave an extra 5 minutes for correction purposes.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This is the line version of student messy hand written dataset (SMHD) (Nisa, Hiqmat; Thom, James; ciesielski, Vic; Tennakoon, Ruwan (2023). Student Messy Handwritten Dataset (SMHD) . RMIT University. Dataset. https://doi.org/10.25439/rmt.24312715.v1).
Within the central repository, there are subfolders of each document converted into lines. All images are in .png format. In the main folder there are three .txt files.
1)SMHD.txt contain all the line level transcription in the form of
image name, threshold value, label
0001-000,178 Bombay Phenotype :-
2) SMHD-Cross-outsandInsertions.txt contains all the line images from the dataset having crossed-out and inserted text.
3)Class_Notes_SMHD.txt contains more complex cases with cross-outs, insertions and overwriting. This can be used as a test set. The images in this files does not included in the SMHD.txt.
In the transcription files, any crossed-out content is denoted by the '#' symbol, facilitating the easy identification of files with or without such modifications.
We have incorporated contributions from more than 500 students to construct the dataset. Handwritten examination papers are primary sources in academic institutes to assess student learning. In our experience as academics, we have found that student examination papers tend to be messy with all kinds of insertions and corrections and would thus be a great source of documents for investigating HTR in the wild. Unfortunately, student examination papers are not available due to ethical considerations. So, we created an exam-like situation to collect handwritten samples from students. The corpus of the collected data is academic-based. Usually, in academia, handwritten papers have lines in them. For this purpose, we drew lines using light colors on white paper. The height of a line is 1.5 pt and the space between two lines is 40 pt. The filled handwritten documents were scanned at a resolution of 300 dpi at a grey-level resolution of 8 bits.
In the second exercise, we asked participants to write an essay from a given list of topics, or they could write on any topic of their choice. We called it an essay-based dataset. This dataset is collected from 250 High school students. We gave them 30 minutes to think about the topic and write for this task.
In the third exercise, we select participants from different subjects and ask them to write on a topic from their current study. We called it a subject-based dataset. For this study, we used undergraduate students from different subjects, including 33 students from Mathematics, 71 from Biological Sciences, 24 from Environmental Sciences, 17 from Physics, and more than 84 from English studies.
Finally a class-notes dataset, we have collected class notes from almost 31 students on the same topic. We asked students to take notes of every possible sentence the speaker delivered during the lecture. After finishing the lesson in almost 10 minutes, we asked students to recheck their notes and compare them with other classmates. We did not impose any time restrictions for rechecking. We observed more cross-outs and corrections in class-notes compared to summary-based and academic-based collections.
In all four exercises, we did not impose any rules on them, for example, spacing, usage of a pen, etc. We asked them to cross out the text if it seemed inappropriate. Although usually writers made corrections in a second read, we also gave an extra 5 minutes for correction purposes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We present Qbias, two novel datasets that promote the investigation of bias in online news search as described in
Fabian Haak and Philipp Schaer. 2023. 𝑄𝑏𝑖𝑎𝑠 - A Dataset on Media Bias in Search Queries and Query Suggestions. In Proceedings of ACM Web Science Conference (WebSci’23). ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/3578503.3583628.
Dataset 1: AllSides Balanced News Dataset (allsides_balanced_news_headlines-texts.csv)
The dataset contains 21,747 news articles collected from AllSides balanced news headline roundups in November 2022 as presented in our publication. The AllSides balanced news feature three expert-selected U.S. news articles from sources of different political views (left, right, center), often featuring spin bias, and slant other forms of non-neutral reporting on political news. All articles are tagged with a bias label by four expert annotators based on the expressed political partisanship, left, right, or neutral. The AllSides balanced news aims to offer multiple political perspectives on important news stories, educate users on biases, and provide multiple viewpoints. Collected data further includes headlines, dates, news texts, topic tags (e.g., "Republican party", "coronavirus", "federal jobs"), and the publishing news outlet. We also include AllSides' neutral description of the topic of the articles. Overall, the dataset contains 10,273 articles tagged as left, 7,222 as right, and 4,252 as center.
To provide easier access to the most recent and complete version of the dataset for future research, we provide a scraping tool and a regularly updated version of the dataset at https://github.com/irgroup/Qbias. The repository also contains regularly updated more recent versions of the dataset with additional tags (such as the URL to the article). We chose to publish the version used for fine-tuning the models on Zenodo to enable the reproduction of the results of our study.
Dataset 2: Search Query Suggestions (suggestions.csv)
The second dataset we provide consists of 671,669 search query suggestions for root queries based on tags of the AllSides biased news dataset. We collected search query suggestions from Google and Bing for the 1,431 topic tags, that have been used for tagging AllSides news at least five times, approximately half of the total number of topics. The topic tags include names, a wide range of political terms, agendas, and topics (e.g., "communism", "libertarian party", "same-sex marriage"), cultural and religious terms (e.g., "Ramadan", "pope Francis"), locations and other news-relevant terms. On average, the dataset contains 469 search queries for each topic. In total, 318,185 suggestions have been retrieved from Google and 353,484 from Bing.
The file contains a "root_term" column based on the AllSides topic tags. The "query_input" column contains the search term submitted to the search engine ("search_engine"). "query_suggestion" and "rank" represents the search query suggestions at the respective positions returned by the search engines at the given time of search "datetime". We scraped our data from a US server saved in "location".
We retrieved ten search query suggestions provided by the Google and Bing search autocomplete systems for the input of each of these root queries, without performing a search. Furthermore, we extended the root queries by the letters a to z (e.g., "democrats" (root term) >> "democrats a" (query input) >> "democrats and recession" (query suggestion)) to simulate a user's input during information search and generate a total of up to 270 query suggestions per topic and search engine. The dataset we provide contains columns for root term, query input, and query suggestion for each suggested query. The location from which the search is performed is the location of the Google servers running Colab, in our case Iowa in the United States of America, which is added to the dataset.
AllSides Scraper
At https://github.com/irgroup/Qbias, we provide a scraping tool, that allows for the automatic retrieval of all available articles at the AllSides balanced news headlines.
We want to provide an easy means of retrieving the news and all corresponding information. For many tasks it is relevant to have the most recent documents available. Thus, we provide this Python-based scraper, that scrapes all available AllSides news articles and gathers available information. By providing the scraper we facilitate access to a recent version of the dataset for other researchers.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset consists of 120 Balinese story texts (as known as Satua Bali) which have been annotated for narrative text analysis purposes, including character identification, alias clustering, and character classification into protagonist or antagonist. The labeling involved two Balinese native speakers who were fluent in understanding Balinese story texts. One of them is an expert in the fields of sociolinguistics and macrolinguistics. Reliability and level of agreement in the dataset are measured by Cohen's kappa coefficient, Jaccard similarity coefficient, and F1-score and all of them show almost perfect agreement values (>0,81). There are four main folders, each used for different narrative text analysis purposes: 1. First Dataset (charsNamedEntity): 89,917 annotated tokens with five character named entity labels (ANM, ADJ, PNAME, GODS, OBJ) for character named entity recognition purpose 2. Second Dataset (charsExtraction): 6,634 annotated sentences for the purpose of character identification at the sentence level 3. Third Dataset (charsAliasClustering): 930 lists of character groups from 120 story texts for the purpose of alias clustering 4. Fourth Dataset (charsClassification): 848 lists of character groups that have been classified into two groups (Protagonist and Antagonist)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Description
Dataset Summary
The Historical Danish handwriting dataset is a Danish-language dataset containing more than 11.000 pages of transcribed and proofread handwritten text. The dataset currently consists of the published minutes from a number of City and Parish Council meetings, all dated between 1841 and 1939.
Languages
All the text is in Danish. The BCP-47 code for Danish is da.
Dataset Structure
Data Instances
Each data… See the full description on the dataset page: https://huggingface.co/datasets/aarhus-city-archives/historical-danish-handwriting.
Attribution-NonCommercial-ShareAlike 2.5 (CC BY-NC-SA 2.5)https://creativecommons.org/licenses/by-nc-sa/2.5/
License information was derived automatically
The present is a manually labeled data set for the task of Event Detection (ED). The task of ED consists of identifying event triggers, the word that most clearly indicates the occurrence of an event. The present data set consists of 2,200 news extracts from The New York Times (NYT) Annotated Corpus, separated into training (2,000) and testing (200) sets. Each news extract contains the plain text with the labels (event mentions), along with two metadata (publication date and an identifier). Labels description: We consider as event any ongoing real-world event or situation reported in the news articles. It is important to distinguish those events and situations that are in progress (or are reported as fresh events) at the moment the news is delivered from past events that are simply brought back, future events, hypothetical events, or events that will not take place. In our data set we only labeled as event the first type of event. Based on this criterion, some words that are typically considered as events are labeled as non-event triggers if they do not refer to ongoing events at the time the analyzed news is released. Take for instance the following news extract: "devaluation is not a realistic option to the current account deficit since it would only contribute to weakening the credibility of economic policies as it did during the last crisis." The only word that is labeled as event trigger in this example is "deficit" because it is the only ongoing event refereed in the news. Note that the words "devaluation", "weakening" and "crisis" could be labeled as event triggers in other news extracts, where the context of use of these words is different, but not in the given example. Further information: For a more detailed description of the data set and the data collection process please visit: https://cs.uns.edu.ar/~mmaisonnave/resources/ED_data. Data format: The dataset is split in two folders: training and testing. The first folder contains 2,000 XML files. The second folder contains 200 XML files. Each XML file has the following format. YYYYMMDDTHHMMSS ... ... ... The first three tags (pubdate, file-id and sent-idx) contain metadata information. The first one is the publication date of the news article that contained that text extract. The next two tags represent a unique identifier for the text extract. The file-id uniquely identifies a news article, that can hold several text extracts. The second one is the index that identifies that text extract inside the full article. The last tag (sentence) defines the beginning and end of the text extract. Inside that text are the tags. Each of these tags surrounds one word that was manually labeled as an event trigger.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The two datasets provided here were used to provide inter-rater reliability statistics for the application of a metaphor identification procedure to texts written in English. Three experienced metaphor researchers applied the Metaphor Identification Procedure Vrije Universiteit (MIPVU) to approximately 1500 words of text from two English-language newspaper articles. The dataset Eng1 contains each researcher’s independent analysis of the lexical demarcation and metaphorical status of each word in the sample. The dataset Eng2 contains a second analysis of the same texts by the same three researchers, carried out after a comparison of our responses in Eng 1 and a troubleshooting session where we discussed our differences. The accompanying R-code was used to produce the three-way and pairwise inter-rater reliability data reported in Section 3.2 of the chapter: How do I determine what comprises a lexical unit? The headings in both datasets are identical, although the order of the columns differs in the two files. In both datasets, each line corresponds to one orthographic word from the newspaper texts. Chapter Abstract: The first part of this chapter discusses various ‘nitty-gritty’ practical aspects about the original MIPVU intended for the English language. Our focus in these first three sections is on common pitfalls for novice MIPVU users that we have encountered when teaching the procedure. First, we discuss how to determine what comprises a lexical unit (section 3.2). We then move on to how to determine a more basic meaning of a lexical unit (section 3.3), and subsequently discuss how to compare and contrast contextual and basic senses (section 3.4). We illustrate our points with actual examples taken from some of our teaching sessions, as well as with our own study into inter-rater reliability, conducted for the purposes of this new volume about MIPVU in multiple languages. Section 3.5 shifts to another topic that new MIPVU users ask about – namely, which practical tools they can use to annotate their data in an efficient way. Here we discuss some tools that we find useful, illustrating how we utilized them in our inter-rater reliability study. We close this part with section 3.6, a brief discussion about reliability testing. The second part of this chapter adopts more of a bird’s-eye view. Here we leave behind the more technical questions of how to operationalize MIPVU and its steps, and instead respond more directly to the question posed above: Do we really have to identify every metaphor in every bit of our data? We discuss possible approaches for research projects involving metaphor identification, by exploring a number of important questions that all researchers need to ask themselves (preferably before they embark on a major piece of research). Section 3.7 weighs some of the differences between quantitative and qualitative approaches in metaphor research projects, while section 3.8 talks about considerations when it comes to choosing which texts to investigate, as well as possible research areas where metaphor identification can play a useful role. We close this chapter in section 3.9 with a recap of our ‘take-away’ points – that is, a summary of the highlights from our entire discussion.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A collection of datasets and python scripts for extraction and analysis of isograms (and some palindromes and tautonyms) from corpus-based word-lists, specifically Google Ngram and the British National Corpus (BNC).Below follows a brief description, first, of the included datasets and, second, of the included scripts.1. DatasetsThe data from English Google Ngrams and the BNC is available in two formats: as a plain text CSV file and as a SQLite3 database.1.1 CSV formatThe CSV files for each dataset actually come in two parts: one labelled ".csv" and one ".totals". The ".csv" contains the actual extracted data, and the ".totals" file contains some basic summary statistics about the ".csv" dataset with the same name.The CSV files contain one row per data point, with the colums separated by a single tab stop. There are no labels at the top of the files. Each line has the following columns, in this order (the labels below are what I use in the database, which has an identical structure, see section below):
Label Data type Description
isogramy int The order of isogramy, e.g. "2" is a second order isogram
length int The length of the word in letters
word text The actual word/isogram in ASCII
source_pos text The Part of Speech tag from the original corpus
count int Token count (total number of occurences)
vol_count int Volume count (number of different sources which contain the word)
count_per_million int Token count per million words
vol_count_as_percent int Volume count as percentage of the total number of volumes
is_palindrome bool Whether the word is a palindrome (1) or not (0)
is_tautonym bool Whether the word is a tautonym (1) or not (0)
The ".totals" files have a slightly different format, with one row per data point, where the first column is the label and the second column is the associated value. The ".totals" files contain the following data:
Label
Data type
Description
!total_1grams
int
The total number of words in the corpus
!total_volumes
int
The total number of volumes (individual sources) in the corpus
!total_isograms
int
The total number of isograms found in the corpus (before compacting)
!total_palindromes
int
How many of the isograms found are palindromes
!total_tautonyms
int
How many of the isograms found are tautonyms
The CSV files are mainly useful for further automated data processing. For working with the data set directly (e.g. to do statistics or cross-check entries), I would recommend using the database format described below.1.2 SQLite database formatOn the other hand, the SQLite database combines the data from all four of the plain text files, and adds various useful combinations of the two datasets, namely:• Compacted versions of each dataset, where identical headwords are combined into a single entry.• A combined compacted dataset, combining and compacting the data from both Ngrams and the BNC.• An intersected dataset, which contains only those words which are found in both the Ngrams and the BNC dataset.The intersected dataset is by far the least noisy, but is missing some real isograms, too.The columns/layout of each of the tables in the database is identical to that described for the CSV/.totals files above.To get an idea of the various ways the database can be queried for various bits of data see the R script described below, which computes statistics based on the SQLite database.2. ScriptsThere are three scripts: one for tiding Ngram and BNC word lists and extracting isograms, one to create a neat SQLite database from the output, and one to compute some basic statistics from the data. The first script can be run using Python 3, the second script can be run using SQLite 3 from the command line, and the third script can be run in R/RStudio (R version 3).2.1 Source dataThe scripts were written to work with word lists from Google Ngram and the BNC, which can be obtained from http://storage.googleapis.com/books/ngrams/books/datasetsv2.html and [https://www.kilgarriff.co.uk/bnc-readme.html], (download all.al.gz).For Ngram the script expects the path to the directory containing the various files, for BNC the direct path to the *.gz file.2.2 Data preparationBefore processing proper, the word lists need to be tidied to exclude superfluous material and some of the most obvious noise. This will also bring them into a uniform format.Tidying and reformatting can be done by running one of the following commands:python isograms.py --ngrams --indir=INDIR --outfile=OUTFILEpython isograms.py --bnc --indir=INFILE --outfile=OUTFILEReplace INDIR/INFILE with the input directory or filename and OUTFILE with the filename for the tidied and reformatted output.2.3 Isogram ExtractionAfter preparing the data as above, isograms can be extracted from by running the following command on the reformatted and tidied files:python isograms.py --batch --infile=INFILE --outfile=OUTFILEHere INFILE should refer the the output from the previosu data cleaning process. Please note that the script will actually write two output files, one named OUTFILE with a word list of all the isograms and their associated frequency data, and one named "OUTFILE.totals" with very basic summary statistics.2.4 Creating a SQLite3 databaseThe output data from the above step can be easily collated into a SQLite3 database which allows for easy querying of the data directly for specific properties. The database can be created by following these steps:1. Make sure the files with the Ngrams and BNC data are named “ngrams-isograms.csv” and “bnc-isograms.csv” respectively. (The script assumes you have both of them, if you only want to load one, just create an empty file for the other one).2. Copy the “create-database.sql” script into the same directory as the two data files.3. On the command line, go to the directory where the files and the SQL script are. 4. Type: sqlite3 isograms.db 5. This will create a database called “isograms.db”.See the section 1 for a basic descript of the output data and how to work with the database.2.5 Statistical processingThe repository includes an R script (R version 3) named “statistics.r” that computes a number of statistics about the distribution of isograms by length, frequency, contextual diversity, etc. This can be used as a starting point for running your own stats. It uses RSQLite to access the SQLite database version of the data described above.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
NLUCat is a dataset of NLU in Catalan. It consists of nearly 12,000 instructions annotated with the most relevant intents and spans. Each instruction is accompanied, in addition, by the instructions received by the annotator who wrote it.
The intents taken into account are the habitual ones of a virtual home assistant (activity calendar, IOT, list management, leisure, etc.), but specific ones have also been added to take into account social and healthcare needs for vulnerable people (information on administrative procedures, menu and medication reminders, etc.).
The spans have been annotated with a tag describing the type of information they contain. They are fine-grained, but can be easily grouped to use them in robust systems.
The examples are not only written in Catalan, but they also take into account the geographical and cultural reality of the speakers of this language (geographic points, cultural references, etc.)
This dataset can be used to train models for intent classification, spans identification and examples generation.
This is the complete version of the dataset. A version prepared to train and evaluate intent classifiers has been published in HuggingFace.
In this repository you'll find the following items:
This dataset can be used for any purpose, whether academic or commercial, under the terms of the CC BY 4.0. Give appropriate credit , provide a link to the license, and indicate if changes were made.
Intent classification, spans identification and examples generation.
The dataset is in Catalan (ca-ES).
Three JSON files, one for each split.
Example
An example looks as follows:
{
"example": "Demana una ambulància; la meva dona està de part.",
"annotation": {
"intent": "call_emergency",
"slots": [
{
"Tag": "service",
"Text": "ambulància",
"Start_char": 11,
"End_char": 21
},
{
"Tag": "situation",
"Text": "la meva dona està de part",
"Start_char": 23,
"End_char": 48
}
]
}
},
We created this dataset to contribute to the development of language models in Catalan, a low-resource language.
When creating this dataset, we took into account not only the language but the entire socio-cultural reality of the Catalan-speaking population. Special consideration was also given to the needs of the vulnerable population.
Initial Data Collection and Normalization
We commissioned a company to create fictitious examples for the creation of this dataset.
Who are the source language producers?
We commissioned the writing of the examples to the company m47 labs.
Annotation process
The elaboration of this dataset has been done in three steps, taking as a model the process followed by the NLU-Evaluation-Data dataset, as explained in the paper.
* First step: translation or elaboration of the instructions given to the annotators to write the examples.
* Second step: writing the examples. This step also includes the grammatical correction and normalization of the texts.
* Third step: recording the attempts and the slots of each example. In this step, some modifications were made to the annotation guides to adjust them to the real situations.
Who are the annotators?
The drafting of the examples and their annotation was entrusted to the company m47 labs through a public tender process.
No personal or sensitive information included.
The examples used for the preparation of this dataset are fictitious and, therefore, the information shown is not real.
We hope that this dataset will help the development of virtual assistants in Catalan, a language that is often not taken into account, and that it will especially help to improve the quality of life of people with special needs.
When writing the examples, the annotators were asked to take into account the socio-cultural reality (geographic points, artists and cultural references, etc.) of the Catalan-speaking population.
Likewise, they were asked to be careful to avoid examples that reinforce the stereotypes that exist in this society. For example: be careful with the gender or origin of personal names that are associated with certain activities.
[N/A]
Language Technologies Unit at the Barcelona Supercomputing Center (langtech@bsc.es)
This work has been promoted and financed by the Generalitat de Catalunya through the Aina project.
This dataset can be used for any purpose, whether academic or commercial, under the terms of the CC BY 4.0.
Give appropriate credit, provide a link to the license, and indicate if changes were made.
The drafting of the examples and their annotation was entrusted to the company m47 labs through a public tender process.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
We provide instructions, codes and datasets for replicating the article by Kim, Lee and McCulloch (2024), "A Topic-based Segmentation Model for Identifying Segment-Level Drivers of Star Ratings from Unstructured Text Reviews." This repository provides a user-friendly R package for any researchers or practitioners to apply A Topic-based Segmentation Model with Unstructured Texts (latent class regression with group variable selection) to their datasets. First, we provide a R code to replicate the illustrative simulation study: see file 1. Second, we provide the user-friendly R package with a very simple example code to help apply the model to real-world datasets: see file 2, Package_MixtureRegression_GroupVariableSelection.R and Dendrogram.R. Third, we provide a set of codes and instructions to replicate the empirical studies of customer-level segmentation and restaurant-level segmentation with Yelp reviews data: see files 3-a, 3-b, 4-a, 4-b. Note, due to the dataset terms of use by Yelp and the restriction of data size, we provide the link to download the same Yelp datasets (https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset/versions/6). Fourth, we provided a set of codes and datasets to replicate the empirical study with professor ratings reviews data: see file 5. Please see more details in the description text and comments of each file. [A guide on how to use the code to reproduce each study in the paper] 1. Full codes for replicating Illustrative simulation study.txt -- [see Table 2 and Figure 2 in main text]: This is R source code to replicate the illustrative simulation study. Please run from the beginning to the end in R. In addition to estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships, you will get dendrograms of selected groups of variables in Figure 2. Computing time is approximately 20 to 30 minutes 3-a. Preprocessing raw Yelp Reviews for Customer-level Segmentation.txt: Code for preprocessing the downloaded unstructured Yelp review data and preparing DV and IVs matrix for customer-level segmentation study. 3-b. Instruction for replicating Customer-level Segmentation analysis.txt -- [see Table 10 in main text; Tables F-1, F-2, and F-3 and Figure F-1 in Web Appendix]: Code for replicating customer-level segmentation study with Yelp data. You will get estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships. Computing time is approximately 3 to 4 hours. 4-a. Preprocessing raw Yelp reviews_Restaruant Segmentation (1).txt: R code for preprocessing the downloaded unstructured Yelp data and preparing DV and IVs matrix for restaurant-level segmentation study. 4-b. Instructions for replicating restaurant-level segmentation analysis.txt -- [see Tables 5, 6 and 7 in main text; Tables E-4 and E-5 and Figure H-1 in Web Appendix]: Code for replicating restaurant-level segmentation study with Yelp. you will get estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships. Computing time is approximately 10 to 12 hours. [Guidelines for running Benchmark models in Table 6] Unsupervised Topic model: 'topicmodels' package in R -- after determining the number of topics(e.g., with 'ldatuning' R package), run 'LDA' function in the 'topicmodels'package. Then, compute topic probabilities per restaurant (with 'posterior' function in the package) which can be used as predictors. Then, conduct prediction with regression Hierarchical topic model (HDP): 'gensimr' R package -- 'model_hdp' function for identifying topics in the package (see https://radimrehurek.com/gensim/models/hdpmodel.html or https://gensimr.news-r.org/). Supervised topic model: 'lda' R package -- 'slda.em' function for training and 'slda.predict' for prediction. Aggregate regression: 'lm' default function in R. Latent class regression without variable selection: 'flexmix' function in 'flexmix' R package. Run flexmix with a certain number of segments (e.g., 3 segments in this study). Then, with estimated coefficients and memberships, conduct prediction of dependent variable per each segment. Latent class regression with variable selection: 'Unconstraind_Bayes_Mixture' function in Kim, Fong and DeSarbo(2012)'s package. Run the Kim et al's model (2012) with a certain number of segments (e.g., 3 segments in this study). Then, with estimated coefficients and memberships, we can do prediction of dependent variables per each segment. The same R package ('KimFongDeSarbo2012.zip') can be downloaded at: https://sites.google.com/scarletmail.rutgers.edu/r-code-packages/home 5. Instructions for replicating Professor ratings review study.txt -- [see Tables G-1, G-2, G-4 and G-5, and Figures G-1 and H-2 in Web Appendix]: Code to replicate the Professor ratings reviews study. Computing time is approximately 10 hours. [A list of the versions of R, packages, and computer...
Attribution-NonCommercial 2.0 (CC BY-NC 2.0)https://creativecommons.org/licenses/by-nc/2.0/
License information was derived automatically
OVERVIEW: The PodcastFillers dataset consists of 199 full-length podcast episodes in English with manually annotated filler words and automatically generated transcripts. The podcast audio recordings, sourced from SoundCloud (www.soundcloud.com), are CC-licensed, gender-balanced, and total 145 hours of audio from over 350 speakers. The annotations are provided under a non-commercial license and consist of 85,803 manually annotated audio events including approximately 35,000 filler words (“uh” and “um”) and 50,000 non-filler events such as breaths, music, laughter, repeated words, and noise. The annotated events are also provided as pre-processed 1-second audio clips. The dataset also includes automatically generated speech transcripts from a speech-to-text system. A detailed description is provided below.
The PodcastFillers dataset homepage: PodcastFillers.github.io The preprocessing utility functions and code repository for reproducing our experimental results: PodcastFillersUtils
LICENSE:
The PodcastFillers dataset has separate licenses for the audio data and for the metadata. The metadata includes all annotations, speech-to-text transcriptions, and model outputs including VAD activations and FillerNet classification predictions.
Note: PodcastFillers is provided for research purposes only. The metadata license prohibits commercial use, which in turn prohibits deploying technology developed using the PodcastFillers metadata (such as the CSV annotations or audio clips extracted based on these annotations) in commercial applications.
This license agreement (the “License”) between Adobe Inc., having a place of business at 345 Park Avenue, San Jose, California 95110-2704 (“Adobe”), and you, the individual or entity exercising rights under this License (“you” or “your”), sets forth the terms for your use of certain research materials that are owned by Adobe (the “Licensed Materials”). By exercising rights under this License, you accept and agree to be bound by its terms. If you are exercising rights under this License on behalf of an entity, then “you” means you and such entity, and you (personally) represent and warrant that you (personally) have all necessary authority to bind that entity to the terms of this License.
All of the podcast episode audio files come from SoundCloud. Please see podcast_episode_license.csv (included in the dataset) for a detailed license info for each episode. They include CC-BY-3.0, CC-BY-SA 3.0 and CC-BY-ND-3.0 licenses.
ACKNOWLEDGEMENT: Please cite the following paper in work that makes use of this dataset:
Filler Word Detection and Classification: A Dataset and Benchmark Ge Zhu, Juan-Pablo Caceres and Justin Salamon In 23rd Annual Cong. of the Int. Speech Communication Association (INTERSPEECH), Incheon, Korea, Sep. 2022.
Bibtex
@inproceedings{Zhu:FillerWords:INTERSPEECH:22, title = {Filler Word Detection and Classification: A Dataset and Benchmark}, booktitle = {23rd Annual Cong.~of the Int.~Speech Communication Association (INTERSPEECH)}, address = {Incheon, Korea}, month = {Sep.}, url = {https://arxiv.org/abs/2203.15135}, author = {Zhu, Ge and Caceres, Juan-Pablo and Salamon, Justin}, year = {2022}, }
ANNOTATIONS: The annotations include 85,803 manually annotated audio events covering common English filler-word and non-filler-word events. We also provide automatically-generated speech transcripts from a speech-to-text system, which do not contain the manually annotated events. Full label vocabulary Each of the 85,803 manually annotated events is labeled as one of 5 filler classes or 8 non-filler classes (label: number of events).
Fillers - Uh: 17,907 - Um: 17,078 - You know: 668 - Other: 315 - Like: 157
Non-fillers - Words: 12,709 - Repetitions: 9,024 - Breath: 8,288 - Laughter: 6,623 - Music : 5,060 - Agree (agreement sounds, e.g., “mm-hmm”, “ah-ha”): 3,755 - Noise : 2,735 - Overlap (overlapping speakers): 1,484
Total: 85,803 Consolidated label vocabulary 76,689 of the audio events are also labeled with a smaller, consolidated vocabulary with 6 classes. The consolidated vocabulary was obtained by removing classes with less than 5,000 annotations (like, you know, other, agreement sounds, overlapping speakers, noise), and grouping “repetitions” and “words” into “words”.
Music : 5,060
Total: 76,689
The consolidated vocabulary was used to train FillerNet
For a detailed description of how the dataset was created, please see our paper. Data Split for Machine Learning: To facilitate machine learning experiments, the audio data in this dataset (full-length recordings and preprocessed 1-sec clips) are pre-arranged into “train”, “validation”, and “test” folders. This split ensures that episodes from the same podcast show are always in the same subset (train, validation, or test), to prevent speaker leakage. We also ensured that each subset in this split remains gender balanced, same as the complete dataset.
We strongly recommend using this split in your experiments. It will ensure your results are not inflated due to overfitting, and that they are comparable to the results published in the FillerNet paper
AUDIO FILES:
Full-length podcast episodes (MP3) 199 audio files of the full-length podcast episode recordings in mp3 format, stereo channels, 44.1 kHz sample rate and 32 bit depth. Filename format: [show name]_[episode name].mp3.
Pre-processed full-length podcast episodes (WAV) 199 audio files of the full-length podcast episode recordings in wav format, mono channel, 16 kHz sample rate and 32 bit depth. The files are split into train, validation and test partitions (folders), see Data Split for Machine Learning above. Filename format: [show name]_[episode name].wav
Pre-processed WAV clips Pre-processed 1-second audio clips of the annotated events, where each clip is centered on the center of the event. For annotated events longer than 1 second, we truncate them from the center into 1-second. The clips are in the same format as the pre-processed full-length podcast episodes: wav format, mono channel, 16 kHz sample rate and 32 bit depth.
The clips that have consolidated vocabulary labels (76,689) are split into “train”, “validation” and “test” partitions (folders), see Data Split for Machine Learning above. The remainder of the clips (9,114) are placed in an “extra” folder.
Filename format: [pfID].wav where:
[pfID] = the PodcastFillers ID of the audio clip (see metadata below)
METADATA:
Each word in the transcript is annotated as a dictionary: {“confidence”:(float), “duration”:(int), “offset”:(int), “text”:(string)} where “confidence” indicates the STT confidence in the prediction, “duration” (unit:microsecond or 1e-6 second) is the duration of the transcribed word, “offset” (unit:microsecond or 1e-6 second) is the start time of the transcribed word in the full-length recording.
2.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Dataset Card for Sketch Scene Descriptions
Dataset used to train Sketch Scene text to image model We advance sketch research to scenes with the first dataset of freehand scene sketches, FS-COCO. With practical applications in mind, we collect sketches that convey well scene content but can be sketched within a few minutes by a person with any sketching skills. Our dataset comprises around 10,000 freehand scene vector sketches with per-point space-time information by 100 non-expert… See the full description on the dataset page: https://huggingface.co/datasets/mozci/tinysketch.
Advanced Diagnostics and Prognostics Testbed (ADAPT) Project Lead: Scott Poll Subject Fault diagnosis in electrical power systems Description The Advanced Diagnostics and Prognostics Testbed (ADAPT) lab at the NASA Ames Research Center aims to provide a means to assess the effectiveness of diagnostic algorithms at detecting faults in power systems. The algorithms are evaluated using data from the Electrical Power System (EPS), which simulates the functions of a typical aerospace vehicle power system. The EPS allows for the controlled insertion of faults in repeatable failure scenarios to test if diagnostic algorithms can detect and isolate these faults. How Data Was Acquired This dataset was generated from the EPS in the ADAPT lab. Each data file corresponds to one experimental run of the testbed. During an experiment, a data acquisition system commands the testbed into different configurations and records data from sensors that measure system variables such as voltages, currents, temperatures and switch positions. Faults were injected in some of the experimental runs. Sample Rates and Parameter Descriptions Data was sampled at a rate of 2 Hz and saved into a tab delimited plain text file. There are a total of 128 sensors and typical experimental runs last for approximately five minutes. The text files have also been converted into a MATLAB environment file containing equivalent data that may be imported for viewing or computation. Faults and Anomalies Faults were injected into the EPS using physical or software means. Physical faults include disconnecting sources, sinks or circuit breakers. For software faults, user commands are passed through an Antagonist function before being received by the EPS, and sensor data is filtered through the same function before being seen by the user. The Antagonist function was able to block user commands, send spurious commands and alter sensor data. External Links Additional data from the ADAPT EPS testbed can be found at the DXC competition page - https://dashlink.arc.nasa.gov/topic/diagnostic-challenge-competition/ Other Notes The HTML diagrams can be viewed in any brower, but its active content is best run on Internet Explorer.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Diverse learning theories have been constructed to understand learners' internal states through various tangible predictors. We focus on self-regulatory actions that are subconscious and habitual actions triggered by behavior agents' 'awareness' of their attention loss. We hypothesize that self-regulatory behaviors (i.e., attention regulation behaviors) also occur in e-reading as 'regulators' as found in other behavior models (Ekman, P., & Friesen, W. V., 1969). In this work, we try to define the types and frequencies of attention regulation behaviors in e-reading. We collected various cues that reflect learners' moment-to-moment and page-to-page cognitive states to understand the learners' attention in e-reading.
The text 'How to make the most of your day at Disneyland Resort Paris' has been implemented on a screen-based e-reader, which we developed in a pdf-reader format. An informative, entertaining text was adopted to capture learners' attentional shifts during knowledge acquisition. The text has 2685 words, distributed over ten pages, with one subtopic on each page. A built-in webcam on Mac Pro and a mouse have been used for the data collection, aiming for real-world implementation only with essential computational devices. A height-adjustable laptop stand has been used to compensate for participants' eye levels.
Thirty learners in higher education have been invited for a screen-based e-reading task (M=16.2, SD=5.2 minutes). A pre-test questionnaire with ten multiple-choice questions was given before the reading to check their prior knowledge level about the topic. There was no specific time limit to finish the questionnaire. We collected cues that reflect learners' moment-to-moment and page-to-page cognitive states to understand the learners' attention in e-reading. Learners were asked to report their distractions on two levels during the reading: 1) In-text distraction (e.g., still reading the text with low attentiveness) or 2) out-of-text distraction (e.g., thinking of something else while not reading the text anymore). We implemented two noticeably-designed buttons on the right-hand side of the screen interface to minimize possible distraction from the reporting task. After triggering a new page, we implemented blur stimuli on the text in the random range of 20 seconds. It ensures that the blur stimuli occur at least once on each page. Participants were asked to click the de-blur button on the text area of the screen to proceed with the reading. The button has been implemented in the whole text area, so participants can minimize the effort to find and click the button. Reaction time for de-blur has been measured, too, to grasp the arousal of learners during the reading. We asked participants to answer pre-test and post-test questionnaires about the reading material. Participants were given ten multiple-choice questions before the session, while the same set of questions was given after the reading session (i.e., formative questions) with added subtopic summarization questions (i.e., summative questions). It can provide insights into the quantitative and qualitative knowledge gained through the session and different learning outcomes based on individual differences. A video dataset of 931,440 frames has been annotated with the attention regulator behaviors using an annotation tool that plays the long sequence clip by clip, which contains 30 frames. Two annotators (doctoral students) have done two stages of labeling. In the first stage, the annotators were trained on the labeling criteria and annotated the attention regulator behaviors separately based on their judgments. The labels were summarized and cross-checked in the second round to address the inconsistent cases, resulting in five attention regulation behaviors and one neutral state. See WEDAR_readme.csv for detailed descriptions of features.
The dataset has been uploaded 1) raw data, which has formed as we collected, and 2) preprocessed, that we extracted useful features for further learning analytics based on real-time and post-hoc data.
Reference
Ekman, P., & Friesen, W. V. (1969). The repertoire of nonverbal behavior: Categories, origins, usage, and coding. semiotica, 1(1), 49-98.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The raw data are stored in txt (ASCII) files generated once a minute with a reading every second. Text files contain ASCII variables separated by tabs. These files may be read by virtually any text editor or spreadsheet program. When interpreted as tabular/spreadsheet data, tabs are equivalent to column divisions, and newline characters are row divisions. For additional information please refer to http://spiddal.marine.ie/data.html
Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
Recently published datasets have been increasingly comprehensive with respect to their variety of simultaneously used sensors, traffic scenarios, environmental conditions, and provided annotations. However, these datasets typically only consider data collected by one independent vehicle. Hence, there is currently a lack of comprehensive, real-world, multi-vehicle datasets fostering research on cooperative applications such as object detection, urban navigation, or multi-agent SLAM. In this paper, we aim to fill this gap by introducing the novel LUCOOP dataset, which provides time-synchronized multi-modal data collected by three interacting measurement vehicles. The driving scenario corresponds to a follow-up setup of multiple rounds in an inner city triangular trajectory. Each vehicle was equipped with a broad sensor suite including at least one LiDAR sensor, one GNSS antenna, and up to three IMUs. Additionally, Ultra-Wide-Band (UWB) sensors were mounted on each vehicle, as well as statically placed along the trajectory enabling both V2V and V2X range measurements. Furthermore, a part of the trajectory was monitored by a total station resulting in a highly accurate reference trajectory. The LUCOOP dataset also includes a precise, dense 3D map point cloud, acquired simultaneously by a mobile mapping system, as well as an LOD2 city model of the measurement area. We provide sensor measurements in a multi-vehicle setup for a trajectory of more than 4 km and a time interval of more than 26 minutes, respectively. Overall, our dataset includes more than 54,000 LiDAR frames, approximately 700,000 IMU measurements, and more than 2.5 hours of 10 Hz GNSS raw measurements along with 1 Hz data from a reference station. Furthermore, we provide more than 6,000 total station measurements over a trajectory of more than 1 km and 1,874 V2V and 267 V2X UWB measurements. Additionally, we offer 3D bounding box annotations for evaluating object detection approaches, as well as highly accurate ground truth poses for each vehicle throughout the measurement campaign.
Important: Before downloading and using the data, please check the Updates.zip in the "Data and Resources" section at the bottom of this web site. There, you find updated files and annotations as well as update notes.
Source LOD2 City model: Auszug aus den Geodaten des Landesamtes für Geoinformation und Landesvermessung Niedersachsen, ©2023, www.lgln.de
https://data.uni-hannover.de/de/dataset/a20cf8fa-f692-40b3-9b9b-d2f7c8a1e3fe/resource/541747ed-3d6e-41c4-9046-15bba3702e3b/download/lgln_logo.png" alt="Alt text" title="LGLN logo">
https://data.uni-hannover.de/de/dataset/a20cf8fa-f692-40b3-9b9b-d2f7c8a1e3fe/resource/d141d4f1-49b0-40e6-b8d9-e49f420e3627/download/vans_with_redgreen_cs_vehicle.png" alt="Alt text" title="Sensor Setup of the three measurement vehicles">
https://data.uni-hannover.de/dataset/a20cf8fa-f692-40b3-9b9b-d2f7c8a1e3fe/resource/5b6b37cf-a991-4dc4-8828-ad12755203ca/download/map_point_cloud.png" alt="Alt text" title="3D map point cloud">
https://data.uni-hannover.de/de/dataset/a20cf8fa-f692-40b3-9b9b-d2f7c8a1e3fe/resource/6c61d297-8544-4788-bccf-7a28ccfa702a/download/scenario_with_osm_reference.png" alt="Alt text" title="Measurement scenario">
Source LOD2 City model: Auszug aus den Geodaten des Landesamtes für Geoinformation und Landesvermessung Niedersachsen, ©2023, www.lgln.de
https://data.uni-hannover.de/de/dataset/a20cf8fa-f692-40b3-9b9b-d2f7c8a1e3fe/resource/8b0262b9-6769-4a5d-a37e-8fcb201720ef/download/annotations.png" alt="Alt text" title="Number of annotations per class">
Source LOD2 City model: Auszug aus den Geodaten des Landesamtes für Geoinformation und Landesvermessung Niedersachsen, ©2023, www.lgln.de
https://data.uni-hannover.de/dataset/a20cf8fa-f692-40b3-9b9b-d2f7c8a1e3fe/resource/7358ed31-9886-4c74-bec2-6868d577a880/download/data_structure.png" alt="Alt text" title="Data structure">
https://data.uni-hannover.de/de/dataset/a20cf8fa-f692-40b3-9b9b-d2f7c8a1e3fe/resource/fc795ec2-f920-4415-aac6-6ad3be3df0a9/download/data_format.png" alt="Alt text" title="Data format">
https://data.uni-hannover.de/dataset/a20cf8fa-f692-40b3-9b9b-d2f7c8a1e3fe/resource/a1974957-5ce2-456c-9f44-9d05c5a14b16/download/vans_merged.png" alt="Alt text" title="Measurement vehicles">
https://data.uni-hannover.de/dataset/a20cf8fa-f692-40b3-9b9b-d2f7c8a1e3fe/resource/53a58500-8847-4b3c-acd4-a3ac27fc8575/download/ts_uwb_mms.png" alt="Alt text">
This measurement campaign could not have been carried out without the help of many contributors. At this point, we thank Yuehan Jiang (Institute for Autonomous Cyber-Physical Systems, Hamburg), Franziska Altemeier, Ingo Neumann, Sören Vogel, Frederic Hake (all Geodetic Institute, Hannover), Colin Fischer (Institute of Cartography and Geoinformatics, Hannover), Thomas Maschke, Tobias Kersten, Nina Fletling (all Institut für Erdmessung, Hannover), Jörg Blankenbach (Geodetic Institute, Aachen), Florian Alpen (Hydromapper GmbH), Allison Kealy (Victorian Department of Environment, Land, Water and Planning, Melbourne), Günther Retscher, Jelena Gabela (both Department of Geodesy and Geoin- formation, Wien), Wenchao Li (Solinnov Pty Ltd), Adrian Bingham (Applied Artificial Intelligence Institute,
This is a public domain speech dataset consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books. A transcription is provided for each clip. Clips vary in length from 1 to 10 seconds and have a total length of approximately 24 hours.
The texts were published between 1884 and 1964, and are in the public domain. The audio was recorded in 2016-17 by the LibriVox project and is also in the public domain.
Metadata is provided in transcripts.csv. This file consists of one record per line, delimited by the pipe character (0x7c). The fields are: * ID: this is the name of the corresponding .wav file * Transcription: words spoken by the reader (UTF-8) * Normalized Transcription: transcription with numbers, ordinals, and monetary units expanded into full words (UTF-8).
Each audio file is a single-channel 16-bit PCM WAV with a sample rate of 22050 Hz
means ~22 k
.
The audio clips range in length from approximately 1 second to 10 seconds. They were segmented automatically based on silences in the recording. Clip boundaries generally align with sentence or clause boundaries, but not always. The text was matched to the audio manually, and a QA pass was done to ensure that the text accurately matched the words spoken in the audio. The original LibriVox recordings were distributed as 128 kbps MP3 files. As a result, they may contain artifacts introduced by the MP3 encoding. The following abbreviations appear in the text. They may be expanded as follows:
Abbreviation Expansion
Mr. Mister
Mrs. Misess (*)
Dr. Doctor
No. Number
St. Saint
Co. Company
Jr. Junior
Maj. Major
Gen. General
Drs. Doctors
Rev. Reverend
Lt. Lieutenant
Hon. Honorable
Sgt. Sergeant
Capt. Captain
Esq. Esquire
Ltd. Limited
Col. Colonel
Ft. Fort
(*) there's no standard expansion for "Mrs." 19 of the transcriptions contain non-ASCII characters (for example, LJ016-0257 contains "raison d'être"). Example code using this dataset to train a speech synthesis model can be found at: github.com/keithito/tacotron. For more information or to report errors, please email kito@kito.us.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The SmartBay Observatory in Galway Bay is an underwater observatory which uses cameras, probes and sensors to permit continuous and remote live underwater monitoring. It was installed in 2015 on the seafloor 1.5km off the coast of Spiddal, Co. Galway, Ireland at a depth of 20-25m. Underwater observatories allow ocean researchers unique real-time access to monitor ongoing changes in the marine environment. The Galway Bay Observatory is an important contribution by Ireland to the growing global network of real-time data capture systems deployed in the ocean. Data relating to the marine environment at the Galway Observatory site is transferred in real-time through a fibre optic telecommunications cable to the Marine Institute headquarters and then made publically available on the internet. The data includes a live video stream, the depth of the observatory node, the water temperature and salinity, and estimates of the chlorophyll and turbidity levels in the water which give an indication of the volume of phytoplankton and other particles, such as sediment, in the water. Maintenance take place on the observatory every 18 to 24 months. This CTD (Conductivity, Temperature, Depth) and Oxygen Dataset comprises of the raw data that is collected from the Galway Observatory site using an Idronaut Ocean-Seven 304 plus Conductivity-Temperature-Depth (CTD) sensor probe. The sensor measures the temperature and conductivity of the seawater. The conductivity is used to calculate an estimate of the salinity. The pressure exerted by the seawater above is used to calculate the depth of the sensor, and these parameters are also used to estimate the speed of sound within the sea. The Ocean-Seven 304 Plus CTD has also been equipped with a polarographic IDRONAUT dissolved oxygen sensor which measure the dissolved oxygen concentration of the seawater. From the 26th August 2021 a new SeaBird CTD 16CT plus and dissolved Oxygen sensor was deployed and will replace the Idronaut The sensor is deployed on the EMSO Smartbay Cable End Equipment Node in Galway Bay in approx. 25m depth of water. The raw data are stored in txt (ASCII) files generated once a minute with a reading every second. Text files contain ASCII variables separated by tabs. These files may be read by virtually any text editor or spreadsheet program. When interpreted as tabular/spreadsheet data, tabs are equivalent to column divisions, and newline characters are row divisions. For additional information please refer to http://spiddal.marine.ie/data.html None
The Observatory in Galway Bay is an important contribution by Ireland to the growing global network of real-time data capture systems deployed within the ocean. Installed on the seafloor 1.5km off the coast of Spiddal, the observatory uses cameras, probes and sensors to permit continuous and remote live underwater monitoring. Data relating to the marine environment at the site is transferred in real-time from the SmartBay Observatory through a fibre optic telecommunications cable to the Marine Institute headquarters and onwards onto the internet. This dataset comprises of processed acoustic data that has been collected from the Galway Observatory site using an icListen HF Smart Hydrophone – a digital hydrophone that processes and stores acoustic data. The dataset comprises of processed data, collected from the Galway Bay Subsea cabled observatory since its installation in 2015. The wide frequency range hydrophone is installed on a separate lander approximately 30m away from the EMSO Smartbay Cable End Equipment Node in Galway Bay in approx. 25m depth of water @ 53° 13.640'N 9° 15.979'W. The processed FFT files are stored in a txt file generated once a minute with a reading every second. The metadata is included in the header of the text file. TXT files contain ASCII variables separated by tabs. These files may be read by virtually any text editor or spreadsheet program. When interpreted as tabular/spreadsheet data, tabs are equivalent to column divisions, and newline characters are row divisions. All TXT files generated by icListen/Lucy contain several rows of header information at the start of the file, followed by rows of either FFT data or time series data. icListenHF stores only FFT data in TXT format. Practical uses of this dataset includes but are not limited to scientists, researchers and marine technologists involved in the areas of marine mammal monitoring, real-time noise measurement, environmental assessment and improving compliance with the Marine Strategy Framework Directive. The processed files are available in real-time via: http://smartbay.marine.ie/data/hydrophones/ None
Suggested Citation: Gaughan, Paul. (2019) SmartBay Observatory Hydrophone Data (Processed). Marine Institute, Ireland. doi:10/c3jm.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By SetFit (From Huggingface) [source]
The SetFit/mnli dataset is a comprehensive collection of textual entailment data designed to facilitate the development and evaluation of models for natural language understanding tasks. This dataset includes three distinct files: validation.csv, train.csv, and test.csv, each containing valuable information for training and evaluating textual entailment models.
In these files, users will find various columns providing important details about the text pairs. The text1 and text2 columns indicate the first and second texts in each pair respectively, allowing researchers to analyze the relationships between these texts. Additionally, the label column provides a categorical value indicating the specific relationship between text1 and text2.
To further aid in understanding the relationships expressed by these labels, there is an accompanying label_text column that offers a human-readable representation of each categorical label. This allows practitioners to interpret and analyze the labeled data more easily.
Moreover, all three files in this dataset contain an additional index column called idx, which assists in organizing and referencing specific samples within the dataset during analysis or model development.
It's worth noting that this SetFit/mnli dataset has been carefully prepared for textual entailment tasks specifically. To ensure accurate evaluation of model performance on such tasks, researchers can leverage validation.csv as a dedicated set of samples specifically reserved for validating their models' performance during training. The train.csv file contains ample training data with corresponding labels that can be utilized to effectively train reliable textual entailment models. Lastly, test.csv includes test samples designed for evaluating model performance on textual entailment tasks.
By utilizing this extensive collection of high-quality data provided by SetFit/mnli dataset, researchers can develop powerful models capable of accurately understanding natural language relationships expressed within text pairs across various domains
- text1: This column contains the first text in a pair.
- text2: This column contains the second text in a pair.
- label: The label column indicates the relationship between text1 and text2 using categorical values.
- label_text: The label_text column provides the text representation of the labels.
To effectively use this dataset for your textual entailment task, follow these steps:
1. Understanding the Columns
Start by familiarizing yourself with the different columns present in each file of this dataset:
- text1: The first text in a pair that needs to be evaluated for textual entailment.
- text2: The second text in a pair that needs to be compared with text1 to determine its logical relationship.
- label: This categorical field represents predefined relationships or categories between texts based on their meaning or logical inference.
- label_text: A human-readable representation of each label category that helps understand their real-world implications.
2. Data Exploration
Before building models or applying any algorithms, it's essential to explore and understand your data thoroughly:
- Analyze sample data points from each file (validation.csv, train.csv).
- Identify any class imbalances within different labels present in your data distribution.
3. Preprocessing Steps
- Handle missing values: Check if there are any missing values (NaNs) within any columns and decide how to handle them.
- Text cleaning: Depending on the nature of your task, implement appropriate text cleaning techniques like removing stop words, lowercasing, punctuation removal, etc.
- Tokenization: Break down the text into individual tokens or words to facilitate further processing steps.
4. Model Training and Evaluation
Once your dataset is ready for modeling:
- Split your data into training and testing sets using the train.csv and test.csv files. This division allows you to train models on a subset of data while evaluating their performance on an unseen portion.
- Utilize machine learning or deep learning algorithms suitable for textual entailment tasks (e.g., BERT
- Natural Language Understanding: The dataset can be used for training and evaluating models that perform natural language understanding tasks, such as text classification, ...