Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Abstract: Granting agencies invest millions of dollars on the generation and analysis of data, making these products extremely valuable. However, without sufficient annotation of the methods used to collect and analyze the data, the ability to reproduce and reuse those products suffers. This lack of assurance of the quality and credibility of the data at the different stages in the research process essentially wastes much of the investment of time and funding and fails to drive research forward to the level of potential possible if everything was effectively annotated and disseminated to the wider research community. In order to address this issue for the Hawaiāi Established Program to Stimulate Competitive Research (EPSCoR) project, a water science gateway was developed at the University of Hawaiāi (UH), called the āIke Wai Gateway. In Hawaiian, āIke means knowledge and Wai means water. The gateway supports research in hydrology and water management by providing tools to address questions of water sustainability in Hawaiāi. The gateway provides a framework for data acquisition, analysis, model integration, and display of data products. The gateway is intended to complement and integrate with the capabilities of the Consortium of Universities for the Advancement of Hydrologic Scienceās (CUAHSI) Hydroshare by providing sound data and metadata management capabilities for multi-domain field observations, analytical lab actions, and modeling outputs. Functionality provided by the gateway is supported by a subset of the CUAHSIās Observations Data Model (ODM) delivered as centralized web based user interfaces and APIs supporting multi-domain data management, computation, analysis, and visualization tools to support reproducible science, modeling, data discovery, and decision support for the Hawaiāi EPSCoR āIke Wai research team and wider Hawaiāi hydrology community. By leveraging the Tapis platform, UH has constructed a gateway that ties data and advanced computing resources together to support diverse research domains including microbiology, geochemistry, geophysics, economics, and humanities, coupled with computational and modeling workflows delivered in a user friendly web interface with workflows for effectively annotating the project data and products. Disseminating results for the āIke Wai project through the āIke Wai data gateway and Hydroshare makes the research products accessible and reusable.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains 35 of 39 taxonomies that were the result of a systematic review. The systematic review was conducted with the goal of identifying taxonomies suitable for semantically annotating research data. A special focus was set on research data from the hybrid societies domain.
The following taxonomies were identified as part of the systematic review:
Filename
Taxonomy Title
acm_ccs
ACM Computing Classification System [1]
amec
A Taxonomy of Evaluation Towards Standards [2]
bibo
A BIBO Ontology Extension for Evaluation of Scientific Research Results [3]
cdt
Cross-Device Taxonomy [4]
cso
Computer Science Ontology [5]
ddbm
What Makes a Data-driven Business Model? A Consolidated Taxonomy [6]
ddi_am
DDI Aggregation Method [7]
ddi_moc
DDI Mode of Collection [8]
n/a
DemoVoc [9]
discretization
Building a New Taxonomy for Data Discretization Techniques [10]
dp
Demopaedia [11]
dsg
Data Science Glossary [12]
ease
A Taxonomy of Evaluation Approaches in Software Engineering [13]
eco
Evidence & Conclusion Ontology [14]
edam
EDAM: The Bioscientific Data Analysis Ontology [15]
n/a
European Language Social Science Thesaurus [16]
et
Evaluation Thesaurus [17]
glos_hci
The Glossary of Human Computer Interaction [18]
n/a
Humanities and Social Science Electronic Thesaurus [19]
hcio
A Core Ontology on the Human-Computer Interaction Phenomenon [20]
hft
Human-Factors Taxonomy [21]
hri
A Taxonomy to Structure and Analyze HumanāRobot Interaction [22]
iim
A Taxonomy of Interaction for Instructional Multimedia [23]
interrogation
A Taxonomy of Interrogation Methods [24]
iot
Design Vocabulary for HumanāIoT Systems Communication [25]
kinect
Understanding Movement and Interaction: An Ontology for Kinect-Based 3D Depth Sensors [26]
maco
Thesaurus Mass Communication [27]
n/a
Thesaurus Cognitive Psychology of Human Memory [28]
mixed_initiative
Mixed-Initiative Human-Robot Interaction: Definition, Taxonomy, and Survey [29]
qos_qoe
A Taxonomy of Quality of Service and Quality of Experience of Multimodal Human-Machine Interaction [30]
ro
The Research Object Ontology [31]
senses_sensors
A Human-Centered Taxonomy of Interaction Modalities and Devices [32]
sipat
A Taxonomy of Spatial Interaction Patterns and Techniques [33]
social_errors
A Taxonomy of Social Errors in Human-Robot Interaction [34]
sosa
Semantic Sensor Network Ontology [35]
swo
The Software Ontology [36]
tadirah
Taxonomy of Digital Research Activities in the Humanities [37]
vrs
Virtual Reality and the CAVE: Taxonomy, Interaction Challenges and Research Directions [38]
xdi
Cross-Device Interaction [39]
We converted the taxonomies into SKOS (Simple Knowledge Organisation System) representation. The following 4 taxonomies were not converted as they were already available in SKOS and were for this reason excluded from this dataset:
1) DemoVoc, cf. http://thesaurus.web.ined.fr/navigateur/ available at https://thesaurus.web.ined.fr/exports/demovoc/demovoc.rdf
2) European Language Social Science Thesaurus, cf. https://thesauri.cessda.eu/elsst/en/ available at https://zenodo.org/record/5506929
3) Humanities and Social Science Electronic Thesaurus, cf. https://hasset.ukdataservice.ac.uk/hasset/en/ available at https://zenodo.org/record/7568355
4) Thesaurus Cognitive Psychology of Human Memory, cf. https://www.loterre.fr/presentation/ available at https://skosmos.loterre.fr/P66/en/
References
[1] āThe 2012 ACM Computing Classification System,ā ACM Digital Library, 2012. https://dl.acm.org/ccs (accessed May 08, 2023).
[2] AMEC, āA Taxonomy of Evaluation Towards Standards.ā Aug. 31, 2016. Accessed: May 08, 2023. [Online]. Available: https://amecorg.com/amecframework/home/supporting-material/taxonomy/
[3] B. DimiÄ Surla, M. Segedinac, and D. IvanoviÄ, āA BIBO ontology extension for evaluation of scientific research results,ā in Proceedings of the Fifth Balkan Conference in Informatics, in BCI ā12. New York, NY, USA: Association for Computing Machinery, Sep. 2012, pp. 275ā278. doi: 10.1145/2371316.2371376.
[4] F. Brudy et al., āCross-Device Taxonomy: Survey, Opportunities and Challenges of Interactions Spanning Across Multiple Devices,ā in Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, in CHI ā19. New York, NY, USA: Association for Computing Machinery, Mai 2019, pp. 1ā28. doi: 10.1145/3290605.3300792.
[5] A. A. Salatino, T. Thanapalasingam, A. Mannocci, F. Osborne, and E. Motta, āThe Computer Science Ontology: A Large-Scale Taxonomy of Research Areas,ā in Lecture Notes in Computer Science 1137, D. VrandeÄiÄ, K. Bontcheva, M. C. SuĆ”rez-Figueroa, V. Presutti, I. Celino, M. Sabou, L.-A. Kaffee, and E. Simperl, Eds., Monterey, California, USA: Springer, Oct. 2018, pp. 187ā205. Accessed: May 08, 2023. [Online]. Available: http://oro.open.ac.uk/55484/
[6] M. Dehnert, A. Gleiss, and F. Reiss, āWhat makes a data-driven business model? A consolidated taxonomy,ā presented at the European Conference on Information Systems, 2021.
[7] DDI Alliance, āDDI Controlled Vocabulary for Aggregation Method,ā 2014. https://ddialliance.org/Specification/DDI-CV/AggregationMethod_1.0.html (accessed May 08, 2023).
[8] DDI Alliance, āDDI Controlled Vocabulary for Mode Of Collection,ā 2015. https://ddialliance.org/Specification/DDI-CV/ModeOfCollection_2.0.html (accessed May 08, 2023).
[9] INED - French Institute for Demographic Studies, āThĆ©saurus DemoVoc,ā Feb. 26, 2020. https://thesaurus.web.ined.fr/navigateur/en/about (accessed May 08, 2023).
[10] A. A. Bakar, Z. A. Othman, and N. L. M. Shuib, āBuilding a new taxonomy for data discretization techniques,ā in 2009 2nd Conference on Data Mining and Optimization, Oct. 2009, pp. 132ā140. doi: 10.1109/DMO.2009.5341896.
[11] N. Brouard and C. Giudici, āUnified second edition of the Multilingual Demographic Dictionary (Demopaedia.org project),ā presented at the 2017 International Population Conference, IUSSP, Oct. 2017. Accessed: May 08, 2023. [Online]. Available: https://iussp.confex.com/iussp/ipc2017/meetingapp.cgi/Paper/5713
[12] DuCharme, Bob, āData Science Glossary.ā https://www.datascienceglossary.org/ (accessed May 08, 2023).
[13] A. Chatzigeorgiou, T. Chaikalis, G. Paschalidou, N. Vesyropoulos, C. K. Georgiadis, and E. Stiakakis, āA Taxonomy of Evaluation Approaches in Software Engineering,ā in Proceedings of the 7th Balkan Conference on Informatics Conference, in BCI ā15. New York, NY, USA: Association for Computing Machinery, Sep. 2015, pp. 1ā8. doi: 10.1145/2801081.2801084.
[14] M. C. Chibucos, D. A. Siegele, J. C. Hu, and M. Giglio, āThe Evidence and Conclusion Ontology (ECO): Supporting GO Annotations,ā in The Gene Ontology Handbook, C. Dessimoz and N. Å kunca, Eds., in Methods in Molecular Biology. New York, NY: Springer, 2017, pp. 245ā259. doi: 10.1007/978-1-4939-3743-1_18.
[15] M. Black et al., āEDAM: the bioscientific data analysis ontology,ā F1000Research, vol. 11, Jan. 2021, doi: 10.7490/f1000research.1118900.1.
[16] Council of European Social Science Data Archives (CESSDA), āEuropean Language Social Science Thesaurus ELSST,ā 2021. https://thesauri.cessda.eu/en/ (accessed May 08, 2023).
[17] M. Scriven, Evaluation Thesaurus, 3rd Edition. Edgepress, 1981. Accessed: May 08, 2023. [Online]. Available: https://us.sagepub.com/en-us/nam/evaluation-thesaurus/book3562
[18] Papantoniou, Bill et al., The Glossary of Human Computer Interaction. Interaction Design Foundation. Accessed: May 08, 2023. [Online]. Available: https://www.interaction-design.org/literature/book/the-glossary-of-human-computer-interaction
[19] āUK Data Service Vocabularies: HASSET Thesaurus.ā https://hasset.ukdataservice.ac.uk/hasset/en/ (accessed May 08, 2023).
[20] S. D. Costa, M. P. Barcellos, R. de A. Falbo, T. Conte, and K. M. de Oliveira, āA core ontology on the HumanāComputer Interaction phenomenon,ā Data Knowl. Eng., vol. 138, p. 101977, Mar. 2022, doi: 10.1016/j.datak.2021.101977.
[21] V. J. Gawron et al., āHuman Factors Taxonomy,ā Proc. Hum. Factors Soc. Annu. Meet., vol. 35, no. 18, pp. 1284ā1287, Sep. 1991, doi: 10.1177/154193129103501807.
[22] L. Onnasch and E. Roesler, āA Taxonomy to Structure and Analyze HumanāRobot Interaction,ā Int. J. Soc. Robot., vol. 13, no. 4, pp. 833ā849, Jul. 2021, doi: 10.1007/s12369-020-00666-5.
[23] R. A. Schwier, āA Taxonomy of Interaction for Instructional Multimedia.ā Sep. 28, 1992. Accessed: May 09, 2023. [Online]. Available: https://eric.ed.gov/?id=ED352044
[24] C. Kelly, J. Miller, A. Redlich, and S. Kleinman, āA Taxonomy of Interrogation Methods,ā
Facebook
TwitterIn this study, we explore to what extent language users agree about what kind of stances are expressed in natural language use or whether their interpretations diverge. In order to perform this task, a comprehensive cognitive-functional framework of ten stance categories was developed based on previous work on speaker stance in the literature. A corpus of opinionated texts, where speakers take stance and position themselves, was compiled, the Brexit Blog Corpus (BBC). An analytical interface for the annotations was set up and the data were annotated independently by two annotators. The annotation procedure, the annotation agreement and the co-occurrence of more than one stance category in the utterances are described and discussed. The careful, analytical annotation process has by and large returned satisfactory inter- and intra-annotation agreement scores, resulting in a gold standard corpus, the final version of the BBC.
Purpose:
The aim of this study is to explore the possibility of identifying speaker stance in discourse, provide an analytical resource for it and an evaluation of the level of agreement across speakers in the area of stance-taking in discourse.
The BBC is a collection of texts from blog sources. The corpus texts are thematically related to the 2016 UK referendum concerning whether the UK should remain members of the European Union or not. The texts were extracted from the Internet from June to August 2015. With the Gavagai API (https://developer.gavagai.se), the texts were detected using seed words, such as Brexit, EU referendum, pro-Europe, europhiles, eurosceptics, United States of Europe, David Cameron, or Downing Street. The retrieved URLs were filtered so that only entries described as blogs in English were selected. Each downloaded document was split into sentential utterances, from which 2,200 utterances were randomly selected as the analysis data set. The final size of the corpus is 1,682 utterances, 35,492 words (169,762 characters without spaces). Each utterance contains from 3 to 40 words with a mean length of 21 words.
For the data annotation process the Active Learning and Visual Analytics (ALVA) system (https://doi.org/10.1145/3132169 and https://doi.org/10.2312/eurp.20161139) was used. Two annotators, one who is a professional translator with a Licentiate degree in English Linguistics and the other one with a PhD in Computational Linguistics, carried out the annotations independently of one another.
The data set can be downloaded in two different formats: a standard Microsoft Excel format and a raw data format (ZIP archive) which can be useful for analytical and machine learning purposes, for example, with the Python library scikit-learn. The Excel file includes one additional variable (utterance word length). The ZIP archive contains a set of directories (e.g., "contrariety" and "prediction") corresponding to the stance categories. Inside of each such directory, there are two nested directories corresponding to annotations which assign or not assign the respective category to utterances (e.g., inside the top-level category "prediction" there are two directories, "prediction" with utterances which were labeled with this category, and "no" with the rest of the utterances). Inside of the nested directories, there are textual files containing individual utterances.
When using data from this study, the primary researcher wishes citation also to be made to the publication: Vasiliki Simaki, Carita Paradis, Maria Skeppstedt, Magnus Sahlgren, Kostiantyn Kucher, and Andreas Kerren. Annotating speaker stance in discourse: the Brexit Blog Corpus. In Corpus Linguistics and Linguistic Theory, 2017. De Gruyter, published electronically before print. https://doi.org/10.1515/cllt-2016-0060
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Musical prosody is characterized by the acoustic variations that make music expressive. However, few systematic and scalable studies exist on the function it serves or on effective tools to carry out such studies. To address this gap, we introduce a novel approach to capturing information about prosodic functions through a citizen science paradigm. In typical bottom-up approaches to studying musical prosody, acoustic properties in performed music and basic musical structures such as accents and phrases are mapped to prosodic functions, namely segmentation and prominence. In contrast, our top-down, human-centered method puts listener annotations of musical prosodic functions first, to analyze the connection between these functions, the underlying musical structures, and acoustic properties. The method is applied primarily to the exploring of segmentation and prominence in performed solo piano music. These prosodic functions are marked by means of four annotation typesāboundaries, regions, note groups, and commentsāin the CosmoNote web-based citizen science platform, which presents the music signal or MIDI data and related acoustic features in information layers that can be toggled on and off. Various annotation strategies are discussed and appraised: intuitive vs. analytical; real-time vs. retrospective; and, audio-based vs. visual. The end-to-end process of the data collection is described, from the providing of prosodic examples to the structuring and formatting of the annotation data for analysis, to techniques for preventing precision errors. The aim is to obtain reliable and coherent annotations that can be applied to theoretical and data-driven models of musical prosody. The outcomes include a growing library of prosodic examples with the goal of achieving an annotation convention for studying musical prosody in performed music.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
DESCRIPTION
For this task, we use a subset of the MIRFLICKR (http://mirflickr.liacs.nl) collection. The entire collection contains 1 million images from the social photo sharing website Flickr and was formed by downloading up to a thousand photos per day that were deemed to be the most interesting according to Flickr. All photos in this collection were released by their users under a Creative Commons license, allowing them to be freely used for research purposes. Of the entire collection, 25 thousand images were manually annotated with a limited number of concepts and many of these annotations have been further refined and expanded over the lifetime of the ImageCLEF photo annotation task. This year we used crowd sourcing to annotate all of these 25 thousand images with the concepts.
On this page we provide you with more information about the textual features, visual features and concept features we supply with each image in the collection we use for this year's task.
TEXTUAL FEATURES
All images are accompanied by the following textual features:
- Flickr user tags
These are the tags that the users assigned to the photos their uploaded to Flickr. The 'raw' tags are the original tags, while the 'clean' tags are those collapsed to lowercase and condensed to removed spaces.
- EXIF metadata
If available, the EXIF metadata contains information about the camera that took the photo and the parameters used. The 'raw' exif is the original camera data, while the 'clean' exif reduces the verbosity.
- User information and Creative Commons license information
This contains information about the user that took the photo and the license associated with it.
VISUAL FEATURES
Over the previous years of the photo annotation task we noticed that often the same types of visual features are used by the participants, in particular features based on interest points and bag-of-words are popular. To assist you we have extracted several features for you that you may want to use, so you can focus on the concept detection instead. We additionally give you some pointers to easy to use toolkits that will help you extract other features or the same features but with different default settings.
- SIFT, C-SIFT, RGB-SIFT, OPPONENT-SIFT
We used the ISIS Color Descriptors (http://www.colordescriptors.com) toolkit to extract these descriptors. This package provides you with many different types of features based on interest points, mostly using SIFT. It furthermore assists you with building codebooks for bag-of-words. The toolkit is available for Windows, Linux and Mac OS X.
- SURF
We used the OpenSURF (http://www.chrisevansdev.com/computer-vision-opensurf.html) toolkit to extract this descriptor. The open source code is available in C++, C#, Java and many more languages.
- TOP-SURF
We used the TOP-SURF (http://press.liacs.nl/researchdownloads/topsurf) toolkit to extract this descriptor, which represents images with SURF-based bag-of-words. The website provides codebooks of several different sizes that were created using a combination of images from the MIR-FLICKR collection and from the internet. The toolkit also offers the ability to create custom codebooks from your own image collection. The code is open source, written in C++ and available for Windows, Linux and Mac OS X.
- GIST
We used the LabelMe (http://labelme.csail.mit.edu) toolkit to extract this descriptor. The MATLAB-based library offers a comprehensive set of tools for annotating images.
For the interest point-based features above we used a Fast Hessian-based technique to detect the interest points in each image. This detector is built into the OpenSURF library. In comparison with the Hessian-Laplace technique built into the ColorDescriptors toolkit it detects fewer points, resulting in a considerably reduced memory footprint. We therefore also provide you with the interest point locations in each image that the Fast Hessian-based technique detected, so when you would like to recalculate some features you can use them as a starting point for the extraction. The ColorDescriptors toolkit for instance accepts these locations as a separate parameter. Please go to http://www.imageclef.org/2012/photo-flickr/descriptors for more information on the file format of the visual features and how you can extract them yourself if you want to change the default settings.
CONCEPT FEATURES
We have solicited the help of workers on the Amazon Mechanical Turk platform to perform the concept annotation for us. To ensure a high standard of annotation we used the CrowdFlower platform that acts as a quality control layer by removing the judgments of workers that fail to annotate properly. We reused several concepts of last year's task and for most of these we annotated the remaining photos of the MIRFLICKR-25K collection that had not yet been used before in the previous task; for some concepts we reannotated all 25,000 images to boost their quality. For the new concepts we naturally had to annotate all of the images.
- Concepts
For each concept we indicate in which images it is present. The 'raw' concepts contain the judgments of all annotators for each image, where a '1' means an annotator indicated the concept was present whereas a '0' means the concept was not present, while the 'clean' concepts only contain the images for which the majority of annotators indicated the concept was present. Some images in the raw data for which we reused last year's annotations only have one judgment for a concept, whereas the other images have between three and five judgments; the single judgment does not mean only one annotator looked at it, as it is the result of a majority vote amongst last year's annotators.
- Annotations
For each image we indicate which concepts are present, so this is the reverse version of the data above. The 'raw' annotations contain the average agreement of the annotators on the presence of each concept, while the 'clean' annotations only include those for which there was a majority agreement amongst the annotators.
You will notice that the annotations are not perfect. Especially when the concepts are more subjective or abstract, the annotators tend to disagree more with each other. The raw versions of the concept annotations should help you get an understanding of the exact judgments given by the annotators.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Description
Dataset Summary
These are the annotations made by a team of experts on the speakers with more than 1200 seconds recorded in the Catalan set of the Common Voice dataset (v13).
The annotators were initially tasked with evaluating all recordings associated with the same individual. Following that, they were instructed to annotate the speaker's accent, gender, and the overall quality of the recordings.
The accents and genders taken into account are the ones used until version 8 of the Common Voice corpus.
See annotations for more details.
Supported Tasks and Leaderboards
Gender classification, Accent classification.
Languages
The dataset is in Catalan (ca).
Dataset Structure
Instances
Two xlsx documents are published, one for each round of annotations.
The following information is available in each of the documents:
{ 'speaker ID': '1b7fc0c4e437188bdf1b03ed21d45b780b525fd0dc3900b9759d0755e34bc25e31d64e69c5bd547ed0eda67d104fc0d658b8ec78277810830167c53ef8ced24b', 'idx': '31', 'same speaker': {'AN1': 'SI', 'AN2': 'SI', 'AN3': 'SI', 'agreed': 'SI', 'percentage': '100'}, 'gender': {'AN1': 'H', 'AN2': 'H', 'AN3': 'H', 'agreed': 'H', 'percentage': '100'}, 'accent': {'AN1': 'Central', 'AN2': 'Central', 'AN3': 'Central', 'agreed': 'Central', 'percentage': '100'}, 'audio quality': {'AN1': '4.0', 'AN2': '3.0', 'AN3': '3.0', 'agreed': '3.0', 'percentage': '66', 'mean quality': '3.33', 'stdev quality': '0.58'}, 'comments': {'AN1': '', 'AN2': 'pujades i baixades de volum', 'AN3': 'Deu ser d'alguna zona de transició amb el central, perquè no fa una reducció total vocà lica, però hi té molta tendència'}, }
We also publish the document Guia anotació parlants.pdf, with the guidelines the annotators recieved.
Data Fields
speaker ID (string): An id for which client (voice) made the recording in the Common Voice corpus
idx (int): Id in this corpus
AN1 (string): Annotations from Annotator 1
AN2 (string): Annotations from Annotator 2
AN3 (string): Annotations from Annotator 3
agreed (string): Annotation from the majority of the annotators
percentage (int): Percentage of annotators that agree with the agreed annotation
mean quality (float): Mean of the quality annotation
stdev quality (float): Standard deviation of the mean quality
Data Splits
The corpus remains undivided into splits, as its purpose does not involve training models.
Dataset Creation
Curation Rationale
During 2022, a campaign was launched to promote the Common Voice corpus within the Catalan-speaking community, achieving remarkable success. However, not all participants provided their demographic details such as age, gender, and accent. Additionally, some individuals faced difficulty in self-defining their accent using the standard classifications established by specialists.
In order to obtain a balanced corpus with reliable information, we have seen the the necessity of enlisting a group of experts from the University of Barcelona to provide accurate annotations.
We release the complete annotations because transparency is fundamental to our project. Furthermore, we believe they hold philological value for studying dialectal and gender variants.
Source Data
The original data comes from the Catalan sentences of the Common Voice corpus.
Initial Data Collection and Normalization
We have selected speakers who have recorded more than 1200 seconds of speech in the Catalan set of the version 13 of the Common Voice corpus.
Who are the source language producers?
The original data comes from the Catalan sentences of the Common Voice corpus.
Annotations
Annotation process
Starting with version 13 of the Common Voice corpus we identified the speakers (273) who have recorded more than 1200 seconds of speech.
A team of three annotators was tasked with annotating:
if all the recordings correspond to the same person
the gender of the speaker
the accent of the speaker
the quality of the recording
They conducted an initial round of annotation, discussed their varying opinions, and subsequently conducted a second round.
We release the complete annotations because transparency is fundamental to our project. Furthermore, we believe they hold philological value for studying dialectal and gender variants.
Who are the annotators?
The annotation was entrusted to the CLiC (Centre de Llenguatge i Computació) team from the University of Barcelona. They selected a group of three annotators (two men and one woman), who received a scholarship to do this work.
The annotation team was composed of:
Annotator 1: 1 female annotator, aged 18-25, L1 Catalan, student in the Modern Languages and Literatures degree, with a focus on Catalan.
Annotators 2 & 3: 2 male annotators, aged 18-25, L1 Catalan, students in the Catalan Philology degree.
1 female supervisor, aged 40-50, L1 Catalan, graduate in Physics and in Linguistics, Ph.D. in Signal Theory and Communications.
To do the annotation they used a Google Drive spreadsheet
Personal and Sensitive Information
The Common Voice dataset consists of people who have donated their voice online. We don't share here their voices, but their gender and accent. You agree to not attempt to determine the identity of speakers in the Common Voice dataset.
Considerations for Using the Data
Social Impact of Dataset
The ID come from the Common Voice dataset, that consists of people who have donated their voice online.
You agree to not attempt to determine the identity of speakers in the Common Voice dataset.
The information from this corpus will allow us to train and evaluate well balanced Catalan ASR models. Furthermore, we believe they hold philological value for studying dialectal and gender variants.
Discussion of Biases
Most of the voices of the common voice in Catalan correspond to men with a central accent between 40 and 60 years old. The aim of this dataset is to provide information that allows to minimize the biases that this could cause.
For the gender annotation, we have only considered "H" (male) and "D" (female).
Other Known Limitations
[N/A]
Additional Information
Dataset Curators
Language Technologies Unit at the Barcelona Supercomputing Center (langtech@bsc.es)
This work has been promoted and financed by the Generalitat de Catalunya through the Aina project.
Licensing Information
This dataset is licensed under a CC BY 4.0 license.
It can be used for any purpose, whether academic or commercial, under the terms of the license. Give appropriate credit, provide a link to the license, and indicate if changes were made.
Citation Information
DOI
Contributions
The annotation was entrusted to the STeL team from the University of Barcelona.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Data
The dataset consist of 5538 images of public spaces, annotated with steps, stairs, ramps and grab bars for stairs and ramps. The dataset has annotations 3564 of steps, 1492 of stairs, 143 of ramps and 922 of grab bars.
Each step annotation is attributed with an estimate of the height of the step, as falling into one of three categories: less than 3cm, 3cm to 7cm or more than 7cm. Additionally it is attributed with a 'type', with the possibilities 'doorstep', 'curb' or 'other'.
Stair annotations are attributed with the number of steps in the stair.
Ramps are attributed with an estimate of their width, also falling into three categories: less than 50cm, 50cm to 100cm and more than 100cm.
In order to preserve all additional attributes of the labels, the data is published in the CVAT XML format for images.
Annotating Process
The labelling has been done using bounding boxes around the objects. This format is compatible with many popular object detection models, e.g. the YOLO object model. A bounding box is placed so it contains exactly the visible part of the respective objects. This implies that only objects that are visible in the photo are annotated. This means in particular a photo of a stair or step from above, where the object cannot be seen, have not been annotated, even when a human viewer can possibly infer that there is a stair or a step from other features in the photo.
Steps
A step is annotated, when there is an vertical increment that functions as a passage between two surface areas intended human or vehicle traffic. This means that we have not included:
In particular, the bounding box of a step object contains exactly the incremental part of the step, but does not extend into the top or bottom horizontal surface any more than necessary to enclose entirely the incremental part. This has been chosen for consistency reasons, as including parts of the horizontal surfaces would imply a non-trivial choice of how much to include, which we deemed would most likely lead to more inconstistent annotations.
The height of the steps are estimated by the annotators, and are therefore not guarranteed to be accurate.
The type of the steps typically fall into the category 'doorstep' or 'curb'. Steps that are in a doorway, entrance or likewise are attributed as doorsteps. We also include in this category steps that are immediately leading to a doorway within a proximity of 1-2m. Steps between different types of pathways, e.g. between streets and sidewalks, are annotated as curbs. Any other type of step are annotated with 'other'. Many of the 'other' steps are for example steps to terraces.
Stairs
The stair label is used whenever two or more steps directly follow each other in a consistent pattern. All vertical increments are enclosed in the bounding box, as well as intermediate surfaces of the steps. However the top and bottom surface is not included more than necessary for the same reason as for steps, as described in the previous section.
The annotator counts the number of steps, and attribute this to the stair object label.
Ramps
Ramps have been annotated when a sloped passage way has been placed or built to connect two surface areas intended for human or vehicle traffic. This implies the same considerations as with steps. Alike also only the sloped part of a ramp is annotated, not including the bottom or top surface area.
For each ramp, the annotator makes an assessment of the width of the ramp in three categories: less than 50cm, 50cm to 100cm and more than 100cm. This parameter is visually hard to assess, and sometimes impossible due to the view of the ramp.
Grab Bars
Grab bars are annotated for hand rails and similar that are in direct connection to a stair or a ramp. While horizontal grab bars could also have been included, this was omitted due to the implied ambiguities of fences and similar objects. As the grab bar was originally intended as an attributal information to stairs and ramps, we chose to keep this focus. The bounding box encloses the part of the grab bar that functions as a hand rail for the stair or ramp.
Usage
As is often the case when annotating data, much information depends on the subjective assessment of the annotator. As each data point in this dataset has been annotated only by one person, caution should be taken if the data is applied.
Generally speaking, the mindset and usage guiding the annotations have been wheelchair accessibility. While we have strived to annotate at an object level, hopefully making the data more widely applicable than this, we state this explicitly as it may have swayed untrivial annotation choices.
The attributal data, such as step height or ramp width are highly subjective estimations. We still provide this data to give a post-hoc method to adjust which annotations to use. E.g. for some purposes, one may be interested in detecting only steps that are indeed more than 3cm. The attributal data makes it possible to sort away the steps less than 3cm, so a machine learning algorithm can be trained on this more appropriate dataset for that use case. We stress however, that one cannot expect to train accurate machine learning algorithms inferring the attributal data, as this is not accurate data in the first place.
We hope this dataset will be a useful building block in the endeavours for automating barrier detection and documentation.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Proteomics data-dependent acquisition data sets collected with high-resolution mass-spectrometry (MS) can achieve very high-quality results, but nearly every analysis yields results that are thresholded at some accepted false discovery rate, meaning that a substantial number of results are incorrect. For study conclusions that rely on a small number of peptide-spectrum matches being correct, it is thus important to examine at least some crucial spectra to ensure that they are not one of the incorrect identifications. We present Quetzal, a peptide fragment ion spectrum annotation tool to assist researchers in annotating and examining such spectra to ensure that they correctly support study conclusions. We describe how Quetzal annotates spectra using the new Human Proteome Organization (HUPO) Proteomics Standards Initiative (PSI) mzPAF standard for fragment ion peak annotation, including the Python-based code, a web-service end point that provides annotation services, and a web-based application for annotating spectra and producing publication-quality figures. We illustrate its functionality with several annotated spectra of varying complexity. Quetzal provides easily accessible functionality that can assist in the effort to ensure and demonstrate that crucial spectra support study conclusions. Quetzal is publicly available at https://proteomecentral.proteomexchange.org/quetzal/.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Predict "too tired to work" events from PPG data continuously monitored throughout their shift
Workers' cognitive fatigue is the cause of 2/3 of accidents in the mining industry and there are similar dangers in road and air transportation and any shift-work environments where people are working 12-hour shifts (eg hospitals).
The gold-standard for neuroscientists studying fatigue is an electroencephalogram (EEG) but it is hard to get good EEG measurements outside of a clinical setting. There has recently been more research using electrocardiogram (ECG or EKG) measurements, particularly using Heart Rate Variability (HRV) as a factor. Good ECG measurements are also not practical on people who are moving around in their working environment.
Photoplethysmograph (PPG) measurement on the ear is another technique to measure HRV and this can be packaged in a way that is accurate, comfortable and unobtrusive for people to wear all the way through their shift and on their drive home afterwards.
There were 5 participants, each attempting a 22 hour "shift" of computer gaming. For each participant there is:
Factors to derive could include:
All in order to estimate:
Consider:
For each gamer:
-> new time-series file of experimental attempts of time-to-ātoo tired to workā-epoch (F)
Breath (respiration) detection to obtain nanosecond timestamps for every breath from the PPG data (A) and/or the heartbeat timestamps (C) -> new breath timestamps file (E)
Calculate respiration rate from (E) -> append Respiratory Rate column to (D)
Revisit 5-7 with extra respiration factor
Eyes closed and/or ānodding dogā detection from .WMV webcam videos -> append new micro-sleep annotations to gamerās annotation file (B)
Revisit 5-7 for correlations in (D) with the new micro-sleep annotations
Donāt worry if 7 is rubbish at the moment because there are too few events to make any sense of. Still try!
For new anomalies you find in 5, I can review webcam video of the gamer at these timestamps to see what they are indications of and then add appropriate "ground truth" annotations to (B) with accurate timestamps. Just contact me through Kaggle to do this. Also if anyone knows how to automatically detect "eyes closed" and/or ānodding dogā micro-sleep events from .WMV webcam videos please let me know.
A great place to start is this flawed but nevertheless interesting 2013 article from Korea:
Facebook
TwitterData consists of annotations of music in terms of moods music may express and activities that music might fit. The data structures are related to different kinds of annotation tasks, which addressed these questions: 1) annotations of 9 activities that fit a wide range of moods related to music, 2) nominations of music tracks that best fit the a particular mood and annotating the activities that fit them, and 3) annotations of these nominated tracks in terms of mood and activities. Users are anonymised, but the background information (gender, music preferences, age, etc.) are also available. Dataset consists of relational database, that is linked together by means of common ids (tracks, users, activities, moods, genres, expertise, language skill).
Current approaches to the tagging of music in online databases predominantly rely on music genre and artist name, with music tags being often ambiguous and inexact. Yet, the possibly most salient feature musical experiences is emotion. The few attempts so far undertaken to tag music for mood or emotion lack a scientific foundation in emotion research. The current project proposes to incorporate recent research on music-evoked emotion into the growing number of online musical databases and catalogues, notably the Geneva Emotional Music Scale (GEMS) - a rating measure for describing emotional effects of music recently developed by our group. Specifically, the aim here is to develop the GEMS into an innovative conceptual and technical tool for tagging of online musical content for emotion. To this end, three studies are proposed. In study 1, we will examine whether the GEMS labels and their grouping holds up against a much wider range of musical genres than those that were originally used for its development. In Study 2, we will use advanced data reduction techniques to select the most recurrent and important labels for describing music-evoked emotion. In a third study we will examine the added benefit of the new GEMS compared to conventional approaches to the tagging of music. The anticipated impact of the findings is threefold. First, the research to be described next will advance our understanding of the nature and structure of emotions evoked by music. Developing a valid model of music-evoked emotion is crucial for meaningful research in the social and in the neurosciences. Second, music information organization and retrieval can benefit from a scientifically sound and parsimonious taxonomy for describing the emotional effects of music. Thus, searches for relevant online music databases need not be longer confined to genre or artist, but can also incorporate emotion as a key experiential dimension of music. Third, a valid tagging scheme for emotion can assist both researchers and professionals in the choice of music to induce specific emotions. For example, psychologists, behavioural economists, and neuroscientists often need to induce emotion in their experiments to understand how behaviour or performance is modulated by emotion. Music is an obvious choice for emotion induction in controlled settings because it is a universal language that lends itself to comparisons across cultures and because it is ethically unproblematic.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Annotating terms referring to aspects of disability in historical texts is crucial for understanding how societies in different periods conceptualized and treated disability. Such annotations help modern readers grasp the evolving language, cultural attitudes, and social structures surrounding disability, shedding light on both marginalization and inclusion throughout history. This is important as evolving societal attitudes can influence the perpetuation of harmful language that reinforces stereotypes and discrimination. However, this task presents significant challenges. Terminology often reflects outdated, offensive, or ambiguous concepts that require sensitive interpretation. Meaning of terms may have shifted over time, making it difficult to align historical terms with contemporary understandings of disability. Additionally, contextual nuances and the lack of standardized language in historical records demand careful scholarly judgment to avoid anachronism or misrepresentation.In this paper we introduce an annotation protocol for analysing and describing semantic shifts in the discourse on disabilities in historical texts, reporting on how our protocol's design evolved to address these specific challenges and on issues around annotators' agreement. For designing the annotation protocol for measuring the semantic change in the disability domain, we selected texts for annotation from Galeās History of Disabilities: Disabilities in Society, Seventeenth to Twentieth Century(https://www.gale.com/intl/c/history-of-disabilities-disabilities-in-society-seventeenth-to-twentieth-century).Nitisha Jain, Chiara Di Bonaventura, Albert MeroƱo-PeƱuela, Barbara McGillivray. 2025. An Annotation Protocol for Diachronic Evaluation of Semantic Drift in Disability Sources. In Proceedings of the 19th Linguistic Annotation Workshop (LAW 2025). Association for Computational Linguistics.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Disco-Annotation is a collection of training and test sets with manually annotated discourse relations for 8 discourse connectives in europarl texts.
The 8 connectives with their annotated relations are:
although (contrast|concession)
as (prep|causal|temporal|comparison|concession)
however (contrast|concession)
meanwhile (contrast|temporal)
since (causal|temporal|temporal-causal)
though (contrast|concession)
while (contrast|concession|temporal|temporal-contrast|temporal-causal)
yet (adv|contrast|concession)
For each connective there is a training set and a test set. The relations were annotated by two trained annotators with a translation spotting method. The division into training and test also allows for comparison reasons if you train your own models.
If you need software for the latter, have a look at: https://github.com/idiap/DiscoConn-Classifier
Citation
Please cite the following papers if you make use of these datasets (and to know more about the annotation method):
@INPROCEEDINGS{Popescu-Belis-LREC-2012, author = {Popescu-Belis, Andrei and Meyer, Thomas and Liyanapathirana, Jeevanthi and Cartoni, Bruno and Zufferey, Sandrine}, title = {{D}iscourse-level {A}nnotation over {E}uroparl for {M}achine {T}ranslation: {C}onnectives and {P}ronouns}, booktitle = {Proceedings of the eighth international conference on Language Resources and Evaluation ({LREC})}, year = {2012}, address = {Istanbul, Turkey} }
@Article{Cartoni-DD-2013, Author = {Cartoni, Bruno and Zufferey, Sandrine and Meyer, Thomas}, Title = {{Annotating the meaning of discourse connectives by looking at their translation: The translation-spotting technique}}, Journal = {Dialogue & Discourse}, Volume = {4}, Number = {2}, pages = {65--86}, year = {2013} }
@ARTICLE{Meyer-TSLP-submitted, author = {Meyer, Thomas and Hajlaoui, Najeh and Popescu-Belis, Andrei}, title = {{Disambiguating Discourse Connectives for Statistical Machine Translation in Several Languages}}, journal = {IEEE/ACM Transactions of Audio, Speech, and Language Processing}, year = {submitted}, volume = {}, pages = {}, number = {} }
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The first release (v1.0.0) for the digitised, computer-readable Enggano word list in Modigliani (1894), with comparison from Nias, Toba-Batak, and Malay. The Enggano word list is included in the EnoLEX database. The data-source directory contains the original data in .xlsx file that the author hand-digitised from the original source (Modigliani 1894). The light annotation included reflects the content of the original source, covering several aspects. First, annotating the string component that is printed in italics in the original source; the marking is indicated by the XML tag so it can be traced computationally. Second, there is also annotation concerning remark (
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Data collection is perhaps the most crucial part of any machine learning model: without it being done properly, not enough information is present for the model to learn from the patterns leading to one output or another. Data collection is however a very complex endeavor, time-consuming due to the volume of data that needs to be acquired and annotated. Annotation is an especially problematic step, due to its difficulty, length, and vulnerability to human error and inaccuracies when annotating complex data.
With high processing power becoming ever more accessible, synthetic dataset generation is becoming a viable option when looking to generate large volumes of accurately annotated data. With the help of photorealistic renderers, it is for example possible now to generate immense amounts of data, annotated with pixel-perfect precision and whose content is virtually indistinguishable from real-world pictures.
As an exercise of synthetic dataset generation, the data offered here was generated using the Python API of Blender, with the images rendered through the Cycles raycaster. It represents plausible images representing pictures of chessboard and pieces. The goal is, from those pictures and their annotation, to build a model capable of recognizing the pieces, as well as their positions on the board.
The dataset contains a large amount of synthetic, randomly generated images representing pictures of chess images, taken at an angle overlooking the board and its pieces. Each image is associated with a .json file containing its annotations. The naming convention is that each render is associated with a number X, and that the images and annotations associated with that render are respectively named X.jpg and X.json.
The data has been generated using the Python scripts and .blend file present in this repository. The chess board and pieces models that have been used for those renders are not provided with the code.
Data characteristics :
No distinction has been hard-built between training, validation, and testing data, and is left completely up to the users. A proposed pipeline for the extraction, recognition, and placement of chess pieces is proposed in a notebook added with this dataset.
I would like to express my gratitude for the efforts of the Blender Foundation and all its participants, for their incredible open-source tool which once again has allowed me to conduct interesting projects with great ease.
Two interesting papers on the generation and use of synthetic data, which have inspired me to conduct this project :
Erroll Wood, Tadas BaltruŔaitis, Charlie Hewitt (2021) Fake It Till You Make It: Face analysis in the wild using synthetic data alone https://arxiv.org/abs/2109.15102 Salehe Erfanian Ebadi, You-Cyuan Jhang, Alex Zook (2021) PeopleSansPeople: A Synthetic Data Generator for Human-Centric Computer Vision https://arxiv.org/abs/2112.09290
Facebook
TwitterSnow Height Classification dataset provides manually annotated snow height data that can be used for development and evaluation of automatic snow height classification approaches. A subset of 20 IMIS stations which span different locations and elevations and vary in underlying surface (e.g., vegetation, bare ground, glacier, etc.) were selected and manually annotated with binary two-class ground truth information regarding snow height data: * Class 0 - Snow - the surface is covered by snow * Class 1 - No Snow - the surface is snow-free (e.g., vegetation, soil, rocks, etc.) Data has been annotated with the help of domain experts. It should be mentioned that annotating historical data is problematic, as there is no way of checking whether there really was snow at the station or not. This means that assessing the presence of snow with the help of information from other sensors should be considered a best effort approach. Besides the annotated snow height data, the dataset also contains additional data which are relevant to reproduce results in the following repository: https://gitlabext.wsl.ch/jan.svoboda/snow-height-classification Any further use of the data has to comply with the CC BY-NC license: https://creativecommons.org/licenses/by-nc/4.0/
Facebook
Twitterhttps://dataverse.no/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.18710/HZKYL4https://dataverse.no/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.18710/HZKYL4
This data set is the appendix for Flick (2020): Die Entwicklung des Definitartikels im Althochdeutschen. Eine kognitiv-linguistische Korpusuntersuchung (Empirically oriented theoretical morphology and syntax 6). Berlin: Language Science Press. English abstract (the publication is in German): The German definite article originated from the adnominally used demonstrative determiner dĆ«r (āthisā). It is known that the categorical shift from demonstrative to definite article took place during the Old High German (OHG) period (750-1050 AD). The genuine demonstrative loses its demonstrative force and appears in contexts in which referents can be identified in dependently of the speech situation (the so called semantic definite contexts). In contrast to former investigations the present study examines this development by means of a broad corpus study. The data consists of the five largest texts of the OHG period which are accessible via the Old German Reference Corpus. The investigation takes a usage-based and constructional approach to language and shows how animacy interacts with the development of the definite article. The data set contains: (i) annotation guidelines for the following categories (particularly with regard to Old High German data): Animacy, individuation, semantic roles, different types of definiteness, nominal features of the noun phrase. (ii) corpus data which was extracted from a pre-version of the āReferenzkorpus Altdeutsch" (0.1) plus all the annotations that were made during the investigation. (iii) R code that helped transforming, annotating and statistically analyzing the data. Please change to tree view to see the folder structure of the data set.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data used for our paper "WormSwin: Instance Segmentation of C. elegans using Vision Transformer".This publication is divided into three parts:
CSB-1 Dataset
Synthetic Images Dataset
MD Dataset
The CSB-1 Dataset consists of frames extracted from videos of Caenorhabditis elegans (C. elegans) annotated with binary masks. Each C. elegans is separately annotated, providing accurate annotations even for overlapping instances. All annotations are provided in binary mask format and as COCO Annotation JSON files (see COCO website).
The videos are named after the following pattern:
<"worm age in hours"_"mutation"_"irradiated (binary)"_"video index (zero based)">
For mutation the following values are possible:
wild type
csb-1 mutant
csb-1 with rescue mutation
An example video name would be 24_1_1_2 meaning it shows C. elegans with csb-1 mutation, being 24h old which got irradiated.
Video data was provided by M. Rieckher; Instance Segmentation Annotations were created under supervision of K. Bozek and M. Deserno.The Synthetic Images Dataset was created by cutting out C. elegans (foreground objects) from the CSB-1 Dataset and placing them randomly on background images also taken from the CSB-1 Dataset. Foreground objects were flipped, rotated and slightly blurred before placed on the background images.The same was done with the binary mask annotations taken from CSB-1 Dataset so that they match the foreground objects in the synthetic images. Additionally, we added rings of random color, size, thickness and position to the background images to simulate petri-dish edges.
This synthetic dataset was generated by M. Deserno.The Mating Dataset (MD) consists of 450 grayscale image patches of 1,012 x 1,012 px showing C. elegans with high overlap, crawling on a petri-dish.We took the patches from a 10 min. long video of size 3,036 x 3,036 px. The video was downsampled from 25 fps to 5 fps before selecting 50 random frames for annotating and patching.Like the other datasets, worms were annotated with binary masks and annotations are provided as COCO Annotation JSON files.
The video data was provided by X.-L. Chu; Instance Segmentation Annotations were created under supervision of K. Bozek and M. Deserno.
Further details about the datasets can be found in our paper.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We provide text metadata, image frames, and thumbnails of YouTube videos classified as harmful or harmless by domain experts, GPT-4-Turbo, and crowdworkers. Harmful videos are categorized into one or more of six harm categories: Information harms (IH), Hate and Harassment harms (HH), Clickbait harms (CB), Addictive harms (ADD), Sexual harms (SXL), and Physical harms (PH).
This repository includes the text metadata and a link to external cloud storage for the image data.
| Folder | Subfolder | #Videos |
| Ground Truth | Harmful_full_agreement (classified as harmful by all the three actors) | 5,109 |
| Harmful_subset_agreement (classified as harmful by more than two actors) | 14,019 | |
| Domain Experts | Harmful | 15,115 |
| Harmless | 3,303 | |
| GPT-4-Turbo | Harmful | 10,495 |
| Harmless | 7,818 | |
| Crowdworkers (Workers from Amazon Mechanical Turk) | Harmful | 12,668 |
| Harmless | 4,390 | |
| Unannotated large pool | - | 60,906 |
For details about the harm classification taxonomy and the performance comparison between crowdworkers, GPT-4-Turbo, and domain experts, please see https://arxiv.org/abs/2411.05854.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Experimental data can broadly be divided in discrete or continuous data. Continuous data are obtained from measurements that are performed as a function of another quantitative variable, e.g., time, length, concentration, or wavelength. The results from these types of experiments are often used to generate plots that visualize the measured variable on a continuous, quantitative scale. To simplify state-of-the-art data visualization and annotation of data from such experiments, an open-source tool was created with R/shiny that does not require coding skills to operate it. The freely available web app accepts wide (spreadsheet) and tidy data and offers a range of options to normalize the data. The data from individual objects can be shown in 3 different ways: (1) lines with unique colors, (2) small multiples, and (3) heatmap-style display. Next to this, the mean can be displayed with a 95% confidence interval for the visual comparison of different conditions. Several color-blind-friendly palettes are available to label the data and/or statistics. The plots can be annotated with graphical features and/or text to indicate any perturbations that are relevant. All user-defined settings can be stored for reproducibility of the data visualization. The app is dubbed PlotTwist and runs locally or online: https://huygens.science.uva.nl/PlotTwist
Facebook
TwitterThis is a safety annotation set for ImageNet. It uses the LlavaGuard-13B model for annotating. The annotations entail a safety category (image-category), an explanation (assessment), and a safety rating (decision). Furthermore, it contains the unique ImageNet id class_sampleId, i.e. n04542943_1754. These annotations allow you to train your model on only safety-aligned data. Plus, you can define yourself what safety-aligned means, i.e. discard all images where decision=="Review Needed" or⦠See the full description on the dataset page: https://huggingface.co/datasets/AIML-TUDA/imagenet_safety_annotated.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Abstract: Granting agencies invest millions of dollars on the generation and analysis of data, making these products extremely valuable. However, without sufficient annotation of the methods used to collect and analyze the data, the ability to reproduce and reuse those products suffers. This lack of assurance of the quality and credibility of the data at the different stages in the research process essentially wastes much of the investment of time and funding and fails to drive research forward to the level of potential possible if everything was effectively annotated and disseminated to the wider research community. In order to address this issue for the Hawaiāi Established Program to Stimulate Competitive Research (EPSCoR) project, a water science gateway was developed at the University of Hawaiāi (UH), called the āIke Wai Gateway. In Hawaiian, āIke means knowledge and Wai means water. The gateway supports research in hydrology and water management by providing tools to address questions of water sustainability in Hawaiāi. The gateway provides a framework for data acquisition, analysis, model integration, and display of data products. The gateway is intended to complement and integrate with the capabilities of the Consortium of Universities for the Advancement of Hydrologic Scienceās (CUAHSI) Hydroshare by providing sound data and metadata management capabilities for multi-domain field observations, analytical lab actions, and modeling outputs. Functionality provided by the gateway is supported by a subset of the CUAHSIās Observations Data Model (ODM) delivered as centralized web based user interfaces and APIs supporting multi-domain data management, computation, analysis, and visualization tools to support reproducible science, modeling, data discovery, and decision support for the Hawaiāi EPSCoR āIke Wai research team and wider Hawaiāi hydrology community. By leveraging the Tapis platform, UH has constructed a gateway that ties data and advanced computing resources together to support diverse research domains including microbiology, geochemistry, geophysics, economics, and humanities, coupled with computational and modeling workflows delivered in a user friendly web interface with workflows for effectively annotating the project data and products. Disseminating results for the āIke Wai project through the āIke Wai data gateway and Hydroshare makes the research products accessible and reusable.