Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains a collection of voice commands for a smart speaker, each beginning with the common wake-word "Hey Alexa." The commands cover a range of tasks such as music control, smart home management, information requests, reminders, shopping, entertainment, and communication. The dataset reflects natural language usage from a diverse group of speakers, capturing various phrasings, inflections, and contexts. It includes contributions from both male and female voices and features speakers with different native languages.If you plan to download this dataset, we would appreciate it very much if you could fill out the Google form at https://forms.gle/dixQ4mkZ4xbXtXRDA. This will help us understand the usage and impacts of this dataset. Your feedback will also help us improve any future extensions of this work.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Alexa Answers from alexaanswers.amazon.com
The Alexa Answers community helps to improve Alexa’s knowledge and answer questions asked by Alexa users. Which contains some very quirky and hard question like Q: what percent of the population has blackhair A: The most common hair color in the world is black and its found in wide array of background and ethnicities. About 75 to 85% of the global population has either black hair or the deepest brown shade. Q: what was the world population… See the full description on the dataset page: https://huggingface.co/datasets/theblackcat102/alexa-qa.
Conversational AI (e.g., Google Assistant or Amazon Alexa) is present in many people’s everyday life and, at the same time, becomes more and more capable of solving more complex tasks. However, it is unclear how the growing capabilities of conversational AI affect people’s disclosure towards the system as previous research has revealed mixed effects of technology competence. To address this research question, we propose a framework systematically disentangling conversational AI competencies along the lines of the dimensions of human competencies suggested by the action regulation theory. Across two correlational studies and three experiments (N total = 1453), we investigated how these competencies differentially affect users’ and non-users’ disclosure towards conversational AI. Results indicate that intellectual competencies (e.g., planning actions and anticipating problems) in a conversational AI heighten users’ willingness to disclose and reduce their privacy concerns. In contrast, meta-cognitive heuristics (e.g., deriving universal strategies based on previous interactions) raise privacy concerns for users and, even more so, for non-users but reduce willingness to disclose only for non-users. Thus, the present research suggests that not all competencies of a conversational AI are seen as merely positive, and the proposed differentiation of competencies is informative to explain effects on disclosure.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository was created for my Master's thesis in Computational Intelligence and Internet of Things at the University of Córdoba, Spain. The purpose of this repository is to store the datasets found that were used in some of the studies that served as research material for this Master's thesis. Also, the datasets used in the experimental part of this work are included.
Below are the datasets specified, along with the details of their references, authors, and download sources.
----------- STS-Gold Dataset ----------------
The dataset consists of 2026 tweets. The file consists of 3 columns: id, polarity, and tweet. The three columns denote the unique id, polarity index of the text and the tweet text respectively.
Reference: Saif, H., Fernandez, M., He, Y., & Alani, H. (2013). Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.
File name: sts_gold_tweet.csv
----------- Amazon Sales Dataset ----------------
This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon. The data was scraped in the month of January 2023 from the Official Website of Amazon.
Owner: Karkavelraja J., Postgraduate student at Puducherry Technological University (Puducherry, Puducherry, India)
Features:
License: CC BY-NC-SA 4.0
File name: amazon.csv
----------- Rotten Tomatoes Reviews Dataset ----------------
This rating inference dataset is a sentiment classification dataset, containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. On average, these reviews consist of 21 words. The first 5331 rows contains only negative samples and the last 5331 rows contain only positive samples, thus the data should be shuffled before usage.
This data is collected from https://www.cs.cornell.edu/people/pabo/movie-review-data/ as a txt file and converted into a csv file. The file consists of 2 columns: reviews and labels (1 for fresh (good) and 0 for rotten (bad)).
Reference: Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115–124, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics
File name: data_rt.csv
----------- Preprocessed Dataset Sentiment Analysis ----------------
Preprocessed amazon product review data of Gen3EcoDot (Alexa) scrapped entirely from amazon.in
Stemmed and lemmatized using nltk.
Sentiment labels are generated using TextBlob polarity scores.
The file consists of 4 columns: index, review (stemmed and lemmatized review using nltk), polarity (score) and division (categorical label generated using polarity score).
DOI: 10.34740/kaggle/dsv/3877817
Citation: @misc{pradeesh arumadi_2022, title={Preprocessed Dataset Sentiment Analysis}, url={https://www.kaggle.com/dsv/3877817}, DOI={10.34740/KAGGLE/DSV/3877817}, publisher={Kaggle}, author={Pradeesh Arumadi}, year={2022} }
This dataset was used in the experimental phase of my research.
File name: EcoPreprocessed.csv
----------- Amazon Earphones Reviews ----------------
This dataset consists of a 9930 Amazon reviews, star ratings, for 10 latest (as of mid-2019) bluetooth earphone devices for learning how to train Machine for sentiment analysis.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 5 columns: ReviewTitle, ReviewBody, ReviewStar, Product and division (manually added - categorical label generated using ReviewStar score)
License: U.S. Government Works
Source: www.amazon.in
File name (original): AllProductReviews.csv (contains 14337 reviews)
File name (edited - used for my research) : AllProductReviews2.csv (contains 9930 reviews)
----------- Amazon Musical Instruments Reviews ----------------
This dataset contains 7137 comments/reviews of different musical instruments coming from Amazon.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 10 columns: reviewerID, asin (ID of the product), reviewerName, helpful (helpfulness rating of the review), reviewText, overall (rating of the product), summary (summary of the review), unixReviewTime (time of the review - unix time), reviewTime (time of the review (raw) and division (manually added - categorical label generated using overall score).
Source: http://jmcauley.ucsd.edu/data/amazon/
File name (original): Musical_instruments_reviews.csv (contains 10261 reviews)
File name (edited - used for my research) : Musical_instruments_reviews2.csv (contains 7137 reviews)
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
The German Wake Word & Voice Command Dataset is expertly curated to support the training and development of voice-activated systems. This dataset includes a large collection of wake words and command phrases, essential for enabling seamless user interaction with voice assistants and other speech-enabled technologies. It’s designed to ensure accurate wake word detection and voice command recognition, enhancing overall system performance and user experience.
This dataset includes 20,000+ audio recordings of wake words and command phrases. Each participant contributed 400 recordings, captured under varied environmental conditions and speaking speeds. The data covers:
This diversity ensures robust training for real-world voice assistant applications.
Each audio file is accompanied by detailed metadata to support advanced filtering and training needs.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
How can a smart home system control a connected device to be in a desired state? Recent developments in the Internet of Things (IoT) technology enable people to control various devices with the smart home system rather than physical contact. Furthermore, smart home systems cooperate with voice assistants such as Bixby or Alexa allowing users to control their devices through voice. In this process, a user’s query clarifies the target state of the device rather than the actions to perform. Thus, the smart home system needs to plan a sequence of actions to fulfill the user’s needs. However, it is challenging to perform action planning because it needs to handle a large-scale state transition graph of a real-world device, and the complex dependence relationships between capabilities. In this work, we propose SmartAid (Smart Home Action Planning in awareness of Dependency), an action planning method for smart home systems. To represent the state transition graph, SmartAid learns models that represent the prerequisite conditions and operations of actions. Then, SmartAid generates an action plan considering the dependencies between capabilities and actions. Extensive experiments demonstrate that SmartAid successfully represents a real-world device based on a state transition log and generates an accurate action sequence for a given query.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset consists of three different privacy policy corpora (in English and Italian) composed of 81 unique privacy policy texts spanning the period 2018-2021. This dataset makes available an example of three corpora of privacy policies. The first corpus is the English-language corpus, the original used in the study by Tang et al. [2]. The other two are cross-language corpora built (one, the source corpus, in English, and the other, the replication corpus, in Italian, which is the language of a potential replication study) from the first corpus.
The policies were collected from:
We manually analyzed the Alexa top 10 Italy websites as of November 2021. Analogously, we analyzed selected apps that, in the same period, had ranked better in the "most profitable games" category of the Play Store for Italy.
All the privacy policies are ANSI-encoded text files and have been manually read and verified.
The dataset is helpful as a starting point for building comparable cross-language privacy policies corpora. The availability of these comparable cross-language privacy policies corpora helps replicate studies in different languages.
Details on the methodology can be found in the accompanying paper.
The available files are as follows:
This dataset is the original dataset used in the publication [1]. The original English U.S. corpus is described in the publication [2].
[1] F. Ciclosi, S. Vidor and F. Massacci. "Building cross-language corpora for human understanding of privacy policies." Workshop on Digital Sovereignty in Cyber Security: New Challenges in Future Vision. Communications in Computer and Information Science. Springer International Publishing, 2023, In press.
[2] J. Tang, H. Shoemaker, A. Lerner, and E. Birrell. Defining Privacy: How Users Interpret Technical Terms in Privacy Policies. Proceedings on Privacy Enhancing Technologies, 3:70–94, 2021.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The COVID-19 pandemic has expedited the integration of Smart Voice Assistants (SVA) among older people. The qualitative data derived from user commands on SVA is pivotal for elucidating the engagement patterns of older individuals with such systems. However, the sheer volume of user-generated voice interaction data presents a formidable challenge for manual coding. Compounding this issue, age-related cognitive decline and alterations in speech patterns further complicate the interpretation of older users’ SVA voice interactions. Conventional dictionary-based textual analysis tools, which count word frequencies, are inadequate in capturing the evolving and communicative essence of these interactions that unfold over a series of dialogues and modify with time. To address these challenges, our study introduces a novel, modified rule-based Natural Language Processing (MR-NLP) model augmented with human input. This reproducible approach capitalizes on human-derived insights to establish a lexicon of critical keywords and to formulate rules for the iterative refinement of the NLP model. English speakers, aged 50 or older and residing alone, were enlisted to engage with Amazon Alexa™ via predefined daily routines for a minimum of 30 min daily spanning three months (N = 35, mean age = 77). We amassed time-stamped, textual data comprising participants’ user commands and responses from Alexa™. Initially, a subset constituting 20% of the data (1,020 instances) underwent manual coding by human coder, predicated on keywords and commands. Separately, a rule-based Natural Language Processing (NLP) methodology was employed to code the identical subset. Discrepancies arising between human coder and the NLP model programmer were deliberated upon and reconciled to refine the rule-based NLP coding framework for the entire dataset. The modified rule-based NLP approach demonstrated notable enhancements in efficiency and scalability and reduced susceptibility to inadvertent errors in comparison to manual coding. Furthermore, human input was instrumental in augmenting the NLP model, yielding insights germane to the aging adult demographic, such as recurring speech patterns or ambiguities. By disseminating this innovative software solution to the scientific community, we endeavor to advance research and innovation in NLP model formulation, subsequently contributing to the understanding of older people's interactions with SVA and other AI-powered systems.
This dataset contains a list of 3654 Dutch websites that we considered the most popular websites in 2015. This list served as whitelist for the Newstracker Research project in which we monitored the online web behaviour of a group of respondents.The research project 'The Newstracker' was a subproject of the NWO-funded project 'The New News Consumer: A User-Based Innovation Project to Meet Paradigmatic Change in News Use and Media Habits'.For the Newstracker project we aimed to understand the web behaviour of a group of respondents. We created custom-built software to monitor their web browsing behaviour on their laptops and desktops (please find the code in open access at https://github.com/NITechLabs/NewsTracker). For reasons of scale and privacy we created a whitelist with websites that were the most popular websites in 2015. We manually compiled this list by using data of DDMM, Alexa and own research. The dataset consists of 5 columns:- the URL- the type of website: We created a list of types of websites and each website has been manually labeled with 1 category- Nieuws-regio: When the category was 'News', we subdivided these websites in the regional focus: International, National or Local- Nieuws-onderwerp: Furthermore, each website under the category News was further subdivided in type of news website. For this we created an own list of news categories and manually coded each website- Bron: For each website we noted which source we used to find this website.The full description of the research design of the Newstracker including the set-up of this whitelist is included in the following article: Kleppe, M., Otte, M. (in print), 'Analysing & understanding news consumption patterns by tracking online user behaviour with a multimodal research design', Digital Scholarship in the Humanities, doi 10.1093/llc/fqx030.
YouTube is an American online video-sharing platform headquartered in San Bruno, California. The service, created in February 2005 by three former PayPal employees—Chad Hurley, Steve Chen, and Jawed Karim—was bought by Google in November 2006 for US$1.65 billion and now operates as one of the company's subsidiaries. YouTube is the second most-visited website after Google Search, according to Alexa Internet rankings.
YouTube allows users to upload, view, rate, share, add to playlists, report, comment on videos, and subscribe to other users. Available content includes video clips, TV show clips, music videos, short and documentary films, audio recordings, movie trailers, live streams, video blogging, short original videos, and educational videos.
YouTube (the world-famous video sharing website) maintains a list of the top trending videos on the platform. According to Variety magazine, “To determine the year’s top-trending videos, YouTube uses a combination of factors including measuring users interactions (number of views, shares, comments, and likes). Note that they’re not the most-viewed videos overall for the calendar year”. Top performers on the YouTube trending list are music videos (such as the famously virile “Gangam Style”), celebrity and/or reality TV performances, and the random dude-with-a-camera viral videos that YouTube is well-known for.
This dataset is a daily record of the top trending YouTube videos.
Note that this dataset is a structurally improved version of this dataset.
This dataset was collected using the YouTube API. This Description is cited in Wikipedia.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset includes observations of trackers present on the top 500 pages popular among Finnish web users as per Alexa. The data collection was conducted using TrackerTracker in five separate requests for five subsets of 100 sites each between 19.8.2017 and 20.8.2017. The tool used a tracker database from March 24, 2017. More methodology details are described in the associated journal article https://doi.org/10.23978/inf.87841
This dataset is composed of the URLs of the top 1 million websites. The domains are ranked using the Alexa traffic ranking which is determined using a combination of the browsing behavior of users on the website, the number of unique visitors, and the number of pageviews. In more detail, unique visitors are the number of unique users who visit a website on a given day, and pageviews are the total number of user URL requests for the website. However, multiple requests for the same website on the same day are counted as a single pageview. The website with the highest combination of unique visitors and pageviews is ranked the highest
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The dataset consists of three different data sources:
The capture of web browser data was made using the Selenium framework, which simulated classical user browsing. The browsers received command for visiting domains taken from Alexa's top 10K most visited websites. The capturing was performed on the host by listening to the network interface of the virtual machine. Overall the dataset contains almost 5,000 web-page visits by Mozilla and 1,000 pages visited by Chrome.
The Cloudflared DoH proxy was installed in Raspberry PI, and the IP address of the Raspberry was set as the default DNS resolver in two separate offices in our university. It was continuously capturing the DNS/DoH traffic created up to 20 devices for around three months.
The dataset contains 1,128,904 flows from which is around 33,000 labeled as DoH. We provide raw pcap data, CSV with flow data, and CSV file with extracted features.
The CSV with extracted features has the following data fields:
- Label (1 - Doh, 0 - regular HTTPS)
- Data source
- Duration
- Minimal Inter-Packet Delay
- Maximal Inter-Packet Delay
- Average Inter-Packet Delay
- A variance of Incoming Packet Sizes
- A variance of Outgoing Packet Sizes
- A ratio of the number of Incoming and outgoing bytes
- A ration of the number of Incoming and outgoing packets
- Average of Incoming Packet sizes
- Average of Outgoing Packet sizes
- The median value of Incoming Packet sizes
- The median value of outgoing Packet sizes
- The ratio of bursts and pauses
- Number of bursts
- Number of pauses
- Autocorrelation
- Transmission symmetry in the 1st third of connection
- Transmission symmetry in the 2nd third of connection
- Transmission symmetry in the last third of connection
The observed network traffic does not contain privacy-sensitive information.
The zip file structure is:
|-- data
| |-- extracted-features...extracted features used in ML for DoH recognition
| | |-- chrome
| | |-- cloudflared
| | `-- firefox
| |-- flows...............................................exported flow data
| | |-- chrome
| | |-- cloudflared
| | `-- firefox
| `-- pcaps....................................................raw PCAP data
| |-- chrome
| |-- cloudflared
| `-- firefox
|-- LICENSE
`-- README.md
When using this dataset, please cite the original work as follows:
@inproceedings{vekshin2020,
author = {Vekshin, Dmitrii and Hynek, Karel and Cejka, Tomas},
title = {DoH Insight: Detecting DNS over HTTPS by Machine Learning},
year = {2020},
isbn = {9781450388337},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3407023.3409192},
doi = {10.1145/3407023.3409192},
booktitle = {Proceedings of the 15th International Conference on Availability, Reliability and Security},
articleno = {87},
numpages = {8},
keywords = {classification, DoH, DNS over HTTPS, machine learning, detection, datasets},
location = {Virtual Event, Ireland},
series = {ARES '20}
}
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These datasets were generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. Individual reports can be parsed by export session ID (column A) or timestamp (column B). Variation between output represents use of different types of Fitbit trackers and individual tracking behaviors / preferences.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data accompanies work from the paper entitled:
Object Detection Networks and Augmented Reality for Cellular Detection in Fluorescence Microscopy Acquisition and Analysis.
Waithe D1*,2,, Brown JM3, Reglinski K4,6,7, Diez-Sevilla I5, Roberts D5, Christian Eggeling1,4,6,8
1 Wolfson Imaging Centre Oxford and 2 MRC WIMM Centre for Computational Biology and 3 MRC Molecular Haematology Unit and 4 MRC Human Immunology Unit, Weatherall Institute of Molecular Medicine, University of Oxford, OX3 9DS, Oxford, United Kingdom. 5 Nuffield Division of Clinical Laboratory Sciences, Radcliffe Department of Medicine, John Radcliffe Hospital, University of Oxford, Headley Way, Oxford, OX3 9DU.
6 Institute of Applied Optics and Biophysics, Friedrich-Schiller-University Jena, Max-Wien Platz 4, 07743 Jena, Germany.
7 University Hospital Jena (UKJ), Bachstraße 18, 07743 Jena, Germany.
8 Leibniz Institute of Photonic Technology e.V., Albert-Einstein-Straße 9, 07745 Jena, Germany.
Further details of these datasets can be found in the methods section of the above paper.
Erythroblast DAPI (+glycophorin A): erythroblast cells were stained with DAPI and for glycophorin A protein (CD235a antibody, JC159 clone, Dako) and with Alexa Fluor 488 secondary antibody (Invitrogen). DAPI staining was performed through using VectaShield Hard Set mounting solution with DAPI (Vector Lab). Num. of images used for training: 80 and testing: 80. Average number of cells per image: 4.5.
Neuroblastoma phalloidin (+DAPI): images of neuroblastoma cells (N1E115) stained with phalloidin and DAPI were acquired from the Cell Image Library [26]. Cell images in the original dataset were acquired with a larger field of view than our system and so we divided each image into four sub-images and also created ROI bounding boxes for each of the cells in the image. The images were stained for FITC-phalloidin and DAPI. Num. of images used for training: 180, testing: 180. Average number of cells per image: 11.7.
Fibroblast nucleopore: fibroblast (GM5756T) cells were stained for a nucleopore protein (anti-Nup153 mouse antibody, Abcam) and detected with anti-mouse Alexa Fluor 488. Num. of images for training: 26 and testing: 20. Average number of cells per image: 4.8.
Eukaryote DAPI: eukaryote cells were stained with DAPI and fixed and mounted in Vectashield (Vector Lab). Num. of images for training: 40 and testing: 40. Average number of cells per image: 8.9.
C127 DAPI: C127 cells were initially treated with a technique called RASER-FISH[27], stained with DAPI and fixed and mounted in Vectashield (Vector Lab). Num. of images for training: 30 and testing: 30. Average number of cells per image: 7.1.
HEK peroxisome All: HEK-293 cells expressing peroxisome-localized GFP-SCP2 protein. Cells were transfected with GFP-SCP2 protein, which contains the PTS-1 localization signal, which redirects the fluorescently tagged protein to the actively importing peroxisomes[28]. Cells were fixed and mounted. Num. of images for training: 55 and testing: 55. Additionally we sub-categorised the cells as ‘punctuate’ and ‘non-punctuate’, where ‘punctuate’ would represent cells that have staining where the peroxisomes are discretely visible and ‘non-punctuate’ would be diffuse staining within the cell. The ‘HEK peroxisome All’ dataset contains ROI for all the cells: average number of cells per image: 7.9. The ‘HEK peroxisome’ dataset contains only those cells with punctuate fluorescence: average number of punctuate cells per image: 3.9.
Erythroid DAPI All: Murine embryoid body-derived erythroid cells, differentiated from mES cells. Stained with DAPI and fixed and mounted in Vectashield (Vector Lab). Num. of images for training: 51 and testing: 50. Multinucleate cells are seen with this differentiation procedure. There is a variation in size of the nuclei (nuclei become smaller as differentiation proceeds). The smaller, 'late erythroid' nuclei contain heavily condensed DNA and often have heavy ‘blobs’ of heterochromatin visible. Apoptopic cells are also present, with apoptotic bodies clearly present. The ‘Erythroid DAPI All’ dataset contains ROI for all the cells in the image. Average number of cells per image: 21.5. The subset ‘Erythroid DAPI’ contains non-apoptotic cells only: average number of cells per image: 11.9
COS-7 nucleopore. Slides were acquired from GATTAquant. GATTA-Cells 1C are single color COS-7 cells stained for Nuclear pore complexes (Anti-Nup) and with Alexa Fluor 555 Fab(ab’)2 secondary stain. GATTA-Cells are embedded in ProLong Diamond. Num. of images for training: 50 and testing: 50. Average number of cells per image: 13.2
COS-7 nucleopore 40x. Same GATTA-Cells 1C slides (GATTAquant) as above but imaged on Nikon microscope, with 40x NA 0.6 objective. Num. of images for testing: 11. Average number of cells per image: 31.6.
COS-7 nucleopore 10x. Same GATTA-Cells 1C slides (GATTAquant) as above but imaged on Nikon microscope, with 10x NA 0.25 objective. Num. of images for testing: 20. Average number of cells per image: 24.6
Dataset Annotation
Datasets were annotated by a skilled user. These annotations represent the ground-truth of each image with bounding boxes (regions) drawn around each cell present within the staining. Annotations were produced using Fiji/ImageJ [29] ROI Manager and also through using the OMERO [30] ROI drawing interface (https://www.openmicroscopy.org/omero/). The dataset labels were then converted into a format compatible with Faster-RCNN (Pascal), YOLOv2, YOLOv3 and also RetinaNet. The scripts used to perform this conversion are documented in the repository (https://github.com/dwaithe/amca/scripts/).
https://guides.library.uq.edu.au/deposit-your-data/license-reuse-data-agreementhttps://guides.library.uq.edu.au/deposit-your-data/license-reuse-data-agreement
UQ-AAS21 is a comprehensive dataset of the Amazon VPA service, i.e., the Alexa, which is the most popular VPA service. It includes 65,195 Alexa applications (or skills) and comprehensive information about them, including invocation names, user reviews, among overall 16 attributes.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
The Tamil Wake Word & Voice Command Dataset is expertly curated to support the training and development of voice-activated systems. This dataset includes a large collection of wake words and command phrases, essential for enabling seamless user interaction with voice assistants and other speech-enabled technologies. It’s designed to ensure accurate wake word detection and voice command recognition, enhancing overall system performance and user experience.
This dataset includes 20,000+ audio recordings of wake words and command phrases. Each participant contributed 400 recordings, captured under varied environmental conditions and speaking speeds. The data covers:
This diversity ensures robust training for real-world voice assistant applications.
Each audio file is accompanied by detailed metadata to support advanced filtering and training needs.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains a collection of voice commands for a smart speaker, each beginning with the common wake-word "Hey Alexa." The commands cover a range of tasks such as music control, smart home management, information requests, reminders, shopping, entertainment, and communication. The dataset reflects natural language usage from a diverse group of speakers, capturing various phrasings, inflections, and contexts. It includes contributions from both male and female voices and features speakers with different native languages.If you plan to download this dataset, we would appreciate it very much if you could fill out the Google form at https://forms.gle/dixQ4mkZ4xbXtXRDA. This will help us understand the usage and impacts of this dataset. Your feedback will also help us improve any future extensions of this work.