17 datasets found

f
Smart Speaker Command Dataset
rochester.figshare.com
pdf
Updated Aug 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tre DiPassio; Michael Heilemann; Jenna Rutowski; Paula Sedlacek; Benjamin Thompson; Yutong Wen (2024). Smart Speaker Command Dataset [Dataset]. http://doi.org/10.60593/ur.d.26417548.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.60593/ur.d.26417548.v1
Dataset updated
Aug 2, 2024
Dataset provided by
University of Rochester
Authors
Tre DiPassio; Michael Heilemann; Jenna Rutowski; Paula Sedlacek; Benjamin Thompson; Yutong Wen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains a collection of voice commands for a smart speaker, each beginning with the common wake-word "Hey Alexa." The commands cover a range of tasks such as music control, smart home management, information requests, reminders, shopping, entertainment, and communication. The dataset reflects natural language usage from a diverse group of speakers, capturing various phrasings, inflections, and contexts. It includes contributions from both male and female voices and features speakers with different native languages.If you plan to download this dataset, we would appreciate it very much if you could fill out the Google form at https://forms.gle/dixQ4mkZ4xbXtXRDA. This will help us understand the usage and impacts of this dataset. Your feedback will also help us improve any future extensions of this work.
h
alexa-qa
huggingface.co
Updated Feb 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
theblackcat102 (2023). alexa-qa [Dataset]. https://huggingface.co/datasets/theblackcat102/alexa-qa
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 19, 2023
Authors
theblackcat102
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Alexa Answers from alexaanswers.amazon.com

The Alexa Answers community helps to improve Alexa’s knowledge and answer questions asked by Alexa users. Which contains some very quirky and hard question like Q: what percent of the population has blackhair A: The most common hair color in the world is black and its found in wide array of background and ethnicities. About 75 to 85% of the global population has either black hair or the deepest brown shade. Q: what was the world population… See the full description on the dataset page: https://huggingface.co/datasets/theblackcat102/alexa-qa.
e
Dataset for: The More Competent, the Better? The Effects of Perceived...
b2find.eudat.eu
Updated Nov 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Dataset for: The More Competent, the Better? The Effects of Perceived Competencies on Disclosure Towards Conversational Artificial Intelligence - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/3193477c-2599-5888-8276-cf7773a5d01b
Explore at:
Dataset updated
Nov 29, 2022
Description
Conversational AI (e.g., Google Assistant or Amazon Alexa) is present in many people’s everyday life and, at the same time, becomes more and more capable of solving more complex tasks. However, it is unclear how the growing capabilities of conversational AI affect people’s disclosure towards the system as previous research has revealed mixed effects of technology competence. To address this research question, we propose a framework systematically disentangling conversational AI competencies along the lines of the dimensions of human competencies suggested by the action regulation theory. Across two correlational studies and three experiments (N total = 1453), we investigated how these competencies differentially affect users’ and non-users’ disclosure towards conversational AI. Results indicate that intellectual competencies (e.g., planning actions and anticipating problems) in a conversational AI heighten users’ willingness to disclose and reduce their privacy concerns. In contrast, meta-cognitive heuristics (e.g., deriving universal strategies based on previous interactions) raise privacy concerns for users and, even more so, for non-users but reduce willingness to disclose only for non-users. Thus, the present research suggests that not all competencies of a conversational AI are seen as merely positive, and the proposed differentiation of competencies is informative to explain effects on disclosure.
Datasets for Sentiment Analysis
zenodo.org
csv
Updated Dec 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias (2023). Datasets for Sentiment Analysis [Dataset]. http://doi.org/10.5281/zenodo.10157504
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10157504
Dataset updated
Dec 10, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository was created for my Master's thesis in Computational Intelligence and Internet of Things at the University of Córdoba, Spain. The purpose of this repository is to store the datasets found that were used in some of the studies that served as research material for this Master's thesis. Also, the datasets used in the experimental part of this work are included.
Below are the datasets specified, along with the details of their references, authors, and download sources.

----------- STS-Gold Dataset ----------------
The dataset consists of 2026 tweets. The file consists of 3 columns: id, polarity, and tweet. The three columns denote the unique id, polarity index of the text and the tweet text respectively.
Reference: Saif, H., Fernandez, M., He, Y., & Alani, H. (2013). Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.
File name: sts_gold_tweet.csv
----------- Amazon Sales Dataset ----------------
This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon. The data was scraped in the month of January 2023 from the Official Website of Amazon.
Owner: Karkavelraja J., Postgraduate student at Puducherry Technological University (Puducherry, Puducherry, India)
Features:
product_id - Product ID
product_name - Name of the Product
category - Category of the Product
discounted_price - Discounted Price of the Product
actual_price - Actual Price of the Product
discount_percentage - Percentage of Discount for the Product
rating - Rating of the Product
rating_count - Number of people who voted for the Amazon rating
about_product - Description about the Product
user_id - ID of the user who wrote review for the Product
user_name - Name of the user who wrote review for the Product
review_id - ID of the user review
review_title - Short review
review_content - Long review
img_link - Image Link of the Product
product_link - Official Website Link of the Product
License: CC BY-NC-SA 4.0
File name: amazon.csv
----------- Rotten Tomatoes Reviews Dataset ----------------
This rating inference dataset is a sentiment classification dataset, containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. On average, these reviews consist of 21 words. The first 5331 rows contains only negative samples and the last 5331 rows contain only positive samples, thus the data should be shuffled before usage.
This data is collected from https://www.cs.cornell.edu/people/pabo/movie-review-data/ as a txt file and converted into a csv file. The file consists of 2 columns: reviews and labels (1 for fresh (good) and 0 for rotten (bad)).
Reference: Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115–124, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics
File name: data_rt.csv
----------- Preprocessed Dataset Sentiment Analysis ----------------
Preprocessed amazon product review data of Gen3EcoDot (Alexa) scrapped entirely from amazon.in
Stemmed and lemmatized using nltk.
Sentiment labels are generated using TextBlob polarity scores.
The file consists of 4 columns: index, review (stemmed and lemmatized review using nltk), polarity (score) and division (categorical label generated using polarity score).
DOI: 10.34740/kaggle/dsv/3877817
Citation: @misc{pradeesh arumadi_2022, title={Preprocessed Dataset Sentiment Analysis}, url={https://www.kaggle.com/dsv/3877817}, DOI={10.34740/KAGGLE/DSV/3877817}, publisher={Kaggle}, author={Pradeesh Arumadi}, year={2022} }
This dataset was used in the experimental phase of my research.
File name: EcoPreprocessed.csv
----------- Amazon Earphones Reviews ----------------
This dataset consists of a 9930 Amazon reviews, star ratings, for 10 latest (as of mid-2019) bluetooth earphone devices for learning how to train Machine for sentiment analysis.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 5 columns: ReviewTitle, ReviewBody, ReviewStar, Product and division (manually added - categorical label generated using ReviewStar score)
License: U.S. Government Works
Source: www.amazon.in
File name (original): AllProductReviews.csv (contains 14337 reviews)
File name (edited - used for my research) : AllProductReviews2.csv (contains 9930 reviews)
----------- Amazon Musical Instruments Reviews ----------------
This dataset contains 7137 comments/reviews of different musical instruments coming from Amazon.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 10 columns: reviewerID, asin (ID of the product), reviewerName, helpful (helpfulness rating of the review), reviewText, overall (rating of the product), summary (summary of the review), unixReviewTime (time of the review - unix time), reviewTime (time of the review (raw) and division (manually added - categorical label generated using overall score).
Source: http://jmcauley.ucsd.edu/data/amazon/
File name (original): Musical_instruments_reviews.csv (contains 10261 reviews)
File name (edited - used for my research) : Musical_instruments_reviews2.csv (contains 7137 reviews)
F
German Wake Words & Voice Commands Speech Data
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). German Wake Words & Voice Commands Speech Data [Dataset]. https://www.futurebeeai.com/dataset/wake-words-and-commands-dataset/wake-words-and-commands-german-germany
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The German Wake Word & Voice Command Dataset is expertly curated to support the training and development of voice-activated systems. This dataset includes a large collection of wake words and command phrases, essential for enabling seamless user interaction with voice assistants and other speech-enabled technologies. It’s designed to ensure accurate wake word detection and voice command recognition, enhancing overall system performance and user experience.
Speech Data
This dataset includes 20,000+ audio recordings of wake words and command phrases. Each participant contributed 400 recordings, captured under varied environmental conditions and speaking speeds. The data covers:
•Wake words alone
•Wake words followed by command phrases
Participant Diversity
•
Speakers: 50 native German speakers from the FutureBeeAI community

•
Regions: Participants from various Germany provinces, ensuring broad coverage of accents and dialects

•
Demographics: Ages 18–70; 60% male and 40% female participants

Recording Details
•
Type: Scripted wake words and command phrases

•
Duration: 1 to 15 seconds per clip

•
Format: WAV, stereo, 16-bit, with sample rates ranging from 16 kHz to 48 kHz

Dataset Diversity
•Wake Word Types
•
Automobile Wake Words: Hey Mercedes, Hey BMW, Hey Porsche, Hey Volvo, Hey Audi, Hi Genesis, Ok Ford, etc.

•
Voice Assistant Wake Words: Hey Siri, Ok Google, Alexa, Hey Cortana, Hi Bixby, Hey Celia, etc.

•
Home Appliance Wake Words: Hi LG, Ok LG, Hello Lloyd, and more

•Command Types by Use Case
•
Automobile: Play music, check directions, voice search, provide feedback, and more

•
Voice Assistant: Ask general questions, make calls, control devices, shopping, manage calendars, and more

•
Home Appliances: Control appliances, check status, set reminders/alarms, manage shopping lists, etc.

•Recording Environments
•No background noise
•Background traffic noise
•People talking in the background
•Speaking Pace
•Normal speed
•Fast speed
This diversity ensures robust training for real-world voice assistant applications.
Metadata
Each audio file is accompanied by detailed metadata to support advanced filtering and training needs.
•
Participant Metadata: Unique ID, age, gender, region, accent, dialect

•
Recording Metadata: Transcript, environment, pace, device used, sample rate, bit depth, file format

Use Cases & Applications
•
Voice Assistant Activation: Train models to accurately detect and trigger based on wake words

•
Smart Home Devices: Enable responsive voice control in smart appliances

•
<b style="font-weight:
f
Statistics of collected datasets.
plos.figshare.com
xls
Updated Jun 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jongjin Kim; Jaeri Lee; Jeongin Yun; U. Kang (2024). Statistics of collected datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0305415.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0305415.t002
Dataset updated
Jun 18, 2024
Dataset provided by
PLOS ONE
Authors
Jongjin Kim; Jaeri Lee; Jeongin Yun; U. Kang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
How can a smart home system control a connected device to be in a desired state? Recent developments in the Internet of Things (IoT) technology enable people to control various devices with the smart home system rather than physical contact. Furthermore, smart home systems cooperate with voice assistants such as Bixby or Alexa allowing users to control their devices through voice. In this process, a user’s query clarifies the target state of the device rather than the actions to perform. Thus, the smart home system needs to plan a sequence of actions to fulfill the user’s needs. However, it is challenging to perform action planning because it needs to handle a large-scale state transition graph of a real-world device, and the complex dependence relationships between capabilities. In this work, we propose SmartAid (Smart Home Action Planning in awareness of Dependency), an action planning method for smart home systems. To represent the state transition graph, SmartAid learns models that represent the prerequisite conditions and operations of actions. Then, SmartAid generates an action plan considering the dependencies between capabilities and actions. Extensive experiments demonstrate that SmartAid successfully represents a real-world device based on a state transition log and generates an accurate action sequence for a given query.
Cross-language corpora of privacy policies
zenodo.org
explore.openaire.eu
+1more
csv, zip
Updated Jun 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Francesco Ciclosi; Francesco Ciclosi; Silvia Vidor; Silvia Vidor; Fabio Massacci; Fabio Massacci (2023). Cross-language corpora of privacy policies [Dataset]. http://doi.org/10.5281/zenodo.7729546
Explore at:
csv, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7729546
Dataset updated
Jun 17, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Francesco Ciclosi; Francesco Ciclosi; Silvia Vidor; Silvia Vidor; Fabio Massacci; Fabio Massacci
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset consists of three different privacy policy corpora (in English and Italian) composed of 81 unique privacy policy texts spanning the period 2018-2021. This dataset makes available an example of three corpora of privacy policies. The first corpus is the English-language corpus, the original used in the study by Tang et al. [2]. The other two are cross-language corpora built (one, the source corpus, in English, and the other, the replication corpus, in Italian, which is the language of a potential replication study) from the first corpus.

The policies were collected from:

the Alexa top 10 Italy and U.S. websites rank;

the Play Store apps rank in the "most profitable games" category of the Play Store for Italy and the U.S.

We manually analyzed the Alexa top 10 Italy websites as of November 2021. Analogously, we analyzed selected apps that, in the same period, had ranked better in the "most profitable games" category of the Play Store for Italy.

All the privacy policies are ANSI-encoded text files and have been manually read and verified.
The dataset is helpful as a starting point for building comparable cross-language privacy policies corpora. The availability of these comparable cross-language privacy policies corpora helps replicate studies in different languages.
Details on the methodology can be found in the accompanying paper.

The available files are as follows:

policies-texts.zip --> contains a directory of text files with the policy texts. File names are the SHA1 hashes of the policy text.

policy-metadata.csv --> Contains a CSV file with the metadata for each privacy policy.

This dataset is the original dataset used in the publication [1]. The original English U.S. corpus is described in the publication [2].

[1] F. Ciclosi, S. Vidor and F. Massacci. "Building cross-language corpora for human understanding of privacy policies." Workshop on Digital Sovereignty in Cyber Security: New Challenges in Future Vision. Communications in Computer and Information Science. Springer International Publishing, 2023, In press.

[2] J. Tang, H. Shoemaker, A. Lerner, and E. Birrell. Defining Privacy: How Users Interpret Technical Terms in Privacy Policies. Proceedings on Privacy Enhancing Technologies, 3:70–94, 2021.
f
Datasheet1_Understanding older people's voice interactions with smart voice...
frontiersin.figshare.com
pdf
Updated May 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhengxu Yan; Victoria Dube; Judith Heselton; Kate Johnson; Changmin Yan; Valerie Jones; Julie Blaskewicz Boron; Marcia Shade (2024). Datasheet1_Understanding older people's voice interactions with smart voice assistants: a new modified rule-based natural language processing model with human input.pdf [Dataset]. http://doi.org/10.3389/fdgth.2024.1329910.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fdgth.2024.1329910.s001
Dataset updated
May 14, 2024
Dataset provided by
Frontiers
Authors
Zhengxu Yan; Victoria Dube; Judith Heselton; Kate Johnson; Changmin Yan; Valerie Jones; Julie Blaskewicz Boron; Marcia Shade
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The COVID-19 pandemic has expedited the integration of Smart Voice Assistants (SVA) among older people. The qualitative data derived from user commands on SVA is pivotal for elucidating the engagement patterns of older individuals with such systems. However, the sheer volume of user-generated voice interaction data presents a formidable challenge for manual coding. Compounding this issue, age-related cognitive decline and alterations in speech patterns further complicate the interpretation of older users’ SVA voice interactions. Conventional dictionary-based textual analysis tools, which count word frequencies, are inadequate in capturing the evolving and communicative essence of these interactions that unfold over a series of dialogues and modify with time. To address these challenges, our study introduces a novel, modified rule-based Natural Language Processing (MR-NLP) model augmented with human input. This reproducible approach capitalizes on human-derived insights to establish a lexicon of critical keywords and to formulate rules for the iterative refinement of the NLP model. English speakers, aged 50 or older and residing alone, were enlisted to engage with Amazon Alexa™ via predefined daily routines for a minimum of 30 min daily spanning three months (N = 35, mean age = 77). We amassed time-stamped, textual data comprising participants’ user commands and responses from Alexa™. Initially, a subset constituting 20% of the data (1,020 instances) underwent manual coding by human coder, predicated on keywords and commands. Separately, a rule-based Natural Language Processing (NLP) methodology was employed to code the identical subset. Discrepancies arising between human coder and the NLP model programmer were deliberated upon and reconciled to refine the rule-based NLP coding framework for the entire dataset. The modified rule-based NLP approach demonstrated notable enhancements in efficiency and scalability and reduced susceptibility to inadvertent errors in comparison to manual coding. Furthermore, human input was instrumental in augmenting the NLP model, yielding insights germane to the aging adult demographic, such as recurring speech patterns or ambiguities. By disseminating this innovative software solution to the scientific community, we endeavor to advance research and innovation in NLP model formulation, subsequently contributing to the understanding of older people's interactions with SVA and other AI-powered systems.
d
Most popular websites in the Netherlands 2015 - Dataset - B2FIND
b2find.dkrz.de
b2find.eudat.eu
Updated Jun 2, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2017). Most popular websites in the Netherlands 2015 - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/6537d0dd-cd8c-5cad-b0a3-edb10f4f1c8b
Explore at:
Dataset updated
Jun 2, 2017
Area covered
Netherlands
Description
This dataset contains a list of 3654 Dutch websites that we considered the most popular websites in 2015. This list served as whitelist for the Newstracker Research project in which we monitored the online web behaviour of a group of respondents.The research project 'The Newstracker' was a subproject of the NWO-funded project 'The New News Consumer: A User-Based Innovation Project to Meet Paradigmatic Change in News Use and Media Habits'.For the Newstracker project we aimed to understand the web behaviour of a group of respondents. We created custom-built software to monitor their web browsing behaviour on their laptops and desktops (please find the code in open access at https://github.com/NITechLabs/NewsTracker). For reasons of scale and privacy we created a whitelist with websites that were the most popular websites in 2015. We manually compiled this list by using data of DDMM, Alexa and own research. The dataset consists of 5 columns:- the URL- the type of website: We created a list of types of websites and each website has been manually labeled with 1 category- Nieuws-regio: When the category was 'News', we subdivided these websites in the regional focus: International, National or Local- Nieuws-onderwerp: Furthermore, each website under the category News was further subdivided in type of news website. For this we created an own list of news categories and manually coded each website- Bron: For each website we noted which source we used to find this website.The full description of the research design of the Newstracker including the set-up of this whitelist is included in the following article: Kleppe, M., Otte, M. (in print), 'Analysing & understanding news consumption patterns by tracking online user behaviour with a multimodal research design', Digital Scholarship in the Humanities, doi 10.1093/llc/fqx030.
YouTube's Channels Dataset
kaggle.com
zip
Updated Mar 31, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HarshitHGupta (2021). YouTube's Channels Dataset [Dataset]. https://www.kaggle.com/harshithgupta/youtubes-channels-dataset
Explore at:
zip(113384217 bytes)Available download formats
Dataset updated
Mar 31, 2021
Authors
HarshitHGupta
Description
Context

YouTube is an American online video-sharing platform headquartered in San Bruno, California. The service, created in February 2005 by three former PayPal employees—Chad Hurley, Steve Chen, and Jawed Karim—was bought by Google in November 2006 for US$1.65 billion and now operates as one of the company's subsidiaries. YouTube is the second most-visited website after Google Search, according to Alexa Internet rankings.

YouTube allows users to upload, view, rate, share, add to playlists, report, comment on videos, and subscribe to other users. Available content includes video clips, TV show clips, music videos, short and documentary films, audio recordings, movie trailers, live streams, video blogging, short original videos, and educational videos.

YouTube (the world-famous video sharing website) maintains a list of the top trending videos on the platform. According to Variety magazine, “To determine the year’s top-trending videos, YouTube uses a combination of factors including measuring users interactions (number of views, shares, comments, and likes). Note that they’re not the most-viewed videos overall for the calendar year”. Top performers on the YouTube trending list are music videos (such as the famously virile “Gangam Style”), celebrity and/or reality TV performances, and the random dude-with-a-camera viral videos that YouTube is well-known for.

This dataset is a daily record of the top trending YouTube videos.

Note that this dataset is a structurally improved version of this dataset.

Acknowledgements

This dataset was collected using the YouTube API. This Description is cited in Wikipedia.
Web tracking data for 500 websites popular among Finnish web users
zenodo.org
Updated Apr 18, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John Bailey; Mikael Laakso; Mikael Laakso; Linus Nyman; Linus Nyman; John Bailey (2020). Web tracking data for 500 websites popular among Finnish web users [Dataset]. http://doi.org/10.5281/zenodo.3543444
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.3543444
Dataset updated
Apr 18, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
John Bailey; Mikael Laakso; Mikael Laakso; Linus Nyman; Linus Nyman; John Bailey
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset includes observations of trackers present on the top 500 pages popular among Finnish web users as per Alexa. The data collection was conducted using TrackerTracker in five separate requests for five subsets of 100 sites each between 19.8.2017 and 20.8.2017. The tool used a tracker database from March 24, 2017. More methodology details are described in the associated journal article https://doi.org/10.23978/inf.87841
O
Alexa Domains
opendatalab.com
zip
Updated Sep 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Booz Allen Hamilton (2022). Alexa Domains [Dataset]. https://opendatalab.com/OpenDataLab/Alexa_Domains
Explore at:
zip(20248769 bytes)Available download formats
Dataset updated
Sep 30, 2022
Dataset provided by
Booz Allen Hamilton
Description
This dataset is composed of the URLs of the top 1 million websites. The domains are ranked using the Alexa traffic ranking which is determined using a combination of the browsing behavior of users on the website, the number of unique visitors, and the number of pageviews. In more detail, unique visitors are the number of unique users who visit a website on a given day, and pageviews are the total number of user URL requests for the website. However, multiple requests for the same website on the same day are counted as a single pageview. The website with the highest combination of unique visitors and pageviews is ranked the highest
Dataset used for detecting DNS over HTTPS by Machine Learning.
zenodo.org
zip
Updated Oct 28, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dmitrii Vekshin; Karel Hynek; Karel Hynek; Tomas Cejka; Tomas Cejka; Dmitrii Vekshin (2020). Dataset used for detecting DNS over HTTPS by Machine Learning. [Dataset]. http://doi.org/10.5281/zenodo.3906526
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3906526
Dataset updated
Oct 28, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Dmitrii Vekshin; Karel Hynek; Karel Hynek; Tomas Cejka; Tomas Cejka; Dmitrii Vekshin
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The dataset consists of three different data sources:

DoH enabled Firefox

DoH enabled Google Chrome

Cloudflared DoH proxy

The capture of web browser data was made using the Selenium framework, which simulated classical user browsing. The browsers received command for visiting domains taken from Alexa's top 10K most visited websites. The capturing was performed on the host by listening to the network interface of the virtual machine. Overall the dataset contains almost 5,000 web-page visits by Mozilla and 1,000 pages visited by Chrome.

The Cloudflared DoH proxy was installed in Raspberry PI, and the IP address of the Raspberry was set as the default DNS resolver in two separate offices in our university. It was continuously capturing the DNS/DoH traffic created up to 20 devices for around three months.

The dataset contains 1,128,904 flows from which is around 33,000 labeled as DoH. We provide raw pcap data, CSV with flow data, and CSV file with extracted features.

The CSV with extracted features has the following data fields:

- Label (1 - Doh, 0 - regular HTTPS)
- Data source
- Duration
- Minimal Inter-Packet Delay
- Maximal Inter-Packet Delay
- Average Inter-Packet Delay
- A variance of Incoming Packet Sizes
- A variance of Outgoing Packet Sizes
- A ratio of the number of Incoming and outgoing bytes
- A ration of the number of Incoming and outgoing packets
- Average of Incoming Packet sizes
- Average of Outgoing Packet sizes
- The median value of Incoming Packet sizes
- The median value of outgoing Packet sizes
- The ratio of bursts and pauses
- Number of bursts
- Number of pauses
- Autocorrelation
- Transmission symmetry in the 1st third of connection
- Transmission symmetry in the 2nd third of connection
- Transmission symmetry in the last third of connection

The observed network traffic does not contain privacy-sensitive information.

The zip file structure is:

|-- data | |-- extracted-features...extracted features used in ML for DoH recognition | | |-- chrome | | |-- cloudflared | | `-- firefox | |-- flows...............................................exported flow data | | |-- chrome | | |-- cloudflared | | `-- firefox | `-- pcaps....................................................raw PCAP data | |-- chrome | |-- cloudflared | `-- firefox |-- LICENSE `-- README.md

When using this dataset, please cite the original work as follows:

@inproceedings{vekshin2020, author = {Vekshin, Dmitrii and Hynek, Karel and Cejka, Tomas}, title = {DoH Insight: Detecting DNS over HTTPS by Machine Learning}, year = {2020}, isbn = {9781450388337}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3407023.3409192}, doi = {10.1145/3407023.3409192}, booktitle = {Proceedings of the 15th International Conference on Availability, Reliability and Security}, articleno = {87}, numpages = {8}, keywords = {classification, DoH, DNS over HTTPS, machine learning, detection, datasets}, location = {Virtual Event, Ireland}, series = {ARES '20} }
Crowd-sourced Fitbit datasets 03.12.2016-05.12.2016
zenodo.org
explore.openaire.eu
+1more
zip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert Furberg; Julia Brinton; Michael Keating; Alexa Ortiz; Robert Furberg; Julia Brinton; Michael Keating; Alexa Ortiz (2020). Crowd-sourced Fitbit datasets 03.12.2016-05.12.2016 [Dataset]. http://doi.org/10.5281/zenodo.53894
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.53894
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Robert Furberg; Julia Brinton; Michael Keating; Alexa Ortiz; Robert Furberg; Julia Brinton; Michael Keating; Alexa Ortiz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These datasets were generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. Individual reports can be parsed by export session ID (column A) or timestamp (column B). Variation between output represents use of different types of Fitbit trackers and individual tracking behaviors / preferences.
Fluorescence Microscopy Data for Cellular Detection using Object Detection...
zenodo.org
zip
Updated Jul 31, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dominic Waithe; Dominic Waithe; Jill M. Brown; Katharina Reglinski; Isabel Diez-Sevilla; David Roberts; Christian Eggeling; Jill M. Brown; Katharina Reglinski; Isabel Diez-Sevilla; David Roberts; Christian Eggeling (2020). Fluorescence Microscopy Data for Cellular Detection using Object Detection Networks. [Dataset]. http://doi.org/10.5281/zenodo.3894389
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3894389
Dataset updated
Jul 31, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Dominic Waithe; Dominic Waithe; Jill M. Brown; Katharina Reglinski; Isabel Diez-Sevilla; David Roberts; Christian Eggeling; Jill M. Brown; Katharina Reglinski; Isabel Diez-Sevilla; David Roberts; Christian Eggeling
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data accompanies work from the paper entitled:

Object Detection Networks and Augmented Reality for Cellular Detection in Fluorescence Microscopy Acquisition and Analysis.

Waithe D1*,2,, Brown JM3, Reglinski K4,6,7, Diez-Sevilla I⁵, Roberts D⁵, Christian Eggeling1,4,6,8

1 Wolfson Imaging Centre Oxford and 2 MRC WIMM Centre for Computational Biology and 3 MRC Molecular Haematology Unit and 4 MRC Human Immunology Unit, Weatherall Institute of Molecular Medicine, University of Oxford, OX3 9DS, Oxford, United Kingdom. 5 Nuffield Division of Clinical Laboratory Sciences, Radcliffe Department of Medicine, John Radcliffe Hospital, University of Oxford, Headley Way, Oxford, OX3 9DU.
6 Institute of Applied Optics and Biophysics, Friedrich-Schiller-University Jena, Max-Wien Platz 4, 07743 Jena, Germany.
7 University Hospital Jena (UKJ), Bachstraße 18, 07743 Jena, Germany.
8 Leibniz Institute of Photonic Technology e.V., Albert-Einstein-Straße 9, 07745 Jena, Germany.

Further details of these datasets can be found in the methods section of the above paper.

Erythroblast DAPI (+glycophorin A): erythroblast cells were stained with DAPI and for glycophorin A protein (CD235a antibody, JC159 clone, Dako) and with Alexa Fluor 488 secondary antibody (Invitrogen). DAPI staining was performed through using VectaShield Hard Set mounting solution with DAPI (Vector Lab). Num. of images used for training: 80 and testing: 80. Average number of cells per image: 4.5.

Neuroblastoma phalloidin (+DAPI): images of neuroblastoma cells (N1E115) stained with phalloidin and DAPI were acquired from the Cell Image Library [26]. Cell images in the original dataset were acquired with a larger field of view than our system and so we divided each image into four sub-images and also created ROI bounding boxes for each of the cells in the image. The images were stained for FITC-phalloidin and DAPI. Num. of images used for training: 180, testing: 180. Average number of cells per image: 11.7.

Fibroblast nucleopore: fibroblast (GM5756T) cells were stained for a nucleopore protein (anti-Nup153 mouse antibody, Abcam) and detected with anti-mouse Alexa Fluor 488. Num. of images for training: 26 and testing: 20. Average number of cells per image: 4.8.

Eukaryote DAPI: eukaryote cells were stained with DAPI and fixed and mounted in Vectashield (Vector Lab). Num. of images for training: 40 and testing: 40. Average number of cells per image: 8.9.

C127 DAPI: C127 cells were initially treated with a technique called RASER-FISH[27], stained with DAPI and fixed and mounted in Vectashield (Vector Lab). Num. of images for training: 30 and testing: 30. Average number of cells per image: 7.1.

HEK peroxisome All: HEK-293 cells expressing peroxisome-localized GFP-SCP2 protein. Cells were transfected with GFP-SCP2 protein, which contains the PTS-1 localization signal, which redirects the fluorescently tagged protein to the actively importing peroxisomes[28]. Cells were fixed and mounted. Num. of images for training: 55 and testing: 55. Additionally we sub-categorised the cells as ‘punctuate’ and ‘non-punctuate’, where ‘punctuate’ would represent cells that have staining where the peroxisomes are discretely visible and ‘non-punctuate’ would be diffuse staining within the cell. The ‘HEK peroxisome All’ dataset contains ROI for all the cells: average number of cells per image: 7.9. The ‘HEK peroxisome’ dataset contains only those cells with punctuate fluorescence: average number of punctuate cells per image: 3.9.

Erythroid DAPI All: Murine embryoid body-derived erythroid cells, differentiated from mES cells. Stained with DAPI and fixed and mounted in Vectashield (Vector Lab). Num. of images for training: 51 and testing: 50. Multinucleate cells are seen with this differentiation procedure. There is a variation in size of the nuclei (nuclei become smaller as differentiation proceeds). The smaller, 'late erythroid' nuclei contain heavily condensed DNA and often have heavy ‘blobs’ of heterochromatin visible. Apoptopic cells are also present, with apoptotic bodies clearly present. The ‘Erythroid DAPI All’ dataset contains ROI for all the cells in the image. Average number of cells per image: 21.5. The subset ‘Erythroid DAPI’ contains non-apoptotic cells only: average number of cells per image: 11.9

COS-7 nucleopore. Slides were acquired from GATTAquant. GATTA-Cells 1C are single color COS-7 cells stained for Nuclear pore complexes (Anti-Nup) and with Alexa Fluor 555 Fab(ab’)2 secondary stain. GATTA-Cells are embedded in ProLong Diamond. Num. of images for training: 50 and testing: 50. Average number of cells per image: 13.2

COS-7 nucleopore 40x. Same GATTA-Cells 1C slides (GATTAquant) as above but imaged on Nikon microscope, with 40x NA 0.6 objective. Num. of images for testing: 11. Average number of cells per image: 31.6.

COS-7 nucleopore 10x. Same GATTA-Cells 1C slides (GATTAquant) as above but imaged on Nikon microscope, with 10x NA 0.25 objective. Num. of images for testing: 20. Average number of cells per image: 24.6

Dataset Annotation

Datasets were annotated by a skilled user. These annotations represent the ground-truth of each image with bounding boxes (regions) drawn around each cell present within the staining. Annotations were produced using Fiji/ImageJ [29] ROI Manager and also through using the OMERO [30] ROI drawing interface (https://www.openmicroscopy.org/omero/). The dataset labels were then converted into a format compatible with Faster-RCNN (Pascal), YOLOv2, YOLOv3 and also RetinaNet. The scripts used to perform this conversion are documented in the repository (https://github.com/dwaithe/amca/scripts/).
r
Data from: UQ-AAS21
researchdata.edu.au
Updated Feb 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Miss Fuman Xie; Miss Fuman Xie; Associate Professor Guangdong Bai; Associate Professor Guangdong Bai (2022). UQ-AAS21 [Dataset]. http://doi.org/10.48610/DC13E08
Explore at:
Unique identifier
https://doi.org/10.48610/DC13E08
Dataset updated
Feb 1, 2022
Dataset provided by
The University of Queensland
Authors
Miss Fuman Xie; Miss Fuman Xie; Associate Professor Guangdong Bai; Associate Professor Guangdong Bai
License
https://guides.library.uq.edu.au/deposit-your-data/license-reuse-data-agreementhttps://guides.library.uq.edu.au/deposit-your-data/license-reuse-data-agreement
Description
UQ-AAS21 is a comprehensive dataset of the Amazon VPA service, i.e., the Alexa, which is the most popular VPA service. It includes 65,195 Alexa applications (or skills) and comprehensive information about them, including invocation names, user reviews, among overall 16 attributes.
F
Tamil Wake Words & Voice Commands Speech Data
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Tamil Wake Words & Voice Commands Speech Data [Dataset]. https://www.futurebeeai.com/dataset/wake-words-and-commands-dataset/wake-words-and-commands-tamil-india
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The Tamil Wake Word & Voice Command Dataset is expertly curated to support the training and development of voice-activated systems. This dataset includes a large collection of wake words and command phrases, essential for enabling seamless user interaction with voice assistants and other speech-enabled technologies. It’s designed to ensure accurate wake word detection and voice command recognition, enhancing overall system performance and user experience.
Speech Data
This dataset includes 20,000+ audio recordings of wake words and command phrases. Each participant contributed 400 recordings, captured under varied environmental conditions and speaking speeds. The data covers:
•Wake words alone
•Wake words followed by command phrases
Participant Diversity
•
Speakers: 50 native Tamil speakers from the FutureBeeAI community

•
Regions: Participants from various Tamil Nadu provinces, ensuring broad coverage of accents and dialects

•
Demographics: Ages 18–70; 60% male and 40% female participants

Recording Details
•
Type: Scripted wake words and command phrases

•
Duration: 1 to 15 seconds per clip

•
Format: WAV, stereo, 16-bit, with sample rates ranging from 16 kHz to 48 kHz

Dataset Diversity
•Wake Word Types
•
Automobile Wake Words: Hey Mercedes, Hey BMW, Hey Porsche, Hey Volvo, Hey Audi, Hi Genesis, Ok Ford, etc.

•
Voice Assistant Wake Words: Hey Siri, Ok Google, Alexa, Hey Cortana, Hi Bixby, Hey Celia, etc.

•
Home Appliance Wake Words: Hi LG, Ok LG, Hello Lloyd, and more

•Command Types by Use Case
•
Automobile: Play music, check directions, voice search, provide feedback, and more

•
Voice Assistant: Ask general questions, make calls, control devices, shopping, manage calendars, and more

•
Home Appliances: Control appliances, check status, set reminders/alarms, manage shopping lists, etc.

•Recording Environments
•No background noise
•Background traffic noise
•People talking in the background
•Speaking Pace
•Normal speed
•Fast speed
This diversity ensures robust training for real-world voice assistant applications.
Metadata
Each audio file is accompanied by detailed metadata to support advanced filtering and training needs.
•
Participant Metadata: Unique ID, age, gender, region, accent, dialect

•
Recording Metadata: Transcript, environment, pace, device used, sample rate, bit depth, file format

Use Cases & Applications
•
Voice Assistant Activation: Train models to accurately detect and trigger based on wake words

•
Smart Home Devices: Enable responsive voice control in smart appliances

•
<b style="font-weight:
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Tre DiPassio; Michael Heilemann; Jenna Rutowski; Paula Sedlacek; Benjamin Thompson; Yutong Wen (2024). Smart Speaker Command Dataset [Dataset]. http://doi.org/10.60593/ur.d.26417548.v1

Smart Speaker Command Dataset

Explore at:

14 scholarly articles cite this dataset (View in Google Scholar)

pdfAvailable download formats

Unique identifier

https://doi.org/10.60593/ur.d.26417548.v1

Dataset updated

Aug 2, 2024

Dataset provided by

University of Rochester

Authors

Tre DiPassio; Michael Heilemann; Jenna Rutowski; Paula Sedlacek; Benjamin Thompson; Yutong Wen

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset contains a collection of voice commands for a smart speaker, each beginning with the common wake-word "Hey Alexa." The commands cover a range of tasks such as music control, smart home management, information requests, reminders, shopping, entertainment, and communication. The dataset reflects natural language usage from a diverse group of speakers, capturing various phrasings, inflections, and contexts. It includes contributions from both male and female voices and features speakers with different native languages.If you plan to download this dataset, we would appreciate it very much if you could fill out the Google form at https://forms.gle/dixQ4mkZ4xbXtXRDA. This will help us understand the usage and impacts of this dataset. Your feedback will also help us improve any future extensions of this work.

Clear search

Close search

Google apps

Main menu

Smart Speaker Command Dataset

alexa-qa

Dataset for: The More Competent, the Better? The Effects of Perceived...

Datasets for Sentiment Analysis

German Wake Words & Voice Commands Speech Data

Introduction

Speech Data

Participant Diversity

Recording Details

Dataset Diversity

Metadata

Use Cases & Applications

Statistics of collected datasets.

Cross-language corpora of privacy policies

Datasheet1_Understanding older people's voice interactions with smart voice...

Most popular websites in the Netherlands 2015 - Dataset - B2FIND

YouTube's Channels Dataset

Context

Acknowledgements

Web tracking data for 500 websites popular among Finnish web users

Alexa Domains

Dataset used for detecting DNS over HTTPS by Machine Learning.

Crowd-sourced Fitbit datasets 03.12.2016-05.12.2016

Fluorescence Microscopy Data for Cellular Detection using Object Detection...

Data from: UQ-AAS21

Tamil Wake Words & Voice Commands Speech Data

Introduction

Speech Data

Participant Diversity

Recording Details

Dataset Diversity

Metadata

Use Cases & Applications

Smart Speaker Command Dataset