This dataset contains a list of 3654 Dutch websites that we considered the most popular websites in 2015. This list served as whitelist for the Newstracker Research project in which we monitored the online web behaviour of a group of respondents.The research project 'The Newstracker' was a subproject of the NWO-funded project 'The New News Consumer: A User-Based Innovation Project to Meet Paradigmatic Change in News Use and Media Habits'.For the Newstracker project we aimed to understand the web behaviour of a group of respondents. We created custom-built software to monitor their web browsing behaviour on their laptops and desktops (please find the code in open access at https://github.com/NITechLabs/NewsTracker). For reasons of scale and privacy we created a whitelist with websites that were the most popular websites in 2015. We manually compiled this list by using data of DDMM, Alexa and own research. The dataset consists of 5 columns:- the URL- the type of website: We created a list of types of websites and each website has been manually labeled with 1 category- Nieuws-regio: When the category was 'News', we subdivided these websites in the regional focus: International, National or Local- Nieuws-onderwerp: Furthermore, each website under the category News was further subdivided in type of news website. For this we created an own list of news categories and manually coded each website- Bron: For each website we noted which source we used to find this website.The full description of the research design of the Newstracker including the set-up of this whitelist is included in the following article: Kleppe, M., Otte, M. (in print), 'Analysing & understanding news consumption patterns by tracking online user behaviour with a multimodal research design', Digital Scholarship in the Humanities, doi 10.1093/llc/fqx030.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset provides a comprehensive overview of online sales transactions across different product categories. Each row represents a single transaction with detailed information such as the order ID, date, category, product name, quantity sold, unit price, total price, region, and payment method.
We'll extract any data from any website on the Internet. You don't have to worry about buying and maintaining complex and expensive software, or hiring developers.
Some common use cases our customers use the data for: • Data Analysis • Market Research • Price Monitoring • Sales Leads • Competitor Analysis • Recruitment
We can get data from websites with pagination or scroll, with captchas, and even from behind logins. Text, images, videos, documents.
Receive data in any format you need: Excel, CSV, JSON, or any other.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is available on Brisbane City Council’s open data website – data.brisbane.qld.gov.au. The site provides additional features for viewing and interacting with the data and for downloading the data in various formats.
Monthly analytics reports for the Brisbane City Council website
Information regarding the sessions for Brisbane City Council website during the month including page views and unique page views.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Context
The data presented here was obtained in a Kali Machine from University of Cincinnati,Cincinnati,OHIO by carrying out packet captures for 1 hour during the evening on Oct 9th,2023 using Wireshark.This dataset consists of 394137 instances were obtained and stored in a CSV (Comma Separated Values) file.This large dataset could be used utilised for different machine learning applications for instance classification of Network traffic,Network performance monitoring,Network Security Management , Network Traffic Management ,network intrusion detection and anomaly detection.
The dataset can be used for a variety of machine learning tasks, such as network intrusion detection, traffic classification, and anomaly detection.
Content :
This network traffic dataset consists of 7 features.Each instance contains the information of source and destination IP addresses, The majority of the properties are numeric in nature, however there are also nominal and date kinds due to the Timestamp.
The network traffic flow statistics (No. Time Source Destination Protocol Length Info) were obtained using Wireshark (https://www.wireshark.org/).
Dataset Columns:
No : Number of Instance. Timestamp : Timestamp of instance of network traffic Source IP: IP address of Source Destination IP: IP address of Destination Portocol: Protocol used by the instance Length: Length of Instance Info: Information of Traffic Instance
Acknowledgements :
I would like thank University of Cincinnati for giving the infrastructure for generation of network traffic data set.
Ravikumar Gattu , Susmitha Choppadandi
Inspiration : This dataset goes beyond the majority of network traffic classification datasets, which only identify the type of application (WWW, DNS, ICMP,ARP,RARP) that an IP flow contains. Instead, it generates machine learning models that can identify specific applications (like Tiktok,Wikipedia,Instagram,Youtube,Websites,Blogs etc.) from IP flow statistics (there are currently 25 applications in total).
**Dataset License: ** CC0: Public Domain
Dataset Usages : This dataset can be used for different machine learning applications in the field of cybersecurity such as classification of Network traffic,Network performance monitoring,Network Security Management , Network Traffic Management ,network intrusion detection and anomaly detection.
ML techniques benefits from this Dataset :
This dataset is highly useful because it consists of 394137 instances of network traffic data obtained by using the 25 applications on a public,private and Enterprise networks.Also,the dataset consists of very important features that can be used for most of the applications of Machine learning in cybersecurity.Here are few of the potential machine learning applications that could be benefited from this dataset are :
Network Performance Monitoring : This large network traffic data set can be utilised for analysing the network traffic to identifying the network patterns in the network .This help in designing the network security algorithms for minimise the network probelms.
Anamoly Detection : Large network traffic dataset can be utilised training the machine learning models for finding the irregularitues in the traffic which could help identify the cyber attacks.
3.Network Intrusion Detection : This large dataset could be utilised for machine algorithms training and designing the models for detection of the traffic issues,Malicious traffic network attacks and DOS attacks as well.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We are publishing a dataset we created for the HTTPS traffic classification.
Since the data were captured mainly in the real backbone network, we omitted IP addresses and ports. The datasets consist of calculated from bidirectional flows exported with flow probe Ipifixprobe. This exporter can export a sequence of packet lengths and times and a sequence of packet bursts and time. For more information, please visit ipfixprobe repository (Ipifixprobe).
During our research, we divided HTTPS into five categories: L -- Live Video Streaming, P -- Video Player, M -- Music Player, U -- File Upload, D -- File Download, W -- Website, and other traffic.
We have chosen the service representatives known for particular traffic types based on the Alexa Top 1M list and Moz's list of the most popular 500 websites for each category. We also used several popular websites that primarily focus on the audience in our country. The identified traffic classes and their representatives are provided below:
Live Video Stream Twitch, Czech TV, YouTube Live
Video Player DailyMotion, Stream.cz, Vimeo, YouTube
Music Player AppleMusic, Spotify, SoundCloud
File Upload/Download FileSender, OwnCloud, OneDrive, Google Drive
Website and Other Traffic Websites from Alexa Top 1M list
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The Roboflow Website Screenshots
dataset is a synthetically generated dataset composed of screenshots from over 1000 of the world's top websites. They have been automatically annotated to label the following classes:
:fa-spacer:
* button
- navigation links, tabs, etc.
* heading
- text that was enclosed in <h1>
to <h6>
tags.
* link
- inline, textual <a>
tags.
* label
- text labeling form fields.
* text
- all other text.
* image
- <img>
, <svg>
, or <video>
tags, and icons.
* iframe
- ads and 3rd party content.
This is an example image and annotation from the dataset:
https://i.imgur.com/mOG3u3Z.png" alt="WIkipedia Screenshot">
Annotated screenshots are very useful in Robotic Process Automation. But they can be expensive to label. This dataset would cost over $4000 for humans to label on popular labeling services. We hope this dataset provides a good starting point for your project. Try it with a model from our model library.
Roboflow is happy to provide a custom screenshots dataset to meet your particular needs. We can crawl public or internal web applications. Just reach out and we'll be happy to provide a quote!
Roboflow makes managing, preprocessing, augmenting, and versioning datasets for computer vision seamless. :fa-spacer: Developers reduce 50% of their boilerplate code when using Roboflow's workflow, save training time, and increase model reproducibility. :fa-spacer:
A. Market Research and Analysis: Utilize the Tripadvisor dataset to conduct in-depth market research and analysis in the travel and hospitality industry. Identify emerging trends, popular destinations, and customer preferences. Gain a competitive edge by understanding your target audience's needs and expectations.
B. Competitor Analysis: Compare and contrast your hotel or travel services with competitors on Tripadvisor. Analyze their ratings, customer reviews, and performance metrics to identify strengths and weaknesses. Use these insights to enhance your offerings and stand out in the market.
C. Reputation Management: Monitor and manage your hotel's online reputation effectively. Track and analyze customer reviews and ratings on Tripadvisor to identify improvement areas and promptly address negative feedback. Positive reviews can be leveraged for marketing and branding purposes.
D. Pricing and Revenue Optimization: Leverage the Tripadvisor dataset to analyze pricing strategies and revenue trends in the hospitality sector. Understand seasonal demand fluctuations, pricing patterns, and revenue optimization opportunities to maximize your hotel's profitability.
E. Customer Sentiment Analysis: Conduct sentiment analysis on Tripadvisor reviews to gauge customer satisfaction and sentiment towards your hotel or travel service. Use this information to improve guest experiences, address pain points, and enhance overall customer satisfaction.
F. Content Marketing and SEO: Create compelling content for your hotel or travel website based on the popular keywords, topics, and interests identified in the Tripadvisor dataset. Optimize your content to improve search engine rankings and attract more potential guests.
G. Personalized Marketing Campaigns: Use the data to segment your target audience based on preferences, travel habits, and demographics. Develop personalized marketing campaigns that resonate with different customer segments, resulting in higher engagement and conversions.
H. Investment and Expansion Decisions: Access historical and real-time data on hotel performance and market dynamics from Tripadvisor. Utilize this information to make data-driven investment decisions, identify potential areas for expansion, and assess the feasibility of new ventures.
I. Predictive Analytics: Utilize the dataset to build predictive models that forecast future trends in the travel industry. Anticipate demand fluctuations, understand customer behavior, and make proactive decisions to stay ahead of the competition.
J. Business Intelligence Dashboards: Create interactive and insightful dashboards that visualize key performance metrics from the Tripadvisor dataset. These dashboards can help executives and stakeholders get a quick overview of the hotel's performance and make data-driven decisions.
Incorporating the Tripadvisor dataset into your business processes will enhance your understanding of the travel market, facilitate data-driven decision-making, and provide valuable insights to drive success in the competitive hospitality industry
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains 101,219 URLs and 18 features (including the label). Here's a description of each attribute: Phishing (0): 63,678 URLs
Legitimate (1): 37,540 URLs
These URLs have been sourced from the URLHaus database, scraped from many sites and other well-known repositories malicious websites actively used in phishing attacks. Each entry in this subset has been manually verified and is labeled as a phishing URL, making this dataset highly reliable for identifying harmful web content.
The legitimate URLs have been collected from reputable sources such as Wikipedia and Stack Overflow. These websites are known for hosting user-generated content and community discussions, ensuring that the URLs represent safe, legitimate web addresses. The URLs were randomly scraped to ensure diversity in the types of legitimate sites included. Dataset Features:
URL: The full web address of each entry, providing the primary feature for analysis. Label: A binary label indicating whether the URL is legitimate (1) or phishing (0). Applications:
This dataset is suitable for training and evaluating machine learning models aimed at distinguishing between phishing and legitimate websites. It can be used in a variety of cybersecurity research projects, including URL-based phishing detection, web content analysis, and the development of real-time protection systems.
Usage:
Researchers can leverage this balanced dataset to develop and test algorithms for identifying phishing websites with high accuracy, using features such as URL structure, and class label attributes. The inclusion of both phishing and legitimate URLs provides a comprehensive basis for creating robust models capable of detecting phishing attempts in diverse online environments.
Feature Name Description URL The full URL string. url_length - Total number of characters in the URL. has_ip_address - Binary flag (1/0): whether the URL contains an IP address. dot_count - Number of . characters in the URL. https_flag - Binary flag (1/0): whether the URL uses HTTPS. url_entropy - Shannon entropy of the URL string – higher values indicate more randomness. token_count - Number of tokens/words in the URL. subdomain_count - Number of subdomains in the URL. query_param_count - Number of query parameters (after ?). tld_length - Length of the Top-Level Domain (e.g., "com" = 3). path_length - Length of the path part after the domain. has_hyphen_in_domain Binary flag (1/0): whether the domain contains a hyphen (-). number_of_digits - Total number of numeric characters in the URL. tld_popularity Binary flag (1/0): whether the TLD is popular. suspicious_file_extension Binary flag (1/0): indicates if the URL ends with suspicious extensions (e.g., .exe, .zip). domain_name_length - Length of the domain name. percentage_numeric_chars - Percentage of numeric characters in the URL. ClassLabel Target label: 1 = Legitimate, 0 = Phishing.
Google Chrome is a popular web browser developed by Google.
The Chrome User Experience Report is a public dataset of key user experience metrics for popular origins on the web, as experienced by Chrome users under real-world conditions.
https://bigquery.cloud.google.com/dataset/chrome-ux-report:all
For more info, see the documentation at https://developers.google.com/web/tools/chrome-user-experience-report/
License: CC BY 4.0
Photo by Edho Pratama on Unsplash
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
This dataset supplements publication "Multilingual Scraper of Privacy Policies and Terms of Service" at ACM CSLAW’25, March 25–27, 2025, München, Germany. It includes the first 12 months of scraped policies and terms from about 800k websites, see concrete numbers below.
The following table lists the amount of websites visited per month:
Month | Number of websites |
---|---|
2024-01 | 551'148 |
2024-02 | 792'921 |
2024-03 | 844'537 |
2024-04 | 802'169 |
2024-05 | 805'878 |
2024-06 | 809'518 |
2024-07 | 811'418 |
2024-08 | 813'534 |
2024-09 | 814'321 |
2024-10 | 817'586 |
2024-11 | 828'662 |
2024-12 | 827'101 |
The amount of websites visited should always be higher than the number of jobs (Table 1 of the paper) as a website may redirect, resulting in two websites scraped or it has to be retried.
To simplify the access, we release the data in large CSVs. Namely, there is one file for policies and another for terms per month. All of these files contain all metadata that are usable for the analysis. If your favourite CSV parser reports the same numbers as above then our dataset is correctly parsed. We use ‘,’ as a separator, the first row is the heading and strings are in quotes.
Since our scraper sometimes collects other documents than policies and terms (for how often this happens, see the evaluation in Sec. 4 of the publication) that might contain personal data such as addresses of authors of websites that they maintain only for a selected audience. We therefore decided to reduce the risks for websites by anonymizing the data using Presidio. Presidio substitutes personal data with tokens. If your personal data has not been effectively anonymized from the database and you wish for it to be deleted, please contact us.
The uncompressed dataset is about 125 GB in size, so you will need sufficient storage. This also means that you likely cannot process all the data at once in your memory, so we split the data in months and in files for policies and terms.
The files have the following names:
Both files contain the following metadata columns:
website_month_id
- identification of crawled websitejob_id
- one website can have multiple jobs in case of redirects (but most commonly has only one)website_index_status
- network state of loading the index page. This is resolved by the Chromed DevTools Protocol.
DNS_ERROR
- domain cannot be resolvedOK
- all fineREDIRECT
- domain redirect to somewhere elseTIMEOUT
- the request timed outBAD_CONTENT_TYPE
- 415 Unsupported Media TypeHTTP_ERROR
- 404 errorTCP_ERROR
- error in the network connectionUNKNOWN_ERROR
- unknown errorwebsite_lang
- language of index page detected based on langdetect
librarywebsite_url
- the URL of the website sampled from the CrUX list (may contain subdomains, etc). Use this as a unique identifier for connecting data between months.job_domain_status
- indicates the status of loading the index page. Can be:
OK
- all works well (at the moment, should be all entries)BLACKLISTED
- URL is on our list of blocked URLsUNSAFE
- website is not safe according to save browsing API by GoogleLOCATION_BLOCKED
- country is in the list of blocked countriesjob_started_at
- when the visit of the website was startedjob_ended_at
- when the visit of the website was endedjob_crux_popularity
- JSON with all popularity ranks of the website this monthjob_index_redirect
- when we detect that the domain redirects us, we stop the crawl and create a new job with the target URL. This saves time if many websites redirect to one target, as it will be crawled only once. The index_redirect
is then the job.id
corresponding to the redirect target.job_num_starts
- amount of crawlers that started this job (counts restarts in case of unsuccessful crawl, max is 3)job_from_static
- whether this job was included in the static selection (see Sec. 3.3 of the paper)job_from_dynamic
- whether this job was included in the dynamic selection (see Sec. 3.3 of the paper) - this is not exclusive with from_static
- both can be true when the lists overlap.job_crawl_name
- our name of the crawl, contains year and month (e.g., 'regular-2024-12' for regular crawls, in Dec 2024)policy_url_id
- ID of the URL this policy haspolicy_keyword_score
- score (higher is better) according to the crawler's keywords list that given document is a policypolicy_ml_probability
- probability assigned by the BERT model that given document is a policypolicy_consideration_basis
- on which basis we decided that this url is policy. The following three options are executed by the crawler in this order:
policy_url
- full URL to the policypolicy_content_hash
- used as identifier - if the document remained the same between crawls, it won't create a new entrypolicy_content
- contains the text of policies and terms extracted to Markdown using Mozilla's readability
librarypolicy_lang
- Language detected by fasttext of the contentAnalogous to policy data, just substitute policy
to terms
.
Check this Google Docs for an updated version of this README.md.
The DAPR extension for CKAN integrates with the Digital Analytics Program (DAP) to retrieve and display usage statistics for datasets and resources within a CKAN instance. This extension allows administrators to monitor how frequently datasets are accessed, providing valuable insights into data usage patterns. By tracking download events, DAPR enriches CKAN's functionality, facilitating data-driven decision-making and resource management. Key Features: DAP Download Event Retrieval: Retrieves download events tracked by the Digital Analytics Program and stores them within CKAN for later analysis and display. This ensures that access data is captured and made available alongside other dataset metadata. Frequently Accessed Dataset Listing: Enables the creation of lists showcasing the most frequently accessed datasets. This allows administrators to identify popular datasets and prioritize resources accordingly. Dataset and Resource Access Counting: Provides a mechanism to display access counts for datasets and individual resources. These counts can be displayed within the CKAN interface, providing users with immediate feedback on dataset popularity. DAP-enabled Website Event Tracking: Tracks accesses to resources even when those accesses originate from external DAP-enabled websites, providing a comprehensive view of data usage regardless of the access point. Scheduled Data Refresh: Supports command-line utility for scheduled access data imports, ensuring usage statistics remain up-to-date with minimal manual intervention. It has options to specify the start and end date or retrieve all records. Technical Integration: The DAPR extension integrates with CKAN by adding new database tables to store DAP tracking data. Administrators configure the extension through the CKAN configuration file (ckan.ini) or similar. The extension also provides a command-line interface (CLI) tool for importing DAP tracking events, which can be scheduled using cron or similar task schedulers. Benefits & Impact: By integrating DAP statistics, the DAPR extension allows CKAN instance owners to improve data visibility, assess the value of available datasets, and make data-informed decisions about resource allocation and data discoverability improvements. Knowing which datasets are frequently used and accessed can help data curators prioritize updates, augment popular datasets with further information, and overall invest resources into ensuring that the CKAN instance delivers the most relevant and impactful data to its audience.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains a number of computed dynamic analysis metrics related to quality in use for the 5,000 most popular websites.
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
The Flights Booking Dataset of various Airlines is a scraped datewise from a famous website in a structured format. The dataset contains the records of flight travel details between the cities in India. Here, multiple features are present like Source & Destination City, Arrival & Departure Time, Duration & Price of the flight etc.
This data is available as a CSV file. We are going to analyze this data set using the Pandas DataFrame.
This analyse will be helpful for those working in Airlines, Travel domain.
Using this dataset, we answered multiple questions with Python in our Project.
Q.1. What are the airlines in the dataset, accompanied by their frequencies?
Q.2. Show Bar Graphs representing the Departure Time & Arrival Time.
Q.3. Show Bar Graphs representing the Source City & Destination City.
Q.4. Does price varies with airlines ?
Q.5. Does ticket price change based on the departure time and arrival time?
Q.6. How the price changes with change in Source and Destination?
Q.7. How is the price affected when tickets are bought in just 1 or 2 days before departure?
Q.8. How does the ticket price vary between Economy and Business class?
Q.9. What will be the Average Price of Vistara airline for a flight from Delhi to Hyderabad in Business Class ?
These are the main Features/Columns available in the dataset :
1) Airline: The name of the airline company is stored in the airline column. It is a categorical feature having 6 different airlines.
2) Flight: Flight stores information regarding the plane's flight code. It is a categorical feature.
3) Source City: City from which the flight takes off. It is a categorical feature having 6 unique cities.
4) Departure Time: This is a derived categorical feature obtained created by grouping time periods into bins. It stores information about the departure time and have 6 unique time labels.
5) Stops: A categorical feature with 3 distinct values that stores the number of stops between the source and destination cities.
6) Arrival Time: This is a derived categorical feature created by grouping time intervals into bins. It has six distinct time labels and keeps information about the arrival time.
7) Destination City: City where the flight will land. It is a categorical feature having 6 unique cities.
8) Class: A categorical feature that contains information on seat class; it has two distinct values: Business and Economy.
9) Duration: A continuous feature that displays the overall amount of time it takes to travel between cities in hours.
10) Days Left: This is a derived characteristic that is calculated by subtracting the trip date by the booking date.
11) Price: Target variable stores information of the ticket price.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a dataset used for the online stats training website (https://www.rensvandeschoot.com/tutorials/) and is based on the data used by van de Schoot, van der Velden, Boom, and Brugman (2010).
The dataset is based on a study that investigates an association between popularity status and antisocial behavior from at-risk adolescents (n = 1491), where gender and ethnic background are moderators under the association. The study distinguished subgroups within the popular status group in terms of overt and covert antisocial behavior.For more information on the sample, instruments, methodology, and research context, we refer the interested readers to van de Schoot, van der Velden, Boom, and Brugman (2010).
Variable name Description
Respnr = Respondents’ number
Dutch = Respondents’ ethnic background (0 = Dutch origin, 1 = non-Dutch origin)
gender = Respondents’ gender (0 = boys, 1 = girls)
sd = Adolescents’ socially desirable answering patterns
covert = Covert antisocial behavior
overt = Overt antisocial behavior
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains global web cache latency measurements collected via RIPE Atlas probes equipped with Starlink terminals across five continents, spanning over 24 hours and resulting in ~2 Million measurements. The measurements aim to evaluate the user-perceived latency of accessing popular websites through low-earth orbit (LEO) satellite networks.
This dataset is a product of Spache, a research project on web caching from space. Please refer to its WWW'25 paper for more details and analysis results.
The dataset includes the following files:
This dataset is intended to support research on web caching, particularly in the context of satellite Internet. Please cite both this dataset and the associated paper if you find this data useful.
https://brightdata.com/licensehttps://brightdata.com/license
Utilize our machine learning datasets to develop and validate your models. Our datasets are designed to support a variety of machine learning applications, from image recognition to natural language processing and recommendation systems. You can access a comprehensive dataset or tailor a subset to fit your specific requirements, using data from a combination of various sources and websites, including custom ones. Popular use cases include model training and validation, where the dataset can be used to ensure robust performance across different applications. Additionally, the dataset helps in algorithm benchmarking by providing extensive data to test and compare various machine learning algorithms, identifying the most effective ones for tasks such as fraud detection, sentiment analysis, and predictive maintenance. Furthermore, it supports feature engineering by allowing you to uncover significant data attributes, enhancing the predictive accuracy of your machine learning models for applications like customer segmentation, personalized marketing, and financial forecasting.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A dataset for recommendations collected from ted.com which contains metadata fields for TED talks and user profiles with rating and commenting transactions.
The TED dataset contains all the audio-video recordings of the TED talks downloaded from the official TED website, http://www.ted.com, on April 27th 2012 (first version) and on September 10th 2012 (second version). No processing has been done on any of the metadata fields. The metadata was obtained by crawling the HTML source of the list of talks and users, as well as talk and user webpages using scripts written by Nikolaos Pappas at the Idiap Research Institute, Martigny, Switzerland. The dataset is shared under the Creative Commons license (the same as the content of the TED talks) which is stored in the COPYRIGHT file. The dataset is shared for research purposes which are explained in detail in the following papers. The dataset can be used to benchmark systems that perform two tasks, namely personalized recommendations and generic recommendations. Please check the CBMI 2013 paper for a detailed description of each task.
Nikolaos Pappas, Andrei Popescu-Belis, "Combining Content with User Preferences for TED Lecture Recommendation", 11th International Workshop on Content Based Multimedia Indexing, Veszprém, Hungary, IEEE, 2013 PDF document, Bibtex citation
Nikolaos Pappas, Andrei Popescu-Belis, Sentiment Analysis of User Comments for One-Class Collaborative Filtering over TED Talks, 36th ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland, ACM, 2013 PDF document, Bibtex citation
If you use the TED dataset for your research please cite one of the above papers (specifically the 1st paper for the April 2012 version and the 2nd paper for the September 2012 version of the dataset).
TED website
The TED website is a popular online repository of audiovisual recordings of public lectures given by prominent speakers, under a Creative Commons non-commercial license (see www.ted.com). The site provides extended metadata and user-contributed material. The speakers are scientists, writers, journalists, artists, and businesspeople from all over the world who are generally given a maximum of 18 minutes to present their ideas. The talks are given in English and are usually transcribed and then translated into several other languages by volunteer users. The quality of the talks has made TED one of the most popular online lecture repositories, as each talk was viewed on average almost 500,000 times.
Metadata
The dataset contains two main entry types: talks and users. The talks have the following data fields: identifier, title, description, speaker name, TED event at which they were given, transcript, publication date, filming date, number of views. Each talk has a variable number of user comments, organized in threads. In addition, three fields were assigned by TED editorial staff: related tags, related themes, and related talks. Each talk generally has three related talks and 95% of them have a high- quality transcript available. The dataset includes 1,149 talks from 960 speakers and 69,023 registered users that have made about 100,000 favorites and 200,000 comments.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Here are a few use cases for this project:
Web Accessibility Analysis: This model can be used to analyze the accessibility of web pages by identifying different elements and ensuring they follow good practices in design and user accessibility standards, such as having appropriate contrast between text and image, or usage of icons and buttons for UI/UX.
Web Page Redesign: By identifying the classes of elements on a webpage, "Reorganized2" could be used by designers and developers to analyze a current website layout and assist in redesigning a more intuitive and user-friendly interface.
UX Research and Testing: The model can be utilized in user experience (UX) research. It can help in identifying which elements (buttons, icons, dropdowns) on a webpage are getting more attention thus allowing UX designers to create more effective webpages.
Web Scraping: In the field of data mining, the model can serve as a smart web scraper, identifying different elements on a page, thus making web scraping more efficient and targeted rather than pulling irrelevant information.
E-commerce Optimization: "Reorganized2" can be used to analyze various e-commerce websites, spotting common design features amongst the most successful ones, especially regarding the usage and placement of 'cart', 'field', and 'dropdown' elements. These insights can be used to optimize other online retail sites.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A collection of 22 data set of 50+ requirements each, expressed as user stories.
The dataset has been created by gathering data from web sources and we are not aware of license agreements or intellectual property rights on the requirements / user stories. The curator took utmost diligence in minimizing the risks of copyright infringement by using non-recent data that is less likely to be critical, by sampling a subset of the original requirements collection, and by qualitatively analyzing the requirements. In case of copyright infringement, please contact the dataset curator (Fabiano Dalpiaz, f.dalpiaz@uu.nl) to discuss the possibility of removal of that dataset [see Zenodo's policies]
The data sets have been originally used to conduct experiments about ambiguity detection with the REVV-Light tool: https://github.com/RELabUU/revv-light
This collection has been originally published in Mendeley data: https://data.mendeley.com/datasets/7zbk8zsd8y/1
The following text provides a description of the datasets, including links to the systems and websites, when available. The datasets are organized by macro-category and then by identifier.
g02-federalspending.txt
(2018) originates from early data in the Federal Spending Transparency project, which pertain to the website that is used to share publicly the spending data for the U.S. government. The website was created because of the Digital Accountability and Transparency Act of 2014 (DATA Act). The specific dataset pertains a system called DAIMS or Data Broker, which stands for DATA Act Information Model Schema. The sample that was gathered refers to a sub-project related to allowing the government to act as a data broker, thereby providing data to third parties. The data for the Data Broker project is currently not available online, although the backend seems to be hosted in GitHub under a CC0 1.0 Universal license. Current and recent snapshots of federal spending related websites, including many more projects than the one described in the shared collection, can be found here.
g03-loudoun.txt
(2018) is a set of extracted requirements from a document, by the Loudoun County Virginia, that describes the to-be user stories and use cases about a system for land management readiness assessment called Loudoun County LandMARC. The source document can be found here and it is part of the Electronic Land Management System and EPlan Review Project - RFP RFQ issued in March 2018. More information about the overall LandMARC system and services can be found here.
g04-recycling.txt
(2017) concerns a web application where recycling and waste disposal facilities can be searched and located. The application operates through the visualization of a map that the user can interact with. The dataset has obtained from a GitHub website and it is at the basis of a students' project on web site design; the code is available (no license).
g05-openspending.txt
(2018) is about the OpenSpending project (www), a project of the Open Knowledge foundation which aims at transparency about how local governments spend money. At the time of the collection, the data was retrieved from a Trello board that is currently unavailable. The sample focuses on publishing, importing and editing datasets, and how the data should be presented. Currently, OpenSpending is managed via a GitHub repository which contains multiple sub-projects with unknown license.
g11-nsf.txt
(2018) refers to a collection of user stories referring to the NSF Site Redesign & Content Discovery project, which originates from a publicly accessible GitHub repository (GPL 2.0 license). In particular, the user stories refer to an early version of the NSF's website. The user stories can be found as closed Issues.
g08-frictionless.txt
(2016) regards the Frictionless Data project, which offers an open source dataset for building data infrastructures, to be used by researchers, data scientists, and data engineers. Links to the many projects within the Frictionless Data project are on GitHub (with a mix of Unlicense and MIT license) and web. The specific set of user stories has been collected in 2016 by GitHub user @danfowler and are stored in a Trello board.
g14-datahub.txt
(2013) concerns the open source project DataHub, which is currently developed via a GitHub repository (the code has Apache License 2.0). DataHub is a data discovery platform which has been developed over multiple years. The specific data set is an initial set of user stories, which we can date back to 2013 thanks to a comment therein.
g16-mis.txt
(2015) is a collection of user stories that pertains a repository for researchers and archivists. The source of the dataset is a public Trello repository. Although the user stories do not have explicit links to projects, it can be inferred that the stories originate from some project related to the library of Duke University.
g17-cask.txt
(2016) refers to the Cask Data Application Platform (CDAP). CDAP is an open source application platform (GitHub, under Apache License 2.0) that can be used to develop applications within the Apache Hadoop ecosystem, an open-source framework which can be used for distributed processing of large datasets. The user stories are extracted from a document that includes requirements regarding dataset management for Cask 4.0, which includes the scenarios, user stories and a design for the implementation of these user stories. The raw data is available in the following environment.
g18-neurohub.txt
(2012) is concerned with the NeuroHub platform, a neuroscience data management, analysis and collaboration platform for researchers in neuroscience to collect, store, and share data with colleagues or with the research community. The user stories were collected at a time NeuroHub was still a research project sponsored by the UK Joint Information Systems Committee (JISC). For information about the research project from which the requirements were collected, see the following record.
g22-rdadmp.txt
(2018) is a collection of user stories from the Research Data Alliance's working group on DMP Common Standards. Their GitHub repository contains a collection of user stories that were created by asking the community to suggest functionality that should part of a website that manages data management plans. Each user story is stored as an issue on the GitHub's page.
g23-archivesspace.txt
(2012-2013) refers to ArchivesSpace: an open source, web application for managing archives information. The application is designed to support core functions in archives administration such as accessioning; description and arrangement of processed materials including analog, hybrid, and
born digital content; management of authorities and rights; and reference service. The application supports collection management through collection management records, tracking of events, and a growing number of administrative reports. ArchivesSpace is open source and its
This dataset contains a list of 3654 Dutch websites that we considered the most popular websites in 2015. This list served as whitelist for the Newstracker Research project in which we monitored the online web behaviour of a group of respondents.The research project 'The Newstracker' was a subproject of the NWO-funded project 'The New News Consumer: A User-Based Innovation Project to Meet Paradigmatic Change in News Use and Media Habits'.For the Newstracker project we aimed to understand the web behaviour of a group of respondents. We created custom-built software to monitor their web browsing behaviour on their laptops and desktops (please find the code in open access at https://github.com/NITechLabs/NewsTracker). For reasons of scale and privacy we created a whitelist with websites that were the most popular websites in 2015. We manually compiled this list by using data of DDMM, Alexa and own research. The dataset consists of 5 columns:- the URL- the type of website: We created a list of types of websites and each website has been manually labeled with 1 category- Nieuws-regio: When the category was 'News', we subdivided these websites in the regional focus: International, National or Local- Nieuws-onderwerp: Furthermore, each website under the category News was further subdivided in type of news website. For this we created an own list of news categories and manually coded each website- Bron: For each website we noted which source we used to find this website.The full description of the research design of the Newstracker including the set-up of this whitelist is included in the following article: Kleppe, M., Otte, M. (in print), 'Analysing & understanding news consumption patterns by tracking online user behaviour with a multimodal research design', Digital Scholarship in the Humanities, doi 10.1093/llc/fqx030.