100+ datasets found
  1. e

    Most popular websites in the Netherlands 2015 - Dataset - B2FIND

    • b2find.eudat.eu
    Updated Jun 2, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2017). Most popular websites in the Netherlands 2015 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/3edeb59b-b49b-59cb-9757-9127aed7e8af
    Explore at:
    Dataset updated
    Jun 2, 2017
    Area covered
    Netherlands
    Description

    This dataset contains a list of 3654 Dutch websites that we considered the most popular websites in 2015. This list served as whitelist for the Newstracker Research project in which we monitored the online web behaviour of a group of respondents.The research project 'The Newstracker' was a subproject of the NWO-funded project 'The New News Consumer: A User-Based Innovation Project to Meet Paradigmatic Change in News Use and Media Habits'.For the Newstracker project we aimed to understand the web behaviour of a group of respondents. We created custom-built software to monitor their web browsing behaviour on their laptops and desktops (please find the code in open access at https://github.com/NITechLabs/NewsTracker). For reasons of scale and privacy we created a whitelist with websites that were the most popular websites in 2015. We manually compiled this list by using data of DDMM, Alexa and own research. The dataset consists of 5 columns:- the URL- the type of website: We created a list of types of websites and each website has been manually labeled with 1 category- Nieuws-regio: When the category was 'News', we subdivided these websites in the regional focus: International, National or Local- Nieuws-onderwerp: Furthermore, each website under the category News was further subdivided in type of news website. For this we created an own list of news categories and manually coded each website- Bron: For each website we noted which source we used to find this website.The full description of the research design of the Newstracker including the set-up of this whitelist is included in the following article: Kleppe, M., Otte, M. (in print), 'Analysing & understanding news consumption patterns by tracking online user behaviour with a multimodal research design', Digital Scholarship in the Humanities, doi 10.1093/llc/fqx030.

  2. m

    LegitPhish Dataset

    • data.mendeley.com
    Updated Apr 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rachana Potpelwar (2025). LegitPhish Dataset [Dataset]. http://doi.org/10.17632/hx4m73v2sf.1
    Explore at:
    Dataset updated
    Apr 7, 2025
    Authors
    Rachana Potpelwar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset contains 101,219 URLs and 18 features (including the label). Here's a description of each attribute: Phishing (0): 63,678 URLs

    Legitimate (1): 37,540 URLs

    These URLs have been sourced from the URLHaus database, scraped from many sites and other well-known repositories malicious websites actively used in phishing attacks. Each entry in this subset has been manually verified and is labeled as a phishing URL, making this dataset highly reliable for identifying harmful web content.

    The legitimate URLs have been collected from reputable sources such as Wikipedia and Stack Overflow. These websites are known for hosting user-generated content and community discussions, ensuring that the URLs represent safe, legitimate web addresses. The URLs were randomly scraped to ensure diversity in the types of legitimate sites included. Dataset Features:

    URL: The full web address of each entry, providing the primary feature for analysis. Label: A binary label indicating whether the URL is legitimate (1) or phishing (0). Applications:

    This dataset is suitable for training and evaluating machine learning models aimed at distinguishing between phishing and legitimate websites. It can be used in a variety of cybersecurity research projects, including URL-based phishing detection, web content analysis, and the development of real-time protection systems.

    Usage:

    Researchers can leverage this balanced dataset to develop and test algorithms for identifying phishing websites with high accuracy, using features such as URL structure, and class label attributes. The inclusion of both phishing and legitimate URLs provides a comprehensive basis for creating robust models capable of detecting phishing attempts in diverse online environments.

    Feature Name Description URL The full URL string. url_length - Total number of characters in the URL. has_ip_address - Binary flag (1/0): whether the URL contains an IP address. dot_count - Number of . characters in the URL. https_flag - Binary flag (1/0): whether the URL uses HTTPS. url_entropy - Shannon entropy of the URL string – higher values indicate more randomness. token_count - Number of tokens/words in the URL. subdomain_count - Number of subdomains in the URL. query_param_count - Number of query parameters (after ?). tld_length - Length of the Top-Level Domain (e.g., "com" = 3). path_length - Length of the path part after the domain. has_hyphen_in_domain Binary flag (1/0): whether the domain contains a hyphen (-). number_of_digits - Total number of numeric characters in the URL. tld_popularity Binary flag (1/0): whether the TLD is popular. suspicious_file_extension Binary flag (1/0): indicates if the URL ends with suspicious extensions (e.g., .exe, .zip). domain_name_length - Length of the domain name. percentage_numeric_chars - Percentage of numeric characters in the URL. ClassLabel Target label: 1 = Legitimate, 0 = Phishing.

  3. NFL Play Statistics dataset (secondary)

    • kaggle.com
    Updated Apr 27, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Todd Steussie (2020). NFL Play Statistics dataset (secondary) [Dataset]. https://www.kaggle.com/datasets/toddsteussie/nfl-play-statistics-secondary-datasets
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 27, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Todd Steussie
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    NFL is one of the most popular sports in the world. Many of us are stat geeks who understanding not what just happened but also who and why. This NFL dataset provides a comprehensive view of NFL games, statistics, participation, and much more. The dataset includes NFL play data from 2004 to the present.

    This NFL dataset provides play-by-play data from the 2004 to 2019 seasons. Dataset also includes play and participation information for players, coaches, and game officials. Additional data tables included in this file includes NFL Draft from 1989 to present, NFL Combine 1999 to present, NFL rosters from 1998 to present, NFL schedules, stadium information and much more. The granularity of NFL statistics varies by NFL season. The current version of NFL statistics has been collected since 2012. All information sources used to create this dataset are from publically accessible websites and the NFL GSIS dataset.

    All information sources used to create this dataset are from publically accessible websites and NFL documentation. Although my current life is focused on data science, this project has a special place in my heart, since it links my previous profession in the NFL with my current passion for data analysis.

  4. R

    Website Screenshots Object Detection Dataset - raw

    • public.roboflow.com
    zip
    Updated Aug 19, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brad Dwyer (2022). Website Screenshots Object Detection Dataset - raw [Dataset]. https://public.roboflow.com/object-detection/website-screenshots/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 19, 2022
    Dataset authored and provided by
    Brad Dwyer
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Variables measured
    Bounding Boxes of elements
    Description

    About This Dataset

    The Roboflow Website Screenshots dataset is a synthetically generated dataset composed of screenshots from over 1000 of the world's top websites. They have been automatically annotated to label the following classes: :fa-spacer: * button - navigation links, tabs, etc. * heading - text that was enclosed in <h1> to <h6> tags. * link - inline, textual <a> tags. * label - text labeling form fields. * text - all other text. * image - <img>, <svg>, or <video> tags, and icons. * iframe - ads and 3rd party content.

    Example

    This is an example image and annotation from the dataset: https://i.imgur.com/mOG3u3Z.png" alt="WIkipedia Screenshot">

    Usage

    Annotated screenshots are very useful in Robotic Process Automation. But they can be expensive to label. This dataset would cost over $4000 for humans to label on popular labeling services. We hope this dataset provides a good starting point for your project. Try it with a model from our model library.

    Collecting Custom Data

    Roboflow is happy to provide a custom screenshots dataset to meet your particular needs. We can crawl public or internal web applications. Just reach out and we'll be happy to provide a quote!

    About Roboflow

    Roboflow makes managing, preprocessing, augmenting, and versioning datasets for computer vision seamless. :fa-spacer: Developers reduce 50% of their boilerplate code when using Roboflow's workflow, save training time, and increase model reproducibility. :fa-spacer:

    Roboflow Wordmark

  5. Freelance Contracts Dataset (1.3 Million Entries)

    • kaggle.com
    Updated Sep 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    asaniczka (2024). Freelance Contracts Dataset (1.3 Million Entries) [Dataset]. https://www.kaggle.com/datasets/asaniczka/freelance-contracts-dataset-1-3-million-entries
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 22, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    asaniczka
    License

    Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
    License information was derived automatically

    Description

    The Freelance Contracts Dataset is a robust collection of 1.3 million contracts extracted from a leading freelancing platform, offering significant insights into the dynamics of the freelance economy. This dataset is essential for data analysts, researchers, and business strategists looking to explore the gig economy.

    Key Features: - Job Details: Each contract includes job ID, title, start, and end dates. - Freelancer Information: Identifies freelancers through a unique ID. - Financial Data: Includes total hours worked, total amount paid, and hourly rates.

    Potential Applications:

    • Analyze trends in freelance job postings across various industries.
    • Investigate how project duration relates to earnings and freelancer performance.
    • Understand pricing strategies and budget allocations for freelance projects.
  6. Z

    Popularity Dataset for Online Stats Training

    • data.niaid.nih.gov
    • zenodo.org
    Updated Aug 25, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rens van de Schoot (2020). Popularity Dataset for Online Stats Training [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3962122
    Explore at:
    Dataset updated
    Aug 25, 2020
    Dataset authored and provided by
    Rens van de Schoot
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is a dataset used for the online stats training website (https://www.rensvandeschoot.com/tutorials/) and is based on the data used by van de Schoot, van der Velden, Boom, and Brugman (2010).

    The dataset is based on a study that investigates an association between popularity status and antisocial behavior from at-risk adolescents (n = 1491), where gender and ethnic background are moderators under the association. The study distinguished subgroups within the popular status group in terms of overt and covert antisocial behavior.For more information on the sample, instruments, methodology, and research context, we refer the interested readers to van de Schoot, van der Velden, Boom, and Brugman (2010).

    Variable name Description

    Respnr = Respondents’ number

    Dutch = Respondents’ ethnic background (0 = Dutch origin, 1 = non-Dutch origin)

    gender = Respondents’ gender (0 = boys, 1 = girls)

    sd = Adolescents’ socially desirable answering patterns

    covert = Covert antisocial behavior

    overt = Overt antisocial behavior

  7. Dataset Search WebApp

    • figshare.com
    zip
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Angelo Batista Neves Júnior; Luiz André Portes Paes Leme (2023). Dataset Search WebApp [Dataset]. http://doi.org/10.6084/m9.figshare.5217958.v2
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    figshare
    Authors
    Angelo Batista Neves Júnior; Luiz André Portes Paes Leme
    License

    https://www.gnu.org/copyleft/gpl.htmlhttps://www.gnu.org/copyleft/gpl.html

    Description

    Despite the fact that extensive list of open datasets are available in catalogues, most of the data publishers still connects their datasets to other popular datasets, such as DBpedia5, Freebase 6 and Geonames7. Although the linkage with popular datasets would allow us to explore external resources, it would fail to cover highly specialized information. Catalogues of linked data describe the content of datasets in terms of the update periodicity, authors, SPARQL endpoints, linksets with other datasets, amongst others, as recommended by W3C VoID Vocabulary. However, catalogues by themselves do not provide any explicit information to help the URI linkage process.Searching techniques can rank available datasets SI according to the probability that it will be possible to define links between URIs of SI and a given dataset T to be published, so that most of the links, if not all, could be found by inspecting the most relevant datasets in the ranking. dataset-search is a tool for searching datasets for linkage.

  8. Data from: Tag Recommendation Datasets

    • figshare.com
    txt
    Updated Jan 25, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fabiano Belem (2016). Tag Recommendation Datasets [Dataset]. http://doi.org/10.6084/m9.figshare.2067183.v4
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jan 25, 2016
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Fabiano Belem
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Associative Tag Recommendation Exploiting Multiple Textual FeaturesFabiano Belem, Eder Martins, Jussara M. Almeida Marcos Goncalves In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, July. 2011AbstractThis work addresses the task of recommending relevant tags to a target object by jointly exploiting three dimen- sions of the problem: (i) term co-occurrence with tags preassigned to the target object, (ii) terms extracted from mul- tiple textual features, and (iii) several metrics of tag relevance. In particular, we propose several new heuristic meth- ods, which extend previous, highly effective and efficient, state-of-the-art strategies by including new metrics that try to capture how accurately a candidate term describes the object’s content. We also exploit two learning to rank techniques, namely RankSVM and Genetic Programming, for the task of generating ranking functions that combine multiple metrics to accurately estimate the relevance of a tag to a given object. We evaluate all proposed methods in various scenarios for three popular Web 2.0 applications, namely, LastFM, YouTube and YahooVideo. We found that our new heuristics greatly outperform the methods on which they are based, producing gains in precision of up to 181%, as well as another state-of-the-art technique, with improvements in precision of up to 40% over the best baseline in any scenario. Some further improvements can also be achieved, in some scenarios, with the new learning-to-rank based strategies, which have the additional advantage of being quite flexible and easily extensible to exploit other aspects of the tag recommendation problem.Bibtex Citation@inproceedings{belem@sigir11, author = {Fabiano Bel\'em and Eder Martins and Jussara Almeida and Marcos Gon\c{c}alves}, title = {Associative Tag Recommendation Exploiting Multiple Textual Features}, booktitle = {{Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval (SIGIR'11)}}, month = {{July}}, year = {2011} }

  9. m

    UI/UX user interaction dataset across popular digital platforms

    • data.mendeley.com
    Updated Nov 19, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md Atikur Rahman (2024). UI/UX user interaction dataset across popular digital platforms [Dataset]. http://doi.org/10.17632/dxthxmnkhx.6
    Explore at:
    Dataset updated
    Nov 19, 2024
    Authors
    Md Atikur Rahman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset comprises 2,271 entries and provides insights into user interface (UI) and user experience (UX) preferences across various digital platforms. Key information includes user demographics (Name, Age, Gender) and platform preferences (e.g., Twitter, YouTube, Facebook, Website). It captures user experiences and satisfaction levels with various UI/UX elements such as color schemes, visual hierarchy, typography, multimedia usage, and layout design. The dataset also includes evaluations of mobile responsiveness, call-to-action buttons, form usability, feedback/error messages, loading speed, personalization, accessibility, and interactions (like scrolling behavior and gestures). Each UI/UX component is rated on a scale, allowing for quantitative analysis of user preferences and experiences, making this dataset valuable for research in user-centered design and usability optimization.

  10. O

    Corporate Website — Analytics — Popular pages

    • data.qld.gov.au
    • researchdata.edu.au
    html
    Updated Oct 20, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brisbane City Council (2025). Corporate Website — Analytics — Popular pages [Dataset]. https://www.data.qld.gov.au/dataset/corporate-website-analytics-popular-pages
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Oct 20, 2025
    Dataset authored and provided by
    Brisbane City Council
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is available on Brisbane City Council’s open data website – data.brisbane.qld.gov.au. The site provides additional features for viewing and interacting with the data and for downloading the data in various formats.

    Monthly analytics reports for the Brisbane City Council website

    Information regarding the sessions for Brisbane City Council website during the month including page views and unique page views.

  11. u

    Behance Community Art Data

    • cseweb.ucsd.edu
    json
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UCSD CSE Research Project, Behance Community Art Data [Dataset]. https://cseweb.ucsd.edu/~jmcauley/datasets.html
    Explore at:
    jsonAvailable download formats
    Dataset authored and provided by
    UCSD CSE Research Project
    Description

    Likes and image data from the community art website Behance. This is a small, anonymized, version of a larger proprietary dataset.

    Metadata includes

    • appreciates (likes)

    • timestamps

    • extracted image features

    Basic Statistics:

    • Users: 63,497

    • Items: 178,788

    • Appreciates (likes): 1,000,000

  12. Z

    Dataset used for HTTPS traffic classification using packet burst statistics

    • data.niaid.nih.gov
    Updated Apr 11, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cejka Tomas (2022). Dataset used for HTTPS traffic classification using packet burst statistics [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4911550
    Explore at:
    Dataset updated
    Apr 11, 2022
    Dataset provided by
    Hynek Karel
    Cejka Tomas
    Tropkova Zdena
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We are publishing a dataset we created for the HTTPS traffic classification.

    Since the data were captured mainly in the real backbone network, we omitted IP addresses and ports. The datasets consist of calculated from bidirectional flows exported with flow probe Ipifixprobe. This exporter can export a sequence of packet lengths and times and a sequence of packet bursts and time. For more information, please visit ipfixprobe repository (Ipifixprobe).

    During our research, we divided HTTPS into five categories: L -- Live Video Streaming, P -- Video Player, M -- Music Player, U -- File Upload, D -- File Download, W -- Website, and other traffic.

    We have chosen the service representatives known for particular traffic types based on the Alexa Top 1M list and Moz's list of the most popular 500 websites for each category. We also used several popular websites that primarily focus on the audience in our country. The identified traffic classes and their representatives are provided below:

    Live Video Stream Twitch, Czech TV, YouTube Live

    Video Player DailyMotion, Stream.cz, Vimeo, YouTube

    Music Player AppleMusic, Spotify, SoundCloud

    File Upload/Download FileSender, OwnCloud, OneDrive, Google Drive

    Website and Other Traffic Websites from Alexa Top 1M list

  13. G2 Dataset

    • brightdata.com
    .json, .csv, .xlsx
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bright Data, G2 Dataset [Dataset]. https://brightdata.com/products/datasets/g2
    Explore at:
    .json, .csv, .xlsxAvailable download formats
    Dataset authored and provided by
    Bright Datahttps://brightdata.com/
    License

    https://brightdata.com/licensehttps://brightdata.com/license

    Area covered
    Worldwide
    Description

    Use our G2 dataset to collect product descriptions, ratings, reviews, and pricing information from the world's largest tech marketplace. You may purchase a full or partial dataset depending on your business needs. The G2 Software Products Dataset, with a focus on top-rated products, serves as a valuable resource for software buyers, businesses, and technology enthusiasts. This use case highlights products that have received exceptional ratings and positive reviews on the G2 platform, offering insights into customer satisfaction and popularity. For software buyers, this dataset acts as a trusted guide, presenting a curated selection of G2's top-rated software products, ensuring a higher likelihood of satisfaction with purchases. Businesses and technology professionals can leverage this dataset to identify popular and well-reviewed software solutions, optimizing their decision-making process. This use case emphasizes the dataset's utility for those specifically interested in exploring and acquiring top-rated software products from G2's Product Overview The G2 software products and reviews dataset offer a detailed and thorough overview of leading software companies. The dataset includes all major data points: Product descriptions Average rating (1-5) Sellers number of reviews Key features (highest and lowest rated) Competitors Website & social media links and more.

  14. Machine Learning Dataset

    • brightdata.com
    .json, .csv, .xlsx
    Updated Jun 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bright Data (2024). Machine Learning Dataset [Dataset]. https://brightdata.com/products/datasets/machine-learning
    Explore at:
    .json, .csv, .xlsxAvailable download formats
    Dataset updated
    Jun 19, 2024
    Dataset authored and provided by
    Bright Datahttps://brightdata.com/
    License

    https://brightdata.com/licensehttps://brightdata.com/license

    Area covered
    Worldwide
    Description

    Utilize our machine learning datasets to develop and validate your models. Our datasets are designed to support a variety of machine learning applications, from image recognition to natural language processing and recommendation systems. You can access a comprehensive dataset or tailor a subset to fit your specific requirements, using data from a combination of various sources and websites, including custom ones. Popular use cases include model training and validation, where the dataset can be used to ensure robust performance across different applications. Additionally, the dataset helps in algorithm benchmarking by providing extensive data to test and compare various machine learning algorithms, identifying the most effective ones for tasks such as fraud detection, sentiment analysis, and predictive maintenance. Furthermore, it supports feature engineering by allowing you to uncover significant data attributes, enhancing the predictive accuracy of your machine learning models for applications like customer segmentation, personalized marketing, and financial forecasting.

  15. Data from: Activity Sessions datasets

    • figshare.com
    bz2
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aaron Halfaker; Os Keyes; Daniel Kluver; Jacob Thebault-Spieker; Tien Nguyen; Kenneth Shores; Anuradha Uduwage; Morten Warncke-Wang (2023). Activity Sessions datasets [Dataset]. http://doi.org/10.6084/m9.figshare.1291033.v1
    Explore at:
    bz2Available download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Aaron Halfaker; Os Keyes; Daniel Kluver; Jacob Thebault-Spieker; Tien Nguyen; Kenneth Shores; Anuradha Uduwage; Morten Warncke-Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This article contains a set of datasets used to demonstrate a strong regularity in inter-activity time.
    See the paper: User Session Identification Based on Strong Regularities in Inter-activity Time http://arxiv.org/abs/1411.2878 Abstract Session identification is a common strategy used to develop metrics for web analytics and behavioral analyses of user-facing systems. Past work has argued that session identification strategies based on an inactivity threshold is inherently arbitrary or advocated that thresholds be set at about 30 minutes. In this work, we demonstrate a strong regularity in the temporal rhythms of user initiated events across several different domains of online activity (incl. video gaming, search, page views and volunteer contributions). We describe a methodology for identifying clusters of user activity and argue that regularity with which these activity clusters appear implies a good rule-of-thumb inactivity threshold of about 1 hour. We conclude with implications that these temporal rhythms may have for system design based on our observations and theories of goal-directed human activity.

  16. Song Features Dataset - Regressing Popularity

    • kaggle.com
    Updated Jan 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ayush Oturkar (2023). Song Features Dataset - Regressing Popularity [Dataset]. https://www.kaggle.com/datasets/ayushnitb/song-features-dataset-regressing-popularity
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 19, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ayush Oturkar
    Description

    Introduction Spotify for Developers offers a wide range of possibilities to utilize the extensive catalog of Spotify data. One of them are the audio features calculated for each song and made available via the official Spotify Web API.

    This is an attempt to retrieve the spotify data post the last extracted data. Haven't fully tested if this spotify allowed any other API full request post 2019

    About Each song (row) has values for artist name, track name, track id and the audio features itself (for more information about the audio features check out this doc from Spotify).

    Additionally, there is also a popularity feature included in this dataset. Please note that Spotify recalculates this value based on the number of plays the track receives so it might not be correct value anymore when you access the data.

    Key Questions/Hypothesis that can be Answered 1. ARE SONGS IN MAJOR MODE ARE MORE POPULAR THAN ONES IN MINOR? 2. ARE SONGS WITH HIGH LOUDNESS ARE MOST POPULAR? 3. MOST PEOPLE LIKE LISTENING TO SONGS WITH SHORTER DURATION?

    In addition more detailed analysis can be done to see what causes a song to be popular.

    Credit Entire Credit goes to Spotify for providing this data via their Web API.

    https://developer.spotify.com/documentation/web-api/reference/tracks/get-track/

  17. Top 20 Programming Languages 2021

    • kaggle.com
    Updated Jan 31, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jyot Makadiya (2021). Top 20 Programming Languages 2021 [Dataset]. https://www.kaggle.com/datasets/jyotmakadiya/top-20-programming-languages-2021/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 31, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Jyot Makadiya
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    We all love programming!! This dataset is an attempt to look at the trends of different programming languages in the last 1 year.

    Content

    This dataset was web-scraped by me, represents information extracted from Tiobe website. This dataset contains 20 different programming languages popularity, their popularity changes over the past 1 year. Details about Jan 2020 and Jan 2021 world rank is also given in the same.

    Acknowledgements

    I want to thank you all for contributing to this trend (remember, a contribution is never insignificant)!!

    Inspiration

    This dataset can be used to predict the future trend of popular programming languages, the dataset provides insights into annual change so, it can be used as a regression task to predict by how much a particular language can takeover the other.

    Have a nice day!!

  18. T

    NetForager: Geographically-Distributed Dataset of Traffic Captures for 15...

    • dataverse.tdl.org
    7z, bin
    Updated Jun 1, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Levent Dane; Deniz Gurkan; Deniz Gurkan; Levent Dane (2020). NetForager: Geographically-Distributed Dataset of Traffic Captures for 15 Popular Web Applications [Dataset]. http://doi.org/10.18738/T8/OPWBMN
    Explore at:
    bin(2147483648), 7z(1206435356), bin(2095046196)Available download formats
    Dataset updated
    Jun 1, 2020
    Dataset provided by
    Texas Data Repository
    Authors
    Levent Dane; Deniz Gurkan; Deniz Gurkan; Levent Dane
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Time period covered
    Aug 1, 2019 - Oct 15, 2019
    Description

    The traffic captures (network packets in pcap format) are collected during August-October 2019 timeframe of 11 weeks in an hourly fashion from a selection of web sites by directly retrieving content that loads for the site address. A collection of json files that have been extracted from network capture files after preliminary analysis of conversations has been done. The data is isolated to pure web site content loads and associated conversations with no OS-specific communications in the pcaps. Each packet capture has up to 200 bytes only.

  19. R

    Web Page Object Detection Dataset

    • universe.roboflow.com
    zip
    Updated Aug 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    web page summarizer (2025). Web Page Object Detection Dataset [Dataset]. https://universe.roboflow.com/web-page-summarizer/web-page-object-detection/dataset/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 2, 2025
    Dataset authored and provided by
    web page summarizer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Web Page Elements Bounding Boxes
    Description

    Here are a few use cases for this project:

    1. Web Accessibility Improvement: The "Web Page Object Detection" model can be used to identify and label various elements on a web page, making it easier for people with visual impairments to navigate and interact with websites using screen readers and other assistive technologies.

    2. Web Design Analysis: The model can be employed to analyze the structure and layout of popular websites, helping web designers understand best practices and trends in web design. This information can inform the creation of new, user-friendly websites or redesigns of existing pages.

    3. Automatic Web Page Summary Generation: By identifying and extracting key elements, such as titles, headings, content blocks, and lists, the model can assist in generating concise summaries of web pages, which can aid users in their search for relevant information.

    4. Web Page Conversion and Optimization: The model can be used to detect redundant or unnecessary elements on a web page and suggest their removal or modification, leading to cleaner designs and faster-loading pages. This can improve user experience and, potentially, search engine rankings.

    5. Assisting Web Developers in Debugging and Testing: By detecting web page elements, the model can help identify inconsistencies or errors in a site's code or design, such as missing or misaligned elements, allowing developers to quickly diagnose and address these issues.

  20. Z

    Developer Expertise Dataset on JavaScript Libraries

    • data.niaid.nih.gov
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Montandon, João Eduardo; Silva, Luciana Lourdes; Valente, Marco Tulio (2020). Developer Expertise Dataset on JavaScript Libraries [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_1484497
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    UFMG
    IFMG
    Authors
    Montandon, João Eduardo; Silva, Luciana Lourdes; Valente, Marco Tulio
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains an anonymized list of surveyed developers who provided their expertise level on three popular JavaScript libraries:

    ReactJS, a library for building enriched web interfaces

    MongoDB, a driver for accessing MongoDB databased

    Socket.IO, a library for realtime communication

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2017). Most popular websites in the Netherlands 2015 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/3edeb59b-b49b-59cb-9757-9127aed7e8af

Most popular websites in the Netherlands 2015 - Dataset - B2FIND

Explore at:
Dataset updated
Jun 2, 2017
Area covered
Netherlands
Description

This dataset contains a list of 3654 Dutch websites that we considered the most popular websites in 2015. This list served as whitelist for the Newstracker Research project in which we monitored the online web behaviour of a group of respondents.The research project 'The Newstracker' was a subproject of the NWO-funded project 'The New News Consumer: A User-Based Innovation Project to Meet Paradigmatic Change in News Use and Media Habits'.For the Newstracker project we aimed to understand the web behaviour of a group of respondents. We created custom-built software to monitor their web browsing behaviour on their laptops and desktops (please find the code in open access at https://github.com/NITechLabs/NewsTracker). For reasons of scale and privacy we created a whitelist with websites that were the most popular websites in 2015. We manually compiled this list by using data of DDMM, Alexa and own research. The dataset consists of 5 columns:- the URL- the type of website: We created a list of types of websites and each website has been manually labeled with 1 category- Nieuws-regio: When the category was 'News', we subdivided these websites in the regional focus: International, National or Local- Nieuws-onderwerp: Furthermore, each website under the category News was further subdivided in type of news website. For this we created an own list of news categories and manually coded each website- Bron: For each website we noted which source we used to find this website.The full description of the research design of the Newstracker including the set-up of this whitelist is included in the following article: Kleppe, M., Otte, M. (in print), 'Analysing & understanding news consumption patterns by tracking online user behaviour with a multimodal research design', Digital Scholarship in the Humanities, doi 10.1093/llc/fqx030.

Search
Clear search
Close search
Google apps
Main menu