100+ datasets found

w
Websites using WordPress
webtechsurvey.com
csv
Updated Apr 4, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
WebTechSurvey (2020). Websites using WordPress [Dataset]. https://webtechsurvey.com/technology/wordpress
Explore at:
csvAvailable download formats
Dataset updated
Apr 4, 2020
Dataset authored and provided by
WebTechSurvey
License
https://webtechsurvey.com/termshttps://webtechsurvey.com/terms
Time period covered
2025
Area covered
Global
Description
A complete list of live websites using the WordPress technology, compiled through global website indexing conducted by WebTechSurvey.
R
Web Page Object Detection Dataset
universe.roboflow.com
zip
Updated Mar 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
web page summarizer (2023). Web Page Object Detection Dataset [Dataset]. https://universe.roboflow.com/web-page-summarizer/web-page-object-detection
Explore at:
zipAvailable download formats
Dataset updated
Mar 2, 2023
Dataset authored and provided by
web page summarizer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Web Page Elements Bounding Boxes
Description
Here are a few use cases for this project:

Web Accessibility Improvement: The "Web Page Object Detection" model can be used to identify and label various elements on a web page, making it easier for people with visual impairments to navigate and interact with websites using screen readers and other assistive technologies.

Web Design Analysis: The model can be employed to analyze the structure and layout of popular websites, helping web designers understand best practices and trends in web design. This information can inform the creation of new, user-friendly websites or redesigns of existing pages.

Automatic Web Page Summary Generation: By identifying and extracting key elements, such as titles, headings, content blocks, and lists, the model can assist in generating concise summaries of web pages, which can aid users in their search for relevant information.

Web Page Conversion and Optimization: The model can be used to detect redundant or unnecessary elements on a web page and suggest their removal or modification, leading to cleaner designs and faster-loading pages. This can improve user experience and, potentially, search engine rankings.

Assisting Web Developers in Debugging and Testing: By detecting web page elements, the model can help identify inconsistencies or errors in a site's code or design, such as missing or misaligned elements, allowing developers to quickly diagnose and address these issues.
m
Web page phishing detection
data.mendeley.com
Updated Sep 28, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abdelhakim Hannousse (2020). Web page phishing detection [Dataset]. http://doi.org/10.17632/c2gw7fy2j4.2
Explore at:
Unique identifier
https://doi.org/10.17632/c2gw7fy2j4.2
Dataset updated
Sep 28, 2020
Authors
Abdelhakim Hannousse
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The provided dataset includes 11430 URLs with 87 extracted features. The dataset are designed to be used as a a benchmark for machine learning based phishing detection systems. Features are from three different classes: 56 extracted from the structure and syntax of URLs, 24 extracted from the content of their correspondent pages and 7 are extracetd by querying external services. The datatset is balanced, it containes exactly 50% phishing and 50% legitimate URLs. Associated to the dataset, we provide Python scripts used for the extraction of the features for potential replication or extension.

dataset_A: contains a list a URLs together with their DOM tree objects that can be used for replication and experimenting new URL and content-based features overtaking short-time living of phishing web pages.

dataset_B: containes the extracted feature values that can be used directly as inupt to classifiers for examination. Note that the data in this dataset are indexed with URLs so that one need to remove the index before experimentation.

Datasets are constructed on May 2020. Due to huge size of dataset A, only a sample of the dataset is provided, it will be divided into sample files and uploaded one by one, for urgent need of full copy, please contact directly the author at: hannousse.abdelhakim@univ-guelma.dz
w
Websites using Simple File List
webtechsurvey.com
csv
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
WebTechSurvey, Websites using Simple File List [Dataset]. https://webtechsurvey.com/technology/simple-file-list
Explore at:
csvAvailable download formats
Dataset authored and provided by
WebTechSurvey
License
https://webtechsurvey.com/termshttps://webtechsurvey.com/terms
Time period covered
2025
Area covered
Global
Description
A complete list of live websites using the Simple File List technology, compiled through global website indexing conducted by WebTechSurvey.
w
Websites using Web Page Maker
webtechsurvey.com
csv
Updated Jul 26, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
WebTechSurvey (2023). Websites using Web Page Maker [Dataset]. https://webtechsurvey.com/technology/web-page-maker
Explore at:
csvAvailable download formats
Dataset updated
Jul 26, 2023
Dataset authored and provided by
WebTechSurvey
License
https://webtechsurvey.com/termshttps://webtechsurvey.com/terms
Time period covered
2025
Area covered
Global
Description
A complete list of live websites using the Web Page Maker technology, compiled through global website indexing conducted by WebTechSurvey.
List of websites with CS projects information
zenodo.org
data.niaid.nih.gov
+1more
Updated Nov 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TIDE-UPF; TIDE-UPF (2022). List of websites with CS projects information [Dataset]. http://doi.org/10.5281/zenodo.7310295
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.7310295
Dataset updated
Nov 22, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
TIDE-UPF; TIDE-UPF
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset contains the list of wesbites from where TIDE-UPF extracted the CS projects information.
Open central government websites - February 2013
gov.uk
Updated Jul 9, 2013
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cabinet Office (2013). Open central government websites - February 2013 [Dataset]. https://www.gov.uk/government/publications/open-central-government-websites-february-2013
Explore at:
Dataset updated
Jul 9, 2013
Dataset provided by
GOV.UKhttp://gov.uk/
Authors
Cabinet Office
Description
Background

Number of and list of central government open websites – 474 as of 13 February 2013.

Information was reported as correct by central government departments at 13 February 2013.

The Cabinet Office committed to begin quarterly publication of the number of open websites starting in financial year 2011.

Definition of a website

The definition used of a website is a user-centric one. Something is counted as a separate website if it is active and either has a separate domain name or, when as a subdomain, the user cannot move freely between the subsite and parent site and there is no family likeness in the design. In other words, if the user experiences it as a separate site in their normal uses of browsing, search and interaction, it is counted as one.

Definition of a closed website

A website is considered closed when it ceases to be actively funded, run and managed by central government, either by packaging information and putting it in the right place for the intended audience on another website or digital channel, or by a third party taking and managing it and bearing the cost. Where appropriate, domains stay operational in order to redirect users to the http://www.nationalarchives.gov.uk/webarchive/" class="govuk-link">UK Government Website Archive.

Explanation for increase in sites reported

Since the previous quarterly report of 22 October 2012, there has been an extra 124 sites reported. This increase is due to a change in the scope of the audit as the Government Digital Service (GDS) felt that the previous method of using the The National Archives database to source this information was not sufficiently and accurately capturing the data that was required. The new process and scope has resulted in more websites being included e.g. Directgov URLs, dot independent sites and national parks. Also, the latest GOV.UK exemption process has brought to our attention many more sites than we were previously aware of.

Definition of the exemption process

The GOV.UK exemption process began with a web rationalisation of the government’s Internet estate to reduce the number of obsolete websites and to establish the scale of the websites that the government owns.

Exclusions from the central government list

Not included in the number or list are websites of public corporations as listed on the Office for National Statistics website, partnerships more than half-funded by private sector, charities and national museums. Specialist closed audience functions, such as the BIS Research Councils, BIS Sector Skills Councils and Industrial Training Boards, and the Defra Levy Boards and their websites, are not included in this data. The Ministry of Defence conducted their own rationalisation of MOD and the armed forces sites as an integral part of the Website Review; military sites belonging to a particular service are excluded from this dataset. Finally, those public bodies set up by Parliament and reporting directly to the Speaker’s Committee and only reporting through a ministerial government department for the purposes of enaction of legislation are also excluded (for example, the Electoral Commission and IPSA).

Inclusion under department name

Websites are listed under the department name for which the minister in HMG has responsibility, either directly through their departmental activities, or indirectly through being the minister reporting to Parliament for independent bodies set up by statute.

List of open websites

For re-usability, these are provided as Excel and CSV files.
g
Lists of websites and services in.gouv.fr
gimi9.com
Updated Dec 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Lists of websites and services in.gouv.fr [Dataset]. https://gimi9.com/dataset/eu_5d64f85e8b4c415bb5166012/
Explore at:
Dataset updated
Dec 19, 2024
Description
Following the list of websites in ‘.gouv.fr’ generated on the repository GitHub gouvfrlist, here is a list of websites and web services in ‘.gouv.fr’. It made it possible to make a graph representation of domains and subdomains by ministry and administrations (deconcentrated). We also relied on the list of top 250 of administrative procedures and the list of sites en.gouv.fr dating from 2014. Graphic representation of.gouv.fr websites ### Deposit GitHub The project description and data set are available in the Github repository graph-gouv-en de jbledevehat ### Legend The objects represented are: — The President of the French Republic and the Prime Minister are qualified as “Person” (in blue) — Departments or administrative branches (in yellow) — Websites (in green) — The subdomains of these websites (in orange) — Online services (in red) — Citizen consultation sites (in pink) — Web sites and services either archived or inactive (in black) — Undefined (incoherent) websites (in grey) ### Web publications This representation is available on the application KUMU at the following address: https://kumu.io/jbledevehat/sites-web-gouvfr#liste-des-sites-web-en-gouvfr-v1
e
Most popular websites in the Netherlands 2015 - Dataset - B2FIND
b2find.eudat.eu
Updated Jun 2, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2017). Most popular websites in the Netherlands 2015 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/c47411e6-3cbb-5381-b5b5-e17c0aa87cde
Explore at:
Dataset updated
Jun 2, 2017
Area covered
Netherlands
Description
This dataset contains a list of 3654 Dutch websites that we considered the most popular websites in 2015. This list served as whitelist for the Newstracker Research project in which we monitored the online web behaviour of a group of respondents.The research project 'The Newstracker' was a subproject of the NWO-funded project 'The New News Consumer: A User-Based Innovation Project to Meet Paradigmatic Change in News Use and Media Habits'.For the Newstracker project we aimed to understand the web behaviour of a group of respondents. We created custom-built software to monitor their web browsing behaviour on their laptops and desktops (please find the code in open access at https://github.com/NITechLabs/NewsTracker). For reasons of scale and privacy we created a whitelist with websites that were the most popular websites in 2015. We manually compiled this list by using data of DDMM, Alexa and own research. The dataset consists of 5 columns:- the URL- the type of website: We created a list of types of websites and each website has been manually labeled with 1 category- Nieuws-regio: When the category was 'News', we subdivided these websites in the regional focus: International, National or Local- Nieuws-onderwerp: Furthermore, each website under the category News was further subdivided in type of news website. For this we created an own list of news categories and manually coded each website- Bron: For each website we noted which source we used to find this website.The full description of the research design of the Newstracker including the set-up of this whitelist is included in the following article: Kleppe, M., Otte, M. (in print), 'Analysing & understanding news consumption patterns by tracking online user behaviour with a multimodal research design', Digital Scholarship in the Humanities, doi 10.1093/llc/fqx030.
Leading websites worldwide 2024, by monthly visits
statista.com
old-kremlin.ru
+4more
Updated Mar 24, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Leading websites worldwide 2024, by monthly visits [Dataset]. https://www.statista.com/statistics/1201880/most-visited-websites-worldwide/
Explore at:
Dataset updated
Mar 24, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Nov 2024
Area covered
Worldwide
Description
In November 2024, Google.com was the most popular website worldwide with 136 billion average monthly visits. The online platform has held the top spot as the most popular website since June 2010, when it pulled ahead of Yahoo into first place. Second-ranked YouTube generated more than 72.8 billion monthly visits in the measured period. The internet leaders: search, social, and e-commerce Social networks, search engines, and e-commerce websites shape the online experience as we know it. While Google leads the global online search market by far, YouTube and Facebook have become the world’s most popular websites for user generated content, solidifying Alphabet’s and Meta’s leadership over the online landscape. Meanwhile, websites such as Amazon and eBay generate millions in profits from the sale and distribution of goods, making the e-market sector an integral part of the global retail scene. What is next for online content? Powering social media and websites like Reddit and Wikipedia, user-generated content keeps moving the internet’s engines. However, the rise of generative artificial intelligence will bring significant changes to how online content is produced and handled. ChatGPT is already transforming how online search is performed, and news of Google's 2024 deal for licensing Reddit content to train large language models (LLMs) signal that the internet is likely to go through a new revolution. While AI's impact on the online market might bring both opportunities and challenges, effective content management will remain crucial for profitability on the web.
LAT Bright Source List
catalog.data.gov
data.amerigeoss.org
+1more
Updated Apr 24, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Aeronautics and Space Administration (2025). LAT Bright Source List [Dataset]. https://catalog.data.gov/dataset/lat-bright-source-list
Explore at:
Dataset updated
Apr 24, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
The Fermi Gamma-ray Space Telescope (Fermi) Large Area Telescope (LAT) is a successor to EGRET, with greatly improved sensitivity, resolution, and energy range. This web page presents the first full catalog of LAT sources, based on the first eleven months of survey data. For a full explanation about the catalog and its construction see the LAT 1-year Catalog Paper.
Open central government websites – January 2014
gov.uk
Updated Jan 31, 2014
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cabinet Office (2014). Open central government websites – January 2014 [Dataset]. https://www.gov.uk/government/publications/open-central-government-websites-january-2014
Explore at:
Dataset updated
Jan 31, 2014
Dataset provided by
GOV.UKhttp://gov.uk/
Authors
Cabinet Office
Description
Number and list of central government open websites – 455 as at 31 December 2013.

The Cabinet Office committed to begin quarterly publication of the number of open websites starting in the financial year 2011.

Definition of a website

The definition used is a user-centric one. Something is counted as a separate website if it is active and either has a separate domain name or, when as a subdomain, the user cannot move freely between the subsite and parent site and there is no family likeness in the design. In other words, if the user experiences it as a separate site in their normal uses of browsing, search and interaction, it is counted as one.

Definition of a closed website

A website is considered closed when it ceases to be actively funded, run and managed by central government, either by packaging information and putting it in the right place for the intended audience on another website or digital channel, or by a third party taking and managing it and bearing the cost. Where appropriate, domains stay operational in order to redirect users to the http://www.nationalarchives.gov.uk/webarchive/" class="govuk-link">UK Government Website Archive.

Definition of the exemption process

The GOV.UK exemption process began with a web rationalisation of the government’s internet estate to reduce the number of obsolete websites and to establish the scale of the websites that the government owns.

Exclusions from the central government list

Not included in the number or list are:

websites of public corporations as listed on the http://www.ons.gov.uk/ons/publications/re-reference-tables.html?edition=tcm%3A77-329008" class="govuk-link">Office for National Statistics website partnerships more than half-funded by private sector

charities and national museums

specialist closed audience functions, such as the BIS Research Councils, BIS Sector Skills Councils and Industrial Training Boards, and the Defra Levy Boards and their websites

Finally, those public bodies set up by Parliament and reporting directly to the Speaker’s Committee are also excluded (for example, the Electoral Commission and IPSA).

As agreed in the quarterly report of February 2013, the following sites have been included in the list:

‘.independent’ sites

National parks

Inclusion under department name

Websites are listed under the department name for which the government minister has responsibility, either directly through their departmental activities, or indirectly through being the minister reporting to Parliament for independent bodies set up by statute.

January 2014 report

Government website domains have been procured from as early as the 1990s and at this time, there was no requirement upon government departments to retain a formal record of ownership. With staff changes and new departments formed, it became apparent that departments did not have a complete view of all sites in their estate.

Government Digital Service (GDS) has worked closely with these departments to identify legacy websites which we were not originally aware of, by going through the complete list of gov.uk domains managed by Cabinet Office, under the second level domain (SLD), gov.uk. A full list of gov.uk domains can be viewed here. As well as websites on the gov.uk SLD, we had found that there are a number of legacy websites owned by departments under a .org.uk or co.uk SLD. Because we do not own these SLDs, information on whether a department has ownership was not so easily accessible, but a strong working relationship with department leads has since helped to identify the majority of these sites.

Previously, the Ministry of Defence conducted their own rationalisation of MOD and the armed forces sites. At the beginning of this report, we agreed to include these sites to ensure a consistent approach.

Since the last report of Oct 2013, 19 websites have closed and 18 have migrated to the governments website, GOV.UK. As government websites migrate to GOV.UK, the responsibility for reporting a department’s content will become an overall GOV.UK reporting
Common languages used for web content 2025, by share of websites
statista.com
ai-chatbox.pro
Updated Feb 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Common languages used for web content 2025, by share of websites [Dataset]. https://www.statista.com/statistics/262946/most-common-languages-on-the-internet/
Explore at:
Dataset updated
Feb 11, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Feb 2025
Area covered
Worldwide
Description
As of February 2025, English was the most popular language for web content, with over 49.4 percent of websites using it. Spanish ranked second, with six percent of web content, while the content in the German language followed, with 5.6 percent. English as the leading online language United States and India, the countries with the most internet users after China, are also the world's biggest English-speaking markets. The internet user base in both countries combined, as of January 2023, was over a billion individuals. This has led to most of the online information being created in English. Consequently, even those who are not native speakers may use it for convenience. Global internet usage by regions As of October 2024, the number of internet users worldwide was 5.52 billion. In the same period, Northern Europe and North America were leading in terms of internet penetration rates worldwide, with around 97 percent of its populations accessing the internet.
U.S. most visited websites 2024, by total visits
statista.com
ai-chatbox.pro
Updated Mar 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). U.S. most visited websites 2024, by total visits [Dataset]. https://www.statista.com/statistics/1456422/most-visited-websites-total-visits-united-states/
Explore at:
Dataset updated
Mar 24, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Nov 2024
Area covered
United States
Description
In November 2024, Google.com was the most visited website in the United States, with over 25 billion total visits. YouTube.com came in second with 12 billion total visits. Reddit.com and Amazon.com counted approximately 3.12 billion and 2.89 monthly visits each from U.S. online audiences.
D
Most popular websites in the Netherlands 2015
ssh.datastations.nl
csv, tsv, zip
Updated May 9, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
M. Kleppe; H. Bijleveld; M. Kleppe; H. Bijleveld (2017). Most popular websites in the Netherlands 2015 [Dataset]. http://doi.org/10.17026/DANS-X6H-6QQT
Explore at:
zip(15855), csv(138294), tsv(176359)Available download formats
Unique identifier
https://doi.org/10.17026/DANS-X6H-6QQT
Dataset updated
May 9, 2017
Dataset provided by
DANS Data Station Social Sciences and Humanities
Authors
M. Kleppe; H. Bijleveld; M. Kleppe; H. Bijleveld
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Netherlands
Dataset funded by
NWO
Description
This dataset contains a list of 3654 Dutch websites that we considered the most popular websites in 2015. This list served as whitelist for the Newstracker Research project in which we monitored the online web behaviour of a group of respondents.The research project 'The Newstracker' was a subproject of the NWO-funded project 'The New News Consumer: A User-Based Innovation Project to Meet Paradigmatic Change in News Use and Media Habits'.For the Newstracker project we aimed to understand the web behaviour of a group of respondents. We created custom-built software to monitor their web browsing behaviour on their laptops and desktops (please find the code in open access at https://github.com/NITechLabs/NewsTracker). For reasons of scale and privacy we created a whitelist with websites that were the most popular websites in 2015. We manually compiled this list by using data of DDMM, Alexa and own research. The dataset consists of 5 columns:- the URL- the type of website: We created a list of types of websites and each website has been manually labeled with 1 category- Nieuws-regio: When the category was 'News', we subdivided these websites in the regional focus: International, National or Local- Nieuws-onderwerp: Furthermore, each website under the category News was further subdivided in type of news website. For this we created an own list of news categories and manually coded each website- Bron: For each website we noted which source we used to find this website.The full description of the research design of the Newstracker including the set-up of this whitelist is included in the following article: Kleppe, M., Otte, M. (in print), 'Analysing & understanding news consumption patterns by tracking online user behaviour with a multimodal research design', Digital Scholarship in the Humanities, doi 10.1093/llc/fqx030.
d
Health and Human Services Facilities List
catalog.data.gov
data.montgomerycountymd.gov
+2more
Updated Sep 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.montgomerycountymd.gov (2023). Health and Human Services Facilities List [Dataset]. https://catalog.data.gov/dataset/health-and-human-services-facilities-list
Explore at:
Dataset updated
Sep 15, 2023
Dataset provided by
data.montgomerycountymd.gov
Description
List of Health and Human Services facilities and available programs, contact information, hours of operations and web-page links. This dataset is updated on an as needed basis.
A
Civil Service List (Terminated)
data.amerigeoss.org
data.cityofnewyork.us
+2more
csv, json, rdf, xml
Updated Jul 1, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
United States (2019). Civil Service List (Terminated) [Dataset]. https://data.amerigeoss.org/sv/dataset/civil-service-list-terminated
Explore at:
csv, rdf, xml, jsonAvailable download formats
Dataset updated
Jul 1, 2019
Dataset provided by
United States
Description
A Civil Service List is considered terminated usually four years after the list has been established, unless it is extended at the Commissioner’s discretion. For more information visit DCAS’ “Work for the City” webpage at: https://www1.nyc.gov/site/dcas/employment/take-an-exam.page.
Leading websites worldwide 2024, by unique visitors
statista.com
ai-chatbox.pro
Updated Feb 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Leading websites worldwide 2024, by unique visitors [Dataset]. https://www.statista.com/statistics/1201889/most-visited-websites-worldwide-unique-visits/
Explore at:
Dataset updated
Feb 11, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Nov 2024
Area covered
Worldwide
Description
In November 2024, Google.com was the most popular website worldwide with approximately 6.25 billion unique monthly visitors. YouTube.com was ranked second with an estimated 3.64 billion unique monthly visitors. Both websites are among the most visited websites worldwide.
Data from: HTTPS traffic classification
kaggle.com
Updated Mar 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Đinh Ngọc Ân (2024). HTTPS traffic classification [Dataset]. https://www.kaggle.com/datasets/inhngcn/https-traffic-classification/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 11, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Đinh Ngọc Ân
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The people from Czech are publishing a dataset for the HTTPS traffic classification.

Since the data were captured mainly in the real backbone network, they omitted IP addresses and ports. The datasets consist of calculated from bidirectional flows exported with flow probe Ipifixprobe. This exporter can export a sequence of packet lengths and times and a sequence of packet bursts and time. For more information, please visit ipfixprobe repository (Ipifixprobe).

During research, they divided HTTPS into five categories: L -- Live Video Streaming, P -- Video Player, M -- Music Player, U -- File Upload, D -- File Download, W -- Website, and other traffic.

They have chosen the service representatives known for particular traffic types based on the Alexa Top 1M list and Moz's list of the most popular 500 websites for each category. They also used several popular websites that primarily focus on the audience in Czech. The identified traffic classes and their representatives are provided below:

Live Video Stream Twitch, Czech TV, YouTube Live Video Player DailyMotion, Stream.cz, Vimeo, YouTube Music Player AppleMusic, Spotify, SoundCloud File Upload/Download FileSender, OwnCloud, OneDrive, Google Drive Website and Other Traffic Websites from Alexa Top 1M list
m
Data for: Machine Learning based Heterogeneous Web Advertisements Detection...
data.mendeley.com
narcis.nl
+1more
Updated Jun 29, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KS Kuppusamy (2018). Data for: Machine Learning based Heterogeneous Web Advertisements Detection Using a Diverse Feature Set [Dataset]. http://doi.org/10.17632/5bzh52txpn.1
Explore at:
Unique identifier
https://doi.org/10.17632/5bzh52txpn.1
Dataset updated
Jun 29, 2018
Authors
KS Kuppusamy
License
Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
Description
Advertisement identification and filtering in web pages gain significance due to various factors such as accessibility, security, privacy, and obtrusiveness. Current practices in this direction involve maintaining URL-based regular expressions called filter lists. Each URL obtained on a web page is matched against this filter list. While effectual, this procedure lacks scalability as it demands regular continuance of the filter list. To counter these limitations, we devise a machine learning based advertisement detection system using a diverse feature set which can distinguish advertisement blocks from non-advertisement blocks. The method can act as a base to provide various accessibility-related features like smooth browsing and text summarization for persons with visual impairments, cognitive impairments, and photosensitive epilepsy. The results from a classifier trained on the proposed feature set achieve 93.4% accuracy in identifying advertisements.