https://webtechsurvey.com/termshttps://webtechsurvey.com/terms
A complete list of live websites using the WordPress technology, compiled through global website indexing conducted by WebTechSurvey.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Here are a few use cases for this project:
Web Accessibility Improvement: The "Web Page Object Detection" model can be used to identify and label various elements on a web page, making it easier for people with visual impairments to navigate and interact with websites using screen readers and other assistive technologies.
Web Design Analysis: The model can be employed to analyze the structure and layout of popular websites, helping web designers understand best practices and trends in web design. This information can inform the creation of new, user-friendly websites or redesigns of existing pages.
Automatic Web Page Summary Generation: By identifying and extracting key elements, such as titles, headings, content blocks, and lists, the model can assist in generating concise summaries of web pages, which can aid users in their search for relevant information.
Web Page Conversion and Optimization: The model can be used to detect redundant or unnecessary elements on a web page and suggest their removal or modification, leading to cleaner designs and faster-loading pages. This can improve user experience and, potentially, search engine rankings.
Assisting Web Developers in Debugging and Testing: By detecting web page elements, the model can help identify inconsistencies or errors in a site's code or design, such as missing or misaligned elements, allowing developers to quickly diagnose and address these issues.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The provided dataset includes 11430 URLs with 87 extracted features. The dataset are designed to be used as a a benchmark for machine learning based phishing detection systems. Features are from three different classes: 56 extracted from the structure and syntax of URLs, 24 extracted from the content of their correspondent pages and 7 are extracetd by querying external services. The datatset is balanced, it containes exactly 50% phishing and 50% legitimate URLs. Associated to the dataset, we provide Python scripts used for the extraction of the features for potential replication or extension.
dataset_A: contains a list a URLs together with their DOM tree objects that can be used for replication and experimenting new URL and content-based features overtaking short-time living of phishing web pages.
dataset_B: containes the extracted feature values that can be used directly as inupt to classifiers for examination. Note that the data in this dataset are indexed with URLs so that one need to remove the index before experimentation.
Datasets are constructed on May 2020. Due to huge size of dataset A, only a sample of the dataset is provided, it will be divided into sample files and uploaded one by one, for urgent need of full copy, please contact directly the author at: hannousse.abdelhakim@univ-guelma.dz
https://webtechsurvey.com/termshttps://webtechsurvey.com/terms
A complete list of live websites using the Simple File List technology, compiled through global website indexing conducted by WebTechSurvey.
https://webtechsurvey.com/termshttps://webtechsurvey.com/terms
A complete list of live websites using the Web Page Maker technology, compiled through global website indexing conducted by WebTechSurvey.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains the list of wesbites from where TIDE-UPF extracted the CS projects information.
Number of and list of central government open websites – 474 as of 13 February 2013.
Information was reported as correct by central government departments at 13 February 2013.
The Cabinet Office committed to begin quarterly publication of the number of open websites starting in financial year 2011.
The definition used of a website is a user-centric one. Something is counted as a separate website if it is active and either has a separate domain name or, when as a subdomain, the user cannot move freely between the subsite and parent site and there is no family likeness in the design. In other words, if the user experiences it as a separate site in their normal uses of browsing, search and interaction, it is counted as one.
A website is considered closed when it ceases to be actively funded, run and managed by central government, either by packaging information and putting it in the right place for the intended audience on another website or digital channel, or by a third party taking and managing it and bearing the cost. Where appropriate, domains stay operational in order to redirect users to the http://www.nationalarchives.gov.uk/webarchive/" class="govuk-link">UK Government Website Archive.
Since the previous quarterly report of 22 October 2012, there has been an extra 124 sites reported. This increase is due to a change in the scope of the audit as the Government Digital Service (GDS) felt that the previous method of using the The National Archives database to source this information was not sufficiently and accurately capturing the data that was required. The new process and scope has resulted in more websites being included e.g. Directgov URLs, dot independent sites and national parks. Also, the latest GOV.UK exemption process has brought to our attention many more sites than we were previously aware of.
The GOV.UK exemption process began with a web rationalisation of the government’s Internet estate to reduce the number of obsolete websites and to establish the scale of the websites that the government owns.
Not included in the number or list are websites of public corporations as listed on the Office for National Statistics website, partnerships more than half-funded by private sector, charities and national museums. Specialist closed audience functions, such as the BIS Research Councils, BIS Sector Skills Councils and Industrial Training Boards, and the Defra Levy Boards and their websites, are not included in this data. The Ministry of Defence conducted their own rationalisation of MOD and the armed forces sites as an integral part of the Website Review; military sites belonging to a particular service are excluded from this dataset. Finally, those public bodies set up by Parliament and reporting directly to the Speaker’s Committee and only reporting through a ministerial government department for the purposes of enaction of legislation are also excluded (for example, the Electoral Commission and IPSA).
Websites are listed under the department name for which the minister in HMG has responsibility, either directly through their departmental activities, or indirectly through being the minister reporting to Parliament for independent bodies set up by statute.
For re-usability, these are provided as Excel and CSV files.
Following the list of websites in ‘.gouv.fr’ generated on the repository GitHub gouvfrlist, here is a list of websites and web services in ‘.gouv.fr’. It made it possible to make a graph representation of domains and subdomains by ministry and administrations (deconcentrated). We also relied on the list of top 250 of administrative procedures and the list of sites en.gouv.fr dating from 2014. Graphic representation of.gouv.fr websites ### Deposit GitHub The project description and data set are available in the Github repository graph-gouv-en de jbledevehat ### Legend The objects represented are: — The President of the French Republic and the Prime Minister are qualified as “Person” (in blue) — Departments or administrative branches (in yellow) — Websites (in green) — The subdomains of these websites (in orange) — Online services (in red) — Citizen consultation sites (in pink) — Web sites and services either archived or inactive (in black) — Undefined (incoherent) websites (in grey) ### Web publications This representation is available on the application KUMU at the following address: https://kumu.io/jbledevehat/sites-web-gouvfr#liste-des-sites-web-en-gouvfr-v1
This dataset contains a list of 3654 Dutch websites that we considered the most popular websites in 2015. This list served as whitelist for the Newstracker Research project in which we monitored the online web behaviour of a group of respondents.The research project 'The Newstracker' was a subproject of the NWO-funded project 'The New News Consumer: A User-Based Innovation Project to Meet Paradigmatic Change in News Use and Media Habits'.For the Newstracker project we aimed to understand the web behaviour of a group of respondents. We created custom-built software to monitor their web browsing behaviour on their laptops and desktops (please find the code in open access at https://github.com/NITechLabs/NewsTracker). For reasons of scale and privacy we created a whitelist with websites that were the most popular websites in 2015. We manually compiled this list by using data of DDMM, Alexa and own research. The dataset consists of 5 columns:- the URL- the type of website: We created a list of types of websites and each website has been manually labeled with 1 category- Nieuws-regio: When the category was 'News', we subdivided these websites in the regional focus: International, National or Local- Nieuws-onderwerp: Furthermore, each website under the category News was further subdivided in type of news website. For this we created an own list of news categories and manually coded each website- Bron: For each website we noted which source we used to find this website.The full description of the research design of the Newstracker including the set-up of this whitelist is included in the following article: Kleppe, M., Otte, M. (in print), 'Analysing & understanding news consumption patterns by tracking online user behaviour with a multimodal research design', Digital Scholarship in the Humanities, doi 10.1093/llc/fqx030.
In November 2024, Google.com was the most popular website worldwide with 136 billion average monthly visits. The online platform has held the top spot as the most popular website since June 2010, when it pulled ahead of Yahoo into first place. Second-ranked YouTube generated more than 72.8 billion monthly visits in the measured period. The internet leaders: search, social, and e-commerce Social networks, search engines, and e-commerce websites shape the online experience as we know it. While Google leads the global online search market by far, YouTube and Facebook have become the world’s most popular websites for user generated content, solidifying Alphabet’s and Meta’s leadership over the online landscape. Meanwhile, websites such as Amazon and eBay generate millions in profits from the sale and distribution of goods, making the e-market sector an integral part of the global retail scene. What is next for online content? Powering social media and websites like Reddit and Wikipedia, user-generated content keeps moving the internet’s engines. However, the rise of generative artificial intelligence will bring significant changes to how online content is produced and handled. ChatGPT is already transforming how online search is performed, and news of Google's 2024 deal for licensing Reddit content to train large language models (LLMs) signal that the internet is likely to go through a new revolution. While AI's impact on the online market might bring both opportunities and challenges, effective content management will remain crucial for profitability on the web.
The Fermi Gamma-ray Space Telescope (Fermi) Large Area Telescope (LAT) is a successor to EGRET, with greatly improved sensitivity, resolution, and energy range. This web page presents the first full catalog of LAT sources, based on the first eleven months of survey data. For a full explanation about the catalog and its construction see the LAT 1-year Catalog Paper.
Number and list of central government open websites – 455 as at 31 December 2013.
The Cabinet Office committed to begin quarterly publication of the number of open websites starting in the financial year 2011.
The definition used is a user-centric one. Something is counted as a separate website if it is active and either has a separate domain name or, when as a subdomain, the user cannot move freely between the subsite and parent site and there is no family likeness in the design. In other words, if the user experiences it as a separate site in their normal uses of browsing, search and interaction, it is counted as one.
A website is considered closed when it ceases to be actively funded, run and managed by central government, either by packaging information and putting it in the right place for the intended audience on another website or digital channel, or by a third party taking and managing it and bearing the cost. Where appropriate, domains stay operational in order to redirect users to the http://www.nationalarchives.gov.uk/webarchive/" class="govuk-link">UK Government Website Archive.
The GOV.UK exemption process began with a web rationalisation of the government’s internet estate to reduce the number of obsolete websites and to establish the scale of the websites that the government owns.
Not included in the number or list are:
Finally, those public bodies set up by Parliament and reporting directly to the Speaker’s Committee are also excluded (for example, the Electoral Commission and IPSA).
As agreed in the quarterly report of February 2013, the following sites have been included in the list:
Websites are listed under the department name for which the government minister has responsibility, either directly through their departmental activities, or indirectly through being the minister reporting to Parliament for independent bodies set up by statute.
Government website domains have been procured from as early as the 1990s and at this time, there was no requirement upon government departments to retain a formal record of ownership. With staff changes and new departments formed, it became apparent that departments did not have a complete view of all sites in their estate.
Government Digital Service (GDS) has worked closely with these departments to identify legacy websites which we were not originally aware of, by going through the complete list of gov.uk domains managed by Cabinet Office, under the second level domain (SLD), gov.uk. A full list of gov.uk domains can be viewed here. As well as websites on the gov.uk SLD, we had found that there are a number of legacy websites owned by departments under a .org.uk or co.uk SLD. Because we do not own these SLDs, information on whether a department has ownership was not so easily accessible, but a strong working relationship with department leads has since helped to identify the majority of these sites.
Previously, the Ministry of Defence conducted their own rationalisation of MOD and the armed forces sites. At the beginning of this report, we agreed to include these sites to ensure a consistent approach.
Since the last report of Oct 2013, 19 websites have closed and 18 have migrated to the governments website, GOV.UK. As government websites migrate to GOV.UK, the responsibility for reporting a department’s content will become an overall GOV.UK reporting
As of February 2025, English was the most popular language for web content, with over 49.4 percent of websites using it. Spanish ranked second, with six percent of web content, while the content in the German language followed, with 5.6 percent. English as the leading online language United States and India, the countries with the most internet users after China, are also the world's biggest English-speaking markets. The internet user base in both countries combined, as of January 2023, was over a billion individuals. This has led to most of the online information being created in English. Consequently, even those who are not native speakers may use it for convenience. Global internet usage by regions As of October 2024, the number of internet users worldwide was 5.52 billion. In the same period, Northern Europe and North America were leading in terms of internet penetration rates worldwide, with around 97 percent of its populations accessing the internet.
In November 2024, Google.com was the most visited website in the United States, with over 25 billion total visits. YouTube.com came in second with 12 billion total visits. Reddit.com and Amazon.com counted approximately 3.12 billion and 2.89 monthly visits each from U.S. online audiences.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains a list of 3654 Dutch websites that we considered the most popular websites in 2015. This list served as whitelist for the Newstracker Research project in which we monitored the online web behaviour of a group of respondents.The research project 'The Newstracker' was a subproject of the NWO-funded project 'The New News Consumer: A User-Based Innovation Project to Meet Paradigmatic Change in News Use and Media Habits'.For the Newstracker project we aimed to understand the web behaviour of a group of respondents. We created custom-built software to monitor their web browsing behaviour on their laptops and desktops (please find the code in open access at https://github.com/NITechLabs/NewsTracker). For reasons of scale and privacy we created a whitelist with websites that were the most popular websites in 2015. We manually compiled this list by using data of DDMM, Alexa and own research. The dataset consists of 5 columns:- the URL- the type of website: We created a list of types of websites and each website has been manually labeled with 1 category- Nieuws-regio: When the category was 'News', we subdivided these websites in the regional focus: International, National or Local- Nieuws-onderwerp: Furthermore, each website under the category News was further subdivided in type of news website. For this we created an own list of news categories and manually coded each website- Bron: For each website we noted which source we used to find this website.The full description of the research design of the Newstracker including the set-up of this whitelist is included in the following article: Kleppe, M., Otte, M. (in print), 'Analysing & understanding news consumption patterns by tracking online user behaviour with a multimodal research design', Digital Scholarship in the Humanities, doi 10.1093/llc/fqx030.
List of Health and Human Services facilities and available programs, contact information, hours of operations and web-page links. This dataset is updated on an as needed basis.
A Civil Service List is considered terminated usually four years after the list has been established, unless it is extended at the Commissioner’s discretion. For more information visit DCAS’ “Work for the City” webpage at: https://www1.nyc.gov/site/dcas/employment/take-an-exam.page.
In November 2024, Google.com was the most popular website worldwide with approximately 6.25 billion unique monthly visitors. YouTube.com was ranked second with an estimated 3.64 billion unique monthly visitors. Both websites are among the most visited websites worldwide.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The people from Czech are publishing a dataset for the HTTPS traffic classification.
Since the data were captured mainly in the real backbone network, they omitted IP addresses and ports. The datasets consist of calculated from bidirectional flows exported with flow probe Ipifixprobe. This exporter can export a sequence of packet lengths and times and a sequence of packet bursts and time. For more information, please visit ipfixprobe repository (Ipifixprobe).
During research, they divided HTTPS into five categories: L -- Live Video Streaming, P -- Video Player, M -- Music Player, U -- File Upload, D -- File Download, W -- Website, and other traffic.
They have chosen the service representatives known for particular traffic types based on the Alexa Top 1M list and Moz's list of the most popular 500 websites for each category. They also used several popular websites that primarily focus on the audience in Czech. The identified traffic classes and their representatives are provided below:
Live Video Stream Twitch, Czech TV, YouTube Live Video Player DailyMotion, Stream.cz, Vimeo, YouTube Music Player AppleMusic, Spotify, SoundCloud File Upload/Download FileSender, OwnCloud, OneDrive, Google Drive Website and Other Traffic Websites from Alexa Top 1M list
Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
Advertisement identification and filtering in web pages gain significance due to various factors such as accessibility, security, privacy, and obtrusiveness. Current practices in this direction involve maintaining URL-based regular expressions called filter lists. Each URL obtained on a web page is matched against this filter list. While effectual, this procedure lacks scalability as it demands regular continuance of the filter list. To counter these limitations, we devise a machine learning based advertisement detection system using a diverse feature set which can distinguish advertisement blocks from non-advertisement blocks. The method can act as a base to provide various accessibility-related features like smooth browsing and text summarization for persons with visual impairments, cognitive impairments, and photosensitive epilepsy. The results from a classifier trained on the proposed feature set achieve 93.4% accuracy in identifying advertisements.
https://webtechsurvey.com/termshttps://webtechsurvey.com/terms
A complete list of live websites using the WordPress technology, compiled through global website indexing conducted by WebTechSurvey.