96 datasets found
  1. Website Traffic

    • kaggle.com
    zip
    Updated Aug 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AnthonyTherrien (2024). Website Traffic [Dataset]. https://www.kaggle.com/datasets/anthonytherrien/website-traffic
    Explore at:
    zip(65228 bytes)Available download formats
    Dataset updated
    Aug 5, 2024
    Authors
    AnthonyTherrien
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Overview

    This dataset provides detailed information on website traffic, including page views, session duration, bounce rate, traffic source, time spent on page, previous visits, and conversion rate.

    Dataset Description

    • Page Views: The number of pages viewed during a session.
    • Session Duration: The total duration of the session in minutes.
    • Bounce Rate: The percentage of visitors who navigate away from the site after viewing only one page.
    • Traffic Source: The origin of the traffic (e.g., Organic, Social, Paid).
    • Time on Page: The amount of time spent on the specific page.
    • Previous Visits: The number of previous visits by the same visitor.
    • Conversion Rate: The percentage of visitors who completed a desired action (e.g., making a purchase).

    Data Summary

    • Total Records: 2000
    • Total Features: 7

    Key Features

    1. Page Views: This feature indicates the engagement level of the visitors by showing how many pages they visit during their session.
    2. Session Duration: This feature measures the length of time a visitor stays on the website, which can indicate the quality of the content.
    3. Bounce Rate: A critical metric for understanding user behavior. A high bounce rate may indicate that visitors are not finding what they are looking for.
    4. Traffic Source: Understanding where your traffic comes from can help in optimizing marketing strategies.
    5. Time on Page: This helps in analyzing which pages are retaining visitors' attention the most.
    6. Previous Visits: This can be used to analyze the loyalty of visitors and the effectiveness of retention strategies.
    7. Conversion Rate: The ultimate metric for measuring the effectiveness of the website in achieving its goals.

    Usage

    This dataset can be used for various analyses such as:

    • Identifying key drivers of engagement and conversion.
    • Analyzing the effectiveness of different traffic sources.
    • Understanding user behavior patterns and optimizing the website accordingly.
    • Improving marketing strategies based on traffic source performance.
    • Enhancing user experience by analyzing time spent on different pages.

    Acknowledgments

    This dataset was generated for educational purposes and is not from a real website. It serves as a tool for learning data analysis and machine learning techniques.

  2. Internet

    • kaggle.com
    zip
    Updated Sep 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aman Chauhan (2022). Internet [Dataset]. https://www.kaggle.com/datasets/whenamancodes/internet
    Explore at:
    zip(94134 bytes)Available download formats
    Dataset updated
    Sep 14, 2022
    Authors
    Aman Chauhan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    For those who are online most days it is easy to forget how young the internet still is. The timeline below the chart reminds you how recent websites and technologies became available that are integrated to the everyday lives of millions: In the 1990s there was no Wikipedia, Twitter launched in 2006, and Our World in Data is only 4 years old (and look how many people have joined since then4).

    And while many of us cannot imagine their lives without the services that the internet provides, the key message for me from this overview of the global history of the internet is that we are still in the very early stages of the internet. It was only in 2017 that half of the world population was online; and in 2018 it is therefore still the case that close to half of the world population is not using the internet.5

    The internet has already changed the world, but the big changes that the internet will bring still lie ahead. Its history has just begun.

    Findings:

    How many Internet users does each country have? What share of people are online?

  3. Popular websites across the globe

    • kaggle.com
    zip
    Updated May 27, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    bpali26 (2017). Popular websites across the globe [Dataset]. https://www.kaggle.com/bpali26/popular-websites-across-the-globe
    Explore at:
    zip(639485 bytes)Available download formats
    Dataset updated
    May 27, 2017
    Authors
    bpali26
    Description

    Context

    This dataset includes some of the basic information of the websites we daily use. While scrapping this info, I learned quite a lot in R programming, system speed, memory usage etc. and developed my niche in Web Scrapping. It took about 4-5 hrs for scrapping this data through my system (4GB RAM) and nearly about 4-5 days working out my idea through this project.

    Content

    The dataset contains Top 50 ranked sites from each 191 countries along with their traffic (global) rank. Here, country_rank represent the traffic rank of that site within the country, and traffic_rank represent the global traffic rank of that site.

    Since most of the columns meaning can be derived from their name itself, its pretty much straight forward to understand this dataset. However, there are some instances of confusion which I would like to explain in here:

    1) most of the numeric values are in character format, hence, contain spaces which you might need to clean on.

    2) There are multiple instances of same website. for.e.g. Yahoo. com is present in 179 rows within this dataset. This is due to their different country rank in each country.

    3)The information provided in this dataset is for the top 50 websites in 191 countries as on 25th May 2017 and is subjected to change in future time due to the dynamic structure of ranking.

    4) The dataset inactual contains 9540 rows instead of 9550(50*191 rows). This was due to the unavailability of information for 10 websites.

    PS: in case if there are anymore queries, comment on this, I'll add an answer to that in above list.

    Acknowledgements

    I wouldn't have done this without the help of others. I've scrapped this information from publicly available (open to all) websites namely: 1) http://data.danetsoft.com/ 2) http://www.alexa.com/topsites , of which i'm highly grateful. I truly appreciate and thanks the owner of these sites for providing us with the information that I included today in this dataset.

    Inspiration

    I feel that there this a lot of scope for exploring & visualization this dataset to find out the trends in the attributes of these websites across countries. Also, one could try predicting the traffic(global) rank being a dependent factor on the other attributes of the website. In any case, this dataset will help you find out the popular sites in your area.

  4. z

    Requirements data sets (user stories)

    • zenodo.org
    • data.mendeley.com
    txt
    Updated Jan 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fabiano Dalpiaz; Fabiano Dalpiaz (2025). Requirements data sets (user stories) [Dataset]. http://doi.org/10.17632/7zbk8zsd8y.1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jan 13, 2025
    Dataset provided by
    Mendeley Data
    Authors
    Fabiano Dalpiaz; Fabiano Dalpiaz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A collection of 22 data set of 50+ requirements each, expressed as user stories.

    The dataset has been created by gathering data from web sources and we are not aware of license agreements or intellectual property rights on the requirements / user stories. The curator took utmost diligence in minimizing the risks of copyright infringement by using non-recent data that is less likely to be critical, by sampling a subset of the original requirements collection, and by qualitatively analyzing the requirements. In case of copyright infringement, please contact the dataset curator (Fabiano Dalpiaz, f.dalpiaz@uu.nl) to discuss the possibility of removal of that dataset [see Zenodo's policies]

    The data sets have been originally used to conduct experiments about ambiguity detection with the REVV-Light tool: https://github.com/RELabUU/revv-light

    This collection has been originally published in Mendeley data: https://data.mendeley.com/datasets/7zbk8zsd8y/1

    Overview of the datasets [data and links added in December 2024]

    The following text provides a description of the datasets, including links to the systems and websites, when available. The datasets are organized by macro-category and then by identifier.

    Public administration and transparency

    g02-federalspending.txt (2018) originates from early data in the Federal Spending Transparency project, which pertain to the website that is used to share publicly the spending data for the U.S. government. The website was created because of the Digital Accountability and Transparency Act of 2014 (DATA Act). The specific dataset pertains a system called DAIMS or Data Broker, which stands for DATA Act Information Model Schema. The sample that was gathered refers to a sub-project related to allowing the government to act as a data broker, thereby providing data to third parties. The data for the Data Broker project is currently not available online, although the backend seems to be hosted in GitHub under a CC0 1.0 Universal license. Current and recent snapshots of federal spending related websites, including many more projects than the one described in the shared collection, can be found here.

    g03-loudoun.txt (2018) is a set of extracted requirements from a document, by the Loudoun County Virginia, that describes the to-be user stories and use cases about a system for land management readiness assessment called Loudoun County LandMARC. The source document can be found here and it is part of the Electronic Land Management System and EPlan Review Project - RFP RFQ issued in March 2018. More information about the overall LandMARC system and services can be found here.

    g04-recycling.txt(2017) concerns a web application where recycling and waste disposal facilities can be searched and located. The application operates through the visualization of a map that the user can interact with. The dataset has obtained from a GitHub website and it is at the basis of a students' project on web site design; the code is available (no license).

    g05-openspending.txt (2018) is about the OpenSpending project (www), a project of the Open Knowledge foundation which aims at transparency about how local governments spend money. At the time of the collection, the data was retrieved from a Trello board that is currently unavailable. The sample focuses on publishing, importing and editing datasets, and how the data should be presented. Currently, OpenSpending is managed via a GitHub repository which contains multiple sub-projects with unknown license.

    g11-nsf.txt (2018) refers to a collection of user stories referring to the NSF Site Redesign & Content Discovery project, which originates from a publicly accessible GitHub repository (GPL 2.0 license). In particular, the user stories refer to an early version of the NSF's website. The user stories can be found as closed Issues.

    (Research) data and meta-data management

    g08-frictionless.txt (2016) regards the Frictionless Data project, which offers an open source dataset for building data infrastructures, to be used by researchers, data scientists, and data engineers. Links to the many projects within the Frictionless Data project are on GitHub (with a mix of Unlicense and MIT license) and web. The specific set of user stories has been collected in 2016 by GitHub user @danfowler and are stored in a Trello board.

    g14-datahub.txt (2013) concerns the open source project DataHub, which is currently developed via a GitHub repository (the code has Apache License 2.0). DataHub is a data discovery platform which has been developed over multiple years. The specific data set is an initial set of user stories, which we can date back to 2013 thanks to a comment therein.

    g16-mis.txt (2015) is a collection of user stories that pertains a repository for researchers and archivists. The source of the dataset is a public Trello repository. Although the user stories do not have explicit links to projects, it can be inferred that the stories originate from some project related to the library of Duke University.

    g17-cask.txt (2016) refers to the Cask Data Application Platform (CDAP). CDAP is an open source application platform (GitHub, under Apache License 2.0) that can be used to develop applications within the Apache Hadoop ecosystem, an open-source framework which can be used for distributed processing of large datasets. The user stories are extracted from a document that includes requirements regarding dataset management for Cask 4.0, which includes the scenarios, user stories and a design for the implementation of these user stories. The raw data is available in the following environment.

    g18-neurohub.txt (2012) is concerned with the NeuroHub platform, a neuroscience data management, analysis and collaboration platform for researchers in neuroscience to collect, store, and share data with colleagues or with the research community. The user stories were collected at a time NeuroHub was still a research project sponsored by the UK Joint Information Systems Committee (JISC). For information about the research project from which the requirements were collected, see the following record.

    g22-rdadmp.txt (2018) is a collection of user stories from the Research Data Alliance's working group on DMP Common Standards. Their GitHub repository contains a collection of user stories that were created by asking the community to suggest functionality that should part of a website that manages data management plans. Each user story is stored as an issue on the GitHub's page.

    g23-archivesspace.txt (2012-2013) refers to ArchivesSpace: an open source, web application for managing archives information. The application is designed to support core functions in archives administration such as accessioning; description and arrangement of processed materials including analog, hybrid, and
    born digital content; management of authorities and rights; and reference service. The application supports collection management through collection management records, tracking of events, and a growing number of administrative reports. ArchivesSpace is open source and its

  5. Youtube Oldest Videos(2005) Dataset

    • kaggle.com
    zip
    Updated Mar 16, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Demyan Pavlyshenko (2022). Youtube Oldest Videos(2005) Dataset [Dataset]. https://www.kaggle.com/datasets/demko1/youtube-oldest-videos2005-dataset/data
    Explore at:
    zip(63483 bytes)Available download formats
    Dataset updated
    Mar 16, 2022
    Authors
    Demyan Pavlyshenko
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    YouTube
    Description

    Context

    There's a story behind every dataset and here's your opportunity to share yours.

    Content

    What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too.

    Acknowledgements

    We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

    Inspiration

    Your data will be in front of the world's largest data science community. What questions do you want to see answered?

  6. SWPS40 (Similar Web Pages) - A Benchmark Dataset for Structure and Vision...

    • web.cs.hacettepe.edu.tr
    zip
    Updated Oct 18, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hacettepe University Department of Computer Engineering (2018). SWPS40 (Similar Web Pages) - A Benchmark Dataset for Structure and Vision based Web Page Similarity [Dataset]. https://web.cs.hacettepe.edu.tr/~selman/swps40dataset/
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 18, 2018
    Dataset provided by
    Hacettepe Universityhttp://hacettepe.edu.tr/
    Authors
    Hacettepe University Department of Computer Engineering
    Time period covered
    May 1, 2018 - Jul 18, 2018
    Description

    SWPS40 (Similar Web PageS) dataset is aimed for researchers to supply a ground truth dataset to verify their ranking results based on web page visual similarity. For this purpose, we have collected screenshots and HTML+CSS+Js files of 40 different web pages from different contexts and sectors. The main goal of this dataset is to provide ground truth for visual similarity based rankings collected from many participants. The web page pairs in the dataset were scored by 312 different participants. During the study, each participant scored 100 different page pairs yielding totally 31200 individual scores. In this way, 40 votings have been collected for each page pair (e.g. P1 and P4) In this way, it was aimed to generate a statistically significant ground truth rankings.

  7. h

    SMS-spam

    • huggingface.co
    Updated Sep 18, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    bert van keulen (2025). SMS-spam [Dataset]. https://huggingface.co/datasets/bvk/SMS-spam
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 18, 2025
    Authors
    bert van keulen
    Description

    This dataset can be found on Kaggle, Huggingface and many other websites, but the source is from the research paper [Tiago], whose authors contributed it to the Machine Learning repository at [UCI]. It contains 5,574 SMS messages, of which 747 massages are labeled as spam. By nature, the messages are short and, in some cases, quite cryptic and personal. The CSV file is a straightforward representation of the data. References [UCI] https://archive.ics.uci.edu/dataset/228/sms+spam+collection… See the full description on the dataset page: https://huggingface.co/datasets/bvk/SMS-spam.

  8. Passive Operating System Fingerprinting Revisited - Network Flows Dataset

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Feb 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Laštovička; Martin Laštovička; Martin Husák; Martin Husák; Petr Velan; Petr Velan; Tomáš Jirsík; Tomáš Jirsík; Pavel Čeleda; Pavel Čeleda (2023). Passive Operating System Fingerprinting Revisited - Network Flows Dataset [Dataset]. http://doi.org/10.5281/zenodo.7635138
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 14, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Martin Laštovička; Martin Laštovička; Martin Husák; Martin Husák; Petr Velan; Petr Velan; Tomáš Jirsík; Tomáš Jirsík; Pavel Čeleda; Pavel Čeleda
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    For the evaluation of OS fingerprinting methods, we need a dataset with the following requirements:

    • First, the dataset needs to be big enough to capture the variability of the data. In this case, we need many connections from different operating systems.
    • Second, the dataset needs to be annotated, which means that the corresponding operating system needs to be known for each network connection captured in the dataset. Therefore, we cannot just capture any network traffic for our dataset; we need to be able to determine the OS reliably.

    To overcome these issues, we have decided to create the dataset from the traffic of several web servers at our university. This allows us to address the first issue by collecting traces from thousands of devices ranging from user computers and mobile phones to web crawlers and other servers. The ground truth values are obtained from the HTTP User-Agent, which resolves the second of the presented issues. Even though most traffic is encrypted, the User-Agent can be recovered from the web server logs that record every connection’s details. By correlating the IP address and timestamp of each log record to the captured traffic, we can add the ground truth to the dataset.

    For this dataset, we have selected a cluster of five web servers that host 475 unique university domains for public websites. The monitoring point recording the traffic was placed at the backbone network connecting the university to the Internet.

    The dataset used in this paper was collected from approximately 8 hours of university web traffic throughout a single workday. The logs were collected from Microsoft IIS web servers and converted from W3C extended logging format to JSON. The logs are referred to as web logs and are used to annotate the records generated from packet capture obtained by using a network probe tapped into the link to the Internet.

    The entire dataset creation process consists of seven steps:

    1. The packet capture was processed by the Flowmon flow exporter (https://www.flowmon.com) to obtain primary flow data containing information from TLS and HTTP protocols.
    2. Additional statistical features were extracted using GoFlows flow exporter (https://github.com/CN-TU/go-flows).
    3. The primary flows were filtered to remove incomplete records and network scans.
    4. The flows from both exporters were merged together into records containing fields from both sources.
    5. Web logs were filtered to cover the same time frame as the flow records.
    6. Web logs were paired with the flow records based on shared properties (IP address, port, time).
    7. The last step was to convert the User-Agent values into the operating system using a Python version of the open-source tool ua-parser (https://github.com/ua-parser/uap-python). We replaced the unstructured User-Agent string in the records with the resulting OS.

    The collected and enriched flows contain 111 data fields that can be used as features for OS fingerprinting or any other data analyses. The fields grouped by their area are listed below:

    • basic flow properties - flow_ID;start;end;L3 PROTO;L4 PROTO;BYTES A;PACKETS A;SRC IP;DST IP;TCP flags A;SRC port;DST port;packetTotalCountforward;packetTotalCountbackward;flowDirection;flowEndReason;
    • IP parameters - IP ToS;maximumTTLforward;maximumTTLbackward;IPv4DontFragmentforward;IPv4DontFragmentbackward;
    • TCP parameters - TCP SYN Size;TCP Win Size;TCP SYN TTL;tcpTimestampFirstPacketbackward;tcpOptionWindowScaleforward;tcpOptionWindowScalebackward;tcpOptionSelectiveAckPermittedforward;tcpOptionSelectiveAckPermittedbackward;tcpOptionMaximumSegmentSizeforward;tcpOptionMaximumSegmentSizebackward;tcpOptionNoOperationforward;tcpOptionNoOperationbackward;synAckFlag;tcpTimestampFirstPacketforward;
    • HTTP - HTTP Request Host;URL;
    • User-agent - UA OS family;UA OS major;UA OS minor;UA OS patch;UA OS patch minor;
    • TLS - TLS_CONTENT_TYPE;TLS_HANDSHAKE_TYPE;TLS_SETUP_TIME;TLS_SERVER_VERSION;TLS_SERVER_RANDOM;TLS_SERVER_SESSION_ID;TLS_CIPHER_SUITE;TLS_ALPN;TLS_SNI;TLS_SNI_LENGTH;TLS_CLIENT_VERSION;TLS_CIPHER_SUITES;TLS_CLIENT_RANDOM;TLS_CLIENT_SESSION_ID;TLS_EXTENSION_TYPES;TLS_EXTENSION_LENGTHS;TLS_ELLIPTIC_CURVES;TLS_EC_POINT_FORMATS;TLS_CLIENT_KEY_LENGTH;TLS_ISSUER_CN;TLS_SUBJECT_CN;TLS_SUBJECT_ON;TLS_VALIDITY_NOT_BEFORE;TLS_VALIDITY_NOT_AFTER;TLS_SIGNATURE_ALG;TLS_PUBLIC_KEY_ALG;TLS_PUBLIC_KEY_LENGTH;TLS_JA3_FINGERPRINT;
    • Packet timings - NPM_CLIENT_NETWORK_TIME;NPM_SERVER_NETWORK_TIME;NPM_SERVER_RESPONSE_TIME;NPM_ROUND_TRIP_TIME;NPM_RESPONSE_TIMEOUTS_A;NPM_RESPONSE_TIMEOUTS_B;NPM_TCP_RETRANSMISSION_A;NPM_TCP_RETRANSMISSION_B;NPM_TCP_OUT_OF_ORDER_A;NPM_TCP_OUT_OF_ORDER_B;NPM_JITTER_DEV_A;NPM_JITTER_AVG_A;NPM_JITTER_MIN_A;NPM_JITTER_MAX_A;NPM_DELAY_DEV_A;NPM_DELAY_AVG_A;NPM_DELAY_MIN_A;NPM_DELAY_MAX_A;NPM_DELAY_HISTOGRAM_1_A;NPM_DELAY_HISTOGRAM_2_A;NPM_DELAY_HISTOGRAM_3_A;NPM_DELAY_HISTOGRAM_4_A;NPM_DELAY_HISTOGRAM_5_A;NPM_DELAY_HISTOGRAM_6_A;NPM_DELAY_HISTOGRAM_7_A;NPM_JITTER_DEV_B;NPM_JITTER_AVG_B;NPM_JITTER_MIN_B;NPM_JITTER_MAX_B;NPM_DELAY_DEV_B;NPM_DELAY_AVG_B;NPM_DELAY_MIN_B;NPM_DELAY_MAX_B;NPM_DELAY_HISTOGRAM_1_B;NPM_DELAY_HISTOGRAM_2_B;NPM_DELAY_HISTOGRAM_3_B;NPM_DELAY_HISTOGRAM_4_B;NPM_DELAY_HISTOGRAM_5_B;NPM_DELAY_HISTOGRAM_6_B;NPM_DELAY_HISTOGRAM_7_B;
    • ICMP - ICMP TYPE;

    The details of OS distribution grouped by the OS family are summarized in the table below. The Other OS family contains records generated by web crawling bots that do not include OS information in the User-Agent.

    OS FamilyNumber of flows
    Other42474
    Windows40349
    Android10290
    iOS8840
    Mac OS X5324
    Linux1589
    Ubuntu653
    Fedora88
    Chrome OS53
    Symbian OS1
    Slackware1
    Linux Mint1

  9. Data from: Crowd and community sourcing to update authoritative LULC data in...

    • data.europa.eu
    • data.niaid.nih.gov
    • +2more
    unknown
    Updated Aug 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2023). Crowd and community sourcing to update authoritative LULC data in urban areas [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-3691827?locale=en
    Explore at:
    unknown(1533)Available download formats
    Dataset updated
    Aug 12, 2023
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The French National Mapping Agency (Institut National de l'Information Géographique et Forestière - IGN) is responsible for producing and maintaining the spatial data sets for all of France. At the same time, they must satisfy the needs of different stakeholders who are responsible for decisions at multiple levels from local to national. IGN produces many different maps including detailed road networks and land cover/land use maps over time. The information contained in these maps is crucial for many of the decisions made about urban planning, resource management and landscape restoration as well as other environmental issues in France. Recently, IGN has started the process of creating a high-resolution land use land cover (LULC) maps, aimed at developing smart and accurate monitoring services of LULC over time. To help update and validate the French LULC database, citizens and interested stakeholders can contribute using the Paysages mobile and web applications. This approach presents an opportunity to evaluate the integration of citizens in the IGN process of updating and validating LULC data. Dataset 1: Change detection validation 2019 This dataset contains web-based validations of changes detected by time series (2016 – 2019) analysis of Sentinel-2 satellite imagery. Validation was conducted using two high resolution orthophotos from respectively 2016 and 2019 as reference data. Two tools have been used: Paysages web application and LACO-Wiki. Both tools used the same validation design: blind validation and the same options. For each detected change, contributors are asked to validate if there is a change and if it is the case then to choose a LU or LC class from a pre-defined list of classes. The dataset has the following characteristics: Time period of the change detection: 2016-2019. Time period of data collection: February 2019-December 2019 Total number of contributors: 105 Number of validated changes: 1048; each change was validated by between 1 to 6 contributors. Region of interest: Toulouse and surrounding areas Associated files: 1- Change validation locations.png, 1-Change validation 2019 – Attributes.csv, 1-Change validation 2019.csv, 1-Change validation 2019.geoJSON This dataset is licensed under a Creative Commons Attribution 4.0 International. It is attributed to the LandSense Citizen Observatory, IGN-France, and GeoVille. Dataset 2: Land use classification 2019 The aim of this data collection campaign was to improve the LU classification of authoritative LULC data (OCS-GE 2016 ©IGN) for built-up area. Using the Paysages web platform, contributors are asked to choose a land use value among a list of pre-defined values for each location. The dataset has the following characteristics: Time period of data collection: August 2019 Types of contributors: Surveyors from the production department of IGN Total number of contributors: 5 Total number of observations: 2711 Data specifications of the OCS-GE ©IGN Region of interest: Toulouse and surrounding areas Associated files: 2- LU classification points.png, 2-LU classification 2019 – Attributes.csv, 2-LU classification 2019.csv, 2-LU classification 2019.geoJSON This dataset is licensed under a Creative Commons Attribution 4.0 International. It is attributed to the LandSense Citizen Observatory, IGN-France and the International Institute for Applied Systems Analysis. Dataset 3: In-situ validation 2018 The aim of this data collection campaign was to collect in-situ (ground-based) information, using the Paysages mobile application, to update authoritative LULC data. Contributors visit pre-determined locations, take photographs, of the point location and in the four cardinal directions away from the point and answer a few questions with respect with the task. Two tasks were defined: Classify the point by choosing a LU class between three classes: industrial (US2), commercial (US3) or residential (US5). Validate changes detected by the LandSense Change Detection Service: for each new detected change, the contributor was requested to validate the change and choose a LU and LC class from a pre-defined list of classes. The dataset has the following characteristics Time period of data collection: June 2018 – October 2018 Types of contributors: students from the School of Agricultural and Life Sciences and citizens Total number of contributors: 26 Total number of observations: 281 Total number of photos: 421 Region of interest: Toulouse and surrounding areas Associated files: 3- Insitu locations.png, 3- Insitu validation 2018 – Attributes.csv, 3- Insitu validation 2018.csv, 3- Insitu validation 2018.geoJSON This dataset is licensed under a Creative Commons Attribution 4.0 International. It is attributed to the LandSense Citizen Observatory, IGN-France. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement no 689812.

  10. Cora Dataset

    • linkagelibrary.icpsr.umich.edu
    • openicpsr.org
    • +3more
    delimited
    Updated Apr 2, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahin Ramezani (2019). Cora Dataset [Dataset]. http://doi.org/10.3886/E109167V2
    Explore at:
    delimitedAvailable download formats
    Dataset updated
    Apr 2, 2019
    Dataset provided by
    Texas A&M University
    Authors
    Mahin Ramezani
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Cora data contains bibliographic records of machine learning papers that have been manually clustered into groups that refer to the same publication. Originally, Cora was prepared by Andrew McCallum, and his versions of this data set are available on his Data web page. The data is also hosted here. Note that various versions of the Cora data set have been used by many publications in record linkage and entity resolution over the years.

  11. D

    HIFLD OPEN GIS Data Index and Crosswalk

    • datalumos.org
    Updated Dec 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    United States Department of Homeland Security (2025). HIFLD OPEN GIS Data Index and Crosswalk [Dataset]. http://doi.org/10.3886/E241367V1
    Explore at:
    Dataset updated
    Dec 20, 2025
    Dataset authored and provided by
    United States Department of Homeland Security
    License

    https://creativecommons.org/share-your-work/public-domain/pdmhttps://creativecommons.org/share-your-work/public-domain/pdm

    Time period covered
    Aug 25, 2025 - Dec 15, 2025
    Area covered
    United States
    Description

    This dataset consists of an inventory of all the HIFLD Open GIS layers that are stored in DataLumos, and a crosswalk published by the US Department of Homeland Security that links to repositories and web mapping services that contain current, updated copies of these datasets.HIFLD Open was a GIS data portal that gathered GIS layers from dozens of US federal government agencies into one centralized source, to facilitate access to nation-wide data for many purposes, including emergency management and community preparedness. The portal was decommissioned in August of 2025. The Data Rescue Project downloaded and archived over 400 datasets that were in the portal, to create a final snapshot of the data that existed there before it was taken offline. Metadata records were captured whenever possible, and dataset titles, descriptions, and terms were carried over from the original HIFLD Open records.The data index was created from the original JSON index file that was associated with the HIFLD Open repository, and contains the file names, titles, dates, descriptions, and key terms associated with each data layer. The Data Rescue Project used this index to create all of the records in DataLumos and to track the progress of the archiving project. Additional fields include file sizes, available formats (file geodatabase, geoJSON, geopackage, and shapefiles), a standardized publisher field, and a link to the final DataLumos record. The archiving project was completed in December 2025.The crosswalk was created by the Department of Homeland Security in August 2025, to link individual data layers previously stored in the repository to the websites, repositories, or web mapping services of the federal agencies that originally created each dataset. Users can follow these links to identify and access updated versions of each of the layers.

  12. c

    Download Home Depot products dataset

    • crawlfeeds.com
    csv, zip
    Updated Mar 5, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crawl Feeds (2026). Download Home Depot products dataset [Dataset]. https://crawlfeeds.com/datasets/download-home-depot-products-dataset
    Explore at:
    csv, zipAvailable download formats
    Dataset updated
    Mar 5, 2026
    Dataset authored and provided by
    Crawl Feeds
    License

    https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

    Description

    Access the Home Depot products dataset, a comprehensive collection of web-scraped data featuring home improvement products. Discover trending tools, hardware, appliances, décor, and gardening essentials to enhance your projects. From power tools and building materials to lighting, furniture, and outdoor living items, this dataset provides insights into top-rated products, best-selling brands, and emerging trends.

    Homedepot available datasets:

    We offer a wide range of categories, including furniture, home décor, painting, plumbing, and many more. Explore all available options here.

    Download now to explore detailed product data for smarter decision-making in home improvement, DIY, and construction projects.

    For a closer look at the product-level data we’ve extracted from Home Depot, including pricing, stock status, and detailed specifications, visit the Home Depot dataset page. You can explore sample records and submit a request for tailored extracts directly from there.

  13. o

    PhishingWebsites

    • openml.org
    Updated Apr 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    See original data source. (2025). PhishingWebsites [Dataset]. http://doi.org/10.24432/C51W2X
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 30, 2025
    Authors
    See original data source.
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset was curated for TabArena by the TabArena team as part of the TabArena Tabular ML IID Study. For more details on the study, see our paper.

    Dataset Focus: This dataset shall be used for evaluating predictive machine learning models for independent and identically distributed tabular data. The intended task is classification.

    Dataset Metadata

    • Licence: CC BY 4.0
    • Original Data Source: https://doi.org/10.24432/C51W2X
    • Reference (please cite): Mohammad, Rami M., Fadi Thabtah, and Lee McCluskey. 'An assessment of features related to phishing websites using an automated technique.' 2012 international conference for internet technology and secured transactions. IEEE, 2012. https://ieeexplore.ieee.org/abstract/document/6470857
    • Dataset Year: 2012
    • Dataset Description: see the reference and the original data source for details.

    Curation comments by the TabArena team (for code see the page of the study):

    • Anomaly: all features are categorical, ordinal-encoded variables with at most 3 values.
    • Anomaly: the data has many duplicates (47%).
  14. Swahili : News Classification Dataset

    • zenodo.org
    • explore.openaire.eu
    • +1more
    csv
    Updated Sep 18, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Davis David; Davis David (2021). Swahili : News Classification Dataset [Dataset]. http://doi.org/10.5281/zenodo.4300294
    Explore at:
    csvAvailable download formats
    Dataset updated
    Sep 18, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Davis David; Davis David
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Swahili is spoken by 100-150 million people across East Africa. In Tanzania, it is one of two national languages (the other is English) and it is the official language of instruction in all schools. News in Swahili is an important part of the media sphere in Tanzania.

    News contributes to education, technology, and the economic growth of a country, and news in local languages plays an important cultural role in many Africa countries. In the modern age, African languages in news and other spheres are at risk of being lost as English becomes the dominant language in online spaces.

    The Swahili news dataset was created to reduce the gap of using the Swahili language to create NLP technologies and help AI practitioners in Tanzania and across the Africa continent to practice their NLP skills to solve different problems in organizations or societies related to the Swahili language. Swahili News were collected from different websites that provide news in the Swahili language. I was able to find some websites that provide news in Swahili only and others in different languages including Swahili.

    The dataset was created for a specific task of text classification, this means each news content can be categorized into six different topics (Local News, International News, Finance News, Health News, Sports News, and Entertainment news). The dataset comes with a specified train/test split. The train set contains 75% of the dataset.

    Acknowledgment: This project was supported by the AI4D language dataset fellowship through K4All and Zindi Africa.

  15. CHOSEN dataset

    • zenodo.org
    • dataon.kisti.re.kr
    bin, nc
    Updated Aug 1, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Liang Zhang; Edom Moges; Liang Zhang; Edom Moges (2021). CHOSEN dataset [Dataset]. http://doi.org/10.5281/zenodo.4060384
    Explore at:
    nc, binAvailable download formats
    Dataset updated
    Aug 1, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Liang Zhang; Edom Moges; Liang Zhang; Edom Moges
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    CHOSEN: A synthesis of hydrometeorological data from intensively monitored catchments

    1. Description

    This project develops a pipeline to synthesize publicly available hydro-meteorological time series data from various resources. Using this pipeline, we have compiled data from 30 study areas from websites listed below into the CHOSEN (Comprehensive Hydrologic Observatory SEnsor Network) dataset. In the CHOSEN dataset, the data from different study areas have the same structures and formats, making them convenient to use for comparative hydrological studies.

    Click on the study area will direct to its original website where the raw data were downloaded.

    No.Study Area1East River2Dry Creek3Sagehen Creek4Andrews Forest5Baltimore6Bonanza Creek7California Current Ecosystem8Central Arizona9Coweeta10Florida Coastal Everglades11Georgia Coastal Ecosystems12Harvard Forest13Hubbard Brook14Jornada Basin15Kellogg16Konza Prairie17Northern Gulf of Alaska18Plum Island19Sevilleta20Boulder21Catalina22Jemez23Christina24Luquillo25Reynolds26Shale Hills27SanJoaquin28Providence29Wolverton30Calhoun

    From each website, we downloaded (if available) field measured time-series data of streamflow, precipitation, air temperature, solar radiation, evapotranspiration, relative humidity, wind direction, wind speed, SWE, snow depth, snowmelt, vapor pressure, soil moisture, soil temperature, and water isotopes.

    For more information and tutorials about the Jupyter Notebook data pipeline, please check our GitLab.

    2.Data

    On the zenodo platform, we provide the data in the NetCDF format. Check this link for an introduction to the NetCDF file format.

    To extract data from NetCDF files, download the Jupyter Notebook (0_Extract_Data_From_NetCDF.ipynb) and data files (.nc). The Jupyter Notebook is a tutorial about extracting data and information from NetCDF files. Geographical Information about monitoring stations is also available to obtain using the Notebook.

    3. Metadata

    The metadata provided include time range of record, variable unit and name, and geographical information of hydro-meteorologcial stations. Those information can be extracted from the NetCDF files using the Jupyter Notebook (0_Extract_Data_From_NetCDF.ipynb).

    4. Acknowledgements

    This work is supported by the US Geological Survey Powell Center for Analysis and Synthesis, a Gordon and Betty Moore Foundation Data-Driven Discovery Investigator grant to LL, and the Jupyter Meets the Earth project, funded by NSF grant number (UC Berkeley: 1928406, NCAR: 1928374). Partial support for ASW is provided by the National Science Foundation and the Experimental Program to Stimulate Competitive Research (EPSCoR: EPS-1929148; Canary in the Watershed). Much of the data used in this study were available from the US. Long-Term Ecological Research Network, Critical Zone Observatory program, Lawrence Berkeley National Laboratory, Dry Creek Experimental Watershed (DCEW); we would like to acknowledge all the staff from these institutions for collecting and publicizing the data. We thank Dr. Adrian Harpold for providing the data from the Sagehen catchment. We would especially like to thank the Powell Center Working Group on Watershed Storage and Controls for their contributions to this project. We also thank Dr. Lindsey Heagy and Dr. Fernando Pérez for their suggestions on data publication and future development.

    5. Collaboration

    Please contact Berkeley ESDL lab or email angelikazhang@berkeley.edu if you have any questions about the CHOSEN dataset or would like to contribute data from another study area.

  16. Identifiers for the 21st century: How to design, provision, and reuse...

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    pdf
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julie A. McMurry; Nick Juty; Niklas Blomberg; Tony Burdett; Tom Conlin; Nathalie Conte; Mélanie Courtot; John Deck; Michel Dumontier; Donal K. Fellows; Alejandra Gonzalez-Beltran; Philipp Gormanns; Jeffrey Grethe; Janna Hastings; Jean-Karim Hériché; Henning Hermjakob; Jon C. Ison; Rafael C. Jimenez; Simon Jupp; John Kunze; Camille Laibe; Nicolas Le Novère; James Malone; Maria Jesus Martin; Johanna R. McEntyre; Chris Morris; Juha Muilu; Wolfgang Müller; Philippe Rocca-Serra; Susanna-Assunta Sansone; Murat Sariyar; Jacky L. Snoep; Stian Soiland-Reyes; Natalie J. Stanford; Neil Swainston; Nicole Washington; Alan R. Williams; Sarala M. Wimalaratne; Lilly M. Winfree; Katherine Wolstencroft; Carole Goble; Christopher J. Mungall; Melissa A. Haendel; Helen Parkinson (2023). Identifiers for the 21st century: How to design, provision, and reuse persistent identifiers to maximize utility and impact of life science data [Dataset]. http://doi.org/10.1371/journal.pbio.2001414
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Julie A. McMurry; Nick Juty; Niklas Blomberg; Tony Burdett; Tom Conlin; Nathalie Conte; Mélanie Courtot; John Deck; Michel Dumontier; Donal K. Fellows; Alejandra Gonzalez-Beltran; Philipp Gormanns; Jeffrey Grethe; Janna Hastings; Jean-Karim Hériché; Henning Hermjakob; Jon C. Ison; Rafael C. Jimenez; Simon Jupp; John Kunze; Camille Laibe; Nicolas Le Novère; James Malone; Maria Jesus Martin; Johanna R. McEntyre; Chris Morris; Juha Muilu; Wolfgang Müller; Philippe Rocca-Serra; Susanna-Assunta Sansone; Murat Sariyar; Jacky L. Snoep; Stian Soiland-Reyes; Natalie J. Stanford; Neil Swainston; Nicole Washington; Alan R. Williams; Sarala M. Wimalaratne; Lilly M. Winfree; Katherine Wolstencroft; Carole Goble; Christopher J. Mungall; Melissa A. Haendel; Helen Parkinson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In many disciplines, data are highly decentralized across thousands of online databases (repositories, registries, and knowledgebases). Wringing value from such databases depends on the discipline of data science and on the humble bricks and mortar that make integration possible; identifiers are a core component of this integration infrastructure. Drawing on our experience and on work by other groups, we outline 10 lessons we have learned about the identifier qualities and best practices that facilitate large-scale data integration. Specifically, we propose actions that identifier practitioners (database providers) should take in the design, provision and reuse of identifiers. We also outline the important considerations for those referencing identifiers in various circumstances, including by authors and data generators. While the importance and relevance of each lesson will vary by context, there is a need for increased awareness about how to avoid and manage common identifier problems, especially those related to persistence and web-accessibility/resolvability. We focus strongly on web-based identifiers in the life sciences; however, the principles are broadly relevant to other disciplines.

  17. Z

    The Semantic PASCAL-Part Dataset

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    • +1more
    Updated Jan 20, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Donadello, Ivan; Serafini, Luciano (2022). The Semantic PASCAL-Part Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5878772
    Explore at:
    Dataset updated
    Jan 20, 2022
    Dataset provided by
    Fondazione Bruno Kessler
    Free University of Bozen-Bolzano
    Authors
    Donadello, Ivan; Serafini, Luciano
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Semantic PASCAL-Part dataset

    The Semantic PASCAL-Part dataset is the RDF version of the famous PASCAL-Part dataset used for object detection in Computer Vision. Each image is annotated with bounding boxes containing a single object. Couples of bounding boxes are annotated with the part-whole relationship. For example, the bounding box of a car has the part-whole annotation with the bounding boxes of its wheels.

    This original release joins Computer Vision with Semantic Web as the objects in the dataset are aligned with concepts from:

    the provided supporting ontology;

    the WordNet database through its synstes;

    the Yago ontology.

    The provided Python 3 code (see the GitHub repo) is able to browse the dataset and convert it in RDF knowledge graph format. This new format easily allows the fostering of research in both Semantic Web and Machine Learning fields.

    Structure of the semantic PASCAL-Part Dataset

    This is the folder structure of the dataset:

    semanticPascalPart: it contains the refined images and annotations (e.g., small specific parts are merged into bigger parts) of the PASCAL-Part dataset in Pascal-voc style.

    Annotations_set: the test set annotations in .xml format. For further information See the PASCAL VOC format here.

    Annotations_trainval: the train and validation set annotations in .xml format. For further information See the PASCAL VOC format here.

    JPEGImages_test: the test set images in .jpg format.

    JPEGImages_trainval: the train and validation set images in .jpg format.

    test.txt: the 2416 image filenames in the test set.

    trainval.txt: the 7687 image filenames in the train and validation set.

    The PASCAL-Part Ontology

    The PASCAL-Part OWL ontology formalizes, through logical axioms, the part-of relationship between whole objects (22 classes) and their parts (39 classes). The ontology contains 85 logical axiomns in Description Logic in (for example) the following form:

    Every potted_plant has exactly 1 plant AND has exactly 1 pot

    We provide two versions of the ontology: with and without cardinality constraints in order to allow users to experiment with or without them. The WordNet alignment is encoded in the ontology as annotations. We further provide the WordNet_Yago_alignment.csv file with both WordNet and Yago alignments.

    The ontology can be browsed with many Semantic Web tools such as:

    Protégé: a graphical tool for ongology modelling;

    OWLAPI: Java API for manipulating OWL ontologies;

    rdflib: Python API for working with the RDF format.

    RDF stores: databases for storing and semantically retrieve RDF triples. See here for some examples.

    Citing semantic PASCAL-Part

    If you use semantic PASCAL-Part in your research, please use the following BibTeX entry

    @article{DBLP:journals/ia/DonadelloS16, author = {Ivan Donadello and Luciano Serafini}, title = {Integration of numeric and symbolic information for semantic image interpretation}, journal = {Intelligenza Artificiale}, volume = {10}, number = {1}, pages = {33--47}, year = {2016} }

  18. Environmental Dataset Gateway (EDG) Search Widget

    • data.wu.ac.at
    bin
    Updated Jan 1, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Environmental Protection Agency (2014). Environmental Dataset Gateway (EDG) Search Widget [Dataset]. https://data.wu.ac.at/schema/data_gov/NzVlYzkyM2ItMDI0Mi00MThiLTlmYWEtMTMyZjlkNjQ4MDU1
    Explore at:
    binAvailable download formats
    Dataset updated
    Jan 1, 2014
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Area covered
    ca270e358e294451ebe5388ae30a952fa0f2efa5
    Description

    Use the Environmental Dataset Gateway (EDG) to find and access EPA's environmental resources. Many options are available for easily reusing EDG content in other other applications. This allows individuals to provide direct access to EPA's metadata outside the EDG interface. The EDG Search Widget makes it possible to search the EDG from another web page or application. The search widget can be included on your website by simply inserting one or two lines of code. Users can type a search term or lucene search query in the search field and retrieve a pop-up list of records that match that search.

  19. l

    LScDC Word-Category RIG Matrix

    • figshare.le.ac.uk
    pdf
    Updated Apr 28, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neslihan Suzen (2020). LScDC Word-Category RIG Matrix [Dataset]. http://doi.org/10.25392/leicester.data.12133431.v2
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Apr 28, 2020
    Dataset provided by
    University of Leicester
    Authors
    Neslihan Suzen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    LScDC Word-Category RIG MatrixApril 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk / suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny MirkesGetting StartedThis file describes the Word-Category RIG Matrix for theLeicester Scientific Corpus (LSC) [1], the procedure to build the matrix and introduces the Leicester Scientific Thesaurus (LScT) with the construction process. The Word-Category RIG Matrix is a 103,998 by 252 matrix, where rows correspond to words of Leicester Scientific Dictionary-Core (LScDC) [2] and columns correspond to 252 Web of Science (WoS) categories [3, 4, 5]. Each entry in the matrix corresponds to a pair (category,word). Its value for the pair shows the Relative Information Gain (RIG) on the belonging of a text from the LSC to the category from observing the word in this text. The CSV file of Word-Category RIG Matrix in the published archive is presented with two additional columns of the sum of RIGs in categories and the maximum of RIGs over categories (last two columns of the matrix). So, the file ‘Word-Category RIG Matrix.csv’ contains a total of 254 columns.This matrix is created to be used in future research on quantifying of meaning in scientific texts under the assumption that words have scientifically specific meanings in subject categories and the meaning can be estimated by information gains from word to categories. LScT (Leicester Scientific Thesaurus) is a scientific thesaurus of English. The thesaurus includes a list of 5,000 words from the LScDC. We consider ordering the words of LScDC by the sum of their RIGs in categories. That is, words are arranged in their informativeness in the scientific corpus LSC. Therefore, meaningfulness of words evaluated by words’ average informativeness in the categories. We have decided to include the most informative 5,000 words in the scientific thesaurus. Words as a Vector of Frequencies in WoS CategoriesEach word of the LScDC is represented as a vector of frequencies in WoS categories. Given the collection of the LSC texts, each entry of the vector consists of the number of texts containing the word in the corresponding category.It is noteworthy that texts in a corpus do not necessarily belong to a single category, as they are likely to correspond to multidisciplinary studies, specifically in a corpus of scientific texts. In other words, categories may not be exclusive. There are 252 WoS categories and a text can be assigned to at least 1 and at most 6 categories in the LSC. Using the binary calculation of frequencies, we introduce the presence of a word in a category. We create a vector of frequencies for each word, where dimensions are categories in the corpus.The collection of vectors, with all words and categories in the entire corpus, can be shown in a table, where each entry corresponds to a pair (word,category). This table is build for the LScDC with 252 WoS categories and presented in published archive with this file. The value of each entry in the table shows how many times a word of LScDC appears in a WoS category. The occurrence of a word in a category is determined by counting the number of the LSC texts containing the word in a category. Words as a Vector of Relative Information Gains Extracted for CategoriesIn this section, we introduce our approach to representation of a word as a vector of relative information gains for categories under the assumption that meaning of a word can be quantified by their information gained for categories.For each category, a function is defined on texts that takes the value 1, if the text belongs to the category, and 0 otherwise. For each word, a function is defined on texts that takes the value 1 if the word belongs to the text, and 0 otherwise. Consider LSC as a probabilistic sample space (the space of equally probable elementary outcomes). For the Boolean random variables, the joint probability distribution, the entropy and information gains are defined.The information gain about the category from the word is the amount of information on the belonging of a text from the LSC to the category from observing the word in the text [6]. We used the Relative Information Gain (RIG) providing a normalised measure of the Information Gain. This provides the ability of comparing information gains for different categories. The calculations of entropy, Information Gains and Relative Information Gains can be found in the README file in the archive published. Given a word, we created a vector where each component of the vector corresponds to a category. Therefore, each word is represented as a vector of relative information gains. It is obvious that the dimension of vector for each word is the number of categories. The set of vectors is used to form the Word-Category RIG Matrix, in which each column corresponds to a category, each row corresponds to a word and each component is the relative information gain from the word to the category. In Word-Category RIG Matrix, a row vector represents the corresponding word as a vector of RIGs in categories. We note that in the matrix, a column vector represents RIGs of all words in an individual category. If we choose an arbitrary category, words can be ordered by their RIGs from the most informative to the least informative for the category. As well as ordering words in each category, words can be ordered by two criteria: sum and maximum of RIGs in categories. The top n words in this list can be considered as the most informative words in the scientific texts. For a given word, the sum and maximum of RIGs are calculated from the Word-Category RIG Matrix.RIGs for each word of LScDC in 252 categories are calculated and vectors of words are formed. We then form the Word-Category RIG Matrix for the LSC. For each word, the sum (S) and maximum (M) of RIGs in categories are calculated and added at the end of the matrix (last two columns of the matrix). The Word-Category RIG Matrix for the LScDC with 252 categories, the sum of RIGs in categories and the maximum of RIGs over categories can be found in the database.Leicester Scientific Thesaurus (LScT)Leicester Scientific Thesaurus (LScT) is a list of 5,000 words form the LScDC [2]. Words of LScDC are sorted in descending order by the sum (S) of RIGs in categories and the top 5,000 words are selected to be included in the LScT. We consider these 5,000 words as the most meaningful words in the scientific corpus. In other words, meaningfulness of words evaluated by words’ average informativeness in the categories and the list of these words are considered as a ‘thesaurus’ for science. The LScT with value of sum can be found as CSV file with the published archive. Published archive contains following files:1) Word_Category_RIG_Matrix.csv: A 103,998 by 254 matrix where columns are 252 WoS categories, the sum (S) and the maximum (M) of RIGs in categories (last two columns of the matrix), and rows are words of LScDC. Each entry in the first 252 columns is RIG from the word to the category. Words are ordered as in the LScDC.2) Word_Category_Frequency_Matrix.csv: A 103,998 by 252 matrix where columns are 252 WoS categories and rows are words of LScDC. Each entry of the matrix is the number of texts containing the word in the corresponding category. Words are ordered as in the LScDC.3) LScT.csv: List of words of LScT with sum (S) values. 4) Text_No_in_Cat.csv: The number of texts in categories. 5) Categories_in_Documents.csv: List of WoS categories for each document of the LSC.6) README.txt: Description of Word-Category RIG Matrix, Word-Category Frequency Matrix and LScT and forming procedures.7) README.pdf (same as 6 in PDF format)References[1] Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2[2] Suzen, Neslihan (2019): LScDC (Leicester Scientific Dictionary-Core). figshare. Dataset. https://doi.org/10.25392/leicester.data.9896579.v3[3] Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4] WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html [5] Suzen, N., Mirkes, E. M., & Gorban, A. N. (2019). LScDC-new large scientific dictionary. arXiv preprint arXiv:1912.06858. [6] Shannon, C. E. (1948). A mathematical theory of communication. Bell system technical journal, 27(3), 379-423.

  20. d

    MyPyramid Food Raw Data

    • catalog.data.gov
    • healthdata.gov
    • +2more
    Updated Apr 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Food and Nutrition Service, Department of Agriculture (2025). MyPyramid Food Raw Data [Dataset]. https://catalog.data.gov/dataset/mypyramid-food-raw-data
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Food and Nutrition Service, Department of Agriculture
    Description

    MyPyramid Food Data provides information on the total calories; calories from solid fats, added sugars, and alcohol (extras); MyPyramid food group and subgroup amounts; and saturated fat content of over 1,000 commonly eaten foods with corresponding commonly used portion amounts. This information is key to help consumers meet the recommendations of the Dietary Guidelines for Americans and manage their weight by understanding how many calories are consumed from "extras." CNPP has created an interactive tool from this data set available on the web at MyFood-a-pedia.gov. A mobile version is coming soon to provide consumers with assistance on-the-go.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
AnthonyTherrien (2024). Website Traffic [Dataset]. https://www.kaggle.com/datasets/anthonytherrien/website-traffic
Organization logo

Website Traffic

Website Traffic and User Engagement Metrics

Explore at:
zip(65228 bytes)Available download formats
Dataset updated
Aug 5, 2024
Authors
AnthonyTherrien
License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Dataset Overview

This dataset provides detailed information on website traffic, including page views, session duration, bounce rate, traffic source, time spent on page, previous visits, and conversion rate.

Dataset Description

  • Page Views: The number of pages viewed during a session.
  • Session Duration: The total duration of the session in minutes.
  • Bounce Rate: The percentage of visitors who navigate away from the site after viewing only one page.
  • Traffic Source: The origin of the traffic (e.g., Organic, Social, Paid).
  • Time on Page: The amount of time spent on the specific page.
  • Previous Visits: The number of previous visits by the same visitor.
  • Conversion Rate: The percentage of visitors who completed a desired action (e.g., making a purchase).

Data Summary

  • Total Records: 2000
  • Total Features: 7

Key Features

  1. Page Views: This feature indicates the engagement level of the visitors by showing how many pages they visit during their session.
  2. Session Duration: This feature measures the length of time a visitor stays on the website, which can indicate the quality of the content.
  3. Bounce Rate: A critical metric for understanding user behavior. A high bounce rate may indicate that visitors are not finding what they are looking for.
  4. Traffic Source: Understanding where your traffic comes from can help in optimizing marketing strategies.
  5. Time on Page: This helps in analyzing which pages are retaining visitors' attention the most.
  6. Previous Visits: This can be used to analyze the loyalty of visitors and the effectiveness of retention strategies.
  7. Conversion Rate: The ultimate metric for measuring the effectiveness of the website in achieving its goals.

Usage

This dataset can be used for various analyses such as:

  • Identifying key drivers of engagement and conversion.
  • Analyzing the effectiveness of different traffic sources.
  • Understanding user behavior patterns and optimizing the website accordingly.
  • Improving marketing strategies based on traffic source performance.
  • Enhancing user experience by analyzing time spent on different pages.

Acknowledgments

This dataset was generated for educational purposes and is not from a real website. It serves as a tool for learning data analysis and machine learning techniques.

Search
Clear search
Close search
Google apps
Main menu