100+ datasets found
  1. Top Visited Websites

    • kaggle.com
    Updated Nov 19, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Top Visited Websites [Dataset]. https://www.kaggle.com/datasets/thedevastator/the-top-websites-in-the-world/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 19, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The Top Websites in the World

    How They Change Over Time

    About this dataset

    This dataset consists of the top 50 most visited websites in the world, as well as the category and principal country/territory for each site. The data provides insights into which sites are most popular globally, and what type of content is most popular in different parts of the world

    How to use the dataset

    This dataset can be used to track the most popular websites in the world over time. It can also be used to compare website popularity between different countries and categories

    Research Ideas

    • To track the most popular websites in the world over time
    • To see how website popularity changes by region
    • To find out which website categories are most popular

    Acknowledgements

    Dataset by Alexa Internet, Inc. (2019), released on Kaggle under the Open Data Commons Public Domain Dedication and License (ODC-PDDL)

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: df_1.csv | Column name | Description | |:--------------------------------|:---------------------------------------------------------------------| | Site | The name of the website. (String) | | Domain Name | The domain name of the website. (String) | | Category | The category of the website. (String) | | Principal country/territory | The principal country/territory where the website is based. (String) |

  2. A web tracking data set of online browsing behavior of 2,148 users

    • zenodo.org
    • explore.openaire.eu
    application/gzip, txt +1
    Updated May 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Juhi Kulshrestha; Juhi Kulshrestha; Marcos Oliveira; Marcos Oliveira; Orkut Karacalik; Denis Bonnay; Claudia Wagner; Orkut Karacalik; Denis Bonnay; Claudia Wagner (2021). A web tracking data set of online browsing behavior of 2,148 users [Dataset]. http://doi.org/10.5281/zenodo.4757574
    Explore at:
    zip, txt, application/gzipAvailable download formats
    Dataset updated
    May 14, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Juhi Kulshrestha; Juhi Kulshrestha; Marcos Oliveira; Marcos Oliveira; Orkut Karacalik; Denis Bonnay; Claudia Wagner; Orkut Karacalik; Denis Bonnay; Claudia Wagner
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This anonymized data set consists of one month's (October 2018) web tracking data of 2,148 German users. For each user, the data contains the anonymized URL of the webpage the user visited, the domain of the webpage, category of the domain, which provides 41 distinct categories. In total, these 2,148 users made 9,151,243 URL visits, spanning 49,918 unique domains. For each user in our data set, we have self-reported information (collected via a survey) about their gender and age.

    We acknowledge the support of Respondi AG, which provided the web tracking and survey data free of charge for research purposes, with special thanks to François Erner and Luc Kalaora at Respondi for their insights and help with data extraction.

    The data set is analyzed in the following paper:

    • Kulshrestha, J., Oliveira, M., Karacalik, O., Bonnay, D., Wagner, C. "Web Routineness and Limits of Predictability: Investigating Demographic and Behavioral Differences Using Web Tracking Data." Proceedings of the International AAAI Conference on Web and Social Media. 2021. https://arxiv.org/abs/2012.15112.

    The code used to analyze the data is also available at https://github.com/gesiscss/web_tracking.

    If you use data or code from this repository, please cite the paper above and the Zenodo link.

  3. s

    Statistics Interface Province-Level Data Collection - Datasets - This...

    • store.smartdatahub.io
    Updated Nov 11, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Statistics Interface Province-Level Data Collection - Datasets - This service has been deprecated - please visit https://www.smartdatahub.io/ to access data. See the About page for details. // [Dataset]. https://store.smartdatahub.io/dataset/fi_tilastokeskus_tilastointialueet_maakunta1000k
    Explore at:
    Dataset updated
    Nov 11, 2024
    Description

    The dataset collection in question is a compilation of related data tables sourced from the website of Tilastokeskus (Statistics Finland) in Finland. The data present in the collection is organized in a tabular format comprising of rows and columns, each holding related data. The collection includes several tables, each of which represents different years, providing a temporal view of the data. The description provided by the data source, Tilastokeskuksen palvelurajapinta (Statistics Finland's service interface), suggests that the data is likely to be statistical in nature and could be related to regional statistics, given the nature of the source. This dataset is licensed under CC BY 4.0 (Creative Commons Attribution 4.0, https://creativecommons.org/licenses/by/4.0/deed.fi).

  4. Medium-Search-Dataset

    • kaggle.com
    Updated Jun 11, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Keegan Fernandes (2021). Medium-Search-Dataset [Dataset]. https://www.kaggle.com/datasets/aristotle609/mediumsearchdataset/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 11, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Keegan Fernandes
    Description

    Context

    Since I started Blogging on medium.com (Here's a shameless plug )I Haven't really had many views (Granted my posts aren't that great and publishing frequency is low) but I've wondered what differentiates the top Medium Data Science Bloggers from me so I decided to make a dataset to find it and improve myself (I found a lot to improve upon)😃

    Content

    The Data Represents the Top 200 Medium Articles for each specific Query. The data was acquired through web scraping and contains various metadata about the post barring the blog text data which I will upload in a separate Dataset.

    Acknowledgements

    The thought of web scraping was pretty daunting to me the coding, the time and data required would be a lot. It is then that I discovered ParseHub Which Allowed me to make me to scrape websites with ease they also ran the WebScraping on Their servers all this for free (with a limit). WebScraping is a Important Method in Data Science to Collect Data I would recommend everyone Give Parsehub a try.

    Inspiration

    Hopefully this will give all the struggling bloggers on Kaggle some insight.

  5. d

    Swash User Search and Consumer Journey Data - 1.5M Worldwide Users - GDPR...

    • datarade.ai
    .csv, .xls
    Updated Jun 27, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Swash (2023). Swash User Search and Consumer Journey Data - 1.5M Worldwide Users - GDPR Compliant [Dataset]. https://datarade.ai/data-products/users-searching-data-on-top-search-engines
    Explore at:
    .csv, .xlsAvailable download formats
    Dataset updated
    Jun 27, 2023
    Dataset authored and provided by
    Swash
    Area covered
    Kuwait, Honduras, Israel, United States of America, Macao, Taiwan, Panama, Japan, Bangladesh, Korea (Republic of)
    Description

    Unlock the Power of Behavioural Data with GDPR-Compliant Clickstream Insights.

    Swash clickstream data offers a comprehensive and GDPR-compliant dataset sourced from users worldwide, encompassing both desktop and mobile browsing behaviour. Here's an in-depth look at what sets us apart and how our data can benefit your organisation.

    User-Centric Approach: Unlike traditional data collection methods, we take a user-centric approach by rewarding users for the data they willingly provide. This unique methodology ensures transparent data collection practices, encourages user participation, and establishes trust between data providers and consumers.

    Wide Coverage and Varied Categories: Our clickstream data covers diverse categories, including search, shopping, and URL visits. Whether you are interested in understanding user preferences in e-commerce, analysing search behaviour across different industries, or tracking website visits, our data provides a rich and multi-dimensional view of user activities.

    GDPR Compliance and Privacy: We prioritise data privacy and strictly adhere to GDPR guidelines. Our data collection methods are fully compliant, ensuring the protection of user identities and personal information. You can confidently leverage our clickstream data without compromising privacy or facing regulatory challenges.

    Market Intelligence and Consumer Behaviour: Gain deep insights into market intelligence and consumer behaviour using our clickstream data. Understand trends, preferences, and user behaviour patterns by analysing the comprehensive user-level, time-stamped raw or processed data feed. Uncover valuable information about user journeys, search funnels, and paths to purchase to enhance your marketing strategies and drive business growth.

    High-Frequency Updates and Consistency: We provide high-frequency updates and consistent user participation, offering both historical data and ongoing daily delivery. This ensures you have access to up-to-date insights and a continuous data feed for comprehensive analysis. Our reliable and consistent data empowers you to make accurate and timely decisions.

    Custom Reporting and Analysis: We understand that every organisation has unique requirements. That's why we offer customisable reporting options, allowing you to tailor the analysis and reporting of clickstream data to your specific needs. Whether you need detailed metrics, visualisations, or in-depth analytics, we provide the flexibility to meet your reporting requirements.

    Data Quality and Credibility: We take data quality seriously. Our data sourcing practices are designed to ensure responsible and reliable data collection. We implement rigorous data cleaning, validation, and verification processes, guaranteeing the accuracy and reliability of our clickstream data. You can confidently rely on our data to drive your decision-making processes.

  6. TTVP Retail Market Spot Check Audit Database

    • fisheries.noaa.gov
    • datasets.ai
    • +1more
    Updated May 1, 2001
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    West Coast Regional Office (2001). TTVP Retail Market Spot Check Audit Database [Dataset]. https://www.fisheries.noaa.gov/inport/item/17224
    Explore at:
    Dataset updated
    May 1, 2001
    Dataset provided by
    West Coast Regional Office
    Time period covered
    May 2001 - Jun 29, 2125
    Area covered
    Puerto Rico, United States, United States
    Description

    The data set contains information on retail market spot check audit purchases of tuna in airtight containers. Data are available from May 2001 to present with new data appended annually. Information includes the date, location, product type, store information where random spot check purchases were made throughout the United States and Puerto Rico. Information on purchased product allows the man...

  7. h

    youtube-music-hits

    • huggingface.co
    Updated Nov 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akbar Gherbal (2024). youtube-music-hits [Dataset]. https://huggingface.co/datasets/akbargherbal/youtube-music-hits
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 14, 2024
    Authors
    Akbar Gherbal
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Area covered
    YouTube
    Description

    YouTube Music Hits Dataset

    A collection of YouTube music video data sourced from Wikidata, focusing on videos with significant viewership metrics.

      Dataset Description
    
    
    
    
    
      Overview
    

    24,329 music videos View range: 1M to 5.5B views Temporal range: 1977-2024

      Features
    

    youtubeId: YouTube video identifier itemLabel: Video/song title performerLabel: Artist/band name youtubeViews: View count year: Release year genreLabel: Musical genre(s)

      View… See the full description on the dataset page: https://huggingface.co/datasets/akbargherbal/youtube-music-hits.
    
  8. c

    Hawaii Open Data - Sites - CKAN Ecosystem Catalog

    • catalog.civicdataecosystem.org
    Updated Feb 1, 2001
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2001). Hawaii Open Data - Sites - CKAN Ecosystem Catalog [Dataset]. https://catalog.civicdataecosystem.org/dataset/hawaii-open-data
    Explore at:
    Dataset updated
    Feb 1, 2001
    Area covered
    Hawaii
    Description

    With the launch of the State of Hawaii's Open Data portal, the State of Hawaii has now begun providing residents, analysts, and civic developers with unparalleled access to State data for use in increasing transparency, driving civic innovation, and engaging participants in a more collaborative form of government. Visitors to the site will find over 150 datasets organized by six major topics, with more datasets continuing to be added to the site: Data on the portal has been optimized so that users of varying technical ability will find the site easy to navigate and use. Residents, journalists and analysts will find that the data can easily be contextualized for various purposes using intuitive features built directly within the State of Hawaii's Open Data portal. Videos detailing how to sort, filter, visualize data can be found within the video guide section of the site. Developers wishing to use the data for civic innovation will benefit from the CKAN Open Data API, a fully-documented, RESTful, Application Programming Interface (API). For more information about the API powering the State of Hawaii's Open Data Portal, please visit the developer's page. State-of-the-art social data features enable participants to create a more collaborative form of government by commenting, discussing, and sharing datasets with other participants on the platform or to publish them on other social networks like Twitter or Facebook. Users of the site are encouraged to participate in the development and future direction of the site by suggesting datasets to be added to the platform. Click the link below to view training materials for Citizens, staff and administrators. https://opendata.hawaii.gov/pages/training Click the link below to view documentation on the CKAN API. https://docs.ckan.org/en/2.9/api/index.html Click the link below to view Open Data site statistics. https://opendata.hawaii.gov/stats

  9. Chicago Taxi Trips

    • kaggle.com
    • data.cityofchicago.org
    zip
    Updated Apr 18, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    City of Chicago (2018). Chicago Taxi Trips [Dataset]. https://www.kaggle.com/datasets/chicago/chicago-taxi-trips-bq
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Apr 18, 2018
    Dataset authored and provided by
    City of Chicago
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    Chicago
    Description

    Context

    Taxicabs in Chicago, Illinois, are operated by private companies and licensed by the city. There are about seven thousand licensed cabs operating within the city limits. Licenses are obtained through the purchase or lease of a taxi medallion which is then affixed to the top right hood of the car. Source: https://en.wikipedia.org/wiki/Taxicabs_of_the_United_States#Chicago

    Content

    This dataset includes taxi trips from 2013 to the present, reported to the City of Chicago in its role as a regulatory agency. To protect privacy but allow for aggregate analyses, the Taxi ID is consistent for any given taxi medallion number but does not show the number, Census Tracts are suppressed in some cases, and times are rounded to the nearest 15 minutes. Due to the data reporting process, not all trips are reported but the City believes that most are. See http://digital.cityofchicago.org/index.php/chicago-taxi-data-released for more information about this dataset and how it was created.

    Fork this kernel to get started.

    Acknowledgements

    https://bigquery.cloud.google.com/dataset/bigquery-public-data:chicago_taxi_trips

    https://cloud.google.com/bigquery/public-data/chicago-taxi

    Dataset Source: City of Chicago

    This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source —https://data.cityofchicago.org — and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

    Banner Photo by Ferdinand Stohr from Unplash.

    Inspiration

    What are the maximum, minimum and average fares for rides lasting 10 minutes or more? Which drop-off areas have the highest average tip? How does trip duration affect fare rates for trips lasting less than 90 minutes?

    https://cloud.google.com/bigquery/images/chicago-taxi-fares-by-duration.png" alt=""> https://cloud.google.com/bigquery/images/chicago-taxi-fares-by-duration.png

  10. Z

    Training and test datasets for the PredictONCO tool

    • data.niaid.nih.gov
    • zenodo.org
    Updated Dec 14, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stourac, Jan (2023). Training and test datasets for the PredictONCO tool [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10013763
    Explore at:
    Dataset updated
    Dec 14, 2023
    Dataset provided by
    Bednar, David
    Planas-Iglesias, Joan
    Sterba, Jaroslav
    Pinto, Gaspar
    Damborsky, Jiri
    Szotkowska, Veronika
    Mazurenko, Stanislav
    Stourac, Jan
    Dobias, Adam
    Khan, Rayyan
    Pokorna, Petra
    Slaby, Ondrej
    Borko, Simeon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset was used for training and validating the PredictONCO web tool, supporting decision-making in precision oncology by extending the bioinformatics predictions with advanced computing and machine learning. The dataset consists of 1073 single-point mutants of 42 proteins, whose effect was classified as Oncogenic (509 data points) and Benign (564 data points). All mutations were annotated with a clinically verified effect and were compiled from the ClinVar and OncoKB databases. The dataset was manually curated based on the available information in other precision oncology databases (The Clinical Knowledgebase by The Jackson Laboratory, Personalized Cancer Therapy Knowledge Base by MD Anderson Cancer Center, cBioPortal, DoCM database) or in the primary literature. To create the dataset, we also removed any possible overlaps with the data points used in the PredictSNP consensus predictor and its constituents. This was implemented to avoid any test set data leakage due to using the PredictSNP score as one of the features (see below).

    The entire dataset (SEQ) was further annotated by the pipeline of PredictONCO. Briefly, the following six features were calculated regardless of the structural information available: essentiality of the mutated residue (yes/no), the conservation of the position (the conservation grade and score), the domain where the mutation is located (cytoplasmic, extracellular, transmembrane, other), the PredictSNP score, and the number of essential residues in the protein. For approximately half of the data (STR: 377 and 76 oncogenic and benign data points, respectively), the structural information was available, and six more features were calculated: FoldX and Rosetta ddg_monomer scores, whether the residue is in the catalytic pocket (identification of residues forming the ligand-binding pocket was obtained from P2Rank), and the pKa changes (the minimum and maximum changes as well as the number of essential residues whose pKa was changed – all values obtained from PROPKA3). For both STR and SEQ datasets, 20% of the data was held out for testing. The data split was implemented at the position level to ensure that no position from the test data subset appears in the training data subset.

    For more details about the tool, please visit the help page or get in touch with us.

    14-Dec-2023 update: the file with features PredictONCO-features.txt now includes UniProt IDs, transcripts, PDB codes, and mutations.

  11. d

    Blood pressure checks on NYC HealthMap

    • catalog.data.gov
    • data.cityofnewyork.us
    Updated Nov 29, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.cityofnewyork.us (2021). Blood pressure checks on NYC HealthMap [Dataset]. https://catalog.data.gov/dataset/blood-pressure-checks-on-nyc-healthmap
    Explore at:
    Dataset updated
    Nov 29, 2021
    Dataset provided by
    data.cityofnewyork.us
    Area covered
    New York
    Description

    The dataset includes locations in NYC that offer free blood pressure checks at self-serve blood pressure kiosks or by pharmacy staff. Blood pressure check data collected to promote access to free blood pressure checks throughout NYC. Data collected so users can visit the NYC HealthMap online to find locations nearest to them to check their blood pressure at a self-serve kiosk or by pharmacy staff. Data collected manually. Each record represents a location that offers free blood pressure checks. Data can be used by general public seeking places to check their blood pressure. Data may change as sites are added and/or sites inform us of updates (i.e. address changes, pharmacy closures)

  12. Z

    Data from: Qbias – A Dataset on Media Bias in Search Queries and Query...

    • data.niaid.nih.gov
    Updated Mar 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Haak, Fabian (2023). Qbias – A Dataset on Media Bias in Search Queries and Query Suggestions [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7682914
    Explore at:
    Dataset updated
    Mar 1, 2023
    Dataset provided by
    Schaer, Philipp
    Haak, Fabian
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We present Qbias, two novel datasets that promote the investigation of bias in online news search as described in

    Fabian Haak and Philipp Schaer. 2023. 𝑄𝑏𝑖𝑎𝑠 - A Dataset on Media Bias in Search Queries and Query Suggestions. In Proceedings of ACM Web Science Conference (WebSci’23). ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/3578503.3583628.

    Dataset 1: AllSides Balanced News Dataset (allsides_balanced_news_headlines-texts.csv)

    The dataset contains 21,747 news articles collected from AllSides balanced news headline roundups in November 2022 as presented in our publication. The AllSides balanced news feature three expert-selected U.S. news articles from sources of different political views (left, right, center), often featuring spin bias, and slant other forms of non-neutral reporting on political news. All articles are tagged with a bias label by four expert annotators based on the expressed political partisanship, left, right, or neutral. The AllSides balanced news aims to offer multiple political perspectives on important news stories, educate users on biases, and provide multiple viewpoints. Collected data further includes headlines, dates, news texts, topic tags (e.g., "Republican party", "coronavirus", "federal jobs"), and the publishing news outlet. We also include AllSides' neutral description of the topic of the articles. Overall, the dataset contains 10,273 articles tagged as left, 7,222 as right, and 4,252 as center.

    To provide easier access to the most recent and complete version of the dataset for future research, we provide a scraping tool and a regularly updated version of the dataset at https://github.com/irgroup/Qbias. The repository also contains regularly updated more recent versions of the dataset with additional tags (such as the URL to the article). We chose to publish the version used for fine-tuning the models on Zenodo to enable the reproduction of the results of our study.

    Dataset 2: Search Query Suggestions (suggestions.csv)

    The second dataset we provide consists of 671,669 search query suggestions for root queries based on tags of the AllSides biased news dataset. We collected search query suggestions from Google and Bing for the 1,431 topic tags, that have been used for tagging AllSides news at least five times, approximately half of the total number of topics. The topic tags include names, a wide range of political terms, agendas, and topics (e.g., "communism", "libertarian party", "same-sex marriage"), cultural and religious terms (e.g., "Ramadan", "pope Francis"), locations and other news-relevant terms. On average, the dataset contains 469 search queries for each topic. In total, 318,185 suggestions have been retrieved from Google and 353,484 from Bing.

    The file contains a "root_term" column based on the AllSides topic tags. The "query_input" column contains the search term submitted to the search engine ("search_engine"). "query_suggestion" and "rank" represents the search query suggestions at the respective positions returned by the search engines at the given time of search "datetime". We scraped our data from a US server saved in "location".

    We retrieved ten search query suggestions provided by the Google and Bing search autocomplete systems for the input of each of these root queries, without performing a search. Furthermore, we extended the root queries by the letters a to z (e.g., "democrats" (root term) >> "democrats a" (query input) >> "democrats and recession" (query suggestion)) to simulate a user's input during information search and generate a total of up to 270 query suggestions per topic and search engine. The dataset we provide contains columns for root term, query input, and query suggestion for each suggested query. The location from which the search is performed is the location of the Google servers running Colab, in our case Iowa in the United States of America, which is added to the dataset.

    AllSides Scraper

    At https://github.com/irgroup/Qbias, we provide a scraping tool, that allows for the automatic retrieval of all available articles at the AllSides balanced news headlines.

    We want to provide an easy means of retrieving the news and all corresponding information. For many tasks it is relevant to have the most recent documents available. Thus, we provide this Python-based scraper, that scrapes all available AllSides news articles and gathers available information. By providing the scraper we facilitate access to a recent version of the dataset for other researchers.

  13. Data from: UNESCO World Heritage Sites Dataset

    • kaggle.com
    Updated Dec 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). UNESCO World Heritage Sites Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/unesco-world-heritage-sites-dataset/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 19, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    Area covered
    World
    Description

    UNESCO World Heritage Sites Dataset

    UNESCO World Heritage Sites Dataset

    By Throwback Thursday [source]

    About this dataset

    How to use the dataset

    Here are some tips on how to make the most out of this dataset:

    • Data Exploration:

      • Begin by understanding the structure and contents of the dataset. Evaluate the number of rows (sites) and columns (attributes) available.
      • Check for missing values or inconsistencies in data entry that may impact your analysis.
      • Assess column descriptions to understand what information is included in each attribute.
    • Geographical Analysis:

      • Leverage geographical features such as latitude and longitude coordinates provided in this dataset.
      • Plot these sites on a map using any mapping software or library like Google Maps or Folium for Python. Visualizing their distribution can provide insights into patterns based on location, climate, or cultural factors.
    • Analyzing Attributes:

      • Familiarize yourself with different attributes available for analysis. Possible attributes include Name, Description, Category, Region, Country, etc.
      • Understand each attribute's format and content type (categorical, numerical) for better utilization during data analysis.
    • Exploring Categories & Regions:

      • Look at unique categories mentioned in the Category column (e.g., Cultural Site, Natural Site) to explore specific interests. This could help identify clusters within particular heritage types across countries/regions worldwide.
      • Analyze regions with high concentrations of heritage sites using data visualizations like bar plots or word clouds based on frequency counts.
    • Identify Trends & Patterns:

      • Discover recurring themes across various sites by analyzing descriptive text attributes such as names and descriptions.
      • Identify patterns and correlations between attributes by performing statistical analysis or utilizing machine learning techniques.
    • Comparison:

      • Compare different attributes to gain a deeper understanding of the sites.
      • For example, analyze the number of heritage sites per country/region or compare the distribution between cultural and natural heritage sites.
    • Additional Data Sources:

      • Use this dataset as a foundation to combine it with other datasets for in-depth analysis. There are several sources available that provide additional data on UNESCO World Heritage Sites, such as travel blogs, official tourism websites, or academic research databases.

    Remember to cite this dataset appropriately if you use it in

    Research Ideas

    • Travel Planning: This dataset can be used to identify and plan visits to UNESCO World Heritage sites around the world. It provides information about the location, category, and date of inscription for each site, allowing users to prioritize their travel destinations based on personal interests or preferences.
    • Cultural Preservation: Researchers or organizations interested in cultural preservation can use this dataset to analyze trends in UNESCO World Heritage site listings over time. By studying factors such as geographical distribution, types of sites listed, and inscription dates, they can gain insights into patterns of cultural heritage recognition and protection.
    • Statistical Analysis: The dataset can be used for statistical analysis to explore various aspects related to UNESCO World Heritage sites. For example, it could be used to examine the correlation between a country's economic indicators (such as GDP per capita) and the number or type of World Heritage sites it possesses. This analysis could provide insights into the relationship between economic development and cultural preservation efforts at a global scale

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    See the dataset description for more information.

    Columns

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Throwback Thursday.

  14. T

    Combined Assets Visited - City Data Portals

    • citydata.mesaaz.gov
    • data.mesaaz.gov
    application/rdfxml +5
    Updated Jun 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data & Performance (2025). Combined Assets Visited - City Data Portals [Dataset]. https://citydata.mesaaz.gov/Office-of-Management-and-Budget/Combined-Assets-Visited-City-Data-Portals/kv2n-zmcu
    Explore at:
    csv, xml, application/rdfxml, tsv, json, application/rssxmlAvailable download formats
    Dataset updated
    Jun 25, 2025
    Dataset authored and provided by
    Data & Performance
    Description

    Information about accesses (visits) of city data assets. Combines analytics from both employee (citydata.mesaaz.gov) and public data (data.mesaaz.gov) portals.

    The following usage types are included in the Access Type column: grid view – tabular view of the dataset / filtered view primer page view – dataset / filtered view’s homepage, includes metadata and table preview of the data download – download of the dataset / filtered view to CSV, JSON, etc. api read access – programmatic access of dataset/filtered vew, etc. story page view – accessing a story page asset visualization page view – accessing a chart or map asset measure page view – accessing a performance measure asset

    Usage data are segmented into the following user types: site member: users who have logged in and have been granted a role on the domain community user: users who have logged in but do not have a role on the domain anonymous: users who have not logged in to the domain Data are updated by a system process at least once a day.

    Please see Site Analytics: Asset Access for more detail.

  15. d

    Council Districts

    • catalog.data.gov
    • datasets.ai
    • +3more
    Updated Jun 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cary (2025). Council Districts [Dataset]. https://catalog.data.gov/dataset/council-districts-3ea59
    Explore at:
    Dataset updated
    Jun 21, 2025
    Dataset provided by
    Cary
    Description

    This dataset contains the location of the Town of Cary’s four Town Council districts.To find out where to vote check out the North Carolina State Board of Elections voter search website. To find out more information visit our municipal elections page and our Council members page.This dataset is updated following municipal elections and changes in Town boundaries due to annexations.

  16. s

    2013 Transportation Data Collection - Datasets - This service has been...

    • store.smartdatahub.io
    Updated Nov 11, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). 2013 Transportation Data Collection - Datasets - This service has been deprecated - please visit https://www.smartdatahub.io/ to access data. See the About page for details. // [Dataset]. https://store.smartdatahub.io/dataset/fi_tilastokeskus_tieliikenne_tieliikenne_2013
    Explore at:
    Dataset updated
    Nov 11, 2024
    Description

    This dataset collection comprises multiple related data tables sourced from the web service interface (WFS) of the 'Tilastokeskus' (Statistics Finland) website in Finland. The data tables are organized in columns and rows, offering a structured format for the data. The information contained within this dataset collection primarily focuses on road traffic data for the year 2013. The data is comprehensive and could serve as a valuable resource for research and analysis related to road traffic patterns and statistics in Finland for the specified year. This dataset is licensed under CC BY 4.0 (Creative Commons Attribution 4.0, https://creativecommons.org/licenses/by/4.0/deed.fi).

  17. GiGL Spaces to Visit

    • data.europa.eu
    unknown
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Greenspace Information for Greater London CIC (GiGL), GiGL Spaces to Visit [Dataset]. https://data.europa.eu/88u/dataset/spaces-to-visit
    Explore at:
    unknownAvailable download formats
    Dataset provided by
    Greenspace Information for Greater London
    Authors
    Greenspace Information for Greater London CIC (GiGL)
    Description

    Introduction

    The GiGL Spaces to Visit dataset provides locations and boundaries for open space sites in Greater London that are available to the public as destinations for leisure, activities and community engagement. It includes green corridors that provide opportunities for walking and cycling.

    The dataset has been created by Greenspace Information for Greater London CIC (GiGL). As London’s Environmental Records Centre, GiGL mobilises, curates and shares data that underpin our knowledge of London’s natural environment. We provide impartial evidence to support informed discussion and decision making in policy and practice.

    GiGL maps under licence from the Greater London Authority.

    Description

    This dataset is a sub-set of the GiGL Open Space dataset, the most comprehensive dataset available of open spaces in London. Sites are selected for inclusion in Spaces to Visit based on their public accessibility and likelihood that people would be interested in visiting.

    The dataset is a mapped Geographic Information System (GIS) polygon dataset where one polygon (or multi-polygon) represents one space. As well as site boundaries, the dataset includes information about a site’s name, size and type (e.g. park, playing field etc.).

    GiGL developed the Spaces to Visit dataset to support anyone who is interested in London’s open spaces - including community groups, web and app developers, policy makers and researchers - with an open licence data source. More detailed and extensive data are available under GiGL data use licences for GIGL partners, researchers and students. Information services are also available for ecological consultants, biological recorders and community volunteers – please see www.gigl.org.uk for more information.

    Please note that access and opening times are subject to change (particularly at the current time) so if you are planning to visit a site check on the local authority or site website that it is open.

    The dataset is updated on a quarterly basis. If you have questions about this dataset please contact GiGL’s GIS and Data Officer.

    Data sources

    The boundaries and information in this dataset, are a combination of data collected during the London Survey Method habitat and open space survey programme (1986 – 2008) and information provided to GiGL from other sources since. These sources include London borough surveys, land use datasets, volunteer surveys, feedback from the public, park friends’ groups, and updates made as part of GiGL’s on-going data validation and verification process.

    Due to data availability, some areas are more up-to-date than others. We are continually working on updating and improving this dataset. If you have any additional information or corrections for sites included in the Spaces to Visit dataset please contact GiGL’s GIS and Data Officer.

    NOTE: The dataset contains OS data © Crown copyright and database rights 2025. The site boundaries are based on Ordnance Survey mapping, and the data are published under Ordnance Survey's 'presumption to publish'. When using these data please acknowledge GiGL and Ordnance Survey as the source of the information using the following citation:

    ‘Dataset created by Greenspace Information for Greater London CIC (GiGL), 2025 – Contains Ordnance Survey and public sector information licensed under the Open Government Licence v3.0

  18. c

    ckanext-trak

    • catalog.civicdataecosystem.org
    Updated Jun 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). ckanext-trak [Dataset]. https://catalog.civicdataecosystem.org/dataset/ckanext-trak
    Explore at:
    Dataset updated
    Jun 4, 2025
    Description

    The trak extension for CKAN enhances the platform's tracking capabilities by providing tools to import Google Analytics data and modify the presentation of page view statistics. It introduces a paster command for importing page view data from exported Google Analytics CSV files, enabling users to supplement CKAN's built-in tracking. The extension also includes template customizations to alter how page view counts are displayed on dataset and resource listing pages. Key Features: Google Analytics Data Import: Imports page view data directly from a stripped-down CSV of Google Analytics data using a dedicated paster command (csv2table). The CSV should contain a list of page views, where each row starts with '/'. The PageViews column is expected to be the 3rd column. Customizable Page View Display: Changes the default presentation of page view statistics within CKAN, removing the minimum view count restriction (default is 10) so all views can be seen and modifies UI elements. Altered Page Tracking Stats: Alters the placement of page tracking statistics, moving them below Package Data (on dataset list pages) and Resource Data (on resource list pages) for better integration of tracking data. UI/UX Enhancements: Replaces the flame icon typically used for page tracking and substitutes it with more subtle background styling to modernize the presentation of tracking data. Backend Data Manipulation Uses a 'floor date' of 2011-01-01 for page view calculation. Entries are made in the trackingraw table for each view, with a unique UUID. Integration with CKAN: The extension integrates into CKAN's core functionalities by introducing a new paster command and modifying existing templates for displaying page view statistics. It relies on CKAN's built-in tracking to be enabled, but supplements its capabilities with imported data and presentation adjustments. After importing data using the csv2table paster command, the standard tracking update and search-index rebuild paster tasks need to be run to process the imported data and update the search index.. Benefits & Impact: By importing data from Google Analytics, the trak extension allows administrators to see a holistic view of page views. It changes the user experience to facilitate tracking statistics in a more integrated fashion. This allows for a better understanding of the impact and utilization of resources within the CKAN instance, based on Google Analytics data.

  19. m

    Composite Dataset of Lumbar Spine Mid-Sagittal Images with Annotations and...

    • data.mendeley.com
    Updated Mar 2, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rao Farhat Masood (2021). Composite Dataset of Lumbar Spine Mid-Sagittal Images with Annotations and Clinically Relevant Spinal Measurements [Dataset]. http://doi.org/10.17632/k3b363f3vz.2
    Explore at:
    Dataset updated
    Mar 2, 2021
    Authors
    Rao Farhat Masood
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This composite dataset comprising of mid-sagittal views of lumbar spine is composed of images of lumbar spine with ground truth images duly labelled/annotated as well the spinal measurements. The purpose of creating this dataset was to establish a strong correlation in the images with the spinal measurements being clinically relevant. Presently, these measurements are being taken either completely through manual methods or by the use of computer assisted tools. The spinal measurements are clinically significant for a spinal surgeon before suggesting or shortlisting suitable surgical intervention procedure. Traditionally, the spinal surgeon evaluates the condition of the patient before surgical procedure in order to ascertain the usefulness of the adopted procedure. It also helps the surgeon in establishing a relation regarding effectiveness of the procedure adopted. For example, in case of spinal fusion procedure, will the fusion procedure be able to restore the spinal balance is a question for which the answered is obtained through making relevant spinal measurements, including lumbar lordotic curve angle, both segmental and for whole lumbar spine, lumbosacral angle, spinal heights, dimensions of vertebral bodies etc.

    The Composite Dataset is acquired in following steps:- 1. Exporting mid-sagittal view from the MRI dataset. (Originally taken from Sudirman, Sud; Al Kafri, Ala; natalia, friska; Meidia, Hira; Afriliana, Nunik; Al-Rashdan, Wasfi; Bashtawi, Mohammad; Al-Jumaily, Mohammed (2019), “Label Image Ground Truth Data for Lumbar Spine MRI Dataset”, Mendeley Data, V2, doi: 10.17632/zbf6b4pttk.2). The original dataset comprises of axial views with annotations however, to determine the efficacy of spinal deformities and analyzing spinal balance sagittal views are used instead. 2. Manual labelling of lumbar vertebral bodies from L1 to L5 and first sacrum bone. Total 6 regions were labelled in consultation with expert radiologists followed by validation by expert spinal surgeon. 3. Performing fully automatic spinal measurements, including, vertebral bodies identification and labelling, lumbar height, lumbosacral angle, lumbar lordotic angle, estimation of spinal curve, intervertebral body dimensions, vertebral body dimensions. All the angular measurements are in degrees, whereas the distance measurements are in millimeters.

    A total of 514 images and annotations with spinal measurements can be downloaded with request to please cite out work in your research.

  20. COVID-19 Pandemic Wikipedia Readership

    • figshare.com
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Isaac Johnson; Leila Zia; Joseph Allemandou; Marcel Ruiz Forns; Nuria Ruiz; Fabian Kaelin (2023). COVID-19 Pandemic Wikipedia Readership [Dataset]. http://doi.org/10.6084/m9.figshare.14548032.v3
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Isaac Johnson; Leila Zia; Joseph Allemandou; Marcel Ruiz Forns; Nuria Ruiz; Fabian Kaelin
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This data release includes two Wikipedia datasets related to the readership of the project as it relates to the early COVID-19 pandemic period. The first dataset is COVID-19 article page views by country, the second dataset is one hop navigation where one of the two pages are COVID-19 related. The data covers roughly the first six months of the pandemic, more specifically from January 1st 2020 to June 30th 2020. For more background on the pandemic in those months, see English Wikipedia's Timeline of the COVID-19 pandemic.Wikipedia articles are considered COVID-19 related according the methodology described here, the list of COVID-19 articles used for the released datasets is available in covid_articles.tsv. For simplicity and transparency, the same list of articles from 20 April 2020 was used for the entire dataset though in practice new COVID-19-relevant articles were constantly being created as the pandemic evolved.Privacy considerationsWhile this data is considered valuable for the insight that it can provide about information-seeking behaviors around the pandemic in its early months across diverse geographies, care must be taken to not inadvertently reveal information about the behavior of individual Wikipedia readers. We put in place a number of filters to release as much data as we can while minimizing the risk to readers.The Wikimedia foundation started to release most viewed articles by country from Jan 2021. At the beginning of the COVID-19 an exemption was made to store reader data about the pandemic with additional privacy protections:- exclude the page views from users engaged in an edit session- exclude reader data from specific countries (with a few exceptions)- the aggregated statistics are based on 50% of reader sessions that involve a pageview to a COVID-19-related article (see covid_pages.tsv). As a control, a 1% random sample of reader sessions that have no pageviews to COVID-19-related articles was kept. In aggregate, we make sure this 1% non-COVID-19 sample and 50% COVID-19 sample represents less than 10% of pageviews for a country for that day. The randomization and filters occurs on a daily cadence with all timestamps in UTC.- exclude power users - i.e. userhashes with greater than 500 pageviews in a day. This doubles as another form of likely bot removal, protects very heavy users of the project, and also in theory would help reduce the chance of a single user heavily skewing the data.- exclude readership from users of the iOS and Android Wikipedia apps. In effect, the view counts in this dataset represent comparable trends rather than the total amount of traffic from a given country. For more background on readership data per country data, and the COVID-19 privacy protections in particular, see this phabricator.To further minimize privacy risks, a k-anonymity threshold of 100 was applied to the aggregated counts. For example, a page needs to be viewed at least 100 times in a given country and week in order to be included in the dataset. In addition, the view counts are floored to a multiple of 100.DatasetsThe datasets published in this release are derived from a reader session dataset generated by the code in this notebook with the filtering described above. The raw reader session data itself will not be publicly available due to privacy considerations. The datasets described below are similar to the pageviews and clickstream data that the Wikimedia foundation publishes already, with the addition of the country specific counts.COVID-19 pageviewsThe file covid_pageviews.tsv contains:- pageview counts for COVID-19 related pages, aggregated by week and country- k-anonymity threshold of 100- example: In the 13th week of 2020 (23 March - 29 March 2020), the page 'Pandémie_de_Covid-19_en_Italie' on French Wikipedia was visited 11700 times from readers in Belgium- as a control bucket, we include pageview counts to all pages aggregated by week and country. Due to privacy considerations during the collection of the data, the control bucket was sampled at ~1% of all view traffic. The view counts for the control title are thus proportional to the total number of pageviews to all pages.The file is ~8 MB and contains ~134000 data points across the 27 weeks, 108 countries, and 168 projects.Covid reader session bigramsThe file covid_session_bigrams.tsv contains:- number of occurrences of visits to pages A -> B, where either A or B is a COVID-19 related article. Note that the bigrams are tuples (from, to) of articles viewed in succession, the underlying mechanism can be clicking on a link in an article, but it may also have been a new search or reading both articles based on links from third source articles. In contrast, the clickstream data is based on referral information only- aggregated by month and country- k-anonymity threshold of 100- example: In March of 2020, there were a 1000 occurences of readers accessing the page es.wikipedia/SARS-CoV-2 followed by es.wikipedia/Orthocoronavirinae from ChileThe file is ~10 MB and contains ~90000 bigrams across the 6 months, 96 countries, and 56 projects.ContactPlease reach out to research-feedback@wikimedia.org for any questions.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
The Devastator (2022). Top Visited Websites [Dataset]. https://www.kaggle.com/datasets/thedevastator/the-top-websites-in-the-world/discussion
Organization logo

Top Visited Websites

A dataset of the top visited websites on the internet

Explore at:
71 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 19, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

The Top Websites in the World

How They Change Over Time

About this dataset

This dataset consists of the top 50 most visited websites in the world, as well as the category and principal country/territory for each site. The data provides insights into which sites are most popular globally, and what type of content is most popular in different parts of the world

How to use the dataset

This dataset can be used to track the most popular websites in the world over time. It can also be used to compare website popularity between different countries and categories

Research Ideas

  • To track the most popular websites in the world over time
  • To see how website popularity changes by region
  • To find out which website categories are most popular

Acknowledgements

Dataset by Alexa Internet, Inc. (2019), released on Kaggle under the Open Data Commons Public Domain Dedication and License (ODC-PDDL)

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: df_1.csv | Column name | Description | |:--------------------------------|:---------------------------------------------------------------------| | Site | The name of the website. (String) | | Domain Name | The domain name of the website. (String) | | Category | The category of the website. (String) | | Principal country/territory | The principal country/territory where the website is based. (String) |

Search
Clear search
Close search
Google apps
Main menu