89 datasets found
  1. Amount of data created, consumed, and stored 2010-2023, with forecasts to...

    • statista.com
    Updated Nov 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). Amount of data created, consumed, and stored 2010-2023, with forecasts to 2028 [Dataset]. https://www.statista.com/statistics/871513/worldwide-data-created/
    Explore at:
    Dataset updated
    Nov 21, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    May 2024
    Area covered
    Worldwide
    Description

    The total amount of data created, captured, copied, and consumed globally is forecast to increase rapidly, reaching 149 zettabytes in 2024. Over the next five years up to 2028, global data creation is projected to grow to more than 394 zettabytes. In 2020, the amount of data created and replicated reached a new high. The growth was higher than previously expected, caused by the increased demand due to the COVID-19 pandemic, as more people worked and learned from home and used home entertainment options more often. Storage capacity also growing Only a small percentage of this newly created data is kept though, as just two percent of the data produced and consumed in 2020 was saved and retained into 2021. In line with the strong growth of the data volume, the installed base of storage capacity is forecast to increase, growing at a compound annual growth rate of 19.2 percent over the forecast period from 2020 to 2025. In 2020, the installed base of storage capacity reached 6.7 zettabytes.

  2. Internet users in the United States 2020-2029

    • statista.com
    Updated Dec 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). Internet users in the United States 2020-2029 [Dataset]. https://www.statista.com/statistics/325645/usa-number-of-internet-users/
    Explore at:
    Dataset updated
    Dec 12, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    United States
    Description

    The number of internet users in the United States was forecast to continuously increase between 2024 and 2029 by in total 13.5 million users (+4.16 percent). After the ninth consecutive increasing year, the number of users is estimated to reach 337.67 million users and therefore a new peak in 2029. Notably, the number of internet users of was continuously increasing over the past years.Depicted is the estimated number of individuals in the country or region at hand, that use the internet. As the datasource clarifies, connection quality and usage frequency are distinct aspects, not taken into account here.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).

  3. NYC STEW-MAP Staten Island organizations' website hyperlink webscrape

    • catalog.data.gov
    • gimi9.com
    Updated Nov 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2022). NYC STEW-MAP Staten Island organizations' website hyperlink webscrape [Dataset]. https://catalog.data.gov/dataset/nyc-stew-map-staten-island-organizations-website-hyperlink-webscrape
    Explore at:
    Dataset updated
    Nov 21, 2022
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Area covered
    Staten Island, New York
    Description

    The data represent web-scraping of hyperlinks from a selection of environmental stewardship organizations that were identified in the 2017 NYC Stewardship Mapping and Assessment Project (STEW-MAP) (USDA 2017). There are two data sets: 1) the original scrape containing all hyperlinks within the websites and associated attribute values (see "README" file); 2) a cleaned and reduced dataset formatted for network analysis. For dataset 1: Organizations were selected from from the 2017 NYC Stewardship Mapping and Assessment Project (STEW-MAP) (USDA 2017), a publicly available, spatial data set about environmental stewardship organizations working in New York City, USA (N = 719). To create a smaller and more manageable sample to analyze, all organizations that intersected (i.e., worked entirely within or overlapped) the NYC borough of Staten Island were selected for a geographically bounded sample. Only organizations with working websites and that the web scraper could access were retained for the study (n = 78). The websites were scraped between 09 and 17 June 2020 to a maximum search depth of ten using the snaWeb package (version 1.0.1, Stockton 2020) in the R computational language environment (R Core Team 2020). For dataset 2: The complete scrape results were cleaned, reduced, and formatted as a standard edge-array (node1, node2, edge attribute) for network analysis. See "READ ME" file for further details. References: R Core Team. (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/. Version 4.0.3. Stockton, T. (2020). snaWeb Package: An R package for finding and building social networks for a website, version 1.0.1. USDA Forest Service. (2017). Stewardship Mapping and Assessment Project (STEW-MAP). New York City Data Set. Available online at https://www.nrs.fs.fed.us/STEW-MAP/data/. This dataset is associated with the following publication: Sayles, J., R. Furey, and M. Ten Brink. How deep to dig: effects of web-scraping search depth on hyperlink network analysis of environmental stewardship organizations. Applied Network Science. Springer Nature, New York, NY, 7: 36, (2022).

  4. s

    Statistical Region Data Collection 2020 - Datasets - This service has been...

    • store.smartdatahub.io
    Updated Nov 11, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Statistical Region Data Collection 2020 - Datasets - This service has been deprecated - please visit https://www.smartdatahub.io/ to access data. See the About page for details. // [Dataset]. https://store.smartdatahub.io/dataset/fi_tilastokeskus_tilastointialueet_maakunta1000k_2020
    Explore at:
    Dataset updated
    Nov 11, 2024
    Description

    This dataset collection contains tables sourced from the Statistics Finland (Tilastokeskus) website. The tables in the collection encompass a range of related data with respect to statistical areas in Finland. The source describes the data as 'Statistical Service Interface (WFS)', indicating that it pertains to statistical information accessed through a web feature service. The data is organized in an easy-to-understand table format with distinct rows and columns, making it simple to interpret and analyze. This dataset is licensed under CC BY 4.0 (Creative Commons Attribution 4.0, https://creativecommons.org/licenses/by/4.0/deed.fi).

  5. s

    Statistics Bureau Web Service Interface (WFS) Dataset Collection 2020 -...

    • store.smartdatahub.io
    Updated Nov 11, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Statistics Bureau Web Service Interface (WFS) Dataset Collection 2020 - Datasets - This service has been deprecated - please visit https://www.smartdatahub.io/ to access data. See the About page for details. // [Dataset]. https://store.smartdatahub.io/dataset/fi_tilastokeskus_tilastointialueet_seutukunta4500k_2020
    Explore at:
    Dataset updated
    Nov 11, 2024
    Description

    This dataset collection comprises a series of related data tables sourced from the website of 'Tilastokeskus' (Statistics Finland), based in Finland. The tables within this collection contain data retrieved from the Statistics Finland's service interface (WFS). The content of the tables is organized in a structured format with rows and columns, showcasing a correlation between different sets of data. The collection, while primarily intended for statistical analysis, can be utilized in a variety of ways, depending on the specific needs of the user. This dataset is licensed under CC BY 4.0 (Creative Commons Attribution 4.0, https://creativecommons.org/licenses/by/4.0/deed.fi).

  6. Web Data Commons (December 2020) Property and Datatype Usage Dataset

    • zenodo.org
    application/gzip
    Updated Mar 25, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jan Martin Keil; Jan Martin Keil; Merle Gänßinger; Merle Gänßinger (2022). Web Data Commons (December 2020) Property and Datatype Usage Dataset [Dataset]. http://doi.org/10.5281/zenodo.6205111
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Mar 25, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jan Martin Keil; Jan Martin Keil; Merle Gänßinger; Merle Gänßinger
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is a dataset about the usage of properties and datatypes in the Web Data Commons RDFa, Microdata, Embedded JSON-LD, and Microformats Data Sets (December 2020) based on the Common Crawl September 2020 archive. The dataset has been produced using the RDF Property and Datatype Usage Scanner v1.0.0, which is based on the Apache Jena framework. Only RDFa and embedded JSON-LD data were considered, as Microdata and Microformats do not incorporate explicit datatypes.

    Dataset Properties

    • Size: 0.6 GiB compressed, 11.0 GiB uncompressed, 53 641 457 rows plus 1 head line determined using gunzip -c measurements.csv.gz | wc -l
    • Parsing Failures: The scanner failed to parse 45 307 472 triples (~0.1 %) of the source dataset (containing 37 971 812 425 triples). The main reasons for failures were malformed IRIs and illegal character encodings.
    • Content:
      • CATEGORY: The category (html-embedded-jsonld or html-rdfa) of the Web Data Commons file that has been measured.
      • FILE_URL: The URL of the Web Data Commons file that has been measured.
      • MEASUREMENT: The applied measurement with specific conditions, one of:
        • UnpreciseRepresentableInDouble: The number of lexicals that are in the lexical space but not in the value space of xsd:double.
        • UnpreciseRepresentableInFloat: The number of lexicals that are in the lexical space but not in the value space of xsd:float.
        • UsedAsDatatype: The total number of literals with the datatype.
        • UsedAsPropertyRange: The number of statements that specify the datatype as range of the property.
        • ValidDateNotation: The number of lexicals that are in the lexical space of xsd:date.
        • ValidDateTimeNotation: The number of lexicals that are in the lexical space of xsd:dateTime.
        • ValidDecimalNotation: The number of lexicals that represent a number with decimal notation and whose lexical representation is thereby in the lexical space of xsd:decimal, xsd:float, and xsd:double.
        • ValidExponentialNotation: The number of lexicals that represent a number with exponential notation and whose lexical representation is thereby in the lexical space of xsd:float, and xsd:double.
        • ValidInfOrNaNNotation: The number of lexicals that equals either INF, +INF, -INF or NaN and whose lexical representation is thereby in the lexical space of xsd:float, and xsd:double.
        • ValidIntegerNotation: The number of lexicals that represent an integer number and whose lexical representation is thereby in the lexical space of xsd:integer, xsd:decimal, xsd:float, and xsd:double.
        • ValidTimeNotation: The number of lexicals that are in the lexical space of xsd:time.
        • ValidTrueOrFalseNotation: The number of lexicals that equal either true or false and whose lexical representation is thereby in the lexical space of xsd:boolean.
        • ValidZeroOrOneNotation: The number of lexicals that equal either 0 or 1 and whose lexical representation is thereby in the lexical space of xsd:boolean, and xsd:integer, xsd:decimal, xsd:float, and xsd:double.
        Note: Lexical representation of xsd:double values in embedded JSON-LD got normalized to always use exponential notation with up to 16 fractional digits (see related code). Be careful by drawing conclusions from according Valid… and Unprecise… measures.
      • PROPERTY: The property that has been measured.
      • DATATYPE: The datatype that has been measured.
      • QUANTITY: The count of statements that fulfill the condition specified by the measurement per file, property and datatype.

    Preview

    "CATEGORY","FILE_URL","MEASUREMENT","PROPERTY","DATATYPE","QUANTITY"
    "html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2020-12/quads/dpef.html-embedded-jsonld.nq-00000.gz","UnpreciseRepresentableInDouble","http://schema.org/height","http://www.w3.org/2001/XMLSchema#string","11"
    "html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2020-12/quads/dpef.html-embedded-jsonld.nq-00000.gz","UnpreciseRepresentableInDouble","http://www.w3.org/ns/csvw#value","http://www.w3.org/2001/XMLSchema#string","27"
    "html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2020-12/quads/dpef.html-embedded-jsonld.nq-00000.gz","UnpreciseRepresentableInDouble","http://schema.org/saturatedFatContent","http://www.w3.org/2001/XMLSchema#string","1"
    …
    "html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2020-12/quads/dpef.html-rdfa.nq-05166.gz","ValidZeroOrOneNotation","http://purl.org/goodrelations/v1#hasMaxValue","http://www.w3.org/2001/XMLSchema#float","2"
    "html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2020-12/quads/dpef.html-rdfa.nq-05166.gz","ValidZeroOrOneNotation","http://purl.org/goodrelations/v1#hasMinValue","http://www.w3.org/2001/XMLSchema#float","4"
    "html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2020-12/quads/dpef.html-rdfa.nq-05166.gz","ValidZeroOrOneNotation","http://rdfs.org/sioc/ns#num_replies","http://www.w3.org/2001/XMLSchema#integer","1062"
    

    Note: The data contain malformed IRIs, like "xsd:dateTime" (instead of probably "http://www.w3.org/2001/XMLSchema#dateTime"), which are caused by missing namespace definitions in the original source website.

  7. CT-FAN-21 corpus: A dataset for Fake News Detection

    • zenodo.org
    Updated Oct 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl (2022). CT-FAN-21 corpus: A dataset for Fake News Detection [Dataset]. http://doi.org/10.5281/zenodo.4714517
    Explore at:
    Dataset updated
    Oct 23, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl
    Description

    Data Access: The data in the research collection provided may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use it only for research purposes. Due to these restrictions, the collection is not open data. Please download the Agreement at Data Sharing Agreement and send the signed form to fakenewstask@gmail.com .

    Citation

    Please cite our work as

    @article{shahi2021overview,
     title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection},
     author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas},
     journal={Working Notes of CLEF},
     year={2021}
    }

    Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English.

    Subtask 3A: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. The training data will be released in batches and roughly about 900 articles with the respective label. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. Our definitions for the categories are as follows:

    • False - The main claim made in an article is untrue.

    • Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services.

    • True - This rating indicates that the primary elements of the main claim are demonstrably true.

    • Other- An article that cannot be categorised as true, false, or partially false due to lack of evidence about its claims. This category includes articles in dispute and unproven articles.

    Subtask 3B: Topical Domain Classification of News Articles (English) Fact-checkers require background expertise to identify the truthfulness of an article. The categorisation will help to automate the sampling process from a stream of data. Given the text of a news article, determine the topical domain of the article (English). This is a classification problem. The task is to categorise fake news articles into six topical categories like health, election, crime, climate, election, education. This task will be offered for a subset of the data of Subtask 3A.

    Input Data

    The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows:

    Task 3a

    • ID- Unique identifier of the news article
    • Title- Title of the news article
    • text- Text mentioned inside the news article
    • our rating - class of the news article as false, partially false, true, other

    Task 3b

    • public_id- Unique identifier of the news article
    • Title- Title of the news article
    • text- Text mentioned inside the news article
    • domain - domain of the given news article(applicable only for task B)

    Output data format

    Task 3a

    • public_id- Unique identifier of the news article
    • predicted_rating- predicted class

    Sample File

    public_id, predicted_rating
    1, false
    2, true

    Task 3b

    • public_id- Unique identifier of the news article
    • predicted_domain- predicted domain

    Sample file

    public_id, predicted_domain
    1, health
    2, crime

    Additional data for Training

    To train your model, the participant can use additional data with a similar format; some datasets are available over the web. We don't provide the background truth for those datasets. For testing, we will not use any articles from other datasets. Some of the possible source:

    IMPORTANT!

    1. Fake news article used for task 3b is a subset of task 3a.
    2. We have used the data from 2010 to 2021, and the content of fake news is mixed up with several topics like election, COVID-19 etc.

    Evaluation Metrics

    This task is evaluated as a classification task. We will use the F1-macro measure for the ranking of teams. There is a limit of 5 runs (total and not per day), and only one person from a team is allowed to submit runs.

    Submission Link: https://competitions.codalab.org/competitions/31238

    Related Work

    • Shahi GK. AMUSED: An Annotation Framework of Multi-modal Social Media Data. arXiv preprint arXiv:2010.00502. 2020 Oct 1.https://arxiv.org/pdf/2010.00502.pdf
    • G. K. Shahi and D. Nandini, “FakeCovid – a multilingualcross-domain fact check news dataset for covid-19,” inWorkshop Proceedings of the 14th International AAAIConference on Web and Social Media, 2020. http://workshop-proceedings.icwsm.org/abstract?id=2020_14
    • Shahi, G. K., Dirkson, A., & Majchrzak, T. A. (2021). An exploratory study of covid-19 misinformation on twitter. Online Social Networks and Media, 22, 100104. doi: 10.1016/j.osnem.2020.100104
  8. d

    U.S. State and Territorial Stay-At-Home Orders: March 15, 2020 – May 31,...

    • catalog.data.gov
    • data.virginia.gov
    • +2more
    Updated Sep 11, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Centers for Disease Control and Prevention (2022). U.S. State and Territorial Stay-At-Home Orders: March 15, 2020 – May 31, 2021 by County by Day [Dataset]. https://catalog.data.gov/dataset/u-s-state-and-territorial-stay-at-home-orders-march-15-2020-may-31-2021-by-county-by-day-6013e
    Explore at:
    Dataset updated
    Sep 11, 2022
    Dataset provided by
    Centers for Disease Control and Prevention
    Area covered
    United States
    Description

    State and territorial executive orders, administrative orders, resolutions, and proclamations are collected from government websites and cataloged and coded using Microsoft Excel by one coder with one or more additional coders conducting quality assurance. Data were collected to determine when individuals in states and territories were subject to executive orders, administrative orders, resolutions, and proclamations for COVID-19 that require or recommend people stay in their homes. Data consists exclusively of state and territorial orders, many of which apply to specific counties within their respective state or territory; therefore, data is broken down to the county level. These data are derived from the publicly available state and territorial executive orders, administrative orders, resolutions, and proclamations (“orders”) for COVID-19 that expressly require or recommend individuals stay at home found by the CDC, COVID-19 Community Intervention and At-Risk Task Force, Monitoring and Evaluation Team & CDC, Center for State, Tribal, Local, and Territorial Support, Public Health Law Program from March 15, 2020 through May 31, 2021. These data will be updated as new orders are collected. Any orders not available through publicly accessible websites are not included in these data. Only official copies of the documents or, where official copies were unavailable, official press releases from government websites describing requirements were coded; news media reports on restrictions were excluded. Recommendations not included in an order are not included in these data. These data do not include mandatory business closures, curfews, or limitations on public or private gatherings. These data do not necessarily represent an official position of the Centers for Disease Control and Prevention.

  9. Data from: WikiReddit: Tracing Information and Attention Flows Between...

    • zenodo.org
    bin
    Updated Feb 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms [Dataset]. http://doi.org/10.5281/zenodo.14653265
    Explore at:
    binAvailable download formats
    Dataset updated
    Feb 10, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 15, 2025
    Description

    Preprint

    Gildersleve, P., Beers, A., Ito, V., Orozco, A., & Tripodi, F. (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms. arXiv [Cs.CY]. https://doi.org/10.48550/arXiv.2502.04942

    Abstract

    The World Wide Web is a complex interconnected digital ecosystem, where information and attention flow between platforms and communities throughout the globe. These interactions co-construct how we understand the world, reflecting and shaping public discourse. Unfortunately, researchers often struggle to understand how information circulates and evolves across the web because platform-specific data is often siloed and restricted by linguistic barriers. To address this gap, we present a comprehensive, multilingual dataset capturing all Wikipedia links shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW subreddits. Each linked Wikipedia article is enriched with revision history, page view data, article ID, redirects, and Wikidata identifiers. Through a research agreement with Reddit, our dataset ensures user privacy while providing a query and ID mechanism that integrates with the Reddit and Wikipedia APIs. This enables extended analyses for researchers studying how information flows across platforms. For example, Reddit discussions use Wikipedia for deliberation and fact-checking which subsequently influences Wikipedia content, by driving traffic to articles or inspiring edits. By analyzing the relationship between information shared and discussed on these platforms, our dataset provides a foundation for examining the interplay between social media discourse and collaborative knowledge consumption and production.

    Datasheet

    Motivation

    The motivations for this dataset stem from the challenges researchers face in studying the flow of information across the web. While the World Wide Web enables global communication and collaboration, data silos, linguistic barriers, and platform-specific restrictions hinder our ability to understand how information circulates, evolves, and impacts public discourse. Wikipedia and Reddit, as major hubs of knowledge sharing and discussion, offer an invaluable lens into these processes. However, without comprehensive data capturing their interactions, researchers are unable to fully examine how platforms co-construct knowledge. This dataset bridges this gap, providing the tools needed to study the interconnectedness of social media and collaborative knowledge systems.

    Composition

    WikiReddit, a comprehensive dataset capturing all Wikipedia mentions (including links) shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW (not safe for work) subreddits. The SQL database comprises 336K total posts, 10.2M comments, 1.95M unique links, and 1.26M unique articles spanning 59 languages on Reddit and 276 Wikipedia language subdomains. Each linked Wikipedia article is enriched with its revision history and page view data within a ±10-day window of its posting, as well as article ID, redirects, and Wikidata identifiers. Supplementary anonymous metadata from Reddit posts and comments further contextualizes the links, offering a robust resource for analysing cross-platform information flows, collective attention dynamics, and the role of Wikipedia in online discourse.

    Collection Process

    Data was collected from the Reddit4Researchers and Wikipedia APIs. No personally identifiable information is published in the dataset. Data from Reddit to Wikipedia is linked via the hyperlink and article titles appearing in Reddit posts.

    Preprocessing/cleaning/labeling

    Extensive processing with tools such as regex was applied to the Reddit post/comment text to extract the Wikipedia URLs. Redirects for Wikipedia URLs and article titles were found through the API and mapped to the collected data. Reddit IDs are hashed with SHA-256 for post/comment/user/subreddit anonymity.

    Uses

    We foresee several applications of this dataset and preview four here. First, Reddit linking data can be used to understand how attention is driven from one platform to another. Second, Reddit linking data can shed light on how Wikipedia's archive of knowledge is used in the larger social web. Third, our dataset could provide insights into how external attention is topically distributed across Wikipedia. Our dataset can help extend that analysis into the disparities in what types of external communities Wikipedia is used in, and how it is used. Fourth, relatedly, a topic analysis of our dataset could reveal how Wikipedia usage on Reddit contributes to societal benefits and harms. Our dataset could help examine if homogeneity within the Reddit and Wikipedia audiences shapes topic patterns and assess whether these relationships mitigate or amplify problematic engagement online.

    Distribution

    The dataset is publicly shared with a Creative Commons Attribution 4.0 International license. The article describing this dataset should be cited: https://doi.org/10.48550/arXiv.2502.04942

    Maintenance

    Patrick Gildersleve will maintain this dataset, and add further years of content as and when available.


    SQL Database Schema

    Table: posts

    Column NameTypeDescription
    subreddit_idTEXTThe unique identifier for the subreddit.
    crosspost_parent_idTEXTThe ID of the original Reddit post if this post is a crosspost.
    post_idTEXTUnique identifier for the Reddit post.
    created_atTIMESTAMPThe timestamp when the post was created.
    updated_atTIMESTAMPThe timestamp when the post was last updated.
    language_codeTEXTThe language code of the post.
    scoreINTEGERThe score (upvotes minus downvotes) of the post.
    upvote_ratioREALThe ratio of upvotes to total votes.
    gildingsINTEGERNumber of awards (gildings) received by the post.
    num_commentsINTEGERNumber of comments on the post.

    Table: comments

    Column NameTypeDescription
    subreddit_idTEXTThe unique identifier for the subreddit.
    post_idTEXTThe ID of the Reddit post the comment belongs to.
    parent_idTEXTThe ID of the parent comment (if a reply).
    comment_idTEXTUnique identifier for the comment.
    created_atTIMESTAMPThe timestamp when the comment was created.
    last_modified_atTIMESTAMPThe timestamp when the comment was last modified.
    scoreINTEGERThe score (upvotes minus downvotes) of the comment.
    upvote_ratioREALThe ratio of upvotes to total votes for the comment.
    gildedINTEGERNumber of awards (gildings) received by the comment.

    Table: postlinks

    Column NameTypeDescription
    post_idTEXTUnique identifier for the Reddit post.
    end_processed_validINTEGERWhether the extracted URL from the post resolves to a valid URL.
    end_processed_urlTEXTThe extracted URL from the Reddit post.
    final_validINTEGERWhether the final URL from the post resolves to a valid URL after redirections.
    final_statusINTEGERHTTP status code of the final URL.
    final_urlTEXTThe final URL after redirections.
    redirectedINTEGERIndicator of whether the posted URL was redirected (1) or not (0).
    in_titleINTEGERIndicator of whether the link appears in the post title (1) or post body (0).

    Table: commentlinks

    Column NameTypeDescription
    comment_idTEXTUnique identifier for the Reddit comment.
    end_processed_validINTEGERWhether the extracted URL from the comment resolves to a valid URL.
    end_processed_urlTEXTThe extracted URL from the comment.
    final_validINTEGERWhether the final URL from the comment resolves to a valid URL after redirections.
    final_statusINTEGERHTTP status code of the final URL.
    final_urlTEXTThe final URL after

  10. Understanding Society: Calendar Year Dataset, 2021: Special Licence Access

    • beta.ukdataservice.ac.uk
    • datacatalogue.cessda.eu
    Updated 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Institute For Social University Of Essex (2024). Understanding Society: Calendar Year Dataset, 2021: Special Licence Access [Dataset]. http://doi.org/10.5255/ukda-sn-9194-1
    Explore at:
    Dataset updated
    2024
    Dataset provided by
    UK Data Servicehttps://ukdataservice.ac.uk/
    datacite
    Authors
    Institute For Social University Of Essex
    Description

    Understanding Society (the UK Household Longitudinal Study), which began in 2009, is conducted by the Institute for Social and Economic Research (ISER) at the University of Essex, and the survey research organisations Verian Group (formerly Kantar Public) and NatCen. It builds on and incorporates, the British Household Panel Survey (BHPS), which began in 1991.

    The Understanding Society: Calendar Year Dataset, 2021, is designed to enable cross-sectional analysis of individuals and households relating specifically to their annual interviews conducted in the year 2021, and, therefore, combine data collected in three waves (Waves 11, 12 and 13). It has been produced from the same data collected in the main Understanding Society study and released in the longitudinal datasets SN 6614 (End User Licence) and SN 6931 (Special Licence). Such cross-sectional analysis can, however, only involve variables that are collected in every wave in order to have data for the full sample panel. The 2021 dataset is the second of a series of planned Calendar Year Datasets to facilitate cross-sectional analysis of specific years. Full details of the Calendar Year Dataset sample structure (including why some individual interviews from 2022 are included), data structure and additional supporting information can be found in the document '9194_calendar_year_dataset_2020_user_guide'.

    As multi-topic studies, the purpose of Understanding Society is to understand the short- and long-term effects of social and economic change in the UK at the household and individual levels. The study has a strong emphasis on domains of family and social ties, employment, education, financial resources, and health. Understanding Society is an annual survey of each adult member of a nationally representative sample. The same individuals are re-interviewed in each wave approximately 12 months apart. When individuals move, they are followed within the UK, and anyone joining their households is also interviewed as long as they are living with them. The fieldwork period for a single wave is 24 months. Data collection uses computer-assisted personal interviewing (CAPI) and web interviews (from wave 7) and includes a telephone mop-up. From March 2020 (the end of wave 10 and 2nd year of wave 11), due to the coronavirus pandemic, face-to-face interviews were suspended, and the survey has been conducted by web and telephone only but otherwise has continued as before. One person completes the household questionnaire. Each person aged 16 or older participates in the individual adult interview and self-completed questionnaire. Youths aged 10 to 15 are asked to respond to a paper self-completion questionnaire. In 2020, an additional frequent web survey was separately issued to sample members to capture data on the rapid changes in people’s lives due to the COVID-19 pandemic (see SN 8644). The COVID-19 Survey data are not included in this dataset.

    Further information may be found on the Understanding Society main stage webpage and links to publications based on the study can be found on the Understanding Society Latest Research webpage.

    Co-funders

    In addition to the Economic and Social Research Council, co-funders for the study included the Department of Work and Pensions, the Department for Education, the Department for Transport, the Department of Culture, Media and Sport, the Department for Community and Local Government, the Department of Health, the Scottish Government, the Welsh Assembly Government, the Northern Ireland Executive, the Department of Environment and Rural Affairs, and the Food Standards Agency.

    End User Licence and Special Licence versions:

    There are two versions of the Calendar Year 2021 data. One is available under the standard End User Licence (EUL) agreement, and the other is a Special Licence (SL) version. The SL version contains month and year of birth variables instead of just age, more detailed country and occupation coding for a number of variables and various income variables have not been top-coded (see xxxx_eul_vs_sl_variable_differences for more details). Users are advised to first obtain the standard EUL version of the data to see if they are sufficient for their research requirements. The SL data have more restrictive access conditions; prospective users of the SL version will need to complete an extra application form and demonstrate to the data owners exactly why they need access to the additional variables in order to get permission to use that version. The main longitudinal versions of the Understanding Society study may be found under SNs 6614 (EUL) and 6931 (SL).

    Low- and Medium-level geographical identifiers produced for the mainstage longitudinal dataset can be used with this Calendar Year 2021 dataset, subject to SL access conditions. See the User Guide for further details.

    Suitable data analysis software

    These data are provided by the depositor in Stata format. Users are strongly advised to analyse them in Stata. Transfer to other formats may result in unforeseen issues. Stata SE or MP software is needed to analyse the larger files, which contain about 1,900 variables.

  11. U.S. State and Territorial Public Mask Mandates From April 10, 2020 through...

    • catalog.data.gov
    • healthdata.gov
    • +2more
    Updated Oct 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Centers for Disease Control and Prevention (2022). U.S. State and Territorial Public Mask Mandates From April 10, 2020 through July 20, 2021 by County by Day [Dataset]. https://catalog.data.gov/dataset/u-s-state-and-territorial-public-mask-mandates-from-april-10-2020-through-july-20-2021-by--7e5b8
    Explore at:
    Dataset updated
    Oct 1, 2022
    Dataset provided by
    Centers for Disease Control and Preventionhttp://www.cdc.gov/
    Area covered
    United States
    Description

    State and territorial executive orders, administrative orders, resolutions, and proclamations are collected from government websites and cataloged and coded using Microsoft Excel by one coder with one or more additional coders conducting quality assurance. Data were collected to determine when members of the public in states and territories were subject to state and territorial executive orders, administrative orders, resolutions, and proclamations for COVID-19 that require them to wear masks in public. “Members of the public” are defined as individuals operating in a personal capacity. “In public” is defined to mean either (1) anywhere outside the home or (2) both in retail businesses and in restaurants/food establishments. Data consists exclusively of state and territorial orders, many of which apply to specific counties within their respective state or territory; therefore, data is broken down to the county level. These data are derived from publicly available state and territorial executive orders, administrative orders, resolutions, and proclamations (“orders”) for COVID-19 that expressly require individuals to wear masks in public found by the CDC, COVID-19 Community Intervention & Critical Populations Task Force, Monitoring & Evaluation Team, Mitigation Policy Analysis Unit, Center for State, Tribal, Local, and Territorial Support, Public Health Law Program, and Max Gakh, Assistant Professor, School of Public Health, University of Nevada, Las Vegas from April 10, 2020 through July 20, 2021. These data will be updated as new orders are collected. Any orders not available through publicly accessible websites are not included in these data. Only official copies of the documents or, where official copies were unavailable, official press releases from government websites describing requirements were coded; news media reports on restrictions were excluded. Recommendations not included in an order are not included in these data. Effective and expiration dates were coded using only the dates provided; no distinction was made based on the specific time of the day the order became effective or expired. These data do not include data on counties that have opted out of their state mask mandate pursuant to state law. These data do not necessarily represent an official position of the Centers for Disease Control and Prevention.

  12. m

    Web page phishing detection

    • data.mendeley.com
    • narcis.nl
    Updated Sep 26, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdelhakim Hannousse (2020). Web page phishing detection [Dataset]. http://doi.org/10.17632/c2gw7fy2j4.1
    Explore at:
    Dataset updated
    Sep 26, 2020
    Authors
    Abdelhakim Hannousse
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The provided dataset includes 11430 URLs with 87 extracted features. The dataset are designed to be used as a a benchmark for machine learning based phishing detection systems. Features are from three different classes: 56 extracted from the structure and syntax of URLs, 24 extracted from the content of their correspondent pages and 7 are extracetd by querying external services. The datatset is balanced, it containes exactly 50% phishing and 50% legitimate URLs. Associated to the dataset, we provide Python scripts used for the extraction of the features for potential replication or extension.

    dataset_A: contains a list a URLs together with their DOM tree objects that can be used for replication and experimenting new URL and content-based features overtaking short-time living of phishing web pages.

    dataset_B: containes the extracted feature values that can be used directly as inupt to classifiers for examination. Note that the data in this dataset are indexed with URLs so that one need to remove the index before experimentation.

    Datasets are constructed on May 2020. Due to huge size of dataset A, only a sample of the dataset is provided, I will try to divide into sample files and upload them one by one, for full copy, please contact directly the author at any time at: hannousse.abdelhakim@univ-guelma.dz

  13. d

    U.S. State and Territorial Orders Closing and Reopening Restaurants Issued...

    • catalog.data.gov
    • healthdata.gov
    • +5more
    Updated Sep 21, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Centers for Disease Control and Prevention (2022). U.S. State and Territorial Orders Closing and Reopening Restaurants Issued from March 11, 2020 through May 31, 2021 by County by Day [Dataset]. https://catalog.data.gov/dataset/u-s-state-and-territorial-orders-closing-and-reopening-restaurants-issued-from-march-11-20-5299c
    Explore at:
    Dataset updated
    Sep 21, 2022
    Dataset provided by
    Centers for Disease Control and Prevention
    Area covered
    United States
    Description

    State and territorial executive orders, administrative orders, resolutions, and proclamations are collected from government websites and cataloged and coded using Microsoft Excel by one coder with one or more additional coders conducting quality assurance. Data were collected to determine when restaurants in states and territories were subject to closing and reopening requirements through executive orders, administrative orders, resolutions, and proclamations for COVID-19. Data can be used to determine when restaurants in states and territories were subject to closing and reopening requirements through executive orders, administrative orders, resolutions, and proclamations for COVID-19. Data consists exclusively of state and territorial orders, many of which apply to specific counties within their respective state or territory; therefore, data is broken down to the county level. These data are derived from publicly available state and territorial executive orders, administrative orders, resolutions, and proclamations (“orders”) for COVID-19 that expressly close or reopen restaurants found by the CDC, COVID-19 Community Intervention & Critical Populations Task Force, Monitoring & Evaluation Team, Mitigation Policy Analysis Unit, and the CDC, Center for State, Tribal, Local, and Territorial Support, Public Health Law Program from March 11, 2020 through May 31, 2021. These data will be updated as new orders are collected. Any orders not available through publicly accessible websites are not included in these data. Only official copies of the documents or, where official copies were unavailable, official press releases from government websites describing requirements were coded; news media reports on restrictions were excluded. Recommendations not included in an order are not included in these data. Effective and expiration dates were coded using only the date provided; no distinction was made based on the specific time of the day the order became effective or expired. These data do not necessarily represent an official position of the Centers for Disease Control and Prevention.

  14. d

    Annual Population Survey Three-Year Pooled Dataset, January 2018 - December...

    • b2find.dkrz.de
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Annual Population Survey Three-Year Pooled Dataset, January 2018 - December 2020 - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/736a3f2f-243e-5d8a-b345-969c59074c15
    Explore at:
    Description

    Abstract copyright UK Data Service and data collection copyright owner.The Annual Population Survey (APS) is a major survey series, which aims to provide data that can produce reliable estimates at the local authority level. Key topics covered in the survey include education, employment, health and ethnicity. The APS comprises key variables from the Labour Force Survey (LFS), all its associated LFS boosts and the APS boost. The APS aims to provide enhanced annual data for England, covering a target sample of at least 510 economically active persons for each Unitary Authority (UA)/Local Authority District (LAD) and at least 450 in each Greater London Borough. In combination with local LFS boost samples, the survey provides estimates for a range of indicators down to Local Education Authority (LEA) level across the United Kingdom.For further detailed information about methodology, users should consult the Labour Force Survey User Guide, included with the APS documentation. For variable and value labelling and coding frames that are not included either in the data or in the current APS documentation, users are advised to consult the latest versions of the LFS User Guides, which are available from the ONS Labour Force Survey - User Guidance webpages.Occupation data for 2021 and 2022The ONS has identified an issue with the collection of some occupational data in 2021 and 2022 data files in a number of their surveys. While they estimate any impacts will be small overall, this will affect the accuracy of the breakdowns of some detailed (four-digit Standard Occupational Classification (SOC)) occupations, and data derived from them. None of ONS' headline statistics, other than those directly sourced from occupational data, are affected and you can continue to rely on their accuracy. The affected datasets have now been updated. Further information can be found in the ONS article published on 11 July 2023: Revision of miscoded occupational data in the ONS Labour Force Survey, UK: January 2021 to September 2022APS Well-Being DatasetsFrom 2012-2015, the ONS published separate APS datasets aimed at providing initial estimates of subjective well-being, based on the Integrated Household Survey. In 2015 these were discontinued. A separate set of well-being variables and a corresponding weighting variable have been added to the April-March APS person datasets from A11M12 onwards. Further information on the transition can be found in the Personal well-being in the UK: 2015 to 2016 article on the ONS website.APS disability variablesOver time, there have been some updates to disability variables in the APS. An article explaining the quality assurance investigations on these variables that have been conducted so far is available on the ONS Methodology webpage. End User Licence and Secure Access APS dataUsers should note that there are two versions of each APS dataset. One is available under the standard End User Licence (EUL) agreement, and the other is a Secure Access version. The EUL version includes Government Office Region geography, banded age, 3-digit SOC and industry sector for main, second and last job. The Secure Access version contains more detailed variables relating to: age: single year of age, year and month of birth, age completed full-time education and age obtained highest qualification, age of oldest dependent child and age of youngest dependent child family unit and household: including a number of variables concerning the number of dependent children in the family according to their ages, relationship to head of household and relationship to head of family nationality and country of origin geography: including county, unitary/local authority, place of work, Nomenclature of Territorial Units for Statistics 2 (NUTS2) and NUTS3 regions, and whether lives and works in same local authority district health: including main health problem, and current and past health problems education and apprenticeship: including numbers and subjects of various qualifications and variables concerning apprenticeships industry: including industry, industry class and industry group for main, second and last job, and industry made redundant from occupation: including 4-digit Standard Occupational Classification (SOC) for main, second and last job and job made redundant from system variables: including week number when interview took place and number of households at address The Secure Access data have more restrictive access conditions than those made available under the standard EUL. Prospective users will need to gain ONS Accredited Researcher status, complete an extra application form and demonstrate to the data owners exactly why they need access to the additional variables. Users are strongly advised to first obtain the standard EUL version of the data to see if they are sufficient for their research requirements. For the third edition (July 2022), the qualification variable QULNOW has been added to the data file. Main Topics:Topics covered include: household composition and relationships, housing tenure, nationality, ethnicity and residential history, employment and training (including government schemes), workplace and location, job hunting, educational background and qualifications. Many of the variables included in the survey are the same as those in the LFS. Multi-stage stratified random sample Face-to-face interview Telephone interview 2018 2020 ADULT EDUCATION AGE APPLICATION FOR EMP... APPOINTMENT TO JOB ATTITUDES BONUS PAYMENTS BUSINESSES CARE OF DEPENDANTS CHRONIC ILLNESS COHABITATION COMMUTING CONDITIONS OF EMPLO... DEBILITATIVE ILLNESS DEGREES Demography population ECONOMIC ACTIVITY EDUCATIONAL BACKGROUND EDUCATIONAL COURSES EMPLOYEES EMPLOYER SPONSORED ... EMPLOYMENT EMPLOYMENT HISTORY EMPLOYMENT PROGRAMMES ETHNIC GROUPS FAMILIES FAMILY BENEFITS FIELDS OF STUDY FULL TIME EMPLOYMENT FURNISHED ACCOMMODA... FURTHER EDUCATION GENDER HEADS OF HOUSEHOLD HEALTH HIGHER EDUCATION HOME OWNERSHIP HOURS OF WORK HOUSEHOLDS HOUSING HOUSING BENEFITS HOUSING TENURE INCOME INDUSTRIES JOB CHANGING JOB HUNTING JOB SEEKER S ALLOWANCE LANDLORDS Labour and employment MANAGERS MARITAL STATUS NATIONAL IDENTITY NATIONALITY OCCUPATIONS OVERTIME PART TIME COURSES PART TIME EMPLOYMENT PLACE OF BIRTH PLACE OF RESIDENCE PRIVATE SECTOR PUBLIC SECTOR QUALIFICATIONS RECRUITMENT REDUNDANCY REDUNDANCY PAY RELIGIOUS AFFILIATION RENTED ACCOMMODATION RESIDENTIAL MOBILITY SELF EMPLOYED SICK LEAVE SICKNESS AND DISABI... SOCIAL HOUSING SOCIAL SECURITY BEN... SOCIO ECONOMIC STATUS STATE RETIREMENT PE... STUDENTS SUBSIDIARY EMPLOYMENT SUPERVISORS SUPERVISORY STATUS TAX RELIEF TEMPORARY EMPLOYMENT TERMINATION OF SERVICE TIED HOUSING TRAINING TRAINING COURSES TRAVELLING TIME UNEMPLOYED UNEMPLOYMENT UNEMPLOYMENT BENEFITS UNFURNISHED ACCOMMO... UNWAGED WORKERS WAGES WELSH LANGUAGE WORKING CONDITIONS WORKPLACE vital statistics an...

  15. c

    Data from: Datasets used to train the Generative Adversarial Networks used...

    • opendata.cern.ch
    Updated 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ATLAS collaboration (2021). Datasets used to train the Generative Adversarial Networks used in ATLFast3 [Dataset]. http://doi.org/10.7483/OPENDATA.ATLAS.UXKX.TXBN
    Explore at:
    Dataset updated
    2021
    Dataset provided by
    CERN Open Data Portal
    Authors
    ATLAS collaboration
    Description

    Three datasets are available, each consisting of 15 csv files. Each file containing the voxelised shower information obtained from single particles produced at the front of the calorimeter in the |η| range (0.2-0.25) simulated in the ATLAS detector. Two datasets contain photons events with different statistics; the larger sample has about 10 times the number of events as the other. The other dataset contains pions. The pion dataset and the photon dataset with the lower statistics were used to train the corresponding two GANs presented in the AtlFast3 paper SIMU-2018-04.

    The information in each file is a table; the rows correspond to the events and the columns to the voxels. The voxelisation procedure is described in the AtlFast3 paper linked above and in the dedicated PUB note ATL-SOFT-PUB-2020-006. In summary, the detailed energy deposits produced by ATLAS were converted from x,y,z coordinates to local cylindrical coordinates defined around the particle 3-momentum at the entrance of the calorimeter. The energy deposits in each layer were then grouped in voxels and for each voxel the energy was stored in the csv file. For each particle, there are 15 files corresponding to the 15 energy points used to train the GAN. The name of the csv file defines both the particle and the energy of the sample used to create the file.

    The size of the voxels is described in the binning.xml file. Software tools to read the XML file and manipulate the spatial information of voxels are provided in the FastCaloGAN repository.

    Updated on February 10th 2022. A new dataset photons_samples_highStat.tgz was added to this record and the binning.xml file was updated accordingly.

    Updated on April 18th 2023. A new dataset pions_samples_highStat.tgz was added to this record.

  16. d

    U.S. State and Territorial Orders Closing and Reopening Bars Issued from...

    • catalog.data.gov
    • data.virginia.gov
    • +2more
    Updated Jan 31, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Centers for Disease Control and Prevention (2024). U.S. State and Territorial Orders Closing and Reopening Bars Issued from March 11, 2020 through May 31, 2021 by County by Day [Dataset]. https://catalog.data.gov/dataset/u-s-state-and-territorial-orders-closing-and-reopening-bars-issued-from-march-11-2020-thro-e2512
    Explore at:
    Dataset updated
    Jan 31, 2024
    Dataset provided by
    Centers for Disease Control and Prevention
    Area covered
    United States
    Description

    State and territorial executive orders, administrative orders, resolutions, and proclamations are collected from government websites and cataloged and coded using Microsoft Excel by one coder with one or more additional coders conducting quality assurance. Data were collected to determine when bars in states and territories were subject to closing and reopening requirements through executive orders, administrative orders, resolutions, and proclamations for COVID-19. Data can be used to determine when bars in states and territories were subject to closing and reopening requirements through executive orders, administrative orders, resolutions, and proclamations for COVID-19. Data consists exclusively of state and territorial orders, many of which apply to specific counties within their respective state or territory; therefore, data is broken down to the county level. These data are derived from publicly available state and territorial executive orders, administrative orders, resolutions, and proclamations (“orders”) for COVID-19 that expressly close or reopen bars found by the CDC, COVID-19 Community Intervention & Critical Populations Task Force, Monitoring & Evaluation Team, Mitigation Policy Analysis Unit, and the CDC, Center for State, Tribal, Local, and Territorial Support, Public Health Law Program from March 11, 2020 through May 31, 2021. These data will be updated as new orders are collected. Any orders not available through publicly accessible websites are not included in these data. Only official copies of the documents or, where official copies were unavailable, official press releases from government websites describing requirements were coded; news media reports on restrictions were excluded. Recommendations not included in an order are not included in these data. Effective and expiration dates were coded using only the date provided; no distinction was made based on the specific time of the day the order became effective or expired. These data do not necessarily represent an official position of the Centers for Disease Control and Prevention.

  17. eCommerce events history in electronics store

    • kaggle.com
    Updated Mar 29, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Kechinov (2021). eCommerce events history in electronics store [Dataset]. https://www.kaggle.com/datasets/mkechinov/ecommerce-events-history-in-electronics-store/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 29, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Michael Kechinov
    Description

    About

    This file contains behavior data for 5 months (Oct 2019 – Feb 2020) from a large electronics online store.

    Each row in the file represents an event. All events are related to products and users. Each event is like many-to-many relation between products and users.

    Data collected by Open CDP project. Feel free to use open source customer data platform.

    More datasets

    Checkout another datasets:

    1. https://www.kaggle.com/mkechinov/ecommerce-behavior-data-from-multi-category-store
    2. https://www.kaggle.com/mkechinov/ecommerce-purchase-history-from-electronics-store
    3. https://www.kaggle.com/mkechinov/ecommerce-events-history-in-cosmetics-shop
    4. https://www.kaggle.com/mkechinov/ecommerce-purchase-history-from-jewelry-store
    5. https://www.kaggle.com/mkechinov/ecommerce-events-history-in-electronics-store - you're reading it right now
    6. [NEW] https://www.kaggle.com/datasets/mkechinov/direct-messaging

    How to read it

    There are different types of events. See below.

    Semantics (or how to read it):

    User user_id during session user_session added to shopping cart (property event_type is equal cart) product product_id of brand brand of category category_code (category_code) with price price at event_time

    File structure

    PropertyDescription
    event_timeTime when event happened at (in UTC).
    event_typeOnly one kind of event: purchase.
    product_idID of a product
    category_idProduct's category ID
    category_codeProduct's category taxonomy (code name) if it was possible to make it. Usually present for meaningful categories and skipped for different kinds of accessories.
    brandDowncased string of brand name. Can be missed.
    priceFloat price of a product. Present.
    user_idPermanent user ID.
    ** user_session**Temporary user's session ID. Same for each user's session. Is changed every time user come back to online store from a long pause.

    Event types

    Events can be:

    • view - a user viewed a product
    • cart - a user added a product to shopping cart
    • remove_from_cart - a user removed a product from shopping cart
    • purchase - a user purchased a product

    Multiple purchases per session

    A session can have multiple purchase events. It's ok, because it's a single order.

    Many thanks

    Thanks to REES46 Marketing Platform for this dataset.

    Using datasets in your works, books, education materials

    You can use this dataset for free. Just mention the source of it: link to this page and link to REES46 Marketing Platform.

  18. d

    Johns Hopkins COVID-19 Case Tracker

    • data.world
    csv, zip
    Updated Mar 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Associated Press (2025). Johns Hopkins COVID-19 Case Tracker [Dataset]. https://data.world/associatedpress/johns-hopkins-coronavirus-case-tracker
    Explore at:
    zip, csvAvailable download formats
    Dataset updated
    Mar 25, 2025
    Authors
    The Associated Press
    Time period covered
    Jan 22, 2020 - Mar 9, 2023
    Area covered
    Description

    Updates

    • Notice of data discontinuation: Since the start of the pandemic, AP has reported case and death counts from data provided by Johns Hopkins University. Johns Hopkins University has announced that they will stop their daily data collection efforts after March 10. As Johns Hopkins stops providing data, the AP will also stop collecting daily numbers for COVID cases and deaths. The HHS and CDC now collect and visualize key metrics for the pandemic. AP advises using those resources when reporting on the pandemic going forward.

    • April 9, 2020

      • The population estimate data for New York County, NY has been updated to include all five New York City counties (Kings County, Queens County, Bronx County, Richmond County and New York County). This has been done to match the Johns Hopkins COVID-19 data, which aggregates counts for the five New York City counties to New York County.
    • April 20, 2020

      • Johns Hopkins death totals in the US now include confirmed and probable deaths in accordance with CDC guidelines as of April 14. One significant result of this change was an increase of more than 3,700 deaths in the New York City count. This change will likely result in increases for death counts elsewhere as well. The AP does not alter the Johns Hopkins source data, so probable deaths are included in this dataset as well.
    • April 29, 2020

      • The AP is now providing timeseries data for counts of COVID-19 cases and deaths. The raw counts are provided here unaltered, along with a population column with Census ACS-5 estimates and calculated daily case and death rates per 100,000 people. Please read the updated caveats section for more information.
    • September 1st, 2020

      • Johns Hopkins is now providing counts for the five New York City counties individually.
    • February 12, 2021

      • The Ohio Department of Health recently announced that as many as 4,000 COVID-19 deaths may have been underreported through the state’s reporting system, and that the "daily reported death counts will be high for a two to three-day period."
      • Because deaths data will be anomalous for consecutive days, we have chosen to freeze Ohio's rolling average for daily deaths at the last valid measure until Johns Hopkins is able to back-distribute the data. The raw daily death counts, as reported by Johns Hopkins and including the backlogged death data, will still be present in the new_deaths column.
    • February 16, 2021

      - Johns Hopkins has reconciled Ohio's historical deaths data with the state.

      Overview

    The AP is using data collected by the Johns Hopkins University Center for Systems Science and Engineering as our source for outbreak caseloads and death counts for the United States and globally.

    The Hopkins data is available at the county level in the United States. The AP has paired this data with population figures and county rural/urban designations, and has calculated caseload and death rates per 100,000 people. Be aware that caseloads may reflect the availability of tests -- and the ability to turn around test results quickly -- rather than actual disease spread or true infection rates.

    This data is from the Hopkins dashboard that is updated regularly throughout the day. Like all organizations dealing with data, Hopkins is constantly refining and cleaning up their feed, so there may be brief moments where data does not appear correctly. At this link, you’ll find the Hopkins daily data reports, and a clean version of their feed.

    The AP is updating this dataset hourly at 45 minutes past the hour.

    To learn more about AP's data journalism capabilities for publishers, corporations and financial institutions, go here or email kromano@ap.org.

    Queries

    Use AP's queries to filter the data or to join to other datasets we've made available to help cover the coronavirus pandemic

    Interactive

    The AP has designed an interactive map to track COVID-19 cases reported by Johns Hopkins.

    @(https://datawrapper.dwcdn.net/nRyaf/15/)

    Interactive Embed Code

    <iframe title="USA counties (2018) choropleth map Mapping COVID-19 cases by county" aria-describedby="" id="datawrapper-chart-nRyaf" src="https://datawrapper.dwcdn.net/nRyaf/10/" scrolling="no" frameborder="0" style="width: 0; min-width: 100% !important;" height="400"></iframe><script type="text/javascript">(function() {'use strict';window.addEventListener('message', function(event) {if (typeof event.data['datawrapper-height'] !== 'undefined') {for (var chartId in event.data['datawrapper-height']) {var iframe = document.getElementById('datawrapper-chart-' + chartId) || document.querySelector("iframe[src*='" + chartId + "']");if (!iframe) {continue;}iframe.style.height = event.data['datawrapper-height'][chartId] + 'px';}}});})();</script>
    

    Caveats

    • This data represents the number of cases and deaths reported by each state and has been collected by Johns Hopkins from a number of sources cited on their website.
    • In some cases, deaths or cases of people who've crossed state lines -- either to receive treatment or because they became sick and couldn't return home while traveling -- are reported in a state they aren't currently in, because of state reporting rules.
    • In some states, there are a number of cases not assigned to a specific county -- for those cases, the county name is "unassigned to a single county"
    • This data should be credited to Johns Hopkins University's COVID-19 tracking project. The AP is simply making it available here for ease of use for reporters and members.
    • Caseloads may reflect the availability of tests -- and the ability to turn around test results quickly -- rather than actual disease spread or true infection rates.
    • Population estimates at the county level are drawn from 2014-18 5-year estimates from the American Community Survey.
    • The Urban/Rural classification scheme is from the Center for Disease Control and Preventions's National Center for Health Statistics. It puts each county into one of six categories -- from Large Central Metro to Non-Core -- according to population and other characteristics. More details about the classifications can be found here.

    Johns Hopkins timeseries data - Johns Hopkins pulls data regularly to update their dashboard. Once a day, around 8pm EDT, Johns Hopkins adds the counts for all areas they cover to the timeseries file. These counts are snapshots of the latest cumulative counts provided by the source on that day. This can lead to inconsistencies if a source updates their historical data for accuracy, either increasing or decreasing the latest cumulative count. - Johns Hopkins periodically edits their historical timeseries data for accuracy. They provide a file documenting all errors in their timeseries files that they have identified and fixed here

    Attribution

    This data should be credited to Johns Hopkins University COVID-19 tracking project

  19. d

    Elevation data for four sites in the coastal marsh at Grand Bay National...

    • catalog.data.gov
    • data.doi.gov
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Elevation data for four sites in the coastal marsh at Grand Bay National Estuarine Research Reserve, Mississippi, from July 2018 through January 2020 [Dataset]. https://catalog.data.gov/dataset/elevation-data-for-four-sites-in-the-coastal-marsh-at-grand-bay-national-estuarine-researc
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Description

    To better understand sediment deposition in marsh environments, scientists from the U.S. Geological Survey, St. Petersburg Coastal and Marine Science Center (USGS-SPCMSC) selected four study sites (Sites 5, 6, 7, and 8) along the Point Aux Chenes Bay shoreline of the Grand Bay National Estuarine Research Reserve (GNDNERR), Mississippi. These datasets were collected to serve as baseline data prior to the installation of a living shoreline (a subtidal sill). Each site consisted of five plots located along a transect perpendicular to the marsh-estuary shoreline at 5-meter (m) increments (5, 10, 15, 20, and 25 m from the shoreline). Each plot contained six net sedimentation tiles (NST) that were secured flush to the marsh surface using polyvinyl chloride (PVC) pipe. NST are an inexpensive and simple tool to assess short- and long-term deposition that can be deployed in highly dynamic environments without the compaction associated with traditional coring methods. The NST were deployed for three month sampling periods, measuring sediment deposition from July 2018 to January 2020, with one set of NST being deployed for six months. Sediment deposited on the NST were processed to determine physical characteristics, such as deposition thickness, volume, wet weight/dry weight, grain size, and organic content (loss-on-ignition [LOI]). For select sampling periods, ancillary data (water level, elevation, and wave data) are also provided in this data release. Data were collected during USGS Field Activities Numbers (FAN) 2018-332-FA (18CCT01), 2018-358-FA (18CCT10), 2019-303-FA (19CCT01, 19CCT02, 19CCT03, and 19CCT04, respectively), and 2020-301-FA (20CCT01). Additional survey and data details are available from the U.S. Geological Survey Coastal and Marine Geoscience Data System (CMGDS) at, https://cmgds.marine.usgs.gov/. Data collected between 2016 and 2017 from a related NST study in the GNDNERR (Middle Bay and North Rigolets) can be found at https://doi.org/10.5066/P9BFR2US. Please read the full metadata for details on data collection, dataset variables, and data quality.

  20. d

    Website Analytics

    • catalog.data.gov
    • data.somervillema.gov
    Updated Feb 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.somervillema.gov (2025). Website Analytics [Dataset]. https://catalog.data.gov/dataset/somerville-analytics
    Explore at:
    Dataset updated
    Feb 7, 2025
    Dataset provided by
    data.somervillema.gov
    Description

    Contains view count data for the top 20 pages each day on the Somerville MA city website dating back to 2020. Data is used in the City's dashboard which can be found at https://www.somervilledata.farm/.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Statista (2024). Amount of data created, consumed, and stored 2010-2023, with forecasts to 2028 [Dataset]. https://www.statista.com/statistics/871513/worldwide-data-created/
Organization logo

Amount of data created, consumed, and stored 2010-2023, with forecasts to 2028

Explore at:
Dataset updated
Nov 21, 2024
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
May 2024
Area covered
Worldwide
Description

The total amount of data created, captured, copied, and consumed globally is forecast to increase rapidly, reaching 149 zettabytes in 2024. Over the next five years up to 2028, global data creation is projected to grow to more than 394 zettabytes. In 2020, the amount of data created and replicated reached a new high. The growth was higher than previously expected, caused by the increased demand due to the COVID-19 pandemic, as more people worked and learned from home and used home entertainment options more often. Storage capacity also growing Only a small percentage of this newly created data is kept though, as just two percent of the data produced and consumed in 2020 was saved and retained into 2021. In line with the strong growth of the data volume, the installed base of storage capacity is forecast to increase, growing at a compound annual growth rate of 19.2 percent over the forecast period from 2020 to 2025. In 2020, the installed base of storage capacity reached 6.7 zettabytes.

Search
Clear search
Close search
Google apps
Main menu