89 datasets found

Amount of data created, consumed, and stored 2010-2023, with forecasts to...
statista.com
Updated Nov 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2024). Amount of data created, consumed, and stored 2010-2023, with forecasts to 2028 [Dataset]. https://www.statista.com/statistics/871513/worldwide-data-created/
Explore at:
Dataset updated
Nov 21, 2024
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
May 2024
Area covered
Worldwide
Description
The total amount of data created, captured, copied, and consumed globally is forecast to increase rapidly, reaching 149 zettabytes in 2024. Over the next five years up to 2028, global data creation is projected to grow to more than 394 zettabytes. In 2020, the amount of data created and replicated reached a new high. The growth was higher than previously expected, caused by the increased demand due to the COVID-19 pandemic, as more people worked and learned from home and used home entertainment options more often. Storage capacity also growing Only a small percentage of this newly created data is kept though, as just two percent of the data produced and consumed in 2020 was saved and retained into 2021. In line with the strong growth of the data volume, the installed base of storage capacity is forecast to increase, growing at a compound annual growth rate of 19.2 percent over the forecast period from 2020 to 2025. In 2020, the installed base of storage capacity reached 6.7 zettabytes.
Internet users in the United States 2020-2029
statista.com
Updated Dec 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2024). Internet users in the United States 2020-2029 [Dataset]. https://www.statista.com/statistics/325645/usa-number-of-internet-users/
Explore at:
Dataset updated
Dec 12, 2024
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
United States
Description
The number of internet users in the United States was forecast to continuously increase between 2024 and 2029 by in total 13.5 million users (+4.16 percent). After the ninth consecutive increasing year, the number of users is estimated to reach 337.67 million users and therefore a new peak in 2029. Notably, the number of internet users of was continuously increasing over the past years.Depicted is the estimated number of individuals in the country or region at hand, that use the internet. As the datasource clarifies, connection quality and usage frequency are distinct aspects, not taken into account here.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).
NYC STEW-MAP Staten Island organizations' website hyperlink webscrape
catalog.data.gov
gimi9.com
Updated Nov 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2022). NYC STEW-MAP Staten Island organizations' website hyperlink webscrape [Dataset]. https://catalog.data.gov/dataset/nyc-stew-map-staten-island-organizations-website-hyperlink-webscrape
Explore at:
Dataset updated
Nov 21, 2022
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Area covered
Staten Island, New York
Description
The data represent web-scraping of hyperlinks from a selection of environmental stewardship organizations that were identified in the 2017 NYC Stewardship Mapping and Assessment Project (STEW-MAP) (USDA 2017). There are two data sets: 1) the original scrape containing all hyperlinks within the websites and associated attribute values (see "README" file); 2) a cleaned and reduced dataset formatted for network analysis. For dataset 1: Organizations were selected from from the 2017 NYC Stewardship Mapping and Assessment Project (STEW-MAP) (USDA 2017), a publicly available, spatial data set about environmental stewardship organizations working in New York City, USA (N = 719). To create a smaller and more manageable sample to analyze, all organizations that intersected (i.e., worked entirely within or overlapped) the NYC borough of Staten Island were selected for a geographically bounded sample. Only organizations with working websites and that the web scraper could access were retained for the study (n = 78). The websites were scraped between 09 and 17 June 2020 to a maximum search depth of ten using the snaWeb package (version 1.0.1, Stockton 2020) in the R computational language environment (R Core Team 2020). For dataset 2: The complete scrape results were cleaned, reduced, and formatted as a standard edge-array (node1, node2, edge attribute) for network analysis. See "READ ME" file for further details. References: R Core Team. (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/. Version 4.0.3. Stockton, T. (2020). snaWeb Package: An R package for finding and building social networks for a website, version 1.0.1. USDA Forest Service. (2017). Stewardship Mapping and Assessment Project (STEW-MAP). New York City Data Set. Available online at https://www.nrs.fs.fed.us/STEW-MAP/data/. This dataset is associated with the following publication: Sayles, J., R. Furey, and M. Ten Brink. How deep to dig: effects of web-scraping search depth on hyperlink network analysis of environmental stewardship organizations. Applied Network Science. Springer Nature, New York, NY, 7: 36, (2022).
s
Statistical Region Data Collection 2020 - Datasets - This service has been...
store.smartdatahub.io
Updated Nov 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Statistical Region Data Collection 2020 - Datasets - This service has been deprecated - please visit https://www.smartdatahub.io/ to access data. See the About page for details. // [Dataset]. https://store.smartdatahub.io/dataset/fi_tilastokeskus_tilastointialueet_maakunta1000k_2020
Explore at:
Dataset updated
Nov 11, 2024
Description
This dataset collection contains tables sourced from the Statistics Finland (Tilastokeskus) website. The tables in the collection encompass a range of related data with respect to statistical areas in Finland. The source describes the data as 'Statistical Service Interface (WFS)', indicating that it pertains to statistical information accessed through a web feature service. The data is organized in an easy-to-understand table format with distinct rows and columns, making it simple to interpret and analyze. This dataset is licensed under CC BY 4.0 (Creative Commons Attribution 4.0, https://creativecommons.org/licenses/by/4.0/deed.fi).
s
Statistics Bureau Web Service Interface (WFS) Dataset Collection 2020 -...
store.smartdatahub.io
Updated Nov 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Statistics Bureau Web Service Interface (WFS) Dataset Collection 2020 - Datasets - This service has been deprecated - please visit https://www.smartdatahub.io/ to access data. See the About page for details. // [Dataset]. https://store.smartdatahub.io/dataset/fi_tilastokeskus_tilastointialueet_seutukunta4500k_2020
Explore at:
Dataset updated
Nov 11, 2024
Description
This dataset collection comprises a series of related data tables sourced from the website of 'Tilastokeskus' (Statistics Finland), based in Finland. The tables within this collection contain data retrieved from the Statistics Finland's service interface (WFS). The content of the tables is organized in a structured format with rows and columns, showcasing a correlation between different sets of data. The collection, while primarily intended for statistical analysis, can be utilized in a variety of ways, depending on the specific needs of the user. This dataset is licensed under CC BY 4.0 (Creative Commons Attribution 4.0, https://creativecommons.org/licenses/by/4.0/deed.fi).
Web Data Commons (December 2020) Property and Datatype Usage Dataset
zenodo.org
application/gzip
Updated Mar 25, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jan Martin Keil; Jan Martin Keil; Merle Gänßinger; Merle Gänßinger (2022). Web Data Commons (December 2020) Property and Datatype Usage Dataset [Dataset]. http://doi.org/10.5281/zenodo.6205111
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6205111
Dataset updated
Mar 25, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jan Martin Keil; Jan Martin Keil; Merle Gänßinger; Merle Gänßinger
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a dataset about the usage of properties and datatypes in the Web Data Commons RDFa, Microdata, Embedded JSON-LD, and Microformats Data Sets (December 2020) based on the Common Crawl September 2020 archive. The dataset has been produced using the RDF Property and Datatype Usage Scanner v1.0.0, which is based on the Apache Jena framework. Only RDFa and embedded JSON-LD data were considered, as Microdata and Microformats do not incorporate explicit datatypes.

Dataset Properties

Size: 0.6 GiB compressed, 11.0 GiB uncompressed, 53 641 457 rows plus 1 head line determined using gunzip -c measurements.csv.gz | wc -l

Parsing Failures: The scanner failed to parse 45 307 472 triples (~0.1 %) of the source dataset (containing 37 971 812 425 triples). The main reasons for failures were malformed IRIs and illegal character encodings.

Content:

CATEGORY: The category (html-embedded-jsonld or html-rdfa) of the Web Data Commons file that has been measured.

FILE_URL: The URL of the Web Data Commons file that has been measured.

MEASUREMENT: The applied measurement with specific conditions, one of:

UnpreciseRepresentableInDouble: The number of lexicals that are in the lexical space but not in the value space of xsd:double.

UnpreciseRepresentableInFloat: The number of lexicals that are in the lexical space but not in the value space of xsd:float.

UsedAsDatatype: The total number of literals with the datatype.

UsedAsPropertyRange: The number of statements that specify the datatype as range of the property.

ValidDateNotation: The number of lexicals that are in the lexical space of xsd:date.

ValidDateTimeNotation: The number of lexicals that are in the lexical space of xsd:dateTime.

ValidDecimalNotation: The number of lexicals that represent a number with decimal notation and whose lexical representation is thereby in the lexical space of xsd:decimal, xsd:float, and xsd:double.

ValidExponentialNotation: The number of lexicals that represent a number with exponential notation and whose lexical representation is thereby in the lexical space of xsd:float, and xsd:double.

ValidInfOrNaNNotation: The number of lexicals that equals either INF, +INF, -INF or NaN and whose lexical representation is thereby in the lexical space of xsd:float, and xsd:double.

ValidIntegerNotation: The number of lexicals that represent an integer number and whose lexical representation is thereby in the lexical space of xsd:integer, xsd:decimal, xsd:float, and xsd:double.

ValidTimeNotation: The number of lexicals that are in the lexical space of xsd:time.

ValidTrueOrFalseNotation: The number of lexicals that equal either true or false and whose lexical representation is thereby in the lexical space of xsd:boolean.

ValidZeroOrOneNotation: The number of lexicals that equal either 0 or 1 and whose lexical representation is thereby in the lexical space of xsd:boolean, and xsd:integer, xsd:decimal, xsd:float, and xsd:double.

Note: Lexical representation of xsd:double values in embedded JSON-LD got normalized to always use exponential notation with up to 16 fractional digits (see related code). Be careful by drawing conclusions from according Valid… and Unprecise… measures.

PROPERTY: The property that has been measured.

DATATYPE: The datatype that has been measured.

QUANTITY: The count of statements that fulfill the condition specified by the measurement per file, property and datatype.

Preview

"CATEGORY","FILE_URL","MEASUREMENT","PROPERTY","DATATYPE","QUANTITY" "html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2020-12/quads/dpef.html-embedded-jsonld.nq-00000.gz","UnpreciseRepresentableInDouble","http://schema.org/height","http://www.w3.org/2001/XMLSchema#string","11" "html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2020-12/quads/dpef.html-embedded-jsonld.nq-00000.gz","UnpreciseRepresentableInDouble","http://www.w3.org/ns/csvw#value","http://www.w3.org/2001/XMLSchema#string","27" "html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2020-12/quads/dpef.html-embedded-jsonld.nq-00000.gz","UnpreciseRepresentableInDouble","http://schema.org/saturatedFatContent","http://www.w3.org/2001/XMLSchema#string","1" … "html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2020-12/quads/dpef.html-rdfa.nq-05166.gz","ValidZeroOrOneNotation","http://purl.org/goodrelations/v1#hasMaxValue","http://www.w3.org/2001/XMLSchema#float","2" "html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2020-12/quads/dpef.html-rdfa.nq-05166.gz","ValidZeroOrOneNotation","http://purl.org/goodrelations/v1#hasMinValue","http://www.w3.org/2001/XMLSchema#float","4" "html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2020-12/quads/dpef.html-rdfa.nq-05166.gz","ValidZeroOrOneNotation","http://rdfs.org/sioc/ns#num_replies","http://www.w3.org/2001/XMLSchema#integer","1062"

Note: The data contain malformed IRIs, like "xsd:dateTime" (instead of probably "http://www.w3.org/2001/XMLSchema#dateTime"), which are caused by missing namespace definitions in the original source website.
CT-FAN-21 corpus: A dataset for Fake News Detection
zenodo.org
Updated Oct 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl (2022). CT-FAN-21 corpus: A dataset for Fake News Detection [Dataset]. http://doi.org/10.5281/zenodo.4714517
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.4714517
Dataset updated
Oct 23, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl
Description
Data Access: The data in the research collection provided may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use it only for research purposes. Due to these restrictions, the collection is not open data. Please download the Agreement at Data Sharing Agreement and send the signed form to fakenewstask@gmail.com .

Citation

Please cite our work as

@article{shahi2021overview, title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection}, author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas}, journal={Working Notes of CLEF}, year={2021} }

Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English.

Subtask 3A: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. The training data will be released in batches and roughly about 900 articles with the respective label. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. Our definitions for the categories are as follows:

False - The main claim made in an article is untrue.

Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services.

True - This rating indicates that the primary elements of the main claim are demonstrably true.

Other- An article that cannot be categorised as true, false, or partially false due to lack of evidence about its claims. This category includes articles in dispute and unproven articles.

Subtask 3B: Topical Domain Classification of News Articles (English) Fact-checkers require background expertise to identify the truthfulness of an article. The categorisation will help to automate the sampling process from a stream of data. Given the text of a news article, determine the topical domain of the article (English). This is a classification problem. The task is to categorise fake news articles into six topical categories like health, election, crime, climate, election, education. This task will be offered for a subset of the data of Subtask 3A.

Input Data

The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows:

Task 3a

ID- Unique identifier of the news article

Title- Title of the news article

text- Text mentioned inside the news article

our rating - class of the news article as false, partially false, true, other

Task 3b

public_id- Unique identifier of the news article

Title- Title of the news article

text- Text mentioned inside the news article

domain - domain of the given news article(applicable only for task B)

Output data format

Task 3a

public_id- Unique identifier of the news article

predicted_rating- predicted class

Sample File

public_id, predicted_rating 1, false 2, true

Task 3b

public_id- Unique identifier of the news article

predicted_domain- predicted domain

Sample file

public_id, predicted_domain 1, health 2, crime

Additional data for Training

To train your model, the participant can use additional data with a similar format; some datasets are available over the web. We don't provide the background truth for those datasets. For testing, we will not use any articles from other datasets. Some of the possible source:

Fakenews Classification Datasets

Fake News Detection Challenge KDD 2020

FakeNewsNet

IMPORTANT!

Fake news article used for task 3b is a subset of task 3a.

We have used the data from 2010 to 2021, and the content of fake news is mixed up with several topics like election, COVID-19 etc.

Evaluation Metrics

This task is evaluated as a classification task. We will use the F1-macro measure for the ranking of teams. There is a limit of 5 runs (total and not per day), and only one person from a team is allowed to submit runs.

Submission Link: https://competitions.codalab.org/competitions/31238

Related Work

Shahi GK. AMUSED: An Annotation Framework of Multi-modal Social Media Data. arXiv preprint arXiv:2010.00502. 2020 Oct 1.https://arxiv.org/pdf/2010.00502.pdf

G. K. Shahi and D. Nandini, “FakeCovid – a multilingualcross-domain fact check news dataset for covid-19,” inWorkshop Proceedings of the 14th International AAAIConference on Web and Social Media, 2020. http://workshop-proceedings.icwsm.org/abstract?id=2020_14

Shahi, G. K., Dirkson, A., & Majchrzak, T. A. (2021). An exploratory study of covid-19 misinformation on twitter. Online Social Networks and Media, 22, 100104. doi: 10.1016/j.osnem.2020.100104
d
U.S. State and Territorial Stay-At-Home Orders: March 15, 2020 – May 31,...
catalog.data.gov
data.virginia.gov
+2more
Updated Sep 11, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Centers for Disease Control and Prevention (2022). U.S. State and Territorial Stay-At-Home Orders: March 15, 2020 – May 31, 2021 by County by Day [Dataset]. https://catalog.data.gov/dataset/u-s-state-and-territorial-stay-at-home-orders-march-15-2020-may-31-2021-by-county-by-day-6013e
Explore at:
Dataset updated
Sep 11, 2022
Dataset provided by
Centers for Disease Control and Prevention
Area covered
United States
Description
State and territorial executive orders, administrative orders, resolutions, and proclamations are collected from government websites and cataloged and coded using Microsoft Excel by one coder with one or more additional coders conducting quality assurance. Data were collected to determine when individuals in states and territories were subject to executive orders, administrative orders, resolutions, and proclamations for COVID-19 that require or recommend people stay in their homes. Data consists exclusively of state and territorial orders, many of which apply to specific counties within their respective state or territory; therefore, data is broken down to the county level. These data are derived from the publicly available state and territorial executive orders, administrative orders, resolutions, and proclamations (“orders”) for COVID-19 that expressly require or recommend individuals stay at home found by the CDC, COVID-19 Community Intervention and At-Risk Task Force, Monitoring and Evaluation Team & CDC, Center for State, Tribal, Local, and Territorial Support, Public Health Law Program from March 15, 2020 through May 31, 2021. These data will be updated as new orders are collected. Any orders not available through publicly accessible websites are not included in these data. Only official copies of the documents or, where official copies were unavailable, official press releases from government websites describing requirements were coded; news media reports on restrictions were excluded. Recommendations not included in an order are not included in these data. These data do not include mandatory business closures, curfews, or limitations on public or private gatherings. These data do not necessarily represent an official position of the Centers for Disease Control and Prevention.

Data from: WikiReddit: Tracing Information and Attention Flows Between...

zenodo.org

bin

Updated Feb 10, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms [Dataset]. http://doi.org/10.5281/zenodo.14653265

Explore at:

binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.14653265

Dataset updated

Feb 10, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

Jan 15, 2025

Description

Preprint

Gildersleve, P., Beers, A., Ito, V., Orozco, A., & Tripodi, F. (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms. arXiv [Cs.CY]. https://doi.org/10.48550/arXiv.2502.04942

Abstract

The World Wide Web is a complex interconnected digital ecosystem, where information and attention flow between platforms and communities throughout the globe. These interactions co-construct how we understand the world, reflecting and shaping public discourse. Unfortunately, researchers often struggle to understand how information circulates and evolves across the web because platform-specific data is often siloed and restricted by linguistic barriers. To address this gap, we present a comprehensive, multilingual dataset capturing all Wikipedia links shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW subreddits. Each linked Wikipedia article is enriched with revision history, page view data, article ID, redirects, and Wikidata identifiers. Through a research agreement with Reddit, our dataset ensures user privacy while providing a query and ID mechanism that integrates with the Reddit and Wikipedia APIs. This enables extended analyses for researchers studying how information flows across platforms. For example, Reddit discussions use Wikipedia for deliberation and fact-checking which subsequently influences Wikipedia content, by driving traffic to articles or inspiring edits. By analyzing the relationship between information shared and discussed on these platforms, our dataset provides a foundation for examining the interplay between social media discourse and collaborative knowledge consumption and production.

Datasheet

Motivation

The motivations for this dataset stem from the challenges researchers face in studying the flow of information across the web. While the World Wide Web enables global communication and collaboration, data silos, linguistic barriers, and platform-specific restrictions hinder our ability to understand how information circulates, evolves, and impacts public discourse. Wikipedia and Reddit, as major hubs of knowledge sharing and discussion, offer an invaluable lens into these processes. However, without comprehensive data capturing their interactions, researchers are unable to fully examine how platforms co-construct knowledge. This dataset bridges this gap, providing the tools needed to study the interconnectedness of social media and collaborative knowledge systems.

Composition

WikiReddit, a comprehensive dataset capturing all Wikipedia mentions (including links) shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW (not safe for work) subreddits. The SQL database comprises 336K total posts, 10.2M comments, 1.95M unique links, and 1.26M unique articles spanning 59 languages on Reddit and 276 Wikipedia language subdomains. Each linked Wikipedia article is enriched with its revision history and page view data within a ±10-day window of its posting, as well as article ID, redirects, and Wikidata identifiers. Supplementary anonymous metadata from Reddit posts and comments further contextualizes the links, offering a robust resource for analysing cross-platform information flows, collective attention dynamics, and the role of Wikipedia in online discourse.

Collection Process

Data was collected from the Reddit4Researchers and Wikipedia APIs. No personally identifiable information is published in the dataset. Data from Reddit to Wikipedia is linked via the hyperlink and article titles appearing in Reddit posts.

Preprocessing/cleaning/labeling

Extensive processing with tools such as regex was applied to the Reddit post/comment text to extract the Wikipedia URLs. Redirects for Wikipedia URLs and article titles were found through the API and mapped to the collected data. Reddit IDs are hashed with SHA-256 for post/comment/user/subreddit anonymity.

Uses

We foresee several applications of this dataset and preview four here. First, Reddit linking data can be used to understand how attention is driven from one platform to another. Second, Reddit linking data can shed light on how Wikipedia's archive of knowledge is used in the larger social web. Third, our dataset could provide insights into how external attention is topically distributed across Wikipedia. Our dataset can help extend that analysis into the disparities in what types of external communities Wikipedia is used in, and how it is used. Fourth, relatedly, a topic analysis of our dataset could reveal how Wikipedia usage on Reddit contributes to societal benefits and harms. Our dataset could help examine if homogeneity within the Reddit and Wikipedia audiences shapes topic patterns and assess whether these relationships mitigate or amplify problematic engagement online.

Distribution

The dataset is publicly shared with a Creative Commons Attribution 4.0 International license. The article describing this dataset should be cited: https://doi.org/10.48550/arXiv.2502.04942

Maintenance

Patrick Gildersleve will maintain this dataset, and add further years of content as and when available.

SQL Database Schema

Table: `posts`

Column Name	Type	Description
`subreddit_id`	TEXT	The unique identifier for the subreddit.
`crosspost_parent_id`	TEXT	The ID of the original Reddit post if this post is a crosspost.
`post_id`	TEXT	Unique identifier for the Reddit post.
`created_at`	TIMESTAMP	The timestamp when the post was created.
`updated_at`	TIMESTAMP	The timestamp when the post was last updated.
`language_code`	TEXT	The language code of the post.
`score`	INTEGER	The score (upvotes minus downvotes) of the post.
`upvote_ratio`	REAL	The ratio of upvotes to total votes.
`gildings`	INTEGER	Number of awards (gildings) received by the post.
`num_comments`	INTEGER	Number of comments on the post.

Table: `comments`

Column Name	Type	Description
`subreddit_id`	TEXT	The unique identifier for the subreddit.
`post_id`	TEXT	The ID of the Reddit post the comment belongs to.
`parent_id`	TEXT	The ID of the parent comment (if a reply).
`comment_id`	TEXT	Unique identifier for the comment.
`created_at`	TIMESTAMP	The timestamp when the comment was created.
`last_modified_at`	TIMESTAMP	The timestamp when the comment was last modified.
`score`	INTEGER	The score (upvotes minus downvotes) of the comment.
`upvote_ratio`	REAL	The ratio of upvotes to total votes for the comment.
`gilded`	INTEGER	Number of awards (gildings) received by the comment.

Table: `postlinks`

Column Name	Type	Description
`post_id`	TEXT	Unique identifier for the Reddit post.
`end_processed_valid`	INTEGER	Whether the extracted URL from the post resolves to a valid URL.
`end_processed_url`	TEXT	The extracted URL from the Reddit post.
`final_valid`	INTEGER	Whether the final URL from the post resolves to a valid URL after redirections.
`final_status`	INTEGER	HTTP status code of the final URL.
`final_url`	TEXT	The final URL after redirections.
`redirected`	INTEGER	Indicator of whether the posted URL was redirected (1) or not (0).
`in_title`	INTEGER	Indicator of whether the link appears in the post title (1) or post body (0).

Table: `commentlinks`

Column Name	Type	Description
`comment_id`	TEXT	Unique identifier for the Reddit comment.
`end_processed_valid`	INTEGER	Whether the extracted URL from the comment resolves to a valid URL.
`end_processed_url`	TEXT	The extracted URL from the comment.
`final_valid`	INTEGER	Whether the final URL from the comment resolves to a valid URL after redirections.
`final_status`	INTEGER	HTTP status code of the final URL.
`final_url`	TEXT	The final URL after

Understanding Society: Calendar Year Dataset, 2021: Special Licence Access
beta.ukdataservice.ac.uk
datacatalogue.cessda.eu
Updated 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Institute For Social University Of Essex (2024). Understanding Society: Calendar Year Dataset, 2021: Special Licence Access [Dataset]. http://doi.org/10.5255/ukda-sn-9194-1
Explore at:
Unique identifier
https://doi.org/10.5255/ukda-sn-9194-1
Dataset updated
2024
Dataset provided by
UK Data Servicehttps://ukdataservice.ac.uk/
datacite
Authors
Institute For Social University Of Essex
Description
Understanding Society (the UK Household Longitudinal Study), which began in 2009, is conducted by the Institute for Social and Economic Research (ISER) at the University of Essex, and the survey research organisations Verian Group (formerly Kantar Public) and NatCen. It builds on and incorporates, the British Household Panel Survey (BHPS), which began in 1991.
The Understanding Society: Calendar Year Dataset, 2021, is designed to enable cross-sectional analysis of individuals and households relating specifically to their annual interviews conducted in the year 2021, and, therefore, combine data collected in three waves (Waves 11, 12 and 13). It has been produced from the same data collected in the main Understanding Society study and released in the longitudinal datasets SN 6614 (End User Licence) and SN 6931 (Special Licence). Such cross-sectional analysis can, however, only involve variables that are collected in every wave in order to have data for the full sample panel. The 2021 dataset is the second of a series of planned Calendar Year Datasets to facilitate cross-sectional analysis of specific years. Full details of the Calendar Year Dataset sample structure (including why some individual interviews from 2022 are included), data structure and additional supporting information can be found in the document '9194_calendar_year_dataset_2020_user_guide'.

As multi-topic studies, the purpose of Understanding Society is to understand the short- and long-term effects of social and economic change in the UK at the household and individual levels. The study has a strong emphasis on domains of family and social ties, employment, education, financial resources, and health. Understanding Society is an annual survey of each adult member of a nationally representative sample. The same individuals are re-interviewed in each wave approximately 12 months apart. When individuals move, they are followed within the UK, and anyone joining their households is also interviewed as long as they are living with them. The fieldwork period for a single wave is 24 months. Data collection uses computer-assisted personal interviewing (CAPI) and web interviews (from wave 7) and includes a telephone mop-up. From March 2020 (the end of wave 10 and 2nd year of wave 11), due to the coronavirus pandemic, face-to-face interviews were suspended, and the survey has been conducted by web and telephone only but otherwise has continued as before. One person completes the household questionnaire. Each person aged 16 or older participates in the individual adult interview and self-completed questionnaire. Youths aged 10 to 15 are asked to respond to a paper self-completion questionnaire. In 2020, an additional frequent web survey was separately issued to sample members to capture data on the rapid changes in people’s lives due to the COVID-19 pandemic (see SN 8644). The COVID-19 Survey data are not included in this dataset.

Further information may be found on the Understanding Society main stage webpage and links to publications based on the study can be found on the Understanding Society Latest Research webpage.

Co-funders

In addition to the Economic and Social Research Council, co-funders for the study included the Department of Work and Pensions, the Department for Education, the Department for Transport, the Department of Culture, Media and Sport, the Department for Community and Local Government, the Department of Health, the Scottish Government, the Welsh Assembly Government, the Northern Ireland Executive, the Department of Environment and Rural Affairs, and the Food Standards Agency.

End User Licence and Special Licence versions:

There are two versions of the Calendar Year 2021 data. One is available under the standard End User Licence (EUL) agreement, and the other is a Special Licence (SL) version. The SL version contains month and year of birth variables instead of just age, more detailed country and occupation coding for a number of variables and various income variables have not been top-coded (see xxxx_eul_vs_sl_variable_differences for more details). Users are advised to first obtain the standard EUL version of the data to see if they are sufficient for their research requirements. The SL data have more restrictive access conditions; prospective users of the SL version will need to complete an extra application form and demonstrate to the data owners exactly why they need access to the additional variables in order to get permission to use that version. The main longitudinal versions of the Understanding Society study may be found under SNs 6614 (EUL) and 6931 (SL).

Low- and Medium-level geographical identifiers produced for the mainstage longitudinal dataset can be used with this Calendar Year 2021 dataset, subject to SL access conditions. See the User Guide for further details.

Suitable data analysis software

These data are provided by the depositor in Stata format. Users are strongly advised to analyse them in Stata. Transfer to other formats may result in unforeseen issues. Stata SE or MP software is needed to analyse the larger files, which contain about 1,900 variables.
U.S. State and Territorial Public Mask Mandates From April 10, 2020 through...
catalog.data.gov
healthdata.gov
+2more
Updated Oct 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Centers for Disease Control and Prevention (2022). U.S. State and Territorial Public Mask Mandates From April 10, 2020 through July 20, 2021 by County by Day [Dataset]. https://catalog.data.gov/dataset/u-s-state-and-territorial-public-mask-mandates-from-april-10-2020-through-july-20-2021-by--7e5b8
Explore at:
Dataset updated
Oct 1, 2022
Dataset provided by
Centers for Disease Control and Preventionhttp://www.cdc.gov/
Area covered
United States
Description
State and territorial executive orders, administrative orders, resolutions, and proclamations are collected from government websites and cataloged and coded using Microsoft Excel by one coder with one or more additional coders conducting quality assurance. Data were collected to determine when members of the public in states and territories were subject to state and territorial executive orders, administrative orders, resolutions, and proclamations for COVID-19 that require them to wear masks in public. “Members of the public” are defined as individuals operating in a personal capacity. “In public” is defined to mean either (1) anywhere outside the home or (2) both in retail businesses and in restaurants/food establishments. Data consists exclusively of state and territorial orders, many of which apply to specific counties within their respective state or territory; therefore, data is broken down to the county level. These data are derived from publicly available state and territorial executive orders, administrative orders, resolutions, and proclamations (“orders”) for COVID-19 that expressly require individuals to wear masks in public found by the CDC, COVID-19 Community Intervention & Critical Populations Task Force, Monitoring & Evaluation Team, Mitigation Policy Analysis Unit, Center for State, Tribal, Local, and Territorial Support, Public Health Law Program, and Max Gakh, Assistant Professor, School of Public Health, University of Nevada, Las Vegas from April 10, 2020 through July 20, 2021. These data will be updated as new orders are collected. Any orders not available through publicly accessible websites are not included in these data. Only official copies of the documents or, where official copies were unavailable, official press releases from government websites describing requirements were coded; news media reports on restrictions were excluded. Recommendations not included in an order are not included in these data. Effective and expiration dates were coded using only the dates provided; no distinction was made based on the specific time of the day the order became effective or expired. These data do not include data on counties that have opted out of their state mask mandate pursuant to state law. These data do not necessarily represent an official position of the Centers for Disease Control and Prevention.
m
Web page phishing detection
data.mendeley.com
narcis.nl
Updated Sep 26, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abdelhakim Hannousse (2020). Web page phishing detection [Dataset]. http://doi.org/10.17632/c2gw7fy2j4.1
Explore at:
Unique identifier
https://doi.org/10.17632/c2gw7fy2j4.1
Dataset updated
Sep 26, 2020
Authors
Abdelhakim Hannousse
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The provided dataset includes 11430 URLs with 87 extracted features. The dataset are designed to be used as a a benchmark for machine learning based phishing detection systems. Features are from three different classes: 56 extracted from the structure and syntax of URLs, 24 extracted from the content of their correspondent pages and 7 are extracetd by querying external services. The datatset is balanced, it containes exactly 50% phishing and 50% legitimate URLs. Associated to the dataset, we provide Python scripts used for the extraction of the features for potential replication or extension.

dataset_A: contains a list a URLs together with their DOM tree objects that can be used for replication and experimenting new URL and content-based features overtaking short-time living of phishing web pages.

dataset_B: containes the extracted feature values that can be used directly as inupt to classifiers for examination. Note that the data in this dataset are indexed with URLs so that one need to remove the index before experimentation.

Datasets are constructed on May 2020. Due to huge size of dataset A, only a sample of the dataset is provided, I will try to divide into sample files and upload them one by one, for full copy, please contact directly the author at any time at: hannousse.abdelhakim@univ-guelma.dz
d
U.S. State and Territorial Orders Closing and Reopening Restaurants Issued...
catalog.data.gov
healthdata.gov
+5more
Updated Sep 21, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Centers for Disease Control and Prevention (2022). U.S. State and Territorial Orders Closing and Reopening Restaurants Issued from March 11, 2020 through May 31, 2021 by County by Day [Dataset]. https://catalog.data.gov/dataset/u-s-state-and-territorial-orders-closing-and-reopening-restaurants-issued-from-march-11-20-5299c
Explore at:
Dataset updated
Sep 21, 2022
Dataset provided by
Centers for Disease Control and Prevention
Area covered
United States
Description
State and territorial executive orders, administrative orders, resolutions, and proclamations are collected from government websites and cataloged and coded using Microsoft Excel by one coder with one or more additional coders conducting quality assurance. Data were collected to determine when restaurants in states and territories were subject to closing and reopening requirements through executive orders, administrative orders, resolutions, and proclamations for COVID-19. Data can be used to determine when restaurants in states and territories were subject to closing and reopening requirements through executive orders, administrative orders, resolutions, and proclamations for COVID-19. Data consists exclusively of state and territorial orders, many of which apply to specific counties within their respective state or territory; therefore, data is broken down to the county level. These data are derived from publicly available state and territorial executive orders, administrative orders, resolutions, and proclamations (“orders”) for COVID-19 that expressly close or reopen restaurants found by the CDC, COVID-19 Community Intervention & Critical Populations Task Force, Monitoring & Evaluation Team, Mitigation Policy Analysis Unit, and the CDC, Center for State, Tribal, Local, and Territorial Support, Public Health Law Program from March 11, 2020 through May 31, 2021. These data will be updated as new orders are collected. Any orders not available through publicly accessible websites are not included in these data. Only official copies of the documents or, where official copies were unavailable, official press releases from government websites describing requirements were coded; news media reports on restrictions were excluded. Recommendations not included in an order are not included in these data. Effective and expiration dates were coded using only the date provided; no distinction was made based on the specific time of the day the order became effective or expired. These data do not necessarily represent an official position of the Centers for Disease Control and Prevention.
d
Annual Population Survey Three-Year Pooled Dataset, January 2018 - December...
b2find.dkrz.de
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Annual Population Survey Three-Year Pooled Dataset, January 2018 - December 2020 - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/736a3f2f-243e-5d8a-b345-969c59074c15
Explore at:
Description
Abstract copyright UK Data Service and data collection copyright owner.The Annual Population Survey (APS) is a major survey series, which aims to provide data that can produce reliable estimates at the local authority level. Key topics covered in the survey include education, employment, health and ethnicity. The APS comprises key variables from the Labour Force Survey (LFS), all its associated LFS boosts and the APS boost. The APS aims to provide enhanced annual data for England, covering a target sample of at least 510 economically active persons for each Unitary Authority (UA)/Local Authority District (LAD) and at least 450 in each Greater London Borough. In combination with local LFS boost samples, the survey provides estimates for a range of indicators down to Local Education Authority (LEA) level across the United Kingdom.For further detailed information about methodology, users should consult the Labour Force Survey User Guide, included with the APS documentation. For variable and value labelling and coding frames that are not included either in the data or in the current APS documentation, users are advised to consult the latest versions of the LFS User Guides, which are available from the ONS Labour Force Survey - User Guidance webpages.Occupation data for 2021 and 2022The ONS has identified an issue with the collection of some occupational data in 2021 and 2022 data files in a number of their surveys. While they estimate any impacts will be small overall, this will affect the accuracy of the breakdowns of some detailed (four-digit Standard Occupational Classification (SOC)) occupations, and data derived from them. None of ONS' headline statistics, other than those directly sourced from occupational data, are affected and you can continue to rely on their accuracy. The affected datasets have now been updated. Further information can be found in the ONS article published on 11 July 2023: Revision of miscoded occupational data in the ONS Labour Force Survey, UK: January 2021 to September 2022APS Well-Being DatasetsFrom 2012-2015, the ONS published separate APS datasets aimed at providing initial estimates of subjective well-being, based on the Integrated Household Survey. In 2015 these were discontinued. A separate set of well-being variables and a corresponding weighting variable have been added to the April-March APS person datasets from A11M12 onwards. Further information on the transition can be found in the Personal well-being in the UK: 2015 to 2016 article on the ONS website.APS disability variablesOver time, there have been some updates to disability variables in the APS. An article explaining the quality assurance investigations on these variables that have been conducted so far is available on the ONS Methodology webpage. End User Licence and Secure Access APS dataUsers should note that there are two versions of each APS dataset. One is available under the standard End User Licence (EUL) agreement, and the other is a Secure Access version. The EUL version includes Government Office Region geography, banded age, 3-digit SOC and industry sector for main, second and last job. The Secure Access version contains more detailed variables relating to: age: single year of age, year and month of birth, age completed full-time education and age obtained highest qualification, age of oldest dependent child and age of youngest dependent child family unit and household: including a number of variables concerning the number of dependent children in the family according to their ages, relationship to head of household and relationship to head of family nationality and country of origin geography: including county, unitary/local authority, place of work, Nomenclature of Territorial Units for Statistics 2 (NUTS2) and NUTS3 regions, and whether lives and works in same local authority district health: including main health problem, and current and past health problems education and apprenticeship: including numbers and subjects of various qualifications and variables concerning apprenticeships industry: including industry, industry class and industry group for main, second and last job, and industry made redundant from occupation: including 4-digit Standard Occupational Classification (SOC) for main, second and last job and job made redundant from system variables: including week number when interview took place and number of households at address The Secure Access data have more restrictive access conditions than those made available under the standard EUL. Prospective users will need to gain ONS Accredited Researcher status, complete an extra application form and demonstrate to the data owners exactly why they need access to the additional variables. Users are strongly advised to first obtain the standard EUL version of the data to see if they are sufficient for their research requirements. For the third edition (July 2022), the qualification variable QULNOW has been added to the data file. Main Topics:Topics covered include: household composition and relationships, housing tenure, nationality, ethnicity and residential history, employment and training (including government schemes), workplace and location, job hunting, educational background and qualifications. Many of the variables included in the survey are the same as those in the LFS. Multi-stage stratified random sample Face-to-face interview Telephone interview 2018 2020 ADULT EDUCATION AGE APPLICATION FOR EMP... APPOINTMENT TO JOB ATTITUDES BONUS PAYMENTS BUSINESSES CARE OF DEPENDANTS CHRONIC ILLNESS COHABITATION COMMUTING CONDITIONS OF EMPLO... DEBILITATIVE ILLNESS DEGREES Demography population ECONOMIC ACTIVITY EDUCATIONAL BACKGROUND EDUCATIONAL COURSES EMPLOYEES EMPLOYER SPONSORED ... EMPLOYMENT EMPLOYMENT HISTORY EMPLOYMENT PROGRAMMES ETHNIC GROUPS FAMILIES FAMILY BENEFITS FIELDS OF STUDY FULL TIME EMPLOYMENT FURNISHED ACCOMMODA... FURTHER EDUCATION GENDER HEADS OF HOUSEHOLD HEALTH HIGHER EDUCATION HOME OWNERSHIP HOURS OF WORK HOUSEHOLDS HOUSING HOUSING BENEFITS HOUSING TENURE INCOME INDUSTRIES JOB CHANGING JOB HUNTING JOB SEEKER S ALLOWANCE LANDLORDS Labour and employment MANAGERS MARITAL STATUS NATIONAL IDENTITY NATIONALITY OCCUPATIONS OVERTIME PART TIME COURSES PART TIME EMPLOYMENT PLACE OF BIRTH PLACE OF RESIDENCE PRIVATE SECTOR PUBLIC SECTOR QUALIFICATIONS RECRUITMENT REDUNDANCY REDUNDANCY PAY RELIGIOUS AFFILIATION RENTED ACCOMMODATION RESIDENTIAL MOBILITY SELF EMPLOYED SICK LEAVE SICKNESS AND DISABI... SOCIAL HOUSING SOCIAL SECURITY BEN... SOCIO ECONOMIC STATUS STATE RETIREMENT PE... STUDENTS SUBSIDIARY EMPLOYMENT SUPERVISORS SUPERVISORY STATUS TAX RELIEF TEMPORARY EMPLOYMENT TERMINATION OF SERVICE TIED HOUSING TRAINING TRAINING COURSES TRAVELLING TIME UNEMPLOYED UNEMPLOYMENT UNEMPLOYMENT BENEFITS UNFURNISHED ACCOMMO... UNWAGED WORKERS WAGES WELSH LANGUAGE WORKING CONDITIONS WORKPLACE vital statistics an...
c
Data from: Datasets used to train the Generative Adversarial Networks used...
opendata.cern.ch
Updated 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ATLAS collaboration (2021). Datasets used to train the Generative Adversarial Networks used in ATLFast3 [Dataset]. http://doi.org/10.7483/OPENDATA.ATLAS.UXKX.TXBN
Explore at:
Unique identifier
https://doi.org/10.7483/OPENDATA.ATLAS.UXKX.TXBN
Dataset updated
2021
Dataset provided by
CERN Open Data Portal
Authors
ATLAS collaboration
Description
Three datasets are available, each consisting of 15 csv files. Each file containing the voxelised shower information obtained from single particles produced at the front of the calorimeter in the |η| range (0.2-0.25) simulated in the ATLAS detector. Two datasets contain photons events with different statistics; the larger sample has about 10 times the number of events as the other. The other dataset contains pions. The pion dataset and the photon dataset with the lower statistics were used to train the corresponding two GANs presented in the AtlFast3 paper SIMU-2018-04.

The information in each file is a table; the rows correspond to the events and the columns to the voxels. The voxelisation procedure is described in the AtlFast3 paper linked above and in the dedicated PUB note ATL-SOFT-PUB-2020-006. In summary, the detailed energy deposits produced by ATLAS were converted from x,y,z coordinates to local cylindrical coordinates defined around the particle 3-momentum at the entrance of the calorimeter. The energy deposits in each layer were then grouped in voxels and for each voxel the energy was stored in the csv file. For each particle, there are 15 files corresponding to the 15 energy points used to train the GAN. The name of the csv file defines both the particle and the energy of the sample used to create the file.

The size of the voxels is described in the binning.xml file. Software tools to read the XML file and manipulate the spatial information of voxels are provided in the FastCaloGAN repository.
Updated on February 10th 2022. A new dataset photons_samples_highStat.tgz was added to this record and the binning.xml file was updated accordingly.
Updated on April 18th 2023. A new dataset pions_samples_highStat.tgz was added to this record.
d
U.S. State and Territorial Orders Closing and Reopening Bars Issued from...
catalog.data.gov
data.virginia.gov
+2more
Updated Jan 31, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Centers for Disease Control and Prevention (2024). U.S. State and Territorial Orders Closing and Reopening Bars Issued from March 11, 2020 through May 31, 2021 by County by Day [Dataset]. https://catalog.data.gov/dataset/u-s-state-and-territorial-orders-closing-and-reopening-bars-issued-from-march-11-2020-thro-e2512
Explore at:
Dataset updated
Jan 31, 2024
Dataset provided by
Centers for Disease Control and Prevention
Area covered
United States
Description
State and territorial executive orders, administrative orders, resolutions, and proclamations are collected from government websites and cataloged and coded using Microsoft Excel by one coder with one or more additional coders conducting quality assurance. Data were collected to determine when bars in states and territories were subject to closing and reopening requirements through executive orders, administrative orders, resolutions, and proclamations for COVID-19. Data can be used to determine when bars in states and territories were subject to closing and reopening requirements through executive orders, administrative orders, resolutions, and proclamations for COVID-19. Data consists exclusively of state and territorial orders, many of which apply to specific counties within their respective state or territory; therefore, data is broken down to the county level. These data are derived from publicly available state and territorial executive orders, administrative orders, resolutions, and proclamations (“orders”) for COVID-19 that expressly close or reopen bars found by the CDC, COVID-19 Community Intervention & Critical Populations Task Force, Monitoring & Evaluation Team, Mitigation Policy Analysis Unit, and the CDC, Center for State, Tribal, Local, and Territorial Support, Public Health Law Program from March 11, 2020 through May 31, 2021. These data will be updated as new orders are collected. Any orders not available through publicly accessible websites are not included in these data. Only official copies of the documents or, where official copies were unavailable, official press releases from government websites describing requirements were coded; news media reports on restrictions were excluded. Recommendations not included in an order are not included in these data. Effective and expiration dates were coded using only the date provided; no distinction was made based on the specific time of the day the order became effective or expired. These data do not necessarily represent an official position of the Centers for Disease Control and Prevention.

eCommerce events history in electronics store

kaggle.com

Updated Mar 29, 2021

Facebook

Twitter

Click to copy link

Link copied

Cite

Michael Kechinov (2021). eCommerce events history in electronics store [Dataset]. https://www.kaggle.com/datasets/mkechinov/ecommerce-events-history-in-electronics-store/code

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Mar 29, 2021

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Michael Kechinov

Description

About

This file contains behavior data for 5 months (Oct 2019 – Feb 2020) from a large electronics online store.

Each row in the file represents an event. All events are related to products and users. Each event is like many-to-many relation between products and users.

Data collected by Open CDP project. Feel free to use open source customer data platform.

More datasets

Checkout another datasets:

How to read it

There are different types of events. See below.

Semantics (or how to read it):

User user_id during session user_session added to shopping cart (property event_type is equal cart) product product_id of brand brand of category category_code (category_code) with price price at event_time

File structure

Property	Description
event_time	Time when event happened at (in UTC).
event_type	Only one kind of event: purchase.
product_id	ID of a product
category_id	Product's category ID
category_code	Product's category taxonomy (code name) if it was possible to make it. Usually present for meaningful categories and skipped for different kinds of accessories.
brand	Downcased string of brand name. Can be missed.
price	Float price of a product. Present.
user_id	Permanent user ID.
user_session	Temporary user's session ID. Same for each user's session. Is changed every time user come back to online store from a long pause.

Event types

Events can be:

view - a user viewed a product
cart - a user added a product to shopping cart
remove_from_cart - a user removed a product from shopping cart
purchase - a user purchased a product

Multiple purchases per session

A session can have multiple purchase events. It's ok, because it's a single order.

Many thanks

Thanks to REES46 Marketing Platform for this dataset.

Using datasets in your works, books, education materials

You can use this dataset for free. Just mention the source of it: link to this page and link to REES46 Marketing Platform.

d
Johns Hopkins COVID-19 Case Tracker
data.world
csv, zip
Updated Mar 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Associated Press (2025). Johns Hopkins COVID-19 Case Tracker [Dataset]. https://data.world/associatedpress/johns-hopkins-coronavirus-case-tracker
Explore at:
zip, csvAvailable download formats
Dataset updated
Mar 25, 2025
Authors
The Associated Press
Time period covered
Jan 22, 2020 - Mar 9, 2023
Area covered
Description
Updates

Notice of data discontinuation: Since the start of the pandemic, AP has reported case and death counts from data provided by Johns Hopkins University. Johns Hopkins University has announced that they will stop their daily data collection efforts after March 10. As Johns Hopkins stops providing data, the AP will also stop collecting daily numbers for COVID cases and deaths. The HHS and CDC now collect and visualize key metrics for the pandemic. AP advises using those resources when reporting on the pandemic going forward.

CDC Weekly case and death counts (national and state level)

CDC County level cases and deaths

HHS New hospital admissions

CDC NowCast COVID variant proportions (national and regional level)

April 9, 2020

The population estimate data for New York County, NY has been updated to include all five New York City counties (Kings County, Queens County, Bronx County, Richmond County and New York County). This has been done to match the Johns Hopkins COVID-19 data, which aggregates counts for the five New York City counties to New York County.

April 20, 2020

Johns Hopkins death totals in the US now include confirmed and probable deaths in accordance with CDC guidelines as of April 14. One significant result of this change was an increase of more than 3,700 deaths in the New York City count. This change will likely result in increases for death counts elsewhere as well. The AP does not alter the Johns Hopkins source data, so probable deaths are included in this dataset as well.

April 29, 2020

The AP is now providing timeseries data for counts of COVID-19 cases and deaths. The raw counts are provided here unaltered, along with a population column with Census ACS-5 estimates and calculated daily case and death rates per 100,000 people. Please read the updated caveats section for more information.

September 1st, 2020

Johns Hopkins is now providing counts for the five New York City counties individually.

February 12, 2021

The Ohio Department of Health recently announced that as many as 4,000 COVID-19 deaths may have been underreported through the state’s reporting system, and that the "daily reported death counts will be high for a two to three-day period."

Because deaths data will be anomalous for consecutive days, we have chosen to freeze Ohio's rolling average for daily deaths at the last valid measure until Johns Hopkins is able to back-distribute the data. The raw daily death counts, as reported by Johns Hopkins and including the backlogged death data, will still be present in the new_deaths column.

February 16, 2021

- Johns Hopkins has reconciled Ohio's historical deaths data with the state.

Overview

The AP is using data collected by the Johns Hopkins University Center for Systems Science and Engineering as our source for outbreak caseloads and death counts for the United States and globally.

The Hopkins data is available at the county level in the United States. The AP has paired this data with population figures and county rural/urban designations, and has calculated caseload and death rates per 100,000 people. Be aware that caseloads may reflect the availability of tests -- and the ability to turn around test results quickly -- rather than actual disease spread or true infection rates.

This data is from the Hopkins dashboard that is updated regularly throughout the day. Like all organizations dealing with data, Hopkins is constantly refining and cleaning up their feed, so there may be brief moments where data does not appear correctly. At this link, you’ll find the Hopkins daily data reports, and a clean version of their feed.

The AP is updating this dataset hourly at 45 minutes past the hour.

To learn more about AP's data journalism capabilities for publishers, corporations and financial institutions, go here or email kromano@ap.org.

Queries

Use AP's queries to filter the data or to join to other datasets we've made available to help cover the coronavirus pandemic

Filter cases by state here

Rank states by their status as current hotspots. Calculates the 7-day rolling average of new cases per capita in each state: https://data.world/associatedpress/johns-hopkins-coronavirus-case-tracker/workspace/query?queryid=481e82a4-1b2f-41c2-9ea1-d91aa4b3b1ac

Find recent hotspots within your state by running a query to calculate the 7-day rolling average of new cases by capita in each county: https://data.world/associatedpress/johns-hopkins-coronavirus-case-tracker/workspace/query?queryid=b566f1db-3231-40fe-8099-311909b7b687&showTemplatePreview=true

Join county-level case data to an earlier dataset released by AP on local hospital capacity here. To find out more about the hospital capacity dataset, see the full details.

Pull the 100 counties with the highest per-capita confirmed cases here

Rank all the counties by the highest per-capita rate of new cases in the past 7 days here. Be aware that because this ranks per-capita caseloads, very small counties may rise to the very top, so take into account raw caseload figures as well.

Interactive

The AP has designed an interactive map to track COVID-19 cases reported by Johns Hopkins.

@(https://datawrapper.dwcdn.net/nRyaf/15/)

Interactive Embed Code

<iframe title="USA counties (2018) choropleth map Mapping COVID-19 cases by county" aria-describedby="" id="datawrapper-chart-nRyaf" src="https://datawrapper.dwcdn.net/nRyaf/10/" scrolling="no" frameborder="0" style="width: 0; min-width: 100% !important;" height="400"></iframe><script type="text/javascript">(function() {'use strict';window.addEventListener('message', function(event) {if (typeof event.data['datawrapper-height'] !== 'undefined') {for (var chartId in event.data['datawrapper-height']) {var iframe = document.getElementById('datawrapper-chart-' + chartId) || document.querySelector("iframe[src*='" + chartId + "']");if (!iframe) {continue;}iframe.style.height = event.data['datawrapper-height'][chartId] + 'px';}}});})();</script>

Caveats

This data represents the number of cases and deaths reported by each state and has been collected by Johns Hopkins from a number of sources cited on their website.

In some cases, deaths or cases of people who've crossed state lines -- either to receive treatment or because they became sick and couldn't return home while traveling -- are reported in a state they aren't currently in, because of state reporting rules.

In some states, there are a number of cases not assigned to a specific county -- for those cases, the county name is "unassigned to a single county"

This data should be credited to Johns Hopkins University's COVID-19 tracking project. The AP is simply making it available here for ease of use for reporters and members.

Caseloads may reflect the availability of tests -- and the ability to turn around test results quickly -- rather than actual disease spread or true infection rates.

Population estimates at the county level are drawn from 2014-18 5-year estimates from the American Community Survey.

The Urban/Rural classification scheme is from the Center for Disease Control and Preventions's National Center for Health Statistics. It puts each county into one of six categories -- from Large Central Metro to Non-Core -- according to population and other characteristics. More details about the classifications can be found here.

Johns Hopkins timeseries data - Johns Hopkins pulls data regularly to update their dashboard. Once a day, around 8pm EDT, Johns Hopkins adds the counts for all areas they cover to the timeseries file. These counts are snapshots of the latest cumulative counts provided by the source on that day. This can lead to inconsistencies if a source updates their historical data for accuracy, either increasing or decreasing the latest cumulative count. - Johns Hopkins periodically edits their historical timeseries data for accuracy. They provide a file documenting all errors in their timeseries files that they have identified and fixed here

Attribution

This data should be credited to Johns Hopkins University COVID-19 tracking project
d
Elevation data for four sites in the coastal marsh at Grand Bay National...
catalog.data.gov
data.doi.gov
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Elevation data for four sites in the coastal marsh at Grand Bay National Estuarine Research Reserve, Mississippi, from July 2018 through January 2020 [Dataset]. https://catalog.data.gov/dataset/elevation-data-for-four-sites-in-the-coastal-marsh-at-grand-bay-national-estuarine-researc
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
To better understand sediment deposition in marsh environments, scientists from the U.S. Geological Survey, St. Petersburg Coastal and Marine Science Center (USGS-SPCMSC) selected four study sites (Sites 5, 6, 7, and 8) along the Point Aux Chenes Bay shoreline of the Grand Bay National Estuarine Research Reserve (GNDNERR), Mississippi. These datasets were collected to serve as baseline data prior to the installation of a living shoreline (a subtidal sill). Each site consisted of five plots located along a transect perpendicular to the marsh-estuary shoreline at 5-meter (m) increments (5, 10, 15, 20, and 25 m from the shoreline). Each plot contained six net sedimentation tiles (NST) that were secured flush to the marsh surface using polyvinyl chloride (PVC) pipe. NST are an inexpensive and simple tool to assess short- and long-term deposition that can be deployed in highly dynamic environments without the compaction associated with traditional coring methods. The NST were deployed for three month sampling periods, measuring sediment deposition from July 2018 to January 2020, with one set of NST being deployed for six months. Sediment deposited on the NST were processed to determine physical characteristics, such as deposition thickness, volume, wet weight/dry weight, grain size, and organic content (loss-on-ignition [LOI]). For select sampling periods, ancillary data (water level, elevation, and wave data) are also provided in this data release. Data were collected during USGS Field Activities Numbers (FAN) 2018-332-FA (18CCT01), 2018-358-FA (18CCT10), 2019-303-FA (19CCT01, 19CCT02, 19CCT03, and 19CCT04, respectively), and 2020-301-FA (20CCT01). Additional survey and data details are available from the U.S. Geological Survey Coastal and Marine Geoscience Data System (CMGDS) at, https://cmgds.marine.usgs.gov/. Data collected between 2016 and 2017 from a related NST study in the GNDNERR (Middle Bay and North Rigolets) can be found at https://doi.org/10.5066/P9BFR2US. Please read the full metadata for details on data collection, dataset variables, and data quality.
d
Website Analytics
catalog.data.gov
data.somervillema.gov
Updated Feb 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.somervillema.gov (2025). Website Analytics [Dataset]. https://catalog.data.gov/dataset/somerville-analytics
Explore at:
Dataset updated
Feb 7, 2025
Dataset provided by
data.somervillema.gov
Description
Contains view count data for the top 20 pages each day on the Somerville MA city website dating back to 2020. Data is used in the City's dashboard which can be found at https://www.somervilledata.farm/.

Facebook

Twitter

Click to copy link

Link copied

Cite

Statista (2024). Amount of data created, consumed, and stored 2010-2023, with forecasts to 2028 [Dataset]. https://www.statista.com/statistics/871513/worldwide-data-created/

Amount of data created, consumed, and stored 2010-2023, with forecasts to 2028

Explore at:

Dataset updated

Nov 21, 2024

Dataset authored and provided by

Statistahttp://statista.com/

Time period covered

May 2024

Area covered

Worldwide

Description

The total amount of data created, captured, copied, and consumed globally is forecast to increase rapidly, reaching 149 zettabytes in 2024. Over the next five years up to 2028, global data creation is projected to grow to more than 394 zettabytes. In 2020, the amount of data created and replicated reached a new high. The growth was higher than previously expected, caused by the increased demand due to the COVID-19 pandemic, as more people worked and learned from home and used home entertainment options more often. Storage capacity also growing Only a small percentage of this newly created data is kept though, as just two percent of the data produced and consumed in 2020 was saved and retained into 2021. In line with the strong growth of the data volume, the installed base of storage capacity is forecast to increase, growing at a compound annual growth rate of 19.2 percent over the forecast period from 2020 to 2025. In 2020, the installed base of storage capacity reached 6.7 zettabytes.

Clear search

Close search

Google apps

Main menu

Amount of data created, consumed, and stored 2010-2023, with forecasts to...

Internet users in the United States 2020-2029

NYC STEW-MAP Staten Island organizations' website hyperlink webscrape

Statistical Region Data Collection 2020 - Datasets - This service has been...

Statistics Bureau Web Service Interface (WFS) Dataset Collection 2020 -...

Web Data Commons (December 2020) Property and Datatype Usage Dataset

CT-FAN-21 corpus: A dataset for Fake News Detection

U.S. State and Territorial Stay-At-Home Orders: March 15, 2020 – May 31,...

Data from: WikiReddit: Tracing Information and Attention Flows Between...

Preprint

Abstract

Datasheet

Motivation

Composition

Collection Process

Preprocessing/cleaning/labeling

Uses

Distribution

Maintenance

SQL Database Schema

Table: posts

Table: comments

Table: postlinks

Table: commentlinks

Understanding Society: Calendar Year Dataset, 2021: Special Licence Access

U.S. State and Territorial Public Mask Mandates From April 10, 2020 through...

Web page phishing detection

U.S. State and Territorial Orders Closing and Reopening Restaurants Issued...

Annual Population Survey Three-Year Pooled Dataset, January 2018 - December...

Data from: Datasets used to train the Generative Adversarial Networks used...

U.S. State and Territorial Orders Closing and Reopening Bars Issued from...

eCommerce events history in electronics store

About

More datasets

How to read it

File structure

Event types

Multiple purchases per session

Many thanks

Using datasets in your works, books, education materials

Johns Hopkins COVID-19 Case Tracker

Updates

- Johns Hopkins has reconciled Ohio's historical deaths data with the state.

Overview

Queries

Interactive

Interactive Embed Code

Caveats

Attribution

Elevation data for four sites in the coastal marsh at Grand Bay National...

Website Analytics

Amount of data created, consumed, and stored 2010-2023, with forecasts to 2028

Table: `posts`

Table: `comments`

Table: `postlinks`

Table: `commentlinks`