The total amount of data created, captured, copied, and consumed globally is forecast to increase rapidly, reaching 149 zettabytes in 2024. Over the next five years up to 2028, global data creation is projected to grow to more than 394 zettabytes. In 2020, the amount of data created and replicated reached a new high. The growth was higher than previously expected, caused by the increased demand due to the COVID-19 pandemic, as more people worked and learned from home and used home entertainment options more often. Storage capacity also growing Only a small percentage of this newly created data is kept though, as just two percent of the data produced and consumed in 2020 was saved and retained into 2021. In line with the strong growth of the data volume, the installed base of storage capacity is forecast to increase, growing at a compound annual growth rate of 19.2 percent over the forecast period from 2020 to 2025. In 2020, the installed base of storage capacity reached 6.7 zettabytes.
The number of internet users in the United States was forecast to continuously increase between 2024 and 2029 by in total 13.5 million users (+4.16 percent). After the ninth consecutive increasing year, the number of users is estimated to reach 337.67 million users and therefore a new peak in 2029. Notably, the number of internet users of was continuously increasing over the past years.Depicted is the estimated number of individuals in the country or region at hand, that use the internet. As the datasource clarifies, connection quality and usage frequency are distinct aspects, not taken into account here.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).
The data represent web-scraping of hyperlinks from a selection of environmental stewardship organizations that were identified in the 2017 NYC Stewardship Mapping and Assessment Project (STEW-MAP) (USDA 2017). There are two data sets: 1) the original scrape containing all hyperlinks within the websites and associated attribute values (see "README" file); 2) a cleaned and reduced dataset formatted for network analysis. For dataset 1: Organizations were selected from from the 2017 NYC Stewardship Mapping and Assessment Project (STEW-MAP) (USDA 2017), a publicly available, spatial data set about environmental stewardship organizations working in New York City, USA (N = 719). To create a smaller and more manageable sample to analyze, all organizations that intersected (i.e., worked entirely within or overlapped) the NYC borough of Staten Island were selected for a geographically bounded sample. Only organizations with working websites and that the web scraper could access were retained for the study (n = 78). The websites were scraped between 09 and 17 June 2020 to a maximum search depth of ten using the snaWeb package (version 1.0.1, Stockton 2020) in the R computational language environment (R Core Team 2020). For dataset 2: The complete scrape results were cleaned, reduced, and formatted as a standard edge-array (node1, node2, edge attribute) for network analysis. See "READ ME" file for further details. References: R Core Team. (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/. Version 4.0.3. Stockton, T. (2020). snaWeb Package: An R package for finding and building social networks for a website, version 1.0.1. USDA Forest Service. (2017). Stewardship Mapping and Assessment Project (STEW-MAP). New York City Data Set. Available online at https://www.nrs.fs.fed.us/STEW-MAP/data/. This dataset is associated with the following publication: Sayles, J., R. Furey, and M. Ten Brink. How deep to dig: effects of web-scraping search depth on hyperlink network analysis of environmental stewardship organizations. Applied Network Science. Springer Nature, New York, NY, 7: 36, (2022).
This dataset collection contains tables sourced from the Statistics Finland (Tilastokeskus) website. The tables in the collection encompass a range of related data with respect to statistical areas in Finland. The source describes the data as 'Statistical Service Interface (WFS)', indicating that it pertains to statistical information accessed through a web feature service. The data is organized in an easy-to-understand table format with distinct rows and columns, making it simple to interpret and analyze. This dataset is licensed under CC BY 4.0 (Creative Commons Attribution 4.0, https://creativecommons.org/licenses/by/4.0/deed.fi).
This dataset collection comprises a series of related data tables sourced from the website of 'Tilastokeskus' (Statistics Finland), based in Finland. The tables within this collection contain data retrieved from the Statistics Finland's service interface (WFS). The content of the tables is organized in a structured format with rows and columns, showcasing a correlation between different sets of data. The collection, while primarily intended for statistical analysis, can be utilized in a variety of ways, depending on the specific needs of the user. This dataset is licensed under CC BY 4.0 (Creative Commons Attribution 4.0, https://creativecommons.org/licenses/by/4.0/deed.fi).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a dataset about the usage of properties and datatypes in the Web Data Commons RDFa, Microdata, Embedded JSON-LD, and Microformats Data Sets (December 2020) based on the Common Crawl September 2020 archive. The dataset has been produced using the RDF Property and Datatype Usage Scanner v1.0.0, which is based on the Apache Jena framework. Only RDFa and embedded JSON-LD data were considered, as Microdata and Microformats do not incorporate explicit datatypes.
Dataset Properties
gunzip -c measurements.csv.gz | wc -l
xsd:double
.xsd:float
.xsd:date
.xsd:dateTime
.xsd:decimal
, xsd:float
, and xsd:double
.xsd:float
, and xsd:double
.INF
, +INF
, -INF
or NaN
and whose lexical representation is thereby in the lexical space of xsd:float
, and xsd:double
.xsd:integer
, xsd:decimal
, xsd:float
, and xsd:double
.xsd:time
.true
or false
and whose lexical representation is thereby in the lexical space of xsd:boolean
.0
or 1
and whose lexical representation is thereby in the lexical space of xsd:boolean
, and xsd:integer
, xsd:decimal
, xsd:float
, and xsd:double
.xsd:double
values in embedded JSON-LD got normalized to always use exponential notation with up to 16 fractional digits (see related code). Be careful by drawing conclusions from according Valid… and Unprecise… measures.Preview
"CATEGORY","FILE_URL","MEASUREMENT","PROPERTY","DATATYPE","QUANTITY"
"html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2020-12/quads/dpef.html-embedded-jsonld.nq-00000.gz","UnpreciseRepresentableInDouble","http://schema.org/height","http://www.w3.org/2001/XMLSchema#string","11"
"html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2020-12/quads/dpef.html-embedded-jsonld.nq-00000.gz","UnpreciseRepresentableInDouble","http://www.w3.org/ns/csvw#value","http://www.w3.org/2001/XMLSchema#string","27"
"html-embedded-jsonld","http://data.dws.informatik.uni-mannheim.de/structureddata/2020-12/quads/dpef.html-embedded-jsonld.nq-00000.gz","UnpreciseRepresentableInDouble","http://schema.org/saturatedFatContent","http://www.w3.org/2001/XMLSchema#string","1"
…
"html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2020-12/quads/dpef.html-rdfa.nq-05166.gz","ValidZeroOrOneNotation","http://purl.org/goodrelations/v1#hasMaxValue","http://www.w3.org/2001/XMLSchema#float","2"
"html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2020-12/quads/dpef.html-rdfa.nq-05166.gz","ValidZeroOrOneNotation","http://purl.org/goodrelations/v1#hasMinValue","http://www.w3.org/2001/XMLSchema#float","4"
"html-rdfa","http://data.dws.informatik.uni-mannheim.de/structureddata/2020-12/quads/dpef.html-rdfa.nq-05166.gz","ValidZeroOrOneNotation","http://rdfs.org/sioc/ns#num_replies","http://www.w3.org/2001/XMLSchema#integer","1062"
Note: The data contain malformed IRIs, like "xsd:dateTime
" (instead of probably "http://www.w3.org/2001/XMLSchema#dateTime
"), which are caused by missing namespace definitions in the original source website.
Data Access: The data in the research collection provided may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use it only for research purposes. Due to these restrictions, the collection is not open data. Please download the Agreement at Data Sharing Agreement and send the signed form to fakenewstask@gmail.com .
Citation
Please cite our work as
@article{shahi2021overview, title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection}, author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas}, journal={Working Notes of CLEF}, year={2021} }
Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English.
Subtask 3A: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. The training data will be released in batches and roughly about 900 articles with the respective label. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. Our definitions for the categories are as follows:
False - The main claim made in an article is untrue.
Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services.
True - This rating indicates that the primary elements of the main claim are demonstrably true.
Other- An article that cannot be categorised as true, false, or partially false due to lack of evidence about its claims. This category includes articles in dispute and unproven articles.
Subtask 3B: Topical Domain Classification of News Articles (English) Fact-checkers require background expertise to identify the truthfulness of an article. The categorisation will help to automate the sampling process from a stream of data. Given the text of a news article, determine the topical domain of the article (English). This is a classification problem. The task is to categorise fake news articles into six topical categories like health, election, crime, climate, election, education. This task will be offered for a subset of the data of Subtask 3A.
Input Data
The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows:
Task 3a
Task 3b
Output data format
Task 3a
Sample File
public_id, predicted_rating
1, false
2, true
Task 3b
Sample file
public_id, predicted_domain
1, health
2, crime
Additional data for Training
To train your model, the participant can use additional data with a similar format; some datasets are available over the web. We don't provide the background truth for those datasets. For testing, we will not use any articles from other datasets. Some of the possible source:
IMPORTANT!
Evaluation Metrics
This task is evaluated as a classification task. We will use the F1-macro measure for the ranking of teams. There is a limit of 5 runs (total and not per day), and only one person from a team is allowed to submit runs.
Submission Link: https://competitions.codalab.org/competitions/31238
Related Work
State and territorial executive orders, administrative orders, resolutions, and proclamations are collected from government websites and cataloged and coded using Microsoft Excel by one coder with one or more additional coders conducting quality assurance. Data were collected to determine when individuals in states and territories were subject to executive orders, administrative orders, resolutions, and proclamations for COVID-19 that require or recommend people stay in their homes. Data consists exclusively of state and territorial orders, many of which apply to specific counties within their respective state or territory; therefore, data is broken down to the county level. These data are derived from the publicly available state and territorial executive orders, administrative orders, resolutions, and proclamations (“orders”) for COVID-19 that expressly require or recommend individuals stay at home found by the CDC, COVID-19 Community Intervention and At-Risk Task Force, Monitoring and Evaluation Team & CDC, Center for State, Tribal, Local, and Territorial Support, Public Health Law Program from March 15, 2020 through May 31, 2021. These data will be updated as new orders are collected. Any orders not available through publicly accessible websites are not included in these data. Only official copies of the documents or, where official copies were unavailable, official press releases from government websites describing requirements were coded; news media reports on restrictions were excluded. Recommendations not included in an order are not included in these data. These data do not include mandatory business closures, curfews, or limitations on public or private gatherings. These data do not necessarily represent an official position of the Centers for Disease Control and Prevention.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The World Wide Web is a complex interconnected digital ecosystem, where information and attention flow between platforms and communities throughout the globe. These interactions co-construct how we understand the world, reflecting and shaping public discourse. Unfortunately, researchers often struggle to understand how information circulates and evolves across the web because platform-specific data is often siloed and restricted by linguistic barriers. To address this gap, we present a comprehensive, multilingual dataset capturing all Wikipedia links shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW subreddits. Each linked Wikipedia article is enriched with revision history, page view data, article ID, redirects, and Wikidata identifiers. Through a research agreement with Reddit, our dataset ensures user privacy while providing a query and ID mechanism that integrates with the Reddit and Wikipedia APIs. This enables extended analyses for researchers studying how information flows across platforms. For example, Reddit discussions use Wikipedia for deliberation and fact-checking which subsequently influences Wikipedia content, by driving traffic to articles or inspiring edits. By analyzing the relationship between information shared and discussed on these platforms, our dataset provides a foundation for examining the interplay between social media discourse and collaborative knowledge consumption and production.
The motivations for this dataset stem from the challenges researchers face in studying the flow of information across the web. While the World Wide Web enables global communication and collaboration, data silos, linguistic barriers, and platform-specific restrictions hinder our ability to understand how information circulates, evolves, and impacts public discourse. Wikipedia and Reddit, as major hubs of knowledge sharing and discussion, offer an invaluable lens into these processes. However, without comprehensive data capturing their interactions, researchers are unable to fully examine how platforms co-construct knowledge. This dataset bridges this gap, providing the tools needed to study the interconnectedness of social media and collaborative knowledge systems.
WikiReddit, a comprehensive dataset capturing all Wikipedia mentions (including links) shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW (not safe for work) subreddits. The SQL database comprises 336K total posts, 10.2M comments, 1.95M unique links, and 1.26M unique articles spanning 59 languages on Reddit and 276 Wikipedia language subdomains. Each linked Wikipedia article is enriched with its revision history and page view data within a ±10-day window of its posting, as well as article ID, redirects, and Wikidata identifiers. Supplementary anonymous metadata from Reddit posts and comments further contextualizes the links, offering a robust resource for analysing cross-platform information flows, collective attention dynamics, and the role of Wikipedia in online discourse.
Data was collected from the Reddit4Researchers and Wikipedia APIs. No personally identifiable information is published in the dataset. Data from Reddit to Wikipedia is linked via the hyperlink and article titles appearing in Reddit posts.
Extensive processing with tools such as regex was applied to the Reddit post/comment text to extract the Wikipedia URLs. Redirects for Wikipedia URLs and article titles were found through the API and mapped to the collected data. Reddit IDs are hashed with SHA-256 for post/comment/user/subreddit anonymity.
We foresee several applications of this dataset and preview four here. First, Reddit linking data can be used to understand how attention is driven from one platform to another. Second, Reddit linking data can shed light on how Wikipedia's archive of knowledge is used in the larger social web. Third, our dataset could provide insights into how external attention is topically distributed across Wikipedia. Our dataset can help extend that analysis into the disparities in what types of external communities Wikipedia is used in, and how it is used. Fourth, relatedly, a topic analysis of our dataset could reveal how Wikipedia usage on Reddit contributes to societal benefits and harms. Our dataset could help examine if homogeneity within the Reddit and Wikipedia audiences shapes topic patterns and assess whether these relationships mitigate or amplify problematic engagement online.
The dataset is publicly shared with a Creative Commons Attribution 4.0 International license. The article describing this dataset should be cited: https://doi.org/10.48550/arXiv.2502.04942
Patrick Gildersleve will maintain this dataset, and add further years of content as and when available.
posts
Column Name | Type | Description |
---|---|---|
subreddit_id | TEXT | The unique identifier for the subreddit. |
crosspost_parent_id | TEXT | The ID of the original Reddit post if this post is a crosspost. |
post_id | TEXT | Unique identifier for the Reddit post. |
created_at | TIMESTAMP | The timestamp when the post was created. |
updated_at | TIMESTAMP | The timestamp when the post was last updated. |
language_code | TEXT | The language code of the post. |
score | INTEGER | The score (upvotes minus downvotes) of the post. |
upvote_ratio | REAL | The ratio of upvotes to total votes. |
gildings | INTEGER | Number of awards (gildings) received by the post. |
num_comments | INTEGER | Number of comments on the post. |
comments
Column Name | Type | Description |
---|---|---|
subreddit_id | TEXT | The unique identifier for the subreddit. |
post_id | TEXT | The ID of the Reddit post the comment belongs to. |
parent_id | TEXT | The ID of the parent comment (if a reply). |
comment_id | TEXT | Unique identifier for the comment. |
created_at | TIMESTAMP | The timestamp when the comment was created. |
last_modified_at | TIMESTAMP | The timestamp when the comment was last modified. |
score | INTEGER | The score (upvotes minus downvotes) of the comment. |
upvote_ratio | REAL | The ratio of upvotes to total votes for the comment. |
gilded | INTEGER | Number of awards (gildings) received by the comment. |
postlinks
Column Name | Type | Description |
---|---|---|
post_id | TEXT | Unique identifier for the Reddit post. |
end_processed_valid | INTEGER | Whether the extracted URL from the post resolves to a valid URL. |
end_processed_url | TEXT | The extracted URL from the Reddit post. |
final_valid | INTEGER | Whether the final URL from the post resolves to a valid URL after redirections. |
final_status | INTEGER | HTTP status code of the final URL. |
final_url | TEXT | The final URL after redirections. |
redirected | INTEGER | Indicator of whether the posted URL was redirected (1) or not (0). |
in_title | INTEGER | Indicator of whether the link appears in the post title (1) or post body (0). |
commentlinks
Column Name | Type | Description |
---|---|---|
comment_id | TEXT | Unique identifier for the Reddit comment. |
end_processed_valid | INTEGER | Whether the extracted URL from the comment resolves to a valid URL. |
end_processed_url | TEXT | The extracted URL from the comment. |
final_valid | INTEGER | Whether the final URL from the comment resolves to a valid URL after redirections. |
final_status | INTEGER | HTTP status code of the final URL. |
final_url | TEXT | The final URL after |
Understanding Society (the UK Household Longitudinal Study), which began in 2009, is conducted by the Institute for Social and Economic Research (ISER) at the University of Essex, and the survey research organisations Verian Group (formerly Kantar Public) and NatCen. It builds on and incorporates, the British Household Panel Survey (BHPS), which began in 1991.
The Understanding Society: Calendar Year Dataset, 2021, is designed to enable cross-sectional analysis of individuals and households relating specifically to their annual interviews conducted in the year 2021, and, therefore, combine data collected in three waves (Waves 11, 12 and 13). It has been produced from the same data collected in the main Understanding Society study and released in the longitudinal datasets SN 6614 (End User Licence) and SN 6931 (Special Licence). Such cross-sectional analysis can, however, only involve variables that are collected in every wave in order to have data for the full sample panel. The 2021 dataset is the second of a series of planned Calendar Year Datasets to facilitate cross-sectional analysis of specific years. Full details of the Calendar Year Dataset sample structure (including why some individual interviews from 2022 are included), data structure and additional supporting information can be found in the document '9194_calendar_year_dataset_2020_user_guide'.
As multi-topic studies, the purpose of Understanding Society is to understand the short- and long-term effects of social and economic change in the UK at the household and individual levels. The study has a strong emphasis on domains of family and social ties, employment, education, financial resources, and health. Understanding Society is an annual survey of each adult member of a nationally representative sample. The same individuals are re-interviewed in each wave approximately 12 months apart. When individuals move, they are followed within the UK, and anyone joining their households is also interviewed as long as they are living with them. The fieldwork period for a single wave is 24 months. Data collection uses computer-assisted personal interviewing (CAPI) and web interviews (from wave 7) and includes a telephone mop-up. From March 2020 (the end of wave 10 and 2nd year of wave 11), due to the coronavirus pandemic, face-to-face interviews were suspended, and the survey has been conducted by web and telephone only but otherwise has continued as before. One person completes the household questionnaire. Each person aged 16 or older participates in the individual adult interview and self-completed questionnaire. Youths aged 10 to 15 are asked to respond to a paper self-completion questionnaire. In 2020, an additional frequent web survey was separately issued to sample members to capture data on the rapid changes in people’s lives due to the COVID-19 pandemic (see SN 8644). The COVID-19 Survey data are not included in this dataset.
Further information may be found on the Understanding Society main stage webpage and links to publications based on the study can be found on the Understanding Society Latest Research webpage.
Co-funders
In addition to the Economic and Social Research Council, co-funders for the study included the Department of Work and Pensions, the Department for Education, the Department for Transport, the Department of Culture, Media and Sport, the Department for Community and Local Government, the Department of Health, the Scottish Government, the Welsh Assembly Government, the Northern Ireland Executive, the Department of Environment and Rural Affairs, and the Food Standards Agency.
End User Licence and Special Licence versions:
There are two versions of the Calendar Year 2021 data. One is available under the standard End User Licence (EUL) agreement, and the other is a Special Licence (SL) version. The SL version contains month and year of birth variables instead of just age, more detailed country and occupation coding for a number of variables and various income variables have not been top-coded (see xxxx_eul_vs_sl_variable_differences for more details). Users are advised to first obtain the standard EUL version of the data to see if they are sufficient for their research requirements. The SL data have more restrictive access conditions; prospective users of the SL version will need to complete an extra application form and demonstrate to the data owners exactly why they need access to the additional variables in order to get permission to use that version. The main longitudinal versions of the Understanding Society study may be found under SNs 6614 (EUL) and 6931 (SL).
Low- and Medium-level geographical identifiers produced for the mainstage longitudinal dataset can be used with this Calendar Year 2021 dataset, subject to SL access conditions. See the User Guide for further details.
Suitable data analysis software
These data are provided by the depositor in Stata format. Users are strongly advised to analyse them in Stata. Transfer to other formats may result in unforeseen issues. Stata SE or MP software is needed to analyse the larger files, which contain about 1,900 variables.
State and territorial executive orders, administrative orders, resolutions, and proclamations are collected from government websites and cataloged and coded using Microsoft Excel by one coder with one or more additional coders conducting quality assurance. Data were collected to determine when members of the public in states and territories were subject to state and territorial executive orders, administrative orders, resolutions, and proclamations for COVID-19 that require them to wear masks in public. “Members of the public” are defined as individuals operating in a personal capacity. “In public” is defined to mean either (1) anywhere outside the home or (2) both in retail businesses and in restaurants/food establishments. Data consists exclusively of state and territorial orders, many of which apply to specific counties within their respective state or territory; therefore, data is broken down to the county level. These data are derived from publicly available state and territorial executive orders, administrative orders, resolutions, and proclamations (“orders”) for COVID-19 that expressly require individuals to wear masks in public found by the CDC, COVID-19 Community Intervention & Critical Populations Task Force, Monitoring & Evaluation Team, Mitigation Policy Analysis Unit, Center for State, Tribal, Local, and Territorial Support, Public Health Law Program, and Max Gakh, Assistant Professor, School of Public Health, University of Nevada, Las Vegas from April 10, 2020 through July 20, 2021. These data will be updated as new orders are collected. Any orders not available through publicly accessible websites are not included in these data. Only official copies of the documents or, where official copies were unavailable, official press releases from government websites describing requirements were coded; news media reports on restrictions were excluded. Recommendations not included in an order are not included in these data. Effective and expiration dates were coded using only the dates provided; no distinction was made based on the specific time of the day the order became effective or expired. These data do not include data on counties that have opted out of their state mask mandate pursuant to state law. These data do not necessarily represent an official position of the Centers for Disease Control and Prevention.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The provided dataset includes 11430 URLs with 87 extracted features. The dataset are designed to be used as a a benchmark for machine learning based phishing detection systems. Features are from three different classes: 56 extracted from the structure and syntax of URLs, 24 extracted from the content of their correspondent pages and 7 are extracetd by querying external services. The datatset is balanced, it containes exactly 50% phishing and 50% legitimate URLs. Associated to the dataset, we provide Python scripts used for the extraction of the features for potential replication or extension.
dataset_A: contains a list a URLs together with their DOM tree objects that can be used for replication and experimenting new URL and content-based features overtaking short-time living of phishing web pages.
dataset_B: containes the extracted feature values that can be used directly as inupt to classifiers for examination. Note that the data in this dataset are indexed with URLs so that one need to remove the index before experimentation.
Datasets are constructed on May 2020. Due to huge size of dataset A, only a sample of the dataset is provided, I will try to divide into sample files and upload them one by one, for full copy, please contact directly the author at any time at: hannousse.abdelhakim@univ-guelma.dz
State and territorial executive orders, administrative orders, resolutions, and proclamations are collected from government websites and cataloged and coded using Microsoft Excel by one coder with one or more additional coders conducting quality assurance. Data were collected to determine when restaurants in states and territories were subject to closing and reopening requirements through executive orders, administrative orders, resolutions, and proclamations for COVID-19. Data can be used to determine when restaurants in states and territories were subject to closing and reopening requirements through executive orders, administrative orders, resolutions, and proclamations for COVID-19. Data consists exclusively of state and territorial orders, many of which apply to specific counties within their respective state or territory; therefore, data is broken down to the county level. These data are derived from publicly available state and territorial executive orders, administrative orders, resolutions, and proclamations (“orders”) for COVID-19 that expressly close or reopen restaurants found by the CDC, COVID-19 Community Intervention & Critical Populations Task Force, Monitoring & Evaluation Team, Mitigation Policy Analysis Unit, and the CDC, Center for State, Tribal, Local, and Territorial Support, Public Health Law Program from March 11, 2020 through May 31, 2021. These data will be updated as new orders are collected. Any orders not available through publicly accessible websites are not included in these data. Only official copies of the documents or, where official copies were unavailable, official press releases from government websites describing requirements were coded; news media reports on restrictions were excluded. Recommendations not included in an order are not included in these data. Effective and expiration dates were coded using only the date provided; no distinction was made based on the specific time of the day the order became effective or expired. These data do not necessarily represent an official position of the Centers for Disease Control and Prevention.
Abstract copyright UK Data Service and data collection copyright owner.The Annual Population Survey (APS) is a major survey series, which aims to provide data that can produce reliable estimates at the local authority level. Key topics covered in the survey include education, employment, health and ethnicity. The APS comprises key variables from the Labour Force Survey (LFS), all its associated LFS boosts and the APS boost. The APS aims to provide enhanced annual data for England, covering a target sample of at least 510 economically active persons for each Unitary Authority (UA)/Local Authority District (LAD) and at least 450 in each Greater London Borough. In combination with local LFS boost samples, the survey provides estimates for a range of indicators down to Local Education Authority (LEA) level across the United Kingdom.For further detailed information about methodology, users should consult the Labour Force Survey User Guide, included with the APS documentation. For variable and value labelling and coding frames that are not included either in the data or in the current APS documentation, users are advised to consult the latest versions of the LFS User Guides, which are available from the ONS Labour Force Survey - User Guidance webpages.Occupation data for 2021 and 2022The ONS has identified an issue with the collection of some occupational data in 2021 and 2022 data files in a number of their surveys. While they estimate any impacts will be small overall, this will affect the accuracy of the breakdowns of some detailed (four-digit Standard Occupational Classification (SOC)) occupations, and data derived from them. None of ONS' headline statistics, other than those directly sourced from occupational data, are affected and you can continue to rely on their accuracy. The affected datasets have now been updated. Further information can be found in the ONS article published on 11 July 2023: Revision of miscoded occupational data in the ONS Labour Force Survey, UK: January 2021 to September 2022APS Well-Being DatasetsFrom 2012-2015, the ONS published separate APS datasets aimed at providing initial estimates of subjective well-being, based on the Integrated Household Survey. In 2015 these were discontinued. A separate set of well-being variables and a corresponding weighting variable have been added to the April-March APS person datasets from A11M12 onwards. Further information on the transition can be found in the Personal well-being in the UK: 2015 to 2016 article on the ONS website.APS disability variablesOver time, there have been some updates to disability variables in the APS. An article explaining the quality assurance investigations on these variables that have been conducted so far is available on the ONS Methodology webpage. End User Licence and Secure Access APS dataUsers should note that there are two versions of each APS dataset. One is available under the standard End User Licence (EUL) agreement, and the other is a Secure Access version. The EUL version includes Government Office Region geography, banded age, 3-digit SOC and industry sector for main, second and last job. The Secure Access version contains more detailed variables relating to: age: single year of age, year and month of birth, age completed full-time education and age obtained highest qualification, age of oldest dependent child and age of youngest dependent child family unit and household: including a number of variables concerning the number of dependent children in the family according to their ages, relationship to head of household and relationship to head of family nationality and country of origin geography: including county, unitary/local authority, place of work, Nomenclature of Territorial Units for Statistics 2 (NUTS2) and NUTS3 regions, and whether lives and works in same local authority district health: including main health problem, and current and past health problems education and apprenticeship: including numbers and subjects of various qualifications and variables concerning apprenticeships industry: including industry, industry class and industry group for main, second and last job, and industry made redundant from occupation: including 4-digit Standard Occupational Classification (SOC) for main, second and last job and job made redundant from system variables: including week number when interview took place and number of households at address The Secure Access data have more restrictive access conditions than those made available under the standard EUL. Prospective users will need to gain ONS Accredited Researcher status, complete an extra application form and demonstrate to the data owners exactly why they need access to the additional variables. Users are strongly advised to first obtain the standard EUL version of the data to see if they are sufficient for their research requirements. For the third edition (July 2022), the qualification variable QULNOW has been added to the data file. Main Topics:Topics covered include: household composition and relationships, housing tenure, nationality, ethnicity and residential history, employment and training (including government schemes), workplace and location, job hunting, educational background and qualifications. Many of the variables included in the survey are the same as those in the LFS. Multi-stage stratified random sample Face-to-face interview Telephone interview 2018 2020 ADULT EDUCATION AGE APPLICATION FOR EMP... APPOINTMENT TO JOB ATTITUDES BONUS PAYMENTS BUSINESSES CARE OF DEPENDANTS CHRONIC ILLNESS COHABITATION COMMUTING CONDITIONS OF EMPLO... DEBILITATIVE ILLNESS DEGREES Demography population ECONOMIC ACTIVITY EDUCATIONAL BACKGROUND EDUCATIONAL COURSES EMPLOYEES EMPLOYER SPONSORED ... EMPLOYMENT EMPLOYMENT HISTORY EMPLOYMENT PROGRAMMES ETHNIC GROUPS FAMILIES FAMILY BENEFITS FIELDS OF STUDY FULL TIME EMPLOYMENT FURNISHED ACCOMMODA... FURTHER EDUCATION GENDER HEADS OF HOUSEHOLD HEALTH HIGHER EDUCATION HOME OWNERSHIP HOURS OF WORK HOUSEHOLDS HOUSING HOUSING BENEFITS HOUSING TENURE INCOME INDUSTRIES JOB CHANGING JOB HUNTING JOB SEEKER S ALLOWANCE LANDLORDS Labour and employment MANAGERS MARITAL STATUS NATIONAL IDENTITY NATIONALITY OCCUPATIONS OVERTIME PART TIME COURSES PART TIME EMPLOYMENT PLACE OF BIRTH PLACE OF RESIDENCE PRIVATE SECTOR PUBLIC SECTOR QUALIFICATIONS RECRUITMENT REDUNDANCY REDUNDANCY PAY RELIGIOUS AFFILIATION RENTED ACCOMMODATION RESIDENTIAL MOBILITY SELF EMPLOYED SICK LEAVE SICKNESS AND DISABI... SOCIAL HOUSING SOCIAL SECURITY BEN... SOCIO ECONOMIC STATUS STATE RETIREMENT PE... STUDENTS SUBSIDIARY EMPLOYMENT SUPERVISORS SUPERVISORY STATUS TAX RELIEF TEMPORARY EMPLOYMENT TERMINATION OF SERVICE TIED HOUSING TRAINING TRAINING COURSES TRAVELLING TIME UNEMPLOYED UNEMPLOYMENT UNEMPLOYMENT BENEFITS UNFURNISHED ACCOMMO... UNWAGED WORKERS WAGES WELSH LANGUAGE WORKING CONDITIONS WORKPLACE vital statistics an...
Three datasets are available, each consisting of 15 csv files. Each file containing the voxelised shower information obtained from single particles produced at the front of the calorimeter in the |η| range (0.2-0.25) simulated in the ATLAS detector. Two datasets contain photons events with different statistics; the larger sample has about 10 times the number of events as the other. The other dataset contains pions. The pion dataset and the photon dataset with the lower statistics were used to train the corresponding two GANs presented in the AtlFast3 paper SIMU-2018-04.
The information in each file is a table; the rows correspond to the events and the columns to the voxels. The voxelisation procedure is described in the AtlFast3 paper linked above and in the dedicated PUB note ATL-SOFT-PUB-2020-006. In summary, the detailed energy deposits produced by ATLAS were converted from x,y,z coordinates to local cylindrical coordinates defined around the particle 3-momentum at the entrance of the calorimeter. The energy deposits in each layer were then grouped in voxels and for each voxel the energy was stored in the csv file. For each particle, there are 15 files corresponding to the 15 energy points used to train the GAN. The name of the csv file defines both the particle and the energy of the sample used to create the file.
The size of the voxels is described in the binning.xml file. Software tools to read the XML file and manipulate the spatial information of voxels are provided in the FastCaloGAN repository.
Updated on February 10th 2022. A new dataset photons_samples_highStat.tgz was added to this record and the binning.xml file was updated accordingly.
Updated on April 18th 2023. A new dataset pions_samples_highStat.tgz was added to this record.
State and territorial executive orders, administrative orders, resolutions, and proclamations are collected from government websites and cataloged and coded using Microsoft Excel by one coder with one or more additional coders conducting quality assurance. Data were collected to determine when bars in states and territories were subject to closing and reopening requirements through executive orders, administrative orders, resolutions, and proclamations for COVID-19. Data can be used to determine when bars in states and territories were subject to closing and reopening requirements through executive orders, administrative orders, resolutions, and proclamations for COVID-19. Data consists exclusively of state and territorial orders, many of which apply to specific counties within their respective state or territory; therefore, data is broken down to the county level. These data are derived from publicly available state and territorial executive orders, administrative orders, resolutions, and proclamations (“orders”) for COVID-19 that expressly close or reopen bars found by the CDC, COVID-19 Community Intervention & Critical Populations Task Force, Monitoring & Evaluation Team, Mitigation Policy Analysis Unit, and the CDC, Center for State, Tribal, Local, and Territorial Support, Public Health Law Program from March 11, 2020 through May 31, 2021. These data will be updated as new orders are collected. Any orders not available through publicly accessible websites are not included in these data. Only official copies of the documents or, where official copies were unavailable, official press releases from government websites describing requirements were coded; news media reports on restrictions were excluded. Recommendations not included in an order are not included in these data. Effective and expiration dates were coded using only the date provided; no distinction was made based on the specific time of the day the order became effective or expired. These data do not necessarily represent an official position of the Centers for Disease Control and Prevention.
This file contains behavior data for 5 months (Oct 2019 – Feb 2020) from a large electronics online store.
Each row in the file represents an event. All events are related to products and users. Each event is like many-to-many relation between products and users.
Data collected by Open CDP project. Feel free to use open source customer data platform.
Checkout another datasets:
There are different types of events. See below.
Semantics (or how to read it):
User user_id during session user_session added to shopping cart (property event_type is equal cart) product product_id of brand brand of category category_code (category_code) with price price at event_time
Property | Description |
---|---|
event_time | Time when event happened at (in UTC). |
event_type | Only one kind of event: purchase. |
product_id | ID of a product |
category_id | Product's category ID |
category_code | Product's category taxonomy (code name) if it was possible to make it. Usually present for meaningful categories and skipped for different kinds of accessories. |
brand | Downcased string of brand name. Can be missed. |
price | Float price of a product. Present. |
user_id | Permanent user ID. |
** user_session** | Temporary user's session ID. Same for each user's session. Is changed every time user come back to online store from a long pause. |
Events can be:
view
- a user viewed a productcart
- a user added a product to shopping cartremove_from_cart
- a user removed a product from shopping cartpurchase
- a user purchased a productA session can have multiple purchase events. It's ok, because it's a single order.
Thanks to REES46 Marketing Platform for this dataset.
You can use this dataset for free. Just mention the source of it: link to this page and link to REES46 Marketing Platform.
Notice of data discontinuation: Since the start of the pandemic, AP has reported case and death counts from data provided by Johns Hopkins University. Johns Hopkins University has announced that they will stop their daily data collection efforts after March 10. As Johns Hopkins stops providing data, the AP will also stop collecting daily numbers for COVID cases and deaths. The HHS and CDC now collect and visualize key metrics for the pandemic. AP advises using those resources when reporting on the pandemic going forward.
April 9, 2020
April 20, 2020
April 29, 2020
September 1st, 2020
February 12, 2021
new_deaths
column.February 16, 2021
The AP is using data collected by the Johns Hopkins University Center for Systems Science and Engineering as our source for outbreak caseloads and death counts for the United States and globally.
The Hopkins data is available at the county level in the United States. The AP has paired this data with population figures and county rural/urban designations, and has calculated caseload and death rates per 100,000 people. Be aware that caseloads may reflect the availability of tests -- and the ability to turn around test results quickly -- rather than actual disease spread or true infection rates.
This data is from the Hopkins dashboard that is updated regularly throughout the day. Like all organizations dealing with data, Hopkins is constantly refining and cleaning up their feed, so there may be brief moments where data does not appear correctly. At this link, you’ll find the Hopkins daily data reports, and a clean version of their feed.
The AP is updating this dataset hourly at 45 minutes past the hour.
To learn more about AP's data journalism capabilities for publishers, corporations and financial institutions, go here or email kromano@ap.org.
Use AP's queries to filter the data or to join to other datasets we've made available to help cover the coronavirus pandemic
Filter cases by state here
Rank states by their status as current hotspots. Calculates the 7-day rolling average of new cases per capita in each state: https://data.world/associatedpress/johns-hopkins-coronavirus-case-tracker/workspace/query?queryid=481e82a4-1b2f-41c2-9ea1-d91aa4b3b1ac
Find recent hotspots within your state by running a query to calculate the 7-day rolling average of new cases by capita in each county: https://data.world/associatedpress/johns-hopkins-coronavirus-case-tracker/workspace/query?queryid=b566f1db-3231-40fe-8099-311909b7b687&showTemplatePreview=true
Join county-level case data to an earlier dataset released by AP on local hospital capacity here. To find out more about the hospital capacity dataset, see the full details.
Pull the 100 counties with the highest per-capita confirmed cases here
Rank all the counties by the highest per-capita rate of new cases in the past 7 days here. Be aware that because this ranks per-capita caseloads, very small counties may rise to the very top, so take into account raw caseload figures as well.
The AP has designed an interactive map to track COVID-19 cases reported by Johns Hopkins.
@(https://datawrapper.dwcdn.net/nRyaf/15/)
<iframe title="USA counties (2018) choropleth map Mapping COVID-19 cases by county" aria-describedby="" id="datawrapper-chart-nRyaf" src="https://datawrapper.dwcdn.net/nRyaf/10/" scrolling="no" frameborder="0" style="width: 0; min-width: 100% !important;" height="400"></iframe><script type="text/javascript">(function() {'use strict';window.addEventListener('message', function(event) {if (typeof event.data['datawrapper-height'] !== 'undefined') {for (var chartId in event.data['datawrapper-height']) {var iframe = document.getElementById('datawrapper-chart-' + chartId) || document.querySelector("iframe[src*='" + chartId + "']");if (!iframe) {continue;}iframe.style.height = event.data['datawrapper-height'][chartId] + 'px';}}});})();</script>
Johns Hopkins timeseries data - Johns Hopkins pulls data regularly to update their dashboard. Once a day, around 8pm EDT, Johns Hopkins adds the counts for all areas they cover to the timeseries file. These counts are snapshots of the latest cumulative counts provided by the source on that day. This can lead to inconsistencies if a source updates their historical data for accuracy, either increasing or decreasing the latest cumulative count. - Johns Hopkins periodically edits their historical timeseries data for accuracy. They provide a file documenting all errors in their timeseries files that they have identified and fixed here
This data should be credited to Johns Hopkins University COVID-19 tracking project
To better understand sediment deposition in marsh environments, scientists from the U.S. Geological Survey, St. Petersburg Coastal and Marine Science Center (USGS-SPCMSC) selected four study sites (Sites 5, 6, 7, and 8) along the Point Aux Chenes Bay shoreline of the Grand Bay National Estuarine Research Reserve (GNDNERR), Mississippi. These datasets were collected to serve as baseline data prior to the installation of a living shoreline (a subtidal sill). Each site consisted of five plots located along a transect perpendicular to the marsh-estuary shoreline at 5-meter (m) increments (5, 10, 15, 20, and 25 m from the shoreline). Each plot contained six net sedimentation tiles (NST) that were secured flush to the marsh surface using polyvinyl chloride (PVC) pipe. NST are an inexpensive and simple tool to assess short- and long-term deposition that can be deployed in highly dynamic environments without the compaction associated with traditional coring methods. The NST were deployed for three month sampling periods, measuring sediment deposition from July 2018 to January 2020, with one set of NST being deployed for six months. Sediment deposited on the NST were processed to determine physical characteristics, such as deposition thickness, volume, wet weight/dry weight, grain size, and organic content (loss-on-ignition [LOI]). For select sampling periods, ancillary data (water level, elevation, and wave data) are also provided in this data release. Data were collected during USGS Field Activities Numbers (FAN) 2018-332-FA (18CCT01), 2018-358-FA (18CCT10), 2019-303-FA (19CCT01, 19CCT02, 19CCT03, and 19CCT04, respectively), and 2020-301-FA (20CCT01). Additional survey and data details are available from the U.S. Geological Survey Coastal and Marine Geoscience Data System (CMGDS) at, https://cmgds.marine.usgs.gov/. Data collected between 2016 and 2017 from a related NST study in the GNDNERR (Middle Bay and North Rigolets) can be found at https://doi.org/10.5066/P9BFR2US. Please read the full metadata for details on data collection, dataset variables, and data quality.
Contains view count data for the top 20 pages each day on the Somerville MA city website dating back to 2020. Data is used in the City's dashboard which can be found at https://www.somervilledata.farm/.
The total amount of data created, captured, copied, and consumed globally is forecast to increase rapidly, reaching 149 zettabytes in 2024. Over the next five years up to 2028, global data creation is projected to grow to more than 394 zettabytes. In 2020, the amount of data created and replicated reached a new high. The growth was higher than previously expected, caused by the increased demand due to the COVID-19 pandemic, as more people worked and learned from home and used home entertainment options more often. Storage capacity also growing Only a small percentage of this newly created data is kept though, as just two percent of the data produced and consumed in 2020 was saved and retained into 2021. In line with the strong growth of the data volume, the installed base of storage capacity is forecast to increase, growing at a compound annual growth rate of 19.2 percent over the forecast period from 2020 to 2025. In 2020, the installed base of storage capacity reached 6.7 zettabytes.