Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Author: Víctor Yeste. Universitat Politècnica de Valencia.The object of this study is the design of a cybermetric methodology whose objectives are to measure the success of the content published in online media and the possible prediction of the selected success variables.In this case, due to the need to integrate data from two separate areas, such as web publishing and the analysis of their shares and related topics on Twitter, has opted for programming as you access both the Google Analytics v4 reporting API and Twitter Standard API, always respecting the limits of these.The website analyzed is hellofriki.com. It is an online media whose primary intention is to solve the need for information on some topics that provide daily a vast number of news in the form of news, as well as the possibility of analysis, reports, interviews, and many other information formats. All these contents are under the scope of the sections of cinema, series, video games, literature, and comics.This dataset has contributed to the elaboration of the PhD Thesis:Yeste Moreno, VM. (2021). Diseño de una metodología cibermétrica de cálculo del éxito para la optimización de contenidos web [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/176009Data have been obtained from each last-minute news article published online according to the indicators described in the doctoral thesis. All related data are stored in a database, divided into the following tables:tesis_followers: User ID list of media account followers.tesis_hometimeline: data from tweets posted by the media account sharing breaking news from the web.status_id: Tweet IDcreated_at: date of publicationtext: content of the tweetpath: URL extracted after processing the shortened URL in textpost_shared: Article ID in WordPress that is being sharedretweet_count: number of retweetsfavorite_count: number of favoritestesis_hometimeline_other: data from tweets posted by the media account that do not share breaking news from the web. Other typologies, automatic Facebook shares, custom tweets without link to an article, etc. With the same fields as tesis_hometimeline.tesis_posts: data of articles published by the web and processed for some analysis.stats_id: Analysis IDpost_id: Article ID in WordPresspost_date: article publication date in WordPresspost_title: title of the articlepath: URL of the article in the middle webtags: Tags ID or WordPress tags related to the articleuniquepageviews: unique page viewsentrancerate: input ratioavgtimeonpage: average visit timeexitrate: output ratiopageviewspersession: page views per sessionadsense_adunitsviewed: number of ads viewed by usersadsense_viewableimpressionpercent: ad display ratioadsense_ctr: ad click ratioadsense_ecpm: estimated ad revenue per 1000 page viewstesis_stats: data from a particular analysis, performed at each published breaking news item. Fields with statistical values can be computed from the data in the other tables, but total and average calculations are saved for faster and easier further processing.id: ID of the analysisphase: phase of the thesis in which analysis has been carried out (right now all are 1)time: "0" if at the time of publication, "1" if 14 days laterstart_date: date and time of measurement on the day of publicationend_date: date and time when the measurement is made 14 days latermain_post_id: ID of the published article to be analysedmain_post_theme: Main section of the published article to analyzesuperheroes_theme: "1" if about superheroes, "0" if nottrailer_theme: "1" if trailer, "0" if notname: empty field, possibility to add a custom name manuallynotes: empty field, possibility to add personalized notes manually, as if some tag has been removed manually for being considered too generic, despite the fact that the editor put itnum_articles: number of articles analysednum_articles_with_traffic: number of articles analysed with traffic (which will be taken into account for traffic analysis)num_articles_with_tw_data: number of articles with data from when they were shared on the media’s Twitter accountnum_terms: number of terms analyzeduniquepageviews_total: total page viewsuniquepageviews_mean: average page viewsentrancerate_mean: average input ratioavgtimeonpage_mean: average duration of visitsexitrate_mean: average output ratiopageviewspersession_mean: average page views per sessiontotal: total of ads viewedadsense_adunitsviewed_mean: average of ads viewedadsense_viewableimpressionpercent_mean: average ad display ratioadsense_ctr_mean: average ad click ratioadsense_ecpm_mean: estimated ad revenue per 1000 page viewsTotal: total incomeretweet_count_mean: average incomefavorite_count_total: total of favoritesfavorite_count_mean: average of favoritesterms_ini_num_tweets: total tweets on the terms on the day of publicationterms_ini_retweet_count_total: total retweets on the terms on the day of publicationterms_ini_retweet_count_mean: average retweets on the terms on the day of publicationterms_ini_favorite_count_total: total of favorites on the terms on the day of publicationterms_ini_favorite_count_mean: average of favorites on the terms on the day of publicationterms_ini_followers_talking_rate: ratio of followers of the media Twitter account who have recently published a tweet talking about the terms on the day of publicationterms_ini_user_num_followers_mean: average followers of users who have spoken of the terms on the day of publicationterms_ini_user_num_tweets_mean: average number of tweets published by users who spoke about the terms on the day of publicationterms_ini_user_age_mean: average age in days of users who have spoken of the terms on the day of publicationterms_ini_ur_inclusion_rate: URL inclusion ratio of tweets talking about terms on the day of publicationterms_end_num_tweets: total tweets on terms 14 days after publicationterms_ini_retweet_count_total: total retweets on terms 14 days after publicationterms_ini_retweet_count_mean: average retweets on terms 14 days after publicationterms_ini_favorite_count_total: total bookmarks on terms 14 days after publicationterms_ini_favorite_count_mean: average of favorites on terms 14 days after publicationterms_ini_followers_talking_rate: ratio of media Twitter account followers who have recently posted a tweet talking about the terms 14 days after publicationterms_ini_user_num_followers_mean: average followers of users who have spoken of the terms 14 days after publicationterms_ini_user_num_tweets_mean: average number of tweets published by users who have spoken about the terms 14 days after publicationterms_ini_user_age_mean: the average age in days of users who have spoken of the terms 14 days after publicationterms_ini_ur_inclusion_rate: URL inclusion ratio of tweets talking about terms 14 days after publication.tesis_terms: data of the terms (tags) related to the processed articles.stats_id: Analysis IDtime: "0" if at the time of publication, "1" if 14 days laterterm_id: Term ID (tag) in WordPressname: Name of the termslug: URL of the termnum_tweets: number of tweetsretweet_count_total: total retweetsretweet_count_mean: average retweetsfavorite_count_total: total of favoritesfavorite_count_mean: average of favoritesfollowers_talking_rate: ratio of followers of the media Twitter account who have recently published a tweet talking about the termuser_num_followers_mean: average followers of users who were talking about the termuser_num_tweets_mean: average number of tweets published by users who were talking about the termuser_age_mean: average age in days of users who were talking about the termurl_inclusion_rate: URL inclusion ratio
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Monthly analytics reports for the Brisbane City Council website
Information regarding the sessions for Brisbane City Council website during the month including page views and unique page views.
http://data.europa.eu/eli/dec/2011/833/ojhttp://data.europa.eu/eli/dec/2011/833/oj
This dataset provides statistics on EUR-Lex website from two views: type of content and number of legal acts available. It is updated on a daily basis.
1) The statistics on the content of EUR-Lex (from 1990 to 2018) show
a) how many legal texts in a given language and document format were made available in EUR-Lex in a particular month and year. They include:
Since the eight parliamentary term, parliamentary questions are no longer included.
b) bibliographical notices by sector (e.g. case-law, treaties).
2) The statistics on legal acts (from 1990 to 2018) provide yearly and monthly figures on the number of adopted acts (also by author and by type) as well as those repealed and expired in a given month.
Sessions, page views, users, devices and referrals for the DataWorks platform using information from Google Analytics. The interactive dashboard provides data by month from September 2018 for sessions, page views, users, devices and referrals to DataWorks. Previous years are covered by static dashboards. Excel and csv files give data on the use of datasets published on our Data Works platform. Please note, each file uploaded on the site is listed and this may include files now deleted or changed. Our Open Data is also published on Data.Gov.UK - Calderdale
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
About the NUDA Dataset
Media bias is a multifaceted problem, leading to one-sided views and impacting decision-making. A way to address bias in news articles is to automatically detect and indicate it through machine-learning methods. However, such detection is limited due to the difficulty of obtaining reliable training data. To facilitate the data-gathering process, we introduce NewsUnravel, a news-reading web application leveraging an initially tested feedback mechanism to collect reader feedback on machine-generated bias highlights within news articles. Our approach augments dataset quality by significantly increasing inter-annotator agreement by 26.31% and improving classifier performance by 2.49%. As the first human-in-the-loop application for media bias, NewsUnravel shows that a user-centric approach to media bias data collection can return reliable data while being scalable and evaluated as easy to use. NewsUnravel demonstrates that feedback mechanisms are a promising strategy to reduce data collection expenses, fluidly adapt to changes in language, and enhance evaluators' diversity.
General
This dataset was created through user feedback on automatically generated bias highlights on news articles on the website NewsUnravel made by ANON. Its goal is to improve the detection of linguistic media bias for analysis and to indicate it to the public. Support came from ANON. None of the funders played any role in the dataset creation process or publication-related decisions.
The dataset consists of text, namely biased sentences with binary bias labels (processed, biased or not biased) as well as metadata about the article. It includes all feedback that was given. The single ratings (unprocessed) used to create the labels with correlating User IDs are included.
For training, this dataset was combined with the BABE dataset. All data is completely anonymous. Some sentences might be offensive or triggering as they were taken from biased or more extreme news sources. The dataset does not identify sub-populations or can be considered sensitive to them, nor is it possible to identify individuals.
Description of the Data Files
This repository contains the datasets for the anonymous NewsUnravel submission. The tables contain the following data:
NUDAdataset.csv: the NUDA dataset with 310 new sentences with bias labels
Statistics.png: contains all Umami statistics for NewsUnravel's usage data
Feedback.csv: holds the participantID of a single feedback with the sentence ID (contentId), the bias rating, and provided reasons
Content.csv: holds the participant ID of a rating with the sentence ID (contentId) of a rated sentence and the bias rating, and reason, if given
Article.csv: holds the article ID, title, source, article metadata, article topic, and bias amount in %
Participant.csv: holds the participant IDs and data processing consent
Collection Process
Data was collected through interactions with the Feedback Mechanism on NewsUnravel. A news article was displayed with automatically generated bias highlights. Each highlight could be selected, and readers were able to agree or disagree with the automatic label. Through a majority vote, labels were generated from those feedback interactions. Spammers were excluded through a spam detection approach.
Readers came to our website voluntarily through posts on LinkedIn and social media as well as posts on university boards. The data collection period lasted for one week, from March 4th to March 11th (2023). The landing page informed them about the goal and the data processing. After being informed, they could proceed to the article overview.
So far, the dataset has been used on top of BABE to train a linguistic bias classifier, adopting hyperparameter configurations from BABE with a pre-trained model from Hugging Face.
The dataset will be open source. On acceptance, a link with all details and contact information will be provided. No third parties are involved.
The dataset will not be maintained as it captures the first test of NewsUnravel at a specific point in time. However, new datasets will arise from further iterations. Those will be linked in the repository. Please cite the NewsUnravel paper if you use the dataset and contact us if you're interested in more information or joining the project.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
General data recollected for the studio " Analysis of the Quantitative Impact of Social Networks on Web Traffic of Cybermedia in the 27 Countries of the European Union".
Four research questions are posed: what percentage of the total web traffic generated by cybermedia in the European Union comes from social networks? Is said percentage higher or lower than that provided through direct traffic and through the use of search engines via SEO positioning? Which social networks have a greater impact? And is there any degree of relationship between the specific weight of social networks in the web traffic of a cybermedia and circumstances such as the average duration of the user's visit, the number of page views or the bounce rate understood in its formal aspect of not performing any kind of interaction on the visited page beyond reading its content?
To answer these questions, we have first proceeded to a selection of the cybermedia with the highest web traffic of the 27 countries that are currently part of the European Union after the United Kingdom left on December 31, 2020. In each nation we have selected five media using a combination of the global web traffic metrics provided by the tools Alexa (https://www.alexa.com/), which ceased to be operational on May 1, 2022, and SimilarWeb (https:// www.similarweb.com/). We have not used local metrics by country since the results obtained with these first two tools were sufficiently significant and our objective is not to establish a ranking of cybermedia by nation but to examine the relevance of social networks in their web traffic.
In all cases, cybermedia whose property corresponds to a journalistic company have been selected, ruling out those belonging to telecommunications portals or service providers; in some cases they correspond to classic information companies (both newspapers and televisions) while in others they refer to digital natives, without this circumstance affecting the nature of the research proposed.
Below we have proceeded to examine the web traffic data of said cybermedia. The period corresponding to the months of October, November and December 2021 and January, February and March 2022 has been selected. We believe that this six-month stretch allows possible one-time variations to be overcome for a month, reinforcing the precision of the data obtained.
To secure this data, we have used the SimilarWeb tool, currently the most precise tool that exists when examining the web traffic of a portal, although it is limited to that coming from desktops and laptops, without taking into account those that come from mobile devices, currently impossible to determine with existing measurement tools on the market.
It includes:
Web traffic general data: average visit duration, pages per visit and bounce rate Web traffic origin by country Percentage of traffic generated from social media over total web traffic Distribution of web traffic generated from social networks Comparison of web traffic generated from social netwoks with direct and search procedures
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This article contains a set of datasets used to demonstrate a strong regularity in inter-activity time.
See the paper:
User Session Identification Based on Strong Regularities in Inter-activity Time
http://arxiv.org/abs/1411.2878
Abstract
Session identification is a common strategy used to develop metrics for web analytics and behavioral analyses of user-facing systems. Past work has argued that session identification strategies based on an inactivity threshold is inherently arbitrary or advocated that thresholds be set at about 30 minutes. In this work, we demonstrate a strong regularity in the temporal rhythms of user initiated events across several different domains of online activity (incl. video gaming, search, page views and volunteer contributions). We describe a methodology for identifying clusters of user activity and argue that regularity with which these activity clusters appear implies a good rule-of-thumb inactivity threshold of about 1 hour. We conclude with implications that these temporal rhythms may have for system design based on our observations and theories of goal-directed human activity.
Background HealthCyberMap aims at mapping parts of health information cyberspace in novel ways to deliver a semantically superior user experience. This is achieved through "intelligent" categorisation and interactive hypermedia visualisation of health resources using metadata, clinical codes and GIS. HealthCyberMap is an ArcView 3.1 project. WebView, the Internet extension to ArcView, publishes HealthCyberMap ArcView Views as Web client-side imagemaps. The basic WebView set-up does not support any GIS database connection, and published Web maps become disconnected from the original project. A dedicated Internet map server would be the best way to serve HealthCyberMap database-driven interactive Web maps, but is an expensive and complex solution to acquire, run and maintain. This paper describes HealthCyberMap simple, low-cost method for "patching" WebView to serve hypermaps with dynamic database drill-down functionality on the Web. Results The proposed solution is currently used for publishing HealthCyberMap GIS-generated navigational information maps on the Web while maintaining their links with the underlying resource metadata base. Conclusion The authors believe their map serving approach as adopted in HealthCyberMap has been very successful, especially in cases when only map attribute data change without a corresponding effect on map appearance. It should be also possible to use the same solution to publish other interactive GIS-driven maps on the Web, e.g., maps of real world health problems.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the AIT Alert Data Set (AIT-ADS), a collection of synthetic alerts suitable for evaluation of alert aggregation, alert correlation, alert filtering, and attack graph generation approaches. The alerts were forensically generated from the AIT Log Data Set V2 (AIT-LDSv2) and origin from three intrusion detection systems, namely Suricata, Wazuh, and AMiner. The data sets comprise eight scenarios, each of which has been targeted by a multi-step attack with attack steps such as scans, web application exploits, password cracking, remote command execution, privilege escalation, etc. Each scenario and attack chain has certain variations so that attack manifestations and resulting alert sequences vary in each scenario; this means that the data set allows to develop and evaluate approaches that compute similarities of attack chains or merge them into meta-alerts. Since only few benchmark alert data sets are publicly available, the AIT-ADS was developed to address common issues in the research domain of multi-step attack analysis; specifically, the alert data set contains many false positives caused by normal user behavior (e.g., user login attempts or software updates), heterogeneous alert formats (although all alerts are in JSON format, their fields are different for each IDS), repeated executions of attacks according to an attack plan, collection of alerts from diverse log sources (application logs and network traffic) and all components in the network (mail server, web server, DNS, firewall, file share, etc.), and labels for attack phases. For more information on how this alert data set was generated, check out our paper accompanying this data set [1] or our GitHub repository. More information on the original log data set, including a detailed description of scenarios and attacks, can be found in [2].
The alert data set contains two files for each of the eight scenarios, and a file for their labels:
_aminer.json contains alerts from AMiner IDS
_wazuh.json contains alerts from Wazuh IDS and Suricata IDS
labels.csv contains the start and end times of attack phases in each scenario
Beside false positive alerts, the alerts in the AIT-ADS correspond to the following attacks:
Scans (nmap, WPScan, dirb)
Webshell upload (CVE-2020-24186)
Password cracking (John the Ripper)
Privilege escalation
Remote command execution
Data exfiltration (DNSteal) and stopped service
The total number of alerts involved in the data set is 2,655,821, of which 2,293,628 origin from Wazuh, 306,635 origin from Suricata, and 55,558 origin from AMiner. The numbers of alerts in each scenario are as follows. fox: 473,104; harrison: 593,948; russellmitchell: 45,544; santos: 130,779; shaw: 70,782; wardbeck: 91,257; wheeler: 616,161; wilson: 634,246.
Acknowledgements: Partially funded by the European Defence Fund (EDF) projects AInception (101103385) and NEWSROOM (101121403), and the FFG project PRESENT (FO999899544). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union. The European Union cannot be held responsible for them.
If you use the AIT-ADS, please cite the following publications:
[1] Landauer, M., Skopik, F., Wurzenberger, M. (2024): Introducing a New Alert Data Set for Multi-Step Attack Analysis. Proceedings of the 17th Cyber Security Experimentation and Test Workshop. [PDF]
[2] Landauer M., Skopik F., Frank M., Hotwagner W., Wurzenberger M., Rauber A. (2023): Maintainable Log Datasets for Evaluation of Intrusion Detection Systems. IEEE Transactions on Dependable and Secure Computing, vol. 20, no. 4, pp. 3466-3482. [PDF]
This data is out of date, no longer updated, and may contain inaccurate information, please see Salmonid Population Monitoring Areas - California - CMP [ds3001].This layer identifies locations where the California Department of Fish and Wildlife (CDFW) is currently conducting or has plans to begin salmonid monitoring operations. This may include didson, lifecycle station, present/absence as well as other types of data collection activities. These locations may change from one year to the next.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
These datasets contain information about all audio-video recordings of TED Talks uploaded to the official TED.com website until September 2012. The TED favorites dataset contains information about the videos, registered users have favorited. The TED Talks dataset contains information about all talks including number of views, number of comments, descriptions, speakers and titles.
The original datasets (in the JSON format) contain all the aforementioned information and in addition, also contain all the data related to content and replies.
The original dataset was obtained from https://www.idiap.ch/dataset/ted and was in the JSON Format. Taken verbatim from the website:
The metadata was obtained by crawling the HTML source of the list of talks and users, as well as talk and user webpages using scripts written by Nikolaos Pappas at the Idiap Research Institute, Martigny, Switzerland. The dataset is shared under the Creative Commons license (the same as the content of the TED talks) which is stored in the COPYRIGHT file. The dataset is shared for research purposes which are explained in detail in the following papers. The dataset can be used to benchmark systems that perform two tasks, namely personalized recommendations and generic recommendations. Please check the CBMI 2013 paper for a detailed description of each task.
The datasets uploaded were used by the second paper listed above.
The ones available here in the CSV format do not include the text data of comments. Instead, they just give you the number of comments on each talk.
I've always been fascinated by TED Talks and the immense diversity of content that it provides for free. I was also thoroughly inspired by a TED Talk that visually explored TED Talks stats and I was motivated to do the same thing, albeit on a much less grander scale.
Some of the questions that can be answered with this dataset: 1. How is each TED Talk related to every other TED Talk? 2. Which are the most viewed and most favorited Talks of all time? Are they mostly the same? What does this tell us? 3. What kind of topics attract the maximum discussion and debate (in the form of comments)? 4. Which months are most popular among TED and TEDx chapters?
Note: This dataset has been superseded by the dataset found at "End-Use Load Profiles for the U.S. Building Stock" (submission 4520; linked in the submission resources), which is a comprehensive and validated representation of hourly load profiles in the U.S. commercial and residential building stock. The End-Use Load Profiles project website includes links to data viewers for this new dataset. For documentation of dataset validation, model calibration, and uncertainty quantification, see Wilson et al. (2022). These data were first created around 2012 as a byproduct of various analyses of solar photovoltaics and solar water heating (see references below for are two examples). This dataset contains several errors and limitations. It is recommended that users of this dataset transition to the updated version of the dataset posted in the resources. This dataset contains weather data, commercial load profile data, and residential load profile data. Weather The Typical Meteorological Year 3 (TMY3) provides one year of hourly data for around 1,000 locations. The TMY weather represents 30-year normals, which are typical weather conditions over a 30-year period. Commercial The commercial load profiles included are the 16 ASHRAE 90.1-2004 DOE Commercial Prototype Models simulated in all TMY3 locations, with building insulation levels changing based on ASHRAE 90.1-2004 requirements in each climate zone. The folder names within each resource represent the weather station location of the profiles, whereas the file names represent the building type and the representative city for the ASHRAE climate zone that was used to determine code compliance insulation levels. As indicated by the file names, all building models represent construction that complied with the ASHRAE 90.1-2004 building energy code requirements. No older or newer vintages of buildings are represented. Residential The BASE residential load profiles are five EnergyPlus models (one per climate region) representing 2009 IECC construction single-family detached homes simulated in all TMY3 locations. No older or newer vintages of buildings are represented. Each of the five climate regions include only one heating fuel type; electric heating is only found in the Hot-Humid climate. Air conditioning is not found in the Marine climate region. One major issue with the residential profiles is that for each of the five climate zones, certain location-specific algorithms from one city were applied to entire climate zones. For example, in the Hot-Humid files, the heating season calculated for Tampa, FL (December 1 - March 31) was unknowingly applied to all other locations in the Hot-Humid zone, which restricts heating operation outside of those days (for example, heating is disabled in Dallas, TX during cold weather in November). This causes the heating energy to be artificially low in colder parts of that climate zone, and conversely the cooling season restriction leads to artificially low cooling energy use in hotter parts of each climate zone. Additionally, the ground temperatures for the representative city were used across the entire climate zone. This affects water heating energy use (because inlet cold water temperature depends on ground temperature) and heating/cooling energy use (because of ground heat transfer through foundation walls and floors). Representative cities were Tampa, FL (Hot-Humid), El Paso, TX (Mixed-Dry/Hot-Dry), Memphis, TN (Mixed-Humid), Arcata, CA (Marine), and Billings, MT (Cold/Very-Cold). The residential dataset includes a HIGH building load profile that was intended to provide a rough approximation of older home vintages, but it combines poor thermal insulation with larger house size, tighter thermostat setpoints, and less efficient HVAC equipment. Conversely, the LOW building combines excellent thermal insulation with smaller house size, wider thermostat setpoints, and more efficient HVAC equipment. However, it is not known how well these HIGH and LOW permutations represent the range of energy use in the housing stock. Note that on July 2nd, 2013, the Residential High and Low load files were updated from 366 days in a year for leap years to the more general 365 days in a normal year. The archived residential load data is included from prior to this date.
https://groups.csail.mit.edu/vision/datasets/ADE20K/terms/https://groups.csail.mit.edu/vision/datasets/ADE20K/terms/
The authors of the ADE20K dataset address the significant challenge of scene parsing, encompassing the recognition and segmentation of objects and stuff within images, a vital task in the domain of computer vision. Despite the efforts made by the research community to gather data, there remains a scarcity of image datasets that comprehensively cover a broad spectrum of scenes and object categories, along with detailed and dense annotations suitable for scene parsing. To fill this void, the authors introduce the ADE20K dataset. This dataset features diverse annotations that span scenes, objects, parts of objects, and, intriguingly, even parts of parts. In order to facilitate benchmarking for scene parsing, the ADE20K dataset includes 150 object and stuff classes, and various segmentation baseline models undergo evaluation using this benchmark. You can access the hierarchy of classes on the official website of the dataset.
https://www.icpsr.umich.edu/web/ICPSR/studies/37484/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/37484/terms
The Michigan Public Policy Survey (MPPS) is a program of state-wide surveys of local government leaders in Michigan. The MPPS is designed to fill an important information gap in the policymaking process. While there are ongoing surveys of the business community and of the citizens of Michigan, before the MPPS there were no ongoing surveys of local government officials that were representative of all general purpose local governments in the state. Therefore, while we knew the policy priorities and views of the state's businesses and citizens, we knew very little about the views of the local officials who are so important to the economies and community life throughout Michigan.
The MPPS was launched in 2009 by the Center for Local, State, and Urban Policy (CLOSUP) at the University of Michigan and is conducted in partnership with the Michigan Association of Counties, Michigan Municipal League, and Michigan Townships Association. The associations provide CLOSUP with contact information for the survey's respondents, and consult on survey topics. CLOSUP makes all decisions on survey design, data analysis, and reporting, and receives no funding support from the associations.
The surveys investigate local officials' opinions and perspectives on a variety of important public policy issues and solicit factual information about their localities relevant to policymaking. Over time, the program has covered issues such as fiscal, budgetary and operational policy, fiscal health, public sector compensation, workforce development, local-state governmental relations, intergovernmental collaboration, economic development strategies and initiatives such as placemaking and economic gardening, the role of local government in environmental sustainability, energy topics such as hydraulic fracturing ("fracking") and wind power, trust in government, views on state policymaker performance, opinions on the impacts of the Federal Stimulus Program (ARRA), and more. The program will investigate many other issues relevant to local and state policy in the future. A searchable database of every question the MPPS has asked is available on CLOSUP's website. Results of MPPS surveys are currently available as reports, and via online data tables.
The MPPS datasets are being released in two forms: public-use datasets and restricted-use datasets. The public use datasets are available on OpenICPSR. Unlike the public-use datasets, the restricted-use datasets represent full MPPS survey waves, and include all of the survey questions from a wave. Restricted-use datasets also allow for multiple waves to be linked together for longitudinal analysis. The MPPS staff do still modify these restricted-use datasets to remove jurisdiction and respondent identifiers and to recode other variables in order to protect confidentiality. However, it is theoretically possible that a researcher might be able, in some rare cases, to use enough variables from a full dataset to identify a unique jurisdiction, so access to these datasets is restricted and approved on a case-by-case basis. CLOSUP encourages researchers interested in the MPPS to review the codebooks included in this data collection to see the full list of variables including those not found in the public-use datasets, and to explore the MPPS data using the public-use-datasets. The codebooks for these restricted use datasets are available for download on CLOSUP's website.
The global number of Facebook users was forecast to continuously increase between 2023 and 2027 by in total 391 million users (+14.36 percent). After the fourth consecutive increasing year, the Facebook user base is estimated to reach 3.1 billion users and therefore a new peak in 2027. Notably, the number of Facebook users was continuously increasing over the past years. User figures, shown here regarding the platform Facebook, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period and count multiple accounts by persons only once.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Views, visits and downloads of the datasets published on our Data Works platform. Please note, each file uploaded on the site is listed and this may include files now deleted or changed.
This dataset has information from data.gov.uk to show the number of downloads for each file and from Google Analytics to show the number of visits, unique visits, time spent, bounce rate and exit rate for each dataset.
Our Open Data is also published on data.gov.uk and you can see statistics for this site here - Data.Gov.UK - Calderdale
Facebook received 73,390 user data requests from federal agencies and courts in the United States during the second half of 2023. The social network produced some user data in 88.84 percent of requests from U.S. federal authorities. The United States accounts for the largest share of Facebook user data requests worldwide.
PLEASE NOTE: This record has been retired. It is superseded by 'Shoreline Management Plan Mapping': https://environment.data.gov.uk/dataset/8e383070-d465-11e4-b752-f0def148f590 Shoreline Management Plan Explorer (SMP Explorer) is an online digital and map based system to hold the details of shoreline management plans around the coast of England. SMP-Explorer was developed by the Environment Agency on behalf of Coastal Groups. Data is published as linked data but data on action plans within each Shoreline Management Plan (SMP) is available to download in CSV format. SMP Explorer contains hyperlinks to websites operated by other parties. We do not control such websites and we take no responsibility for, and will not incur any liability in respect of, their content. Our inclusion of hyperlinks to such websites does not imply any endorsement of views, statements or information contained in such websites. Attribution statement: © Environment Agency copyright and/or database right 2024. All rights reserved.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The X-GENRE dataset comprises almost 3,000 web texts in English and Slovenian, manually-annotated with genre labels. The dataset allows for automated genre identification and genre analyses as well as other web corpora research. Inter alia, it was used for the development of the multilingual X-GENRE classifier (http://hdl.handle.net/11356/1961).
The X-GENRE dataset was constructed by merging three manually-annotated datasets by mapping the original schemata to the joint genre schema (the "X-GENRE schema"): 1) the Slovenian GINCO dataset (http://hdl.handle.net/11356/1467), 2) the English CORE dataset (https://github.com/TurkuNLP/CORE-corpus), and 3) the English FTD dataset (https://github.com/ssharoff/genre-keras). All of the original genre datasets are based on web corpora. The X-GENRE schema consists of 9 genre labels: Information/Explanation, News, Instruction, Opinion/Argumentation, Forum, Prose/Lyrical, Legal, Promotion and Other (refer to the README provided with the files for the details on the labels).
The dataset is separated into train, development and test split. The train split consists of 1,772 texts and 1,940,317 words, the development split of 592 texts and 798,025 words, and the test split of 592 texts and 583,595 words. The splits are stratified by labels. As the dataset consists of two English datasets and one Slovenian dataset, the distribution of texts in the two languages is roughly two to one: 2,063 English texts and 893 Slovenian texts.
The dataset is in JSONL format. It has the following attributes: text (text instance), labels (genre label), dataset (original manually-annotated genre dataset from which the instance was obtained – CORE, GINCO or FTD), and language (language of the text – Slovenian or English).
This work received funding from the European Union’s Connecting Europe Facility 2014–2020 – CEF Telecom – under Grant Agreement No. INEA/CEF/ICT/A2020/2278341. This communication reflects only the authors’ views. The Agency is not responsible for any use that may be made of the information it contains.
https://brightdata.com/licensehttps://brightdata.com/license
Access our extensive Facebook datasets that provide detailed information on public posts, pages, and user engagement. Gain insights into post performance, audience interactions, page details, and content trends with our ethically sourced data. Free samples are available for evaluation. Over 940M records available Price starts at $250/100K records Data formats are available in JSON, NDJSON, CSV, XLSX and Parquet. 100% ethical and compliant data collection Included datapoints:
Post ID Post Content & URL Date Posted Hashtags Number of Comments Number of Shares Likes & Reaction Counts (by type) Video View Count Page Name & Category Page Followers & Likes Page Verification Status Page Website & Contact Info Is Sponsored Post Attachments (Images/Videos) External Link Data And much more
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Author: Víctor Yeste. Universitat Politècnica de Valencia.The object of this study is the design of a cybermetric methodology whose objectives are to measure the success of the content published in online media and the possible prediction of the selected success variables.In this case, due to the need to integrate data from two separate areas, such as web publishing and the analysis of their shares and related topics on Twitter, has opted for programming as you access both the Google Analytics v4 reporting API and Twitter Standard API, always respecting the limits of these.The website analyzed is hellofriki.com. It is an online media whose primary intention is to solve the need for information on some topics that provide daily a vast number of news in the form of news, as well as the possibility of analysis, reports, interviews, and many other information formats. All these contents are under the scope of the sections of cinema, series, video games, literature, and comics.This dataset has contributed to the elaboration of the PhD Thesis:Yeste Moreno, VM. (2021). Diseño de una metodología cibermétrica de cálculo del éxito para la optimización de contenidos web [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/176009Data have been obtained from each last-minute news article published online according to the indicators described in the doctoral thesis. All related data are stored in a database, divided into the following tables:tesis_followers: User ID list of media account followers.tesis_hometimeline: data from tweets posted by the media account sharing breaking news from the web.status_id: Tweet IDcreated_at: date of publicationtext: content of the tweetpath: URL extracted after processing the shortened URL in textpost_shared: Article ID in WordPress that is being sharedretweet_count: number of retweetsfavorite_count: number of favoritestesis_hometimeline_other: data from tweets posted by the media account that do not share breaking news from the web. Other typologies, automatic Facebook shares, custom tweets without link to an article, etc. With the same fields as tesis_hometimeline.tesis_posts: data of articles published by the web and processed for some analysis.stats_id: Analysis IDpost_id: Article ID in WordPresspost_date: article publication date in WordPresspost_title: title of the articlepath: URL of the article in the middle webtags: Tags ID or WordPress tags related to the articleuniquepageviews: unique page viewsentrancerate: input ratioavgtimeonpage: average visit timeexitrate: output ratiopageviewspersession: page views per sessionadsense_adunitsviewed: number of ads viewed by usersadsense_viewableimpressionpercent: ad display ratioadsense_ctr: ad click ratioadsense_ecpm: estimated ad revenue per 1000 page viewstesis_stats: data from a particular analysis, performed at each published breaking news item. Fields with statistical values can be computed from the data in the other tables, but total and average calculations are saved for faster and easier further processing.id: ID of the analysisphase: phase of the thesis in which analysis has been carried out (right now all are 1)time: "0" if at the time of publication, "1" if 14 days laterstart_date: date and time of measurement on the day of publicationend_date: date and time when the measurement is made 14 days latermain_post_id: ID of the published article to be analysedmain_post_theme: Main section of the published article to analyzesuperheroes_theme: "1" if about superheroes, "0" if nottrailer_theme: "1" if trailer, "0" if notname: empty field, possibility to add a custom name manuallynotes: empty field, possibility to add personalized notes manually, as if some tag has been removed manually for being considered too generic, despite the fact that the editor put itnum_articles: number of articles analysednum_articles_with_traffic: number of articles analysed with traffic (which will be taken into account for traffic analysis)num_articles_with_tw_data: number of articles with data from when they were shared on the media’s Twitter accountnum_terms: number of terms analyzeduniquepageviews_total: total page viewsuniquepageviews_mean: average page viewsentrancerate_mean: average input ratioavgtimeonpage_mean: average duration of visitsexitrate_mean: average output ratiopageviewspersession_mean: average page views per sessiontotal: total of ads viewedadsense_adunitsviewed_mean: average of ads viewedadsense_viewableimpressionpercent_mean: average ad display ratioadsense_ctr_mean: average ad click ratioadsense_ecpm_mean: estimated ad revenue per 1000 page viewsTotal: total incomeretweet_count_mean: average incomefavorite_count_total: total of favoritesfavorite_count_mean: average of favoritesterms_ini_num_tweets: total tweets on the terms on the day of publicationterms_ini_retweet_count_total: total retweets on the terms on the day of publicationterms_ini_retweet_count_mean: average retweets on the terms on the day of publicationterms_ini_favorite_count_total: total of favorites on the terms on the day of publicationterms_ini_favorite_count_mean: average of favorites on the terms on the day of publicationterms_ini_followers_talking_rate: ratio of followers of the media Twitter account who have recently published a tweet talking about the terms on the day of publicationterms_ini_user_num_followers_mean: average followers of users who have spoken of the terms on the day of publicationterms_ini_user_num_tweets_mean: average number of tweets published by users who spoke about the terms on the day of publicationterms_ini_user_age_mean: average age in days of users who have spoken of the terms on the day of publicationterms_ini_ur_inclusion_rate: URL inclusion ratio of tweets talking about terms on the day of publicationterms_end_num_tweets: total tweets on terms 14 days after publicationterms_ini_retweet_count_total: total retweets on terms 14 days after publicationterms_ini_retweet_count_mean: average retweets on terms 14 days after publicationterms_ini_favorite_count_total: total bookmarks on terms 14 days after publicationterms_ini_favorite_count_mean: average of favorites on terms 14 days after publicationterms_ini_followers_talking_rate: ratio of media Twitter account followers who have recently posted a tweet talking about the terms 14 days after publicationterms_ini_user_num_followers_mean: average followers of users who have spoken of the terms 14 days after publicationterms_ini_user_num_tweets_mean: average number of tweets published by users who have spoken about the terms 14 days after publicationterms_ini_user_age_mean: the average age in days of users who have spoken of the terms 14 days after publicationterms_ini_ur_inclusion_rate: URL inclusion ratio of tweets talking about terms 14 days after publication.tesis_terms: data of the terms (tags) related to the processed articles.stats_id: Analysis IDtime: "0" if at the time of publication, "1" if 14 days laterterm_id: Term ID (tag) in WordPressname: Name of the termslug: URL of the termnum_tweets: number of tweetsretweet_count_total: total retweetsretweet_count_mean: average retweetsfavorite_count_total: total of favoritesfavorite_count_mean: average of favoritesfollowers_talking_rate: ratio of followers of the media Twitter account who have recently published a tweet talking about the termuser_num_followers_mean: average followers of users who were talking about the termuser_num_tweets_mean: average number of tweets published by users who were talking about the termuser_age_mean: average age in days of users who were talking about the termurl_inclusion_rate: URL inclusion ratio