9 datasets found
  1. h

    blog_authorship_corpus

    • huggingface.co
    Updated Jul 27, 2003
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bar-Ilan University (2003). blog_authorship_corpus [Dataset]. https://huggingface.co/datasets/barilan/blog_authorship_corpus
    Explore at:
    Dataset updated
    Jul 27, 2003
    Dataset authored and provided by
    Bar-Ilan University
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person.

    Each blog is presented as a separate file, the name of which indicates a blogger id# and the blogger’s self-provided gender, age, industry and astrological sign. (All are labeled for gender and age but for many, industry and/or sign is marked as unknown.)

    All bloggers included in the corpus fall into one of three age groups: - 8240 "10s" blogs (ages 13-17), - 8086 "20s" blogs (ages 23-27), - 2994 "30s" blogs (ages 33-47).

    For each age group there are an equal number of male and female bloggers.

    Each blog in the corpus includes at least 200 occurrences of common English words. All formatting has been stripped with two exceptions. Individual posts within a single blogger are separated by the date of the following post and links within a post are denoted by the label urllink.

    The corpus may be freely used for non-commercial research purposes.

  2. BlogCatalog dataset

    • figshare.com
    zip
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nitin Agarwal; Xufei Wang (2023). BlogCatalog dataset [Dataset]. http://doi.org/10.6084/m9.figshare.11923611.v3
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Nitin Agarwal; Xufei Wang
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Abstract: BlogCatalog is the social blog directory which manages the bloggers and their blogs.Number of Nodes:10,312Number of Edges:333,983Missing Values?noSource:Nitin Agarwal+, Xufei Wang*, Huan Liu*+ Department of Information Science, University of Arkansas at Little Rock. E-mail:nxagarwal@ualr.edu* School of Computing, Informatics and Decision Systems Engineering, Arizona State University. E-mail: huan.liu@asu.edu, xufei.wang@asu.eduData Set Information:2 files are included:1. nodes.csv-- it's the file of all the users. This file works as a dictionary of all the users in this data set. It's useful for fast reference. It contains all the node ids used in the dataset.2. edges.csv-- this is the friendship network among the bloggers. The blogger's friends are represented using edges. Here is an example.1,2This means blogger with id "1" is friend with blogger id "2".Attribute Information:This is the data set crawled on July, 2009 from BlogCatalog ( http://www.blogcatalog.com ). BlogCatalog is a social blog directory website. This contains the friendship network crawled. For easier understanding, all the contents are organized in CSV file format.-. Basic statisticsNumber of bloggers : 88,784Number of friendship pairs: 4,186,390Relevant Papers:Nitin Agarwal and Huan Liu. ”Modeling and Data Mining in Blogosphere”, Synthesis Lectures on Data Mining and Knowledge Discovery #1, Morgan & Claypool Publishers, Robert Grossman (Editor), August 2009. ISBN: 9781598299083 (paperback) ISBN: 9781598299090 (ebook) Nitin Agarwal, Magdiel Galan, Huan Liu, and Shankar Subramanya. WisColl: Collective Wisdom based Blog Clustering. Journal of Information Science, 180(1): 39-61, January, 2010. Nitin Agarwal, Huan Liu, Sudheendra Murthy, Arunabha Sen, and Xufei Wang. A Social Identity Approach to Identify Familiar Strangers in a Social Network. In Proceedings of the Third International AAAI Conference on Weblogs and Social Media (ICWSM09), pp. 2 - 9, May 17-20, 2009. San Jose, California. Nitin Agarwal, Huan Liu, Sudheendra Murthy, Arunabha Sen, and Xufei Wang. "A Social Identity Approach to Identify Familiar Strangers in a Social Network", 3rd International AAAI Conference on Weblogs and Social Media (ICWSM09), pp. 2 - 9, May 17-20, 2009. San Jose, California.

  3. MySciBlog Survey - Top Read SciBlogs by SciBloggers

    • figshare.com
    pdf
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Paige B Jarreau (2023). MySciBlog Survey - Top Read SciBlogs by SciBloggers [Dataset]. http://doi.org/10.6084/m9.figshare.1278974.v2
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Paige B Jarreau
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In a survey (MySciBlog survey 2014, by Paige Brown Jarreau, Louisiana State University) of over 600 science bloggers, participants were asked to list up to the top three science blogs, other than their own, that they read on a regular basis. The resulting dataset was mapped in Gephi, and laid out according to a ForceAtlas 2 algorithm. Each node represents a science blog (either a survey participant's blog or a blog listed by a participant). Communities (represented in color-coded nodes) were detected automatically in Gephi with a resolution of 3.0. Nodes and node labels are sized according to in-degree (how many times the blog was listed by other bloggers as regularly read). Update 12/27/2014: Edited files fix several duplicate nodes (same blogs listed under different names, etc.) Update 12/30/2014: Interactive data graphic available at http://bit.ly/MySciBlogRead.

  4. s

    Twitter bot profiling

    • researchdata.smu.edu.sg
    • smu.edu.sg
    • +1more
    pdf
    Updated May 31, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Living Analytics Research Centre (2023). Twitter bot profiling [Dataset]. http://doi.org/10.25440/smu.12062706.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    SMU Research Data Repository (RDR)
    Authors
    Living Analytics Research Centre
    License

    http://rightsstatements.org/vocab/InC/1.0/http://rightsstatements.org/vocab/InC/1.0/

    Description

    This dataset comprises a set of Twitter accounts in Singapore that are used for social bot profiling research conducted by the Living Analytics Research Centre (LARC) at Singapore Management University (SMU). Here a bot is defined as a Twitter account that generates contents and/or interacts with other users automatically (at least according to human judgment). In this research, Twitter bots have been categorized into three major types:

    Broadcast bot. This bot aims at disseminating information to general audience by providing, e.g., benign links to news, blogs or sites. Such bot is often managed by an organization or a group of people (e.g., bloggers). Consumption bot. The main purpose of this bot is to aggregate contents from various sources and/or provide update services (e.g., horoscope reading, weather update) for personal consumption or use. Spam bot. This type of bots posts malicious contents (e.g., to trick people by hijacking certain account or redirecting them to malicious sites), or promotes harmless but invalid/irrelevant contents aggressively.

    This categorization is general enough to cater for new, emerging types of bot (e.g., chatbots can be viewed as a special type of broadcast bots). The dataset was collected from 1 January to 30 April 2014 via the Twitter REST and streaming APIs. Starting from popular seed users (i.e., users having many followers), their follow, retweet, and user mention links were crawled. The data collection proceeds by adding those followers/followees, retweet sources, and mentioned users who state Singapore in their profile location. Using this procedure, a total of 159,724 accounts have been collected. To identify bots, the first step is to check active accounts who tweeted at least 15 times within the month of April 2014. These accounts were then manually checked and labelled, of which 589 bots were found. As many more human users are expected in the Twitter population, the remaining accounts were randomly sampled and manually checked. With this, 1,024 human accounts were identified. In total, this results in 1,613 labelled accounts. Related Publication: R. J. Oentaryo, A. Murdopo, P. K. Prasetyo, and E.-P. Lim. (2016). On profiling bots in social media. Proceedings of the International Conference on Social Informatics (SocInfo’16), 92-109. Bellevue, WA. https://doi.org/10.1007/978-3-319-47880-7_6

  5. e

    The internet and everyday rights in Russia - Dataset - B2FIND

    • b2find.eudat.eu
    Updated Jul 17, 2010
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2010). The internet and everyday rights in Russia - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/68c7603b-0a69-5e8d-972d-e18b6030d627
    Explore at:
    Dataset updated
    Jul 17, 2010
    Area covered
    Russia
    Description

    This two-year project analyses whether the internet can champion the causes of citizens in non-democratic states. While there is much speculation that the internet can provide critical social capital when there is a democratic deficit, there is relatively little empirical work on the interplay between online and off-line social protest and action. This project will study the role of the internet in political life in Russia through an analysis of how people seek to fulfil their 'everyday' human rights in gaining access to social services such as pensions and health care. The study uses five central elements to study the role of the internet in these efforts: content community catalyst control co-optation. The project will analyse internet content against a background of key factors, including the nature and behaviour of online users (community), how the internet activity is sparked by real-world events such as protests or funding cuts (catalysts), how the government attempts to regulate the internet (control); and - more pessimistically - how political elites may attempt to hijack the influence of populist bloggers or websites once they have become influential (co-optation).

  6. f

    Microsoft Excel dataset file of YouTube videos.

    • plos.figshare.com
    xlsx
    Updated Nov 29, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dan Sun; Guochang Zhao (2023). Microsoft Excel dataset file of YouTube videos. [Dataset]. http://doi.org/10.1371/journal.pone.0294665.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Nov 29, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Dan Sun; Guochang Zhao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    YouTube
    Description

    News dissemination plays a vital role in supporting people to incorporate beneficial actions during public health emergencies, thereby significantly reducing the adverse influences of events. Based on big data from YouTube, this research study takes the declaration of COVID-19 National Public Health Emergency (PHE) as the event impact and employs a DiD model to investigate the effect of PHE on the news dissemination strength of relevant videos. The study findings indicate that the views, comments, and likes on relevant videos significantly increased during the COVID-19 public health emergency. Moreover, the public’s response to PHE has been rapid, with the highest growth in comments and views on videos observed within the first week of the public health emergency, followed by a gradual decline and returning to normal levels within four weeks. In addition, during the COVID-19 public health emergency, in the context of different types of media, lifestyle bloggers, local media, and institutional media demonstrated higher growth in the news dissemination strength of relevant videos as compared to news & political bloggers, foreign media, and personal media, respectively. Further, the audience attracted by related news tends to display a certain level of stickiness, therefore this audience may subscribe to these channels during public health emergencies, which confirms the incentive mechanisms of social media platforms to foster relevant news dissemination during public health emergencies. The proposed findings provide essential insights into effective news dissemination in potential future public health events.

  7. e

    The internet, electoral politics and citizen participation in global...

    • b2find.eudat.eu
    Updated Apr 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). The internet, electoral politics and citizen participation in global perspective - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/2f02ef66-fc58-5925-baed-66bb0562bf6d
    Explore at:
    Dataset updated
    Apr 27, 2023
    Description

    This project examines how new media are affecting political participation and campaigning in elections worldwide with particular reference to UK and Australian parliamentary elections (2010) and French and US presidential elections (2012). It focuses on the uptake of web 2.0 tools by parties, candidates and voters and asks whether this process is fostering a new type of networked political activism-citizen-campaigning - that challenges established modes of election behaviour and management. More specifically, do the new technologies of blogs, online video and social networking sites enable 'ordinary' voters to play a greater role in the coordination and communication of the campaign, thereby shifting power away from established elites, party members and activists? If so, what factors help to promote this new type of activism at the individual, organisational and institutional level and what does it mean for parties, participation and the wider political system? Do the new forms of engagement ultimately strengthen the representative model government or encourage a more direct style of involvement by citizens and a by-passing of intermediaries? The research questions are explored using a range of original data including campaign sites, elite and public opinion surveys and new and innovative methodologies developed specifically for web 2.0 platforms. Surveys Elections studied: - United Kingdom 2010 General Election - Australia 2010 Federal Parliamentary Election - France 2012 Presidential Election - United States 2012 Presidential Primaries and General election Data collected: In each case a series of elite and mass-level datasets was collected for meaningful cross-country comparisons to be drawn. The key datasets include: - Opinion surveys of citizens online and offline political activities and attitudes during all 4 countries’ elections. - Party and candidates’ official election websites and web 2.0 presence (i.e. Facebook, blogs, YouTube, Twitter sites) - E-campaign manager surveys - Elite interviews with e-campaign managers, prominent journalists covering the e-election, and political bloggers

  8. d

    Data from: Bringing ecology blogging into the scientific fold: measuring...

    • datadryad.org
    • search.dataone.org
    • +1more
    zip
    Updated Sep 6, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Manu E. Saunders; Meghan A. Duffy; Stephen B. Heard; Margaret Kosmala; Simon R. Leather; Terrence P. McGlynn; Jeff Ollerton; Amy L. Parachnowitsch (2017). Bringing ecology blogging into the scientific fold: measuring reach and impact of science community blogs [Dataset]. http://doi.org/10.5061/dryad.kf8b0
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 6, 2017
    Dataset provided by
    Dryad
    Authors
    Manu E. Saunders; Meghan A. Duffy; Stephen B. Heard; Margaret Kosmala; Simon R. Leather; Terrence P. McGlynn; Jeff Ollerton; Amy L. Parachnowitsch
    Time period covered
    Sep 4, 2017
    Area covered
    Australia, North America, UK, Europe
    Description

    Bringing ecology blogging into the scientific fold, measuring reach and impact of science community blogs Supp MaterialRaw datasets used in analyses, including metadata.

  9. Z

    Data from: Drama Critiques' Database

    • data.niaid.nih.gov
    Updated Jul 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Damien Pellé (2022). Drama Critiques' Database [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6787150
    Explore at:
    Dataset updated
    Jul 2, 2022
    Dataset provided by
    Gaëtan Brison
    Mylène Maignant
    Damien Pellé
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Drama Critiques gathers 10 years (2010 – 2020) of London contemporary theatre reviews. By focusing on two literary communities (journalists on the one hand and bloggers on the other hand), this corpus enables one to examine the poetical and political discourse journalistic and digital reviewers build. While our initial corpus is composed of more than 43 000 theatre reviews, the version made available here is constituted of 36 766 reviews. This is explained by the fact that we still have not received the authorisation of some of the bloggers to publish their data in open access. To have more information about our project, the different analyses of the corpus can be found here: https://dramacritiques.com/en/home/

    The corpus based on journalism was created thanks to Theatre Record, a paper magazine originally created by the English critic Ian Herbert. Theatre Record reprints in full all the national drama critics’ reviews of the latest productions in and out of London. Published every two weeks in England since January 1981, it is in January 2019 that its archives were digitized thanks to Julian Oddy (https://www.theatrerecord.com/). We have selected 23 newspapers in total which correspond to 21 717 theatre reviews. All of them were initially available in a PDF format. After having converted all the files in a textual format, a massive work of automatic and manual corrections was done on each of the files. This task represents more than 1050 hours of work.

    The corpus based on blog platforms is constituted of the most popular 28 blog platforms on the Internet. They can be divided into two sub-categories: collective blog platforms on the one hand, and individual blog platforms on the other. Either these digital platforms are run by a publisher who invites other reviewers to post on his website, or the publisher publishes all of his reviews himself. In both cases, these authors are not paid for their activity, the content of their blog does not have a printed version and is completely free. In this version there are 21 blogs, or 15 049 reviews. All of them were automatically extracted thanks to web scraping techniques before being corrected (250 hours of work). You can discover more about each blog here: https://dramacritiques.com/en/categories-2/the-corpus/

  10. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Bar-Ilan University (2003). blog_authorship_corpus [Dataset]. https://huggingface.co/datasets/barilan/blog_authorship_corpus

blog_authorship_corpus

Blog Authorship Corpus

barilan/blog_authorship_corpus

Explore at:
5 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Jul 27, 2003
Dataset authored and provided by
Bar-Ilan University
License

https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

Description

The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person.

Each blog is presented as a separate file, the name of which indicates a blogger id# and the blogger’s self-provided gender, age, industry and astrological sign. (All are labeled for gender and age but for many, industry and/or sign is marked as unknown.)

All bloggers included in the corpus fall into one of three age groups: - 8240 "10s" blogs (ages 13-17), - 8086 "20s" blogs (ages 23-27), - 2994 "30s" blogs (ages 33-47).

For each age group there are an equal number of male and female bloggers.

Each blog in the corpus includes at least 200 occurrences of common English words. All formatting has been stripped with two exceptions. Individual posts within a single blogger are separated by the date of the following post and links within a post are denoted by the label urllink.

The corpus may be freely used for non-commercial research purposes.

Search
Clear search
Close search
Google apps
Main menu