Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person.
Each blog is presented as a separate file, the name of which indicates a blogger id# and the blogger’s self-provided gender, age, industry and astrological sign. (All are labeled for gender and age but for many, industry and/or sign is marked as unknown.)
All bloggers included in the corpus fall into one of three age groups: - 8240 "10s" blogs (ages 13-17), - 8086 "20s" blogs (ages 23-27), - 2994 "30s" blogs (ages 33-47).
For each age group there are an equal number of male and female bloggers.
Each blog in the corpus includes at least 200 occurrences of common English words. All formatting has been stripped with two exceptions. Individual posts within a single blogger are separated by the date of the following post and links within a post are denoted by the label urllink.
The corpus may be freely used for non-commercial research purposes.
Facebook
TwitterA global study among bloggers conducted in July and August 2023 found that around 76 percent reported having published how-to articles throughout the 12 months preceding the survey. Approximately 55 percent said they posted lists.
Facebook
TwitterBlog
Categories
Blog
Infographics Case Study Glossary Press Release
Grow your Business with Right Data
Get a Quote
Grow your Business with Right Data
Get a Quote
Trending
Explore Email List by Category & Data Hygiene Services
Technology Email List
Industry Email List
Professional Email List
Facebook
Twitterhttps://www.isc.org/downloads/software-support-policy/isc-license/https://www.isc.org/downloads/software-support-policy/isc-license/
The Blog-1K corpus is a redistributable authorship identification testbed for contemporary English prose. It has 1,000 candidate authors, 16K+ posts, and a pre-defined data split (train/dev/test proportional to ca. 8:1:1). It is a subset of the Blog Authorship Corpus from Kaggle. The MD5 for Blog-1K is '0a9e38740af9f921b6316b7f400acf06'.
We first filter out texts shorter than 1,000 characters. Then we select one thousand authors whose writings meet the following criteria: - accumulatively at least 10,000 characters, - accumulatively at most 49,410 characters, - accumulatively at least 16 posts, - accumulatively at most 40 posts, and - each text has at least 50 function words found in the Koppel512 list (to filter out non-English prose).
Blog-1K has three columns: 'id', 'text', and 'split', where 'id' corresponds to its parent corpus.
Its creation and statistics can be found in the Jupyter Notebook.
Split
# Authors
# Posts
# Characters
Avg. Characters Per Author (Std.)
Avg. Characters Per Post (Std.)
Train
1,000
16,132
30,092,057
30,092 (5,884)
1,865 (1,007)
Validation
935
2,017
3,755,362
4,016 (2,269)
1,862 (999)
Test
924
2,017
3,732,448
4,039 (2,188)
1,850 (936)
import pandas as pd
df = pd.read_csv('blog1000.csv.gz', compression='infer')
train_text, train_label = zip(*df.loc[df.split=='train'][['text', 'id']].itertuples(index=False))
License All the materials is licensed under the ISC License.
Contact Please contact its maintainer for questions.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Abstract: BlogCatalog is the social blog directory which manages the bloggers and their blogs.Number of Nodes:10,312Number of Edges:333,983Missing Values?noSource:Nitin Agarwal+, Xufei Wang*, Huan Liu*+ Department of Information Science, University of Arkansas at Little Rock. E-mail:nxagarwal@ualr.edu* School of Computing, Informatics and Decision Systems Engineering, Arizona State University. E-mail: huan.liu@asu.edu, xufei.wang@asu.eduData Set Information:2 files are included:1. nodes.csv-- it's the file of all the users. This file works as a dictionary of all the users in this data set. It's useful for fast reference. It contains all the node ids used in the dataset.2. edges.csv-- this is the friendship network among the bloggers. The blogger's friends are represented using edges. Here is an example.1,2This means blogger with id "1" is friend with blogger id "2".Attribute Information:This is the data set crawled on July, 2009 from BlogCatalog ( http://www.blogcatalog.com ). BlogCatalog is a social blog directory website. This contains the friendship network crawled. For easier understanding, all the contents are organized in CSV file format.-. Basic statisticsNumber of bloggers : 88,784Number of friendship pairs: 4,186,390Relevant Papers:Nitin Agarwal and Huan Liu. ”Modeling and Data Mining in Blogosphere”, Synthesis Lectures on Data Mining and Knowledge Discovery #1, Morgan & Claypool Publishers, Robert Grossman (Editor), August 2009. ISBN: 9781598299083 (paperback) ISBN: 9781598299090 (ebook) Nitin Agarwal, Magdiel Galan, Huan Liu, and Shankar Subramanya. WisColl: Collective Wisdom based Blog Clustering. Journal of Information Science, 180(1): 39-61, January, 2010. Nitin Agarwal, Huan Liu, Sudheendra Murthy, Arunabha Sen, and Xufei Wang. A Social Identity Approach to Identify Familiar Strangers in a Social Network. In Proceedings of the Third International AAAI Conference on Weblogs and Social Media (ICWSM09), pp. 2 - 9, May 17-20, 2009. San Jose, California. Nitin Agarwal, Huan Liu, Sudheendra Murthy, Arunabha Sen, and Xufei Wang. "A Social Identity Approach to Identify Familiar Strangers in a Social Network", 3rd International AAAI Conference on Weblogs and Social Media (ICWSM09), pp. 2 - 9, May 17-20, 2009. San Jose, California.
Facebook
TwitterThe blogs in the blogmix are selected through the lists Most visited private blogs, Most visited professional blogs, and the local lists for different regions, at bloggportalen.se.
More information, such as the location and age of the blogger is also retrieved from Bloggportalen. The material has not been manually checked, which means that spam may occur. Some English blogs have been removed when discovered, and some blogs have not been added for technical reasons.
The time of the blogs ranges from the first to the latest entries of the selected blogs, and the corpus is continually updated.
Facebook
TwitterThe dataset contains a folder with pdf files of blog posts written for the Studiotec project's website, https://studiotec.info.
Facebook
TwitterThe Political Blogs dataset contains a list of political blogs from the 2004 US Election classified as liberal or conservative, and links between blogs.
Facebook
TwitterAs of January 2023, the most popular blog in Sweden was UNDERBARACLARA, described as Sweden's largest blog for those who like feminism and crooked country roads. UNDERBARACLARA had over 185 thousand visitors in the past seven days, followed by Elsa Billgren, with over 134 thousand views in the past week.
What digital content do Swedish Swedes read?
A survey conducted in 2019 showed that seven percent of Swedish respondents were reading blogs daily and 49 percent were reading blogs in general. By contrast, 37 percent were reading newspapers online daily and three percent were reading e-books or audio books daily that year.
Blogs by size
Blogging became an influencing platform in the past few years. Bloggers have been divided into micro influencers and macro influencers. The minimum views that micro influencers received in Sweden in 2020 were five thousand, while macro influencers got ten thousand views minimum. In addition, the average minimum income that micro influencers received that year was roughly 2.5 thousand Swedish kronor. Macro influencers received 20 thousand Swedish kronor minimum and icons were receiving 40 thousand Swedish kronor in 2020.
Facebook
TwitterA virtual database created by the Neuroscience Information Framework currently indexing Scientific Blog and News resources such as: Nature Network Blogs, Wired Science Blogs, The Guardian: Science, It Takes 30, Scientific American Cross-Check, Scientific American Bering in Mind, Research Blogging, CENtral Science, ScienceBlogs: Medicine and Health, American Guest Blog, Scientific American Observations, LabSpaces, RetractionWatch.com, Wired Science, Genomes Unzipped, PLoS Blogs, Daring Nucleic Adventures - genegeek, H2SO4Hurts - Brian Krueger PhD, and Sciblogs.
Facebook
Twitterharpreetsahota/medium-blogs-example dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterhttps://webtechsurvey.com/termshttps://webtechsurvey.com/terms
A complete list of live websites using the Class Blogs technology, compiled through global website indexing conducted by WebTechSurvey.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The blogs in the blogmix are selected through the lists Most visited private blogs, Most visited professional blogs, and the local lists for different regions, at bloggportalen.se.
More information, such as the location and age of the blogger is also retrieved from Bloggportalen. The material has not been manually checked, which means that spam may occur. Some English blogs have been removed when discovered, and some blogs have not been added for technical reasons.
The time of the blogs ranges from the first to the latest entries of the selected blogs, and the corpus is continually updated.
Facebook
TwitterKrisztian Buza Budapest University of Technology and Economics buza '@' cs.bme.hu http://www.cs.bme.hu/~buza
You can download a zip file from https://archive.ics.uci.edu/ml/datasets/BlogFeedback
This data originates from blog posts. The raw HTML-documents of the blog posts were crawled and processed.
The prediction task associated with the data is the prediction of the number of comments in the upcoming 24 hours.
In order to simulate this situation, we choose a basetime (in the past) and select the blog posts that were published at most 72 hours before the selected base date/time. Then, we calculate all the features of the selected blog posts from the information that was available at the basetime, therefore each instance corresponds to a blog post. The target is the number of comments that the blog post received in the next 24 hours relative to the base time.
In the train data, the base times were in the years 2010 and 2011. In the test data the base times were in February and March 2012.
This simulates the real-world situation in which training data from the past is available to predict events in the future.
The train data was generated from different base times that may temporally overlap.
Therefore, if you simply split the train into disjoint partitions, the underlying time intervals may overlap.
Therefore, you should use the provided, temporally disjoint train and test splits in order to ensure that the evaluation is fair.
1...50: Average, standard deviation, min, max and median of the Attributes 51...60 for the source of the current blog post. With source we mean the blog on which the post appeared. For example, myblog.blog.org would be the source of the post myblog.blog.org/post_2010_09_10
51: Total number of comments before basetime 52: Number of comments in the last 24 hours before the base time 53: Let T1 denote the datetime 48 hours before basetime, Let T2 denote the datetime 24 hours before basetime. This attribute is the number of comments in the time period between T1 and T2 54: Number of comments in the first 24 hours after the publication of the blog post, but before basetime 55: The difference of Attribute 52 and Attribute 53 56...60: The same features as the attributes 51...55, but features 56...60 refer to the number of links (trackbacks), while features 51...55 refer to the number of comments. 61: The length of time between the publication of the blog post and base time 62: The length of the blog post 63...262: The 200 bag of words features for 200 frequent words of the text of the blog post 263...269: binary indicator features (0 or 1) for the weekday (Monday...Sunday) of the basetime 270...276: binary indicator features (0 or 1) for the weekday (Monday...Sunday) of the date of publication of the blog post 277: Number of parent pages: we consider a blog post P as a parent of blog post B, if B is a reply (trackback) to blog post P. 278...280: Minimum, maximum, average number of comments that the parents received 281: The target: the number of comments in the next 24 hours (relative to base time)
Buza, K. (2014). Feedback Prediction for Blogs. In Data Analysis, Machine Learning and Knowledge Discovery (pp. 145-152). Springer International Publishing (http://cs.bme.hu/~buza/pdfs/gfkl2012_blogs.pdf).
Facebook
TwitterBringing ecology blogging into the scientific fold, measuring reach and impact of science community blogs Supp MaterialRaw datasets used in analyses, including metadata.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 2 rows and is filtered where the book subjects is Brosh, Allie-Blogs. It features 9 columns including author, publication date, language, and book publisher.
Facebook
TwitterAs of August 2023, more than **** out of 10 bloggers surveyed worldwide reported using social media to promote their blog posts. E-mail marketing and search engine optimization (SEO) followed, each mentioned by about ********** of respondents.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The number of micro-blogs by combined inference category CM and corroboration for the ADon_a and ADcomb_a datasets.
Facebook
TwitterNews and blogs related to GIS
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Blog is a dataset for object detection tasks - it contains Objects annotations for 2,324 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person.
Each blog is presented as a separate file, the name of which indicates a blogger id# and the blogger’s self-provided gender, age, industry and astrological sign. (All are labeled for gender and age but for many, industry and/or sign is marked as unknown.)
All bloggers included in the corpus fall into one of three age groups: - 8240 "10s" blogs (ages 13-17), - 8086 "20s" blogs (ages 23-27), - 2994 "30s" blogs (ages 33-47).
For each age group there are an equal number of male and female bloggers.
Each blog in the corpus includes at least 200 occurrences of common English words. All formatting has been stripped with two exceptions. Individual posts within a single blogger are separated by the date of the following post and links within a post are denoted by the label urllink.
The corpus may be freely used for non-commercial research purposes.