100+ datasets found
  1. h

    blog_authorship_corpus

    • huggingface.co
    Updated Jul 27, 2003
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bar-Ilan University (2003). blog_authorship_corpus [Dataset]. https://huggingface.co/datasets/barilan/blog_authorship_corpus
    Explore at:
    Dataset updated
    Jul 27, 2003
    Dataset authored and provided by
    Bar-Ilan University
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person.

    Each blog is presented as a separate file, the name of which indicates a blogger id# and the blogger’s self-provided gender, age, industry and astrological sign. (All are labeled for gender and age but for many, industry and/or sign is marked as unknown.)

    All bloggers included in the corpus fall into one of three age groups: - 8240 "10s" blogs (ages 13-17), - 8086 "20s" blogs (ages 23-27), - 2994 "30s" blogs (ages 33-47).

    For each age group there are an equal number of male and female bloggers.

    Each blog in the corpus includes at least 200 occurrences of common English words. All formatting has been stripped with two exceptions. Individual posts within a single blogger are separated by the date of the following post and links within a post are denoted by the label urllink.

    The corpus may be freely used for non-commercial research purposes.

  2. Most frequent blog content types among bloggers worldwide 2023

    • statista.com
    Updated Aug 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). Most frequent blog content types among bloggers worldwide 2023 [Dataset]. https://www.statista.com/statistics/314422/blogging-format-content/
    Explore at:
    Dataset updated
    Aug 22, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Jul 2023 - Aug 2023
    Area covered
    Worldwide
    Description

    A global study among bloggers conducted in July and August 2023 found that around 76 percent reported having published how-to articles throughout the 12 months preceding the survey. Approximately 55 percent said they posted lists.

  3. p

    Blogs

    • prospectwallet.com
    Updated Aug 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prospect Wallet: B2B Mailing & Email lists | Direct Mail Marketing (2025). Blogs [Dataset]. https://www.prospectwallet.com/blogs/
    Explore at:
    Dataset updated
    Aug 4, 2025
    Dataset authored and provided by
    Prospect Wallet: B2B Mailing & Email lists | Direct Mail Marketing
    Description

    Blog

             Categories 
    
            Blog
    

    Infographics Case Study Glossary Press Release

            Grow your Business with Right Data       
    
                    Get a Quote
    
    
            Grow your Business with Right Data       
    
                    Get a Quote
    
            Trending        
            Explore Email List by Category & Data Hygiene Services       
            Technology Email List        
            Industry Email List       
            Professional Email List
    
  4. Z

    Blog-1K

    • data.niaid.nih.gov
    • zenodo.org
    Updated Dec 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Haining Wang (2022). Blog-1K [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7455622
    Explore at:
    Dataset updated
    Dec 21, 2022
    Dataset authored and provided by
    Haining Wang
    License

    https://www.isc.org/downloads/software-support-policy/isc-license/https://www.isc.org/downloads/software-support-policy/isc-license/

    Description

    The Blog-1K corpus is a redistributable authorship identification testbed for contemporary English prose. It has 1,000 candidate authors, 16K+ posts, and a pre-defined data split (train/dev/test proportional to ca. 8:1:1). It is a subset of the Blog Authorship Corpus from Kaggle. The MD5 for Blog-1K is '0a9e38740af9f921b6316b7f400acf06'.

    1. Preprocessing

    We first filter out texts shorter than 1,000 characters. Then we select one thousand authors whose writings meet the following criteria: - accumulatively at least 10,000 characters, - accumulatively at most 49,410 characters, - accumulatively at least 16 posts, - accumulatively at most 40 posts, and - each text has at least 50 function words found in the Koppel512 list (to filter out non-English prose).

    Blog-1K has three columns: 'id', 'text', and 'split', where 'id' corresponds to its parent corpus.

    1. Statistics

    Its creation and statistics can be found in the Jupyter Notebook.

        Split
        # Authors
        # Posts
        # Characters
        Avg. Characters Per Author (Std.)
        Avg. Characters Per Post (Std.)
    
    
        Train
        1,000
        16,132
        30,092,057
        30,092 (5,884)
        1,865 (1,007)
    
    
        Validation
        935
        2,017
        3,755,362
        4,016 (2,269)
        1,862 (999)
    
    
        Test
        924
        2,017
        3,732,448
        4,039 (2,188)
        1,850 (936)
    
    1. Usage

    import pandas as pd

    df = pd.read_csv('blog1000.csv.gz', compression='infer')

    read in training data

    train_text, train_label = zip(*df.loc[df.split=='train'][['text', 'id']].itertuples(index=False))

    1. License All the materials is licensed under the ISC License.

    2. Contact Please contact its maintainer for questions.

  5. BlogCatalog dataset

    • figshare.com
    zip
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nitin Agarwal; Xufei Wang (2023). BlogCatalog dataset [Dataset]. http://doi.org/10.6084/m9.figshare.11923611.v3
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Nitin Agarwal; Xufei Wang
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Abstract: BlogCatalog is the social blog directory which manages the bloggers and their blogs.Number of Nodes:10,312Number of Edges:333,983Missing Values?noSource:Nitin Agarwal+, Xufei Wang*, Huan Liu*+ Department of Information Science, University of Arkansas at Little Rock. E-mail:nxagarwal@ualr.edu* School of Computing, Informatics and Decision Systems Engineering, Arizona State University. E-mail: huan.liu@asu.edu, xufei.wang@asu.eduData Set Information:2 files are included:1. nodes.csv-- it's the file of all the users. This file works as a dictionary of all the users in this data set. It's useful for fast reference. It contains all the node ids used in the dataset.2. edges.csv-- this is the friendship network among the bloggers. The blogger's friends are represented using edges. Here is an example.1,2This means blogger with id "1" is friend with blogger id "2".Attribute Information:This is the data set crawled on July, 2009 from BlogCatalog ( http://www.blogcatalog.com ). BlogCatalog is a social blog directory website. This contains the friendship network crawled. For easier understanding, all the contents are organized in CSV file format.-. Basic statisticsNumber of bloggers : 88,784Number of friendship pairs: 4,186,390Relevant Papers:Nitin Agarwal and Huan Liu. ”Modeling and Data Mining in Blogosphere”, Synthesis Lectures on Data Mining and Knowledge Discovery #1, Morgan & Claypool Publishers, Robert Grossman (Editor), August 2009. ISBN: 9781598299083 (paperback) ISBN: 9781598299090 (ebook) Nitin Agarwal, Magdiel Galan, Huan Liu, and Shankar Subramanya. WisColl: Collective Wisdom based Blog Clustering. Journal of Information Science, 180(1): 39-61, January, 2010. Nitin Agarwal, Huan Liu, Sudheendra Murthy, Arunabha Sen, and Xufei Wang. A Social Identity Approach to Identify Familiar Strangers in a Social Network. In Proceedings of the Third International AAAI Conference on Weblogs and Social Media (ICWSM09), pp. 2 - 9, May 17-20, 2009. San Jose, California. Nitin Agarwal, Huan Liu, Sudheendra Murthy, Arunabha Sen, and Xufei Wang. "A Social Identity Approach to Identify Familiar Strangers in a Social Network", 3rd International AAAI Conference on Weblogs and Social Media (ICWSM09), pp. 2 - 9, May 17-20, 2009. San Jose, California.

  6. e

    Blog mix 2003

    • data.europa.eu
    • researchdata.se
    unknown
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Göteborgs universitet, Blog mix 2003 [Dataset]. https://data.europa.eu/data/datasets/https-doi-org-10-23695-cjrj-4364?locale=en
    Explore at:
    unknownAvailable download formats
    Dataset authored and provided by
    Göteborgs universitet
    Description

    The blogs in the blogmix are selected through the lists Most visited private blogs, Most visited professional blogs, and the local lists for different regions, at bloggportalen.se.

    More information, such as the location and age of the blogger is also retrieved from Bloggportalen. The material has not been manually checked, which means that spam may occur. Some English blogs have been removed when discovered, and some blogs have not been added for technical reasons.

    The time of the blogs ranges from the first to the latest entries of the selected blogs, and the corpus is continually updated.

  7. b

    STUDIOTEC BLOGS - Datasets - data.bris

    • data.bris.ac.uk
    Updated Oct 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). STUDIOTEC BLOGS - Datasets - data.bris [Dataset]. https://data.bris.ac.uk/data/dataset/1vfev51640v692tokt6fahywgj
    Explore at:
    Dataset updated
    Oct 21, 2024
    Description

    The dataset contains a folder with pdf files of blog posts written for the Studiotec project's website, https://studiotec.info.

  8. t

    Political Blogs dataset

    • service.tib.eu
    Updated Dec 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Political Blogs dataset [Dataset]. https://service.tib.eu/ldmservice/dataset/political-blogs-dataset
    Explore at:
    Dataset updated
    Dec 16, 2024
    Description

    The Political Blogs dataset contains a list of political blogs from the 2004 US Election classified as liberal or conservative, and links between blogs.

  9. Ranking of the most popular blogs in Sweden 2023

    • statista.com
    Updated Aug 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). Ranking of the most popular blogs in Sweden 2023 [Dataset]. https://www.statista.com/statistics/675977/most-popular-blogs-in-sweden/
    Explore at:
    Dataset updated
    Aug 9, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Jan 2023
    Area covered
    Sweden
    Description

    As of January 2023, the most popular blog in Sweden was UNDERBARACLARA, described as Sweden's largest blog for those who like feminism and crooked country roads. UNDERBARACLARA had over 185 thousand visitors in the past seven days, followed by Elsa Billgren, with over 134 thousand views in the past week.

    What digital content do Swedish Swedes read?

    A survey conducted in 2019 showed that seven percent of Swedish respondents were reading blogs daily and 49 percent were reading blogs in general. By contrast, 37 percent were reading newspapers online daily and three percent were reading e-books or audio books daily that year.

    Blogs by size

    Blogging became an influencing platform in the past few years. Bloggers have been divided into micro influencers and macro influencers. The minimum views that micro influencers received in Sweden in 2020 were five thousand, while macro influencers got ten thousand views minimum. In addition, the average minimum income that micro influencers received that year was roughly 2.5 thousand Swedish kronor. Macro influencers received 20 thousand Swedish kronor minimum and icons were receiving 40 thousand Swedish kronor in 2020.

  10. n

    Integrated Blogs

    • neuinfo.org
    • dknet.org
    • +2more
    Updated Oct 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Integrated Blogs [Dataset]. http://identifiers.org/RRID:SCR_005386
    Explore at:
    Dataset updated
    Oct 28, 2024
    Description

    A virtual database created by the Neuroscience Information Framework currently indexing Scientific Blog and News resources such as: Nature Network Blogs, Wired Science Blogs, The Guardian: Science, It Takes 30, Scientific American Cross-Check, Scientific American Bering in Mind, Research Blogging, CENtral Science, ScienceBlogs: Medicine and Health, American Guest Blog, Scientific American Observations, LabSpaces, RetractionWatch.com, Wired Science, Genomes Unzipped, PLoS Blogs, Daring Nucleic Adventures - genegeek, H2SO4Hurts - Brian Krueger PhD, and Sciblogs.

  11. h

    medium-blogs-example

    • huggingface.co
    Updated Dec 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harpreet Sahota (2024). medium-blogs-example [Dataset]. https://huggingface.co/datasets/harpreetsahota/medium-blogs-example
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 6, 2024
    Authors
    Harpreet Sahota
    Description

    harpreetsahota/medium-blogs-example dataset hosted on Hugging Face and contributed by the HF Datasets community

  12. w

    Websites using Class Blogs

    • webtechsurvey.com
    csv
    Updated Oct 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    WebTechSurvey (2025). Websites using Class Blogs [Dataset]. https://webtechsurvey.com/technology/class-blogs
    Explore at:
    csvAvailable download formats
    Dataset updated
    Oct 13, 2025
    Dataset authored and provided by
    WebTechSurvey
    License

    https://webtechsurvey.com/termshttps://webtechsurvey.com/terms

    Time period covered
    2025
    Area covered
    Global
    Description

    A complete list of live websites using the Class Blogs technology, compiled through global website indexing conducted by WebTechSurvey.

  13. r

    Blog mix 2007

    • researchdata.se
    Updated Jan 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Språkbanken Text (2024). Blog mix 2007 [Dataset]. http://doi.org/10.23695/DW23-K962
    Explore at:
    Dataset updated
    Jan 1, 2024
    Dataset provided by
    University of Gothenburg
    Authors
    Språkbanken Text
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The blogs in the blogmix are selected through the lists Most visited private blogs, Most visited professional blogs, and the local lists for different regions, at bloggportalen.se.

    More information, such as the location and age of the blogger is also retrieved from Bloggportalen. The material has not been manually checked, which means that spam may occur. Some English blogs have been removed when discovered, and some blogs have not been added for technical reasons.

    The time of the blogs ranges from the first to the latest entries of the selected blogs, and the corpus is continually updated.

  14. BlogFeedback Data Set

    • kaggle.com
    zip
    Updated Jul 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julio Tentor (2022). BlogFeedback Data Set [Dataset]. https://www.kaggle.com/datasets/jtentor/blogfeedback-data-set
    Explore at:
    zip(2550651 bytes)Available download formats
    Dataset updated
    Jul 15, 2022
    Authors
    Julio Tentor
    Description

    Source:

    Krisztian Buza Budapest University of Technology and Economics buza '@' cs.bme.hu http://www.cs.bme.hu/~buza

    You can download a zip file from https://archive.ics.uci.edu/ml/datasets/BlogFeedback

    Data Set Information:

    This data originates from blog posts. The raw HTML-documents of the blog posts were crawled and processed.

    The prediction task associated with the data is the prediction of the number of comments in the upcoming 24 hours.

    In order to simulate this situation, we choose a basetime (in the past) and select the blog posts that were published at most 72 hours before the selected base date/time. Then, we calculate all the features of the selected blog posts from the information that was available at the basetime, therefore each instance corresponds to a blog post. The target is the number of comments that the blog post received in the next 24 hours relative to the base time.

    In the train data, the base times were in the years 2010 and 2011. In the test data the base times were in February and March 2012.

    This simulates the real-world situation in which training data from the past is available to predict events in the future.

    The train data was generated from different base times that may temporally overlap.

    Therefore, if you simply split the train into disjoint partitions, the underlying time intervals may overlap.

    Therefore, you should use the provided, temporally disjoint train and test splits in order to ensure that the evaluation is fair.

    ** Attribute Information:**

    1...50: Average, standard deviation, min, max and median of the Attributes 51...60 for the source of the current blog post. With source we mean the blog on which the post appeared. For example, myblog.blog.org would be the source of the post myblog.blog.org/post_2010_09_10

    51: Total number of comments before basetime 52: Number of comments in the last 24 hours before the base time 53: Let T1 denote the datetime 48 hours before basetime, Let T2 denote the datetime 24 hours before basetime. This attribute is the number of comments in the time period between T1 and T2 54: Number of comments in the first 24 hours after the publication of the blog post, but before basetime 55: The difference of Attribute 52 and Attribute 53 56...60: The same features as the attributes 51...55, but features 56...60 refer to the number of links (trackbacks), while features 51...55 refer to the number of comments. 61: The length of time between the publication of the blog post and base time 62: The length of the blog post 63...262: The 200 bag of words features for 200 frequent words of the text of the blog post 263...269: binary indicator features (0 or 1) for the weekday (Monday...Sunday) of the basetime 270...276: binary indicator features (0 or 1) for the weekday (Monday...Sunday) of the date of publication of the blog post 277: Number of parent pages: we consider a blog post P as a parent of blog post B, if B is a reply (trackback) to blog post P. 278...280: Minimum, maximum, average number of comments that the parents received 281: The target: the number of comments in the next 24 hours (relative to base time)

    ** Relevant Papers:**

    Buza, K. (2014). Feedback Prediction for Blogs. In Data Analysis, Machine Learning and Knowledge Discovery (pp. 145-152). Springer International Publishing (http://cs.bme.hu/~buza/pdfs/gfkl2012_blogs.pdf).

  15. d

    Data from: Bringing ecology blogging into the scientific fold: measuring...

    • datadryad.org
    • datasetcatalog.nlm.nih.gov
    • +2more
    zip
    Updated Sep 6, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Manu E. Saunders; Meghan A. Duffy; Stephen B. Heard; Margaret Kosmala; Simon R. Leather; Terrence P. McGlynn; Jeff Ollerton; Amy L. Parachnowitsch (2017). Bringing ecology blogging into the scientific fold: measuring reach and impact of science community blogs [Dataset]. http://doi.org/10.5061/dryad.kf8b0
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 6, 2017
    Dataset provided by
    Dryad
    Authors
    Manu E. Saunders; Meghan A. Duffy; Stephen B. Heard; Margaret Kosmala; Simon R. Leather; Terrence P. McGlynn; Jeff Ollerton; Amy L. Parachnowitsch
    Time period covered
    Sep 4, 2017
    Area covered
    Australia, North America, Europe, UK
    Description

    Bringing ecology blogging into the scientific fold, measuring reach and impact of science community blogs Supp MaterialRaw datasets used in analyses, including metadata.

  16. w

    Dataset of books about Brosh, Allie-Blogs

    • workwithdata.com
    Updated Apr 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2025). Dataset of books about Brosh, Allie-Blogs [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=j0-book_subject&fop0=%3D&fval0=Brosh%2C+Allie-Blogs&j=1&j0=book_subjects
    Explore at:
    Dataset updated
    Apr 17, 2025
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is about books. It has 2 rows and is filtered where the book subjects is Brosh, Allie-Blogs. It features 9 columns including author, publication date, language, and book publisher.

  17. Leading ways to promote blog posts among bloggers worldwide 2023

    • statista.com
    Updated Jul 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Leading ways to promote blog posts among bloggers worldwide 2023 [Dataset]. https://www.statista.com/statistics/487515/blog-posts-promoting-bloggers-us/
    Explore at:
    Dataset updated
    Jul 3, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Jul 2023 - Aug 2023
    Area covered
    Worldwide
    Description

    As of August 2023, more than **** out of 10 bloggers surveyed worldwide reported using social media to promote their blog posts. E-mail marketing and search engine optimization (SEO) followed, each mentioned by about ********** of respondents.

  18. f

    The number of micro-blogs by combined inference category CM and...

    • plos.figshare.com
    xls
    Updated Jun 17, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marie Truelove; Maria Vasardani; Stephan Winter (2023). The number of micro-blogs by combined inference category CM and corroboration for the ADon_a and ADcomb_a datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0189378.t016
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 17, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Marie Truelove; Maria Vasardani; Stephan Winter
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The number of micro-blogs by combined inference category CM and corroboration for the ADon_a and ADcomb_a datasets.

  19. A

    GIS News and Blogs

    • data.amerigeoss.org
    • data.wu.ac.at
    html
    Updated Aug 9, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Energy Data Exchange (2019). GIS News and Blogs [Dataset]. https://data.amerigeoss.org/sl/dataset/gis-news-and-blogs
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Aug 9, 2019
    Dataset provided by
    Energy Data Exchange
    Description

    News and blogs related to GIS

  20. Blog Dataset

    • universe.roboflow.com
    zip
    Updated Mar 29, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    veereshg9921@gmail.com (2022). Blog Dataset [Dataset]. https://universe.roboflow.com/veereshg9921-gmail-com/blog-ysqpf
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 29, 2022
    Dataset provided by
    Gmailhttp://gmail.com/
    Authors
    veereshg9921@gmail.com
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Objects Bounding Boxes
    Description

    Blog

    ## Overview
    
    Blog is a dataset for object detection tasks - it contains Objects annotations for 2,324 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Bar-Ilan University (2003). blog_authorship_corpus [Dataset]. https://huggingface.co/datasets/barilan/blog_authorship_corpus

blog_authorship_corpus

Blog Authorship Corpus

barilan/blog_authorship_corpus

Explore at:
6 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Jul 27, 2003
Dataset authored and provided by
Bar-Ilan University
License

https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

Description

The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person.

Each blog is presented as a separate file, the name of which indicates a blogger id# and the blogger’s self-provided gender, age, industry and astrological sign. (All are labeled for gender and age but for many, industry and/or sign is marked as unknown.)

All bloggers included in the corpus fall into one of three age groups: - 8240 "10s" blogs (ages 13-17), - 8086 "20s" blogs (ages 23-27), - 2994 "30s" blogs (ages 33-47).

For each age group there are an equal number of male and female bloggers.

Each blog in the corpus includes at least 200 occurrences of common English words. All formatting has been stripped with two exceptions. Individual posts within a single blogger are separated by the date of the following post and links within a post are denoted by the label urllink.

The corpus may be freely used for non-commercial research purposes.

Search
Clear search
Close search
Google apps
Main menu