Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
https://www.isc.org/downloads/software-support-policy/isc-license/https://www.isc.org/downloads/software-support-policy/isc-license/
The Blog-1K corpus is a redistributable authorship identification testbed for contemporary English prose. It has 1,000 candidate authors, 16K+ posts, and a pre-defined data split (train/dev/test proportional to ca. 8:1:1). It is a subset of the Blog Authorship Corpus from Kaggle. The MD5 for Blog-1K is '0a9e38740af9f921b6316b7f400acf06'.
We first filter out texts shorter than 1,000 characters. Then we select one thousand authors whose writings meet the following criteria: - accumulatively at least 10,000 characters, - accumulatively at most 49,410 characters, - accumulatively at least 16 posts, - accumulatively at most 40 posts, and - each text has at least 50 function words found in the Koppel512 list (to filter out non-English prose).
Blog-1K has three columns: 'id', 'text', and 'split', where 'id' corresponds to its parent corpus.
Its creation and statistics can be found in the Jupyter Notebook.
Split
# Authors
# Posts
# Characters
Avg. Characters Per Author (Std.)
Avg. Characters Per Post (Std.)
Train
1,000
16,132
30,092,057
30,092 (5,884)
1,865 (1,007)
Validation
935
2,017
3,755,362
4,016 (2,269)
1,862 (999)
Test
924
2,017
3,732,448
4,039 (2,188)
1,850 (936)
import pandas as pd
df = pd.read_csv('blog1000.csv.gz', compression='infer')
train_text, train_label = zip(*df.loc[df.split=='train'][['text', 'id']].itertuples(index=False))
License All the materials is licensed under the ISC License.
Contact Please contact its maintainer for questions.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically