This dataset gathers 728,321 biographies from English Wikipedia. It aims at evaluating text generation algorithms. For each article, we provide the first paragraph and the infobox (both tokenized).
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
We present the Pantheon 1.0 dataset: a manually verified dataset of individuals that have transcended linguistic, temporal, and geographic boundaries. The Pantheon 1.0 dataset includes the 11,341 biographies present in more than 25 languages in Wikipedia and is enriched with: (i) manually verified demographic information (place and date of birth, gender) (ii) a taxonomy of occupations classifying each biography at three levels of aggregation and (iii) two measures of global popularity including the number of languages in which a biography is present in Wikipedia (L), and the Historical Popularity Index (HPI) a metric that combines information on L, time since birth, and page-views (2008-2013). We compare the Pantheon 1.0 dataset to data from the 2003 book, Human Accomplishments, and also to external measures of accomplishment in individual games and sports: Tennis, Swimming, Car Racing, and Chess. In all of these cases we find that measures of popularity (L and HPI) correlate highly with individual accomplishment, suggesting that measures of global popularity proxy the historical impact of individuals.
Full biographies of the members of the Biometrics and Forensics Ethics Group.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains the biographies of the Sverdlovsk Oblast officials (2004-2005; 2019-2020)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract: In this article, we begin from the theoretical implications vislumbrated by J. M. Schaeffer with the porpose to create a new literary genre based in the task of the reader to compare diverse literary works which maybe don't belong to the same tradition. The object of our interest is the existence of tales and narrations of biographies which are invented by an author interested in real historical characters (or even also invented). To explore the limits of the genre is necessary to know deeply the field of biografphy and the relations with literary writing and the relations with historiographical discourse too.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains 1000 biographies of literature writers retrieved from the english version of Wikipedia. There is a total of 500 biographies of women writers extracted from the category entitled “19th-century_women_writers” (https://en.wikipedia.org/wiki/Category:19th-century_women_writers) and 500 male biographies extracted from the category “19th-century_male_writers” (https://en.wikipedia.org/wiki/Category:19th-century_male_writers)
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Bio-MCP-Data
A repository containing biological datasets that will be used by BIO-MCP MCP (Model Context Protocol) standard.
About
This repository hosts biological data assets formatted to be compatible with the Model Context Protocol, enabling AI models to efficiently access and process biological information. The data is managed using Git Large File Storage (LFS) to handle large biological datasets.
Purpose
Provide standardized biological datasets for AI… See the full description on the dataset page: https://huggingface.co/datasets/longevity-genie/bio-mcp-data.
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Preview Dataset for editorial evaluation and review.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Big Data V3 No Bio is a dataset for object detection tasks - it contains Trash annotations for 8,825 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains biographical data produced in course of the digital humanities project “Mapping historical networks: Building the new Austrian Prosopographical/Biographical Information System (APIS)” at the Austrian Academy of Sciences. It was funded by the Austrian National Fonds for Research, Technology and Development. The biographies were manually annotated by the author via a web application (apis.acdh.oewa.ac.at) which was developed at the Austrian Centre for Digital Humanities and Cultural Heritage (ACDH-CH).
The starting point of the dataset (cl Kuenstlerhaus) were 506 annotated artists’ biographies from the Austrian Biographical Encyclopaedia 1815–1950 (ÖBL). For these persons, the membership in the Association of Fine Artists Vienna (Genossenschaft der bildenden Künstler Wiens) was confirmed by the comparison of the yearly published membership lists with the lemmas of the ÖBL. The data were collected primarily to enable a) statistics b) historical network analyses and c) cartographic analyses.
The data is provided as graphml files:
The datset was last reviewed in January 2020.
Abstract copyright UK Data Service and data collection copyright owner.
This dataset contains the digitized treatments in Plazi based on the original journal article Reich, Mike (2015): A short biography of Hubert Ludwig and a note on the publication dates of his monograph Die Seewalzen (1889 – 1892). Zootaxa 4052 (2): 332-344, DOI: http://dx.doi.org/10.11646/zootaxa.4052.3.3
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
data collected from rural communities in Zimbabwe to evaluate preventive health behavior based on the health belief model.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A Genism LDA Topic Model of English Wikipedia biographical articles with list of all 1.8M articles, and some associated Wikidata information
The model has 150 Topics.
This model was developed in the process of isolating a set of visual arts biographical articles, as described in "Clowns in the Visual Artists: Topic Modeling Wikipedia and Wikidata" in the Spring 2022 issue of Art Documentation - https://doi.org/10.1086/719999
Because names, nationalities, and birthdays are so prominent in biographies, the stopwords list removed 170,000 names, surnames, city names, place names, countries, days, months and other time related words (https://github.com/mandiberg/Names-Surnames-and-Countries-for-Stopwords). We also directly removed each article subject’s given and surname, which were almost always the most frequently occurring words in any given article. Otherwise, the model just produced topics based on nationality, and common names and surnames.
Files:
all_enwiki_bios_from_wikidata.csv
The list of all Wikidata items for humans with an enwiki page (e.g biographical article) was extracted from Wikidata JSON dump; list includes gender, occupation, and nationality. This was joined with the converted plaintext from an English Wikipedia dump. This data was downloaded in March 2021.
Wikipedia Biographies LDA Topic Model human readable summary.csv
A human readable file with the 150 topics ranked by count of articles per topic from the 1.8M corpus. The most popular topics have categorical descriptions of the occupations of each cluster. Some are marked as not an occupation cluster.
BoW_corpus.mm*
model_lda_full_Sep2_150Tv2*
These six files comprise the topic model. The code to load them is present in the python files.
dict_full_Aug-28-2021
processed_docs_full_Aug-28-2021.txt
processed_docs_1000_Aug-18-2021.txt
These are the dictionary and processed corpuses required to build and implement the model using this code. The corpus with the first 1000 items is meant to be used for testing, as the full one is quite large and takes a long time to complete.
topic-model-wikipedia-sept2021.zip
The code and settings used for creating and implementing this model are included in this zip and are also available here: https://github.com/mandiberg/topic-model-wikipedia
All-Wikipedia-Biographies-with-topic1.csv
All-Wikipedia-Biographies-with-topic1and2.csv
These are the list of 1.8M biographies matched to topics. The "topic1" file just includes the first topic, this is a slightly larger list. The "topic1and2" file is slightly smaller because about 2% articles do not match to a second topic.
Analysis-for-Clowns-Visual-Arts.zip
These are the raw data and final data produced for the "Clowns in the Visual Artists." Please see the article for context.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Names, positions, state, party, and congress number of members of US Congress 1774-present.
Scraped from http://bioguide.congress.gov/biosearch/biosearch.asp by https://scraperwiki.com/scrapers/biographical_directory_usc/#
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This merge file can be used to combine detailed biographical data on U.S. District Court Judges from the Federal Judicial Center (http://www.fjc.gov/history/home.nsf/page/export.html) with data on cases form the U.S. Court of Appeals Database Project (http://www.wmich.edu/nsf-coa/). The file includes the unique identifiers used by each group to make it easy for researchers to combine the two data sources together. Note that this is a merge file for U.S. District Court Judges only.
The purpose of this dataset was to study gender bias in occupations. Online biographies, written in English, were collected to find the names, pronouns, and occupations. Twenty-eight most frequent occupations were identified based on their appearances. The resulting dataset consists of 397,340 biographies spanning twenty-eight different occupations. Of these occupations, the professor is the most frequent, with 118,400 biographies, while the rapper is the least frequent, with 1,406 biographies. Important information about the biographies: 1. The longest biography is 194 tokens, while the shortest is eighteen; the median biography length is seventy-two tokens. 2. It should be noted that the demographics of online biographies’ subjects differ from those of the overall workforce and that this dataset does not contain all biographies on the Internet.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data set encompasses a collection of detailed biographical data about 850 of the first members of the German National Academy of Sciences Leopoldina (Deutsche Akademie der Naturforscher Leopoldina – Nationale Akademie der Wissenschaften) from 1652 to 1818. The data includes information about the members themselves, their family, their membership in the Leopoldina, academic and professional positions held, as well as works, portraits, and associated sources.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Yellow Nineties 2.0 uses digital tools to advance knowledge of eight late-Victorian little magazines and the people who contributed to their production between 1889 and 1905: Pagan Review (1 volume, 1892) Yellow Book (13 volumes, 1894–1897) The Dial (5 volumes, 1889–1897) The Evergreen: A Northern Seasonal (4 volumes, 1895–1897) The Green Sheaf (13 issues, 1903–1904) The Pageant (2 volumes, 1896–1897) The Savoy (2 quarterly and 6 monthly issues, 1896) The Venture: An Annual of Art and Literature (2 volumes, 1903 and 1905) The data document the communities of production responsible for these little magazines, particularly by recovering the social networks of and biographical information about women and marginalized persons in those communities. The dataset enables users to query, visualize, and analyze the relationships, connections, and social networks of magazine contributors. The Yellow Nineties project site (https://1890s.ca) includes two biographical tools, one discursive and the other data-driven. Essays on the life and work of a select group of magazine contributors are available in Y90s Biographies. Biographical data for all magazine contributors are available in the Y90s Personography (https://personography.1890s.ca). The data has been transformed into Linked Open Data via the LINCS conversion toolkit of the the Linked Infrastructure for Networked Cultural Scholarship (LINCS) project. The data is assembled as a single text file in text/turtle (.ttl) and contains descriptive metadata that has been reconciled into triples using established linked data vocabularies. The Yellow Nineties 2.0 has been supported by funding from SSHRC.
This dataset gathers 728,321 biographies from English Wikipedia. It aims at evaluating text generation algorithms. For each article, we provide the first paragraph and the infobox (both tokenized).