Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset contains biographical information derived from articles on English Wikipedia as it stood in early June 2024. It was created as part of the Structured Contents initiative at Wikimedia Enterprise and is intended for evaluation and research use.
The beta sample dataset is a subset of the Structured Contents Snapshot focusing on people with infoboxes in EN wikipedia; outputted as json files (compressed in tar.gz).
We warmly welcome any feedback you have. Please share your thoughts, suggestions, and any issues you encounter on the discussion page for this dataset here on Kaggle.
Noteworthy Included Fields: - name - title of the article. - identifier - ID of the article. - image - main image representing the article's subject. - description - one-sentence description of the article for quick reference. - abstract - lead section, summarizing what the article is about. - infoboxes - parsed information from the side panel (infobox) on the Wikipedia article. - sections - parsed sections of the article, including links. Note: excludes other media/images, lists, tables and references or similar non-prose sections.
The Wikimedia Enterprise Data Dictionary explains all of the fields in this dataset.
Infoboxes - Compressed: 2GB - Uncompressed: 11GB
Infoboxes + sections + short description - Size of compressed file: 4.12 GB - Size of uncompressed file: 21.28 GB
Article analysis and filtering breakdown: - total # of articles analyzed: 6,940,949 - # people found with QID: 1,778,226 - # people found with Category: 158,996 - people found with Biography Project: 76,150 - Total # of people articles found: 2,013,372 - Total # people articles with infoboxes: 1,559,985 End stats - Total number of people articles in this dataset: 1,559,985 - that have a short description: 1,416,701 - that have an infobox: 1,559,985 - that have article sections: 1,559,921
This dataset includes 235,146 people articles that exist on Wikipedia but aren't yet tagged on Wikidata as instance of:human.
This dataset was originally extracted from the Wikimedia Enterprise APIs on June 5, 2024. The information in this dataset may therefore be out of date. This dataset isn't being actively updated or maintained, and has been shared for community use and feedback. If you'd like to retrieve up-to-date Wikipedia articles or data from other Wikiprojects, get started with Wikimedia Enterprise's APIs
The dataset is built from the Wikimedia Enterprise HTML “snapshots”: https://enterprise.wikimedia.com/docs/snapshot/ and focuses on the Wikipedia article namespace (namespace 0 (main)).
Wikipedia is a human generated corpus of free knowledge, written, edited, and curated by a global community of editors since 2001. It is the largest and most accessed educational resource in history, accessed over 20 billion times by half a billion people each month. Wikipedia represents almost 25 years of work by its community; the creation, curation, and maintenance of millions of articles on distinct topics. This dataset includes the biographical contents of English Wikipedia language editions: English https://en.wikipedia.org/, written by the community.
Wikimedia Enterprise provides this dataset under the assumption that downstream users will adhere to the relevant free culture licenses when the data is reused. In situations where attribution is required, reusers should identify the Wikimedia project from which the content was retrieved as the source of the content. Any attribution should adhere to Wikimedia’s trademark policy (available at https://foundation.wikimedia.org/wiki/Trademark_policy) and visual identity guidelines (ava...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By wiki_bio (From Huggingface) [source]
The dataset contains several key columns: input_text and target_text. The input_text column includes the infobox and first paragraph of a Wikipedia biography, providing essential information about the individual's background, accomplishments, and notable features. The target_text column consists of the complete biography text extracted from the corresponding Wikipedia page.
In order to facilitate model training and validation, the dataset is divided into three main files: train.csv, val.csv, and test.csv. The train.csv file contains pairs of input text and target text for model training. It serves as a fundamental resource to develop accurate language generation models by providing abundant examples for learning to generate coherent biographical texts.
The val.csv file provides further validation data consisting of additional Wikipedia biographies with their corresponding infoboxes and first paragraphs. This subset allows researchers to evaluate their trained models' performance on unseen examples during development or fine-tuning stages.
Finally, the test.csv file offers a separate set of input texts paired with corresponding target texts for generating complete biographies using pre-trained models or newly developed algorithms. The purpose of this file is to benchmark system performance on unseen data in order to assess generalization capabilities.
This extended description aims to provide an informative overview of the dataset structure, its intended use cases in natural language processing research tasks such as text generation or summarization. Researchers can leverage this comprehensive collection to advance various applications in automatic biography writing systems or content generation tasks that require coherent textual output based on provided partial information extracted from an infobox or initial paragraph sources from online encyclopedias like Wikipedia
Overview:
- This dataset consists of biographical information from Wikipedia pages, specifically the infobox and the first paragraph of each biography.
- The dataset is provided in three separate files: train.csv, val.csv, and test.csv.
- Each file contains pairs of input text and target text.
File Descriptions:
- train.csv: This file is used for training purposes. It includes pairs of input text (infobox and first paragraph) and target text (complete biography).
- val.csv: Validation purposes can be fulfilled using this file. It contains a collection of biographies with infobox and first paragraph texts.
- test.csv: This file can be used to generate complete biographies based on the given input texts.
Column Information:
a) For train.csv:
input_text: Input text column containing the infobox and first paragraph of a Wikipedia biography.target_text: Target text column containing the complete biography text for each entry.b) For val.csv: -
input_text: Infobox and first paragraph texts are included in this column. -target_text: Complete biography texts are present in this column.c) For test.csv: The columns follow the pattern mentioned previously, i.e.,
input_textfollowed bytarget_text.Usage Guidelines:
Training Model or Algorithm Development: If you are working on training a model or developing an algorithm for generating complete biographies from given inputs, it is recommended to use train.csv as your primary dataset.
Model Validation or Evaluation: To validate or evaluate your trained model, you can use val.csv as an independent dataset. This dataset contains biographies that have been withheld from the training data.
Generating Biographies with Trained Models: To generate complete biographies using your trained model, you can make use of test.csv. This dataset provides input texts for which you need to generate the corresponding target texts.
Additional Information and Tips:
The input text in this dataset includes both an infobox (a structured section containing key-value pairs) and the first paragraph of a Wikipedia biography.
The target text is the complete biography for each entry.
While working with this dataset, make sure to preprocess and
- Text Generation: The dataset can be used to train language models to generate complete Wikipedia biographies given only the infobox and first paragraph ...
Facebook
Twitterhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.3/customlicense?persistentId=doi:10.7910/DVN/JAJ3CPhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.3/customlicense?persistentId=doi:10.7910/DVN/JAJ3CP
This is a cleaned and merged version of the OECD's Programme for the International Assessment of Adult Competencies. The data contains individual person-measures of several basic skills including literacy, numeracy and critical thinking, along with extensive biographical details about each subject. PIAAC is essentially a standardized test taken by a representative sample of all OECD countries (approximately 200K individuals in total). We have found this data useful in studies of predictive algorithms and human capital, in part because of its high quality, size, number and quality of biographical features per subject and representativeness of the population at large.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset consists of 51 randomly selected entries from the Slovenian Biographical Lexicon (1925–1991). The text of each entry has been manually tokenised and sentence segmented, marked with named entities and the words lemmatised. It has also been automatically annotated with PoS tags (MULTEXT-East morphosyntactic descriptions) and Universal Dependencies PoS tags, morphological features and dependency parses.
Crucially for the envisaged use of the corpus, the abbreviations in the corpus (of which there are 2,041) have been manually expanded so that the expanded abbreviations are also in the correct inflected form, given their context.
The corpus is available in the canonical TEI encoding, and derived plain text and CoNLL-U files. The plain-text file has abbreviations and their expansions marked up with [...]. There are two CoNLL-U files, one with the text stream with abbreviations, and one with the text stream with expansions. Note that only the one with expansions has syntactic parses. Both CoNLL-U files have the expansions / abbreviations and named entities marked up in IOB format in the last column.
Facebook
Twitter"This FDZ-Methodenreport (including Stata code examples) outlines an approach to construct cross-sectional data at freely selectable reference dates using the Sample of Integrated Labour Market Biographies (version 1975-2023). In addition, the generation of biographical variables is described." (Author's abstract, IAB-Doku) Dieser Datenreport beschreibt die Stichprobe der Integrierten Arbeitsmarktbiografien (SIAB) 1975 - 2023.
Facebook
TwitterThese data contain biodata, net catch, and detection records for juvenile sea lamprey sampled in natural streams in Michigan and Quebec, Canada and those stocked into an artificial stream at the USGS Hammond Bay Biological Station for monitoring diel activity. During October 31 through November 9, 2011 scientists collected downstream migrating juvenile sea lamprey from the Little Carp River, Michigan, Lake Superior (46°50'6.34"N 88°28'57.58"W). Collections were permitted by the State of Michigan under the Michigan Department of Natural Resources Scientific Collectors Permit issued to U.S. Geological Survey, Hammond Bay Biological Station December 12, 2007; amended February 23, 2011. Between November 3, 2014 and December 15, 2014, scientists monitored lamprey downstream passage using passive integrated transponder (PIT) telemetry in Morpion Stream, Quebec, Canada, Lake Champlain (45°10'23.68"N 73° 2'16.98"W). During October 2014 through February 2015, scientists monitored movement activity of juvenile sea lamprey using PIT telemetry in an artificial stream located at the Hammond Bay Biological Station, Millersburg, Michigan. Collections from Morpion Stream, Quebec during 2014 were conducted under Quebec Ministere des Resources Naturelles permis a des fins scientifiques 2012-10-01-1436-16-SP to Ellen Marsden and a scientific collection permit from the Vermont Department of Fish and Wildlife. Fish care and protocols for fish holding, surgery, and tagging were conducted under a University of Vermont Institutional Animal Care and Use Committee, permit 13-017.
Facebook
TwitterTo collect psychometric and biographical data which may enhance counselling and selection of students. A similar study of high school pupils is held as SN: 996.
Facebook
TwitterA collection of diverse human biospecimens and their associated clinical and molecular data, available for research purposes through the Specie Bio BioExchange platform. This dataset is contributed by a network of biobanks, academic medical centers, and other research institutions.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Biodata items and domains to which they belong.
Facebook
TwitterFor any b2z file, It is recommend to be parallel bzip decompressor (https://github.com/mxmlnkn/indexed_bzip2) for speed.
In summary:
See forum discussion for details of [1],[2]: https://www.kaggle.com/competitions/leash-BELKA/discussion/492846
This is somehow obsolete as the competition progresses. ecfp6 gives better results and can be extracted fast with scikit-fingerprints.
See forum discussion for details of [3]: https://www.kaggle.com/competitions/leash-BELKA/discussion/498858 https://www.kaggle.com/code/hengck23/lb6-02-graph-nn-example
See forum discussion for details of [4]: https://www.kaggle.com/competitions/leash-BELKA/discussion/505985 https://www.kaggle.com/code/hengck23/conforge-open-source-conformer-generator
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Point-biserial associations between biodata items and performance.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains answers at a questionnaire relative to modes of sample and data accessibility in research Biobanks
Facebook
Twitterhttps://search.gesis.org/research_data/datasearch-httpwww-da-ra-deoaip--oaioai-da-ra-de502761https://search.gesis.org/research_data/datasearch-httpwww-da-ra-deoaip--oaioai-da-ra-de502761
Abstract (en): This dataset was produced by Darrett B. and Anita H. Rutman while researching their book A Place in Time: Middlesex County Virginia, 1650-1750 and the companion volume, A Place in Time: Explicatus (both New York: Norton, 1984). Together, these works were intended as an ethnography of the English settlers of colonial Middlesex County, which lies on the Chesapeake Bay. The Rutmans created this dataset by consulting documentary records from Middlesex and Lancaster Counties (Middlesex was split from Lancaster in the late 1660s) and material artifacts, including gravestones and house lots. The documentary records include information about birth, marriage, death, migration, land patents and conveyances, probate, church matters, and government matters. The Rutmans organized this material by person involved in the recorded events, producing over 12,000 individual biographical sheets. The biographical sheets contain as much information as could be found for each individual, including dates of birth, marriage, and death; children's names and dates of birth and death; names of parents and spouses; appearance in wills, transaction receipts, and court proceedings; occupation and employers; and public service. This process is described in detail in Chapter 1 of A Place in Time: Middlesex County Virginia, 1650-1750. The Rutmans' biographical sheets have been archived at the Virginia Historical Society in Richmond, Virginia. To produce this dataset, most of the sheets were photographed (those with minimal information -- usually only a name and one date -- were omitted). Information from the sheets was then hand-keyed and organized into two data tables: one containing information about the individuals who were the main subjects of each sheet, and one containing information about children listed on those sheets. Because individuals appear several times, data for the same person frequently appears in both tables and in more than one row in each table. For example, a woman who lived all her life in Middlesex and married once would have two rows in the children's table -- one for her appearance on her mother's sheet and one for her appearance on her father's sheet -- and two rows in the individual table -- one for the sheet with her maiden name and one for the sheet with her married name. After entry, records were linked in order to associate all appearances of the same individual and to associate individuals with spouses, parents, children, siblings, and other relatives. Sheets with minimal information were not included in the dataset. The data includes information on 6586 unique individuals. There are 4893 observations in the individual file, and 7552 in the kids file. The purpose of the data collection was to develop an ethnography of the English settlers of colonial Middlesex County, Virginia, which lies in the Chesapeake Bay. The Rutmans created this dataset by consulting documentary records from Middlesex and Lancaster Counties (Middlesex was split from Lancaster in the late 1660s) and material artifacts, including gravestones and house lots. The documentary records include information about birth, marriage, death, migration, land patents and conveyances, probate, church matters, and government matters. The Rutmans organized this material by person involved in recorded events, producing over 12,000 individual biographical sheets. The biographical sheets contain as much information as could be found for each individual, including dates of birth, marriage, and death; children's names and dates of birth and death; names of parents and spouses; appearance in wills, transaction receipts, and court proceedings; occupation and employers; and public service. This process is described in detail in Chapter 1 of A Place in Time: Middlesex County Virginia, 1650-1750 (New York: Norton, 1984). The data are not weighted. ICPSR data undergo a confidentiality review and are altered when necessary to limit the risk of disclosure. ICPSR also routinely creates ready-to-go data files along with setups in the major statistical software formats as well as standard codebooks to accompany the data. In addition to these procedures, ICPSR performed the following processing steps for this data collection: Checked for undocumented or out-of-range codes.. English settlers of colonial Middlesex County, Virginia. Smallest Geographic Unit: county The original data collection was not sampled. However, in computerizing this resource, biographical shee...
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This file contains the empty data collection template and variable and value labels to code biographical data on legislators for WP4 in the ActEU project. It is an abbreviated version of the codebooks produced by the Pathways to Power project and by the InclusiveParl project.
Facebook
TwitterSampling data captured in Oceanic Exploration Research This report is submitted in accordance with Section 10 of Standard Clauses of the Contract contained in Appendix II to the Contract for exploration for cobalt-rich ferromanganese crusts, as concluded between the International Seabed Authority and the Ministry of Natural Resources and Environment of the Russian Federation on March, 10th, 2015. The report contains information about the results of activities concerning the study of the seabed cobalt-rich ferromanganese crusts (CRC) carried out during the first year of the first five-year period within the exploration area according to the Plan of the Exploration Activities, approved by the Council in July 2014 (Appendix 1 to the Contract for exploration of CRC). The main activities were carried out in the following directions: - Exploration activities; - Environment baseline studies.
Facebook
TwitterBio World Photo Sample Export Import Data. Follow the Eximpedia platform for HS code, importer-exporter records, and customs shipment details.
Facebook
TwitterThe purpose of the project was to make accessible for historical analysis the biographical information contained in Emden's Biographical Registers of the Universities of Oxford to 1540 and Cambridge to 1500. It was not intended to eliminate the need to consult the printed volumes, but rather to facilitate access to the different categories of material contained in them. For example, one could extract the names of those meeting certain predetermined criteria such as members of Merton College between 1320 and 1339 (dates were encoded as belonging to 20 year 'generations') who were authors. For fuller details the printed volumes would have to be consulted.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
The Arctic bio-optical database assembles a diverse suite of biological and optical data from 34 expeditions throughout the Arctic Ocean. Data combined into a single AO database following the OBPG criteria (Pegau et al. 2003), as was done in the development of the global NASA Bio-optical Marine Algorithm Data Set (NOMAD) (Werdell 2005, Werdell & Bailey 2005). This Arctic database combines coincident in situ observations of IOPs, apparent optical properties (AOPs), Chl a, environmental data (e.g. temperature, salinity) and station metadata (e.g. sampling depth, latitude, longitude, date). Data were acquired from the NASA SeaWiFS Bio-optical Archive and Storage System (SeaBASS, https://seabass.gsfc.nasa.gov/), the LEFE CYBER database (http://www.obs-vlfr.fr/proof/index2.php), the Data and Sample Research System for Whole Cruise Information in JAMSTEC (DARWIN, http://www.godac.jamstec.go.jp), NOMAD, and individual contributors. To ensure consistency, data were limited to those that were collected using OBPG defined protocols (Pegau et al. 2003). Only observations shallower that 30 m were included. For spectral parameters, we included data at the following wavelengths that are used by satellite and thus are relevant for ocean color algorithm evaluation: 412, 443, 469, 488, 490, 510, 531, 547, 555, 645, 667, 670 and 678 nm. In situ measurements were binned at the same station if measurements were within 8 hours and 1° of distance (Werdell & Bailey 2005). For regional analyses, each station was assigned to one of ten sub-regions and three functional shelf-types (Carmack et al. 2006).
Methods This bio-optical database was assembled using in situ measurements from cruises throughout the Arctic Ocean based on the methods in Werdell 2005 and Werdell & Bailey 2005. Please see [my eventual paper citation/DOI] for full details on methods and data source.
Werdell, P. 2005. An evaluation of inherent optical property data for inclusion in the NASA Bio‐optical Marine Algorithm Data Set, NASA Ocean Biology Processing Group paper, NASA Goddard Space Flight Cent., Greenbelt, Md.
Werdell, P.J. and S.W. Bailey. 2005. An improved bio-optical data set for ocean color algorithm development and satellite data product validation. Remote Sens. Environ. 98: 122-140.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Regression analysis of job performance using rational biodata.
Facebook
TwitterThe Quarterly Labour Force Survey (QLFS) is a household-based sample survey conducted by Statistics South Africa (Stats SA). It collects data on the labour market activities of individuals aged 15 years or older who live in South Africa.
National coverage
Individuals
The QLFS sample covers the non-institutional population of South Africa with one exception. The only institutional subpopulation included in the QLFS sample are individuals in worker's hostels. Persons living in private dwelling units within institutions are also enumerated. For example, within a school compound, one would enumerate the schoolmaster's house and teachers' accommodation because these are private dwellings. Students living in a dormitory on the school compound would, however, be excluded.
Sample survey data [ssd]
The QLFS uses a master sampling frame that is used by several household surveys conducted by Statistics South Africa. This wave of the QLFS is based on the 2013 master frame, which was created based on the 2011 census. There are 3324 PSUs in the master frame and roughly 33000 dwelling units.
The sample for the QLFS is based on a stratified two-stage design with probability proportional to size (PPS) sampling of PSUs in the first stage, and sampling of dwelling units (DUs) with systematic sampling in the second stage.
For each quarter of the QLFS, a quarter of the sampled dwellings are rotated out of the sample. These dwellings are replaced by new dwellings from the same PSU or the next PSU on the list. For more information see the statistical release.
Computer Assisted Telephone Interview [cati]
The survey questionnaire consists of the following sections: - Biographical information (marital status, education, etc.) - Economic activities in the last week for persons aged 15 years and older - Unemployment and economic inactivity for persons aged 15 years and above - Main work activity in the last week for persons aged 15 years and above - Earnings in the main job for employees, employers and own-account workers aged 15 years and above
From 2010 the income data collected by South Africa's Quarterly Labour Force Survey is no longer provided in the QLFS dataset (except for a brief return in QLFS 2010 Q3 which may be an error). Possibly because the data is unreliable at the level of the quarter, Statistics South Africa now provides the income data from the QLFS in an annualised dataset called Labour Market Dynamics in South Africa (LMDSA). The datasets for LMDSA are available from DataFirst's website.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset contains biographical information derived from articles on English Wikipedia as it stood in early June 2024. It was created as part of the Structured Contents initiative at Wikimedia Enterprise and is intended for evaluation and research use.
The beta sample dataset is a subset of the Structured Contents Snapshot focusing on people with infoboxes in EN wikipedia; outputted as json files (compressed in tar.gz).
We warmly welcome any feedback you have. Please share your thoughts, suggestions, and any issues you encounter on the discussion page for this dataset here on Kaggle.
Noteworthy Included Fields: - name - title of the article. - identifier - ID of the article. - image - main image representing the article's subject. - description - one-sentence description of the article for quick reference. - abstract - lead section, summarizing what the article is about. - infoboxes - parsed information from the side panel (infobox) on the Wikipedia article. - sections - parsed sections of the article, including links. Note: excludes other media/images, lists, tables and references or similar non-prose sections.
The Wikimedia Enterprise Data Dictionary explains all of the fields in this dataset.
Infoboxes - Compressed: 2GB - Uncompressed: 11GB
Infoboxes + sections + short description - Size of compressed file: 4.12 GB - Size of uncompressed file: 21.28 GB
Article analysis and filtering breakdown: - total # of articles analyzed: 6,940,949 - # people found with QID: 1,778,226 - # people found with Category: 158,996 - people found with Biography Project: 76,150 - Total # of people articles found: 2,013,372 - Total # people articles with infoboxes: 1,559,985 End stats - Total number of people articles in this dataset: 1,559,985 - that have a short description: 1,416,701 - that have an infobox: 1,559,985 - that have article sections: 1,559,921
This dataset includes 235,146 people articles that exist on Wikipedia but aren't yet tagged on Wikidata as instance of:human.
This dataset was originally extracted from the Wikimedia Enterprise APIs on June 5, 2024. The information in this dataset may therefore be out of date. This dataset isn't being actively updated or maintained, and has been shared for community use and feedback. If you'd like to retrieve up-to-date Wikipedia articles or data from other Wikiprojects, get started with Wikimedia Enterprise's APIs
The dataset is built from the Wikimedia Enterprise HTML “snapshots”: https://enterprise.wikimedia.com/docs/snapshot/ and focuses on the Wikipedia article namespace (namespace 0 (main)).
Wikipedia is a human generated corpus of free knowledge, written, edited, and curated by a global community of editors since 2001. It is the largest and most accessed educational resource in history, accessed over 20 billion times by half a billion people each month. Wikipedia represents almost 25 years of work by its community; the creation, curation, and maintenance of millions of articles on distinct topics. This dataset includes the biographical contents of English Wikipedia language editions: English https://en.wikipedia.org/, written by the community.
Wikimedia Enterprise provides this dataset under the assumption that downstream users will adhere to the relevant free culture licenses when the data is reused. In situations where attribution is required, reusers should identify the Wikimedia project from which the content was retrieved as the source of the content. Any attribution should adhere to Wikimedia’s trademark policy (available at https://foundation.wikimedia.org/wiki/Trademark_policy) and visual identity guidelines (ava...