100+ datasets found
  1. English Wikipedia People Dataset

    • kaggle.com
    zip
    Updated Jul 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wikimedia (2025). English Wikipedia People Dataset [Dataset]. https://www.kaggle.com/datasets/wikimedia-foundation/english-wikipedia-people-dataset
    Explore at:
    zip(4293465577 bytes)Available download formats
    Dataset updated
    Jul 31, 2025
    Dataset provided by
    Wikimedia Foundationhttp://www.wikimedia.org/
    Authors
    Wikimedia
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Summary

    This dataset contains biographical information derived from articles on English Wikipedia as it stood in early June 2024. It was created as part of the Structured Contents initiative at Wikimedia Enterprise and is intended for evaluation and research use.

    The beta sample dataset is a subset of the Structured Contents Snapshot focusing on people with infoboxes in EN wikipedia; outputted as json files (compressed in tar.gz).

    We warmly welcome any feedback you have. Please share your thoughts, suggestions, and any issues you encounter on the discussion page for this dataset here on Kaggle.

    Data Structure

    • File name: wme_people_infobox.tar.gz
    • Size of compressed file: 4.12 GB
    • Size of uncompressed file: 21.28 GB

    Noteworthy Included Fields: - name - title of the article. - identifier - ID of the article. - image - main image representing the article's subject. - description - one-sentence description of the article for quick reference. - abstract - lead section, summarizing what the article is about. - infoboxes - parsed information from the side panel (infobox) on the Wikipedia article. - sections - parsed sections of the article, including links. Note: excludes other media/images, lists, tables and references or similar non-prose sections.

    The Wikimedia Enterprise Data Dictionary explains all of the fields in this dataset.

    Stats

    Infoboxes - Compressed: 2GB - Uncompressed: 11GB

    Infoboxes + sections + short description - Size of compressed file: 4.12 GB - Size of uncompressed file: 21.28 GB

    Article analysis and filtering breakdown: - total # of articles analyzed: 6,940,949 - # people found with QID: 1,778,226 - # people found with Category: 158,996 - people found with Biography Project: 76,150 - Total # of people articles found: 2,013,372 - Total # people articles with infoboxes: 1,559,985 End stats - Total number of people articles in this dataset: 1,559,985 - that have a short description: 1,416,701 - that have an infobox: 1,559,985 - that have article sections: 1,559,921

    This dataset includes 235,146 people articles that exist on Wikipedia but aren't yet tagged on Wikidata as instance of:human.

    Maintenance and Support

    This dataset was originally extracted from the Wikimedia Enterprise APIs on June 5, 2024. The information in this dataset may therefore be out of date. This dataset isn't being actively updated or maintained, and has been shared for community use and feedback. If you'd like to retrieve up-to-date Wikipedia articles or data from other Wikiprojects, get started with Wikimedia Enterprise's APIs

    Initial Data Collection and Normalization

    The dataset is built from the Wikimedia Enterprise HTML “snapshots”: https://enterprise.wikimedia.com/docs/snapshot/ and focuses on the Wikipedia article namespace (namespace 0 (main)).

    Who are the source language producers?

    Wikipedia is a human generated corpus of free knowledge, written, edited, and curated by a global community of editors since 2001. It is the largest and most accessed educational resource in history, accessed over 20 billion times by half a billion people each month. Wikipedia represents almost 25 years of work by its community; the creation, curation, and maintenance of millions of articles on distinct topics. This dataset includes the biographical contents of English Wikipedia language editions: English https://en.wikipedia.org/, written by the community.

    Attribution

    Terms and conditions

    Wikimedia Enterprise provides this dataset under the assumption that downstream users will adhere to the relevant free culture licenses when the data is reused. In situations where attribution is required, reusers should identify the Wikimedia project from which the content was retrieved as the source of the content. Any attribution should adhere to Wikimedia’s trademark policy (available at https://foundation.wikimedia.org/wiki/Trademark_policy) and visual identity guidelines (ava...

  2. Wikipedia Biographies Text Generation Dataset

    • kaggle.com
    zip
    Updated Dec 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Wikipedia Biographies Text Generation Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/wikipedia-biographies-text-generation-dataset/code
    Explore at:
    zip(269983242 bytes)Available download formats
    Dataset updated
    Dec 3, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Wikipedia Biographies Text Generation Dataset

    Wikipedia Biographies: Infobox and First Paragraphs Texts

    By wiki_bio (From Huggingface) [source]

    About this dataset

    The dataset contains several key columns: input_text and target_text. The input_text column includes the infobox and first paragraph of a Wikipedia biography, providing essential information about the individual's background, accomplishments, and notable features. The target_text column consists of the complete biography text extracted from the corresponding Wikipedia page.

    In order to facilitate model training and validation, the dataset is divided into three main files: train.csv, val.csv, and test.csv. The train.csv file contains pairs of input text and target text for model training. It serves as a fundamental resource to develop accurate language generation models by providing abundant examples for learning to generate coherent biographical texts.

    The val.csv file provides further validation data consisting of additional Wikipedia biographies with their corresponding infoboxes and first paragraphs. This subset allows researchers to evaluate their trained models' performance on unseen examples during development or fine-tuning stages.

    Finally, the test.csv file offers a separate set of input texts paired with corresponding target texts for generating complete biographies using pre-trained models or newly developed algorithms. The purpose of this file is to benchmark system performance on unseen data in order to assess generalization capabilities.

    This extended description aims to provide an informative overview of the dataset structure, its intended use cases in natural language processing research tasks such as text generation or summarization. Researchers can leverage this comprehensive collection to advance various applications in automatic biography writing systems or content generation tasks that require coherent textual output based on provided partial information extracted from an infobox or initial paragraph sources from online encyclopedias like Wikipedia

    How to use the dataset

    • Overview:

      • This dataset consists of biographical information from Wikipedia pages, specifically the infobox and the first paragraph of each biography.
      • The dataset is provided in three separate files: train.csv, val.csv, and test.csv.
      • Each file contains pairs of input text and target text.
    • File Descriptions:

      • train.csv: This file is used for training purposes. It includes pairs of input text (infobox and first paragraph) and target text (complete biography).
      • val.csv: Validation purposes can be fulfilled using this file. It contains a collection of biographies with infobox and first paragraph texts.
      • test.csv: This file can be used to generate complete biographies based on the given input texts.
    • Column Information:

      a) For train.csv:

      • input_text: Input text column containing the infobox and first paragraph of a Wikipedia biography.
      • target_text: Target text column containing the complete biography text for each entry.

      b) For val.csv: - input_text: Infobox and first paragraph texts are included in this column. - target_text: Complete biography texts are present in this column.

      c) For test.csv: The columns follow the pattern mentioned previously, i.e.,input_text followed by target_text.

    • Usage Guidelines:

    • Training Model or Algorithm Development: If you are working on training a model or developing an algorithm for generating complete biographies from given inputs, it is recommended to use train.csv as your primary dataset.

    • Model Validation or Evaluation: To validate or evaluate your trained model, you can use val.csv as an independent dataset. This dataset contains biographies that have been withheld from the training data.

    • Generating Biographies with Trained Models: To generate complete biographies using your trained model, you can make use of test.csv. This dataset provides input texts for which you need to generate the corresponding target texts.

    • Additional Information and Tips:

    • The input text in this dataset includes both an infobox (a structured section containing key-value pairs) and the first paragraph of a Wikipedia biography.

    • The target text is the complete biography for each entry.

    • While working with this dataset, make sure to preprocess and

    Research Ideas

    • Text Generation: The dataset can be used to train language models to generate complete Wikipedia biographies given only the infobox and first paragraph ...
  3. H

    Replication Data (A) for 'Biased Programmers or Biased Data?': Individual...

    • dataverse.harvard.edu
    Updated Sep 2, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bo Cowgill; Fabrizio Dell'Acqua; Sam Deng; Daniel Hsu; Nakul Verma; Augustin Chaintreau (2020). Replication Data (A) for 'Biased Programmers or Biased Data?': Individual Measures of Numeracy, Literacy and Problem Solving Skill -- and Biographical Data -- for a Representative Sample of 200K OECD Residents [Dataset]. http://doi.org/10.7910/DVN/JAJ3CP
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 2, 2020
    Dataset provided by
    Harvard Dataverse
    Authors
    Bo Cowgill; Fabrizio Dell'Acqua; Sam Deng; Daniel Hsu; Nakul Verma; Augustin Chaintreau
    License

    https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.3/customlicense?persistentId=doi:10.7910/DVN/JAJ3CPhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.3/customlicense?persistentId=doi:10.7910/DVN/JAJ3CP

    Description

    This is a cleaned and merged version of the OECD's Programme for the International Assessment of Adult Competencies. The data contains individual person-measures of several basic skills including literacy, numeracy and critical thinking, along with extensive biographical details about each subject. PIAAC is essentially a standardized test taken by a representative sample of all OECD countries (approximately 200K individuals in total). We have found this data useful in studies of predictive algorithms and human capital, in part because of its high quality, size, number and quality of biographical features per subject and representativeness of the population at large.

  4. E

    Data from: Annotated sample of the Slovenian Biographical Lexicon SBL-51abbr...

    • live.european-language-grid.eu
    binary format
    Updated Jun 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Annotated sample of the Slovenian Biographical Lexicon SBL-51abbr 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/20518
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Jun 14, 2022
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset consists of 51 randomly selected entries from the Slovenian Biographical Lexicon (1925–1991). The text of each entry has been manually tokenised and sentence segmented, marked with named entities and the words lemmatised. It has also been automatically annotated with PoS tags (MULTEXT-East morphosyntactic descriptions) and Universal Dependencies PoS tags, morphological features and dependency parses.

    Crucially for the envisaged use of the corpus, the abbreviations in the corpus (of which there are 2,041) have been manually expanded so that the expanded abbreviations are also in the correct inflected form, given their context.

    The corpus is available in the canonical TEI encoding, and derived plain text and CoNLL-U files. The plain-text file has abbreviations and their expansions marked up with [...]. There are two CoNLL-U files, one with the text stream with abbreviations, and one with the text stream with expansions. Note that only the one with expansions has syntactic parses. Both CoNLL-U files have the expansions / abbreviations and named entities marked up in IOB format in the last column.

  5. u

    Academic Year 1973-1974

    • datacatalogue.ukdataservice.ac.uk
    Updated Jan 1, 1978
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Markham, S., North East London Polytechnic; Sugarman, L., North East London Polytechnic (1978). Academic Year 1973-1974 [Dataset]. http://doi.org/10.5255/UKDA-SN-974-1
    Explore at:
    Dataset updated
    Jan 1, 1978
    Dataset provided by
    UK Data Servicehttps://ukdataservice.ac.uk/
    Authors
    Markham, S., North East London Polytechnic; Sugarman, L., North East London Polytechnic
    Area covered
    England
    Description

    To collect psychometric and biographical data which may enhance counselling and selection of students. A similar study of high school pupils is held as SN: 996.

  6. f

    Biodata items and domains to which they belong.

    • figshare.com
    xls
    Updated Jun 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pedro J. Ramos-Villagrasa; Elena Fernández-del-Río; Ángel Castro (2023). Biodata items and domains to which they belong. [Dataset]. http://doi.org/10.1371/journal.pone.0274878.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 13, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Pedro J. Ramos-Villagrasa; Elena Fernández-del-Río; Ángel Castro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Biodata items and domains to which they belong.

  7. Leash-Bio-processed-dataset

    • kaggle.com
    Updated May 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    hengck23 (2024). Leash-Bio-processed-dataset [Dataset]. https://www.kaggle.com/datasets/hengck23/leash-bio-processed-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 26, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    hengck23
    Description

    Processed dataset for https://www.kaggle.com/competitions/leash-BELKA.

    For any b2z file, It is recommend to be parallel bzip decompressor (https://github.com/mxmlnkn/indexed_bzip2) for speed.

    Last update : 22-may-2024

    In summary:

    See forum discussion for details of [1],[2]: https://www.kaggle.com/competitions/leash-BELKA/discussion/492846

    [1] reduced data

    • train.reduced.parquet : 98_415_610 training SMILES and their information
    • train.bind.npz : 98_415_610 x 3 target matrix
    • test.reduced.parquet : 878_022 test SMILES
    • all_buildingblock.csv: building blocks id used in train.reduced.parquet/test.reduced.parquet
    • fold0.parquet: train_share,valid_share,valid_nonshare splits for the experiments in the discussion

    [2] extracted ECFP4 fingerprints

    • train.ecfp4.packed.npz : Features extracted using rdkit
      • AllChem.GetMorganFingerprintAsBitVect(mol, 2, 2048)
      • repack with np.packbits() to give 98_415_610 x 256 feature matrix
    • test.ecfp4.packed.npz : similarly processed for the test SMILES

    This is somehow obsolete as the competition progresses. ecfp6 gives better results and can be extracted fast with scikit-fingerprints.

    See forum discussion for details of [3]: https://www.kaggle.com/competitions/leash-BELKA/discussion/498858 https://www.kaggle.com/code/hengck23/lb6-02-graph-nn-example

    [3] graph NN processed data

    • test/train-replace-c.smiles.bytestring.bz2 : replace linker [Dy] with C. Note that these are bytestrings and not strings.
    • train-replace-c-30m.graph.pickle.**.b2z : 98_415_610 molecule graph split into 3 files. test graphs are not provided as they are be generated on the fly.

    See forum discussion for details of [4]: https://www.kaggle.com/competitions/leash-BELKA/discussion/505985 https://www.kaggle.com/code/hengck23/conforge-open-source-conformer-generator

    [4] conformer. i.e. molecule estimated xyz data

    • test-replace-c.conforge.sdf.bz2 : conformer in sdf file. you can read the file using rdkit Chem.SDMolSupplier().
    • test-replace-c.conforge.status.parquet:
      • 'status col' shows the status of conformer. 0 means success. for failure cases, sdf store a dummy 'CC' molecule.
      • 'idx col' shows the idx (primary key) to test.reduced.parquet. use this to retrieve SMILES strings. Note that conformer is based on test-replace-c.smiles.bytestring.bz2, i.e. [Dy] is replaced by C.
    • train-replace-c.sub-[split].conforge.sdf.bz2/status.parquet: smiliar format as describe above. [split] are:
      • train: 1000250+(1001610*3) molecules
      • valid: 40000
      • nonshare: about 61674
  8. Point-biserial associations between biodata items and performance.

    • plos.figshare.com
    xls
    Updated Jun 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pedro J. Ramos-Villagrasa; Elena Fernández-del-Río; Ángel Castro (2023). Point-biserial associations between biodata items and performance. [Dataset]. http://doi.org/10.1371/journal.pone.0274878.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 13, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Pedro J. Ramos-Villagrasa; Elena Fernández-del-Río; Ángel Castro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Point-biserial associations between biodata items and performance.

  9. u

    Indexes to A.B. Emden's Biographical Registers of the Universities of Oxford...

    • datacatalogue.ukdataservice.ac.uk
    • search.datacite.org
    Updated Jun 29, 1998
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aston, T. H., University of Oxford, History of the University of Oxford (1998). Indexes to A.B. Emden's Biographical Registers of the Universities of Oxford to 1540 and Cambridge to 1500 [Dataset]. http://doi.org/10.5255/UKDA-SN-3788-1
    Explore at:
    Dataset updated
    Jun 29, 1998
    Dataset provided by
    UK Data Servicehttps://ukdataservice.ac.uk/
    Authors
    Aston, T. H., University of Oxford, History of the University of Oxford
    Time period covered
    Jan 1, 1200 - Jan 1, 1500
    Area covered
    Cambridge, England
    Description

    The purpose of the project was to make accessible for historical analysis the biographical information contained in Emden's Biographical Registers of the Universities of Oxford to 1540 and Cambridge to 1500. It was not intended to eliminate the need to consult the printed volumes, but rather to facilitate access to the different categories of material contained in them. For example, one could extract the names of those meeting certain predetermined criteria such as members of Merton College between 1320 and 1339 (dates were encoded as belonging to 20 year 'generations') who were authors. For fuller details the printed volumes would have to be consulted.

  10. s

    Biological Samples and Associated Data

    • specie.bio
    Updated May 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Specie Bio Provider Network (2025). Biological Samples and Associated Data [Dataset]. https://specie.bio/providers
    Explore at:
    Dataset updated
    May 6, 2025
    Dataset provided by
    Specie Bio, Inc
    Authors
    Specie Bio Provider Network
    Description

    A collection of diverse human biospecimens and their associated clinical and molecular data, available for research purposes through the Specie Bio BioExchange platform. This dataset is contributed by a network of biobanks, academic medical centers, and other research institutions.

  11. d

    Sample of Integrated Labour Market Biographies (SIAB) 1975-2023 - Dataset -...

    • demo-b2find.dkrz.de
    Updated Jun 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Sample of Integrated Labour Market Biographies (SIAB) 1975-2023 - Dataset - B2FIND [Dataset]. http://demo-b2find.dkrz.de/dataset/108c0407-02d7-5907-99da-18557e19eee9
    Explore at:
    Dataset updated
    Jun 2, 2025
    Description

    "This FDZ-Methodenreport (including Stata code examples) outlines an approach to construct cross-sectional data at freely selectable reference dates using the Sample of Integrated Labour Market Biographies (version 1975-2023). In addition, the generation of biographical variables is described." (Author's abstract, IAB-Doku) Dieser Datenreport beschreibt die Stichprobe der Integrierten Arbeitsmarktbiografien (SIAB) 1975 - 2023.

  12. A Place In Time: Colonial Middlesex County, VA, 1650-1750 - Version 1

    • search.gesis.org
    Updated Jun 16, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GESIS search (2016). A Place In Time: Colonial Middlesex County, VA, 1650-1750 - Version 1 [Dataset]. http://doi.org/10.3886/ICPSR35057.v1
    Explore at:
    Dataset updated
    Jun 16, 2016
    Dataset provided by
    Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
    GESIS search
    License

    https://search.gesis.org/research_data/datasearch-httpwww-da-ra-deoaip--oaioai-da-ra-de502761https://search.gesis.org/research_data/datasearch-httpwww-da-ra-deoaip--oaioai-da-ra-de502761

    Area covered
    Middlesex County, Virginia
    Description

    Abstract (en): This dataset was produced by Darrett B. and Anita H. Rutman while researching their book A Place in Time: Middlesex County Virginia, 1650-1750 and the companion volume, A Place in Time: Explicatus (both New York: Norton, 1984). Together, these works were intended as an ethnography of the English settlers of colonial Middlesex County, which lies on the Chesapeake Bay. The Rutmans created this dataset by consulting documentary records from Middlesex and Lancaster Counties (Middlesex was split from Lancaster in the late 1660s) and material artifacts, including gravestones and house lots. The documentary records include information about birth, marriage, death, migration, land patents and conveyances, probate, church matters, and government matters. The Rutmans organized this material by person involved in the recorded events, producing over 12,000 individual biographical sheets. The biographical sheets contain as much information as could be found for each individual, including dates of birth, marriage, and death; children's names and dates of birth and death; names of parents and spouses; appearance in wills, transaction receipts, and court proceedings; occupation and employers; and public service. This process is described in detail in Chapter 1 of A Place in Time: Middlesex County Virginia, 1650-1750. The Rutmans' biographical sheets have been archived at the Virginia Historical Society in Richmond, Virginia. To produce this dataset, most of the sheets were photographed (those with minimal information -- usually only a name and one date -- were omitted). Information from the sheets was then hand-keyed and organized into two data tables: one containing information about the individuals who were the main subjects of each sheet, and one containing information about children listed on those sheets. Because individuals appear several times, data for the same person frequently appears in both tables and in more than one row in each table. For example, a woman who lived all her life in Middlesex and married once would have two rows in the children's table -- one for her appearance on her mother's sheet and one for her appearance on her father's sheet -- and two rows in the individual table -- one for the sheet with her maiden name and one for the sheet with her married name. After entry, records were linked in order to associate all appearances of the same individual and to associate individuals with spouses, parents, children, siblings, and other relatives. Sheets with minimal information were not included in the dataset. The data includes information on 6586 unique individuals. There are 4893 observations in the individual file, and 7552 in the kids file. The purpose of the data collection was to develop an ethnography of the English settlers of colonial Middlesex County, Virginia, which lies in the Chesapeake Bay. The Rutmans created this dataset by consulting documentary records from Middlesex and Lancaster Counties (Middlesex was split from Lancaster in the late 1660s) and material artifacts, including gravestones and house lots. The documentary records include information about birth, marriage, death, migration, land patents and conveyances, probate, church matters, and government matters. The Rutmans organized this material by person involved in recorded events, producing over 12,000 individual biographical sheets. The biographical sheets contain as much information as could be found for each individual, including dates of birth, marriage, and death; children's names and dates of birth and death; names of parents and spouses; appearance in wills, transaction receipts, and court proceedings; occupation and employers; and public service. This process is described in detail in Chapter 1 of A Place in Time: Middlesex County Virginia, 1650-1750 (New York: Norton, 1984). The data are not weighted. ICPSR data undergo a confidentiality review and are altered when necessary to limit the risk of disclosure. ICPSR also routinely creates ready-to-go data files along with setups in the major statistical software formats as well as standard codebooks to accompany the data. In addition to these procedures, ICPSR performed the following processing steps for this data collection: Checked for undocumented or out-of-range codes.. English settlers of colonial Middlesex County, Virginia. Smallest Geographic Unit: county The original data collection was not sampled. However, in computerizing this resource, biographical shee...

  13. Regression analysis of job performance using rational biodata.

    • plos.figshare.com
    xls
    Updated Jun 13, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pedro J. Ramos-Villagrasa; Elena Fernández-del-Río; Ángel Castro (2023). Regression analysis of job performance using rational biodata. [Dataset]. http://doi.org/10.1371/journal.pone.0274878.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 13, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Pedro J. Ramos-Villagrasa; Elena Fernández-del-Río; Ángel Castro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Regression analysis of job performance using rational biodata.

  14. e

    Bio World Photo Sample Export Import Data | Eximpedia

    • eximpedia.app
    Updated Oct 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Bio World Photo Sample Export Import Data | Eximpedia [Dataset]. https://www.eximpedia.app/companies/bio-world-photo-sample/03090638
    Explore at:
    Dataset updated
    Oct 30, 2025
    Description

    Bio World Photo Sample Export Import Data. Follow the Eximpedia platform for HS code, importer-exporter records, and customs shipment details.

  15. g

    Harmonizing and synthesizing partnership histories from different research...

    • search.gesis.org
    Updated Jun 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Schulz, Sonja; Weiß, Bernd; Sterl, Sebastian; Haensch, Anna-Carolina; Schmid, Lisa; May, Antonia (2022). Harmonizing and synthesizing partnership histories from different research data infrastructures: A model project for linking research data from various infrastructure (HaSpaD). [Dataset]. http://doi.org/10.7802/2429
    Explore at:
    Dataset updated
    Jun 21, 2022
    Dataset provided by
    GESIS search
    GESIS, Köln
    Authors
    Schulz, Sonja; Weiß, Bernd; Sterl, Sebastian; Haensch, Anna-Carolina; Schmid, Lisa; May, Antonia
    License

    https://www.gesis.org/en/institute/data-usage-termshttps://www.gesis.org/en/institute/data-usage-terms

    Description

    English:
    The HaSpaD project harmonizes and pools longitudinal data for the analysis of partnership biographies from nine German survey programs. These are in detail:

    • The German Family Panel (pairfam), Data file Version 12.0.0
    • ALLBUS/GGSS 1980-2016 (Kumulierte Allgemeine Bevölkerungsumfrage der Sozialwissenschaften / Cumulated German General Social Survey 1980-2016)
    • Family Surveys 1988-2000 (Change and Development of Forms of Family Life in West Germany (Survey of Families), Family and Partner Relations in Eastern Germany (Survey of Families), Change and Development of Ways of Family Life - 2nd Wave (Survey of Families), Change and Development of Families` Way of Life - 3rd Wave (Family Survey))
    • Mannheim Divorce Study 1996
    • German Fertility and Family Survey (FFS) 1992
    • German Life History Studies (Courses of Life and Historical Change in East Germany (Life History Study LV DDR), Courses of Life and Social Change: Courses of Life and Welfare Development (Life History Study LV-West I), Courses of Life and Social Change: The Between-the-War Cohort in Transition to Retirement (Life History Study LV-West II A - Personal Interview), Courses of Life and Social Change: The Between-the-War Cohort in Transition to Retirement (Life History Study LV-West II T - Telephone Interview), Courses of Life and Social Change: Access to Occupation in Employment Crisis (Life History Study LV-West III), East German Life Courses After Unification (Life History Study LV-Ost Panel), East German Life Courses After Unification (Life History Study LV Ost 71), Education, Training, and Occupation: Life Courses of the 1964 and 1971 Birth Cohorts in West Germany (Life History Study LV-West 64/71), Early Careers and Starting a Family: Life Courses of the 1971 Birth Cohorts in East and West Germany (Life History Study LV-Panel 71))
    • Generations & Gender Survey (German Subsample) GGS Waves 1 and 2
    • The Survey of Health, Ageing and Retirement in Europe (SHARE), German Sample (Share Waves 1, 2, and 3) and
    • Socio-Economic Panel (SOEP), data for the years 1984-2018.

    The HaSpaD projects does not distribute own datasets. Instead, the HaSpaD syntax package allows to harmonize and pool all German surveys with partnership biographical data which are available for secondary use via a research data repository. Data access to these source data must be arranged autonomously by users of the HaSpaD syntax. The scripts harmonize and pool the partnership biographical data, as well as additional variables on respondents and their partnerships. These include, for example, gender, religious affiliation, and nationality of the respondents. The pooled data set provides the opportunity to analyse previously unanswered questions on marriage and partnership stability from a historical and life course theoretical perspective, in particular on the long-term increase in divorce rates and on social changes in risk factors for separation. In addition, methodological developments of research syntheses will be facilitated.


    Deutsch:
    Das HaSpaD-Projekt harmonisiert und kumuliert Längsschnittdaten zur Analyse von Partnerschaftsbiografien aus neun deutschen Umfrageprogrammen. Dies sind im Einzelnen:
    • Beziehungs- und Familienpanels pairfam, Release 12.0
    • Kumulierte Allgemeine Bevölkerungsumfrage der Sozialwissenschaften (ALLBUS / GGSS) 1980-2016
    • Familiensurvey 1988 - 2000 (Wandel und Entwicklung familialer Lebensformen in Westdeutschland (Familiensurvey), Familie und Partnerbeziehungen in Ostdeutschland (Familiensurvey), Wandel und Entwicklung familialer Lebensformen - 2. Welle (Familiensurvey), Wandel und Entwicklung familialer Lebensformen - 3. Welle (Familiensurvey))
    • Mannheimer Scheidungsstudie 1996
    • Deutscher Fertility and Family Survey 1992
    • Lebensverlaufsstudien (Lebensverläufe und historischer Wandel in Ostdeutschland (Lebensverlaufsstudie LV-DDR), Lebensverläufe und gesellschaftlicher Wandel: Lebensverläufe und Wohlfahrtsentwicklung (Lebensverlaufsstudie LV-West I), Lebensverläufe und gesellschaftlicher Wandel: Die Zwischenkriegskohorte im Übergang zum Ruhestand (Lebensverlaufsstudie LV-West II A - Persönliche Befragung), Lebensverläufe und gesellschaftlicher Wandel: Die Zwischenkriegskohorte im Übergang zum Ruhestand (Lebensverlaufsstudie LV-West II T - Telefonische Befragung), Lebensverläufe und gesellschaftlicher Wandel: Berufszugang in der Beschäftigungskr...

  16. H

    Stowell Datasets Digital Archive: Dallas-Ft.Worth, Texas, USA

    • dataverse.harvard.edu
    pdf, tsv
    Updated Jan 28, 2008
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harvard Dataverse (2008). Stowell Datasets Digital Archive: Dallas-Ft.Worth, Texas, USA [Dataset]. http://doi.org/10.7910/DVN/M3RQGI
    Explore at:
    pdf(90427), pdf(33454), tsv(480137)Available download formats
    Dataset updated
    Jan 28, 2008
    Dataset provided by
    Harvard Dataverse
    Time period covered
    1993 - 1997
    Area covered
    Texas, United States
    Description

    This is one of over 400 major media market consumer surveys which have been gifted to Washington State University (WSU) by Leigh Stowell & Company, Inc. of Seattle, Washington, USA. This is a market research firm which specializes in providing newspapers, television affiliates and cable operators with market segmentation research pertinent to consumer purchasing patterns and the effective marketing of goods and services to program audiences. The data in the Stowell Archive were collected via random digit dialing and computer-aided telephone interviews (CATI). Most of the surveys focus on the marketing needs of mass media clients and contain demographics, psychographics, media exposure information, and purchasing behavior data about consumers in major metropolitan areas of the United States and Canada starting in 1989. The sample sizes of the surveys range from 500 to 3,000 respondents, averaging 1,000 observations per study. Data are available at the respondent level, and all observations are keyed to zip code or other geographic identifiers. Additional surveys are anticipated, with over twenty new media marke t studies being donated annually. The University's relationship with Leigh Stowell & Company, Inc. was cultivated by Dr. Nicholas Lovrich, Director of WSU's Division of Governmental Studies and Services (DGSS) and by Dr. John Pierce, former Dean of the WSU College of Liberal Arts over the course of a decade. DGSS collaborated with WSU Libraries Digital Services to process the gifted data files into this digital archive which features powerful search and download capabilities. Further refinement of the archive in accordance with the Data Documentation Initiative is progressing with support from the Office of the Provost, the College of Liberal Arts and the WSU Libraries. It is important to note that the year indicated by the study's title is the year that the original survey was published, and is not necessarily the year in which the interviews were conducted. Refer to the metadata field "Dates of Collection" to di scern the interview dates of each specific survey. Refer also to date fields within the data file itself.

  17. Samples and data accessibility in research biobanks

    • zenodo.org
    • data.niaid.nih.gov
    xls
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marco Capocasa; Paolo Anagnostou; Flavio D'Abramo; Giulia Matteucci; Valentina Dominici; Giovanni Destro Bisol; Fabrizio Rufo; Marco Capocasa; Paolo Anagnostou; Flavio D'Abramo; Giulia Matteucci; Valentina Dominici; Giovanni Destro Bisol; Fabrizio Rufo (2020). Samples and data accessibility in research biobanks [Dataset]. http://doi.org/10.5281/zenodo.17098
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Marco Capocasa; Paolo Anagnostou; Flavio D'Abramo; Giulia Matteucci; Valentina Dominici; Giovanni Destro Bisol; Fabrizio Rufo; Marco Capocasa; Paolo Anagnostou; Flavio D'Abramo; Giulia Matteucci; Valentina Dominici; Giovanni Destro Bisol; Fabrizio Rufo
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains answers at a questionnaire relative to modes of sample and data accessibility in research Biobanks

  18. Quarterly Labour Force Survey 2021, Quarter 1 - South Africa

    • microdata.worldbank.org
    • catalog.ihsn.org
    Updated Oct 25, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statistics South Africa (2021). Quarterly Labour Force Survey 2021, Quarter 1 - South Africa [Dataset]. https://microdata.worldbank.org/index.php/catalog/4074
    Explore at:
    Dataset updated
    Oct 25, 2021
    Dataset authored and provided by
    Statistics South Africahttp://www.statssa.gov.za/
    Time period covered
    2021
    Area covered
    South Africa
    Description

    Abstract

    The Quarterly Labour Force Survey (QLFS) is a household-based sample survey conducted by Statistics South Africa (Stats SA). It collects data on the labour market activities of individuals aged 15 years or older who live in South Africa.

    Geographic coverage

    National coverage

    Analysis unit

    Individuals

    Universe

    The QLFS sample covers the non-institutional population of South Africa with one exception. The only institutional subpopulation included in the QLFS sample are individuals in worker's hostels. Persons living in private dwelling units within institutions are also enumerated. For example, within a school compound, one would enumerate the schoolmaster's house and teachers' accommodation because these are private dwellings. Students living in a dormitory on the school compound would, however, be excluded.

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    The QLFS uses a master sampling frame that is used by several household surveys conducted by Statistics South Africa. This wave of the QLFS is based on the 2013 master frame, which was created based on the 2011 census. There are 3324 PSUs in the master frame and roughly 33000 dwelling units.

    The sample for the QLFS is based on a stratified two-stage design with probability proportional to size (PPS) sampling of PSUs in the first stage, and sampling of dwelling units (DUs) with systematic sampling in the second stage.

    For each quarter of the QLFS, a quarter of the sampled dwellings are rotated out of the sample. These dwellings are replaced by new dwellings from the same PSU or the next PSU on the list. For more information see the statistical release.

    Mode of data collection

    Computer Assisted Telephone Interview [cati]

    Research instrument

    The survey questionnaire consists of the following sections: - Biographical information (marital status, education, etc.) - Economic activities in the last week for persons aged 15 years and older - Unemployment and economic inactivity for persons aged 15 years and above - Main work activity in the last week for persons aged 15 years and above - Earnings in the main job for employees, employers and own-account workers aged 15 years and above

    From 2010 the income data collected by South Africa's Quarterly Labour Force Survey is no longer provided in the QLFS dataset (except for a brief return in QLFS 2010 Q3 which may be an error). Possibly because the data is unreliable at the level of the quarter, Statistics South Africa now provides the income data from the QLFS in an annualised dataset called Labour Market Dynamics in South Africa (LMDSA). The datasets for LMDSA are available from DataFirst's website.

  19. r

    Data for: Coverage of Web Accessibility Guidelines Provided by Automated...

    • researchdata.se
    Updated Aug 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thomas Fischer (2025). Data for: Coverage of Web Accessibility Guidelines Provided by Automated Checking Tools [Dataset]. http://doi.org/10.5878/qe0c-kb63
    Explore at:
    (6791136), (49025), (34868), (102004), (1320), (8779)Available download formats
    Dataset updated
    Aug 25, 2025
    Dataset provided by
    University of Skövde
    Authors
    Thomas Fischer
    Area covered
    Sweden
    Description

    This data set contains three parts:

    1. A collection of the raw data, which includes (a) the retrieved landing page of each analyzed PSO (to be precise, the DOM presentation from a browser showing this page) both in HTML and text (text without HTML tags), (b) for each of the six automated checker/engine combination one log file, (c) other metadata such as text file containing tools' and libraries' version information. Data of case 1(a) may contain personal data (details see below) and is thus kept in a separate archive file and is only available upon request. Data of case 1(b) has been stripped of personal data and thus may get shared freely. This data allows investigating how the webpages looked at the time of the study and to which assessments the then-current automated checkers came. Future studies can reproduce the same setup and, for example, compare changes over time in PSOs' webpages' accessibility.

    2. A "coverage" file that is essentially a big database on WCAG-2 success criteria, their metadata, and links to automated checkers' documentation and source code. The "coverage" file combines information from various sources, such as information scrapped from W3C web page, accessibility tools' Git repositories, or AXE's documentation. Other researchers can load this "coverage" file to get a database of WCAG-2 success criteria and associated metadata in their data analysis without performing those error-prone and tedious steps themselves.

    3. A collection of Python files. This not only allows reproducing how raw data was process and filtered (up to the output of LaTeX code), but allows other researchers to get inspiration how to solve problems addressed in this code base as well as to re-use code in their own projects.

    The data covered by case 1(a) above includes textual data collected from publicly available web pages of Swedish public sector organizations (PSOs), which may include names, contact details, or other personal or biographical information. Due to the directory structure, for every file the origin of the data is determined, so any further questions about the handling of personal data shall be directed to the respective PSO.

  20. d

    The impact of HIV-AIDS on the health sector 2002: Adult data - All provinces...

    • demo-b2find.dkrz.de
    Updated Sep 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). The impact of HIV-AIDS on the health sector 2002: Adult data - All provinces in South Africa - Dataset - B2FIND [Dataset]. http://demo-b2find.dkrz.de/dataset/34752768-fb2e-5452-b21a-52494cb28419
    Explore at:
    Dataset updated
    Sep 21, 2025
    Area covered
    South Africa
    Description

    Description: The data set contains adult patients' data- demographic, morbidity, behavioural, environmental and data on health facilities (i.e. type, province and health district). Patient biographic data (age, sex, race, residence, nationality, language, type of dwelling, education, employment, income, religion, marital status, etc.). Data on history of hospitalisation of patient and data on patients health status like weight, STIs, pregnancy etc. and symptoms/diseases that had prompted patients to seek medical and health care. Behavioural data includes sexual partnerships, condom use, alcohol/ drug use and circumcision that predisposes one towards infection with HIV. Environmental data on pollution, living on farms and access to clean drinking water and food as well as data on HIV status was collected. The data contains 192 variables and 1534 cases. Abstract: The Nelson Mandela / HSRC study of HIV/AIDS (2002) reported an estimated prevalence of 4.5 million among persons aged two years and older. Given the overall impact of HIV/AIDS on South African society, and the need to make policies on the management of those living with the disease, it was important that studies were undertaken to provide data on the impact on the health system. This study was undertaken by the HSRC in collaboration with the national School of Public Health (NSPH) at the Medical University of South Africa (MEDUNSA) and the Medical Research Council (MRC). It was commissioned by the National Department of Health (DoH) to assess the impact of HIV/AIDS on the health system and to understand its progressive impact over time. The PIs sought to answer the following questions To what extent does HIV/AIDS affect the health system? What aspects or sub-systems are most affected? How is the impact going to progress over time? To answer the questions, a stratified cluster sample of 222 health facilities representative of the public and private sector in South Africa were drawn from the national DoH database on health facilities (1996). A nation-wide, representative sample of 2000 medical professionals including nursing professionals; other categories of nursing staff; other health professionals and non-professional health workers was obtained. In addition to this a representative probability sample of 2000 patients was obtained. Data collection methods included interviews using questionnaires and clinical measurements where either a blood specimen or an oral fluid (Orasure) specimen was collected. An anonymous linked HIV survey was conducted in the Free state, Mpumalanga, North West and Kwazulu-Natal. Oral fluids were tested for HIV antibodies at three different laboratories and results were linked with questionnaire data using barcodes. The adult questionnaire contains the patient's biographical data, hospitalisation history of patients seen at clinics, in-patients interviewed in a hospital, health status, sexual behaviour and the environment. Clinical measurements Face-to-face interview All adult (15-45 years) patients in public and private health facilities in South Africa. (Note: In hospitals only patients in medical wards were included.). Children 15 years and older were included as they were interviewed together with the adult patients. The task was to obtain a representative probability sample of 2000 patients, and at most representative probability sample 2000 health professionals who are in contact with patients undergoing treatment at the selected health facilities. The sampling frame was the national DoH's health facilities database (1996). Target population, was selected from two separate sampling frames: - (a) a list of all public clinics in the country (excluding mobile, satellite, part-time and specialized clinics; and (b) a list of all hospitals (public and private) and Private clinics with indication of the number of beds available in each of health facilities from the national DoH database on health facilities (1996). Provinces and health regions within provinces were considered as explicit strata. Provinces formed the primary stratification variable and the health regions the secondary stratification variable. The Primary sampling unit (PSU) was the magisterial districts within each health region in the case of public clinics, Secondary sampling unit (SSU) were clinics and hospitals- drawn using simple random sampling, and Ultimate/final sampling unit the (USU) the professional and non-professional health workers and patients. Measure of size (MOS) for public clinics was a monotonic function of the number of clinics per managerial districts. Selected 167 clinics were allocated disproportionately i.e. proportional to MOS. Allocated sample number of clinics within each province was allocated proportionately to the health regions in the province. MOS for hospitals and private clinics was a monotonic function of the number of beds as in DOH's database. Sample sizes for SSUs: Public clinics (167) Public Hospitals (33) Private Hospitals and clinics (22) Sample sizes for USUs: 1000 patients 500 nursing personnel 200 medical doctors 100 other professional health workers 400 non-professional health workers Public clinics 1000 patients 500 nursing personnel 111 nonprofessional personnel( e.g. cleaners) Public Hospitals 667 patients 333 nursing Personnel 200 medical doctors 67 other professional 222 non-professionals Private Hospitals and clinics

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Wikimedia (2025). English Wikipedia People Dataset [Dataset]. https://www.kaggle.com/datasets/wikimedia-foundation/english-wikipedia-people-dataset
Organization logo

English Wikipedia People Dataset

Biographical Data for People on English Wikipedia

Explore at:
zip(4293465577 bytes)Available download formats
Dataset updated
Jul 31, 2025
Dataset provided by
Wikimedia Foundationhttp://www.wikimedia.org/
Authors
Wikimedia
License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Summary

This dataset contains biographical information derived from articles on English Wikipedia as it stood in early June 2024. It was created as part of the Structured Contents initiative at Wikimedia Enterprise and is intended for evaluation and research use.

The beta sample dataset is a subset of the Structured Contents Snapshot focusing on people with infoboxes in EN wikipedia; outputted as json files (compressed in tar.gz).

We warmly welcome any feedback you have. Please share your thoughts, suggestions, and any issues you encounter on the discussion page for this dataset here on Kaggle.

Data Structure

  • File name: wme_people_infobox.tar.gz
  • Size of compressed file: 4.12 GB
  • Size of uncompressed file: 21.28 GB

Noteworthy Included Fields: - name - title of the article. - identifier - ID of the article. - image - main image representing the article's subject. - description - one-sentence description of the article for quick reference. - abstract - lead section, summarizing what the article is about. - infoboxes - parsed information from the side panel (infobox) on the Wikipedia article. - sections - parsed sections of the article, including links. Note: excludes other media/images, lists, tables and references or similar non-prose sections.

The Wikimedia Enterprise Data Dictionary explains all of the fields in this dataset.

Stats

Infoboxes - Compressed: 2GB - Uncompressed: 11GB

Infoboxes + sections + short description - Size of compressed file: 4.12 GB - Size of uncompressed file: 21.28 GB

Article analysis and filtering breakdown: - total # of articles analyzed: 6,940,949 - # people found with QID: 1,778,226 - # people found with Category: 158,996 - people found with Biography Project: 76,150 - Total # of people articles found: 2,013,372 - Total # people articles with infoboxes: 1,559,985 End stats - Total number of people articles in this dataset: 1,559,985 - that have a short description: 1,416,701 - that have an infobox: 1,559,985 - that have article sections: 1,559,921

This dataset includes 235,146 people articles that exist on Wikipedia but aren't yet tagged on Wikidata as instance of:human.

Maintenance and Support

This dataset was originally extracted from the Wikimedia Enterprise APIs on June 5, 2024. The information in this dataset may therefore be out of date. This dataset isn't being actively updated or maintained, and has been shared for community use and feedback. If you'd like to retrieve up-to-date Wikipedia articles or data from other Wikiprojects, get started with Wikimedia Enterprise's APIs

Initial Data Collection and Normalization

The dataset is built from the Wikimedia Enterprise HTML “snapshots”: https://enterprise.wikimedia.com/docs/snapshot/ and focuses on the Wikipedia article namespace (namespace 0 (main)).

Who are the source language producers?

Wikipedia is a human generated corpus of free knowledge, written, edited, and curated by a global community of editors since 2001. It is the largest and most accessed educational resource in history, accessed over 20 billion times by half a billion people each month. Wikipedia represents almost 25 years of work by its community; the creation, curation, and maintenance of millions of articles on distinct topics. This dataset includes the biographical contents of English Wikipedia language editions: English https://en.wikipedia.org/, written by the community.

Attribution

Terms and conditions

Wikimedia Enterprise provides this dataset under the assumption that downstream users will adhere to the relevant free culture licenses when the data is reused. In situations where attribution is required, reusers should identify the Wikimedia project from which the content was retrieved as the source of the content. Any attribution should adhere to Wikimedia’s trademark policy (available at https://foundation.wikimedia.org/wiki/Trademark_policy) and visual identity guidelines (ava...

Search
Clear search
Close search
Google apps
Main menu