80 datasets found
  1. English Wikipedia People Dataset

    • kaggle.com
    zip
    Updated Jul 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wikimedia (2025). English Wikipedia People Dataset [Dataset]. https://www.kaggle.com/datasets/wikimedia-foundation/english-wikipedia-people-dataset
    Explore at:
    zip(4293465577 bytes)Available download formats
    Dataset updated
    Jul 31, 2025
    Dataset provided by
    Wikimedia Foundationhttp://www.wikimedia.org/
    Authors
    Wikimedia
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Summary

    This dataset contains biographical information derived from articles on English Wikipedia as it stood in early June 2024. It was created as part of the Structured Contents initiative at Wikimedia Enterprise and is intended for evaluation and research use.

    The beta sample dataset is a subset of the Structured Contents Snapshot focusing on people with infoboxes in EN wikipedia; outputted as json files (compressed in tar.gz).

    We warmly welcome any feedback you have. Please share your thoughts, suggestions, and any issues you encounter on the discussion page for this dataset here on Kaggle.

    Data Structure

    • File name: wme_people_infobox.tar.gz
    • Size of compressed file: 4.12 GB
    • Size of uncompressed file: 21.28 GB

    Noteworthy Included Fields: - name - title of the article. - identifier - ID of the article. - image - main image representing the article's subject. - description - one-sentence description of the article for quick reference. - abstract - lead section, summarizing what the article is about. - infoboxes - parsed information from the side panel (infobox) on the Wikipedia article. - sections - parsed sections of the article, including links. Note: excludes other media/images, lists, tables and references or similar non-prose sections.

    The Wikimedia Enterprise Data Dictionary explains all of the fields in this dataset.

    Stats

    Infoboxes - Compressed: 2GB - Uncompressed: 11GB

    Infoboxes + sections + short description - Size of compressed file: 4.12 GB - Size of uncompressed file: 21.28 GB

    Article analysis and filtering breakdown: - total # of articles analyzed: 6,940,949 - # people found with QID: 1,778,226 - # people found with Category: 158,996 - people found with Biography Project: 76,150 - Total # of people articles found: 2,013,372 - Total # people articles with infoboxes: 1,559,985 End stats - Total number of people articles in this dataset: 1,559,985 - that have a short description: 1,416,701 - that have an infobox: 1,559,985 - that have article sections: 1,559,921

    This dataset includes 235,146 people articles that exist on Wikipedia but aren't yet tagged on Wikidata as instance of:human.

    Maintenance and Support

    This dataset was originally extracted from the Wikimedia Enterprise APIs on June 5, 2024. The information in this dataset may therefore be out of date. This dataset isn't being actively updated or maintained, and has been shared for community use and feedback. If you'd like to retrieve up-to-date Wikipedia articles or data from other Wikiprojects, get started with Wikimedia Enterprise's APIs

    Initial Data Collection and Normalization

    The dataset is built from the Wikimedia Enterprise HTML “snapshots”: https://enterprise.wikimedia.com/docs/snapshot/ and focuses on the Wikipedia article namespace (namespace 0 (main)).

    Who are the source language producers?

    Wikipedia is a human generated corpus of free knowledge, written, edited, and curated by a global community of editors since 2001. It is the largest and most accessed educational resource in history, accessed over 20 billion times by half a billion people each month. Wikipedia represents almost 25 years of work by its community; the creation, curation, and maintenance of millions of articles on distinct topics. This dataset includes the biographical contents of English Wikipedia language editions: English https://en.wikipedia.org/, written by the community.

    Attribution

    Terms and conditions

    Wikimedia Enterprise provides this dataset under the assumption that downstream users will adhere to the relevant free culture licenses when the data is reused. In situations where attribution is required, reusers should identify the Wikimedia project from which the content was retrieved as the source of the content. Any attribution should adhere to Wikimedia’s trademark policy (available at https://foundation.wikimedia.org/wiki/Trademark_policy) and visual identity guidelines (ava...

  2. Wikipedia Biographies Text Generation Dataset

    • kaggle.com
    zip
    Updated Dec 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Wikipedia Biographies Text Generation Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/wikipedia-biographies-text-generation-dataset/code
    Explore at:
    zip(269983242 bytes)Available download formats
    Dataset updated
    Dec 3, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Wikipedia Biographies Text Generation Dataset

    Wikipedia Biographies: Infobox and First Paragraphs Texts

    By wiki_bio (From Huggingface) [source]

    About this dataset

    The dataset contains several key columns: input_text and target_text. The input_text column includes the infobox and first paragraph of a Wikipedia biography, providing essential information about the individual's background, accomplishments, and notable features. The target_text column consists of the complete biography text extracted from the corresponding Wikipedia page.

    In order to facilitate model training and validation, the dataset is divided into three main files: train.csv, val.csv, and test.csv. The train.csv file contains pairs of input text and target text for model training. It serves as a fundamental resource to develop accurate language generation models by providing abundant examples for learning to generate coherent biographical texts.

    The val.csv file provides further validation data consisting of additional Wikipedia biographies with their corresponding infoboxes and first paragraphs. This subset allows researchers to evaluate their trained models' performance on unseen examples during development or fine-tuning stages.

    Finally, the test.csv file offers a separate set of input texts paired with corresponding target texts for generating complete biographies using pre-trained models or newly developed algorithms. The purpose of this file is to benchmark system performance on unseen data in order to assess generalization capabilities.

    This extended description aims to provide an informative overview of the dataset structure, its intended use cases in natural language processing research tasks such as text generation or summarization. Researchers can leverage this comprehensive collection to advance various applications in automatic biography writing systems or content generation tasks that require coherent textual output based on provided partial information extracted from an infobox or initial paragraph sources from online encyclopedias like Wikipedia

    How to use the dataset

    • Overview:

      • This dataset consists of biographical information from Wikipedia pages, specifically the infobox and the first paragraph of each biography.
      • The dataset is provided in three separate files: train.csv, val.csv, and test.csv.
      • Each file contains pairs of input text and target text.
    • File Descriptions:

      • train.csv: This file is used for training purposes. It includes pairs of input text (infobox and first paragraph) and target text (complete biography).
      • val.csv: Validation purposes can be fulfilled using this file. It contains a collection of biographies with infobox and first paragraph texts.
      • test.csv: This file can be used to generate complete biographies based on the given input texts.
    • Column Information:

      a) For train.csv:

      • input_text: Input text column containing the infobox and first paragraph of a Wikipedia biography.
      • target_text: Target text column containing the complete biography text for each entry.

      b) For val.csv: - input_text: Infobox and first paragraph texts are included in this column. - target_text: Complete biography texts are present in this column.

      c) For test.csv: The columns follow the pattern mentioned previously, i.e.,input_text followed by target_text.

    • Usage Guidelines:

    • Training Model or Algorithm Development: If you are working on training a model or developing an algorithm for generating complete biographies from given inputs, it is recommended to use train.csv as your primary dataset.

    • Model Validation or Evaluation: To validate or evaluate your trained model, you can use val.csv as an independent dataset. This dataset contains biographies that have been withheld from the training data.

    • Generating Biographies with Trained Models: To generate complete biographies using your trained model, you can make use of test.csv. This dataset provides input texts for which you need to generate the corresponding target texts.

    • Additional Information and Tips:

    • The input text in this dataset includes both an infobox (a structured section containing key-value pairs) and the first paragraph of a Wikipedia biography.

    • The target text is the complete biography for each entry.

    • While working with this dataset, make sure to preprocess and

    Research Ideas

    • Text Generation: The dataset can be used to train language models to generate complete Wikipedia biographies given only the infobox and first paragraph ...
  3. u

    Academic Year 1973-1974

    • datacatalogue.ukdataservice.ac.uk
    Updated Jan 1, 1978
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Markham, S., North East London Polytechnic; Sugarman, L., North East London Polytechnic (1978). Academic Year 1973-1974 [Dataset]. http://doi.org/10.5255/UKDA-SN-974-1
    Explore at:
    Dataset updated
    Jan 1, 1978
    Dataset provided by
    UK Data Servicehttps://ukdataservice.ac.uk/
    Authors
    Markham, S., North East London Polytechnic; Sugarman, L., North East London Polytechnic
    Area covered
    England
    Description

    To collect psychometric and biographical data which may enhance counselling and selection of students. A similar study of high school pupils is held as SN: 996.

  4. s

    Biological Samples and Associated Data

    • specie.bio
    Updated May 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Specie Bio Provider Network (2025). Biological Samples and Associated Data [Dataset]. https://specie.bio/providers
    Explore at:
    Dataset updated
    May 6, 2025
    Dataset provided by
    Specie Bio, Inc
    Authors
    Specie Bio Provider Network
    Description

    A collection of diverse human biospecimens and their associated clinical and molecular data, available for research purposes through the Specie Bio BioExchange platform. This dataset is contributed by a network of biobanks, academic medical centers, and other research institutions.

  5. Leash-Bio-processed-dataset

    • kaggle.com
    Updated May 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    hengck23 (2024). Leash-Bio-processed-dataset [Dataset]. https://www.kaggle.com/datasets/hengck23/leash-bio-processed-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 26, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    hengck23
    Description

    Processed dataset for https://www.kaggle.com/competitions/leash-BELKA.

    For any b2z file, It is recommend to be parallel bzip decompressor (https://github.com/mxmlnkn/indexed_bzip2) for speed.

    Last update : 22-may-2024

    In summary:

    See forum discussion for details of [1],[2]: https://www.kaggle.com/competitions/leash-BELKA/discussion/492846

    [1] reduced data

    • train.reduced.parquet : 98_415_610 training SMILES and their information
    • train.bind.npz : 98_415_610 x 3 target matrix
    • test.reduced.parquet : 878_022 test SMILES
    • all_buildingblock.csv: building blocks id used in train.reduced.parquet/test.reduced.parquet
    • fold0.parquet: train_share,valid_share,valid_nonshare splits for the experiments in the discussion

    [2] extracted ECFP4 fingerprints

    • train.ecfp4.packed.npz : Features extracted using rdkit
      • AllChem.GetMorganFingerprintAsBitVect(mol, 2, 2048)
      • repack with np.packbits() to give 98_415_610 x 256 feature matrix
    • test.ecfp4.packed.npz : similarly processed for the test SMILES

    This is somehow obsolete as the competition progresses. ecfp6 gives better results and can be extracted fast with scikit-fingerprints.

    See forum discussion for details of [3]: https://www.kaggle.com/competitions/leash-BELKA/discussion/498858 https://www.kaggle.com/code/hengck23/lb6-02-graph-nn-example

    [3] graph NN processed data

    • test/train-replace-c.smiles.bytestring.bz2 : replace linker [Dy] with C. Note that these are bytestrings and not strings.
    • train-replace-c-30m.graph.pickle.**.b2z : 98_415_610 molecule graph split into 3 files. test graphs are not provided as they are be generated on the fly.

    See forum discussion for details of [4]: https://www.kaggle.com/competitions/leash-BELKA/discussion/505985 https://www.kaggle.com/code/hengck23/conforge-open-source-conformer-generator

    [4] conformer. i.e. molecule estimated xyz data

    • test-replace-c.conforge.sdf.bz2 : conformer in sdf file. you can read the file using rdkit Chem.SDMolSupplier().
    • test-replace-c.conforge.status.parquet:
      • 'status col' shows the status of conformer. 0 means success. for failure cases, sdf store a dummy 'CC' molecule.
      • 'idx col' shows the idx (primary key) to test.reduced.parquet. use this to retrieve SMILES strings. Note that conformer is based on test-replace-c.smiles.bytestring.bz2, i.e. [Dy] is replaced by C.
    • train-replace-c.sub-[split].conforge.sdf.bz2/status.parquet: smiliar format as describe above. [split] are:
      • train: 1000250+(1001610*3) molecules
      • valid: 40000
      • nonshare: about 61674
  6. E

    Data from: Annotated sample of the Slovenian Biographical Lexicon SBL-51abbr...

    • live.european-language-grid.eu
    binary format
    Updated Jun 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Annotated sample of the Slovenian Biographical Lexicon SBL-51abbr 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/20518
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Jun 14, 2022
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset consists of 51 randomly selected entries from the Slovenian Biographical Lexicon (1925–1991). The text of each entry has been manually tokenised and sentence segmented, marked with named entities and the words lemmatised. It has also been automatically annotated with PoS tags (MULTEXT-East morphosyntactic descriptions) and Universal Dependencies PoS tags, morphological features and dependency parses.

    Crucially for the envisaged use of the corpus, the abbreviations in the corpus (of which there are 2,041) have been manually expanded so that the expanded abbreviations are also in the correct inflected form, given their context.

    The corpus is available in the canonical TEI encoding, and derived plain text and CoNLL-U files. The plain-text file has abbreviations and their expansions marked up with [...]. There are two CoNLL-U files, one with the text stream with abbreviations, and one with the text stream with expansions. Note that only the one with expansions has syntactic parses. Both CoNLL-U files have the expansions / abbreviations and named entities marked up in IOB format in the last column.

  7. H

    Replication Data (A) for 'Biased Programmers or Biased Data?': Individual...

    • dataverse.harvard.edu
    Updated Sep 2, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bo Cowgill; Fabrizio Dell'Acqua; Sam Deng; Daniel Hsu; Nakul Verma; Augustin Chaintreau (2020). Replication Data (A) for 'Biased Programmers or Biased Data?': Individual Measures of Numeracy, Literacy and Problem Solving Skill -- and Biographical Data -- for a Representative Sample of 200K OECD Residents [Dataset]. http://doi.org/10.7910/DVN/JAJ3CP
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 2, 2020
    Dataset provided by
    Harvard Dataverse
    Authors
    Bo Cowgill; Fabrizio Dell'Acqua; Sam Deng; Daniel Hsu; Nakul Verma; Augustin Chaintreau
    License

    https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.3/customlicense?persistentId=doi:10.7910/DVN/JAJ3CPhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.3/customlicense?persistentId=doi:10.7910/DVN/JAJ3CP

    Description

    This is a cleaned and merged version of the OECD's Programme for the International Assessment of Adult Competencies. The data contains individual person-measures of several basic skills including literacy, numeracy and critical thinking, along with extensive biographical details about each subject. PIAAC is essentially a standardized test taken by a representative sample of all OECD countries (approximately 200K individuals in total). We have found this data useful in studies of predictive algorithms and human capital, in part because of its high quality, size, number and quality of biographical features per subject and representativeness of the population at large.

  8. d

    The impact of HIV-AIDS on the health sector 2002: Adult data - All provinces...

    • demo-b2find.dkrz.de
    Updated Sep 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). The impact of HIV-AIDS on the health sector 2002: Adult data - All provinces in South Africa - Dataset - B2FIND [Dataset]. http://demo-b2find.dkrz.de/dataset/34752768-fb2e-5452-b21a-52494cb28419
    Explore at:
    Dataset updated
    Sep 21, 2025
    Area covered
    South Africa
    Description

    Description: The data set contains adult patients' data- demographic, morbidity, behavioural, environmental and data on health facilities (i.e. type, province and health district). Patient biographic data (age, sex, race, residence, nationality, language, type of dwelling, education, employment, income, religion, marital status, etc.). Data on history of hospitalisation of patient and data on patients health status like weight, STIs, pregnancy etc. and symptoms/diseases that had prompted patients to seek medical and health care. Behavioural data includes sexual partnerships, condom use, alcohol/ drug use and circumcision that predisposes one towards infection with HIV. Environmental data on pollution, living on farms and access to clean drinking water and food as well as data on HIV status was collected. The data contains 192 variables and 1534 cases. Abstract: The Nelson Mandela / HSRC study of HIV/AIDS (2002) reported an estimated prevalence of 4.5 million among persons aged two years and older. Given the overall impact of HIV/AIDS on South African society, and the need to make policies on the management of those living with the disease, it was important that studies were undertaken to provide data on the impact on the health system. This study was undertaken by the HSRC in collaboration with the national School of Public Health (NSPH) at the Medical University of South Africa (MEDUNSA) and the Medical Research Council (MRC). It was commissioned by the National Department of Health (DoH) to assess the impact of HIV/AIDS on the health system and to understand its progressive impact over time. The PIs sought to answer the following questions To what extent does HIV/AIDS affect the health system? What aspects or sub-systems are most affected? How is the impact going to progress over time? To answer the questions, a stratified cluster sample of 222 health facilities representative of the public and private sector in South Africa were drawn from the national DoH database on health facilities (1996). A nation-wide, representative sample of 2000 medical professionals including nursing professionals; other categories of nursing staff; other health professionals and non-professional health workers was obtained. In addition to this a representative probability sample of 2000 patients was obtained. Data collection methods included interviews using questionnaires and clinical measurements where either a blood specimen or an oral fluid (Orasure) specimen was collected. An anonymous linked HIV survey was conducted in the Free state, Mpumalanga, North West and Kwazulu-Natal. Oral fluids were tested for HIV antibodies at three different laboratories and results were linked with questionnaire data using barcodes. The adult questionnaire contains the patient's biographical data, hospitalisation history of patients seen at clinics, in-patients interviewed in a hospital, health status, sexual behaviour and the environment. Clinical measurements Face-to-face interview All adult (15-45 years) patients in public and private health facilities in South Africa. (Note: In hospitals only patients in medical wards were included.). Children 15 years and older were included as they were interviewed together with the adult patients. The task was to obtain a representative probability sample of 2000 patients, and at most representative probability sample 2000 health professionals who are in contact with patients undergoing treatment at the selected health facilities. The sampling frame was the national DoH's health facilities database (1996). Target population, was selected from two separate sampling frames: - (a) a list of all public clinics in the country (excluding mobile, satellite, part-time and specialized clinics; and (b) a list of all hospitals (public and private) and Private clinics with indication of the number of beds available in each of health facilities from the national DoH database on health facilities (1996). Provinces and health regions within provinces were considered as explicit strata. Provinces formed the primary stratification variable and the health regions the secondary stratification variable. The Primary sampling unit (PSU) was the magisterial districts within each health region in the case of public clinics, Secondary sampling unit (SSU) were clinics and hospitals- drawn using simple random sampling, and Ultimate/final sampling unit the (USU) the professional and non-professional health workers and patients. Measure of size (MOS) for public clinics was a monotonic function of the number of clinics per managerial districts. Selected 167 clinics were allocated disproportionately i.e. proportional to MOS. Allocated sample number of clinics within each province was allocated proportionately to the health regions in the province. MOS for hospitals and private clinics was a monotonic function of the number of beds as in DOH's database. Sample sizes for SSUs: Public clinics (167) Public Hospitals (33) Private Hospitals and clinics (22) Sample sizes for USUs: 1000 patients 500 nursing personnel 200 medical doctors 100 other professional health workers 400 non-professional health workers Public clinics 1000 patients 500 nursing personnel 111 nonprofessional personnel( e.g. cleaners) Public Hospitals 667 patients 333 nursing Personnel 200 medical doctors 67 other professional 222 non-professionals Private Hospitals and clinics

  9. e

    Bio World Photo Sample Export Import Data | Eximpedia

    • eximpedia.app
    Updated Oct 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Bio World Photo Sample Export Import Data | Eximpedia [Dataset]. https://www.eximpedia.app/companies/bio-world-photo-sample/03090638
    Explore at:
    Dataset updated
    Oct 30, 2025
    Description

    Bio World Photo Sample Export Import Data. Follow the Eximpedia platform for HS code, importer-exporter records, and customs shipment details.

  10. Samples and data accessibility in research biobanks

    • zenodo.org
    • data.niaid.nih.gov
    xls
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marco Capocasa; Paolo Anagnostou; Flavio D'Abramo; Giulia Matteucci; Valentina Dominici; Giovanni Destro Bisol; Fabrizio Rufo; Marco Capocasa; Paolo Anagnostou; Flavio D'Abramo; Giulia Matteucci; Valentina Dominici; Giovanni Destro Bisol; Fabrizio Rufo (2020). Samples and data accessibility in research biobanks [Dataset]. http://doi.org/10.5281/zenodo.17098
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Marco Capocasa; Paolo Anagnostou; Flavio D'Abramo; Giulia Matteucci; Valentina Dominici; Giovanni Destro Bisol; Fabrizio Rufo; Marco Capocasa; Paolo Anagnostou; Flavio D'Abramo; Giulia Matteucci; Valentina Dominici; Giovanni Destro Bisol; Fabrizio Rufo
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains answers at a questionnaire relative to modes of sample and data accessibility in research Biobanks

  11. u

    Understanding Society - Health and Biomarkers

    • understandingsociety.ac.uk
    Updated Nov 24, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ISER > Institute for Social and Economic Research, University of Essex (2017). Understanding Society - Health and Biomarkers [Dataset]. http://doi.org/10.5255/UKDA-SN-7251-3
    Explore at:
    Dataset updated
    Nov 24, 2017
    Dataset authored and provided by
    ISER > Institute for Social and Economic Research, University of Essex
    Time period covered
    Jan 1, 2009 - Dec 31, 2012
    Description

    Understanding Society, the UK Household Longitudinal Study (UKHLS), is a longitudinal survey of the members of approximately 40,000 households (at Wave 1) in the United Kingdom, and is conducted by the Institute for Social and Economic Research (ISER), at the University of Essex. Understanding Society collects information about participants’ social and economic circumstances, attitudes and beliefs and it also gathers information about their health. From Wave 1 onwards participants were asked a number of questions about their general health. In Wave 2 and Wave 3 adult participants received a follow-up health assessment visit from a registered nurse. A range of bio-medical measures were collected from around 20000 adults, which included blood pressure, weight, height, waist measurement, body fat, grip strength and lung function. Blood samples were also taken at these visits and frozen for future research. A number of biomarkers have now been extracted from the blood which measure major illnesses in the UK as well as being markers of key physiological systems. A genome wide scan has been conducted on DNA samples from approximately 10000 people, which enables us to examine gene-environment interactions for health and social phenomena. Methylation profiling has been conducted on DNA samples from approximately 1200 individuals from the British Household Panel Survey component of Understanding Society. This is particular important to advance understanding of how people’s social, economic and physical environments over their life time influence their biological processes by altering how their genes work.

  12. Bio-optical data for Australian Inland Waters v.1

    • data.csiro.au
    • researchdata.edu.au
    Updated May 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Janet Anstee; Nathan Drayson; Hannelie Botha; Gemma Kerrisk; Stephen Sagar; Phillip Ford; Bozena Wojtasiewicz; Lesley Clementson; Guy Byrne (2022). Bio-optical data for Australian Inland Waters v.1 [Dataset]. http://doi.org/10.25919/rtd7-j815
    Explore at:
    Dataset updated
    May 6, 2022
    Dataset provided by
    CSIROhttp://www.csiro.au/
    Authors
    Janet Anstee; Nathan Drayson; Hannelie Botha; Gemma Kerrisk; Stephen Sagar; Phillip Ford; Bozena Wojtasiewicz; Lesley Clementson; Guy Byrne
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 2013 - Dec 14, 2021
    Area covered
    Dataset funded by
    CSIROhttp://www.csiro.au/
    Geoscience Australia
    Description

    This collection is comprised of bio-optical measurements for a wide range of Australian inland waterbodies. The data was collected to describe the variation in bio-optical properties in Australian waterbodies. These data are able to be used for validation and development of inversion algorithms. Lineage: The data were collected using a combination of in situ measurements and laboratory analysis. See the readme file for details. The following data were obtained from laboratory analysis of in situ surface samples: Absorption, TSS, phytoplankton pigments, organic carbon. The following data were obtained from in situ measurements: Backscattering, radiometric measurements. Absorption - Laboratory analysis of in situ surface samples Backscattering - in situ surface measurements TSS -

  13. Data from: S7 Fig -

    • plos.figshare.com
    zip
    Updated Jun 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Isabella Lucia Chiara Mariani Wigley; Massimiliano Pastore; Eleonora Mascheroni; Marta Tremolada; Sabrina Bonichini; Rosario Montirosso (2023). S7 Fig - [Dataset]. http://doi.org/10.1371/journal.pone.0274477.s007
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 13, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Isabella Lucia Chiara Mariani Wigley; Massimiliano Pastore; Eleonora Mascheroni; Marta Tremolada; Sabrina Bonichini; Rosario Montirosso
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    a. Likelihood Distance (LD) for each observation in the Calibration sample. Each point represent the LD when the observation is deleted from the sample. Here we evaluate case influence which refers to the impact of a case on study results quantified by detection statistics. This approach compares the solutions obtained from the original sample with those obtained from the sample excluding case i, where i represents each case in turn. A way to evaluate case influence in SEM is the Likelihood Distance (Ldi) (1). Specifically, it evaluates the influence of a case on the global fit of the model. The higher value of LDi, the greater is the influence. the global fit of the model. In the present study LDi was evaluated with respect to the TBQ four-factors model tested in the Calibration data sample (first step). The graph below highlights the absence of cases with a significant influence on the global fit of the model. b. The ΔCFI difference (ΔCFI) for each observation in the Calibration sample. We also evaluated the influence of each case on the global fit of the model computing the CFI difference (ΔCFI). This measure highlight the magnitude and the direction of influence. Thus, positive values of ΔCFI indicate that by removing case i the model is improved while negative values indicate the opposite. As seen in the graph below, no influential cases were detected. Again, each point represent the ΔCFI when the observation is deleted from the sample. c. The Generalized Cook’s Distance for each observation in the Calibration sample. Each point represent the Generalized Cook’s Distance when the observation is deleted from the sample. We used Generalized Cook’s Distance in order to evaluate the influence of a case on parameter estimates of our model. As seen from the graph below, by removing case i parameter estimates did not change significantly. (ZIP)

  14. u

    Data for the Computational Linguistics and Clinical Psychology Shared Task,...

    • datacatalogue.ukdataservice.ac.uk
    Updated Dec 3, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UK Data Service (2020). Data for the Computational Linguistics and Clinical Psychology Shared Task, 2018 [Dataset]. http://doi.org/10.5255/UKDA-SN-8471-1
    Explore at:
    Dataset updated
    Dec 3, 2020
    Dataset provided by
    UK Data Servicehttps://ukdataservice.ac.uk/
    Time period covered
    Jan 1, 1969 - Dec 31, 2008
    Area covered
    United Kingdom
    Description

    The National Child Development Study (NCDS) originated in the Perinatal Mortality Survey (see SN 5565), which examined social and obstetric factors associated with still birth and infant mortality among over 17,000 babies born in Britain in one week in March 1958. Surviving members of this birth cohort have been surveyed on eight further occasions in order to monitor their changing health, education, social and economic circumstances - in 1965 at age 7, 1969 at age 11, 1974 at age 16 (the first three sweeps are also held under SN 5565), 1981 (age 23 - SN 5566), 1991 (age 33 - SN 5567), 1999/2000 (age 41/2 - SN 5578), 2004-2005 (age 46/47 - SN 5579), 2008-2009 (age 50 - SN 6137) and 2013 (age 55 - SN 7669).

    There have also been surveys of sub-samples of the cohort, the most recent occurring in 1995 (age 37), when a 10% representative sub-sample was assessed for difficulties with basic skills (SN 4992). Finally, during 2002-2004, 9,340 NCDS cohort members participated in a bio-medical survey, carried out by qualified nurses (SN 5594, available under more restrictive Special Licence access conditions; see catalogue record for details). The bio-medical survey did not cover any of the topics included in the 2004/2005 survey. Further NCDS data separate to the main surveys include a response and deaths dataset, parent migration studies, employment, activity and partnership histories, behavioural studies and essays - see the NCDS series page for details.

    Further information about the NCDS can be found on the Centre for Longitudinal Studies website.

    How to access genetic and/or bio-medical sample data from a range of longitudinal surveys:
    A useful overview of the governance routes for applying for genetic and bio-medical sample data, which are not available through the UK Data Service, can be found at Governance of data and sample access on the METADAC (Managing Ethico-social, Technical and Administrative issues in Data Access) website.

    Data for the Computational Linguistics and Clinical Psychology Shared Task, 2018
    contains the outputs of the shared task for the CLPsych 2018 workshop, which focused on predicting current and future psychological health from an essay authored in childhood. Language-based predictions of a person's current health have the potential to supplement traditional psychological assessment such as questionnaires, improving intake risk measurement and monitoring. Predictions of future psychological health can aid with both early detection and the development of preventative care. Research into the mental health trajectory of people, beginning from their childhood, has thus far been an area of little work within the neuro-linguistic programming (NLP) community. This shared task represented one of the first attempts to evaluate the use of early language to predict future health; this has the potential to support a wide variety of clinical health care tasks, from early assessment of lifetime risk for mental health problems, to optimal timing for targeted interventions aimed at both prevention and treatment.

  15. Bio-optical Data from Chilean Coastal waters 2017 - 2020

    • researchdata.edu.au
    • data.csiro.au
    datadownload
    Updated Dec 4, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elizabeth Brewer; Bozena Wojtasiewicz; Diego Ocampo Melgar; Patricio Bernal; Andy Steven; Joey Crosswell; Nagur Cherukuru; Tim Malthus; Lesley Clementson; Tim Malthus; Lesley Clementson; Joseph Crosswell; Elizabeth Brewer; Bozena Wojtasiewicz (2020). Bio-optical Data from Chilean Coastal waters 2017 - 2020 [Dataset]. http://doi.org/10.25919/QBBV-V359
    Explore at:
    datadownloadAvailable download formats
    Dataset updated
    Dec 4, 2020
    Dataset provided by
    CSIROhttp://www.csiro.au/
    Authors
    Elizabeth Brewer; Bozena Wojtasiewicz; Diego Ocampo Melgar; Patricio Bernal; Andy Steven; Joey Crosswell; Nagur Cherukuru; Tim Malthus; Lesley Clementson; Tim Malthus; Lesley Clementson; Joseph Crosswell; Elizabeth Brewer; Bozena Wojtasiewicz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Oct 17, 2017 - May 8, 2020
    Description

    This is a collection of data consisting of pigment concentration and composition, particulate and dissolved absorption co-efficients and total suspended matter concentration. The data relates to samples collected in Chilean coastal waters where aquaculture is present. The data will be used to develop a local algorithm for retrieved satellite estimates of bio-optical parameters in the water column. Lineage: Water samples were taken on-board the vessel and stored under cool and dark conditions until filtering took place on land. Samples were analysed and QC procedures were carried out in the Bio-Analytical facility, CSIRO Marine Labs, Hobart. For pigment analysis, 4 litres of sample water was filtered through a 47 mm glass fibre filter (Whatman GF/F) and then stored in liquid nitrogen until analysis. To extract the pigments, the filters were cut into small pieces and covered with 100% acetone (3 mls) in a 10 ml centrifuge tube. The samples were vortexed for about 30 seconds and then sonicated for 15 minutes in the dark. The samples were then kept in the dark at 4 °C for approximately 15 hours. After this time 200 µL water was added to the acetone such that the extract mixture was 90:10 acetone:water (vol:vol) and sonicated once more for 15 minutes. The extracts were centrifuged to remove the filter paper and then filtered through a 0.2 µm membrane filter (Whatman, anatope) prior to analysis by HPLC using a Waters Alliance high performance liquid chromatography system, comprising a 2695XE separations module with column heater and refrigerated autosampler and a 2996 photo-diode array detector. Immediately prior to injection the sample extract was mixed with a buffer solution (90:10 28 mM tetrabutyl ammonium acetate, pH 6.5 : methanol) within the sample loop. Pigments were separated using a Zorbax Eclipse XDB-C8 stainless steel 150 mm x 4.6 mm ID column with 3.5 µm particle size (Agilent Technologies) with gradient elution as described in Van Heukelem and Thomas (2001). The separated pigments were detected at 436 nm and identified against standard spectra using Waters Empower software. Concentrations of chlorophyll a, chlorophyll b, b,b-carotene and b,e-carotene in sample chromatograms were determined from standards (Sigma, USA or DHI, Denmark). For Absorption coefficients: 4 litres of sample water was filtered through a 25 mm glass fibre filter (Whatman GF/F) and the filter was then stored flat in liquid nitrogen until analysis. Optical density spectra for total particulate matter were obtained using a Cintra 404 UV/VIS dual beam spectrophotometer equipped with an integrating sphere. For CDOM: water filtered through a 0.22 Durapore filter on an all glass filter unit. Optical density spectra was obtained using 10 cm cells in a Cintra 404 UV/vis spectrophotometer with Milli-q water as a reference. For TSM: determined by drying the filter at 60°C to constant weight; the filter may then be muffled at 450°C to burn off the organic fraction. The inorganic fraction is weighed ad the organic fraction is determined as the difference between the SPM and the inorganic fraction.

  16. d

    The impact of HIV-AIDS on the health sector 2002: Child data - All provinces...

    • demo-b2find.dkrz.de
    Updated Sep 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). The impact of HIV-AIDS on the health sector 2002: Child data - All provinces in South Africa - Dataset - B2FIND [Dataset]. http://demo-b2find.dkrz.de/dataset/ffa74581-569e-504b-be99-bc4451438471
    Explore at:
    Dataset updated
    Sep 26, 2025
    Area covered
    South Africa
    Description

    Description: The data set contains child patients' data - demographic, morbidity, behavioural, environmental and data on health facilities (i.e. name, type, and province and health district). Patient biographic data (age, sex, race, residence, nationality, refugee status, place of birth, language, type of dwelling, education, employment; religion, orphan hood status, marital status of parents, etc.). Data on history of hospitalisation of patient and data on patients health status like weight loss, diarrhoea etc. and symptoms/diseases that had prompted patients to seek medical and health care. Furthermore environmental data on pollution, living on farms and access to clean drinking water and food was collected as well as data on HIV status. The data contains 108 variables and 415 cases. Abstract: The Nelson Mandela / HSRC study of HIV/AIDS (2002) reported an estimated prevalence of 4.5 million among persons aged two years and older. Given the overall impact of HIV/AIDS on South African society, and the need to make policies on the management of those living with the disease, it was important that studies were undertaken to provide data on the impact on the health system. This study was undertaken by the HSRC in collaboration with the national School of Public Health (NSPH) at the Medical University of South Africa (MEDUNSA) and the Medical Research Council (MRC). It was commissioned by the National Department of Health (DoH) to assess the impact of HIV/AIDS on the health system and to understand its progressive impact over time. The PIs sought to answer the following questions To what extent does HIV/AIDS affect the health system? What aspects or sub-systems are most affected? How is the impact going to progress over time? To answer the questions, a stratified cluster sample of 222 health facilities representative of the public and private sector in South Africa were drawn from the national DoH database on health facilities (1996). A nation-wide, representative sample of 2000 medical professionals including nursing professionals; other categories of nursing staff; other health professionals and non-professional health workers was obtained. In addition to this a representative probability sample of 2000 patients was obtained. Data collection methods included interviews using questionnaires and clinical measurements where either a blood specimen or an oral fluid (Orasure) specimen was collected. An anonymous linked HIV survey was conducted in the Free state, Mpumalanga, North West and Kwazulu-Natal. Oral fluids were tested for HIV antibodies at three different laboratories and results were linked with questionnaire data using barcodes. The child questionnaire contains the child patient's biographical data, hospitalisation history of patients seen at clinics, in-patients interviewed in a hospital, health status, environment. Clinical measurements Face-to-face interview All child patients (younger than 15 years) in public and private health facilities in South Africa. (Note: In hospitals only patients in medical and paediatric wards were included.). The task was to obtain a representative probability sample of 2000 patients, and at most representative probability sample 2000 health professionals who are in contact with patients undergoing treatment at the selected health facilities. The sampling frame was the national DoH's health facilities database (1996). Target population, was selected from two separate sampling frames: - (a) a list of all public clinics in the country (excluding mobile, satellite, part-time and specialized clinics; and (b) a list of all hospitals (public and private) and Private clinics with indication of the number of beds available in each of health facilities from the national DoH database on health facilities (1996). Provinces and health regions within provinces were considered as explicit strata. Provinces formed the primary stratification variable and the health regions the secondary stratification variable. The Primary sampling unit (PSU) was the magisterial districts within each health region in the case of public clinics, Secondary sampling unit (SSU) were clinics and hospitals- drawn using simple random sampling, and Ultimate/final sampling unit the (USU) the professional and non-professional health workers and patients. Measure of size (MOS) for public clinics was a monotonic function of the number of clinics per managerial districts. Selected 167 clinics were allocated disproportionately i.e. proportional to MOS. Allocated sample number of clinics within each province was allocated proportionately to the health regions in the province. MOS for hospitals and private clinics was a monotonic function of the number of beds as in DOH's database. Sample sizes for SSUs: Public clinics (167) Public Hospitals (33) Private Hospitals and clinics (22) Sample sizes for USUs: 1000 patients 500 nursing personnel 200 medical doctors 100 other professional health workers 400 non-professional health workers Public clinics 1000 patients 500 nursing personnel 111 nonprofessional personnel( e.g. cleaners) Public Hospitals 667 patients 333 nursing Personnel 200 medical doctors 67 other professional 222 non-professionals Private Hospitals and clinics

  17. BioMates - Bio-oil from ablative fast pyrolysis: Identifiers for WP3-...

    • data.europa.eu
    • data-staging.niaid.nih.gov
    • +1more
    unknown
    Updated Feb 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2022). BioMates - Bio-oil from ablative fast pyrolysis: Identifiers for WP3- samples and blends [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-6223154?locale=en
    Explore at:
    unknown(1910757)Available download formats
    Dataset updated
    Feb 24, 2022
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Ablative Fast Pyrolysis (AFP) is the first step in the BioMates-concept to convert herbaceous biomass into co-feed with reliable properties for conventional refineries (www.biomates.eu). The document provides the coding behind the identifiers used for samples and sample blends produced by RISE via AFP within the H2020-project BioMates.

  18. n

    Bio-optical Database of the Arctic Ocean

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated May 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kate Lewis; Gert van Dijken; Kevin Arrigo (2020). Bio-optical Database of the Arctic Ocean [Dataset]. http://doi.org/10.5061/dryad.cnp5hqc17
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 21, 2020
    Dataset provided by
    Stanford University
    Authors
    Kate Lewis; Gert van Dijken; Kevin Arrigo
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Area covered
    Arctic Ocean
    Description

    The Arctic bio-optical database assembles a diverse suite of biological and optical data from 34 expeditions throughout the Arctic Ocean. Data combined into a single AO database following the OBPG criteria (Pegau et al. 2003), as was done in the development of the global NASA Bio-optical Marine Algorithm Data Set (NOMAD) (Werdell 2005, Werdell & Bailey 2005). This Arctic database combines coincident in situ observations of IOPs, apparent optical properties (AOPs), Chl a, environmental data (e.g. temperature, salinity) and station metadata (e.g. sampling depth, latitude, longitude, date). Data were acquired from the NASA SeaWiFS Bio-optical Archive and Storage System (SeaBASS, https://seabass.gsfc.nasa.gov/), the LEFE CYBER database (http://www.obs-vlfr.fr/proof/index2.php), the Data and Sample Research System for Whole Cruise Information in JAMSTEC (DARWIN, http://www.godac.jamstec.go.jp), NOMAD, and individual contributors. To ensure consistency, data were limited to those that were collected using OBPG defined protocols (Pegau et al. 2003). Only observations shallower that 30 m were included. For spectral parameters, we included data at the following wavelengths that are used by satellite and thus are relevant for ocean color algorithm evaluation: 412, 443, 469, 488, 490, 510, 531, 547, 555, 645, 667, 670 and 678 nm. In situ measurements were binned at the same station if measurements were within 8 hours and 1° of distance (Werdell & Bailey 2005). For regional analyses, each station was assigned to one of ten sub-regions and three functional shelf-types (Carmack et al. 2006).

    Methods This bio-optical database was assembled using in situ measurements from cruises throughout the Arctic Ocean based on the methods in Werdell 2005 and Werdell & Bailey 2005. Please see [my eventual paper citation/DOI] for full details on methods and data source.

    Werdell, P. 2005. An evaluation of inherent optical property data for inclusion in the NASA Bio‐optical Marine Algorithm Data Set, NASA Ocean Biology Processing Group paper, NASA Goddard Space Flight Cent., Greenbelt, Md.

    Werdell, P.J. and S.W. Bailey. 2005. An improved bio-optical data set for ocean color algorithm development and satellite data product validation. Remote Sens. Environ. 98: 122-140.

  19. Capillary rise through bio-stabilized rammed earth samples data

    • zenodo.org
    bin
    Updated Mar 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Esther Machlein; Esther Machlein (2025). Capillary rise through bio-stabilized rammed earth samples data [Dataset]. http://doi.org/10.5281/zenodo.15056117
    Explore at:
    binAvailable download formats
    Dataset updated
    Mar 20, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Esther Machlein; Esther Machlein
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Mass intake for capillary rise and compression test data (in French).

    Rammed earth samples label:

    REF: reference (unstabilized)

    W: rammed earth with wool stabilizer

    L: rammed earth with lignin sulphonate stabilizer

    T: rammed earth with tannin stabilizer

  20. Data from:...

    • osdr.nasa.gov
    • s.cnmilf.com
    • +2more
    Updated Jul 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jonathan Galazka; Ruth Globus (2025). Rodent-Research-1-RR1-NASA-Validation-Flight-Mouse-kidney-transcriptomic-proteomic-and-epigenomic-data [Dataset]. https://osdr.nasa.gov/bio/repo/data/studies/OSD-102
    Explore at:
    Dataset updated
    Jul 21, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Authors
    Jonathan Galazka; Ruth Globus
    License

    Attribution 1.0 (CC BY 1.0)https://creativecommons.org/licenses/by/1.0/
    License information was derived automatically

    Description

    NASA's Rodent Research (RR) project is playing a critical role in advancing biomedical research on the physiological effects of space environments. Due to the limited resources for conducting biological experiments aboard the International Space Station (ISS), it is imperative to use crew time efficiently while maximizing high-quality science return. NASA's GeneLab project has as its primary objectives to 1) further increase the value of these experiments using a multi-omics, systems biology-based approach, and 2) disseminate these data without restrictions to the scientific community. The current investigation assessed viability of RNA, DNA, and protein extracted from archived RR-1 tissue samples for epigenomic, transcriptomic, and proteomic assays. During the first RR spaceflight experiment, a variety of tissue types were harvested from subjects, snap-frozen or RNAlater-preserved, and then stored at least a year at -80C after return to Earth. They were then prioritized for this investigation based on likelihood of significant scientific value for spaceflight research. All tissues were made available to GeneLab through the bio-specimen sharing program managed by the Ames Life Science Data Archive and included mouse adrenal glands, quadriceps, gastrocnemius, tibialis anterior, extensor digitorum longus, soleus, eye, and kidney. We report here protocols for and results of these tissue extractions, and thus, the feasibility and value of these kinds of omics analyses. In addition to providing additional opportunities for investigation of spaceflight effects on the mouse transcriptome and proteome in new kinds of tissues, our results may also be of value to program managers for the prioritization of ISS crew time for rodent research activities.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Wikimedia (2025). English Wikipedia People Dataset [Dataset]. https://www.kaggle.com/datasets/wikimedia-foundation/english-wikipedia-people-dataset
Organization logo

English Wikipedia People Dataset

Biographical Data for People on English Wikipedia

Explore at:
zip(4293465577 bytes)Available download formats
Dataset updated
Jul 31, 2025
Dataset provided by
Wikimedia Foundationhttp://www.wikimedia.org/
Authors
Wikimedia
License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Summary

This dataset contains biographical information derived from articles on English Wikipedia as it stood in early June 2024. It was created as part of the Structured Contents initiative at Wikimedia Enterprise and is intended for evaluation and research use.

The beta sample dataset is a subset of the Structured Contents Snapshot focusing on people with infoboxes in EN wikipedia; outputted as json files (compressed in tar.gz).

We warmly welcome any feedback you have. Please share your thoughts, suggestions, and any issues you encounter on the discussion page for this dataset here on Kaggle.

Data Structure

  • File name: wme_people_infobox.tar.gz
  • Size of compressed file: 4.12 GB
  • Size of uncompressed file: 21.28 GB

Noteworthy Included Fields: - name - title of the article. - identifier - ID of the article. - image - main image representing the article's subject. - description - one-sentence description of the article for quick reference. - abstract - lead section, summarizing what the article is about. - infoboxes - parsed information from the side panel (infobox) on the Wikipedia article. - sections - parsed sections of the article, including links. Note: excludes other media/images, lists, tables and references or similar non-prose sections.

The Wikimedia Enterprise Data Dictionary explains all of the fields in this dataset.

Stats

Infoboxes - Compressed: 2GB - Uncompressed: 11GB

Infoboxes + sections + short description - Size of compressed file: 4.12 GB - Size of uncompressed file: 21.28 GB

Article analysis and filtering breakdown: - total # of articles analyzed: 6,940,949 - # people found with QID: 1,778,226 - # people found with Category: 158,996 - people found with Biography Project: 76,150 - Total # of people articles found: 2,013,372 - Total # people articles with infoboxes: 1,559,985 End stats - Total number of people articles in this dataset: 1,559,985 - that have a short description: 1,416,701 - that have an infobox: 1,559,985 - that have article sections: 1,559,921

This dataset includes 235,146 people articles that exist on Wikipedia but aren't yet tagged on Wikidata as instance of:human.

Maintenance and Support

This dataset was originally extracted from the Wikimedia Enterprise APIs on June 5, 2024. The information in this dataset may therefore be out of date. This dataset isn't being actively updated or maintained, and has been shared for community use and feedback. If you'd like to retrieve up-to-date Wikipedia articles or data from other Wikiprojects, get started with Wikimedia Enterprise's APIs

Initial Data Collection and Normalization

The dataset is built from the Wikimedia Enterprise HTML “snapshots”: https://enterprise.wikimedia.com/docs/snapshot/ and focuses on the Wikipedia article namespace (namespace 0 (main)).

Who are the source language producers?

Wikipedia is a human generated corpus of free knowledge, written, edited, and curated by a global community of editors since 2001. It is the largest and most accessed educational resource in history, accessed over 20 billion times by half a billion people each month. Wikipedia represents almost 25 years of work by its community; the creation, curation, and maintenance of millions of articles on distinct topics. This dataset includes the biographical contents of English Wikipedia language editions: English https://en.wikipedia.org/, written by the community.

Attribution

Terms and conditions

Wikimedia Enterprise provides this dataset under the assumption that downstream users will adhere to the relevant free culture licenses when the data is reused. In situations where attribution is required, reusers should identify the Wikimedia project from which the content was retrieved as the source of the content. Any attribution should adhere to Wikimedia’s trademark policy (available at https://foundation.wikimedia.org/wiki/Trademark_policy) and visual identity guidelines (ava...

Search
Clear search
Close search
Google apps
Main menu