85 datasets found

English Wikipedia People Dataset
kaggle.com
zip
Updated Jul 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wikimedia (2025). English Wikipedia People Dataset [Dataset]. https://www.kaggle.com/datasets/wikimedia-foundation/english-wikipedia-people-dataset
Explore at:
zip(4293465577 bytes)Available download formats
Dataset updated
Jul 31, 2025
Dataset provided by
Wikimedia Foundationhttp://www.wikimedia.org/
Authors
Wikimedia
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Summary

This dataset contains biographical information derived from articles on English Wikipedia as it stood in early June 2024. It was created as part of the Structured Contents initiative at Wikimedia Enterprise and is intended for evaluation and research use.

The beta sample dataset is a subset of the Structured Contents Snapshot focusing on people with infoboxes in EN wikipedia; outputted as json files (compressed in tar.gz).

We warmly welcome any feedback you have. Please share your thoughts, suggestions, and any issues you encounter on the discussion page for this dataset here on Kaggle.

Data Structure

File name: wme_people_infobox.tar.gz

Size of compressed file: 4.12 GB

Size of uncompressed file: 21.28 GB

Noteworthy Included Fields: - name - title of the article. - identifier - ID of the article. - image - main image representing the article's subject. - description - one-sentence description of the article for quick reference. - abstract - lead section, summarizing what the article is about. - infoboxes - parsed information from the side panel (infobox) on the Wikipedia article. - sections - parsed sections of the article, including links. Note: excludes other media/images, lists, tables and references or similar non-prose sections.

The Wikimedia Enterprise Data Dictionary explains all of the fields in this dataset.

Stats

Infoboxes - Compressed: 2GB - Uncompressed: 11GB

Infoboxes + sections + short description - Size of compressed file: 4.12 GB - Size of uncompressed file: 21.28 GB

Article analysis and filtering breakdown: - total # of articles analyzed: 6,940,949 - # people found with QID: 1,778,226 - # people found with Category: 158,996 - people found with Biography Project: 76,150 - Total # of people articles found: 2,013,372 - Total # people articles with infoboxes: 1,559,985 End stats - Total number of people articles in this dataset: 1,559,985 - that have a short description: 1,416,701 - that have an infobox: 1,559,985 - that have article sections: 1,559,921

This dataset includes 235,146 people articles that exist on Wikipedia but aren't yet tagged on Wikidata as instance of:human.

Maintenance and Support

This dataset was originally extracted from the Wikimedia Enterprise APIs on June 5, 2024. The information in this dataset may therefore be out of date. This dataset isn't being actively updated or maintained, and has been shared for community use and feedback. If you'd like to retrieve up-to-date Wikipedia articles or data from other Wikiprojects, get started with Wikimedia Enterprise's APIs

Initial Data Collection and Normalization

The dataset is built from the Wikimedia Enterprise HTML “snapshots”: https://enterprise.wikimedia.com/docs/snapshot/ and focuses on the Wikipedia article namespace (namespace 0 (main)).

Who are the source language producers?

Wikipedia is a human generated corpus of free knowledge, written, edited, and curated by a global community of editors since 2001. It is the largest and most accessed educational resource in history, accessed over 20 billion times by half a billion people each month. Wikipedia represents almost 25 years of work by its community; the creation, curation, and maintenance of millions of articles on distinct topics. This dataset includes the biographical contents of English Wikipedia language editions: English https://en.wikipedia.org/, written by the community.

Attribution

Terms and conditions

Wikimedia Enterprise provides this dataset under the assumption that downstream users will adhere to the relevant free culture licenses when the data is reused. In situations where attribution is required, reusers should identify the Wikimedia project from which the content was retrieved as the source of the content. Any attribution should adhere to Wikimedia’s trademark policy (available at https://foundation.wikimedia.org/wiki/Trademark_policy) and visual identity guidelines (ava...
Wikipedia Biographies Text Generation Dataset
kaggle.com
zip
Updated Dec 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Wikipedia Biographies Text Generation Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/wikipedia-biographies-text-generation-dataset/code
Explore at:
zip(269983242 bytes)Available download formats
Dataset updated
Dec 3, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Wikipedia Biographies Text Generation Dataset

Wikipedia Biographies: Infobox and First Paragraphs Texts

By wiki_bio (From Huggingface) [source]

About this dataset

The dataset contains several key columns: input_text and target_text. The input_text column includes the infobox and first paragraph of a Wikipedia biography, providing essential information about the individual's background, accomplishments, and notable features. The target_text column consists of the complete biography text extracted from the corresponding Wikipedia page.

In order to facilitate model training and validation, the dataset is divided into three main files: train.csv, val.csv, and test.csv. The train.csv file contains pairs of input text and target text for model training. It serves as a fundamental resource to develop accurate language generation models by providing abundant examples for learning to generate coherent biographical texts.

The val.csv file provides further validation data consisting of additional Wikipedia biographies with their corresponding infoboxes and first paragraphs. This subset allows researchers to evaluate their trained models' performance on unseen examples during development or fine-tuning stages.

Finally, the test.csv file offers a separate set of input texts paired with corresponding target texts for generating complete biographies using pre-trained models or newly developed algorithms. The purpose of this file is to benchmark system performance on unseen data in order to assess generalization capabilities.

This extended description aims to provide an informative overview of the dataset structure, its intended use cases in natural language processing research tasks such as text generation or summarization. Researchers can leverage this comprehensive collection to advance various applications in automatic biography writing systems or content generation tasks that require coherent textual output based on provided partial information extracted from an infobox or initial paragraph sources from online encyclopedias like Wikipedia

How to use the dataset

Overview:

This dataset consists of biographical information from Wikipedia pages, specifically the infobox and the first paragraph of each biography.

The dataset is provided in three separate files: train.csv, val.csv, and test.csv.

Each file contains pairs of input text and target text.

File Descriptions:

train.csv: This file is used for training purposes. It includes pairs of input text (infobox and first paragraph) and target text (complete biography).

val.csv: Validation purposes can be fulfilled using this file. It contains a collection of biographies with infobox and first paragraph texts.

test.csv: This file can be used to generate complete biographies based on the given input texts.

Column Information:

a) For train.csv:

input_text: Input text column containing the infobox and first paragraph of a Wikipedia biography.

target_text: Target text column containing the complete biography text for each entry.

b) For val.csv: - input_text: Infobox and first paragraph texts are included in this column. - target_text: Complete biography texts are present in this column.

c) For test.csv: The columns follow the pattern mentioned previously, i.e.,input_text followed by target_text.

Usage Guidelines:

Training Model or Algorithm Development: If you are working on training a model or developing an algorithm for generating complete biographies from given inputs, it is recommended to use train.csv as your primary dataset.

Model Validation or Evaluation: To validate or evaluate your trained model, you can use val.csv as an independent dataset. This dataset contains biographies that have been withheld from the training data.

Generating Biographies with Trained Models: To generate complete biographies using your trained model, you can make use of test.csv. This dataset provides input texts for which you need to generate the corresponding target texts.

Additional Information and Tips:

The input text in this dataset includes both an infobox (a structured section containing key-value pairs) and the first paragraph of a Wikipedia biography.

The target text is the complete biography for each entry.

While working with this dataset, make sure to preprocess and

Research Ideas

Text Generation: The dataset can be used to train language models to generate complete Wikipedia biographies given only the infobox and first paragraph ...
s
Biological Samples and Associated Data
specie.bio
Updated May 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Specie Bio Provider Network (2025). Biological Samples and Associated Data [Dataset]. https://specie.bio/providers
Explore at:
Dataset updated
May 6, 2025
Dataset provided by
Specie Bio, Inc
Authors
Specie Bio Provider Network
Description
A collection of diverse human biospecimens and their associated clinical and molecular data, available for research purposes through the Specie Bio BioExchange platform. This dataset is contributed by a network of biobanks, academic medical centers, and other research institutions.
Data from: S7 Fig -
plos.figshare.com
zip
Updated Jun 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Isabella Lucia Chiara Mariani Wigley; Massimiliano Pastore; Eleonora Mascheroni; Marta Tremolada; Sabrina Bonichini; Rosario Montirosso (2023). S7 Fig - [Dataset]. http://doi.org/10.1371/journal.pone.0274477.s007
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0274477.s007
Dataset updated
Jun 13, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Isabella Lucia Chiara Mariani Wigley; Massimiliano Pastore; Eleonora Mascheroni; Marta Tremolada; Sabrina Bonichini; Rosario Montirosso
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
a. Likelihood Distance (LD) for each observation in the Calibration sample. Each point represent the LD when the observation is deleted from the sample. Here we evaluate case influence which refers to the impact of a case on study results quantified by detection statistics. This approach compares the solutions obtained from the original sample with those obtained from the sample excluding case i, where i represents each case in turn. A way to evaluate case influence in SEM is the Likelihood Distance (Ldi) (1). Specifically, it evaluates the influence of a case on the global fit of the model. The higher value of LDi, the greater is the influence. the global fit of the model. In the present study LDi was evaluated with respect to the TBQ four-factors model tested in the Calibration data sample (first step). The graph below highlights the absence of cases with a significant influence on the global fit of the model. b. The ΔCFI difference (ΔCFI) for each observation in the Calibration sample. We also evaluated the influence of each case on the global fit of the model computing the CFI difference (ΔCFI). This measure highlight the magnitude and the direction of influence. Thus, positive values of ΔCFI indicate that by removing case i the model is improved while negative values indicate the opposite. As seen in the graph below, no influential cases were detected. Again, each point represent the ΔCFI when the observation is deleted from the sample. c. The Generalized Cook’s Distance for each observation in the Calibration sample. Each point represent the Generalized Cook’s Distance when the observation is deleted from the sample. We used Generalized Cook’s Distance in order to evaluate the influence of a case on parameter estimates of our model. As seen from the graph below, by removing case i parameter estimates did not change significantly. (ZIP)
e
Bio World Photo Sample Export Import Data | Eximpedia
eximpedia.app
Updated Oct 30, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Bio World Photo Sample Export Import Data | Eximpedia [Dataset]. https://www.eximpedia.app/companies/bio-world-photo-sample/03090638
Explore at:
Dataset updated
Oct 30, 2025
Description
Bio World Photo Sample Export Import Data. Follow the Eximpedia platform for HS code, importer-exporter records, and customs shipment details.
Leash-Bio-processed-dataset
kaggle.com
Updated May 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
hengck23 (2024). Leash-Bio-processed-dataset [Dataset]. https://www.kaggle.com/datasets/hengck23/leash-bio-processed-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 26, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
hengck23
Description
Processed dataset for https://www.kaggle.com/competitions/leash-BELKA.

For any b2z file, It is recommend to be parallel bzip decompressor (https://github.com/mxmlnkn/indexed_bzip2) for speed.

Last update : 22-may-2024

In summary:

See forum discussion for details of [1],[2]: https://www.kaggle.com/competitions/leash-BELKA/discussion/492846

[1] reduced data

train.reduced.parquet : 98_415_610 training SMILES and their information

train.bind.npz : 98_415_610 x 3 target matrix

test.reduced.parquet : 878_022 test SMILES

all_buildingblock.csv: building blocks id used in train.reduced.parquet/test.reduced.parquet

fold0.parquet: train_share,valid_share,valid_nonshare splits for the experiments in the discussion

[2] extracted ECFP4 fingerprints

train.ecfp4.packed.npz : Features extracted using rdkit

AllChem.GetMorganFingerprintAsBitVect(mol, 2, 2048)

repack with np.packbits() to give 98_415_610 x 256 feature matrix

test.ecfp4.packed.npz : similarly processed for the test SMILES

This is somehow obsolete as the competition progresses. ecfp6 gives better results and can be extracted fast with scikit-fingerprints.

See forum discussion for details of [3]: https://www.kaggle.com/competitions/leash-BELKA/discussion/498858 https://www.kaggle.com/code/hengck23/lb6-02-graph-nn-example

[3] graph NN processed data

test/train-replace-c.smiles.bytestring.bz2 : replace linker [Dy] with C. Note that these are bytestrings and not strings.

train-replace-c-30m.graph.pickle.**.b2z : 98_415_610 molecule graph split into 3 files. test graphs are not provided as they are be generated on the fly.

See forum discussion for details of [4]: https://www.kaggle.com/competitions/leash-BELKA/discussion/505985 https://www.kaggle.com/code/hengck23/conforge-open-source-conformer-generator

[4] conformer. i.e. molecule estimated xyz data

test-replace-c.conforge.sdf.bz2 : conformer in sdf file. you can read the file using rdkit Chem.SDMolSupplier().

test-replace-c.conforge.status.parquet:

'status col' shows the status of conformer. 0 means success. for failure cases, sdf store a dummy 'CC' molecule.

'idx col' shows the idx (primary key) to test.reduced.parquet. use this to retrieve SMILES strings. Note that conformer is based on test-replace-c.smiles.bytestring.bz2, i.e. [Dy] is replaced by C.

train-replace-c.sub-[split].conforge.sdf.bz2/status.parquet: smiliar format as describe above. [split] are:

train: 1000250+(1001610*3) molecules

valid: 40000

nonshare: about 61674
u
Academic Year 1973-1974
datacatalogue.ukdataservice.ac.uk
Updated Jan 1, 1978
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Markham, S., North East London Polytechnic; Sugarman, L., North East London Polytechnic (1978). Academic Year 1973-1974 [Dataset]. http://doi.org/10.5255/UKDA-SN-974-1
Explore at:
Unique identifier
https://doi.org/10.5255/UKDA-SN-974-1
Dataset updated
Jan 1, 1978
Dataset provided by
UK Data Servicehttps://ukdataservice.ac.uk/
Authors
Markham, S., North East London Polytechnic; Sugarman, L., North East London Polytechnic
Area covered
England
Description
To collect psychometric and biographical data which may enhance counselling and selection of students. A similar study of high school pupils is held as SN: 996.
Samples and data accessibility in research biobanks
zenodo.org
data.niaid.nih.gov
xls
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marco Capocasa; Paolo Anagnostou; Flavio D'Abramo; Giulia Matteucci; Valentina Dominici; Giovanni Destro Bisol; Fabrizio Rufo; Marco Capocasa; Paolo Anagnostou; Flavio D'Abramo; Giulia Matteucci; Valentina Dominici; Giovanni Destro Bisol; Fabrizio Rufo (2020). Samples and data accessibility in research biobanks [Dataset]. http://doi.org/10.5281/zenodo.17098
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.17098
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Marco Capocasa; Paolo Anagnostou; Flavio D'Abramo; Giulia Matteucci; Valentina Dominici; Giovanni Destro Bisol; Fabrizio Rufo; Marco Capocasa; Paolo Anagnostou; Flavio D'Abramo; Giulia Matteucci; Valentina Dominici; Giovanni Destro Bisol; Fabrizio Rufo
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains answers at a questionnaire relative to modes of sample and data accessibility in research Biobanks
n
Bio-optical Database of the Arctic Ocean
data.niaid.nih.gov
datadryad.org
zip
Updated May 21, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kate Lewis; Gert van Dijken; Kevin Arrigo (2020). Bio-optical Database of the Arctic Ocean [Dataset]. http://doi.org/10.5061/dryad.cnp5hqc17
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.cnp5hqc17
Dataset updated
May 21, 2020
Dataset provided by
Stanford University
Authors
Kate Lewis; Gert van Dijken; Kevin Arrigo
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Area covered
Arctic Ocean
Description
The Arctic bio-optical database assembles a diverse suite of biological and optical data from 34 expeditions throughout the Arctic Ocean. Data combined into a single AO database following the OBPG criteria (Pegau et al. 2003), as was done in the development of the global NASA Bio-optical Marine Algorithm Data Set (NOMAD) (Werdell 2005, Werdell & Bailey 2005). This Arctic database combines coincident in situ observations of IOPs, apparent optical properties (AOPs), Chl a, environmental data (e.g. temperature, salinity) and station metadata (e.g. sampling depth, latitude, longitude, date). Data were acquired from the NASA SeaWiFS Bio-optical Archive and Storage System (SeaBASS, https://seabass.gsfc.nasa.gov/), the LEFE CYBER database (http://www.obs-vlfr.fr/proof/index2.php), the Data and Sample Research System for Whole Cruise Information in JAMSTEC (DARWIN, http://www.godac.jamstec.go.jp), NOMAD, and individual contributors. To ensure consistency, data were limited to those that were collected using OBPG defined protocols (Pegau et al. 2003). Only observations shallower that 30 m were included. For spectral parameters, we included data at the following wavelengths that are used by satellite and thus are relevant for ocean color algorithm evaluation: 412, 443, 469, 488, 490, 510, 531, 547, 555, 645, 667, 670 and 678 nm. In situ measurements were binned at the same station if measurements were within 8 hours and 1° of distance (Werdell & Bailey 2005). For regional analyses, each station was assigned to one of ten sub-regions and three functional shelf-types (Carmack et al. 2006).

Methods This bio-optical database was assembled using in situ measurements from cruises throughout the Arctic Ocean based on the methods in Werdell 2005 and Werdell & Bailey 2005. Please see [my eventual paper citation/DOI] for full details on methods and data source.

Werdell, P. 2005. An evaluation of inherent optical property data for inclusion in the NASA Bio‐optical Marine Algorithm Data Set, NASA Ocean Biology Processing Group paper, NASA Goddard Space Flight Cent., Greenbelt, Md.

Werdell, P.J. and S.W. Bailey. 2005. An improved bio-optical data set for ocean color algorithm development and satellite data product validation. Remote Sens. Environ. 98: 122-140.
BioMates - Bio-oil from ablative fast pyrolysis: Identifiers for WP3-...
data.europa.eu
data-staging.niaid.nih.gov
+1more
unknown
Updated Feb 24, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2022). BioMates - Bio-oil from ablative fast pyrolysis: Identifiers for WP3- samples and blends [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-6223154?locale=en
Explore at:
unknown(1910757)Available download formats
Dataset updated
Feb 24, 2022
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Ablative Fast Pyrolysis (AFP) is the first step in the BioMates-concept to convert herbaceous biomass into co-feed with reliable properties for conventional refineries (www.biomates.eu). The document provides the coding behind the identifiers used for samples and sample blends produced by RISE via AFP within the H2020-project BioMates.
Bio-optical Data from Chilean Coastal waters 2017 - 2020
data.csiro.au
researchdata.edu.au
Updated Dec 4, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lesley Clementson; Tim Malthus; Nagur Cherukuru; Joey Crosswell; Andy Steven; Patricio Bernal; Diego Ocampo Melgar; Bozena Wojtasiewicz; Elizabeth Brewer (2020). Bio-optical Data from Chilean Coastal waters 2017 - 2020 [Dataset]. http://doi.org/10.25919/qbbv-v359
Explore at:
Unique identifier
https://doi.org/10.25919/qbbv-v359
Dataset updated
Dec 4, 2020
Dataset provided by
CSIROhttp://www.csiro.au/
Authors
Lesley Clementson; Tim Malthus; Nagur Cherukuru; Joey Crosswell; Andy Steven; Patricio Bernal; Diego Ocampo Melgar; Bozena Wojtasiewicz; Elizabeth Brewer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Oct 17, 2017 - May 8, 2020
Dataset funded by
CSIROhttp://www.csiro.au/
Description
This is a collection of data consisting of pigment concentration and composition, particulate and dissolved absorption co-efficients and total suspended matter concentration. The data relates to samples collected in Chilean coastal waters where aquaculture is present. The data will be used to develop a local algorithm for retrieved satellite estimates of bio-optical parameters in the water column. Lineage: Water samples were taken on-board the vessel and stored under cool and dark conditions until filtering took place on land. Samples were analysed and QC procedures were carried out in the Bio-Analytical facility, CSIRO Marine Labs, Hobart. For pigment analysis, 4 litres of sample water was filtered through a 47 mm glass fibre filter (Whatman GF/F) and then stored in liquid nitrogen until analysis. To extract the pigments, the filters were cut into small pieces and covered with 100% acetone (3 mls) in a 10 ml centrifuge tube. The samples were vortexed for about 30 seconds and then sonicated for 15 minutes in the dark. The samples were then kept in the dark at 4 °C for approximately 15 hours. After this time 200 µL water was added to the acetone such that the extract mixture was 90:10 acetone:water (vol:vol) and sonicated once more for 15 minutes. The extracts were centrifuged to remove the filter paper and then filtered through a 0.2 µm membrane filter (Whatman, anatope) prior to analysis by HPLC using a Waters Alliance high performance liquid chromatography system, comprising a 2695XE separations module with column heater and refrigerated autosampler and a 2996 photo-diode array detector. Immediately prior to injection the sample extract was mixed with a buffer solution (90:10 28 mM tetrabutyl ammonium acetate, pH 6.5 : methanol) within the sample loop. Pigments were separated using a Zorbax Eclipse XDB-C8 stainless steel 150 mm x 4.6 mm ID column with 3.5 µm particle size (Agilent Technologies) with gradient elution as described in Van Heukelem and Thomas (2001). The separated pigments were detected at 436 nm and identified against standard spectra using Waters Empower software. Concentrations of chlorophyll a, chlorophyll b, b,b-carotene and b,e-carotene in sample chromatograms were determined from standards (Sigma, USA or DHI, Denmark). For Absorption coefficients: 4 litres of sample water was filtered through a 25 mm glass fibre filter (Whatman GF/F) and the filter was then stored flat in liquid nitrogen until analysis. Optical density spectra for total particulate matter were obtained using a Cintra 404 UV/VIS dual beam spectrophotometer equipped with an integrating sphere. For CDOM: water filtered through a 0.22 Durapore filter on an all glass filter unit. Optical density spectra was obtained using 10 cm cells in a Cintra 404 UV/vis spectrophotometer with Milli-q water as a reference. For TSM: determined by drying the filter at 60°C to constant weight; the filter may then be muffled at 450°C to burn off the organic fraction. The inorganic fraction is weighed ad the organic fraction is determined as the difference between the SPM and the inorganic fraction.
Bio-optical data for Australian Inland Waters v.1
data.csiro.au
researchdata.edu.au
Updated May 6, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Janet Anstee; Nathan Drayson; Hannelie Botha; Gemma Kerrisk; Stephen Sagar; Phillip Ford; Bozena Wojtasiewicz; Lesley Clementson; Guy Byrne (2022). Bio-optical data for Australian Inland Waters v.1 [Dataset]. http://doi.org/10.25919/rtd7-j815
Explore at:
Unique identifier
https://doi.org/10.25919/rtd7-j815
Dataset updated
May 6, 2022
Dataset provided by
CSIROhttp://www.csiro.au/
Authors
Janet Anstee; Nathan Drayson; Hannelie Botha; Gemma Kerrisk; Stephen Sagar; Phillip Ford; Bozena Wojtasiewicz; Lesley Clementson; Guy Byrne
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 2013 - Dec 14, 2021
Area covered

Dataset funded by
Geoscience Australia
CSIROhttp://www.csiro.au/
Description
This collection is comprised of bio-optical measurements for a wide range of Australian inland waterbodies. The data was collected to describe the variation in bio-optical properties in Australian waterbodies. These data are able to be used for validation and development of inversion algorithms. Lineage: The data were collected using a combination of in situ measurements and laboratory analysis. See the readme file for details. The following data were obtained from laboratory analysis of in situ surface samples: Absorption, TSS, phytoplankton pigments, organic carbon. The following data were obtained from in situ measurements: Backscattering, radiometric measurements. Absorption - Laboratory analysis of in situ surface samples Backscattering - in situ surface measurements TSS -
E
BIO/DFO CTD Data
ceotr.ocean.dal.ca
Updated Sep 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DFO (2025). BIO/DFO CTD Data [Dataset]. https://ceotr.ocean.dal.ca/erddap/info/bio_ctd_public_bio_ctd_data/index.html
Explore at:
Dataset updated
Sep 3, 2025
Dataset authored and provided by
DFO
Variables measured
id, geom, time, depth, latitude, oxygen_1, oxygen_2, longitude, station_id, station_name, and 9 more
Description
Conductivity, Temperature, Depth (CTD) data gathered by Department of Fisheries and Oceans (DFO). Mainly gathered from the Bedford Basin. cdm_data_type=Profile cdm_profile_variables=time Conventions=COARDS, CF-1.6, ACDD-1.3 featureType=Profile geospatial_lat_units=degrees_north geospatial_lon_units=degrees_east geospatial_vertical_positive=down geospatial_vertical_units=m infoUrl=http://www.bio.gc.ca/science/monitoring-monitorage/bbmp-pobb/bbmp-pobb-en.php institution=DFO keywords_vocabulary=GCMD Science Keywords sourceUrl=(source database) standard_name_vocabulary=CF Standard Name Table v29
S3 Table -
plos.figshare.com
datasetcatalog.nlm.nih.gov
zip
Updated Jun 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Isabella Lucia Chiara Mariani Wigley; Massimiliano Pastore; Eleonora Mascheroni; Marta Tremolada; Sabrina Bonichini; Rosario Montirosso (2023). S3 Table - [Dataset]. http://doi.org/10.1371/journal.pone.0274477.s012
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0274477.s012
Dataset updated
Jun 16, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Isabella Lucia Chiara Mariani Wigley; Massimiliano Pastore; Eleonora Mascheroni; Marta Tremolada; Sabrina Bonichini; Rosario Montirosso
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
a. Fit indices obtained by imputing missing data with full information maximum likelihood approach (film) in the Calibration Sample. CFI = comparative fit index; NNFI = Tucker–Lewis index; RMSEA = root mean square error of approximation; SRMR = Standardized root mean square residual. TCD = total coefficient of determination. b. Factor loadings obtained with data imputation in the Calibration Sample. f1 = Childhood/Adolescent Touch Experience; f2 = Comfort with Interpersonal Touch; f3 = Fondness for Interpersonal Touch; f4 = Adult Touch Experience. (ZIP)
U133A_combat.h5
figshare.com
hdf
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christine Staiger (2023). U133A_combat.h5 [Dataset]. http://doi.org/10.6084/m9.figshare.3119248.v1
Explore at:
hdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3119248.v1
Dataset updated
Jun 2, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Christine Staiger
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We compiled a large cohort of breast cancer samples from NCBI's Gene Expression Omnibus (GEO) (see Table 1) as it was suggested in (Györffy and Schäfer, 2009). We only took samples from the U133A platform into account and removed duplicate samples, that is, samples that occur in several studies under the same GEO id. Array quality checks were executed for all samples belonging to the same study by the R packagearrayQualityMetrics. Due to high memory demands of this package, studies containing more than 400 samples had to be divided into two parts. Samples that were classified as outliers in the RLE or NUSE analysis were discarded. Finally, all samples across all studies were normalized together using R's justRMA function yielding for each sample and each probe a log(intensity) value. This normalization also included a quantile normalization step. Subsequently, probe intensities were mean centered, yielding for each sample and each probep a log(intensityμ(intensityp))log(intensityμ(intensityp)) value.We found batch effects within single studies, where samples have been collected from different locations and batch effects between studies. Specifically for breast cancer, samples also form batches according to the five subtypes of breast cancer: luminal A, luminal B, Her2 enriched, normal like and basal like. To account for these effects we employed R's combat, where the cancer subtype was modeled as an additional covariate to maintain the variance associated with the subtypes. To do so we needed to stratify the patients according to the subtype. Since this variable is not always available in the annotation of the patients, we predict the subtype employing the PAM50 marker genes as documented in R's genefu package.Principal component analysis of the batch corrected data revealed pairs of samples with a very high correlation (>0.9). Those pairs were regarded as replicate samples. For each pair of replicate samples one sample was removed randomly. Affymetrix probe IDs were mapped to Entrez Gene IDs via the mapping files provided by Affymetrix. Only probes that mapped to exactly one Gene ID were taken into account and probes starting with AFFX were discarded. If an Entrez Gene ID mapped to several Affymetrix probe IDs, probes were considered in the following order according to their suffix (Gohlmann and Talloen, 2010): “_at,” “s_at,” “x_at,” “i_at,” and “a_at.” When there were still several probes valid for one Gene ID, the Affymetrix probe with the higher variance of expression values was chosen.The patients' class labels corresponding to recurrence free or distant metastasis free survival were calculated with respect to a 5-year threshold. The final cohort is shown in Table 1. We derived two data sets: one labeled according to recurrence free survival (RFS) and one labeled according to distant metastasis free survival (DMFS). Note, that the DMFS data set is a subset of the RFS data set.
Validation-of-Methods-to-Assess-the-Immunoglobulin-Gene-Repertoire-in-Tissues-Obtained-from-Mice-on-the-International-Space-Station...
osdr.nasa.gov
catalog.data.gov
Updated Oct 18, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stephen Chapes (2024). Validation-of-Methods-to-Assess-the-Immunoglobulin-Gene-Repertoire-in-Tissues-Obtained-from-Mice-on-the-International-Space-Station [Dataset]. https://osdr.nasa.gov/bio/repo/data/studies/OSD-141
Explore at:
Dataset updated
Oct 18, 2024
Dataset provided by
NASAhttp://nasa.gov/
Authors
Stephen Chapes
License
Attribution 1.0 (CC BY 1.0)https://creativecommons.org/licenses/by/1.0/
License information was derived automatically
Description
Spaceflight is known to affect immune cell populations. In particular, splenic B-cell numbers decrease during spaceflight and in ground-based physiological models. Although antibody isotype changes have been assessed during and after spaceflight, an extensive characterization of the impact of spaceflight on antibody composition has not been conducted in mice. Next Generation Sequencing and bioinformatic tools are now available to assess antibody repertoires. We can now identify immunoglobulin gene- segment usage, junctional regions, and modifications that contribute to specificity and diversity. Due to limitations on the International Space Station, alternate sample collection and storage methods must be employed. Our group compared Illumina MiSeq sequencing data from multiple sample preparation methods in normal C57Bl/6J mice to validate that sample preparation and storage would not bias the outcome of antibody repertoire characterization. In this report, we also compared sequencing techniques and a bioinformatic workflow on the data output when we assessed the IgH and Igκ variable gene usage. Our bioinformatic workflow has been optimized for Illumina HiSeq and MiSeq datasets, and is designed specifically to reduce bias, capture the most information from Ig sequences, and produce a data set that provides other data mining options.
Capillary rise through bio-stabilized rammed earth samples data
zenodo.org
bin
Updated Mar 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Esther Machlein; Esther Machlein (2025). Capillary rise through bio-stabilized rammed earth samples data [Dataset]. http://doi.org/10.5281/zenodo.15056117
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15056117
Dataset updated
Mar 20, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Esther Machlein; Esther Machlein
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Mass intake for capillary rise and compression test data (in French).

Rammed earth samples label:

REF: reference (unstabilized)

W: rammed earth with wool stabilizer

L: rammed earth with lignin sulphonate stabilizer

T: rammed earth with tannin stabilizer
Bacterial and archael 16S rRNA sequences and taxonomic summary tables for...
catalog.data.gov
s.cnmilf.com
+1more
Updated Nov 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2020). Bacterial and archael 16S rRNA sequences and taxonomic summary tables for biofilm samples from the bio-reactors [Dataset]. https://catalog.data.gov/dataset/bacterial-and-archael-16s-rrna-sequences-and-taxonomic-summary-tables-for-biofilm-samples-
Explore at:
Dataset updated
Nov 12, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
A biofilm anode acclimated with acetate, acetate+methane, and methane growth media for over three years produced a steady current density of 1.6-2.3 mA/m^2 in a microbial electrochemical cell (MxC) fed with methane as the sole electron donor. Geobacter was the dominant genus for the bacterial domain (93%) in the biofilm anode, while methanogens (Methanocorpusculum labreanum and Methanosaeta concilii) accounted for 82% of the total archaeal clones in the biofilm. A fluorescence in situ hybridization (FISH) image clearly showed a biofilm of bacteria and archaea, supporting a syntrophic interaction between them for performing anaerobic oxidation of methane (AOM) in the biofilm anode. Measured cumulative coulombs correlated linearly to the methane-gas concentration in the range of 10% to 99.97% (R^2 ≥ 0.99) when the measurement was sustained for at least 50 min. Thus, cumulative coulombs over 50 min. could be used to quantify the methane concentration in gas samples. This dataset is associated with the following publication: Gao, Y., H. Ryu, B. Rittmann, A. Hussain, and H. Lee. Quantification of the methane concentration using anaerobic oxidation of methane coupled to extracellular electron transfer. Bioresource Technology. Elsevier Online, New York, NY, USA, 241: 979-984, (2017).
n
Center for Bio-Image Informatics
neuinfo.org
scicrunch.org
+1more
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Center for Bio-Image Informatics [Dataset]. http://identifiers.org/RRID:SCR_001949
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_001949
Description
The Center for Bio-Image Informatics is an interdisciplinary research effort between Biology, Computer Science, Statistics, Multimedia and Engineering. The overarching goal of the center is the advancement of human knowledge of the complex biological processes which occur at both cellular and sub-cellular levels. the center employs and develops cutting edge techniques in the fields of imaging, pattern recognition and data mining. Research also focuses on development of new information processing techniques which can afford us a better understanding of biological processes depicted in microscopy images of cells and tissues, specifically on the distributions of biological molecules within these samples. This is achieved by borrowing methods for information processing at the sensor level to enable high speed and super-resolution imaging. By applying pattern recognition and data mining methods to bio-molecular images, full automation of both the extraction of information and the construction of statistically-sound models of the processes depicted in those images was possible. At the heart of the center's reseach is the BISQUE system, an online repository for multidimensional bio-images, and testbed for new research techniques and methods. BISQUE: Online Semantic Query User Environment is an online database for managing up to 5 dimensional scientific images with associated metadata and a flexible, collaborative tagging system. Currently the system has more than 85,000 user-provided tags and 128006 2-D planes from over 6,000 biological images. BISQUE is much more than just a repository for scientific images- the system provides resources for complex scientific analysis over images, result visualization, user-extensible modules, customized organization of images, advanced search features, graphical annotations, textual annotations and compatible client-side applications. Sponsors: This work is supported in part by an NSF infrastructure award No. EIA-0080134 and IIS-0808772.
u
Data for the Computational Linguistics and Clinical Psychology Shared Task,...
datacatalogue.ukdataservice.ac.uk
Updated Dec 3, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UK Data Service (2020). Data for the Computational Linguistics and Clinical Psychology Shared Task, 2018 [Dataset]. http://doi.org/10.5255/UKDA-SN-8471-1
Explore at:
Unique identifier
https://doi.org/10.5255/UKDA-SN-8471-1
Dataset updated
Dec 3, 2020
Dataset provided by
UK Data Servicehttps://ukdataservice.ac.uk/
Time period covered
Jan 1, 1969 - Dec 31, 2008
Area covered
United Kingdom
Description
The National Child Development Study (NCDS) originated in the Perinatal Mortality Survey (see SN 5565), which examined social and obstetric factors associated with still birth and infant mortality among over 17,000 babies born in Britain in one week in March 1958. Surviving members of this birth cohort have been surveyed on eight further occasions in order to monitor their changing health, education, social and economic circumstances - in 1965 at age 7, 1969 at age 11, 1974 at age 16 (the first three sweeps are also held under SN 5565), 1981 (age 23 - SN 5566), 1991 (age 33 - SN 5567), 1999/2000 (age 41/2 - SN 5578), 2004-2005 (age 46/47 - SN 5579), 2008-2009 (age 50 - SN 6137) and 2013 (age 55 - SN 7669).

There have also been surveys of sub-samples of the cohort, the most recent occurring in 1995 (age 37), when a 10% representative sub-sample was assessed for difficulties with basic skills (SN 4992). Finally, during 2002-2004, 9,340 NCDS cohort members participated in a bio-medical survey, carried out by qualified nurses (SN 5594, available under more restrictive Special Licence access conditions; see catalogue record for details). The bio-medical survey did not cover any of the topics included in the 2004/2005 survey. Further NCDS data separate to the main surveys include a response and deaths dataset, parent migration studies, employment, activity and partnership histories, behavioural studies and essays - see the NCDS series page for details.

Further information about the NCDS can be found on the Centre for Longitudinal Studies website.

How to access genetic and/or bio-medical sample data from a range of longitudinal surveys:
A useful overview of the governance routes for applying for genetic and bio-medical sample data, which are not available through the UK Data Service, can be found at Governance of data and sample access on the METADAC (Managing Ethico-social, Technical and Administrative issues in Data Access) website.

Data for the Computational Linguistics and Clinical Psychology Shared Task, 2018 contains the outputs of the shared task for the CLPsych 2018 workshop, which focused on predicting current and future psychological health from an essay authored in childhood. Language-based predictions of a person's current health have the potential to supplement traditional psychological assessment such as questionnaires, improving intake risk measurement and monitoring. Predictions of future psychological health can aid with both early detection and the development of preventative care. Research into the mental health trajectory of people, beginning from their childhood, has thus far been an area of little work within the neuro-linguistic programming (NLP) community. This shared task represented one of the first attempts to evaluate the use of early language to predict future health; this has the potential to support a wide variety of clinical health care tasks, from early assessment of lifetime risk for mental health problems, to optimal timing for targeted interventions aimed at both prevention and treatment.

Facebook

Twitter

Click to copy link

Link copied

Cite

Wikimedia (2025). English Wikipedia People Dataset [Dataset]. https://www.kaggle.com/datasets/wikimedia-foundation/english-wikipedia-people-dataset

English Wikipedia People Dataset

Biographical Data for People on English Wikipedia

Explore at:

zip(4293465577 bytes)Available download formats

Dataset updated

Jul 31, 2025

Dataset provided by

Wikimedia Foundationhttp://www.wikimedia.org/

Authors

Wikimedia

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Summary

This dataset contains biographical information derived from articles on English Wikipedia as it stood in early June 2024. It was created as part of the Structured Contents initiative at Wikimedia Enterprise and is intended for evaluation and research use.

The beta sample dataset is a subset of the Structured Contents Snapshot focusing on people with infoboxes in EN wikipedia; outputted as json files (compressed in tar.gz).

We warmly welcome any feedback you have. Please share your thoughts, suggestions, and any issues you encounter on the discussion page for this dataset here on Kaggle.

Data Structure

File name: wme_people_infobox.tar.gz
Size of compressed file: 4.12 GB
Size of uncompressed file: 21.28 GB

Noteworthy Included Fields: - name - title of the article. - identifier - ID of the article. - image - main image representing the article's subject. - description - one-sentence description of the article for quick reference. - abstract - lead section, summarizing what the article is about. - infoboxes - parsed information from the side panel (infobox) on the Wikipedia article. - sections - parsed sections of the article, including links. Note: excludes other media/images, lists, tables and references or similar non-prose sections.

The Wikimedia Enterprise Data Dictionary explains all of the fields in this dataset.

Stats

Infoboxes - Compressed: 2GB - Uncompressed: 11GB

Infoboxes + sections + short description - Size of compressed file: 4.12 GB - Size of uncompressed file: 21.28 GB

Article analysis and filtering breakdown: - total # of articles analyzed: 6,940,949 - # people found with QID: 1,778,226 - # people found with Category: 158,996 - people found with Biography Project: 76,150 - Total # of people articles found: 2,013,372 - Total # people articles with infoboxes: 1,559,985 End stats - Total number of people articles in this dataset: 1,559,985 - that have a short description: 1,416,701 - that have an infobox: 1,559,985 - that have article sections: 1,559,921

This dataset includes 235,146 people articles that exist on Wikipedia but aren't yet tagged on Wikidata as instance of:human.

Maintenance and Support

This dataset was originally extracted from the Wikimedia Enterprise APIs on June 5, 2024. The information in this dataset may therefore be out of date. This dataset isn't being actively updated or maintained, and has been shared for community use and feedback. If you'd like to retrieve up-to-date Wikipedia articles or data from other Wikiprojects, get started with Wikimedia Enterprise's APIs

Initial Data Collection and Normalization

The dataset is built from the Wikimedia Enterprise HTML “snapshots”: https://enterprise.wikimedia.com/docs/snapshot/ and focuses on the Wikipedia article namespace (namespace 0 (main)).

Who are the source language producers?

Wikipedia is a human generated corpus of free knowledge, written, edited, and curated by a global community of editors since 2001. It is the largest and most accessed educational resource in history, accessed over 20 billion times by half a billion people each month. Wikipedia represents almost 25 years of work by its community; the creation, curation, and maintenance of millions of articles on distinct topics. This dataset includes the biographical contents of English Wikipedia language editions: English https://en.wikipedia.org/, written by the community.

Attribution

Terms and conditions

Wikimedia Enterprise provides this dataset under the assumption that downstream users will adhere to the relevant free culture licenses when the data is reused. In situations where attribution is required, reusers should identify the Wikimedia project from which the content was retrieved as the source of the content. Any attribution should adhere to Wikimedia’s trademark policy (available at https://foundation.wikimedia.org/wiki/Trademark_policy) and visual identity guidelines (ava...

Clear search

Close search

Google apps

Main menu

English Wikipedia People Dataset

Summary

Data Structure

Stats

Maintenance and Support

Initial Data Collection and Normalization

Who are the source language producers?

Attribution

Wikipedia Biographies Text Generation Dataset

Wikipedia Biographies Text Generation Dataset

Wikipedia Biographies: Infobox and First Paragraphs Texts

About this dataset

How to use the dataset

Research Ideas

Biological Samples and Associated Data

Data from: S7 Fig -

Bio World Photo Sample Export Import Data | Eximpedia

Leash-Bio-processed-dataset

Processed dataset for https://www.kaggle.com/competitions/leash-BELKA.

Last update : 22-may-2024

[1] reduced data

[2] extracted ECFP4 fingerprints

[3] graph NN processed data

[4] conformer. i.e. molecule estimated xyz data

Academic Year 1973-1974

Samples and data accessibility in research biobanks

Bio-optical Database of the Arctic Ocean

BioMates - Bio-oil from ablative fast pyrolysis: Identifiers for WP3-...

Bio-optical Data from Chilean Coastal waters 2017 - 2020

Bio-optical data for Australian Inland Waters v.1

BIO/DFO CTD Data

S3 Table -

U133A_combat.h5

Validation-of-Methods-to-Assess-the-Immunoglobulin-Gene-Repertoire-in-Tissues-Obtained-from-Mice-on-the-International-Space-Station...

Capillary rise through bio-stabilized rammed earth samples data

Bacterial and archael 16S rRNA sequences and taxonomic summary tables for...

Center for Bio-Image Informatics

Data for the Computational Linguistics and Clinical Psychology Shared Task,...

English Wikipedia People Dataset

Biographical Data for People on English Wikipedia

Summary

Data Structure

Stats

Maintenance and Support

Initial Data Collection and Normalization

Who are the source language producers?

Attribution