93 datasets found

f
Prompts for the different question categories. Every prompt ended with the...
plos.figshare.com
xls
Updated Jul 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Katharina Danhauser; Yingding Wang; Christoph Klein; Uta Tacke; Larissa Mantoan; Laura Aurica Ritter; Florian Heinen; Chiara Nobile; Moritz Tacke (2025). Prompts for the different question categories. Every prompt ended with the phrase The JSON variable should have the name XXX where XXX was a suited word (e.g. bodyweight_kg) [Dataset]. http://doi.org/10.1371/journal.pdig.0000919.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pdig.0000919.t001
Dataset updated
Jul 23, 2025
Dataset provided by
PLOS Digital Health
Authors
Katharina Danhauser; Yingding Wang; Christoph Klein; Uta Tacke; Larissa Mantoan; Laura Aurica Ritter; Florian Heinen; Chiara Nobile; Moritz Tacke
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Prompts for the different question categories. Every prompt ended with the phrase The JSON variable should have the name XXX where XXX was a suited word (e.g. bodyweight_kg)
w
Web Data Commons - RDFa, Microdata, and Microformat Data Sets
webdatacommons.org
n-quads
Updated Oct 15, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christian Bizer; Robert Meusel; Anna Primpeli (2016). Web Data Commons - RDFa, Microdata, and Microformat Data Sets [Dataset]. http://webdatacommons.org/structureddata/2016-10/stats/stats.html
Explore at:
n-quadsAvailable download formats
Dataset updated
Oct 15, 2016
Authors
Christian Bizer; Robert Meusel; Anna Primpeli
Description
Microformat, Microdata and RDFa data from the October 2016 Common Crawl web corpus. We found structured data within 1.24 billion HTML pages out of the 3.2 billion pages contained in the crawl (38%). These pages originate from 5.63 million different pay-level-domains out of the 34 million pay-level-domains covered by the crawl (16.5%). Altogether, the extracted data sets consist of 44.2 billion RDF quads.
b
Data from: Automated phenotyping of mild cognitive impairment and...
bdsp.io
Updated Sep 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Haoqi Sun; Ruoqi Wei; Niels Turley; Aditya Gupta; Manohar Ghanta; Robert Thomas; Sahar Zafar; M Brandon Westover (2025). Automated phenotyping of mild cognitive impairment and Alzheimer's disease and related dementias using electronic health records [Dataset]. http://doi.org/10.60508/2m08-nf79
Explore at:
Unique identifier
https://doi.org/10.60508/2m08-nf79
Dataset updated
Sep 25, 2025
Authors
Haoqi Sun; Ruoqi Wei; Niels Turley; Aditya Gupta; Manohar Ghanta; Robert Thomas; Sahar Zafar; M Brandon Westover
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Objectives: Unstructured and structured data in electronic health records (EHR) are a rich source of information for research and quality improvement studies. However, extracting accurate information from EHR is labor-intensive. Timely and accurate identification of patients with Alzheimer's Disease, related dementias (ADRD), or mild cognitive impairment (MCI) is critical for improving patient outcomes through early intervention, optimizing care plans, and reducing healthcare system burdens. Here we introduce an automated EHR phenotyping model to streamline this process and enable efficient identification of these conditions.

Methods: We analyzed data from 3,626 outpatients seen at two hospitals between February 2015 and June 2022. Through manual chart review, we established ground truth labels for the presence or absence of MCI/ADRD diagnoses. Our model combined three types of data: (1) unstructured clinical notes, from which we extracted single words, two-word phrases (bigrams), and three-word phrases (trigrams) as features, weighted using Term Frequency-Inverse Document Frequency (TF-IDF) to capture their relative importance, (2) International Classification of Diseases (ICD) codes, and (3) medication prescriptions related to MCI/ADRD. We trained a regularized logistic regression model to predict MCI/ADRD diagnoses and evaluated its performance using standard metrics including area under the receiver operating curve (AUROC), area under the precision-recall curve (AUPRC), accuracy, specificity, precision, recall, and F1 score.

Results: Thirty percent of patients in the cohort carried diagnoses of MCI/ADRD based on manual review. When evaluated on a held-out test set, the best model using clinical notes, ICDs, and medications, achieved an AUROC of 0.98, an AUPRC of 0.98, an accuracy of 0.93, a sensitivity (recall) of 0.91, a specificity of 0.96, a precision of 0.96, and an F1 score of 0.93 The estimated overall accuracy for patients randomly selected from EHRs was 99.88%.

Conclusion: Automated EHR phenotyping accurately identifies patients with MCI/ADRD based on clinical notes, ICD codes, and medication records. This approach holds potential for large-scale MCI/ADRD research utilizing EHR databases.
Yield Curve Models and Data - Three-Factor Nominal Term Structure Model
catalog.data.gov
s.cnmilf.com
Updated Dec 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Board of Governors of the Federal Reserve System (2024). Yield Curve Models and Data - Three-Factor Nominal Term Structure Model [Dataset]. https://catalog.data.gov/dataset/yield-curve-models-and-data-three-factor-nominal-term-structure-model
Explore at:
Dataset updated
Dec 18, 2024
Dataset provided by
Federal Reserve Board of Governors
Federal Reserve Systemhttp://www.federalreserve.gov/
Description
This is a no-arbitrage dynamic term structure model, implemented as in Kim and Wright using the methodology of Kim and Orphanides . The underlying model is the standard affine Gaussian model with three factors that are latent (i.e., the factors are defined only statistically and do not have a specific economic meaning). The model is parameterized in a maximally flexible way (i.e., it is the most general model of its kind with three factors that are econometrically identified). In the estimation of the parameters of the model, data on survey forecasts of 3-month Treasury bill (T-bill) rate are used in addition to yields data in order to help address the small sample problems that often pervade econometric estimation with persistent time series like bond yields.
Data from: Amazon Forest Structure from Airborne Lidar, ED2 Initial...
catalog.data.gov
earthdata.nasa.gov
+2more
Updated Sep 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ORNL_DAAC (2025). Amazon Forest Structure from Airborne Lidar, ED2 Initial Condition Files, 2016 [Dataset]. https://catalog.data.gov/dataset/amazon-forest-structure-from-airborne-lidar-ed2-initial-condition-files-2016
Explore at:
Dataset updated
Sep 19, 2025
Dataset provided by
Oak Ridge National Laboratory Distributed Active Archive Center
Area covered
Amazon Rainforest
Description
This dataset provides initial condition files for initializing the Ecosystem Demography Model (ED2). This dataset holds regional forest structure characteristics across the Brazilian Amazon that were derived from 545 airborne lidar transects (300 x 12500 m each) acquired during the Amazon Biomass Estimation Project (EBA2016) campaign in 2016. These data contain vertical distributions of stem density, carbon storage, and other vegetation traits for over 1,300,000 columns (50 x 50 m each) that were aggregated into 288 grid cells (1 x 1 degree). This dataset also contains soil edaphic characteristics obtained from existing datasets and carbon stored in litter and soil layers estimated from the land use history and limited measurements in different land use types. Three types of files are provided: Site files (.sss) hold soil and terrain characteristics. Patch files (.sss) hold patch location, area, disturbance type, stem density, stem basal area, leaf area index (LAI), aboveground biomass (AGB), along with carbon and nitrogen density in several categories for patches within sites. Cohort files (*.css) hold diameter at breast height, plant height, stem density, mass of living and dead biomass, LAI, AGB), and plant functional type for cohorts of stems within patches and sites. The data are provided in text format compatible with the ED2 model.
n
NCBI Structure
neuinfo.org
dknet.org
+2more
Updated Oct 15, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). NCBI Structure [Dataset]. http://identifiers.org/RRID:SCR_004218
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_004218
Dataset updated
Oct 15, 2024
Description
Database of three-dimensional structures of macromolecules that allows the user to retrieve structures for specific molecule types as well as structures for genes and proteins of interest. Three main databases comprise Structure-The Molecular Modeling Database; Conserved Domains and Protein Classification; and the BioSystems Database. Structure also links to the PubChem databases to connect biological activity data to the macromolecular structures. Users can locate structural templates for proteins and interactively view structures and sequence data to closely examine sequence-structure relationships. * Macromolecular structures: The three-dimensional structures of biomolecules provide a wealth of information on their biological function and evolutionary relationships. The Molecular Modeling Database (MMDB), as part of the Entrez system, facilitates access to structure data by connecting them with associated literature, protein and nucleic acid sequences, chemicals, biomolecular interactions, and more. It is possible, for example, to find 3D structures for homologs of a protein of interest by following the Related Structure link in an Entrez Protein sequence record. * Conserved domains and protein classification: Conserved domains are functional units within a protein that act as building blocks in molecular evolution and recombine in various arrangements to make proteins with different functions. The Conserved Domain Database (CDD) brings together several collections of multiple sequence alignments representing conserved domains, in addition to NCBI-curated domains that use 3D-structure information explicitly to define domain boundaries and provide insights into sequence/structure/function relationships. * Small molecules and their biological activity: The PubChem project provides information on the biological activities of small molecules and is a component of NIH''''s Molecular Libraries Roadmap Initiative. PubChem includes three databases: PCSubstance, PCBioAssay, and PCCompound. The PubChem data are linked to other data types (illustrated example) in the Entrez system, making it possible, for example, to retrieve information about a compound and then Link to its biological activity data, retrieve 3D protein structures bound to the compound and interactively view their active sites, and find biosystems that include the compound as a component. * Biological Systems: A biosystem, or biological system, is a group of molecules that interact directly or indirectly, where the grouping is relevant to the characterization of living matter. The NCBI BioSystems Database provides centralized access to biological pathways from several source databases and connects the biosystem records with associated literature, molecular, and chemical data throughout the Entrez system. BioSystem records list and categorize components (illustrated example), such as the genes, proteins, and small molecules involved in a biological system. The companion FLink icon FLink tool, in turn, allows you to input a list of proteins, genes, or small molecules and retrieve a ranked list of biosystems.
n
Data from: Validating two-dimensional leadership models on...
narcis.nl
data.niaid.nih.gov
+2more
Updated Nov 22, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Watts, Isobel; Nagy, Mate; Holbrook, Robert I.; Biro, Dora; Burt de Perera, Theresa (2016). Data from: Validating two-dimensional leadership models on three-dimensionally structured fish schools [Dataset]. http://doi.org/10.5061/dryad.j0226
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.j0226
Dataset updated
Nov 22, 2016
Dataset provided by
Data Archiving and Networked Services (DANS)
Authors
Watts, Isobel; Nagy, Mate; Holbrook, Robert I.; Biro, Dora; Burt de Perera, Theresa
Description
Identifying leader-follower interactions is crucial for understanding how a group decides where or when to move, and how this information is transferred between members. Although many animal groups have a three-dimensional structure, previous studies investigating leader-follower interactions have often ignored vertical information. This raises the question whether commonly used two-dimensional leader-follower analyses can be used justifiably on groups that interact in three dimensions. To address this we quantified the individual movements of banded tetra fish (Astyanax mexicanus) within shoals by computing the three-dimensional trajectories of all individuals using a stereo-camera technique. We used these data firstly to identify and compare leader-follower interactions in two and three dimensions, and secondly to analyse leadership with respect to an individual’s spatial position in three dimensions. We show that for 95% of all pairwise interactions leadership identified through two-dimensional analysis matches that identified through three-dimensional analysis, and we reveal that fish attend to the same shoalmates for vertical information as they do for horizontal information. Our results therefore highlight that three-dimensional analyses are not always required to identify leader-follower relationships in species that move freely in three-dimensions. We discuss our results in terms of the importance of taking species’ sensory capacities into account when studying interaction networks within groups.
Data from: The three types of power distribution, that structure...
scielo.figshare.com
jpeg
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Esteban Valenzuela; Osvaldo Henriquez; Ignacio Cienfuegos (2023). The three types of power distribution, that structure decentralization in South America [Dataset]. http://doi.org/10.6084/m9.figshare.8988146.v1
Explore at:
jpegAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.8988146.v1
Dataset updated
May 31, 2023
Dataset provided by
SciELOhttp://www.scielo.org/
Authors
Esteban Valenzuela; Osvaldo Henriquez; Ignacio Cienfuegos
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
South America
Description
Abstract In research about state and public administration, it is common to perform an analysis of its structure, function, form and type. However, on some occasions, these general categories do not reveal the distribution of attributions or real power in their different territorial levels. This report, through a historical-institutional review of the last 50 years in South American countries, proposes the existence of characteristic power structures that remain in time with gradual changes that maintains the essence of its historical origin and, others that are formed as a result of disruptive changes that modify the dominant paradigms. The existence of these structures shows three characteristic types that are called compound, integrated and simple.
D
Replication Data for: Multiple signal classification as a blind...
dataverse.azure.uit.no
dataverse.no
+1more
png, tiff, txt
Updated Sep 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ida Opstad; Ida Opstad (2023). Replication Data for: Multiple signal classification as a blind reconstruction approach for three-dimensional structured illumination microscopy [Dataset]. http://doi.org/10.18710/3POVDT
Explore at:
tiff(4194583), tiff(88089758), tiff(419432396), txt(1312), tiff(17833518), tiff(17833517), tiff(419432253), tiff(445652526), png(89082), tiff(445652525), tiff(88089752), tiff(88089759), tiff(104859514), tiff(2202018968), tiff(104857879), tiff(1050496), tiff(2202018972), tiff(2202018975), tiff(16779129), tiff(26216319), tiff(1050495), tiff(535098235), png(131959), tiff(2202018969), txt(3881), tiff(285218318), tiff(88089756), tiff(2202018974), tiff(285218303), tiff(7130319279), tiff(2202018973), tiff(88089753), tiff(104859515), tiff(4196216), tiff(7130319263), tiff(16777492), tiff(13109080), tiff(16779131), txt(8922), tiff(285240617), tiff(4196219), tiff(88089757), tiff(104857880)Available download formats
Unique identifier
https://doi.org/10.18710/3POVDT
Dataset updated
Sep 28, 2023
Dataset provided by
DataverseNO
Authors
Ida Opstad; Ida Opstad
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Time period covered
Jun 6, 2018 - May 5, 2023
Area covered
Troms, Tromsø, Norway
Dataset funded by
UiT The Arctic University of Norway
Description
The dataset contains replication data for the research paper "Multiple signal classification as a blind reconstruction approach for three-dimensional structured illumination microscopy" (3DSIM). The paper compares image reconstructions using conventional 3DSIM and two variants of "Multiple signal classification algorithm" (MUSICAL), MUS-S and MUS-CE. This dataset provides 1 image generated of mitochondria in living cells, 1 image of mitochondria in fixed cells, 1 image of nephrin in fixed murine kidney tissue using 3DSIM, MUS-S and MUS-CE. The 3DSIM raw data and reconstructions for H9c2 cells are available from https://doi.org/10.18710/PDCLAS. Four different types of image data are included: -Raw structured illumination kidney data used for all super-resolution image reconstructions (i.e., MUS-S, MUS-CE and 3DSIM) -3DSIM images of a kidney section. -MUS-S for kidney, fixed and live H9c2 cardiomyoblasts -MUS-CE for kidney, fixed and live H9c2 cardiomyoblasts The data is organized in different folders according to sample type (FixedCell/LiveCell H9c2, KidneyTissue), and reconstruction method (MUS-CE, MUS-S, 3DSIM). The image files are TIFF images. Abbreviations: 3DSIM - three-dimensional structured illumination microscopy MUSICAL - Multiple signal classification algorithm MUS-S - soft thresholding variant of MUSICAL MUS-CE - contrast enhancement, a low-resolution variant of MUSICAL LTDR - LysoTracker Deep Red MAX - maximum intensity z-projected three-dimensional images c or ch - channel
Data from: Urbanev: An open benchmark dataset for urban electric vehicle...
data.niaid.nih.gov
search.dataone.org
+1more
zip
Updated Apr 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Han Li; Haohao Qu; Xiaojun Tan; Linlin You; Rui Zhu; Wenqi Fan (2025). Urbanev: An open benchmark dataset for urban electric vehicle charging demand prediction [Dataset]. http://doi.org/10.5061/dryad.np5hqc04z
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.np5hqc04z
Dataset updated
Apr 25, 2025
Dataset provided by
Institute of High Performance Computing
Hong Kong Polytechnic University
Sun Yat-sen University
Authors
Han Li; Haohao Qu; Xiaojun Tan; Linlin You; Rui Zhu; Wenqi Fan
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
The recent surge in electric vehicles (EVs), driven by a collective push to enhance global environmental sustainability, has underscored the significance of exploring EV charging prediction. To catalyze further research in this domain, we introduce UrbanEV—an open dataset showcasing EV charging space availability and electricity consumption in a pioneering city for vehicle electrification, namely Shenzhen, China. UrbanEV offers a rich repository of charging data (i.e., charging occupancy, duration, volume, and price) captured at hourly intervals across an extensive six-month span for over 20,000 individual charging stations. Beyond these core attributes, the dataset also encompasses diverse influencing factors like weather conditions and spatial proximity. These factors are thoroughly analyzed qualitatively and quantitatively to reveal their correlations and causal impacts on charging behaviors. Furthermore, comprehensive experiments have been conducted to showcase the predictive capabilities of various models, including statistical, deep learning, and transformer-based approaches, using the UrbanEV dataset. This dataset is poised to propel advancements in EV charging prediction and management, positioning itself as a benchmark resource within this burgeoning field. Methods To build a comprehensive and reliable benchmark dataset, we conduct a series of rigorous processes from data collection to dataset evaluation. The overall workflow sequentially includes data acquisition, data processing, statistical analysis, and prediction assessment. As follows, please see detailed descriptions. Study area and data acquisition

Shenzhen, a pioneering city in global vehicle electrification, has been selected for this study with the objective of offering valuable insights into electric vehicle (EV) development that can serve as a reference for other urban centers. This study encompasses the entire expanse of Shenzhen, where data on public EV charging stations distributed around the city have been meticulously gathered. Specifically, EV charging data was automatically collected from a mobile platform used by EV drivers to locate public charging stations. Through this platform, users could access real-time information on each charging pile, including its availability (e.g., busy or idle), charging price, and geographic coordinates. Accordingly, we recorded the charging-related data at five-minute intervals from September 1, 2022, to February 28, 2023. This data collection process was fully digital and did not require manual readings. Furthermore, to delve into the correlation between EV charging patterns and environmental elements, weather data for Shenzhen city were acquired from two meteorological observatories situated in the airport and central regions, respectively. These meteorological data are publicly available on the Shenzhen Government Data Open Platform. Thirdly, point of interest (POI) data was extracted through the Application Programming Interface Platform of AMap.com, along with three primary types: food and beverage services, business and residential, and lifestyle services. Lastly, the spatial and static data were organized based on the traffic zones delineated by the sixth Residential Travel Survey of Shenzhen. The collected data contains detailed spatiotemporal information that can be analyzed to provide valuable insights about urban EV charging patterns and their correlations with meteorological conditions.

Shenzhen, a pioneering city in global vehicle electrification, has been selected for this study with the objective of offering valuable insights into electric vehicle (EV) development that can serve as a reference for other urban centers. This study encompasses the entire expanse of Shenzhen, where data on public EV charging stations distributed around the city have been meticulously gathered. Specifically, a program was employed to extract the status (e.g., busy or idle, charging price, electricity volume, and coordinates) of each charging pile at five-minute intervals from 1 September 2022 to 28 February 2023. Furthermore, to delve into the correlation between EV charging patterns and environmental elements, weather data for Shenzhen city was acquired from two meteorological observatories situated in the airport and central regions, respectively. Thirdly, point of interest (POI) data was extracted, along with three primary types: food and beverage services, business and residential, and lifestyle services. Lastly, the spatial and static data were organized based on the traffic zones delineated by the sixth Residential Travel Survey of Shenzhen. The collected data contains detailed spatiotemporal information that can be analyzed to provide valuable insights about urban EV charging patterns and their correlations with meteorological conditions. Processing raw information into well-structured data To streamline the utilization of the UrbanEV dataset, we harmonize heterogeneous data from various sources into well-structured data with aligned temporal and spatial resolutions. This process can be segmented into two parts: the reorganization of EV charging data and the preparation of other influential factors. EV charging data The raw charging data, obtained from publicly available EV charging services, pertains to charging stations and predominantly comprises string-type records at a 5-minute interval. To transform this raw data into a structured time series tailored for prediction tasks, we implement the following three key measures:

Initial Extraction. From the string-type records, we extract vital information for each charging pile, such as availability (designated as "busy" or "idle"), rated power, and the corresponding charging and service fees applicable during the observed time periods. First, a charging pile is categorized as "active charging" if its states at two consecutive timestamps are both "busy". Consequently, the occupancy within a charging station can be defined as the count of in-use charging piles, while the charging duration is calculated as the product of the count of in-use piles and the time between the two timestamps (in our case, 5 minutes). Moreover, the charging volume in a station can correspondingly be estimated by multiplying the duration by the piles' rated power. Finally, the average electricity price and service price are calculated for each station in alignment with the same temporal resolution as the three charging variables.

Error Detection and Imputation. Ensuring data quality is paramount when utilizing charging data for decision-making, advanced analytics, and machine-learning applications. It is crucial to address concerns around data cleanliness, as the presence of inaccuracies and inconsistencies, often referred to as dirty data, can significantly compromise the reliability and validity of any subsequent analysis or modeling efforts. To improve data quality of our charging data, several errors are identified, particularly the negative values for charging fees and the inconsistencies between the counts of occupied, idle, and total charging piles. We remove the records containing these anomalies and treat them as missing data. Besides that, a two-step imputation process was implemented to address missing values. First, forward filling replaced missing values using data from preceding timestamps. Then, backward filling was applied to fill gaps at the start of each time series. Moreover, a certain number of outliers were identified in the dataset, which could significantly impact prediction performance. To address this, the interquartile range (IQR) method was used to detect outliers for metrics including charging volume (v), charging duration (d), and the rate of active charging piles at the charging station (o). To retain more original data and minimize the impact of outlier correction on the overall data distribution, we set the coefficient to 4 instead of the default 1.5. Finally, each outlier was replaced by the mean of its adjacent valid values. This preprocessing pipeline transformed the raw data into a structured and analyzable dataset.

Aggregation and Filtration. Building upon the station-level charging data that has been extracted and cleansed, we further organize the data into a region-level dataset with an hourly interval providing a new perspective for EV charging behavior analysis. This is achieved by two major processes: aggregation and filtration. First, we aggregate all the charging data from both temporal and spatial views: a. Temporally, we standardize all time-series data to a common time resolution of one hour, as it serves as the least common denominator among the various resolutions. This aims to establish a unified temporal resolution for all time-series data, including pricing schemes, weather records, and charging data, thereby creating a well-structured dataset. Aggregation rules specify that the five-minute charging volume v and duration $(d)$ are summed within each interval (i.e., one hour), whereas the occupancy o, electricity price pe, and service price ps are assigned specific values at certain hours for each charging pile. This distinction arises from the inherent nature of these data types: volume v and duration d are cumulative, while o, pe, and ps are instantaneous variables. Compared to using the mean or median values within each interval, selecting the instantaneous values of o, pe, and ps as representatives preserves the original data patterns more effectively and minimizes the influence of human interpretation. b. Spatially, stations are aggregated based on the traffic zones delineated by the sixth Residential Travel Survey of Shenzhen. After aggregation, our aggregated dataset comprises 331 regions (also called traffic zones) with 4344 timestamps. Second, variance tests and zero-value filtering functions were employed to filter out traffic zones with zero or no change in charging data. Specifically, it means that
Q
Interviews regarding data curation for qualitative data reuse and big social...
data.qdr.syr.edu
bin, pdf, txt
Updated Apr 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sara Mannheimer; Sara Mannheimer (2023). Interviews regarding data curation for qualitative data reuse and big social research [Dataset]. http://doi.org/10.5064/F6GWMU4O
Explore at:
pdf(111223), pdf(170851), pdf(174860), pdf(220706), pdf(181317), pdf(155781), pdf(176948), pdf(186400), pdf(216506), pdf(186156), pdf(166627), pdf(204315), pdf(120883), pdf(223955), pdf(197623), pdf(209721), pdf(212401), pdf(111468), pdf(175067), pdf(194133), pdf(194606), bin(254918656), pdf(174896), txt(8346), pdf(180451), pdf(192049), pdf(119959), pdf(214380), bin(2258685), pdf(547705), pdf(189347), pdf(196971), pdf(115127), pdf(213879), pdf(146828), pdf(195493), pdf(177017), pdf(189665), pdf(149437), pdf(183110), pdf(221008), pdf(200024)Available download formats
Unique identifier
https://doi.org/10.5064/F6GWMU4O
Dataset updated
Apr 26, 2023
Dataset provided by
Qualitative Data Repository
Authors
Sara Mannheimer; Sara Mannheimer
License
https://qdr.syr.edu/policies/qdr-standard-access-conditionshttps://qdr.syr.edu/policies/qdr-standard-access-conditions
Time period covered
Mar 1, 2019 - Jun 1, 2023
Area covered
United States
Description
Project Overview Trends toward open science practices, along with advances in technology, have promoted increased data archiving in recent years, thus bringing new attention to the reuse of archived qualitative data. Qualitative data reuse can increase efficiency and reduce the burden on research subjects, since new studies can be conducted without collecting new data. Qualitative data reuse also supports larger-scale, longitudinal research by combining datasets to analyze more participants. At the same time, qualitative research data can increasingly be collected from online sources. Social scientists can access and analyze personal narratives and social interactions through social media such as blogs, vlogs, online forums, and posts and interactions from social networking sites like Facebook and Twitter. These big social data have been celebrated as an unprecedented source of data analytics, able to produce insights about human behavior on a massive scale. However, both types of research also present key epistemological, ethical, and legal issues. This study explores the issues of context, data quality and trustworthiness, data comparability, informed consent, privacy and confidentiality, and intellectual property and data ownership, with a focus on data curation strategies. The research suggests that connecting qualitative researchers, big social researchers, and curators can enhance responsible practices for qualitative data reuse and big social research. This study addressed the following research questions: RQ1: How is big social data curation similar to and different from qualitative data curation? RQ1a: How are epistemological, ethical, and legal issues different or similar for qualitative data reuse and big social research? RQ1b: How can data curation practices such as metadata and archiving support and resolve some of these epistemological and ethical issues? RQ2: What are the implications of these similarities and differences for big social data curation and qualitative data curation, and what can we learn from combining these two conversations? Data Description and Collection Overview The data in this study was collected using semi-structured interviews that centered around specific incidents of qualitative data archiving or reuse, big social research, or data curation. The participants for the interviews were therefore drawn from three categories: researchers who have used big social data, qualitative researchers who have published or reused qualitative data, and data curators who have worked with one or both types of data. Six key issues were identified in a literature review, and were then used to structure three interview guides for the semi-structured interviews. The six issues are context, data quality and trustworthiness, data comparability, informed consent, privacy and confidentiality, and intellectual property and data ownership. Participants were limited to those working in the United States. Ten participants from each of the three target populations—big social researchers, qualitative researchers who had published or reused data, and data curators were interviewed. The interviews were conducted between March 11 and October 6, 2021. When scheduling the interviews, participants received an email asking them to identify a critical incident prior to the interview. The “incident” in critical incident interviewing technique is a specific example that focuses a participant’s answers to the interview questions. The participants were asked their permission to have the interviews recorded, which was completed using the built-in recording technology of Zoom videoconferencing software. The author also took notes during the interviews. Otter.ai speech-to-text software was used to create initial transcriptions of the interview recordings. A hired undergraduate student hand-edited the transcripts for accuracy. The transcripts were manually de-identified. The author analyzed the interview transcripts using a qualitative content analysis approach. This involved using a combination of inductive and deductive coding approaches. After reviewing the research questions, the author used NVivo software to identify chunks of text in the interview transcripts that represented key themes of the research. Because the interviews were structured around each of the six key issues that had been identified in the literature review, the author deductively created a parent code for each of the six key issues. These parent codes were context, data quality and trustworthiness, data comparability, informed consent, privacy and confidentiality, and intellectual property and data ownership. The author then used inductive coding to create sub-codes beneath each of the parent codes for these key issues. Selection and Organization of Shared Data The data files consist of 28 of the interview transcripts themselves – transcripts from Big Science Researchers (BSR), Data Curators (DC), and Qualitative Researchers (QR)...
Labeled dataset of IEEE 802.11 probe requests
zenodo.org
data-staging.niaid.nih.gov
+1more
zip
Updated Jan 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aleš Simončič; Aleš Simončič; Miha Mohorčič; Miha Mohorčič; Mihael Mohorčič; Mihael Mohorčič; Andrej Hrovat; Andrej Hrovat (2023). Labeled dataset of IEEE 802.11 probe requests [Dataset]. http://doi.org/10.5281/zenodo.7503594
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7503594
Dataset updated
Jan 6, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Aleš Simončič; Aleš Simončič; Miha Mohorčič; Miha Mohorčič; Mihael Mohorčič; Mihael Mohorčič; Andrej Hrovat; Andrej Hrovat
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Introduction

The 802.11 standard includes several management features and corresponding frame types. One of them are probe requests (PR). They are sent by mobile devices in the unassociated state to search the nearby area for existing wireless networks. The frame part of PRs consists of variable length fields called information elements (IE). IE fields represent the capabilities of a mobile device, such as data rates.
The dataset includes PRs collected in a controlled rural environment and in a semi-controlled indoor environment under different measurement scenarios.
It can be used for various use cases, e.g., analysing MAC randomization, determining the number of people in a given location at a given time or in different time periods, analysing trends in population movement (streets, shopping malls, etc.) in different time periods, etc.

Measurement setup

The system for collecting PRs consists of a Raspberry Pi 4 (RPi) with an additional WiFi dongle to capture Wi-Fi signal traffic in monitoring mode. Passive PR monitoring is performed by listening to 802.11 traffic and filtering out PR packets on a single WiFi channel.
The following information about each PR received is collected: MAC address, Supported data rates, extended supported rates, HT capabilities, extended capabilities, data under extended tag and vendor specific tag, interworking, VHT capabilities, RSSI, SSID and timestamp when PR was received.
The collected data was forwarded to a remote database via a secure VPN connection. A Python script was written using the Pyshark package for data collection, preprocessing and transmission.

Data preprocessing

The gateway collects PRs for each consecutive predefined scan interval (10 seconds). During this time interval, the data are preprocessed before being transmitted to the database.
For each detected PR in the scan interval, IEs fields are saved in the following JSON structure:
PR_IE_data =
{
'DATA_RTS': {'SUPP': DATA_supp , 'EXT': DATA_ext},
'HT_CAP': DATA_htcap,
'EXT_CAP': {'length': DATA_len, 'data': DATA_extcap},
'VHT_CAP': DATA_vhtcap,
'INTERWORKING': DATA_inter,
'EXT_TAG': {'ID_1': DATA_1_ext, 'ID_2': DATA_2_ext ...},
'VENDOR_SPEC': {VENDOR_1:{
'ID_1': DATA_1_vendor1,
'ID_2': DATA_2_vendor1
...},
VENDOR_2:{
'ID_1': DATA_1_vendor2,
'ID_2': DATA_2_vendor2
...}
...}
}

Supported data rates and extended supported rates are represented as arrays of values that encode information about the rates supported by a mobile device. The rest of the IEs data is represented in hexadecimal format. Vendor Specific Tag is structured differently than the other IEs. This field can contain multiple vendor IDs with multiple data IDs with corresponding data. Similarly, the extended tag can contain multiple data IDs with corresponding data.
Missing IE fields in the captured PR are not included in PR_IE_DATA.

When a new MAC address is detected in the current scan time interval, the data from PR is stored in the following structure:

{'MAC': MAC_address, 'SSIDs': [ SSID ], 'PROBE_REQs': [PR_data] },

where PR_data is structured as follows:
{
'TIME': [ DATA_time ],
'RSSI': [ DATA_rssi ],
'DATA': PR_IE_data
}.

This data structure allows storing only TOA and RSSI for all PRs originating from the same MAC address and containing the same PR_IE_data. All SSIDs from the same MAC address are also stored.
The data of the newly detected PR is compared with the already stored data of the same MAC in the current scan time interval.
If identical PR's IE data from the same MAC address is already stored, then only data for the keys TIME and RSSI are appended.
If no identical PR's IE data has yet been received from the same MAC address, then PR_data structure of the new PR for that MAC address is appended to PROBE_REQs key.
The preprocessing procedure is shown in Figure ./Figures/Preprocessing_procedure.png
At the end of each scan time interval, all processed data is sent to the database along with additional metadata about the collected data e.g. wireless gateway serial number and scan start and end timestamps. For an example of a single PR captured, see the ./Single_PR_capture_example.json file.

Environments description

We performed measurements in a controlled rural outdoor environment and in a semi-controlled indoor environment of the Jozef Stefan Institute.
See the Excel spreadsheet Measurement_informations.xlsx for a list of mobile devices tested.

Indoor environment

We used 3 RPi's for the acquisition of PRs in the Jozef Stefan Institute. They were placed indoors in the hallways as shown in the ./Figures/RPi_locations_JSI.png. Measurements were performed on weekend to minimize additional uncontrolled traffic from users' mobile devices. While there is some overlap in WiFi coverage between the devices at the location 2 and 3, the device at location 1 has no overlap with the other two devices.

Rural environment outdoors

The three RPi's used to collect PRs were placed at three different locations with non-overlapping WiFi coverage, as shown in ./Figures/RPi_locations_rural_env.png. Before starting the measurement campaign, all measured devices were turned off and the environment was checked for active WiFi devices. We did not detect any unknown active devices sending WiFi packets in the RPi's coverage area, so the deployment can be considered fully controlled.
All known WiFi enabled devices that were used to collect and send data to the database used a global MAC address, so they can be easily excluded in the preprocessing phase. MAC addresses of these devices can be found in the ./Measurement_informations.xlsx spreadsheet.
Note: The Huawei P20 device with ID 4.3 was not included in the test in this environment.

Scenarios description

We performed three different scenarios of measurements.

Individual device measurements

For each device, we collected PRs for one minute with the screen on, followed by PRs collected for one minute with the screen off. In the indoor environment the WiFi interfaces of the other devices not being tested were disabled. In rural environment other devices were turned off. Start and end timestamps of the recorded data for each device can be found in the ./Measurement_informations.xlsx spreadsheet under the Indoor environment of Jozef Stefan Institute sheet and the Rural environment sheet.

Three groups test

In this measurement scenario, the devices were divided into three groups. The first group contained devices from different manufacturers. The second group contained devices from only one manufacturer (Samsung). Half of the third group consisted of devices from the same manufacturer (Huawei), and the other half of devices from different manufacturers. The distribution of devices among the groups can be found in the ./Measurement_informations.xlsx spreadsheet.

The same data collection procedure was used for all three groups. Data for each group were collected in both environments at three different RPis locations, as shown in ./Figures/RPi_locations_JSI.png and ./Figures/RPi_locations_rural_env.png.
At each location, PRs were collected from each group for 10 minutes with the screen on. Then all three groups switched locations and the process was repeated. Thus, the dataset contains measurements from all three RPi locations of all three groups of devices in both measurement environments. The group movements and the timestamps for the start and end of the collection of PRs at each loacation can be found in spreadsheet ./Measurement_informations.xlsx.

One group test

In the last measurement scenario, all devices were grouped together. In rural evironement we first collected PRs for 10 minutes while the screen was on, and then for another 10 minutes while the screen was off. In indoor environment data were collected at first location with screens on for 10 minutes. Then all devices were moved to the location of the next RPi and PRs were collected for 5 minutes with the screen on and then for another 5 minutes with the screen off.

Folder structure

The root directory contains two files in JSON format for each of the environments where the measurements took place (Data_indoor_environment.json and Data_rural_environment.json). Both files contain collected PRs for the entire day that the measurements were taken (12:00 AM to 12:00 PM) to get a sense of the behaviour of the unknown devices in each environment. The spreadsheet ./Measurement_informations.xlsx. contains three sheets. Devices description contains general information about the tested devices, RPis, and the assigned group for each device. The sheets Indoor environment of Jozef Stefan Institute and Rural environment contain the corresponding timestamps for the start and end of each measurement scenario. For the scenario where the devices were divided into groups, additional information about the movements between locations is included. The location names are based on the RPi gateway ID and may differ from those on the figures showing the
m
DataCo SMART SUPPLY CHAIN FOR BIG DATA ANALYSIS
data.mendeley.com
narcis.nl
+1more
Updated Mar 12, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fabian Constante (2019). DataCo SMART SUPPLY CHAIN FOR BIG DATA ANALYSIS [Dataset]. http://doi.org/10.17632/8gx2fvg2k6.3
Explore at:
Unique identifier
https://doi.org/10.17632/8gx2fvg2k6.3
Dataset updated
Mar 12, 2019
Authors
Fabian Constante
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A DataSet of Supply Chains used by the company DataCo Global was used for the analysis. Dataset of Supply Chain , which allows the use of Machine Learning Algorithms and R Software. Areas of important registered activities : Provisioning , Production , Sales , Commercial Distribution.It also allows the correlation of Structured Data with Unstructured Data for knowledge generation.

Type Data : Structured Data : DataCoSupplyChainDataset.csv Unstructured Data : tokenized_access_logs.csv (Clickstream)

Types of Products : Clothing , Sports , and Electronic Supplies

Additionally it is attached in another file called DescriptionDataCoSupplyChain.csv, the description of each of the variables of the DataCoSupplyChainDatasetc.csv.
f
Data from: Tautomeric Conflicts in Forty Small-Molecule Databases
figshare.com
datasetcatalog.nlm.nih.gov
xlsx
Updated Sep 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Devendra K. Dhaked; Marc C. Nicklaus (2024). Tautomeric Conflicts in Forty Small-Molecule Databases [Dataset]. http://doi.org/10.1021/acs.jcim.4c00700.s003
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.4c00700.s003
Dataset updated
Sep 24, 2024
Dataset provided by
ACS Publications
Authors
Devendra K. Dhaked; Marc C. Nicklaus
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
We have analyzed 40 different databases ranging in size from a few thousand to nearly 100 million molecules, comprising a total of over 210 million structures, for their tautomeric conflicts. A tautomeric conflict is defined as an occurrence of two or more structures within a data set identified by the tautomeric rules applied as being tautomers of each other. We tested a total of 119 detailed tautomeric transform rules expressed as SMIRKS, out of which 79 yielded at least one conflict. These transformations include three types of tautomerism: prototropic, ring–chain, and valence tautomerism. The databases analyzed spanned a wide variety of types including large aggregating databases, drug collections, and structure collections based on experimental data. All databases analyzed showed intra-database tautomeric conflicts. The conflict rates as percentage of the database were typically in the few tenths of a percent range, which for the largest databases amounts to >100,000 cases per database.
Protein Chemical Structure Comparison from Three Drug Databases
johnsnowlabs.com
csv
Updated Jan 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John Snow Labs (2021). Protein Chemical Structure Comparison from Three Drug Databases [Dataset]. https://www.johnsnowlabs.com/marketplace/protein-chemical-structure-comparison-from-three-drug-databases/
Explore at:
csvAvailable download formats
Dataset updated
Jan 20, 2021
Dataset authored and provided by
John Snow Labs
Area covered
N/A
Description
This dataset Protein Chemical Structure Comparison from Three Drug Databases is a selection of a 3-way consensus list from the paper "Comparing the Chemical Structure and Protein Content of ChEMBL, DrugBank, Human Metabolome Database and the Therapeutic Target Database" (2013) [Abstract]. It includes 352 proteins-in-common between the three drug databases.
Data Wrangling Market Size, Share, Trends & Research Report, 2030
mordorintelligence.com
pdf,excel,csv,ppt
Updated Jun 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mordor Intelligence (2025). Data Wrangling Market Size, Share, Trends & Research Report, 2030 [Dataset]. https://www.mordorintelligence.com/industry-reports/data-wrangling-market
Explore at:
pdf,excel,csv,pptAvailable download formats
Dataset updated
Jun 30, 2025
Dataset authored and provided by
Mordor Intelligence
License
https://www.mordorintelligence.com/privacy-policyhttps://www.mordorintelligence.com/privacy-policy
Time period covered
2019 - 2030
Area covered
Global
Description
The Data Wrangling Market Report is Segmented by Data Type (Structured Data, Semi-Structured Data, and Unstructured Data), Component (Software and Services), Business Function (Finance, Marketing and Sales, Operations, and More), End-User Industry (IT and Telecommunication, BFSI, Retail and E-Commerce, and More), and Geography. The Market Forecasts are Provided in Terms of Value (USD).
Data from: Zooplankton occurrences (present/absent) and community structure...
gbif.org
demo.gbif.org
Updated Jun 6, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vincent Kiggundu; Lucas Mwebaza-Ndawula; Bonifance Makanga; Sarah Nachuha; Laban Musinguzi; Vianny Natugonza; Vincent Kiggundu; Lucas Mwebaza-Ndawula; Bonifance Makanga; Sarah Nachuha; Laban Musinguzi; Vianny Natugonza (2019). Zooplankton occurrences (present/absent) and community structure in three habitat types in northern Lake Victoria, Uganda [Dataset]. http://doi.org/10.15468/myjghm
Explore at:
Unique identifier
https://doi.org/10.15468/myjghm
Dataset updated
Jun 6, 2019
Dataset provided by
Global Biodiversity Information Facilityhttps://www.gbif.org/
National Fisheries Resources Research Institute
Authors
Vincent Kiggundu; Lucas Mwebaza-Ndawula; Bonifance Makanga; Sarah Nachuha; Laban Musinguzi; Vianny Natugonza; Vincent Kiggundu; Lucas Mwebaza-Ndawula; Bonifance Makanga; Sarah Nachuha; Laban Musinguzi; Vianny Natugonza
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered

Description
This dataset presents zooplankton taxa occurrence records and abundance. The data was obtained from a zooplankton survey conducted from October 2009 to January 2010 in three habitat types with varying environmental conditions on Lake Victoria, Uganda. The habitats include stabilizing waste water ponds near the lake shores.
n
National Cancer Institute 3D Structure Database
neuinfo.org
dknet.org
+1more
Updated Feb 1, 2001
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2001). National Cancer Institute 3D Structure Database [Dataset]. http://identifiers.org/RRID:SCR_008211
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_008211 https://identifiers.org/RRID:SCR_008211/resolver?q=&i=rrid
Dataset updated
Feb 1, 2001
Description
The NCI DIS 3D database is a collection of 3D structures for over 400,000 drugs. The database is an extension of the NCI Drug Information System. The structural information stored in the DIS is only the connection table for each drug. The connection table is just a list of which atoms are connected and how they are connected. It is essentially a searcheable database of three-dimensional structures has been developed from the chemistry database of the NCI Drug Information System (DIS), a file of about 450,000 primarily organic compounds which have been tested by NCI for anticancer activity. The DIS database is very similar in size and content to the proprietary databases used in the pharmaceutical industry; its development began in the 1950s; and this history led to a number of problems in the generation of 3D structures. This information can be searched to find drugs that share similar patterns of connections, which can correlate with similar biological activity. But the cellular targets for drug action, as well as the drugs themselves, are 3 dimensional objects and advances in computer hardware and software have reached the point where they can be represented as such. In many cases the important points of interaction between a drug and its target can be represented by a 3D arrangement of a small number of atoms. Such a group of atoms is called a pharmacophore. The pharmacophore can be used to search 3D databases and drugs that match the pharmacophore could have similar biological activity, but have very different patterns of atomic connections. Having a diverse set of lead compounds increases the chances of finding an active compound with acceptable properties for clinical development. Sponsor: The ICBG are supported by the Cooperative Agreement mechanism, with funds from nine components of the NIH, the National Science Foundation, and the Foreign Agricultural Service of the USDA.
U
Digital data for three-dimensional geologic framework model of the Rio San...
data.usgs.gov
s.cnmilf.com
+1more
Updated Nov 19, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Donald Sweetkind; Keely Miltenberger; Andre Ritchie; Amy Galanter (2021). Digital data for three-dimensional geologic framework model of the Rio San Jose groundwater basin, New Mexico [Dataset]. http://doi.org/10.5066/P9MPAGA7
Explore at:
Unique identifier
https://doi.org/10.5066/P9MPAGA7
Dataset updated
Nov 19, 2021
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Authors
Donald Sweetkind; Keely Miltenberger; Andre Ritchie; Amy Galanter
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Time period covered
2020
Area covered
Rio San Jose, New Mexico
Description
This data release contains a geospatial database related to a digital 3D geologic framework of the Rio San Jose watershed, New Mexico. The geospatial database contains two main data elements: (1) input data to the 3D framework model; (2) interpolated elevations and thicknesses of stratigraphic units as a cellular array. Input surface and subsurface data for 18 stratigraphic units have been condensed to points that define the elevation of the top of each stratigraphic unit; these point data sets serve as the digital input to the framework model. The point data are derived from geologic maps, cross sections, oil and gas wells, water wells, structure contour maps, and thickness maps. Additional input geologic features that either cut or overlay the stratigraphic units in the model are provided as separate features classes, including the location of faults, volcanic dikes, and volcanic vents.
The interpolated elevations and thickness of stratigraphic units are presented as a cellula ...
o
Data from: How "simple" methodological decisions affect interpretation of...
explore.openaire.eu
Updated Jan 26, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Graham, Carly F; Boreham, Douglas R; Manzon, Richard G; Stott, Wendylee; Wilson, Joanna Y; Somers, Christopher M (2020). Data from: How "simple" methodological decisions affect interpretation of population structure based on reduced representation library DNA sequencing: a case study using the lake whitefish [Dataset]. http://doi.org/10.5061/dryad.4vr8kp3
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.4vr8kp3
Dataset updated
Jan 26, 2020
Authors
Graham, Carly F; Boreham, Douglas R; Manzon, Richard G; Stott, Wendylee; Wilson, Joanna Y; Somers, Christopher M
Description
Reduced representation (RRL) sequencing approaches (e.g., RADSeq, genotyping by sequencing) require decisions about how much to invest in genome coverage and sequencing depth (library quality), as well as choices of values for adjustable bioinformatics parameters. To empirically explore the importance of these "simple" decisions, we generated two independent sequencing libraries for the same 142 individual lake whitefish (Coregonus clupeaformis) using a nextRAD RRL approach: (1) A small number of loci and low sequencing depth (library A); and (2) more loci and higher sequencing depth (library B). The fish were selected from populations with different levels of expected genetic subdivision. Each library was analyzed using the STACKS pipeline followed by three types of population structure assessment (FST, DAPC and ADMIXTURE) with iterative increases in the stringency of sequencing depth and missing data requirements, as well as more specific a priori population maps. Library B was always able to resolve strong population differentiation in all three types of assessment regardless of the selected parameters. In contrast, library A produced more variable results; increasing the minimum sequencing depth threshold (-m) resulted in a reduced number of retained loci, and therefore lost resolution at high -m values for FST and ADMIXTURE, but not DAPC. FST and DAPC were robust to varying the population map and increasing the stringency of missing data requirements. In contrast, ADMIXTURE was unable to resolve strong population differentiation when increasing these same parameters in library A. Similarly, when examining fine scale population subdivision, library B was robust to changing parameters but library A lost resolution depending on the parameter set. We used library B to examine actual subdivision in our study populations. All three types of analysis found complete subdivision among populations in Lake Huron, ON and Dore Lake, SK, Canada using 10,640 SNP loci. Weak population subdivision was detected in Lake Huron with fish from sites in the north-west, Search Bay, North Point and Hammond Bay, showing slight differentiation. Overall, we show that apparently simple decisions about library quality and bioinformatics parameters can have potentially important impacts on the interpretation of population subdivision. Although costly, the early investment in a high-quality library and more conservative stringency settings on STACKS parameters lead to a final dataset that was more consistent and robust when examining both weak and strong population differentiation.

Facebook

Twitter

Click to copy link

Link copied

Cite

Katharina Danhauser; Yingding Wang; Christoph Klein; Uta Tacke; Larissa Mantoan; Laura Aurica Ritter; Florian Heinen; Chiara Nobile; Moritz Tacke (2025). Prompts for the different question categories. Every prompt ended with the phrase The JSON variable should have the name XXX where XXX was a suited word (e.g. bodyweight_kg) [Dataset]. http://doi.org/10.1371/journal.pdig.0000919.t001

Prompts for the different question categories. Every prompt ended with the phrase The JSON variable should have the name XXX where XXX was a suited word (e.g. bodyweight_kg)

Explore at:

xlsAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pdig.0000919.t001

Dataset updated

Jul 23, 2025

Dataset provided by

PLOS Digital Health

Authors

Katharina Danhauser; Yingding Wang; Christoph Klein; Uta Tacke; Larissa Mantoan; Laura Aurica Ritter; Florian Heinen; Chiara Nobile; Moritz Tacke

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Prompts for the different question categories. Every prompt ended with the phrase The JSON variable should have the name XXX where XXX was a suited word (e.g. bodyweight_kg)

Clear search

Close search

Google apps

Main menu

Prompts for the different question categories. Every prompt ended with the...

Web Data Commons - RDFa, Microdata, and Microformat Data Sets

Data from: Automated phenotyping of mild cognitive impairment and...

Yield Curve Models and Data - Three-Factor Nominal Term Structure Model

Data from: Amazon Forest Structure from Airborne Lidar, ED2 Initial...

NCBI Structure

Data from: Validating two-dimensional leadership models on...

Data from: The three types of power distribution, that structure...

Replication Data for: Multiple signal classification as a blind...

Data from: Urbanev: An open benchmark dataset for urban electric vehicle...

Interviews regarding data curation for qualitative data reuse and big social...

Labeled dataset of IEEE 802.11 probe requests

DataCo SMART SUPPLY CHAIN FOR BIG DATA ANALYSIS

Data from: Tautomeric Conflicts in Forty Small-Molecule Databases

Protein Chemical Structure Comparison from Three Drug Databases

Data Wrangling Market Size, Share, Trends & Research Report, 2030

Data from: Zooplankton occurrences (present/absent) and community structure...

National Cancer Institute 3D Structure Database

Digital data for three-dimensional geologic framework model of the Rio San...

Data from: How "simple" methodological decisions affect interpretation of...

Prompts for the different question categories. Every prompt ended with the phrase The JSON variable should have the name XXX where XXX was a suited word (e.g. bodyweight_kg)