Facebook
TwitterPooling individual samples prior to DNA extraction can mitigate the cost of DNA extraction and genotyping; however, these methods need to accurately generate equal representation of individuals within pools. This data set was generated to determine accuracy of pool construction based on white blood cell counts compared to two common DNA quantification methods. Fifty individual bovine blood samples were collected, and then pooled with all individuals represented in each pool. Pools were constructed with the target of equal representation of each individual animal based on number of white blood cells, spectrophotometric readings, spectrofluorometric readings and whole blood volume with 9 pools per method and a total of 36 pools. Pools and individual samples that comprised the pools were genotyped using a commercially available genotyping array. ASReml was used to estimate variance components for individual animal contribution to pools. The correlation between animal contributions between two pools was estimated using bivariate analysis with starting values set to the result of a univariate analysis. The dataset includes: 1) pooling allele frequencies (PAF) for all pools and individual animals computed from normalized intensities for red (X) and green (Y); PAF = X/(X+Y). 2) Genotypes or number of copies of B(green) allele (0,1,2). 3) Definitions for each sample. Resources in this dataset:Resource Title: Pooling Allele Frequencies (paf) for all pools and individual animals. File Name: pafAnimal.csv.gzResource Description: Pooling Allele Frequencies (paf) for all pools and individual animals computed from normalized intensities for red (X) and green (Y); paf = X / (X + Y)Resource Title: Genotypes for individuals within pools. File Name: g.csv.gzResource Description: Genotypes (number of copies of the B (green) allele (0,1,2)) for individual bovine animals within pools.Resource Title: Sample Definitions . File Name: XY Data Key.xlsxResource Description: Definitions for each sample (both pools and individual animals).
Facebook
TwitterU.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
File Name: pafAnimal.csv.gzResource Description: Pooling Allele Frequencies (paf) for all pools and individual animals computed from normalized intensities for red (X) and green (Y); paf = X / (X + Y)Resource Title: Genotypes for individuals within pools. File Name: g.csv.gzResource Description: Genotypes (number of copies of the B (green) allele (0,1,2)) for individual bovine animals within pools.Resource Title: Sample Definitions . File Name: XY Data Key.xlsxResource Description: Definitions for each sample (both pools and individual animals).
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This layer contains the data for the public pools for the Parks and Recreation department in the City of Round Rock, located in Williamson County, Texas. This layer is part of an original dataset provided and maintained by the City of Round Rock GIS/IT Department and the Planning and Development Services Department. The data in this layer are represented as polygons.A public pool is defined as a pool that is regulated and maintained by either the City, the MUD, an HOA, or other regulatory body. The public pools in this layer have unique names and address, and other information about the pools is available in this layer as well.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In our everyday lives, we are required to make decisions based upon our statistical intuitions. Often, these involve the comparison of two groups, such as luxury versus family cars and their suitability. Research has shown that the mean difference affects judgements where two sets of data are compared, but the variability of the data has only a minor influence, if any at all. However, prior research has tended to present raw data as simple lists of values. Here, we investigated whether displaying data visually, in the form of parallel dot plots, would lead viewers to incorporate variability information. In Experiment 1, we asked a large sample of people to compare two fictional groups (children who drank ‘Brain Juice’ versus water) in a one-shot design, where only a single comparison was made. Our results confirmed that only the mean difference between the groups predicted subsequent judgements of how much they differed, in line with previous work using lists of numbers. In Experiment 2, we asked each participant to make multiple comparisons, with both the mean difference and the pooled standard deviation varying across data sets they were shown. Here, we found that both sources of information were correctly incorporated when making responses. Taken together, we suggest that increasing the salience of variability information, through manipulating this factor across items seen, encourages viewers to consider this in their judgements. Such findings may have useful applications for best practices when teaching difficult concepts like sampling variation.
Facebook
TwitterThe concept of a pool of employment is often used generically to define the area of influence of a particular economic cluster. It corresponds to a finer division of employment areas. Sometimes, an employment pool corresponds exactly to an area of employment.The exact determination methodology is not communicated by INSEE.The Melchior site gives this definition: INSEE has defined areas of employment, but the concept of employment basin (perimeter used by the Ministry of Labour) does not have a clear definition. Basins are subdivisions of employment areas and may constitute local policy frameworks for public authorities.
Facebook
TwitterThis dataset consists of 119,494 lines of data consisting of idle well fluid level depth, auxiliary measurements, and well parameters from California oil and gas wells that were reported to the California Department of Conservation, Geologic Energy Management Division (CalGEM). The dataset was provided by CalGEM in March 2018 and includes measurements made from 1976 to 2018. There are 5 sets of operator-reported data: idle well fluid level depth (N=101,734), well clean out depth (N=8,402), depth of base of fresh water (N=108,216), well top perforation depth (N=93,569), and depth reached (N=15,756). These are associated with a well, defined by API number, well number, operator name, test date, township, section, range, and pool code. While detailed metadata for these measurements was not provided by CalGEM, they are thought to be collected under idle well testing regulations. Present regulations broadly define an idle well as one that has not been used for production or injection for 24 months or longer (California Code of Regulations, 2022, Title 14 §1760). Below, a summary of current regulations related to this program are presented; however, regulations at the time of data collection may be different. Once a well is classified as an idle well, a fluid level test using acoustical, mechanical, or other methods must be conducted within 24 months, and every 24 months beyond that, as long as a well is idle, unless the wellbore does not penetrate an underground source of drinking water (USDW) (California Code of Regulations, 2022, Title 14 §1772.1). Currently, within 8 years of a well becoming idle a clean out tag is required. This is done to demonstrate that the well can be properly plugged and abandoned. A clean out tag is done by passing open-ended tubing or a gauge ring of a minimum diameter equal to that of tubing necessary to plug and abandon a well (California Code of Regulations, 2022, Title 14 §1772.1). This testing must generally be repeated once every 48 months as long as a well is classified as an idle well. Freshwater is defined as water that contains 3,000 milligrams/liter (mg/L) or less of total dissolved solids (California Code of Regulations, 2022, Title 14 §1720.1). The base of freshwater is the depth in a well where the overlying water is freshwater. Neither top perforation depth or depth reached is defined by statute. Top perforation is generally the shallowest active perforated interval. It is not clear what depth reached represents. Well elevation and pool name were added from other datasets to aid in analysis. Pools, identified by pool code and pool name, are defined as independent hydrocarbon zones (California Public Resources Code § 3227.6.b). The accuracy of the values reported to CalGEM by oil-field operators is unknown. Unrealistic values were discarded from the data as noted in the process steps. This dataset was compiled and analyzed as part of the California State Water Resources Control Board Oil and Gas Regional Monitoring Program and the U.S. Geological Survey California Oil, Gas, and Groundwater (COGG) program.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This is the official dataset for paper "Performance Modeling of Data Storage Systems using Generative Models" in IEEE Access [journal] [arxiv].
Keywords: regression, uncertainty estimation, generative models on tabular data, surrogate modeling
Data storage systems (DSS) play a critical role in today's world. Performance modeling is important during the development process of such systems. This modeling enables engineers to evaluate how the system behaves under various conditions and aids in identifying its optimal design. Another application lies in diagnostics and predictive maintenance, where model predictions are compared against real-world measurements to identify failures and anomalies.
A typical DSS comprises three main components: controllers, high-speed cache memory, and storage pools. All data resides within these pools, composed of multiple hard disk drives (HDDs) or solid-state drives (SSDs) organized using RAID configurations. The cache accelerates read and write operations for frequently accessed data blocks.
Performance is measured by the number of input/output operations per second (IOPS) and the average latency of those operations. Data load characteristics include factors like load type, I/O type, read fraction, block size, job count, and queue depth per job. Our dataset considers two load types—random and sequential. Each data load involves a combination of read and write operations. Each operation handles a single data block of predefined size, while data loads are generated through multiple jobs, each with a specific queue depth. Storage pools employing HDDs or SSDs come with configurable parameters, including the total number of disks in the pool and the RAID scheme, defined by the quantity of data and parity blocks.
If you use the dataset in a scientific publication, we would appreciate citations to the paper:
A. R. Al-Maeeni, A. Temirkhanov, A. Ryzhikov and M. Hushchyn, "Performance Modeling of Data Storage Systems Using Generative Models," in IEEE Access, vol. 13, pp. 49643-49658, 2025, doi: 10.1109/ACCESS.2025.3552409
or using BibTeX:
@ARTICLE{10930879,
author={Al-Maeeni, Abdalaziz R. and Temirkhanov, Aziz and Ryzhikov, Artem and Hushchyn, Mikhail},
journal={IEEE Access},
title={Performance Modeling of Data Storage Systems Using Generative Models},
year={2025},
volume={13},
number={},
pages={49643-49658},
doi={10.1109/ACCESS.2025.3552409}}
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The concept of a pool of employment is often used generically to define the area of influence of a particular economic cluster. It corresponds to a finer division of employment areas. Sometimes, an employment pool corresponds exactly to an area of employment.The exact determination methodology is not communicated by INSEE.The Melchior site gives this definition: INSEE has defined areas of employment, but the concept of employment basin (perimeter used by the Ministry of Labour) does not have a clear definition. Basins are subdivisions of employment areas and may constitute local policy frameworks for public authorities.
Facebook
TwitterObserved average proportion and standard deviation of low and high educated in pooled data, definition of scenarios used to estimate educational inequalities mortality in different educational distributions, men and women.
Facebook
TwitterU.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Data Description: This data set contains all inspections issued/performed by City of Cincinnati Departments (including Buildings & Inspections; Cincinnati Fire Department; Cincinnati Health Department; Cincinnati Parks; and Trade/Development), as well as Inspections Bureau Inc (IBI) and Hamilton County departments.
Inspections range from electrical surveys, to swimming pools/spas, to elevator inspections, daycare inspections, and more. This data covers inspections since 1999 through present day.
Data Creation: All data is input by respective agencies, and maintained/stored by Cincinnati Area Geographic Information Systems (CAGIS), and is additionally available on CAGIS Property Activity Report website: http://cagismaps.hamilton-co.org/PropertyActivity/cagisreport
Data Created By: CAGIS
Refresh Frequency: Daily
Data Dictionary: A data dictionary providing definitions of columns and attributes is available as an attachment to this dataset.
Processing: The City of Cincinnati is committed to providing the most granular and accurate data possible. In that pursuit the Office of Performance and Data Analytics facilitates standard processing to most raw data prior to publication. Processing includes but is not limited: address verification, geocoding, decoding attributes, and addition of administrative areas (i.e. Census, neighborhoods, police districts, etc.).
Data Usage: For directions on downloading and using open data please visit our How-to Guide: https://data.cincinnati-oh.gov/dataset/Open-Data-How-To-Guide/gdr9-g3ad
Facebook
Twitterhttps://data.syr.gov/pages/termsofusehttps://data.syr.gov/pages/termsofuse
Information about pools owned and maintained by the City of Syracuse. Dataset also contains information about the pools, whether they are handicap accessible, width and length of the pools, depth, and whether they have been converted to be a salt water pool.Data Dictionary:LabelDefinitionDefinition SourceParkThe name of the CIty Park.Parks and Rec. DepartmentPoolWhether there is a pool at this park. This will be a "Yes" or "No" value.Parks and Rec. DepartmentTypeThe type of pool, whether this is an Outdoor Pool, Indoor, or and Outdoor "L" shaped pool.Parks and Rec. DepartmentLatitudeThe latitude where the pool is located. This can be used with GIS mapping.Parks and Rec. DepartmentLongitudeThe longitude where the pool is located. This can be used with GIS mapping.Parks and Rec. DepartmentAccessible_PoolWhether the pool is one that has been deemed to be an Americans with Disabilities Act (ADA) accessible pool. This can be through a ramp, lift, or other means.Parks and Rec. DepartmentLength_x_WidthThe length and width of the pool. These may be measured in yards (yds), meters (M), or feet ('). Parks and Rec. DepartmentDepthThe depth of the pool measured in feet and inches.Parks and Rec. DepartmentPool_ImageWebsite link to an image of this pool. This can be used for a pop-up on a GIS map.Parks and Rec. DepartmentWebsiteThe City of Syracuse's website that contains up to date information about pool hours and other information.Parks and Rec. DepartmentDataset Contact Information:Organization: Parks and Rec DepartmentPosition:Data Program ManagerCity:Syracuse, NYE-Mail Address:opendata@syrgov.net
Facebook
TwitterThis is an integration of 10 independent multi-country, multi-region, multi-cultural social surveys fielded by Gallup International between 2000 and 2013. The integrated data file contains responses from 535,159 adults living in 103 countries. In total, the harmonization project combined 571 social surveys.
These data have value in a number of longitudinal multi-country, multi-regional, and multi-cultural (L3M) research designs. Understood as independent, though non-random, L3M samples containing a number of multiple indicator ASQ (ask same questions) and ADQ (ask different questions) measures of human development, the environment, international relations, gender equality, security, international organizations, and democracy, to name a few [see full list below].
The data can be used for exploratory and descriptive analysis, with greatest utility at low levels of resolution (e.g. nation-states, supranational groupings). Level of resolution in analysis of these data should be sufficiently low to approximate confidence intervals.
These data can be used for teaching 3M methods, including data harmonization in L3M, 3M research design, survey design, 3M measurement invariance, analysis, and visualization, and reporting. Opportunities to teach about para data, meta data, and data management in L3M designs.
The country units are an unbalanced panel derived from non-probability samples of countries and respondents> Panels (countries) have left and right censorship and are thusly unbalanced. This design limitation can be overcome to the extent that VOTP panels are harmonized with public measurements from other 3M surveys to establish balance in terms of panels and occasions of measurement. Should L3M harmonization occur, these data can be assigned confidence weights to reflect the amount of error in these surveys.
Pooled public opinion surveys (country means), when combine with higher quality country measurements of the same concepts (ASQ, ADQ), can be leveraged to increase the statistical power of pooled publics opinion research designs (multiple L3M datasets)…that is, in studies of public, rather than personal, beliefs.
The Gallup Voice of the People survey data are based on uncertain sampling methods based on underspecified methods. Country sampling is non-random. The sampling method appears be primarily probability and quota sampling, with occasional oversample of urban populations in difficult to survey populations. The sampling units (countries and individuals) are poorly defined, suggesting these data have more value in research designs calling for independent samples replication and repeated-measures frameworks.
**The Voice of the People Survey Series is WIN/Gallup International Association's End of Year survey and is a global study that collects the public's view on the challenges that the world faces today. Ongoing since 1977, the purpose of WIN/Gallup International's End of Year survey is to provide a platform for respondents to speak out concerning government and corporate policies. The Voice of the People, End of Year Surveys for 2012, fielded June 2012 to February 2013, were conducted in 56 countries to solicit public opinion on social and political issues. Respondents were asked whether their country was governed by the will of the people, as well as their attitudes about their society. Additional questions addressed respondents' living conditions and feelings of safety around their living area, as well as personal happiness. Respondents' opinions were also gathered in relation to business development and their views on the effectiveness of the World Health Organization. Respondents were also surveyed on ownership and use of mobile devices. Demographic information includes sex, age, income, education level, employment status, and type of living area.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Overview Welcome to Kaggle's second annual Machine Learning and Data Science Survey ― and our first-ever survey data challenge.
This year, as last year, we set out to conduct an industry-wide survey that presents a truly comprehensive view of the state of data science and machine learning. The survey was live for one week in October, and after cleaning the data we finished with 23,859 responses, a 49% increase over last year!
There's a lot to explore here. The results include raw numbers about who is working with data, what’s happening with machine learning in different industries, and the best ways for new data scientists to break into the field. We've published the data in as raw a format as possible without compromising anonymization, which makes it an unusual example of a survey dataset.
Challenge This year Kaggle is launching the first Data Science Survey Challenge, where we will be awarding a prize pool of $28,000 to kernel authors who tell a rich story about a subset of the data science and machine learning community..
In our second year running this survey, we were once again awed by the global, diverse, and dynamic nature of the data science and machine learning industry. This survey data EDA provides an overview of the industry on an aggregate scale, but it also leaves us wanting to know more about the many specific communities comprised within the survey. For that reason, we’re inviting the Kaggle community to dive deep into the survey datasets and help us tell the diverse stories of data scientists from around the world.
The challenge objective: tell a data story about a subset of the data science community represented in this survey, through a combination of both narrative text and data exploration. A “story” could be defined any number of ways, and that’s deliberate. The challenge is to deeply explore (through data) the impact, priorities, or concerns of a specific group of data science and machine learning practitioners. That group can be defined in the macro (for example: anyone who does most of their coding in Python) or the micro (for example: female data science students studying machine learning in masters programs). This is an opportunity to be creative and tell the story of a community you identify with or are passionate about!
Submissions will be evaluated on the following:
Composition - Is there a clear narrative thread to the story that’s articulated and supported by data? The subject should be well defined, well researched, and well supported through the use of data and visualizations. Originality - Does the reader learn something new through this submission? Or is the reader challenged to think about something in a new way? A great entry will be informative, thought provoking, and fresh all at the same time. Documentation - Are your code, and kernel, and additional data sources well documented so a reader can understand what you did? Are your sources clearly cited? A high quality analysis should be concise and clear at each step so the rationale is easy to follow and the process is reproducible To be valid, a submission must be contained in one kernel, made public on or before the submission deadline. Participants are free to use any datasets in addition to the Kaggle Data Science survey, but those datasets must also be publicly available on Kaggle by the deadline for a submission to be valid.
While the challenge is running, Kaggle will also give a Weekly Kernel Award of $1,500 to recognize excellent kernels that are public analyses of the survey. Weekly Kernel Awards will be announced every Friday between 11/9 and 11/30.
How to Participate To make a submission, complete the submission form. Only one submission will be judged per participant, so if you make multiple submissions we will review the last (most recent) entry.
No submission is necessary for the Weekly Kernels Awards. To be eligible, a kernel must be public and use the 2018 Data Science Survey as a data source.
Timeline All dates are 11:59PM UTC
Submission deadline: December 3rd
Winners announced: December 10th
Weekly Kernels Award prize winners announcements: November 9th, 16th, 23rd, and 30th
All kernels are evaluated after the deadline.
Rules To be eligible to win a prize in either of the above prize tracks, you must be:
a registered account holder at Kaggle.com; the older of 18 years old or the age of majority in your jurisdiction of residence; and not a resident of Crimea, Cuba, Iran, Syria, North Korea, or Sudan Your kernels will only be eligible to win if they have been made public on kaggle.com by the above deadline. All prizes are awarded at the discretion of Kaggle. Kaggle reserves the right to cancel or modify prize criteria.
Unfortunately employees, interns, contractors, officers and directors of Kaggle Inc., and their parent companies, are not eligible to win any prizes.
Survey Methodology ...
Facebook
TwitterIt can be difficult finding timely ESG data for multiple companies at a time unless you pay for an expensive subscription. This dataset includes ESG ratings and stock market information for approximately 700 companies. When comparing ESG ratings, it's important to compare a company with their industry or sector peers rather than across industries. The reason is that there are different material issues and metrics that are considered more pertinent depending on the industry. For example, ESG ley issues and metrics for a railroad company will be different than for a bank).
This dataset includes companies that are categorized in the "Industrials" sector, per the Global Industry Classification System (GICs). It includes ESG ratings by 4 different ESG ratings providers, if that data is available for a particular company. It also includes stock market data pulled from the first week of April 2024 - that includes 52-week high and low prices, volume, etc.
Example of chart made using this dataset:
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F7790751%2F8bbc5b936efdf14d6b338678bd6466f7%2Ftrucking-april2024%20-%20Copy.png?generation=1714329261083713&alt=media" alt="">
Key columns and descriptors
Unique_id: the number used by ESGAnalytics to uniquely distinguish each company
Symbol: Stock symbol
Exchange: stock index where the company is listed (one company may be listed on multiple exchanges in the real world)
gicSector: sector classification (this is higher on the hierarchy than subindustry per the GIC
gicSubindustry: subindustry classification, the next level down in the GIC hierarchy
ESG ratings columns
Company_ESG_pulse: the main ESG ratings of this dataset; 1 is lowest investor risk and -1 means highest investor risk
ESG_beta: how much the pulse rating affects the stock market price of the company, per ESGAnalytics
SNP: the S&P Global ESG rating for the company (scale of 1-100 with 100 being the LOWEST investment risk)
Sustainalytics: the Sustainalytics ESG rating for a company with ratings 0-10 meaning negligible investment risk; 10-20 low risk; 20-30 medium risk; 30-40 high risk; 40+ severe risk
MSCI: the MSCI ESG rating for the company with ratings of CCC,B, BB meaning an industry laggard; BB, BBB, A meaning average; AA, AAA meaning industry leader
Update_data-ESG_scores: this is the date when the SNP, Sustainalytics, and MSCI scores were pulled; ESGAnalytics ratings were pulled April 2024 (as they are updated in real-time while the others are updated annually)
Stock market columns
Volume, Market Cap, 52w_highest price (52w means 52-week), 52w_lowest price, 52w_change price, 52w_average volumne were pulled the first week of April 2024.
For more details about the ESG ratings, please see my Medium post on ESG data providers.
The data is available via ESGAnalytics.io and Finazon.io use licenses (per my subscriptions with them).
Similar to others on kaggle who have shared ESG datasets, my objective is to help make ESG data more accessible and understandable so that more people are versed in what ESG is and how different companies rate.
Please let me know any comments or if there are other ESG-related datasets that you are interested in. Thank you!
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset comes from the Annual Community Survey questions related to resident satisfaction with City Parks, Recreation, Libraries, Arts, & Cultural Centers. Survey participants are asked: "Please rate your level of satisfaction with a) Quality of City swimming pools; b) Quality of neighborhood parks; c) Quality of City recreation & community centers; d) Quality of Tempe History Museum; e) Quality of Tempe Public Library; f) Quality of Tempe Center for the Arts." Survey respondents are asked to rate their satisfaction level on a scale of 5 to 1, where 5 means "Very Satisfied" and 1 means "Very Dissatisfied" (responses of "don't know" are excluded).The survey is mailed to a random sample of households in the City of Tempe and has a 95% confidence level.This page provides data for the City Parks, Recreation, Libraries, Arts, & Cultural Centers performance measure.The performance measure dashboard is available at 3.16 Community Services Facilities and Open Spaces.Additional InformationSource: Community Attitude Survey (Vendor: ETC Institute)Contact: Wydale HolmesContact E-Mail: wydale_holmes@tempe.govData Source Type: Excel and PDF ReportPreparation Method: Extracted from Annual Community Survey resultsPublish Frequency: AnnualPublish Method: ManualData Dictionary
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Network traffic from server directly to the clients.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Increasing genetic and phenotypic data size is critical for understanding the genetic determinants of diseases. Evidently, establishing practical means for collaboration and data sharing among institutions is a fundamental methodological barrier for performing high-powered studies. As the sample sizes become more heterogeneous, complex statistical approaches, such as generalized linear mixed effects models, must be used to correct for the confounders that may bias results. On another front, due to the privacy concerns around Protected Health Information (PHI), genetic information is restrictively protected by sharing according to regulations such as Health Insurance Portability and Accountability Act (HIPAA). This limits data sharing among institutions and hampers efforts around executing high-powered collaborative studies. Federated approaches are promising to alleviate the issues around privacy and performance, since sensitive data never leaves the local sites. Motivated by these, we developed FedGMMAT, a federated genetic association testing tool that utilizes a federated statistical testing approach for efficient association tests that can correct for confounding fixed and additive polygenic random effects among different collaborating sites. Genetic data is never shared among collaborating sites, and the intermediate statistics are protected by encryption. Using simulated and real datasets, we demonstrate FedGMMAT can achieve the virtually same results as pooled analysis under a privacy-preserving framework with practical resource requirements.
Facebook
TwitterThis dataset is comprised of eight files related to salt marsh monitoring data or measures of of human disturbance (i.e. human impacts in terms of physical, chemical, and land-use stressors) collected at 33 marsh study units (MSUs) in five National Parks within the NPS Northeast Coastal and Barrier Network (NCBN) along the northeastern coast of the US. Two files contain data related to the species and coverage of salt marsh vegetation observed in MSUs (1 data file, 1 definitions file). Two files contain data related to the species and abundance of nekton collected from creeks, pools and ditches in MSUs (1 data file, 1 definitions file). Two files contain data related to the height of key salt marsh vegetation species observed in MSUs (1 data file, 1 definitions file). Two files contain data related to metrics describing the degree of human disturbance in MSUs (1 data file, 1 definitions file). Salt marsh monitoring data were generally collected from 2008-2013; however, salt marsh monitoring data were collected irregularly between 1997 and 2007 as part of a pilot program in a small number of the MSUs. Human disturbance metrics were derived from existing aerial imagery and the 2006 National Land Cover Database.
Facebook
TwitterThis dataset represents the dam density and storage volumes within individual local and accumulated upstream catchments for NHDPlusV2 Waterbodies based on the National Anthropogenic Barrier Dataset (NABD). Catchment boundaries in LakeCat are defined in one of two ways, on-network or off-network. The on-network catchment boundaries follow the catchments provided in the NHDPlusV2 and the metrics for these lakes mirror metrics from StreamCat, but will substitute the COMID of the NHDWaterbody for that of the NHDFlowline. The off-network catchment framework uses the NHDPlusV2 flow direction rasters to define non-overlapping lake-catchment boundaries and then links them through an off-network flow table. The main objective of this project was to develop a dataset of large, anthropogenic barriers that are spatially linked to the National Hydrography Dataset Plus Version 1 (NHDPlusV1) for the conterminous U.S. to facilitate GIS analyses based on the NHDPlusV1/NHD and NID datasets. To meet this objective, Michigan State University conducted a spatial linkage of the point dataset of the 2009 National Inventory of Dams (NID) created by the U.S. Army Corps of Engineers (USACE) to the NHDPlusV1/NHD. The pool of dam data included were modified based on 1) dam removals that occurred after development of the 2009 NID and 2) the identification of duplicate dam records along state boundaries (cases where more than one state reported the same dam). The US Geological Survey (USGS) Aquatic GAP Program supported this work. The (dams/catchment) and (dam_storage/catchment) were summarized and accumulated into watersheds to produce local catchment-level and watershed-level metrics as a point data type.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Urinary expressed prostatic secretion or “EPS-urine” is proximal tissue fluid that is collected after a digital rectal exam (DRE). EPS-urine is a rich source of prostate-derived proteins that can be used for biomarker discovery for prostate cancer (PCa) and other prostatic diseases. We previously conducted a comprehensive proteome analysis of direct expressed prostatic secretions (EPS). In the current study, we defined the proteome of EPS-urine employing Multidimensional Protein Identification Technology (MudPIT) and providing a comprehensive catalogue of this body fluid for future biomarker studies. We identified 1022 unique proteins in a heterogeneous cohort of 11 EPS-urines derived from biopsy negative noncancer diagnoses with some benign prostatic diseases (BPH) and low-grade PCa, representative of secreted prostate and immune system-derived proteins in a urine background. We further applied MudPIT-based proteomics to generate and compare the differential proteome from a subset of pooled urines (pre-DRE) and EPS-urines (post-DRE) from noncancer and PCa patients. The direct proteomic comparison of these highly controlled patient sample pools enabled us to define a list of prostate-enriched proteins detectable in EPS-urine and distinguishable from a complex urine protein background. A combinatorial analysis of both proteomics data sets and systematic integration with publicly available proteomics data of related body fluids, human tissue transcriptomic data, and immunohistochemistry images from the Human Protein Atlas database allowed us to demarcate a robust panel of 49 prostate-derived proteins in EPS-urine. Finally, we validated the expression of seven of these proteins using Western blotting, supporting the likelihood that they originate from the prostate. The definition of these prostatic proteins in EPS-urine samples provides a reference for future investigations for prostatic-disease biomarker studies.
Facebook
TwitterPooling individual samples prior to DNA extraction can mitigate the cost of DNA extraction and genotyping; however, these methods need to accurately generate equal representation of individuals within pools. This data set was generated to determine accuracy of pool construction based on white blood cell counts compared to two common DNA quantification methods. Fifty individual bovine blood samples were collected, and then pooled with all individuals represented in each pool. Pools were constructed with the target of equal representation of each individual animal based on number of white blood cells, spectrophotometric readings, spectrofluorometric readings and whole blood volume with 9 pools per method and a total of 36 pools. Pools and individual samples that comprised the pools were genotyped using a commercially available genotyping array. ASReml was used to estimate variance components for individual animal contribution to pools. The correlation between animal contributions between two pools was estimated using bivariate analysis with starting values set to the result of a univariate analysis. The dataset includes: 1) pooling allele frequencies (PAF) for all pools and individual animals computed from normalized intensities for red (X) and green (Y); PAF = X/(X+Y). 2) Genotypes or number of copies of B(green) allele (0,1,2). 3) Definitions for each sample. Resources in this dataset:Resource Title: Pooling Allele Frequencies (paf) for all pools and individual animals. File Name: pafAnimal.csv.gzResource Description: Pooling Allele Frequencies (paf) for all pools and individual animals computed from normalized intensities for red (X) and green (Y); paf = X / (X + Y)Resource Title: Genotypes for individuals within pools. File Name: g.csv.gzResource Description: Genotypes (number of copies of the B (green) allele (0,1,2)) for individual bovine animals within pools.Resource Title: Sample Definitions . File Name: XY Data Key.xlsxResource Description: Definitions for each sample (both pools and individual animals).