Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Unsupervised exploratory data analysis (EDA) is often the first step in understanding complex data sets. While summary statistics are among the most efficient and convenient tools for exploring and describing sets of data, they are often overlooked in EDA. In this paper, we show multiple case studies that compare the performance, including clustering, of a series of summary statistics in EDA. The summary statistics considered here are pattern recognition entropy (PRE), the mean, standard deviation (STD), 1-norm, range, sum of squares (SSQ), and X4, which are compared with principal component analysis (PCA), multivariate curve resolution (MCR), and/or cluster analysis. PRE and the other summary statistics are direct methods for analyzing datathey are not factor-based approaches. To quantify the performance of summary statistics, we use the concept of the “critical pair,” which is employed in chromatography. The data analyzed here come from different analytical methods. Hyperspectral images, including one of a biological material, are also analyzed. In general, PRE outperforms the other summary statistics, especially in image analysis, although a suite of summary statistics is useful in exploring complex data sets. While PRE results were generally comparable to those from PCA and MCR, PRE is easier to apply. For example, there is no need to determine the number of factors that describe a data set. Finally, we introduce the concept of divided spectrum-PRE (DS-PRE) as a new EDA method. DS-PRE increases the discrimination power of PRE. We also show that DS-PRE can be used to provide the inputs for the k-nearest neighbor (kNN) algorithm. We recommend PRE and DS-PRE as rapid new tools for unsupervised EDA.
Monte Carlo simulations are commonly used to test the performance of estimators and models from rival methods under a range of data generating processes. This tool improves our understanding of the relative merits of rival methods in different contexts, such as varying sample sizes and violations of assumptions. When used, it is common to report the bias and/or the root mean squared error of the different meth- ods. It is far less common to report the standard deviation, overconfidence, coverage probability, or power. Each of these six performance statistics provides important, and often differing, information regarding a method’s performance. Here, we present a structured way to think about Monte Carlo performance statistics. In replications of three prominent papers, we demonstrate the utility of our approach and provide new substantive results about the performance of rival methods.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We show that the expected value of the largest order statistic in Gaussian samples can be accurately approximated as (0.2069 ln (ln (n))+0.942)4, where n∈[2,108] is the sample size, while the standard deviation of the largest order statistic can be approximated as −0.4205arctan(0.5556[ln(ln (n))−0.9148])+0.5675. We also provide an approximation of the probability density function of the largest order statistic which in turn can be used to approximate its higher order moments. The proposed approximations are computationally efficient, and improve previous approximations of the mean and standard deviation given by Chen and Tyler (1999).
We will learn how to work on a real project of Data Analysis with Python. Questions are given in the project and then solved with the help of Python. It is a project of Data Analysis with Python or you can say, Data Science with Python.
The commands that we used in this project :
Challenges for this DataSet:
Q. 1) Find all the unique 'Wind Speed' values in the data. Q. 2) Find the number of times when the 'Weather is exactly Clear'. Q. 3) Find the number of times when the 'Wind Speed was exactly 4 km/h'. Q. 4) Find out all the Null Values in the data. Q. 5) Rename the column name 'Weather' of the dataframe to 'Weather Condition'. Q. 6) What is the mean 'Visibility' ? Q. 7) What is the Standard Deviation of 'Pressure' in this data? Q. 8) What is the Variance of 'Relative Humidity' in this data ? Q. 9) Find all instances when 'Snow' was recorded. Q. 10) Find all instances when 'Wind Speed is above 24' and 'Visibility is 25'. Q. 11) What is the Mean value of each column against each 'Weather Condition ? Q. 12) What is the Minimum & Maximum value of each column against each 'Weather Condition ? Q. 13) Show all the Records where Weather Condition is Fog. Q. 14) Find all instances when 'Weather is Clear' or 'Visibility is above 40'. Q. 15) Find all instances when : A. 'Weather is Clear' and 'Relative Humidity is greater than 50' or B. 'Visibility is above 40'
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context
The dataset presents median household incomes for various household sizes in Show Low, AZ, as reported by the U.S. Census Bureau. The dataset highlights the variation in median household income with the size of the family unit, offering valuable insights into economic trends and disparities within different household sizes, aiding in data analysis and decision-making.
Key observations
https://i.neilsberg.com/ch/show-low-az-median-household-income-by-household-size.jpeg" alt="Show Low, AZ median household income, by household size (in 2022 inflation-adjusted dollars)">
When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates.
Household Sizes:
Variables / Data Columns
Good to know
Margin of Error
Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.
Custom data
If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.
Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.
This dataset is a part of the main dataset for Show Low median household income. You can refer the same here
1.Colour patterns are used by many species to make decisions that ultimately affect their Darwinian fitness. Colour patterns consist of a mosaic of patches that differ in geometry and visual properties. Although traditionally pattern geometry and colour patch visual properties are analysed separately, these components are likely to work together as a functional unit. Despite this, the combined effect of patch visual properties, patch geometry, and the effects of the patch boundaries on animal visual systems, behaviour and fitness are relatively unexplored.
2.Here we describe Boundary Strength Analysis (BSA), a novel way to combine the geometry of the edges (boundaries among the patch classes) with the receptor noise estimate (ΔS) of the intensity of the edges. The method is based upon known properties of vertebrate and invertebrate retinas. The mean and SD of ΔS (mΔS, sΔS) of a colour pattern can be obtained by weighting each edge class ΔS by its length, separately for chromatic and ac...
Evaluating thermodynamic solubility is crucial to design successful drug candidates. Yet, predicting it with in silico approaches remains a challenge. Machine learning methods are used to develop regression models leveraged on molecular descriptors. Recently, powerful solubility predictive models have been published using feature- and graph-based neural networks. These models often display attractive performances, yet, their reliability may be deceiving when used for prospective prediction. This review investigates the origins of these discrepancies, following three directions: a historical perspective, an analysis of the structure of the aqueous solubility dataverse and data quality. We demonstrate that new models are not ready for public usage because they lack a well-defined applicability domain and they overlook some historical data sources. On the basis of carefully reviewed dataset we are able to illustrate the influence the data quality on model predictivity. We comprehensively investigated over 20 years of published solubility datasets and models, highlighting overlooked and interconnected datasets. We benchmarked recently published models on a Sanofi dataset, as an example of pharmaceutical context, and they performed poorly. We observed the impact of factors influencing the performances of the models: interlaboratory standard deviation, ionic state of the solute and source of the solubility data. As a consequence we draw a general workflow to cure aqueous solubility data with the aim of producing predictive models. Our results show how data quality and applicability domain of public models have an impact on their utility in a real context in pharmaceutical industry. We found that some data sources may appear as less reliable than initially expected, as for instance, the eChem dataset. This exhaustive aqueous solubility data analysis led to the development of a curation workflow; the resulting models and datasets are publicly available. Data are available as CSV files. File AqSolDBc.csv and AqSolDB_Enriched AqSolDBc is the final curated dataset after filtering of the AqSolDB_Enriched dataset. AqSolDBc is the curated data from the AqSolDB. The available columns are: Source If in AqSolDBc, the value is "AqSolDBc" ID Compound ID (string) Name Name of the compound (string) InChI InChI code of the chemical structure (string) InChIKey InChI hash code of the chemical structure (string) ExperimentalLogS Mole/L logarithm in decimal basis of the thermodynamic solubility in water at pH 7 (+/-1) at ~300K (float) SMILEScurated Curated SMILES code of the chemical structure (string) SD Standard laboratory Deviation, default value: -1 (float) Group Data quality label imported from AqSolDB (string) Dataset Source of the data point (string) Composition Purity of the substance: mono-constituent, multi-constituent, UVCB (Categorical) Origin Either organic, organomettalic, NaN (Categorical) HasError Yes or No, see ErrorType for details (boolean) ErrorType Identifier error on the data point, default value: None (String) AtomCount Number of atoms (integer) AlertAtoms True if the molecule contains one of the AlertAtom - see curation (boolean) DuplicateGroup ID used to regroup duplicate structures (integer) DuplicateSD Standard deviation of the measurements for each unique structure, based on duplicate observations (double) DuplicateOccurrence Number of measurement for each unique structure (integer) SD Experimental standard deviation as given in the original AqSolDB (double) File AqSolDB.csv Original data from the AqSolDB. The available columns are: ID Compound ID (string) Name Name of the compound (string) SMILES Original SMILES code of the chemical structure (string) SmilesCurated Curated SMILES code of the chemical structure (string) InChI InChI code of the chemical structure (string) InChIKey InChI hash code of the chemical structure (string) Composition Purity of the substance: mono-constituent, multi-constituent, UVCB (Categorical) Origin Either organic, organomettalic, NaN (Categorical) Dataset Source of the data point (string) HasError Yes or No, see ErrorType for details (boolean) ErrorType Identifier error on the data point, default value: None (String) AtomCount Number of atoms (integer) AlertAtoms True if the molecule contains one of the AlertAtom - see curation (boolean) SD Experimental standard deviation as given in the original AqSolDB (double) File AqSolDB_Enriched_for_AqSolDBc.csv An extended version of AqSolDB_Enriched supplemented with molecular descriptors. Available columns: ID Compound ID (string) Name Name of the compound (string) InChI InChI code of the chemical structure (string) InChIKey InChI hash code of the chemical structure (string) Solubility Mole/L logarithm in decimal basis of the thermodynamic solubility in water at pH 7 (+/-1) at ~300K (float) SMILES Original SMILES code of the chemical structure (string) SD Standard laboratory Deviation, default value: -1 (float) Ocurrences Number of occurrences in the original merged dataset from the original AqSolDB dataset Group Data quality label imported from AqSolDB (string) MolWt Molecular weight (double) MolLogP Computed logP lipophilicity (double) MolMR Computed molecular refractivity (double) HeavyAtomCount Number of heavy atoms (integer) NumHAcceptors Number of hydrogen bond acceptors (integer) NumHAcceptors Number of hydrogen bond donors (integer) NumRotatableBonds Number of rotatable bonds (integer) NumValenceElectrons Number of valence electrons (integer) NumAromaticRings Number of aromatic rings (integer) NumSaturatedRings Number of saturated rings (integer) NumAliphaticRings Number of aliphatic rings (integer) RingCount Number of rings (integer) TPSA Total Polar Surface Area in Å^2 (double) LabuteASA Labute approximate molecular surface area (double) BalabanJ Balaban topological descriptor (double) BertzCT Bertz molecular complexity descriptor (double) CAS Chemical Abstract Service identifier (string) Dataset Source of the data point (string) Composition Purity of the substance: mono-constituent, multi-constituent, UVCB (Categorical) Origin Either organic, organomettalic, NaN (Categorical) HasError Yes or No, see ErrorType for details (boolean) ErrorType Identifier error on the data point, default value: None (String) ROMol molecular identifier from RDkit (integer) Smiles7 Curated SMILES ionized at pH 7.0 (string) hasGroupSeparatedPy Has the molecular separated formal charges neutralizing each other (at pH 7.0) - True / False (boolean) totalChargePy Formal charge values count on the molecule at pH 7.0 (integer) sumChargePy Molecular formal charge at pH 7.0 (integer) ChargeRatioPy sumChargePy / totalChargePy, at pH 7.0 (double) Pass_ChargeRatioPy Category of the compound according to the ionization state at pH 7.0: Uncharged, Negative, Zwitterion, Positive, PureChargeSeparation (categorical) AtomCount Number of atoms (integer) AlertAtoms True if the molecule contains one of the AlertAtom - see curation (boolean) SD Experimental standard deviation as given in the original AqSolDB (double) File OChem.csv Raw file obtained from OChem. index A numerical identifier for each entry in the dataset (integer). SMILES Simplified Molecular Input Line Entry System (string). CASRN Chemical Abstracts Service Registry Number (string). EXTERNALID An external identifier that links to other databases or references (integer). N Identifier specific to the dataset (integer). NAME The name of the chemical compound (string). ARTICLEID Identifier for the article or publication where the data was reported (string). PUBMEDID Identifier for the article in the PubMed database, which indexes biomedical and life sciences literature (string). PAGE Page number in the publication where the data can be found (integer). TABLE Table number in the publication where the data can be found (integer). Water solubility The solubility of the chemical compound in water (double). UNIT {Water solubility} The unit of measurement for water solubility (e.g., mg/L, mol/L) (string). Water solubility {measured, converted} Water solubility data, indicating whether the value is measured directly or converted from another unit (string). UNIT {Water solubility}.1 The unit of measurement for the converted water solubility value (string). Dataset The specific dataset or source from which the data is derived (string). Temperature The temperature at which the water solubility measurement was taken (double). UNIT {Temperature} The unit of measurement for temperature (e.g., Celsius, Kelvin) (string). Ionic strength The ionic strength of the solution in which solubility was measured (double). UNIT {Ionic strength} The unit of measurement for ionic strength (e.g., mol/L) (string). comment (chemical) Additional comments or notes about the chemical compound (string). source The source from which the data was obtained (string). pH The pH value of the solution in which solubility was measured (double). UNIT {pH} The unit for pH, which is dimensionless (string). Quality code A code indicating the quality or reliability of the data (integer). UNIT {Quality code} The unit or scale used for the quality code (string). MW Molecular weight of the chemical compound (double). LogS (Format) Logarithm of the solubility (double). Temperature (Format) Temperature format (string). Temperature Keep (Format) Indicates whether the row is to be kept based on the temperature (boolean). NB Hetero (Format) Number of heteroatoms in the chemical compound (integer). CpId Compound identifier, a unique ID assigned to each chemical compound in the dataset (integer). File OChemUnseen.csv Solubility data from OChem, curated and orthogonal to AqSolDB. The available columns are: SMILES Curated SMILES code of the chemical structure (string) LogS Mole/L logarithm of the thermodynamic solubility in water at pH 7 (+/-1) (float) File OChemOverlapping.csv Solubility data from OChem, curated; chemical structures are also present inside
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This data is part of the Monthly aggregated Water Vapor MODIS MCD19A2 (1 km) dataset. Check the related identifiers section on the Zenodo side panel to access other parts of the dataset. General Description The monthly aggregated water vapor dataset is derived from MCD19A2 v061. The Water Vapor data measures the column above ground retrieved from MODIS near-IR bands at 0.94μm. The dataset time spans from 2000 to 2022 and provides data that covers the entire globe. The dataset can be used in many applications like water cycle modeling, vegetation mapping, and soil mapping. This dataset includes: Monthly time-series:Derived from MCD19A2 v061, this data provides a monthly aggregated mean and standard deviation of daily water vapor time-series data from 2000 to 2022. Only positive non-cloudy pixels were considered valid observations to derive the mean and the standard deviation. The remaining no-data values were filled using the TMWM algorithm. This dataset also includes smoothed mean and standard deviation values using the Whittaker method. The quality assessment layers and the number of valid observations for each month can provide an indication of the reliability of the monthly mean and standard deviation values. Yearly time-series:Derived from monthly time-series, this data provides a yearly time-series aggregated statistics of the monthly time-series data. Long-term data (2000-2022):Derived from monthly time-series, this data provides long-term aggregated statistics for the whole series of monthly observations. Data Details Time period: 2000–2002 Type of data: Water vapor column above the ground (0.001cm) How the data was collected or derived: Derived from MCD19A2 v061 using Google Earth Engine. Cloudy pixels were removed and only positive values of water vapor were considered to compute the statistics. The time-series gap-filling and time-series smoothing were computed using the Scikit-map Python package. Statistical methods used: Four statistics were derived: mean, standard deviation, smoothed mean, smoothed standard deviation. Limitations or exclusions in the data: The dataset does not include data for Antarctica. Coordinate reference system: EPSG:4326 Bounding box (Xmin, Ymin, Xmax, Ymax): (-180.00000, -62.00081, 179.99994, 87.37000) Spatial resolution: 1/120 d.d. = 0.008333333 (1km) Image size: 43,200 x 17,924 File format: Cloud Optimized Geotiff (COG) format. Support If you discover a bug, artifact, or inconsistency, or if you have a question please use some of the following channels: Technical issues and questions about the code: GitLab Issues General questions and comments: LandGIS Forum Name convention To ensure consistency and ease of use across and within the projects, we follow the standard Open-Earth-Monitor file-naming convention. The convention works with 10 fields that describes important properties of the data. In this way users can search files, prepare data analysis etc, without needing to open files. The fields are: generic variable name: wv = Water vapor variable procedure combination: mcd19a2v061.seasconv = MCD19A2 v061 with gap-filling algorithm Position in the probability distribution / variable type: m = mean | sd = standard deviation | n = number of observations | qa = quality assessment Spatial support: 1km Depth reference: s = surface Time reference begin time: 20000101 = 2000-01-01 Time reference end time: 20021231 = 2002-12-31 Bounding box: go = global (without Antarctica) EPSG code: epsg.4326 = EPSG:4326 Version code: v20230619 = 2023-06-19 (creation date)
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This data is part of the Monthly aggregated Water Vapor MODIS MCD19A2 (1 km) dataset. Check the related identifiers section on the Zenodo side panel to access other parts of the dataset. General Description The monthly aggregated water vapor dataset is derived from MCD19A2 v061. The Water Vapor data measures the column above ground retrieved from MODIS near-IR bands at 0.94μm. The dataset time spans from 2000 to 2022 and provides data that covers the entire globe. The dataset can be used in many applications like water cycle modeling, vegetation mapping, and soil mapping. This dataset includes:
Monthly time-series:Derived from MCD19A2 v061, this data provides a monthly aggregated mean and standard deviation of daily water vapor time-series data from 2000 to 2022. Only positive non-cloudy pixels were considered valid observations to derive the mean and the standard deviation. The remaining no-data values were filled using the TMWM algorithm. This dataset also includes smoothed mean and standard deviation values using the Whittaker method. The quality assessment layers and the number of valid observations for each month can provide an indication of the reliability of the monthly mean and standard deviation values. Yearly time-series:Derived from monthly time-series, this data provides a yearly time-series aggregated statistics of the monthly time-series data. Long-term data (2000-2022):Derived from monthly time-series, this data provides long-term aggregated statistics for the whole series of monthly observations. Data Details
Time period: 2021–2022 Type of data: Water vapor column above the ground (0.001cm) How the data was collected or derived: Derived from MCD19A2 v061 using Google Earth Engine. Cloudy pixels were removed and only positive values of water vapor were considered to compute the statistics. The time-series gap-filling and time-series smoothing were computed using the Scikit-map Python package. Statistical methods used: Four statistics were derived: mean, standard deviation, smoothed mean, smoothed standard deviation. Limitations or exclusions in the data: The dataset does not include data for Antarctica. Coordinate reference system: EPSG:4326 Bounding box (Xmin, Ymin, Xmax, Ymax): (-180.00000, -62.00081, 179.99994, 87.37000) Spatial resolution: 1/120 d.d. = 0.008333333 (1km) Image size: 43,200 x 17,924 File format: Cloud Optimized Geotiff (COG) format. Support If you discover a bug, artifact, or inconsistency, or if you have a question please use some of the following channels:
Technical issues and questions about the code: GitLab Issues General questions and comments: LandGIS Forum Name convention To ensure consistency and ease of use across and within the projects, we follow the standard Open-Earth-Monitor file-naming convention. The convention works with 10 fields that describes important properties of the data. In this way users can search files, prepare data analysis etc, without needing to open files. The fields are:
generic variable name: wv = Water vapor variable procedure combination: mcd19a2v061.seasconv = MCD19A2 v061 with gap-filling algorithm Position in the probability distribution / variable type: m = mean | sd = standard deviation | n = number of observations | qa = quality assessment Spatial support: 1km Depth reference: s = surface Time reference begin time: 20210101 = 2021-01-01 Time reference end time: 20221231 = 2022-12-31 Bounding box: go = global (without Antarctica) EPSG code: epsg.4326 = EPSG:4326 Version code: v20230619 = 2023-06-19 (creation date)
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This data is part of the Monthly aggregated Water Vapor MODIS MCD19A2 (1 km) dataset. Check the related identifiers section on the Zenodo side panel to access other parts of the dataset. General Description The monthly aggregated water vapor dataset is derived from MCD19A2 v061. The Water Vapor data measures the column above ground retrieved from MODIS near-IR bands at 0.94μm. The dataset time spans from 2000 to 2022 and provides data that covers the entire globe. The dataset can be used in many applications like water cycle modeling, vegetation mapping, and soil mapping. This dataset includes: Monthly time-series:Derived from MCD19A2 v061, this data provides a monthly aggregated mean and standard deviation of daily water vapor time-series data from 2000 to 2022. Only positive non-cloudy pixels were considered valid observations to derive the mean and the standard deviation. The remaining no-data values were filled using the TMWM algorithm. This dataset also includes smoothed mean and standard deviation values using the Whittaker method. The quality assessment layers and the number of valid observations for each month can provide an indication of the reliability of the monthly mean and standard deviation values. Yearly time-series:Derived from monthly time-series, this data provides a yearly time-series aggregated statistics of the monthly time-series data. Long-term data (2000-2022):Derived from monthly time-series, this data provides long-term aggregated statistics for the whole series of monthly observations. Data Details Time period: 2018–2020 Type of data: Water vapor column above the ground (0.001cm) How the data was collected or derived: Derived from MCD19A2 v061 using Google Earth Engine. Cloudy pixels were removed and only positive values of water vapor were considered to compute the statistics. The time-series gap-filling and time-series smoothing were computed using the Scikit-map Python package. Statistical methods used: Four statistics were derived: mean, standard deviation, smoothed mean, smoothed standard deviation. Limitations or exclusions in the data: The dataset does not include data for Antarctica. Coordinate reference system: EPSG:4326 Bounding box (Xmin, Ymin, Xmax, Ymax): (-180.00000, -62.00081, 179.99994, 87.37000) Spatial resolution: 1/120 d.d. = 0.008333333 (1km) Image size: 43,200 x 17,924 File format: Cloud Optimized Geotiff (COG) format. Support If you discover a bug, artifact, or inconsistency, or if you have a question please use some of the following channels: Technical issues and questions about the code: GitLab Issues General questions and comments: LandGIS Forum Name convention To ensure consistency and ease of use across and within the projects, we follow the standard Open-Earth-Monitor file-naming convention. The convention works with 10 fields that describes important properties of the data. In this way users can search files, prepare data analysis etc, without needing to open files. The fields are: generic variable name: wv = Water vapor variable procedure combination: mcd19a2v061.seasconv = MCD19A2 v061 with gap-filling algorithm Position in the probability distribution / variable type: m = mean | sd = standard deviation | n = number of observations | qa = quality assessment Spatial support: 1km Depth reference: s = surface Time reference begin time: 20180101 = 2018-01-01 Time reference end time: 20201231 = 2020-12-31 Bounding box: go = global (without Antarctica) EPSG code: epsg.4326 = EPSG:4326 Version code: v20230619 = 2023-06-19 (creation date)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In cluster analysis, a common first step is to scale the data aiming to better partition them into clusters. Even though many different techniques have throughout many years been introduced to this end, it is probably fair to say that the workhorse in this preprocessing phase has been to divide the data by the standard deviation along each dimension. Like division by the standard deviation, the great majority of scaling techniques can be said to have roots in some sort of statistical take on the data. Here we explore the use of multidimensional shapes of data, aiming to obtain scaling factors for use prior to clustering by some method, like k-means, that makes explicit use of distances between samples. We borrow from the field of cosmology and related areas the recently introduced notion of shape complexity, which in the variant we use is a relatively simple, data-dependent nonlinear function that we show can be used to help with the determination of appropriate scaling factors. Focusing on what might be called “midrange” distances, we formulate a constrained nonlinear programming problem and use it to produce candidate scaling-factor sets that can be sifted on the basis of further considerations of the data, say via expert knowledge. We give results on some iconic data sets, highlighting the strengths and potential weaknesses of the new approach. These results are generally positive across all the data sets used.
Added +32,000 more locations. For information on data calculations please refer to the methodology pdf document. Information on how to calculate the data your self is also provided as well as how to buy data for $1.29 dollars.
The database contains 32,000 records on US Household Income Statistics & Geo Locations. The field description of the database is documented in the attached pdf file. To access, all 348,893 records on a scale roughly equivalent to a neighborhood (census tract) see link below and make sure to up vote. Up vote right now, please. Enjoy!
The dataset originally developed for real estate and business investment research. Income is a vital element when determining both quality and socioeconomic features of a given geographic location. The following data was derived from over +36,000 files and covers 348,893 location records.
Only proper citing is required please see the documentation for details. Have Fun!!!
Golden Oak Research Group, LLC. “U.S. Income Database Kaggle”. Publication: 5, August 2017. Accessed, day, month year.
2011-2015 ACS 5-Year Documentation was provided by the U.S. Census Reports. Retrieved August 2, 2017, from https://www2.census.gov/programs-surveys/acs/summary_file/2015/data/5_year_by_state/
Please tell us so we may provide you the most accurate data possible. You may reach us at: research_development@goldenoakresearch.com
for any questions you can reach me on at 585-626-2965
please note: it is my personal number and email is preferred
Check our data's accuracy: Census Fact Checker
Don't settle. Go big and win big. Optimize your potential. Overcome limitation and outperform expectation. Access all household income records on a scale roughly equivalent to a neighborhood, see link below:
Website: Golden Oak Research Kaggle Deals all databases $1.29 Limited time only
A small startup with big dreams, giving the every day, up and coming data scientist professional grade data at affordable prices It's what we do.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Post-purchase online word of mouth (WOM) has emerged as a major research area in marketing and information systems literature as it functions both as a predictor and driver of sales revenue. While the relationship between WOM and sales is now quite well understood, it is less clear what drives WOM itself. Of specific interest is the question of whether a population’s propensity to contribute to post-consumption WOM is dependent on various indicators of that product’s availability. The study of these product-level effects is complicated by simultaneous social dynamic-level effects which might also differ for products of varying availability and popularity. Furthermore, the bi-modal distribution commonly found in review ratings complicates the use of established summary statistics such as arithmetic mean and standard deviation. In this paper we study a dataset of almost 280,000 consumer reviews for 433 movies over three years to disentangle these product-level effects and social dynamic-level effects to better understand their effect on a population’s propensity to contribute to post-consumption WOM. We first show how the use of a simple arithmetic mean can lead to estimation problems. To resolve these problems and allow the study of social dynamics, we introduce a novel measure to capture the disagreement contained in non-normally distributed review ratings based on an expectation-maximization algorithm for finite mixture models. Using this measure, we find that consumers are more likely to post reviews for products that are perceived to be less available in the market but no effect for those perceived to be less successful in the market. Investigating the social dynamic-level effects, we find a positive effect for products that have already accumulated a larger volume of prior reviews and a positive effect of disagreement on consumers’ propensity to engage in post-consumption WOM. Investigating the interaction between the movie-level effects and social dynamic-level effects, we find that while high disagreement helps gen erate WOM for less popular (niche) products, it hampers that of highly popular (hit) products.
View on Map View ArcGIS Service BTM Standard deviation – this mosaic dataset is part of a series of seafloor terrain datasets aimed at providing a consistent baseline to assist users in consistently characterizing Aotearoa New Zealand seafloor habitats. This series has been developed using the tools provided within the Benthic Terrain Model (BTM [v3.0]) across different multibeam echo-sounder datasets. The series includes derived outputs from 50 MBES survey sets conducted between 1999 and 2020 from throughout the New Zealand marine environment (where available) covering an area of approximately 52,000 km2. Consistency and compatibility of the benthic terrain datasets have been achieved by utilising a common projected coordinate system (WGS84 Web Mercator), resolution (10 m), and by using a standard classification dictionary (also utilised by previous BTM studies in NZ). However, we advise caution when comparing the classification between different survey areas.Derived BTM outputs include the Bathymetric Position Index (BPI); Surface Derivative; Rugosity; Depth Statistics; Terrain Classification. A standardised digital surface model, and derived hillshade and aspect datasets have also been made available. The index of the original MBES survey surface models used in this analysis can be accessed from https://data.linz.govt.nz/layer/95574-nz-bathymetric-surface-model-index/The full report and description of available output datasets are available at: https://www.doc.govt.nz/globalassets/documents/science-and-technical/drds367entire.pdf
We present the second data release of the Radial Velocity Experiment (RAVE), an ambitious spectroscopic survey to measure radial velocities and stellar atmosphere parameters (temperature, metallicity, surface gravity, and rotational velocity) of up to one million stars using the 6dF multi-object spectrograph on the 1.2m UK Schmidt Telescope of the Anglo-Australian Observatory (AAO). The RAVE program started in 2003, obtaining medium resolution spectra (median R=7500) in the Ca-triplet region (8410-8795{AA}) for southern hemisphere stars drawn from the Tycho-2 and SuperCOSMOS catalogues, in the magnitude range 9<I<12. Following the first data release, the current release doubles the sample of published radial velocities, now containing 51829 radial velocities for 49327 individual stars observed on 141 nights between 2003 April 11 and 2005 March 31. Comparison with external data sets shows that the new data collected since 2004 April 3 show a standard deviation of 1.3km/s, about twice as good as for the first data release. For the first time, this data release contains values of stellar parameters from 22407 spectra of 21121 individual stars. They were derived by a penalized chi-square method using an extensive grid of synthetic spectra calculated from the latest version of Kurucz stellar atmosphere models. From comparison with external data sets, our conservative estimates of errors of the stellar parameters for a spectrum with an average signal-to-noise ratio (S/N) of ~40 are 400K in temperature, 0.5dex in gravity, and 0.2dex in metallicity. We note however that, for all three stellar parameters, the internal errors estimated from repeat RAVE observations of 855 stars are at least a factor 2 smaller. We demonstrate that the results show no systematic offsets if compared to values derived from photometry or complementary spectroscopic analyses. The data release includes proper motions from Starnet2, Tycho-2, and UCAC2 catalogs and photometric measurements from Tycho-2 USNO-B, DENIS, and 2MASS. The data release can be accessed via the RAVE Web site: http://www.rave-survey.org and through CDS.
The data sets included in this resource are published in Kravitz et al, 2012 Distinct roles for direct and indirect pathway striatal neurons in reinforcement (Nat. Neurosci). The data shows: optogenetic activation of dopamine D1 or D2 receptor-expressing striatal projection neurons influenced reinforcement learning in mice. Stimulating D1 receptor-expressing neurons induced persistent reinforcement, whereas stimulating D2 receptor-expressing neurons induced transient punishment. These 3 recording files contain data collected in 2010 and 2011 by Lex Kravitz in Anatol Kreitzer''s lab at the Gladstone Institutes. Each file contains ~ 1 hour of awake in vivo recording data, containing the spike times for ~30 minutes of spontaneous activity, preceeded or followed by 400 laser pulses (473nm laser light, 1 sec pulses, 3 seconds inter-pulse-interval). The laser pulses were presented at 4 intensities: 0.1mW, 0.3mW, 1.0mW, and 3.0mW, and pulse times at each intensity are given in separate columns in each data file. Finally, each data file contains the identification of ''light-modulated units'' as we identified them (see methods below). Methods: Viral expression of DIO-ChR2-YFP and DIO-YFP We used double-floxed inverted (DIO) constructs to express ChR2-YFP fusions and YFP alone in Cre-expressing neurons, which virtually eliminates recombination in cells that do not express Cre-recombinase (Sohal et al, Nature, 2009). The double-floxed reverse ChR2-YFP or YFP cassette was cloned into a modified version of the pAAV2-MCS vector (Stratagene, La Jolla, CA) carrying the EF-1a promoter and the Woodchuck hepatitis virus posttranscriptional regulatory element (WPRE) to enhance expression. The recombinant AAV vectors were serotyped with AAV1 coat proteins and packaged by the viral vector core at the University of North Carolina. The final viral concentration was 4 x 10^12 virus molecules/mL (by Dot Blot, UNC vector core). This viral construct can now be ordered in single aliquots directly from UNC Vector core as product AAV-EF1a-DIO-hChR2(H134R)-EYFP, at http://genetherapy.unc.edu/services.htm) Implantation of electrode arrays for awake recordings Anaesthesia was induced with a mixture of ketamine and xylazine (100mg ketamine plus 5mg xylazine per kilogram of body weight i.p.) and maintained with isoflurane through a nose cone mounted on a stereotaxic apparatus (Kopf Instruments). The scalp was opened and a hole was drilled in the skull (0.0 to +1.0mm AP, -1.0 to -2.0mm ML from bregma). Two skull screws were implanted in the opposing hemisphere. Dental adhesive (C&B Metabond, Parkell) was used to fix the skull screws in place and coat the surface of the skull. An array of 16 or 32 microwires (35-��m tungsten wires, 100-��m spacing between wires, 200-��m spacing between rows; Innovative Physiology) and one optical fiber in a ferrule was lowered into the striatum (3.0mm below the surface of the brain) and cemented in place with dental acrylic (Ortho-Jet, Lang Dental). After the cement dried, the scalp was sutured shut. Animals were allowed to recover for at least seven days before striatal recordings were made. In vivo electrophysiology Voltage signals from each recording site on the microwire array were band-pass-filtered, such that activity between 150 and 8,000Hz was analysed as spiking activity. This data was amplified, processed and digitally captured using commercial hardware and software (Plexon). Single units were discriminated with principal component analysis (OFFLINE SORTER, Plexon). Two criteria were used to ensure quality of recorded units: (1) recorded units smaller than 100��V (~3 times the noise band) were excluded from further analysis and (2) recorded units in which more than 1% of interspike intervals were shorter than 2ms were excluded from further analysis. Average waveforms were exported with OFFLINE SORTER. During the recording we coupled the array to a laser and pulsed the laser at four intensities (0.1mW, 0.3mW, 1mW, and 3mW). Laser stimulation was run in a cyclical fashion, on for 1 second, and off for 3 seconds. Each neuron received 100 pulses at each laser intensity. Identification of ChR2 expressing units in in vivo recordings For all neurons, peri-event histograms were generated for each laser intensity independently. Neurons were classified as ChR2-expressing if they exhibited a firing rate greater than 3x above the standard deviation of the 1-second preceding the laser pulse within 10msec of the laser onset. Each neuron was tested independently at each laser power, and neurons that satisfied this criteria at any one power were defined as ChR2-expression.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Distributions of strictly positive numbers are common and can be characterized by standard statistical measures such as mean, standard deviation, and skewness. We demonstrate that for these distributions the skewness D3 is bounded from below by a function of the coefficient of variation (CoV) δ as D3 > δ − 1/δ. The results are extended to any distribution that is bounded with minimum value xmin and/or bounded with maximum value xmax. We build on the results to provide bounds for kurtosis D4, and conjecture analogous bounds exists for higher statistical moments.
DOI Methane hydrates are present in marine seep systems and occur within the gas hydrate stability zone. Very little is known about their crystallite sizes and size distributions because they are notoriously difficult to measure. Crystal size distributions are usually considered as one of the key petrophysical parameters because they influence mechanical properties and possible compositional changes, which may occur with changing environmental conditions. Variations in grain size are relevant for gas substitution in natural hydrates by replacing CH4 with CO2 for the purpose of carbon dioxide sequestration. Here we show that crystallite sizes of gas hydrates from some locations in the Indian Ocean, Gulf of Mexico and Black Sea are in the range of 200–400 µm; larger values were obtained for deeper-buried samples from ODP Leg 204. The crystallite sizes show generally a log-normal distribution and appear to vary sometimes rapidly with location. Site conditions (water depth, burial depth meters below sea floor [mbsf] and temperature) are given for marine samples to provide information about the stability conditions for the hydrates#Klapp et al. (2007), see also: http://doi.pangaea.de/10.1594/PANGAEA.771920*Tréhu et al. (2006).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
To solve the high-frequency sample needs of time series wetland classification, we developed a method for automatically producing global wetland samples based on 13 global and regional wetland-related datasets and millions of images from Landsat 8 OLI, MODIS, Sentinel-1 SAR GRD, and Sentinel-2 MSI sensors. Considering the consistency of types and the separability of spectra, we summarized all classification systems into three types: wetland, water body, and non-wetland.Samples are randomly selected based on the equal-area stratified sampling scheme based on the existence probability of wetlands. In order to ensure sufficient samples, we proposed global sample size of 500,000. According to the global potential wetland distribution data set, the sample size of each grid was allocated, and samples were randomly selected. Based on 13 auxiliary data sets, we first determined the sample type according to the order of water body and wetland and assigned the "non-wetland" attribute to the type of neither water body nor wetland. The 13 auxiliary data sets include GlobeLand30 (Chen et al., 2014), FROM-GLC (Yu et al., 2013), GlobCover (Arino et al., 2010), GLC_FCS30_2020 (Liu et al., 2020), Joint Research Centre Global Surface Water Survey and Mapping map (Pekel et al., 2016), Global Reservoir and Dam Database (GRanD) (Lehner et al., 2011), Global Mangrove Watch (GMW) (Bunting et al., 2018), Global Lakes and Wetlands Database (GLWD) (Lehner et al., 2004), Murray Global Intertidal Change (MGIC) (Murray et al., 2019), CAS_Wetlands (Mao et al., 2020), CA_wetlands (Wulder et al., 2018), National Land Cover Database (NLCD) (Yang et al., 2018), Global Potential Wetland Distribution Dataset (GPWD) (Hu et al., 2017).We also included 139027 Landsat 8 OLI images, 21160 MOD09A1 images, 296479 Sentinel-1 SAR images, and 4553453 Sentinel-2 MSI images globally from January 1 to December 31, 2020. We extracted minimum, maximum, mean, and median information for each band and NDVI, NDWI, MNDWI, and LSWI indexes in four sensors of global wetland samples. In order to remove this part of the noise, this study kept the water, wetland, and non-wetland samples within one standard deviation of the annual mean of each spectral band as the sample's secondary screening conditions to ensure the accuracy of samples.The number of wetland samples determined by each sensor is different. Landsat 8 has a total of 202,111 samples, including 13,176 water bodies, 54,229 wetland samples, and 134,706 non-wetland samples; MODIS has a total of 190,898 samples, including 13,436 water body samples, 50,400 wetland samples, and 127,062 non-wetland samples ; Sentinel- has a total of 185,943 samples, including 10,885 water samples, 54,224 wetland samples, and 120,834 non-wetland samples; Sentinel-2 has a total of 185,484 samples, including 11,225 water samples, 52,142 wetland samples, and 122,117 non-wetland samples.They are stored separately in four shapefiles.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
EarSet aims at providing the research community with a novel, multi-modal, dataset, which, for the first time, will allow studying of the impact of body and head/face movements on both the morphology of the PPG wave captured at the ear, as well as on the vital signs estimation. To accurately collect in-ear PPG data, coupled with a 6 degrees-of-freedom (DoF) motion signature, we prototyped and built a flexible research platform for in-the-ear data collection. The platform is centered around a novel ear-tip design which includes a 3-channel PPG (green, red, infrared) and a 6-axis (accelerometer, gyroscope) motion sensor (IMU) co-located on the same ear-tip. This allows the simultaneous collection of spatially distant (i.e., one tip in the left and one in the right ear) PPG data at multiple wavelengths and the corresponding motion signature, for a total of 18 data streams.
Inspired by the Facial Action Coding Systems (FACS), we consider a set of potential sources of motion artifact (MA) caused by natural facial and head movements. Specifically, we gather data on 16 different head and facial motions - head movements (nodding, shaking, tilting), eyes movements (vertical eyes movements, horizontal eyes movements, brow raiser, brow lowerer, right eye wink, left eye wink), and mouth movements (lip puller, chin raiser, mouth stretch, speaking, chewing).
We also collect motion and PPG data under activities, of different intensities, which entail the movement of the entire body (walking and running). Together with in-ear PPG and IMU data, we collect several vital signs including, heart rate, heart rate variability, breathing rate, and raw ECG, from a medical-grade chest device.
With approximately 17 hours of data from 30 participants of mixed gender and ethnicity (mean age: 28.9 years, standard deviation: 6.11 years), our dataset empowers the research community to analyze the morphological characteristics of in-ear PPG signals with respect to motion, device positioning (left ear, right ear), as well as a set of configuration parameters and their corresponding data quality/power consumption trade-off. We envision such a dataset could open the door to innovative filtering techniques to mitigate, and eventually eliminate, the impact of MA on in-ear PPG. We ran a set of preliminary analyses on the data, considering both handcrafted features, as well as a DNN (Deep Neural Network) approach. Ultimately, we observe statistically significant morphological differences in the PPG signal across different types of motions when compared to a situation where there is no motion. We also discuss a 3-classes classification task and show how full-body motions and head/face motions can be discriminated from a still baseline (and among themselves). These preliminary results represent the first step towards the detection of corrupted PPG segments and show the importance of studying how head/face movements impact PPG signals in the ear.
To the best of our knowledge, this is the first in-ear PPG dataset that covers a wide range of full-body and head/facial motion artifacts. Being able to study the signal quality and motion artifacts under such circumstances will serve as a reference for future research in the field, acting as a stepping stone to fully enable PPG-equipped earables.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Unsupervised exploratory data analysis (EDA) is often the first step in understanding complex data sets. While summary statistics are among the most efficient and convenient tools for exploring and describing sets of data, they are often overlooked in EDA. In this paper, we show multiple case studies that compare the performance, including clustering, of a series of summary statistics in EDA. The summary statistics considered here are pattern recognition entropy (PRE), the mean, standard deviation (STD), 1-norm, range, sum of squares (SSQ), and X4, which are compared with principal component analysis (PCA), multivariate curve resolution (MCR), and/or cluster analysis. PRE and the other summary statistics are direct methods for analyzing datathey are not factor-based approaches. To quantify the performance of summary statistics, we use the concept of the “critical pair,” which is employed in chromatography. The data analyzed here come from different analytical methods. Hyperspectral images, including one of a biological material, are also analyzed. In general, PRE outperforms the other summary statistics, especially in image analysis, although a suite of summary statistics is useful in exploring complex data sets. While PRE results were generally comparable to those from PCA and MCR, PRE is easier to apply. For example, there is no need to determine the number of factors that describe a data set. Finally, we introduce the concept of divided spectrum-PRE (DS-PRE) as a new EDA method. DS-PRE increases the discrimination power of PRE. We also show that DS-PRE can be used to provide the inputs for the k-nearest neighbor (kNN) algorithm. We recommend PRE and DS-PRE as rapid new tools for unsupervised EDA.