Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Prior to statistical analysis of mass spectrometry (MS) data, quality control (QC) of the identified biomolecule peak intensities is imperative for reducing process-based sources of variation and extreme biological outliers. Without this step, statistical results can be biased. Additionally, liquid chromatography–MS proteomics data present inherent challenges due to large amounts of missing data that require special consideration during statistical analysis. While a number of R packages exist to address these challenges individually, there is no single R package that addresses all of them. We present pmartR, an open-source R package, for QC (filtering and normalization), exploratory data analysis (EDA), visualization, and statistical analysis robust to missing data. Example analysis using proteomics data from a mouse study comparing smoke exposure to control demonstrates the core functionality of the package and highlights the capabilities for handling missing data. In particular, using a combined quantitative and qualitative statistical test, 19 proteins whose statistical significance would have been missed by a quantitative test alone were identified. The pmartR package provides a single software tool for QC, EDA, and statistical comparisons of MS data that is robust to missing data and includes numerous visualization capabilities.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Coronavirus disease 2019 (COVID-19) time series listing confirmed cases, reported deaths and reported recoveries. Data is disaggregated by country (and sometimes subregion). Coronavirus disease (COVID-19) is caused by the Severe acute respiratory syndrome Coronavirus 2 (SARS-CoV-2) and has had a worldwide effect. On March 11 2020, the World Health Organization (WHO) declared it a pandemic, pointing to the over 118,000 cases of the Coronavirus illness in over 110 countries and territories around the world at the time.
This dataset includes time series data tracking the number of people affected by COVID-19 worldwide, including:
confirmed tested cases of Coronavirus infection the number of people who have reportedly died while sick with Coronavirus the number of people who have reportedly recovered from it
Data is in CSV format and updated daily. It is sourced from this upstream repository maintained by the amazing team at Johns Hopkins University Center for Systems Science and Engineering (CSSE) who have been doing a great public service from an early point by collating data from around the world.
We have cleaned and normalized that data, for example tidying dates and consolidating several files into normalized time series. We have also added some metadata such as column descriptions and data packaged it.
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
This section does not describe the methods of read-tv software development, which can be found in the associated manuscript from JAMIA Open (JAMIO-2020-0121.R1). This section describes the methods involved in the surgical work flow disruption data collection. A curated, PHI-free (protected health information) version of this dataset was used as a use case for this manuscript.
Observer training
Trained human factors researchers conducted each observation following the completion of observer training. The researchers were two full-time research assistants based in the department of surgery at site 3 who visited the other two sites to collect data. Human Factors experts guided and trained each observer in the identification and standardized collection of FDs. The observers were also trained in the basic components of robotic surgery in order to be able to tangibly isolate and describe such disruptive events.
Comprehensive observer training was ensured with both classroom and floor train...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Samples relating to 12 analyses of lay-theories of resilience among participants from USA, New Zealand, India, Iran, Russia (Moscow; Kazan). Central variables relate to participant endorsements of resilience descriptors. Demographic data includes (though not for all samples), Sex/Gender, Age, Ethnicity, Work, and Educational Status. Analysis 1. USA Exploratory Factor Analysis dataAnalysis 2. New Zealand Exploratory Factor Analysis dataAnalysis 3. India Exploratory Factor Analysis dataAnalysis 4. Iran Exploratory Factor Analysis dataAnalysis 5. Russian (Moscow) Exploratory Factor Analysis dataAnalysis 6. Russian (Kazan) Exploratory Factor Analysis dataAnalysis 7. USA Confirmatory Factor Analysis dataAnalysis 8. New Zealand Confirmatory Factor Analysis dataAnalysis 9. India Confirmatory Factor Analysis dataAnalysis 10. Iran Confirmatory Factor Analysis dataAnalysis 11. Russian (Moscow) Confirmatory Factor Analysis dataAnalysis 12. Russian (Kazan) Confirmatory Factor Analysis data
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Supplementary materials for the article: De Winter, J. C. F., Dodou, D., & Wieringa, P. A. (2009). Exploratory factor analysis with small sample sizes. Multivariate Behavioral Research, 44, 147–181.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The high-resolution and mass accuracy of Fourier transform mass spectrometry (FT-MS) has made it an increasingly popular technique for discerning the composition of soil, plant and aquatic samples containing complex mixtures of proteins, carbohydrates, lipids, lignins, hydrocarbons, phytochemicals and other compounds. Thus, there is a growing demand for informatics tools to analyze FT-MS data that will aid investigators seeking to understand the availability of carbon compounds to biotic and abiotic oxidation and to compare fundamental chemical properties of complex samples across groups. We present ftmsRanalysis, an R package which provides an extensive collection of data formatting and processing, filtering, visualization, and sample and group comparison functionalities. The package provides a suite of plotting methods and enables expedient, flexible and interactive visualization of complex datasets through functions which link to a powerful and interactive visualization user interface, Trelliscope. Example analysis using FT-MS data from a soil microbiology study demonstrates the core functionality of the package and highlights the capabilities for producing interactive visualizations.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.
There are four datasets: 1) bank-additional-full.csv with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010), very close to the data analyzed in [Moro et al., 2014] 2) bank-additional.csv with 10% of the examples (4119), randomly selected from 1), and 20 inputs. 3) bank-full.csv with all examples and 17 inputs, ordered by date (older version of this dataset with less inputs). 4) bank.csv with 10% of the examples and 17 inputs, randomly selected from 3 (older version of this dataset with less inputs). The smallest datasets are provided to test more computationally demanding machine learning algorithms (e.g., SVM).
The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Interval data are widely used in many fields, notably in economics, industry, and health areas. Analogous to the scatterplot for single-value data, the rectangle plot and cross plot are the conventional visualization methods for the relationship between two variables in interval forms. These methods do not provide much information to assess complicated relationships, however. In this article, we propose two visualization methods: Segment and Dandelion plots. They offer much more information than the existing visualization methods and allow us to have a much better understanding of the relationship between two variables in interval forms. A general guide for reading these plots is provided. Relevant theoretical support is developed. Both empirical and real data examples are provided to demonstrate the advantages of the proposed visualization methods. Supplementary materials for this article are available online.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Problem Statement
Customer Personality Analysis is a detailed analysis of a company’s ideal customers. It helps a business to better understand its customers and makes it easier for them to modify products according to the specific needs, behaviors and concerns of different types of customers.
Customer personality analysis helps a business to modify its product based on its target customers from different types of customer segments. For example, instead of spending money to market a new product to every customer in the company’s database, a company can analyze which customer segment is most likely to buy the product and then market the product only on that particular segment.
Attributes
People
Products
Promotion
Place
Need to perform clustering to summarize customer segments.
The dataset for this project is provided by Dr. Omar Romero-Hernandez.
You can take help from following link to know more about the approach to solve this problem. Visit this URL
happy learning....
Hope you like this dataset please don't forget to like this dataset
The average American’s diet does not align with the Dietary Guidelines for Americans (DGA) provided by the U.S. Department of Agriculture and the U.S. Department of Health and Human Services (2020). The present study aimed to compare fruit and vegetable consumption among those who had and had not heard of the DGA, identify characteristics of DGA users, and identify barriers to DGA use. A nationwide survey of 943 Americans revealed that those who had heard of the DGA ate more fruits and vegetables than those who had not. Men, African Americans, and those who have more education had greater odds of using the DGA as a guide when preparing meals relative to their respective counterparts. Disinterest, effort, and time were among the most cited reasons for not using the DGA. Future research should examine how to increase DGA adherence among those unaware of or who do not use the DGA. Comparative analyses of fruit and vegetable consumption among those who were aware/unaware and use/do not use the DGA were completed using independent samples t tests. Fruit and vegetable consumption variables were log-transformed for analysis. Binary logistic regression was used to examine whether demographic features (race, gender, and age) predict DGA awareness and usage. Data were analyzed using SPSS version 28.1 and SAS/STAT® version 9.4 TS1M7 (2023 SAS Institute Inc).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains supplementary material for the paper 'Funding Covid-19 research: Insights from an exploratory analysis using open data infrastructures' by Alexis-Michel Mugabushaka, Nees Jan van Eck, and Ludo Waltman.
supplementary_material_1_dataset.ods: Dataset of Covid-19 publications.
supplementary_material_2_sample.ods: Samples of publications used to assess the accuracy of funding data in the different databases.
supplementary_material_3_tables_and_figures.ods: Statistics underlying the tables and figures presented in the paper.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This are the feature values used in the study "Exploratory Landscape Analysis is Strongly Sensitive to the Sampling Strategy".
The dataset regroups feature values for every "cheap" features available in the R package flacco and are computed using 5 sampling strategies and in dimension \($d=5$\):
The csv file features_summury_dim_5_ppsn.csv regroups 100 values for every features whereas features_summury_dim_5_ppsn_median.csv regroups for every feature the median of the 100 values.
In the folder PPSN_feature_plots are the histograms of feature values on the 24 COCO functions for 3 sampling strategies: Random, LHS and Sobol.
The Python file sampling_ppsn.py is the code used to generate the sample points from which the feature values are computed.
The file stats50_knn_dt.csv provide the raw data of median and IQR (inter quartile interval) for the heatmaps and boxplots available in the paper.
Finally, the files results_classif_knn100.csv (resp. dt) provide the accuracy of 100 classifications for every settings.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We present a dataset of open source software developed mainly by enterprises rather than volunteers. This can be used to address known generalizability concerns, and, also, to perform research on open source business software development. Based on the premise that an enterprise's employees are likely to contribute to a project developed by their organization using the email account provided by it, we mine domain names associated with enterprises from open data sources as well as through white- and blacklisting, and use them through three heuristics to identify 17,252 enterprise GitHub projects. We provide these as a dataset detailing their provenance and properties. A manual evaluation of a dataset sample shows an identification accuracy of 89%. Through an exploratory data analysis we found that projects are staffed by a plurality of enterprise insiders, who appear to be pulling more than their weight, and that in a small percentage of relatively large projects development happens exclusively through enterprise insiders.
The main dataset is provided as a 17,252 record tab-separated file named enterprise_projects.txt
with the following 27 fields.
The file cohost_project_details.txt
provides the full set of 309,531 cohort projects that are not part of the enterprise data set, but have comparable quality attributes.
Ten pollen samples collected from sediments in the East Field system at San Pedro Siris, situated in the Yalbac area of the Cayo District of central Belize, were examined for pollen evidence of crops.
The lowest sample collected from a stratigraphic column at Hell Gap was submitted for exploratory pollen analysis. Exploratory pollen analysis of this sample includes a pollen count, evaluation of the condition of the pollen and concentration of pollen in this sediment, and recommendations for the future.
The 42 element, 1190 sample Mobile Metal Ion subset of the National Geochemical Survey of Australia database was used to develop and illustrate the concept of Degree of Geochemical Similarity of soil samples. Element concentrations were unified to parts per million units and log(10)-transformed. The degree of similarity of pairs of samples of known provenance in the Yilgarn Craton were obtained using least squares linear regression analysis and demonstrated that the method successfully assessed the degree of similarity of soils related to granitoid and greenstone lithologies. Exploratory Data Analysis symbol maps of all remaining samples in the database against various reference samples were used to obtain correlation maps for not only granitoid- and greenstone-related soil types, but also to distinguish between for example samples derived from marine vs regolith carbonate. Likewise, the distribution of soil samples having a geochemical fingerprint similar to mineralised provinces (e.g., Mt Isa) can be mapped and this can be used as a first order prospection tool. Sensitivity analysis confirmed the method to produce robust results without undue influence from either single elements with anomalous concentrations or elements with a high proportion of censored values.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data presented in this paper is used to examine the behavioral factors that influence the preferences of foods in Indonesia, and Indonesian audiences’ segmentation behind those preferences, provided by social media data. We collected the data through an online platform by performing a query search on Facebook Audience Insights Interests. The keywords that we use in the question quest are based on the United Nations Food and Agriculture Organisation (FAO) Food Balance Sheet (FBS) which is retrieved from FAOStat in May 2020. The data was gathered between 15 May and 2 July 2020. With a sample size of 100-150 million viewers or about 36.95 per cent-55.43 per cent of Indonesia 's 2019 population, we limited our sample to Indonesia. The dataset is made up of ten tables that can be separately analyzed. For each table, we carry out exploratory data analysis (EDA) to provide more insights. Such data could be of interest to various fields, including food scientists, government and policymakers, data scientists/analysts, and marketers. This data could also be the complementary source for the scarcity of food survey data from the government, particularly the behavioral aspects.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Given the replication crisis in cognitive science, it is important to consider what researchers need to do in order to report results that are reliable. We consider three changes in current practice that have the potential to deliver more realistic and robust claims. First, the planned experiment should be divided up into two stages, an exploratory stage and a confirmatory stage. This clear separation allows the researcher to check whether any results found in the exploratory stage are robust. The second change is to carry out adequately powered studies. We show that this is imperative if we want to obtain realistic estimates of effects in psycholinguistics. The third change is to use Bayesian data-analytic methods rather than frequentist ones; the Bayesian framework allows us to focus on the best estimates we can obtain of the effect, rather than rejecting a strawman null. As a case study, we investigate number interference effects in German. Number feature interference is predicted by cue-based retrieval models of sentence processing (Van Dyke & Lewis, 2003; Vasishth & Lewis, 2006), but has shown inconsistent results. We show that by implementing the three changes mentioned, suggestive evidence emerges that is consistent with the predicted number interference effects.
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Geochemical data from regional geochemistry survey samples from Yukon have undergone exploratory data analysis and principal component analysis. The results of these analyses clearly demonstrate geological control on the distribution of a number of important commodity and mineral deposit pathfinder elements. Catchment basins have been delineated for the samples and the dominant simplified geological unit in each catchment basin used to level the geochemical data where appropriate. Levelling the geochemical data in this fashion generally fails to fully account for enrichments in many commodity and mineral deposit pathfinder elements in the bedrock due to practical limitations on the resolution of the mapping and knowledge of the relative contributions of different geological units, although the resulting data interpretation is an improvement on one based solely upon raw geochemical data. Weighted sums models have been generated for the deposit types that either exist within the individual map areas covered by this report or are considered by the authors to be of exploration significance. Separate catchment maps showing the distribution of stream water pH and the concentration of elements inferred to have accumulated through hydromorphic dispersion are also provided. An additional series of maps has been generated to display weighted sums models calculated using regression of commodity and mineral deposit pathfinder elements against those principal components containing the same elements that show the strongest spatial associations with bedrock geology. Both model types have been iteratively tested using known mineral occurrences in the relevant map areas and, for the most part, are compatible with the distribution of known mineralization where sampling coverage is adequate. Geochemical anomalies unrelated to known mineral occurrences are evident in both data sets and provide possible targets for further investigation.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Prior to statistical analysis of mass spectrometry (MS) data, quality control (QC) of the identified biomolecule peak intensities is imperative for reducing process-based sources of variation and extreme biological outliers. Without this step, statistical results can be biased. Additionally, liquid chromatography–MS proteomics data present inherent challenges due to large amounts of missing data that require special consideration during statistical analysis. While a number of R packages exist to address these challenges individually, there is no single R package that addresses all of them. We present pmartR, an open-source R package, for QC (filtering and normalization), exploratory data analysis (EDA), visualization, and statistical analysis robust to missing data. Example analysis using proteomics data from a mouse study comparing smoke exposure to control demonstrates the core functionality of the package and highlights the capabilities for handling missing data. In particular, using a combined quantitative and qualitative statistical test, 19 proteins whose statistical significance would have been missed by a quantitative test alone were identified. The pmartR package provides a single software tool for QC, EDA, and statistical comparisons of MS data that is robust to missing data and includes numerous visualization capabilities.