Chemical concentration, exposure, and health risk data for U.S. census tracts from National Scale Air Toxics Assessment (NATA). This dataset is associated with the following publication: Huang, H., R. Tornero-Velez, and T. Barzyk. Associations between socio-demographic characteristics and chemical concentrations contributing to cumulative exposures in the United States. Journal of Exposure Science and Environmental Epidemiology. Nature Publishing Group, London, UK, 27(6): 544-550, (2017).
Market basket analysis with Apriori algorithm
The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.
Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.
Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.
Number of Attributes: 7
https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">
First, we need to load required libraries. Shortly I describe all libraries.
https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">
Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.
https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png">
https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">
After we will clear our data frame, will remove missing values.
https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">
To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variable description.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The result comparison of the different D.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The SAR difference of different confidence degree thresholds in D = 3.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset used for validation of information mining process for spatial association discovery.
Local association analysis, such as local similarity analysis and local shape analysis, of biological time series data helps elucidate the varying dynamics of biological systems. However, their applications to large scale high-throughput data are limited by slow permutation procedures for statistical significance evaluation. We developed a theoretical approach to approximate the statistical significance of local similarity and local shape analysis based on the approximate tail distribution of the maximum partial sum of independent identically distributed (i.i.d) and Markovian random variables. Simulations show that the derived formula approximates the tail distribution reasonably well (starting at time points > 10 with no delay and > 20 with delay) and provides p-values comparable to those from permutations. The new approach enables efficient calculation of statistical significance for pairwise local association analysis, making possible all-to-all association studies otherwise prohibitive. As a demonstration, local association analysis of human microbiome time series shows that core OTUs are highly synergetic and some of the associations are body-site specific across samples. The new approach is implemented in our eLSA package, which now provides pipelines for faster local similarity and shape analysis of time series data. The tool is freely available from eLSA's website: http://meta.usc.edu/softs/lsa.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Association – Association’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from http://data.europa.eu/88u/dataset/6088a69731d514e2fa9bacd0 on 13 January 2022.
--- Dataset description provided by original source is as follows ---
Liste des associations (descriptif)
--- Original source retains full ownership of the source dataset ---
Supplementary FiguresSupplementary Figures 1-5Supplementary TablesSupplementary Tables 1-3
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Sample relatedness is a major confounder in large-scale GWAS and could result in inflation if not appropriately controlled. Incorporating GRM-related random effects into the conventional models is the mostly used strategy. Although effective, it is technically challenging to extend this strategy to other complex traits with complicated structure. In this work, we propose a scalable, accurate, and universal analysis framework, SPAGRM, in which the sample relatedness is controlled via the precise approximation of the joint distribution of genotypes for related samples in families. SPAGRM can utilize GRM-free conventional models and thus is applicable to a wide variety of traits. A hybrid strategy including saddlepoint approximation (SPA) can greatly increase the accuracy to analyze low-frequency and rare genetic variants, especially if the phenotypic distribution is unbalanced. Extensive simulation studies and real data analyses validated that SPAGRM is accurate to control type I error rates and can gain power for a longitudinal trait analysis. Expanding upon the previous studies, we implemented a refined and meticulous QC pipeline to extract 79 longitudinal traits from UK Biobank primary care data. The application of SPAGRM to the 79 longitudinal traits identified 7,463 genetic loci, which is a pioneering attempt to conduct GWAS for a majority of these traits as a longitudinal phenotype.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data supporting the paper Luciano et al. Association analysis in over 329,000 individuals identifies 116 independent variants influencing neuroticism. Nature Genetics (2017). doi: 10.1038/s41588-017-0013-8
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The one-item SAR of D = 3.
Genotype, phenotype, and pedigree data are uploaded as separate files and should be joined using the individual identifiers common to each relevant fileset. Note that the files have been named using phen_*, gen_*, other*, and source_data* nomenclature according to the numbers and descriptions in the three categories outlined above . A README file is uploaded as part of this submission that details exact file contents and usage.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset provides comprehensive information on road intersection crashes recognised as "high-high" clusters within the City of Cape Town. It includes detailed records of all intersection crashes and their corresponding crash attribute combinations, which were prevalent in at least 5% of the total "high-high" cluster road intersection crashes for the years 2017, 2018, 2019, and 2021. The dataset is meticulously organised according to support metric values, ranging from 0,05 to 0,0235, with entries presented in descending order.Data SpecificsData Type: Geospatial-temporal categorical dataFile Format: Excel document (.xlsx)Size: 499 KBNumber of Files: The dataset contains a total of 7186 association rulesDate Created: 23rd May 2024MethodologyData Collection Method: The descriptive road traffic crash data per crash victim involved in the crashes was obtained from the City of Cape Town Network InformationSoftware: ArcGIS Pro, PythonProcessing Steps: Following the spatio-temporal analyses and the derivation of "high-high" cluster fishnet grid cells from a cluster and outlier analysis, all the road intersection crashes that occurred within the "high-high" cluster fishnet grid cells were extracted to be processed by association analysis. The association analysis of these crashes was processed using Python software and involved the use of a 0,05 support metric value. Consequently, commonly occurring crash attributes among at least 5% of the "high-high" cluster road intersection crashes were extracted for inclusion in this dataset.Geospatial InformationSpatial Coverage:West Bounding Coordinate: 18°20'EEast Bounding Coordinate: 19°05'ENorth Bounding Coordinate: 33°25'SSouth Bounding Coordinate: 34°25'SCoordinate System: South African Reference System (Lo19) using the Universal Transverse Mercator projectionTemporal InformationTemporal Coverage:Start Date: 01/01/2017End Date: 31/12/2021 (2020 data omitted)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Hourly Association ’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from http://data.europa.eu/88u/dataset/5ae9d0b4c8d8c9146c44cc88 on 16 January 2022.
--- Dataset description provided by original source is as follows ---
No. of Constitutions in the month, Total No. of Constitutions, Percentage of Constitutions per ANH and Average Time of Constitution (accumulated)
--- Original source retains full ownership of the source dataset ---
Type 2 diabetes (T2D) is a global public health challenge. Whilst the advent of genome-wide association studies has identified >400 genetic variants associated with T2D, our understanding of its biological mechanisms and translational insights is still limited. The EPIC-InterAct project, centred in 8 countries in the European Prospective Investigations into Cancer and Nutrition study, is one of the largest prospective studies of T2D. Established as a nested case-cohort study to investigate the interplay between genetic and lifestyle behavioural factors on the risk of T2D, a total of 12,403 individuals were identified as incident T2D cases and a representative sub-cohort of 16,154 individuals was selected from a larger cohort of 340,234 participants with a follow-up time of 3.99 million person-years. We describe the results from a genome-wide association analysis between more than 8.9 million SNPs and T2D risk among 22,326 individuals (9,978 cases and 12,348 non-cases) from the EPIC-I...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Affiliate Association’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://catalog.data.gov/dataset/50492d3d-6c6e-4cb6-8278-fed55400be75 on 11 February 2022.
--- Dataset description provided by original source is as follows ---
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Nombre d'adhérents par association’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from http://data.europa.eu/88u/dataset/5f727999469cc4595a5b6549 on 16 January 2022.
--- Dataset description provided by original source is as follows ---
Ces informations font partie de l'étude sur l'économie du sport parue en février 2020
62 % des associations sportives ont moins de 100 adhérents
nombre d'adhérents par association : (exprimé en %)
Source : V. Tchernonog - L. Prouteau - « Le paysage associatif français »
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The two-item SAR of D = 3.
[No abstract entered]
Chemical concentration, exposure, and health risk data for U.S. census tracts from National Scale Air Toxics Assessment (NATA). This dataset is associated with the following publication: Huang, H., R. Tornero-Velez, and T. Barzyk. Associations between socio-demographic characteristics and chemical concentrations contributing to cumulative exposures in the United States. Journal of Exposure Science and Environmental Epidemiology. Nature Publishing Group, London, UK, 27(6): 544-550, (2017).