Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
#https://www.kaggle.com/c/facial-keypoints-detection/details/getting-started-with-r #################################
###Variables for downloaded files data.dir <- ' ' train.file <- paste0(data.dir, 'training.csv') test.file <- paste0(data.dir, 'test.csv') #################################
###Load csv -- creates a data.frame matrix where each column can have a different type. d.train <- read.csv(train.file, stringsAsFactors = F) d.test <- read.csv(test.file, stringsAsFactors = F)
###In training.csv, we have 7049 rows, each one with 31 columns. ###The first 30 columns are keypoint locations, which R correctly identified as numbers. ###The last one is a string representation of the image, identified as a string.
###To look at samples of the data, uncomment this line:
###Let's save the first column as another variable, and remove it from d.train: ###d.train is our dataframe, and we want the column called Image. ###Assigning NULL to a column removes it from the dataframe
im.train <- d.train$Image d.train$Image <- NULL #removes 'image' from the dataframe
im.test <- d.test$Image d.test$Image <- NULL #removes 'image' from the dataframe
################################# #The image is represented as a series of numbers, stored as a string #Convert these strings to integers by splitting them and converting the result to integer
#strsplit splits the string #unlist simplifies its output to a vector of strings #as.integer converts it to a vector of integers. as.integer(unlist(strsplit(im.train[1], " "))) as.integer(unlist(strsplit(im.test[1], " ")))
###Install and activate appropriate libraries ###The tutorial is meant for Linux and OSx, where they use a different library, so: ###Replace all instances of %dopar% with %do%.
library("foreach", lib.loc="~/R/win-library/3.3")
###implement parallelization im.train <- foreach(im = im.train, .combine=rbind) %do% { as.integer(unlist(strsplit(im, " "))) } im.test <- foreach(im = im.test, .combine=rbind) %do% { as.integer(unlist(strsplit(im, " "))) } #The foreach loop will evaluate the inner command for each row in im.train, and combine the results with rbind (combine by rows). #%do% instructs R to do all evaluations in parallel. #im.train is now a matrix with 7049 rows (one for each image) and 9216 columns (one for each pixel):
###Save all four variables in data.Rd file ###Can reload them at anytime with load('data.Rd')
#each image is a vector of 96*96 pixels (96*96 = 9216). #convert these 9216 integers into a 96x96 matrix: im <- matrix(data=rev(im.train[1,]), nrow=96, ncol=96)
#im.train[1,] returns the first row of im.train, which corresponds to the first training image. #rev reverse the resulting vector to match the interpretation of R's image function #(which expects the origin to be in the lower left corner).
#To visualize the image we use R's image function: image(1:96, 1:96, im, col=gray((0:255)/255))
#Let’s color the coordinates for the eyes and nose points(96-d.train$nose_tip_x[1], 96-d.train$nose_tip_y[1], col="red") points(96-d.train$left_eye_center_x[1], 96-d.train$left_eye_center_y[1], col="blue") points(96-d.train$right_eye_center_x[1], 96-d.train$right_eye_center_y[1], col="green")
#Another good check is to see how variable is our data. #For example, where are the centers of each nose in the 7049 images? (this takes a while to run): for(i in 1:nrow(d.train)) { points(96-d.train$nose_tip_x[i], 96-d.train$nose_tip_y[i], col="red") }
#there are quite a few outliers -- they could be labeling errors. Looking at one extreme example we get this: #In this case there's no labeling error, but this shows that not all faces are centralized idx <- which.max(d.train$nose_tip_x) im <- matrix(data=rev(im.train[idx,]), nrow=96, ncol=96) image(1:96, 1:96, im, col=gray((0:255)/255)) points(96-d.train$nose_tip_x[idx], 96-d.train$nose_tip_y[idx], col="red")
#One of the simplest things to try is to compute the mean of the coordinates of each keypoint in the training set and use that as a prediction for all images colMeans(d.train, na.rm=T)
#To build a submission file we need to apply these computed coordinates to the test instances: p <- matrix(data=colMeans(d.train, na.rm=T), nrow=nrow(d.test), ncol=ncol(d.train), byrow=T) colnames(p) <- names(d.train) predictions <- data.frame(ImageId = 1:nrow(d.test), p) head(predictions)
#The expected submission format has one one keypoint per row, but we can easily get that with the help of the reshape2 library:
library(...
Facebook
TwitterMarket basket analysis with Apriori algorithm
The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.
Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.
Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.
Number of Attributes: 7
https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">
First, we need to load required libraries. Shortly I describe all libraries.
https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">
Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.
https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png">
https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">
After we will clear our data frame, will remove missing values.
https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">
To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
Facebook
TwitterDONALD’s raw texts are stored in the R data frame “DONALD.txt.rdata” (size = 12.5 Gb, 4.87 Gb compressed) consisting of 2,173,172 rows (one per document) × 2 columns, namely document ID and text.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This is a historical dataset on the modern Olympic Games, including all the Games from Athens 1896 to Rio 2016. I scraped this data from www.sports-reference.com in May 2018. The R code I used to scrape and wrangle the data is on GitHub. I recommend checking my kernel before starting your own analysis.
Note that the Winter and Summer Games were held in the same year up until 1992. After that, they staggered them such that Winter Games occur on a four year cycle starting with 1994, then Summer in 1996, then Winter in 1998, and so on. A common mistake people make when analyzing this data is to assume that the Summer and Winter Games have always been staggered.
The file athlete_events.csv contains 271116 rows and 15 columns. Each row corresponds to an individual athlete competing in an individual Olympic event (athlete-events). The columns are:
The Olympic data on www.sports-reference.com is the result of an incredible amount of research by a group of Olympic history enthusiasts and self-proclaimed 'statistorians'. Check out their blog for more information. All I did was consolidated their decades of work into a convenient format for data analysis.
This dataset provides an opportunity to ask questions about how the Olympics have evolved over time, including questions about the participation and performance of women, different nations, and different sports and events.
Facebook
TwitterABSTRACT:
The World Soil Information Service (WoSIS) provides quality-assessed and standardized soil profile data to support digital soil mapping and environmental applications at broad scale levels. Since the release of the ‘WoSIS snapshot 2019’ many new soil data were shared with us, registered in the ISRIC data repository, and subsequently standardized in accordance with the licenses specified by the data providers. The source data were contributed by a wide range of data providers, therefore special attention was paid to the standardization of soil property definitions, soil analytical procedures and soil property values (and units of measurement).
We presently consider the following soil chemical properties (organic carbon, total carbon, total carbonate equivalent, total Nitrogen, Phosphorus (extractable-P, total-P, and P-retention), soil pH, cation exchange capacity, and electrical conductivity) and physical properties (soil texture (sand, silt, and clay), bulk density, coarse fragments, and water retention), grouped according to analytical procedures (aggregates) that are operationally comparable.
For each profile we provide the original soil classification (FAO, WRB, USDA, and version) and horizon designations as far as these have been specified in the source databases.
Three measures for 'fitness-for-intended-use' are provided: positional uncertainty (for site locations), time of sampling/description, and a first approximation for the uncertainty associated with the operationally defined analytical methods. These measures should be considered during digital soil mapping and subsequent earth system modelling that use the present set of soil data.
DATA SET DESCRIPTION:
The 'WoSIS 2023 snapshot' comprises data for 228k profiles from 217k geo-referenced sites that originate from 174 countries. The profiles represent over 900k soil layers (or horizons) and over 6 million records. The actual number of measurements for each property varies (greatly) between profiles and with depth, this generally depending on the objectives of the initial soil sampling programmes.
The data are provided in TSV (tab separated values) format and as GeoPackage. The zip-file (446 Mb) contains the following files:
Readme_WoSIS_202312_v2.pdf: Provides a short description of the dataset, file structure, column names, units and category values (this file is also available directly under 'online resources'). The pdf includes links to tutorials for downloading the TSV files into R respectively Excel. See also 'HOW TO READ TSV FILES INTO R AND PYTHON' in the next section.
wosis_202312_observations.tsv: This file lists the four to six letter codes for each observation, whether the observation is for a site/profile or layer (horizon), the unit of measurement and the number of profiles respectively layers represented in the snapshot. It also provides an estimate for the inferred accuracy for the laboratory measurements.
wosis_202312_sites.tsv: This file characterizes the site location where profiles were sampled.
wosis_2023112_profiles: Presents the unique profile ID (i.e. primary key), site_id, source of the data, country ISO code and name, positional uncertainty, latitude and longitude (WGS 1984), maximum depth of soil described and sampled, as well as information on the soil classification system and edition. Depending on the soil classification system used, the number of fields will vary .
wosis_202312_layers: This file characterises the layers (or horizons) per profile, and lists their upper and lower depths (cm).
wosis_202312_xxxx.tsv : This type of file presents results for each observation (e.g. “xxxx” = “BDFIOD” ), as defined under “code” in file wosis_202312_observation.tsv. (e.g. wosis_202311_bdfiod.tsv).
wosis_202312.gpkg: Contains the above datafiles in GeoPackage format (which stores the files within an SQLite database).
HOW TO READ TSV FILES INTO R AND PYTHON:
A) To read the data in R, please uncompress the ZIP file and specify the uncompressed folder.
setwd("/YourFolder/WoSIS_2023_December/") ## For example: setwd('D:/WoSIS_2023_December/')
Then use read_tsv to read the TSV files, specifying the data types for each column (c = character, i = integer, n = number, d = double, l = logical, f = factor, D = date, T = date time, t = time).
observations = readr::read_tsv('wosis_202312_observations.tsv', col_types='cccciid')
observations ## show columns and first 10 rows
sites = readr::read_tsv('wosis_202312_sites.tsv', col_types='iddcccc') sites
profiles = readr::read_tsv('wosis_202312_profiles.tsv', col_types='icciccddcccccciccccicccci') profiles
layers = readr::read_tsv('wosis_202312_layers.tsv', col_types='iiciciiilcc') layers
orgc = readr::read_tsv('wosis_202312_orgc.tsv', col_types='iicciilccdccddccccc')
orgc
Note: One may also use the following R code (example is for file 'observations.tsv'): observations <- read.table("wosis_202312_observations.tsv", sep = "\t", header = TRUE, quote = "", comment.char = "", stringsAsFactors = FALSE )
B) To read the files into python first decompress the files to your selected folder. Then in python:
import pandas as pd
observations = pd.read_csv("wosis_202312_observations.tsv", sep="\t") # print the data frame header and some rows observations.head()
sites = pd.read_csv("wosis_202312_sites.tsv", sep="\t")
profiles = pd.read_csv("wosis_202312_profiles.tsv", sep="\t")
layers = pd.read_csv("wosis_202312_layers.tsv", sep="\t")
cfvo = pd.read_csv("wosis_202312_cfvo.tsv", sep="\t")
CITATION: Calisto, L., de Sousa, L.M., Batjes, N.H., 2023. Standardised soil profile data for the world (WoSIS snapshot – December 2023), https://doi.org/10.17027/isric-wdcsoils-20231130
Supplement to: Batjes N.H., Calisto, L. and de Sousa L.M., 2023. Providing quality-assessed and standardised soil data to support global mapping and modelling (WoSIS snapshot 2023). Earth System Science Data, https://doi.org/10.5194/essd-16-4735-2024.
Facebook
Twitterregression.dat_nurseries_ADW2.RdatThis R data frame contains 1353 rows corresponding to the international trials in the CIMMYT database used in this study. The column names should be self-descriptive, and contain all the predictors used for this regression analysis.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The first two rows of a pandas DataFrame ready to be used with GLAM.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
An R Rda-file containing the four hypothetical datasets used in the analysis (Flat, increasing, decreasing, unimodal). These are stored in a single data-frame, where the 400 rows correspond to observations. For each dataset there are three variables: an event indicator = 1 for if death was during follow-up, else = 0 (suffix "_dead"), the true time of death (suffix "_time") and the observed follow-up time, which will = 1 if the true time of death > 1 (suffix "_obs"). Hence there are twelve columns.A script is provided seperately within this project (Analysis.R) which includes the code used to analyse this dataset in order to obtain the results reported in the manuscript "How uncertain is the survival extrapolation? A study of the impact of different parametric survival models on extrapolated uncertainty about hazard functions, lifetime mean survival and cost-effectiveness."
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Here you will find the raw data (RawData.RData) and code (CHIVAS_SEROLOGY_Code.Rmd, an R Markdown file) for generating the analysis and figures for the following publication:
Streptococcus pyogenes pharyngitis elicits diverse antibody responses to key vaccine antigens influenced by the imprint of past infections.
Joshua Osowicki1,2,3 #, Hannah R Frost1 #, Kristy I Azzopardi1, Alana L Whitcombe4, Reuben McGregor4, Lauren H. Carlton4, Ciara Baker1, Loraine Fabri1,5,6, Manisha Pandey7, Michael F Good7, Jonathan R. Carapetis8,9,10, Mark J Walker11,12,13, Pierre R Smeesters1,2,5,6, Paul V Licciardi2,14, Nicole J Moreland4 *, Danika L Hill15 *, Andrew C Steer1,2,3 *
Provided in the RData file are the following items:
Dataframes:
"outcome" : clinical variables associated with human challenge for each participant
"data" : ELISA and functional antibody responses for human challenge participants. Each timepoint and isotype for each antigen as seperate column)
"data_long": Data equivalent to "data" file but in long format, i.e. One column for each antigen, timepoint and isotype as factors.
"data.melt" : Data equivalent to "data" file but in longer format , i.e. timepoint, isotype and antigen as factors, 'value' as ELISA AU.
"luminex" : IgG responses to 6 antigens analysed by luminex bead-based assay in human challenge participants.
"luminex.children" : IgG responses to 6 antigen analysed by luminex bead-based assay in children
Vectors:
"pharyngitis" : participant "id" for the 19 individuals that developed pharyngitis.
"Antigen.Order" : relates to "Main" antigen classification used in Figure 2
'additional" : relates to "Additional
Function:
"custom_theme" : used as a theme when using ggplot to graph.
Adobe Illustrator or Inkscape were used to generate the final image files for publication, with some graph editing to axes labels, font size, adding p-values etc.
Additional files:
3 .csv files have been included for download
"ELISA_data_wide_format.csv", a wide format data table of 25 human challenge individuals and 219 variables. Equivalent to the 'data' dataframe in the RData file
"CHIVAS_luminex.csv", a long format data table of 25 human challenge participants at 1 week, 1 month, and 3 months. Equivalent to the 'luminex' dataframe in the RData file.
"Luminex.children.csv", a datatable of 6 luminex variables for 39 children (healthy and post pharyngitis). Equivalent to the 'luminex.children' dataframe in the RData file.
Facebook
TwitterAnalyses are reproducible using version 3.3.2 or above (R Core Team 2016).
Files needed for reproducing the analyses are:
chond-data.csv: Data frame with 63 rows (species) and 11 variables. Some of these variables are based on the same life history trait but are transformed for ease of interpretation and analysis.
stein-et-al-single.tree: Phylogenetic tree with scaled branch lengths from Stein et al. (2018) used in analyses. These are freely downloadable from http://vertlife.org/sharktree/.
rmax-scaling-analysis.R: R code with minimum working example of how to load data files, fit models phylogenetic linear models using the pgls function in the caper package, run information-theoretic comparisons, and check diagnostics.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
#https://www.kaggle.com/c/facial-keypoints-detection/details/getting-started-with-r #################################
###Variables for downloaded files data.dir <- ' ' train.file <- paste0(data.dir, 'training.csv') test.file <- paste0(data.dir, 'test.csv') #################################
###Load csv -- creates a data.frame matrix where each column can have a different type. d.train <- read.csv(train.file, stringsAsFactors = F) d.test <- read.csv(test.file, stringsAsFactors = F)
###In training.csv, we have 7049 rows, each one with 31 columns. ###The first 30 columns are keypoint locations, which R correctly identified as numbers. ###The last one is a string representation of the image, identified as a string.
###To look at samples of the data, uncomment this line:
###Let's save the first column as another variable, and remove it from d.train: ###d.train is our dataframe, and we want the column called Image. ###Assigning NULL to a column removes it from the dataframe
im.train <- d.train$Image d.train$Image <- NULL #removes 'image' from the dataframe
im.test <- d.test$Image d.test$Image <- NULL #removes 'image' from the dataframe
################################# #The image is represented as a series of numbers, stored as a string #Convert these strings to integers by splitting them and converting the result to integer
#strsplit splits the string #unlist simplifies its output to a vector of strings #as.integer converts it to a vector of integers. as.integer(unlist(strsplit(im.train[1], " "))) as.integer(unlist(strsplit(im.test[1], " ")))
###Install and activate appropriate libraries ###The tutorial is meant for Linux and OSx, where they use a different library, so: ###Replace all instances of %dopar% with %do%.
library("foreach", lib.loc="~/R/win-library/3.3")
###implement parallelization im.train <- foreach(im = im.train, .combine=rbind) %do% { as.integer(unlist(strsplit(im, " "))) } im.test <- foreach(im = im.test, .combine=rbind) %do% { as.integer(unlist(strsplit(im, " "))) } #The foreach loop will evaluate the inner command for each row in im.train, and combine the results with rbind (combine by rows). #%do% instructs R to do all evaluations in parallel. #im.train is now a matrix with 7049 rows (one for each image) and 9216 columns (one for each pixel):
###Save all four variables in data.Rd file ###Can reload them at anytime with load('data.Rd')
#each image is a vector of 96*96 pixels (96*96 = 9216). #convert these 9216 integers into a 96x96 matrix: im <- matrix(data=rev(im.train[1,]), nrow=96, ncol=96)
#im.train[1,] returns the first row of im.train, which corresponds to the first training image. #rev reverse the resulting vector to match the interpretation of R's image function #(which expects the origin to be in the lower left corner).
#To visualize the image we use R's image function: image(1:96, 1:96, im, col=gray((0:255)/255))
#Let’s color the coordinates for the eyes and nose points(96-d.train$nose_tip_x[1], 96-d.train$nose_tip_y[1], col="red") points(96-d.train$left_eye_center_x[1], 96-d.train$left_eye_center_y[1], col="blue") points(96-d.train$right_eye_center_x[1], 96-d.train$right_eye_center_y[1], col="green")
#Another good check is to see how variable is our data. #For example, where are the centers of each nose in the 7049 images? (this takes a while to run): for(i in 1:nrow(d.train)) { points(96-d.train$nose_tip_x[i], 96-d.train$nose_tip_y[i], col="red") }
#there are quite a few outliers -- they could be labeling errors. Looking at one extreme example we get this: #In this case there's no labeling error, but this shows that not all faces are centralized idx <- which.max(d.train$nose_tip_x) im <- matrix(data=rev(im.train[idx,]), nrow=96, ncol=96) image(1:96, 1:96, im, col=gray((0:255)/255)) points(96-d.train$nose_tip_x[idx], 96-d.train$nose_tip_y[idx], col="red")
#One of the simplest things to try is to compute the mean of the coordinates of each keypoint in the training set and use that as a prediction for all images colMeans(d.train, na.rm=T)
#To build a submission file we need to apply these computed coordinates to the test instances: p <- matrix(data=colMeans(d.train, na.rm=T), nrow=nrow(d.test), ncol=ncol(d.train), byrow=T) colnames(p) <- names(d.train) predictions <- data.frame(ImageId = 1:nrow(d.test), p) head(predictions)
#The expected submission format has one one keypoint per row, but we can easily get that with the help of the reshape2 library:
library(...