Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data ready for analysis, just relevant variables has been retained, no relevant transformations has been done to raw data, just renaming or relabeling. A codebook explaining data features is included. Also the code used for data analysis is attached in 2 formats: one .Rmd literate programming file with code, plain text and outputs that could be run easily in Rstudio; and another .R file with programming code and coments.
Facebook
TwitterThe purpose of this project was added practice in learning new and demonstrate R Data analytical skills. The data set was located in Kaggle and shows sales information from the years 2010 to 2012. The weekly sales have two categories: holiday and non holiday representing 1 and 0 in that column respectfully.
The main question for this exercise was were there any factors that affected weekly sales for the stores? Those factors included temperature, fuel prices, and unemployment rates.
install.packages("tidyverse")
install.packages("dplyr")
install.packages("tsibble")
library("tidyverse")
library(readr)
library(dplyr)
library(ggplot2)
library(readr)
library(lubridate)
library(tsibble)
Walmart <- read.csv("C:/Users/matth/OneDrive/Desktop/Case Study/Walmart.csv")
Compared column names of each file to verify consistency.
colnames(Walmart)
colnames(Walmart)
dim(Walmart)
str(Walmart)
head(Walmart)
which(is.na(Walmart$Date))
sum(is.na(Walmart))
There is NA data in the set.
Walmart$Store<-as.factor(Walmart$Store)
Walmart$Holiday_Flag<-as.factor(Walmart$Holiday_Flag)
Walmart$week<-yearweek(as.Date(Walmart$Date,tryFormats=c("%d-%m-%Y"))) # make sure to install "tsibble"
Walmart$year<-format(as.Date(Walmart$Date,tryFormats=c("%d-%m-%Y")),"%Y")
Walmart_Holiday<-
filter(Walmart, Holiday_Flag==1)
Walmart_Non_Holiday<-
filter(Walmart, Holiday_Flag==0)
ggplot(Walmart, aes(x=Weekly_Sales, y=Store))+geom_boxplot()+ labs(title = 'Weekly Sales Accross 45 Stores',
x='Weekly sales', y='Store')+theme_bw()
From observation of the boxplot, it shows that Store 14 had max sales while Store 33 had the min sales.
Lets verify the results via slice_max and slice_min:
Walmart %>% slice_max(Weekly_Sales)
Walmart %>% slice_min(Weekly_Sales)
It looks the information was correct. Lets check the mean for the weekly_sales column:
mean(Walmart$Weekly_Sales)
The mean for Weekly_Sales column for the Walmart dataset was 1046965.
ggplot(Walmart_Holiday, aes(x=Weekly_Sales, y=Store))+geom_boxplot()+ labs(title = 'Holiday Sales Accross 45 Stores',
x='Weekly sales', y='Store')+theme_bw()
Store 4 had the highest weekly sales during a holiday week based on the boxplot. Boxplot shows stores 33 and 5 as some of the lowest holiday sales.Lets reverify with slice_max and slice_min:
Walmart_Holiday %>% slice_max(Weekly_Sales)
Walmart_Holiday %>% slice_min(Weekly_Sales)
The results match what is given on the boxplot. Lets find the mean:
mean(Walmart_Holiday$Weekly_Sales)
The result was that the mean was 1122888.
ggplot(Walmart_Non_Holiday, aes(x=Weekly_Sales, y=Store))+geom_boxplot()+ labs(title = 'Non Holiday Sales Accross 45 Stores', x='Weekly sales', y='Store')+theme_bw()
Lets matched the results of the Walmart dataset that had both non holiday weeks and holiday weeks. Store 14 had the max sales and store 33 had the minimum sales. Lets verify the results and find the mean:
Walmart_Non_Holiday %>% slice_max(Weekly_Sales)
Walmart_Non_Holiday %>% slice_min(Weekly_Sales)
mean(Walmart_Non_Holiday$Weekly_Sales)
Results matched. And the mean for weekly sales was 1041256.
ggplot(data = Walmart) + geom_point(mapping = aes(x=year, y=Weekly_Sales))
According the plot, 2010 had the most sales. Lets use a boxplot to see more.
ggplot(Walmart, aes(x=year, y=Weekly_Sales))+geom_boxplot()+ labs(title = 'Weekly Sales for Years 2010 - 2012',
x='Year', y='Weekly Sales')
2010 Saw higher sales numbers and higher medium
Lets start with holiday weekly sales:
ggplot(Walmart_Holiday, aes(x=year, y=Weekly_Sales))+geom_boxplot()+ labs(title = 'Holiday Weekly Sales for Years ...
Facebook
TwitterThe programx MAXENT v3.4.1, which was used for SDM analyses, is freely available. ASC layers can be viewed using the open-source program QGIS. R and RStudio (both freely available) and associated open-source packages were used to process and analyze data.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Publication
will_INF.txt and go_INF.txt). They represent the co-occurrence frequency of top-200 infinitival collocates for will and be going to respectively across the twenty decades of Corpus of Historical American English (from the 1810s to the 2000s).1-script-create-input-data-raw.r. The codes preprocess and combine the two files into a long format data frame consisting of the following columns: (i) decade, (ii) coll (for "collocate"), (iii) BE going to (for frequency of the collocates with be going to) and (iv) will (for frequency of the collocates with will); it is available in the input_data_raw.txt. 2-script-create-motion-chart-input-data.R processes the input_data_raw.txt for normalising the co-occurrence frequency of the collocates per million words (the COHA size and normalising base frequency are available in coha_size.txt). The output from the second script is input_data_futurate.txt.input_data_futurate.txt contains the relevant input data for generating (i) the static motion chart as an image plot in the publication (using the script 3-script-create-motion-chart-plot.R), and (ii) the dynamic motion chart (using the script 4-script-motion-chart-dynamic.R).Future Constructions.Rproj file to open an RStudio session whose working directory is associated with the contents of this repository.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In the last decade, a plethora of algorithms have been developed for spatial ecology studies. In our case, we use some of these codes for underwater research work in applied ecology analysis of threatened endemic fishes and their natural habitat. For this, we developed codes in Rstudio® script environment to run spatial and statistical analyses for ecological response and spatial distribution models (e.g., Hijmans & Elith, 2017; Den Burg et al., 2020). The employed R packages are as follows: caret (Kuhn et al., 2020), corrplot (Wei & Simko, 2017), devtools (Wickham, 2015), dismo (Hijmans & Elith, 2017), gbm (Freund & Schapire, 1997; Friedman, 2002), ggplot2 (Wickham et al., 2019), lattice (Sarkar, 2008), lattice (Musa & Mansor, 2021), maptools (Hijmans & Elith, 2017), modelmetrics (Hvitfeldt & Silge, 2021), pander (Wickham, 2015), plyr (Wickham & Wickham, 2015), pROC (Robin et al., 2011), raster (Hijmans & Elith, 2017), RColorBrewer (Neuwirth, 2014), Rcpp (Eddelbeuttel & Balamura, 2018), rgdal (Verzani, 2011), sdm (Naimi & Araujo, 2016), sf (e.g., Zainuddin, 2023), sp (Pebesma, 2020) and usethis (Gladstone, 2022).
It is important to follow all the codes in order to obtain results from the ecological response and spatial distribution models. In particular, for the ecological scenario, we selected the Generalized Linear Model (GLM) and for the geographic scenario we selected DOMAIN, also known as Gower's metric (Carpenter et al., 1993). We selected this regression method and this distance similarity metric because of its adequacy and robustness for studies with endemic or threatened species (e.g., Naoki et al., 2006). Next, we explain the statistical parameterization for the codes immersed in the GLM and DOMAIN running:
In the first instance, we generated the background points and extracted the values of the variables (Code2_Extract_values_DWp_SC.R). Barbet-Massin et al. (2012) recommend the use of 10,000 background points when using regression methods (e.g., Generalized Linear Model) or distance-based models (e.g., DOMAIN). However, we considered important some factors such as the extent of the area and the type of study species for the correct selection of the number of points (Pers. Obs.). Then, we extracted the values of predictor variables (e.g., bioclimatic, topographic, demographic, habitat) in function of presence and background points (e.g., Hijmans and Elith, 2017).
Subsequently, we subdivide both the presence and background point groups into 75% training data and 25% test data, each group, following the method of Soberón & Nakamura (2009) and Hijmans & Elith (2017). For a training control, the 10-fold (cross-validation) method is selected, where the response variable presence is assigned as a factor. In case that some other variable would be important for the study species, it should also be assigned as a factor (Kim, 2009).
After that, we ran the code for the GBM method (Gradient Boost Machine; Code3_GBM_Relative_contribution.R and Code4_Relative_contribution.R), where we obtained the relative contribution of the variables used in the model. We parameterized the code with a Gaussian distribution and cross iteration of 5,000 repetitions (e.g., Friedman, 2002; kim, 2009; Hijmans and Elith, 2017). In addition, we considered selecting a validation interval of 4 random training points (Personal test). The obtained plots were the partial dependence blocks, in function of each predictor variable.
Subsequently, the correlation of the variables is run by Pearson's method (Code5_Pearson_Correlation.R) to evaluate multicollinearity between variables (Guisan & Hofer, 2003). It is recommended to consider a bivariate correlation ± 0.70 to discard highly correlated variables (e.g., Awan et al., 2021).
Once the above codes were run, we uploaded the same subgroups (i.e., presence and background groups with 75% training and 25% testing) (Code6_Presence&backgrounds.R) for the GLM method code (Code7_GLM_model.R). Here, we first ran the GLM models per variable to obtain the p-significance value of each variable (alpha ≤ 0.05); we selected the value one (i.e., presence) as the likelihood factor. The generated models are of polynomial degree to obtain linear and quadratic response (e.g., Fielding and Bell, 1997; Allouche et al., 2006). From these results, we ran ecological response curve models, where the resulting plots included the probability of occurrence and values for continuous variables or categories for discrete variables. The points of the presence and background training group are also included.
On the other hand, a global GLM was also run, from which the generalized model is evaluated by means of a 2 x 2 contingency matrix, including both observed and predicted records. A representation of this is shown in Table 1 (adapted from Allouche et al., 2006). In this process we select an arbitrary boundary of 0.5 to obtain better modeling performance and avoid high percentage of bias in type I (omission) or II (commission) errors (e.g., Carpenter et al., 1993; Fielding and Bell, 1997; Allouche et al., 2006; Kim, 2009; Hijmans and Elith, 2017).
Table 1. Example of 2 x 2 contingency matrix for calculating performance metrics for GLM models. A represents true presence records (true positives), B represents false presence records (false positives - error of commission), C represents true background points (true negatives) and D represents false backgrounds (false negatives - errors of omission).
Validation set
Model
True
False
Presence
A
B
Background
C
D
We then calculated the Overall and True Skill Statistics (TSS) metrics. The first is used to assess the proportion of correctly predicted cases, while the second metric assesses the prevalence of correctly predicted cases (Olden and Jackson, 2002). This metric also gives equal importance to the prevalence of presence prediction as to the random performance correction (Fielding and Bell, 1997; Allouche et al., 2006).
The last code (i.e., Code8_DOMAIN_SuitHab_model.R) is for species distribution modelling using the DOMAIN algorithm (Carpenter et al., 1993). Here, we loaded the variable stack and the presence and background group subdivided into 75% training and 25% test, each. We only included the presence training subset and the predictor variables stack in the calculation of the DOMAIN metric, as well as in the evaluation and validation of the model.
Regarding the model evaluation and estimation, we selected the following estimators:
1) partial ROC, which evaluates the approach between the curves of positive (i.e., correctly predicted presence) and negative (i.e., correctly predicted absence) cases. As farther apart these curves are, the model has a better prediction performance for the correct spatial distribution of the species (Manzanilla-Quiñones, 2020).
2) ROC/AUC curve for model validation, where an optimal performance threshold is estimated to have an expected confidence of 75% to 99% probability (De Long et al., 1988).
Facebook
TwitterWelcome to the Cyclistic bike-share analysis case study! In this case study, you will perform many real-world tasks of a junior data analyst. You will work for a fictional company, Cyclistic, and meet different characters and team members. In order to answer the key business questions, you will follow the steps of the data analysis process: ask, prepare, process, analyze, share, and act. Along the way, the Case Study Roadmap tables — including guiding questions and key tasks — will help you stay on the right path.
You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.
How do annual members and casual riders use Cyclistic bikes differently?
What is the problem you are trying to solve?
How do annual members and casual riders use Cyclistic bikes differently?
How can your insights drive business decisions?
The insight will help the marketing team to make a strategy for casual riders
Where is your data located?
Data located in Cyclistic organization data.
How is data organized?
Dataset are in csv format for each month wise from Financial year 22.
Are there issues with bias or credibility in this data? Does your data ROCCC?
It is good it is ROCCC because data collected in from Cyclistic organization.
How are you addressing licensing, privacy, security, and accessibility?
The company has their own license over the dataset. Dataset does not have any personal information about the riders.
How did you verify the data’s integrity?
All the files have consistent columns and each column has the correct type of data.
How does it help you answer your questions?
Insights always hidden in the data. We have the interpret with data to find the insights.
Are there any problems with the data?
Yes, starting station names, ending station names have null values.
What tools are you choosing and why?
I used R studio for the cleaning and transforming the data for analysis phase because of large dataset and to gather experience in the language.
Have you ensured the data’s integrity?
Yes, the data is consistent throughout the columns.
What steps have you taken to ensure that your data is clean?
First duplicates, null values are removed then added new columns for analysis.
How can you verify that your data is clean and ready to analyze?
Make sure the column names are consistent thorough out all data sets by using the “bind row” function.
Make sure column data types are consistent throughout all the dataset by using the “compare_df_col” from the “janitor” package.
Combine the all dataset into single data frame to make consistent throught the analysis.
Removed the column start_lat, start_lng, end_lat, end_lng from the dataframe because those columns not required for analysis.
Create new columns day, date, month, year, from the started_at column this will provide additional opportunities to aggregate the data
Create the “ride_length” column from the started_at and ended_at column to find the average duration of the ride by the riders.
Removed the null rows from the dataset by using the “na.omit function”
Have you documented your cleaning process so you can review and share those results?
Yes, the cleaning process is documented clearly.
How should you organize your data to perform analysis on it? The data has been organized in one single dataframe by using the read csv function in R Has your data been properly formatted? Yes, all the columns have their correct data type.
What surprises did you discover in the data?
Casual member ride duration is higher than the annual members
Causal member widely uses docked bike than the annual members
What trends or relationships did you find in the data?
Annual members are used mainly for commute purpose
Casual member are preferred the docked bikes
Annual members are preferred the electric or classic bikes
How will these insights help answer your business questions?
This insights helps to build a profile for members
Were you able to answer the question of how ...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains 20Hz sampled CAN bus data from a passenger vehicle, e.g. WheelSpeed FL (speed of the front left wheel), SteerAngle (steering wheel angle), Role, Pitch, and accelerometer values per direction. In contrast to the dataset published at https://zenodo.org/record/2658168#.XMw2m6JS9PY we now have GPS data from the vehicle (see signals 'Latitude_Vehicle' and 'Longitude_Vehicle' in h5 group 'Math') and GPS data from the IMU device (see signals 'Latitude_IMU', 'Longitude_IMU' and 'Time_IMU' in h5 group 'Math') included. However, as it was exported with single_precision, therefore we lost some precision for those GPS values. We are currently looking for a solution and will update the records if possible. For data analysis we use R and R Studio (https://www.rstudio.com/) and the library h5. e.g. check file with R code: library(h5) f <- h5file("file path/20181113_Driver1_Trip1.hdf") summary(f["CAN/Yawrate1"][,]) summary(f["Math/Latitude_IMU"][,]) h5close(f)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This Zenodo repository contains raw data tables, a Shiny app (via dockerfile), and a sqlite database that makes up the p53motifDB (p53 motif database).
The p53motifDB is a compendium of genomic locations in the human hg38 reference genome that contain recognizable DNA sequences that match the binding preferences for the transcription factor p53. Multiple types of genomic, epigenomic, and genome variation data were integrated with these locations in order to let researchers quickly generate hypotheses about novel activities of p53 or validate known behaviors.
The raw data tables (raw_tables.tar.gz) are divided into the "primary" table, containing p53 motif locations and other biographical information relating to those genomic locations. The "accesory" tables contain additional descriptive or quantitative information that can be queried based on the information in the "primary" table. A description of table schema for the primary table and all accessory tables can be found in Schema_p53motifDB.xlsx.
Table_1_DataSources.xlsx contains information about all raw and processed data sources that were used in the construction of the p53motifDB.
The Shiny App is designed to allow rapid filtering, querying, and downloading of the primary and accessory tables. Users can access a web-based version at https://p53motifDB.its.albany.edu. Users can also deploy the Shiny app locally by downloading and extracting p53motifDB_shiny.zip and doing one of of the following:
Option 1: From the extracted folder, run the included Dockerfile to create a Docker image which will deploy to localhost port 3838.
Option 2: From the shiny_p53motifDB subfolder, run app.R from R or RStudio. This requires a number of dependencies, which may not be compatible with your current version of R. We highly recommend accessing the Shiny app via the web or through the Dockerfile.
Users can perform more complex database queries (beyond those available in the Shiny app) by first downloading sqlite_db.tar.gz. Unpacking this file will reveal the database file p53motifDB.db. This is a sqlite database file containing the same "primary" and "accessory" data from raw_tables.tar.gz and can be used/queried using standard structured query language. The schema of this database, inlcuding relationships between tables, can be seen in p53motifDB_VISUAL_schema.pdf or additional information about each table and the column contents can be examined in the file Schema_p53motifDB.xlsx.
The gzipped TAR file sqlite_db.tar.gz also contains all of the files and information neccessary to reconstruct the p53motifDB.db via R. Users can source the included R script (database_sqlite_commit.R) or can open, examine, and run via RStudio. We strongly advise unpacking the TAR file which will produce a folder called sqlite_db and then running the included R script from within that folder using either source or running line-by-line in RStudio. The result of this script will be p53motifDB.db and an RData object (sqlite_construction.RData) written to the sqlite_db folder.
If opening and running database_sqlite_commit.R via RStudio, please uncomment line 10 and comment out lines 13 and 14.
Please also be aware of the minimal package dependencies in R. The included version of p53motifDB.db was created using R (v. 3.4.0) and the following packages (and versions) available via CRAN:
RSQLite (v. 2.3.7), DBI (v. 1.2.3), tidyverse (2.0.0), and utils (v. 4.3.0) packages
The p53motifDB was created by Morgan Sammons, Gaby Baniulyte, and Sawyer Hicks.
Please let us know if you have any questions, comments, or would like additional datasets included in the next version of the p53motifDB by contacting masammons(at)albany.edu
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A collection of datasets and python scripts for extraction and analysis of isograms (and some palindromes and tautonyms) from corpus-based word-lists, specifically Google Ngram and the British National Corpus (BNC).Below follows a brief description, first, of the included datasets and, second, of the included scripts.1. DatasetsThe data from English Google Ngrams and the BNC is available in two formats: as a plain text CSV file and as a SQLite3 database.1.1 CSV formatThe CSV files for each dataset actually come in two parts: one labelled ".csv" and one ".totals". The ".csv" contains the actual extracted data, and the ".totals" file contains some basic summary statistics about the ".csv" dataset with the same name.The CSV files contain one row per data point, with the colums separated by a single tab stop. There are no labels at the top of the files. Each line has the following columns, in this order (the labels below are what I use in the database, which has an identical structure, see section below):
Label Data type Description
isogramy int The order of isogramy, e.g. "2" is a second order isogram
length int The length of the word in letters
word text The actual word/isogram in ASCII
source_pos text The Part of Speech tag from the original corpus
count int Token count (total number of occurences)
vol_count int Volume count (number of different sources which contain the word)
count_per_million int Token count per million words
vol_count_as_percent int Volume count as percentage of the total number of volumes
is_palindrome bool Whether the word is a palindrome (1) or not (0)
is_tautonym bool Whether the word is a tautonym (1) or not (0)
The ".totals" files have a slightly different format, with one row per data point, where the first column is the label and the second column is the associated value. The ".totals" files contain the following data:
Label
Data type
Description
!total_1grams
int
The total number of words in the corpus
!total_volumes
int
The total number of volumes (individual sources) in the corpus
!total_isograms
int
The total number of isograms found in the corpus (before compacting)
!total_palindromes
int
How many of the isograms found are palindromes
!total_tautonyms
int
How many of the isograms found are tautonyms
The CSV files are mainly useful for further automated data processing. For working with the data set directly (e.g. to do statistics or cross-check entries), I would recommend using the database format described below.1.2 SQLite database formatOn the other hand, the SQLite database combines the data from all four of the plain text files, and adds various useful combinations of the two datasets, namely:• Compacted versions of each dataset, where identical headwords are combined into a single entry.• A combined compacted dataset, combining and compacting the data from both Ngrams and the BNC.• An intersected dataset, which contains only those words which are found in both the Ngrams and the BNC dataset.The intersected dataset is by far the least noisy, but is missing some real isograms, too.The columns/layout of each of the tables in the database is identical to that described for the CSV/.totals files above.To get an idea of the various ways the database can be queried for various bits of data see the R script described below, which computes statistics based on the SQLite database.2. ScriptsThere are three scripts: one for tiding Ngram and BNC word lists and extracting isograms, one to create a neat SQLite database from the output, and one to compute some basic statistics from the data. The first script can be run using Python 3, the second script can be run using SQLite 3 from the command line, and the third script can be run in R/RStudio (R version 3).2.1 Source dataThe scripts were written to work with word lists from Google Ngram and the BNC, which can be obtained from http://storage.googleapis.com/books/ngrams/books/datasetsv2.html and [https://www.kilgarriff.co.uk/bnc-readme.html], (download all.al.gz).For Ngram the script expects the path to the directory containing the various files, for BNC the direct path to the *.gz file.2.2 Data preparationBefore processing proper, the word lists need to be tidied to exclude superfluous material and some of the most obvious noise. This will also bring them into a uniform format.Tidying and reformatting can be done by running one of the following commands:python isograms.py --ngrams --indir=INDIR --outfile=OUTFILEpython isograms.py --bnc --indir=INFILE --outfile=OUTFILEReplace INDIR/INFILE with the input directory or filename and OUTFILE with the filename for the tidied and reformatted output.2.3 Isogram ExtractionAfter preparing the data as above, isograms can be extracted from by running the following command on the reformatted and tidied files:python isograms.py --batch --infile=INFILE --outfile=OUTFILEHere INFILE should refer the the output from the previosu data cleaning process. Please note that the script will actually write two output files, one named OUTFILE with a word list of all the isograms and their associated frequency data, and one named "OUTFILE.totals" with very basic summary statistics.2.4 Creating a SQLite3 databaseThe output data from the above step can be easily collated into a SQLite3 database which allows for easy querying of the data directly for specific properties. The database can be created by following these steps:1. Make sure the files with the Ngrams and BNC data are named “ngrams-isograms.csv” and “bnc-isograms.csv” respectively. (The script assumes you have both of them, if you only want to load one, just create an empty file for the other one).2. Copy the “create-database.sql” script into the same directory as the two data files.3. On the command line, go to the directory where the files and the SQL script are. 4. Type: sqlite3 isograms.db 5. This will create a database called “isograms.db”.See the section 1 for a basic descript of the output data and how to work with the database.2.5 Statistical processingThe repository includes an R script (R version 3) named “statistics.r” that computes a number of statistics about the distribution of isograms by length, frequency, contextual diversity, etc. This can be used as a starting point for running your own stats. It uses RSQLite to access the SQLite database version of the data described above.
Facebook
TwitterWhat is this ? In this case study, I use a bike-share company data to evaluate the biking performance between members and casuals, determine if there are any trends or patterns, and theorize what are causing them. I am then able to develop a recommendation based on those findings.
Content: Hi. This is my first data analysis project and also my first time to use R in my work. They are the capstone project for Google Data Analysis Certificate Course offered in Coursera. (https://www.coursera.org/professional-certificates/google-data-analytics) It is about operation data analysis of a frictional bike-share company in Chicago. For detailed background story, please check the pdf file (Case 01.pdf) for reference.
In this case study, I use a bike-share company data to evaluate the biking performance between members and casuals, determine if there are any trends or patterns, and theorize what are causing them by descriptive analysis. I am then able to develop a recommendation based on those findings.
First I will make a background introduction, my business tasks and objectives, and how I obtain the data sources for analysis. Also, they are the R code I worked in RStudio for data processing, cleaning and generating graphs for next part analysis. Next, there are my analysis of bike data, with graphs and charts generated by R ggplot2. At the end, I also provide some recommendations to business tasks, based on the data finding.
I understand that I am just new to data analysis and the skills or code is very beginner level. But I am working hard to learn more in both R and data science field. If you have any idea or feedback. Please feel free to comment.
Stanley Cheng 2021-09-30
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is retrieved from the user Mobius page, where it's generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. I woıuld like to thank Möbius and everyone responsible for the work.
Bellabeat Case Study 1 2022-11-14 1. Introduction Hello everyone, my name is Nur Simais and this project is part of Google Data Analytics Professional Certificate. There have been multiple skills and skillsets learned throughout this course that can mainly be categorized under soft and hard skills. Also, this case study I have chosen is about the company calles “Bellabeat”, a fitness tracker device. The company is founded in 2013 by Urška Sršen and Sando Mur. It gradually gained recognition and expanded in many countires.(https://bellabeat.com/) Adding this brief info about the company, I’d like to say that doing the business analysis will help the company to see how it can achieve it’s goals and what can be done as to improve more.
During the analysis process, I will be using the Google’s “Ask-Prepare-Process-Analyze-Share-Act” Framework that I have learned throughout this certification and apply the tools and skillsets into it.
1.ASK
1.1 Business Task The goal of this project is to analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices and how to apply these insights into Bellabeat’s marketing strategy using these three questions:
What are some trends in smart device usage? How could these trends apply to Bellabeat customers? How could these trends help influence Bellabeat marketing strategy?
2.PREPARE Prepare the Data and Libraries in RStudio Collect the data required for analysis but since the data is available on Kaggle publicly, FitBit Fitness Tracker Data (CC0: Public Domain) and download the dataset.
There are 18 packages but after examining the excel docs, I decided to use these 8 datasets: dailyActivity_merged.csv, heartrate_seconds_merged.csv, hourlyCalories_merged.csv, hourlyIntensities_merged.csv, hourlySteps_merged.csv, minuteMETsNarrow_merged.csv, sleepDay_merged.csv, weightLogInfo_merged.csv 2.1 Install and load the packages Install the RStudio libraries for analysis and visualizations
install.packages("tidyverse") # core package for cleaning and analysis
install.packages("lubridate") # date library mdy()
install.packages("janitor") # clean_names() to consists only _, character, numbers, and letters.
install.packages("dplyr") #helps to check the garmmar of data manioulation
Load the libraries
library(tidyverse)
library(janitor) ##
##
##
library(lubridate)
##
##
##
library(dplyr) Having loaded tidyverse package, the rest of the essential packages (ggplot2, dplyr, and tidyr) are loaded as well.
2.2 Importing and Preparing the Dataset Upload the archived dataset to RStudio by clicking the Upload button in the bottom right pane.
File will be saved in a new folder named “Fitabase Data 4.12.16-5.12.16”. Importing the datasets and renaming them.
daily_activity <- read.csv("dailyActivity_merged.csv") heartrate_seconds <- read_csv("heartrate_seconds_merged.csv")
##
spec() to retrieve the full column specification for this data.show_col_types = FALSE to quiet this message.hourly_calories <- read_csv("hourlyCalories_merged.csv")
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
As a junior data analyst at a fanfiction analytics consultancy, I was tasked with analyzing how archive warnings are distributed across fanfiction works on Archive of Our Own (AO3). The client is interested in understanding:
Better understanding of archive warnings can:
Dataset includes ~600,000 AO3 fanfiction works, organized across three tables:
works: Metadata on fanfiction works tags: Includes tag types like archive warnings and fandoms work_tag: Many-to-many mapping of works and tags work_id and tag_idtype == "ArchiveWarning" and type == "Fandom"integer, character)| Warning Name | Total Works | % of All Works |
|---|---|---|
| No Archive Warnings Apply | 32,051 | 5.33% |
| Choose Not To Use Archive Warnings | 21,591 | 3.59% |
| Graphic Depictions Of Violence | 5,281 | 0.88% |
| Major Character Death | 3,009 | 0.50% |
| Rape/Non-Con | 1,650 | 0.27% |
# Filter archive warning tags
archive_warnings <- tags %>%
filter(type == "ArchiveWarning") %>%
select(warning_id = id, warning_name = name)
# Filter tag mapping for works that use archive warnings
work_warnings <- work_tag %>%
filter(tag_id %in% archive_warnings$warning_id)
# Total number of works with at least one archive warning
total_works_with_warning <- work_warnings %>%
summarise(total = n_distinct(work_id)) %>%
pull(total)
# Count per warning and join with tag names
warning_summary <- work_warnings %>%
group_by(tag_id) %>%
summarise(total_works_with_warning = n_distinct(work_id)) %>%
mutate(percent_of_all_works = (total_works_with_warning / 601286) * 100) %>%
rename(warning_id =...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Baseline data table for comparison of training and validation group.
Facebook
TwitterExtra-organismal DNA (eoDNA) from material left behind by organisms (non-invasive DNA: e.g., faeces, hair) or from environmental samples (eDNA: e.g., water, soil) is a valuable source of genetic information. However, the relatively low quality and quantity of eoDNA, which can be further degraded by environmental factors, results in reduced amplification and sequencing success. This is often compensated for through cost- and time-intensive replications of genotyping/sequencing procedures. Therefore, system- and site-specific quantifications of environmental degradation are needed to maximize sampling efficiency (e.g., fewer replicates, shorter sampling durations), and to improve species detection and abundance estimates. Using ten environmentally diverse bat roosts as a case study, we developed a robust modelling pipeline to quantify the environmental factors degrading eoDNA, predict eoDNA quality, and estimate sampling-site-specific ideal exposure duration. Maximum humidity was the stro...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Here you find an example research data dataset for the automotive demonstrator within the "AEGIS - Advanced Big Data Value Chain for Public Safety and Personal Security" big data project, which has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 732189. The time series data has been collected during trips conducted by three drivers driving the same vehicle in Austria.
The dataset contains 20Hz sampled CAN bus data from a passenger vehicle, e.g. WheelSpeed FL (speed of the front left wheel), SteerAngle (steering wheel angle), Role, Pitch, and accelerometer values per direction.
GPS data from the vehicle (see signals 'Latitude_Vehicle' and 'Longitude_Vehicle' in h5 group 'Math') and GPS data from the IMU device (see signals 'Latitude_IMU', 'Longitude_IMU' and 'Time_IMU' in h5 group 'Math') are included. However, as it had to be exported with single-precision, we lost some precision for those GPS values.
For data analysis we use R and R Studio (https://www.rstudio.com/) and the library h5.
e.g. check file with R code:
library(h5)
f <- h5file("file path/20181113_Driver1_Trip1.hdf")
summary(f["CAN/Yawrate1"][,])
summary(f["Math/Latitude_IMU"][,])
h5close(f)
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Cyclistic ride share service 12-month data. Used to Categorize and analyze how the two user types; annual members and casual riders use the service differently. The data set involves 12 separate csv files merged together using R-studio resulting in 5.8million+ rows of data. (https://public.tableau.com/app/profile/richgg/viz/CyclisticCapstoneDashboard/Dashboard1 )-DashBoard (Check the **PowerPoint **presentation for more analysis information from the data set.) https://www.kaggle.com/code/therichgg/cyclistic-data-analysis-on-user-type-differences/notebook <-notebook reference for analysis
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Cyclistic Bikes: A Comparison Between Casual and Annual Memberships
As part of the Google Data Analytics Certificate, I have been asked to complete a case study on the maximisation of Annual memberships vs those who choose the single and day-pass options.
The business goal of Cyclistic is clear, convert more members to Annual in an attempt to boost profits. The question is whether such a goal is truly profitable in the long term.
For this task, I will take the previous 12 months of data available from a public AWS server, https://divvy-tripdata.s3.amazonaws.com/index.html, and use that to build a forecast for the following years, looking for trends and possible problems that may impede Cyclistic’s ultimate goal
Sources and Tools
Rstudio: Tidyverse - Lubridate https://divvy-tripdata.s3.amazonaws.com/index.html
Business Goal
Under the direction of Lily Moreno and, by extension Cyclistic, the aim of this case study will be to analyse the differences in usage between Casual and Annual members.
For clarity, Casual members will be those who use the Day and Single Use options when using Cyclistic, whilst Annual refers to those who purchase a 12 month subscription to the service.
The ultimate goal is to see if there is a clear business reason to push forward with a marketing campaign to convert Casual users into Annual memberships
Tasks and Data Storage
The data I will be using was previously stored on an AWS server at https://divvy-tripdata.s3.amazonaws.com/index.html. This location is publicly accessible but the data within can only be downloaded and edited locally.
For the purposes of this task, I have downloaded the data for the year 2022, 12 separate files that I then collated into a single zip file to upload to Rstudio for the purposes of cleaning, arranging and studying the information. The original files will be located on my PC and at the AWS link. As part of the process, a backup file will be created within Rstudio to ensure that the original data is always available.
Process
After uploading the dat to Rstudio and putting in a naming convention, Month, the next step was to compare and equate the names of the coloumns. As the information came from 2022, 2 years after Cyclistic updated their naming conventions, this step was more of a formality to ensure that the files could later be joined into one. No irregularities were found at this stage.
As all coloumn names matched, there was no need to rename them. furthermore, all ride_id fields were already in character format.
Once this check was complete, all tables were compiled into one, named all_trips
Cleaning
The first issue found was the number of fields used to identify the different member types. The files used a four coloumn approach with "member" and "subscriber" for Annual and "Customer" and "casual" for the casual users. These four fields were aggregated into 2, Member and Casual.
As the original files only measured ride-level, more fields were added in the form of day, week, month, year to enable more opportunites to aggregate the data.
ride_length was added for consistency and to provide a clearer output. After adding this coloumn, the data was morphed from Factor to Numeric to ensure that the final output could be measured.
Analysis
Here, I will provide the final code used to descirbe the final process
mean(all_trips_v2$ride_length) #straight average (total ride length / rides) median(all_trips_v2$ride_length) #midpoint number in the ascending array of ride lengths max(all_trips_v2$ride_length) #longest ride min(all_trips_v2$ride_length) #shortest ride
summary(all_trips_v2$ride_length)
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = mean) aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = median) aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = max) aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = min)
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)
all_trips_v2$day_of_week <- ordered(all_trips_v2$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)
all_trips_v2 %>% mutate(weekday = wday(started_at, label = TRUE)) %>% #creates weekday field using wday() group_by(member_casual, weekday) %>% #groups by usertype and weekday summarise(number_of_rides = n() ...
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
The spatial distribution of individuals within ecological assemblages, and their associated traits and behaviors, are key determinants of ecosystem structure and function. Consequently, determining the spatial distribution of species, and how distributions influence patterns of species richness across ecosystems today and in the past, helps us understand what factors act as fundamental controls on biodiversity. Here, we explore how ecological niche modeling has contributed to understanding the spatiotemporal distribution of past biodiversity, and past ecological and evolutionary processes. We first perform a semi-quantitative literature review to capture studies that applied ecological niche models (ENMs) in the past, identifying 668 studies. We coded each study according to focal taxonomic groups and whether and how the study used fossil evidence, whether it relied on evidence or methods in addition to ENMs, and spatial scale and temporal intervals. We used trends in publication patterns across categories to anchor discussion of recent technical advances in niche modeling, focusing on paleobiogeographic ENM applications. We then explored the contributions of ENMs to paleobiogeography, with a particular focus on examining patterns and associated drivers of range dynamics; phylogeography and within-lineage dynamics; macroevolutionary patterns and processes, including niche change, speciation, and extinction; drivers of community assembly; and conservation paleobiogeography. Overall, ENMs are powerful tools for elucidating paleobiogeographic patterns. ENMs are most commonly used to understand Quaternary dynamics, but an increasing number of studies use ENMs to gain important insight into both ecological and evolutionary processes in pre-Quaternary times. Deeper integration with traits and phylogenies may further extend those insights. Methods We conducted an initial search on 15 September 2023 for peer-reviewed articles, written in English, that applied ENMs to past time intervals, using both the Scopus and Web of Science databases with nearly identical search conditions (see Appendix 1 for full search terms). Our search and screening followed the PRISMA protocol for scoping reviews (Tricco et al. 2018). Article metadata was downloaded from each database (Scopus n = 16155, Web of Science n = 15600), and the two datasets were merged and duplicates removed (n = 22656). We screened article titles and abstracts to determine if they (a) projected an ENM to a point in time before 1800 A.D., and/or (b) included fossil occurrences in their ENM. We identified 668 studies that met our criteria, and randomly assigned these to the five authors to gather data on the ENM approaches therein. Data extracted from each article included taxonomic information (taxonomic description and resolution, and the number of taxonomic units analyzed), time periods for which data were modeled and projected, whether the fossil record was used for either model calibration or validation, whether additional data (e.g., molecular, isotopic, morphological, etc.) were used, and the geographic extent of the analysis. All data manipulation and analyses were performed in R (version 4.3.0; R Core Team 2014) using an RStudio interface (version 2023.06.1 Build 524 “Mountain Hydrangea”; Rstudio Team 2020). Data manipulations were carried out with dplyr (version 1.1.2; Wickham et al. 2023b), tidyr (version 1.3.0; Wickham et al. 2023a), and stringr (version 1.5.0; Wickham 2023). Title and abstract screening was done through revtools (version 0.4.1; Westgate 2019). Referenced Literature:
R Core Team. 2014: R: A language and environment for statistical computing.
Rstudio Team. 2020: RStudio: integrated development for R.
Tricco, A. C., E. Lillie, W. Zarin, K. K. O’Brien, H. Colquhoun, D. Levac, D. Moher, M. D. Peters, T. Horsley, and L. Weeks. 2018: PRISMA extension for scoping reviews (PRISMA-ScR): checklist and explanation. Annals of internal medicine 169:467–473.
Westgate, M. J. 2019: revtools: An R package to support article screening for evidence synthesis. Research synthesis methods 10:606–614.
Wickham, H. 2023: stringr: Simple, Consistent Wrappers for Common String Operations.
Wickham, H., D. Vaughan, and M. Girlich. 2023a: tidyr: Tidy Messy Data.
Wickham, H., R. François, L. Henry, K. Müller, and D. Vaughan. 2023b: dplyr: A Grammar of Data Manipulation.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This study applies a geographical-physical and statistical methodology to predict vineyard distribution in the Taurasi DOCG terroir, southern Italy. Integrating morpho-topographical, climatic, and pedological data through GIS-based logistic regression, it aims to refine vineyard site selection—traditionally guided by local expertise—via scientifically validated predictive tools. The Taurasi territory, marked by pronounced lithological and topographic heterogeneity and a viticulture-favorable climate, serves as an ideal case study. The model was developed using environmental variables, optimized through stepwise selection and Variance Inflation Factor (VIF) analysis, and validated using the Receiver Operating Characteristic (ROC) curve. The resulting suitability map identifies areas most conducive to viticulture, emphasizing the importance of altitude, slope, aspect, and temperature in shaping vineyard potential. Despite sensitivity to environmental data quality, the approach demonstrates the value of integrating geospatial and statistical methods for informed spatial planning. The study reinforces the role of data-driven strategies in optimizing and sustainably managing viticultural landscapes.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By [source]
This remarkable dataset provides an awe-inspiring collection of over 50,000 books, encompassing the world's best practices in literature, poetry, and authorship. For each book included in the dataset, users can gain access to a wealth of insightful information such as title, author(s), average rating given by readers and critics alike, a brief description highlighting its plot or characteristics; language it is written in; unique ISBN which enables potential buyers to locate their favorite works with ease; genres it belongs to; any awards it has won or characters that inhabit its storyworld.
Additionally, seeking out readers' opinions on exceptional books is made easier due to the availability of bbeScore (best books ever score) alongside details for the most accurate ratings given through well-detailed breakdowns in “ratingsByStars” section. Making sure visibility and recognition are granted fairly – be it a classic novel from time immemorial or merely recently released newcomers - this source also allows us to evaluate new stories based off readers' engagement rate highlighted by likedPercent column (the percentage of readers who liked the book), bbeVotes (number of votes casted) as well as entries related to date published - including showstopping firstPublishDate!
Aspiring literature researchers; literary historians and those seeking hidden literary gems alike would no doubt benefit from delving into this magnificent collection – 25 variables regarding different novels & poets that are presented by Kaggle open source dataset “Best Books Ever: A Comprehensive Historical Collection of Literary Greats”. What worlds awaits you?
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
Whether you are a student, researcher, or enthusiast of literature, this dataset provides a valuable source for exploring literary works from varied time periods and genres. By accessing all 25 variables in the dataset, readers have the opportunity to use them for building visualizations, creating new analysis tools and models, or finding books you might be interested in reading.
First after downloading the dataset into Kaggle Notebooks platform or other programming interfaces of your choice such as R Studio/Python Jupyter Notebooks (Pandas) - make sure that data is arranged into columns with clearly labeled title names. This will help you understand which variable is related to what precise information. Afterwards explore each variable by finding any patterns across particular titles or interesting findings about certain authors/ratings that are available in your research interests.
Utilize the vital columns of Title (title), Author(author), Rating (rating), Description (description), Language (language), Genres (genres) and Characters(characters) - these can assist you in discovering different trends between books according to style of composition or character types etc. Move further down on examining more specific details offered by Book Format(bookFormat), Edition(edition) Pages(pages). Peruse publisher info along with Publish Date(publishDate). Besides these structural elements also take note of Awards column considering recent recognition different titles have received; also observe how much ratings has been collected per text through Numbers Ratings column-(numRatings); analyze reader's feedback according on Ratings By Stars(_ratingsByStars); view LikedPercentage rate provided by readers when analyzing particular book(_likedPercent).
Apart from more accessible factors mentioned previously delve deeper onto more sophisticated data presented: Setting (_setting); Cover Image (_coverImg); BbeScore_bbeScore); BbeVotes_bbeVotes). All those should provide greater insight when trying to explain why certain book has made its way onto GoodReads top selections list! To find value estimate test out Price (_price)) column too - determining if some texts retain large popularity despite rather costly publishing options cost-wise available on market currently?
Finally combine different aspects observed while researching concerning individual titles- create personalized recommendations based upon released comprehensive lists! To achieve that utilize ISUBN code provided; compare publication Vs first publication dates historically recorded; verify awards labeling procedure relied upon give context information on discussed here books progress over years
- Creating a web or mobile...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data ready for analysis, just relevant variables has been retained, no relevant transformations has been done to raw data, just renaming or relabeling. A codebook explaining data features is included. Also the code used for data analysis is attached in 2 formats: one .Rmd literate programming file with code, plain text and outputs that could be run easily in Rstudio; and another .R file with programming code and coments.