29 datasets found

Data, Instrument and Code for validation of s-MHC-SF
figshare.com
html
Updated Mar 14, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Echeverría, Guadalupe; Manuel Torres-Sahli; Pedrals, Nuria; Oslando Padilla; Attilio Rigotti; Marcela Bitran (2017). Data, Instrument and Code for validation of s-MHC-SF [Dataset]. http://doi.org/10.6084/m9.figshare.3370828.v1
Explore at:
htmlAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3370828.v1
Dataset updated
Mar 14, 2017
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Echeverría, Guadalupe; Manuel Torres-Sahli; Pedrals, Nuria; Oslando Padilla; Attilio Rigotti; Marcela Bitran
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data ready for analysis, just relevant variables has been retained, no relevant transformations has been done to raw data, just renaming or relabeling. A codebook explaining data features is included. Also the code used for data analysis is attached in 2 formats: one .Rmd literate programming file with code, plain text and outputs that could be run easily in Rstudio; and another .R file with programming code and coments.
Walmart Data Set
kaggle.com
zip
Updated Jan 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matthew Garrett Carter (2023). Walmart Data Set [Dataset]. https://www.kaggle.com/datasets/matthewgarrettcarter/walmart-data-set
Explore at:
zip(272320 bytes)Available download formats
Dataset updated
Jan 4, 2023
Authors
Matthew Garrett Carter
Description
Introduction

The purpose of this project was added practice in learning new and demonstrate R Data analytical skills. The data set was located in Kaggle and shows sales information from the years 2010 to 2012. The weekly sales have two categories: holiday and non holiday representing 1 and 0 in that column respectfully.

The main question for this exercise was were there any factors that affected weekly sales for the stores? Those factors included temperature, fuel prices, and unemployment rates.

The following packages required for this project:

install.packages("tidyverse") install.packages("dplyr") install.packages("tsibble")

The following libraries required:

library("tidyverse") library(readr) library(dplyr) library(ggplot2) library(readr) library(lubridate) library(tsibble)

Downloading data set into RStudio:

Walmart <- read.csv("C:/Users/matth/OneDrive/Desktop/Case Study/Walmart.csv")

Data Inspection

Compared column names of each file to verify consistency.

colnames(Walmart) colnames(Walmart) dim(Walmart) str(Walmart) head(Walmart) which(is.na(Walmart$Date)) sum(is.na(Walmart))

There is NA data in the set.

Turning Store and Holiday_flag into factors:

Walmart$Store<-as.factor(Walmart$Store) Walmart$Holiday_Flag<-as.factor(Walmart$Holiday_Flag)

Splicing the date into Year and weekyear:

Walmart$week<-yearweek(as.Date(Walmart$Date,tryFormats=c("%d-%m-%Y"))) # make sure to install "tsibble" Walmart$year<-format(as.Date(Walmart$Date,tryFormats=c("%d-%m-%Y")),"%Y")

Filered Holiday_Flag Column to include only holidays weeks:

Walmart_Holiday<- filter(Walmart, Holiday_Flag==1)

Filered Holiday_Flag Column to include only non holidays Weeks:

Walmart_Non_Holiday<- filter(Walmart, Holiday_Flag==0)

Lets review all 45 stores' weekly sales and compare them. Using dataset Walmart

ggplot(Walmart, aes(x=Weekly_Sales, y=Store))+geom_boxplot()+ labs(title = 'Weekly Sales Accross 45 Stores', x='Weekly sales', y='Store')+theme_bw()

Results

From observation of the boxplot, it shows that Store 14 had max sales while Store 33 had the min sales.

Lets verify the results via slice_max and slice_min:

Walmart %>% slice_max(Weekly_Sales) Walmart %>% slice_min(Weekly_Sales)

It looks the information was correct. Lets check the mean for the weekly_sales column:

mean(Walmart$Weekly_Sales)

The mean for Weekly_Sales column for the Walmart dataset was 1046965.

Lets check for the MIN and MAX of Weekly Sales but only if they are holiday sales weeks:

ggplot(Walmart_Holiday, aes(x=Weekly_Sales, y=Store))+geom_boxplot()+ labs(title = 'Holiday Sales Accross 45 Stores', x='Weekly sales', y='Store')+theme_bw()

Result

Store 4 had the highest weekly sales during a holiday week based on the boxplot. Boxplot shows stores 33 and 5 as some of the lowest holiday sales.Lets reverify with slice_max and slice_min:

Walmart_Holiday %>% slice_max(Weekly_Sales) Walmart_Holiday %>% slice_min(Weekly_Sales)

The results match what is given on the boxplot. Lets find the mean:

mean(Walmart_Holiday$Weekly_Sales)

The result was that the mean was 1122888.

Lets check for the MIN and MAX of Weekly Sales but only if they are non holiday sales weeks:

ggplot(Walmart_Non_Holiday, aes(x=Weekly_Sales, y=Store))+geom_boxplot()+ labs(title = 'Non Holiday Sales Accross 45 Stores', x='Weekly sales', y='Store')+theme_bw()

Lets matched the results of the Walmart dataset that had both non holiday weeks and holiday weeks. Store 14 had the max sales and store 33 had the minimum sales. Lets verify the results and find the mean:

Walmart_Non_Holiday %>% slice_max(Weekly_Sales) Walmart_Non_Holiday %>% slice_min(Weekly_Sales) mean(Walmart_Non_Holiday$Weekly_Sales)

Results matched. And the mean for weekly sales was 1041256.

Which Year had the most sales?

ggplot(data = Walmart) + geom_point(mapping = aes(x=year, y=Weekly_Sales))

According the plot, 2010 had the most sales. Lets use a boxplot to see more.

ggplot(Walmart, aes(x=year, y=Weekly_Sales))+geom_boxplot()+ labs(title = 'Weekly Sales for Years 2010 - 2012', x='Year', y='Weekly Sales')

2010 Saw higher sales numbers and higher medium

Is there any differance between Sales during no Holiday weeks and Holiday weeks?

Lets start with holiday weekly sales:

ggplot(Walmart_Holiday, aes(x=year, y=Weekly_Sales))+geom_boxplot()+ labs(title = 'Holiday Weekly Sales for Years ...
d
Data from: Hindcast-validated species distribution models reveal future...
datadryad.org
data.niaid.nih.gov
zip
Updated Aug 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Richard Hodel; Douglas Soltis; Pamela Soltis (2022). Hindcast-validated species distribution models reveal future vulnerabilities of mangroves and salt marsh species [Dataset]. http://doi.org/10.5061/dryad.08kprr55b
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.08kprr55b
Dataset updated
Aug 30, 2022
Dataset provided by
Dryad
Authors
Richard Hodel; Douglas Soltis; Pamela Soltis
Time period covered
Aug 22, 2022
Description
The programx MAXENT v3.4.1, which was used for SDM analyses, is freely available. ASC layers can be viewed using the open-source program QGIS. R and RStudio (both freely available) and associated open-source packages were used to process and analyze data.
r
R codes and dataset for Visualisation of Diachronic Constructional Change...
researchdata.edu.au
bridges.monash.edu
Updated Apr 1, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gede Primahadi Wijaya Rajeg; Gede Primahadi Wijaya Rajeg (2019). R codes and dataset for Visualisation of Diachronic Constructional Change using Motion Chart [Dataset]. http://doi.org/10.26180/5c844c7a81768
Explore at:
Unique identifier
https://doi.org/10.26180/5c844c7a81768
Dataset updated
Apr 1, 2019
Dataset provided by
Monash University
Authors
Gede Primahadi Wijaya Rajeg; Gede Primahadi Wijaya Rajeg
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Publication

Primahadi Wijaya R., Gede. 2014. Visualisation of diachronic constructional change using Motion Chart. In Zane Goebel, J. Herudjati Purwoko, Suharno, M. Suryadi & Yusuf Al Aried (eds.). Proceedings: International Seminar on Language Maintenance and Shift IV (LAMAS IV), 267-270. Semarang: Universitas Diponegoro. doi: https://doi.org/10.4225/03/58f5c23dd8387

Description of R codes and data files in the repository

This repository is imported from its GitHub repo. Versioning of this figshare repository is associated with the GitHub repo's Release. So, check the Releases page for updates (the next version is to include the unified version of the codes in the first release with the tidyverse).

The raw input data consists of two files (i.e. will_INF.txt and go_INF.txt). They represent the co-occurrence frequency of top-200 infinitival collocates for will and be going to respectively across the twenty decades of Corpus of Historical American English (from the 1810s to the 2000s).

These two input files are used in the R code file 1-script-create-input-data-raw.r. The codes preprocess and combine the two files into a long format data frame consisting of the following columns: (i) decade, (ii) coll (for "collocate"), (iii) BE going to (for frequency of the collocates with be going to) and (iv) will (for frequency of the collocates with will); it is available in the input_data_raw.txt.

Then, the script 2-script-create-motion-chart-input-data.R processes the input_data_raw.txt for normalising the co-occurrence frequency of the collocates per million words (the COHA size and normalising base frequency are available in coha_size.txt). The output from the second script is input_data_futurate.txt.

Next, input_data_futurate.txt contains the relevant input data for generating (i) the static motion chart as an image plot in the publication (using the script 3-script-create-motion-chart-plot.R), and (ii) the dynamic motion chart (using the script 4-script-motion-chart-dynamic.R).

The repository adopts the project-oriented workflow in RStudio; double-click on the Future Constructions.Rproj file to open an RStudio session whose working directory is associated with the contents of this repository.
Z
Codes in R for spatial statistics analysis, ecological response models and...
data.niaid.nih.gov
data-staging.niaid.nih.gov
Updated Feb 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rössel-Ramírez, D. W.; Palacio-Núñez, J.; Espinosa, S.; Martínez-Montoya, J. F. (2023). Codes in R for spatial statistics analysis, ecological response models and spatial distribution models [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7603556
Explore at:
Dataset updated
Feb 6, 2023
Dataset provided by
Facultad de Ciencias, Universidad Autónoma de San Luis Potosí. San Luis Potosí, S.L.P. México.
Campus San Luis, Colegio de Postgraduados. Salinas de Hidalgo, S.L.P. México.
Authors
Rössel-Ramírez, D. W.; Palacio-Núñez, J.; Espinosa, S.; Martínez-Montoya, J. F.
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In the last decade, a plethora of algorithms have been developed for spatial ecology studies. In our case, we use some of these codes for underwater research work in applied ecology analysis of threatened endemic fishes and their natural habitat. For this, we developed codes in Rstudio® script environment to run spatial and statistical analyses for ecological response and spatial distribution models (e.g., Hijmans & Elith, 2017; Den Burg et al., 2020). The employed R packages are as follows: caret (Kuhn et al., 2020), corrplot (Wei & Simko, 2017), devtools (Wickham, 2015), dismo (Hijmans & Elith, 2017), gbm (Freund & Schapire, 1997; Friedman, 2002), ggplot2 (Wickham et al., 2019), lattice (Sarkar, 2008), lattice (Musa & Mansor, 2021), maptools (Hijmans & Elith, 2017), modelmetrics (Hvitfeldt & Silge, 2021), pander (Wickham, 2015), plyr (Wickham & Wickham, 2015), pROC (Robin et al., 2011), raster (Hijmans & Elith, 2017), RColorBrewer (Neuwirth, 2014), Rcpp (Eddelbeuttel & Balamura, 2018), rgdal (Verzani, 2011), sdm (Naimi & Araujo, 2016), sf (e.g., Zainuddin, 2023), sp (Pebesma, 2020) and usethis (Gladstone, 2022).

It is important to follow all the codes in order to obtain results from the ecological response and spatial distribution models. In particular, for the ecological scenario, we selected the Generalized Linear Model (GLM) and for the geographic scenario we selected DOMAIN, also known as Gower's metric (Carpenter et al., 1993). We selected this regression method and this distance similarity metric because of its adequacy and robustness for studies with endemic or threatened species (e.g., Naoki et al., 2006). Next, we explain the statistical parameterization for the codes immersed in the GLM and DOMAIN running:

In the first instance, we generated the background points and extracted the values of the variables (Code2_Extract_values_DWp_SC.R). Barbet-Massin et al. (2012) recommend the use of 10,000 background points when using regression methods (e.g., Generalized Linear Model) or distance-based models (e.g., DOMAIN). However, we considered important some factors such as the extent of the area and the type of study species for the correct selection of the number of points (Pers. Obs.). Then, we extracted the values of predictor variables (e.g., bioclimatic, topographic, demographic, habitat) in function of presence and background points (e.g., Hijmans and Elith, 2017).

Subsequently, we subdivide both the presence and background point groups into 75% training data and 25% test data, each group, following the method of Soberón & Nakamura (2009) and Hijmans & Elith (2017). For a training control, the 10-fold (cross-validation) method is selected, where the response variable presence is assigned as a factor. In case that some other variable would be important for the study species, it should also be assigned as a factor (Kim, 2009).

After that, we ran the code for the GBM method (Gradient Boost Machine; Code3_GBM_Relative_contribution.R and Code4_Relative_contribution.R), where we obtained the relative contribution of the variables used in the model. We parameterized the code with a Gaussian distribution and cross iteration of 5,000 repetitions (e.g., Friedman, 2002; kim, 2009; Hijmans and Elith, 2017). In addition, we considered selecting a validation interval of 4 random training points (Personal test). The obtained plots were the partial dependence blocks, in function of each predictor variable.

Subsequently, the correlation of the variables is run by Pearson's method (Code5_Pearson_Correlation.R) to evaluate multicollinearity between variables (Guisan & Hofer, 2003). It is recommended to consider a bivariate correlation ± 0.70 to discard highly correlated variables (e.g., Awan et al., 2021).

Once the above codes were run, we uploaded the same subgroups (i.e., presence and background groups with 75% training and 25% testing) (Code6_Presence&backgrounds.R) for the GLM method code (Code7_GLM_model.R). Here, we first ran the GLM models per variable to obtain the p-significance value of each variable (alpha ≤ 0.05); we selected the value one (i.e., presence) as the likelihood factor. The generated models are of polynomial degree to obtain linear and quadratic response (e.g., Fielding and Bell, 1997; Allouche et al., 2006). From these results, we ran ecological response curve models, where the resulting plots included the probability of occurrence and values for continuous variables or categories for discrete variables. The points of the presence and background training group are also included.

On the other hand, a global GLM was also run, from which the generalized model is evaluated by means of a 2 x 2 contingency matrix, including both observed and predicted records. A representation of this is shown in Table 1 (adapted from Allouche et al., 2006). In this process we select an arbitrary boundary of 0.5 to obtain better modeling performance and avoid high percentage of bias in type I (omission) or II (commission) errors (e.g., Carpenter et al., 1993; Fielding and Bell, 1997; Allouche et al., 2006; Kim, 2009; Hijmans and Elith, 2017).

Table 1. Example of 2 x 2 contingency matrix for calculating performance metrics for GLM models. A represents true presence records (true positives), B represents false presence records (false positives - error of commission), C represents true background points (true negatives) and D represents false backgrounds (false negatives - errors of omission).

Validation set

Model

True

False

Presence

A

B

Background

C

D

We then calculated the Overall and True Skill Statistics (TSS) metrics. The first is used to assess the proportion of correctly predicted cases, while the second metric assesses the prevalence of correctly predicted cases (Olden and Jackson, 2002). This metric also gives equal importance to the prevalence of presence prediction as to the random performance correction (Fielding and Bell, 1997; Allouche et al., 2006).

The last code (i.e., Code8_DOMAIN_SuitHab_model.R) is for species distribution modelling using the DOMAIN algorithm (Carpenter et al., 1993). Here, we loaded the variable stack and the presence and background group subdivided into 75% training and 25% test, each. We only included the presence training subset and the predictor variables stack in the calculation of the DOMAIN metric, as well as in the evaluation and validation of the model.

Regarding the model evaluation and estimation, we selected the following estimators:

1) partial ROC, which evaluates the approach between the curves of positive (i.e., correctly predicted presence) and negative (i.e., correctly predicted absence) cases. As farther apart these curves are, the model has a better prediction performance for the correct spatial distribution of the species (Manzanilla-Quiñones, 2020).

2) ROC/AUC curve for model validation, where an optimal performance threshold is estimated to have an expected confidence of 75% to 99% probability (De Long et al., 1988).

Google Data Analytics Case Study Cyclistic

kaggle.com

zip

Updated Sep 27, 2022

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Udayakumar19 (2022). Google Data Analytics Case Study Cyclistic [Dataset]. https://www.kaggle.com/datasets/udayakumar19/google-data-analytics-case-study-cyclistic/suggestions

Explore at:

zip(1299 bytes)Available download formats

Dataset updated

Sep 27, 2022

Authors

Udayakumar19

Description

Introduction

Welcome to the Cyclistic bike-share analysis case study! In this case study, you will perform many real-world tasks of a junior data analyst. You will work for a fictional company, Cyclistic, and meet different characters and team members. In order to answer the key business questions, you will follow the steps of the data analysis process: ask, prepare, process, analyze, share, and act. Along the way, the Case Study Roadmap tables — including guiding questions and key tasks — will help you stay on the right path.

Scenario

You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.

Ask

How do annual members and casual riders use Cyclistic bikes differently?

Guiding Question:

What is the problem you are trying to solve?
  How do annual members and casual riders use Cyclistic bikes differently?
How can your insights drive business decisions?
  The insight will help the marketing team to make a strategy for casual riders

Prepare

Guiding Question:

Where is your data located?
  Data located in Cyclistic organization data.

How is data organized?
  Dataset are in csv format for each month wise from Financial year 22.

Are there issues with bias or credibility in this data? Does your data ROCCC? 
  It is good it is ROCCC because data collected in from Cyclistic organization.

How are you addressing licensing, privacy, security, and accessibility?
  The company has their own license over the dataset. Dataset does not have any personal information about the riders.

How did you verify the data’s integrity?
  All the files have consistent columns and each column has the correct type of data.

How does it help you answer your questions?
  Insights always hidden in the data. We have the interpret with data to find the insights.

Are there any problems with the data?
  Yes, starting station names, ending station names have null values.

Process

Guiding Question:

What tools are you choosing and why?
  I used R studio for the cleaning and transforming the data for analysis phase because of large dataset and to gather experience in the language.

Have you ensured the data’s integrity?
 Yes, the data is consistent throughout the columns.

What steps have you taken to ensure that your data is clean?
  First duplicates, null values are removed then added new columns for analysis.

How can you verify that your data is clean and ready to analyze? 
 Make sure the column names are consistent thorough out all data sets by using the “bind row” function.

Make sure column data types are consistent throughout all the dataset by using the “compare_df_col” from the “janitor” package.
Combine the all dataset into single data frame to make consistent throught the analysis.
Removed the column start_lat, start_lng, end_lat, end_lng from the dataframe because those columns not required for analysis.
Create new columns day, date, month, year, from the started_at column this will provide additional opportunities to aggregate the data
Create the “ride_length” column from the started_at and ended_at column to find the average duration of the ride by the riders.
Removed the null rows from the dataset by using the “na.omit function”
Have you documented your cleaning process so you can review and share those results? 
  Yes, the cleaning process is documented clearly.

Analyze Phase:

Guiding Questions:

How should you organize your data to perform analysis on it? The data has been organized in one single dataframe by using the read csv function in R Has your data been properly formatted? Yes, all the columns have their correct data type.

What surprises did you discover in the data?
  Casual member ride duration is higher than the annual members
  Causal member widely uses docked bike than the annual members
What trends or relationships did you find in the data?
  Annual members are used mainly for commute purpose
  Casual member are preferred the docked bikes
  Annual members are preferred the electric or classic bikes
How will these insights help answer your business questions?
  This insights helps to build a profile for members

Guiding Quesions:

Were you able to answer the question of how ...

Vehicle CAN bus data (with GPS)
data.europa.eu
data.niaid.nih.gov
unknown
Updated May 2, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2019). Vehicle CAN bus data (with GPS) [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-2661316?locale=es
Explore at:
unknown(29356364)Available download formats
Dataset updated
May 2, 2019
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset contains 20Hz sampled CAN bus data from a passenger vehicle, e.g. WheelSpeed FL (speed of the front left wheel), SteerAngle (steering wheel angle), Role, Pitch, and accelerometer values per direction. In contrast to the dataset published at https://zenodo.org/record/2658168#.XMw2m6JS9PY we now have GPS data from the vehicle (see signals 'Latitude_Vehicle' and 'Longitude_Vehicle' in h5 group 'Math') and GPS data from the IMU device (see signals 'Latitude_IMU', 'Longitude_IMU' and 'Time_IMU' in h5 group 'Math') included. However, as it was exported with single_precision, therefore we lost some precision for those GPS values. We are currently looking for a solution and will update the records if possible. For data analysis we use R and R Studio (https://www.rstudio.com/) and the library h5. e.g. check file with R code: library(h5) f <- h5file("file path/20181113_Driver1_Trip1.hdf") summary(f["CAN/Yawrate1"][,]) summary(f["Math/Latitude_IMU"][,]) h5close(f)
p53motifDB
zenodo.org
application/gzip, bin +2
Updated Sep 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Morgan Sammons; Morgan Sammons (2024). p53motifDB [Dataset]. http://doi.org/10.5281/zenodo.13351805
Explore at:
application/gzip, bin, pdf, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13351805
Dataset updated
Sep 23, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Morgan Sammons; Morgan Sammons
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This Zenodo repository contains raw data tables, a Shiny app (via dockerfile), and a sqlite database that makes up the p53motifDB (p53 motif database).

The p53motifDB is a compendium of genomic locations in the human hg38 reference genome that contain recognizable DNA sequences that match the binding preferences for the transcription factor p53. Multiple types of genomic, epigenomic, and genome variation data were integrated with these locations in order to let researchers quickly generate hypotheses about novel activities of p53 or validate known behaviors.

Raw data tables

The raw data tables (raw_tables.tar.gz) are divided into the "primary" table, containing p53 motif locations and other biographical information relating to those genomic locations. The "accesory" tables contain additional descriptive or quantitative information that can be queried based on the information in the "primary" table. A description of table schema for the primary table and all accessory tables can be found in Schema_p53motifDB.xlsx.

Table_1_DataSources.xlsx contains information about all raw and processed data sources that were used in the construction of the p53motifDB.

Shiny App

The Shiny App is designed to allow rapid filtering, querying, and downloading of the primary and accessory tables. Users can access a web-based version at https://p53motifDB.its.albany.edu. Users can also deploy the Shiny app locally by downloading and extracting p53motifDB_shiny.zip and doing one of of the following:

Option 1: From the extracted folder, run the included Dockerfile to create a Docker image which will deploy to localhost port 3838.

Option 2: From the shiny_p53motifDB subfolder, run app.R from R or RStudio. This requires a number of dependencies, which may not be compatible with your current version of R. We highly recommend accessing the Shiny app via the web or through the Dockerfile.

sqlite Database

Users can perform more complex database queries (beyond those available in the Shiny app) by first downloading sqlite_db.tar.gz. Unpacking this file will reveal the database file p53motifDB.db. This is a sqlite database file containing the same "primary" and "accessory" data from raw_tables.tar.gz and can be used/queried using standard structured query language. The schema of this database, inlcuding relationships between tables, can be seen in p53motifDB_VISUAL_schema.pdf or additional information about each table and the column contents can be examined in the file Schema_p53motifDB.xlsx.

The gzipped TAR file sqlite_db.tar.gz also contains all of the files and information neccessary to reconstruct the p53motifDB.db via R. Users can source the included R script (database_sqlite_commit.R) or can open, examine, and run via RStudio. We strongly advise unpacking the TAR file which will produce a folder called sqlite_db and then running the included R script from within that folder using either source or running line-by-line in RStudio. The result of this script will be p53motifDB.db and an RData object (sqlite_construction.RData) written to the sqlite_db folder.

If opening and running database_sqlite_commit.R via RStudio, please uncomment line 10 and comment out lines 13 and 14.

Please also be aware of the minimal package dependencies in R. The included version of p53motifDB.db was created using R (v. 3.4.0) and the following packages (and versions) available via CRAN:

RSQLite (v. 2.3.7), DBI (v. 1.2.3), tidyverse (2.0.0), and utils (v. 4.3.0) packages

Credit

The p53motifDB was created by Morgan Sammons, Gaby Baniulyte, and Sawyer Hicks.

Please let us know if you have any questions, comments, or would like additional datasets included in the next version of the p53motifDB by contacting masammons(at)albany.edu
Data and tools for studying isograms
figshare.com
Updated Jul 31, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Florian Breit (2017). Data and tools for studying isograms [Dataset]. http://doi.org/10.6084/m9.figshare.5245810.v1
Explore at:
application/x-sqlite3Available download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5245810.v1
Dataset updated
Jul 31, 2017
Dataset provided by
Figsharehttp://figshare.com/
Authors
Florian Breit
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A collection of datasets and python scripts for extraction and analysis of isograms (and some palindromes and tautonyms) from corpus-based word-lists, specifically Google Ngram and the British National Corpus (BNC).Below follows a brief description, first, of the included datasets and, second, of the included scripts.1. DatasetsThe data from English Google Ngrams and the BNC is available in two formats: as a plain text CSV file and as a SQLite3 database.1.1 CSV formatThe CSV files for each dataset actually come in two parts: one labelled ".csv" and one ".totals". The ".csv" contains the actual extracted data, and the ".totals" file contains some basic summary statistics about the ".csv" dataset with the same name.The CSV files contain one row per data point, with the colums separated by a single tab stop. There are no labels at the top of the files. Each line has the following columns, in this order (the labels below are what I use in the database, which has an identical structure, see section below):

Label Data type Description

isogramy int The order of isogramy, e.g. "2" is a second order isogram

length int The length of the word in letters

word text The actual word/isogram in ASCII

source_pos text The Part of Speech tag from the original corpus

count int Token count (total number of occurences)

vol_count int Volume count (number of different sources which contain the word)

count_per_million int Token count per million words

vol_count_as_percent int Volume count as percentage of the total number of volumes

is_palindrome bool Whether the word is a palindrome (1) or not (0)

is_tautonym bool Whether the word is a tautonym (1) or not (0)

The ".totals" files have a slightly different format, with one row per data point, where the first column is the label and the second column is the associated value. The ".totals" files contain the following data:

Label

Data type

Description

!total_1grams

int

The total number of words in the corpus

!total_volumes

int

The total number of volumes (individual sources) in the corpus

!total_isograms

int

The total number of isograms found in the corpus (before compacting)

!total_palindromes

int

How many of the isograms found are palindromes

!total_tautonyms

int

How many of the isograms found are tautonyms

The CSV files are mainly useful for further automated data processing. For working with the data set directly (e.g. to do statistics or cross-check entries), I would recommend using the database format described below.1.2 SQLite database formatOn the other hand, the SQLite database combines the data from all four of the plain text files, and adds various useful combinations of the two datasets, namely:• Compacted versions of each dataset, where identical headwords are combined into a single entry.• A combined compacted dataset, combining and compacting the data from both Ngrams and the BNC.• An intersected dataset, which contains only those words which are found in both the Ngrams and the BNC dataset.The intersected dataset is by far the least noisy, but is missing some real isograms, too.The columns/layout of each of the tables in the database is identical to that described for the CSV/.totals files above.To get an idea of the various ways the database can be queried for various bits of data see the R script described below, which computes statistics based on the SQLite database.2. ScriptsThere are three scripts: one for tiding Ngram and BNC word lists and extracting isograms, one to create a neat SQLite database from the output, and one to compute some basic statistics from the data. The first script can be run using Python 3, the second script can be run using SQLite 3 from the command line, and the third script can be run in R/RStudio (R version 3).2.1 Source dataThe scripts were written to work with word lists from Google Ngram and the BNC, which can be obtained from http://storage.googleapis.com/books/ngrams/books/datasetsv2.html and [https://www.kilgarriff.co.uk/bnc-readme.html], (download all.al.gz).For Ngram the script expects the path to the directory containing the various files, for BNC the direct path to the *.gz file.2.2 Data preparationBefore processing proper, the word lists need to be tidied to exclude superfluous material and some of the most obvious noise. This will also bring them into a uniform format.Tidying and reformatting can be done by running one of the following commands:python isograms.py --ngrams --indir=INDIR --outfile=OUTFILEpython isograms.py --bnc --indir=INFILE --outfile=OUTFILEReplace INDIR/INFILE with the input directory or filename and OUTFILE with the filename for the tidied and reformatted output.2.3 Isogram ExtractionAfter preparing the data as above, isograms can be extracted from by running the following command on the reformatted and tidied files:python isograms.py --batch --infile=INFILE --outfile=OUTFILEHere INFILE should refer the the output from the previosu data cleaning process. Please note that the script will actually write two output files, one named OUTFILE with a word list of all the isograms and their associated frequency data, and one named "OUTFILE.totals" with very basic summary statistics.2.4 Creating a SQLite3 databaseThe output data from the above step can be easily collated into a SQLite3 database which allows for easy querying of the data directly for specific properties. The database can be created by following these steps:1. Make sure the files with the Ngrams and BNC data are named “ngrams-isograms.csv” and “bnc-isograms.csv” respectively. (The script assumes you have both of them, if you only want to load one, just create an empty file for the other one).2. Copy the “create-database.sql” script into the same directory as the two data files.3. On the command line, go to the directory where the files and the SQL script are. 4. Type: sqlite3 isograms.db 5. This will create a database called “isograms.db”.See the section 1 for a basic descript of the output data and how to work with the database.2.5 Statistical processingThe repository includes an R script (R version 3) named “statistics.r” that computes a number of statistics about the distribution of isograms by length, frequency, contextual diversity, etc. This can be used as a starting point for running your own stats. It uses RSQLite to access the SQLite database version of the data described above.
Bike Sharing Data Analysis with R
kaggle.com
zip
Updated Sep 28, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
stanley888cy (2021). Bike Sharing Data Analysis with R [Dataset]. https://www.kaggle.com/stanley888cy/google-project-01
Explore at:
zip(189322255 bytes)Available download formats
Dataset updated
Sep 28, 2021
Authors
stanley888cy
Description
What is this ? In this case study, I use a bike-share company data to evaluate the biking performance between members and casuals, determine if there are any trends or patterns, and theorize what are causing them. I am then able to develop a recommendation based on those findings.

Content: Hi. This is my first data analysis project and also my first time to use R in my work. They are the capstone project for Google Data Analysis Certificate Course offered in Coursera. (https://www.coursera.org/professional-certificates/google-data-analytics) It is about operation data analysis of a frictional bike-share company in Chicago. For detailed background story, please check the pdf file (Case 01.pdf) for reference.

In this case study, I use a bike-share company data to evaluate the biking performance between members and casuals, determine if there are any trends or patterns, and theorize what are causing them by descriptive analysis. I am then able to develop a recommendation based on those findings.

First I will make a background introduction, my business tasks and objectives, and how I obtain the data sources for analysis. Also, they are the R code I worked in RStudio for data processing, cleaning and generating graphs for next part analysis. Next, there are my analysis of bike data, with graphs and charts generated by R ggplot2. At the end, I also provide some recommendations to business tasks, based on the data finding.

I understand that I am just new to data analysis and the skills or code is very beginner level. But I am working hard to learn more in both R and data science field. If you have any idea or feedback. Please feel free to comment.

Stanley Cheng 2021-09-30
Bellabeat Case Study II Google Capstone Project
kaggle.com
zip
Updated Nov 18, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NUR SİMAİŞ (2022). Bellabeat Case Study II Google Capstone Project [Dataset]. https://www.kaggle.com/datasets/nursma/bellabeat-case-study-ii-google-capstone-project
Explore at:
zip(25278847 bytes)Available download formats
Dataset updated
Nov 18, 2022
Authors
NUR SİMAİŞ
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset is retrieved from the user Mobius page, where it's generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. I woıuld like to thank Möbius and everyone responsible for the work.

Bellabeat Case Study 1 2022-11-14 1. Introduction Hello everyone, my name is Nur Simais and this project is part of Google Data Analytics Professional Certificate. There have been multiple skills and skillsets learned throughout this course that can mainly be categorized under soft and hard skills. Also, this case study I have chosen is about the company calles “Bellabeat”, a fitness tracker device. The company is founded in 2013 by Urška Sršen and Sando Mur. It gradually gained recognition and expanded in many countires.(https://bellabeat.com/) Adding this brief info about the company, I’d like to say that doing the business analysis will help the company to see how it can achieve it’s goals and what can be done as to improve more.

During the analysis process, I will be using the Google’s “Ask-Prepare-Process-Analyze-Share-Act” Framework that I have learned throughout this certification and apply the tools and skillsets into it.

1.ASK

1.1 Business Task The goal of this project is to analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices and how to apply these insights into Bellabeat’s marketing strategy using these three questions:

What are some trends in smart device usage? How could these trends apply to Bellabeat customers? How could these trends help influence Bellabeat marketing strategy?

2.PREPARE Prepare the Data and Libraries in RStudio Collect the data required for analysis but since the data is available on Kaggle publicly, FitBit Fitness Tracker Data (CC0: Public Domain) and download the dataset.

There are 18 packages but after examining the excel docs, I decided to use these 8 datasets: dailyActivity_merged.csv, heartrate_seconds_merged.csv, hourlyCalories_merged.csv, hourlyIntensities_merged.csv, hourlySteps_merged.csv, minuteMETsNarrow_merged.csv, sleepDay_merged.csv, weightLogInfo_merged.csv 2.1 Install and load the packages Install the RStudio libraries for analysis and visualizations

install.packages("tidyverse") # core package for cleaning and analysis

Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'

(as 'lib' is unspecified)

install.packages("lubridate") # date library mdy()

Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'

(as 'lib' is unspecified)

install.packages("janitor") # clean_names() to consists only _, character, numbers, and letters.

Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'

(as 'lib' is unspecified)

install.packages("dplyr") #helps to check the garmmar of data manioulation

Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'

(as 'lib' is unspecified)

Load the libraries

library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──

✔ ggplot2 3.4.0 ✔ purrr 0.3.5

✔ tibble 3.1.8 ✔ dplyr 1.0.10

✔ tidyr 1.2.1 ✔ stringr 1.4.1

✔ readr 2.1.3 ✔ forcats 0.5.2

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──

✖ dplyr::filter() masks stats::filter()

✖ dplyr::lag() masks stats::lag()

library(janitor) ##

Attaching package: 'janitor'

##

The following objects are masked from 'package:stats':

##

chisq.test, fisher.test

library(lubridate)

Loading required package: timechange

##

Attaching package: 'lubridate'

##

The following objects are masked from 'package:base':

##

date, intersect, setdiff, union

library(dplyr) Having loaded tidyverse package, the rest of the essential packages (ggplot2, dplyr, and tidyr) are loaded as well.

2.2 Importing and Preparing the Dataset Upload the archived dataset to RStudio by clicking the Upload button in the bottom right pane.

File will be saved in a new folder named “Fitabase Data 4.12.16-5.12.16”. Importing the datasets and renaming them.

Loading your CSV files

daily_activity <- read.csv("dailyActivity_merged.csv") heartrate_seconds <- read_csv("heartrate_seconds_merged.csv")

Rows: 2483658 Columns: 3

── Column specification ────────────────────────────────────────────────────────

Delimiter: ","

chr (1): Time

dbl (2): Id, Value

##

ℹ Use spec() to retrieve the full column specification for this data.

ℹ Specify the column types or set show_col_types = FALSE to quiet this message.

hourly_calories <- read_csv("hourlyCalories_merged.csv")

Rows: 22099 Columns: 3

── Column specification ─────────────────────────────────────...

AO3 2021 Snapshot Dataset

kaggle.com

zip

Updated Jul 22, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Zaynab Badawy (2025). AO3 2021 Snapshot Dataset [Dataset]. https://www.kaggle.com/datasets/zaynabbadawy/ao3-2021-snapshot-dataset

Explore at:

zip(94887935 bytes)Available download formats

Dataset updated

Jul 22, 2025

Authors

Zaynab Badawy

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

📊 Case Study: Analysis of Archive Warnings in AO3 Fanfiction Works

1. Ask

📌 Business Task

As a junior data analyst at a fanfiction analytics consultancy, I was tasked with analyzing how archive warnings are distributed across fanfiction works on Archive of Our Own (AO3). The client is interested in understanding:

The prevalence of content warnings
How they inform content tagging accuracy
Ways to improve reader experience and moderation

🔍 Key Questions

How many works contain each type of archive warning?
What percentage of overall works each warning represents?
Are there trends or overlaps in warning usage?

👥 Stakeholders

AO3 content moderation team
Fanfiction readers and communities
Client-side data/product teams for content safety and discovery

🎯 How Insights Help

Better understanding of archive warnings can:

Enhance tagging algorithms
Improve content filtering
Guide moderation policies
Promote transparency and safety

2. Prepare

📁 Data Sources

Dataset includes ~600,000 AO3 fanfiction works, organized across three tables:

works: Metadata on fanfiction works
tags: Includes tag types like archive warnings and fandoms
work_tag: Many-to-many mapping of works and tags

🧹 Data Cleaning

Imported into RStudio
Tags linked via work_id and tag_id
Filtered for type == "ArchiveWarning" and type == "Fandom"
Converted relevant columns to correct types (integer, character)
Removed incomplete or inconsistent entries

✅ Data Quality

Validated counts against known AO3 stats
Confirmed warning totals made logical sense
Dataset is anonymized and public — no privacy concerns

3. Process

🧰 Tools Used

MySQL — for initial slicing of large tables
R (tidyverse) — for transformation, filtering, summarizing
Tableau — for interactive visualizations

🔧 Key Transformations

Parsed concatenated tag ID strings
Filtered tags by type
Grouped by archive warning and counted distinct work IDs
Calculated percentages based on total works

4. Analyze

📈 Summary Statistics

Total works in dataset: 601,286
Works with at least one archive warning tag: 61,576 (~10.2%)

🏷️ Top 5 Archive Warnings by Frequency

Warning Name	Total Works	% of All Works
No Archive Warnings Apply	32,051	5.33%
Choose Not To Use Archive Warnings	21,591	3.59%
Graphic Depictions Of Violence	5,281	0.88%
Major Character Death	3,009	0.50%
Rape/Non-Con	1,650	0.27%

🔍 Key Findings

Most works don’t use explicit warnings, or authors choose not to specify them
Tags for violence and major character death are more common than other sensitive tags
Multiple warnings can appear in the same work
Dataset reflects historical snapshot, not current live AO3 stats

5. Share

📊 Visualizations Created

Bar charts showing archive warning frequency
Tableau dashboard to explore archive warnings by fandom

💡 Communicated Insights

Clear breakdown of which warnings are most prevalent
Patterns highlight gaps in author tagging practices
Supports better decisions for content filtering and moderation policies

6. Act

✔️ Recommendations

Improve author tagging UX to encourage accurate warnings
Educate authors about importance of content warnings
Focus moderation resources on works tagged with higher-risk warnings
Repeat analysis with newer data to track changes over time

➕ Future Work

Analyze overlaps in warnings for nuanced content safety flags
Compare warning usage by fandom for genre-specific trends
Use reader engagement or feedback data to evaluate warning effectiveness

📎 Appendix: R Code Snippets

# Filter archive warning tags
archive_warnings <- tags %>%
 filter(type == "ArchiveWarning") %>%
 select(warning_id = id, warning_name = name)

# Filter tag mapping for works that use archive warnings
work_warnings <- work_tag %>%
 filter(tag_id %in% archive_warnings$warning_id)

# Total number of works with at least one archive warning
total_works_with_warning <- work_warnings %>%
 summarise(total = n_distinct(work_id)) %>%
 pull(total)

# Count per warning and join with tag names
warning_summary <- work_warnings %>%
 group_by(tag_id) %>%
 summarise(total_works_with_warning = n_distinct(work_id)) %>%
 mutate(percent_of_all_works = (total_works_with_warning / 601286) * 100) %>%
 rename(warning_id =...

f
Baseline data table for comparison of training and validation group.
plos.figshare.com
xls
Updated Aug 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Congjie Zhang; Yiyuan Zhang; Changyuan Quan; Xiaotao Lai; Sheng Ming; Hemin Zhang; Haiqun Wu; Fangfang He (2025). Baseline data table for comparison of training and validation group. [Dataset]. http://doi.org/10.1371/journal.pone.0331172.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0331172.t001
Dataset updated
Aug 29, 2025
Dataset provided by
PLOS ONE
Authors
Congjie Zhang; Yiyuan Zhang; Changyuan Quan; Xiaotao Lai; Sheng Ming; Hemin Zhang; Haiqun Wu; Fangfang He
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Baseline data table for comparison of training and validation group.
d
Data from: How to quantify factors degrading DNA in the environment and...
datadryad.org
data.niaid.nih.gov
+1more
zip
Updated Jun 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thomas Naef; Anne-Laure Besnard; Lisa Lehnen; Eric J. Petit; Jaap van Schaik; Sebastien J. Puechemaille (2022). How to quantify factors degrading DNA in the environment and predict degradation for effective sampling design [Dataset]. http://doi.org/10.5061/dryad.79cnp5hxn
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.79cnp5hxn
Dataset updated
Jun 30, 2022
Dataset provided by
Dryad
Authors
Thomas Naef; Anne-Laure Besnard; Lisa Lehnen; Eric J. Petit; Jaap van Schaik; Sebastien J. Puechemaille
Time period covered
May 27, 2022
Description
Extra-organismal DNA (eoDNA) from material left behind by organisms (non-invasive DNA: e.g., faeces, hair) or from environmental samples (eDNA: e.g., water, soil) is a valuable source of genetic information. However, the relatively low quality and quantity of eoDNA, which can be further degraded by environmental factors, results in reduced amplification and sequencing success. This is often compensated for through cost- and time-intensive replications of genotyping/sequencing procedures. Therefore, system- and site-specific quantifications of environmental degradation are needed to maximize sampling efficiency (e.g., fewer replicates, shorter sampling durations), and to improve species detection and abundance estimates. Using ten environmentally diverse bat roosts as a case study, we developed a robust modelling pipeline to quantify the environmental factors degrading eoDNA, predict eoDNA quality, and estimate sampling-site-specific ideal exposure duration. Maximum humidity was the stro...
Z
Automotive CAN bus data: An Example Dataset from the AEGIS Big Data Project
data.niaid.nih.gov
Updated Jul 8, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaiser, Christian; Stocker, Alexander; Festl, Andreas (2020). Automotive CAN bus data: An Example Dataset from the AEGIS Big Data Project [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3267183
Explore at:
Dataset updated
Jul 8, 2020
Dataset provided by
Virtual Vehicle Research Center, Graz, Austria
Authors
Kaiser, Christian; Stocker, Alexander; Festl, Andreas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Here you find an example research data dataset for the automotive demonstrator within the "AEGIS - Advanced Big Data Value Chain for Public Safety and Personal Security" big data project, which has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 732189. The time series data has been collected during trips conducted by three drivers driving the same vehicle in Austria.

The dataset contains 20Hz sampled CAN bus data from a passenger vehicle, e.g. WheelSpeed FL (speed of the front left wheel), SteerAngle (steering wheel angle), Role, Pitch, and accelerometer values per direction.

GPS data from the vehicle (see signals 'Latitude_Vehicle' and 'Longitude_Vehicle' in h5 group 'Math') and GPS data from the IMU device (see signals 'Latitude_IMU', 'Longitude_IMU' and 'Time_IMU' in h5 group 'Math') are included. However, as it had to be exported with single-precision, we lost some precision for those GPS values.

For data analysis we use R and R Studio (https://www.rstudio.com/) and the library h5.

e.g. check file with R code:

library(h5)

f <- h5file("file path/20181113_Driver1_Trip1.hdf")

summary(f["CAN/Yawrate1"][,])

summary(f["Math/Latitude_IMU"][,])

h5close(f)
Cyclistic ride-share service analysis of 12 months
kaggle.com
zip
Updated Oct 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The RichGG (2022). Cyclistic ride-share service analysis of 12 months [Dataset]. https://www.kaggle.com/therichgg/cyclistic-ride-share-service-12-month-data
Explore at:
zip(284777387 bytes)Available download formats
Dataset updated
Oct 2, 2022
Authors
The RichGG
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Cyclistic ride share service 12-month data. Used to Categorize and analyze how the two user types; annual members and casual riders use the service differently. The data set involves 12 separate csv files merged together using R-studio resulting in 5.8million+ rows of data. (https://public.tableau.com/app/profile/richgg/viz/CyclisticCapstoneDashboard/Dashboard1 )-DashBoard (Check the **PowerPoint **presentation for more analysis information from the data set.) https://www.kaggle.com/code/therichgg/cyclistic-data-analysis-on-user-type-differences/notebook <-notebook reference for analysis
Cyclistic/Divvy Case Study
kaggle.com
zip
Updated Oct 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anthony White (2023). Cyclistic/Divvy Case Study [Dataset]. https://www.kaggle.com/datasets/anthonywhite88/cyclisticdivvy-case-study/discussion
Explore at:
zip(205517251 bytes)Available download formats
Dataset updated
Oct 4, 2023
Authors
Anthony White
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Cyclistic Bikes: A Comparison Between Casual and Annual Memberships

As part of the Google Data Analytics Certificate, I have been asked to complete a case study on the maximisation of Annual memberships vs those who choose the single and day-pass options.

The business goal of Cyclistic is clear, convert more members to Annual in an attempt to boost profits. The question is whether such a goal is truly profitable in the long term.

For this task, I will take the previous 12 months of data available from a public AWS server, https://divvy-tripdata.s3.amazonaws.com/index.html, and use that to build a forecast for the following years, looking for trends and possible problems that may impede Cyclistic’s ultimate goal

Sources and Tools

Rstudio: Tidyverse - Lubridate https://divvy-tripdata.s3.amazonaws.com/index.html

Business Goal

Under the direction of Lily Moreno and, by extension Cyclistic, the aim of this case study will be to analyse the differences in usage between Casual and Annual members.

For clarity, Casual members will be those who use the Day and Single Use options when using Cyclistic, whilst Annual refers to those who purchase a 12 month subscription to the service.

The ultimate goal is to see if there is a clear business reason to push forward with a marketing campaign to convert Casual users into Annual memberships

Tasks and Data Storage

The data I will be using was previously stored on an AWS server at https://divvy-tripdata.s3.amazonaws.com/index.html. This location is publicly accessible but the data within can only be downloaded and edited locally.

For the purposes of this task, I have downloaded the data for the year 2022, 12 separate files that I then collated into a single zip file to upload to Rstudio for the purposes of cleaning, arranging and studying the information. The original files will be located on my PC and at the AWS link. As part of the process, a backup file will be created within Rstudio to ensure that the original data is always available.

Process

After uploading the dat to Rstudio and putting in a naming convention, Month, the next step was to compare and equate the names of the coloumns. As the information came from 2022, 2 years after Cyclistic updated their naming conventions, this step was more of a formality to ensure that the files could later be joined into one. No irregularities were found at this stage.

As all coloumn names matched, there was no need to rename them. furthermore, all ride_id fields were already in character format.

Once this check was complete, all tables were compiled into one, named all_trips

Cleaning

The first issue found was the number of fields used to identify the different member types. The files used a four coloumn approach with "member" and "subscriber" for Annual and "Customer" and "casual" for the casual users. These four fields were aggregated into 2, Member and Casual.

As the original files only measured ride-level, more fields were added in the form of day, week, month, year to enable more opportunites to aggregate the data.

ride_length was added for consistency and to provide a clearer output. After adding this coloumn, the data was morphed from Factor to Numeric to ensure that the final output could be measured.

Analysis

Here, I will provide the final code used to descirbe the final process

Descriptive analysis on ride_length (all figures in seconds)

mean(all_trips_v2$ride_length) #straight average (total ride length / rides) median(all_trips_v2$ride_length) #midpoint number in the ascending array of ride lengths max(all_trips_v2$ride_length) #longest ride min(all_trips_v2$ride_length) #shortest ride

All four lines condensced

summary(all_trips_v2$ride_length)

Compare members and casual users

aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = mean) aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = median) aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = max) aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = min)

See the average ride time by each day for members vs casual users

aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)

Fixing the issue of the days of week being out of order

all_trips_v2$day_of_week <- ordered(all_trips_v2$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))

The average ride time by each day for members vs casual users

aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)

analyze ridership data by type and weekday

all_trips_v2 %>% mutate(weekday = wday(started_at, label = TRUE)) %>% #creates weekday field using wday() group_by(member_casual, weekday) %>% #groups by usertype and weekday summarise(number_of_rides = n() ...
n
Data from: Paleobiogeographic insights gained from ecological niche models:...
data.niaid.nih.gov
datadryad.org
zip
Updated Apr 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jessica Blois; André Bellvé; Marta Jarzyna; Erin Saupe; V. J. P. Syverson (2024). Paleobiogeographic insights gained from ecological niche models: progress and continued challenges [Dataset]. http://doi.org/10.5061/dryad.m37pvmd9n
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.m37pvmd9n
Dataset updated
Apr 15, 2024
Dataset provided by
The Ohio State University
University of Oxford
University of California, Merced
Authors
Jessica Blois; André Bellvé; Marta Jarzyna; Erin Saupe; V. J. P. Syverson
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
The spatial distribution of individuals within ecological assemblages, and their associated traits and behaviors, are key determinants of ecosystem structure and function. Consequently, determining the spatial distribution of species, and how distributions influence patterns of species richness across ecosystems today and in the past, helps us understand what factors act as fundamental controls on biodiversity. Here, we explore how ecological niche modeling has contributed to understanding the spatiotemporal distribution of past biodiversity, and past ecological and evolutionary processes. We first perform a semi-quantitative literature review to capture studies that applied ecological niche models (ENMs) in the past, identifying 668 studies. We coded each study according to focal taxonomic groups and whether and how the study used fossil evidence, whether it relied on evidence or methods in addition to ENMs, and spatial scale and temporal intervals. We used trends in publication patterns across categories to anchor discussion of recent technical advances in niche modeling, focusing on paleobiogeographic ENM applications. We then explored the contributions of ENMs to paleobiogeography, with a particular focus on examining patterns and associated drivers of range dynamics; phylogeography and within-lineage dynamics; macroevolutionary patterns and processes, including niche change, speciation, and extinction; drivers of community assembly; and conservation paleobiogeography. Overall, ENMs are powerful tools for elucidating paleobiogeographic patterns. ENMs are most commonly used to understand Quaternary dynamics, but an increasing number of studies use ENMs to gain important insight into both ecological and evolutionary processes in pre-Quaternary times. Deeper integration with traits and phylogenies may further extend those insights. Methods We conducted an initial search on 15 September 2023 for peer-reviewed articles, written in English, that applied ENMs to past time intervals, using both the Scopus and Web of Science databases with nearly identical search conditions (see Appendix 1 for full search terms). Our search and screening followed the PRISMA protocol for scoping reviews (Tricco et al. 2018). Article metadata was downloaded from each database (Scopus n = 16155, Web of Science n = 15600), and the two datasets were merged and duplicates removed (n = 22656). We screened article titles and abstracts to determine if they (a) projected an ENM to a point in time before 1800 A.D., and/or (b) included fossil occurrences in their ENM. We identified 668 studies that met our criteria, and randomly assigned these to the five authors to gather data on the ENM approaches therein. Data extracted from each article included taxonomic information (taxonomic description and resolution, and the number of taxonomic units analyzed), time periods for which data were modeled and projected, whether the fossil record was used for either model calibration or validation, whether additional data (e.g., molecular, isotopic, morphological, etc.) were used, and the geographic extent of the analysis. All data manipulation and analyses were performed in R (version 4.3.0; R Core Team 2014) using an RStudio interface (version 2023.06.1 Build 524 “Mountain Hydrangea”; Rstudio Team 2020). Data manipulations were carried out with dplyr (version 1.1.2; Wickham et al. 2023b), tidyr (version 1.3.0; Wickham et al. 2023a), and stringr (version 1.5.0; Wickham 2023). Title and abstract screening was done through revtools (version 0.4.1; Westgate 2019). Referenced Literature:

R Core Team. 2014: R: A language and environment for statistical computing.

Rstudio Team. 2020: RStudio: integrated development for R.

Tricco, A. C., E. Lillie, W. Zarin, K. K. O’Brien, H. Colquhoun, D. Levac, D. Moher, M. D. Peters, T. Horsley, and L. Weeks. 2018: PRISMA extension for scoping reviews (PRISMA-ScR): checklist and explanation. Annals of internal medicine 169:467–473.

Westgate, M. J. 2019: revtools: An R package to support article screening for evidence synthesis. Research synthesis methods 10:606–614.

Wickham, H. 2023: stringr: Simple, Consistent Wrappers for Common String Operations.

Wickham, H., D. Vaughan, and M. Girlich. 2023a: tidyr: Tidy Messy Data.

Wickham, H., R. François, L. Henry, K. Müller, and D. Vaughan. 2023b: dplyr: A Grammar of Data Manipulation.
f
Data from: A physical geography approach to predict vineyard occurrence...
tandf.figshare.com
pdf
Updated Jul 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Angelo Cusano; Francesco Carrer; Filippo Russo (2025). A physical geography approach to predict vineyard occurrence using statistical methods in the Taurasi DOCG Terroir (Avellino Province, southern Italy) [Dataset]. http://doi.org/10.6084/m9.figshare.29739996.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.29739996.v1
Dataset updated
Jul 31, 2025
Dataset provided by
Taylor & Francis
Authors
Angelo Cusano; Francesco Carrer; Filippo Russo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Province of Avellino, Italy
Description
This study applies a geographical-physical and statistical methodology to predict vineyard distribution in the Taurasi DOCG terroir, southern Italy. Integrating morpho-topographical, climatic, and pedological data through GIS-based logistic regression, it aims to refine vineyard site selection—traditionally guided by local expertise—via scientifically validated predictive tools. The Taurasi territory, marked by pronounced lithological and topographic heterogeneity and a viticulture-favorable climate, serves as an ideal case study. The model was developed using environmental variables, optimized through stepwise selection and Variance Inflation Factor (VIF) analysis, and validated using the Receiver Operating Characteristic (ROC) curve. The resulting suitability map identifies areas most conducive to viticulture, emphasizing the importance of altitude, slope, aspect, and temperature in shaping vineyard potential. Despite sensitivity to environmental data quality, the approach demonstrates the value of integrating geospatial and statistical methods for informed spatial planning. The study reinforces the role of data-driven strategies in optimizing and sustainably managing viticultural landscapes.
Comprehensive Literary Greats Dataset
kaggle.com
zip
Updated Jan 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Comprehensive Literary Greats Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/comprehensive-literary-greats-dataset
Explore at:
zip(29940528 bytes)Available download formats
Dataset updated
Jan 29, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Comprehensive Literary Greats Dataset

50,000+ Books Rated and Awarded Across Language, Genre, and Format

By [source]

About this dataset

This remarkable dataset provides an awe-inspiring collection of over 50,000 books, encompassing the world's best practices in literature, poetry, and authorship. For each book included in the dataset, users can gain access to a wealth of insightful information such as title, author(s), average rating given by readers and critics alike, a brief description highlighting its plot or characteristics; language it is written in; unique ISBN which enables potential buyers to locate their favorite works with ease; genres it belongs to; any awards it has won or characters that inhabit its storyworld.

Additionally, seeking out readers' opinions on exceptional books is made easier due to the availability of bbeScore (best books ever score) alongside details for the most accurate ratings given through well-detailed breakdowns in “ratingsByStars” section. Making sure visibility and recognition are granted fairly – be it a classic novel from time immemorial or merely recently released newcomers - this source also allows us to evaluate new stories based off readers' engagement rate highlighted by likedPercent column (the percentage of readers who liked the book), bbeVotes (number of votes casted) as well as entries related to date published - including showstopping firstPublishDate!

Aspiring literature researchers; literary historians and those seeking hidden literary gems alike would no doubt benefit from delving into this magnificent collection – 25 variables regarding different novels & poets that are presented by Kaggle open source dataset “Best Books Ever: A Comprehensive Historical Collection of Literary Greats”. What worlds awaits you?

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

Whether you are a student, researcher, or enthusiast of literature, this dataset provides a valuable source for exploring literary works from varied time periods and genres. By accessing all 25 variables in the dataset, readers have the opportunity to use them for building visualizations, creating new analysis tools and models, or finding books you might be interested in reading.

First after downloading the dataset into Kaggle Notebooks platform or other programming interfaces of your choice such as R Studio/Python Jupyter Notebooks (Pandas) - make sure that data is arranged into columns with clearly labeled title names. This will help you understand which variable is related to what precise information. Afterwards explore each variable by finding any patterns across particular titles or interesting findings about certain authors/ratings that are available in your research interests.

Utilize the vital columns of Title (title), Author(author), Rating (rating), Description (description), Language (language), Genres (genres) and Characters(characters) - these can assist you in discovering different trends between books according to style of composition or character types etc. Move further down on examining more specific details offered by Book Format(bookFormat), Edition(edition) Pages(pages). Peruse publisher info along with Publish Date(publishDate). Besides these structural elements also take note of Awards column considering recent recognition different titles have received; also observe how much ratings has been collected per text through Numbers Ratings column-(numRatings); analyze reader's feedback according on Ratings By Stars(_ratingsByStars); view LikedPercentage rate provided by readers when analyzing particular book(_likedPercent).

Apart from more accessible factors mentioned previously delve deeper onto more sophisticated data presented: Setting (_setting); Cover Image (_coverImg); BbeScore_bbeScore); BbeVotes_bbeVotes). All those should provide greater insight when trying to explain why certain book has made its way onto GoodReads top selections list! To find value estimate test out Price (_price)) column too - determining if some texts retain large popularity despite rather costly publishing options cost-wise available on market currently?

Finally combine different aspects observed while researching concerning individual titles- create personalized recommendations based upon released comprehensive lists! To achieve that utilize ISUBN code provided; compare publication Vs first publication dates historically recorded; verify awards labeling procedure relied upon give context information on discussed here books progress over years

Research Ideas

Creating a web or mobile...

Facebook

Twitter

Click to copy link

Link copied

Cite

Echeverría, Guadalupe; Manuel Torres-Sahli; Pedrals, Nuria; Oslando Padilla; Attilio Rigotti; Marcela Bitran (2017). Data, Instrument and Code for validation of s-MHC-SF [Dataset]. http://doi.org/10.6084/m9.figshare.3370828.v1

Data, Instrument and Code for validation of s-MHC-SF

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

htmlAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.3370828.v1

Dataset updated

Mar 14, 2017

Dataset provided by

figshare
Figsharehttp://figshare.com/

Authors

Echeverría, Guadalupe; Manuel Torres-Sahli; Pedrals, Nuria; Oslando Padilla; Attilio Rigotti; Marcela Bitran

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Data ready for analysis, just relevant variables has been retained, no relevant transformations has been done to raw data, just renaming or relabeling. A codebook explaining data features is included. Also the code used for data analysis is attached in 2 formats: one .Rmd literate programming file with code, plain text and outputs that could be run easily in Rstudio; and another .R file with programming code and coments.

Clear search

Close search

Google apps

Main menu

Data, Instrument and Code for validation of s-MHC-SF

Walmart Data Set

Introduction

The following packages required for this project:

The following libraries required:

Downloading data set into RStudio:

Data Inspection

Turning Store and Holiday_flag into factors:

Splicing the date into Year and weekyear:

Filered Holiday_Flag Column to include only holidays weeks:

Filered Holiday_Flag Column to include only non holidays Weeks:

Lets review all 45 stores' weekly sales and compare them. Using dataset Walmart

Results

Lets check for the MIN and MAX of Weekly Sales but only if they are holiday sales weeks:

Result

Lets check for the MIN and MAX of Weekly Sales but only if they are non holiday sales weeks:

Which Year had the most sales?

Is there any differance between Sales during no Holiday weeks and Holiday weeks?

Data from: Hindcast-validated species distribution models reveal future...

R codes and dataset for Visualisation of Diachronic Constructional Change...

Codes in R for spatial statistics analysis, ecological response models and...

Google Data Analytics Case Study Cyclistic

Introduction

Scenario

Ask

Guiding Question:

Prepare

Guiding Question:

Process

Guiding Question:

Analyze Phase:

Guiding Questions:

Share

Guiding Quesions:

Vehicle CAN bus data (with GPS)

p53motifDB

Raw data tables

Shiny App

sqlite Database

Credit

Data and tools for studying isograms

Bike Sharing Data Analysis with R

Bellabeat Case Study II Google Capstone Project

Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'

(as 'lib' is unspecified)

Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'

(as 'lib' is unspecified)

Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'

(as 'lib' is unspecified)

Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'

(as 'lib' is unspecified)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──

✔ ggplot2 3.4.0 ✔ purrr 0.3.5

✔ tibble 3.1.8 ✔ dplyr 1.0.10

✔ tidyr 1.2.1 ✔ stringr 1.4.1

✔ readr 2.1.3 ✔ forcats 0.5.2

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──

✖ dplyr::filter() masks stats::filter()

✖ dplyr::lag() masks stats::lag()

Attaching package: 'janitor'

The following objects are masked from 'package:stats':

chisq.test, fisher.test

Loading required package: timechange

Attaching package: 'lubridate'

The following objects are masked from 'package:base':

date, intersect, setdiff, union

Loading your CSV files

Rows: 2483658 Columns: 3

── Column specification ────────────────────────────────────────────────────────

Delimiter: ","

chr (1): Time

dbl (2): Id, Value

ℹ Use spec() to retrieve the full column specification for this data.

ℹ Specify the column types or set show_col_types = FALSE to quiet this message.

Rows: 22099 Columns: 3

── Column specification ─────────────────────────────────────...

AO3 2021 Snapshot Dataset

📊 Case Study: Analysis of Archive Warnings in AO3 Fanfiction Works

1. Ask

📌 Business Task

ℹ Use `spec()` to retrieve the full column specification for this data.

ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.