100+ datasets found

w
Dataset of books called An introduction to data analysis in R : hands-on...
workwithdata.com
Updated Apr 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2025). Dataset of books called An introduction to data analysis in R : hands-on coding, data mining, visualization and statistics from scratch [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=An+introduction+to+data+analysis+in+R+%3A+hands-on+coding%2C+data+mining%2C+visualization+and+statistics+from+scratch
Explore at:
Dataset updated
Apr 17, 2025
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is about books. It has 1 row and is filtered where the book is An introduction to data analysis in R : hands-on coding, data mining, visualization and statistics from scratch. It features 7 columns including author, publication date, language, and book publisher.
Google Data Analytics Capstone Project
kaggle.com
Updated Oct 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Rookie (2022). Google Data Analytics Capstone Project [Dataset]. https://www.kaggle.com/datasets/rookieaj1234/google-data-analytics-capstone-project
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 1, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Data Rookie
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Project Name: Divvy Bikeshare Trip Data_Year2020 Date Range: April 2020 to December 2020. Analyst: Ajith Software: R Program, Microsoft Excel IDE: RStudio

The following are the basic system requirements, necessary for the project: Processor: Intel i3 or AMD Ryzen 3 and higher Internal RAM: 8 GB or higher Operating System: Windows 7 or above, MacOS

**Data Usage License: https://ride.divvybikes.com/data-license-agreement ** Introduction:

In this case, study we aim to utilize different data analysis techniques and tools, to understand the rental patterns of the divvy bike sharing company and understand the key business improvement suggestions. This case study is a mandatory project to be submitted to achieve the Google Data Analytics Certification. The data utilized in this case study was licensed based on the provided data usage license. The trips between April 2020 to December 2020 are used to analyse the data.

Scenario: Marketing team needs to design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ.

Objective: The main objective of this case study, is to understand the customer usage patterns and the breakdown of customers, based on their subscription status and the average durations of the rental bike usage.

Introduction to Data: The Data provided for this project, is adhered to the data usage license, laid down by the source company. The source data was provided in the CSV files and are month and quarter breakdowns. A total of 13 columns of data was provided in each csv file.

The following are the columns, which were initially observed across the datasets.

Ride_id Ride_type Start_station_name Start_station_id End_station_name End_station_id Usertype Start_time End_time Start_lat Start_lng End_lat End_lng

Documentation, Cleaning and Preparing Data for Analysis: The total size of the datasets, for the year 2020, is approximately 450 MB, which is tiring job, when you have to upload them to the SQL database and visualize using the BI tools. I wanted to improve my skills into R environment and this is the best opportunity and optimal to use R for the data analysis.

For more insights, installation procedures for R and RStudio, please refer to the following URL, for additional information.

R Projects Document: https://www.r-project.org/other-docs.html RStudio Download: https://www.rstudio.com/products/rstudio/ Installation Guide: https://www.youtube.com/watch?v=TFGYlKvQEQ4
t
Trusted Research Environments: Analysis of Characteristics and Data...
researchdata.tuwien.ac.at
bin, csv
Updated Jun 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martin Weise; Martin Weise; Andreas Rauber; Andreas Rauber (2024). Trusted Research Environments: Analysis of Characteristics and Data Availability [Dataset]. http://doi.org/10.48436/cv20m-sg117
Explore at:
bin, csvAvailable download formats
Unique identifier
https://doi.org/10.48436/cv20m-sg117
Dataset updated
Jun 25, 2024
Dataset provided by
TU Wien
Authors
Martin Weise; Martin Weise; Andreas Rauber; Andreas Rauber
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Trusted Research Environments (TREs) enable analysis of sensitive data under strict security assertions that protect the data with technical organizational and legal measures from (accidentally) being leaked outside the facility. While many TREs exist in Europe, little information is available publicly on the architecture and descriptions of their building blocks & their slight technical variations. To shine light on these problems, we give an overview of existing, publicly described TREs and a bibliography linking to the system description. We further analyze their technical characteristics, especially in their commonalities & variations and provide insight on their data type characteristics and availability. Our literature study shows that 47 TREs worldwide provide access to sensitive data of which two-thirds provide data themselves, predominantly via secure remote access. Statistical offices make available a majority of available sensitive data records included in this study.
Methodology
We performed a literature study covering 47 TREs worldwide using scholarly databases (Scopus, Web of Science, IEEE Xplore, Science Direct), a computer science library (dblp.org), Google and grey literature focusing on retrieving the following source material:
Peer-reviewed articles where available,
TRE websites,
TRE metadata catalogs.
The goal for this literature study is to discover existing TREs, analyze their characteristics and data availability to give an overview on available infrastructure for sensitive data research as many European initiatives have been emerging in recent months.
Technical details
This dataset consists of five comma-separated values (.csv) files describing our inventory:
countries.csv: Table of countries with columns id (number), name (text) and code (text, in ISO 3166-A3 encoding, optional)
tres.csv: Table of TREs with columns id (number), name (text), countryid (number, refering to column id of table countries), structureddata (bool, optional), datalevel (one of [1=de-identified, 2=pseudonomized, 3=anonymized], optional), outputcontrol (bool, optional), inceptionyear (date, optional), records (number, optional), datatype (one of [1=claims, 2=linked records]), optional), statistics_office (bool), size (number, optional), source (text, optional), comment (text, optional)
access.csv: Table of access modes of TREs with columns id (number), suf (bool, optional), physical_visit (bool, optional), external_physical_visit (bool, optional), remote_visit (bool, optional)
inclusion.csv: Table of included TREs into the literature study with columns id (number), included (bool), exclusion reason (one of [peer review, environment, duplicate], optional), comment (text, optional)
major_fields.csv: Table of data categorization into the major research fields with columns id (number), life_sciences (bool, optional), physical_sciences (bool, optional), arts_and_humanities (bool, optional), social_sciences (bool, optional).
Additionally, a MariaDB (10.5 or higher) schema definition .sql file is needed, properly modelling the schema for databases:
schema.sql: Schema definition file to create the tables and views used in the analysis.
The analysis was done through Jupyter Notebook which can be found in our source code repository: https://gitlab.tuwien.ac.at/martin.weise/tres/-/blob/master/analysis.ipynb
w
Dataset of books called Data analysis in business research : a step-by-step...
workwithdata.com
Updated Apr 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2025). Dataset of books called Data analysis in business research : a step-by-step nonparametric approach [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=Data+analysis+in+business+research+%3A+a+step-by-step+nonparametric+approach
Explore at:
Dataset updated
Apr 17, 2025
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is about books. It has 1 row and is filtered where the book is Data analysis in business research : a step-by-step nonparametric approach. It features 7 columns including author, publication date, language, and book publisher.
w
Dataset of books called Statistical and computational methods in data...
workwithdata.com
Updated Apr 17, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2025). Dataset of books called Statistical and computational methods in data analysis [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=Statistical+and+computational+methods+in+data+analysis
Explore at:
Dataset updated
Apr 17, 2025
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is about books. It has 2 rows and is filtered where the book is Statistical and computational methods in data analysis. It features 7 columns including author, publication date, language, and book publisher.

Automobile Sales data

kaggle.com

zip

Updated Nov 18, 2023

Facebook

Twitter

Click to copy link

Link copied

Cite

dee dee (2023). Automobile Sales data [Dataset]. https://www.kaggle.com/datasets/ddosad/auto-sales-data/data

Explore at:

zip(81125 bytes)Available download formats

Dataset updated

Nov 18, 2023

Authors

dee dee

Description

The dataset contains Sales data of an Automobile company.

Do explore pinned 📌 notebook under code section for quick EDA📊 reference

Consider an upvote ^ if you find the dataset useful

Data Description

Column Name	Description
ORDERNUMBER	This column represents the unique identification number assigned to each order.
QUANTITYORDERED	It indicates the number of items ordered in each order.
PRICEEACH	This column specifies the price of each item in the order.
ORDERLINENUMBER	It represents the line number of each item within an order.
SALES	This column denotes the total sales amount for each order, which is calculated by multiplying the quantity ordered by the price of each item.
ORDERDATE	It denotes the date on which the order was placed.
DAYS_SINCE_LASTORDER	This column represents the number of days that have passed since the last order for each customer. It can be used to analyze customer purchasing patterns.
STATUS	It indicates the status of the order, such as "Shipped," "In Process," "Cancelled," "Disputed," "On Hold," or "Resolved."
PRODUCTLINE	This column specifies the product line categories to which each item belongs.
MSRP	It stands for Manufacturer's Suggested Retail Price and represents the suggested selling price for each item.
PRODUCTCODE	This column represents the unique code assigned to each product.
CUSTOMERNAME	It denotes the name of the customer who placed the order.
PHONE	This column contains the contact phone number for the customer.
ADDRESSLINE1	It represents the first line of the customer's address.
CITY	This column specifies the city where the customer is located.
POSTALCODE	It denotes the postal code or ZIP code associated with the customer's address.
COUNTRY	This column indicates the country where the customer is located.
CONTACTLASTNAME	It represents the last name of the contact person associated with the customer.
CONTACTFIRSTNAME	This column denotes the first name of the contact person associated with the customer.
DEALSIZE	It indicates the size of the deal or order, which are the categories "Small," "Medium," or "Large."

w
Dataset of books called Modeling and data analysis : an introduction with...
workwithdata.com
Updated Apr 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2025). Dataset of books called Modeling and data analysis : an introduction with environmental applications [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=Modeling+and+data+analysis+%3A+an+introduction+with+environmental+applications
Explore at:
Dataset updated
Apr 17, 2025
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is about books. It has 1 row and is filtered where the book is Modeling and data analysis : an introduction with environmental applications. It features 7 columns including author, publication date, language, and book publisher.
DatasetofDatasets (DoD)
kaggle.com
zip
Updated Aug 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Konstantinos Malliaridis (2024). DatasetofDatasets (DoD) [Dataset]. https://www.kaggle.com/terminalgr/datasetofdatasets-124-1242024
Explore at:
zip(7583 bytes)Available download formats
Dataset updated
Aug 12, 2024
Authors
Konstantinos Malliaridis
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset is essentially the metadata from 164 datasets. Each of its lines concerns a dataset from which 22 features have been extracted, which are used to classify each dataset into one of the categories 0-Unmanaged, 2-INV, 3-SI, 4-NOA (DatasetType).

This Dataset consists of 164 Rows. Each row is the metadata of an other dataset. The target column is datasetType which has 4 values indicating the dataset type. These are:

2 - Invoice detail (INV): This dataset type is a special report (usually called Detailed Sales Statement) produced by a Company Accounting or an Enterprise Resource Planning software (ERP). Using a INV-type dataset directly for ARM is extremely convenient for users as it relieves them from the tedious work of transforming data into another more suitable form. INV-type data input typically includes a header but, only two of its attributes are essential for data mining. The first attribute serves as the grouping identifier creating a unique transaction (e.g., Invoice ID, Order Number), while the second attribute contains the items utilized for data mining (e.g., Product Code, Product Name, Product ID).

3 - Sparse Item (SI): This type is widespread in Association Rules Mining (ARM). It involves a header and a fixed number of columns. Each item corresponds to a column. Each row represents a transaction. The typical cell stores a value, usually one character in length, that depicts the presence or absence of the item in the corresponding transaction. The absence character must be identified or declared before the Association Rules Mining process takes place.

4 - Nominal Attributes (NOA): This type is commonly used in Machine Learning and Data Mining tasks. It involves a fixed number of columns. Each column registers nominal/categorical values. The presence of a header row is optional. However, in cases where no header is provided, there is a risk of extracting incorrect rules if similar values exist in different attributes of the dataset. The potential values for each attribute can vary.

0 - Unmanaged for ARM: On the other hand, not all datasets are suitable for extracting useful association rules or frequent item sets. For instance, datasets characterized predominantly by numerical features with arbitrary values, or datasets that involve fragmented or mixed types of data types. For such types of datasets, ARM processing becomes possible only by introducing a data discretization stage which in turn introduces information loss. Such types of datasets are not considered in the present treatise and they are termed (0) Unmanaged in the sequel.

The dataset type is crucial to determine for ARM, and the current dataset is used to classify the dataset's type using a Supervised Machine Learning Model.

There is and another dataset type named 1 - Market Basket List (MBL) where each dataset row is a transaction. A transaction involves a variable number of items. However, due to this characteristic, these datasets can be easily categorized using procedural programming and DoD does not include instances of them. For more details about Dataset Types please refer to article "WebApriori: a web application for association rules mining". https://link.springer.com/chapter/10.1007/978-3-030-49663-0_44
LTMP analysis 11-year versus 25-year with missing data
figshare.com
txt
Updated Nov 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alfonso Ruiz Moreno (2025). LTMP analysis 11-year versus 25-year with missing data [Dataset]. http://doi.org/10.6084/m9.figshare.28785908.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28785908.v1
Dataset updated
Nov 6, 2025
Dataset provided by
Figsharehttp://figshare.com/
Authors
Alfonso Ruiz Moreno
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains all the scripts and data used in the analysis of the LTMP data presented in the manuscript “Longer time series with missing data improve parameter estimation in State-Space mode in coral reef fish communities”. There are 22 files in total.All model fits were run on the HPC cluster at James Cook University. The model fit to the 11-year time series took approximately 3-5 days and the model fit to the 25-year time series took approximately 10-12 days. We did not include the model fits as they are big files (~12-30GB) but these can be obtained by running the corresponding scripts.LTMP data and data wranglingLTMP_data_1995_2005_prop_zero_40sp.RData: File containing 45 columns. The first column is Year and it contains the year for each observation in the dataset. The second column Reef contains the reef name, while the latitude and longitude are collected in the third column called Reef_lat and fourth column called Reef_long, respectively. The fifth column is called Shelf and contains the reef shelf position as I for Inner shelf positioning, M for Middle Shelf positioning and O for outer Shelf positioning. The rest of the columns contain the counts of the 40 species with the lowest proportion of zeros in the LTMP data. This contains data from 1995 to 2005.LTMP_data_1995_2019_prop_zero_40sp.RData: Same data structure as above but for the time series from 1995 to 2019 (includes Nas in some of the abundance counts).dw_11y_Pomacentrids.R and dw_25yNA_Pomacentrids.R scripts order species in pomacentrids and non-pomacentrids so the models can be fitted to the data. These files produce the data files LTMP_data_1995_2005_prop_zero_40sp_Pomacentrids.RData and LTMP_data_1995_2019_prop_zero_40sp_PomacentridsNA.RData.Model fittingLTMP_fit_40sp.R is a script that fits the model to the 11-year time series data. Specifically, the input dataset is LTMP_data_1995_2005_prop_zero_40sp_Pomacentrids.RData and the output fit is called LTMP_fit_40sp.RData.LTMP_fit_40sp_NA.R is a script that fits the model to the 25-year time series with missing data. Specifically, the input dataset is LTMP_data_1995_2019_prop_zero_40sp_PomacentridsNA.RData and the output fit is called LTMP_fit_40sp_NA.RData.Stan modelMARPLN_LV_Pomacentrids.stan: Stan code for the multivariate autoregressive Poisson-Lognormal model with the latent variables.MARPLN_LV_Pomacentrids_NA.stan: Stan code for same model as above, but this can deal with missing data.FiguresFigure 1 A and B.R and Figure 4.R produce the corresponding figures in the main text.Note that Figure 1A and B.R requires several files to produce the GBR and Australia maps. These are:Great_Barrier_Reef_Features.cpgGreat_Barrier_Reef_Features.dbfGreat_Barrier_Reef_Features.lyrGreat_Barrier_Reef_Features.shp.xmlReef_lat_long.csvGreat_Barrier_Reef_Features.prjGreat_Barrier_Reef_Features.sbnGreat_Barrier_Reef_Features.sbxGreat_Barrier_Reef_Features.shpGreat_Barrier_Reef_Features.shx
f
UC_vs_US Statistic Analysis.xlsx
figshare.com
xlsx
Updated Jul 9, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
F. (Fabiano) Dalpiaz (2020). UC_vs_US Statistic Analysis.xlsx [Dataset]. http://doi.org/10.23644/uu.12631628.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.23644/uu.12631628.v1
Dataset updated
Jul 9, 2020
Dataset provided by
Utrecht University
Authors
F. (Fabiano) Dalpiaz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Sheet 1 (Raw-Data): The raw data of the study is provided, presenting the tagging results for the used measures described in the paper. For each subject, it includes multiple columns: A. a sequential student ID B an ID that defines a random group label and the notation C. the used notation: user Story or use Cases D. the case they were assigned to: IFA, Sim, or Hos E. the subject's exam grade (total points out of 100). Empty cells mean that the subject did not take the first exam F. a categorical representation of the grade L/M/H, where H is greater or equal to 80, M is between 65 included and 80 excluded, L otherwise G. the total number of classes in the student's conceptual model H. the total number of relationships in the student's conceptual model I. the total number of classes in the expert's conceptual model J. the total number of relationships in the expert's conceptual model K-O. the total number of encountered situations of alignment, wrong representation, system-oriented, omitted, missing (see tagging scheme below) P. the researchers' judgement on how well the derivation process explanation was explained by the student: well explained (a systematic mapping that can be easily reproduced), partially explained (vague indication of the mapping ), or not present.

Tagging scheme: Aligned (AL) - A concept is represented as a class in both models, either

with the same name or using synonyms or clearly linkable names; Wrongly represented (WR) - A class in the domain expert model is incorrectly represented in the student model, either (i) via an attribute, method, or relationship rather than class, or (ii) using a generic term (e.g., user'' instead ofurban planner''); System-oriented (SO) - A class in CM-Stud that denotes a technical implementation aspect, e.g., access control. Classes that represent legacy system or the system under design (portal, simulator) are legitimate; Omitted (OM) - A class in CM-Expert that does not appear in any way in CM-Stud; Missing (MI) - A class in CM-Stud that does not appear in any way in CM-Expert.

All the calculations and information provided in the following sheets

originate from that raw data.

Sheet 2 (Descriptive-Stats): Shows a summary of statistics from the data collection,

including the number of subjects per case, per notation, per process derivation rigor category, and per exam grade category.

Sheet 3 (Size-Ratio):

The number of classes within the student model divided by the number of classes within the expert model is calculated (describing the size ratio). We provide box plots to allow a visual comparison of the shape of the distribution, its central value, and its variability for each group (by case, notation, process, and exam grade) . The primary focus in this study is on the number of classes. However, we also provided the size ratio for the number of relationships between student and expert model.

Sheet 4 (Overall):

Provides an overview of all subjects regarding the encountered situations, completeness, and correctness, respectively. Correctness is defined as the ratio of classes in a student model that is fully aligned with the classes in the corresponding expert model. It is calculated by dividing the number of aligned concepts (AL) by the sum of the number of aligned concepts (AL), omitted concepts (OM), system-oriented concepts (SO), and wrong representations (WR). Completeness on the other hand, is defined as the ratio of classes in a student model that are correctly or incorrectly represented over the number of classes in the expert model. Completeness is calculated by dividing the sum of aligned concepts (AL) and wrong representations (WR) by the sum of the number of aligned concepts (AL), wrong representations (WR) and omitted concepts (OM). The overview is complemented with general diverging stacked bar charts that illustrate correctness and completeness.

For sheet 4 as well as for the following four sheets, diverging stacked bar

charts are provided to visualize the effect of each of the independent and mediated variables. The charts are based on the relative numbers of encountered situations for each student. In addition, a "Buffer" is calculated witch solely serves the purpose of constructing the diverging stacked bar charts in Excel. Finally, at the bottom of each sheet, the significance (T-test) and effect size (Hedges' g) for both completeness and correctness are provided. Hedges' g was calculated with an online tool: https://www.psychometrica.de/effect_size.html. The independent and moderating variables can be found as follows:

Sheet 5 (By-Notation):

Model correctness and model completeness is compared by notation - UC, US.

Sheet 6 (By-Case):

Model correctness and model completeness is compared by case - SIM, HOS, IFA.

Sheet 7 (By-Process):

Model correctness and model completeness is compared by how well the derivation process is explained - well explained, partially explained, not present.

Sheet 8 (By-Grade):

Model correctness and model completeness is compared by the exam grades, converted to categorical values High, Low , and Medium.
Datasets for Sentiment Analysis
zenodo.org
csv
Updated Dec 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias (2023). Datasets for Sentiment Analysis [Dataset]. http://doi.org/10.5281/zenodo.10157504
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10157504
Dataset updated
Dec 10, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository was created for my Master's thesis in Computational Intelligence and Internet of Things at the University of Córdoba, Spain. The purpose of this repository is to store the datasets found that were used in some of the studies that served as research material for this Master's thesis. Also, the datasets used in the experimental part of this work are included.
Below are the datasets specified, along with the details of their references, authors, and download sources.

----------- STS-Gold Dataset ----------------
The dataset consists of 2026 tweets. The file consists of 3 columns: id, polarity, and tweet. The three columns denote the unique id, polarity index of the text and the tweet text respectively.
Reference: Saif, H., Fernandez, M., He, Y., & Alani, H. (2013). Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.
File name: sts_gold_tweet.csv
----------- Amazon Sales Dataset ----------------
This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon. The data was scraped in the month of January 2023 from the Official Website of Amazon.
Owner: Karkavelraja J., Postgraduate student at Puducherry Technological University (Puducherry, Puducherry, India)
Features:
product_id - Product ID
product_name - Name of the Product
category - Category of the Product
discounted_price - Discounted Price of the Product
actual_price - Actual Price of the Product
discount_percentage - Percentage of Discount for the Product
rating - Rating of the Product
rating_count - Number of people who voted for the Amazon rating
about_product - Description about the Product
user_id - ID of the user who wrote review for the Product
user_name - Name of the user who wrote review for the Product
review_id - ID of the user review
review_title - Short review
review_content - Long review
img_link - Image Link of the Product
product_link - Official Website Link of the Product
License: CC BY-NC-SA 4.0
File name: amazon.csv
----------- Rotten Tomatoes Reviews Dataset ----------------
This rating inference dataset is a sentiment classification dataset, containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. On average, these reviews consist of 21 words. The first 5331 rows contains only negative samples and the last 5331 rows contain only positive samples, thus the data should be shuffled before usage.
This data is collected from https://www.cs.cornell.edu/people/pabo/movie-review-data/ as a txt file and converted into a csv file. The file consists of 2 columns: reviews and labels (1 for fresh (good) and 0 for rotten (bad)).
Reference: Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115–124, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics
File name: data_rt.csv
----------- Preprocessed Dataset Sentiment Analysis ----------------
Preprocessed amazon product review data of Gen3EcoDot (Alexa) scrapped entirely from amazon.in
Stemmed and lemmatized using nltk.
Sentiment labels are generated using TextBlob polarity scores.
The file consists of 4 columns: index, review (stemmed and lemmatized review using nltk), polarity (score) and division (categorical label generated using polarity score).
DOI: 10.34740/kaggle/dsv/3877817
Citation: @misc{pradeesh arumadi_2022, title={Preprocessed Dataset Sentiment Analysis}, url={https://www.kaggle.com/dsv/3877817}, DOI={10.34740/KAGGLE/DSV/3877817}, publisher={Kaggle}, author={Pradeesh Arumadi}, year={2022} }
This dataset was used in the experimental phase of my research.
File name: EcoPreprocessed.csv
----------- Amazon Earphones Reviews ----------------
This dataset consists of a 9930 Amazon reviews, star ratings, for 10 latest (as of mid-2019) bluetooth earphone devices for learning how to train Machine for sentiment analysis.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 5 columns: ReviewTitle, ReviewBody, ReviewStar, Product and division (manually added - categorical label generated using ReviewStar score)
License: U.S. Government Works
Source: www.amazon.in
File name (original): AllProductReviews.csv (contains 14337 reviews)
File name (edited - used for my research) : AllProductReviews2.csv (contains 9930 reviews)
----------- Amazon Musical Instruments Reviews ----------------
This dataset contains 7137 comments/reviews of different musical instruments coming from Amazon.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 10 columns: reviewerID, asin (ID of the product), reviewerName, helpful (helpfulness rating of the review), reviewText, overall (rating of the product), summary (summary of the review), unixReviewTime (time of the review - unix time), reviewTime (time of the review (raw) and division (manually added - categorical label generated using overall score).
Source: http://jmcauley.ucsd.edu/data/amazon/
File name (original): Musical_instruments_reviews.csv (contains 10261 reviews)
File name (edited - used for my research) : Musical_instruments_reviews2.csv (contains 7137 reviews)
HR Analytics Dataset
kaggle.com
Updated Jan 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shodolamu Opeyemi (2025). HR Analytics Dataset [Dataset]. https://www.kaggle.com/datasets/hopesb/hr-analytics-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 18, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Shodolamu Opeyemi
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
The uploaded dataset contains detailed information about employees, training programs, and other HR-related metrics. Here's an overview:

General Details:

Rows: 3,150

Columns: 39

Column Names:

Unnamed: 0

FirstName

LastName

StartDate

ExitDate

Title

Supervisor

ADEmail

BusinessUnit

EmployeeStatus

EmployeeType

PayZone

EmployeeClassificationType

TerminationType

TerminationDescription

DepartmentType

Division

DOB

State

JobFunctionDescription

GenderCode

LocationCode

RaceDesc

MaritalDesc

Performance Score

Current Employee Rating

Employee ID

Survey Date

Engagement Score

Satisfaction Score

Work-Life Balance Score

Training Date

Training Program Name

Training Type

Training Outcome

Location

Trainer

Training Duration (Days)

Training Cost

Summary:

Employee Data: Contains details such as names, start and exit dates, job titles, and supervisors.

Performance and Survey Metrics: Includes engagement, satisfaction, and work-life balance scores.

Training Information: Covers program names, training types, outcomes, durations, costs, and trainer details.

Diversity Details: Includes gender, race, and marital status.

Status & Classification: Indicates employee status (active/terminated), type, and termination reasons.
w
Dataset of books called Core concepts in data analysis : summarization,...
workwithdata.com
Updated Apr 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2025). Dataset of books called Core concepts in data analysis : summarization, correlation and visualization [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=Core+concepts+in+data+analysis+%3A+summarization%2C+correlation+and+visualization
Explore at:
Dataset updated
Apr 17, 2025
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is about books. It has 1 row and is filtered where the book is Core concepts in data analysis : summarization, correlation and visualization. It features 7 columns including author, publication date, language, and book publisher.
d
Data from: Data and code from: The Impacts of Parental Choice and...
catalog.data.gov
agdatacommons.nal.usda.gov
+1more
Updated Apr 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Data and code from: The Impacts of Parental Choice and Intrapopulation Selection for Seed Size on the Uprightness of Progeny Derived from Interspecific Hybridization between Glycine max and Glycine soja [Dataset]. https://catalog.data.gov/dataset/data-and-code-from-the-impacts-of-parental-choice-and-intrapopulation-selection-for-seed-s-e773b
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Service
Description
This dataset contains all data and code necessary to reproduce the analysis described under the heading "Experiment 3" in the manuscript: Taliercio, E., Eickholt, D., Read, Q. D., Carter, T., Waldeck, N., & Fallen, B. (2023). Parental choice and seed size impact the uprightness of progeny from interspecific Glycine hybridizations. Crop Science. https://doi.org/10.1002/csc2.21015 The attached files are: G_max_G_soja_seedweight_seedcolor_analysis.Rmd: RMarkdown notebook containing all analysis code. The CSV data files should be placed in a subdirectory called data within the working directory from which the notebook is rendered. G_max_G_soja_seedweight_seedcolor_analysis.html: Rendered HTML output from RMarkdown notebook, including figures, tables, and explanatory text. counts_seedwt.csv: CSV file containing the number of progeny selected and average 100-seed weight data for each combination of cross, size class, and replicate. Columns are: F3_location: text identifier of F3 nursery location, either "CLA" or "FF" plot: numeric ID of plot pop: numeric ID of population max: name of G. max parent soja: name of G. soja parent F2_location: text identifier of F2 nursery location, either "Caswell" or "Hugo" n_planted: number of seeds planted (raw) n_selected: number of progeny selected size_ordered: seed size class, to be converted to an ordered factor size_combined: seed size class aggregated to fewer unique levels ave_100sw: average 100-seed weight for the given size class n_planted_trials: number of seeds planted rounded to nearest integer seedcolor.csv: CSV file with additional data on number of seeds of each color by population. Columns are: cross: text identifier of cross line: text identifier of line light: number of light seeds mid: number of mid-green seeds brown: number of brown seeds dark: number of dark or black seeds population: identifier of population type (F2 derived or selected) max: name of G. max parent n_total: sum of the light, mid, brown, and dark columns soja: name of G. soja parent The data processing and analysis pipeline in the RMarkdown notebook includes: Importing the data (slightly cleaned version is provided) Creating boxplots of proportion selected by cross, nursery location, and size class Fitting logistic GLMM to estimate the probability of selection as a function of parent, 100-seed weight, and their interactions Extracting and plotting random effect estimates from model Calculating and plotting estimated marginal means from model Taking contrasts between pairs of estimated marginal means and trends Calculating Bayes Factors associated with the contrasts Generating figures and tables for all above results Additional seed color analysis: importing data (slightly cleaned version is provided) Additional seed color analysis: drawing exploratory bar plot Additional seed color analysis: fitting multinomial GLM modeling the proportion of seeds with each color as a function of population Additional seed color analysis: generating expected value predictions from GLM and taking contrasts Additional seed color analysis: creating figures and tables for model results This research was funded by CRIS 6070-21220-069-00D, United Soybean Board Project # 2333-203-0101, and falls under National Program NP301. Resources in this dataset:Resource Title: RMarkdown document with all analysis code. File Name: G_max_G_soja_seedweight_seedcolor_analysis.RmdResource Title: Rendered HTML version of notebook. File Name: G_max_G_soja_seedweight_seedcolor_analysis.htmlResource Title: Progeny counts and seed weight data. File Name: counts_seedwt.csvResource Title: Seed color counts data. File Name: seedcolor.csv
Z
Experimental data and software for: Defaults: a double-edged sword in...
data-staging.niaid.nih.gov
data.niaid.nih.gov
+1more
Updated Aug 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Montero-Porras, Eladio (2024). Experimental data and software for: Defaults: a double-edged sword in governing common resources [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_10228657
Explore at:
Dataset updated
Aug 3, 2024
Dataset provided by
Vrije Universiteit Brussel
Authors
Montero-Porras, Eladio
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Experimental data and software for the paper: Defaults: a double-edged sword in governing common resources

The experiment consisted in three treatments of the Common Pool Resource Dilemma, where three default interventions were applied: pro-social, self-serving and no default. Plus, the participants had to complete an SVO task and a Risk assessment task.

Description of the data and file structure

In the file called all_participants.csv is the full dataset of all participants that took part of the experiment. This includes participants who will end up excluded and dropouts.

The experimental data files come in two formats: wide and long. The wide version, called data_wide_format.csv contains one row per participant and a column for all the fields, including rounds from 1 to 10 of the CPR task. Also, this file includes all demographic information of the participants, times and payments. The ID shown is generated internally and has no relationship with the participants' Prolific ID.

The long version, called data_long_format.csv, contains 10 rows per participant, and columns for the extraction and other variables necessary for analysis. This version contains the necessary data to reproduce all the figures and statistics detailed in the main manuscript.

In both of the previous files, the participants taken into account were the ones who completed the whole experiment. Those who did not complete the comprehension test, dropped out or did not sign the Informed Consent Form were excluded from the experimental data used. More details in the Methods below.

In the file default_opinions.csv, we manually classified the responses by participants to whether they were influenced by the default presented.

The file "Instructions of the experiment.pdf" contains the instructions of the experiment as shown to participants, also screenshots of the platform.
c
Parameter estimates of mixed generalized Gaussian distribution for modelling...
research-data.cardiff.ac.uk
datasetcatalog.nlm.nih.gov
zip
Updated Sep 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zoe Salinger; Alla Sikorskii; Michael J. Boivin; Nenad Šuvak; Maria Veretennikova; Nikolai N. Leonenko (2024). Parameter estimates of mixed generalized Gaussian distribution for modelling the increments of electroencephalogram data [Dataset]. http://doi.org/10.17035/d.2023.0277307170
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.17035/d.2023.0277307170
Dataset updated
Sep 18, 2024
Dataset provided by
Cardiff University
Authors
Zoe Salinger; Alla Sikorskii; Michael J. Boivin; Nenad Šuvak; Maria Veretennikova; Nikolai N. Leonenko
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Electroencephalogram (EEG) is used to monitor child's brain during coma by recording data on electrical neural activity of the brain. Signals are captured by multiple electrodes called channels located over the scalp. Statistical analyses of EEG data includes classification and prediction using arrays of EEG features, but few models for the underlying stochastic processes have been proposed. For this purpose, a new strictly stationary strong mixing diffusion model with marginal multimodal (three-peak) distribution (MixGGDiff) and exponentially decaying autocorrelation function for modeling of increments of EEG data was proposed. The increments were treated as discrete-time observations and a diffusion process where the stationary distribution is viewed as a mixture of three non-central generalized Gaussian distributions (MixGGD) was constructed.Probability density function of a mixed generalized Gaussian distribution (MixGGD) consists of three components and is described using a total of 12 parameters:\muk, location parameter of each of the components,sk, shape parameter of each of the components, \sigma2k, parameter related to the scale of each of the components andwk, weight of each of the components, where k, k={1,2,3} refers to theindex of the component of a MixGGD. The parameters of this distribution were estimated using the expectation-maximization algorithm, where the added shape parameter is estimated using the higher order statistics approach based on an analytical relationship between the shape parameter and kurtosis.To illustrate an application of the MixGGDiff to real data, analysis of EEG data collected in Uganda between 2008 and 2015 from 78 children within age-range of 18 months to 12 years who were in coma due to cerebral malaria was performed. EEG were recorded using the International 10–20 system with the sampling rate of 500 Hz and the average record duration of 30 min. EEG signal for every child was the result of a recording from 19 channels. MixGGD was fitted to each channel of every child's recording separately, hence for each channel a total of 12 parameter estimates were obtained. The data is presented in a matrix form (dimension 79*228) in a .csv format and consists of 79 rows where the first row is a header row which contains the names of the variables and the subsequent 78 rows represent parameter estimates of one instance (i.e. one child, without identifiers that could be related back to a specific child). There are a total of 228 columns (19 channels times 12 parameter estimates) where each column represents one parameter estimate of one component of MixGGD in the order of the channels, thus columns 1 to 12 refer to parameter estimates on the first channel, columns 13 to 24 refer to parameter estimates on the second channel and so on. Each variable name starts with "chi" where "ch" is an abbreviation of "channel" and i refers to the order of the channel from EEG recording. The rest of the characters in variable names refer to the parameter estimate names of the components of a MixGGD, thus for example "ch3sigmasq1" refers to the parameter estimate of \sigma2 of the first component of MixGGD obtained from EEG increments on the third channel. Parameter estimates contained in the .csv file are all real numbers within a range of -671.11 and 259326.96.Research results based upon these data are published at https://doi.org/10.1007/s00477-023-02524-y

Cafe Sales - Dirty Data for Cleaning Training

kaggle.com

zip

Updated Jan 17, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Ahmed Mohamed (2025). Cafe Sales - Dirty Data for Cleaning Training [Dataset]. https://www.kaggle.com/datasets/ahmedmohamed2003/cafe-sales-dirty-data-for-cleaning-training

Explore at:

zip(113510 bytes)Available download formats

Dataset updated

Jan 17, 2025

Authors

Ahmed Mohamed

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Dirty Cafe Sales Dataset

Overview

The Dirty Cafe Sales dataset contains 10,000 rows of synthetic data representing sales transactions in a cafe. This dataset is intentionally "dirty," with missing values, inconsistent data, and errors introduced to provide a realistic scenario for data cleaning and exploratory data analysis (EDA). It can be used to practice cleaning techniques, data wrangling, and feature engineering.

File Information

File Name: dirty_cafe_sales.csv
Number of Rows: 10,000
Number of Columns: 8

Columns Description

Column Name	Description	Example Values
`Transaction ID`	A unique identifier for each transaction. Always present and unique.	`TXN_1234567`
`Item`	The name of the item purchased. May contain missing or invalid values (e.g., "ERROR").	`Coffee`, `Sandwich`
`Quantity`	The quantity of the item purchased. May contain missing or invalid values.	`1`, `3`, `UNKNOWN`
`Price Per Unit`	The price of a single unit of the item. May contain missing or invalid values.	`2.00`, `4.00`
`Total Spent`	The total amount spent on the transaction. Calculated as `Quantity * Price Per Unit`.	`8.00`, `12.00`
`Payment Method`	The method of payment used. May contain missing or invalid values (e.g., `None`, "UNKNOWN").	`Cash`, `Credit Card`
`Location`	The location where the transaction occurred. May contain missing or invalid values.	`In-store`, `Takeaway`
`Transaction Date`	The date of the transaction. May contain missing or incorrect values.	`2023-01-01`

Data Characteristics

Missing Values:
- Some columns (e.g., Item, Payment Method, Location) may contain missing values represented as None or empty cells.
Invalid Values:
- Some rows contain invalid entries like "ERROR" or "UNKNOWN" to simulate real-world data issues.
Price Consistency:
- Prices for menu items are consistent but may have missing or incorrect values introduced.

Menu Items

The dataset includes the following menu items with their respective price ranges:

Item	Price($)
Coffee	2
Tea	1.5
Sandwich	4
Salad	5
Cake	3
Cookie	1
Smoothie	4
Juice	3

Use Cases

This dataset is suitable for: - Practicing data cleaning techniques such as handling missing values, removing duplicates, and correcting invalid entries. - Exploring EDA techniques like visualizations and summary statistics. - Performing feature engineering for machine learning workflows.

Cleaning Steps Suggestions

To clean this dataset, consider the following steps: 1. Handle Missing Values: - Fill missing numeric values with the median or mean. - Replace missing categorical values with the mode or "Unknown."

Handle Invalid Values:
- Replace invalid entries like "ERROR" and "UNKNOWN" with NaN or appropriate values.
Date Consistency:
- Ensure all dates are in a consistent format.
- Fill missing dates with plausible values based on nearby records.
Feature Engineering:
- Create new columns, such as Day of the Week or Transaction Month, for further analysis.

License

This dataset is released under the CC BY-SA 4.0 License. You are free to use, share, and adapt it, provided you give appropriate credit.

Feedback

If you have any questions or feedback, feel free to reach out through the dataset's discussion board on Kaggle.

Bellabeat Case Study II Google Capstone Project
kaggle.com
zip
Updated Nov 18, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NUR SİMAİŞ (2022). Bellabeat Case Study II Google Capstone Project [Dataset]. https://www.kaggle.com/datasets/nursma/bellabeat-case-study-ii-google-capstone-project
Explore at:
zip(25278847 bytes)Available download formats
Dataset updated
Nov 18, 2022
Authors
NUR SİMAİŞ
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset is retrieved from the user Mobius page, where it's generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. I woıuld like to thank Möbius and everyone responsible for the work.

Bellabeat Case Study 1 2022-11-14 1. Introduction Hello everyone, my name is Nur Simais and this project is part of Google Data Analytics Professional Certificate. There have been multiple skills and skillsets learned throughout this course that can mainly be categorized under soft and hard skills. Also, this case study I have chosen is about the company calles “Bellabeat”, a fitness tracker device. The company is founded in 2013 by Urška Sršen and Sando Mur. It gradually gained recognition and expanded in many countires.(https://bellabeat.com/) Adding this brief info about the company, I’d like to say that doing the business analysis will help the company to see how it can achieve it’s goals and what can be done as to improve more.

During the analysis process, I will be using the Google’s “Ask-Prepare-Process-Analyze-Share-Act” Framework that I have learned throughout this certification and apply the tools and skillsets into it.

1.ASK

1.1 Business Task The goal of this project is to analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices and how to apply these insights into Bellabeat’s marketing strategy using these three questions:

What are some trends in smart device usage? How could these trends apply to Bellabeat customers? How could these trends help influence Bellabeat marketing strategy?

2.PREPARE Prepare the Data and Libraries in RStudio Collect the data required for analysis but since the data is available on Kaggle publicly, FitBit Fitness Tracker Data (CC0: Public Domain) and download the dataset.

There are 18 packages but after examining the excel docs, I decided to use these 8 datasets: dailyActivity_merged.csv, heartrate_seconds_merged.csv, hourlyCalories_merged.csv, hourlyIntensities_merged.csv, hourlySteps_merged.csv, minuteMETsNarrow_merged.csv, sleepDay_merged.csv, weightLogInfo_merged.csv 2.1 Install and load the packages Install the RStudio libraries for analysis and visualizations

install.packages("tidyverse") # core package for cleaning and analysis

Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'

(as 'lib' is unspecified)

install.packages("lubridate") # date library mdy()

Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'

(as 'lib' is unspecified)

install.packages("janitor") # clean_names() to consists only _, character, numbers, and letters.

Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'

(as 'lib' is unspecified)

install.packages("dplyr") #helps to check the garmmar of data manioulation

Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'

(as 'lib' is unspecified)

Load the libraries

library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──

✔ ggplot2 3.4.0 ✔ purrr 0.3.5

✔ tibble 3.1.8 ✔ dplyr 1.0.10

✔ tidyr 1.2.1 ✔ stringr 1.4.1

✔ readr 2.1.3 ✔ forcats 0.5.2

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──

✖ dplyr::filter() masks stats::filter()

✖ dplyr::lag() masks stats::lag()

library(janitor) ##

Attaching package: 'janitor'

##

The following objects are masked from 'package:stats':

##

chisq.test, fisher.test

library(lubridate)

Loading required package: timechange

##

Attaching package: 'lubridate'

##

The following objects are masked from 'package:base':

##

date, intersect, setdiff, union

library(dplyr) Having loaded tidyverse package, the rest of the essential packages (ggplot2, dplyr, and tidyr) are loaded as well.

2.2 Importing and Preparing the Dataset Upload the archived dataset to RStudio by clicking the Upload button in the bottom right pane.

File will be saved in a new folder named “Fitabase Data 4.12.16-5.12.16”. Importing the datasets and renaming them.

Loading your CSV files

daily_activity <- read.csv("dailyActivity_merged.csv") heartrate_seconds <- read_csv("heartrate_seconds_merged.csv")

Rows: 2483658 Columns: 3

── Column specification ────────────────────────────────────────────────────────

Delimiter: ","

chr (1): Time

dbl (2): Id, Value

##

ℹ Use spec() to retrieve the full column specification for this data.

ℹ Specify the column types or set show_col_types = FALSE to quiet this message.

hourly_calories <- read_csv("hourlyCalories_merged.csv")

Rows: 22099 Columns: 3

── Column specification ─────────────────────────────────────...
Case study: Cyclistic bike-share analysis
kaggle.com
zip
Updated Mar 25, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jorge4141 (2022). Case study: Cyclistic bike-share analysis [Dataset]. https://www.kaggle.com/datasets/jorge4141/case-study-cyclistic-bikeshare-analysis
Explore at:
zip(131490806 bytes)Available download formats
Dataset updated
Mar 25, 2022
Authors
Jorge4141
Description
Introduction

This is a case study called Capstone Project from the Google Data Analytics Certificate.

In this case study, I am working as a junior data analyst at a fictitious bike-share company in Chicago called Cyclistic.

Cyclistic is a bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike.

Scenario

The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, our team will design a new marketing strategy to convert casual riders into annual members.

****Primary Stakeholders:****

1: Cyclistic Executive Team

2: Lily Moreno, Director of Marketing and Manager

ASK

How do annual members and casual riders use Cyclistic bikes differently?

Why would casual riders buy Cyclistic annual memberships?

How can Cyclistic use digital media to influence casual riders to become members?

# Prepare

The last four quarters were selected for analysis which cover April 01, 2019 - March 31, 2020. These are the datasets used:

Divvy_Trips_2019_Q2 Divvy_Trips_2019_Q3 Divvy_Trips_2019_Q4 Divvy_Trips_2020_Q1

The data is stored in CSV files. Each file contains one month data for a total of 12 .csv files.

Data appears to be reliable with no bias. It also appears to be original, current and cited.

I used Cyclistic’s historical trip data found here: https://divvy-tripdata.s3.amazonaws.com/index.html

The data has been made available by Motivate International Inc. under this license: https://ride.divvybikes.com/data-license-agreement

Limitations

Financial information is not available.

Process

Used R to analyze and clean data

After installing the R packages, data was collected, wrangled and combined into a single file.

Columns were renamed.

Looked for incongruencies in the dataframes and converted some columns to character type, so they can stack correctly.

Combined all quarters into one big data frame.

Removed unnecessary columns

Analyze

Inspected new data table to ensure column names were correctly assigned.

Formatted columns to ensure proper data types were assigned (numeric, character, etc).

Consolidated the member_casual column.

Added day, month and year columns to aggregate data.

Added ride-length column to the entire dataframe for consistency.

Deleted trip duration rides that showed as negative and bikes out of circulation for quality control.

Replaced the word "member" with "Subscriber" and also replaced the word "casual" with "Customer".

Aggregated data, compared average rides between members and casual users.

Share

After analysis, visuals were created as shown below with R.

Act

Conclusion:

Data appears to show that casual riders and members use bike share differently.

Casual riders' average ride length is more than twice of that of members.

Members use bike share for commuting, casual riders use it for leisure and mostly on the weekends.

Unfortunately, there's no financial data available to determine which of the two (casual or member) is spending more money.

Recommendations

Offer casual riders a membership package with promotions and discounts.
d
Geochemical data supporting analysis of geochemical conditions and nitrogen...
catalog.data.gov
data.usgs.gov
+2more
Updated Nov 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Geochemical data supporting analysis of geochemical conditions and nitrogen transport in nearshore groundwater and the subterranean estuary at a Cape Cod embayment, East Falmouth, Massachusetts, 2013 [Dataset]. https://catalog.data.gov/dataset/geochemical-data-supporting-analysis-of-geochemical-conditions-and-nitrogen-transport-in-n
Explore at:
Dataset updated
Nov 19, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Falmouth, East Falmouth, Cape Cod, Massachusetts
Description
This data release provides analytical and other data in support of an analysis of nitrogen transport and transformation in groundwater and in a subterranean estuary in the Eel River and onshore locations on the Seacoast Shores peninsula, Falmouth, Massachusetts. The analysis is described in U.S. Geological Survey Scientific Investigations Report 2018-5095 by Colman and others (2018). This data release is structured as a set of comma-separated values (CSV) files, each of which contains data columns for laboratory (if applicable), USGS Site Name, date sampled, time sampled, and columns of specific analytical and(or) other data. The .csv data files have the same number of rows and each row in each .csv file corresponds to the same sample. Blank cells in a .csv file indicate that the sample was not analyzed for that constituent. The data release also provides a Data Dictionary (Data_Dictionary.csv) that provides the following information for each constituent (analyte): laboratory or data source, data type, description of units, method, minimum reporting limit, limit of quantitation if appropriate, method reference citations, minimum, maximum, median, and average values for each analyte. The data release also contains a file called Abbreviations in Data_Dictionary.pdf that contains all of the abbreviations in the Data Dictionary and in the well characteristics file in the companion report, Colman and others (2018).

Facebook

Twitter

Click to copy link

Link copied

Cite

Work With Data (2025). Dataset of books called An introduction to data analysis in R : hands-on coding, data mining, visualization and statistics from scratch [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=An+introduction+to+data+analysis+in+R+%3A+hands-on+coding%2C+data+mining%2C+visualization+and+statistics+from+scratch

Dataset of books called An introduction to data analysis in R : hands-on coding, data mining, visualization and statistics from scratch

Explore at:

Dataset updated

Apr 17, 2025

Dataset authored and provided by

Work With Data

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset is about books. It has 1 row and is filtered where the book is An introduction to data analysis in R : hands-on coding, data mining, visualization and statistics from scratch. It features 7 columns including author, publication date, language, and book publisher.

Clear search

Close search

Google apps

Main menu

Dataset of books called An introduction to data analysis in R : hands-on...

Google Data Analytics Capstone Project

Trusted Research Environments: Analysis of Characteristics and Data...

Methodology

Technical details

Dataset of books called Data analysis in business research : a step-by-step...

Dataset of books called Statistical and computational methods in data...

Automobile Sales data

Dataset of books called Modeling and data analysis : an introduction with...

DatasetofDatasets (DoD)

LTMP analysis 11-year versus 25-year with missing data

UC_vs_US Statistic Analysis.xlsx

Datasets for Sentiment Analysis

HR Analytics Dataset

Dataset of books called Core concepts in data analysis : summarization,...

Data from: Data and code from: The Impacts of Parental Choice and...

Experimental data and software for: Defaults: a double-edged sword in...

Parameter estimates of mixed generalized Gaussian distribution for modelling...

Cafe Sales - Dirty Data for Cleaning Training

Dirty Cafe Sales Dataset

Overview

File Information

Columns Description

Data Characteristics

Menu Items

Use Cases

Cleaning Steps Suggestions

License

Feedback

Bellabeat Case Study II Google Capstone Project

Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'

(as 'lib' is unspecified)

Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'

(as 'lib' is unspecified)

Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'

(as 'lib' is unspecified)

Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'

(as 'lib' is unspecified)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──

✔ ggplot2 3.4.0 ✔ purrr 0.3.5

✔ tibble 3.1.8 ✔ dplyr 1.0.10

✔ tidyr 1.2.1 ✔ stringr 1.4.1

✔ readr 2.1.3 ✔ forcats 0.5.2

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──

✖ dplyr::filter() masks stats::filter()

✖ dplyr::lag() masks stats::lag()

Attaching package: 'janitor'

The following objects are masked from 'package:stats':

chisq.test, fisher.test

Loading required package: timechange

Attaching package: 'lubridate'

The following objects are masked from 'package:base':

date, intersect, setdiff, union

Loading your CSV files

Rows: 2483658 Columns: 3

── Column specification ────────────────────────────────────────────────────────

Delimiter: ","

chr (1): Time

dbl (2): Id, Value

ℹ Use spec() to retrieve the full column specification for this data.

ℹ Specify the column types or set show_col_types = FALSE to quiet this message.

Rows: 22099 Columns: 3

── Column specification ─────────────────────────────────────...

Case study: Cyclistic bike-share analysis

Introduction

Scenario

****Primary Stakeholders:****

ASK

Limitations

Process

Analyze

Share

Act

Recommendations

Geochemical data supporting analysis of geochemical conditions and nitrogen...

Dataset of books called An introduction to data analysis in R : hands-on coding, data mining, visualization and statistics from scratch

ℹ Use `spec()` to retrieve the full column specification for this data.

ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Primary Stakeholders: