14 datasets found
  1. Google Data Analytics Case Study Cyclistic

    • kaggle.com
    zip
    Updated Sep 27, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Udayakumar19 (2022). Google Data Analytics Case Study Cyclistic [Dataset]. https://www.kaggle.com/datasets/udayakumar19/google-data-analytics-case-study-cyclistic/suggestions
    Explore at:
    zip(1299 bytes)Available download formats
    Dataset updated
    Sep 27, 2022
    Authors
    Udayakumar19
    Description

    Introduction

    Welcome to the Cyclistic bike-share analysis case study! In this case study, you will perform many real-world tasks of a junior data analyst. You will work for a fictional company, Cyclistic, and meet different characters and team members. In order to answer the key business questions, you will follow the steps of the data analysis process: ask, prepare, process, analyze, share, and act. Along the way, the Case Study Roadmap tables — including guiding questions and key tasks — will help you stay on the right path.

    Scenario

    You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.

    Ask

    How do annual members and casual riders use Cyclistic bikes differently?

    Guiding Question:

    What is the problem you are trying to solve?
      How do annual members and casual riders use Cyclistic bikes differently?
    How can your insights drive business decisions?
      The insight will help the marketing team to make a strategy for casual riders
    

    Prepare

    Guiding Question:

    Where is your data located?
      Data located in Cyclistic organization data.
    
    How is data organized?
      Dataset are in csv format for each month wise from Financial year 22.
    
    Are there issues with bias or credibility in this data? Does your data ROCCC? 
      It is good it is ROCCC because data collected in from Cyclistic organization.
    
    How are you addressing licensing, privacy, security, and accessibility?
      The company has their own license over the dataset. Dataset does not have any personal information about the riders.
    
    How did you verify the data’s integrity?
      All the files have consistent columns and each column has the correct type of data.
    
    How does it help you answer your questions?
      Insights always hidden in the data. We have the interpret with data to find the insights.
    
    Are there any problems with the data?
      Yes, starting station names, ending station names have null values.
    

    Process

    Guiding Question:

    What tools are you choosing and why?
      I used R studio for the cleaning and transforming the data for analysis phase because of large dataset and to gather experience in the language.
    
    Have you ensured the data’s integrity?
     Yes, the data is consistent throughout the columns.
    
    What steps have you taken to ensure that your data is clean?
      First duplicates, null values are removed then added new columns for analysis.
    
    How can you verify that your data is clean and ready to analyze? 
     Make sure the column names are consistent thorough out all data sets by using the “bind row” function.
    
    Make sure column data types are consistent throughout all the dataset by using the “compare_df_col” from the “janitor” package.
    Combine the all dataset into single data frame to make consistent throught the analysis.
    Removed the column start_lat, start_lng, end_lat, end_lng from the dataframe because those columns not required for analysis.
    Create new columns day, date, month, year, from the started_at column this will provide additional opportunities to aggregate the data
    Create the “ride_length” column from the started_at and ended_at column to find the average duration of the ride by the riders.
    Removed the null rows from the dataset by using the “na.omit function”
    Have you documented your cleaning process so you can review and share those results? 
      Yes, the cleaning process is documented clearly.
    

    Analyze Phase:

    Guiding Questions:

    How should you organize your data to perform analysis on it? The data has been organized in one single dataframe by using the read csv function in R Has your data been properly formatted? Yes, all the columns have their correct data type.

    What surprises did you discover in the data?
      Casual member ride duration is higher than the annual members
      Causal member widely uses docked bike than the annual members
    What trends or relationships did you find in the data?
      Annual members are used mainly for commute purpose
      Casual member are preferred the docked bikes
      Annual members are preferred the electric or classic bikes
    How will these insights help answer your business questions?
      This insights helps to build a profile for members
    

    Share

    Guiding Quesions:

    Were you able to answer the question of how ...
    
  2. Practice makes master: Movie Collection Analysis

    • kaggle.com
    zip
    Updated May 19, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Beyjin (2019). Practice makes master: Movie Collection Analysis [Dataset]. https://www.kaggle.com/beyjin/movies-1990-to-2017
    Explore at:
    zip(22259569 bytes)Available download formats
    Dataset updated
    May 19, 2019
    Authors
    Beyjin
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Context

    The data set represents movies which were released in the years of xxx up to 2017. It is kept quite general and does not have any real problem / challenge as a background. The whole data set is meant to practice different types of techniques for a data analyst / data scientist.

    I´d like also to mention that the Dataset is not fully cleaned. Reasoning is that it shall demonstrate you the real life of being an Analyst / Scientist. Get Data - Prep Data - Analyse Data - Visualize Data - Predict Outcomes of different Use Cases ;-)

    Content

    I love watching movies and therefore tried to combine this hobby with my current self studies of becoming a data scientist. Therefore I needed a way to obtain a data set which included information of movies so that I could play around and use my learnings. On the first glance I could see that the data set can be used for Regressions, Classifications or potentially even Deep Learning (such as Image Recognition - Post URLs are given)

    I did aquire this dataset by using different steps. First I did check the internet for a specific API which I may use to receive movie information. After a short time I got to know omdbapi.com. With the help of this API I was able to fetch information based on the title of the movies.

    Now I had another problem. I was missing movie titles. The next search had begun. I couldn´t find an API for that but I did see that wikipedia was quite well structured in regards to movie titles. So I did build a scraper to fetch all movie titles from 1990 to 2017.

    After receiving all the data I could finally start to obtain all movie information of a movie by having the title + year (there might be movies which have the same name). Unfortunately some movie titles have been written differently and so I had a failure rate of 10% for obtaining the movie data. Based on the 10% failed movie titles - I did an Text Analysis and found around 400 000 new Movies / Series. The latest Version should include nearly 200 000 different movies based on the imdbID.

    Additionally I did clean some of the information such as Genre, Actors and Writer for better analysing. Each of the CSV File can be joined by the imdbID. Be aware that some information are missing and declared as _NOT_GIVEN.

    Acknowledgements

    • Thanks to omdbapi.com for providing such a good API and well structured data.

    Inspiration

    The inspiration of this data set came from getting into the practical flow of developing an image recognition application. Recognize the genre of a movie by the given poster. By request I could also provide the images of the movies. But for the given Dataset I do have the following questions in my mind:

    1. Does the Genre correlate with the given Scoring?
    2. Can we see a hype of specific genre over the past years?
    3. Do the actors or writer prefer a genre?
    4. Do the actors or writer have an impact on the imdb scoring?
    5. Do the directors have prefered actors for their movies?
    6. Do the directors have prefered writers for their movies?
    7. How many movies have been produced by the directors?
    8. Is there any relation between the director and the imdb rating?
    9. .... many more questions :-)
  3. Cyclistic_data_visualization

    • kaggle.com
    Updated Jun 12, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mark Woychick (2021). Cyclistic_data_visualization [Dataset]. https://www.kaggle.com/markwoychick/cyclistic-data-visualization
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 12, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Mark Woychick
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    I created these files and analysis as part of working on a case study for the Google Data Analyst certificate.

    Question investigated: Do annual members and casual riders use Cyclistic bikes differently? Why do we want to know?: Knowing bike usage/behavior by rider type will allow the Marketing, Analytics, and Executive team stakeholders to design, assess, and approve appropriate strategies that drive profitability.

    Content

    I used the script noted below to clean the files and then added some additional steps to create the visualizations to complete my analysis. The additional steps are noted in corresponding R Markdown file for this data set.

    Acknowledgements

    Files: most recent 1 year of data available, Divvy_Trips_2019_Q2.csv, Divvy_Trips_2019_Q3.csv, Divvy_Trips_2019_Q4.csv, Divvy_Trips_2020_Q1.csv Source: Downloaded from https://divvy-tripdata.s3.amazonaws.com/index.html

    Data cleaning script: followed this script to clean and merge files https://docs.google.com/document/d/1gUs7-pu4iCHH3PTtkC1pMvHfmyQGu0hQBG5wvZOzZkA/copy

    Note: Combined data set has 3,876,042 rows, so you will likely need to run R analysis on your computer (e.g., R Console) rather than in the cloud (e.g., RStudio Cloud)

    Inspiration

    This was my first attempt to conduct an analysis in R and create the R Markdown file. As you might guess, it was an eye-opening experience, with both exciting discoveries and aggravating moments.

    One thing I have not yet been able to figure out is how to add a legend to the map. I was able to get a legend to appear on a separate (empty) map, but not on the map you will see here.

    I am also interested to see what others did with this analysis - what were the findings and insights you found?

  4. i

    Household Expenditure and Income Survey 2010, Economic Research Forum (ERF)...

    • catalog.ihsn.org
    Updated Mar 29, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Hashemite Kingdom of Jordan Department of Statistics (DOS) (2019). Household Expenditure and Income Survey 2010, Economic Research Forum (ERF) Harmonization Data - Jordan [Dataset]. https://catalog.ihsn.org/index.php/catalog/7662
    Explore at:
    Dataset updated
    Mar 29, 2019
    Dataset authored and provided by
    The Hashemite Kingdom of Jordan Department of Statistics (DOS)
    Time period covered
    2010 - 2011
    Area covered
    Jordan
    Description

    Abstract

    The main objective of the HEIS survey is to obtain detailed data on household expenditure and income, linked to various demographic and socio-economic variables, to enable computation of poverty indices and determine the characteristics of the poor and prepare poverty maps. Therefore, to achieve these goals, the sample had to be representative on the sub-district level. The raw survey data provided by the Statistical Office was cleaned and harmonized by the Economic Research Forum, in the context of a major research project to develop and expand knowledge on equity and inequality in the Arab region. The main focus of the project is to measure the magnitude and direction of change in inequality and to understand the complex contributing social, political and economic forces influencing its levels. However, the measurement and analysis of the magnitude and direction of change in this inequality cannot be consistently carried out without harmonized and comparable micro-level data on income and expenditures. Therefore, one important component of this research project is securing and harmonizing household surveys from as many countries in the region as possible, adhering to international statistics on household living standards distribution. Once the dataset has been compiled, the Economic Research Forum makes it available, subject to confidentiality agreements, to all researchers and institutions concerned with data collection and issues of inequality.

    Data collected through the survey helped in achieving the following objectives: 1. Provide data weights that reflect the relative importance of consumer expenditure items used in the preparation of the consumer price index 2. Study the consumer expenditure pattern prevailing in the society and the impact of demographic and socio-economic variables on those patterns 3. Calculate the average annual income of the household and the individual, and assess the relationship between income and different economic and social factors, such as profession and educational level of the head of the household and other indicators 4. Study the distribution of individuals and households by income and expenditure categories and analyze the factors associated with it 5. Provide the necessary data for the national accounts related to overall consumption and income of the household sector 6. Provide the necessary income data to serve in calculating poverty indices and identifying the poor characteristics as well as drawing poverty maps 7. Provide the data necessary for the formulation, follow-up and evaluation of economic and social development programs, including those addressed to eradicate poverty

    Geographic coverage

    National

    Analysis unit

    • Households
    • Individuals

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    The Household Expenditure and Income survey sample for 2010, was designed to serve the basic objectives of the survey through providing a relatively large sample in each sub-district to enable drawing a poverty map in Jordan. The General Census of Population and Housing in 2004 provided a detailed framework for housing and households for different administrative levels in the country. Jordan is administratively divided into 12 governorates, each governorate is composed of a number of districts, each district (Liwa) includes one or more sub-district (Qada). In each sub-district, there are a number of communities (cities and villages). Each community was divided into a number of blocks. Where in each block, the number of houses ranged between 60 and 100 houses. Nomads, persons living in collective dwellings such as hotels, hospitals and prison were excluded from the survey framework.

    A two stage stratified cluster sampling technique was used. In the first stage, a cluster sample proportional to the size was uniformly selected, where the number of households in each cluster was considered the weight of the cluster. At the second stage, a sample of 8 households was selected from each cluster, in addition to another 4 households selected as a backup for the basic sample, using a systematic sampling technique. Those 4 households were sampled to be used during the first visit to the block in case the visit to the original household selected is not possible for any reason. For the purposes of this survey, each sub-district was considered a separate stratum to ensure the possibility of producing results on the sub-district level. In this respect, the survey framework adopted that provided by the General Census of Population and Housing Census in dividing the sample strata. To estimate the sample size, the coefficient of variation and the design effect of the expenditure variable provided in the Household Expenditure and Income Survey for the year 2008 was calculated for each sub-district. These results were used to estimate the sample size on the sub-district level so that the coefficient of variation for the expenditure variable in each sub-district is less than 10%, at a minimum, of the number of clusters in the same sub-district (6 clusters). This is to ensure adequate presentation of clusters in different administrative areas to enable drawing an indicative poverty map.

    It should be noted that in addition to the standard non response rate assumed, higher rates were expected in areas where poor households are concentrated in major cities. Therefore, those were taken into consideration during the sampling design phase, and a higher number of households were selected from those areas, aiming at well covering all regions where poverty spreads.

    Mode of data collection

    Face-to-face [f2f]

    Research instrument

    • General form
    • Expenditure on food commodities form
    • Expenditure on non-food commodities form

    Cleaning operations

    Raw Data: - Organizing forms/questionnaires: A compatible archive system was used to classify the forms according to different rounds throughout the year. A registry was prepared to indicate different stages of the process of data checking, coding and entry till forms were back to the archive system. - Data office checking: This phase was achieved concurrently with the data collection phase in the field where questionnaires completed in the field were immediately sent to data office checking phase. - Data coding: A team was trained to work on the data coding phase, which in this survey is only limited to education specialization, profession and economic activity. In this respect, international classifications were used, while for the rest of the questions, coding was predefined during the design phase. - Data entry/validation: A team consisting of system analysts, programmers and data entry personnel were working on the data at this stage. System analysts and programmers started by identifying the survey framework and questionnaire fields to help build computerized data entry forms. A set of validation rules were added to the entry form to ensure accuracy of data entered. A team was then trained to complete the data entry process. Forms prepared for data entry were provided by the archive department to ensure forms are correctly extracted and put back in the archive system. A data validation process was run on the data to ensure the data entered is free of errors. - Results tabulation and dissemination: After the completion of all data processing operations, ORACLE was used to tabulate the survey final results. Those results were further checked using similar outputs from SPSS to ensure that tabulations produced were correct. A check was also run on each table to guarantee consistency of figures presented, together with required editing for tables' titles and report formatting.

    Harmonized Data: - The Statistical Package for Social Science (SPSS) was used to clean and harmonize the datasets. - The harmonization process started with cleaning all raw data files received from the Statistical Office. - Cleaned data files were then merged to produce one data file on the individual level containing all variables subject to harmonization. - A country-specific program was generated for each dataset to generate/compute/recode/rename/format/label harmonized variables. - A post-harmonization cleaning process was run on the data. - Harmonized data was saved on the household as well as the individual level, in SPSS and converted to STATA format.

  5. R

    Data Clean Room Platforms Market Research Report 2033

    • researchintelo.com
    csv, pdf, pptx
    Updated Oct 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Research Intelo (2025). Data Clean Room Platforms Market Research Report 2033 [Dataset]. https://researchintelo.com/report/data-clean-room-platforms-market
    Explore at:
    csv, pptx, pdfAvailable download formats
    Dataset updated
    Oct 2, 2025
    Dataset authored and provided by
    Research Intelo
    License

    https://researchintelo.com/privacy-and-policyhttps://researchintelo.com/privacy-and-policy

    Time period covered
    2024 - 2033
    Area covered
    Global
    Description

    Data Clean Room Platforms Market Outlook



    According to our latest research, the Global Data Clean Room Platforms Market size was valued at $1.2 billion in 2024 and is projected to reach $8.6 billion by 2033, expanding at a robust CAGR of 24.1% during the forecast period of 2025–2033. One of the primary drivers fueling this remarkable growth is the escalating demand for privacy-centric data collaboration solutions in the wake of stringent data protection regulations and the gradual phase-out of third-party cookies in digital advertising. As organizations across industries strive to unlock the potential of their data assets while ensuring compliance with global privacy standards, data clean room platforms have emerged as a critical enabler, facilitating secure, privacy-compliant data sharing and advanced analytics without exposing sensitive or personally identifiable information.



    Regional Outlook



    North America currently commands the largest share of the global Data Clean Room Platforms market, accounting for more than 42% of total revenue in 2024. This dominance can be attributed to the region's mature digital advertising ecosystem, early adoption of privacy-enhancing technologies, and the presence of leading technology providers and cloud service giants. The United States, in particular, has seen rapid integration of data clean room solutions by major brands, publishers, and agencies seeking to navigate evolving privacy regulations such as CCPA and the deprecation of third-party cookies by major browsers. Furthermore, North American enterprises are leveraging these platforms for advanced cross-channel analytics and targeted audience activation, driving significant market value and fostering innovation in data collaboration frameworks.



    Asia Pacific is anticipated to be the fastest-growing region, projected to register a CAGR exceeding 27% through 2033. The surge in digital transformation initiatives, exponential growth of e-commerce, and increasing investments in MarTech by enterprises in China, India, Japan, and Southeast Asia are key drivers of this accelerated expansion. The region’s burgeoning online consumer base, coupled with rising awareness about data privacy and security, is propelling demand for secure data collaboration tools. Governments in Asia Pacific are also enacting new data protection laws, prompting organizations to adopt data clean room platforms to maintain regulatory compliance while extracting actionable insights from first-party and partner data sources. As a result, vendors are ramping up regional investments and forming strategic alliances to tap into this high-growth market.



    Emerging economies in Latin America and the Middle East & Africa are gradually recognizing the value proposition of data clean room platforms, though adoption remains at a nascent stage due to infrastructural and regulatory challenges. In these regions, local enterprises are grappling with fragmented data landscapes, limited technical expertise, and evolving data governance policies. However, as digital advertising and e-commerce sectors gain momentum and cross-border data sharing becomes more prevalent, there is growing interest in privacy-preserving analytics solutions. International technology providers are increasingly partnering with local players to deliver tailored offerings and build capacity, setting the stage for future market expansion despite current hurdles.



    Report Scope





    Attributes Details
    Report Title Data Clean Room Platforms Market Research Report 2033
    By Component Software, Services
    By Deployment Mode On-Premises, Cloud
    By Organization Size Large Enterprises, Small and Medium Enterprises
    By Application Advertising & Marketing, Data Analytics, Compliance &a

  6. Goalkeeper and Midfielder Statistics

    • kaggle.com
    zip
    Updated Dec 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Goalkeeper and Midfielder Statistics [Dataset]. https://www.kaggle.com/datasets/thedevastator/maximizing-player-performance-with-goalkeeper-an
    Explore at:
    zip(108659 bytes)Available download formats
    Dataset updated
    Dec 8, 2022
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Goalkeeper and Midfielder Statistics

    Leveraging Statistical Data Of Goalkeepers and Midfielders

    By [source]

    About this dataset

    Welcome to Kaggle's dataset, where we provide rich and detailed insights into professional football players. Analyze player performance and team data with over 125 different metrics covering everything from goal involvement to tackles won, errors made and clean sheets kept. With the high levels of granularity included in our analysis, you can identify which players are underperforming or stand out from their peers for areas such as defense, shot stopping and key passes. Discover current trends in the game or uncover players' hidden value with this comprehensive dataset - a must-have resource for any aspiring football analyst!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    • Define Performance: The first step of using this dataset is defining what type of performance you are measuring. Are you looking at total goals scored? Assists made? Shots on target? This will allow you to choose which metrics from the dataset best fit your criteria.

    • Descriptive Analysis: Once you have chosen your metric(s), it's time for descriptive analysis. This means analyzing the patterns within the data that contribute towards that metric(s). Does one team have more potential assist makers than another? What about shot accuracy or tackles won %? With descriptive analysis, we'll look for general trends across teams or specific players that influence performance in a meaningful way.

    • Predictive Analysis: Finally, we can move onto predictive analysis. This type of analysis seeks to answer two questions: what are factors that predict player performance? And which factors are most important when predicting performance? Utilizing various predictive models—ex – Logistic regression or Random forest -we can determine which variables in our dataset best explain a certain metric’s outcome—for example –expected goals per match -and build models that accurately predict future outcomes based on given input values associated with those factors.

    By following these steps outlined here, you'll be able to get started in finding relationships between different metrics from this dataset and leveraging these insights into predictions about player performance!

    Research Ideas

    • Creating an advanced predictive analytics model: By using the data in this dataset, it would be possible to create an advanced predictive analytics model that can analyze player performance and provide more accurate insights on which players are likely to have the most impact during a given season.
    • Using Machine Learning algorithms to identify potential transfer targets: By using a variety of metrics included in this dataset, such as shots, shots on target and goals scored, it would be possible to use Machine Learning algorithms to identify potential transfer targets for a team.
    • Analyzing positional differences between players: This dataset contains information about each player's position as well as their performance metrics across various aspects of the game (e.g., crosses attempted, defensive clearances). Thus it could be used for analyzing how certain positional groupings perform differently from one another in certain aspects of their play over different stretches of time or within one season or matchday in particular.

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: DEF PerApp 2GWs.csv | Column name | Description | |:----------------------------|:------------------------------------------------------------| | Name | Name of the player. (String) | | App. | Number of appearances. (Integer) | | Minutes | Number of minutes played. (Integer) | | Shots | Number of shots taken. (Integer) | | Shots on Target | Number of shots on target. (Integer) ...

  7. Cyclisitic Trip Data 2019 (Google)

    • kaggle.com
    zip
    Updated Aug 4, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaine Pepper (2022). Cyclisitic Trip Data 2019 (Google) [Dataset]. https://www.kaggle.com/datasets/shainepepper/divvy-2019-trip-data-clean
    Explore at:
    zip(27551971 bytes)Available download formats
    Dataset updated
    Aug 4, 2022
    Authors
    Shaine Pepper
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Intro

    Cleaning this data took some time due to many NULL values, typos, and unorganized collection. My first step was to put the dataset into R and work my magic there. After analyzing and cleaning the data, I moved the data to Tableau to create easily understandable and helpful graphs. This step was a learning curve because there are so many potential options inside Tableau. Finding the correct graph to share my findings while keeping the stakeholders' tasks in mind was my biggest obstacle.

    RStudio

    Firstly I needed to combine the 4 datasets into 1, I did this using the rbind() function.

    Step two was to remove typos or poorly named columns. colnames(Cyclistic_Data_2019)[colnames(Cyclistic_Data_2019) == "tripduration"] <- "trip_duration" colnames(Cyclistic_Data_2019)[colnames(Cyclistic_Data_2019) == "bikeid"] <- "bike_id"' colnames(Cyclistic_Data_2019)[colnames(Cyclistic_Data_2019) == "usertype"] <- "user_type" colnames(Cyclistic_Data_2019)[colnames(Cyclistic_Data_2019) == "birthyear"] <- "birth_year"

    Next step was to remove all NULL and over exaggerated numbers. Such as trip durations more than 10 hours long.

    library(dplyr) Cyclistic_Clean_v2 <- Cyclistic_Data_2019 %>% filter(across(where(is.character), ~ . != "NULL")) %>% type.convert(as.is = TRUE)

    Once removing the NULL data, it was time to remove potential typos and poorly collected data. I could only identify exaggerated data under the "trip_duration" column. Finding that there were multiple cases of 2,000,000 + second trips. To find these large values, I used the count() function.

    Cyclistic_Clean_v2 %>% count(Cyclistic_Clean_v2, trip_duration > "30000")

    After finding multiple instances of this, I ran into a hard spot, the trip_duration column was categorized as a character when it needed to be numeric to be further cleaned. it took me quite a while to find out that this was an issue, and then I remembered the class() function. With this, I was easily able to identify that the classification was wrong

    class(Cyclistic_Clean_v2$trip_duration)

    Once identifying the classification, I still had some work to do before converting it to an integer as it contained quotations, periods, and a trailing 0. To remove these I used the gsub() function.

    Cyclistic_Clean_v2$trip_duration <- gsub(".0", "", Cyclistic_Clean_v2$trip_duration) Cyclistic_Clean_v2$trip_duration <- gsub('"', '', Cyclistic_Clean_v2$trip_duration)

    Now that unwanted characters are gone, we can convert the column into numeric.

    Cyclistic_Clean_v2$trip_duration <- as.numeric(Cyclistic_Clean_v2$trip_duration)

    Doing this allows Tableau and R to read the data properly to create graphs without error.

    Next I created a backup dataset incase there was any issue while exporting.

    Cyclistic_Clean_v3 <- Cyclistic_Clean_v2 write.csv(Cyclistic_Clean_v2,"Folder.Path\Cyclistic_Data_Cleaned_2019.csv", row.names = FALSE)

    After exporting I came to the conclusion that I should have put together a more accurate change log rather than brief notes. That is one major learning lesson I will take away from this project.

    All around, I had a lot of fun using R to transform and analyze the data. I learned many of different ways to efficiently clean data.

    Tableau

    Now onto the fun part! Tableau is a very good tool to learn. There are so many different ways to bring your data to life and show your creativity inside your work. After a few guides and errors, I could finally start building graphs to bring the stakeholders' tasks to fruition.

    Charts

    Please note this are all made in tableau and meant to be interactive.

    Here you can find the relation between male and female riders.

    View post on imgur.com

    Male vs Female tripduration with usertype

    View post on imgur.com

    Busiest stations filtered by months. (This is meant to be interactive.)

    View post on imgur.com

    Most popular starting stations.

    View post on imgur.com

    Most popular ending stations.

    View post on imgur.com

    Conclusion

    My main goal was to help find out how Cyclistic can convert casual riders into subscribers. Here is my findings.

    1. Casual riders ride much longer than subscribers duration wise.
    2. Although there are many more male riders, females tend to ride longer than males.
    3. Stations #562 & #568 are the most busy by a h...
  8. Adventures of Sherlock Holmes: Sentiment Analysis.

    • kaggle.com
    zip
    Updated Aug 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick L Ford (2024). Adventures of Sherlock Holmes: Sentiment Analysis. [Dataset]. https://www.kaggle.com/datasets/patricklford/adventures-of-sherlock-holmes-sentiment-analysis/discussion
    Explore at:
    zip(219210 bytes)Available download formats
    Dataset updated
    Aug 25, 2024
    Authors
    Patrick L Ford
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Introduction

    The famous Sherlock Holmes quote, “Data! data! data!” from The Copper Beeches perfectly encapsulates the essence of both detective work and data analysis. Holmes’ relentless pursuit of every detail closely mirrors the approach of modern data analysts, who understand that conclusions drawn without solid data are mere conjecture. Just as Holmes systematically gathered clues, analysed them from different perspectives, and tested hypotheses to arrive at the truth, today’s analysts follow similar processes when investigating complex data-driven problems. This project draws a parallel between Holmes’ detective methods and modern data analysis techniques by visualising and interpreting data from The Adventures of Sherlock Holmes.

    “**Data! data! data!**” he cried, impatiently. “I can’t make bricks without clay.”

    The above quote comes from one of my favourite Sherlock Holmes stories, The Copper Beeches. In this single outburst, Holmes captures a principle that resonates deeply with today’s data analysts: without data, conclusions are mere speculation. Data is the bedrock of any investigation. Without sufficient data, the route to solving a problem or answering a question is clouded with uncertainty.

    Sherlock Holmes, the iconic fictional detective, thrived on difficult cases, relishing the challenge of pitting his wits against the criminal mind.

    His methods of detection: - Examining crime scenes. - Interrogating witnesses. - Evaluating motives.

    Closely parallel how a data analyst approaches a complex problem today. By carefully collecting and interpreting data, Holmes was able to unravel mysteries that seemed impenetrable at first glance.

    1. Data Collection: Gathering Evidence
    Holmes’s meticulous approach to data collection mirrors the first stage of data analysis. Just as Holmes would scrutinise a crime scene for every detail; whether it be a footprint, a discarded note, or a peculiar smell. Data analysts seek to gather as much relevant data as possible. Just as incomplete or biased data can skew results in modern analysis, Holmes understood that every clue mattered. Overlooking a small piece of information could compromise the entire investigation.

    2. Data Quality: “I can’t make bricks without clay.”
    This quote is more than just a witty remark, it highlights the importance of having the right data. In the same way that substandard materials result in poor construction, incomplete or inaccurate data leads to unreliable analysis. Today’s analysts face similar issues: they must assess data integrity, clean noisy datasets, and ensure they’re working with accurate information before drawing conclusions. Holmes, in his time, would painstakingly verify each clue, ensuring that he was not misled by false leads.

    3. Data Analysis: Considering Multiple Perspectives
    Holmes’s genius lay not just in gathering data, but in the way he analysed it. He would often examine a problem from multiple angles, revisiting clues with fresh perspectives to see what others might have missed. In modern data analysis, this approach is akin to using different models, visualisations, and analytical methods to interpret the same dataset. Analysts explore data from multiple viewpoints, testing different hypotheses, and applying various algorithms to see which provides the most plausible insight.

    4. Hypothesis Testing: Eliminate the Improbable
    One of Holmes’s guiding principles was: “When you have eliminated the impossible, whatever remains, however improbable, must be the truth.” This mirrors the process of hypothesis testing in data analysis. Analysts might begin with several competing theories about what the data suggests. By testing these hypotheses, ruling out those that are contradicted by the data, they zero in on the most likely explanation. For both Holmes and today’s data analysts, the process of elimination is crucial to arriving at the correct answer.

    5. Insight and Conclusion: The Final Deduction
    After piecing together all the clues, Holmes would reveal his conclusion, often leaving his audience in awe at how the seemingly unrelated pieces of data fit together. Similarly, data analysts must present their findings clearly and compellingly, translating raw data into actionable insights. The ability to connect the dots and tell a coherent story from the data is what transforms analysis into impactful decision-making.

    In summary, the methods Sherlock Holmes employed were gathering data meticulously, testing multiple angles, and drawing conclusions through careful analysis. Are strikingly similar to the techniques used by modern data analysts. Just as Holmes required high-quality data and a structured approach to solve crimes, today’s data analysts rely on well-prepared data and methodical analysis to provide insights. Whether you’re cracking a case or uncovering business...

  9. Google Data Analytics Capstone

    • kaggle.com
    zip
    Updated Apr 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Giancarlo Vincitorio (2023). Google Data Analytics Capstone [Dataset]. https://www.kaggle.com/datasets/giancarlovincitorio/googlecapstone/code
    Explore at:
    zip(192803792 bytes)Available download formats
    Dataset updated
    Apr 24, 2023
    Authors
    Giancarlo Vincitorio
    Description

    Scenario

    In this case study , i hold the position of junior analyst in bike sharing company named CYCLIST,chicago.Our company has more than 5,800 bicycles and 600 docking stations within the program. The main motive of the bike-sharing program is that the bikes can be unlocked from one station and returned to any other station with the help of the system. We accommodate 2 types of Customers

    Customers who purchased single ride or full day passes are referred to as Casual Customers Customers who purchased annual membership are the Cyclistic members. During the survey, Finance team concluded that annual members are more profitable to the company than the casual members. So keeping the success of the company's future in mind, the marketing team planned to overlook on how they can maximize the number of annual memberships. As a result, the marketing analyst participated together in designing in a new marketing strategy to convert all casual riders into annual memberships. In order to understand this, Moreno, The director of marketing wanted us to first answer the business question on How the annual members and casual riders use Cyclistic bikes differently?

    To put these pieces into action we need to follow 6 steps for processing the data.

    Ask phase: identifying the stakeholders and asking the effective question to define the problem. Prepare phase: Collecting the data used for the upcoming process Process phase:Documenting to clean or manipulate the data. Analyze phase: using tools to transform, clean and organize the data Share phase: visualizing the insights and key findings. Act phase: To finalize conclusions and recommendations ASK PHASE

    Business Task

    Analyzing How the annual members and casual riders use Cyclistic bikes differently? and gaining insights on users behavior and putting up high level recommendations for effective cyclistic marketing strategy.

    Stakeholders

    â—Ź Lily Moreno: The director of marketing and your manager. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels.

    ● Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy. You joined this team six months ago and have been busy learning about Cyclistic’s mission and business goals — as well as how you, as a junior data analyst, can help Cyclistic achieve them.

    â—Ź Cyclistic executive team: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program

    Prepare Phase

    The dataset of cyclistic was collected from https://divvy-tripdata.s3.amazonaws.com/index.html. This data is well organized in a systematic manner based on the fiscal quarters and by the year .For Analysis i used 4 csv file that represents 4 different quarters for fiscal year 2019. The data has been made available by Motivate International Inc. under this license: https://ride.divvybikes.com/data-license-agreement

    DataFile

    Divvy_Trips_2019_Q4/Divvy_Trips_2019_Q1.csv Divvy_Trips_2019_Q4/Divvy_Trips_2019_Q2.csv Divvy_Trips_2019_Q4/Divvy_Trips_2019_Q3.csv Divvy_Trips_2019_Q4/Divvy_Trips_2019_Q4.csv

  10. Crystal Clean: Brain Tumors MRI Dataset

    • kaggle.com
    zip
    Updated Jul 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MH (2023). Crystal Clean: Brain Tumors MRI Dataset [Dataset]. https://www.kaggle.com/datasets/mohammadhossein77/brain-tumors-dataset
    Explore at:
    zip(231999018 bytes)Available download formats
    Dataset updated
    Jul 16, 2023
    Authors
    MH
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Uncovering Knowledge: A Clean Brain Tumor Dataset for Advanced Medical Research

    Introduction:

    • This dataset, available in RAR archive format, consists of four classes, including three tumor classes (Pituitary, Glioma and Meningioma) and one class representing normal brain MRI scans.
    • The strength of this dataset in comparison with other releases across the Kaggle is the cleanness of data. In this regard, we subjected the initial dataset to a meticulous data cleaning pipeline. This pipeline involved several steps aimed at enhancing the dataset's integrity and usability.
    • The initial data source for this dataset is the brain tumor classification MRI dataset, which can be accessed at this link.

    Data Cleaning Process:

    • Removal of Duplicate Samples: We employed an image vector comparison method to identify and remove duplicate samples, ensuring that each data point is unique.
    • Correction of Mislabeled Images: Using our domain knowledge, we carefully inspected and corrected falsely labeled images, ensuring that they were appropriately categorized. This step greatly enhances the accuracy of the dataset.
    • Image Resizing: All images in the dataset were resized to a memory-efficient yet academically accepted size of (224, 224), facilitating easier processing and analysis. Statistics: *Before the cleaning pipeline, the dataset contained the following number of samples for each class from the initial data source:
    • Normal: 500
    • Glioma: 926
    • Meningioma: 937
    • Pituitary: 901

    After applying the data cleaning pipeline, the number of samples in each category decreased on average by approximately 3-9%. This reduction ensures the data integrity while maintaining a sufficient number of samples for comprehensive analysis.

    Data Augmentation:

    To enhance the diversity and robustness of the dataset, we employed various image augmentation techniques. These techniques were applied to the images without altering the labels. Here is a summary of the augmentation methods used: - Salt and Pepper Noise: Introducing random noise by setting pixels to white or black based on a specified intensity. - Histogram Equalization: Applying histogram equalization to enhance the contrast and details in the images. - Rotation: Rotating the images clockwise or counterclockwise by a specified angle. - Brightness Adjustment: Modifying the brightness of the images by adding or subtracting intensity values. - Horizontal and Vertical Flipping: Flipping the images horizontally or vertically to create mirror images.

    Use Cases and Potential Investigations:

    This dataset offers significant potential for various advanced medical research and analysis applications. Some interesting use cases and potential investigations using this dataset include: - Tumor Classification: Developing advanced machine learning models for accurate and automated brain tumor classification. - Treatment Planning: Analyzing the tumor characteristics to aid in treatment planning and decision-making processes. - Radiomics Analysis: Extracting quantitative features from the images for radiomics analysis to uncover valuable insights and patterns. - Comparative Studies: Conducting comparative studies among different tumor types to understand their unique characteristics and behaviors.

    Acknowledgement

    • We would like to express our sincere gratitude to the original dataset publisher, sartajbhuvaji, for their valuable contribution.
    • This dataset is released under the CC0 license, making it open and accessible for everyone to use. While not mandatory, citing the dataset would be greatly appreciated.
    Important Note

    Those researchers who want to use this dataset for real world use cases, must consult with medical field experts (radiologists, ...) on the ground truth of the labels and their usability for their angle of research.

  11. Restaurant and Consumer Recommendation System Data

    • kaggle.com
    zip
    Updated Jan 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Restaurant and Consumer Recommendation System Data [Dataset]. https://www.kaggle.com/datasets/thedevastator/restaurant-and-consumer-context-aware-recommenda
    Explore at:
    zip(55179 bytes)Available download formats
    Dataset updated
    Jan 6, 2023
    Authors
    The Devastator
    Description

    Restaurant and Consumer Recommendation System Data

    User Preferences, Restaurant Details, Payment Options and Ratings

    By UCI [source]

    About this dataset

    Welcome to the Restaurant and Consumer Data for Context-Aware dataset! This dataset was obtained from a recommender system prototype, where the task was to generate a top-n list of restaurants based on user preferences. The data represented here consists of two different approaches used: one through collaborative filtering techniques and another with a contextual approach.

    In total, this dataset contains nine files instances, attributes, and associated information. The restaurants attributes comprise of chefmozaccepts.csv, chefmozcuisine.csv, chefmozhours4.csv and chefmozparking csv only while geoplaces2 csv has more additional attribute like alcohol served etc., Consumers attributes are usercuisine csv, userpayment csv ,userprofile ,and the final is rating_final which contains userID ,placeID ,rating ,food_rating & service_rating. It also details attribute information such as place ID (nominal), payment methods (nominal - cash/visa/mastercard/etc.), cuisine types (Nominal - Afghan/African/American etc.), hours of operation(nominal range :00:00-23:30) & days it operates (mon - sun). Further features include latitude n longitude for geospatial data representation; whether alcohol served or not; smoking area permission details /dress code stipulations; accessibility issues for disabled personel & website URLs for each restaurant along with their respective ratings . Consumer information includes smoker status (True/False ), dress preference(informal); marital status ; temperature recognition ; birth year ; interests foci plus budgeting expenditure restraints he must be aware off while selecting between various restaurant options available as per his individual desires .It is easy to discern interesting trends within this dataset that can provide useful insights in order to develop a contextaware recommender system accurately predicting customer choices efficiently unearthing optimal menu solutions tailored exclusively towards prospective diners & devouring aficinados globally!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    The dataset consists of 9 different files with various attributes related to each restaurant, consumer, and user/item/rating information. Each file is described in detail below:

    • chefmozaccepts.csv: This file contains placeID and payment information for each restaurant in the dataset.
    • chefmozcuisine.csv: This file contains place ID and cuisine type for each restaurant in the dataset.
    • chefmozhours4.csv:This file contains the hours and days of operation for each restaurant in the dataset
    • chefmozparking.csv: This file contains placeID along with parking lot information for each restaurant in the dataset 5) geoplaces2.csv: This file provides location data (latitude, longitude, etc.) for all restaurants in the dataset 6) rating_final.csvThis files contains userID along with ratings given by users on food quality, service quality, etc., at various places 7) usercuisinexsvThis files consists of user ID along with cuisine type preferences 8 )userpaymentxsvThis files provides payment method accepted by users at different places 9 )userprofilexsvThis files consistss of profile data like smoker status, activity level , budget ranges etc., coupled with users's IDs

    Now that you have a better understanding of this our data set let’s go over some simple steps you can take to use it effectively :

    Clean & Organize Your Data : Before using this data make sure you first clean it up by removing any duplicates or inconsistencies .You'll also want to parse it into a format that makes sense like creating a CSV /Json document or table so that when you run your algorithms they are easy to understand .

    Analyze Your Data : Once you have had a chance to organize your data its time too analyze what insights we may be able find from our set . Use methods like clustering , statistical tests as well as machine learning techniques such as linear regression & random forest models too understand your datasets better . Step 3 - Generate Recommendations : With all these techniques now mastered its time too generate recommendationa system over our datasets which is exactly what this tutorial was created too do ! Utilize Collaborative Filtering & Content

    Research Ideas

    • Analyzing restaurant customer preferences by creating a model that predicts the type of cuisine a customer is likely to order.
    • Developing an algorithm to ide...
  12. Adventure Works 2022 CSVs

    • kaggle.com
    zip
    Updated Nov 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Algorismus (2022). Adventure Works 2022 CSVs [Dataset]. https://www.kaggle.com/datasets/algorismus/adventure-works-in-excel-tables
    Explore at:
    zip(567646 bytes)Available download formats
    Dataset updated
    Nov 2, 2022
    Authors
    Algorismus
    License

    http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html

    Description

    Adventure Works 2022 dataset

    How this Dataset is created?

    On the official website the dataset is available over SQL server (localhost) and CSVs to be used via Power BI Desktop running on Virtual Lab (Virtaul Machine). As per first two steps of Importing data are executed in the virtual lab and then resultant Power BI tables are copied in CSVs. Added records till year 2022 as required.

    How this Dataset may help you?

    this dataset will be helpful in case you want to work offline with Adventure Works data in Power BI desktop in order to carry lab instructions as per training material on official website. The dataset is useful in case you want to work on Power BI desktop Sales Analysis example from Microsoft website PL 300 learning.

    How to use this Dataset?

    Download the CSV file(s) and import in Power BI desktop as tables. The CSVs are named as tables created after first two steps of importing data as mentioned in the PL-300 Microsoft Power BI Data Analyst exam lab.

  13. iFood Marketing Campaigns Analysis

    • kaggle.com
    zip
    Updated Aug 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmad Fayez (2023). iFood Marketing Campaigns Analysis [Dataset]. https://www.kaggle.com/datasets/fayez7/ifood-marketing-campaigns
    Explore at:
    zip(295368 bytes)Available download formats
    Dataset updated
    Aug 17, 2023
    Authors
    Ahmad Fayez
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This data is part of data published on GitHub in this link

    File Name: ifood_df.csv

    The data is about result of 5 Marketing Campaigns done by food company, and how each customer interact with those campaigns, in addition to demographics data about customers such as: Income, age, education level, marital status, number of children and teenagers, and other data related to each customer

    **I download it to Explore, Clean, and Transform it by Microsoft Excel, then Visualize and Analysis by Python ** First of all, in the Exploration Phase: Understand the column and the relationships between each column and others, Also Define the important questions which the way to make recommendations about the Marketing Campaign.

    In the Cleaning Phase: Delete Columns “Z_CostContact” and “Z_Revenue” because it contains fixed number and not important for my questions. Delete column “Response” because it’s not used in my analysis and meaningless in addition to can’t find what’s that stands for! Then Check for missing data and I found that the day is Complete. After that check for duplicates and find out the every row is Unique. Also check for Accuracy and know that the data have correct and logical values. The only thing you should know that is the data not current, it's from more than 2000 customers from 2020.

    Overall in Cleaning Process: The data is Accurate, Complete, Consistent, Relevant, Valid, and Unique. But need some Transformation.

    In Transportation Phase: Add a column for “Index” to make a unique identifier for each customer. Aggregate all marital status in one column. Aggregate all education level in one column. Rearrange some columns like campaigns and Totals.

    Fields Description:

    Index: unique identifier for each customer. Income: the yearly income for each customer. Kidhome: number of small children in customer’s household. Teenhome: number of teenagers in customer’s household. Recency: number of days since last purchase. MntWines: Amount of wine purchased in last 2 years. MntFruits: Amount of fruit purchased in last 2 years. MntMeatProducts: Amount of meat purchased in last 2 years. MntFishProducts: Amount of fish purchased in last 2 years. MntSweetProducts: Amount of sweets purchased in last 2 years. MntRegularProds: Amount of Regular Products purchased in last 2 years. MntGoldProds: Amount of Special Products purchased in last 2 years. MntTotal: Total amount of everything purchased in last 2 years. NumDealsPurchases: number of purchases made with discount. NumWebPurchases: number of purchases made through company’s website. NumCatalogPurchases: number of purchases made using catalog. NumStorePurchases: number of purchases made directly in the store. NumWebVisitsMonth: number of visits to company’s website in the last month. AcceptedCmp1: 1 if customer accepts the offer in first campaign, 0 for otherwise. AcceptedCmp2: 1 if customer accepts the offer in second campaign, 0 for otherwise. AcceptedCmp3: 1 if customer accepts the offer in third campaign, 0 for otherwise. AcceptedCmp4: 1 if customer accepts the offer in fourth campaign, 0 for otherwise. AcceptedCmp5: 1 if customer accepts the offer in fifth campaign, 0 for otherwise. AcceptedCmpOverall: total number of marketing campaigns that customer accepted. Complain: whether the customer complained in last 2 years or not. Age: the customer’s age. Customer_Days: days since registration. marital_status: the customer’s status. education: the customer’s level of education.

  14. Synthetic Financial Datasets For Fraud Detection

    • kaggle.com
    zip
    Updated Apr 3, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Edgar Lopez-Rojas (2017). Synthetic Financial Datasets For Fraud Detection [Dataset]. https://www.kaggle.com/datasets/ealaxi/paysim1
    Explore at:
    zip(186385561 bytes)Available download formats
    Dataset updated
    Apr 3, 2017
    Authors
    Edgar Lopez-Rojas
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Context

    There is a lack of public available datasets on financial services and specially in the emerging mobile money transactions domain. Financial datasets are important to many researchers and in particular to us performing research in the domain of fraud detection. Part of the problem is the intrinsically private nature of financial transactions, that leads to no publicly available datasets.

    We present a synthetic dataset generated using the simulator called PaySim as an approach to such a problem. PaySim uses aggregated data from the private dataset to generate a synthetic dataset that resembles the normal operation of transactions and injects malicious behaviour to later evaluate the performance of fraud detection methods.

    Content

    PaySim simulates mobile money transactions based on a sample of real transactions extracted from one month of financial logs from a mobile money service implemented in an African country. The original logs were provided by a multinational company, who is the provider of the mobile financial service which is currently running in more than 14 countries all around the world.

    This synthetic dataset is scaled down 1/4 of the original dataset and it is created just for Kaggle.

    NOTE: Transactions which are detected as fraud are cancelled, so for fraud detection these columns (oldbalanceOrg, newbalanceOrig, oldbalanceDest, newbalanceDest ) must not be used.

    Headers

    This is a sample of 1 row with headers explanation:

    1,PAYMENT,1060.31,C429214117,1089.0,28.69,M1591654462,0.0,0.0,0,0

    step - maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (30 days simulation).

    type - CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.

    amount - amount of the transaction in local currency.

    nameOrig - customer who started the transaction

    oldbalanceOrg - initial balance before the transaction

    newbalanceOrig - new balance after the transaction.

    nameDest - customer who is the recipient of the transaction

    oldbalanceDest - initial balance recipient before the transaction. Note that there is not information for customers that start with M (Merchants).

    newbalanceDest - new balance recipient after the transaction. Note that there is not information for customers that start with M (Merchants).

    isFraud - This is the transactions made by the fraudulent agents inside the simulation. In this specific dataset the fraudulent behavior of the agents aims to profit by taking control or customers accounts and try to empty the funds by transferring to another account and then cashing out of the system.

    isFlaggedFraud - The business model aims to control massive transfers from one account to another and flags illegal attempts. An illegal attempt in this dataset is an attempt to transfer more than 200.000 in a single transaction.

    Past Research

    There are 5 similar files that contain the run of 5 different scenarios. These files are better explained at my PhD thesis chapter 7 (PhD Thesis Available here http://urn.kb.se/resolve?urn=urn:nbn:se:bth-12932.

    We ran PaySim several times using random seeds for 744 steps, representing each hour of one month of real time, which matches the original logs. Each run took around 45 minutes on an i7 intel processor with 16GB of RAM. The final result of a run contains approximately 24 million of financial records divided into the 5 types of categories: CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.

    Acknowledgements

    This work is part of the research project ”Scalable resource-efficient systems for big data analytics” funded by the Knowledge Foundation (grant: 20140032) in Sweden.

    Please refer to this dataset using the following citations:

    PaySim first paper of the simulator:

    E. A. Lopez-Rojas , A. Elmir, and S. Axelsson. "PaySim: A financial mobile money simulator for fraud detection". In: The 28th European Modeling and Simulation Symposium-EMSS, Larnaca, Cyprus. 2016

  15. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Udayakumar19 (2022). Google Data Analytics Case Study Cyclistic [Dataset]. https://www.kaggle.com/datasets/udayakumar19/google-data-analytics-case-study-cyclistic/suggestions
Organization logo

Google Data Analytics Case Study Cyclistic

Difference between Casual vs Member in Cyclistic Riders

Explore at:
zip(1299 bytes)Available download formats
Dataset updated
Sep 27, 2022
Authors
Udayakumar19
Description

Introduction

Welcome to the Cyclistic bike-share analysis case study! In this case study, you will perform many real-world tasks of a junior data analyst. You will work for a fictional company, Cyclistic, and meet different characters and team members. In order to answer the key business questions, you will follow the steps of the data analysis process: ask, prepare, process, analyze, share, and act. Along the way, the Case Study Roadmap tables — including guiding questions and key tasks — will help you stay on the right path.

Scenario

You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.

Ask

How do annual members and casual riders use Cyclistic bikes differently?

Guiding Question:

What is the problem you are trying to solve?
  How do annual members and casual riders use Cyclistic bikes differently?
How can your insights drive business decisions?
  The insight will help the marketing team to make a strategy for casual riders

Prepare

Guiding Question:

Where is your data located?
  Data located in Cyclistic organization data.

How is data organized?
  Dataset are in csv format for each month wise from Financial year 22.

Are there issues with bias or credibility in this data? Does your data ROCCC? 
  It is good it is ROCCC because data collected in from Cyclistic organization.

How are you addressing licensing, privacy, security, and accessibility?
  The company has their own license over the dataset. Dataset does not have any personal information about the riders.

How did you verify the data’s integrity?
  All the files have consistent columns and each column has the correct type of data.

How does it help you answer your questions?
  Insights always hidden in the data. We have the interpret with data to find the insights.

Are there any problems with the data?
  Yes, starting station names, ending station names have null values.

Process

Guiding Question:

What tools are you choosing and why?
  I used R studio for the cleaning and transforming the data for analysis phase because of large dataset and to gather experience in the language.

Have you ensured the data’s integrity?
 Yes, the data is consistent throughout the columns.

What steps have you taken to ensure that your data is clean?
  First duplicates, null values are removed then added new columns for analysis.

How can you verify that your data is clean and ready to analyze? 
 Make sure the column names are consistent thorough out all data sets by using the “bind row” function.

Make sure column data types are consistent throughout all the dataset by using the “compare_df_col” from the “janitor” package.
Combine the all dataset into single data frame to make consistent throught the analysis.
Removed the column start_lat, start_lng, end_lat, end_lng from the dataframe because those columns not required for analysis.
Create new columns day, date, month, year, from the started_at column this will provide additional opportunities to aggregate the data
Create the “ride_length” column from the started_at and ended_at column to find the average duration of the ride by the riders.
Removed the null rows from the dataset by using the “na.omit function”
Have you documented your cleaning process so you can review and share those results? 
  Yes, the cleaning process is documented clearly.

Analyze Phase:

Guiding Questions:

How should you organize your data to perform analysis on it? The data has been organized in one single dataframe by using the read csv function in R Has your data been properly formatted? Yes, all the columns have their correct data type.

What surprises did you discover in the data?
  Casual member ride duration is higher than the annual members
  Causal member widely uses docked bike than the annual members
What trends or relationships did you find in the data?
  Annual members are used mainly for commute purpose
  Casual member are preferred the docked bikes
  Annual members are preferred the electric or classic bikes
How will these insights help answer your business questions?
  This insights helps to build a profile for members

Share

Guiding Quesions:

Were you able to answer the question of how ...
Search
Clear search
Close search
Google apps
Main menu