26 datasets found
  1. Google Capstone Project - BellaBeats

    • kaggle.com
    Updated Jan 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jason Porzelius (2023). Google Capstone Project - BellaBeats [Dataset]. https://www.kaggle.com/datasets/jasonporzelius/google-capstone-project-bellabeats
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 5, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Jason Porzelius
    Description

    Introduction: I have chosen to complete a data analysis project for the second course option, Bellabeats, Inc., using a locally hosted database program, Excel for both my data analysis and visualizations. This choice was made primarily because I live in a remote area and have limited bandwidth and inconsistent internet access. Therefore, completing a capstone project using web-based programs such as R Studio, SQL Workbench, or Google Sheets was not a feasible choice. I was further limited in which option to choose as the datasets for the ride-share project option were larger than my version of Excel would accept. In the scenario provided, I will be acting as a Junior Data Analyst in support of the Bellabeats, Inc. executive team and data analytics team. This combined team has decided to use an existing public dataset in hopes that the findings from that dataset might reveal insights which will assist in Bellabeat's marketing strategies for future growth. My task is to provide data driven insights to business tasks provided by the Bellabeats, Inc.'s executive and data analysis team. In order to accomplish this task, I will complete all parts of the Data Analysis Process (Ask, Prepare, Process, Analyze, Share, Act). In addition, I will break each part of the Data Analysis Process down into three sections to provide clarity and accountability. Those three sections are: Guiding Questions, Key Tasks, and Deliverables. For the sake of space and to avoid repetition, I will record the deliverables for each Key Task directly under the numbered Key Task using an asterisk (*) as an identifier.

    Section 1 - Ask: A. Guiding Questions: Who are the key stakeholders and what are their goals for the data analysis project? What is the business task that this data analysis project is attempting to solve?

    B. Key Tasks: Identify key stakeholders and their goals for the data analysis project *The key stakeholders for this project are as follows: -Urška Sršen and Sando Mur - co-founders of Bellabeats, Inc. -Bellabeats marketing analytics team. I am a member of this team. Identify the business task. *The business task is: -As provided by co-founder Urška Sršen, the business task for this project is to gain insight into how consumers are using their non-BellaBeats smart devices in order to guide upcoming marketing strategies for the company which will help drive future growth. Specifically, the researcher was tasked with applying insights driven by the data analysis process to 1 BellaBeats product and presenting those insights to BellaBeats stakeholders.

    Section 2 - Prepare: A. Guiding Questions: Where is the data stored and organized? Are there any problems with the data? How does the data help answer the business question?

    B. Key Tasks: Research and communicate the source of the data, and how it is stored/organized to stakeholders. *The data source used for our case study is FitBit Fitness Tracker Data. This dataset is stored in Kaggle and was made available through user Mobius in an open-source format. Therefore, the data is public and available to be copied, modified, and distributed, all without asking the user for permission. These datasets were generated by respondents to a distributed survey via Amazon Mechanical Turk reportedly (see credibility section directly below) between 03/12/2016 thru 05/12/2016. *Reportedly (see credibility section directly below), thirty eligible Fitbit users consented to the submission of personal tracker data, including output related to steps taken, calories burned, time spent sleeping, heart rate, and distance traveled. This data was broken down into minute, hour, and day level totals. This data is stored in 18 CSV documents. I downloaded all 18 documents into my local laptop and decided to use 2 documents for the purposes of this project as they were files which had merged activity and sleep data from the other documents. All unused documents were permanently deleted from the laptop. The 2 files used were: -sleepDaymerged.csv -dailyActivitymerged.csv Identify and communicate to stakeholders any problems found with the data related to credibility and bias. *As will be more specifically presented in the Process section, the data seems to have credibility issues related to the reported time frame of the data collected. The metadata seems to indicate that the data collected covered roughly 2 months of FitBit tracking. However, upon my initial data processing, I found that only 1 month of data was reported. *As will be more specifically presented in the Process section, the data has credibility issues related to the number of individuals who reported FitBit data. Specifically, the metadata communicates that 30 individual users agreed to report their tracking data. My initial data processing uncovered 33 individual IDs in the dailyActivity_merged dataset. *Due to the small number of participants (...

  2. NOAA GSOD

    • kaggle.com
    zip
    Updated Aug 30, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NOAA (2019). NOAA GSOD [Dataset]. https://www.kaggle.com/datasets/noaa/gsod
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Aug 30, 2019
    Dataset provided by
    National Oceanic and Atmospheric Administrationhttp://www.noaa.gov/
    Authors
    NOAA
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Overview

    Global Surface Summary of the Day is derived from The Integrated Surface Hourly (ISH) dataset. The ISH dataset includes global data obtained from the USAF Climatology Center, located in the Federal Climate Complex with NCDC. The latest daily summary data are normally available 1-2 days after the date-time of the observations used in the daily summaries.

    Content

    Over 9000 stations' data are typically available.

    The daily elements included in the dataset (as available from each station) are: Mean temperature (.1 Fahrenheit) Mean dew point (.1 Fahrenheit) Mean sea level pressure (.1 mb) Mean station pressure (.1 mb) Mean visibility (.1 miles) Mean wind speed (.1 knots) Maximum sustained wind speed (.1 knots) Maximum wind gust (.1 knots) Maximum temperature (.1 Fahrenheit) Minimum temperature (.1 Fahrenheit) Precipitation amount (.01 inches) Snow depth (.1 inches)

    Indicator for occurrence of: Fog, Rain or Drizzle, Snow or Ice Pellets, Hail, Thunder, Tornado/Funnel

    Querying BigQuery tables

    You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.

    Acknowledgements

    This public dataset was created by the National Oceanic and Atmospheric Administration (NOAA) and includes global data obtained from the USAF Climatology Center. This dataset covers GSOD data between 1929 and present, collected from over 9000 stations. Dataset Source: NOAA

    Use: This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source — http://www.data.gov/privacy-policy#data_policy — and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

    Photo by Allan Nygren on Unsplash

  3. Poročilo o mobilnosti zaradi koronavirusa (COVID-19)

    • data.europa.eu
    Updated Mar 21, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Greater London Authority (2021). Poročilo o mobilnosti zaradi koronavirusa (COVID-19) [Dataset]. https://data.europa.eu/data/datasets/coronavirus-covid-19-mobility-report?locale=sl
    Explore at:
    Dataset updated
    Mar 21, 2021
    Dataset authored and provided by
    Greater London Authorityhttp://www.london.gov.uk/
    Description

    Due to changes in the collection and availability of data on COVID-19, this website will no longer be updated. The webpage will no longer be available as of 11 May 2023. On-going, reliable sources of data for COVID-19 are available via the COVID-19 dashboard and the UKHSA

    GLA Covid-19 Mobility Report

    Since March 2020, London has seen many different levels of restrictions - including three separate lockdowns and many other tiers/levels of restrictions, as well as easing of restrictions and even measures to actively encourage people to go to work, their high streets and local restaurants. This reports gathers data from a number of sources, including google, apple, citymapper, purple wifi and opentable to assess the extent to which these levels of restrictions have translated to a reductions in Londoners' movements.

    The data behind the charts below come from different sources. None of these data represent a direct measure of how well people are adhering to the lockdown rules - nor do they provide an exhaustive data set. Rather, they are measures of different aspects of mobility, which together, offer an overall impression of how people Londoners are moving around the capital. The information is broken down by use of public transport, pedestrian activity, retail and leisure, and homeworking.

    Public Transport

    For the transport measures, we have included data from google, Apple, CityMapper and Transport for London. They measure different aspects of public transport usage - depending on the data source. Each of the lines in the chart below represents a percentage of a pre-pandemic baseline.

    https://cdn.datapress.cloud/london/img/dataset/60e5834b-68aa-48d7-a8c5-7ee4781bde05/2025-06-09T20%3A54%3A15/6b096426c4c582dc9568ed4830b4226d.webp" alt="Embedded Image" />

    activity Source Latest Baseline Min value in Lockdown 1 Min value in Lockdown 2 Min value in Lockdown 3 Citymapper Citymapper mobility index 2021-09-05 Compares trips planned and trips taken within its app to a baseline of the four weeks from 6 Jan 2020 7.9% 28% 19% Google Google Mobility Report 2022-10-15 Location data shared by users of Android smartphones, compared time and duration of visits to locations to the median values on the same day of the week in the five weeks from 3 Jan 2020 20.4% 40% 27% TfL Bus Transport for London 2022-10-30 Bus journey ‘taps' on the TfL network compared to same day of the week in four weeks starting 13 Jan 2020 - 34% 24% TfL Tube Transport for London 2022-10-30 Tube journey ‘taps' on the TfL network compared to same day of the week in four weeks starting 13 Jan 2020 - 30% 21% Pedestrian activity

    With the data we currently have it's harder to estimate pedestrian activity and high street busyness. A few indicators can give us information on how people are making trips out of the house:

    https://cdn.datapress.cloud/london/img/dataset/60e5834b-68aa-48d7-a8c5-7ee4781bde05/2025-06-09T20%3A54%3A15/bcf082c07e4d7ff5202012f0a97abc3a.webp" alt="Embedded Image" />

    activity Source Latest Baseline Min value in Lockdown 1 Min value in Lockdown 2 Min value in Lockdown 3 Walking Apple Mobility Index 2021-11-09 estimates the frequency of trips made on foot compared to baselie of 13 Jan '20 22% 47% 36% Parks Google Mobility Report 2022-10-15 Frequency of trips to parks. Changes in the weather mean this varies a lot. Compared to baseline of 5 weeks from 3 Jan '20 30% 55% 41% Retail & Rec Google Mobility Report 2022-10-15 Estimates frequency of trips to shops/leisure locations. Compared to baseline of 5 weeks from 3 Jan '20 30% 55% 41% Retail and recreation

    In this section, we focus on estimated footfall to shops, restaurants, cafes, shopping centres and so on.

    https://cdn.datapress.cloud/london/img/dataset/60e5834b-68aa-48d7-a8c5-7ee4781bde05/2025-06-09T20%3A54%3A16/b62d60f723eaafe64a989e4afec4c62b.webp" alt="Embedded Image" />

    activity Source Latest Baseline Min value in Lockdown 1 Min value in Lockdown 2 Min value in Lockdown 3 Grocery/pharmacy Google Mobility Report 2022-10-15 Estimates frequency of trips to grovery shops and pharmacies. Compared to baseline of 5 weeks from 3 Jan '20 32% 55.00% 45.000% Retail/rec <a href="https://ww

  4. Chicago Crime

    • kaggle.com
    zip
    Updated Apr 17, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    City of Chicago (2018). Chicago Crime [Dataset]. https://www.kaggle.com/datasets/chicago/chicago-crime
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Apr 17, 2018
    Dataset authored and provided by
    City of Chicago
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    Chicago
    Description

    Context

    Approximately 10 people are shot on an average day in Chicago.

    http://www.chicagotribune.com/news/data/ct-shooting-victims-map-charts-htmlstory.html http://www.chicagotribune.com/news/local/breaking/ct-chicago-homicides-data-tracker-htmlstory.html http://www.chicagotribune.com/news/local/breaking/ct-homicide-victims-2017-htmlstory.html

    Content

    This dataset reflects reported incidents of crime (with the exception of murders where data exists for each victim) that occurred in the City of Chicago from 2001 to present, minus the most recent seven days. Data is extracted from the Chicago Police Department's CLEAR (Citizen Law Enforcement Analysis and Reporting) system. In order to protect the privacy of crime victims, addresses are shown at the block level only and specific locations are not identified. This data includes unverified reports supplied to the Police Department. The preliminary crime classifications may be changed at a later date based upon additional investigation and there is always the possibility of mechanical or human error. Therefore, the Chicago Police Department does not guarantee (either expressed or implied) the accuracy, completeness, timeliness, or correct sequencing of the information and the information should not be used for comparison purposes over time.

    Update Frequency: Daily

    Fork this kernel to get started.

    Acknowledgements

    https://bigquery.cloud.google.com/dataset/bigquery-public-data:chicago_crime

    https://cloud.google.com/bigquery/public-data/chicago-crime-data

    Dataset Source: City of Chicago

    This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source —https://data.cityofchicago.org — and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

    Banner Photo by Ferdinand Stohr from Unplash.

    Inspiration

    What categories of crime exhibited the greatest year-over-year increase between 2015 and 2016?

    Which month generally has the greatest number of motor vehicle thefts?

    How does temperature affect the incident rate of violent crime (assault or battery)?

    https://cloud.google.com/bigquery/images/chicago-scatter.png" alt=""> https://cloud.google.com/bigquery/images/chicago-scatter.png

  5. My analysis of the "bike share" data: Google S.

    • kaggle.com
    zip
    Updated May 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lamar McMillan (2023). My analysis of the "bike share" data: Google S. [Dataset]. https://www.kaggle.com/datasets/lamarmcmillan/my-analysis-of-the-bike-share-data-google-s
    Explore at:
    zip(76673 bytes)Available download formats
    Dataset updated
    May 2, 2023
    Authors
    Lamar McMillan
    Description

    Context

    ** One analysis Done in spreadsheets with 202004 and 202005 data **

    Content

    To adjust for outlier Ride lengths like the max and min below: Max RL =MAX(N:N)978:40:02 minimum RL =MIN(N:N)-0:02:56

    TRIMMean to shave off the top and bottom of a dataset. TRIMMEAN =TRIMMEAN(N:N,5%)0:20:20 =TRIMMEAN(N:N,2%)0:21:27

    Otherwise the Ride length for 202004 is Average RL 0:35:51

    The most common day of the week is Sunday. There are 61,148 members and 23,628 casual riders. mode of DOW 1 CountIf member of MC 61148 CountIf casual of MC 23628

    Pivot table 1 2020-04 member_casual AVERAGE of ride_length

    Same calculations for 2020-05 Average RL 0:33:23 Max RL 481:36:53 minimum RL -0:01:48 mode of DOW 7 CountIf member of MC 113365 CountIf casual of MC 86909 TRIMMEAN 0:25:22 0:26:59

    There are 4 pivot tables included in seperate sheets for other comparisons.

    Acknowledgements

    I gathered this data using the sources provided by the Google Data Analytics course. All work seen is done by myself.

    Inspiration

    I want to further use the data in SQL, and Tableau.

  6. COVID-19 Community Mobility Reports

    • google.com
    • google.com.tr
    • +6more
    csv, pdf
    Updated Oct 17, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google (2022). COVID-19 Community Mobility Reports [Dataset]. https://www.google.com/covid19/mobility/
    Explore at:
    csv, pdfAvailable download formats
    Dataset updated
    Oct 17, 2022
    Dataset authored and provided by
    Googlehttp://google.com/
    Description

    As global communities responded to COVID-19, we heard from public health officials that the same type of aggregated, anonymized insights we use in products such as Google Maps would be helpful as they made critical decisions to combat COVID-19. These Community Mobility Reports aimed to provide insights into what changed in response to policies aimed at combating COVID-19. The reports charted movement trends over time by geography, across different categories of places such as retail and recreation, groceries and pharmacies, parks, transit stations, workplaces, and residential.

  7. g

    Wikipedia, Global Oil Refineries, World, 2.3.2004

    • geocommons.com
    Updated Apr 29, 2008
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data (2008). Wikipedia, Global Oil Refineries, World, 2.3.2004 [Dataset]. http://geocommons.com/search.html
    Explore at:
    Dataset updated
    Apr 29, 2008
    Dataset provided by
    Wikipedia
    data
    Description

    This is a data set from the Google Earth BBS of oil refineries around the globe posted in Feb 3rd 2004. The original creator of the data set posted a set of caveats to the data on the Google BBS (http://bbs.keyhole.com/ubb/showflat.php/Cat/0/Number/142111/): Here are placemarks for most of the world's crude oil refineries and their capacities. There is no way I got them all, and some are probably not in the exact location. Those include refineries that are grouped together, and in very low resolution areas. Please point out any incorrect locations and refineries not listed (with their capacities) because help is needed especially in these areas: Japan: Missing many, and the ones I have marked are probably not in the correct location. China: Missing many. Mostly the smaller CNCP (PetroChina) ones. Russia: Must be missing some. France: Same Italy: Same Germany: Maybe a few here too. Middle East: Iraq, and some smaller countries not listed. You can see most of this in list form at: http://en.wikipedia.org/wiki/List_of_oil_refineries

  8. Data (i.e., evidence) about evidence based medicine

    • figshare.com
    • search.datacite.org
    png
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jorge H Ramirez (2023). Data (i.e., evidence) about evidence based medicine [Dataset]. http://doi.org/10.6084/m9.figshare.1093997.v24
    Explore at:
    pngAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Jorge H Ramirez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Update — December 7, 2014. – Evidence-based medicine (EBM) is not working for many reasons, for example: 1. Incorrect in their foundations (paradox): hierarchical levels of evidence are supported by opinions (i.e., lowest strength of evidence according to EBM) instead of real data collected from different types of study designs (i.e., evidence). http://dx.doi.org/10.6084/m9.figshare.1122534 2. The effect of criminal practices by pharmaceutical companies is only possible because of the complicity of others: healthcare systems, professional associations, governmental and academic institutions. Pharmaceutical companies also corrupt at the personal level, politicians and political parties are on their payroll, medical professionals seduced by different types of gifts in exchange of prescriptions (i.e., bribery) which very likely results in patients not receiving the proper treatment for their disease, many times there is no such thing: healthy persons not needing pharmacological treatments of any kind are constantly misdiagnosed and treated with unnecessary drugs. Some medical professionals are converted in K.O.L. which is only a puppet appearing on stage to spread lies to their peers, a person supposedly trained to improve the well-being of others, now deceits on behalf of pharmaceutical companies. Probably the saddest thing is that many honest doctors are being misled by these lies created by the rules of pharmaceutical marketing instead of scientific, medical, and ethical principles. Interpretation of EBM in this context was not anticipated by their creators. “The main reason we take so many drugs is that drug companies don’t sell drugs, they sell lies about drugs.” ―Peter C. Gøtzsche “doctors and their organisations should recognise that it is unethical to receive money that has been earned in part through crimes that have harmed those people whose interests doctors are expected to take care of. Many crimes would be impossible to carry out if doctors weren’t willing to participate in them.” —Peter C Gøtzsche, The BMJ, 2012, Big pharma often commits corporate crime, and this must be stopped. Pending (Colombia): Health Promoter Entities (In Spanish: EPS ―Empresas Promotoras de Salud).

    1. Misinterpretations New technologies or concepts are difficult to understand in the beginning, it doesn’t matter their simplicity, we need to get used to new tools aimed to improve our professional practice. Probably the best explanation is here in these videos (credits to Antonio Villafaina for sharing these videos with me). English https://www.youtube.com/watch?v=pQHX-SjgQvQ&w=420&h=315 Spanish https://www.youtube.com/watch?v=DApozQBrlhU&w=420&h=315 ----------------------- Hypothesis: hierarchical levels of evidence based medicine are wrong Dear Editor, I have data to support the hypothesis described in the title of this letter. Before rejecting the null hypothesis I would like to ask the following open question:Could you support with data that hierarchical levels of evidence based medicine are correct? (1,2) Additional explanation to this question: – Only respond to this question attaching publicly available raw data.– Be aware that more than a question this is a challenge: I have data (i.e., evidence) which is contrary to classic (i.e., McMaster) or current (i.e., Oxford) hierarchical levels of evidence based medicine. An important part of this data (but not all) is publicly available. References
    2. Ramirez, Jorge H (2014): The EBM challenge. figshare. http://dx.doi.org/10.6084/m9.figshare.1135873
    3. The EBM Challenge Day 1: No Answers. Competing interests: I endorse the principles of open data in human biomedical research Read this letter on The BMJ – August 13, 2014.http://www.bmj.com/content/348/bmj.g3725/rr/762595Re: Greenhalgh T, et al. Evidence based medicine: a movement in crisis? BMJ 2014; 348: g3725. _ Fileset contents Raw data: Excel archive: Raw data, interactive figures, and PubMed search terms. Google Spreadsheet is also available (URL below the article description). Figure 1. Unadjusted (Fig 1A) and adjusted (Fig 1B) PubMed publication trends (01/01/1992 to 30/06/2014). Figure 2. Adjusted PubMed publication trends (07/01/2008 to 29/06/2014) Figure 3. Google search trends: Jan 2004 to Jun 2014 / 1-week periods. Figure 4. PubMed publication trends (1962-2013) systematic reviews and meta-analysis, clinical trials, and observational studies.
      Figure 5. Ramirez, Jorge H (2014): Infographics: Unpublished US phase 3 clinical trials (2002-2014) completed before Jan 2011 = 50.8%. figshare.http://dx.doi.org/10.6084/m9.figshare.1121675 Raw data: "13377 studies found for: Completed | Interventional Studies | Phase 3 | received from 01/01/2002 to 01/01/2014 | Worldwide". This database complies with the terms and conditions of ClinicalTrials.gov: http://clinicaltrials.gov/ct2/about-site/terms-conditions Supplementary Figures (S1-S6). PubMed publication delay in the indexation processes does not explain the descending trends in the scientific output of evidence-based medicine. Acknowledgments I would like to acknowledge the following persons for providing valuable concepts in data visualization and infographics:
    4. Maria Fernanda Ramírez. Professor of graphic design. Universidad del Valle. Cali, Colombia.
    5. Lorena Franco. Graphic design student. Universidad del Valle. Cali, Colombia. Related articles by this author (Jorge H. Ramírez)
    6. Ramirez JH. Lack of transparency in clinical trials: a call for action. Colomb Med (Cali) 2013;44(4):243-6. URL: http://www.ncbi.nlm.nih.gov/pubmed/24892242
    7. Ramirez JH. Re: Evidence based medicine is broken (17 June 2014). http://www.bmj.com/node/759181
    8. Ramirez JH. Re: Global rules for global health: why we need an independent, impartial WHO (19 June 2014). http://www.bmj.com/node/759151
    9. Ramirez JH. PubMed publication trends (1992 to 2014): evidence based medicine and clinical practice guidelines (04 July 2014). http://www.bmj.com/content/348/bmj.g3725/rr/759895 Recommended articles
    10. Greenhalgh Trisha, Howick Jeremy,Maskrey Neal. Evidence based medicine: a movement in crisis? BMJ 2014;348:g3725
    11. Spence Des. Evidence based medicine is broken BMJ 2014; 348:g22
    12. Schünemann Holger J, Oxman Andrew D,Brozek Jan, Glasziou Paul, JaeschkeRoman, Vist Gunn E et al. Grading quality of evidence and strength of recommendations for diagnostic tests and strategies BMJ 2008; 336:1106
    13. Lau Joseph, Ioannidis John P A, TerrinNorma, Schmid Christopher H, OlkinIngram. The case of the misleading funnel plot BMJ 2006; 333:597
    14. Moynihan R, Henry D, Moons KGM (2014) Using Evidence to Combat Overdiagnosis and Overtreatment: Evaluating Treatments, Tests, and Disease Definitions in the Time of Too Much. PLoS Med 11(7): e1001655. doi:10.1371/journal.pmed.1001655
    15. Katz D. A-holistic view of evidence based medicinehttp://thehealthcareblog.com/blog/2014/05/02/a-holistic-view-of-evidence-based-medicine/ ---
  9. g

    Usage metrics of the TousAntiCovid application

    • gimi9.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Usage metrics of the TousAntiCovid application [Dataset]. https://gimi9.com/dataset/eu_5fa93b994b29f6390f150980_1
    Explore at:
    Description

    The TousAntiCovid app TousAntiCovid is an application that allows everyone to be an actor in the fight against the epidemic. This is an additional barrier gesture that is activated at all times when you have to redouble your vigilance: at the restaurant, in the canteen, when you go to a gym, when you participate in a professional event, when there is a risk that not everyone will respect the other barrier gestures. TousAntiCovid complements the action of doctors and sickness insurance, aimed at containing the spread of the virus by stopping the chains of contamination as soon as possible. The principle is as follows: prevent, while guaranteeing anonymity, people who have been close to a person tested positive, so that they can get tested and taken care of as soon as possible. It also makes it possible to stay informed about the evolution of the epidemic and the conduct to be held and thus to remain vigilant and adopt the right actions. It allows easy access to other tools available to citizens wishing to be involved in the fight against the epidemic: DepistageCovid which gives map of nearby labs and wait times and MesConseilsCovid which provides personalised advice to protect and protect others. The installation of the TousAntiCovid app is done on a voluntary basis. Everyone is supported even if they choose not to use the app. The app is downloaded from the Apple Store and Google Play: Hello.tousanticovid.gouv.fr/ ### Description of the data This dataset informs for each day since the launch of the application on 2 June 2020: — Cumulative total of the number of registered applications minus the number of deregistrations. — Cumulative total of users notified by the application: the number of users notified by the application as risk contacts following exposure to COVID-19, since 2 June 2020. — Cumulative total of users reporting as COVID-19 cases per day: the number of users who reported as COVID-19 cases in the application, since 2 June 2020.

  10. Cleaned Duolingo Learning Data

    • kaggle.com
    Updated Feb 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Charity Githogora (2025). Cleaned Duolingo Learning Data [Dataset]. https://www.kaggle.com/datasets/charitygithogora/cleaned-duolingo-learning-data/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 25, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Charity Githogora
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This dataset contains pre-processed learning traces from Duolingo’s spaced repetition system. It includes timestamps, user interactions, and correctness data, structured to analyze learning patterns over time. The dataset was cleaned and refined in Google Colab before being used to generate visual insights, including a heatmap showing learning activity trends.

    Checkout the heatmap visualization: https://github.com/Charity-Githogora/duolingo-heatmap-insights

    Source: The original dataset was obtained from (https://www.kaggle.com/datasets/aravinii/duolingo-spaced-repetition-data) , and it has been processed to improve usability for data analysis and visualization.

    Columns:

    timestamp – The time of user interaction (converted to datetime format). hour – The hour of the day the interaction occurred. day_of_week – The day of the week the interaction occurred. correct – Whether the response was correct (1) or incorrect (0). Other relevant features extracted for analysis.

    Usage: It can be used for various analyses, such as identifying peak learning hours, tracking performance trends over time, and understanding how engagement impacts accuracy. Researchers and data enthusiasts can explore predictive modeling, time-series analysis, and interactive visualizations to uncover deeper insights. Additionally, the dataset can be used to generate heatmaps and other visual representations of learning activity.

  11. Z

    Data from: EyeFi: Fast Human Identification Through Vision and WiFi-based...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Dec 4, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shiwei Fang (2022). EyeFi: Fast Human Identification Through Vision and WiFi-based Trajectory Matching [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3882103
    Explore at:
    Dataset updated
    Dec 4, 2022
    Dataset provided by
    Shahriar Nirjon
    Sirajum Munir
    Tamzeed Islam
    Shiwei Fang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    EyeFi Dataset

    This dataset is collected as a part of the EyeFi project at Bosch Research and Technology Center, Pittsburgh, PA, USA. The dataset contains WiFi CSI values of human motion trajectories along with ground truth location information captured through a camera. This dataset is used in the following paper "EyeFi: Fast Human Identification Through Vision and WiFi-based Trajectory Matching" that is published in the IEEE International Conference on Distributed Computing in Sensor Systems 2020 (DCOSS '20). We also published a dataset paper titled as "Dataset: Person Tracking and Identification using Cameras and Wi-Fi Channel State Information (CSI) from Smartphones" in Data: Acquisition to Analysis 2020 (DATA '20) workshop describing details of data collection. Please check it out for more information on the dataset.

    Clarification/Bug report: Please note that the order of antennas and subcarriers in .h5 files is not written clearly in the README.md file. The order of antennas and subcarriers are as follows for the 90 csi_real and csi_imag values : [subcarrier1-antenna1, subcarrier1-antenna2, subcarrier1-antenna3, subcarrier2-antenna1, subcarrier2-antenna2, subcarrier2-antenna3,… subcarrier30-antenna1, subcarrier30-antenna2, subcarrier30-antenna3]. Please see the description below. The newer version of the dataset contains this information in README.md. We are sorry for the inconvenience.

    Data Collection Setup

    In our experiments, we used Intel 5300 WiFi Network Interface Card (NIC) installed in an Intel NUC and Linux CSI tools [1] to extract the WiFi CSI packets. The (x,y) coordinates of the subjects are collected from Bosch Flexidome IP Panoramic 7000 panoramic camera mounted on the ceiling and Angle of Arrivals (AoAs) are derived from the (x,y) coordinates. Both the WiFi card and camera are located at the same origin coordinates but at different height, the camera is location around 2.85m from the ground and WiFi antennas are around 1.12m above the ground.

    The data collection environment consists of two areas, first one is a rectangular space measured 11.8m x 8.74m, and the second space is an irregularly shaped kitchen area with maximum distances of 19.74m and 14.24m between two walls. The kitchen also has numerous obstacles and different materials that pose different RF reflection characteristics including strong reflectors such as metal refrigerators and dishwashers.

    To collect the WiFi data, we used a Google Pixel 2 XL smartphone as an access point and connect the Intel 5300 NIC to it for WiFi communication. The transmission rate is about 20-25 packets per second. The same WiFi card and phone are used in both lab and kitchen area.

    List of Files Here is a list of files included in the dataset:

    |- 1_person |- 1_person_1.h5 |- 1_person_2.h5 |- 2_people |- 2_people_1.h5 |- 2_people_2.h5 |- 2_people_3.h5 |- 3_people |- 3_people_1.h5 |- 3_people_2.h5 |- 3_people_3.h5 |- 5_people |- 5_people_1.h5 |- 5_people_2.h5 |- 5_people_3.h5 |- 5_people_4.h5 |- 10_people |- 10_people_1.h5 |- 10_people_2.h5 |- 10_people_3.h5 |- Kitchen |- 1_person |- kitchen_1_person_1.h5 |- kitchen_1_person_2.h5 |- kitchen_1_person_3.h5 |- 3_people |- kitchen_3_people_1.h5 |- training |- shuffuled_train.h5 |- shuffuled_valid.h5 |- shuffuled_test.h5 View-Dataset-Example.ipynb README.md

    In this dataset, folder 1_person/ , 2_people/ , 3_people/ , 5_people/, and 10_people/ contains data collected from the lab area whereas Kitchen/ folder contains data collected from the kitchen area. To see how the each file is structured, please see below in section Access the data.

    The training folder contains the training dataset we used to train the neural network discussed in our paper. They are generated by shuffling all the data from 1_person/ folder collected in the lab area (1_person_1.h5 and 1_person_2.h5).

    Why multiple files in one folder?

    Each folder contains multiple files. For example, 1_person folder has two files: 1_person_1.h5 and 1_person_2.h5. Files in the same folder always have the same number of human subjects present simultaneously in the scene. However, the person who is holding the phone can be different. Also, the data could be collected through different days and/or the data collection system needs to be rebooted due to stability issue. As result, we provided different files (like 1_person_1.h5, 1_person_2.h5) to distinguish different person who is holding the phone and possible system reboot that introduces different phase offsets (see below) in the system.

    Special note:

    For 1_person_1.h5, this file is generated by the same person who is holding the phone, and 1_person_2.h5 contains different people holding the phone but only one person is present in the area at a time. Boths files are collected in different days as well.

    Access the data To access the data, hdf5 library is needed to open the dataset. There are free HDF5 viewer available on the official website: https://www.hdfgroup.org/downloads/hdfview/. We also provide an example Python code View-Dataset-Example.ipynb to demonstrate how to access the data.

    Each file is structured as (except the files under "training/" folder):

    |- csi_imag |- csi_real |- nPaths_1 |- offset_00 |- spotfi_aoa |- offset_11 |- spotfi_aoa |- offset_12 |- spotfi_aoa |- offset_21 |- spotfi_aoa |- offset_22 |- spotfi_aoa |- nPaths_2 |- offset_00 |- spotfi_aoa |- offset_11 |- spotfi_aoa |- offset_12 |- spotfi_aoa |- offset_21 |- spotfi_aoa |- offset_22 |- spotfi_aoa |- nPaths_3 |- offset_00 |- spotfi_aoa |- offset_11 |- spotfi_aoa |- offset_12 |- spotfi_aoa |- offset_21 |- spotfi_aoa |- offset_22 |- spotfi_aoa |- nPaths_4 |- offset_00 |- spotfi_aoa |- offset_11 |- spotfi_aoa |- offset_12 |- spotfi_aoa |- offset_21 |- spotfi_aoa |- offset_22 |- spotfi_aoa |- num_obj |- obj_0 |- cam_aoa |- coordinates |- obj_1 |- cam_aoa |- coordinates ... |- timestamp

    The csi_real and csi_imag are the real and imagenary part of the CSI measurements. The order of antennas and subcarriers are as follows for the 90 csi_real and csi_imag values : [subcarrier1-antenna1, subcarrier1-antenna2, subcarrier1-antenna3, subcarrier2-antenna1, subcarrier2-antenna2, subcarrier2-antenna3,… subcarrier30-antenna1, subcarrier30-antenna2, subcarrier30-antenna3]. nPaths_x group are SpotFi [2] calculated WiFi Angle of Arrival (AoA) with x number of multiple paths specified during calculation. Under the nPath_x group are offset_xx subgroup where xx stands for the offset combination used to correct the phase offset during the SpotFi calculation. We measured the offsets as:

    AntennasOffset 1 (rad)Offset 2 (rad)
    1 & 21.1899-2.0071
    1 & 31.3883-1.8129

    The measurement is based on the work [3], where the authors state there are two possible offsets between two antennas which we measured by booting the device multiple times. The combination of the offset are used for the offset_xx naming. For example, offset_12 is offset 1 between antenna 1 & 2 and offset 2 between antenna 1 & 3 are used in the SpotFi calculation.

    The num_obj field is used to store the number of human subjects present in the scene. The obj_0 is always the subject who is holding the phone. In each file, there are num_obj of obj_x. For each obj_x1, we have the coordinates reported from the camera and cam_aoa, which is estimated AoA from the camera reported coordinates. The (x,y) coordinates and AoA listed here are chronologically ordered (except the files in the training folder) . It reflects the way the person carried the phone moved in the space (for obj_0) and everyone else walked (for other obj_y, where y > 0).

    The timestamp is provided here for time reference for each WiFi packets.

    To access the data (Python):

    import h5py

    data = h5py.File('3_people_3.h5','r')

    csi_real = data['csi_real'][()] csi_imag = data['csi_imag'][()]

    cam_aoa = data['obj_0/cam_aoa'][()] cam_loc = data['obj_0/coordinates'][()]

    For file inside training/ folder:

    Files inside training folder has a different data structure:

    |- nPath-1 |- aoa |- csi_imag |- csi_real |- spotfi |- nPath-2 |- aoa |- csi_imag |- csi_real |- spotfi |- nPath-3 |- aoa |- csi_imag |- csi_real |- spotfi |- nPath-4 |- aoa |- csi_imag |- csi_real |- spotfi

    The group nPath-x is the number of multiple path specified during the SpotFi calculation. aoa is the camera generated angle of arrival (AoA) (can be considered as ground truth), csi_image and csi_real is the imaginary and real component of the CSI value. spotfi is the SpotFi calculated AoA values. The SpotFi values are chosen based on the lowest median and mean error from across 1_person_1.h5 and 1_person_2.h5. All the rows under the same nPath-x group are aligned (i.e., first row of aoa corresponds to the first row of csi_imag, csi_real, and spotfi. There is no timestamp recorded and the sequence of the data is not chronological as they are randomly shuffled from the 1_person_1.h5 and 1_person_2.h5 files.

    Citation If you use the dataset, please cite our paper:

    @inproceedings{eyefi2020, title={EyeFi: Fast Human Identification Through Vision and WiFi-based Trajectory Matching}, author={Fang, Shiwei and Islam, Tamzeed and Munir, Sirajum and Nirjon, Shahriar}, booktitle={2020 IEEE International Conference on Distributed Computing in Sensor Systems (DCOSS)},

  12. n

    Data from: Recognizing the importance of near-home contact with nature for...

    • data.niaid.nih.gov
    • datasetcatalog.nlm.nih.gov
    • +2more
    zip
    Updated Aug 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Magdalena Lenda; Piotr Skórka; Małgorzata Jaźwa; Hsien-Yung Lin; Edward Nęcka; Piotr Tryjanowski; Dawid Moroń; Johannes M. H. Knops; Hugh P. Possingham (2023). Recognizing the importance of near-home contact with nature for mental well-being based on the COVID-19 lockdown experience [Dataset]. http://doi.org/10.5061/dryad.fn2z34v1h
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 29, 2023
    Dataset provided by
    Institute of Nature Conservation
    University of Opole
    Xi’an Jiaotong-Liverpool University
    The University of Queensland
    Uniwersytet SWPS
    University of Life Sciences in Poznań
    Institute of Systematics and Evolution of Animals
    Carleton University
    Authors
    Magdalena Lenda; Piotr Skórka; Małgorzata Jaźwa; Hsien-Yung Lin; Edward Nęcka; Piotr Tryjanowski; Dawid Moroń; Johannes M. H. Knops; Hugh P. Possingham
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Several urban landscape planning solutions have been introduced around the world to find a balance between developing urban spaces, maintaining and restoring biodiversity, and enhancing quality of human life. Our global mini-review, combined with analysis of big data collected from Google Trends at global scale, reveals the importance of enjoying day-to-day contact with nature and engaging in such activities as nature observation and identification and gardening for the mental well-being of humans during the COVID-19 pandemic. Home-based activities, such as watching birds from one’s window, identifying species of plants and animals, backyard gardening, and collecting information about nature for citizen science projects, were popular during the first lockdown in spring 2020, when people could not easily venture out of their homes. In our mini-review, we found 37 articles from 28 countries with a total sample of 114,466 people. These papers suggest that home-based engagement with nature was an entertaining and pleasant distraction that helped preserve mental well-being during a challenging time. According to Google Trends, interest in such activities increased during lockdown compared to the previous five years. Millions of people worldwide are chronically or temporarily confined to their homes and neighborhoods because of illness, childcare chores, or elderly care responsibility, which makes it difficult for them to travel far to visit such places as national parks, created through land sparing, where people go to enjoy nature and relieve stress. This article posits that for such people, living in an urban landscape designed to facilitate effortless contact with small natural areas is a more effective way to receive the mental health benefits of contact with nature than visiting a sprawling nature park on rare occasions. Methods 1. Identifying the most common types of activities related to nature observation, gardening, and taxa identification during the first lockdown based on scientific articles and non-scientific press For scientific articles, in March 2023 we searched Scopus and Google Scholar. For countries where Google is restricted, such as China, similar results will be available from other scientific browsers, with the highest number of results from our database being available from Scopus. We used the Google Search browser to search for globally published non-scientific press articles. Some selection criteria were applied during article review. Specifically, we excluded articles that were not about the first lockdown; did not study activities at a local scale (from balcony, window, backyard) but rather in areas far away from home (e.g., visiting forests); studied the mental health effect of observing indoor potted plants and pet animals; or transiently mentioned the topic or keyword without going into any scientific detail. We included all papers that met our criteria, that is, studies that analyzed our chosen topic with experiments or planned observations. We included all research papers, but not letters that made claims without any data. Google Scholar automatically screened the title, abstract, keywords, and the whole text of each article for the keywords we entered. All articles that met our criteria were read and double-checked for keywords and content related to the keywords (e.g., synonyms or if they presented content about the relevant topic without using the specific keywords). We identified, from both types of articles, the major nature-based activities that people engaged in during the first lockdown in the spring of 2020. Keywords used in this study were grouped into six main topics: (1) COVID-19 pandemic; (2) nature-oriented activity focused on nature observation, identification of different taxa, or gardening; (3) mental well-being; (4) activities performed from a balcony, window, or in gardens; (5) entertainment; and (6) citizen science (see Table 1 for all keywords). 2. Increase in global trends in interest in nature observation, gardening, and taxa identification during the first lockdown We used the categorical cluster method, which was combined with big data from Google Trends (downloaded on 1 September 2020) and anomaly detection to identify trend anomalies globally in peoples’ interests. We used this combination of methods to examine whether interest in nature-based activities that were mentioned in scientific and nonscientific press articles increased during the first lockdown. Keywords linked with the main types of nature-oriented activities, as identified from press and scientific articles, and used according to the categorical clustering method were classified into the following six main categories: (1) global interest in bird-watching and bird identification combined with citizen science; (2) global interest in plant identification and gardening combined with citizen science; (3) global interest in butterfly watching, (4) local interest in early-spring (lockdown time), summer, or autumn flowering species that usually can be found in Central European (country: Poland) backyards; (5) global interest in traveling and social activities; and (6) global interest in nature areas and activities typically enjoyed during holidays and thus requiring traveling to land-spared nature reserves. The six categories were divided into 15 subcategories so that we could attach relevant words or phrases belonging to the same cluster and typically related to the activity (according to Google Trends and Google browser’s automatic suggestions; e.g., people who searched for “bird-watching” typically also searched for “binoculars,” “bird feeder,” “bird nest,” and “birdhouse”). The subcategories and keywords used for data collection about trends in society’s interest in the studied topic from Google Trends are as follows.

    Bird-watching: “binoculars,” “bird feeder,” “bird nest,” “birdhouse,” “bird-watching”; Bird identification: “bird app,” “bird identification,” “bird identification app,” “bird identifier,” “bird song app”; Bird-watching combined with citizen science: “bird guide,” “bird identification,” “eBird,” “feeding birds,” “iNaturalist”; Citizen science and bird-watching apps: “BirdNET,” “BirdSong ID,” “eBird,” “iNaturalist,” “Merlin Bird ID”; Gardening: “gardening,” “planting,” “seedling,” “seeds,” “soil”; Shopping for gardening: “garden shop,” “plant buy,” “plant ebay,” “plant sell,” “plant shop”; Plant identification apps: “FlowerChecker,” “LeafSnap,” “NatureGate,” “Plantifier,” “PlantSnap”; Citizen science and plant identification: “iNaturalist,” “plant app,” “plant check,” “plant identification app,” “plant identifier”; Flowers that were flowering in gardens during lockdown in Poland: “fiołek” (viola), “koniczyna” (shamrock), “mlecz” (dandelion), “pierwiosnek” (primose), “stokrotka” (daisy). They are typical early-spring flowers growing in the gardens in Central Europe. We had to be more specific in this search because there are no plant species blooming across the world at the same time. These plant species have well-known biology; thus, we could easily interpret these results; Flowers that were not flowering during lockdown in Poland: “chaber” (cornflower), “mak” (poppy), “nawłoć” (goldenrod), “róża” (rose), “rumianek” (chamomile). They are typical mid-summer flowering plants often planted in gardens; Interest in traveling long distances and in social activities that involve many people: “airport,” “bus,” “café,” “driving,” “pub”; Single or mass commuting, and traveling: “bike,” “boat,” “car,” “flight,” “train”; Interest in distant places and activities for visiting natural areas: “forest,” “nature park,” “safari,” “trekking,” “trip”; Places and activities for holidays (typically located far away): “coral reef,” “rainforest,” “safari,” “savanna,” “snorkeling”; Butterfly watching: “butterfly watching,” “butterfly identification,” “butterfly app,” “butterfly net,” “butterfly guide”;

    In Google Trends, we set the following filters: global search, dates: July 2016–July 2020; language: English.

  13. Lead Scoring Dataset

    • kaggle.com
    zip
    Updated Aug 17, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amrita Chatterjee (2020). Lead Scoring Dataset [Dataset]. https://www.kaggle.com/amritachatterjee09/lead-scoring-dataset
    Explore at:
    zip(411028 bytes)Available download formats
    Dataset updated
    Aug 17, 2020
    Authors
    Amrita Chatterjee
    Description

    Context

    An education company named X Education sells online courses to industry professionals. On any given day, many professionals who are interested in the courses land on their website and browse for courses.

    The company markets its courses on several websites and search engines like Google. Once these people land on the website, they might browse the courses or fill up a form for the course or watch some videos. When these people fill up a form providing their email address or phone number, they are classified to be a lead. Moreover, the company also gets leads through past referrals. Once these leads are acquired, employees from the sales team start making calls, writing emails, etc. Through this process, some of the leads get converted while most do not. The typical lead conversion rate at X education is around 30%.

    Now, although X Education gets a lot of leads, its lead conversion rate is very poor. For example, if, say, they acquire 100 leads in a day, only about 30 of them are converted. To make this process more efficient, the company wishes to identify the most potential leads, also known as ‘Hot Leads’. If they successfully identify this set of leads, the lead conversion rate should go up as the sales team will now be focusing more on communicating with the potential leads rather than making calls to everyone.

    There are a lot of leads generated in the initial stage (top) but only a few of them come out as paying customers from the bottom. In the middle stage, you need to nurture the potential leads well (i.e. educating the leads about the product, constantly communicating, etc. ) in order to get a higher lead conversion.

    X Education wants to select the most promising leads, i.e. the leads that are most likely to convert into paying customers. The company requires you to build a model wherein you need to assign a lead score to each of the leads such that the customers with higher lead score h have a higher conversion chance and the customers with lower lead score have a lower conversion chance. The CEO, in particular, has given a ballpark of the target lead conversion rate to be around 80%.

    Content

    Variables Description * Prospect ID - A unique ID with which the customer is identified. * Lead Number - A lead number assigned to each lead procured. * Lead Origin - The origin identifier with which the customer was identified to be a lead. Includes API, Landing Page Submission, etc. * Lead Source - The source of the lead. Includes Google, Organic Search, Olark Chat, etc. * Do Not Email -An indicator variable selected by the customer wherein they select whether of not they want to be emailed about the course or not. * Do Not Call - An indicator variable selected by the customer wherein they select whether of not they want to be called about the course or not. * Converted - The target variable. Indicates whether a lead has been successfully converted or not. * TotalVisits - The total number of visits made by the customer on the website. * Total Time Spent on Website - The total time spent by the customer on the website. * Page Views Per Visit - Average number of pages on the website viewed during the visits. * Last Activity - Last activity performed by the customer. Includes Email Opened, Olark Chat Conversation, etc. * Country - The country of the customer. * Specialization - The industry domain in which the customer worked before. Includes the level 'Select Specialization' which means the customer had not selected this option while filling the form. * How did you hear about X Education - The source from which the customer heard about X Education. * What is your current occupation - Indicates whether the customer is a student, umemployed or employed. * What matters most to you in choosing this course An option selected by the customer - indicating what is their main motto behind doing this course. * Search - Indicating whether the customer had seen the ad in any of the listed items. * Magazine
    * Newspaper Article * X Education Forums
    * Newspaper * Digital Advertisement * Through Recommendations - Indicates whether the customer came in through recommendations. * Receive More Updates About Our Courses - Indicates whether the customer chose to receive more updates about the courses. * Tags - Tags assigned to customers indicating the current status of the lead. * Lead Quality - Indicates the quality of lead based on the data and intuition the employee who has been assigned to the lead. * Update me on Supply Chain Content - Indicates whether the customer wants updates on the Supply Chain Content. * Get updates on DM Content - Indicates whether the customer wants updates on the DM Content. * Lead Profile - A lead level assigned to each customer based on their profile. * City - The city of the customer. * Asymmetric Activity Index - An index and score assigned to each customer based on their activity and their profile * Asymmetric Profile Index * Asymmetric Activity Score * Asymmetric Profile Score
    * I agree to pay the amount through cheque - Indicates whether the customer has agreed to pay the amount through cheque or not. * a free copy of Mastering The Interview - Indicates whether the customer wants a free copy of 'Mastering the Interview' or not. * Last Notable Activity - The last notable activity performed by the student.

    Acknowledgements

    UpGrad Case Study

    Inspiration

    Your data will be in front of the world's largest data science community. What questions do you want to see answered?

  14. COVID19 - The New York Times

    • kaggle.com
    zip
    Updated May 18, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google BigQuery (2020). COVID19 - The New York Times [Dataset]. https://www.kaggle.com/bigquery/covid19-nyt
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    May 18, 2020
    Dataset provided by
    BigQueryhttps://cloud.google.com/bigquery
    Authors
    Google BigQuery
    Description

    Context

    This is the US Coronavirus data repository from The New York Times . This data includes COVID-19 cases and deaths reported by state and county. The New York Times compiled this data based on reports from state and local health agencies. More information on the data repository is available here . For additional reporting and data visualizations, see The New York Times’ U.S. coronavirus interactive site

    Sample Queries

    Query 1

    Which US counties have the most confirmed cases per capita? This query determines which counties have the most cases per 100,000 residents. Note that this may differ from similar queries of other datasets because of differences in reporting lag, methodologies, or other dataset differences.

    SELECT covid19.county, covid19.state_name, total_pop AS county_population, confirmed_cases, ROUND(confirmed_cases/total_pop *100000,2) AS confirmed_cases_per_100000, deaths, ROUND(deaths/total_pop *100000,2) AS deaths_per_100000 FROM bigquery-public-data.covid19_nyt.us_counties covid19 JOIN bigquery-public-data.census_bureau_acs.county_2017_5yr acs ON covid19.county_fips_code = acs.geo_id WHERE date = DATE_SUB(CURRENT_DATE(),INTERVAL 1 day) AND covid19.county_fips_code != "00000" ORDER BY confirmed_cases_per_100000 desc

    Query 2

    How do I calculate the number of new COVID-19 cases per day? This query determines the total number of new cases in each state for each day available in the dataset SELECT b.state_name, b.date, MAX(b.confirmed_cases - a.confirmed_cases) AS daily_confirmed_cases FROM (SELECT state_name AS state, state_fips_code , confirmed_cases, DATE_ADD(date, INTERVAL 1 day) AS date_shift FROM bigquery-public-data.covid19_nyt.us_states WHERE confirmed_cases + deaths > 0) a JOIN bigquery-public-data.covid19_nyt.us_states b ON a.state_fips_code = b.state_fips_code AND a.date_shift = b.date GROUP BY b.state_name, date ORDER BY date desc

  15. Data from Time Travelling with Technology: a technology-based program for...

    • researchdata.edu.au
    Updated Oct 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Li Weicong; Leahy Andrew; Jones Caroline; Radnan Maddie; Weicong Li; Caroline Jones (2024). Data from Time Travelling with Technology: a technology-based program for promoting relationships and engagement in aged care [Dataset]. http://doi.org/10.26183/RB4C-SS12
    Explore at:
    Dataset updated
    Oct 23, 2024
    Dataset provided by
    Western Sydney Universityhttp://www.uws.edu.au/
    Authors
    Li Weicong; Leahy Andrew; Jones Caroline; Radnan Maddie; Weicong Li; Caroline Jones
    Time period covered
    Jul 29, 2019 - Dec 2, 2021
    Description

    This dataset contains transcripts of conversations between elderly people and a facilitator during group reminiscence therapy sessions in a day-respite aged care facility in Sydney Australia. Each session consisted of 2-4 older adults, sometimes including family and carers, and ran for approximately 30 minutes.

    Each session displayed locations of significance to the clients on a television using Google Maps and Google Street View in a program called Time Travelling with Technology (TTT). Half the sessions involved the High-Tech condition using dynamic images panning the environment and the other half the Low-Tech condition using static images.

    The dataset also includes dyadic interviews between the facilitator and each individual. The interviews were carried out at initial, mid and final intervals and included discourse tasks and autobiographical discussions.

  16. Day & night temperatures, 50yrs, 1666ws, TFRecord

    • kaggle.com
    zip
    Updated Nov 9, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Görner (2019). Day & night temperatures, 50yrs, 1666ws, TFRecord [Dataset]. https://www.kaggle.com/datasets/mgorner/day-night-temperatures-50yrs-1666ws-tfrecord
    Explore at:
    zip(160157825 bytes)Available download formats
    Dataset updated
    Nov 9, 2019
    Authors
    Martin Görner
    License

    https://www.usa.gov/government-works/https://www.usa.gov/government-works/

    Description

    This dataset is a cleaned-up extract from the following public BigQuery dataset: https://console.cloud.google.com/marketplace/details/noaa-public/ghcn-d

    The dataset contains daily min/max temperatures from a selection of 1666 weather stations. The data spans exactly 50 years. Missing values have been interpolated and are marked as such.

    This dataset is in TFRecord format.

    About the original dataset: NOAA’s Global Historical Climatology Network (GHCN) is an integrated database of climate summaries from land surface stations across the globe that have been subjected to a common suite of quality assurance reviews. The data are obtained from more than 20 sources. The GHCN-Daily is an integrated database of daily climate summaries from land surface stations across the globe, and is comprised of daily climate records from over 100,000 stations in 180 countries and territories, and includes some data from every year since 1763.

  17. Data from: Novel Corona Virus 2019 Dataset

    • kaggle.com
    zip
    Updated Jan 30, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SRK (2020). Novel Corona Virus 2019 Dataset [Dataset]. https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset
    Explore at:
    zip(3155 bytes)Available download formats
    Dataset updated
    Jan 30, 2020
    Authors
    SRK
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    From World Health Organization - On 31 December 2019, WHO was alerted to several cases of pneumonia in Wuhan City, Hubei Province of China. The virus did not match any other known virus. This raised concern because when a virus is new, we do not know how it affects people.

    So daily level information on the affected people can give some interesting insights when it is made available to the broader data science community.

    Johns Hopkins University has made an excellent dashboard using the affected cases data. This data is extracted from the same link and made available in csv format.

    Content

    2019 Novel Coronavirus (2019-nCoV) is a virus (more specifically, a coronavirus) identified as the cause of an outbreak of respiratory illness first detected in Wuhan, China. Early on, many of the patients in the outbreak in Wuhan, China reportedly had some link to a large seafood and animal market, suggesting animal-to-person spread. However, a growing number of patients reportedly have not had exposure to animal markets, indicating person-to-person spread is occurring. At this time, it’s unclear how easily or sustainably this virus is spreading between people - CDC

    This dataset has daily level information on the number of affected cases, deaths and recovery from 2019 novel coronavirus.

    The data is available from 22 Jan 2020.

    Acknowledgements

    Johns Hopkins university has made the data available in google sheets format here. Sincere thanks to them.

    Thanks to WHO, CDC, NHC and DXY for making the data available in first place.

    Picture courtesy : Johns Hopkins University dashboard

    Inspiration

    Some insights could be

    1. Changes in number of affected cases over time
    2. Change in cases over time at country level
    3. Latest number of affected cases
  18. f

    Data from: Fine-Scale Spatiotemporal Air Pollution Analysis Using Mobile...

    • tandf.figshare.com
    • datasetcatalog.nlm.nih.gov
    pdf
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yawen Guan; Margaret C. Johnson; Matthias Katzfuss; Elizabeth Mannshardt; Kyle P. Messier; Brian J. Reich; Joon J. Song (2023). Fine-Scale Spatiotemporal Air Pollution Analysis Using Mobile Monitors on Google Street View Vehicles [Dataset]. http://doi.org/10.6084/m9.figshare.10113239.v3
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Yawen Guan; Margaret C. Johnson; Matthias Katzfuss; Elizabeth Mannshardt; Kyle P. Messier; Brian J. Reich; Joon J. Song
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    People are increasingly concerned with understanding their personal environment, including possible exposure to harmful air pollutants. To make informed decisions on their day-to-day activities, they are interested in real-time information on a localized scale. Publicly available, fine-scale, high-quality air pollution measurements acquired using mobile monitors represent a paradigm shift in measurement technologies. A methodological framework utilizing these increasingly fine-scale measurements to provide real-time air pollution maps and short-term air quality forecasts on a fine-resolution spatial scale could prove to be instrumental in increasing public awareness and understanding. The Google Street View study provides a unique source of data with spatial and temporal complexities, with the potential to provide information about commuter exposure and hot spots within city streets with high traffic. We develop a computationally efficient spatiotemporal model for these data and use the model to make short-term forecasts and high-resolution maps of current air pollution levels. We also show via an experiment that mobile networks can provide more nuanced information than an equally sized fixed-location network. This modeling framework has important real-world implications in understanding citizens’ personal environments, as data production and real-time availability continue to be driven by the ongoing development and improvement of mobile measurement technologies. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.

  19. n

    Data from: A dataset of the crackdown on cross-border wildlife crimes in...

    • data.niaid.nih.gov
    • search.dataone.org
    • +1more
    zip
    Updated Mar 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tianjian Song; Zexu Luo; Yuxin Huang; Yonghua Li; Lei Fang; Jiang Chang (2023). A dataset of the crackdown on cross-border wildlife crimes in China, 2014-2020 [Dataset]. http://doi.org/10.5061/dryad.t1g1jwt5g
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 28, 2023
    Dataset provided by
    University of Chinese Academy of Sciences
    Chinese Research Academy of Environmental Sciences
    Authors
    Tianjian Song; Zexu Luo; Yuxin Huang; Yonghua Li; Lei Fang; Jiang Chang
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Area covered
    China
    Description

    Wildlife crimes that involve smuggling threaten national security and biodiversity, cause regional conflicts, and hinder economic development, especially in developing countries with abundant wildlife resources. Over the past few decades, significant headway has been made in combating wildlife smuggling and the related illegal domestic trade in China. Previous studies on the wildlife smuggling trade were mostly based on customs punishment and confiscation data. From the China Judgments Online website, we retrieved cases related to cross-border wildlife and wildlife products smuggling from 2014 to 2020. A total of 510 available cases and 927 records for more than 110 species were registered. We thoroughly studied each judgment and ruling file to extract information on cases, defendants, species, sentences, and origins and destinations of wildlife and wildlife products. Furthermore, frequency of origin-destination place occurrences and spatial patterns of cross-border wildlife crime in China were shown in this data paper. The main purpose of our dataset is to make these wildlife and wildlife products trade data accessible for researchers to develop conservation studies. We expect that this dataset will be valuable for network analysis of regional or global wildlife trafficking, which has attracted global attention. There are no copyright restrictions on the data; we ask that researchers please cite this paper and the associated dataset when using the data in publications. Methods Data source: The China Judgments Online (CJO) website (https://wenshu.court.gov.cn) provides electronic public access to court records. In 2010, 2013, and 2016, the Supreme People’s Court promulgated and revised the provisions on the publication of judicial documents by people’s courts on the Internet, and the publication of judicial documents has become the responsibility and obligation of courts at all levels (Wu, 2022). Since January 1, 2014, judgment documents must be published on CJO within seven days of their enforcement, and cannot be amended, replaced or revoked without court authority. Up to now, the CJO has become an important channel for the publication of judgments documents.

    Data collection: The collection time of this dataset is up to September 2021. We searched for “wildlife” and “smuggling” on the China referee’s website. Then, we screened these judgment documents according to the following criteria: (I) the full text can be accessed, and the case involves the crimes of illegal hunting, sale, acquisition, transportation, or smuggling of wildlife or wildlife products (including rare and endangered wildlife or wildlife products) overseas and (II) when there are multiple judgment documents in the same lawsuit, such as any subsequent retrial of a case, filing and hearing of different perpetrators in batches, a consistent case number (record) was assigned.

    Data compilation: These judicial documents provide the process of tracing criminal information. We collected as detailed information as possible, such as the date of the seizure, the location of the seizure, the type of illegal activities, the items seized, the source of the items seized, and the actual or expected destination. We used these criteria: (I) on the premise of protecting the personal information in the judgment documents, we obtained the education level and nationality of the principal defendants; (II) for the origin and destination of wildlife or its products, in addition to recording the national, provincial, county, and city levels, the information should be as accurate as possible to specific geographical names by obtaining longitude and latitude coordinate data through Baidu map (https://map.baidu.com/) and Google map (https://www.google.com/maps); and (III) for the identification of “crocodile,” “modern elephant,” “pangolin scale,” and other identifications that are not accurate to the species level in the judgment documents, only the upper classification (genus) level was recorded (i.e., “Crocodylus,” “Loxodonta,” “Manis”; Figure 3). If only the Chinese common name of the species was given but the Latin scientific name was not given, we queried the corresponding species in the International Union for the Conservation of Nature (IUCN)’s Red List of Threatened Species (hereafter: IUCN Red List; https://www.iucnredlist.org) for supplemental information. Eventually these records were translated from Chinese to English.

    Quality control: Due to the need to extract information by reading many parties’ statements, defenders’ opinions, examination instructions, and other words, the preliminary preparation was mainly to discuss the standardized methods and steps of data collection, and the division of labor and training of personnel involved in data collection tasks. In the data entry and summary stage, established data collection methods and steps were followed to reduce human errors. In the data inspection stage, we cross checked the obtained data and missing values with the author to ensure the accuracy of data input. If there were questions, the lead author and Luo would revisit the original judgment documents and make a final decision after discussion with the other authors.

  20. g

    Iowa City Flood Recovery Resource Center, Status of Bridges: Closed or Open,...

    • geocommons.com
    Updated Jun 23, 2008
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Iowa City Flood Recovery Resource Center, icgov.org (2008). Iowa City Flood Recovery Resource Center, Status of Bridges: Closed or Open, Iowa City, 6.19.2008 [Dataset]. http://geocommons.com/search.html
    Explore at:
    Dataset updated
    Jun 23, 2008
    Dataset provided by
    Iowa City Flood Recovery Resource Center, icgov.org
    Burkey
    Description

    This dataset displays the status of bridges in Iowa City as of 6.19.08. The data comes from the website of Iowa City at icgov.org on their Flood Recovery Resource Center. The lat/lons were obtained by geocoding visually on google earth.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Jason Porzelius (2023). Google Capstone Project - BellaBeats [Dataset]. https://www.kaggle.com/datasets/jasonporzelius/google-capstone-project-bellabeats
Organization logo

Google Capstone Project - BellaBeats

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 5, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Jason Porzelius
Description

Introduction: I have chosen to complete a data analysis project for the second course option, Bellabeats, Inc., using a locally hosted database program, Excel for both my data analysis and visualizations. This choice was made primarily because I live in a remote area and have limited bandwidth and inconsistent internet access. Therefore, completing a capstone project using web-based programs such as R Studio, SQL Workbench, or Google Sheets was not a feasible choice. I was further limited in which option to choose as the datasets for the ride-share project option were larger than my version of Excel would accept. In the scenario provided, I will be acting as a Junior Data Analyst in support of the Bellabeats, Inc. executive team and data analytics team. This combined team has decided to use an existing public dataset in hopes that the findings from that dataset might reveal insights which will assist in Bellabeat's marketing strategies for future growth. My task is to provide data driven insights to business tasks provided by the Bellabeats, Inc.'s executive and data analysis team. In order to accomplish this task, I will complete all parts of the Data Analysis Process (Ask, Prepare, Process, Analyze, Share, Act). In addition, I will break each part of the Data Analysis Process down into three sections to provide clarity and accountability. Those three sections are: Guiding Questions, Key Tasks, and Deliverables. For the sake of space and to avoid repetition, I will record the deliverables for each Key Task directly under the numbered Key Task using an asterisk (*) as an identifier.

Section 1 - Ask: A. Guiding Questions: Who are the key stakeholders and what are their goals for the data analysis project? What is the business task that this data analysis project is attempting to solve?

B. Key Tasks: Identify key stakeholders and their goals for the data analysis project *The key stakeholders for this project are as follows: -Urška Sršen and Sando Mur - co-founders of Bellabeats, Inc. -Bellabeats marketing analytics team. I am a member of this team. Identify the business task. *The business task is: -As provided by co-founder Urška Sršen, the business task for this project is to gain insight into how consumers are using their non-BellaBeats smart devices in order to guide upcoming marketing strategies for the company which will help drive future growth. Specifically, the researcher was tasked with applying insights driven by the data analysis process to 1 BellaBeats product and presenting those insights to BellaBeats stakeholders.

Section 2 - Prepare: A. Guiding Questions: Where is the data stored and organized? Are there any problems with the data? How does the data help answer the business question?

B. Key Tasks: Research and communicate the source of the data, and how it is stored/organized to stakeholders. *The data source used for our case study is FitBit Fitness Tracker Data. This dataset is stored in Kaggle and was made available through user Mobius in an open-source format. Therefore, the data is public and available to be copied, modified, and distributed, all without asking the user for permission. These datasets were generated by respondents to a distributed survey via Amazon Mechanical Turk reportedly (see credibility section directly below) between 03/12/2016 thru 05/12/2016. *Reportedly (see credibility section directly below), thirty eligible Fitbit users consented to the submission of personal tracker data, including output related to steps taken, calories burned, time spent sleeping, heart rate, and distance traveled. This data was broken down into minute, hour, and day level totals. This data is stored in 18 CSV documents. I downloaded all 18 documents into my local laptop and decided to use 2 documents for the purposes of this project as they were files which had merged activity and sleep data from the other documents. All unused documents were permanently deleted from the laptop. The 2 files used were: -sleepDaymerged.csv -dailyActivitymerged.csv Identify and communicate to stakeholders any problems found with the data related to credibility and bias. *As will be more specifically presented in the Process section, the data seems to have credibility issues related to the reported time frame of the data collected. The metadata seems to indicate that the data collected covered roughly 2 months of FitBit tracking. However, upon my initial data processing, I found that only 1 month of data was reported. *As will be more specifically presented in the Process section, the data has credibility issues related to the number of individuals who reported FitBit data. Specifically, the metadata communicates that 30 individual users agreed to report their tracking data. My initial data processing uncovered 33 individual IDs in the dailyActivity_merged dataset. *Due to the small number of participants (...

Search
Clear search
Close search
Google apps
Main menu