Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset originates from DataCamp. Many users have reposted copies of the CSV on Kaggle, but most of those uploads omit the original instructions, business context, and problem framing. In this upload, I’ve included that missing context in the About Dataset so the reader of my notebook or any other notebook can fully understand how the data was intended to be used and the intended problem framing.
Note: I have also uploaded a visualization of the workflow I personally took to tackle this problem, but it is not part of the dataset itself.
Additionally, I created a PowerPoint presentation based on my work in the notebook, which you can download from here:
PPTX Presentation
From: Head of Data Science
Received: Today
Subject: New project from the product team
Hey!
I have a new project for you from the product team. Should be an interesting challenge. You can see the background and request in the email below.
I would like you to perform the analysis and write a short report for me. I want to be able to review your code as well as read your thought process for each step. I also want you to prepare and deliver the presentation for the product team - you are ready for the challenge!
They want us to predict which recipes will be popular 80% of the time and minimize the chance of showing unpopular recipes. I don't think that is realistic in the time we have, but do your best and present whatever you find.
You can find more details about what I expect you to do here. And information on the data here.
I will be on vacation for the next couple of weeks, but I know you can do this without my support. If you need to make any decisions, include them in your work and I will review them when I am back.
Good Luck!
From: Product Manager - Recipe Discovery
To: Head of Data Science
Received: Yesterday
Subject: Can you help us predict popular recipes?
Hi,
We haven't met before but I am responsible for choosing which recipes to display on the homepage each day. I have heard about what the data science team is capable of and I was wondering if you can help me choose which recipes we should display on the home page?
At the moment, I choose my favorite recipe from a selection and display that on the home page. We have noticed that traffic to the rest of the website goes up by as much as 40% if I pick a popular recipe. But I don't know how to decide if a recipe will be popular. More traffic means more subscriptions so this is really important to the company.
Can your team: - Predict which recipes will lead to high traffic? - Correctly predict high traffic recipes 80% of the time?
We need to make a decision on this soon, so I need you to present your results to me by the end of the month. Whatever your results, what do you recommend we do next?
Look forward to seeing your presentation.
Tasty Bytes was founded in 2020 in the midst of the Covid Pandemic. The world wanted inspiration so we decided to provide it. We started life as a search engine for recipes, helping people to find ways to use up the limited supplies they had at home.
Now, over two years on, we are a fully fledged business. For a monthly subscription we will put together a full meal plan to ensure you and your family are getting a healthy, balanced diet whatever your budget. Subscribe to our premium plan and we will also deliver the ingredients to your door.
This is an example of how a recipe may appear on the website, we haven't included all of the steps but you should get an idea of what visitors to the site see.
Tomato Soup
Servings: 4
Time to make: 2 hours
Category: Lunch/Snack
Cost per serving: $
Nutritional Information (per serving) - Calories 123 - Carbohydrate 13g - Sugar 1g - Protein 4g
Ingredients: - Tomatoes - Onion - Carrot - Vegetable Stock
Method: 1. Cut the tomatoes into quarters….
The product manager has tried to make this easier for us and provided data for each recipe, as well as whether there was high traffic when the recipe was featured on the home page.
As you will see, they haven't given us all of the information they have about each recipe.
You can find the data here.
I will let you decide how to process it, just make sure you include all your decisions in your report.
Don't forget to double check the data really does match what they say - it might not.
| Column Name | Details |
|---|---|
| recipe | Numeric, unique identifier of recipe |
| calories | Numeric, number of calories |
| carbohydrate | Numeric, amount of carbohydrates in grams |
| sugar | Numeric, amount of sugar in grams |
| protein | Numeric, amount of prote... |
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
First Impressions Dataset
The dataset contains 20,000 images of people. For each person, a first impression of them was created. The first impression is a text consisting of several sentences.
💴 For Commercial Usage: To discuss your requirements, learn about the price and buy the dataset, leave a request on our website to buy the dataset
Content
The dataset includes a folder with images of 20,000 people. The .csv file consists of columns:
image_id - the… See the full description on the dataset page: https://huggingface.co/datasets/UniqueData/first-impressions-dataset.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains two files related to our article about reader eclecticism. One file contains metadata about books, derived from their landing pages on Goodreads.com. It's formatted as JSON and structured like a Python dictionary, where the keys are urls for each book's works page on Goodreads. The values include the book's title (as a string), the author (string), the average rating (float), the number of ratings (integer), and some shelves (dictionary). The last of these refers to the shelf data available on each book's landing page; at the time of the scrape (fall 2021), Goodreads showed up to 10 of these, and included information about how many people had tagged the book with each shelf. They no longer do this, and reconstructing the weights is non-trivial (you can find detailed information about all of a book's shelves, but Goodreads sometimes groups shelves into an overarching category for the landing page). The information collected here does reflect user interaction with the book, but these are caveats worth considering. In any case, the sub-dictionary uses the shelves as keys and has their weights as values. The file contains information about 884,722 books. The second file shows how we've sorted all of the shelves in our dataset into just a few clusters. This file is very simple—just a two-column csv with the name of the shelf and its cluster—but producing it was complicated. First, we made a network out of our shelves. Each shelf is a node, and we draw an edge between two shelves if they appear in the same book. As we see additional books that combine those shelves, we add to the edge weight. In the end we got a network that shows how all 1,194 shelves in our network are used relative to each other. When we had the network, we used community detection to see how the shelves cluster together. There are many ways to do this, but we used the Louvain method. This approach is non-deterministic and sensitive to various decisions, like the granularity of the community detection. To shore up our sense of the community structure (sometimes called "modularity") of this network, we spent a lot of time on this process. We ran community detection 10,000 times each at a few different granularities. We examined the resulting communities to see which ones tended to show up often and which emerged rarely, and we also observed how shelves tended to show up together. In the end we settled on the eight communities you see in this spreadsheet. We picked the names of each community ourselves. If you want to repeat this process, you will probably wind up with a somewhat different picture. We request that any outputs resulting from use of this dataset acknowledge the Price Lab / J.D. Porter. We have chosen not to share data about specific Goodreads users, in order to protect their privacy. We are, however, open to corresponding with researchers about sharing and collaboration.
Facebook
TwitterStatistics of how many adults access the internet and use different types of technology covering: home internet access how people connect to the web how often people use the web/computers whether people use mobile devices whether people buy goods over the web whether people carried out specified activities over the internet For more information see the ONS website and the UKDS website.
Facebook
TwitterHow many people use social media?
Social media usage is one of the most popular online activities. In 2024, over five billion people were using social media worldwide, a number projected to increase to over six billion in 2028.
Who uses social media?
Social networking is one of the most popular digital activities worldwide and it is no surprise that social networking penetration across all regions is constantly increasing. As of January 2023, the global social media usage rate stood at 59 percent. This figure is anticipated to grow as lesser developed digital markets catch up with other regions
when it comes to infrastructure development and the availability of cheap mobile devices. In fact, most of social media’s global growth is driven by the increasing usage of mobile devices. Mobile-first market Eastern Asia topped the global ranking of mobile social networking penetration, followed by established digital powerhouses such as the Americas and Northern Europe.
How much time do people spend on social media?
Social media is an integral part of daily internet usage. On average, internet users spend 151 minutes per day on social media and messaging apps, an increase of 40 minutes since 2015. On average, internet users in Latin America had the highest average time spent per day on social media.
What are the most popular social media platforms?
Market leader Facebook was the first social network to surpass one billion registered accounts and currently boasts approximately 2.9 billion monthly active users, making it the most popular social network worldwide. In June 2023, the top social media apps in the Apple App Store included mobile messaging apps WhatsApp and Telegram Messenger, as well as the ever-popular app version of Facebook.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Classify video clips with natural scenes of actions performed by people visible in the videos.
See the UCF101 Dataset web page: https://www.crcv.ucf.edu/data/UCF101.php#Results_on_UCF101
This example datasets consists of the 10 most numerous video from the UCF101 dataset. For the top 5 version, see: https://doi.org/10.5281/zenodo.7924745 .
Based on this code: https://keras.io/examples/vision/video_classification/ (needs to be updated, if has not yet been already; see the issue: https://github.com/keras-team/keras-io/issues/1342).
Testing if data can be downloaded from figshare with wget, see: https://github.com/mojaveazure/angsd-wrapper/issues/10
For generating the subset, see this notebook: https://colab.research.google.com/github/sayakpaul/Action-Recognition-in-TensorFlow/blob/main/Data_Preparation_UCF101.ipynb -- however, it also needs to be adjusted (if has not yet been already - then I will post a link to the notebook here or elsewhere, e.g., in the corrected notebook with Keras example).
I would like to thank Sayak Paul for contacting me about his example at Keras documentation being out of date.
Cite this dataset as:
Soomro, K., Zamir, A. R., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402. https://doi.org/10.48550/arXiv.1212.0402
To download the dataset via the command line, please use:
wget -q https://zenodo.org/record/7882861/files/ucf101_top10.tar.gz -O ucf101_top10.tar.gz tar xf ucf101_top10.tar.gz
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset Card for StereoSet
Dataset Summary
StereoSet is a dataset that measures stereotype bias in language models. StereoSet consists of 17,000 sentences that measures model preferences across gender, race, religion, and profession.
Supported Tasks and Leaderboards
multiple-choice question answering
Languages
English (en)
Dataset Structure
Data Instances
{'bias_type': 'race', 'context': 'Many people live in… See the full description on the dataset page: https://huggingface.co/datasets/McGill-NLP/stereoset.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Biometric Attack Dataset, Hispanic People
The similar dataset that includes all ethnicities - Anti Spoofing Real Dataset
The dataset for face anti spoofing and face recognition includes images and videos of hispanic people. 32,600+ photos & video of 16,300 people from 20 countries. The dataset helps in enchancing the performance of the model by providing wider range of data for a specific ethnic group. The videos were gathered by capturing faces of genuine individuals… See the full description on the dataset page: https://huggingface.co/datasets/UniqueData/hispanic-people-liveness-detection-video-dataset.
Facebook
TwitterThe global number of Facebook users was forecast to continuously increase between 2023 and 2027 by in total 391 million users (+14.36 percent). After the fourth consecutive increasing year, the Facebook user base is estimated to reach 3.1 billion users and therefore a new peak in 2027. Notably, the number of Facebook users was continuously increasing over the past years. User figures, shown here regarding the platform Facebook, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period and count multiple accounts by persons only once.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).
Facebook
TwitterThe dataset was collected via web scraping from Corusera's website and contains six .csv tables with rich information on specializations/professional certificates, courses, and weekly study materials. for all available courses. The source code used for web scraping has also been made available online (see GitHub link https://github.com/TK-Problem/Coursera-scrapper) . Just keep in mind that Corusera's website can change in the future and may not be fully functional. Also, read the README.md file for the explanation of why number of reviews doesn't;t match between different .csv tables.
The data was scraped on 2023-09-03 it might not be up to date in the future.
All tables can be joined using SpecializationURL and CourseURL columns.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This paper constructs a dataset for Tibetan machine reading comprehension. The data comes from Yunzang website, and covers 12 fields of nature, culture, education, geography, history, life, society, art, technology, people, science and sports. The questions and answers of the dataset are manually entered and marked by 20 Tibetan professionals. It contains 631 articles, 903 paragraphs, and 2,000 question-and-answer pairs constructed based on the paragraphs. Data items mainly include article ID, title, paragraph, question and answer. The publication of this dataset is of great value for promoting the development of Tibetan information processing.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
RCS Data Switzerland can help you connect with many people and grow your business. This dataset is perfect for getting probable RCS users number all across the country. Also, people can always use this for easy communication or direct marketing. Besides, the RCS Data Switzerland is a simple method for talking directly through SMS to interested people. If you want to boost your business easily, this database website is just suitable for you. Moreover, our RCS Data Switzerland is an excellent tool for marketing in this country. In addition, RCS messaging lets businesses send large, high-quality content to users, while SMS has fewer features but works on more devices. SMS became popular first, but RCS can improve its limited abilities. With this trustworthy number list, you can easily follow your marketing techniques. Most importantly, the best part is that everyone can enjoy a remarkable return on investment (ROI). Switzerland RCS Data will make your marketing more successful. The RCS system displays when a message is read or received. In fact, users can share files and high-quality photos. Also, this verified list is perfect for sending messages. However, you can reach people in different parts of the country. Our Switzerland RCS Data has over 95% accurate and up-to-date mobile numbers. Our special team confirms all the numbers to make sure they are the latest and active. Hence, our website presents customizable packages to fit your requirements. Additionally, the Switzerland RCS Data helps you reach the right people in your marketing efforts. By using this data correctly, you can develop your business across the nation. All data was created by obeying GDPR rules. Moreover, you get this dataset in an Excel or CSV file. In other words, this data allows you to share special offers, news, or reminders. In the end, you can buy this RCS Data from our website.
Facebook
TwitterThis is a repository for a UKRI Economic and Social Research Council (ESRC) funded project to understand the software used to analyse social sciences data. Any software produced has been made available under a BSD 2-Clause license and any data and other non-software derivative is made available under a CC-BY 4.0 International License. Note that the software that analysed the survey is provided for illustrative purposes - it will not work on the decoupled anonymised data set. Exceptions to this are: Data from the UKRI ESRC is mostly made available under a CC BY-NC-SA 4.0 Licence. Data from Gateway to Research is made available under an Open Government Licence (Version 3.0). Contents Survey data & analysis: esrc_data-survey-analysis-data.zip Other data: esrc_data-other-data.zip Transcripts: esrc_data-transcripts.zip Data Management Plan: esrc_data-dmp.zip Survey data & analysis The survey ran from 3rd February 2022 to 6th March 2023 during which 168 responses were received. Of these responses, three were removed because they were supplied by people from outside the UK without a clear indication of involvement with the UK or associated infrastructure. A fourth response was removed as both came from the same person which leaves us with 164 responses in the data. The survey responses, Question (Q) Q1-Q16, have been decoupled from the demographic data, Q17-Q23. Questions Q24-Q28 are for follow-up and have been removed from the data. The institutions (Q17) and funding sources (Q18) have been provided in a separate file as this could be used to identify respondents. Q17, Q18 and Q19-Q23 have all been independently shuffled. The data has been made available as Comma Separated Values (CSV) with the question number as the header of each column and the encoded responses in the column below. To see what the question and the responses correspond to you will have to consult the survey-results-key.csv which decodes the question and responses accordingly. A pdf copy of the survey questions is available on GitHub. The survey data has been decoupled into: survey-results-key.csv - maps a question number and the responses to the actual question values. q1-16-survey-results.csv- the non-demographic component of the survey responses (Q1-Q16). q19-23-demographics.csv - the demographic part of the survey (Q19-Q21, Q23). q17-institutions.csv - the institution/location of the respondent (Q17). q18-funding.csv - funding sources within the last 5 years (Q18). Please note the code that has been used to do the analysis will not run with the decoupled survey data. Other data files included CleanedLocations.csv - normalised version of the institutions that the survey respondents volunteered. DTPs.csv - information on the UKRI Doctoral Training Partnerships (DTPs) scaped from the UKRI DTP contacts web page in October 2021. projectsearch-1646403729132.csv.gz - data snapshot from the UKRI Gateway to Research released on the 24th February 2022 made available under an Open Government Licence. locations.csv - latitude and longitude for the institutions in the cleaned locations. subjects.csv - research classifications for the ESRC projects for the 24th February data snapshot. topics.csv - topic classification for the ESRC projects for the 24th February data snapshot. Interview transcripts The interview transcripts have been anonymised and converted to markdown so that it's easier to process in general. List of interview transcripts: 1269794877.md 1578450175.md 1792505583.md 2964377624.md 3270614512.md 40983347262.md 4288358080.md 4561769548.md 4938919540.md 5037840428.md 5766299900.md 5996360861.md 6422621713.md 6776362537.md 7183719943.md 7227322280.md 7336263536.md 75909371872.md 7869268779.md 8031500357.md 9253010492.md Data Management Plan The study's Data Management Plan is provided in PDF format and shows the different data sets used throughout the duration of the study and where they have been deposited, as well as how long the SSI will keep these records.
Facebook
TwitterKey Performance Indicators from Active People Survey (APS). Data on volunteering, club membership, tuition, organised sport, competition, satisfaction with local sports provision, for local authorities, based on Active People Survey. KPI 1 Participation is defined as taking part on at least 3 days a week in moderate intensity sport and active recreation (at least 12 days in the last 4 weeks) for at least 30 minutes continuously in any one session. Participation includes recreational walking and cycling. KPI 2 Volunteering is defined as ‘Volunteering to support sport for at least one hour a week’. KPI 3 Club membership is defined as ‘being a member of a club particularly so that you can participate in sport or recreational activity in the last 4 weeks’. KPI 4 Receiving tuition is defined as ‘having received tuition from an instructor or coach to improve your performance in any sport or recreational activity in the last 12 months’. KPI 5 Organised Competition is defined as ‘having taken part in any organised competition in any sport or recreational activity in the last 12 months’. KPI 6 Satisfaction is the percentage of adults who are very or fairly satisfied with sports provision in their local area. Organised sport is defined as the percentage of adults who have done at least one of the following: received tuition in the last 12 months, taken part in organised competition in the last 12 months or been a member of a club to play sport. A statistically significant change is indicated by 'increase' or 'decrease' and this means that we are 95% certain that there has been a real change (increase or decrease). For more information on measuring statistically significant change within Active People, see the briefing note on Sport England’s website. The 'Base' refers to the sample size, i.e. the number of respondents. http://activepeople.sportengland.org/
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Biometric Attack Dataset, Black People
The similar dataset that includes all ethnicities - Anti Spoofing Real Dataset
The dataset for face anti spoofing and face recognition includes images and videos of black people. The dataset helps in enchancing the performance of the model by providing wider range of data for a specific ethnic group. The videos were gathered by capturing faces of genuine individuals presenting spoofs, using facial presentations. Our dataset proposes… See the full description on the dataset page: https://huggingface.co/datasets/UniqueData/black-people-liveness-detection-video-dataset.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for MMLU
Dataset Summary
Measuring Massive Multitask Language Understanding by Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt (ICLR 2021). This is a massive multitask test consisting of multiple-choice questions from various branches of knowledge. The test spans subjects in the humanities, social sciences, hard sciences, and other areas that are important for some people to learn. This covers 57 tasks… See the full description on the dataset page: https://huggingface.co/datasets/cais/mmlu.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Web Camera People Behavior Dataset for computer vision tasks
Dataset includes 2,300+ individuals, contributing to a total of 53,800+ videos and 9,300+ images captured via webcams. It is designed to study social interactions and behaviors in various remote meetings, including video calls, video conferencing, and online meetings. By leveraging this dataset, developers and researchers can enhance their understanding of human behavior in digital communication settings, contributing to… See the full description on the dataset page: https://huggingface.co/datasets/UniDataPro/web-camera-people-behavior.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Crowd Counting Dataset
The dataset includes images featuring crowds of people ranging from 0 to 5000 individuals. The dataset includes a diverse range of scenes and scenarios, capturing crowds in various settings. Each image in the dataset is accompanied by a corresponding JSON file containing detailed labeling information for each person in the crowd for crowd count and classification.
Types of crowds in the dataset: 0-1000, 1000-2000, 2000-3000, 3000-4000 and 4000-5000 This… See the full description on the dataset page: https://huggingface.co/datasets/UniqueData/crowd-counting-dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract: Recent years have seen a focus on research into distributed optimization algorithms for multi-robot Collaborative Simultaneous Localization and Mapping (C-SLAM). Research in this domain, however, is made difficult by a lack of standard benchmark datasets. Such datasets have been used to great effect in the field of single-robot SLAM, and researchers focused on multi-robot problems would benefit greatly from dedicated benchmark datasets. To address this gap we design and release the Collaborative Open-Source Multi-robot Optimization Benchmark (COSMO-Bench) -- a suite of 24 datasets derived from a state-of-the-art C-SLAM front-end and real-world LiDAR data. For additional details please see our associated publication: https://arxiv.org/abs/2508.16731This entry, hosted through Carnegie Mellon University libraries, serves to host the official dataset release in perpetuity. However, we also support a website that provides a somewhat nicer user interface at cosmobench.comNOTE - Shortly after making this data available we were notified of some issues with the groundtruth of the CU-Multi data on which the kittredge and main_campus datasets are based. This issue has since been resolved and new versions of the affected datasets have been uploaded. If you are one of the handful of people that downloaded these datasets before September 15th 2025, please update to the corrected versions. To verify that you have the correct versions please see instructions in README.md
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Selfies and video dataset
4000 people in this dataset. Each person took a selfie on a webcam, took a selfie on a mobile phone. In addition, people recorded video from the phone and from the webcam, on which they pronounced a given set of numbers. Includes folders corresponding to people in the dataset. Each folder includes 8 files (4 images and 4 videos).
💴 For Commercial Usage: To discuss your requirements, learn about the price and buy the dataset, leave a request on… See the full description on the dataset page: https://huggingface.co/datasets/UniqueData/selfie_and_video.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset originates from DataCamp. Many users have reposted copies of the CSV on Kaggle, but most of those uploads omit the original instructions, business context, and problem framing. In this upload, I’ve included that missing context in the About Dataset so the reader of my notebook or any other notebook can fully understand how the data was intended to be used and the intended problem framing.
Note: I have also uploaded a visualization of the workflow I personally took to tackle this problem, but it is not part of the dataset itself.
Additionally, I created a PowerPoint presentation based on my work in the notebook, which you can download from here:
PPTX Presentation
From: Head of Data Science
Received: Today
Subject: New project from the product team
Hey!
I have a new project for you from the product team. Should be an interesting challenge. You can see the background and request in the email below.
I would like you to perform the analysis and write a short report for me. I want to be able to review your code as well as read your thought process for each step. I also want you to prepare and deliver the presentation for the product team - you are ready for the challenge!
They want us to predict which recipes will be popular 80% of the time and minimize the chance of showing unpopular recipes. I don't think that is realistic in the time we have, but do your best and present whatever you find.
You can find more details about what I expect you to do here. And information on the data here.
I will be on vacation for the next couple of weeks, but I know you can do this without my support. If you need to make any decisions, include them in your work and I will review them when I am back.
Good Luck!
From: Product Manager - Recipe Discovery
To: Head of Data Science
Received: Yesterday
Subject: Can you help us predict popular recipes?
Hi,
We haven't met before but I am responsible for choosing which recipes to display on the homepage each day. I have heard about what the data science team is capable of and I was wondering if you can help me choose which recipes we should display on the home page?
At the moment, I choose my favorite recipe from a selection and display that on the home page. We have noticed that traffic to the rest of the website goes up by as much as 40% if I pick a popular recipe. But I don't know how to decide if a recipe will be popular. More traffic means more subscriptions so this is really important to the company.
Can your team: - Predict which recipes will lead to high traffic? - Correctly predict high traffic recipes 80% of the time?
We need to make a decision on this soon, so I need you to present your results to me by the end of the month. Whatever your results, what do you recommend we do next?
Look forward to seeing your presentation.
Tasty Bytes was founded in 2020 in the midst of the Covid Pandemic. The world wanted inspiration so we decided to provide it. We started life as a search engine for recipes, helping people to find ways to use up the limited supplies they had at home.
Now, over two years on, we are a fully fledged business. For a monthly subscription we will put together a full meal plan to ensure you and your family are getting a healthy, balanced diet whatever your budget. Subscribe to our premium plan and we will also deliver the ingredients to your door.
This is an example of how a recipe may appear on the website, we haven't included all of the steps but you should get an idea of what visitors to the site see.
Tomato Soup
Servings: 4
Time to make: 2 hours
Category: Lunch/Snack
Cost per serving: $
Nutritional Information (per serving) - Calories 123 - Carbohydrate 13g - Sugar 1g - Protein 4g
Ingredients: - Tomatoes - Onion - Carrot - Vegetable Stock
Method: 1. Cut the tomatoes into quarters….
The product manager has tried to make this easier for us and provided data for each recipe, as well as whether there was high traffic when the recipe was featured on the home page.
As you will see, they haven't given us all of the information they have about each recipe.
You can find the data here.
I will let you decide how to process it, just make sure you include all your decisions in your report.
Don't forget to double check the data really does match what they say - it might not.
| Column Name | Details |
|---|---|
| recipe | Numeric, unique identifier of recipe |
| calories | Numeric, number of calories |
| carbohydrate | Numeric, amount of carbohydrates in grams |
| sugar | Numeric, amount of sugar in grams |
| protein | Numeric, amount of prote... |