Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset originates from DataCamp. Many users have reposted copies of the CSV on Kaggle, but most of those uploads omit the original instructions, business context, and problem framing. In this upload, I’ve included that missing context in the About Dataset so the reader of my notebook or any other notebook can fully understand how the data was intended to be used and the intended problem framing.
Note: I have also uploaded a visualization of the workflow I personally took to tackle this problem, but it is not part of the dataset itself.
Additionally, I created a PowerPoint presentation based on my work in the notebook, which you can download from here:
PPTX Presentation
From: Head of Data Science
Received: Today
Subject: New project from the product team
Hey!
I have a new project for you from the product team. Should be an interesting challenge. You can see the background and request in the email below.
I would like you to perform the analysis and write a short report for me. I want to be able to review your code as well as read your thought process for each step. I also want you to prepare and deliver the presentation for the product team - you are ready for the challenge!
They want us to predict which recipes will be popular 80% of the time and minimize the chance of showing unpopular recipes. I don't think that is realistic in the time we have, but do your best and present whatever you find.
You can find more details about what I expect you to do here. And information on the data here.
I will be on vacation for the next couple of weeks, but I know you can do this without my support. If you need to make any decisions, include them in your work and I will review them when I am back.
Good Luck!
From: Product Manager - Recipe Discovery
To: Head of Data Science
Received: Yesterday
Subject: Can you help us predict popular recipes?
Hi,
We haven't met before but I am responsible for choosing which recipes to display on the homepage each day. I have heard about what the data science team is capable of and I was wondering if you can help me choose which recipes we should display on the home page?
At the moment, I choose my favorite recipe from a selection and display that on the home page. We have noticed that traffic to the rest of the website goes up by as much as 40% if I pick a popular recipe. But I don't know how to decide if a recipe will be popular. More traffic means more subscriptions so this is really important to the company.
Can your team: - Predict which recipes will lead to high traffic? - Correctly predict high traffic recipes 80% of the time?
We need to make a decision on this soon, so I need you to present your results to me by the end of the month. Whatever your results, what do you recommend we do next?
Look forward to seeing your presentation.
Tasty Bytes was founded in 2020 in the midst of the Covid Pandemic. The world wanted inspiration so we decided to provide it. We started life as a search engine for recipes, helping people to find ways to use up the limited supplies they had at home.
Now, over two years on, we are a fully fledged business. For a monthly subscription we will put together a full meal plan to ensure you and your family are getting a healthy, balanced diet whatever your budget. Subscribe to our premium plan and we will also deliver the ingredients to your door.
This is an example of how a recipe may appear on the website, we haven't included all of the steps but you should get an idea of what visitors to the site see.
Tomato Soup
Servings: 4
Time to make: 2 hours
Category: Lunch/Snack
Cost per serving: $
Nutritional Information (per serving) - Calories 123 - Carbohydrate 13g - Sugar 1g - Protein 4g
Ingredients: - Tomatoes - Onion - Carrot - Vegetable Stock
Method: 1. Cut the tomatoes into quarters….
The product manager has tried to make this easier for us and provided data for each recipe, as well as whether there was high traffic when the recipe was featured on the home page.
As you will see, they haven't given us all of the information they have about each recipe.
You can find the data here.
I will let you decide how to process it, just make sure you include all your decisions in your report.
Don't forget to double check the data really does match what they say - it might not.
| Column Name | Details |
|---|---|
| recipe | Numeric, unique identifier of recipe |
| calories | Numeric, number of calories |
| carbohydrate | Numeric, amount of carbohydrates in grams |
| sugar | Numeric, amount of sugar in grams |
| protein | Numeric, amount of prote... |
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This users dataset is a preview of a much bigger dataset, with lots of related data (product listings of sellers, comments on listed products, etc...).
My Telegram bot will answer your queries and allow you to contact me.
There are a lot of unknowns when running an E-commerce store, even when you have analytics to guide your decisions.
Users are an important factor in an e-commerce business. This is especially true in a C2C-oriented store, since they are both the suppliers (by uploading their products) AND the customers (by purchasing other user's articles).
This dataset aims to serve as a benchmark for an e-commerce fashion store. Using this dataset, you may want to try and understand what you can expect of your users and determine in advance how your grows may be.
If you think this kind of dataset may be useful or if you liked it, don't forget to show your support or appreciation with an upvote/comment. You may even include how you think this dataset might be of use to you. This way, I will be more aware of specific needs and be able to adapt my datasets to suits more your needs.
This dataset is part of a preview of a much larger dataset. Please contact me for more.
The data was scraped from a successful online C2C fashion store with over 10M registered users. The store was first launched in Europe around 2009 then expanded worldwide.
Visitors vs Users: Visitors do not appear in this dataset. Only registered users are included. "Visitors" cannot purchase an article but can view the catalog.
We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.
Questions you might want to answer using this dataset:
Example works:
For other licensing options, contact me.
Facebook
TwitterA dataset of COVID-19 testing sites. A dataset of COVID-19 testing sites. If looking for a test, please use the Testing Sites locator app. You will be asked for identification and will also be asked for health insurance information. Identification will be required to receive a test. If you don’t have health insurance, you may still be able to receive a test by paying out-of-pocket. Some sites may also: - Limit testing to people who meet certain criteria. - Require an appointment. - Require a referral from your doctor. Check a location’s specific details on the map. Then, call or visit the provider’s website before going for a test.
Facebook
TwitterThe global number of Facebook users was forecast to continuously increase between 2023 and 2027 by in total 391 million users (+14.36 percent). After the fourth consecutive increasing year, the Facebook user base is estimated to reach 3.1 billion users and therefore a new peak in 2027. Notably, the number of Facebook users was continuously increasing over the past years. User figures, shown here regarding the platform Facebook, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period and count multiple accounts by persons only once.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).
Facebook
Twitter🇬🇧 United Kingdom English Introduction The GiGL Spaces to Visit dataset provides locations and boundaries for open space sites in Greater London that are available to the public as destinations for leisure, activities and community engagement. It includes green corridors that provide opportunities for walking and cycling. The dataset has been created by Greenspace Information for Greater London CIC (GiGL). As London’s Environmental Records Centre, GiGL mobilises, curates and shares data that underpin our knowledge of London’s natural environment. We provide impartial evidence to support informed discussion and decision making in policy and practice. GiGL maps under licence from the Greater London Authority. Description This dataset is a sub-set of the GiGL Open Space dataset, the most comprehensive dataset available of open spaces in London. Sites are selected for inclusion in Spaces to Visit based on their public accessibility and likelihood that people would be interested in visiting. The dataset is a mapped Geographic Information System (GIS) polygon dataset where one polygon (or multi-polygon) represents one space. As well as site boundaries, the dataset includes information about a site’s name, size and type (e.g. park, playing field etc.). GiGL developed the Spaces to Visit dataset to support anyone who is interested in London’s open spaces - including community groups, web and app developers, policy makers and researchers - with an open licence data source. More detailed and extensive data are available under GiGL data use licences for GIGL partners, researchers and students. Information services are also available for ecological consultants, biological recorders and community volunteers – please see www.gigl.org.uk for more information. Please note that access and opening times are subject to change (particularly at the current time) so if you are planning to visit a site check on the local authority or site website that it is open. The dataset is updated on a quarterly basis. If you have questions about this dataset please contact GiGL’s GIS and Data Officer. Data sources The boundaries and information in this dataset, are a combination of data collected during the London Survey Method habitat and open space survey programme (1986 – 2008) and information provided to GiGL from other sources since. These sources include London borough surveys, land use datasets, volunteer surveys, feedback from the public, park friends’ groups, and updates made as part of GiGL’s on-going data validation and verification process. Due to data availability, some areas are more up-to-date than others. We are continually working on updating and improving this dataset. If you have any additional information or corrections for sites included in the Spaces to Visit dataset please contact GiGL’s GIS and Data Officer. NOTE: The dataset contains OS data © Crown copyright and database rights 2025. The site boundaries are based on Ordnance Survey mapping, and the data are published under Ordnance Survey's 'presumption to publish'. When using these data please acknowledge GiGL and Ordnance Survey as the source of the information using the following citation: ‘Dataset created by Greenspace Information for Greater London CIC (GiGL), 2025 – Contains Ordnance Survey and public sector information licensed under the Open Government Licence v3.0 ’
Facebook
TwitterThis Dataset shows the Alexa Top 100 International Websites, and provides metrics on the volume of traffic that these sites were able to handle. The Alexa top 100 lists the 100 most visited websites in the world and measures various statistical information. I have looked up the Headquarters, either through alexa, or a Whois Lookup to get street address with i was then able to geocode. I was only able to successfully geocode 85 of the top 100 sites throughout the world. Source of Data was Alexa.com, Source URL: http://www.alexa.com/site/ds/top_sites?ts_mode=global&lang=none Data was from October 12, 2007. Alexa is updated daily so to get more up to date information visit their site directly. they don't have maps though.
Facebook
TwitterHow much time do people spend on social media?
As of 2024, the average daily social media usage of internet users worldwide amounted to 143 minutes per day, down from 151 minutes in the previous year. Currently, the country with the most time spent on social media per day is Brazil, with online users spending an average of three hours and 49 minutes on social media each day. In comparison, the daily time spent with social media in
the U.S. was just two hours and 16 minutes. Global social media usageCurrently, the global social network penetration rate is 62.3 percent. Northern Europe had an 81.7 percent social media penetration rate, topping the ranking of global social media usage by region. Eastern and Middle Africa closed the ranking with 10.1 and 9.6 percent usage reach, respectively.
People access social media for a variety of reasons. Users like to find funny or entertaining content and enjoy sharing photos and videos with friends, but mainly use social media to stay in touch with current events friends. Global impact of social mediaSocial media has a wide-reaching and significant impact on not only online activities but also offline behavior and life in general.
During a global online user survey in February 2019, a significant share of respondents stated that social media had increased their access to information, ease of communication, and freedom of expression. On the flip side, respondents also felt that social media had worsened their personal privacy, increased a polarization in politics and heightened everyday distractions.
Facebook
TwitterThe global number of internet users in was forecast to continuously increase between 2024 and 2029 by in total 1.3 billion users (+23.66 percent). After the fifteenth consecutive increasing year, the number of users is estimated to reach 7 billion users and therefore a new peak in 2029. Notably, the number of internet users of was continuously increasing over the past years.Depicted is the estimated number of individuals in the country or region at hand, that use the internet. As the datasource clarifies, connection quality and usage frequency are distinct aspects, not taken into account here.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the number of internet users in countries like the Americas and Asia.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Lithuania number dataset is a database of phone numbers collected from trusted sources. This means the numbers come from reliable places like government records, websites, or phone companies. The companies that provide this data work hard to ensure it is correct. They even offer source URLs, so you can see where the data came from. Moreover, you get 24/7 support, so if you have questions, help is always available. List to Data is a helpful website for finding important cell numbers quickly. Additionally, the phone numbers in the Lithuania number dataset follow an opt-in system. This means people agreed to share their phone numbers. This system is important because it keeps the data legal. It ensures that you are only contacting people who have given permission. Number data in Lithuania makes it easy to connect with the right people. Lithuania phone data is a special set of phone numbers that you can filter to meet your needs. You can easily filter the list by gender, age, and relationship status. For example, you can quickly sort the data to contact older adults or young singles easily. This flexibility makes it easier to communicate with the right audience. Therefore, you can connect with the people you want to reach. Also, the Lithuanian phone data follows strict GDPR rules. These rules protect people’s privacy and make sure their information stays safe. We collect and use the database of Lithuania in ways that respect everyone’s rights. Additionally, it removes any invalid numbers. You can find important phone numbers easily on our website, List to Data. Lithuania phone number list is a collection of phone numbers from people living in Lithuania. This list is completely correct and valid, meaning all numbers work properly. Companies check every phone number to ensure it is accurate. If you find a number that doesn’t work, you can get a new one for free. Moreover, Lithuania phone number list is about all numbers from authorized customers. People on this list agreed to share their numbers. As a result, you can use the data without worrying about legal issues. This makes the phonebook safe and useful for businesses that want to connect with people in Lithuania.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Italy number dataset includes phone numbers that businesses can trust. The dataset comes from reliable sources, ensuring accuracy. These sources collect numbers from various places, such as public records and directories. You can also find source URLs, which help you verify where the data came from. This adds another layer of credibility to the information. Additionally, this data provides 24/7 support. This is important for businesses that need quick answers. Furthermore, this Italy number dataset follows an opt-in process. This means every person whose number appears in the list agreed to have their number shared. They understand how we will use their information, making it safe to contact them. With this number dataset, businesses gain access to trustworthy and reliable information. List to Data is a website that helps you quickly find important phone numbers. Italy phone data is a valuable database that allows businesses to filter information based on specific needs. This means you can filter the data by gender, age, and relationship status. For example, businesses can easily find numbers for younger people to reach that age group. This ability to filter information makes communication more effective. You can focus on the audience that matters most to you. Moreover, you can remove invalid Italy phone data from the list. That means if any number becomes inactive, you can take it out. Keeping only active numbers helps ensure that your contacts are always up-to-date. This process makes it easy to get up-to-date info regularly. The ability to filter, remove invalid data, and stay GDPR compliant makes this data powerful for organizations. Italy phone number list is a collection of phone numbers from people living in Italy. This list is very useful for businesses and organizations that want to reach out to these individuals. The numbers in this list are 100% correct and valid. This means that every number works, so businesses can call confidently. If any number does not work, you receive a replacement guarantee. Furthermore, every number in the Italy phone number list comes from a customer permission basis. This means that people on the list agreed to have their phone numbers shared. By using this list, businesses can effectively connect with the right people while keeping everything legal and safe. The valid numbers and replacement guarantee make this list an excellent tool for outreach.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains the results of running the automatic audio annotation algorithms for pitch, tempo and key used for the evaluation of algorithms developed during the AudioCommons H2020 EU project and which are part of the Audio Commons Audio Extractor tool. It also includes estimation results information for the single-eventness audio descriptor also developed for the same tool.
These estimation results data has been used to generate the following documents:
All these documents are available in the materials section of the AudioCommons website.
All data in this repository is provided in the form of CSV files. Each CSV file corresponds to the analysis results of one musical task and one of the individual datasets used in the aforementioned deliverables. This repository does not include the audio files of each individual dataset, but includes references to the audio files. The following paragraphs describe the structure of the CSV files and give some notes about how to obtain the audio files in case these would be needed.
Structure of the CSV files
All the CSV files in this repository (with the sole exception of SINGLE EVENT - Estimation Results Truth.csv) are named according to the following convention: "DATASET_NAME - ESTIMATION_TASK Estimation Results.csv". Therefore, estimation results for pitch, tempo and tonality music tasks are separated in different files. All these files share the same structure for the first 2 CSV columns:
The rest of the columns include the estimation results for each one of the algorithms included in the evaluation of each music facet. For each algorithms two columns are reserved, the first one containing the actual estimation and the second one the confidence of this estimation (see CSV file previews below). The format of actual estimations depends on the musical task, check the description of the corresponding ground truth dataset for more information on that. The confidence value is a float number, typically in the range from 0.0 to 1.0. It can happen that one or both columns are empty for a given analysis algorithm and CSV row. This will be the case if the algorithm could not successfully produce an estimation for the audio file row corresponding to the CSV row.
The remaining CSV file, SINGLE EVENT - Estimation Results.csv, has the following 4 columns:
How to get the audio data
In this section we provide some notes about how to obtain the audio files corresponding to the estimation results provided here. Note that due to licensing restrictions we are not allowed to re-distribute the audio data corresponding to most of these automatic annotations.
Facebook
TwitterOpen Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
A dataset providing information about local council services in Leeds. Leeds City Council uses this information to populate the Knowledge Panels on the Google search website. The dataset includes type of service, contact information and opening times. What is a Knowledge Panel? When people search for a business on Google, they may see information about that business in a box that appears to the right of their search results. The information in the box, called the Knowledge Panel, can help customers discover and contact your business. Is the information correct?
Facebook
TwitterThis map shows the access to mental health providers in every county and state in the United States according to the 2024 County Health Rankings & Roadmaps data for counties, states, and the nation. It translates the numbers to explain how many additional mental health providers are needed in each county and state. According to the data, in the United States overall there are 319 people per mental health provider in the U.S. The maps clearly illustrate that access to mental health providers varies widely across the country.The data comes from this County Health Rankings 2024 layer. An updated layer is usually published each year, which allows comparisons from year to year. This map contains layers for 2024 and also for 2022 as a comparison. County Health Rankings & Roadmaps (CHR&R), a program of the University of Wisconsin Population Health Institute with support provided by the Robert Wood Johnson Foundation, draws attention to why there are differences in health within and across communities by measuring the health of nearly all counties in the nation. This map's layers contain 2024 CHR&R data for nation, state, and county levels. The CHR&R Annual Data Release is compiled using county-level measures from a variety of national and state data sources. CHR&R provides a snapshot of the health of nearly every county in the nation. A wide range of factors influence how long and how well we live, including: opportunities for education, income, safe housing and the right to shape policies and practices that impact our lives and futures. Health Outcomes tell us how long people live on average within a community, and how people experience physical and mental health in a community. Health Factors represent the things we can improve to support longer and healthier lives. They are indicators of the future health of our communities. Some example measures are:Life ExpectancyAccess to Exercise OpportunitiesUninsuredFlu VaccinationsChildren in PovertySchool Funding AdequacySevere Housing Cost BurdenBroadband AccessTo see a full list of variables, definitions and descriptions, explore the Fields information by clicking the Data tab here in the Item Details of this layer. For full documentation, visit the Measures page on the CHR&R website. Notable changes in the 2024 CHR&R Annual Data Release:Measures of birth and death now provide more detailed race categories including a separate category for ‘Native Hawaiian or Other Pacific Islander’ and a ‘Two or more races’ category where possible. Find more information on the CHR&R website.Ranks are no longer calculated nor included in the dataset. CHR&R introduced a new graphic to the County Health Snapshots on their website that shows how a county fares relative to other counties in a state and nation. Data Processing:County Health Rankings data and metadata were prepared and formatted for Living Atlas use by the CHR&R team. 2021 U.S. boundaries are used in this dataset for a total of 3,143 counties. Analytic data files can be downloaded from the CHR&R website.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract: Recent years have seen a focus on research into distributed optimization algorithms for multi-robot Collaborative Simultaneous Localization and Mapping (C-SLAM). Research in this domain, however, is made difficult by a lack of standard benchmark datasets. Such datasets have been used to great effect in the field of single-robot SLAM, and researchers focused on multi-robot problems would benefit greatly from dedicated benchmark datasets. To address this gap we design and release the Collaborative Open-Source Multi-robot Optimization Benchmark (COSMO-Bench) -- a suite of 24 datasets derived from a state-of-the-art C-SLAM front-end and real-world LiDAR data. For additional details please see our associated publication: https://arxiv.org/abs/2508.16731This entry, hosted through Carnegie Mellon University libraries, serves to host the official dataset release in perpetuity. However, we also support a website that provides a somewhat nicer user interface at cosmobench.comNOTE - Shortly after making this data available we were notified of some issues with the groundtruth of the CU-Multi data on which the kittredge and main_campus datasets are based. This issue has since been resolved and new versions of the affected datasets have been uploaded. If you are one of the handful of people that downloaded these datasets before September 15th 2025, please update to the corrected versions. To verify that you have the correct versions please see instructions in README.md
Facebook
TwitterThe Find Ryan White HIV/AIDS Medical Care Providers tool is a locator that helps people living with HIV/AIDS access medical care and related services. Users can search for Ryan White-funded medical care providers near a specific complete address, city and state, state and county, or ZIP code. Search results are sorted by distance away and include the Ryan White HIV/AIDS facility name, address, approximate distance from the search point, telephone number, website address, and a link for driving directions. HRSA's Ryan White program funds an array of grants at the state and local levels in areas where most needed. These grants provide medical and support services to more than a half million people who otherwise would be unable to afford care.
Facebook
TwitterHow many people use social media?
Social media usage is one of the most popular online activities. In 2024, over five billion people were using social media worldwide, a number projected to increase to over six billion in 2028.
Who uses social media?
Social networking is one of the most popular digital activities worldwide and it is no surprise that social networking penetration across all regions is constantly increasing. As of January 2023, the global social media usage rate stood at 59 percent. This figure is anticipated to grow as lesser developed digital markets catch up with other regions
when it comes to infrastructure development and the availability of cheap mobile devices. In fact, most of social media’s global growth is driven by the increasing usage of mobile devices. Mobile-first market Eastern Asia topped the global ranking of mobile social networking penetration, followed by established digital powerhouses such as the Americas and Northern Europe.
How much time do people spend on social media?
Social media is an integral part of daily internet usage. On average, internet users spend 151 minutes per day on social media and messaging apps, an increase of 40 minutes since 2015. On average, internet users in Latin America had the highest average time spent per day on social media.
What are the most popular social media platforms?
Market leader Facebook was the first social network to surpass one billion registered accounts and currently boasts approximately 2.9 billion monthly active users, making it the most popular social network worldwide. In June 2023, the top social media apps in the Apple App Store included mobile messaging apps WhatsApp and Telegram Messenger, as well as the ever-popular app version of Facebook.
Facebook
TwitterThe live traffic camera feed provides images from 177 cameras at key sites across the Capital, showing what's happening on London's streets. All images are TfL branded, have a location description, and date and time-stamp. They are refreshed at least every three minutes. Individual feeds may be interrupted if there is a system fault or if a camera is being serviced. Images are not captured when a camera is in use for managing traffic, when a camera is being maintained or in the event of a camera or system fault. Some ideas for re-use include: Freight or delivery services could use the live feed to follow traffic traffic conditions and plan routes accordingly Radio stations could add a live camera feed to a traffic news page Organisations with staff intranets could add the traffic camera feed so people can plan their journeys home Find out more about the feeds available from Transport for London. The BBC use TFL camera images for the live camera feeds on their website. Visit the BBC website to see live camera images.
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Data Description
We release the training dataset of ChatQA. It is built and derived from existing datasets: DROP, NarrativeQA, NewsQA, Quoref, ROPES, SQuAD1.1, SQuAD2.0, TAT-QA, a SFT dataset, as well as a our synthetic conversational QA dataset by GPT-3.5-turbo-0613. The SFT dataset is built and derived from: Soda, ELI5, FLAN, the FLAN collection, Self-Instruct, Unnatural Instructions, OpenAssistant, and Dolly. For more information about ChatQA, check the website!
Other… See the full description on the dataset page: https://huggingface.co/datasets/nvidia/ChatQA-Training-Data.
Facebook
TwitterThe global number of smartphone users in was forecast to continuously increase between 2024 and 2029 by in total 1.8 billion users (+42.62 percent). After the ninth consecutive increasing year, the smartphone user base is estimated to reach 6.1 billion users and therefore a new peak in 2029. Notably, the number of smartphone users of was continuously increasing over the past years.Smartphone users here are limited to internet users of any age using a smartphone. The shown figures have been derived from survey data that has been processed to estimate missing demographics.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the number of smartphone users in countries like Australia & Oceania and Asia.
Facebook
Twitterhttps://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
📚 FineWeb-Edu
1.3 trillion tokens of the finest educational data the 🌐 web has to offer
Paper: https://arxiv.org/abs/2406.17557
What is it?
📚 FineWeb-Edu dataset consists of 1.3T tokens and 5.4T tokens (FineWeb-Edu-score-2) of educational web pages filtered from 🍷 FineWeb dataset. This is the 1.3 trillion version. To enhance FineWeb's quality, we developed an educational quality classifier using annotations generated by LLama3-70B-Instruct. We then… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset originates from DataCamp. Many users have reposted copies of the CSV on Kaggle, but most of those uploads omit the original instructions, business context, and problem framing. In this upload, I’ve included that missing context in the About Dataset so the reader of my notebook or any other notebook can fully understand how the data was intended to be used and the intended problem framing.
Note: I have also uploaded a visualization of the workflow I personally took to tackle this problem, but it is not part of the dataset itself.
Additionally, I created a PowerPoint presentation based on my work in the notebook, which you can download from here:
PPTX Presentation
From: Head of Data Science
Received: Today
Subject: New project from the product team
Hey!
I have a new project for you from the product team. Should be an interesting challenge. You can see the background and request in the email below.
I would like you to perform the analysis and write a short report for me. I want to be able to review your code as well as read your thought process for each step. I also want you to prepare and deliver the presentation for the product team - you are ready for the challenge!
They want us to predict which recipes will be popular 80% of the time and minimize the chance of showing unpopular recipes. I don't think that is realistic in the time we have, but do your best and present whatever you find.
You can find more details about what I expect you to do here. And information on the data here.
I will be on vacation for the next couple of weeks, but I know you can do this without my support. If you need to make any decisions, include them in your work and I will review them when I am back.
Good Luck!
From: Product Manager - Recipe Discovery
To: Head of Data Science
Received: Yesterday
Subject: Can you help us predict popular recipes?
Hi,
We haven't met before but I am responsible for choosing which recipes to display on the homepage each day. I have heard about what the data science team is capable of and I was wondering if you can help me choose which recipes we should display on the home page?
At the moment, I choose my favorite recipe from a selection and display that on the home page. We have noticed that traffic to the rest of the website goes up by as much as 40% if I pick a popular recipe. But I don't know how to decide if a recipe will be popular. More traffic means more subscriptions so this is really important to the company.
Can your team: - Predict which recipes will lead to high traffic? - Correctly predict high traffic recipes 80% of the time?
We need to make a decision on this soon, so I need you to present your results to me by the end of the month. Whatever your results, what do you recommend we do next?
Look forward to seeing your presentation.
Tasty Bytes was founded in 2020 in the midst of the Covid Pandemic. The world wanted inspiration so we decided to provide it. We started life as a search engine for recipes, helping people to find ways to use up the limited supplies they had at home.
Now, over two years on, we are a fully fledged business. For a monthly subscription we will put together a full meal plan to ensure you and your family are getting a healthy, balanced diet whatever your budget. Subscribe to our premium plan and we will also deliver the ingredients to your door.
This is an example of how a recipe may appear on the website, we haven't included all of the steps but you should get an idea of what visitors to the site see.
Tomato Soup
Servings: 4
Time to make: 2 hours
Category: Lunch/Snack
Cost per serving: $
Nutritional Information (per serving) - Calories 123 - Carbohydrate 13g - Sugar 1g - Protein 4g
Ingredients: - Tomatoes - Onion - Carrot - Vegetable Stock
Method: 1. Cut the tomatoes into quarters….
The product manager has tried to make this easier for us and provided data for each recipe, as well as whether there was high traffic when the recipe was featured on the home page.
As you will see, they haven't given us all of the information they have about each recipe.
You can find the data here.
I will let you decide how to process it, just make sure you include all your decisions in your report.
Don't forget to double check the data really does match what they say - it might not.
| Column Name | Details |
|---|---|
| recipe | Numeric, unique identifier of recipe |
| calories | Numeric, number of calories |
| carbohydrate | Numeric, amount of carbohydrates in grams |
| sugar | Numeric, amount of sugar in grams |
| protein | Numeric, amount of prote... |