Facebook
TwitterThis dataset was created by NIYIBIGIRA Geredi
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains job postings related to Data Science roles in 2025, collected from publicly available sources. It includes essential details such as job titles, seniority levels, company information, locations, salaries, industries, company size, and required skills. The dataset has been cleaned and structured to ensure accuracy and consistency, with duplicates and irrelevant entries removed.
It is designed to help researchers, students, and professionals analyze hiring trends, salary ranges, and in-demand skills in the Data Science job market. This dataset can also support projects in machine learning, career prediction, salary forecasting, and workforce analytics.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
As a Data Scientist, you most likely at some point already have heard of TDS. It is an amazing publication about lots of AI-related topics, providing Hands-On project expertise, interesting framework and technology discussions and the theory behind hundreds of algorithms.
I scraped the archive of TDS from 2018 until 2021 to collect the titles, taglines, urls and date of (almost) every article in that year). You can apply various techniques on this data, such as for instance topic modeling.
If needed, I can also continue labeling this dataset. Just drop me a note what you'd be interested in, and I'll add labels to this dataset.
Of course, special thanks to Towards Data Science and its editors for providing such great content on their publication. Reading such articles is always a great start into the day for me 😁
Think about ways to make sense of this data. What kind of articles have been published the most? What are the topics of the respective years or months?
Tip: You might also want to think about how you can enrich this data? There are many ways to do so!
Facebook
TwitterI've been creating videos on YouTube since November of 2017 (https://www.youtube.com/c/KenJee1) with the mission of making data science accessible to more people. One of the best ways to do this is to tell stories and working on projects. This is my attempt at my first community project. I am making my YouTube data available for everyone to help better understand the growth of my YouTube community and think about ways that it could be improved! I would love for everyone in the community feel like they had some hand in contributing to the channel.
Announcement Video: https://youtu.be/YPph59-rTxA
I will be sharing my favorite projects in a few of my videos (with permission of course), and would also like to give away a few small prizes to the top featured notebooks. I hope you have fun with the analysis, I'm interested in seeing what you find in the data!
For those looking for a place to start, some things I'm thinking about are: - What are the themes of the comment data? - What types of video titles and thumbnails drive the most traffic? - Who is my core audience and what are they interested in? - What types of videos have lead to the most growth? - What type of content are people engaging with the most or watching the longest?
Some advanced projects could be: - Creating a chat bot to respond to common comments with videos where I have addressed a topic - Pulling sentiment from thumbnails and titles and comparing that with performance
Data I would like to add over time - Video descriptions - Video subtitles - Actual video data
There are four files in this repo. The relevant data included in most of them is from Nov 2017 - Jan 2022. I gathered some of this data via the YouTube API and the rest from my specific analytics.
1) Aggregated Metrics By Video - This has all the topline metrics from my channel from its start (around 2015 to Jan 22 2022). I didn't post my first video until around 2) Aggregated Metrics By Video with Country and Subscriber Status - This has the same data as aggregated metrics by video, but it includes dimensions for which country people are viewing from and if the viewers are subscribed to the channel or not. 3) Video Performance Over Time - This has the daily data from each of my videos. 4) All Comments - This is all of my comment data gathered from the YouTube API. I have anonymized the users so don't worry about your name showing up!
This obviously wouldn't be possible without all of the wonderful people who watch and interact with my videos! I'm incredibly grateful for you all and I'm so happy I can share this project with you!
I collected this data from the YouTube API and through my own google analytics. Thus use of it must uphold the YouTube API's terms of service: https://developers.google.com/youtube/terms/api-services-terms-of-service
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Riga Data Science Club is a non-profit organisation to share ideas, experience and build machine learning projects together. Data Science community should known own data, so this is a dataset about ourselves: our website analytics, social media activity, slack statistics and even meetup transcriptions!
Dataset is split up in several folders by the context: * linkedin - company page visitor, follower and post stats * slack - messaging and member activity * typeform - new member responses * website - website visitors by country, language, device, operating system, screen resolution * youtube - meetup transcriptions
Let's make Riga Data Science Club better! We expect this data to bring lots of insights on how to improve.
"Know your c̶u̶s̶t̶o̶m̶e̶r̶ member" - Explore member interests by analysing sign-up survey (typeform) responses - Explore messaging patterns in Slack to understand how members are retained and when they are lost
Social media intelligence * Define LinkedIn posting strategy based on historical engagement data * Define target user profile based on LinkedIn page attendance data
Website * Define website localisation strategy based on data about visitor countries and languages * Define website responsive design strategy based on data about visitor devices, operating systems and screen resolutions
Have some fun * NLP analysis of meetup transcriptions: word frequencies, question answering, something else?
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset is a synthetic yet realistic representation of personal auto insurance data, crafted using real-world statistics. While actual insurance data is sensitive and unavailable for public use, this dataset bridges the gap by offering a safe and practical alternative for building robust data science projects.
Why This Dataset? - Realistic Foundation: Synthetic data generated from real-world statistical patterns ensures practical relevance. - Safe for Use: No personal or sensitive information—completely anonymized and compliant with data privacy standards. - Flexible Applications: Ideal for testing models, developing prototypes, and showcasing portfolio projects.
How You Can Use It: - Build machine learning models for predicting customer conversion and retention. - Design risk assessment tools or premium optimization algorithms. - Create dashboards to visualize trends in customer segmentation and policy data. - Explore innovative solutions for the insurance industry using a realistic data foundation.
This dataset empowers you to work on real-world insurance scenarios without compromising on data sensitivity.
Facebook
TwitterThis dataset was created by Beshoy Nagy
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by Hussein Al Chami
Released under MIT
Facebook
TwitterThis dataset was created by Kirill Rudovski
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Abdelrahman Attiea
Released under Apache 2.0
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Danish1212
Released under Apache 2.0
Facebook
TwitterThis dataset was created by Pawan Kumar
Facebook
TwitterThis dataset was created by Zahid Ali
Facebook
TwitterThis dataset was created by Shekhar Parcha
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by yousef Elshahat
Released under Apache 2.0
Facebook
TwitterThis dataset was created by Keval joshi
Facebook
TwitterThe following datasets were each created and used to create the data visualizations (see https://www.lukas-grosserhode.com/). Raw data sets:
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Asraf28
Released under Apache 2.0
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Project Climate Change, Health, and Artificial Intelligence (Project CCHAIN) dataset is a validated, open-sourced linked dataset containing 20 years (2003-2022) of climate, environmental, socioeconomic, and health dimensions at the barangay (village) level across twelve Philippine cities (Dagupan, Palayan, Navotas, Mandaluyong, Muntinlupa, Legazpi, Iloilo, Mandaue, Tacloban, Zamboanga, Cagayan de Oro, Davao). The full documentation can be accessed here.
The tables are designed in a way that users can choose variables that are most relevant to their focus city and use case, and link these variables to form a single dataset by merging using standard geography codes and calendar dates. This can be done using the provided linking notebook, or offline using the user's own code.
Here are some tips on how make most use of this dataset:
- Focus on one location. Starting with a detailed analysis of one location allows for a better understanding of the local dynamics, which may differ across locations.
- Choose one health data source. Pick one of either a central or local data source. Using two different data health sources is not advised because it will lead to double/overcounting of disease cases.
- Do not use all variables at once- do a literature review first to identify possible key variables to identify possible key variables. More often than not, using all variables is not necessary and may even yield subpar results.
- Decide whether or not to use regular or downscaled climate data. Our downscaled climate data provides nuanced insights on spatial patterns of a few climate variables. Kindly read the documentation before deciding to use this data. If you are uncertain, consider using only the climate_atmosphere table instead
- Check data availability on your focus location and make sure they fit the requirements of your study.
This dataset also includes household surveys tables (see schema here and here) done on partner informal settlement communities in the cities of Muntinlupa, Davao, Iloilo, and Mandaue and administered on various dates from 2001 to 2024. Due to the sensitive nature of surveys and the vulnerability of the subjects involved, requests for access must be submitted for review and approval by the Philippine Action for Community-Led Shelter Initiatives, Inc. (PACSII). To submit a request, please use this form.
The Project CCHAIN dataset adapted the Creative Commons Attribution 4.0 International (CC BY 4.0) license. This allows anyone to share (copy and redistribute) and adapt (remix, transform, and build upon) a work, as long as they give appropriate credit to the original creator.
One exception, the tm_open_buildings table, follows the Open Database License (ODbL) as directed by its source, OpenStreetMap. Under the ODbL, users are free to use, modify, and distribute the database, but on top of CC BY 4.0's attribution requirement, this license requires to share any modifications they make under the same ODbL license.
Facebook
TwitterThis dataset was created by Phạm Tuấn Kiệt
Facebook
TwitterThis dataset was created by NIYIBIGIRA Geredi