https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
I was scared when I saw the large number of connection attempts to my little raspberry pi. I had the idea to share some of them with you. Small mistake on my part, I can't get the exact date (month, year).
The csv file contains : - The month - The hour when the login attempt is - The username which is used - The IP address used - The port which is used
What can you tell me about all these connections? Where do they come from? What are the most used usernames? Are there days when it is better to cut off the internet? At what time are the bots most active? Which port do I have to use?
http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
Dataset Name: Spam Email Dataset
Description: This dataset contains a collection of email text messages, labeled as either spam or not spam. Each email message is associated with a binary label, where "1" indicates that the email is spam, and "0" indicates that it is not spam. The dataset is intended for use in training and evaluating spam email classification models.
Columns:
text (Text): This column contains the text content of the email messages. It includes the body of the emails along with any associated subject lines or headers.
spam_or_not (Binary): This column contains binary labels to indicate whether an email is spam or not. "1" represents spam, while "0" represents not spam.
Usage: This dataset can be used for various Natural Language Processing (NLP) tasks, such as text classification and spam detection. Researchers and data scientists can train and evaluate machine learning models using this dataset to build effective spam email filters.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This data set contains the number of unique customers who logged in to their accounts on a website. The value column shows this count.
Potential use cases; - timeseries modelling - in month targeting
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The Iris dataset was used in R.A. Fisher's classic 1936 paper, The Use of Multiple Measurements in Taxonomic Problems, and can also be found on the UCI Machine Learning Repository.
It includes three iris species with 50 samples each as well as some properties about each flower. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other.
The columns in this dataset are:
Probabilities of getting a certain label (On v2, v1 are reverted): - name = 0.8 - email = 0.5 - phone_num = 0.3 - address= 0.3 - url= 0.5 - username= 0.5
The subject of the essay varies with the same probability among the following. I precised it was a design thinking tool: - "Visualization tool", - "Storytelling tool", - "Mind Mapping tool", - "Learning launch tool"
Model: - mistralai/Mistral-7B-Instruct-v0.2
Warnings: - I-USERNAME and I-EMAIL appear to be in the dataset - Some addresses and other entities can be split by some punctuation
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The cost of living is a scorching topic. This dataset is composed of tweets sent from August 20 to Sept 9 2022, with over 144k tweets. All tweets are in English and are from different countries. Below is a breakdown of columns and the data in them.
https://images.unsplash.com/photo-1553729459-efe14ef6055d?ixlib=rb-1.2.1&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=1770&q=80" alt="">
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
In business, a unicorn is a privately held startup company valued at over $1 billion. The term was first popularised in 2013 by venture capitalist Aileen Lee, choosing the mythical animal to represent the statistical rarity of such successful ventures.
This dataset is a tidied up version of https://www.kaggle.com/ramjasmaurya/unicorn-startups/ shared by @ramjasmaurya
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
🌟 Enjoying the Dataset? 🌟
If this dataset helped you uncover new insights or make your day a little brighter. Thanks a ton for checking it out! Let’s keep those insights rolling! 🔥📈
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F23961675%2Ff3761bd2d7ee460ad464de8f25634f63%2Fsteve-johnson-z6LlNgsDeug-unsplash.jpg?generation=1740481184467263&alt=media" alt="">
Dataset Description:
This dataset contains website conversion data for Bluetooth speaker sales. The dataset tracks user sessions on different landing page variants, with the primary goal of analyzing conversion rates, user behavior, and other factors influencing sales. It includes detailed user engagement metrics such as time spent, pages visited, device type, sign-in methods, and geographical information.
Use Case:
This dataset can be used for various analytical tasks including:
A/B testing and multivariate analysis to compare landing page designs.
User segmentation by demographics (age, gender, location, etc.).
Conversion rate optimization (CRO) analysis.
Predictive modeling for conversion likelihood based on session characteristics.
Revenue and payment analysis.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Medically Validated, Age-Accurate, and Balanced
Samples: 35,000 | Features: 16 | Targets: 2 (Binary + Regression)
This dataset is designed for predicting stroke risk using symptoms, demographics, and medical literature-inspired risk modeling. Version 2 significantly improves upon Version 1 by incorporating age-dependent symptom probabilities, gender-specific risk modifiers, and medically validated feature engineering.
Age-Accurate Risk Modeling:
Gender-Specific Risk:
Balanced and Expanded Data:
Column | Type | Description |
---|---|---|
age | Integer | Age (18–90) |
gender | String | Male/Female |
chest_pain | Binary | 1 = Present, 0 = Absent |
shortness_of_breath | Binary | 1 = Present, 0 = Absent |
irregular_heartbeat | Binary | 1 = Present, 0 = Absent |
fatigue_weakness | Binary | 1 = Present, 0 = Absent |
dizziness | Binary | 1 = Present, 0 = Absent |
swelling_edema | Binary | 1 = Present, 0 = Absent |
neck_jaw_pain | Binary | 1 = Present, 0 = Absent |
excessive_sweating | Binary | 1 = Present, 0 = Absent |
persistent_cough | Binary | 1 = Present, 0 = Absent |
nausea_vomiting | Binary | 1 = Present, 0 = Absent |
high_blood_pressure | Binary | 1 = Present, 0 = Absent |
chest_discomfort | Binary | 1 = Present, 0 = Absent |
cold_hands_feet | Binary | 1 = Present, 0 = Absent |
snoring_sleep_apnea | Binary | 1 = Present, 0 = Absent |
anxiety_doom | Binary | 1 = Present, 0 = Absent |
at_risk | Binary | Target for classification (1 = At Risk, 0 = Not At Risk) |
stroke_risk_percentage | Float | Target for regression (0–100%) |
Age distribution in Version 2 vs. Version 1
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F21100322%2F6317df05bc7526268853e24a5ce831ba%2FAge%20Distribution%20Plot.png?generation=1740875866152537&alt=media" alt="">
This dataset is grounded in peer-reviewed medical literature, with symptom probabilities, risk weights, and demographic relationships directly derived from clinical guidelines and epidemiological studies. Below is a detailed breakdown of how medical knowledge was translated into dataset parameters:
The prevalence of symptoms increases with age, reflecting real-world clinical observations. Probabilities are calibrated using population-level data from medical literature:
which contains all the skills from linkedin ,Github and stackoverflow and all the skills from job descriptions across different platform like naukri ,indeed and monster.com
This is the World's Largest Collection of Dataset for skills which covers all the skills.
What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too.
We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.
Your data will be in front of the world's largest data science community. What questions do you want to see answered?
The dataset contains https://www.kaggle.com/competitions/icr-identify-age-related-conditions competition dataset transformed into integerized data. The common denominator is found for each column. Distribution of even/odd numbers were performed to identify if some values should be a fraction.
Columns 'FL' and 'GL' were untouched, probably float by nature.
Please refer to notebook for exact transformations: https://www.kaggle.com/code/raddar/convert-icr-data-to-integers
This dataset consists of 734 entries representing social media activity and performance from a local SME (Micro, Small, and Medium Enterprise) across TikTok, Instagram, and Twitter platforms. It captures key metrics related to audience interaction and content strategy effectiveness, and is valuable for evaluating and optimizing digital marketing efforts for small businesses.
Area : Target location or customer region where the UMKM's content is directed. Category : The business content category (e.g., product promotion, education, seasonal campaign). Day : The day of the week the content was published. Month : The month the post went live. Platform : The social media platform used by the UMKM (TikTok, Instagram, or Twitter). Post Type : The format of the content posted: image, video, carousel, or text. Timestamp : The exact date and time when the content was posted. User : The username or business account that posted the content. Week : Week number within the year for time-based analysis. Year : The year the content was posted. Comments : Total number of comments received on the post. Engagement Rate : A calculated metric showing how engaging the content is (based on likes, comments, shares vs. reach/impressions). Hour : Hour of the day the post was published. Impressions : Number of times the content appeared on users' feeds. Likes : Number of likes the post received. Reach : Number of unique users who saw the content. Shares : Number of times users shared the content.
All images and metadata in ISIC archive.
!pip install isic-cli
!isic image download images/
See also - https://www.kaggle.com/competitions/isic-2024-challenge/discussion/515356 - https://www.kaggle.com/competitions/siim-isic-melanoma-classification/discussion/171801 - https://www.kaggle.com/competitions/siim-isic-melanoma-classification/discussion/161943
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Data includes reviews of different restaurants on Google Maps. There are 1100 comments in total and pictures of each comment in the data set. The data is labeled according to 4 classes (Taste, Menu, Indoor atmosphere, Outdoor atmosphere) for the artificial intelligence to predict. The dataset has been prepared in a way that can be used in both text processing and image processing fields.
The dataset contains the following columns: business_name, author_name, text, photo, rating, rating_category
IMPORTANT: The rating_category column is related to the photo of the review. If you want to use this dataset for NLP, you need to label it yourself. I will label it for you when I am available.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains medical data used for predicting heart disease. The data includes various attributes such as age, sex, chest pain type (cp), resting blood pressure (trestbps), cholesterol (chol), fasting blood sugar (fbs), resting electrocardiographic results (restecg), maximum heart rate achieved (thalach), exercise-induced angina (exang), and ST depression induced by exercise relative to rest (oldpeak).
age: Age of the patient (in years) sex: Sex of the patient (1 = male, 0 = female) cp: Chest pain type (1-4) trestbps: Resting blood pressure (in mm Hg on admission to the hospital) chol: Serum cholesterol in mg/dl fbs: Fasting blood sugar > 120 mg/dl (1 = true; 0 = false) restecg: Resting electrocardiographic results (0-2) thalach: Maximum heart rate achieved exang: Exercise-induced angina (1 = yes; 0 = no) oldpeak: ST depression induced by exercise relative to rest
The preprocessed and "denoised" data for competition: https://www.kaggle.com/competitions/open-problems-multimodal
For further information see discussion: https://www.kaggle.com/competitions/open-problems-multimodal/discussion/350856
Reposting as kaggle dataset for convenience and fast usage
This dataset was created by srinivas365
The Customer Shopping Preferences Dataset offers valuable insights into consumer behavior and purchasing patterns. Understanding customer preferences and trends is critical for businesses to tailor their products, marketing strategies, and overall customer experience. This dataset captures a wide range of customer attributes including age, gender, purchase history, preferred payment methods, frequency of purchases, and more. Analyzing this data can help businesses make informed decisions, optimize product offerings, and enhance customer satisfaction. The dataset stands as a valuable resource for businesses aiming to align their strategies with customer needs and preferences. It's important to note that this dataset is a Synthetic Dataset Created for Beginners to learn more about Data Analysis and Machine Learning.
This dataset encompasses various features related to customer shopping preferences, gathering essential information for businesses seeking to enhance their understanding of their customer base. The features include customer age, gender, purchase amount, preferred payment methods, frequency of purchases, and feedback ratings. Additionally, data on the type of items purchased, shopping frequency, preferred shopping seasons, and interactions with promotional offers is included. With a collection of 3900 records, this dataset serves as a foundation for businesses looking to apply data-driven insights for better decision-making and customer-centric strategies.
https://i.imgur.com/6UEqejq.png" alt="">
This dataset is a synthetic creation generated using ChatGPT to simulate a realistic customer shopping experience. Its purpose is to provide a platform for beginners and data enthusiasts, allowing them to create, enjoy, practice, and learn from a dataset that mirrors real-world customer shopping behavior. The aim is to foster learning and experimentation in a simulated environment, encouraging a deeper understanding of data analysis and interpretation in the context of consumer preferences and retail scenarios.
Cover Photo by: Freepik
Thumbnail by: Clothing icons created by Flat Icons - Flaticon
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
E-commerce has become a new channel to support businesses development. Through e-commerce, businesses can get access and establish a wider market presence by providing cheaper and more efficient distribution channels for their products or services. E-commerce has also changed the way people shop and consume products and services. Many people are turning to their computers or smart devices to order goods, which can easily be delivered to their homes.
This is a sales transaction data set of UK-based e-commerce (online retail) for one year. This London-based shop has been selling gifts and homewares for adults and children through the website since 2007. Their customers come from all over the world and usually make direct purchases for themselves. There are also small businesses that buy in bulk and sell to other customers through retail outlet channels.
The data set contains 500K rows and 8 columns. The following is the description of each column. 1. TransactionNo (categorical): a six-digit unique number that defines each transaction. The letter “C” in the code indicates a cancellation. 2. Date (numeric): the date when each transaction was generated. 3. ProductNo (categorical): a five or six-digit unique character used to identify a specific product. 4. Product (categorical): product/item name. 5. Price (numeric): the price of each product per unit in pound sterling (£). 6. Quantity (numeric): the quantity of each product per transaction. Negative values related to cancelled transactions. 7. CustomerNo (categorical): a five-digit unique number that defines each customer. 8. Country (categorical): name of the country where the customer resides.
There is a small percentage of order cancellation in the data set. Most of these cancellations were due to out-of-stock conditions on some products. Under this situation, customers tend to cancel an order as they want all products delivered all at once.
Information is a main asset of businesses nowadays. The success of a business in a competitive environment depends on its ability to acquire, store, and utilize information. Data is one of the main sources of information. Therefore, data analysis is an important activity for acquiring new and useful information. Analyze this dataset and try to answer the following questions. 1. How was the sales trend over the months? 2. What are the most frequently purchased products? 3. How many products does the customer purchase in each transaction? 4. What are the most profitable segment customers? 5. Based on your findings, what strategy could you recommend to the business to gain more profit?
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
I was scared when I saw the large number of connection attempts to my little raspberry pi. I had the idea to share some of them with you. Small mistake on my part, I can't get the exact date (month, year).
The csv file contains : - The month - The hour when the login attempt is - The username which is used - The IP address used - The port which is used
What can you tell me about all these connections? Where do they come from? What are the most used usernames? Are there days when it is better to cut off the internet? At what time are the bots most active? Which port do I have to use?