This dataset was created by Jean_oliveirasi
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
The data in question was generated using the Faker library and is not authentic real-world data. In recent years, there have been numerous reports suggesting the presence of bot voting practices that have resulted in manipulated outcomes within data science competitions. As a result of this, the idea for creating a simulated dataset arose. Although this is the first time that this dataset has been created, it is open to feedback and constructive criticism in order to improve its overall quality and significance.
NAME: The name of the individual. GENDER: The gender of the individual, either male or female. EMAIL_ID: The email address of the individual. IS_GLOGIN: A boolean indicating whether the individual used Google login to register or not. FOLLOWER_COUNT: The number of followers the individual has. FOLLOWING_COUNT: The number of individuals the individual is following. DATASET_COUNT: The number of datasets the individual has created. CODE_COUNT: The number of notebooks the individual has created. DISCUSSION_COUNT: The number of discussions the individual has participated in. AVG_NB_READ_TIME_MIN: The average time spent reading notebooks in minutes. REGISTRATION_IPV4: The IP address used to register. REGISTRATION_LOCATION: The location from where the individual registered. TOTAL_VOTES_GAVE_NB: The total number of votes the individual has given to notebooks. TOTAL_VOTES_GAVE_DS: The total number of votes the individual has given to datasets. TOTAL_VOTES_GAVE_DC: The total number of votes the individual has given to discussion comments. ISBOT: A boolean indicating whether the individual is a bot or not.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Humans From Https Www.kaggle.com Datasets Constantinwerner Human Detection Dataset is a dataset for object detection tasks - it contains Human annotations for 548 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
This dataset was created by Ahmed Ali
The dataset is designed to simulate password-related events, creating a synthetic representation of actions related to password management. It includes fields like timestamp, action, event type, location, IP address, password, hour, and time difference.
This synthetic dataset can be used for training and testing machine learning models related to cyber security, anomaly detection, or password management. It allows researchers and practitioners to experiment with data resembling real-world scenarios without compromising actual user information.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
F5 score = .690 - Recall = .692, Precision = .639
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
this graph was created in R :
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F16731800%2F418952a3857f2530a53a40d9cc9c320c%2Fgraph1.gif?generation=1732477206118972&alt=media" alt="">
Due to the size of the full dataset (see Technical Notices below for more information), users are advised to download data for specific time periods and/or geographic areas.
To download all available ACLED data for a specific time period, enter your login information, select a date range in the ‘from’ and ‘to’ boxes, and click ‘export.’ To download all available ACLED data for a specific region, country, or location enter your login information, select a ‘region,’ ‘country,’ or ‘location’ from the relevant drop-down menus, and click ‘export.’ Note: ‘country’ selection will override ‘region’ selection, and only data for the selected country or countries will be downloaded. ‘Location’ selection requires a ‘country’ selection, and will result in an export of only data for that specific subnational location.
To download data for specific event types, select the relevant event types from that category in the ‘event type’ or ‘sub-event type’ boxes and leave all other categories as they are. All data for the selected event type(s) will be exported.
To download data for a specific actor type or a specific actor, select the ‘actor type’ or ‘actor’ in the relevant boxes and leave all other categories as they are. All data for the selected actor or actor type(s) or actor will be exported.
By default, the data are exported in a format where each row represents a single event, on a specific day and location, and involving distinct actors. An ‘actor based’ file displays events by single actors instead, meaning that events are often repeated if two actors are involved. To determine which of the two file types to use, you should consider whether the data are being used to analyze patterns over time, types of violence, conflict between groups, or locations (which the default file type is best for), or to analyze actor types or specific actors. For the former, the default format should be used, while for the latter, the ‘actor based’ file should be used.
For systems that use semi-colon separated values by default, you may wish to use the ‘compatibility mode’ option.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
If this Data Set is useful, and upvote is appreciated. This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd-period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).
This dataset was created by Rhitaza Jana
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The Google Merchandise Store sells Google branded merchandise. The data is typical of what you would see for an ecommerce website.
The sample dataset contains Google Analytics 360 data from the Google Merchandise Store, a real ecommerce store. The Google Merchandise Store sells Google branded merchandise. The data is typical of what you would see for an ecommerce website. It includes the following kinds of information:
Traffic source data: information about where website visitors originate. This includes data about organic traffic, paid search traffic, display traffic, etc. Content data: information about the behavior of users on the site. This includes the URLs of pages that visitors look at, how they interact with content, etc. Transactional data: information about the transactions that occur on the Google Merchandise Store website.
Fork this kernel to get started.
Banner Photo by Edho Pratama from Unsplash.
What is the total number of transactions generated per device browser in July 2017?
The real bounce rate is defined as the percentage of visits with a single pageview. What was the real bounce rate per traffic source?
What was the average number of product pageviews for users who made a purchase in July 2017?
What was the average number of product pageviews for users who did not make a purchase in July 2017?
What was the average total transactions per user that made a purchase in July 2017?
What is the average amount of money spent per session in July 2017?
What is the sequence of pages viewed?
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
A user activity is defined as
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F285393%2F76ddd60b7a0afd22fadf3ed21510d52b%2Factivity_map.png?generation=1595260268658485&alt=media" alt="">
This dataset consists of 4 sub-datasets **USER_ACTIVITY.csv ** Contains the user activity on a day-username level - submissions - comments - script runs - dataset updates
competitions_1000_ranks.csv Top 1000 ranked kagglers ( competitions ) username - rank
discussion_top1000_ranks.csv Top 1000 ranked kagglers ( discussions) username - rank
scripts_top1000_ranks.csv Top 1000 ranked kagglers ( kernels ) username - rank
userid_username_mapping.csv "kaggle id - kaggle username mapping file
This dataset will be updated every Monday
The main USER_ACTIVITY data set has been acquired from the kaggle's user activity tab ( from the user's home page ) Also other meta has been acquired from metakaggle ( public dataset)
Do the top kagglers show some pattern in they submissions, comments , dataset updates or script runs ???
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
** Please Upvote if you like the dataset **
Fake news or hoax news is false or misleading information presented as news. Fake news often has the aim of damaging the reputation of a person or entity, or making money through advertising revenue.
This dataset is having Both Fake and Real news.
The columns present in the dataset are:-
1) Title -> Title of the News
2) Text -> Text or Content of the News
3) Label -> Labelling the news as Fake or Real
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Data about Kaggle ranked users
This data is available online here. I image it was obtained by a crawler since it is displayed on the Kaggle leader board. I took the data and standardize the country names and add a continent label to each user, but I did not use the city name. To preserve anonymity I removed the columns UserName and DisplayName from the original dataset.
Each row represent a ranked user. The columns are: register date, current points, current ranking, highest ranking, country and continent.
In Kaggle, points and ranking change over time. So, all the positions represented here correspond only to a specific point in time (around August 2018).
I want to thank the team from Norconsult responsible to make this data public.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset provides a detailed look into transactional behavior and financial activity patterns, ideal for exploring fraud detection and anomaly identification. It contains 2,512 samples of transaction data, covering various transaction attributes, customer demographics, and usage patterns. Each entry offers comprehensive insights into transaction behavior, enabling analysis for financial security and fraud detection applications.
Key Features:
This dataset is ideal for data scientists, financial analysts, and researchers looking to analyze transactional patterns, detect fraud, and build predictive models for financial security applications. The dataset was designed for machine learning and pattern analysis tasks and is not intended as a primary data source for academic publications.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Context
The data presented here was obtained in a Kali Machine from University of Cincinnati,Cincinnati,OHIO by carrying out packet captures for 1 hour during the evening on Oct 9th,2023 using Wireshark.This dataset consists of 394137 instances were obtained and stored in a CSV (Comma Separated Values) file.This large dataset could be used utilised for different machine learning applications for instance classification of Network traffic,Network performance monitoring,Network Security Management , Network Traffic Management ,network intrusion detection and anomaly detection.
The dataset can be used for a variety of machine learning tasks, such as network intrusion detection, traffic classification, and anomaly detection.
Content :
This network traffic dataset consists of 7 features.Each instance contains the information of source and destination IP addresses, The majority of the properties are numeric in nature, however there are also nominal and date kinds due to the Timestamp.
The network traffic flow statistics (No. Time Source Destination Protocol Length Info) were obtained using Wireshark (https://www.wireshark.org/).
Dataset Columns:
No : Number of Instance. Timestamp : Timestamp of instance of network traffic Source IP: IP address of Source Destination IP: IP address of Destination Portocol: Protocol used by the instance Length: Length of Instance Info: Information of Traffic Instance
Acknowledgements :
I would like thank University of Cincinnati for giving the infrastructure for generation of network traffic data set.
Ravikumar Gattu , Susmitha Choppadandi
Inspiration : This dataset goes beyond the majority of network traffic classification datasets, which only identify the type of application (WWW, DNS, ICMP,ARP,RARP) that an IP flow contains. Instead, it generates machine learning models that can identify specific applications (like Tiktok,Wikipedia,Instagram,Youtube,Websites,Blogs etc.) from IP flow statistics (there are currently 25 applications in total).
**Dataset License: ** CC0: Public Domain
Dataset Usages : This dataset can be used for different machine learning applications in the field of cybersecurity such as classification of Network traffic,Network performance monitoring,Network Security Management , Network Traffic Management ,network intrusion detection and anomaly detection.
ML techniques benefits from this Dataset :
This dataset is highly useful because it consists of 394137 instances of network traffic data obtained by using the 25 applications on a public,private and Enterprise networks.Also,the dataset consists of very important features that can be used for most of the applications of Machine learning in cybersecurity.Here are few of the potential machine learning applications that could be benefited from this dataset are :
Network Performance Monitoring : This large network traffic data set can be utilised for analysing the network traffic to identifying the network patterns in the network .This help in designing the network security algorithms for minimise the network probelms.
Anamoly Detection : Large network traffic dataset can be utilised training the machine learning models for finding the irregularitues in the traffic which could help identify the cyber attacks.
3.Network Intrusion Detection : This large dataset could be utilised for machine algorithms training and designing the models for detection of the traffic issues,Malicious traffic network attacks and DOS attacks as well.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worlwide. Heart failure is a common event caused by CVDs and this dataset contains 12 features that can be used to predict mortality by heart failure.
Most cardiovascular diseases can be prevented by addressing behavioural risk factors such as tobacco use, unhealthy diet and obesity, physical inactivity and harmful use of alcohol using population-wide strategies.
People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management wherein a machine learning model can be of great help.
Thirteen (13) clinical features: - age: age of the patient (years) - anaemia: decrease of red blood cells or hemoglobin (boolean) - high blood pressure: if the patient has hypertension (boolean) - creatinine phosphokinase (CPK): level of the CPK enzyme in the blood (mcg/L) - diabetes: if the patient has diabetes (boolean) - ejection fraction: percentage of blood leaving the heart at each contraction (percentage) - platelets: platelets in the blood (kiloplatelets/mL) - sex: woman or man (binary) - serum creatinine: level of serum creatinine in the blood (mg/dL) - serum sodium: level of serum sodium in the blood (mEq/L) - smoking: if the patient smokes or not (boolean) - time: follow-up period (days) - [target] death event: if the patient deceased during the follow-up period (boolean)
More - Find More Exciting🙀 Datasets Here - An Upvote👍 A Dayᕙ(`▿´)ᕗ , Keeps Aman Hurray Hurray..... ٩(˘◡˘)۶Haha
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset provides a comprehensive overview of online sales transactions across different product categories. Each row represents a single transaction with detailed information such as the order ID, date, category, product name, quantity sold, unit price, total price, region, and payment method.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset for publication "Comparative analysis of time series models for student data in the Moodle platform".
This dataset is based on train and test dataset from this competition: https://www.kaggle.com/competitions/widsdatathon2024-challenge1 .
What did I change?
1. I dropped 2 columns that contained to little data.
2. using Machine Learning I imputed "payer_type", "patient_race" and "bmi".
3. using "patient_zip3" I filled missing values in "patient_state" , "Region" and "Division"
4. using SinmpleImputer I imputed few missing numeric data in "Ozone", "PM2.5" and other columns
5. I created some new features, based on demographic features, that may be a bit more informative.
6. I tokenized the 'breast_cancer_diagnosis_desc' column
If you're interested how I did that check those notebooks: https://www.kaggle.com/code/anopsy/ml-for-missing-values for "bmi" and new features check this: https://www.kaggle.com/code/anopsy/fe-and-xgb-on-clean-data
According to the description of the original dataset, it's a "39k record dataset (split into training and test sets) representing patients and their characteristics (age, race, BMI, zip code), their diagnosis and treatment information (breast cancer diagnosis code, metastatic cancer diagnosis code, metastatic cancer treatments, … etc.), their geo (zip-code level) demographic data (income, education, rent, race, poverty, …etc), as well as toxic air quality data (Ozone, PM25 and NO2)."
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains Kaggle ranking of datasets.
+800 rows and 8 columns. Columns' description are listed below.
Data from Kaggle. Image from The Guardian.
If you're reading this, please upvote.
This dataset was created by Jean_oliveirasi