100+ datasets found
  1. password and username generator

    • kaggle.com
    Updated Apr 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jean_oliveirasi (2023). password and username generator [Dataset]. https://www.kaggle.com/datasets/jeanoliveirasi/password-and-username-generator/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 22, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Jean_oliveirasi
    Description

    Dataset

    This dataset was created by Jean_oliveirasi

    Contents

  2. Kaggle Bot Account Detection

    • kaggle.com
    Updated Feb 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shriyash Jagtap (2023). Kaggle Bot Account Detection [Dataset]. https://www.kaggle.com/datasets/shriyashjagtap/kaggle-bot-account-detection/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 7, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Shriyash Jagtap
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    The data in question was generated using the Faker library and is not authentic real-world data. In recent years, there have been numerous reports suggesting the presence of bot voting practices that have resulted in manipulated outcomes within data science competitions. As a result of this, the idea for creating a simulated dataset arose. Although this is the first time that this dataset has been created, it is open to feedback and constructive criticism in order to improve its overall quality and significance.

    NAME: The name of the individual. GENDER: The gender of the individual, either male or female. EMAIL_ID: The email address of the individual. IS_GLOGIN: A boolean indicating whether the individual used Google login to register or not. FOLLOWER_COUNT: The number of followers the individual has. FOLLOWING_COUNT: The number of individuals the individual is following. DATASET_COUNT: The number of datasets the individual has created. CODE_COUNT: The number of notebooks the individual has created. DISCUSSION_COUNT: The number of discussions the individual has participated in. AVG_NB_READ_TIME_MIN: The average time spent reading notebooks in minutes. REGISTRATION_IPV4: The IP address used to register. REGISTRATION_LOCATION: The location from where the individual registered. TOTAL_VOTES_GAVE_NB: The total number of votes the individual has given to notebooks. TOTAL_VOTES_GAVE_DS: The total number of votes the individual has given to datasets. TOTAL_VOTES_GAVE_DC: The total number of votes the individual has given to discussion comments. ISBOT: A boolean indicating whether the individual is a bot or not.

  3. R

    Humans From Https Www.kaggle.com Datasets Constantinwerner Human Detection...

    • universe.roboflow.com
    zip
    Updated Jun 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ChawawiwatPractice (2024). Humans From Https Www.kaggle.com Datasets Constantinwerner Human Detection Dataset Dataset [Dataset]. https://universe.roboflow.com/chawawiwatpractice/humans-from-https-www.kaggle.com-datasets-constantinwerner-human-detection-dataset-cewfm
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 20, 2024
    Dataset authored and provided by
    ChawawiwatPractice
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Human Bounding Boxes
    Description

    Humans From Https Www.kaggle.com Datasets Constantinwerner Human Detection Dataset

    ## Overview
    
    Humans From Https Www.kaggle.com Datasets Constantinwerner Human Detection Dataset is a dataset for object detection tasks - it contains Human annotations for 548 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  4. Kaggle account verification

    • kaggle.com
    Updated Jun 16, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmed Ali (2022). Kaggle account verification [Dataset]. https://www.kaggle.com/datasets/ahmedali058/kaggle-account-verification
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 16, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ahmed Ali
    Description

    Dataset

    This dataset was created by Ahmed Ali

    Contents

  5. Data from: Password Reset Dataset

    • kaggle.com
    Updated Oct 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HariSellowpay (2023). Password Reset Dataset [Dataset]. https://www.kaggle.com/datasets/harisellowpay/password-reset-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 3, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    HariSellowpay
    Description

    The dataset is designed to simulate password-related events, creating a synthetic representation of actions related to password management. It includes fields like timestamp, action, event type, location, IP address, password, hour, and time difference.

    • The dataset comprises 50,000 records representing a variety of password-related events.
    • A list of commonly used passwords is incorporated to mimic real-world scenarios.
    • Timestamps are spread throughout the current year.
    • Features like 'hour' and 'time_difference' are derived to provide additional insights into the temporal aspects of the events.

    This synthetic dataset can be used for training and testing machine learning models related to cyber security, anomaly detection, or password management. It allows researchers and practitioners to experiment with data resembling real-world scenarios without compromising actual user information.

  6. 4367x PII Label-Specific Essays (by 7b Models)

    • kaggle.com
    Updated Feb 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Valentin Werner (2024). 4367x PII Label-Specific Essays (by 7b Models) [Dataset]. https://www.kaggle.com/datasets/valentinwerner/pii-label-specific-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 7, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Valentin Werner
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Evaluation of my dataset with my .915 baseline:

    F5 score = .690 - Recall = .692, Precision = .639

    Distribution of data:

    • 843x Address (ca. 500 US)
    • 496x Names (Incl. Middle Names, Pronounciation or Nicknames)
    • 537x Userid
    • 704x Username (Incl. Name)
    • 531x Phone
    • 755x Email (Incl. Name)
    • 501x URL

    See linked notebook for generation.

    Remarks on labels:

    EMAIL:

    1. Email is always based on name, but random domains
    2. Prompt was to also write about their favourite book, they are heavily favouring “to kill a mockingbird”

    PHONE:

    1. Generated from multiple countries for diversity
    2. Labelling of phone numbers should only include the full number (not parts of it)

    ADDRESSES:

    1. From multiple countries for diversity
    2. For US Addresses, State abbreviations are mapped to full name, so these are labeled as well
    3. Addresses are only labelled as such if it starts with either of the first two words of the full address (e.g., if house number misses for us address, it is still labelled)

    NAMES:

    1. Middle names are sometimes generated, either separeted with " " or "-"
    2. Pronounciations and nicknames were generated and labelled
    3. However, “t’oma” as in my name Thomas is derived from the arameic word “t’oma” was not tagged. Let me know if this is wrong. They are relatively easy to identify in the names dataset by looking for “derived from”

    URL:

    1. Short domains, full websites and full URIs

    USERID:

    1. Mostly random generated string, number combination - not oriented on other formats
    2. Can mostly easily be augmented by replacing the userid
    3. Userid is sometimes split in text into parts - these splits are not labelled (not sure if this is right)

    USERNAMES:

    1. either generated based on name OR animal+birthyear OR colour+fruit
  7. Data Export Tool

    • kaggle.com
    Updated Nov 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    willian oliveira gibin (2024). Data Export Tool [Dataset]. http://doi.org/10.34740/kaggle/dsv/10002590
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 24, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    willian oliveira gibin
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    this graph was created in R :

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F16731800%2F418952a3857f2530a53a40d9cc9c320c%2Fgraph1.gif?generation=1732477206118972&alt=media" alt="">

    Due to the size of the full dataset (see Technical Notices below for more information), users are advised to download data for specific time periods and/or geographic areas.

    To download all available ACLED data for a specific time period, enter your login information, select a date range in the ‘from’ and ‘to’ boxes, and click ‘export.’ To download all available ACLED data for a specific region, country, or location enter your login information, select a ‘region,’ ‘country,’ or ‘location’ from the relevant drop-down menus, and click ‘export.’ Note: ‘country’ selection will override ‘region’ selection, and only data for the selected country or countries will be downloaded. ‘Location’ selection requires a ‘country’ selection, and will result in an export of only data for that specific subnational location.

    To download data for specific event types, select the relevant event types from that category in the ‘event type’ or ‘sub-event type’ boxes and leave all other categories as they are. All data for the selected event type(s) will be exported.

    To download data for a specific actor type or a specific actor, select the ‘actor type’ or ‘actor’ in the relevant boxes and leave all other categories as they are. All data for the selected actor or actor type(s) or actor will be exported.

    By default, the data are exported in a format where each row represents a single event, on a specific day and location, and involving distinct actors. An ‘actor based’ file displays events by single actors instead, meaning that events are often repeated if two actors are involved. To determine which of the two file types to use, you should consider whether the data are being used to analyze patterns over time, types of violence, conflict between groups, or locations (which the default file type is best for), or to analyze actor types or specific actors. For the former, the default format should be used, while for the latter, the ‘actor based’ file should be used.

    For systems that use semi-colon separated values by default, you may wish to use the ‘compatibility mode’ option.

  8. Student Performance Data Set

    • kaggle.com
    Updated Mar 27, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data-Science Sean (2020). Student Performance Data Set [Dataset]. https://www.kaggle.com/datasets/larsen0966/student-performance-data-set
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 27, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Data-Science Sean
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    If this Data Set is useful, and upvote is appreciated. This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd-period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).

  9. Data from: Spam Email

    • kaggle.com
    Updated Feb 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rhitaza Jana (2022). Spam Email [Dataset]. https://www.kaggle.com/datasets/rhitazajana/spam-email
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 10, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Rhitaza Jana
    Description

    Dataset

    This dataset was created by Rhitaza Jana

    Contents

  10. Google Analytics Sample

    • kaggle.com
    zip
    Updated Sep 19, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google BigQuery (2019). Google Analytics Sample [Dataset]. https://www.kaggle.com/datasets/bigquery/google-analytics-sample
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Sep 19, 2019
    Dataset provided by
    BigQueryhttps://cloud.google.com/bigquery
    Googlehttp://google.com/
    Authors
    Google BigQuery
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    The Google Merchandise Store sells Google branded merchandise. The data is typical of what you would see for an ecommerce website.

    Content

    The sample dataset contains Google Analytics 360 data from the Google Merchandise Store, a real ecommerce store. The Google Merchandise Store sells Google branded merchandise. The data is typical of what you would see for an ecommerce website. It includes the following kinds of information:

    Traffic source data: information about where website visitors originate. This includes data about organic traffic, paid search traffic, display traffic, etc. Content data: information about the behavior of users on the site. This includes the URLs of pages that visitors look at, how they interact with content, etc. Transactional data: information about the transactions that occur on the Google Merchandise Store website.

    Fork this kernel to get started.

    Acknowledgements

    Data from: https://bigquery.cloud.google.com/table/bigquery-public-data:google_analytics_sample.ga_sessions_20170801

    Banner Photo by Edho Pratama from Unsplash.

    Inspiration

    What is the total number of transactions generated per device browser in July 2017?

    The real bounce rate is defined as the percentage of visits with a single pageview. What was the real bounce rate per traffic source?

    What was the average number of product pageviews for users who made a purchase in July 2017?

    What was the average number of product pageviews for users who did not make a purchase in July 2017?

    What was the average total transactions per user that made a purchase in July 2017?

    What is the average amount of money spent per session in July 2017?

    What is the sequence of pages viewed?

  11. Top-ranked kaggler DAILY user activity (updated)

    • kaggle.com
    Updated Jul 22, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    piby4 (2020). Top-ranked kaggler DAILY user activity (updated) [Dataset]. https://www.kaggle.com/tomtillo/top-ranked-kaggle-user-activity-1-1000-ranks/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 22, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    piby4
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    LAST UPDATED : 20th JULY 2020

    Context

    • Do the top Kagglers comment more ??
    • Do they do the competition submissions mostly during weekends ?
    • Who are the most active kagglers from the top-ranked users ?

    A user activity is defined as

    • Making a competition submission
    • Running a script
    • Commenting on a topic
    • Creating a new dataset / updating one.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F285393%2F76ddd60b7a0afd22fadf3ed21510d52b%2Factivity_map.png?generation=1595260268658485&alt=media" alt="">

    Content

    This dataset consists of 4 sub-datasets **USER_ACTIVITY.csv ** Contains the user activity on a day-username level - submissions - comments - script runs - dataset updates

    competitions_1000_ranks.csv Top 1000 ranked kagglers ( competitions ) username - rank

    discussion_top1000_ranks.csv Top 1000 ranked kagglers ( discussions) username - rank

    scripts_top1000_ranks.csv Top 1000 ranked kagglers ( kernels ) username - rank

    userid_username_mapping.csv "kaggle id - kaggle username mapping file

    Frequency of Update

    This dataset will be updated every Monday

    Acknowledgements

    The main USER_ACTIVITY data set has been acquired from the kaggle's user activity tab ( from the user's home page ) Also other meta has been acquired from metakaggle ( public dataset)

    Inspiration

    Do the top kagglers show some pattern in they submissions, comments , dataset updates or script runs ???

  12. Fake News Prediction Dataset

    • kaggle.com
    Updated Nov 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rajat Kumar (2023). Fake News Prediction Dataset [Dataset]. https://www.kaggle.com/datasets/rajatkumar30/fake-news
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 3, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Rajat Kumar
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    ** Please Upvote if you like the dataset **

    Fake news or hoax news is false or misleading information presented as news. Fake news often has the aim of damaging the reputation of a person or entity, or making money through advertising revenue.

    This dataset is having Both Fake and Real news.

    The columns present in the dataset are:-

    1) Title -> Title of the News

    2) Text -> Text or Content of the News

    3) Label -> Labelling the news as Fake or Real

  13. ranked_users_kaggle_data

    • kaggle.com
    Updated Nov 18, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FelipeSalvatore (2018). ranked_users_kaggle_data [Dataset]. https://www.kaggle.com/felsal/ranked-users-kaggle-data/
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 18, 2018
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    FelipeSalvatore
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Ranked users Kaggle data

    Data about Kaggle ranked users

    Context

    This data is available online here. I image it was obtained by a crawler since it is displayed on the Kaggle leader board. I took the data and standardize the country names and add a continent label to each user, but I did not use the city name. To preserve anonymity I removed the columns UserName and DisplayName from the original dataset.

    Content

    Each row represent a ranked user. The columns are: register date, current points, current ranking, highest ranking, country and continent.

    In Kaggle, points and ranking change over time. So, all the positions represented here correspond only to a specific point in time (around August 2018).

    Acknowledgements

    I want to thank the team from Norconsult responsible to make this data public.

  14. Bank Transaction Dataset for Fraud Detection

    • kaggle.com
    Updated Nov 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    vala khorasani (2024). Bank Transaction Dataset for Fraud Detection [Dataset]. https://www.kaggle.com/datasets/valakhorasani/bank-transaction-dataset-for-fraud-detection
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 4, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    vala khorasani
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset provides a detailed look into transactional behavior and financial activity patterns, ideal for exploring fraud detection and anomaly identification. It contains 2,512 samples of transaction data, covering various transaction attributes, customer demographics, and usage patterns. Each entry offers comprehensive insights into transaction behavior, enabling analysis for financial security and fraud detection applications.

    Key Features:

    • TransactionID: Unique alphanumeric identifier for each transaction.
    • AccountID: Unique identifier for each account, with multiple transactions per account.
    • TransactionAmount: Monetary value of each transaction, ranging from small everyday expenses to larger purchases.
    • TransactionDate: Timestamp of each transaction, capturing date and time.
    • TransactionType: Categorical field indicating 'Credit' or 'Debit' transactions.
    • Location: Geographic location of the transaction, represented by U.S. city names.
    • DeviceID: Alphanumeric identifier for devices used to perform the transaction.
    • IP Address: IPv4 address associated with the transaction, with occasional changes for some accounts.
    • MerchantID: Unique identifier for merchants, showing preferred and outlier merchants for each account.
    • AccountBalance: Balance in the account post-transaction, with logical correlations based on transaction type and amount.
    • PreviousTransactionDate: Timestamp of the last transaction for the account, aiding in calculating transaction frequency.
    • Channel: Channel through which the transaction was performed (e.g., Online, ATM, Branch).
    • CustomerAge: Age of the account holder, with logical groupings based on occupation.
    • CustomerOccupation: Occupation of the account holder (e.g., Doctor, Engineer, Student, Retired), reflecting income patterns.
    • TransactionDuration: Duration of the transaction in seconds, varying by transaction type.
    • LoginAttempts: Number of login attempts before the transaction, with higher values indicating potential anomalies.

    This dataset is ideal for data scientists, financial analysts, and researchers looking to analyze transactional patterns, detect fraud, and build predictive models for financial security applications. The dataset was designed for machine learning and pattern analysis tasks and is not intended as a primary data source for academic publications.

  15. Network Traffic Dataset

    • kaggle.com
    Updated Oct 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ravikumar Gattu (2023). Network Traffic Dataset [Dataset]. https://www.kaggle.com/datasets/ravikumargattu/network-traffic-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 31, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ravikumar Gattu
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    The data presented here was obtained in a Kali Machine from University of Cincinnati,Cincinnati,OHIO by carrying out packet captures for 1 hour during the evening on Oct 9th,2023 using Wireshark.This dataset consists of 394137 instances were obtained and stored in a CSV (Comma Separated Values) file.This large dataset could be used utilised for different machine learning applications for instance classification of Network traffic,Network performance monitoring,Network Security Management , Network Traffic Management ,network intrusion detection and anomaly detection.

    The dataset can be used for a variety of machine learning tasks, such as network intrusion detection, traffic classification, and anomaly detection.

    Content :

    This network traffic dataset consists of 7 features.Each instance contains the information of source and destination IP addresses, The majority of the properties are numeric in nature, however there are also nominal and date kinds due to the Timestamp.

    The network traffic flow statistics (No. Time Source Destination Protocol Length Info) were obtained using Wireshark (https://www.wireshark.org/).

    Dataset Columns:

    No : Number of Instance. Timestamp : Timestamp of instance of network traffic Source IP: IP address of Source Destination IP: IP address of Destination Portocol: Protocol used by the instance Length: Length of Instance Info: Information of Traffic Instance

    Acknowledgements :

    I would like thank University of Cincinnati for giving the infrastructure for generation of network traffic data set.

    Ravikumar Gattu , Susmitha Choppadandi

    Inspiration : This dataset goes beyond the majority of network traffic classification datasets, which only identify the type of application (WWW, DNS, ICMP,ARP,RARP) that an IP flow contains. Instead, it generates machine learning models that can identify specific applications (like Tiktok,Wikipedia,Instagram,Youtube,Websites,Blogs etc.) from IP flow statistics (there are currently 25 applications in total).

    **Dataset License: ** CC0: Public Domain

    Dataset Usages : This dataset can be used for different machine learning applications in the field of cybersecurity such as classification of Network traffic,Network performance monitoring,Network Security Management , Network Traffic Management ,network intrusion detection and anomaly detection.

    ML techniques benefits from this Dataset :

    This dataset is highly useful because it consists of 394137 instances of network traffic data obtained by using the 25 applications on a public,private and Enterprise networks.Also,the dataset consists of very important features that can be used for most of the applications of Machine learning in cybersecurity.Here are few of the potential machine learning applications that could be benefited from this dataset are :

    1. Network Performance Monitoring : This large network traffic data set can be utilised for analysing the network traffic to identifying the network patterns in the network .This help in designing the network security algorithms for minimise the network probelms.

    2. Anamoly Detection : Large network traffic dataset can be utilised training the machine learning models for finding the irregularitues in the traffic which could help identify the cyber attacks.

    3.Network Intrusion Detection : This large dataset could be utilised for machine algorithms training and designing the models for detection of the traffic issues,Malicious traffic network attacks and DOS attacks as well.

  16. Predicting Heart Failure

    • kaggle.com
    Updated Sep 13, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aman Chauhan (2022). Predicting Heart Failure [Dataset]. https://www.kaggle.com/datasets/whenamancodes/heart-failure-clinical-records
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 13, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Aman Chauhan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worlwide. Heart failure is a common event caused by CVDs and this dataset contains 12 features that can be used to predict mortality by heart failure.

    Most cardiovascular diseases can be prevented by addressing behavioural risk factors such as tobacco use, unhealthy diet and obesity, physical inactivity and harmful use of alcohol using population-wide strategies.

    People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management wherein a machine learning model can be of great help.

    Attribute Information:

    Thirteen (13) clinical features: - age: age of the patient (years) - anaemia: decrease of red blood cells or hemoglobin (boolean) - high blood pressure: if the patient has hypertension (boolean) - creatinine phosphokinase (CPK): level of the CPK enzyme in the blood (mcg/L) - diabetes: if the patient has diabetes (boolean) - ejection fraction: percentage of blood leaving the heart at each contraction (percentage) - platelets: platelets in the blood (kiloplatelets/mL) - sex: woman or man (binary) - serum creatinine: level of serum creatinine in the blood (mg/dL) - serum sodium: level of serum sodium in the blood (mEq/L) - smoking: if the patient smokes or not (boolean) - time: follow-up period (days) - [target] death event: if the patient deceased during the follow-up period (boolean)

    More - Find More Exciting🙀 Datasets Here - An Upvote👍 A Dayᕙ(`▿´)ᕗ , Keeps Aman Hurray Hurray..... ٩(˘◡˘)۶Haha

  17. Online Sales Dataset - Popular Marketplace Data

    • kaggle.com
    Updated May 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ShreyanshVerma27 (2024). Online Sales Dataset - Popular Marketplace Data [Dataset]. https://www.kaggle.com/datasets/shreyanshverma27/online-sales-dataset-popular-marketplace-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 25, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    ShreyanshVerma27
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset provides a comprehensive overview of online sales transactions across different product categories. Each row represents a single transaction with detailed information such as the order ID, date, category, product name, quantity sold, unit price, total price, region, and payment method.

    Columns:

    • Order ID: Unique identifier for each sales order.
    • Date:Date of the sales transaction.
    • Category:Broad category of the product sold (e.g., Electronics, Home Appliances, Clothing, Books, Beauty Products, Sports).
    • Product Name:Specific name or model of the product sold.
    • Quantity:Number of units of the product sold in the transaction.
    • Unit Price:Price of one unit of the product.
    • Total Price: Total revenue generated from the sales transaction (Quantity * Unit Price).
    • Region:Geographic region where the transaction occurred (e.g., North America, Europe, Asia).
    • Payment Method: Method used for payment (e.g., Credit Card, PayPal, Debit Card).

    Insights:

    • 1. Analyze sales trends over time to identify seasonal patterns or growth opportunities.
    • 2. Explore the popularity of different product categories across regions.
    • 3. Investigate the impact of payment methods on sales volume or revenue.
    • 4. Identify top-selling products within each category to optimize inventory and marketing strategies.
    • 5. Evaluate the performance of specific products or categories in different regions to tailor marketing campaigns accordingly.
  18. Moodle grades and action logs

    • kaggle.com
    Updated May 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martins Sneiders (2025). Moodle grades and action logs [Dataset]. https://www.kaggle.com/datasets/martinssneiders/moodle-grades-and-action-logs
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 2, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Martins Sneiders
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset for publication "Comparative analysis of time series models for student data in the Moodle platform".

  19. Equity in Healthcare Clean DataSets

    • kaggle.com
    Updated Feb 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anopsy (2024). Equity in Healthcare Clean DataSets [Dataset]. https://www.kaggle.com/datasets/anopsy/equity-in-healthcare-clean-datasets
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 21, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Anopsy
    Description

    This dataset is based on train and test dataset from this competition: https://www.kaggle.com/competitions/widsdatathon2024-challenge1 .

    What did I change? 1. I dropped 2 columns that contained to little data.
    2. using Machine Learning I imputed "payer_type", "patient_race" and "bmi". 3. using "patient_zip3" I filled missing values in "patient_state" , "Region" and "Division" 4. using SinmpleImputer I imputed few missing numeric data in "Ozone", "PM2.5" and other columns 5. I created some new features, based on demographic features, that may be a bit more informative. 6. I tokenized the 'breast_cancer_diagnosis_desc' column

    If you're interested how I did that check those notebooks: https://www.kaggle.com/code/anopsy/ml-for-missing-values for "bmi" and new features check this: https://www.kaggle.com/code/anopsy/fe-and-xgb-on-clean-data

    According to the description of the original dataset, it's a "39k record dataset (split into training and test sets) representing patients and their characteristics (age, race, BMI, zip code), their diagnosis and treatment information (breast cancer diagnosis code, metastatic cancer diagnosis code, metastatic cancer treatments, … etc.), their geo (zip-code level) demographic data (income, education, rent, race, poverty, …etc), as well as toxic air quality data (Ozone, PM25 and NO2)."

  20. Kaggle Datasets Ranking

    • kaggle.com
    Updated Jan 19, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vivo Vinco (2022). Kaggle Datasets Ranking [Dataset]. https://www.kaggle.com/datasets/vivovinco/kaggle-datasets-ranking/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 19, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Vivo Vinco
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Context

    This dataset contains Kaggle ranking of datasets.

    Content

    +800 rows and 8 columns. Columns' description are listed below.

    • Rank : Rank of the user
    • Tier : Grandmaster, Master or Expert
    • Username : Name of the user
    • Join Date : Year of join
    • Gold Medals : Number of gold medals
    • Silver Medals : Number of silver medals
    • Bronze Medals : Number of bronze medals
    • Points : Total points

    Acknowledgements

    Data from Kaggle. Image from The Guardian.

    If you're reading this, please upvote.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Jean_oliveirasi (2023). password and username generator [Dataset]. https://www.kaggle.com/datasets/jeanoliveirasi/password-and-username-generator/code
Organization logo

password and username generator

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 22, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Jean_oliveirasi
Description

Dataset

This dataset was created by Jean_oliveirasi

Contents

Search
Clear search
Close search
Google apps
Main menu