100+ datasets found
  1. Meta Kaggle

    • kaggle.com
    zip
    Updated Mar 7, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaggle (2026). Meta Kaggle [Dataset]. https://www.kaggle.com/datasets/kaggle/meta-kaggle
    Explore at:
    zip(10349076623 bytes)Available download formats
    Dataset updated
    Mar 7, 2026
    Dataset authored and provided by
    Kagglehttp://kaggle.com/
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Meta Kaggle

    Explore our public data on competitions, datasets, kernels (code / notebooks) and more

    Meta Kaggle may not be the Rosetta Stone of data science, but we do think there's a lot to learn (and plenty of fun to be had) from this collection of rich data about Kaggle’s community and activity.

    Strategizing to become a Competitions Grandmaster? Wondering who, where, and what goes into a winning team? Choosing evaluation metrics for your next data science project? The kernels published using this data can help. We also hope they'll spark some lively Kaggler conversations and be a useful resource for the larger data science community.

    https://imgur.com/2Egeb8R.png" alt="Kaggle Leaderboard Performance">

    This dataset is made available as CSV files through Kaggle Kernels. It contains tables on public activity from Competitions, Datasets, Kernels, Discussions, and more. The tables are updated daily.

    Please note: This data is not a complete dump of our database. Rows, columns, and tables have been filtered out and transformed.

    August 2023 update

    In August 2023, we released Meta Kaggle for Code, a companion to Meta Kaggle containing public, Apache 2.0 licensed notebook data. View the dataset and instructions for how to join it with Meta Kaggle here: https://www.kaggle.com/datasets/kaggle/meta-kaggle-code

    We also updated the license on Meta Kaggle from CC-BY-NC-SA to Apache 2.0.

  2. Predictive Maintenance Dataset

    • kaggle.com
    zip
    Updated Nov 7, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Himanshu Agarwal (2022). Predictive Maintenance Dataset [Dataset]. https://www.kaggle.com/datasets/hiimanshuagarwal/predictive-maintenance-dataset
    Explore at:
    zip(1798425 bytes)Available download formats
    Dataset updated
    Nov 7, 2022
    Authors
    Himanshu Agarwal
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    A company has a fleet of devices transmitting daily sensor readings. They would like to create a predictive maintenance solution to proactively identify when maintenance should be performed. This approach promises cost savings over routine or time based preventive maintenance, because tasks are performed only when warranted.

    The task is to build a predictive model using machine learning to predict the probability of a device failure. When building this model, be sure to minimize false positives and false negatives. The column you are trying to Predict is called failure with binary value 0 for non-failure and 1 for failure.

  3. Online Courses

    • kaggle.com
    zip
    Updated Jun 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Khaled Shata (2023). Online Courses [Dataset]. https://www.kaggle.com/datasets/khaledatef1/online-courses
    Explore at:
    zip(1314629 bytes)Available download formats
    Dataset updated
    Jun 28, 2023
    Authors
    Khaled Shata
    Description

    The dataset contains information on around 10,000 online courses from popular online learning platforms as : Coursera, Udacity, Simplilearn, and FutureLearn. The data was scraped and compiled, with the dataset being updated until the year 2023. This dataset provides valuable information for analyzing and understanding the online learning landscape as of that year.

    The dataset is typically available in a structured format, such as a CSV (Comma-Separated Values) file or a spreadsheet, with each row representing a course and each column representing a specific attribute or feature of the course.

    Potential Applications:

    1- Course Recommendations: Analyzing the dataset can provide insights for recommending courses to individuals based on their interests, skill level, and career goals.

    2- Market Analysis: Researchers or analysts can use the dataset to study the market share and popularity of different online learning platforms and subject areas.

    3- Skill Demand Analysis: The dataset can help identify the most in-demand skills and subject areas among online learners.

    4- Educational Research: Researchers can leverage the dataset to investigate trends and patterns in online learning, instructional design, and course delivery.

  4. NYC Open Data

    • kaggle.com
    zip
    Updated Mar 20, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NYC Open Data (2019). NYC Open Data [Dataset]. https://www.kaggle.com/datasets/nycopendata/new-york
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Mar 20, 2019
    Dataset authored and provided by
    NYC Open Data
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    NYC Open Data is an opportunity to engage New Yorkers in the information that is produced and used by City government. We believe that every New Yorker can benefit from Open Data, and Open Data can benefit from every New Yorker. Source: https://opendata.cityofnewyork.us/overview/

    Content

    Thanks to NYC Open Data, which makes public data generated by city agencies available for public use, and Citi Bike, we've incorporated over 150 GB of data in 5 open datasets into Google BigQuery Public Datasets, including:

    • Over 8 million 311 service requests from 2012-2016

    • More than 1 million motor vehicle collisions 2012-present

    • Citi Bike stations and 30 million Citi Bike trips 2013-present

    • Over 1 billion Yellow and Green Taxi rides from 2009-present

    • Over 500,000 sidewalk trees surveyed decennially in 1995, 2005, and 2015

    This dataset is deprecated and not being updated.

    Fork this kernel to get started with this dataset.

    Acknowledgements

    https://opendata.cityofnewyork.us/

    https://cloud.google.com/blog/big-data/2017/01/new-york-city-public-datasets-now-available-on-google-bigquery

    This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - https://data.cityofnewyork.us/ - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

    By accessing datasets and feeds available through NYC Open Data, the user agrees to all of the Terms of Use of NYC.gov as well as the Privacy Policy for NYC.gov. The user also agrees to any additional terms of use defined by the agencies, bureaus, and offices providing data. Public data sets made available on NYC Open Data are provided for informational purposes. The City does not warranty the completeness, accuracy, content, or fitness for any particular purpose or use of any public data set made available on NYC Open Data, nor are any such warranties to be implied or inferred with respect to the public data sets furnished therein.

    The City is not liable for any deficiencies in the completeness, accuracy, content, or fitness for any particular purpose or use of any public data set, or application utilizing such data set, provided by any third party.

    Banner Photo by @bicadmedia from Unplash.

    Inspiration

    On which New York City streets are you most likely to find a loud party?

    Can you find the Virginia Pines in New York City?

    Where was the only collision caused by an animal that injured a cyclist?

    What’s the Citi Bike record for the Longest Distance in the Shortest Time (on a route with at least 100 rides)?

    https://cloud.google.com/blog/big-data/2017/01/images/148467900588042/nyc-dataset-6.png" alt="enter image description here"> https://cloud.google.com/blog/big-data/2017/01/images/148467900588042/nyc-dataset-6.png

  5. COVID-19 Dataset

    • kaggle.com
    zip
    Updated Nov 13, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Meir Nizri (2022). COVID-19 Dataset [Dataset]. https://www.kaggle.com/datasets/meirnizri/covid19-dataset
    Explore at:
    zip(4890659 bytes)Available download formats
    Dataset updated
    Nov 13, 2022
    Authors
    Meir Nizri
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Coronavirus disease (COVID-19) is an infectious disease caused by a newly discovered coronavirus. Most people infected with COVID-19 virus will experience mild to moderate respiratory illness and recover without requiring special treatment. Older people, and those with underlying medical problems like cardiovascular disease, diabetes, chronic respiratory disease, and cancer are more likely to develop serious illness. During the entire course of the pandemic, one of the main problems that healthcare providers have faced is the shortage of medical resources and a proper plan to efficiently distribute them. In these tough times, being able to predict what kind of resource an individual might require at the time of being tested positive or even before that will be of immense help to the authorities as they would be able to procure and arrange for the resources necessary to save the life of that patient.

    The main goal of this project is to build a machine learning model that, given a Covid-19 patient's current symptom, status, and medical history, will predict whether the patient is in high risk or not.

    content

    The dataset was provided by the Mexican government (link). This dataset contains an enormous number of anonymized patient-related information including pre-conditions. The raw dataset consists of 21 unique features and 1,048,576 unique patients. In the Boolean features, 1 means "yes" and 2 means "no". values as 97 and 99 are missing data.

    • sex: 1 for female and 2 for male.
    • age: of the patient.
    • classification: covid test findings. Values 1-3 mean that the patient was diagnosed with covid in different degrees. 4 or higher means that the patient is not a carrier of covid or that the test is inconclusive.
    • patient type: type of care the patient received in the unit. 1 for returned home and 2 for hospitalization.
    • pneumonia: whether the patient already have air sacs inflammation or not.
    • pregnancy: whether the patient is pregnant or not.
    • diabetes: whether the patient has diabetes or not.
    • copd: Indicates whether the patient has Chronic obstructive pulmonary disease or not.
    • asthma: whether the patient has asthma or not.
    • inmsupr: whether the patient is immunosuppressed or not.
    • hypertension: whether the patient has hypertension or not.
    • cardiovascular: whether the patient has heart or blood vessels related disease.
    • renal chronic: whether the patient has chronic renal disease or not.
    • other disease: whether the patient has other disease or not.
    • obesity: whether the patient is obese or not.
    • tobacco: whether the patient is a tobacco user.
    • usmr: Indicates whether the patient treated medical units of the first, second or third level.
    • medical unit: type of institution of the National Health System that provided the care.
    • intubed: whether the patient was connected to the ventilator.
    • icu: Indicates whether the patient had been admitted to an Intensive Care Unit.
    • date died: If the patient died indicate the date of death, and 9999-99-99 otherwise.
  6. 🚨 Fake Reviews Dataset

    • kaggle.com
    zip
    Updated Sep 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    mexwell (2023). 🚨 Fake Reviews Dataset [Dataset]. https://www.kaggle.com/datasets/mexwell/fake-reviews-dataset
    Explore at:
    zip(5016888 bytes)Available download formats
    Dataset updated
    Sep 17, 2023
    Authors
    mexwell
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The generated fake reviews dataset, containing 20k fake reviews and 20k real product reviews. OR = Original reviews (presumably human created and authentic); CG = Computer-generated fake reviews.

    Citation

    Salminen, J., Kandpal, C., Kamel, A. M., Jung, S., & Jansen, B. J. (2022). Creating and detecting fake reviews of online products. Journal of Retailing and Consumer Services, 64, 102771. https://doi.org/10.1016/j.jretconser.2021.102771

    Acknowlegement

    Foto von Brett Jordan auf Unsplash

  7. Detailed Products Datasets

    • kaggle.com
    zip
    Updated Nov 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sujay Kapadnis (2023). Detailed Products Datasets [Dataset]. https://www.kaggle.com/datasets/sujaykapadnis/products-datasets
    Explore at:
    zip(102115 bytes)Available download formats
    Dataset updated
    Nov 24, 2023
    Authors
    Sujay Kapadnis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    List of products with the attributes

    • S.No
    • BrandName
    • Product ID
    • Product Name
    • Brand Desc
    • Product Size
    • Currency
    • MRP
    • SellPrice
    • Discount
    • Category

      Kari, Venkatram (2023), “Product Dataset”, Mendeley Data, V1, doi: 10.17632/v8yt3r8th2.1

  8. MSRVTT

    • kaggle.com
    • opendatalab.com
    zip
    Updated Nov 7, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vishnutheep B (2022). MSRVTT [Dataset]. https://www.kaggle.com/datasets/vishnutheepb/msrvtt
    Explore at:
    zip(4574604594 bytes)Available download formats
    Dataset updated
    Nov 7, 2022
    Authors
    Vishnutheep B
    Description

    MSR-VTT (Microsoft Research Video to Text) is a large-scale dataset for the open domain video captioning, which consists of 10,000 video clips from 20 categories, and each video clip is annotated with 20 English sentences by Amazon Mechanical Turks. There are about 29,000 unique words in all captions. The standard splits uses 6,513 clips for training, 497 clips for validation, and 2,990 clips for testing.

  9. Mental Health Dataset

    • kaggle.com
    zip
    Updated Mar 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bhavik Jikadara (2024). Mental Health Dataset [Dataset]. https://www.kaggle.com/datasets/bhavikjikadara/mental-health-dataset
    Explore at:
    zip(2048887 bytes)Available download formats
    Dataset updated
    Mar 18, 2024
    Authors
    Bhavik Jikadara
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset appears to contain a variety of features related to text analysis, sentiment analysis, and psychological indicators, likely derived from posts or text data. Some features include readability indices such as Automated Readability Index (ARI), Coleman Liau Index, and Flesch-Kincaid Grade Level, as well as sentiment analysis scores like sentiment compound, negative, neutral, and positive scores. Additionally, there are features related to psychological aspects such as economic stress, isolation, substance use, and domestic stress. The dataset seems to cover a wide range of linguistic, psychological, and behavioural attributes, potentially suitable for analyzing mental health-related topics in online communities or text data.

    Benefits of using this dataset:

    • Insight into Mental Health: The dataset provides valuable insights into mental health by analyzing linguistic patterns, sentiment, and psychological indicators in text data. Researchers and data scientists can gain a better understanding of how mental health issues manifest in online communication.
    • Predictive Modeling: With a wide range of features, including sentiment analysis scores and psychological indicators, the dataset offers opportunities for developing predictive models to identify or predict mental health outcomes based on textual data. This can be useful for early intervention and support.
    • Community Engagement: Mental health is a topic of increasing importance, and this dataset can foster community engagement on platforms like Kaggle. Data enthusiasts, researchers, and mental health professionals can collaborate to analyze the data and develop solutions to address mental health challenges.
    • Data-driven Insights: By analyzing the dataset, users can uncover correlations and patterns between linguistic features, sentiment, and mental health indicators. These insights can inform interventions, policies, and support systems aimed at promoting mental well-being.
    • Educational Resource: The dataset can serve as a valuable educational resource for teaching and learning about mental health analytics, sentiment analysis, and text mining techniques. It provides a real-world dataset for students and practitioners to apply data science skills in a meaningful context.
  10. Student Mental health

    • kaggle.com
    zip
    Updated Feb 17, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MD Shariful Islam (2023). Student Mental health [Dataset]. https://www.kaggle.com/datasets/shariful07/student-mental-health
    Explore at:
    zip(1664 bytes)Available download formats
    Dataset updated
    Feb 17, 2023
    Authors
    MD Shariful Islam
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    A STATISTICAL RESEARCH ON THE EFFECTS OF MENTAL HEALTH ON STUDENTS’ CGPA dataset This Data set was collected by a survey conducted by Google forms from University student in order to examine their current academic situation and mental health.

    All the data was based on Malaysia and collected from Iium (International Islamic University Malaysia).

  11. Loan Approval Classification Dataset

    • kaggle.com
    zip
    Updated Oct 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ta-wei Lo (2024). Loan Approval Classification Dataset [Dataset]. https://www.kaggle.com/datasets/taweilo/loan-approval-classification-data
    Explore at:
    zip(768769 bytes)Available download formats
    Dataset updated
    Oct 29, 2024
    Authors
    Ta-wei Lo
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    1. Data Source

    This dataset is a synthetic version inspired by the original Credit Risk dataset on Kaggle and enriched with additional variables based on Financial Risk for Loan Approval data. SMOTENC was used to simulate new data points to enlarge the instances. The dataset is structured for both categorical and continuous features.

    2. Metadata

    The dataset contains 45,000 records and 14 variables, each described below:

    ColumnDescriptionType
    person_ageAge of the personFloat
    person_genderGender of the personCategorical
    person_educationHighest education levelCategorical
    person_incomeAnnual incomeFloat
    person_emp_expYears of employment experienceInteger
    person_home_ownershipHome ownership status (e.g., rent, own, mortgage)Categorical
    loan_amntLoan amount requestedFloat
    loan_intentPurpose of the loanCategorical
    loan_int_rateLoan interest rateFloat
    loan_percent_incomeLoan amount as a percentage of annual incomeFloat
    cb_person_cred_hist_lengthLength of credit history in yearsFloat
    credit_scoreCredit score of the personInteger
    previous_loan_defaults_on_fileIndicator of previous loan defaultsCategorical
    loan_status (target variable)Loan approval status: 1 = approved; 0 = rejectedInteger

    3. Data Usage

    The dataset can be used for multiple purposes:

    • Exploratory Data Analysis (EDA): Analyze key features, distribution patterns, and relationships to understand credit risk factors.
    • Classification: Build predictive models to classify the loan_status variable (approved/not approved) for potential applicants.
    • Regression: Develop regression models to predict the credit_score variable based on individual and loan-related attributes.

    Mind the data issue from the original data, such as the instance > 100-year-old as age.

    This dataset provides a rich basis for understanding financial risk factors and simulating predictive modeling processes for loan approval and credit scoring.

    Feel free to leave comments on the discussion. I'd appreciate your upvote if you find my dataset useful! 😀

  12. 🖼️ Famous Paintings

    • kaggle.com
    zip
    Updated Oct 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    mexwell (2023). 🖼️ Famous Paintings [Dataset]. https://www.kaggle.com/datasets/mexwell/famous-paintings
    Explore at:
    zip(6681482 bytes)Available download formats
    Dataset updated
    Oct 5, 2023
    Authors
    mexwell
    Description

    Famous paintings and their artists. This data set is published to help students have interesting data to practice SQL

    Original Data

    Acknowlegement

    Foto von Steve Johnson auf Unsplash

  13. Kaggle Dataset Metadata Repository

    • kaggle.com
    zip
    Updated Nov 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ijaj Ahmed (2024). Kaggle Dataset Metadata Repository [Dataset]. https://www.kaggle.com/datasets/ijajdatanerd/kaggle-dataset-metadata-repository
    Explore at:
    zip(5122110 bytes)Available download formats
    Dataset updated
    Nov 16, 2024
    Authors
    Ijaj Ahmed
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F13367141%2F444a868e669671faf9007822d6f2d348%2FAdd%20a%20heading.png?generation=1731775788329917&alt=media" alt="">

    Kaggle Dataset Metadata Collection 📊

    This dataset provides comprehensive metadata on various Kaggle datasets, offering detailed information about the dataset owners, creators, usage statistics, licensing, and more. It can help researchers, data scientists, and Kaggle enthusiasts quickly analyze the key attributes of different datasets on Kaggle. 📚

    Dataset Overview:

    • Purpose: To provide detailed insights into Kaggle dataset metadata.
    • Content: Information related to the dataset's owner, creator, usage metrics, licensing, and more.
    • Target Audience: Data scientists, Kaggle competitors, and dataset curators.

    Columns Description 📋

    • datasetUrl 🌐: The URL of the Kaggle dataset page. This directs you to the specific dataset's page on Kaggle.

    • ownerAvatarUrl 🖼️: The URL of the dataset owner's profile avatar on Kaggle.

    • ownerName 👤: The name of the dataset owner. This can be the individual or organization that created and maintains the dataset.

    • ownerUrl 🌍: A link to the Kaggle profile page of the dataset owner.

    • ownerUserId 💼: The unique user ID of the dataset owner on Kaggle.

    • ownerTier 🎖️: The ownership tier, such as "Tier 1" or "Tier 2," indicating the owner's status or level on Kaggle.

    • creatorName 👩‍💻: The name of the dataset creator, which could be different from the owner.

    • creatorUrl 🌍: A link to the Kaggle profile page of the dataset creator.

    • creatorUserId 💼: The unique user ID of the dataset creator.

    • scriptCount 📜: The number of scripts (kernels) associated with this dataset.

    • scriptsUrl 🔗: A link to the scripts (kernels) page for the dataset, where you can explore related code.

    • forumUrl 💬: The URL to the discussion forum for this dataset, where users can ask questions and share insights.

    • viewCount 👀: The number of views the dataset page has received on Kaggle.

    • downloadCount ⬇️: The number of times the dataset has been downloaded by users.

    • dateCreated 📅: The date when the dataset was first created and uploaded to Kaggle.

    • dateUpdated 🔄: The date when the dataset was last updated or modified.

    • voteButton 👍: The metadata for the dataset's vote button, showing how users interact with the dataset's quality ratings.

    • categories 🏷️: The categories or tags associated with the dataset, helping users filter datasets based on topics of interest (e.g., "Healthcare," "Finance").

    • licenseName 🛡️: The name of the license under which the dataset is shared (e.g., "CC0," "MIT License").

    • licenseShortName 🔑: A short form or abbreviation of the dataset's license name (e.g., "CC0" for Creative Commons Zero).

    • datasetSize 📦: The size of the dataset in terms of storage, typically measured in MB or GB.

    • commonFileTypes 📂: A list of common file types included in the dataset (e.g., .csv, .json, .xlsx).

    • downloadUrl ⬇️: A direct link to download the dataset files.

    • newKernelNotebookUrl 📝: A link to a new kernel or notebook related to this dataset, for those who wish to explore it programmatically.

    • newKernelScriptUrl 💻: A link to a new script for running computations or processing data related to the dataset.

    • usabilityRating 🌟: A rating or score representing how usable the dataset is, based on user feedback.

    • firestorePath 🔍: A reference to the path in Firestore where this dataset’s metadata is stored.

    • datasetSlug 🏷️: A URL-friendly version of the dataset name, typically used for URLs.

    • rank 📈: The dataset's rank based on certain metrics (e.g., downloads, votes, views).

    • datasource 🌐: The source or origin of the dataset (e.g., government data, private organizations).

    • medalUrl 🏅: A URL pointing to the dataset's medal or badge, indicating the dataset's quality or relevance.

    • hasHashLink 🔗: Indicates whether the dataset has a hash link for verifying data integrity.

    • ownerOrganizationId 🏢: The unique organization ID of the dataset's owner if the owner is an organization rather than an individual.

    • totalVotes 🗳️: The total number of votes the dataset has received from users, reflecting its popularity or quality.

    • category_names 📑: A comma-separated string of category names that represent the dataset’s classification.

    This dataset is a valuable resource for those who want to analyze Kaggle's ecosystem, discover high-quality datasets, and explore metadata in a structured way. 🌍📊

  14. Ecommerce Text Classification

    • kaggle.com
    zip
    Updated Oct 9, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saurabh Shahane (2023). Ecommerce Text Classification [Dataset]. https://www.kaggle.com/datasets/saurabhshahane/ecommerce-text-classification
    Explore at:
    zip(8236809 bytes)Available download formats
    Dataset updated
    Oct 9, 2023
    Authors
    Saurabh Shahane
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the classification based E-commerce text dataset for 4 categories - "Electronics", "Household", "Books" and "Clothing & Accessories", which almost cover 80% of any E-commerce website.

    The dataset is in ".csv" format with two columns - the first column is the class name and the second one is the datapoint of that class. The data point is the product and description from the e-commerce website.

    The dataset has the following features :

    Data Set Characteristics: Multivariate

    Number of Instances: 50425

    Number of classes: 4

    Area: Computer science

    Attribute Characteristics: Real

    Number of Attributes: 1

    Associated Tasks: Classification

    Missing Values? No

    Gautam. (2019). E commerce text dataset (version - 2) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.3355823

  15. Kaggle Dataset Medals

    • kaggle.com
    zip
    Updated Dec 19, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Niek van der Zwaag (2021). Kaggle Dataset Medals [Dataset]. https://www.kaggle.com/datasets/niekvanderzwaag/kaggle-dataset-medals
    Explore at:
    zip(4426597 bytes)Available download formats
    Dataset updated
    Dec 19, 2021
    Authors
    Niek van der Zwaag
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    https://github.com/dean-kg/RoadToExpertRanking_Kaggle/blob/main/kg_medal.png?raw=true" alt="medals">

    Dataset Medals https://www.kaggle.com/static/images/medals/notebooks/goldl@2x.png" alt="gold">

    Dataset Medals are awarded to popular public datasets published to the site, as measured by number of upvotes. Not all upvotes count towards medals: votes by novices are excluded from medal calculation.

    Content https://www.kaggle.com/static/images/medals/datasets/silverl@2x.png" alt="silver">

    Metadata of 42,955 datasets on Kaggle from 2015-12 to 2021-11

    • Medal: color of received medal
    • Created: time of creation
    • URL: URL to dataset on kaggle.com
    • Views: total view count
    • Votes: total vote count
    • Votes_Advanced: total vote count excluding votes from 'Novice' rank
    • Downloads: total download count
    • Kernels: total kernel count
    • Title: title of dataset
    • Description: description of dataset
    • Tags: tags of dataset
    • License: licence under which dataset is published

    Acknowledgements https://www.kaggle.com/static/images/medals/notebooks/bronzel@2x.png" alt="bronze">

    Tidied up version of dataset provided by @kukuroo3

    Source: https://www.kaggle.com/kukuroo3/dataset-of-kaggle-dataset-include-medalvotecount

  16. IT_incident_log_Dataset

    • kaggle.com
    zip
    Updated Jul 4, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    shamiul islam shifat (2020). IT_incident_log_Dataset [Dataset]. https://www.kaggle.com/datasets/shamiulislamshifat/it-incident-log-dataset
    Explore at:
    zip(2571433 bytes)Available download formats
    Dataset updated
    Jul 4, 2020
    Authors
    shamiul islam shifat
    Description

    Data Set Information:

    This is an event log of an incident management process extracted from data gathered from the audit system of an instance of the ServiceNowTM platform used by an IT company. The event log is enriched with data loaded from a relational database underlying a corresponding process-aware information system. Information was anonymized for privacy.

    Number of instances: 141,712 events (24,918 incidents) Number of attributes: 36 attributes (1 case identifier, 1 state identifier, 32 descriptive attributes, 2 dependent variables)

    Attribute Information:

    1. number: incident identifier (24,918 different values);
    2. incident state: eight levels controlling the incident management process transitions from opening until closing the case;
    3. active: boolean attribute that shows whether the record is active or closed/canceled;
    4. reassignment_count: number of times the incident has the group or the support analysts changed;
    5. reopen_count: number of times the incident resolution was rejected by the caller;
    6. sys_mod_count: number of incident updates until that moment;
    7. made_sla: boolean attribute that shows whether the incident exceeded the target SLA;
    8. caller_id: identifier of the user affected;
    9. opened_by: identifier of the user who reported the incident;
    10. opened_at: incident user opening date and time;
    11. sys_created_by: identifier of the user who registered the incident;
    12. sys_created_at: incident system creation date and time;
    13. sys_updated_by: identifier of the user who updated the incident and generated the current log record;
    14. sys_updated_at: incident system update date and time;
    15. contact_type: categorical attribute that shows by what means the incident was reported;
    16. location: identifier of the location of the place affected;
    17. category: first-level description of the affected service;
    18. subcategory: second-level description of the affected service (related to the first level description, i.e., to category);
    19. u_symptom: description of the user perception about service availability;
    20. cmdb_ci: (confirmation item) identifier used to report the affected item (not mandatory);
    21. impact: description of the impact caused by the incident (values: 1–High; 2–Medium; 3–Low);
    22. urgency: description of the urgency informed by the user for the incident resolution (values: 1–High; 2–Medium; 3–Low);
    23. priority: calculated by the system based on 'impact' and 'urgency';
    24. assignment_group: identifier of the support group in charge of the incident;
    25. assigned_to: identifier of the user in charge of the incident;
    26. knowledge: boolean attribute that shows whether a knowledge base document was used to resolve the incident;
    27. u_priority_confirmation: boolean attribute that shows whether the priority field has been double-checked;
    28. notify: categorical attribute that shows whether notifications were generated for the incident;
    29. problem_id: identifier of the problem associated with the incident;
    30. rfc: (request for change) identifier of the change request associated with the incident;
    31. vendor: identifier of the vendor in charge of the incident;
    32. caused_by: identifier of the RFC responsible by the incident;
    33. close_code: identifier of the resolution of the incident;
    34. resolved_by: identifier of the user who resolved the incident;
    35. resolved_at: incident user resolution date and time (dependent variable);
    36. closed_at: incident user close date and time (dependent variable).

    Relevant Papers:

    Amaral, C. A. L., Fantinato, M., Reijers, H. A., Peres, S. M., Enhancing Completion Time Prediction Through Attribute Selection. Proceedings of the 15th International Conference on Advanced Information Technologies for Management (AITM 2018) and 13th International Conference on Information Systems Management (ISM 2018), Revised Selected Papers – Lecture Notes in Business Information Processing, v. 346, pp. 3-23, 2019. [Web Link]

    Amaral, C. A. L., Fantinato, M., Peres, S. M., Attribute Selection with Filter and Wrapper: An Application on Incident Management Process. Proceedings of the 14th Federated Conference on Computer Science and Information Systems (FedCSIS 2018), pp. 679-682, 2018. [Web Link]

    Maita, A. R. C., Martins, L. C., Paz, C. R. L., Rafferty, L., Hung, P., Peres, S. M., Fantinato, M. A systematic mapping study of process mining. Enterprise Information Systems, v. 12, n. 5, pp. 505-549, 2018. [Web Link]

    Citation Request:

    Please cite this paper if you use this dataset: Amaral, C. A. L., Fantinato, M., Reijers, H. A., Peres, S. M., Enhancing Completion Time Prediction Through Attribute Selection. Proceedings of the 15th International Conference on Advanced Information Technologies for Management (AITM 2018) and 13th International Conference on Information Systems Management (ISM 2018), Revised Selected Papers – Lecture Notes in Business Information Processing, v. 346, pp. 3-23, 2019. [Web Link]

  17. Stellar Classification Dataset - SDSS17

    • kaggle.com
    zip
    Updated Jan 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    fedesoriano (2022). Stellar Classification Dataset - SDSS17 [Dataset]. https://www.kaggle.com/datasets/fedesoriano/stellar-classification-dataset-sdss17
    Explore at:
    zip(7223444 bytes)Available download formats
    Dataset updated
    Jan 15, 2022
    Authors
    fedesoriano
    Description

    Similar Datasets

    • CERN Proton Collision Dataset: LINK
    • Airfoil Self-Noise Dataset: LINK
    • CERN Electron Collision Data: LINK
    • Wind Speed Prediction Dataset: LINK
    • Spanish Wine Quality Dataset: LINK

    Context

    In astronomy, stellar classification is the classification of stars based on their spectral characteristics. The classification scheme of galaxies, quasars, and stars is one of the most fundamental in astronomy. The early cataloguing of stars and their distribution in the sky has led to the understanding that they make up our own galaxy and, following the distinction that Andromeda was a separate galaxy to our own, numerous galaxies began to be surveyed as more powerful telescopes were built. This datasat aims to classificate stars, galaxies, and quasars based on their spectral characteristics.

    Content

    The data consists of 100,000 observations of space taken by the SDSS (Sloan Digital Sky Survey). Every observation is described by 17 feature columns and 1 class column which identifies it to be either a star, galaxy or quasar. 1. obj_ID = Object Identifier, the unique value that identifies the object in the image catalog used by the CAS 1. alpha = Right Ascension angle (at J2000 epoch) 1. delta = Declination angle (at J2000 epoch) 1. u = Ultraviolet filter in the photometric system 1. g = Green filter in the photometric system 1. r = Red filter in the photometric system 1. i = Near Infrared filter in the photometric system 1. z = Infrared filter in the photometric system 1. run_ID = Run Number used to identify the specific scan 1. rereun_ID = Rerun Number to specify how the image was processed 1. cam_col = Camera column to identify the scanline within the run 1. field_ID = Field number to identify each field 1. spec_obj_ID = Unique ID used for optical spectroscopic objects (this means that 2 different observations with the same spec_obj_ID must share the output class) 1. class = object class (galaxy, star or quasar object) 1. redshift = redshift value based on the increase in wavelength 1. plate = plate ID, identifies each plate in SDSS 1. MJD = Modified Julian Date, used to indicate when a given piece of SDSS data was taken 1. fiber_ID = fiber ID that identifies the fiber that pointed the light at the focal plane in each observation

    Citation

    fedesoriano. (January 2022). Stellar Classification Dataset - SDSS17. Retrieved [Date Retrieved] from https://www.kaggle.com/fedesoriano/stellar-classification-dataset-sdss17.

    Acknowledgements

    The data released by the SDSS is under public domain. Its taken from the current data release RD17. - More information about the license: http://www.sdss.org/science/image-gallery/

    SDSS Publications: - Abdurro’uf et al., The Seventeenth data release of the Sloan Digital Sky Surveys: Complete Release of MaNGA, MaStar and APOGEE-2 DATA (Abdurro’uf et al. submitted to ApJS) [arXiv:2112.02026]

  18. Structural Protein Sequences

    • kaggle.com
    zip
    Updated Feb 3, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SHAHIR (2018). Structural Protein Sequences [Dataset]. https://www.kaggle.com/datasets/shahir/protein-data-set
    Explore at:
    zip(28782775 bytes)Available download formats
    Dataset updated
    Feb 3, 2018
    Authors
    SHAHIR
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Context

    This is a protein data set retrieved from Research Collaboratory for Structural Bioinformatics (RCSB) Protein Data Bank (PDB).

    The PDB archive is a repository of atomic coordinates and other information describing proteins and other important biological macromolecules. Structural biologists use methods such as X-ray crystallography, NMR spectroscopy, and cryo-electron microscopy to determine the location of each atom relative to each other in the molecule. They then deposit this information, which is then annotated and publicly released into the archive by the wwPDB.

    The constantly-growing PDB is a reflection of the research that is happening in laboratories across the world. This can make it both exciting and challenging to use the database in research and education. Structures are available for many of the proteins and nucleic acids involved in the central processes of life, so you can go to the PDB archive to find structures for ribosomes, oncogenes, drug targets, and even whole viruses. However, it can be a challenge to find the information that you need, since the PDB archives so many different structures. You will often find multiple structures for a given molecule, or partial structures, or structures that have been modified or inactivated from their native form.

    Content

    There are two data files. Both are arranged on "structureId" of the protein:

    • pdb_data_no_dups.csv contains protein meta data which includes details on protein classification, extraction methods, etc.

    • data_seq.csv contains >400,000 protein structure sequences.

    Acknowledgements

    Original data set down loaded from http://www.rcsb.org/pdb/

    Inspiration

    Protein data base helped the life science community to study about different diseases and come with new drugs and solution that help the human survival.

  19. Framingham heart study dataset

    • kaggle.com
    zip
    Updated Apr 19, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ashish Bhardwaj (2022). Framingham heart study dataset [Dataset]. https://www.kaggle.com/datasets/aasheesh200/framingham-heart-study-dataset
    Explore at:
    zip(59440 bytes)Available download formats
    Dataset updated
    Apr 19, 2022
    Authors
    Ashish Bhardwaj
    Area covered
    Framingham
    Description

    The "Framingham" heart disease dataset includes over 4,240 records,16 columns and 15 attributes. The goal of the dataset is to predict whether the patient has 10-year risk of future (CHD) coronary heart disease

  20. Financial_Risk

    • kaggle.com
    zip
    Updated Jul 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Preetham Gouda (2024). Financial_Risk [Dataset]. https://www.kaggle.com/datasets/preethamgouda/financial-risk
    Explore at:
    zip(709463 bytes)Available download formats
    Dataset updated
    Jul 23, 2024
    Authors
    Preetham Gouda
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The Financial Risk Assessment Dataset provides detailed information on individual financial profiles. It includes demographic, financial, and behavioral data to assess financial risk. The dataset features various columns such as income, credit score, and risk rating, with intentional imbalances and missing values to simulate real-world scenarios.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Kaggle (2026). Meta Kaggle [Dataset]. https://www.kaggle.com/datasets/kaggle/meta-kaggle
Organization logo

Meta Kaggle

Kaggle's public data on competitions, users, submission scores, code, and more

Explore at:
22 scholarly articles cite this dataset (View in Google Scholar)
zip(10349076623 bytes)Available download formats
Dataset updated
Mar 7, 2026
Dataset authored and provided by
Kagglehttp://kaggle.com/
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Meta Kaggle

Explore our public data on competitions, datasets, kernels (code / notebooks) and more

Meta Kaggle may not be the Rosetta Stone of data science, but we do think there's a lot to learn (and plenty of fun to be had) from this collection of rich data about Kaggle’s community and activity.

Strategizing to become a Competitions Grandmaster? Wondering who, where, and what goes into a winning team? Choosing evaluation metrics for your next data science project? The kernels published using this data can help. We also hope they'll spark some lively Kaggler conversations and be a useful resource for the larger data science community.

https://imgur.com/2Egeb8R.png" alt="Kaggle Leaderboard Performance">

This dataset is made available as CSV files through Kaggle Kernels. It contains tables on public activity from Competitions, Datasets, Kernels, Discussions, and more. The tables are updated daily.

Please note: This data is not a complete dump of our database. Rows, columns, and tables have been filtered out and transformed.

August 2023 update

In August 2023, we released Meta Kaggle for Code, a companion to Meta Kaggle containing public, Apache 2.0 licensed notebook data. View the dataset and instructions for how to join it with Meta Kaggle here: https://www.kaggle.com/datasets/kaggle/meta-kaggle-code

We also updated the license on Meta Kaggle from CC-BY-NC-SA to Apache 2.0.

Search
Clear search
Close search
Google apps
Main menu