100+ datasets found
  1. Top 1000 Kaggle Datasets

    • kaggle.com
    zip
    Updated Jan 3, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Trrishan (2022). Top 1000 Kaggle Datasets [Dataset]. https://www.kaggle.com/datasets/notkrishna/top-1000-kaggle-datasets
    Explore at:
    zip(34269 bytes)Available download formats
    Dataset updated
    Jan 3, 2022
    Authors
    Trrishan
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    From wiki

    Kaggle, a subsidiary of Google LLC, is an online community of data scientists and machine learning practitioners. Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges.

    Kaggle got its start in 2010 by offering machine learning competitions and now also offers a public data platform, a cloud-based workbench for data science, and Artificial Intelligence education. Its key personnel were Anthony Goldbloom and Jeremy Howard. Nicholas Gruen was founding chair succeeded by Max Levchin. Equity was raised in 2011 valuing the company at $25 million. On 8 March 2017, Google announced that they were acquiring Kaggle.[1][2]

    Source: Kaggle

  2. Kaggle Top Datasets๐Ÿš€๐Ÿ“Š

    • kaggle.com
    zip
    Updated Apr 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aaron Frias (2024). Kaggle Top Datasets๐Ÿš€๐Ÿ“Š [Dataset]. https://www.kaggle.com/datasets/aaronfriasr/kaggle-top-datasets
    Explore at:
    zip(1572305 bytes)Available download formats
    Dataset updated
    Apr 10, 2024
    Authors
    Aaron Frias
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Context

    Kaggle is one of the largest communities of data scientists and machine learning practitioners in the world, and its platform hosts thousands of datasets covering a wide range of topics and industries. With so many options to choose from, it can be difficult to know where to start or what datasets are worth exploring. That's where this dataset comes in. By scraping information about the top 10,000 datasets on Kaggle, we have created a single source of truth for the most popular and useful datasets on the platform. This dataset is not just a list of names and numbers, but a valuable tool for data enthusiasts and professionals alike, providing insights into the latest trends and techniques in data science and machine learning

    Column description - Dataset_name - Name of the dataset - Author_name - Name of the author - Author_id - Kaggle id of the author - No_of_files - Number of files the author has uploaded - size - Size of all the files - Type_of_file - Type of the files such as csv, json etc. - Upvotes - Total upvotes of the dataset - Medals - Medal of the dataset - Usability - Usability of the dataset - Date - Date in which the dataset is uploaded - Day - Day in which the dataset is uploaded - Time - Time in which the dataset is uploaded - Dataset_link - Kaggle link of the dataset

    Acknowledgements The data has been scraped from the official Kaggle Website and is available under the Creative Common License.

    Enjoy & Keep Learning !!!

  3. Kaggle: Forum Discussions

    • kaggle.com
    zip
    Updated Nov 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicolรกs Ariel Gonzรกlez Muรฑoz (2025). Kaggle: Forum Discussions [Dataset]. https://www.kaggle.com/datasets/nicolasgonzalezmunoz/kaggle-forum-discussions
    Explore at:
    zip(542099 bytes)Available download formats
    Dataset updated
    Nov 8, 2025
    Authors
    Nicolรกs Ariel Gonzรกlez Muรฑoz
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Note: This is a work in progress, and not all the Kaggle forums are included in this dataset. The remaining forums will be added when I end solving some issues with the data generators related to these forums.

    Summary

    Welcome to the Kaggle Forum Discussions dataset!. This dataset contains curated data about recent discussions opened in the different forums on Kaggle. The data is obtained through web scraping techniques, using the selenium libraries, and converting text data into markdown style using the markdownify package.

    This dataset contains information about the discussion main topic, topic title, comments, votes, medals and more, and is designed to serve as a complement to the data available on the Kaggle meta dataset, specifically for recent discussions. Keep reading to see the details.

    Extraction Technique

    As a dynamic website that relies heavily in JavaScript (JS), I extracted the data in this dataset through web scraping techniques using the selenium library.

    The functions and classes used to scrape the data on Kaggle where stored on a utility script publicly available here. As JS-generated pages like Kaggle are unstable where trying to scrape them, the mentioned script implements capabilities for retrying connections and to await for elements to appear.

    Each Forum was scrapped using a one notebook for each, then the mentioned notebooks were connected to a central notebook that generates this dataset. Also the discussions are scrapped in parallel so to enhance speed. This dataset represents all the data that can be gathered in a single notebook session, from the most recent to the most old.

    If you need more control on the data you want to research, feel free to import all you need from the utility script mentioned before.

    Structure

    This dataset contains several folders, each named as the discussion forum they contain data about. For example, the 'competition-hosting' folder contains data about the Competition Hosting forum. Inside each folder, you'll find two files: one is a csv file and the other a json file.

    The json file (in Python, represented as a dictionary) is indexed with the ID that Kaggle assigns to the mentioned discussion. Each ID is paired with its corresponding discussion, which is represented as a nested dictionary (the discussion dict), which contains the following fields: - title: The title of the main topic. - content: Content of the main topic. - tags: List containing the discussion's tags. - datetime: Date and time at which the discussion was published (in ISO 8601 format). - votes: Number of votes gotten by the discussion. - medal: Medal awarded by the main topic (if any). - user: User that published the main topic. - expertise: Publisher's expertise, measured by the Kaggle progression system. - n_comments: Total number of comments in the current discussion. - n_appreciation_comments: Total number of appreciation comments in the current discussion. - comments: Dictionary containing data about the comments in the discussion. Each comment is indexed by an ID assigned by Kaggle, containing the following fields: - content: Comment's content. - is_appreciation: Wether the comment is of appreciation. - is_deleted: Wether the comment was deleted. - n_replies: Number of replies to the comment. - datetime: Date and time at which the comment was published (in ISO 8601 format). - votes: Number of votes gotten by the current comment. - medal: Medal awarded by the comment (if any). - user: User that published the comment. - expertise: Publisher's expertise, measured by the Kaggle progression system. - n_deleted: Total number of deleted replies (including self). - replies: A dict following this same format.

    By other side, the csv file serves as a summary of the json file, containing information about the comments limited to the hottest and most voted comments.

    Note: Only the 'content' field is mandatory for each discussion. The availability of the other fields is subject to the stability of the scraping tasks, which may also affect the update frequency.

  4. Student Performance Data Set

    • kaggle.com
    zip
    Updated Mar 27, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data-Science Sean (2020). Student Performance Data Set [Dataset]. https://www.kaggle.com/datasets/larsen0966/student-performance-data-set
    Explore at:
    zip(12353 bytes)Available download formats
    Dataset updated
    Mar 27, 2020
    Authors
    Data-Science Sean
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    If this Data Set is useful, and upvote is appreciated. This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd-period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).

  5. Kaggle dataset

    • kaggle.com
    zip
    Updated Jun 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Laxmi Ghimire (2024). Kaggle dataset [Dataset]. https://www.kaggle.com/datasets/rejinaghimire/kaggle-dataset
    Explore at:
    zip(425673 bytes)Available download formats
    Dataset updated
    Jun 17, 2024
    Authors
    Laxmi Ghimire
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Laxmi Ghimire

    Released under Apache 2.0

    Contents

  6. Materials and their Mechanical Properties

    • kaggle.com
    zip
    Updated Apr 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Purushottam Nawale (2023). Materials and their Mechanical Properties [Dataset]. https://www.kaggle.com/datasets/purushottamnawale/materials
    Explore at:
    zip(145487 bytes)Available download formats
    Dataset updated
    Apr 15, 2023
    Authors
    Purushottam Nawale
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    We utilized a dataset of Machine Design materials, which includes information on their mechanical properties. The dataset was obtained from the Autodesk Material Library and comprises 15 columns, also referred to as features/attributes. This dataset is a real-world dataset, and it does not contain any random values. However, due to missing values, we only utilized seven of these columns for our ML model. You can access the related GitHub Repository here: https://github.com/purushottamnawale/material-selection-using-machine-learning

    To develop a ML model, we employed several Python libraries, including NumPy, pandas, scikit-learn, and graphviz, in addition to other technologies such as Weka, MS Excel, VS Code, Kaggle, Jupyter Notebook, and GitHub. We employed Weka software to swiftly visualize the data and comprehend the relationships between the features, without requiring any programming expertise.

    My Problem statement is Material Selection for EV Chassis. So, if you have any specific ideas, be sure to implement them and add the codes on Kaggle.

    A Detailed Research Paper is available on https://iopscience.iop.org/article/10.1088/1742-6596/2601/1/012014

  7. Social Media and Mental Health

    • kaggle.com
    zip
    Updated Jul 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SouvikAhmed071 (2023). Social Media and Mental Health [Dataset]. https://www.kaggle.com/datasets/souvikahmed071/social-media-and-mental-health
    Explore at:
    zip(10944 bytes)Available download formats
    Dataset updated
    Jul 18, 2023
    Authors
    SouvikAhmed071
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    This dataset was originally collected for a data science and machine learning project that aimed at investigating the potential correlation between the amount of time an individual spends on social media and the impact it has on their mental health.

    The project involves conducting a survey to collect data, organizing the data, and using machine learning techniques to create a predictive model that can determine whether a person should seek professional help based on their answers to the survey questions.

    This project was completed as part of a Statistics course at a university, and the team is currently in the process of writing a report and completing a paper that summarizes and discusses the findings in relation to other research on the topic.

    The following is the Google Colab link to the project, done on Jupyter Notebook -

    https://colab.research.google.com/drive/1p7P6lL1QUw1TtyUD1odNR4M6TVJK7IYN

    The following is the GitHub Repository of the project -

    https://github.com/daerkns/social-media-and-mental-health

    Libraries used for the Project -

    Pandas
    Numpy
    Matplotlib
    Seaborn
    Sci-kit Learn
    
  8. Wikipedia Structured Contents

    • kaggle.com
    zip
    Updated Apr 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wikimedia (2025). Wikipedia Structured Contents [Dataset]. https://www.kaggle.com/datasets/wikimedia-foundation/wikipedia-structured-contents
    Explore at:
    zip(25121685657 bytes)Available download formats
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Wikimedia Foundationhttp://www.wikimedia.org/
    Authors
    Wikimedia
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Summary Early beta release of pre-parsed English and French Wikipedia articles including infoboxes. Inviting feedback.

    This dataset contains all articles of the English and French language editions of Wikipedia, pre-parsed and outputted as structured JSON files with a consistent schema. Each JSON line holds the content of one full Wikipedia article stripped of extra markdown and non-prose sections (references, etc.).

    Invitation for Feedback The dataset is built as part of the Structured Contents initiative and based on the Wikimedia Enterprise html snapshots. It is an early beta release to improve transparency in the development process and request feedback. This first version includes pre-parsed Wikipedia abstracts, short descriptions, main images links, infoboxes and article sections, excluding non-prose sections (e.g. references). More elements (such as lists and tables) may be added over time. For updates follow the projectโ€™s blog and our Mediawiki Quarterly software updates on MediaWiki. As this is an early beta release, we highly value your feedback to help us refine and improve this dataset. Please share your thoughts, suggestions, and any issues you encounter either on the discussion page of Wikimedia Enterpriseโ€™s homepage on Meta wiki, or on the discussion page for this dataset here on Kaggle.

    The contents of this dataset of Wikipedia articles is collectively written and curated by a global volunteer community. All original textual content is licensed under the GNU Free Documentation License (GFDL) and the Creative Commons Attribution-Share-Alike 4.0 License. Some text may be available only under the Creative Commons license; see the Wikimedia Terms of Use for details. Text written by some authors may be released under additional licenses or into the public domain.

    The dataset in its structured form is generally helpful for a wide variety of tasks, including all phases of model development, from pre-training to alignment, fine-tuning, updating/RAG as well as testing/benchmarking. We would love to hear more about your use cases.

    Data Fields The data fields are the same among all, noteworthy included fields: name - title of the article. identifier - ID of the article. url - URL of the article. version: metadata related to the latest specific revision of the article version.editor - editor-specific signals that can help contextualize the revision version.scores - returns assessments by ML models on the likelihood of a revision being reverted. main entity - Wikidata QID the article is related to. abstract - lead section, summarizing what the article is about. description - one-sentence description of the article for quick reference. image - main image representing the article's subject. infoboxes - parsed information from the side panel (infobox) on the Wikipedia article. sections - parsed sections of the article, including links. Note: excludes other media/images, lists, tables and references or similar non-prose sections. Full data dictionary is available here: https://enterprise.wikimedia.com/docs/data-dictionary/

    Curation Rationale This dataset has been created as part of the larger Structured Contents initiative at Wikimedia Enterprise with the aim of making Wikimedia data more machine readable. These efforts are both focused on pre-parsing Wikipedia snippets as well as connecting the different projects closer together. Even if Wikipedia is very structured to the human eye, it is a non-triv...

  9. Fake News Classification

    • kaggle.com
    zip
    Updated Oct 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saurabh Shahane (2023). Fake News Classification [Dataset]. https://www.kaggle.com/datasets/saurabhshahane/fake-news-classification
    Explore at:
    zip(96615040 bytes)Available download formats
    Dataset updated
    Oct 8, 2023
    Authors
    Saurabh Shahane
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    (WELFake) is a dataset of 72,134 news articles with 35,028 real and 37,106 fake news. For this, authors merged four popular news datasets (i.e. Kaggle, McIntire, Reuters, BuzzFeed Political) to prevent over-fitting of classifiers and to provide more text data for better ML training.

    Dataset contains four columns: Serial number (starting from 0); Title (about the text news heading); Text (about the news content); and Label (0 = fake and 1 = real).

    There are 78098 data entries in csv file out of which only 72134 entries are accessed as per the data frame.

    Published in: IEEE Transactions on Computational Social Systems: pp. 1-13 (doi: 10.1109/TCSS.2021.3068519).

  10. Kaggle Dataset Metadata Repository

    • kaggle.com
    zip
    Updated Nov 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ijaj Ahmed (2024). Kaggle Dataset Metadata Repository [Dataset]. https://www.kaggle.com/datasets/ijajdatanerd/kaggle-dataset-metadata-repository
    Explore at:
    zip(5122110 bytes)Available download formats
    Dataset updated
    Nov 16, 2024
    Authors
    Ijaj Ahmed
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F13367141%2F444a868e669671faf9007822d6f2d348%2FAdd%20a%20heading.png?generation=1731775788329917&alt=media" alt="">

    Kaggle Dataset Metadata Collection ๐Ÿ“Š

    This dataset provides comprehensive metadata on various Kaggle datasets, offering detailed information about the dataset owners, creators, usage statistics, licensing, and more. It can help researchers, data scientists, and Kaggle enthusiasts quickly analyze the key attributes of different datasets on Kaggle. ๐Ÿ“š

    Dataset Overview:

    • Purpose: To provide detailed insights into Kaggle dataset metadata.
    • Content: Information related to the dataset's owner, creator, usage metrics, licensing, and more.
    • Target Audience: Data scientists, Kaggle competitors, and dataset curators.

    Columns Description ๐Ÿ“‹

    • datasetUrl ๐ŸŒ: The URL of the Kaggle dataset page. This directs you to the specific dataset's page on Kaggle.

    • ownerAvatarUrl ๐Ÿ–ผ๏ธ: The URL of the dataset owner's profile avatar on Kaggle.

    • ownerName ๐Ÿ‘ค: The name of the dataset owner. This can be the individual or organization that created and maintains the dataset.

    • ownerUrl ๐ŸŒ: A link to the Kaggle profile page of the dataset owner.

    • ownerUserId ๐Ÿ’ผ: The unique user ID of the dataset owner on Kaggle.

    • ownerTier ๐ŸŽ–๏ธ: The ownership tier, such as "Tier 1" or "Tier 2," indicating the owner's status or level on Kaggle.

    • creatorName ๐Ÿ‘ฉโ€๐Ÿ’ป: The name of the dataset creator, which could be different from the owner.

    • creatorUrl ๐ŸŒ: A link to the Kaggle profile page of the dataset creator.

    • creatorUserId ๐Ÿ’ผ: The unique user ID of the dataset creator.

    • scriptCount ๐Ÿ“œ: The number of scripts (kernels) associated with this dataset.

    • scriptsUrl ๐Ÿ”—: A link to the scripts (kernels) page for the dataset, where you can explore related code.

    • forumUrl ๐Ÿ’ฌ: The URL to the discussion forum for this dataset, where users can ask questions and share insights.

    • viewCount ๐Ÿ‘€: The number of views the dataset page has received on Kaggle.

    • downloadCount โฌ‡๏ธ: The number of times the dataset has been downloaded by users.

    • dateCreated ๐Ÿ“…: The date when the dataset was first created and uploaded to Kaggle.

    • dateUpdated ๐Ÿ”„: The date when the dataset was last updated or modified.

    • voteButton ๐Ÿ‘: The metadata for the dataset's vote button, showing how users interact with the dataset's quality ratings.

    • categories ๐Ÿท๏ธ: The categories or tags associated with the dataset, helping users filter datasets based on topics of interest (e.g., "Healthcare," "Finance").

    • licenseName ๐Ÿ›ก๏ธ: The name of the license under which the dataset is shared (e.g., "CC0," "MIT License").

    • licenseShortName ๐Ÿ”‘: A short form or abbreviation of the dataset's license name (e.g., "CC0" for Creative Commons Zero).

    • datasetSize ๐Ÿ“ฆ: The size of the dataset in terms of storage, typically measured in MB or GB.

    • commonFileTypes ๐Ÿ“‚: A list of common file types included in the dataset (e.g., .csv, .json, .xlsx).

    • downloadUrl โฌ‡๏ธ: A direct link to download the dataset files.

    • newKernelNotebookUrl ๐Ÿ“: A link to a new kernel or notebook related to this dataset, for those who wish to explore it programmatically.

    • newKernelScriptUrl ๐Ÿ’ป: A link to a new script for running computations or processing data related to the dataset.

    • usabilityRating ๐ŸŒŸ: A rating or score representing how usable the dataset is, based on user feedback.

    • firestorePath ๐Ÿ”: A reference to the path in Firestore where this datasetโ€™s metadata is stored.

    • datasetSlug ๐Ÿท๏ธ: A URL-friendly version of the dataset name, typically used for URLs.

    • rank ๐Ÿ“ˆ: The dataset's rank based on certain metrics (e.g., downloads, votes, views).

    • datasource ๐ŸŒ: The source or origin of the dataset (e.g., government data, private organizations).

    • medalUrl ๐Ÿ…: A URL pointing to the dataset's medal or badge, indicating the dataset's quality or relevance.

    • hasHashLink ๐Ÿ”—: Indicates whether the dataset has a hash link for verifying data integrity.

    • ownerOrganizationId ๐Ÿข: The unique organization ID of the dataset's owner if the owner is an organization rather than an individual.

    • totalVotes ๐Ÿ—ณ๏ธ: The total number of votes the dataset has received from users, reflecting its popularity or quality.

    • category_names ๐Ÿ“‘: A comma-separated string of category names that represent the datasetโ€™s classification.

    This dataset is a valuable resource for those who want to analyze Kaggle's ecosystem, discover high-quality datasets, and explore metadata in a structured way. ๐ŸŒ๐Ÿ“Š

  11. Meta Kaggle Code

    • kaggle.com
    zip
    Updated Nov 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaggle (2025). Meta Kaggle Code [Dataset]. https://www.kaggle.com/datasets/kaggle/meta-kaggle-code/code
    Explore at:
    zip(167219625372 bytes)Available download formats
    Dataset updated
    Nov 27, 2025
    Dataset authored and provided by
    Kagglehttp://kaggle.com/
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Explore our public notebook content!

    Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.

    Why weโ€™re releasing this dataset

    By collecting all of this code created by Kaggleโ€™s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.

    Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.

    The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the codeโ€™s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!

    Sensitive data

    While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.

    Joining with Meta Kaggle

    The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.

    File organization

    The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.

    The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays

    Questions / Comments

    We love feedback! Let us know in the Discussion tab.

    Happy Kaggling!

  12. Video Game Sales

    • kaggle.com
    zip
    Updated Oct 26, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GregorySmith (2016). Video Game Sales [Dataset]. https://www.kaggle.com/datasets/gregorut/videogamesales
    Explore at:
    zip(390286 bytes)Available download formats
    Dataset updated
    Oct 26, 2016
    Authors
    GregorySmith
    Description

    This dataset contains a list of video games with sales greater than 100,000 copies. It was generated by a scrape of vgchartz.com.

    Fields include

    • Rank - Ranking of overall sales

    • Name - The games name

    • Platform - Platform of the games release (i.e. PC,PS4, etc.)

    • Year - Year of the game's release

    • Genre - Genre of the game

    • Publisher - Publisher of the game

    • NA_Sales - Sales in North America (in millions)

    • EU_Sales - Sales in Europe (in millions)

    • JP_Sales - Sales in Japan (in millions)

    • Other_Sales - Sales in the rest of the world (in millions)

    • Global_Sales - Total worldwide sales.

    The script to scrape the data is available at https://github.com/GregorUT/vgchartzScrape. It is based on BeautifulSoup using Python. There are 16,598 records. 2 records were dropped due to incomplete information.

  13. Metadata of Kaggle dataset _Include MedalVoteCount

    • kaggle.com
    zip
    Updated Dec 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    kukuroo3 (2021). Metadata of Kaggle dataset _Include MedalVoteCount [Dataset]. https://www.kaggle.com/datasets/kukuroo3/dataset-of-kaggle-dataset-include-medalvotecount
    Explore at:
    zip(11216728 bytes)Available download formats
    Dataset updated
    Dec 20, 2021
    Authors
    kukuroo3
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    https://github.com/dean-kg/RoadToExpertRanking_Kaggle/blob/main/kg_medal.png?raw=true" alt="kaggle_medal">

    The Kaggle Dataset medal rule has a bronze medal when a user with a rank of novice or higher upvotes 5 or more, a silver medal when 20 or more upvotes, and a gold medal when 50 or more. Recently I uploaded a lot of datasets to Kaggle. However, although I have won many bronze medals, I have never won more than a silver medal. So, I created this dataset to check the characteristics of the dataset that will receive the silver medal. The metadata of the dataset that received at least one upvote among all Kaggle datasets and the number of MedalVoteCount in each dataset were recorded together.

    This dataset can be used to create strategies for receiving silver and gold medals.

    Content

    42,955 meta data of datasets from 2015-12 to 2021-11

    • DataSetMedals : medal color
    • ct : create time
    • dataUrl :data url (follwed https://www.kaggle.com/)
    • totalviews
    • votecount : total vote counts
    • medalvotecount : upvote Counting by users who are upper Novice Rank
    • totaldownloads : downloads counts
    • totalkernel :kernel counts
    • title
    • description
    • key : dataset tags
    • license

    Source

    https://www.kaggle.com/kaggle/meta-kaggle and get "MedalVoteCount" value by scraping

  14. Job Dataset

    • kaggle.com
    zip
    Updated Sep 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ravender Singh Rana (2023). Job Dataset [Dataset]. https://www.kaggle.com/datasets/ravindrasinghrana/job-description-dataset
    Explore at:
    zip(479575920 bytes)Available download formats
    Dataset updated
    Sep 17, 2023
    Authors
    Ravender Singh Rana
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Job Dataset

    This dataset provides a comprehensive collection of synthetic job postings to facilitate research and analysis in the field of job market trends, natural language processing (NLP), and machine learning. Created for educational and research purposes, this dataset offers a diverse set of job listings across various industries and job types.

    Descriptions for each of the columns in the dataset:

    1. Job Id: A unique identifier for each job posting.
    2. Experience: The required or preferred years of experience for the job.
    3. Qualifications: The educational qualifications needed for the job.
    4. Salary Range: The range of salaries or compensation offered for the position.
    5. Location: The city or area where the job is located.
    6. Country: The country where the job is located.
    7. Latitude: The latitude coordinate of the job location.
    8. Longitude: The longitude coordinate of the job location.
    9. Work Type: The type of employment (e.g., full-time, part-time, contract).
    10. Company Size: The approximate size or scale of the hiring company.
    11. Job Posting Date: The date when the job posting was made public.
    12. Preference: Special preferences or requirements for applicants (e.g., Only Male or Only Female, or Both)
    13. Contact Person: The name of the contact person or recruiter for the job.
    14. Contact: Contact information for job inquiries.
    15. Job Title: The job title or position being advertised.
    16. Role: The role or category of the job (e.g., software developer, marketing manager).
    17. Job Portal: The platform or website where the job was posted.
    18. Job Description: A detailed description of the job responsibilities and requirements.
    19. Benefits: Information about benefits offered with the job (e.g., health insurance, retirement plans).
    20. Skills: The skills or qualifications required for the job.
    21. Responsibilities: Specific responsibilities and duties associated with the job.
    22. Company Name: The name of the hiring company.
    23. Company Profile: A brief overview of the company's background and mission.

    Potential Use Cases:

    • Building predictive models to forecast job market trends.
    • Enhancing job recommendation systems for job seekers.
    • Developing NLP models for resume parsing and job matching.
    • Analyzing regional job market disparities and opportunities.
    • Exploring salary prediction models for various job roles.

    Acknowledgements:

    We would like to express our gratitude to the Python Faker library for its invaluable contribution to the dataset generation process. Additionally, we appreciate the guidance provided by ChatGPT in fine-tuning the dataset, ensuring its quality, and adhering to ethical standards.

    Note:

    Please note that the examples provided are fictional and for illustrative purposes. You can tailor the descriptions and examples to match the specifics of your dataset. It is not suitable for real-world applications and should only be used within the scope of research and experimentation. You can also reach me via email at: rrana157@gmail.com

  15. COVID-19 Dataset

    • kaggle.com
    zip
    Updated Nov 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Meir Nizri (2022). COVID-19 Dataset [Dataset]. https://www.kaggle.com/datasets/meirnizri/covid19-dataset
    Explore at:
    zip(4890659 bytes)Available download formats
    Dataset updated
    Nov 13, 2022
    Authors
    Meir Nizri
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Coronavirus disease (COVID-19) is an infectious disease caused by a newly discovered coronavirus. Most people infected with COVID-19 virus will experience mild to moderate respiratory illness and recover without requiring special treatment. Older people, and those with underlying medical problems like cardiovascular disease, diabetes, chronic respiratory disease, and cancer are more likely to develop serious illness. During the entire course of the pandemic, one of the main problems that healthcare providers have faced is the shortage of medical resources and a proper plan to efficiently distribute them. In these tough times, being able to predict what kind of resource an individual might require at the time of being tested positive or even before that will be of immense help to the authorities as they would be able to procure and arrange for the resources necessary to save the life of that patient.

    The main goal of this project is to build a machine learning model that, given a Covid-19 patient's current symptom, status, and medical history, will predict whether the patient is in high risk or not.

    content

    The dataset was provided by the Mexican government (link). This dataset contains an enormous number of anonymized patient-related information including pre-conditions. The raw dataset consists of 21 unique features and 1,048,576 unique patients. In the Boolean features, 1 means "yes" and 2 means "no". values as 97 and 99 are missing data.

    • sex: 1 for female and 2 for male.
    • age: of the patient.
    • classification: covid test findings. Values 1-3 mean that the patient was diagnosed with covid in different degrees. 4 or higher means that the patient is not a carrier of covid or that the test is inconclusive.
    • patient type: type of care the patient received in the unit. 1 for returned home and 2 for hospitalization.
    • pneumonia: whether the patient already have air sacs inflammation or not.
    • pregnancy: whether the patient is pregnant or not.
    • diabetes: whether the patient has diabetes or not.
    • copd: Indicates whether the patient has Chronic obstructive pulmonary disease or not.
    • asthma: whether the patient has asthma or not.
    • inmsupr: whether the patient is immunosuppressed or not.
    • hypertension: whether the patient has hypertension or not.
    • cardiovascular: whether the patient has heart or blood vessels related disease.
    • renal chronic: whether the patient has chronic renal disease or not.
    • other disease: whether the patient has other disease or not.
    • obesity: whether the patient is obese or not.
    • tobacco: whether the patient is a tobacco user.
    • usmr: Indicates whether the patient treated medical units of the first, second or third level.
    • medical unit: type of institution of the National Health System that provided the care.
    • intubed: whether the patient was connected to the ventilator.
    • icu: Indicates whether the patient had been admitted to an Intensive Care Unit.
    • date died: If the patient died indicate the date of death, and 9999-99-99 otherwise.
  16. Diversity, Equity and Inclusion Measures Dataset

    • kaggle.com
    zip
    Updated Nov 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kerem Kurt (2022). Diversity, Equity and Inclusion Measures Dataset [Dataset]. https://www.kaggle.com/datasets/keremkurt/diversity-equity-and-inclusion-measures-dataset
    Explore at:
    zip(2756966 bytes)Available download formats
    Dataset updated
    Nov 2, 2022
    Authors
    Kerem Kurt
    Description

    General Info

    This data set is generated to simulate an employee data set of a company including sensitive information such as gender, sexual orientation, ethnicity, LGBTQ, and much more. The goal of this data set is to improve Diversity, Equity, and Inclusion in the workplace.

    Survey Questions and Scores

    The main idea of the survey is to track whether the company's efforts in improving DEI actually work and to discover if any group(coming from different backgrounds such as different gender, ethnicity, sexual orientation etc.) falls behind. The survey can be repeated periodically to measure the impact of the company's efforts.

    There are 5 survey questions for each of the DEI categories. Survey scores of employees are also shown in the data set.

  17. E-Commerce Data

    • kaggle.com
    zip
    Updated Aug 17, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carrie (2017). E-Commerce Data [Dataset]. https://www.kaggle.com/datasets/carrie1/ecommerce-data
    Explore at:
    zip(7548686 bytes)Available download formats
    Dataset updated
    Aug 17, 2017
    Authors
    Carrie
    Description

    Context

    Typically e-commerce datasets are proprietary and consequently hard to find among publicly available data. However, The UCI Machine Learning Repository has made this dataset containing actual transactions from 2010 and 2011. The dataset is maintained on their site, where it can be found by the title "Online Retail".

    Content

    "This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers."

    Acknowledgements

    Per the UCI Machine Learning Repository, this data was made available by Dr Daqing Chen, Director: Public Analytics group. chend '@' lsbu.ac.uk, School of Engineering, London South Bank University, London SE1 0AA, UK.

    Image from stocksnap.io.

    Inspiration

    Analyses for this dataset could include time series, clustering, classification and more.

  18. Data from: Global Superstore Dataset

    • kaggle.com
    zip
    Updated Nov 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fatih ฤฐlhan (2023). Global Superstore Dataset [Dataset]. https://www.kaggle.com/datasets/fatihilhan/global-superstore-dataset
    Explore at:
    zip(3349507 bytes)Available download formats
    Dataset updated
    Nov 16, 2023
    Authors
    Fatih ฤฐlhan
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    About this file The Kaggle Global Superstore dataset is a comprehensive dataset containing information about sales and orders in a global superstore. It is a valuable resource for data analysis and visualization tasks. This dataset has been processed and transformed from its original format (txt) to CSV using the R programming language. The original dataset is available here, and the transformed CSV file used in this analysis can be found here.

    Here is a description of the columns in the dataset:

    category: The category of products sold in the superstore.

    city: The city where the order was placed.

    country: The country in which the superstore is located.

    customer_id: A unique identifier for each customer.

    customer_name: The name of the customer who placed the order.

    discount: The discount applied to the order.

    market: The market or region where the superstore operates.

    ji_lu_shu: An unknown or unspecified column.

    order_date: The date when the order was placed.

    order_id: A unique identifier for each order.

    order_priority: The priority level of the order.

    product_id: A unique identifier for each product.

    product_name: The name of the product.

    profit: The profit generated from the order.

    quantity: The quantity of products ordered.

    region: The region where the order was placed.

    row_id: A unique identifier for each row in the dataset.

    sales: The total sales amount for the order.

    segment: The customer segment (e.g., consumer, corporate, or home office).

    ship_date: The date when the order was shipped.

    ship_mode: The shipping mode used for the order.

    shipping_cost: The cost of shipping for the order.

    state: The state or region within the country.

    sub_category: The sub-category of products within the main category.

    year: The year in which the order was placed.

    market2: Another column related to market information.

    weeknum: The week number when the order was placed.

    This dataset can be used for various data analysis tasks, including understanding sales patterns, customer behavior, and profitability in the context of a global superstore.

  19. Alzheimer MRI 4 classes dataset

    • kaggle.com
    zip
    Updated Jan 4, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marco Pinamonti (2022). Alzheimer MRI 4 classes dataset [Dataset]. https://www.kaggle.com/datasets/marcopinamonti/alzheimer-mri-4-classes-dataset
    Explore at:
    zip(35808295 bytes)Available download formats
    Dataset updated
    Jan 4, 2022
    Authors
    Marco Pinamonti
    Description

    Context

    This dataset is a copy of the images in the dataset at the link: Alzheimer's Dataset (4 class of Images).

    Content

    The original dataset contained MRI images of 32 horizontal slices of the brain divided into 4 classes: - Mild Dementia - Moderate Dementia - Non Dementia - Very Mild Dementia

    For each classes there were a different number of subjects: - 28 subjects for the Mild Dementia Class - 2 subjects for the Moderate Dementia Class - 100 subjects for the Non Dementia Class - 70 subjects for the Very Mild Dementia Class

    The problem of the original dataset was that the train and the test sets contained different slices of the brain because the images of the dataset were ordered by the position of the slice and the train/test set division was performed by putting the first percentage of images in the train set and the last ones in the test set.

    In this dataset the original train and test set have been united and the images have been divided between train, test and validation set randomly.

  20. UCI-dataset

    • kaggle.com
    zip
    Updated Aug 17, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md Waquar Azam (2022). UCI-dataset [Dataset]. https://www.kaggle.com/datasets/mdwaquarazam/ucidatasetlist
    Explore at:
    zip(20774 bytes)Available download formats
    Dataset updated
    Aug 17, 2022
    Authors
    Md Waquar Azam
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset is about list of dataset provided by UCI ML , If you are a learner and want some data on the basis of year ,categories, profession or some other criteria you search it from here.

    There are 8 rows in the dataset in which all details are given. --link --Data-Name --data type --default task --attribute-type --instances --attributes --year

    Some missing values are present there also,

    You can analyse the as per your requirement

    EDA

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Trrishan (2022). Top 1000 Kaggle Datasets [Dataset]. https://www.kaggle.com/datasets/notkrishna/top-1000-kaggle-datasets
Organization logo

Top 1000 Kaggle Datasets

Kaggle's most popular datasets

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
zip(34269 bytes)Available download formats
Dataset updated
Jan 3, 2022
Authors
Trrishan
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

From wiki

Kaggle, a subsidiary of Google LLC, is an online community of data scientists and machine learning practitioners. Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges.

Kaggle got its start in 2010 by offering machine learning competitions and now also offers a public data platform, a cloud-based workbench for data science, and Artificial Intelligence education. Its key personnel were Anthony Goldbloom and Jeremy Howard. Nicholas Gruen was founding chair succeeded by Max Levchin. Equity was raised in 2011 valuing the company at $25 million. On 8 March 2017, Google announced that they were acquiring Kaggle.[1][2]

Source: Kaggle

Search
Clear search
Close search
Google apps
Main menu