82 datasets found
  1. kaggle api key 2

    • kaggle.com
    zip
    Updated Jun 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    valerie lucro (2024). kaggle api key 2 [Dataset]. https://www.kaggle.com/datasets/valerielucro/kaggle-api-key-2/code
    Explore at:
    zip(238 bytes)Available download formats
    Dataset updated
    Jun 27, 2024
    Authors
    valerie lucro
    Description

    Dataset

    This dataset was created by valerie lucro

    Contents

  2. Most popular Kaggle datasets

    • kaggle.com
    zip
    Updated Jan 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dany Ocean (2025). Most popular Kaggle datasets [Dataset]. https://www.kaggle.com/datasets/danyocean/most-popular-kaggle-datasets
    Explore at:
    zip(920936 bytes)Available download formats
    Dataset updated
    Jan 2, 2025
    Authors
    Dany Ocean
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This project focuses on exploring and analyzing the most popular datasets available on Kaggle. By delving into these datasets, we aim to identify key trends, understand user preferences, and highlight the topics that drive engagement within the data science and machine learning communities

    Also there are interesting charts and analytics in the attached notebook

  3. api key

    • kaggle.com
    zip
    Updated Oct 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CIturrieta (2024). api key [Dataset]. https://www.kaggle.com/datasets/citurrieta/api-key
    Explore at:
    zip(213 bytes)Available download formats
    Dataset updated
    Oct 10, 2024
    Authors
    CIturrieta
    Description

    Dataset

    This dataset was created by CIturrieta

    Contents

  4. Free API List

    • kaggle.com
    zip
    Updated Oct 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nancy.jain042 (2023). Free API List [Dataset]. https://www.kaggle.com/datasets/nancyjain042/free-api-list
    Explore at:
    zip(57783 bytes)Available download formats
    Dataset updated
    Oct 21, 2023
    Authors
    nancy.jain042
    Description

    This dataset provides a curated collection of freely available APIs covering a wide range of categories. Whether you are a developer, data enthusiast, or just someone interested in exploring various API services, this dataset offers a valuable resource to help you discover, access, and understand these APIs.

    Each entry in the dataset includes essential information about the APIs, such as the API name, a brief description of its functionality, authentication requirements, HTTPS support, and a link to the API's documentation or endpoint. The dataset is categorized to facilitate easy exploration and access to APIs across different domains.

    Example entries:

    "AdoptAPet": A resource to help get pets adopted, requiring an API key for access. "Axolotl": A collection of axolotl pictures and facts, with HTTPS support and no authentication required. "Cat Facts": Providing daily cat facts, with HTTPS support and no authentication needed.

    Columns: API: This column provides the name or title of the API

    Description: In this column, you'll find a brief description of the API's functionality and what it offers.

    Auth (Authentication): This column indicates whether the API requires authentication for access. If it specifies "apiKey" or any other form of authentication, users need to provide valid credentials or keys to utilize the API.

    HTTPS: This column indicates whether the API supports secure communication over HTTPS

    Link: This column provides a URL or link to the API's documentation

    Category: The "Category" column categorizes the API into a relevant domain or topic.

  5. api_key

    • kaggle.com
    zip
    Updated Dec 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yamen Mohamed (2024). api_key [Dataset]. https://www.kaggle.com/datasets/yamenmohamed/api-key/code
    Explore at:
    zip(228 bytes)Available download formats
    Dataset updated
    Dec 7, 2024
    Authors
    Yamen Mohamed
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Yamen Mohamed

    Released under Apache 2.0

    Contents

  6. key-api

    • kaggle.com
    zip
    Updated Nov 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shubhangi C Salunkhe (2024). key-api [Dataset]. https://www.kaggle.com/datasets/shubhangicsalunkhe/key-api/discussion
    Explore at:
    zip(233 bytes)Available download formats
    Dataset updated
    Nov 3, 2024
    Authors
    Shubhangi C Salunkhe
    Description

    Dataset

    This dataset was created by Shubhangi C Salunkhe

    Contents

  7. api_key

    • kaggle.com
    zip
    Updated Dec 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shefiyyah Aurellia (2023). api_key [Dataset]. https://www.kaggle.com/datasets/shefiyyahaurellia/api-key
    Explore at:
    zip(232 bytes)Available download formats
    Dataset updated
    Dec 4, 2023
    Authors
    Shefiyyah Aurellia
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Shefiyyah Aurellia

    Released under MIT

    Contents

  8. API_Websites

    • kaggle.com
    zip
    Updated Feb 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    chandrashekhar G T (2023). API_Websites [Dataset]. https://www.kaggle.com/datasets/chandrashekhargt/api-wwebsites
    Explore at:
    zip(750 bytes)Available download formats
    Dataset updated
    Feb 5, 2023
    Authors
    chandrashekhar G T
    Description

    APIs, or Application Programming Interfaces, allow developers to access the functionality of an application or service over the internet. To access an API, a developer would typically make a request using a specific URL or endpoint, along with any necessary authentication or parameters, and receive a response in a standardized format such as JSON. This response can then be used to integrate the API's functionality into another application or service. Many websites and web-based services offer APIs that allow developers to access data and perform actions, such as retrieving information about a user, posting content, or making a purchase. In order to access an API, a developer often needs to obtain an API key or access token, which serves as a unique identifier and enables the API to track usage and enforce rate limits.

  9. ShutterStock Dataset for AI vs Human-Gen. Image

    • kaggle.com
    zip
    Updated Jun 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sachin Singh (2025). ShutterStock Dataset for AI vs Human-Gen. Image [Dataset]. https://www.kaggle.com/datasets/shreyasraghav/shutterstock-dataset-for-ai-vs-human-gen-image
    Explore at:
    zip(11617243112 bytes)Available download formats
    Dataset updated
    Jun 19, 2025
    Authors
    Sachin Singh
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    ShutterStock AI vs. Human-Generated Image Dataset

    This dataset is curated to facilitate research in distinguishing AI-generated images from human-created ones, leveraging ShutterStock data. As AI-generated imagery becomes more sophisticated, developing models that can classify and analyze such images is crucial for applications in content moderation, digital forensics, and media authenticity verification.

    Dataset Overview:

    • Total Images: 100,000
    • Training Data: 80,000 images (majority AI-generated)
    • Test Data: 20,000 images
    • Image Sources: A mix of AI-generated images and real photographs or illustrations created by human artists
    • Labeling: Each image is labeled as either AI-generated or human-created

    Potential Use Cases:

    • AI-Generated Image Detection: Train models to distinguish between AI and human-made images.
    • Deep Learning & Computer Vision Research: Develop and benchmark CNNs, transformers, and other architectures.
    • Generative Model Evaluation: Compare AI-generated images to real images for quality assessment.
    • Digital Forensics: Identify synthetic media for applications in fake image detection.
    • Ethical AI & Content Authenticity: Study the impact of AI-generated visuals in media and ensure transparency.

    Why This Dataset?

    With the rise of generative AI models like Stable Diffusion, DALL·E, and MidJourney, the ability to differentiate between synthetic and real images has become a crucial challenge. This dataset offers a structured way to train AI models on this task, making it a valuable resource for both academic research and practical applications.

    Explore the dataset and contribute to advancing AI-generated content detection!

    Step 1: Install and Authenticate Kaggle API

    If you haven't installed the Kaggle API, run:
    bash pip install kaggle Then, download your kaggle.json API key from Kaggle Account and move it to ~/.kaggle/ (Linux/Mac) or `C:\Users\YourUser.kaggle` (Windows).

    Step 2: Use wget

      wget --no-check-certificate --header "Authorization: Bearer $(cat ~/.kaggle/kaggle.json | jq -r .token)" "https://www.kaggle.com/datasets/shreyasraghav/shutterstock-dataset-for-ai-vs-human-gen-image" -O dataset.zip
    

    Step 3: Extract the Dataset

    Once downloaded, extract the dataset using:
    bash unzip dataset.zip -d dataset_folder

    Now your dataset is ready to use! 🚀

  10. codeparrot_1M

    • kaggle.com
    zip
    Updated Feb 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tanay Mehta (2024). codeparrot_1M [Dataset]. https://www.kaggle.com/datasets/heyytanay/codeparrot-1m/suggestions?status=pending&yourSuggestions=true
    Explore at:
    zip(2368083124 bytes)Available download formats
    Dataset updated
    Feb 25, 2024
    Authors
    Tanay Mehta
    Description

    A subset of codeparrot/github-code dataset consisting of 1 Million tokenized Python files in Lance file format for blazing fast and memory efficient I/O.

    The files were tokenized using the EleutherAI/gpt-neox-20b tokenizer with no extra tokens.

    For detailed information on how the dataset was created, refer to my article on Curating Custom Datasets for efficient LLM training using Lance.

    The script used for creating the dataset can be found here.

    Instructions for using this dataset

    This dataset is not supposed to be used on Kaggle Kernels since Lance requires the input directory of the dataset to have write access but Kaggle Kernel's input directory doesn't have it and the dataset size prohibits one from moving it to /kaggle/working. Hence, to use this dataset, you must download it by using the Kaggle API or through this page and then move the unzipped files to a folder called codeparrot_1M.lance. Below are detailed snippets on how to download and use this dataset.

    First download and unzip the dataset from your terminal (make sure you have your kaggle API key at ~/.kaggle/:

    $ pip install -q kaggle pyarrow pylance
    $ kaggle datasets download -d heyytanay/codeparrot-1m
    $ mkdir codeparrot_1M.lance/
    $ unzip -qq codeparrot-1m.zip -d codeparrot_1M.lance/
    $ rm codeparrot-1m.zip
    

    Once this is done, you will find your dataset in the codeparrot_1M.lance/ folder. Now to load and get a gist of the data, run the below snippet.

    import lance
    dataset = lance.dataset('codeparrot_1M.lance/')
    print(dataset.count_rows())
    

    This will give you the total number of tokens in the dataset.

    Considerations for Using the Data The dataset consists of source code from a wide range of repositories. As such they can potentially include harmful or biased code as well as sensitive information like passwords or usernames.

  11. coco2017 Lance (train)

    • kaggle.com
    zip
    Updated Apr 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tanay Mehta (2024). coco2017 Lance (train) [Dataset]. https://www.kaggle.com/datasets/heyytanay/coco2017-train-lance
    Explore at:
    zip(19181147792 bytes)Available download formats
    Dataset updated
    Apr 10, 2024
    Authors
    Tanay Mehta
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This is the COCO-2017 dataset's training split for object detection and segmentation saved in the Lance file format for blazing fast and memory-efficient I/O.

    This dataset only includes data necessary for object detection and segmentation.

    For detailed information on how the dataset was created, refer to my article on Curating Custom Datasets for efficient LLM training using Lance.

    Instructions for using this dataset

    This dataset is not supposed to be used on Kaggle Kernels since Lance requires the input directory of the dataset to have write access but Kaggle Kernel's input directory doesn't have it and the dataset size prohibits one from moving it to /kaggle/working. Hence, to use this dataset, you must download it by using the Kaggle API or through this page, and then move the unzipped files to a folder called coco2017_train.lance. Below are detailed snippets on how to download and use this dataset.

    First, download and unzip the dataset from your terminal (make sure you have your kaggle API key at ~/.kaggle/:

    $ pip install -q kaggle pyarrow pylance
    $ kaggle datasets download -d heyytanay/coco2017-train-lance
    $ mkdir coco2017_train.lance/
    $ unzip -qq coco2017-train-lance.zip -d coco2017_train.lance/
    $ rm coco2017-train-lance.zip
    

    Once this is done, you will find your dataset in the coco2017_train.lance/ folder. To load and get the gist of the data, run the below snippet.

    import lance
    dataset = lance.dataset('coco2017_train.lance/')
    print(dataset.count_rows())
    

    This will give you the total number of rows in the dataset.

  12. MeDAL Dataset

    • kaggle.com
    • opendatalab.com
    • +1more
    zip
    Updated Nov 16, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    xhlulu (2020). MeDAL Dataset [Dataset]. https://www.kaggle.com/xhlulu/medal-emnlp
    Explore at:
    zip(7324382521 bytes)Available download formats
    Dataset updated
    Nov 16, 2020
    Authors
    xhlulu
    Description

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2352583%2F868a18fb09d7a1d3da946d74a9857130%2FLogo.PNG?generation=1604973725053566&alt=media" alt="">

    Medical Dataset for Abbreviation Disambiguation for Natural Language Understanding (MeDAL) is a large medical text dataset curated for abbreviation disambiguation, designed for natural language understanding pre-training in the medical domain. It was published at the ClinicalNLP workshop at EMNLP.

    💻 Code 🤗 Dataset (Hugging Face) 💾 Dataset (Kaggle) 💽 Dataset (Zenodo) 📜 Paper (ACL) 📝 Paper (Arxiv)Pre-trained ELECTRA (Hugging Face)

    Downloading the data

    We recommend downloading from Kaggle if you can authenticate through their API. The advantage to Kaggle is that the data is compressed, so it will be faster to download. Links to the data can be found at the top of the readme.

    First, you will need to create an account on kaggle.com. Afterwards, you will need to install the kaggle API: pip install kaggle

    Then, you will need to follow the instructions here to add your username and key. Once that's done, you can run: kaggle datasets download xhlulu/medal-emnlp

    Now, unzip everything and place them inside the data directory: unzip -nq crawl-300d-2M-subword.zip -d data mv data/pretrain_sample/* data/

    Loading FastText Embeddings

    For the LSTM models, we will need to use the fastText embeddings. To do so, first download and extract the weights: wget -nc -P data/ https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M-subword.zip unzip -nq data/crawl-300d-2M-subword.zip -d data/

    Model Quickstart

    Using Torch Hub

    You can directly load LSTM and LSTM-SA with torch.hub: ```python import torch

    lstm = torch.hub.load("BruceWen120/medal", "lstm") lstm_sa = torch.hub.load("BruceWen120/medal", "lstm_sa") ```

    If you want to use the Electra model, you need to first install transformers: pip install transformers Then, you can load it with torch.hub: python import torch electra = torch.hub.load("BruceWen120/medal", "electra")

    Using Huggingface transformers

    If you are only interested in the pre-trained ELECTRA weights (without the disambiguation head), you can load it directly from the Hugging Face Repository:

    from transformers import AutoModel, AutoTokenizer
    
    model = AutoModel.from_pretrained("xhlu/electra-medal")
    tokenizer = AutoTokenizer.from_pretrained("xhlu/electra-medal")
    

    Citation

    Download the bibtex here, or copy the text below: @inproceedings{wen-etal-2020-medal, title = "{M}e{DAL}: Medical Abbreviation Disambiguation Dataset for Natural Language Understanding Pretraining", author = "Wen, Zhi and Lu, Xing Han and Reddy, Siva", booktitle = "Proceedings of the 3rd Clinical Natural Language Processing Workshop", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.clinicalnlp-1.15", pages = "130--135", }

    License, Terms and Conditions

    The ELECTRA model is licensed under Apache 2.0. The license for the libraries used in this project (transformers, pytorch, etc.) can be found in their respective GitHub repository. Our model is released under a MIT license.

    The original dataset was retrieved and modified from the NLM website. By using this dataset, you are bound by the terms and conditions specified by NLM:

    INTRODUCTION

    Downloading data from the National Library of Medicine FTP servers indicates your acceptance of the following Terms and Conditions: No charges, usage fees or royalties are paid to NLM for this data.

    MEDLINE/PUBMED SPECIFIC TERMS

    NLM freely provides PubMed/MEDLINE data. Please note some PubMed/MEDLINE abstracts may be protected by copyright.

    GENERAL TERMS AND CONDITIONS

    • Users of the data agree to:

      • acknowledge NLM as the source of the data by including the phrase "Courtesy of the U.S. National Library of Medicine" in a clear and conspicuous manner,
      • properly use registration and/or trademark symbols when referring to NLM products, and
      • not indicate or imply that NLM has endorsed its products/services/applications.
    • Users who republish or redistribute the data (services, products or raw data) agree to:

      • maintain the most current version of all distributed data, or
      • make known in a clear and conspicuous manner that the products/services/applications do not reflect the most current/accurate data available from NLM.
    • These data are produced with a reasonable standard of care, but NLM makes no warranties express or implied, including no warranty of merchantability or fitness for particular purpose, regarding the accuracy or completeness of the data. Users agree to hold NLM and the U.S. Government harmless from any liability resulting from errors in the data. NLM disclaims any liability for any consequences due to use, misuse, or interpretation of information contained or not contained in the data.

    • NLM does not provide legal advice regarding copyright, fair use, or other aspects of intellectual property rights. See the NLM Copyright page.

    • NLM reserves the right to change the type and format of its machine-readable data. NLM will take reasonable steps to inform users of any changes to the format of the data before the data are distributed via the announcement section or subscription to email and RSS updates.

  13. openwebtext_1M

    • kaggle.com
    zip
    Updated Mar 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tanay Mehta (2024). openwebtext_1M [Dataset]. https://www.kaggle.com/datasets/heyytanay/openwebtext-1m/code
    Explore at:
    zip(2043993317 bytes)Available download formats
    Dataset updated
    Mar 18, 2024
    Authors
    Tanay Mehta
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    A subset of Skylion007/openwebtext dataset consisting of 1 Million tokenized samples in Lance file format for blazing fast and memory efficient I/O.

    The files were tokenized using the gpt2 tokenizer with no extra tokens.

    For detailed information on how the dataset was created, refer to my article on Curating Custom Datasets for efficient LLM training using Lance.

    Instructions for using this dataset

    This dataset is not supposed to be used on Kaggle Kernels since Lance requires the input directory of the dataset to have write access but Kaggle Kernel's input directory doesn't have it and the dataset size prohibits one from moving it to /kaggle/working. Hence, to use this dataset, you must download it by using the Kaggle API or through this page and then move the unzipped files to a folder called openwebtext_1M.lance. Below are detailed snippets on how to download and use this dataset.

    First download and unzip the dataset from your terminal (make sure you have your kaggle API key at ~/.kaggle/:

    $ pip install -q kaggle pyarrow pylance
    $ kaggle datasets download -d heyytanay/openwebtext-1m
    $ mkdir openwebtext_1M.lance/
    $ unzip -qq openwebtext-1m.zip -d openwebtext_1M.lance/
    $ rm openwebtext-1m.zip
    

    Once this is done, you will find your dataset in the openwebtext_1M.lance/ folder. Now to load and get a gist of the data, run the below snippet.

    import lance
    dataset = lance.dataset('openwebtext_1M.lance/')
    print(dataset.count_rows())
    

    This will give you the total number of tokens in the dataset.

  14. Clean Meta Kaggle

    • kaggle.com
    Updated Sep 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yoni Kremer (2023). Clean Meta Kaggle [Dataset]. https://www.kaggle.com/datasets/yonikremer/clean-meta-kaggle
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 8, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Yoni Kremer
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Cleaned Meta-Kaggle Dataset

    The Original Dataset - Meta-Kaggle

    Explore our public data on competitions, datasets, kernels (code / notebooks) and more Meta Kaggle may not be the Rosetta Stone of data science, but we do think there's a lot to learn (and plenty of fun to be had) from this collection of rich data about Kaggle’s community and activity.

    Strategizing to become a Competitions Grandmaster? Wondering who, where, and what goes into a winning team? Choosing evaluation metrics for your next data science project? The kernels published using this data can help. We also hope they'll spark some lively Kaggler conversations and be a useful resource for the larger data science community.

    https://i.imgur.com/2Egeb8R.png" alt="" title="a title">

    This dataset is made available as CSV files through Kaggle Kernels. It contains tables on public activity from Competitions, Datasets, Kernels, Discussions, and more. The tables are updated daily.

    Please note: This data is not a complete dump of our database. Rows, columns, and tables have been filtered out and transformed.

    August 2023 update

    In August 2023, we released Meta Kaggle for Code, a companion to Meta Kaggle containing public, Apache 2.0 licensed notebook data. View the dataset and instructions for how to join it with Meta Kaggle here

    We also updated the license on Meta Kaggle from CC-BY-NC-SA to Apache 2.0.

    The Problems with the Original Dataset

    • The original dataset is 32 CSV files, with 268 colums and 7GB of compressed data. Having so many tables and columns makes it hard to understand the data.
    • The data is not normalized, so when you join tables you get a lot of errors.
    • Some values refer to non-existing values in other tables. For example, the UserId column in the ForumMessages table has values that do not exist in the Users table.
    • There are missing values.
    • There are duplicate values.
    • There are values that are not valid. For example, Ids that are not positive integers.
    • The date and time columns are not in the right format.
    • Some columns only have the same value for all rows, so they are not useful.
    • The boolean columns have string values True or False.
    • Incorrect values for the Total columns. For example, the DatasetCount is not the total number of datasets with the Tag according to the DatasetTags table.
    • Users upvote their own messages.

    The Solution

    • To handle so many tables and columns I use a relational database. I use MySQL, but you can use any relational database.
    • The steps to create the database are:
    • Creating the database tables with the right data types and constraints. I do that by running the db_abd_create_tables.sql script.
    • Downloading the CSV files from Kaggle using the Kaggle API.
    • Cleaning the data using pandas. I do that by running the clean_data.py script. The script does the following steps for each table:
      • Drops the columns that are not needed.
      • Converts each column to the right data type.
      • Replaces foreign keys that do not exist with NULL.
      • Replaces some of the missing values with default values.
      • Removes rows where there are missing values in the primary key/not null columns.
      • Removes duplicate rows.
    • Loading the data into the database using the LOAD DATA INFILE command.
    • Checks that the number of rows in the database tables is the same as the number of rows in the CSV files.
    • Adds foreign key constraints to the database tables. I do that by running the add_foreign_keys.sql script.
    • Update the Total columns in the database tables. I do that by running the update_totals.sql script.
    • Backup the database.
  15. Data from: Seeking Alpha Dataset

    • kaggle.com
    zip
    Updated Nov 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aman Sharma (2024). Seeking Alpha Dataset [Dataset]. https://www.kaggle.com/datasets/aman2626786/seeking-alpha-dataset
    Explore at:
    zip(250575 bytes)Available download formats
    Dataset updated
    Nov 10, 2024
    Authors
    Aman Sharma
    Description

    What is the Seeking Alpha API? Seeking Alpha API from RapidAPI is an API that queries stock news, market-moving, price quotes, charts, indices, analysis, and many more from investors and experts on seeking alpha stock research platform. In addition, it has a comprehensive list of endpoints for different categories of data.

    Currently, the API has three pricing plans and a free subscription. It supports various programming languages, including Python, PHP, Ruby, and Javascript. This article will dig deeper into its details and see how to use this API with multiple programming languages.

    How does the Seeking Alpha API work? Seeking Alpha API works using simple API logic in which It sends a request to a specific endpoint and obtains the necessary output as the response. When sending a request, it includes x-RapidAPI-key and host as authentication parameters so that the server can identify it as a valid request. In addition, the API requests body contains the optional parameters to process the request. Once the API server has received the request, it will process the request using the back-end application. Finally, the server will send back the information requested by the client in JSON format.

    Target Audience for the Seeking Alpha API Financial Application Developers Financial application developers can integrate this API to attract Seeking Alphas’ audience to their financial applications. Its comprehensive list of APIs enables providing the complete Seeking Alpha experience. This API has affordable pricing plans, each endpoint requires only a few lines of code, and integration to an application is pretty straightforward. Since it supports multiple programming languages, it has widespread usability.

    Stock Market Investors and learners Investors, especially those who research financial companies and the stock market, can use this to get information straight from this API. In addition, it has a free plan, and its Pro plan only costs $10. Therefore, anyone who learns about the stock market can make use of it for a low cost.

    How to connect to the Seeking Alpha API Tutorial – Step by Step Step 1 – Sign up and Get a RapidAPI Account. RapidAPI is the world’s largest API marketplace which is used by more than a million developers worldwide. You can use RapidAPI to search and connect to thousands of APIs using a single SDK, API key, and Dashboard.

    To create a RapidAPI account, go to rapidapi.com and click on the Sign Up icon. You can use your Google, Github, or Facebook account for Single Sign-on (SSO) or create an account manually.

  16. Malware Benign API Call Argument Feature Vector

    • kaggle.com
    zip
    Updated Apr 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BISHWAJIT PRASAD GOND (2025). Malware Benign API Call Argument Feature Vector [Dataset]. https://www.kaggle.com/datasets/bishwajitprasadgond/malware-benign-api-call-argument-feature-vector
    Explore at:
    zip(36681722 bytes)Available download formats
    Dataset updated
    Apr 28, 2025
    Authors
    BISHWAJIT PRASAD GOND
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Disclaimer: To process and perform analysis with this dataset, it is strongly recommended that your system has at least 128 GB of RAM. Attempting to work with this dataset on systems with lower memory may result in crashes, incomplete processing, or significant performance issues.

    The process involves acquiring malware data, performing behavioral analysis, and preparing features for deep learning models.

    Step 1: Data Acquisition

    • Source Malware Hashes: Obtain malware hashes from VirusShare.
    • Query VirusTotal: Use the hashes to query VirusTotal and download JSON files containing scan results from over 70 antivirus engines.
    • Class Determination: Analyze the scan results to classify the malware into distinct categories.

    Step 2: Malware Download

    • Download Malware Samples: Based on the classification, download malware samples for each category.

    Step 3: Dynamic Analysis with Cuckoo Sandbox

    • Environment Setup: Conduct dynamic analysis in a controlled environment using Cuckoo Sandbox.
    • Behavioral Report: Generate a JSON behavioral report for each malware sample, focusing on Portable Executable (PE) files.
    • API Call Sequence Extraction: Extract API call sequence reports in JSON format, including:
      • API Name
      • API Argument
      • API Return
      • API Category

    Step 4: Data Preprocessing

    • JSON Report Segmentation: Split the JSON report into four text files:

      • api_name.txt
      • api_argument.txt
      • api_return.txt
      • api_category.txt
    • Unigram Generation:

      • Combine API names with their corresponding arguments using underscores (e.g., LdrLoadDll_urlmon.dll).
      • Generate unigrams for each malware category.

      Example unigram: - LdrLoadDll_urlmon_urlmon.dll

    • Output: Create a CSV file containing unigrams for each malware category.

    Step 5: Feature Extraction and Vectorization

    • API Elements Extraction:

      • Extract key elements: API Name and API Argument
    • Unique Unigrams:

      • Identify unique unigrams from the JSON reports.
    • Term Frequency (TF) Calculation:

      • Tokenize the text and compute TF weights for each unigram, reflecting their importance in the dataset.
      • Optionally apply L2 normalization to ensure consistent feature vector lengths.
    • Feature Refinement:

      • Filter unnecessary features from the unigram CSV files to create a refined feature set.

    Output

    • Prepared Dataset:
      • A refined CSV file containing unigrams and TF-weighted features for each malware category.

    Citation

    • B. P. Gond, M. Shahnawaz, Rajneekant and D. P. Mohapatra, "NLP-Driven Malware Classification: A Jaccard Similarity Approach," 2024 IEEE International Conference on Information Technology, Electronics and Intelligent Communication Systems (ICITEICS), Bangalore, India, 2024, pp. 1-8, DOI: https://doi.org/10.1109/ICITEICS61368.2024.10624953
    • B. P. Gond, A. K. Singh and D. P. Mohapatra, "A Deep Learning Framework for Malware Classification using NLP Techniques," 2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT), Kamand, India, 2024, pp. 1-8, DOI: https://doi.org/10.1109/ICCCNT61001.2024.10725427
    • Gond, B. P., & Mohapatra, D. P. (2025). Deep Learning-Driven Malware Classification with API Call Sequence Analysis and Concept Drift Handling. ArXiv. https://arxiv.org/abs/2502.08679
  17. MyAnimeList API

    • kaggle.com
    zip
    Updated Aug 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pat Mendoza (2023). MyAnimeList API [Dataset]. https://www.kaggle.com/datasets/patmendoza/myanimelist-api
    Explore at:
    zip(49218834 bytes)Available download formats
    Dataset updated
    Aug 2, 2023
    Authors
    Pat Mendoza
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    MyAnimeList API Download

    This is the dataset that I created as part of the Google Data Analytics Professional Certificate capstone project. The MyAnimeList website has a vast repository of ratings and rankings of viewership data that could be used for various methods. I extracted several datasets from the detail API from MyAnimeList (MAL) https://myanimelist.net/apiconfig/references/api/v2 and plan to potentially update data every two weeks.

    Many possible uses for this data could be tracking what anime viewers are watching most within a particular time period, what's being scored (out of 10) well and what isn't.

    My viz for this data will be part of a tableau dashboard located here. This dashboard allows fans to explore the dataset and locate top scored or popular titles by genre, time period, and demographic (although this field isn't always entered)

    Documentation

    The extraction and cleaning process is outlined on github here.

    Frequency of Updates

    I plan on updating this potentially every 2 weeks, this depends on my availability and the interest in this dataset.

    Caveats

    Extracting and loading this data involved some transformations that should be noted:

    • This data only includes titles that correspond with the "tv" ranking category. This was in an effort to streamline extraction and fine tune the analysis. If you would like to see other categories you are welcome to suggest it as an enhancement or use the code create your own dataset. As a result of subsetting on "tv", the dataset excludes the following ranking categories:
      1. All
      2. airing
      3. upcoming
      4. ova
      5. movie
      6. special
      7. bypopularity
      8. favorite
    • Adult content - This extract excludes all adult content (r+).
    • Note: The previous two points are valid for all tables with the exception of the rank_table. This is the table that was used as a starting point to obtain all MAL ids that were associated with "tv". Because this is a fast download, all categories are included in this table.
    • The creation of the alternative_title field in the anime_table. This uses the english version of the name unless it is null, if the value is null, it uses the default name. This was in an effort to make the title accessible to english speakers. The original title field can be used if desired.
    • The extraction of the demographic information from the genres field. MyAnimeList includes demographic information (shounen, seinen etc.) in the genres field. I've extracted it so that it could be used as its own field. However, many of those fields are null making it somewhat difficult to use.
    • Cleaning processes of data. Various methods of cleaning data have been carried out and are noted on github.
    • start_season.year - this field in the anime_table has been modified for null values. If there are null values, the first four characters from the start_date have been used. I will continue to use this method as long as it is viable.

    Table Structure

    The primary keys in all of the tables (with the exclusion of the tm_ky table) are foreign keys to other tables. As a result, the tables have 2 or more primary keys.

    1. anime_demo_table
    FieldTypePrimary Key
    tm_kyintPK
    mal_idintPK
    demo_idint
    1. anime_genres_table
    FieldTypePrimary Key
    tm_kyintPK
    mal_idintPK
    genres_idintPK
    1. anime_ranking_table
    FieldTypePrimary Key
    tm_kyintPK
    mal_idintPK
    meandbl
    rankint
    popularityint
    num_scoring_usersint
    statistics.watchingint
    statistics.completedint
    statistics.on_holdint
    statistics.droppedint
    statistics.plan_to_watchint
    statistics.num_scoring_usersint
    1. anime_studios_table
    FieldTypePrimary Key
    tm_kyintPK
    mal_idintPK
    studio_idintPK
    1. anime_syn_table
    FieldTypePrimary Key
    tm_kyintPK
    mal_idintPK
    synonymschr
    1. anime_table
    FieldTypePrimary Key
    tm_kyintPK
    mal_idintPK
    titlechr
    main_picture.mediumchr
    main_picture.largechr
    alternative_titles.enchr
    alternative_titles.jachr
    start_datechr
    end_datechr
    synopsischr
    media_typechr
    statuschr
    num_episodesint
    start_season.yearint
    start_season.seasonchr
    ratingchr
    nsfwchr
    demo_dechr ...
  18. 2M unique spotify songs with audio features.

    • kaggle.com
    zip
    Updated Sep 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    krish sharma (2024). 2M unique spotify songs with audio features. [Dataset]. https://www.kaggle.com/datasets/krishsharma0413/2-million-songs-from-mpd-with-audio-features
    Explore at:
    zip(408512929 bytes)Available download formats
    Dataset updated
    Sep 1, 2024
    Authors
    krish sharma
    License

    https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html

    Description

    Where is the data from?

    The dataset is a combination of Million Playlist Dataset and Spotify API.

    SQLite structure.

    The SQLite is in .db format. With one table extracted. Following are all the columns in this table. - track_uri (TEXT PRIMARY KEY): Unique identifier used by Spotify for songs. - track_name (TEXT): Song name. - artist_name (TEXT): Artist name. - artist_uri (TEXT): Unique identifier used by Spotify for artists. - album_name (TEXT): Album name - album_uri (TEXT): Unique identifier used by Spotify for albums. - duration_ms (INTEGER): Duration of the song. - danceability (REAL): Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. - energy (REAL): Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. - key (INTEGER): The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1. - loudness (REAL): The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typically range between -60 and 0 db. - mode (INTEGER): Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0. - speechiness (REAL): Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks. - acousticness (REAL): A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic. - instrumentalness (REAL): Predicts whether a track contains no vocals. "Ooh" and "aah" sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly "vocal". The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0. - liveness (REAL): Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live. - valence (REAL): A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). - tempo (REAL): The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. - type (TEXT): The object type. - id (TEXT): The Spotify ID for the track. - uri (TEXT): The Spotify URI for the track. - track_href (TEXT): A link to the Web API endpoint providing full details of the track. - analysis_url (TEXT): A URL to access the full audio analysis of this track. An access token is required to access this data. - fduration_ms (INTEGER): The duration of the track in milliseconds. - time_signature (INTEGER): An estimated time signature. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure). The time signature ranges from 3 to 7 indicating time signatures of "3/4", to "7/4".

  19. YTB_apiKey

    • kaggle.com
    zip
    Updated May 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Trương Văn Khải (2024). YTB_apiKey [Dataset]. https://www.kaggle.com/datasets/khitrngvn/ytb-apikey
    Explore at:
    zip(201 bytes)Available download formats
    Dataset updated
    May 4, 2024
    Authors
    Trương Văn Khải
    Description

    Dataset

    This dataset was created by Trương Văn Khải

    Contents

  20. Shopping Mall Customer Data Segmentation Analysis

    • kaggle.com
    zip
    Updated Aug 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DataZng (2024). Shopping Mall Customer Data Segmentation Analysis [Dataset]. https://www.kaggle.com/datasets/datazng/shopping-mall-customer-data-segmentation-analysis
    Explore at:
    zip(5890828 bytes)Available download formats
    Dataset updated
    Aug 4, 2024
    Authors
    DataZng
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Demographic Analysis of Shopping Behavior: Insights and Recommendations

    Dataset Information: The Shopping Mall Customer Segmentation Dataset comprises 15,079 unique entries, featuring Customer ID, age, gender, annual income, and spending score. This dataset assists in understanding customer behavior for strategic marketing planning.

    Cleaned Data Details: Data cleaned and standardized, 15,079 unique entries with attributes including - Customer ID, age, gender, annual income, and spending score. Can be used by marketing analysts to produce a better strategy for mall specific marketing.

    Challenges Faced: 1. Data Cleaning: Overcoming inconsistencies and missing values required meticulous attention. 2. Statistical Analysis: Interpreting demographic data accurately demanded collaborative effort. 3. Visualization: Crafting informative visuals to convey insights effectively posed design challenges.

    Research Topics: 1. Consumer Behavior Analysis: Exploring psychological factors driving purchasing decisions. 2. Market Segmentation Strategies: Investigating effective targeting based on demographic characteristics.

    Suggestions for Project Expansion: 1. Incorporate External Data: Integrate social media analytics or geographic data to enrich customer insights. 2. Advanced Analytics Techniques: Explore advanced statistical methods and machine learning algorithms for deeper analysis. 3. Real-Time Monitoring: Develop tools for agile decision-making through continuous customer behavior tracking. This summary outlines the demographic analysis of shopping behavior, highlighting key insights, dataset characteristics, team contributions, challenges, research topics, and suggestions for project expansion. Leveraging these insights can enhance marketing strategies and drive business growth in the retail sector.

    References OpenAI. (2022). ChatGPT [Computer software]. Retrieved from https://openai.com/chatgpt. Mustafa, Z. (2022). Shopping Mall Customer Segmentation Data [Data set]. Kaggle. Retrieved from https://www.kaggle.com/datasets/zubairmustafa/shopping-mall-customer-segmentation-data Donkeys. (n.d.). Kaggle Python API [Jupyter Notebook]. Kaggle. Retrieved from https://www.kaggle.com/code/donkeys/kaggle-python-api/notebook Pandas-Datareader. (n.d.). Retrieved from https://pypi.org/project/pandas-datareader/

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
valerie lucro (2024). kaggle api key 2 [Dataset]. https://www.kaggle.com/datasets/valerielucro/kaggle-api-key-2/code
Organization logo

kaggle api key 2

Explore at:
zip(238 bytes)Available download formats
Dataset updated
Jun 27, 2024
Authors
valerie lucro
Description

Dataset

This dataset was created by valerie lucro

Contents

Search
Clear search
Close search
Google apps
Main menu