82 datasets found

kaggle api key 2
kaggle.com
zip
Updated Jun 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
valerie lucro (2024). kaggle api key 2 [Dataset]. https://www.kaggle.com/datasets/valerielucro/kaggle-api-key-2/code
Explore at:
zip(238 bytes)Available download formats
Dataset updated
Jun 27, 2024
Authors
valerie lucro
Description
Dataset

This dataset was created by valerie lucro

Contents
Most popular Kaggle datasets
kaggle.com
zip
Updated Jan 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dany Ocean (2025). Most popular Kaggle datasets [Dataset]. https://www.kaggle.com/datasets/danyocean/most-popular-kaggle-datasets
Explore at:
zip(920936 bytes)Available download formats
Dataset updated
Jan 2, 2025
Authors
Dany Ocean
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This project focuses on exploring and analyzing the most popular datasets available on Kaggle. By delving into these datasets, we aim to identify key trends, understand user preferences, and highlight the topics that drive engagement within the data science and machine learning communities

Also there are interesting charts and analytics in the attached notebook
api key
kaggle.com
zip
Updated Oct 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CIturrieta (2024). api key [Dataset]. https://www.kaggle.com/datasets/citurrieta/api-key
Explore at:
zip(213 bytes)Available download formats
Dataset updated
Oct 10, 2024
Authors
CIturrieta
Description
Dataset

This dataset was created by CIturrieta

Contents
Free API List
kaggle.com
zip
Updated Oct 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nancy.jain042 (2023). Free API List [Dataset]. https://www.kaggle.com/datasets/nancyjain042/free-api-list
Explore at:
zip(57783 bytes)Available download formats
Dataset updated
Oct 21, 2023
Authors
nancy.jain042
Description
This dataset provides a curated collection of freely available APIs covering a wide range of categories. Whether you are a developer, data enthusiast, or just someone interested in exploring various API services, this dataset offers a valuable resource to help you discover, access, and understand these APIs.

Each entry in the dataset includes essential information about the APIs, such as the API name, a brief description of its functionality, authentication requirements, HTTPS support, and a link to the API's documentation or endpoint. The dataset is categorized to facilitate easy exploration and access to APIs across different domains.

Example entries:

"AdoptAPet": A resource to help get pets adopted, requiring an API key for access. "Axolotl": A collection of axolotl pictures and facts, with HTTPS support and no authentication required. "Cat Facts": Providing daily cat facts, with HTTPS support and no authentication needed.

Columns: API: This column provides the name or title of the API

Description: In this column, you'll find a brief description of the API's functionality and what it offers.

Auth (Authentication): This column indicates whether the API requires authentication for access. If it specifies "apiKey" or any other form of authentication, users need to provide valid credentials or keys to utilize the API.

HTTPS: This column indicates whether the API supports secure communication over HTTPS

Link: This column provides a URL or link to the API's documentation

Category: The "Category" column categorizes the API into a relevant domain or topic.
api_key
kaggle.com
zip
Updated Dec 7, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yamen Mohamed (2024). api_key [Dataset]. https://www.kaggle.com/datasets/yamenmohamed/api-key/code
Explore at:
zip(228 bytes)Available download formats
Dataset updated
Dec 7, 2024
Authors
Yamen Mohamed
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset

This dataset was created by Yamen Mohamed

Released under Apache 2.0

Contents
key-api
kaggle.com
zip
Updated Nov 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shubhangi C Salunkhe (2024). key-api [Dataset]. https://www.kaggle.com/datasets/shubhangicsalunkhe/key-api/discussion
Explore at:
zip(233 bytes)Available download formats
Dataset updated
Nov 3, 2024
Authors
Shubhangi C Salunkhe
Description
Dataset

This dataset was created by Shubhangi C Salunkhe

Contents
api_key
kaggle.com
zip
Updated Dec 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shefiyyah Aurellia (2023). api_key [Dataset]. https://www.kaggle.com/datasets/shefiyyahaurellia/api-key
Explore at:
zip(232 bytes)Available download formats
Dataset updated
Dec 4, 2023
Authors
Shefiyyah Aurellia
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset

This dataset was created by Shefiyyah Aurellia

Released under MIT

Contents
API_Websites
kaggle.com
zip
Updated Feb 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
chandrashekhar G T (2023). API_Websites [Dataset]. https://www.kaggle.com/datasets/chandrashekhargt/api-wwebsites
Explore at:
zip(750 bytes)Available download formats
Dataset updated
Feb 5, 2023
Authors
chandrashekhar G T
Description
APIs, or Application Programming Interfaces, allow developers to access the functionality of an application or service over the internet. To access an API, a developer would typically make a request using a specific URL or endpoint, along with any necessary authentication or parameters, and receive a response in a standardized format such as JSON. This response can then be used to integrate the API's functionality into another application or service. Many websites and web-based services offer APIs that allow developers to access data and perform actions, such as retrieving information about a user, posting content, or making a purchase. In order to access an API, a developer often needs to obtain an API key or access token, which serves as a unique identifier and enables the API to track usage and enforce rate limits.
ShutterStock Dataset for AI vs Human-Gen. Image
kaggle.com
zip
Updated Jun 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sachin Singh (2025). ShutterStock Dataset for AI vs Human-Gen. Image [Dataset]. https://www.kaggle.com/datasets/shreyasraghav/shutterstock-dataset-for-ai-vs-human-gen-image
Explore at:
zip(11617243112 bytes)Available download formats
Dataset updated
Jun 19, 2025
Authors
Sachin Singh
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
ShutterStock AI vs. Human-Generated Image Dataset

This dataset is curated to facilitate research in distinguishing AI-generated images from human-created ones, leveraging ShutterStock data. As AI-generated imagery becomes more sophisticated, developing models that can classify and analyze such images is crucial for applications in content moderation, digital forensics, and media authenticity verification.

Dataset Overview:

Total Images: 100,000

Training Data: 80,000 images (majority AI-generated)

Test Data: 20,000 images

Image Sources: A mix of AI-generated images and real photographs or illustrations created by human artists

Labeling: Each image is labeled as either AI-generated or human-created

Potential Use Cases:

AI-Generated Image Detection: Train models to distinguish between AI and human-made images.

Deep Learning & Computer Vision Research: Develop and benchmark CNNs, transformers, and other architectures.

Generative Model Evaluation: Compare AI-generated images to real images for quality assessment.

Digital Forensics: Identify synthetic media for applications in fake image detection.

Ethical AI & Content Authenticity: Study the impact of AI-generated visuals in media and ensure transparency.

Why This Dataset?

With the rise of generative AI models like Stable Diffusion, DALL·E, and MidJourney, the ability to differentiate between synthetic and real images has become a crucial challenge. This dataset offers a structured way to train AI models on this task, making it a valuable resource for both academic research and practical applications.

Explore the dataset and contribute to advancing AI-generated content detection!

Step 1: Install and Authenticate Kaggle API

If you haven't installed the Kaggle API, run:
bash pip install kaggle Then, download your kaggle.json API key from Kaggle Account and move it to ~/.kaggle/ (Linux/Mac) or `C:\Users\YourUser.kaggle` (Windows).

Step 2: Use wget

wget --no-check-certificate --header "Authorization: Bearer $(cat ~/.kaggle/kaggle.json | jq -r .token)" "https://www.kaggle.com/datasets/shreyasraghav/shutterstock-dataset-for-ai-vs-human-gen-image" -O dataset.zip

Step 3: Extract the Dataset

Once downloaded, extract the dataset using:
bash unzip dataset.zip -d dataset_folder

Now your dataset is ready to use! 🚀
codeparrot_1M
kaggle.com
zip
Updated Feb 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tanay Mehta (2024). codeparrot_1M [Dataset]. https://www.kaggle.com/datasets/heyytanay/codeparrot-1m/suggestions?status=pending&yourSuggestions=true
Explore at:
zip(2368083124 bytes)Available download formats
Dataset updated
Feb 25, 2024
Authors
Tanay Mehta
Description
A subset of codeparrot/github-code dataset consisting of 1 Million tokenized Python files in Lance file format for blazing fast and memory efficient I/O.

The files were tokenized using the EleutherAI/gpt-neox-20b tokenizer with no extra tokens.

For detailed information on how the dataset was created, refer to my article on Curating Custom Datasets for efficient LLM training using Lance.

The script used for creating the dataset can be found here.

Instructions for using this dataset

This dataset is not supposed to be used on Kaggle Kernels since Lance requires the input directory of the dataset to have write access but Kaggle Kernel's input directory doesn't have it and the dataset size prohibits one from moving it to /kaggle/working. Hence, to use this dataset, you must download it by using the Kaggle API or through this page and then move the unzipped files to a folder called codeparrot_1M.lance. Below are detailed snippets on how to download and use this dataset.

First download and unzip the dataset from your terminal (make sure you have your kaggle API key at ~/.kaggle/:

$ pip install -q kaggle pyarrow pylance $ kaggle datasets download -d heyytanay/codeparrot-1m $ mkdir codeparrot_1M.lance/ $ unzip -qq codeparrot-1m.zip -d codeparrot_1M.lance/ $ rm codeparrot-1m.zip

Once this is done, you will find your dataset in the codeparrot_1M.lance/ folder. Now to load and get a gist of the data, run the below snippet.

import lance dataset = lance.dataset('codeparrot_1M.lance/') print(dataset.count_rows())

This will give you the total number of tokens in the dataset.

Considerations for Using the Data The dataset consists of source code from a wide range of repositories. As such they can potentially include harmful or biased code as well as sensitive information like passwords or usernames.
coco2017 Lance (train)
kaggle.com
zip
Updated Apr 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tanay Mehta (2024). coco2017 Lance (train) [Dataset]. https://www.kaggle.com/datasets/heyytanay/coco2017-train-lance
Explore at:
zip(19181147792 bytes)Available download formats
Dataset updated
Apr 10, 2024
Authors
Tanay Mehta
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This is the COCO-2017 dataset's training split for object detection and segmentation saved in the Lance file format for blazing fast and memory-efficient I/O.

This dataset only includes data necessary for object detection and segmentation.

For detailed information on how the dataset was created, refer to my article on Curating Custom Datasets for efficient LLM training using Lance.

Instructions for using this dataset

This dataset is not supposed to be used on Kaggle Kernels since Lance requires the input directory of the dataset to have write access but Kaggle Kernel's input directory doesn't have it and the dataset size prohibits one from moving it to /kaggle/working. Hence, to use this dataset, you must download it by using the Kaggle API or through this page, and then move the unzipped files to a folder called coco2017_train.lance. Below are detailed snippets on how to download and use this dataset.

First, download and unzip the dataset from your terminal (make sure you have your kaggle API key at ~/.kaggle/:

$ pip install -q kaggle pyarrow pylance $ kaggle datasets download -d heyytanay/coco2017-train-lance $ mkdir coco2017_train.lance/ $ unzip -qq coco2017-train-lance.zip -d coco2017_train.lance/ $ rm coco2017-train-lance.zip

Once this is done, you will find your dataset in the coco2017_train.lance/ folder. To load and get the gist of the data, run the below snippet.

import lance dataset = lance.dataset('coco2017_train.lance/') print(dataset.count_rows())

This will give you the total number of rows in the dataset.
MeDAL Dataset
kaggle.com
opendatalab.com
+1more
zip
Updated Nov 16, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
xhlulu (2020). MeDAL Dataset [Dataset]. https://www.kaggle.com/xhlulu/medal-emnlp
Explore at:
zip(7324382521 bytes)Available download formats
Dataset updated
Nov 16, 2020
Authors
xhlulu
Description
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2352583%2F868a18fb09d7a1d3da946d74a9857130%2FLogo.PNG?generation=1604973725053566&alt=media" alt="">

Medical Dataset for Abbreviation Disambiguation for Natural Language Understanding (MeDAL) is a large medical text dataset curated for abbreviation disambiguation, designed for natural language understanding pre-training in the medical domain. It was published at the ClinicalNLP workshop at EMNLP.

💻 Code 🤗 Dataset (Hugging Face) 💾 Dataset (Kaggle) 💽 Dataset (Zenodo) 📜 Paper (ACL) 📝 Paper (Arxiv) ⚡ Pre-trained ELECTRA (Hugging Face)

Downloading the data

We recommend downloading from Kaggle if you can authenticate through their API. The advantage to Kaggle is that the data is compressed, so it will be faster to download. Links to the data can be found at the top of the readme.

First, you will need to create an account on kaggle.com. Afterwards, you will need to install the kaggle API: pip install kaggle

Then, you will need to follow the instructions here to add your username and key. Once that's done, you can run: kaggle datasets download xhlulu/medal-emnlp

Now, unzip everything and place them inside the data directory: unzip -nq crawl-300d-2M-subword.zip -d data mv data/pretrain_sample/* data/

Loading FastText Embeddings

For the LSTM models, we will need to use the fastText embeddings. To do so, first download and extract the weights: wget -nc -P data/ https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M-subword.zip unzip -nq data/crawl-300d-2M-subword.zip -d data/

Model Quickstart

Using Torch Hub

You can directly load LSTM and LSTM-SA with torch.hub: ```python import torch

lstm = torch.hub.load("BruceWen120/medal", "lstm") lstm_sa = torch.hub.load("BruceWen120/medal", "lstm_sa") ```

If you want to use the Electra model, you need to first install transformers: pip install transformers Then, you can load it with torch.hub: python import torch electra = torch.hub.load("BruceWen120/medal", "electra")

Using Huggingface transformers

If you are only interested in the pre-trained ELECTRA weights (without the disambiguation head), you can load it directly from the Hugging Face Repository:

from transformers import AutoModel, AutoTokenizer model = AutoModel.from_pretrained("xhlu/electra-medal") tokenizer = AutoTokenizer.from_pretrained("xhlu/electra-medal")

Citation

Download the bibtex here, or copy the text below: @inproceedings{wen-etal-2020-medal, title = "{M}e{DAL}: Medical Abbreviation Disambiguation Dataset for Natural Language Understanding Pretraining", author = "Wen, Zhi and Lu, Xing Han and Reddy, Siva", booktitle = "Proceedings of the 3rd Clinical Natural Language Processing Workshop", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.clinicalnlp-1.15", pages = "130--135", }

License, Terms and Conditions

The ELECTRA model is licensed under Apache 2.0. The license for the libraries used in this project (transformers, pytorch, etc.) can be found in their respective GitHub repository. Our model is released under a MIT license.

The original dataset was retrieved and modified from the NLM website. By using this dataset, you are bound by the terms and conditions specified by NLM:

INTRODUCTION

Downloading data from the National Library of Medicine FTP servers indicates your acceptance of the following Terms and Conditions: No charges, usage fees or royalties are paid to NLM for this data.

MEDLINE/PUBMED SPECIFIC TERMS

NLM freely provides PubMed/MEDLINE data. Please note some PubMed/MEDLINE abstracts may be protected by copyright.

GENERAL TERMS AND CONDITIONS

Users of the data agree to:

acknowledge NLM as the source of the data by including the phrase "Courtesy of the U.S. National Library of Medicine" in a clear and conspicuous manner,

properly use registration and/or trademark symbols when referring to NLM products, and

not indicate or imply that NLM has endorsed its products/services/applications.

Users who republish or redistribute the data (services, products or raw data) agree to:

maintain the most current version of all distributed data, or

make known in a clear and conspicuous manner that the products/services/applications do not reflect the most current/accurate data available from NLM.

These data are produced with a reasonable standard of care, but NLM makes no warranties express or implied, including no warranty of merchantability or fitness for particular purpose, regarding the accuracy or completeness of the data. Users agree to hold NLM and the U.S. Government harmless from any liability resulting from errors in the data. NLM disclaims any liability for any consequences due to use, misuse, or interpretation of information contained or not contained in the data.

NLM does not provide legal advice regarding copyright, fair use, or other aspects of intellectual property rights. See the NLM Copyright page.

NLM reserves the right to change the type and format of its machine-readable data. NLM will take reasonable steps to inform users of any changes to the format of the data before the data are distributed via the announcement section or subscription to email and RSS updates.
openwebtext_1M
kaggle.com
zip
Updated Mar 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tanay Mehta (2024). openwebtext_1M [Dataset]. https://www.kaggle.com/datasets/heyytanay/openwebtext-1m/code
Explore at:
zip(2043993317 bytes)Available download formats
Dataset updated
Mar 18, 2024
Authors
Tanay Mehta
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
A subset of Skylion007/openwebtext dataset consisting of 1 Million tokenized samples in Lance file format for blazing fast and memory efficient I/O.

The files were tokenized using the gpt2 tokenizer with no extra tokens.

For detailed information on how the dataset was created, refer to my article on Curating Custom Datasets for efficient LLM training using Lance.

Instructions for using this dataset

This dataset is not supposed to be used on Kaggle Kernels since Lance requires the input directory of the dataset to have write access but Kaggle Kernel's input directory doesn't have it and the dataset size prohibits one from moving it to /kaggle/working. Hence, to use this dataset, you must download it by using the Kaggle API or through this page and then move the unzipped files to a folder called openwebtext_1M.lance. Below are detailed snippets on how to download and use this dataset.

First download and unzip the dataset from your terminal (make sure you have your kaggle API key at ~/.kaggle/:

$ pip install -q kaggle pyarrow pylance $ kaggle datasets download -d heyytanay/openwebtext-1m $ mkdir openwebtext_1M.lance/ $ unzip -qq openwebtext-1m.zip -d openwebtext_1M.lance/ $ rm openwebtext-1m.zip

Once this is done, you will find your dataset in the openwebtext_1M.lance/ folder. Now to load and get a gist of the data, run the below snippet.

import lance dataset = lance.dataset('openwebtext_1M.lance/') print(dataset.count_rows())

This will give you the total number of tokens in the dataset.
Clean Meta Kaggle
kaggle.com
Updated Sep 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yoni Kremer (2023). Clean Meta Kaggle [Dataset]. https://www.kaggle.com/datasets/yonikremer/clean-meta-kaggle
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 8, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Yoni Kremer
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Cleaned Meta-Kaggle Dataset

The Original Dataset - Meta-Kaggle

Explore our public data on competitions, datasets, kernels (code / notebooks) and more Meta Kaggle may not be the Rosetta Stone of data science, but we do think there's a lot to learn (and plenty of fun to be had) from this collection of rich data about Kaggle’s community and activity.

Strategizing to become a Competitions Grandmaster? Wondering who, where, and what goes into a winning team? Choosing evaluation metrics for your next data science project? The kernels published using this data can help. We also hope they'll spark some lively Kaggler conversations and be a useful resource for the larger data science community.

https://i.imgur.com/2Egeb8R.png" alt="" title="a title">

This dataset is made available as CSV files through Kaggle Kernels. It contains tables on public activity from Competitions, Datasets, Kernels, Discussions, and more. The tables are updated daily.

Please note: This data is not a complete dump of our database. Rows, columns, and tables have been filtered out and transformed.

August 2023 update

In August 2023, we released Meta Kaggle for Code, a companion to Meta Kaggle containing public, Apache 2.0 licensed notebook data. View the dataset and instructions for how to join it with Meta Kaggle here

We also updated the license on Meta Kaggle from CC-BY-NC-SA to Apache 2.0.

The Problems with the Original Dataset

The original dataset is 32 CSV files, with 268 colums and 7GB of compressed data. Having so many tables and columns makes it hard to understand the data.

The data is not normalized, so when you join tables you get a lot of errors.

Some values refer to non-existing values in other tables. For example, the UserId column in the ForumMessages table has values that do not exist in the Users table.

There are missing values.

There are duplicate values.

There are values that are not valid. For example, Ids that are not positive integers.

The date and time columns are not in the right format.

Some columns only have the same value for all rows, so they are not useful.

The boolean columns have string values True or False.

Incorrect values for the Total columns. For example, the DatasetCount is not the total number of datasets with the Tag according to the DatasetTags table.

Users upvote their own messages.

The Solution

To handle so many tables and columns I use a relational database. I use MySQL, but you can use any relational database.

The steps to create the database are:

Creating the database tables with the right data types and constraints. I do that by running the db_abd_create_tables.sql script.

Downloading the CSV files from Kaggle using the Kaggle API.

Cleaning the data using pandas. I do that by running the clean_data.py script. The script does the following steps for each table:

Drops the columns that are not needed.

Converts each column to the right data type.

Replaces foreign keys that do not exist with NULL.

Replaces some of the missing values with default values.

Removes rows where there are missing values in the primary key/not null columns.

Removes duplicate rows.

Loading the data into the database using the LOAD DATA INFILE command.

Checks that the number of rows in the database tables is the same as the number of rows in the CSV files.

Adds foreign key constraints to the database tables. I do that by running the add_foreign_keys.sql script.

Update the Total columns in the database tables. I do that by running the update_totals.sql script.

Backup the database.
Data from: Seeking Alpha Dataset
kaggle.com
zip
Updated Nov 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aman Sharma (2024). Seeking Alpha Dataset [Dataset]. https://www.kaggle.com/datasets/aman2626786/seeking-alpha-dataset
Explore at:
zip(250575 bytes)Available download formats
Dataset updated
Nov 10, 2024
Authors
Aman Sharma
Description
What is the Seeking Alpha API? Seeking Alpha API from RapidAPI is an API that queries stock news, market-moving, price quotes, charts, indices, analysis, and many more from investors and experts on seeking alpha stock research platform. In addition, it has a comprehensive list of endpoints for different categories of data.

Currently, the API has three pricing plans and a free subscription. It supports various programming languages, including Python, PHP, Ruby, and Javascript. This article will dig deeper into its details and see how to use this API with multiple programming languages.

How does the Seeking Alpha API work? Seeking Alpha API works using simple API logic in which It sends a request to a specific endpoint and obtains the necessary output as the response. When sending a request, it includes x-RapidAPI-key and host as authentication parameters so that the server can identify it as a valid request. In addition, the API requests body contains the optional parameters to process the request. Once the API server has received the request, it will process the request using the back-end application. Finally, the server will send back the information requested by the client in JSON format.

Target Audience for the Seeking Alpha API Financial Application Developers Financial application developers can integrate this API to attract Seeking Alphas’ audience to their financial applications. Its comprehensive list of APIs enables providing the complete Seeking Alpha experience. This API has affordable pricing plans, each endpoint requires only a few lines of code, and integration to an application is pretty straightforward. Since it supports multiple programming languages, it has widespread usability.

Stock Market Investors and learners Investors, especially those who research financial companies and the stock market, can use this to get information straight from this API. In addition, it has a free plan, and its Pro plan only costs $10. Therefore, anyone who learns about the stock market can make use of it for a low cost.

How to connect to the Seeking Alpha API Tutorial – Step by Step Step 1 – Sign up and Get a RapidAPI Account. RapidAPI is the world’s largest API marketplace which is used by more than a million developers worldwide. You can use RapidAPI to search and connect to thousands of APIs using a single SDK, API key, and Dashboard.

To create a RapidAPI account, go to rapidapi.com and click on the Sign Up icon. You can use your Google, Github, or Facebook account for Single Sign-on (SSO) or create an account manually.
Malware Benign API Call Argument Feature Vector
kaggle.com
zip
Updated Apr 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BISHWAJIT PRASAD GOND (2025). Malware Benign API Call Argument Feature Vector [Dataset]. https://www.kaggle.com/datasets/bishwajitprasadgond/malware-benign-api-call-argument-feature-vector
Explore at:
zip(36681722 bytes)Available download formats
Dataset updated
Apr 28, 2025
Authors
BISHWAJIT PRASAD GOND
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Disclaimer: To process and perform analysis with this dataset, it is strongly recommended that your system has at least 128 GB of RAM. Attempting to work with this dataset on systems with lower memory may result in crashes, incomplete processing, or significant performance issues.

The process involves acquiring malware data, performing behavioral analysis, and preparing features for deep learning models.

Step 1: Data Acquisition

Source Malware Hashes: Obtain malware hashes from VirusShare.

Query VirusTotal: Use the hashes to query VirusTotal and download JSON files containing scan results from over 70 antivirus engines.

Class Determination: Analyze the scan results to classify the malware into distinct categories.

Step 2: Malware Download

Download Malware Samples: Based on the classification, download malware samples for each category.

Step 3: Dynamic Analysis with Cuckoo Sandbox

Environment Setup: Conduct dynamic analysis in a controlled environment using Cuckoo Sandbox.

Behavioral Report: Generate a JSON behavioral report for each malware sample, focusing on Portable Executable (PE) files.

API Call Sequence Extraction: Extract API call sequence reports in JSON format, including:

API Name

API Argument

API Return

API Category

Step 4: Data Preprocessing

JSON Report Segmentation: Split the JSON report into four text files:

api_name.txt

api_argument.txt

api_return.txt

api_category.txt

Unigram Generation:

Combine API names with their corresponding arguments using underscores (e.g., LdrLoadDll_urlmon.dll).

Generate unigrams for each malware category.

Example unigram: - LdrLoadDll_urlmon_urlmon.dll

Output: Create a CSV file containing unigrams for each malware category.

Step 5: Feature Extraction and Vectorization

API Elements Extraction:

Extract key elements: API Name and API Argument

Unique Unigrams:

Identify unique unigrams from the JSON reports.

Term Frequency (TF) Calculation:

Tokenize the text and compute TF weights for each unigram, reflecting their importance in the dataset.

Optionally apply L2 normalization to ensure consistent feature vector lengths.

Feature Refinement:

Filter unnecessary features from the unigram CSV files to create a refined feature set.

Output

Prepared Dataset:

A refined CSV file containing unigrams and TF-weighted features for each malware category.

Citation

B. P. Gond, M. Shahnawaz, Rajneekant and D. P. Mohapatra, "NLP-Driven Malware Classification: A Jaccard Similarity Approach," 2024 IEEE International Conference on Information Technology, Electronics and Intelligent Communication Systems (ICITEICS), Bangalore, India, 2024, pp. 1-8, DOI: https://doi.org/10.1109/ICITEICS61368.2024.10624953

B. P. Gond, A. K. Singh and D. P. Mohapatra, "A Deep Learning Framework for Malware Classification using NLP Techniques," 2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT), Kamand, India, 2024, pp. 1-8, DOI: https://doi.org/10.1109/ICCCNT61001.2024.10725427

Gond, B. P., & Mohapatra, D. P. (2025). Deep Learning-Driven Malware Classification with API Call Sequence Analysis and Concept Drift Handling. ArXiv. https://arxiv.org/abs/2502.08679

MyAnimeList API

kaggle.com

zip

Updated Aug 2, 2023

Facebook

Twitter

Click to copy link

Link copied

Cite

Pat Mendoza (2023). MyAnimeList API [Dataset]. https://www.kaggle.com/datasets/patmendoza/myanimelist-api

Explore at:

zip(49218834 bytes)Available download formats

Dataset updated

Aug 2, 2023

Authors

Pat Mendoza

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

MyAnimeList API Download

This is the dataset that I created as part of the Google Data Analytics Professional Certificate capstone project. The MyAnimeList website has a vast repository of ratings and rankings of viewership data that could be used for various methods. I extracted several datasets from the detail API from MyAnimeList (MAL) https://myanimelist.net/apiconfig/references/api/v2 and plan to potentially update data every two weeks.

Many possible uses for this data could be tracking what anime viewers are watching most within a particular time period, what's being scored (out of 10) well and what isn't.

My viz for this data will be part of a tableau dashboard located here. This dashboard allows fans to explore the dataset and locate top scored or popular titles by genre, time period, and demographic (although this field isn't always entered)

Documentation

The extraction and cleaning process is outlined on github here.

Frequency of Updates

I plan on updating this potentially every 2 weeks, this depends on my availability and the interest in this dataset.

Caveats

Extracting and loading this data involved some transformations that should be noted:

This data only includes titles that correspond with the "tv" ranking category. This was in an effort to streamline extraction and fine tune the analysis. If you would like to see other categories you are welcome to suggest it as an enhancement or use the code create your own dataset. As a result of subsetting on "tv", the dataset excludes the following ranking categories:
1. All
2. airing
3. upcoming
4. ova
5. movie
6. special
7. bypopularity
8. favorite
Adult content - This extract excludes all adult content (r+).
Note: The previous two points are valid for all tables with the exception of the rank_table. This is the table that was used as a starting point to obtain all MAL ids that were associated with "tv". Because this is a fast download, all categories are included in this table.
The creation of the alternative_title field in the anime_table. This uses the english version of the name unless it is null, if the value is null, it uses the default name. This was in an effort to make the title accessible to english speakers. The original title field can be used if desired.
The extraction of the demographic information from the genres field. MyAnimeList includes demographic information (shounen, seinen etc.) in the genres field. I've extracted it so that it could be used as its own field. However, many of those fields are null making it somewhat difficult to use.
Cleaning processes of data. Various methods of cleaning data have been carried out and are noted on github.
start_season.year - this field in the anime_table has been modified for null values. If there are null values, the first four characters from the start_date have been used. I will continue to use this method as long as it is viable.

Table Structure

The primary keys in all of the tables (with the exclusion of the tm_ky table) are foreign keys to other tables. As a result, the tables have 2 or more primary keys.

anime_demo_table

Field	Type	Primary Key
tm_ky	int	PK
mal_id	int	PK
demo_id	int

anime_genres_table

Field	Type	Primary Key
tm_ky	int	PK
mal_id	int	PK
genres_id	int	PK

anime_ranking_table

Field	Type	Primary Key
tm_ky	int	PK
mal_id	int	PK
mean	dbl
rank	int
popularity	int
num_scoring_users	int
statistics.watching	int
statistics.completed	int
statistics.on_hold	int
statistics.dropped	int
statistics.plan_to_watch	int
statistics.num_scoring_users	int

anime_studios_table

Field	Type	Primary Key
tm_ky	int	PK
mal_id	int	PK
studio_id	int	PK

anime_syn_table

Field	Type	Primary Key
tm_ky	int	PK
mal_id	int	PK
synonyms	chr

anime_table

Field	Type	Primary Key
tm_ky	int	PK
mal_id	int	PK
title	chr
main_picture.medium	chr
main_picture.large	chr
alternative_titles.en	chr
alternative_titles.ja	chr
start_date	chr
end_date	chr
synopsis	chr
media_type	chr
status	chr
num_episodes	int
start_season.year	int
start_season.season	chr
rating	chr
nsfw	chr
demo_de	chr ...

2M unique spotify songs with audio features.
kaggle.com
zip
Updated Sep 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
krish sharma (2024). 2M unique spotify songs with audio features. [Dataset]. https://www.kaggle.com/datasets/krishsharma0413/2-million-songs-from-mpd-with-audio-features
Explore at:
zip(408512929 bytes)Available download formats
Dataset updated
Sep 1, 2024
Authors
krish sharma
License
https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
Description
Where is the data from?

The dataset is a combination of Million Playlist Dataset and Spotify API.

SQLite structure.

The SQLite is in .db format. With one table extracted. Following are all the columns in this table. - track_uri (TEXT PRIMARY KEY): Unique identifier used by Spotify for songs. - track_name (TEXT): Song name. - artist_name (TEXT): Artist name. - artist_uri (TEXT): Unique identifier used by Spotify for artists. - album_name (TEXT): Album name - album_uri (TEXT): Unique identifier used by Spotify for albums. - duration_ms (INTEGER): Duration of the song. - danceability (REAL): Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. - energy (REAL): Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. - key (INTEGER): The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1. - loudness (REAL): The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typically range between -60 and 0 db. - mode (INTEGER): Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0. - speechiness (REAL): Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks. - acousticness (REAL): A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic. - instrumentalness (REAL): Predicts whether a track contains no vocals. "Ooh" and "aah" sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly "vocal". The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0. - liveness (REAL): Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live. - valence (REAL): A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). - tempo (REAL): The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. - type (TEXT): The object type. - id (TEXT): The Spotify ID for the track. - uri (TEXT): The Spotify URI for the track. - track_href (TEXT): A link to the Web API endpoint providing full details of the track. - analysis_url (TEXT): A URL to access the full audio analysis of this track. An access token is required to access this data. - fduration_ms (INTEGER): The duration of the track in milliseconds. - time_signature (INTEGER): An estimated time signature. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure). The time signature ranges from 3 to 7 indicating time signatures of "3/4", to "7/4".
YTB_apiKey
kaggle.com
zip
Updated May 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Trương Văn Khải (2024). YTB_apiKey [Dataset]. https://www.kaggle.com/datasets/khitrngvn/ytb-apikey
Explore at:
zip(201 bytes)Available download formats
Dataset updated
May 4, 2024
Authors
Trương Văn Khải
Description
Dataset

This dataset was created by Trương Văn Khải

Contents
Shopping Mall Customer Data Segmentation Analysis
kaggle.com
zip
Updated Aug 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DataZng (2024). Shopping Mall Customer Data Segmentation Analysis [Dataset]. https://www.kaggle.com/datasets/datazng/shopping-mall-customer-data-segmentation-analysis
Explore at:
zip(5890828 bytes)Available download formats
Dataset updated
Aug 4, 2024
Authors
DataZng
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Demographic Analysis of Shopping Behavior: Insights and Recommendations

Dataset Information: The Shopping Mall Customer Segmentation Dataset comprises 15,079 unique entries, featuring Customer ID, age, gender, annual income, and spending score. This dataset assists in understanding customer behavior for strategic marketing planning.

Cleaned Data Details: Data cleaned and standardized, 15,079 unique entries with attributes including - Customer ID, age, gender, annual income, and spending score. Can be used by marketing analysts to produce a better strategy for mall specific marketing.

Challenges Faced: 1. Data Cleaning: Overcoming inconsistencies and missing values required meticulous attention. 2. Statistical Analysis: Interpreting demographic data accurately demanded collaborative effort. 3. Visualization: Crafting informative visuals to convey insights effectively posed design challenges.

Research Topics: 1. Consumer Behavior Analysis: Exploring psychological factors driving purchasing decisions. 2. Market Segmentation Strategies: Investigating effective targeting based on demographic characteristics.

Suggestions for Project Expansion: 1. Incorporate External Data: Integrate social media analytics or geographic data to enrich customer insights. 2. Advanced Analytics Techniques: Explore advanced statistical methods and machine learning algorithms for deeper analysis. 3. Real-Time Monitoring: Develop tools for agile decision-making through continuous customer behavior tracking. This summary outlines the demographic analysis of shopping behavior, highlighting key insights, dataset characteristics, team contributions, challenges, research topics, and suggestions for project expansion. Leveraging these insights can enhance marketing strategies and drive business growth in the retail sector.

References OpenAI. (2022). ChatGPT [Computer software]. Retrieved from https://openai.com/chatgpt. Mustafa, Z. (2022). Shopping Mall Customer Segmentation Data [Data set]. Kaggle. Retrieved from https://www.kaggle.com/datasets/zubairmustafa/shopping-mall-customer-segmentation-data Donkeys. (n.d.). Kaggle Python API [Jupyter Notebook]. Kaggle. Retrieved from https://www.kaggle.com/code/donkeys/kaggle-python-api/notebook Pandas-Datareader. (n.d.). Retrieved from https://pypi.org/project/pandas-datareader/

Facebook

Twitter

Click to copy link

Link copied

Cite

valerie lucro (2024). kaggle api key 2 [Dataset]. https://www.kaggle.com/datasets/valerielucro/kaggle-api-key-2/code

kaggle api key 2

Explore at:

zip(238 bytes)Available download formats

Dataset updated

Jun 27, 2024

Authors

valerie lucro

Description

Dataset

This dataset was created by valerie lucro

Clear search

Close search

Google apps

Main menu

kaggle api key 2

Dataset

Contents

Most popular Kaggle datasets

api key

Dataset

Contents

Free API List

api_key

Dataset

Contents

key-api

Dataset

Contents

api_key

Dataset

Contents

API_Websites

ShutterStock Dataset for AI vs Human-Gen. Image

Dataset Overview:

Potential Use Cases:

Why This Dataset?

Step 1: Install and Authenticate Kaggle API

Step 2: Use wget

Step 3: Extract the Dataset

codeparrot_1M

Instructions for using this dataset

coco2017 Lance (train)

Instructions for using this dataset

MeDAL Dataset

Downloading the data

Loading FastText Embeddings

Model Quickstart

Using Torch Hub

Using Huggingface transformers

Citation

License, Terms and Conditions

openwebtext_1M

Instructions for using this dataset

Clean Meta Kaggle

Cleaned Meta-Kaggle Dataset

The Original Dataset - Meta-Kaggle

August 2023 update

The Problems with the Original Dataset

The Solution

Data from: Seeking Alpha Dataset

Malware Benign API Call Argument Feature Vector

Step 1: Data Acquisition

Step 2: Malware Download

Step 3: Dynamic Analysis with Cuckoo Sandbox

Step 4: Data Preprocessing

Step 5: Feature Extraction and Vectorization

Output

Citation

MyAnimeList API

MyAnimeList API Download

Documentation

Frequency of Updates

Caveats

Table Structure

2M unique spotify songs with audio features.

Where is the data from?

SQLite structure.

YTB_apiKey

Dataset

Contents

Shopping Mall Customer Data Segmentation Analysis

kaggle api key 2

Dataset

Contents

Using Huggingface `transformers`