18 datasets found

authtoken
kaggle.com
zip
Updated Aug 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sa Atmax (2023). authtoken [Dataset]. https://www.kaggle.com/datasets/saatmax/authtoken
Explore at:
zip(1458 bytes)Available download formats
Dataset updated
Aug 18, 2023
Authors
Sa Atmax
Description
Dataset

This dataset was created by Sa Atmax

Contents
ShutterStock Dataset for AI vs Human-Gen. Image
kaggle.com
zip
Updated Jun 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sachin Singh (2025). ShutterStock Dataset for AI vs Human-Gen. Image [Dataset]. https://www.kaggle.com/datasets/shreyasraghav/shutterstock-dataset-for-ai-vs-human-gen-image
Explore at:
zip(11617243112 bytes)Available download formats
Dataset updated
Jun 19, 2025
Authors
Sachin Singh
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
ShutterStock AI vs. Human-Generated Image Dataset

This dataset is curated to facilitate research in distinguishing AI-generated images from human-created ones, leveraging ShutterStock data. As AI-generated imagery becomes more sophisticated, developing models that can classify and analyze such images is crucial for applications in content moderation, digital forensics, and media authenticity verification.

Dataset Overview:

Total Images: 100,000

Training Data: 80,000 images (majority AI-generated)

Test Data: 20,000 images

Image Sources: A mix of AI-generated images and real photographs or illustrations created by human artists

Labeling: Each image is labeled as either AI-generated or human-created

Potential Use Cases:

AI-Generated Image Detection: Train models to distinguish between AI and human-made images.

Deep Learning & Computer Vision Research: Develop and benchmark CNNs, transformers, and other architectures.

Generative Model Evaluation: Compare AI-generated images to real images for quality assessment.

Digital Forensics: Identify synthetic media for applications in fake image detection.

Ethical AI & Content Authenticity: Study the impact of AI-generated visuals in media and ensure transparency.

Why This Dataset?

With the rise of generative AI models like Stable Diffusion, DALL·E, and MidJourney, the ability to differentiate between synthetic and real images has become a crucial challenge. This dataset offers a structured way to train AI models on this task, making it a valuable resource for both academic research and practical applications.

Explore the dataset and contribute to advancing AI-generated content detection!

Step 1: Install and Authenticate Kaggle API

If you haven't installed the Kaggle API, run:
bash pip install kaggle Then, download your kaggle.json API key from Kaggle Account and move it to ~/.kaggle/ (Linux/Mac) or `C:\Users\YourUser.kaggle` (Windows).

Step 2: Use wget

wget --no-check-certificate --header "Authorization: Bearer $(cat ~/.kaggle/kaggle.json | jq -r .token)" "https://www.kaggle.com/datasets/shreyasraghav/shutterstock-dataset-for-ai-vs-human-gen-image" -O dataset.zip

Step 3: Extract the Dataset

Once downloaded, extract the dataset using:
bash unzip dataset.zip -d dataset_folder

Now your dataset is ready to use! 🚀
R
Custom Yolov7 On Kaggle On Custom Dataset
universe.roboflow.com
zip
Updated Jan 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Owais Ahmad (2023). Custom Yolov7 On Kaggle On Custom Dataset [Dataset]. https://universe.roboflow.com/owais-ahmad/custom-yolov7-on-kaggle-on-custom-dataset-rakiq/dataset/1
Explore at:
zipAvailable download formats
Dataset updated
Jan 29, 2023
Dataset authored and provided by
Owais Ahmad
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Person Car Bounding Boxes
Description
Custom Training with YOLOv7 🔥

Some Important links

Model Inference🤖

🚀Training Yolov7 on Kaggle

Weight and Biases 🐝

HuggingFace 🤗 Model Repo

Contact Information

Name - Owais Ahmad

Phone - +91-9515884381

Email - owaiskhan9654@gmail.com

Portfolio - https://owaiskhan9654.github.io/

Objective

To Showcase custom Object Detection on the Given Dataset to train and Infer the Model using newly launched YoloV7.

Data Acquisition

The goal of this task is to train a model that can localize and classify each instance of Person and Car as accurately as possible.

Link to the Downloadable Dataset

from IPython.display import Markdown, display display(Markdown("../input/Car-Person-v2-Roboflow/README.roboflow.txt"))

Custom Training with YOLOv7 🔥

In this Notebook, I have processed the images with RoboFlow because in COCO formatted dataset was having different dimensions of image and Also data set was not splitted into different Format. To train a custom YOLOv7 model we need to recognize the objects in the dataset. To do so I have taken the following steps:

Export the dataset to YOLOv7

Train YOLOv7 to recognize the objects in our dataset

Evaluate our YOLOv7 model's performance

Run test inference to view performance of YOLOv7 model at work

📦 YOLOv7

https://raw.githubusercontent.com/Owaiskhan9654/Yolo-V7-Custom-Dataset-Train-on-Kaggle/main/car-person-2.PNG" width=800>

Image Credit - jinfagang

Step 1: Install Requirements

!git clone https://github.com/WongKinYiu/yolov7 # Downloading YOLOv7 repository and installing requirements %cd yolov7 !pip install -qr requirements.txt !pip install -q roboflow

Downloading YOLOV7 starting checkpoint

!wget "https://github.com/WongKinYiu/yolov7/releases/download/v0.1/yolov7.pt"

import os import glob import wandb import torch from roboflow import Roboflow from kaggle_secrets import UserSecretsClient from IPython.display import Image, clear_output, display # to display images print(f"Setup complete. Using torch {torch._version_} ({torch.cuda.get_device_properties(0).name if torch.cuda.is_available() else 'CPU'})")

https://camo.githubusercontent.com/dd842f7b0be57140e68b2ab9cb007992acd131c48284eaf6b1aca758bfea358b/68747470733a2f2f692e696d6775722e636f6d2f52557469567a482e706e67">

I will be integrating W&B for visualizations and logging artifacts and comparisons of different models!

YOLOv7-Car-Person-Custom

try: user_secrets = UserSecretsClient() wandb_api_key = user_secrets.get_secret("wandb_api") wandb.login(key=wandb_api_key) anonymous = None except: wandb.login(anonymous='must') print('To use your W&B account, Go to Add-ons -> Secrets and provide your W&B access token. Use the Label name as WANDB. Get your W&B access token from here: https://wandb.ai/authorize') wandb.init(project="YOLOvR",name=f"7. YOLOv7-Car-Person-Custom-Run-7")

Step 2: Assemble Our Dataset

https://uploads-ssl.webflow.com/5f6bc60e665f54545a1e52a5/615627e5824c9c6195abfda9_computer-vision-cycle.png" alt="">

In order to train our custom model, we need to assemble a dataset of representative images with bounding box annotations around the objects that we want to detect. And we need our dataset to be in YOLOv7 format.

In Roboflow, We can choose between two paths:

Convert an existing Coco dataset to YOLOv7 format. In Roboflow it supports over 30 formats object detection formats for conversion.

Uploading only these raw images and annotate them in Roboflow with Roboflow Annotate.

Version v2 Aug 12, 2022 Looks like this.

https://raw.githubusercontent.com/Owaiskhan9654/Yolo-V7-Custom-Dataset-Train-on-Kaggle/main/Roboflow.PNG" alt="">

user_secrets = UserSecretsClient() roboflow_api_key = user_secrets.get_secret("roboflow_api")

rf = Roboflow(api_key=roboflow_api_key) project = rf.workspace("owais-ahmad").project("custom-yolov7-on-kaggle-on-custom-dataset-rakiq") dataset = project.version(2).download("yolov7")

Step 3: Training Custom pretrained YOLOv7 model

Here, I am able to pass a number of arguments: - img: define input image size - batch: determine
API_Websites
kaggle.com
zip
Updated Feb 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
chandrashekhar G T (2023). API_Websites [Dataset]. https://www.kaggle.com/datasets/chandrashekhargt/api-wwebsites
Explore at:
zip(750 bytes)Available download formats
Dataset updated
Feb 5, 2023
Authors
chandrashekhar G T
Description
APIs, or Application Programming Interfaces, allow developers to access the functionality of an application or service over the internet. To access an API, a developer would typically make a request using a specific URL or endpoint, along with any necessary authentication or parameters, and receive a response in a standardized format such as JSON. This response can then be used to integrate the API's functionality into another application or service. Many websites and web-based services offer APIs that allow developers to access data and perform actions, such as retrieving information about a user, posting content, or making a purchase. In order to access an API, a developer often needs to obtain an API key or access token, which serves as a unique identifier and enables the API to track usage and enforce rate limits.
Metaverse Crypto Tokens Historical data 📊 📓
kaggle.com
zip
Updated Jul 12, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kash (2022). Metaverse Crypto Tokens Historical data 📊 📓 [Dataset]. https://www.kaggle.com/datasets/kaushiksuresh147/metaverse-cryptos-historical-data
Explore at:
zip(4442545 bytes)Available download formats
Dataset updated
Jul 12, 2022
Authors
Kash
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
https://i2.wp.com/www.mon-livret.fr/wp-content/uploads/2021/10/crypto-Metaverse-696x392.png?resize=696%2C392&ssl=1" alt="">

Context

The metaverse, a living and breathing space that blends physical and digital, is quickly evolving from a science fiction dream into a reality with endless possibilities. A world where people can interact virtually, create and exchange digital assets for real-world value, own digital land, engage with digitized real-world products and services, and much more.

Major tech giants are beginning to recognize the viability and potential of metaverses, following Facebook’s groundbreaking Meta rebrand announcement. In addition to tech companies, entertainment brands like Disney have also announced plans to take the leap into virtual reality.

While the media hype is deafening, your average netizen isn’t fully aware of what a metaverse is, how it operates and, most importantly—what benefits and opportunities it can offer them as a user.

https://cdn.images.express.co.uk/img/dynamic/22/590x/Metaverse-tokens-cryptocurrency-explained-ethereum-killers-new-coins-digital-currency-meta-news-1518777.jpg?r=1638256864800" alt="">

What Is The Metaverse?

In its digital iteration, a metaverse is a virtual world based on blockchain technology. This all-encompassing space allows users to work and play in a virtual reflection of real-life and fantasy scenarios, an online reality, ranging from sci-fi and dragons to more practical and familiar settings like shopping centers, offices, and even homes.

Users can access metaverses via computer, handheld device, or complete immersion with a VR headset. Those entering the metaverse get to experience living in a digital realm, where they will be able to work, play, shop, exercise, and socialize. Users will be able to create their own avatars based on face recognition, set up their own businesses of any kind, buy real estate, create in-world content and asset,s and attend concerts from real-world superstars—all in one virtual environment,

With that said, a metaverse is a virtual world with a virtual economy. In most cases, it is an online reality powered by decentralized finance (DeFi), where users exchange value and assets via cryptocurrencies and Non-Fungible Tokens.

What Are Metaverse Tokens?

Metaverse tokens are a unit of virtual currency used to make digital transactions within the metaverse. Since metaverses are built on the blockchain, transactions on underlying networks are near-instant. Blockchains are designed to ensure trust and security, making the metaverse the perfect environment for an economy free of corruption and financial fraud.

Holders of metaverse tokens can access multiple services and applications inside the virtual space. Some tokens give special in-game abilities. Other tokens represent unique items, like clothing for virtual avatars or membership for a community. If you’ve played MMO games like World of Warcraft, the concept of in-game items and currencies are very familiar. However, unlike your traditional virtual world games, metaverse tokens have value inside and outside the virtual worlds. Metaverse tokens in the form of cryptocurrency can be exchanged for fiat currencies. Or if they’re an NFT, they can be used to authenticate ownership to tethered real-world assets like collectibles, works or art, or even cups of coffee.

Some examples of metaverse tokens include SAND of the immensely popular Sandbox metaverse. In The Sandbox, users can create a virtual world driven by NFTs. Another token is MANA of the Decentraland project, where users can use MANA to purchase plots of digital real estate called “LAND”. It is even possible to monetize the plots of LAND purchased by renting them to other users for fixed fees. The ENJ token of the Enjin metaverse is the native asset of an ecosystem with the world’s largest game/app NFT networks.

Dataset Information

The dataset brings 198 metaverse cryptos. Pls refer to the file Metaverse coins.csv to find the list of metaverse crypto coins.

The dataset will be updated on a weekly basis with more and more additional metaverse tokens, Stay tuned ⏳
huggingface_hub
kaggle.com
zip
Updated Nov 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abhishek Thakur (2024). huggingface_hub [Dataset]. https://www.kaggle.com/datasets/abhishek/huggingface-hub
Explore at:
zip(4315332 bytes)Available download formats
Dataset updated
Nov 4, 2024
Authors
Abhishek Thakur
Description
Dataset

This dataset was created by Abhishek Thakur

Contents
CrunchDAO Competition Unified Dataset
kaggle.com
zip
Updated Jun 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joakim Arvidsson (2023). CrunchDAO Competition Unified Dataset [Dataset]. https://www.kaggle.com/datasets/joebeachcapital/crunchdao-competition-unified-dataset
Explore at:
zip(183163058 bytes)Available download formats
Dataset updated
Jun 15, 2023
Authors
Joakim Arvidsson
Description
This data set is for creating predictive models for the CrunchDAO tournament. Registration is required in order to participate in the competition, and to be eligible to earn $CRUNCH tokens.

See notebooks (Code tab) for how to import and explore the data, and build predictive models.

EDA

QuickStarter

See Terms of Use for data license.
openwebtext_1M
kaggle.com
zip
Updated Mar 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tanay Mehta (2024). openwebtext_1M [Dataset]. https://www.kaggle.com/datasets/heyytanay/openwebtext-1m/code
Explore at:
zip(2043993317 bytes)Available download formats
Dataset updated
Mar 18, 2024
Authors
Tanay Mehta
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
A subset of Skylion007/openwebtext dataset consisting of 1 Million tokenized samples in Lance file format for blazing fast and memory efficient I/O.

The files were tokenized using the gpt2 tokenizer with no extra tokens.

For detailed information on how the dataset was created, refer to my article on Curating Custom Datasets for efficient LLM training using Lance.

Instructions for using this dataset

This dataset is not supposed to be used on Kaggle Kernels since Lance requires the input directory of the dataset to have write access but Kaggle Kernel's input directory doesn't have it and the dataset size prohibits one from moving it to /kaggle/working. Hence, to use this dataset, you must download it by using the Kaggle API or through this page and then move the unzipped files to a folder called openwebtext_1M.lance. Below are detailed snippets on how to download and use this dataset.

First download and unzip the dataset from your terminal (make sure you have your kaggle API key at ~/.kaggle/:

$ pip install -q kaggle pyarrow pylance $ kaggle datasets download -d heyytanay/openwebtext-1m $ mkdir openwebtext_1M.lance/ $ unzip -qq openwebtext-1m.zip -d openwebtext_1M.lance/ $ rm openwebtext-1m.zip

Once this is done, you will find your dataset in the openwebtext_1M.lance/ folder. Now to load and get a gist of the data, run the below snippet.

import lance dataset = lance.dataset('openwebtext_1M.lance/') print(dataset.count_rows())

This will give you the total number of tokens in the dataset.
Ethereum Blockchain
kaggle.com
zip
Updated Mar 4, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2019). Ethereum Blockchain [Dataset]. https://www.kaggle.com/datasets/bigquery/ethereum-blockchain
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Mar 4, 2019
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Authors
Google BigQuery
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Bitcoin and other cryptocurrencies have captured the imagination of technologists, financiers, and economists. Digital currencies are only one application of the underlying blockchain technology. Like its predecessor, Bitcoin, the Ethereum blockchain can be described as an immutable distributed ledger. However, creator Vitalik Buterin also extended the set of capabilities by including a virtual machine that can execute arbitrary code stored on the blockchain as smart contracts.

Both Bitcoin and Ethereum are essentially OLTP databases, and provide little in the way of OLAP (analytics) functionality. However the Ethereum dataset is notably distinct from the Bitcoin dataset:

The Ethereum blockchain has as its primary unit of value Ether, while the Bitcoin blockchain has Bitcoin. However, the majority of value transfer on the Ethereum blockchain is composed of so-called tokens. Tokens are created and managed by smart contracts.

Ether value transfers are precise and direct, resembling accounting ledger debits and credits. This is in contrast to the Bitcoin value transfer mechanism, for which it can be difficult to determine the balance of a given wallet address.

Addresses can be not only wallets that hold balances, but can also contain smart contract bytecode that allows the programmatic creation of agreements and automatic triggering of their execution. An aggregate of coordinated smart contracts could be used to build a decentralized autonomous organization.

Content

The Ethereum blockchain data are now available for exploration with BigQuery. All historical data are in the ethereum_blockchain dataset, which updates daily.

Our hope is that by making the data on public blockchain systems more readily available it promotes technological innovation and increases societal benefits.

Querying BigQuery tables

You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.crypto_ethereum.[TABLENAME]. Fork this kernel to get started.

Acknowledgements

Cover photo by Thought Catalog on Unsplash

Inspiration

What are the most popularly exchanged digital tokens, represented by ERC-721 and ERC-20 smart contracts?

Compare transaction volume and transaction networks over time

Compare transaction volume to historical prices by joining with other available data sources like Bitcoin Historical Data
LOTTERY CASH
kaggle.com
zip
Updated May 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OYENIYI ISAAC ENIOLA (2022). LOTTERY CASH [Dataset]. https://www.kaggle.com/datasets/yorlepro/lottery-cash
Explore at:
zip(590770 bytes)Available download formats
Dataset updated
May 13, 2022
Authors
OYENIYI ISAAC ENIOLA
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
There are two methods available for authentication: HTTP Basic and OAuth 2.0. For non-interactive applications, we only support HTTP Basic Authentication. We encourage all our developers of interactive applications to use the OAuth 2.0 workflow to authenticate their users.

HTTP Basic Authentication is required when you are authenticating from a script that runs without interaction with the user, like your ETL tool, an update script, or any other data management automation.

OAuth 2.0 is the preferred option for cases where you are building a web or mobile application that needs to perform actions on behalf of the user, like accessing data, and the interaction model allows you to present the user with a form to obtain their permission for the app to do so.

Authenticating using HTTP Basic Authentication Requests can be authenticated using HTTP Basic Authentication. You can use your HTTP library’s Basic Auth feature to pass your credentials. All HTTP-basic-authenticated requests must be performed over a secure (https) connection. Authenticated requests made over an insecure connection will be denied.

Users may use their username and password or an API key and secret pair to authenticate using Basic Authentication. Documentation on how to create and manage API keys can be found here.

We recommend using API keys! They provide the following benefits:

Access Socrata APIs without the risk of embedding your username and password in scripts or code Users on domains that require SSO (and thus without passwords) can access Socrata APIs Create individual keys for different apps or jobs so that if any one needs to be revoked or rotated, other apps are unaffected Change your account password without disrupting apps or rotate API keys without disrupting logins Here is a sample HTTP session that uses HTTP Basic Authentication:

POST /resource/4tka-6guv.json HTTP/1.1 Host: soda.demo.socrata.com Accept: / Authorization: Basic [REDACTED] Content-Length: 253 Content-Type: application/json X-App-Token: [REDACTED]

[ { ... } ] Note that the Authorization header in this request will usually be generated via your HTTP library’s Basic Auth feature (as opposed to manually constructing the Base64 encoding of your credentials yourself). For example, if you’re using Python’s requests module, it supports Basic Authentication out of the box. Similarly, an API tool like Postman also handles Basic Authentication.

OAuth 2.0 Note: When developing applications that make use of OAuth, you must provide a web-accessible callback URL when registering your application token. This can make it difficult to develop on a machine that isn't directly exposed to the Internet. One great option is to use a tool like ngrok to create a secure tunnel to expose your web application in a secure manner. Workflow We support a subset of OAuth 2.0 — the server-based flow with a callback URL — which we believe is more secure than the other flows in the specification. This OAuth flow is used by several other popular API services on the web. We have made the authentication flow similar to Google AuthSub.

To authenticate with OAuth 2.0, you will first need to register your application, which will create an app token and a secret token. When registering your application, you must preregister your server by filling out the Callback Prefix field), so that we can be sure that access through your application is secure even if both your tokens are stolen. The Callback Prefix is the beginning of the URL that you will use as your redirect URL. Generally, you’ll want to provide as much of your callback URL as you can. For example, if your authentication callback is https://my-website.com/socrata-app/auth/callback, you might want to specify https://my-website.com/socrata-app as your callback URL.

Once you have an application and a secret token, you’ll be able to authenticate with the SODA OAuth 2.0 endpoint. You’ll first need to redirect the user to the Socrata-powered site you wish to access so that they may log in and approve your application. For example:

https://soda.demo.socrata.com/oauth/authorize?client_id=YOUR_AUTH_TOKEN&response_type=code &redirect_uri=YOUR_REDIRECT_URI Note that the redirect_uri here must be an absolute, secure (https:) URI which starts with the Callback Prefix you specified when you registered your application. If any of these cases fail, the user will be shown an error indicating as much.

Should the user authorize your application, they will be redirected back to the your redirect_uri. For example, if I provide https://my-website.com/socrata-app/auth/callback as my redirect_uri, the user will be redirected to this URL:

https://my-website.com/socrata-app/auth/callback?code=CODE where CODE is an authorization code that you will use later.

If your redirect_uri contains a querystring, it will be preserved, and the code parameter will...
Cloud Accounting Integrity Log Dataset
kaggle.com
zip
Updated Nov 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Python Developer (2025). Cloud Accounting Integrity Log Dataset [Dataset]. https://www.kaggle.com/datasets/programmer3/cloud-accounting-integrity-log-dataset
Explore at:
zip(109578 bytes)Available download formats
Dataset updated
Nov 5, 2025
Authors
Python Developer
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The Cloud Accounting Integrity Verification Dataset (CAIVD) contains 1,900 simulated accounting and cloud platform log entries, designed for evaluating cloud data integrity verification algorithms. Each record represents a real-world accounting system event — including insert, update, delete, view, and approve operations — enriched with financial, network, and system-level metadata. This dataset supports experiments in data integrity auditing, cloud computing performance analysis, and intelligent accounting verification. Key Features Record_ID

Unique identifier for each log entry.

Timestamp

Date and time when the accounting or cloud event occurred.

User_ID

Randomized user identifier (represents an accountant, auditor, or automated process).

Action_Type

Type of operation performed in the accounting system: insert, update, delete, view, or approve.

Transaction_Amount

Financial amount involved in the accounting transaction.

Account_Category

Account classification: Assets, Liabilities, Revenue, or Expense.

Approval_Status

Approval state of the transaction: Pending, Approved, or Rejected.

Device_ID

Identifier for the source terminal or accounting device.

IP_Address

Client network address from which the transaction originated.

Cloud_Node

Cloud region or node processing the accounting request: Node-A, Node-B, or Node-C.

CPU_Usage

CPU utilization (%) of the cloud node during transaction processing.

Memory_Usage

Memory usage (%) of the cloud node at event time.

Request_Latency

Observed network delay (milliseconds) for the event request.

Encryption_Flag

Indicates encryption status of the record (1 = encrypted, 0 = not encrypted).

Access_Token_Validity

Authentication token status: Valid, Expired, or Suspicious.

Block_Size_KB Constant value = 1 KB — size of each data block used in verification tests.

File_Size_MB File size (MB) used in simulation: 200, 400, 600, 800, 1000, 1200, or 1400 MB.

Total_Blocks Computed as File_Size_MB × 1024; total number of 1 KB data blocks per experiment.
nemotron-3-8b-base-4k
kaggle.com
zip
Updated Aug 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Serhii Kharchuk (2024). nemotron-3-8b-base-4k [Dataset]. https://www.kaggle.com/datasets/serhiikharchuk/nemotron-3-8b-base-4k
Explore at:
zip(13688476176 bytes)Available download formats
Dataset updated
Aug 31, 2024
Authors
Serhii Kharchuk
Description
Nemotron-3-8B-Base-4k Model Overview License

The use of this model is governed by the NVIDIA AI Foundation Models Community License Agreement. Description

Nemotron-3-8B-Base-4k is a large language foundation model for enterprises to build custom LLMs. This foundation model has 8 billion parameters, and supports a context length of 4,096 tokens. Nemotron-3-8B-Base-4k is part of Nemotron-3, which is a family of enterprise ready generative text models compatible with NVIDIA NeMo Framework. For other models in this collection, see the collections page.

NVIDIA NeMo is an end-to-end, cloud-native platform to build, customize, and deploy generative AI models anywhere. It includes training and inferencing frameworks, guardrailing toolkits, data curation tools, and pretrained models, offering enterprises an easy, cost-effective, and fast way to adopt generative AI. To get access to NeMo Framework, please sign up at this link. References

Announcement Blog Model Architecture

Architecture Type: Transformer

Network Architecture: Generative Pre-Trained Transformer (GPT-3) Software Integration

Runtime Engine(s): NVIDIA AI Enterprise

Toolkit: NeMo Framework

To get access to NeMo Framework, please sign up at this link. See NeMo inference container documentation for details on how to setup and deploy an inference server with NeMo.

Sample Inference Code:

from nemo.deploy import NemoQuery

In this case, we run inference on the same machine

nq = NemoQuery(url="localhost:8000", model_name="Nemotron-3-8B-4K")

output = nq.query_llm(prompts=["The meaning of life is"], max_output_token=200, top_k=1, top_p=0.0, temperature=0.1) print(output)

Supported Hardware:

H100 A100 80GB, A100 40GB

Model Version(s)

Nemotron-3-8B-base-4k-BF16-1 Dataset & Training

The model uses a learning rate of 3e-4 with a warm-up period of 500M tokens and a cosine learning rate annealing schedule for 95% of the total training tokens. The decay stops at a minimum learning rate of 3e-5. The model is trained with a sequence length of 4096 and uses FlashAttention’s Multi-Head Attention implementation. 1,024 A100s were used for 19 days to train the model.

NVIDIA models are trained on a diverse set of public and proprietary datasets. This model was trained on a dataset containing 3.8 Trillion tokens of text. The dataset contains 53 different human languages (including English, German, Russian, Spanish, French, Japanese, Chinese, Italian, and Dutch) and 37 programming languages. The model also uses the training subsets of downstream academic benchmarks from sources like FLANv2, P3, and NaturalInstructions v2. NVIDIA is committed to the responsible development of large language models and conducts reviews of all datasets included in training. Evaluation Task Num-shot Score MMLU* 5 54.4 WinoGrande 0 70.9 Hellaswag 0 76.4 ARC Easy 0 72.9 TyDiQA-GoldP** 1 49.2 Lambada 0 70.6 WebQS 0 22.9 PiQA 0 80.4 GSM8K 8-shot w/ maj@8 39.4

The calculation of MMLU follows the original implementation. See Hugging Face’s explanation of different implementations of MMLU.

** The languages used are Arabic, Bangla, Finnish, Indonesian, Korean, Russian and Swahili. Intended use

This is a completion model. For best performance, users are encouraged to customize the completion model using NeMo Framework suite of customization tools including Parameter-Efficient Fine-Tuning (P-tuning, Adapters, LoRA), and SFT/RLHF. For chat use cases, please consider using Nemotron-3-8B chat variants. Ethical use

Technology can have a profound impact on people and the world, and NVIDIA is committed to enabling trust and transparency in AI development. NVIDIA encourages users to adopt principles of AI ethics and trustworthiness to guide your business decisions by following the guidelines in the NVIDIA AI Foundation Models Community License Agreement. Limitations

The model was trained on data that contains toxic language and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.
South Africa COVID-19 Twitter Posts Dataset
kaggle.com
zip
Updated Jul 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Blessing Ogbuokiri (2022). South Africa COVID-19 Twitter Posts Dataset [Dataset]. https://www.kaggle.com/datasets/ogbuokiriblessing/tweetdatasa
Explore at:
zip(1713167 bytes)Available download formats
Dataset updated
Jul 4, 2022
Authors
Blessing Ogbuokiri
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Area covered
South Africa
Description
This dataset contains Twitter posts containing daily updates of location-based COVID–19 vaccine-related tweets from January 2021 to August 2021.

With an existing Twitter account, we applied for Developer Access and were granted access to Twitter Academic Researcher API which allows for over 10 million tweets per month. Then, we created an application to generate the API credentials (access tokens) from Twitter. The access token was used in Python (v3.6) script to authenticate and establish a connection to the Twitter database. To get goe-tagged vaccine-related tweets, we used the python script we developed to perform a historical search (archive search) of vaccine-related keywords with place country South Africa (ZA). By goe-tagged tweets, we refer to Twitter posts with a know location. These vaccine-related keywords include but are not limited to the vaccine, anti-vaxxer, vaccination, AstraZeneca, Oxford-AstraZeneca, IChooseVaccination, VaccineToSaveSouthAfrica, JohnsonJohnson, and Pfizer. The keywords were selected from the trending topic during the period of discussion. A complete list of the keywords is shown below:

Oxford-AstraZeneca, AstraZeneca, JohnsonJohnson, Vaccine, BioNTech, anti-vaccine, jab, Vaccination, Covax, Vaccine Rollout, Sputnik, VaccineToSaveSouthAfrica, IChooseVaccination, TeachersVaccine, AstraZeneca vaccine, Pfizer, J & J, Johonson & Johnson, Moderna, VaccinesWork, VacciNation, Vaccine, Steriod, COVIDvaccine, covax, VaccineEquity, VaccineReady, Jab OR PfizerGang, Scamdemic, Plandemic, Scaredemic, COVID-19, coronavirus, SARS-CoV-2, anti-vaxxers, jab, Pfizer, BioNTech, JJ, Vaccine, JohnsonJohnson Vaccine, Vaccine Rollout, J & J, Sputnik, COVAX, CoronaVac

The preferred language of the tweet is English.

Movies Info

kaggle.com

zip

Updated Aug 1, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Rushil Dhingra (2025). Movies Info [Dataset]. https://www.kaggle.com/datasets/rushildhingra25/movies-info/code

Explore at:

zip(1292894 bytes)Available download formats

Dataset updated

Aug 1, 2025

Authors

Rushil Dhingra

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

This dataset contains detailed information about top-rated movies, retrieved directly from The Movie Database (TMDB) using its official public API.

Data Source

API Endpoint Used: https://api.themoviedb.org/3/movie/top_rated
Authorization: Accessed using TMDB-provided Bearer Token Authentication
Genre Mapping: Genre names were mapped from the TMDB /genre/movie/list endpoint using the provided genre IDs.

Dataset Contents

Each row in the dataset represents a single movie and includes the following fields:

Column	Description
`title`	The official title of the movie
`overview`	A short description or plot summary of the movie
`genre_ids`	A list of genre IDs associated with the movie (from TMDB)
`genre_names`	The corresponding genre names for the movie (e.g. Action, Drama)
`genre_str`	A comma-separated string of genres for easy text processing

Use Cases

This dataset is ideal for a variety of educational and practical machine learning tasks, including but not limited to:

Natural Language Processing (NLP):
- Text cleaning and preprocessing (lowercasing, stopword removal, etc.)
- Tokenization and embedding
- TF-IDF or Word2Vec vectorization
- Genre classification based on movie descriptions
- Clustering similar movies using text similarity
Data Visualization:
- Word clouds from overviews
- Genre frequency analysis
- Sentiment trends across genres
Machine Learning Projects:
- Supervised learning: Predict movie genres from descriptions
- Unsupervised learning: Cluster movies based on plot similarity
- Recommendation engines (content-based filtering)

Note

This dataset is intended for learning and educational purposes only. Please adhere to TMDB's API terms of use when using this data in any public or commercial setting.

CommonsenseQA (Multiple-Choice Q&A)
kaggle.com
zip
Updated Nov 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). CommonsenseQA (Multiple-Choice Q&A) [Dataset]. https://www.kaggle.com/datasets/thedevastator/new-commonsenseqa-dataset-for-multiple-choice-qu
Explore at:
zip(712030 bytes)Available download formats
Dataset updated
Nov 21, 2022
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
CommonsenseQA (Multiple-Choice Q&A)

12,102 questions with one correct answer and four distractor answers

Source

Huggingface Hub: link

About this dataset

CommonsenseQA is a new multiple-choice question answering dataset that requires different types of commonsense knowledge to predict the correct answers . It contains 12,102 questions with one correct answer and four distractor answers. The dataset is provided in two major training/validation/testing set splits: "Random split" which is the main evaluation split, and "Question token split", see paper for details.

How to use the dataset

Research Ideas

This dataset can be used to train a model to predict the correct answers to multiple-choice questions.

This dataset can be used to evaluate the performance of different models on the CommonsenseQA dataset.

This dataset can be used to discover new types of commonsense knowledge required to predict the correct answers to questions in the CommonsenseQA dataset

Acknowledgements

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: validation.csv | Column name | Description | |:--------------|:---------------------------------------------------------------| | answerKey | The correct answer to the question. (String) | | choices | The four possible answers for each question. (List of strings) |

File: train.csv | Column name | Description | |:--------------|:---------------------------------------------------------------| | answerKey | The correct answer to the question. (String) | | choices | The four possible answers for each question. (List of strings) |

File: test.csv | Column name | Description | |:--------------|:---------------------------------------------------------------| | answerKey | The correct answer to the question. (String) | | choices | The four possible answers for each question. (List of strings) |
All Crypto Data - Every 12 hrs
kaggle.com
zip
Updated Mar 6, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Idan Erez (2018). All Crypto Data - Every 12 hrs [Dataset]. https://www.kaggle.com/idanerez/all-cryoto-data-every-12-hrs
Explore at:
zip(2002206 bytes)Available download formats
Dataset updated
Mar 6, 2018
Authors
Idan Erez
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

I started collecting data from Coin Market Cap of all crypto currencies and token on a 12 hours cycle. Plan to upload it every weekend. If you need it before for your analysis please PM me.

Content

All available crypto currency data from coin market cap - Symbol, Rank, Price USD, Price BTC, Market Cap, Date, Time.

Updated every 12 hours - update time is in EST.

Acknowledgements

Thanks coinmarketcap.com for the API access

Inspiration

What correlations are there? Is there a low-risk portfolio? What are the signals before crypto rise or crushes?
Mutant Ape Yacht Club NFT Images
kaggle.com
zip
Updated Dec 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Mutant Ape Yacht Club NFT Images [Dataset]. https://www.kaggle.com/datasets/thedevastator/mutant-ape-yacht-club-nft-images
Explore at:
zip(1766126750 bytes)Available download formats
Dataset updated
Dec 6, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Mutant Ape Yacht Club NFT Images

NFT image dataset with Mutant Ape Yacht Club artwork and metadata

By huggingnft (From Huggingface) [source]

About this dataset

The NFT Images Dataset for Unconditional Generation, specifically focused on the Mutant Ape Yacht Club, offers valuable information and resources for artificial intelligence and machine learning enthusiasts. This dataset provides comprehensive details about NFT artwork, including high-quality images and associated metadata.

The dataset includes columns such as image and image_original_url, providing direct access to the image files in their original form through valid web addresses. These image files represent unique pieces of digital artwork from the Mutant Ape Yacht Club collection.

Moreover, this dataset offers crucial insights into each NFT artwork through the token_metadata column. This metadata encompasses various essential details regarding the specific piece of art, such as artist information, detailed descriptions, and unique attributes associated with each NFT. These attributes help differentiate one piece of art from another in terms of style, theme, rarity factors, or any other distinctive characteristics.

By utilizing this comprehensive dataset's resources in various AI applications like unconditional generation models or machine learning algorithms, users gain access to a wide range of digital artworks for research or creative purposes. Additionally, with accurate token metadata available alongside each image file, users can explore diverse aspects that contribute to the essence of these digital creations.

How to use the dataset

Introduction:

Dataset Overview:

The dataset contains information about NFT images from the Mutant Ape Yacht Club, including image URLs, IDs, token metadata, and original image URLs.

Each artwork in the dataset is represented by an image file and its corresponding metadata.

Accessing the Dataset:

To access this dataset, download or import the provided train.csv file.

Understanding Key Columns:

image (Image file): This column contains the image file of each artwork in the NFT collection.

token_metadata (Text): This column includes metadata associated with each artwork such as artist details, description, and attributes.

image_original_url (URL): Provides the original URL of each image file in case you need to refer back to it or access additional information.

Potential Use Cases:

Unconditional Generation: The dataset can be used for unconditional generation tasks like training generative models or running experiments on creating novel artworks based on existing ones.

Preprocessing Steps: Before using this dataset for unconditional generation tasks, consider performing some preprocessing steps such as:

a) Image Processing: Resize or normalize images for consistent input dimensions if required by your model architecture.

b) Text Processing: Clean token_metadata column if needed by removing special characters or irrelevant text that may hinder model training.

Exploratory Data Analysis (EDA): Conducting EDA provides insights into patterns within the database that might help you understand art concepts better or optimize your unconditional generation models. Some possible EDA tasks include:

a) Image Visualization: Display a subset of images to get a visual understanding of the artworks.

b) Metadata Analysis: Analyze the distribution or correlation between different attributes mentioned in the token_metadata column.

Training Unconditional Generation Models: Use this dataset to train generative models such as Variational Autoencoders (VAE), Generative Adversarial Networks (GANs), or other deep learning architectures for creative artwork generation.

Iterative Model Improvements: Experiment with different model architectures, hyperparameters, and loss functions to enhance the

Research Ideas

Unconditional Image Generation: This dataset can be used for training generative models to create new and unique NFT artwork. By training a model on this dataset, it can learn the patterns, styles, and attributes of Mutant Ape Yacht Club NFTs and generate new images in a similar style.

Artistic Style Transfer: The token_metadata associated with each NFT artwork can provide valuable information about the artist, description, and attributes of the image. This metadata can be used for artistic style transfer techniques to apply the unique style of Mutant Ape Yacht Club artworks to other im...
OpenCorpora: Russian
kaggle.com
zip
Updated Sep 12, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rachael Tatman (2017). OpenCorpora: Russian [Dataset]. https://www.kaggle.com/rtatman/opencorpora-russian
Explore at:
zip(26456197 bytes)Available download formats
Dataset updated
Sep 12, 2017
Authors
Rachael Tatman
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Context:

“Russian is an East Slavic language and an official language in Russia, Belarus, Kazakhstan, Kyrgyzstan and many minor or unrecognised territories. It is an unofficial but widely spoken language in Ukraine and Latvia, and to a lesser extent, the other countries that were once constituent republics of the Soviet Union and former participants of the Eastern Bloc.” -- “Russian Language” on Wikipedia

Russian has around 150 million native speakers and 110 million non-native speakers. Russian in written in Cyrillic script. This dataset is a morphologically, syntactically and semantically annotated corpus of texts in Russian, fully accessible to researchers and edited by users.

Content:

This dataset is encoded in UTF-8. There are two files included in this dataset: the corpus and the dictionary. The corpus is in .json format, while the dictionary is in plain text.

Dictionary

In the dictionary, each entry is a lemma, presented with all of its tagged derivations. The tags depend on the part of speech of the lemma. Some examples are:

Nouns: part of speech, animacy, gender & number, case

Verbs: Part of speech, aspect, transitivity, gender & number, person, tense, mood

Adjectives: part of speech (ADJF), gender, number, case

A Python script to convert the tags in this corpus to this set more commonly used in English-language linguistics can be found here.

Sample dictionary entries:

1 ЁЖ NOUN,anim,masc sing,nomn ЕЖА NOUN,anim,masc sing,gent ЕЖУ NOUN,anim,masc sing,datv ЕЖА NOUN,anim,masc sing,accs ЕЖОМ NOUN,anim,masc sing,ablt ЕЖЕ NOUN,anim,masc sing,loct ЕЖИ NOUN,anim,masc plur,nomn ЕЖЕЙ NOUN,anim,masc plur,gent ЕЖАМ NOUN,anim,masc plur,datv 41 ЁРНИЧАЮ VERB,impf,intr sing,1per,pres,indc ЁРНИЧАЕМ VERB,impf,intr plur,1per,pres,indc ЁРНИЧАЕШЬ VERB,impf,intr sing,2per,pres,indc ЁРНИЧАЕТЕ VERB,impf,intr plur,2per,pres,indc ЁРНИЧАЕТ VERB,impf,intr sing,3per,pres,indc ЁРНИЧАЮТ VERB,impf,intr plur,3per,pres,indc ЁРНИЧАЛ VERB,impf,intr masc,sing,past,indc ЁРНИЧАЛА VERB,impf,intr femn,sing,past,indc ЁРНИЧАЛО VERB,impf,intr neut,sing,past,indc ЁРНИЧАЛИ VERB,impf,intr plur,past,indc ЁРНИЧАЙ VERB,impf,intr sing,impr,excl ЁРНИЧАЙТЕ VERB,impf,intr plur,impr,excl

Corpous

In this corpus, each word has been grammatically tagged. You can access individual tokens using the following general path:

JSON > text > paragraphs > paragraph > [paragraph number] > sentence > [sentence number] > tokens > [token number]

Each token has: * A unique id number (@id) * The text of the token (@text) * information on the lemma (under “l”), including the id number of the lemma as found in the dictionary

You can see an example of the token portion of the .json structure below:

{ "@id": 1714292, "@text": "сват", "tfr": { "@t": "сват", "@rev_id": 3754311, "v": { "l": { "@id": 314741, "@t": "сват", "g": [ { "@v": "NOUN" }, { "@v": "anim" }, { "@v": "masc" }, { "@v": "sing" }, { "@v": "nomn" } ] } } } }

Acknowledgements:

This dataset was collected and annotated by, among others, Svetlana Alekseeva, Anastasia Bodrova, Victor Bocharov, Dmitry Granovsky, Irina Krylova, Maria Nikolaeva, Catherine Protopopova, Alexander Chuchunkov, Anastasia Shimorina, Vasily Alekseev, Natalia Ostapuk, Maria Stepanova and Alexey Surikov. The code used to collect and clean this data is available online.

It is reproduced here under a CC-BY-SA license.

More information on this corpus and its most recent version can be found here (in Russian.)
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Sa Atmax (2023). authtoken [Dataset]. https://www.kaggle.com/datasets/saatmax/authtoken

authtoken

Explore at:

zip(1458 bytes)Available download formats

Dataset updated

Aug 18, 2023

Authors

Sa Atmax

Description

Dataset

This dataset was created by Sa Atmax

Clear search

Close search

Google apps

Main menu

authtoken

Dataset

Contents

ShutterStock Dataset for AI vs Human-Gen. Image

Dataset Overview:

Potential Use Cases:

Why This Dataset?

Step 1: Install and Authenticate Kaggle API

Step 2: Use wget

Step 3: Extract the Dataset

Custom Yolov7 On Kaggle On Custom Dataset

Custom Training with YOLOv7 🔥

Some Important links

Contact Information

Objective

To Showcase custom Object Detection on the Given Dataset to train and Infer the Model using newly launched YoloV7.

Data Acquisition

Custom Training with YOLOv7 🔥

📦 YOLOv7

Step 1: Install Requirements

Downloading YOLOV7 starting checkpoint

Step 2: Assemble Our Dataset

Version v2 Aug 12, 2022 Looks like this.

Step 3: Training Custom pretrained YOLOv7 model

API_Websites

Metaverse Crypto Tokens Historical data 📊 📓

Context

What Is The Metaverse?

What Are Metaverse Tokens?

Dataset Information

huggingface_hub

Dataset

Contents

CrunchDAO Competition Unified Dataset

openwebtext_1M

Instructions for using this dataset

Ethereum Blockchain

Context

Content

Querying BigQuery tables

Acknowledgements

Inspiration

LOTTERY CASH

Cloud Accounting Integrity Log Dataset

nemotron-3-8b-base-4k

In this case, we run inference on the same machine

South Africa COVID-19 Twitter Posts Dataset

Movies Info

Data Source

Dataset Contents

Use Cases

Note

CommonsenseQA (Multiple-Choice Q&A)

CommonsenseQA (Multiple-Choice Q&A)

12,102 questions with one correct answer and four distractor answers

Source

About this dataset

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

All Crypto Data - Every 12 hrs

Context

Content

Acknowledgements

Inspiration

Mutant Ape Yacht Club NFT Images

Mutant Ape Yacht Club NFT Images

NFT image dataset with Mutant Ape Yacht Club artwork and metadata

About this dataset

How to use the dataset

Research Ideas

OpenCorpora: Russian

Context:

Content:

Dictionary

Corpous

Acknowledgements:

authtoken