18 datasets found
  1. authtoken

    • kaggle.com
    zip
    Updated Aug 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sa Atmax (2023). authtoken [Dataset]. https://www.kaggle.com/datasets/saatmax/authtoken
    Explore at:
    zip(1458 bytes)Available download formats
    Dataset updated
    Aug 18, 2023
    Authors
    Sa Atmax
    Description

    Dataset

    This dataset was created by Sa Atmax

    Contents

  2. ShutterStock Dataset for AI vs Human-Gen. Image

    • kaggle.com
    zip
    Updated Jun 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sachin Singh (2025). ShutterStock Dataset for AI vs Human-Gen. Image [Dataset]. https://www.kaggle.com/datasets/shreyasraghav/shutterstock-dataset-for-ai-vs-human-gen-image
    Explore at:
    zip(11617243112 bytes)Available download formats
    Dataset updated
    Jun 19, 2025
    Authors
    Sachin Singh
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    ShutterStock AI vs. Human-Generated Image Dataset

    This dataset is curated to facilitate research in distinguishing AI-generated images from human-created ones, leveraging ShutterStock data. As AI-generated imagery becomes more sophisticated, developing models that can classify and analyze such images is crucial for applications in content moderation, digital forensics, and media authenticity verification.

    Dataset Overview:

    • Total Images: 100,000
    • Training Data: 80,000 images (majority AI-generated)
    • Test Data: 20,000 images
    • Image Sources: A mix of AI-generated images and real photographs or illustrations created by human artists
    • Labeling: Each image is labeled as either AI-generated or human-created

    Potential Use Cases:

    • AI-Generated Image Detection: Train models to distinguish between AI and human-made images.
    • Deep Learning & Computer Vision Research: Develop and benchmark CNNs, transformers, and other architectures.
    • Generative Model Evaluation: Compare AI-generated images to real images for quality assessment.
    • Digital Forensics: Identify synthetic media for applications in fake image detection.
    • Ethical AI & Content Authenticity: Study the impact of AI-generated visuals in media and ensure transparency.

    Why This Dataset?

    With the rise of generative AI models like Stable Diffusion, DALL·E, and MidJourney, the ability to differentiate between synthetic and real images has become a crucial challenge. This dataset offers a structured way to train AI models on this task, making it a valuable resource for both academic research and practical applications.

    Explore the dataset and contribute to advancing AI-generated content detection!

    Step 1: Install and Authenticate Kaggle API

    If you haven't installed the Kaggle API, run:
    bash pip install kaggle Then, download your kaggle.json API key from Kaggle Account and move it to ~/.kaggle/ (Linux/Mac) or `C:\Users\YourUser.kaggle` (Windows).

    Step 2: Use wget

      wget --no-check-certificate --header "Authorization: Bearer $(cat ~/.kaggle/kaggle.json | jq -r .token)" "https://www.kaggle.com/datasets/shreyasraghav/shutterstock-dataset-for-ai-vs-human-gen-image" -O dataset.zip
    

    Step 3: Extract the Dataset

    Once downloaded, extract the dataset using:
    bash unzip dataset.zip -d dataset_folder

    Now your dataset is ready to use! 🚀

  3. R

    Custom Yolov7 On Kaggle On Custom Dataset

    • universe.roboflow.com
    zip
    Updated Jan 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Owais Ahmad (2023). Custom Yolov7 On Kaggle On Custom Dataset [Dataset]. https://universe.roboflow.com/owais-ahmad/custom-yolov7-on-kaggle-on-custom-dataset-rakiq/dataset/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 29, 2023
    Dataset authored and provided by
    Owais Ahmad
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Person Car Bounding Boxes
    Description

    Custom Training with YOLOv7 🔥

    Some Important links

    Contact Information

    Objective

    To Showcase custom Object Detection on the Given Dataset to train and Infer the Model using newly launched YoloV7.

    Data Acquisition

    The goal of this task is to train a model that can localize and classify each instance of Person and Car as accurately as possible.

    from IPython.display import Markdown, display
    
    display(Markdown("../input/Car-Person-v2-Roboflow/README.roboflow.txt"))
    

    Custom Training with YOLOv7 🔥

    In this Notebook, I have processed the images with RoboFlow because in COCO formatted dataset was having different dimensions of image and Also data set was not splitted into different Format. To train a custom YOLOv7 model we need to recognize the objects in the dataset. To do so I have taken the following steps:

    • Export the dataset to YOLOv7
    • Train YOLOv7 to recognize the objects in our dataset
    • Evaluate our YOLOv7 model's performance
    • Run test inference to view performance of YOLOv7 model at work

    📦 YOLOv7

    https://raw.githubusercontent.com/Owaiskhan9654/Yolo-V7-Custom-Dataset-Train-on-Kaggle/main/car-person-2.PNG" width=800>

    Image Credit - jinfagang

    Step 1: Install Requirements

    !git clone https://github.com/WongKinYiu/yolov7 # Downloading YOLOv7 repository and installing requirements
    %cd yolov7
    !pip install -qr requirements.txt
    !pip install -q roboflow
    

    Downloading YOLOV7 starting checkpoint

    !wget "https://github.com/WongKinYiu/yolov7/releases/download/v0.1/yolov7.pt"
    
    import os
    import glob
    import wandb
    import torch
    from roboflow import Roboflow
    from kaggle_secrets import UserSecretsClient
    from IPython.display import Image, clear_output, display # to display images
    
    
    
    print(f"Setup complete. Using torch {torch._version_} ({torch.cuda.get_device_properties(0).name if torch.cuda.is_available() else 'CPU'})")
    

    https://camo.githubusercontent.com/dd842f7b0be57140e68b2ab9cb007992acd131c48284eaf6b1aca758bfea358b/68747470733a2f2f692e696d6775722e636f6d2f52557469567a482e706e67">

    I will be integrating W&B for visualizations and logging artifacts and comparisons of different models!

    YOLOv7-Car-Person-Custom

    try:
      user_secrets = UserSecretsClient()
      wandb_api_key = user_secrets.get_secret("wandb_api")
      wandb.login(key=wandb_api_key)
      anonymous = None
    except:
      wandb.login(anonymous='must')
      print('To use your W&B account,
    Go to Add-ons -> Secrets and provide your W&B access token. Use the Label name as WANDB. 
    Get your W&B access token from here: https://wandb.ai/authorize')
      
      
      
    wandb.init(project="YOLOvR",name=f"7. YOLOv7-Car-Person-Custom-Run-7")
    

    Step 2: Assemble Our Dataset

    https://uploads-ssl.webflow.com/5f6bc60e665f54545a1e52a5/615627e5824c9c6195abfda9_computer-vision-cycle.png" alt="">

    In order to train our custom model, we need to assemble a dataset of representative images with bounding box annotations around the objects that we want to detect. And we need our dataset to be in YOLOv7 format.

    In Roboflow, We can choose between two paths:

    Version v2 Aug 12, 2022 Looks like this.

    https://raw.githubusercontent.com/Owaiskhan9654/Yolo-V7-Custom-Dataset-Train-on-Kaggle/main/Roboflow.PNG" alt="">

    user_secrets = UserSecretsClient()
    roboflow_api_key = user_secrets.get_secret("roboflow_api")
    
    rf = Roboflow(api_key=roboflow_api_key)
    project = rf.workspace("owais-ahmad").project("custom-yolov7-on-kaggle-on-custom-dataset-rakiq")
    dataset = project.version(2).download("yolov7")
    

    Step 3: Training Custom pretrained YOLOv7 model

    Here, I am able to pass a number of arguments: - img: define input image size - batch: determine

  4. API_Websites

    • kaggle.com
    zip
    Updated Feb 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    chandrashekhar G T (2023). API_Websites [Dataset]. https://www.kaggle.com/datasets/chandrashekhargt/api-wwebsites
    Explore at:
    zip(750 bytes)Available download formats
    Dataset updated
    Feb 5, 2023
    Authors
    chandrashekhar G T
    Description

    APIs, or Application Programming Interfaces, allow developers to access the functionality of an application or service over the internet. To access an API, a developer would typically make a request using a specific URL or endpoint, along with any necessary authentication or parameters, and receive a response in a standardized format such as JSON. This response can then be used to integrate the API's functionality into another application or service. Many websites and web-based services offer APIs that allow developers to access data and perform actions, such as retrieving information about a user, posting content, or making a purchase. In order to access an API, a developer often needs to obtain an API key or access token, which serves as a unique identifier and enables the API to track usage and enforce rate limits.

  5. Metaverse Crypto Tokens Historical data 📊 📓

    • kaggle.com
    zip
    Updated Jul 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kash (2022). Metaverse Crypto Tokens Historical data 📊 📓 [Dataset]. https://www.kaggle.com/datasets/kaushiksuresh147/metaverse-cryptos-historical-data
    Explore at:
    zip(4442545 bytes)Available download formats
    Dataset updated
    Jul 12, 2022
    Authors
    Kash
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    https://i2.wp.com/www.mon-livret.fr/wp-content/uploads/2021/10/crypto-Metaverse-696x392.png?resize=696%2C392&ssl=1" alt="">

    Context

    • The metaverse, a living and breathing space that blends physical and digital, is quickly evolving from a science fiction dream into a reality with endless possibilities. A world where people can interact virtually, create and exchange digital assets for real-world value, own digital land, engage with digitized real-world products and services, and much more.

    • Major tech giants are beginning to recognize the viability and potential of metaverses, following Facebook’s groundbreaking Meta rebrand announcement. In addition to tech companies, entertainment brands like Disney have also announced plans to take the leap into virtual reality.

    • While the media hype is deafening, your average netizen isn’t fully aware of what a metaverse is, how it operates and, most importantly—what benefits and opportunities it can offer them as a user.

    https://cdn.images.express.co.uk/img/dynamic/22/590x/Metaverse-tokens-cryptocurrency-explained-ethereum-killers-new-coins-digital-currency-meta-news-1518777.jpg?r=1638256864800" alt="">

    What Is The Metaverse?

    • In its digital iteration, a metaverse is a virtual world based on blockchain technology. This all-encompassing space allows users to work and play in a virtual reflection of real-life and fantasy scenarios, an online reality, ranging from sci-fi and dragons to more practical and familiar settings like shopping centers, offices, and even homes.

    • Users can access metaverses via computer, handheld device, or complete immersion with a VR headset. Those entering the metaverse get to experience living in a digital realm, where they will be able to work, play, shop, exercise, and socialize. Users will be able to create their own avatars based on face recognition, set up their own businesses of any kind, buy real estate, create in-world content and asset,s and attend concerts from real-world superstars—all in one virtual environment,

    • With that said, a metaverse is a virtual world with a virtual economy. In most cases, it is an online reality powered by decentralized finance (DeFi), where users exchange value and assets via cryptocurrencies and Non-Fungible Tokens.

    What Are Metaverse Tokens?

    • Metaverse tokens are a unit of virtual currency used to make digital transactions within the metaverse. Since metaverses are built on the blockchain, transactions on underlying networks are near-instant. Blockchains are designed to ensure trust and security, making the metaverse the perfect environment for an economy free of corruption and financial fraud.

    • Holders of metaverse tokens can access multiple services and applications inside the virtual space. Some tokens give special in-game abilities. Other tokens represent unique items, like clothing for virtual avatars or membership for a community. If you’ve played MMO games like World of Warcraft, the concept of in-game items and currencies are very familiar. However, unlike your traditional virtual world games, metaverse tokens have value inside and outside the virtual worlds. Metaverse tokens in the form of cryptocurrency can be exchanged for fiat currencies. Or if they’re an NFT, they can be used to authenticate ownership to tethered real-world assets like collectibles, works or art, or even cups of coffee.

    • Some examples of metaverse tokens include SAND of the immensely popular Sandbox metaverse. In The Sandbox, users can create a virtual world driven by NFTs. Another token is MANA of the Decentraland project, where users can use MANA to purchase plots of digital real estate called “LAND”. It is even possible to monetize the plots of LAND purchased by renting them to other users for fixed fees. The ENJ token of the Enjin metaverse is the native asset of an ecosystem with the world’s largest game/app NFT networks.

    Dataset Information

    • The dataset brings 198 metaverse cryptos. Pls refer to the file Metaverse coins.csv to find the list of metaverse crypto coins.

    • The dataset will be updated on a weekly basis with more and more additional metaverse tokens, Stay tuned ⏳

  6. huggingface_hub

    • kaggle.com
    zip
    Updated Nov 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abhishek Thakur (2024). huggingface_hub [Dataset]. https://www.kaggle.com/datasets/abhishek/huggingface-hub
    Explore at:
    zip(4315332 bytes)Available download formats
    Dataset updated
    Nov 4, 2024
    Authors
    Abhishek Thakur
    Description

    Dataset

    This dataset was created by Abhishek Thakur

    Contents

  7. CrunchDAO Competition Unified Dataset

    • kaggle.com
    zip
    Updated Jun 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joakim Arvidsson (2023). CrunchDAO Competition Unified Dataset [Dataset]. https://www.kaggle.com/datasets/joebeachcapital/crunchdao-competition-unified-dataset
    Explore at:
    zip(183163058 bytes)Available download formats
    Dataset updated
    Jun 15, 2023
    Authors
    Joakim Arvidsson
    Description

    This data set is for creating predictive models for the CrunchDAO tournament. Registration is required in order to participate in the competition, and to be eligible to earn $CRUNCH tokens.

    See notebooks (Code tab) for how to import and explore the data, and build predictive models.

    See Terms of Use for data license.

  8. openwebtext_1M

    • kaggle.com
    zip
    Updated Mar 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tanay Mehta (2024). openwebtext_1M [Dataset]. https://www.kaggle.com/datasets/heyytanay/openwebtext-1m/code
    Explore at:
    zip(2043993317 bytes)Available download formats
    Dataset updated
    Mar 18, 2024
    Authors
    Tanay Mehta
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    A subset of Skylion007/openwebtext dataset consisting of 1 Million tokenized samples in Lance file format for blazing fast and memory efficient I/O.

    The files were tokenized using the gpt2 tokenizer with no extra tokens.

    For detailed information on how the dataset was created, refer to my article on Curating Custom Datasets for efficient LLM training using Lance.

    Instructions for using this dataset

    This dataset is not supposed to be used on Kaggle Kernels since Lance requires the input directory of the dataset to have write access but Kaggle Kernel's input directory doesn't have it and the dataset size prohibits one from moving it to /kaggle/working. Hence, to use this dataset, you must download it by using the Kaggle API or through this page and then move the unzipped files to a folder called openwebtext_1M.lance. Below are detailed snippets on how to download and use this dataset.

    First download and unzip the dataset from your terminal (make sure you have your kaggle API key at ~/.kaggle/:

    $ pip install -q kaggle pyarrow pylance
    $ kaggle datasets download -d heyytanay/openwebtext-1m
    $ mkdir openwebtext_1M.lance/
    $ unzip -qq openwebtext-1m.zip -d openwebtext_1M.lance/
    $ rm openwebtext-1m.zip
    

    Once this is done, you will find your dataset in the openwebtext_1M.lance/ folder. Now to load and get a gist of the data, run the below snippet.

    import lance
    dataset = lance.dataset('openwebtext_1M.lance/')
    print(dataset.count_rows())
    

    This will give you the total number of tokens in the dataset.

  9. Ethereum Blockchain

    • kaggle.com
    zip
    Updated Mar 4, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google BigQuery (2019). Ethereum Blockchain [Dataset]. https://www.kaggle.com/datasets/bigquery/ethereum-blockchain
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Mar 4, 2019
    Dataset provided by
    BigQueryhttps://cloud.google.com/bigquery
    Authors
    Google BigQuery
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Bitcoin and other cryptocurrencies have captured the imagination of technologists, financiers, and economists. Digital currencies are only one application of the underlying blockchain technology. Like its predecessor, Bitcoin, the Ethereum blockchain can be described as an immutable distributed ledger. However, creator Vitalik Buterin also extended the set of capabilities by including a virtual machine that can execute arbitrary code stored on the blockchain as smart contracts.

    Both Bitcoin and Ethereum are essentially OLTP databases, and provide little in the way of OLAP (analytics) functionality. However the Ethereum dataset is notably distinct from the Bitcoin dataset:

    • The Ethereum blockchain has as its primary unit of value Ether, while the Bitcoin blockchain has Bitcoin. However, the majority of value transfer on the Ethereum blockchain is composed of so-called tokens. Tokens are created and managed by smart contracts.

    • Ether value transfers are precise and direct, resembling accounting ledger debits and credits. This is in contrast to the Bitcoin value transfer mechanism, for which it can be difficult to determine the balance of a given wallet address.

    • Addresses can be not only wallets that hold balances, but can also contain smart contract bytecode that allows the programmatic creation of agreements and automatic triggering of their execution. An aggregate of coordinated smart contracts could be used to build a decentralized autonomous organization.

    Content

    The Ethereum blockchain data are now available for exploration with BigQuery. All historical data are in the ethereum_blockchain dataset, which updates daily.

    Our hope is that by making the data on public blockchain systems more readily available it promotes technological innovation and increases societal benefits.

    Querying BigQuery tables

    You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.crypto_ethereum.[TABLENAME]. Fork this kernel to get started.

    Acknowledgements

    Cover photo by Thought Catalog on Unsplash

    Inspiration

    • What are the most popularly exchanged digital tokens, represented by ERC-721 and ERC-20 smart contracts?
    • Compare transaction volume and transaction networks over time
    • Compare transaction volume to historical prices by joining with other available data sources like Bitcoin Historical Data
  10. LOTTERY CASH

    • kaggle.com
    zip
    Updated May 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OYENIYI ISAAC ENIOLA (2022). LOTTERY CASH [Dataset]. https://www.kaggle.com/datasets/yorlepro/lottery-cash
    Explore at:
    zip(590770 bytes)Available download formats
    Dataset updated
    May 13, 2022
    Authors
    OYENIYI ISAAC ENIOLA
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    There are two methods available for authentication: HTTP Basic and OAuth 2.0. For non-interactive applications, we only support HTTP Basic Authentication. We encourage all our developers of interactive applications to use the OAuth 2.0 workflow to authenticate their users.

    HTTP Basic Authentication is required when you are authenticating from a script that runs without interaction with the user, like your ETL tool, an update script, or any other data management automation.

    OAuth 2.0 is the preferred option for cases where you are building a web or mobile application that needs to perform actions on behalf of the user, like accessing data, and the interaction model allows you to present the user with a form to obtain their permission for the app to do so.

    Authenticating using HTTP Basic Authentication Requests can be authenticated using HTTP Basic Authentication. You can use your HTTP library’s Basic Auth feature to pass your credentials. All HTTP-basic-authenticated requests must be performed over a secure (https) connection. Authenticated requests made over an insecure connection will be denied.

    Users may use their username and password or an API key and secret pair to authenticate using Basic Authentication. Documentation on how to create and manage API keys can be found here.

    We recommend using API keys! They provide the following benefits:

    Access Socrata APIs without the risk of embedding your username and password in scripts or code Users on domains that require SSO (and thus without passwords) can access Socrata APIs Create individual keys for different apps or jobs so that if any one needs to be revoked or rotated, other apps are unaffected Change your account password without disrupting apps or rotate API keys without disrupting logins Here is a sample HTTP session that uses HTTP Basic Authentication:

    POST /resource/4tka-6guv.json HTTP/1.1 Host: soda.demo.socrata.com Accept: / Authorization: Basic [REDACTED] Content-Length: 253 Content-Type: application/json X-App-Token: [REDACTED]

    [ { ... } ] Note that the Authorization header in this request will usually be generated via your HTTP library’s Basic Auth feature (as opposed to manually constructing the Base64 encoding of your credentials yourself). For example, if you’re using Python’s requests module, it supports Basic Authentication out of the box. Similarly, an API tool like Postman also handles Basic Authentication.

    OAuth 2.0 Note: When developing applications that make use of OAuth, you must provide a web-accessible callback URL when registering your application token. This can make it difficult to develop on a machine that isn't directly exposed to the Internet. One great option is to use a tool like ngrok to create a secure tunnel to expose your web application in a secure manner. Workflow We support a subset of OAuth 2.0 — the server-based flow with a callback URL — which we believe is more secure than the other flows in the specification. This OAuth flow is used by several other popular API services on the web. We have made the authentication flow similar to Google AuthSub.

    To authenticate with OAuth 2.0, you will first need to register your application, which will create an app token and a secret token. When registering your application, you must preregister your server by filling out the Callback Prefix field), so that we can be sure that access through your application is secure even if both your tokens are stolen. The Callback Prefix is the beginning of the URL that you will use as your redirect URL. Generally, you’ll want to provide as much of your callback URL as you can. For example, if your authentication callback is https://my-website.com/socrata-app/auth/callback, you might want to specify https://my-website.com/socrata-app as your callback URL.

    Once you have an application and a secret token, you’ll be able to authenticate with the SODA OAuth 2.0 endpoint. You’ll first need to redirect the user to the Socrata-powered site you wish to access so that they may log in and approve your application. For example:

    https://soda.demo.socrata.com/oauth/authorize?client_id=YOUR_AUTH_TOKEN&response_type=code &redirect_uri=YOUR_REDIRECT_URI Note that the redirect_uri here must be an absolute, secure (https:) URI which starts with the Callback Prefix you specified when you registered your application. If any of these cases fail, the user will be shown an error indicating as much.

    Should the user authorize your application, they will be redirected back to the your redirect_uri. For example, if I provide https://my-website.com/socrata-app/auth/callback as my redirect_uri, the user will be redirected to this URL:

    https://my-website.com/socrata-app/auth/callback?code=CODE where CODE is an authorization code that you will use later.

    If your redirect_uri contains a querystring, it will be preserved, and the code parameter will...

  11. Cloud Accounting Integrity Log Dataset

    • kaggle.com
    zip
    Updated Nov 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Python Developer (2025). Cloud Accounting Integrity Log Dataset [Dataset]. https://www.kaggle.com/datasets/programmer3/cloud-accounting-integrity-log-dataset
    Explore at:
    zip(109578 bytes)Available download formats
    Dataset updated
    Nov 5, 2025
    Authors
    Python Developer
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The Cloud Accounting Integrity Verification Dataset (CAIVD) contains 1,900 simulated accounting and cloud platform log entries, designed for evaluating cloud data integrity verification algorithms. Each record represents a real-world accounting system event — including insert, update, delete, view, and approve operations — enriched with financial, network, and system-level metadata. This dataset supports experiments in data integrity auditing, cloud computing performance analysis, and intelligent accounting verification. Key Features Record_ID

    Unique identifier for each log entry.

    Timestamp

    Date and time when the accounting or cloud event occurred.

    User_ID

    Randomized user identifier (represents an accountant, auditor, or automated process).

    Action_Type

    Type of operation performed in the accounting system: insert, update, delete, view, or approve.

    Transaction_Amount

    Financial amount involved in the accounting transaction.

    Account_Category

    Account classification: Assets, Liabilities, Revenue, or Expense.

    Approval_Status

    Approval state of the transaction: Pending, Approved, or Rejected.

    Device_ID

    Identifier for the source terminal or accounting device.

    IP_Address

    Client network address from which the transaction originated.

    Cloud_Node

    Cloud region or node processing the accounting request: Node-A, Node-B, or Node-C.

    CPU_Usage

    CPU utilization (%) of the cloud node during transaction processing.

    Memory_Usage

    Memory usage (%) of the cloud node at event time.

    Request_Latency

    Observed network delay (milliseconds) for the event request.

    Encryption_Flag

    Indicates encryption status of the record (1 = encrypted, 0 = not encrypted).

    Access_Token_Validity

    Authentication token status: Valid, Expired, or Suspicious.

    Block_Size_KB Constant value = 1 KB — size of each data block used in verification tests.

    File_Size_MB File size (MB) used in simulation: 200, 400, 600, 800, 1000, 1200, or 1400 MB.

    Total_Blocks Computed as File_Size_MB × 1024; total number of 1 KB data blocks per experiment.

  12. nemotron-3-8b-base-4k

    • kaggle.com
    zip
    Updated Aug 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Serhii Kharchuk (2024). nemotron-3-8b-base-4k [Dataset]. https://www.kaggle.com/datasets/serhiikharchuk/nemotron-3-8b-base-4k
    Explore at:
    zip(13688476176 bytes)Available download formats
    Dataset updated
    Aug 31, 2024
    Authors
    Serhii Kharchuk
    Description

    Nemotron-3-8B-Base-4k Model Overview License

    The use of this model is governed by the NVIDIA AI Foundation Models Community License Agreement. Description

    Nemotron-3-8B-Base-4k is a large language foundation model for enterprises to build custom LLMs. This foundation model has 8 billion parameters, and supports a context length of 4,096 tokens. Nemotron-3-8B-Base-4k is part of Nemotron-3, which is a family of enterprise ready generative text models compatible with NVIDIA NeMo Framework. For other models in this collection, see the collections page.

    NVIDIA NeMo is an end-to-end, cloud-native platform to build, customize, and deploy generative AI models anywhere. It includes training and inferencing frameworks, guardrailing toolkits, data curation tools, and pretrained models, offering enterprises an easy, cost-effective, and fast way to adopt generative AI. To get access to NeMo Framework, please sign up at this link. References

    Announcement Blog Model Architecture

    Architecture Type: Transformer

    Network Architecture: Generative Pre-Trained Transformer (GPT-3) Software Integration

    Runtime Engine(s): NVIDIA AI Enterprise

    Toolkit: NeMo Framework

    To get access to NeMo Framework, please sign up at this link. See NeMo inference container documentation for details on how to setup and deploy an inference server with NeMo.

    Sample Inference Code:

    from nemo.deploy import NemoQuery

    In this case, we run inference on the same machine

    nq = NemoQuery(url="localhost:8000", model_name="Nemotron-3-8B-4K")

    output = nq.query_llm(prompts=["The meaning of life is"], max_output_token=200, top_k=1, top_p=0.0, temperature=0.1) print(output)

    Supported Hardware:

    H100
    A100 80GB, A100 40GB
    

    Model Version(s)

    Nemotron-3-8B-base-4k-BF16-1 Dataset & Training

    The model uses a learning rate of 3e-4 with a warm-up period of 500M tokens and a cosine learning rate annealing schedule for 95% of the total training tokens. The decay stops at a minimum learning rate of 3e-5. The model is trained with a sequence length of 4096 and uses FlashAttention’s Multi-Head Attention implementation. 1,024 A100s were used for 19 days to train the model.

    NVIDIA models are trained on a diverse set of public and proprietary datasets. This model was trained on a dataset containing 3.8 Trillion tokens of text. The dataset contains 53 different human languages (including English, German, Russian, Spanish, French, Japanese, Chinese, Italian, and Dutch) and 37 programming languages. The model also uses the training subsets of downstream academic benchmarks from sources like FLANv2, P3, and NaturalInstructions v2. NVIDIA is committed to the responsible development of large language models and conducts reviews of all datasets included in training. Evaluation Task Num-shot Score MMLU* 5 54.4 WinoGrande 0 70.9 Hellaswag 0 76.4 ARC Easy 0 72.9 TyDiQA-GoldP** 1 49.2 Lambada 0 70.6 WebQS 0 22.9 PiQA 0 80.4 GSM8K 8-shot w/ maj@8 39.4

    • The calculation of MMLU follows the original implementation. See Hugging Face’s explanation of different implementations of MMLU.

    ** The languages used are Arabic, Bangla, Finnish, Indonesian, Korean, Russian and Swahili. Intended use

    This is a completion model. For best performance, users are encouraged to customize the completion model using NeMo Framework suite of customization tools including Parameter-Efficient Fine-Tuning (P-tuning, Adapters, LoRA), and SFT/RLHF. For chat use cases, please consider using Nemotron-3-8B chat variants. Ethical use

    Technology can have a profound impact on people and the world, and NVIDIA is committed to enabling trust and transparency in AI development. NVIDIA encourages users to adopt principles of AI ethics and trustworthiness to guide your business decisions by following the guidelines in the NVIDIA AI Foundation Models Community License Agreement. Limitations

    The model was trained on data that contains toxic language and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts.
    The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.
    
  13. South Africa COVID-19 Twitter Posts Dataset

    • kaggle.com
    zip
    Updated Jul 4, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Blessing Ogbuokiri (2022). South Africa COVID-19 Twitter Posts Dataset [Dataset]. https://www.kaggle.com/datasets/ogbuokiriblessing/tweetdatasa
    Explore at:
    zip(1713167 bytes)Available download formats
    Dataset updated
    Jul 4, 2022
    Authors
    Blessing Ogbuokiri
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Area covered
    South Africa
    Description

    This dataset contains Twitter posts containing daily updates of location-based COVID–19 vaccine-related tweets from January 2021 to August 2021.

    With an existing Twitter account, we applied for Developer Access and were granted access to Twitter Academic Researcher API which allows for over 10 million tweets per month. Then, we created an application to generate the API credentials (access tokens) from Twitter. The access token was used in Python (v3.6) script to authenticate and establish a connection to the Twitter database. To get goe-tagged vaccine-related tweets, we used the python script we developed to perform a historical search (archive search) of vaccine-related keywords with place country South Africa (ZA). By goe-tagged tweets, we refer to Twitter posts with a know location. These vaccine-related keywords include but are not limited to the vaccine, anti-vaxxer, vaccination, AstraZeneca, Oxford-AstraZeneca, IChooseVaccination, VaccineToSaveSouthAfrica, JohnsonJohnson, and Pfizer. The keywords were selected from the trending topic during the period of discussion. A complete list of the keywords is shown below:

    Oxford-AstraZeneca, AstraZeneca, JohnsonJohnson, Vaccine, BioNTech, anti-vaccine, jab, Vaccination, Covax, Vaccine Rollout, Sputnik, VaccineToSaveSouthAfrica, IChooseVaccination, TeachersVaccine, AstraZeneca vaccine, Pfizer, J & J, Johonson & Johnson, Moderna, VaccinesWork, VacciNation, Vaccine, Steriod, COVIDvaccine, covax, VaccineEquity, VaccineReady, Jab OR PfizerGang, Scamdemic, Plandemic, Scaredemic, COVID-19, coronavirus, SARS-CoV-2, anti-vaxxers, jab, Pfizer, BioNTech, JJ, Vaccine, JohnsonJohnson Vaccine, Vaccine Rollout, J & J, Sputnik, COVAX, CoronaVac

    The preferred language of the tweet is English.

  14. Movies Info

    • kaggle.com
    zip
    Updated Aug 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rushil Dhingra (2025). Movies Info [Dataset]. https://www.kaggle.com/datasets/rushildhingra25/movies-info/code
    Explore at:
    zip(1292894 bytes)Available download formats
    Dataset updated
    Aug 1, 2025
    Authors
    Rushil Dhingra
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset contains detailed information about top-rated movies, retrieved directly from The Movie Database (TMDB) using its official public API.

    Data Source

    • API Endpoint Used: https://api.themoviedb.org/3/movie/top_rated
    • Authorization: Accessed using TMDB-provided Bearer Token Authentication
    • Genre Mapping: Genre names were mapped from the TMDB /genre/movie/list endpoint using the provided genre IDs.

    Dataset Contents

    Each row in the dataset represents a single movie and includes the following fields:

    ColumnDescription
    titleThe official title of the movie
    overviewA short description or plot summary of the movie
    genre_idsA list of genre IDs associated with the movie (from TMDB)
    genre_namesThe corresponding genre names for the movie (e.g. Action, Drama)
    genre_strA comma-separated string of genres for easy text processing

    Use Cases

    This dataset is ideal for a variety of educational and practical machine learning tasks, including but not limited to:

    • Natural Language Processing (NLP):

      • Text cleaning and preprocessing (lowercasing, stopword removal, etc.)
      • Tokenization and embedding
      • TF-IDF or Word2Vec vectorization
      • Genre classification based on movie descriptions
      • Clustering similar movies using text similarity
    • Data Visualization:

      • Word clouds from overviews
      • Genre frequency analysis
      • Sentiment trends across genres
    • Machine Learning Projects:

      • Supervised learning: Predict movie genres from descriptions
      • Unsupervised learning: Cluster movies based on plot similarity
      • Recommendation engines (content-based filtering)

    Note

    This dataset is intended for learning and educational purposes only. Please adhere to TMDB's API terms of use when using this data in any public or commercial setting.

  15. CommonsenseQA (Multiple-Choice Q&A)

    • kaggle.com
    zip
    Updated Nov 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). CommonsenseQA (Multiple-Choice Q&A) [Dataset]. https://www.kaggle.com/datasets/thedevastator/new-commonsenseqa-dataset-for-multiple-choice-qu
    Explore at:
    zip(712030 bytes)Available download formats
    Dataset updated
    Nov 21, 2022
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    CommonsenseQA (Multiple-Choice Q&A)

    12,102 questions with one correct answer and four distractor answers

    Source

    Huggingface Hub: link

    About this dataset

    CommonsenseQA is a new multiple-choice question answering dataset that requires different types of commonsense knowledge to predict the correct answers . It contains 12,102 questions with one correct answer and four distractor answers. The dataset is provided in two major training/validation/testing set splits: "Random split" which is the main evaluation split, and "Question token split", see paper for details.

    How to use the dataset

    Research Ideas

    • This dataset can be used to train a model to predict the correct answers to multiple-choice questions.
    • This dataset can be used to evaluate the performance of different models on the CommonsenseQA dataset.
    • This dataset can be used to discover new types of commonsense knowledge required to predict the correct answers to questions in the CommonsenseQA dataset

    Acknowledgements

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: validation.csv | Column name | Description | |:--------------|:---------------------------------------------------------------| | answerKey | The correct answer to the question. (String) | | choices | The four possible answers for each question. (List of strings) |

    File: train.csv | Column name | Description | |:--------------|:---------------------------------------------------------------| | answerKey | The correct answer to the question. (String) | | choices | The four possible answers for each question. (List of strings) |

    File: test.csv | Column name | Description | |:--------------|:---------------------------------------------------------------| | answerKey | The correct answer to the question. (String) | | choices | The four possible answers for each question. (List of strings) |

  16. All Crypto Data - Every 12 hrs

    • kaggle.com
    zip
    Updated Mar 6, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Idan Erez (2018). All Crypto Data - Every 12 hrs [Dataset]. https://www.kaggle.com/idanerez/all-cryoto-data-every-12-hrs
    Explore at:
    zip(2002206 bytes)Available download formats
    Dataset updated
    Mar 6, 2018
    Authors
    Idan Erez
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    I started collecting data from Coin Market Cap of all crypto currencies and token on a 12 hours cycle. Plan to upload it every weekend. If you need it before for your analysis please PM me.

    Content

    All available crypto currency data from coin market cap - Symbol, Rank, Price USD, Price BTC, Market Cap, Date, Time.

    Updated every 12 hours - update time is in EST.

    Acknowledgements

    Thanks coinmarketcap.com for the API access

    Inspiration

    What correlations are there? Is there a low-risk portfolio? What are the signals before crypto rise or crushes?

  17. Mutant Ape Yacht Club NFT Images

    • kaggle.com
    zip
    Updated Dec 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Mutant Ape Yacht Club NFT Images [Dataset]. https://www.kaggle.com/datasets/thedevastator/mutant-ape-yacht-club-nft-images
    Explore at:
    zip(1766126750 bytes)Available download formats
    Dataset updated
    Dec 6, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Mutant Ape Yacht Club NFT Images

    NFT image dataset with Mutant Ape Yacht Club artwork and metadata

    By huggingnft (From Huggingface) [source]

    About this dataset

    The NFT Images Dataset for Unconditional Generation, specifically focused on the Mutant Ape Yacht Club, offers valuable information and resources for artificial intelligence and machine learning enthusiasts. This dataset provides comprehensive details about NFT artwork, including high-quality images and associated metadata.

    The dataset includes columns such as image and image_original_url, providing direct access to the image files in their original form through valid web addresses. These image files represent unique pieces of digital artwork from the Mutant Ape Yacht Club collection.

    Moreover, this dataset offers crucial insights into each NFT artwork through the token_metadata column. This metadata encompasses various essential details regarding the specific piece of art, such as artist information, detailed descriptions, and unique attributes associated with each NFT. These attributes help differentiate one piece of art from another in terms of style, theme, rarity factors, or any other distinctive characteristics.

    By utilizing this comprehensive dataset's resources in various AI applications like unconditional generation models or machine learning algorithms, users gain access to a wide range of digital artworks for research or creative purposes. Additionally, with accurate token metadata available alongside each image file, users can explore diverse aspects that contribute to the essence of these digital creations.

    How to use the dataset

    Introduction:

    • Dataset Overview:

      • The dataset contains information about NFT images from the Mutant Ape Yacht Club, including image URLs, IDs, token metadata, and original image URLs.
      • Each artwork in the dataset is represented by an image file and its corresponding metadata.
    • Accessing the Dataset:

      • To access this dataset, download or import the provided train.csv file.
    • Understanding Key Columns:

      • image (Image file): This column contains the image file of each artwork in the NFT collection.
      • token_metadata (Text): This column includes metadata associated with each artwork such as artist details, description, and attributes.
      • image_original_url (URL): Provides the original URL of each image file in case you need to refer back to it or access additional information.
    • Potential Use Cases:

      • Unconditional Generation: The dataset can be used for unconditional generation tasks like training generative models or running experiments on creating novel artworks based on existing ones.
    • Preprocessing Steps: Before using this dataset for unconditional generation tasks, consider performing some preprocessing steps such as:

      a) Image Processing: Resize or normalize images for consistent input dimensions if required by your model architecture.

      b) Text Processing: Clean token_metadata column if needed by removing special characters or irrelevant text that may hinder model training.

    • Exploratory Data Analysis (EDA): Conducting EDA provides insights into patterns within the database that might help you understand art concepts better or optimize your unconditional generation models. Some possible EDA tasks include:

      a) Image Visualization: Display a subset of images to get a visual understanding of the artworks.

      b) Metadata Analysis: Analyze the distribution or correlation between different attributes mentioned in the token_metadata column.

    • Training Unconditional Generation Models: Use this dataset to train generative models such as Variational Autoencoders (VAE), Generative Adversarial Networks (GANs), or other deep learning architectures for creative artwork generation.

    • Iterative Model Improvements: Experiment with different model architectures, hyperparameters, and loss functions to enhance the

    Research Ideas

    • Unconditional Image Generation: This dataset can be used for training generative models to create new and unique NFT artwork. By training a model on this dataset, it can learn the patterns, styles, and attributes of Mutant Ape Yacht Club NFTs and generate new images in a similar style.
    • Artistic Style Transfer: The token_metadata associated with each NFT artwork can provide valuable information about the artist, description, and attributes of the image. This metadata can be used for artistic style transfer techniques to apply the unique style of Mutant Ape Yacht Club artworks to other im...
  18. OpenCorpora: Russian

    • kaggle.com
    zip
    Updated Sep 12, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rachael Tatman (2017). OpenCorpora: Russian [Dataset]. https://www.kaggle.com/rtatman/opencorpora-russian
    Explore at:
    zip(26456197 bytes)Available download formats
    Dataset updated
    Sep 12, 2017
    Authors
    Rachael Tatman
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Context:

    “Russian is an East Slavic language and an official language in Russia, Belarus, Kazakhstan, Kyrgyzstan and many minor or unrecognised territories. It is an unofficial but widely spoken language in Ukraine and Latvia, and to a lesser extent, the other countries that were once constituent republics of the Soviet Union and former participants of the Eastern Bloc.” -- “Russian Language” on Wikipedia

    Russian has around 150 million native speakers and 110 million non-native speakers. Russian in written in Cyrillic script. This dataset is a morphologically, syntactically and semantically annotated corpus of texts in Russian, fully accessible to researchers and edited by users.

    Content:

    This dataset is encoded in UTF-8. There are two files included in this dataset: the corpus and the dictionary. The corpus is in .json format, while the dictionary is in plain text.

    Dictionary

    In the dictionary, each entry is a lemma, presented with all of its tagged derivations. The tags depend on the part of speech of the lemma. Some examples are:

    • Nouns: part of speech, animacy, gender & number, case
    • Verbs: Part of speech, aspect, transitivity, gender & number, person, tense, mood
    • Adjectives: part of speech (ADJF), gender, number, case

    A Python script to convert the tags in this corpus to this set more commonly used in English-language linguistics can be found here.

    Sample dictionary entries:

    1
    ЁЖ NOUN,anim,masc sing,nomn
    ЕЖА NOUN,anim,masc sing,gent
    ЕЖУ NOUN,anim,masc sing,datv
    ЕЖА NOUN,anim,masc sing,accs
    ЕЖОМ  NOUN,anim,masc sing,ablt
    ЕЖЕ NOUN,anim,masc sing,loct
    ЕЖИ NOUN,anim,masc plur,nomn
    ЕЖЕЙ  NOUN,anim,masc plur,gent
    ЕЖАМ  NOUN,anim,masc plur,datv
    
    41
    ЁРНИЧАЮ VERB,impf,intr sing,1per,pres,indc
    ЁРНИЧАЕМ  VERB,impf,intr plur,1per,pres,indc
    ЁРНИЧАЕШЬ  VERB,impf,intr sing,2per,pres,indc
    ЁРНИЧАЕТЕ  VERB,impf,intr plur,2per,pres,indc
    ЁРНИЧАЕТ  VERB,impf,intr sing,3per,pres,indc
    ЁРНИЧАЮТ  VERB,impf,intr plur,3per,pres,indc
    ЁРНИЧАЛ VERB,impf,intr masc,sing,past,indc
    ЁРНИЧАЛА  VERB,impf,intr femn,sing,past,indc
    ЁРНИЧАЛО  VERB,impf,intr neut,sing,past,indc
    ЁРНИЧАЛИ  VERB,impf,intr plur,past,indc
    ЁРНИЧАЙ VERB,impf,intr sing,impr,excl
    ЁРНИЧАЙТЕ  VERB,impf,intr plur,impr,excl
    

    Corpous

    In this corpus, each word has been grammatically tagged. You can access individual tokens using the following general path:

    JSON > text > paragraphs > paragraph > [paragraph number] > sentence > [sentence number] > tokens > [token number]

    Each token has: * A unique id number (@id) * The text of the token (@text) * information on the lemma (under “l”), including the id number of the lemma as found in the dictionary

    You can see an example of the token portion of the .json structure below:

           {
            "@id": 1714292,
            "@text": "сват",
            "tfr": {
             "@t": "сват",
             "@rev_id": 3754311,
             "v": {
              "l": {
               "@id": 314741,
               "@t": "сват",
               "g": [
                {
                 "@v": "NOUN"
                },
                {
                 "@v": "anim"
                },
                {
                 "@v": "masc"
                },
                {
                 "@v": "sing"
                },
                {
                 "@v": "nomn"
                }
               ]
              }
             }
          }
         }
    

    Acknowledgements:

    This dataset was collected and annotated by, among others, Svetlana Alekseeva, Anastasia Bodrova, Victor Bocharov, Dmitry Granovsky, Irina Krylova, Maria Nikolaeva, Catherine Protopopova, Alexander Chuchunkov, Anastasia Shimorina, Vasily Alekseev, Natalia Ostapuk, Maria Stepanova and Alexey Surikov. The code used to collect and clean this data is available online.

    It is reproduced here under a CC-BY-SA license.

    More information on this corpus and its most recent version can be found here (in Russian.)

  19. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Sa Atmax (2023). authtoken [Dataset]. https://www.kaggle.com/datasets/saatmax/authtoken
Organization logo

authtoken

Explore at:
zip(1458 bytes)Available download formats
Dataset updated
Aug 18, 2023
Authors
Sa Atmax
Description

Dataset

This dataset was created by Sa Atmax

Contents

Search
Clear search
Close search
Google apps
Main menu