Facebook
TwitterThis dataset was created by Sa Atmax
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
ShutterStock AI vs. Human-Generated Image Dataset
This dataset is curated to facilitate research in distinguishing AI-generated images from human-created ones, leveraging ShutterStock data. As AI-generated imagery becomes more sophisticated, developing models that can classify and analyze such images is crucial for applications in content moderation, digital forensics, and media authenticity verification.
With the rise of generative AI models like Stable Diffusion, DALL·E, and MidJourney, the ability to differentiate between synthetic and real images has become a crucial challenge. This dataset offers a structured way to train AI models on this task, making it a valuable resource for both academic research and practical applications.
Explore the dataset and contribute to advancing AI-generated content detection!
If you haven't installed the Kaggle API, run:
bash
pip install kaggle
Then, download your kaggle.json API key from Kaggle Account and move it to ~/.kaggle/ (Linux/Mac) or `C:\Users\YourUser.kaggle` (Windows).
wget --no-check-certificate --header "Authorization: Bearer $(cat ~/.kaggle/kaggle.json | jq -r .token)" "https://www.kaggle.com/datasets/shreyasraghav/shutterstock-dataset-for-ai-vs-human-gen-image" -O dataset.zip
Once downloaded, extract the dataset using:
bash
unzip dataset.zip -d dataset_folder
Now your dataset is ready to use! 🚀
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The goal of this task is to train a model that can localize and classify each instance of Person and Car as accurately as possible.
from IPython.display import Markdown, display
display(Markdown("../input/Car-Person-v2-Roboflow/README.roboflow.txt"))
In this Notebook, I have processed the images with RoboFlow because in COCO formatted dataset was having different dimensions of image and Also data set was not splitted into different Format. To train a custom YOLOv7 model we need to recognize the objects in the dataset. To do so I have taken the following steps:
Image Credit - jinfagang
!git clone https://github.com/WongKinYiu/yolov7 # Downloading YOLOv7 repository and installing requirements
%cd yolov7
!pip install -qr requirements.txt
!pip install -q roboflow
!wget "https://github.com/WongKinYiu/yolov7/releases/download/v0.1/yolov7.pt"
import os
import glob
import wandb
import torch
from roboflow import Roboflow
from kaggle_secrets import UserSecretsClient
from IPython.display import Image, clear_output, display # to display images
print(f"Setup complete. Using torch {torch._version_} ({torch.cuda.get_device_properties(0).name if torch.cuda.is_available() else 'CPU'})")
https://camo.githubusercontent.com/dd842f7b0be57140e68b2ab9cb007992acd131c48284eaf6b1aca758bfea358b/68747470733a2f2f692e696d6775722e636f6d2f52557469567a482e706e67">
I will be integrating W&B for visualizations and logging artifacts and comparisons of different models!
try:
user_secrets = UserSecretsClient()
wandb_api_key = user_secrets.get_secret("wandb_api")
wandb.login(key=wandb_api_key)
anonymous = None
except:
wandb.login(anonymous='must')
print('To use your W&B account,
Go to Add-ons -> Secrets and provide your W&B access token. Use the Label name as WANDB.
Get your W&B access token from here: https://wandb.ai/authorize')
wandb.init(project="YOLOvR",name=f"7. YOLOv7-Car-Person-Custom-Run-7")
https://uploads-ssl.webflow.com/5f6bc60e665f54545a1e52a5/615627e5824c9c6195abfda9_computer-vision-cycle.png" alt="">
In order to train our custom model, we need to assemble a dataset of representative images with bounding box annotations around the objects that we want to detect. And we need our dataset to be in YOLOv7 format.
In Roboflow, We can choose between two paths:
https://raw.githubusercontent.com/Owaiskhan9654/Yolo-V7-Custom-Dataset-Train-on-Kaggle/main/Roboflow.PNG" alt="">
user_secrets = UserSecretsClient()
roboflow_api_key = user_secrets.get_secret("roboflow_api")
rf = Roboflow(api_key=roboflow_api_key)
project = rf.workspace("owais-ahmad").project("custom-yolov7-on-kaggle-on-custom-dataset-rakiq")
dataset = project.version(2).download("yolov7")
Here, I am able to pass a number of arguments: - img: define input image size - batch: determine
Facebook
TwitterAPIs, or Application Programming Interfaces, allow developers to access the functionality of an application or service over the internet. To access an API, a developer would typically make a request using a specific URL or endpoint, along with any necessary authentication or parameters, and receive a response in a standardized format such as JSON. This response can then be used to integrate the API's functionality into another application or service. Many websites and web-based services offer APIs that allow developers to access data and perform actions, such as retrieving information about a user, posting content, or making a purchase. In order to access an API, a developer often needs to obtain an API key or access token, which serves as a unique identifier and enables the API to track usage and enforce rate limits.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://i2.wp.com/www.mon-livret.fr/wp-content/uploads/2021/10/crypto-Metaverse-696x392.png?resize=696%2C392&ssl=1" alt="">
The metaverse, a living and breathing space that blends physical and digital, is quickly evolving from a science fiction dream into a reality with endless possibilities. A world where people can interact virtually, create and exchange digital assets for real-world value, own digital land, engage with digitized real-world products and services, and much more.
Major tech giants are beginning to recognize the viability and potential of metaverses, following Facebook’s groundbreaking Meta rebrand announcement. In addition to tech companies, entertainment brands like Disney have also announced plans to take the leap into virtual reality.
While the media hype is deafening, your average netizen isn’t fully aware of what a metaverse is, how it operates and, most importantly—what benefits and opportunities it can offer them as a user.
https://cdn.images.express.co.uk/img/dynamic/22/590x/Metaverse-tokens-cryptocurrency-explained-ethereum-killers-new-coins-digital-currency-meta-news-1518777.jpg?r=1638256864800" alt="">
In its digital iteration, a metaverse is a virtual world based on blockchain technology. This all-encompassing space allows users to work and play in a virtual reflection of real-life and fantasy scenarios, an online reality, ranging from sci-fi and dragons to more practical and familiar settings like shopping centers, offices, and even homes.
Users can access metaverses via computer, handheld device, or complete immersion with a VR headset. Those entering the metaverse get to experience living in a digital realm, where they will be able to work, play, shop, exercise, and socialize. Users will be able to create their own avatars based on face recognition, set up their own businesses of any kind, buy real estate, create in-world content and asset,s and attend concerts from real-world superstars—all in one virtual environment,
With that said, a metaverse is a virtual world with a virtual economy. In most cases, it is an online reality powered by decentralized finance (DeFi), where users exchange value and assets via cryptocurrencies and Non-Fungible Tokens.
Metaverse tokens are a unit of virtual currency used to make digital transactions within the metaverse. Since metaverses are built on the blockchain, transactions on underlying networks are near-instant. Blockchains are designed to ensure trust and security, making the metaverse the perfect environment for an economy free of corruption and financial fraud.
Holders of metaverse tokens can access multiple services and applications inside the virtual space. Some tokens give special in-game abilities. Other tokens represent unique items, like clothing for virtual avatars or membership for a community. If you’ve played MMO games like World of Warcraft, the concept of in-game items and currencies are very familiar. However, unlike your traditional virtual world games, metaverse tokens have value inside and outside the virtual worlds. Metaverse tokens in the form of cryptocurrency can be exchanged for fiat currencies. Or if they’re an NFT, they can be used to authenticate ownership to tethered real-world assets like collectibles, works or art, or even cups of coffee.
Some examples of metaverse tokens include SAND of the immensely popular Sandbox metaverse. In The Sandbox, users can create a virtual world driven by NFTs. Another token is MANA of the Decentraland project, where users can use MANA to purchase plots of digital real estate called “LAND”. It is even possible to monetize the plots of LAND purchased by renting them to other users for fixed fees. The ENJ token of the Enjin metaverse is the native asset of an ecosystem with the world’s largest game/app NFT networks.
The dataset brings 198 metaverse cryptos. Pls refer to the file Metaverse coins.csv to find the list of metaverse crypto coins.
The dataset will be updated on a weekly basis with more and more additional metaverse tokens, Stay tuned ⏳
Facebook
TwitterThis dataset was created by Abhishek Thakur
Facebook
TwitterThis data set is for creating predictive models for the CrunchDAO tournament. Registration is required in order to participate in the competition, and to be eligible to earn $CRUNCH tokens.
See notebooks (Code tab) for how to import and explore the data, and build predictive models.
See Terms of Use for data license.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
A subset of Skylion007/openwebtext dataset consisting of 1 Million tokenized samples in Lance file format for blazing fast and memory efficient I/O.
The files were tokenized using the gpt2 tokenizer with no extra tokens.
For detailed information on how the dataset was created, refer to my article on Curating Custom Datasets for efficient LLM training using Lance.
This dataset is not supposed to be used on Kaggle Kernels since Lance requires the input directory of the dataset to have write access but Kaggle Kernel's input directory doesn't have it and the dataset size prohibits one from moving it to /kaggle/working. Hence, to use this dataset, you must download it by using the Kaggle API or through this page and then move the unzipped files to a folder called openwebtext_1M.lance. Below are detailed snippets on how to download and use this dataset.
First download and unzip the dataset from your terminal (make sure you have your kaggle API key at ~/.kaggle/:
$ pip install -q kaggle pyarrow pylance
$ kaggle datasets download -d heyytanay/openwebtext-1m
$ mkdir openwebtext_1M.lance/
$ unzip -qq openwebtext-1m.zip -d openwebtext_1M.lance/
$ rm openwebtext-1m.zip
Once this is done, you will find your dataset in the openwebtext_1M.lance/ folder. Now to load and get a gist of the data, run the below snippet.
import lance
dataset = lance.dataset('openwebtext_1M.lance/')
print(dataset.count_rows())
This will give you the total number of tokens in the dataset.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Bitcoin and other cryptocurrencies have captured the imagination of technologists, financiers, and economists. Digital currencies are only one application of the underlying blockchain technology. Like its predecessor, Bitcoin, the Ethereum blockchain can be described as an immutable distributed ledger. However, creator Vitalik Buterin also extended the set of capabilities by including a virtual machine that can execute arbitrary code stored on the blockchain as smart contracts.
Both Bitcoin and Ethereum are essentially OLTP databases, and provide little in the way of OLAP (analytics) functionality. However the Ethereum dataset is notably distinct from the Bitcoin dataset:
The Ethereum blockchain has as its primary unit of value Ether, while the Bitcoin blockchain has Bitcoin. However, the majority of value transfer on the Ethereum blockchain is composed of so-called tokens. Tokens are created and managed by smart contracts.
Ether value transfers are precise and direct, resembling accounting ledger debits and credits. This is in contrast to the Bitcoin value transfer mechanism, for which it can be difficult to determine the balance of a given wallet address.
Addresses can be not only wallets that hold balances, but can also contain smart contract bytecode that allows the programmatic creation of agreements and automatic triggering of their execution. An aggregate of coordinated smart contracts could be used to build a decentralized autonomous organization.
The Ethereum blockchain data are now available for exploration with BigQuery. All historical data are in the ethereum_blockchain dataset, which updates daily.
Our hope is that by making the data on public blockchain systems more readily available it promotes technological innovation and increases societal benefits.
You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.crypto_ethereum.[TABLENAME]. Fork this kernel to get started.
Cover photo by Thought Catalog on Unsplash
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
There are two methods available for authentication: HTTP Basic and OAuth 2.0. For non-interactive applications, we only support HTTP Basic Authentication. We encourage all our developers of interactive applications to use the OAuth 2.0 workflow to authenticate their users.
HTTP Basic Authentication is required when you are authenticating from a script that runs without interaction with the user, like your ETL tool, an update script, or any other data management automation.
OAuth 2.0 is the preferred option for cases where you are building a web or mobile application that needs to perform actions on behalf of the user, like accessing data, and the interaction model allows you to present the user with a form to obtain their permission for the app to do so.
Authenticating using HTTP Basic Authentication Requests can be authenticated using HTTP Basic Authentication. You can use your HTTP library’s Basic Auth feature to pass your credentials. All HTTP-basic-authenticated requests must be performed over a secure (https) connection. Authenticated requests made over an insecure connection will be denied.
Users may use their username and password or an API key and secret pair to authenticate using Basic Authentication. Documentation on how to create and manage API keys can be found here.
We recommend using API keys! They provide the following benefits:
Access Socrata APIs without the risk of embedding your username and password in scripts or code Users on domains that require SSO (and thus without passwords) can access Socrata APIs Create individual keys for different apps or jobs so that if any one needs to be revoked or rotated, other apps are unaffected Change your account password without disrupting apps or rotate API keys without disrupting logins Here is a sample HTTP session that uses HTTP Basic Authentication:
POST /resource/4tka-6guv.json HTTP/1.1 Host: soda.demo.socrata.com Accept: / Authorization: Basic [REDACTED] Content-Length: 253 Content-Type: application/json X-App-Token: [REDACTED]
[ { ... } ] Note that the Authorization header in this request will usually be generated via your HTTP library’s Basic Auth feature (as opposed to manually constructing the Base64 encoding of your credentials yourself). For example, if you’re using Python’s requests module, it supports Basic Authentication out of the box. Similarly, an API tool like Postman also handles Basic Authentication.
OAuth 2.0 Note: When developing applications that make use of OAuth, you must provide a web-accessible callback URL when registering your application token. This can make it difficult to develop on a machine that isn't directly exposed to the Internet. One great option is to use a tool like ngrok to create a secure tunnel to expose your web application in a secure manner. Workflow We support a subset of OAuth 2.0 — the server-based flow with a callback URL — which we believe is more secure than the other flows in the specification. This OAuth flow is used by several other popular API services on the web. We have made the authentication flow similar to Google AuthSub.
To authenticate with OAuth 2.0, you will first need to register your application, which will create an app token and a secret token. When registering your application, you must preregister your server by filling out the Callback Prefix field), so that we can be sure that access through your application is secure even if both your tokens are stolen. The Callback Prefix is the beginning of the URL that you will use as your redirect URL. Generally, you’ll want to provide as much of your callback URL as you can. For example, if your authentication callback is https://my-website.com/socrata-app/auth/callback, you might want to specify https://my-website.com/socrata-app as your callback URL.
Once you have an application and a secret token, you’ll be able to authenticate with the SODA OAuth 2.0 endpoint. You’ll first need to redirect the user to the Socrata-powered site you wish to access so that they may log in and approve your application. For example:
https://soda.demo.socrata.com/oauth/authorize?client_id=YOUR_AUTH_TOKEN&response_type=code &redirect_uri=YOUR_REDIRECT_URI Note that the redirect_uri here must be an absolute, secure (https:) URI which starts with the Callback Prefix you specified when you registered your application. If any of these cases fail, the user will be shown an error indicating as much.
Should the user authorize your application, they will be redirected back to the your redirect_uri. For example, if I provide https://my-website.com/socrata-app/auth/callback as my redirect_uri, the user will be redirected to this URL:
https://my-website.com/socrata-app/auth/callback?code=CODE where CODE is an authorization code that you will use later.
If your redirect_uri contains a querystring, it will be preserved, and the code parameter will...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The Cloud Accounting Integrity Verification Dataset (CAIVD) contains 1,900 simulated accounting and cloud platform log entries, designed for evaluating cloud data integrity verification algorithms. Each record represents a real-world accounting system event — including insert, update, delete, view, and approve operations — enriched with financial, network, and system-level metadata. This dataset supports experiments in data integrity auditing, cloud computing performance analysis, and intelligent accounting verification. Key Features Record_ID
Unique identifier for each log entry.
Timestamp
Date and time when the accounting or cloud event occurred.
User_ID
Randomized user identifier (represents an accountant, auditor, or automated process).
Action_Type
Type of operation performed in the accounting system: insert, update, delete, view, or approve.
Transaction_Amount
Financial amount involved in the accounting transaction.
Account_Category
Account classification: Assets, Liabilities, Revenue, or Expense.
Approval_Status
Approval state of the transaction: Pending, Approved, or Rejected.
Device_ID
Identifier for the source terminal or accounting device.
IP_Address
Client network address from which the transaction originated.
Cloud_Node
Cloud region or node processing the accounting request: Node-A, Node-B, or Node-C.
CPU_Usage
CPU utilization (%) of the cloud node during transaction processing.
Memory_Usage
Memory usage (%) of the cloud node at event time.
Request_Latency
Observed network delay (milliseconds) for the event request.
Encryption_Flag
Indicates encryption status of the record (1 = encrypted, 0 = not encrypted).
Access_Token_Validity
Authentication token status: Valid, Expired, or Suspicious.
Block_Size_KB Constant value = 1 KB — size of each data block used in verification tests.
File_Size_MB File size (MB) used in simulation: 200, 400, 600, 800, 1000, 1200, or 1400 MB.
Total_Blocks Computed as File_Size_MB × 1024; total number of 1 KB data blocks per experiment.
Facebook
TwitterNemotron-3-8B-Base-4k Model Overview License
The use of this model is governed by the NVIDIA AI Foundation Models Community License Agreement. Description
Nemotron-3-8B-Base-4k is a large language foundation model for enterprises to build custom LLMs. This foundation model has 8 billion parameters, and supports a context length of 4,096 tokens. Nemotron-3-8B-Base-4k is part of Nemotron-3, which is a family of enterprise ready generative text models compatible with NVIDIA NeMo Framework. For other models in this collection, see the collections page.
NVIDIA NeMo is an end-to-end, cloud-native platform to build, customize, and deploy generative AI models anywhere. It includes training and inferencing frameworks, guardrailing toolkits, data curation tools, and pretrained models, offering enterprises an easy, cost-effective, and fast way to adopt generative AI. To get access to NeMo Framework, please sign up at this link. References
Announcement Blog Model Architecture
Architecture Type: Transformer
Network Architecture: Generative Pre-Trained Transformer (GPT-3) Software Integration
Runtime Engine(s): NVIDIA AI Enterprise
Toolkit: NeMo Framework
To get access to NeMo Framework, please sign up at this link. See NeMo inference container documentation for details on how to setup and deploy an inference server with NeMo.
Sample Inference Code:
from nemo.deploy import NemoQuery
nq = NemoQuery(url="localhost:8000", model_name="Nemotron-3-8B-4K")
output = nq.query_llm(prompts=["The meaning of life is"], max_output_token=200, top_k=1, top_p=0.0, temperature=0.1) print(output)
Supported Hardware:
H100
A100 80GB, A100 40GB
Model Version(s)
Nemotron-3-8B-base-4k-BF16-1 Dataset & Training
The model uses a learning rate of 3e-4 with a warm-up period of 500M tokens and a cosine learning rate annealing schedule for 95% of the total training tokens. The decay stops at a minimum learning rate of 3e-5. The model is trained with a sequence length of 4096 and uses FlashAttention’s Multi-Head Attention implementation. 1,024 A100s were used for 19 days to train the model.
NVIDIA models are trained on a diverse set of public and proprietary datasets. This model was trained on a dataset containing 3.8 Trillion tokens of text. The dataset contains 53 different human languages (including English, German, Russian, Spanish, French, Japanese, Chinese, Italian, and Dutch) and 37 programming languages. The model also uses the training subsets of downstream academic benchmarks from sources like FLANv2, P3, and NaturalInstructions v2. NVIDIA is committed to the responsible development of large language models and conducts reviews of all datasets included in training. Evaluation Task Num-shot Score MMLU* 5 54.4 WinoGrande 0 70.9 Hellaswag 0 76.4 ARC Easy 0 72.9 TyDiQA-GoldP** 1 49.2 Lambada 0 70.6 WebQS 0 22.9 PiQA 0 80.4 GSM8K 8-shot w/ maj@8 39.4
** The languages used are Arabic, Bangla, Finnish, Indonesian, Korean, Russian and Swahili. Intended use
This is a completion model. For best performance, users are encouraged to customize the completion model using NeMo Framework suite of customization tools including Parameter-Efficient Fine-Tuning (P-tuning, Adapters, LoRA), and SFT/RLHF. For chat use cases, please consider using Nemotron-3-8B chat variants. Ethical use
Technology can have a profound impact on people and the world, and NVIDIA is committed to enabling trust and transparency in AI development. NVIDIA encourages users to adopt principles of AI ethics and trustworthiness to guide your business decisions by following the guidelines in the NVIDIA AI Foundation Models Community License Agreement. Limitations
The model was trained on data that contains toxic language and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts.
The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset contains Twitter posts containing daily updates of location-based COVID–19 vaccine-related tweets from January 2021 to August 2021.
With an existing Twitter account, we applied for Developer Access and were granted access to Twitter Academic Researcher API which allows for over 10 million tweets per month. Then, we created an application to generate the API credentials (access tokens) from Twitter. The access token was used in Python (v3.6) script to authenticate and establish a connection to the Twitter database. To get goe-tagged vaccine-related tweets, we used the python script we developed to perform a historical search (archive search) of vaccine-related keywords with place country South Africa (ZA). By goe-tagged tweets, we refer to Twitter posts with a know location. These vaccine-related keywords include but are not limited to the vaccine, anti-vaxxer, vaccination, AstraZeneca, Oxford-AstraZeneca, IChooseVaccination, VaccineToSaveSouthAfrica, JohnsonJohnson, and Pfizer. The keywords were selected from the trending topic during the period of discussion. A complete list of the keywords is shown below:
Oxford-AstraZeneca, AstraZeneca, JohnsonJohnson, Vaccine, BioNTech, anti-vaccine, jab, Vaccination, Covax, Vaccine Rollout, Sputnik, VaccineToSaveSouthAfrica, IChooseVaccination, TeachersVaccine, AstraZeneca vaccine, Pfizer, J & J, Johonson & Johnson, Moderna, VaccinesWork, VacciNation, Vaccine, Steriod, COVIDvaccine, covax, VaccineEquity, VaccineReady, Jab OR PfizerGang, Scamdemic, Plandemic, Scaredemic, COVID-19, coronavirus, SARS-CoV-2, anti-vaxxers, jab, Pfizer, BioNTech, JJ, Vaccine, JohnsonJohnson Vaccine, Vaccine Rollout, J & J, Sputnik, COVAX, CoronaVac
The preferred language of the tweet is English.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains detailed information about top-rated movies, retrieved directly from The Movie Database (TMDB) using its official public API.
https://api.themoviedb.org/3/movie/top_rated/genre/movie/list endpoint using the provided genre IDs.Each row in the dataset represents a single movie and includes the following fields:
| Column | Description |
|---|---|
title | The official title of the movie |
overview | A short description or plot summary of the movie |
genre_ids | A list of genre IDs associated with the movie (from TMDB) |
genre_names | The corresponding genre names for the movie (e.g. Action, Drama) |
genre_str | A comma-separated string of genres for easy text processing |
This dataset is ideal for a variety of educational and practical machine learning tasks, including but not limited to:
Natural Language Processing (NLP):
Data Visualization:
Machine Learning Projects:
This dataset is intended for learning and educational purposes only. Please adhere to TMDB's API terms of use when using this data in any public or commercial setting.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Huggingface Hub: link
CommonsenseQA is a new multiple-choice question answering dataset that requires different types of commonsense knowledge to predict the correct answers . It contains 12,102 questions with one correct answer and four distractor answers. The dataset is provided in two major training/validation/testing set splits: "Random split" which is the main evaluation split, and "Question token split", see paper for details.
- This dataset can be used to train a model to predict the correct answers to multiple-choice questions.
- This dataset can be used to evaluate the performance of different models on the CommonsenseQA dataset.
- This dataset can be used to discover new types of commonsense knowledge required to predict the correct answers to questions in the CommonsenseQA dataset
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: validation.csv | Column name | Description | |:--------------|:---------------------------------------------------------------| | answerKey | The correct answer to the question. (String) | | choices | The four possible answers for each question. (List of strings) |
File: train.csv | Column name | Description | |:--------------|:---------------------------------------------------------------| | answerKey | The correct answer to the question. (String) | | choices | The four possible answers for each question. (List of strings) |
File: test.csv | Column name | Description | |:--------------|:---------------------------------------------------------------| | answerKey | The correct answer to the question. (String) | | choices | The four possible answers for each question. (List of strings) |
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
I started collecting data from Coin Market Cap of all crypto currencies and token on a 12 hours cycle. Plan to upload it every weekend. If you need it before for your analysis please PM me.
All available crypto currency data from coin market cap - Symbol, Rank, Price USD, Price BTC, Market Cap, Date, Time.
Updated every 12 hours - update time is in EST.
Thanks coinmarketcap.com for the API access
What correlations are there? Is there a low-risk portfolio? What are the signals before crypto rise or crushes?
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By huggingnft (From Huggingface) [source]
The NFT Images Dataset for Unconditional Generation, specifically focused on the Mutant Ape Yacht Club, offers valuable information and resources for artificial intelligence and machine learning enthusiasts. This dataset provides comprehensive details about NFT artwork, including high-quality images and associated metadata.
The dataset includes columns such as image and image_original_url, providing direct access to the image files in their original form through valid web addresses. These image files represent unique pieces of digital artwork from the Mutant Ape Yacht Club collection.
Moreover, this dataset offers crucial insights into each NFT artwork through the token_metadata column. This metadata encompasses various essential details regarding the specific piece of art, such as artist information, detailed descriptions, and unique attributes associated with each NFT. These attributes help differentiate one piece of art from another in terms of style, theme, rarity factors, or any other distinctive characteristics.
By utilizing this comprehensive dataset's resources in various AI applications like unconditional generation models or machine learning algorithms, users gain access to a wide range of digital artworks for research or creative purposes. Additionally, with accurate token metadata available alongside each image file, users can explore diverse aspects that contribute to the essence of these digital creations.
Introduction:
Dataset Overview:
- The dataset contains information about NFT images from the Mutant Ape Yacht Club, including image URLs, IDs, token metadata, and original image URLs.
- Each artwork in the dataset is represented by an image file and its corresponding metadata.
Accessing the Dataset:
- To access this dataset, download or import the provided train.csv file.
Understanding Key Columns:
image(Image file): This column contains the image file of each artwork in the NFT collection.token_metadata(Text): This column includes metadata associated with each artwork such as artist details, description, and attributes.image_original_url(URL): Provides the original URL of each image file in case you need to refer back to it or access additional information.Potential Use Cases:
- Unconditional Generation: The dataset can be used for unconditional generation tasks like training generative models or running experiments on creating novel artworks based on existing ones.
Preprocessing Steps: Before using this dataset for unconditional generation tasks, consider performing some preprocessing steps such as:
a) Image Processing: Resize or normalize images for consistent input dimensions if required by your model architecture.
b) Text Processing: Clean token_metadata column if needed by removing special characters or irrelevant text that may hinder model training.
Exploratory Data Analysis (EDA): Conducting EDA provides insights into patterns within the database that might help you understand art concepts better or optimize your unconditional generation models. Some possible EDA tasks include:
a) Image Visualization: Display a subset of images to get a visual understanding of the artworks.
b) Metadata Analysis: Analyze the distribution or correlation between different attributes mentioned in the token_metadata column.
Training Unconditional Generation Models: Use this dataset to train generative models such as Variational Autoencoders (VAE), Generative Adversarial Networks (GANs), or other deep learning architectures for creative artwork generation.
Iterative Model Improvements: Experiment with different model architectures, hyperparameters, and loss functions to enhance the
- Unconditional Image Generation: This dataset can be used for training generative models to create new and unique NFT artwork. By training a model on this dataset, it can learn the patterns, styles, and attributes of Mutant Ape Yacht Club NFTs and generate new images in a similar style.
- Artistic Style Transfer: The token_metadata associated with each NFT artwork can provide valuable information about the artist, description, and attributes of the image. This metadata can be used for artistic style transfer techniques to apply the unique style of Mutant Ape Yacht Club artworks to other im...
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
“Russian is an East Slavic language and an official language in Russia, Belarus, Kazakhstan, Kyrgyzstan and many minor or unrecognised territories. It is an unofficial but widely spoken language in Ukraine and Latvia, and to a lesser extent, the other countries that were once constituent republics of the Soviet Union and former participants of the Eastern Bloc.” -- “Russian Language” on Wikipedia
Russian has around 150 million native speakers and 110 million non-native speakers. Russian in written in Cyrillic script. This dataset is a morphologically, syntactically and semantically annotated corpus of texts in Russian, fully accessible to researchers and edited by users.
This dataset is encoded in UTF-8. There are two files included in this dataset: the corpus and the dictionary. The corpus is in .json format, while the dictionary is in plain text.
In the dictionary, each entry is a lemma, presented with all of its tagged derivations. The tags depend on the part of speech of the lemma. Some examples are:
A Python script to convert the tags in this corpus to this set more commonly used in English-language linguistics can be found here.
Sample dictionary entries:
1
ЁЖ NOUN,anim,masc sing,nomn
ЕЖА NOUN,anim,masc sing,gent
ЕЖУ NOUN,anim,masc sing,datv
ЕЖА NOUN,anim,masc sing,accs
ЕЖОМ NOUN,anim,masc sing,ablt
ЕЖЕ NOUN,anim,masc sing,loct
ЕЖИ NOUN,anim,masc plur,nomn
ЕЖЕЙ NOUN,anim,masc plur,gent
ЕЖАМ NOUN,anim,masc plur,datv
41
ЁРНИЧАЮ VERB,impf,intr sing,1per,pres,indc
ЁРНИЧАЕМ VERB,impf,intr plur,1per,pres,indc
ЁРНИЧАЕШЬ VERB,impf,intr sing,2per,pres,indc
ЁРНИЧАЕТЕ VERB,impf,intr plur,2per,pres,indc
ЁРНИЧАЕТ VERB,impf,intr sing,3per,pres,indc
ЁРНИЧАЮТ VERB,impf,intr plur,3per,pres,indc
ЁРНИЧАЛ VERB,impf,intr masc,sing,past,indc
ЁРНИЧАЛА VERB,impf,intr femn,sing,past,indc
ЁРНИЧАЛО VERB,impf,intr neut,sing,past,indc
ЁРНИЧАЛИ VERB,impf,intr plur,past,indc
ЁРНИЧАЙ VERB,impf,intr sing,impr,excl
ЁРНИЧАЙТЕ VERB,impf,intr plur,impr,excl
In this corpus, each word has been grammatically tagged. You can access individual tokens using the following general path:
JSON > text > paragraphs > paragraph > [paragraph number] > sentence > [sentence number] > tokens > [token number]
Each token has: * A unique id number (@id) * The text of the token (@text) * information on the lemma (under “l”), including the id number of the lemma as found in the dictionary
You can see an example of the token portion of the .json structure below:
{
"@id": 1714292,
"@text": "сват",
"tfr": {
"@t": "сват",
"@rev_id": 3754311,
"v": {
"l": {
"@id": 314741,
"@t": "сват",
"g": [
{
"@v": "NOUN"
},
{
"@v": "anim"
},
{
"@v": "masc"
},
{
"@v": "sing"
},
{
"@v": "nomn"
}
]
}
}
}
}
This dataset was collected and annotated by, among others, Svetlana Alekseeva, Anastasia Bodrova, Victor Bocharov, Dmitry Granovsky, Irina Krylova, Maria Nikolaeva, Catherine Protopopova, Alexander Chuchunkov, Anastasia Shimorina, Vasily Alekseev, Natalia Ostapuk, Maria Stepanova and Alexey Surikov. The code used to collect and clean this data is available online.
It is reproduced here under a CC-BY-SA license.
More information on this corpus and its most recent version can be found here (in Russian.)
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterThis dataset was created by Sa Atmax