MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by sundriedtomatoes
Released under MIT
A subset of codeparrot/github-code
dataset consisting of 1 Million tokenized Python files in Lance file format for blazing fast and memory efficient I/O.
The files were tokenized using the EleutherAI/gpt-neox-20b
tokenizer with no extra tokens.
For detailed information on how the dataset was created, refer to my article on Curating Custom Datasets for efficient LLM training using Lance.
The script used for creating the dataset can be found here.
This dataset is not supposed to be used on Kaggle Kernels since Lance requires the input directory of the dataset to have write access but Kaggle Kernel's input directory doesn't have it and the dataset size prohibits one from moving it to /kaggle/working
. Hence, to use this dataset, you must download it by using the Kaggle API or through this page and then move the unzipped files to a folder called codeparrot_1M.lance
. Below are detailed snippets on how to download and use this dataset.
First download and unzip the dataset from your terminal (make sure you have your kaggle API key at ~/.kaggle/
:
$ pip install -q kaggle pyarrow pylance
$ kaggle datasets download -d heyytanay/codeparrot-1m
$ mkdir codeparrot_1M.lance/
$ unzip -qq codeparrot-1m.zip -d codeparrot_1M.lance/
$ rm codeparrot-1m.zip
Once this is done, you will find your dataset in the codeparrot_1M.lance/
folder. Now to load and get a gist of the data, run the below snippet.
import lance
dataset = lance.dataset('codeparrot_1M.lance/')
print(dataset.count_rows())
This will give you the total number of tokens in the dataset.
Considerations for Using the Data The dataset consists of source code from a wide range of repositories. As such they can potentially include harmful or biased code as well as sensitive information like passwords or usernames.
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
What is this?
This is a cleaned version of Amazon Product Dataset 2020 from Kaggle.
Why?
Using via Hugging Face API is easier; Kaggle API is annoying because their authentication is having credentials in a folder. Cleaned because 13/28 columns are empty.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Doge Coin: An explosion’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/cyruskouhyar/doge-coin-an-explosion on 13 February 2022.
--- Dataset description provided by original source is as follows ---
this Dataset contains the doge coin prices in 2019-Now
Doge Coin prices with details, open, close, low and high prices. open and close and all related dates. API that I got the results of is CoinAPI: with free plan you can access rest api i put the link below so you can also use it.
thanks to CoinAPI for this amazing service. I will be happy if you vote up it and follow my kaggle profile.😃 I did the same thing for bitcoin: https://www.kaggle.com/cyruskouhyar/btcprices2015now
--- Original source retains full ownership of the source dataset ---
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The past two months were crazy in the crypto market. The goal is to allow analyze correlations between Bitcoin and other Crypto Currencies in order to do smarter day-trading.
This data set was updated every 15 min using Coin Market Cap API and includes the top 100 coins market cap, price in USD and price in BTC. Every row has its update time in EST Time zone
Coin Market Cap API
Who are the followers and leaders in the crypto market? When BTC goes down - what coins should be bought and when? When it goes up - which coins start to rise following it but still giving us enough time to buy them?
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Below is a draft DMP–style description of your credit‐card fraud detection experiment, modeled on the antiquities example:
Research Domain
This work resides in the domain of financial fraud detection and applied machine learning. We focus on detecting anomalous credit‐card transactions in real time to reduce financial losses and improve trust in digital payment systems.
Purpose
The goal is to train and evaluate a binary classification model that flags potentially fraudulent transactions. By publishing both the code and data splits via FAIR repositories, we enable reproducible benchmarking of fraud‐detection algorithms and support future research on anomaly detection in transaction data.
Data Sources
We used the publicly available credit‐card transaction dataset from Kaggle (original source: https://www.kaggle.com/mlg-ulb/creditcardfraud), which contains anonymized transactions made by European cardholders over two days in September 2013. The dataset includes 284 807 transactions, of which 492 are fraudulent.
Method of Dataset Preparation
Schema validation: Renamed columns to snake_case (e.g. transaction_amount
, is_declined
) so they conform to DBRepo’s requirements.
Data import: Uploaded the full CSV into DBRepo, assigned persistent identifiers (PIDs).
Splitting: Programmatically derived three subsets—training (70%), validation (15%), test (15%)—using range‐based filters on the primary key actionnr
. Each subset was materialized in DBRepo and assigned its own PID for precise citation.
Cleaning: Converted the categorical flags (is_declined
, isforeigntransaction
, ishighriskcountry
, isfradulent
) from “Y”/“N” to 1/0 and dropped non‐feature identifiers (actionnr
, merchant_id
).
Modeling: Trained a RandomForest classifier on the training split, tuned on validation, and evaluated on the held‐out test set.
Dataset Structure
The raw data is a single CSV with columns:
actionnr
(integer transaction ID)
merchant_id
(string)
average_amount_transaction_day
(float)
transaction_amount
(float)
is_declined
, isforeigntransaction
, ishighriskcountry
, isfradulent
(binary flags)
total_number_of_declines_day
, daily_chargeback_avg_amt
, sixmonth_avg_chbk_amt
, sixmonth_chbk_freq
(numeric features)
Naming Conventions
All columns use lowercase snake_case.
Subsets are named creditcard_training
, creditcard_validation
, creditcard_test
in DBRepo.
Files in the code repo follow a clear structure:
├── data/ # local copies only; raw data lives in DBRepo
├── notebooks/Task.ipynb
├── models/rf_model_v1.joblib
├── outputs/ # confusion_matrix.png, roc_curve.png, predictions.csv
├── README.md
├── requirements.txt
└── codemeta.json
Required Software
Python 3.9+
pandas, numpy (data handling)
scikit-learn (modeling, metrics)
matplotlib (visualizations)
dbrepo‐client.py (DBRepo API)
requests (TU WRD API)
Additional Resources
Original dataset: https://www.kaggle.com/mlg-ulb/creditcardfraud
Scikit-learn docs: https://scikit-learn.org/stable
DBRepo API guide: via the starter notebook’s dbrepo_client.py
template
TU WRD REST API spec: https://test.researchdata.tuwien.ac.at/api/docs
Data Limitations
Highly imbalanced: only ~0.17% of transactions are fraudulent.
Anonymized PCA features (V1
–V28
) hidden; we extended with domain features but cannot reverse engineer raw variables.
Time‐bounded: only covers two days of transactions, may not capture seasonal patterns.
Licensing and Attribution
Raw data: CC-0 (per Kaggle terms)
Code & notebooks: MIT License
Model artifacts & outputs: CC-BY 4.0
DUWRD records include ORCID identifiers for the author.
Recommended Uses
Benchmarking new fraud‐detection algorithms on a standard imbalanced dataset.
Educational purposes: demonstrating model‐training pipelines, FAIR data practices.
Extension: adding time‐series or deep‐learning models.
Known Issues
Possible temporal leakage if date/time features not handled correctly.
Model performance may degrade on live data due to concept drift.
Binary flags may oversimplify nuanced transaction outcomes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Below is a draft DMP–style description of your credit‐card fraud detection experiment, modeled on the antiquities example:
Research Domain
This work resides in the domain of financial fraud detection and applied machine learning. We focus on detecting anomalous credit‐card transactions in real time to reduce financial losses and improve trust in digital payment systems.
Purpose
The goal is to train and evaluate a binary classification model that flags potentially fraudulent transactions. By publishing both the code and data splits via FAIR repositories, we enable reproducible benchmarking of fraud‐detection algorithms and support future research on anomaly detection in transaction data.
Data Sources
We used the publicly available credit‐card transaction dataset from Kaggle (original source: https://www.kaggle.com/mlg-ulb/creditcardfraud), which contains anonymized transactions made by European cardholders over two days in September 2013. The dataset includes 284 807 transactions, of which 492 are fraudulent.
Method of Dataset Preparation
Schema validation: Renamed columns to snake_case (e.g. transaction_amount
, is_declined
) so they conform to DBRepo’s requirements.
Data import: Uploaded the full CSV into DBRepo, assigned persistent identifiers (PIDs).
Splitting: Programmatically derived three subsets—training (70%), validation (15%), test (15%)—using range‐based filters on the primary key actionnr
. Each subset was materialized in DBRepo and assigned its own PID for precise citation.
Cleaning: Converted the categorical flags (is_declined
, isforeigntransaction
, ishighriskcountry
, isfradulent
) from “Y”/“N” to 1/0 and dropped non‐feature identifiers (actionnr
, merchant_id
).
Modeling: Trained a RandomForest classifier on the training split, tuned on validation, and evaluated on the held‐out test set.
Dataset Structure
The raw data is a single CSV with columns:
actionnr
(integer transaction ID)
merchant_id
(string)
average_amount_transaction_day
(float)
transaction_amount
(float)
is_declined
, isforeigntransaction
, ishighriskcountry
, isfradulent
(binary flags)
total_number_of_declines_day
, daily_chargeback_avg_amt
, sixmonth_avg_chbk_amt
, sixmonth_chbk_freq
(numeric features)
Naming Conventions
All columns use lowercase snake_case.
Subsets are named creditcard_training
, creditcard_validation
, creditcard_test
in DBRepo.
Files in the code repo follow a clear structure:
├── data/ # local copies only; raw data lives in DBRepo
├── notebooks/Task.ipynb
├── models/rf_model_v1.joblib
├── outputs/ # confusion_matrix.png, roc_curve.png, predictions.csv
├── README.md
├── requirements.txt
└── codemeta.json
Required Software
Python 3.9+
pandas, numpy (data handling)
scikit-learn (modeling, metrics)
matplotlib (visualizations)
dbrepo‐client.py (DBRepo API)
requests (TU WRD API)
Additional Resources
Original dataset: https://www.kaggle.com/mlg-ulb/creditcardfraud
Scikit-learn docs: https://scikit-learn.org/stable
DBRepo API guide: via the starter notebook’s dbrepo_client.py
template
TU WRD REST API spec: https://test.researchdata.tuwien.ac.at/api/docs
Data Limitations
Highly imbalanced: only ~0.17% of transactions are fraudulent.
Anonymized PCA features (V1
–V28
) hidden; we extended with domain features but cannot reverse engineer raw variables.
Time‐bounded: only covers two days of transactions, may not capture seasonal patterns.
Licensing and Attribution
Raw data: CC-0 (per Kaggle terms)
Code & notebooks: MIT License
Model artifacts & outputs: CC-BY 4.0
DUWRD records include ORCID identifiers for the author.
Recommended Uses
Benchmarking new fraud‐detection algorithms on a standard imbalanced dataset.
Educational purposes: demonstrating model‐training pipelines, FAIR data practices.
Extension: adding time‐series or deep‐learning models.
Known Issues
Possible temporal leakage if date/time features not handled correctly.
Model performance may degrade on live data due to concept drift.
Binary flags may oversimplify nuanced transaction outcomes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Cryptocurrency historical datasets from January 2012 (if available) to October 2021 were obtained and integrated from various sources and Application Programming Interfaces (APIs) including Yahoo Finance, Cryptodownload, CoinMarketCap, various Kaggle datasets, and multiple APIs. While these datasets used various formats of time (e.g., minutes, hours, days), in order to integrate the datasets days format was used for in this research study. The integrated cryptocurrency historical datasets for 80 cryptocurrencies including but not limited to Bitcoin (BTC), Ethereum (ETH), Binance Coin (BNB), Cardano (ADA), Tether (USDT), Ripple (XRP), Solana (SOL), Polkadot (DOT), USD Coin (USDC), Dogecoin (DOGE), Tron (TRX), Bitcoin Cash (BCH), Litecoin (LTC), EOS (EOS), Cosmos (ATOM), Stellar (XLM), Wrapped Bitcoin (WBTC), Uniswap (UNI), Terra (LUNA), SHIBA INU (SHIB), and 60 more cryptocurrencies were uploaded in this online Mendeley data repository. Although the primary attribute of including the mentioned cryptocurrencies was the Market Capitalization, a subject matter expert i.e., a professional trader has also guided the initial selection of the cryptocurrencies by analyzing various indicators such as Relative Strength Index (RSI), Moving Average Convergence/Divergence (MACD), MYC Signals, Bollinger Bands, Fibonacci Retracement, Stochastic Oscillator and Ichimoku Cloud. The primary features of this dataset that were used as the decision-making criteria of the CLUS-MCDA II approach are Timestamps, Open, High, Low, Closed, Volume (Currency), % Change (7 days and 24 hours), Market Cap and Weighted Price values. The available excel and CSV files in this data set are just part of the integrated data and other databases, datasets and API References that was used in this study are as follows: [1] https://finance.yahoo.com/ [2] https://coinmarketcap.com/historical/ [3] https://cryptodatadownload.com/ [4] https://kaggle.com/philmohun/cryptocurrency-financial-data [5] https://kaggle.com/deepshah16/meme-cryptocurrency-historical-data [6] https://kaggle.com/sudalairajkumar/cryptocurrencypricehistory [7] https://min-api.cryptocompare.com/data/price?fsym=BTC&tsyms=USD [8] https://min-api.cryptocompare.com/ [9] https://p.nomics.com/cryptocurrency-bitcoin-api [10] https://www.coinapi.io/ [11] https://www.coingecko.com/en/api [12] https://cryptowat.ch/ [13] https://www.alphavantage.co/ This dataset is part of the CLUS-MCDA (Cluster analysis for improving Multiple Criteria Decision Analysis) and CLUS-MCDAII Project: https://aimaghsoodi.github.io/CLUSMCDA-R-Package/ https://github.com/Aimaghsoodi/CLUS-MCDA-II https://github.com/azadkavian/CLUS-MCDA
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The Global Financial Inclusion Database provides 800 country-level indicators of financial inclusion summarized for all adults and disaggregated by key demographic characteristics-gender, age, education, income, and rural residence. Covering more than 140 economies, the indicators of financial inclusion measure how people save, borrow, make payments and manage risk.
The reference citation for the data is: Demirguc-Kunt, Asli, Leora Klapper, Dorothe Singer, and Peter Van Oudheusden. 2015. “The Global Findex Database 2014: Measuring Financial Inclusion around the World.” Policy Research Working Paper 7255, World Bank, Washington, DC.
This is a dataset hosted by the World Bank. The organization has an open data platform found here and they update their information according the amount of data that is brought in. Explore the World Bank using Kaggle and all of the data sources available through the World Bank organization page!
This dataset is maintained using the World Bank's APIs and Kaggle's API.
Cover photo by ZACHARY STAINES on Unsplash
Unsplash Images are distributed under a unique Unsplash License.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://cdn-uploads.huggingface.co/production/uploads/65e3c559d26b426e3e1994f8/e85CmtDucO_FQ-W5h1RTB.png">
https://visitor-badge.laobi.icu/badge?page_id=atalaydenknalbant/rawg-games-dataset" alt="visitors">
RAWG Games Dataset video game records data gathered directly from the RAWG API. It includes essential fields such as game id, title, release date, rating, genres, platforms, descriptive tags, Metacritic score, developers, publishers, playtime, and a detailed description. The data was collected to support studies, trend analysis, and insights into the gaming industry. Each field is aligned with the specifications provided in the RAWG API documentation.
Latest Update: February 14, 2025
Grateful to RAWG for data API.
Field | Description |
---|---|
id | A unique identifier for each game, serving as the primary key to reference detailed game data via the API. |
name | The official title of the game. |
released | The release date of the game, typically in the YYYY-MM-DD format. |
rating | An aggregated score based on player reviews, computed on a standardized scale reflecting user opinions. |
genres | A list of genre objects categorizing the game (e.g., Action, Adventure, RPG). |
platforms | An array of platform objects that indicate on which systems the game is available (e.g., PC, PlayStation, Xbox). |
tags | A collection of descriptive keyword tags (e.g., multiplayer, indie). |
metacritic | A numerical score derived from Metacritic reviews (usually ranging from 0 to 100). |
developers | The individuals or companies responsible for creating the game. |
publishers | Entities that market and distribute the game. |
playtime | An estimate of the average time (in hours) that players spend engaging with the game. |
description | A detailed narrative of the game, providing in-depth information about gameplay, plot, mechanics, and overall context. |
St. Louis Fed’s Economic News Index (ENI) uses economic content from key monthly economic data releases to forecast the growth of real GDP during that quarter. In general, the most-current observation is revised multiple times throughout the quarter. The final forecasted value (before the BEA’s release of the advance estimate of GDP) is the static, historical value for that quarter. For more information, see Grover, Sean P.; Kliesen, Kevin L.; and McCracken, Michael W. “A Macroeconomic News Index for Constructing Nowcasts of U.S. Real Gross Domestic Product Growth" (https://research.stlouisfed.org/publications/review/2016/12/05/a-macroeconomic-news-index-for-constructing-nowcasts-of-u-s-real-gross-domestic-product-growth/ )
This is a dataset from the Federal Reserve Bank of St. Louis hosted by the Federal Reserve Economic Database (FRED). FRED has a data platform found here and they update their information according to the frequency that the data updates. Explore the Federal Reserve Bank of St. Louis using Kaggle and all of the data sources available through the St. Louis Fed organization page!
Update Frequency: This dataset is updated daily.
Observation Start: 2013-04-01
Observation End : 2019-10-01
This dataset is maintained using FRED's API and Kaggle's API.
Cover photo by Ferdinand Stöhr on Unsplash
Unsplash Images are distributed under a unique Unsplash License.
The STLFSI measures the degree of financial stress in the markets and is constructed from 18 weekly data series: seven interest rate series, six yield spreads and five other indicators. Each of these variables captures some aspect of financial stress. Accordingly, as the level of financial stress in the economy changes, the data series are likely to move together.
How to Interpret the Index: The average value of the index, which begins in late 1993, is designed to be zero. Thus, zero is viewed as representing normal financial market conditions. Values below zero suggest below-average financial market stress, while values above zero suggest above-average financial market stress.
More information: For additional information on the STLFSI and its construction, see "Measuring Financial Market Stress" (https://files.stlouisfed.org/research/publications/es/10/ES1002.pdf) and the related appendix (https://files.stlouisfed.org/files/htdocs/publications/net/NETJan2010Appendix.pdf).
See this list (https://www.stlouisfed.org/news-releases/st-louis-fed-financial-stress-index/stlfsi-key) of the components that are used to construct the STLFSI.
As of 07/15/2010 the Vanguard Financial Exchange-Traded Fund series has been replaced with the S&P 500 Financials Index. This change was made to facilitate a more timely and automated updating of the FSI. Switching from the Vanguard series to the S&P series produced no meaningful change in the index.
Copyright, 2016, Federal Reserve Bank of St. Louis.
This is a dataset from the Federal Reserve Bank of St. Louis hosted by the Federal Reserve Economic Database (FRED). FRED has a data platform found here and they update their information according to the frequency that the data updates. Explore the Federal Reserve Bank of St. Louis using Kaggle and all of the data sources available through the St. Louis Fed organization page!
Update Frequency: This dataset is updated daily.
Observation Start: 1993-12-31
Observation End : 2019-11-29
This dataset is maintained using FRED's API and Kaggle's API.
Cover photo by Laura Lefurgey-Smith on Unsplash
Unsplash Images are distributed under a unique Unsplash License.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset provides comprehensive annual data on Tether (USDT), one of the most widely used stablecoins in the cryptocurrency ecosystem. The data includes key market metrics collected via the CoinGecko API, structured for in-depth analysis and versatile applications, such as market analysis, financial modeling, and machine learning algorithms.
To cite the dataset please reference it as Y. Kim, S. Hakak, and A. Ghorbani. "DDoS Attack Dataset (CICEV2023) against EV Authentication in Charging Infrastructure," in 2023 20th Annual International Conference on Privacy, Security and Trust (PST), IEEE Computer Society, pp. 1-9, August 2023.
Explore a comprehensive dataset capturing DDoS attack scenarios within electric vehicle (EV) charging infrastructure. This dataset features diverse machine learning attributes, including packet access counts, system status details, and authentication profiles across multiple charging stations and grid services. Simulated attack scenarios, authentication protocols, and extensive profiling results offer invaluable insights for training and testing detection models in safeguarding EV charging systems against cyber threats.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F5737185%2F2dec3a047fec426e0b6d2f7672d25016%2Fadjusted-5221113.jpg?generation=1743055158796994&alt=media" alt="">
Figure 1: Proposed simulator structure, source: Y. Kim, S. Hakak, and A. Ghorbani.
Acknowledgment :
The authors sincerely appreciate the support provided by the Canadian Institute for Cybersecurity (CIC), as well as the funding received from the Canada Research Chair and the Atlantic Canada Opportunities Agency (ACOA).
Reference :
Y. Kim, S. Hakak, and A. Ghorbani. "DDoS Attack Dataset (CICEV2023) against EV Authentication in Charging Infrastructure," in 2023 20th Annual International Conference on Privacy, Security and Trust (PST), IEEE Computer Society, pp. 1-9, August 2023.
Data on the song catalogue of The Beatles, along with the audio features such as Tempo, Key, Mode, Energy, Loudness, Valence and Danceability among others as assigned in the Spotify API.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
🌍 # AI Waste Recognition Dataset – Empowering Sustainable Solutions with Deep Learning 📌 ## Overview The AI Waste Recognition Dataset is a high-quality dataset designed for training deep learning models to automatically classify and detect waste materials. With a growing need for smart waste management, this dataset provides a structured approach to recognizing four key waste categories:
♻ Plastic Bottles ♻ Aluminium Cans ♻ Paper Cups ♻ Glass Bottles
By leveraging this dataset, researchers, data scientists, and AI enthusiasts can develop advanced computer vision models to enhance automated recycling systems, reduce environmental pollution, and contribute to a sustainable future.
📊 Dataset Details 🔹 Total Images: 100,000+ (Augmented for diversity) 🔹 Categories: 4 (Plastic Bottles, Aluminium Cans, Paper Cups, Glass Bottles) 🔹 Resolution: High-quality 256x256 images 🔹 Annotations: Labeled with folder names (stored in labels.csv) 🔹 File Format: JPEG / PNG
This dataset includes real-world waste images collected from various environments, augmented with advanced transformations to improve model generalization.
🚀 Ideal Use Cases ✅ Object Detection & Classification – Train CNNs, YOLO, Faster R-CNN, etc. ✅ AI-Powered Recycling Bins – Automate waste sorting in smart bins. ✅ Environmental AI Research – Contribute to eco-friendly AI projects. ✅ Edge AI & IoT – Deploy waste detection models on edge devices.
📥 How to Use? 1️⃣ Download the dataset or load it via Kaggle API. 2️⃣ Use labels.csv to map images to their respective classes. 3️⃣ Train deep learning models using TensorFlow, PyTorch, or YOLO. 4️⃣ Deploy your model for real-world waste classification!
🎯 Why This Dataset? 🌟 Well-structured & diverse – Covers different lighting, backgrounds & perspectives. 🌟 AI-ready – Optimized for deep learning & computer vision tasks. 🌟 Promotes sustainability – Helps in developing AI solutions for waste management. 🌟 Real-world applications – Supports smart cities & environmental research.
🛠️ Get Started Today! Use this dataset to build innovative AI models, contribute to make dataset superior, and be part of the VELOCIS.
🔹 Keywords: AI Waste Detection, Smart Recycling, Object Recognition, Deep Learning, CNN, YOLO, Kaggle Dataset
http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
Our groundbreaking translation dataset represents a monumental advancement in the field of natural language processing and machine translation. Comprising a staggering 785 million records, this corpus bridges language barriers by offering translations from English to an astonishing 548 languages. The dataset promises to be a cornerstone resource for researchers, engineers, and developers seeking to enhance their machine translation models, cross-lingual analysis, and linguistic investigations.
Size of the dataset – 41GB(Uncompressed) and Compressed – 20GB
Key Features:
Scope and Scale: With a comprehensive collection of 785 million records, this dataset provides an unparalleled wealth of translated text. Each record consists of an English sentence paired with its translation in one of the 548 target languages, enabling multi-directional translation applications.
Language Diversity: Encompassing translations into 548 languages, this dataset represents a diverse array of linguistic families, dialects, and scripts. From widely spoken languages to those with limited digital representation, the dataset bridges communication gaps on a global scale.
Quality and Authenticity: The translations have been meticulously curated, verified, and cross-referenced to ensure high quality and authenticity. This attention to detail guarantees that the dataset is not only extensive but also reliable, serving as a solid foundation for machine learning applications. Data is collected from various open datasets for my personal ML projects and looking to share it to team.
Use Case Versatility: Researchers and practitioners across a spectrum of domains can harness this dataset for a myriad of applications. It facilitates the training and evaluation of machine translation models, empowers cross-lingual sentiment analysis, aids in linguistic typology studies, and supports cultural and sociolinguistic investigations.
Machine Learning Advancement: Machine translation models, especially neural machine translation (NMT) systems, can leverage this dataset to enhance their training. The large-scale nature of the dataset allows for more robust and contextually accurate translation outputs.
Fine-tuning and Customization: Developers can fine-tune translation models using specific language pairs, offering a powerful tool for specialized translation tasks. This customization capability ensures that the dataset is adaptable to various industries and use cases.
Data Format: The dataset is provided in a structured json format, facilitating easy integration into existing machine learning pipelines. This structured approach expedites research and experimentation. Json format contains the English word and equivalent word as single record. Data was exported from MongoDB database to ensure the uniqueness of the record. Each of the record is unique and sorted.
Access: The dataset is available for academic and research purposes, enabling the global AI community to contribute to and benefit from its usage. A well-documented API and sample code are provided to expedite exploration and integration.
The English-to-548-languages translation dataset represents an incredible leap forward in advancing multilingual communication, breaking down barriers to understanding, and fostering collaboration on a global scale. It holds the potential to reshape how we approach cross-lingual communication, linguistic studies, and the development of cutting-edge translation technologies.
Dataset Composition: The dataset is a culmination of translations from English, a widely spoken and understood language, into 548 distinct languages. Each language represents a unique linguistic and cultural background, providing a rich array of translation contexts. This diverse range of languages spans across various language families, regions, and linguistic complexities, making the dataset a comprehensive repository for linguistic research.
Data Volume and Scale: With a staggering 785 million records, the dataset boasts an immense scale that captures a vast array of translations and linguistic nuances. Each translation entry consists of an English source text paired with its corresponding translation in one of the 548 target languages. This vast corpus allows researchers and practitioners to explore patterns, trends, and variations across languages, enabling the development of robust and adaptable translation models.
Linguistic Coverage: The dataset covers an extensive set of languages, including but not limited to Indo-European, Afroasiatic, Sino-Tibetan, Austronesian, Niger-Congo, and many more. This broad linguistic coverage ensures that languages with varying levels of grammatical complexity, vocabulary richness, and syntactic structures are included, enhancing the applicability of translation models across diverse linguistic landscapes.
Dataset Preparation: The translation ...
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains match statistics from FACEIT, an online matchmaking service for Counter-Strike 2 (CS2). The data has been collected via the FACEIT API* (see below) and preprocessed for machine learning applications focused on predicting match outcomes based on team average player performance history and elo. Use win_prediciton_clean or join each excel file data_win_prediction_#.xlsx (excel files are not clean and contain duplicates/non-competitive maps).
Dataset Details: - Observations: 9,651 matches - response: win - team that one the given match (a or b) - match ID - the id given by FACEIT for the match played, can be used to pull additional match data from API.*
Features: - Average Win Percentage (for the given map) - Average ELO (team skill rating) - Average Kills per Round (K/R Ratio)
Also attached is notebooks used to pull data, feature engineering, and model tuning. The highest predictive accuracy I was able to get was 77.11% ± 0.84 using CNN.
*If you'd like to pull data from FACEIT API, you need an authorization token from FACEIT, you can get more information at https://docs.faceit.com/.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset encompasses all albums released by the renowned US songwriter and artist Taylor Swift, up to and including June 6, 2024. The most recent addition to this collection is "The Tortured Poets Department: The Anthology," featuring 31 tracks. The dataset has been generated using the Python library SpotiPy and is provided in an untouched, unfiltered, and raw state, making it ideal for model training, data analysis, or visualization projects.
Key Features: - Comprehensive Collection: Includes all of Taylor Swift's albums released by June 6, 2024. - Latest Album: "The Tortured Poets Department: The Anthology" with 31 tracks. - Raw and Unfiltered: The dataset is presented in its original form without any modifications, ensuring the authenticity of the data. - Generated with SpotiPy: Data extracted using the SpotiPy library, ensuring accuracy and reliability.
Usage Notes: - Multiple Versions of Albums: Be aware that the dataset includes multiple versions of some albums. This means that tracks and their details may appear more than once if they are present in different album versions. - Model Training and Visualization: The dataset's comprehensive and unaltered nature makes it an excellent resource for various applications, including machine learning model training, data analysis, and visualizations.
Potential Applications: - Music Analysis: Analyze trends, patterns, and characteristics of Taylor Swift's music over the years. - Machine Learning: Train models for music recommendation, genre classification, or popularity prediction. - Data Visualization: Create visual representations of Taylor Swift's discography, track features, and album details.
Dataset Contents: - Album Details: Information about each album, including release dates, album names, and the number of tracks. - Track Information: Details about each track, such as track names, durations, and other relevant metadata. - Track Audio feature: Includes features like danceability, energy, acousticness, speechiness, etc. Note: Description of Audio Features have directly been taken from Spotify API description for each term to eliminate any confusion.
Acknowledgements: This dataset was created using the SpotiPy library, a Python client for the Spotify Web API, which allows for easy access to Spotify's vast music catalog.
We hope this dataset provides valuable insights and facilitates various analyses and applications related to Taylor Swift's music.
For any questions or issues, please feel free to contact us through the Kaggle community forum.
Enjoy exploring Taylor Swift's musical journey!
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Dataset Overview
This dataset provides a detailed snapshot of real estate properties listed in Dubai, UAE, as of August 2024. The dataset includes over 5,000 listings scraped using the Apify API from Propertyfinder and various other real estate websites in the UAE. The data includes key details such as the number of bedrooms and bathrooms, price, location, size, and whether the listing is verified. All personal identifiers, such as agent names and contact details, have been ethically removed.
Data Science Applications
Given the size and structure of this dataset, it is ideal for the following data science applications:
This dataset provides a practical foundation for both beginners and experts in data science, allowing for the exploration of real estate trends, development of predictive models, and implementation of machine learning algorithms.
# Column Descriptors
# Ethically Mined Data
This dataset was ethically scraped using the Apify API, ensuring compliance with data privacy standards. All personal data such as agent names, phone numbers, and any other sensitive information have been omitted from this dataset to ensure privacy and ethical use. The data is intended solely for educational purposes and should not be used for commercial activities.
# Acknowledgements
This dataset was made possible thanks to the following:
-**Photo by** : Francesca Tosolini on Unsplash
Use the Data Responsibly
Please ensure that this dataset is used responsibly, with respect to privacy and data ethics. This data is provided for educational purposes.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by sundriedtomatoes
Released under MIT