Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Abbad Alam
Released under Apache 2.0
This dataset was created by v1nor1
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Zheung Yik2024
Released under Apache 2.0
This dataset was created by Marcos Faria
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains a snapshot of the official Kaggle datasets leaderboard (taken between October 21st and October 24th 2024). For every user, the dataframe contains all their datasets with information sourced through the Kaggle API. Currently, the dataset only contains the top 250, but I have a larger snapshot of the leaderboard. I aim to expand the dataset to include the top 1000 dataset contributors.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2352583%2F868a18fb09d7a1d3da946d74a9857130%2FLogo.PNG?generation=1604973725053566&alt=media" alt="">
Medical Dataset for Abbreviation Disambiguation for Natural Language Understanding (MeDAL) is a large medical text dataset curated for abbreviation disambiguation, designed for natural language understanding pre-training in the medical domain. It was published at the ClinicalNLP workshop at EMNLP.
💻 Code 🤗 Dataset (Hugging Face) 💾 Dataset (Kaggle) 💽 Dataset (Zenodo) 📜 Paper (ACL) 📝 Paper (Arxiv) ⚡ Pre-trained ELECTRA (Hugging Face)
We recommend downloading from Kaggle if you can authenticate through their API. The advantage to Kaggle is that the data is compressed, so it will be faster to download. Links to the data can be found at the top of the readme.
First, you will need to create an account on kaggle.com. Afterwards, you will need to install the kaggle API:
pip install kaggle
Then, you will need to follow the instructions here to add your username and key. Once that's done, you can run:
kaggle datasets download xhlulu/medal-emnlp
Now, unzip everything and place them inside the data
directory:
unzip -nq crawl-300d-2M-subword.zip -d data
mv data/pretrain_sample/* data/
For the LSTM models, we will need to use the fastText embeddings. To do so, first download and extract the weights:
wget -nc -P data/ https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M-subword.zip
unzip -nq data/crawl-300d-2M-subword.zip -d data/
You can directly load LSTM and LSTM-SA with torch.hub
:
```python
import torch
lstm = torch.hub.load("BruceWen120/medal", "lstm") lstm_sa = torch.hub.load("BruceWen120/medal", "lstm_sa") ```
If you want to use the Electra model, you need to first install transformers:
pip install transformers
Then, you can load it with torch.hub
:
python
import torch
electra = torch.hub.load("BruceWen120/medal", "electra")
transformers
If you are only interested in the pre-trained ELECTRA weights (without the disambiguation head), you can load it directly from the Hugging Face Repository:
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained("xhlu/electra-medal")
tokenizer = AutoTokenizer.from_pretrained("xhlu/electra-medal")
Download the bibtex
here, or copy the text below:
@inproceedings{wen-etal-2020-medal,
title = "{M}e{DAL}: Medical Abbreviation Disambiguation Dataset for Natural Language Understanding Pretraining",
author = "Wen, Zhi and Lu, Xing Han and Reddy, Siva",
booktitle = "Proceedings of the 3rd Clinical Natural Language Processing Workshop",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.clinicalnlp-1.15",
pages = "130--135",
}
The ELECTRA model is licensed under Apache 2.0. The license for the libraries used in this project (transformers
, pytorch
, etc.) can be found in their respective GitHub repository. Our model is released under a MIT license.
The original dataset was retrieved and modified from the NLM website. By using this dataset, you are bound by the terms and conditions specified by NLM:
INTRODUCTION
Downloading data from the National Library of Medicine FTP servers indicates your acceptance of the following Terms and Conditions: No charges, usage fees or royalties are paid to NLM for this data.
MEDLINE/PUBMED SPECIFIC TERMS
NLM freely provides PubMed/MEDLINE data. Please note some PubMed/MEDLINE abstracts may be protected by copyright.
GENERAL TERMS AND CONDITIONS
Users of the data agree to:
- acknowledge NLM as the source of the data by including the phrase "Courtesy of the U.S. National Library of Medicine" in a clear and conspicuous manner,
- properly use registration and/or trademark symbols when referring to NLM products, and
- not indicate or imply that NLM has endorsed its products/services/applications.
Users who republish or redistribute the data (services, products or raw data) agree to:
- maintain the most current version of all distributed data, or
- make known in a clear and conspicuous manner that the products/services/applications do not reflect the most current/accurate data available from NLM.
These data are produced with a reasonable standard of care, but NLM makes no warranties express or implied, including no warranty of merchantability or fitness for particular purpose, regarding the accuracy or completeness of the data. Users agree to hold NLM and the U.S. Government harmless from any liability resulting from errors in the data. NLM disclaims any liability for any consequences due to use, misuse, or interpretation of information contained or not contained in the data.
NLM does not provide legal advice regarding copyright, fair use, or other aspects of intellectual property rights. See the NLM Copyright page.
NLM reserves the right to change the type and format of its machine-readable data. NLM will take reasonable steps to inform users of any changes to the format of the data before the data are distributed via the announcement section or subscription to email and RSS updates.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Simple ReAct trajectories generated using Gemini in a LLM Agent and the Kaggle Environment.
See notebook for more details.
More details about each file are in the individual file descriptions.
This is a dataset from the U.S. Census Bureau hosted by the Federal Reserve Economic Database (FRED). FRED has a data platform found here and they update their information according the amount of data that is brought in. Explore the U.S. Census Bureau using Kaggle and all of the data sources available through the U.S. Census Bureau organization page!
This dataset is maintained using FRED's API and Kaggle's API.
This dataset was created by mengkoding 47
https://www.worldbank.org/en/about/legal/terms-of-use-for-datasetshttps://www.worldbank.org/en/about/legal/terms-of-use-for-datasets
The primary World Bank collection of development indicators, compiled from officially-recognized international sources. It presents the most current and accurate global development data available, and includes national, regional and global estimates.
This is a dataset hosted by the World Bank. The organization has an open data platform found here and they update their information according the amount of data that is brought in. Explore the World Bank using Kaggle and all of the data sources available through the World Bank organization page!
This dataset is maintained using the World Bank's APIs and Kaggle's API.
Cover photo by Alex Block on Unsplash
Unsplash Images are distributed under a unique Unsplash License.
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Distributed micro-services based applications are typically accessed via APIs. These APIs are used either by apps or they can be accessed directly by programmatic means. Many a time API access is abused by attackers trying to exploit the business logic exposed by these APIs. The way normal users access these APIs is different from how the attackers access these APIs. Many applications have 100s of APIs that are called in specific order and depending on various factors such as browser refreshes, session refreshes, network errors, or programmatic access these behaviors are not static and can vary for the same user. API calls in long running sessions form access graphs that need to be analysed in order to discover attack patterns and anomalies. Graphs dont lend themselves to numerical computation. We address this issue and provide a dataset where user access behavior is qualified as numerical features. In addition we provide a dataset where raw API call graphs are provided. Supporting the use of these datasets two notebooks on classification, node embeddings and clustering are also provided.
There are 4 files provided. Two files are in CSV format and two files are in JSON format. The files in CSV format are user behavior graphs represented as behavior metrics. The JSON files are the actual API call graphs. The two datasets can be joined on a key so that those who want to combine graphs with metrics could do so in novel ways.
This data set captures API access patterns in terms of behavior metrics. Behaviors are captured by tracking users' API call graphs which are then summarized in terms of metrics. In some sense a categorical sequence of entities has been reduced to numerical metrics.
There are two files provided. One called supervised_dataset.csv
has behaviors labeled as normal
or outlier
. The second file called remaining_behavior_ext.csv
has a larger number of samples that are not labeled but has additional insights as well as a classification created by another algorithm.
Each row is one instance of an observed behavior that has been manually classified as normal or outlier
There are two files provided to correspond to the two CSV files
Each item has an _id field that can be used to join against the CSV data sets. Then we have the API behavior graph represented as a list of edges.
classification
label with a skewed distribution of normal and abnormal cases and with very few labeled samples available. Use supervised_dataset.csv
remaining_behavior_ext.csv
For further information, please refer to the Board of Governors of the Federal Reserve System's G.20 release, online at http://www.federalreserve.gov/releases/g20/ .
This is a dataset from the Federal Reserve hosted by the Federal Reserve Economic Database (FRED). FRED has a data platform found here and they update their information according to the frequency that the data updates. Explore the Federal Reserve using Kaggle and all of the data sources available through the Federal Reserve organization page!
Update Frequency: This dataset is updated daily.
Observation Start: 1943-01-01
Observation End : 2019-10-01
This dataset is maintained using FRED's API and Kaggle's API.
Cover photo by Patrick Fore on Unsplash
Unsplash Images are distributed under a unique Unsplash License.
For further information, please refer to the Board of Governors of the Federal Reserve System's H.8 release, online at http://www.federalreserve.gov/releases/h8/.
This is a dataset from the Federal Reserve hosted by the Federal Reserve Economic Database (FRED). FRED has a data platform found here and they update their information according to the frequency that the data updates. Explore the Federal Reserve using Kaggle and all of the data sources available through the Federal Reserve organization page!
Update Frequency: This dataset is updated daily.
Observation Start: 1947-01-01
Observation End : 2019-11-01
This dataset is maintained using FRED's API and Kaggle's API.
Cover photo by timJ on Unsplash
Unsplash Images are distributed under a unique Unsplash License.
This series is constructed as Advance Retail and Food Services Sales (https://fred.stlouisfed.org/series/RSAFS) deflated using the Consumer Price Index for All Urban Consumers (1982-84=100) (https://fred.stlouisfed.org/series/CPIAUCSL).
This is a dataset from the Federal Reserve Bank of St. Louis hosted by the Federal Reserve Economic Database (FRED). FRED has a data platform found here and they update their information according to the frequency that the data updates. Explore the Federal Reserve Bank of St. Louis using Kaggle and all of the data sources available through the St. Louis Fed organization page!
Update Frequency: This dataset is updated daily.
Observation Start: 1992-01-01
Observation End : 2019-10-01
This dataset is maintained using FRED's API and Kaggle's API.
Cover photo by Ive Erhard on Unsplash
Unsplash Images are distributed under a unique Unsplash License.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
GovData360 is a compendium of the most important governance indicators, from 26 datasets with worldwide coverage and more than 10 years of info, designed to provide guidance on the design of reforms and the monitoring of impacts. We have an Unbalanced Panel Data by Dataset - Country for around 3260 governance focused indicators.
This is a dataset hosted by the World Bank. The organization has an open data platform found here and they update their information according the amount of data that is brought in. Explore the World Bank using Kaggle and all of the data sources available through the World Bank organization page!
This dataset is maintained using the World Bank's APIs and Kaggle's API.
Cover photo by John Jason on Unsplash
Unsplash Images are distributed under a unique Unsplash License.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
A significant amount of software is available in Kaggle's Python notebook. I had hoped to find a reference somewhere listing which Python packages were available and what each one did.
When I didn't find what I was looking for, I decided to build this dataset instead.
This dataset was assembled in four steps:
More details about each file are in the individual file descriptions.
This is a dataset from the Federal Reserve hosted by the Federal Reserve Economic Database (FRED). FRED has a data platform found here and they update their information according to the frequency that the data updates. Explore the Federal Reserve using Kaggle and all of the data sources available through the Federal Reserve organization page!
This dataset is maintained using FRED's API and Kaggle's API.
Cover photo by Copper and Wild on Unsplash
Unsplash Images are distributed under a unique Unsplash License.
This dataset was created by Suman Das
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains realistic traces of topics released by Chrome's Topics API across 4 weeks. Traces are simulated for 10 million fake users to match differentially private statistics computed on real browsing behavior. Full details of the dataset generation can be found in [1]. The code that generated the dataset can be found here.
[1] Differentially Private Synthetic Data Release for Topics API Outputs, Travis Dick et al. Proceedings of KDD 2025. Toronto, Canada.
More details about each file are in the individual file descriptions.
This is a dataset from the Federal Reserve hosted by the Federal Reserve Economic Database (FRED). FRED has a data platform found here and they update their information according to the frequency that the data updates. Explore the Federal Reserve using Kaggle and all of the data sources available through the Federal Reserve organization page!
This dataset is maintained using FRED's API and Kaggle's API.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Abbad Alam
Released under Apache 2.0