The dataset was originally obtained from TikTok's trending API by a GitHub user named Ivan Tran. It contains metadata on engagement with user-created videos and user profile data. The original create time is in Unix timecode format and is extracted directly from the video id number. TikTok's API has become much more difficult to access recently, so more current data is harder to obtain. The hashtags column contains lists.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2352583%2F868a18fb09d7a1d3da946d74a9857130%2FLogo.PNG?generation=1604973725053566&alt=media" alt="">
Medical Dataset for Abbreviation Disambiguation for Natural Language Understanding (MeDAL) is a large medical text dataset curated for abbreviation disambiguation, designed for natural language understanding pre-training in the medical domain. It was published at the ClinicalNLP workshop at EMNLP.
💻 Code 🤗 Dataset (Hugging Face) 💾 Dataset (Kaggle) 💽 Dataset (Zenodo) 📜 Paper (ACL) 📝 Paper (Arxiv) ⚡ Pre-trained ELECTRA (Hugging Face)
We recommend downloading from Kaggle if you can authenticate through their API. The advantage to Kaggle is that the data is compressed, so it will be faster to download. Links to the data can be found at the top of the readme.
First, you will need to create an account on kaggle.com. Afterwards, you will need to install the kaggle API:
pip install kaggle
Then, you will need to follow the instructions here to add your username and key. Once that's done, you can run:
kaggle datasets download xhlulu/medal-emnlp
Now, unzip everything and place them inside the data
directory:
unzip -nq crawl-300d-2M-subword.zip -d data
mv data/pretrain_sample/* data/
For the LSTM models, we will need to use the fastText embeddings. To do so, first download and extract the weights:
wget -nc -P data/ https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M-subword.zip
unzip -nq data/crawl-300d-2M-subword.zip -d data/
You can directly load LSTM and LSTM-SA with torch.hub
:
```python
import torch
lstm = torch.hub.load("BruceWen120/medal", "lstm") lstm_sa = torch.hub.load("BruceWen120/medal", "lstm_sa") ```
If you want to use the Electra model, you need to first install transformers:
pip install transformers
Then, you can load it with torch.hub
:
python
import torch
electra = torch.hub.load("BruceWen120/medal", "electra")
transformers
If you are only interested in the pre-trained ELECTRA weights (without the disambiguation head), you can load it directly from the Hugging Face Repository:
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained("xhlu/electra-medal")
tokenizer = AutoTokenizer.from_pretrained("xhlu/electra-medal")
Download the bibtex
here, or copy the text below:
@inproceedings{wen-etal-2020-medal,
title = "{M}e{DAL}: Medical Abbreviation Disambiguation Dataset for Natural Language Understanding Pretraining",
author = "Wen, Zhi and Lu, Xing Han and Reddy, Siva",
booktitle = "Proceedings of the 3rd Clinical Natural Language Processing Workshop",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.clinicalnlp-1.15",
pages = "130--135",
}
The ELECTRA model is licensed under Apache 2.0. The license for the libraries used in this project (transformers
, pytorch
, etc.) can be found in their respective GitHub repository. Our model is released under a MIT license.
The original dataset was retrieved and modified from the NLM website. By using this dataset, you are bound by the terms and conditions specified by NLM:
INTRODUCTION
Downloading data from the National Library of Medicine FTP servers indicates your acceptance of the following Terms and Conditions: No charges, usage fees or royalties are paid to NLM for this data.
MEDLINE/PUBMED SPECIFIC TERMS
NLM freely provides PubMed/MEDLINE data. Please note some PubMed/MEDLINE abstracts may be protected by copyright.
GENERAL TERMS AND CONDITIONS
Users of the data agree to:
- acknowledge NLM as the source of the data by including the phrase "Courtesy of the U.S. National Library of Medicine" in a clear and conspicuous manner,
- properly use registration and/or trademark symbols when referring to NLM products, and
- not indicate or imply that NLM has endorsed its products/services/applications.
Users who republish or redistribute the data (services, products or raw data) agree to:
- maintain the most current version of all distributed data, or
- make known in a clear and conspicuous manner that the products/services/applications do not reflect the most current/accurate data available from NLM.
These data are produced with a reasonable standard of care, but NLM makes no warranties express or implied, including no warranty of merchantability or fitness for particular purpose, regarding the accuracy or completeness of the data. Users agree to hold NLM and the U.S. Government harmless from any liability resulting from errors in the data. NLM disclaims any liability for any consequences due to use, misuse, or interpretation of information contained or not contained in the data.
NLM does not provide legal advice regarding copyright, fair use, or other aspects of intellectual property rights. See the NLM Copyright page.
NLM reserves the right to change the type and format of its machine-readable data. NLM will take reasonable steps to inform users of any changes to the format of the data before the data are distributed via the announcement section or subscription to email and RSS updates.
More details about each file are in the individual file descriptions.
This is a dataset from the U.S. Census Bureau hosted by the Federal Reserve Economic Database (FRED). FRED has a data platform found here and they update their information according the amount of data that is brought in. Explore the U.S. Census Bureau using Kaggle and all of the data sources available through the U.S. Census Bureau organization page!
This dataset is maintained using FRED's API and Kaggle's API.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This folder contains data behind the story Every Guest Jon Stewart Ever Had On ‘The Daily Show’.
Header | Definition |
---|---|
YEAR | The year the episode aired |
GoogleKnowlege_Occupation | Their occupation or office, according to Google's Knowledge Graph or, if they're not in there, how Stewart introduced them on the program. |
Show | Air date of episode. Not unique, as some shows had more than one guest |
Group | A larger group designation for the occupation. For instance, us senators, us presidents, and former presidents are all under "politicians" |
Raw_Guest_List | The person or list of people who appeared on the show, according to Wikipedia. The GoogleKnowlege_Occupation only refers to one of them in a given row. |
Source: Google Knowlege Graph, The Daily Show clip library, Wikipedia.
This is a dataset from FiveThirtyEight hosted on their GitHub. Explore FiveThirtyEight data using Kaggle and all of the data sources available through the FiveThirtyEight organization page!
This dataset is maintained using GitHub's API and Kaggle's API.
This dataset is distributed under the Attribution 4.0 International (CC BY 4.0) license.
Cover photo by Oscar Nord on Unsplash
Unsplash Images are distributed under a unique Unsplash License.
I am writing articles on League of Legends and Machine Learning. You can find the full repository where this information is stored here.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This is a dataset from FiveThirtyEight hosted on their GitHub. Explore FiveThirtyEight data using Kaggle and all of the data sources available through the FiveThirtyEight organization page!
This dataset is maintained using GitHub's API and Kaggle's API.
This dataset is distributed under the Attribution 4.0 International (CC BY 4.0) license.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This folder contains data behind the story A Statistical Analysis of the Work of Bob Ross.
This is a dataset from FiveThirtyEight hosted on their GitHub. Explore FiveThirtyEight data using Kaggle and all of the data sources available through the FiveThirtyEight organization page!
This dataset is maintained using GitHub's API and Kaggle's API.
This dataset is distributed under the Attribution 4.0 International (CC BY 4.0) license.
Cover photo by Alex Kotomanov on Unsplash
Unsplash Images are distributed under a unique Unsplash License.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This folder contains data behind the stories: * The Save Ruined Relief Pitching. The Goose Egg Can Fix It * Kenley Jansen Is The Model Of A Modern Reliever
Header | Definition |
---|---|
name | Pitcher name |
year | Start year of season |
team | Retrosheet team code |
league | NL or AL |
goose_eggs | Goose eggs |
broken_eggs | Broken eggs |
mehs | Mehs |
league_average_gpct | League-average goose percentage |
ppf | Pitcher park factor |
replacement_gpct | Replacement-level goose pecentage |
gwar | Goose Wins Above Replacement |
key_retro | Retrosheet unique player identifier |
Source: Retrosheet
This is a dataset from FiveThirtyEight hosted on their GitHub. Explore FiveThirtyEight data using Kaggle and all of the data sources available through the FiveThirtyEight organization page!
This dataset is maintained using GitHub's API and Kaggle's API.
This dataset is distributed under the Attribution 4.0 International (CC BY 4.0) license.
More details about each file are in the individual file descriptions.
This is a dataset from the Federal Reserve hosted by the Federal Reserve Economic Database (FRED). FRED has a data platform found here and they update their information according to the frequency that the data updates. Explore the Federal Reserve using Kaggle and all of the data sources available through the Federal Reserve organization page!
This dataset is maintained using FRED's API and Kaggle's API.
Cover photo by Ben Waardenburg on Unsplash
Unsplash Images are distributed under a unique Unsplash License.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This folder contains data behind the story Obama Granted Clemency Unlike Any Other President In History.
The data in obama_commutations.csv
is copied from the Justice Department website. The python script parses it by looking at the first column to figure out what is contained in the second column.
Source: Department of Justice
This is a dataset from FiveThirtyEight hosted on their GitHub. Explore FiveThirtyEight data using Kaggle and all of the data sources available through the FiveThirtyEight organization page!
This dataset is maintained using GitHub's API and Kaggle's API.
This dataset is distributed under the Attribution 4.0 International (CC BY 4.0) license.
https://www.worldbank.org/en/about/legal/terms-of-use-for-datasetshttps://www.worldbank.org/en/about/legal/terms-of-use-for-datasets
The primary World Bank collection of development indicators, compiled from officially-recognized international sources. It presents the most current and accurate global development data available, and includes national, regional and global estimates.
This is a dataset hosted by the World Bank. The organization has an open data platform found here and they update their information according the amount of data that is brought in. Explore the World Bank using Kaggle and all of the data sources available through the World Bank organization page!
This dataset is maintained using the World Bank's APIs and Kaggle's API.
Cover photo by Alex Block on Unsplash
Unsplash Images are distributed under a unique Unsplash License.
More details about each file are in the individual file descriptions.
This is a dataset from the Federal Reserve hosted by the Federal Reserve Economic Database (FRED). FRED has a data platform found here and they update their information according to the frequency that the data updates. Explore the Federal Reserve using Kaggle and all of the data sources available through the Federal Reserve organization page!
This dataset is maintained using FRED's API and Kaggle's API.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This directory contains the data behind the story ‘Mad Men’ Is Ending. What’s Next For The Cast?
The primary file show-data.csv
contains data of actors who appeared on at least half the episodes of television shows that were nominated for an Emmy for Outstanding Drama since the year 2000. It contains the following variables:
Header | Definition |
---|---|
Performer | The name of the actor, according to IMDb. This is not a unique identifier - two performers appeared in more than one program |
Show | The television show where this actor appeared in more than half the episodes |
Show Start | The year the television show began |
Show End | The year the television show ended, "PRESENT" if the show remains on the air as of May 10. |
Status? | Why the actor is no longer on the program: "END" if the show has concluded, "LEFT" if the show remains on the air. |
CharEnd | The year the character left the show. Equal to "Show End" if the performer stayed on until the final season. |
Years Since | 2015 minus CharEnd |
#LEAD | The number of leading roles in films the performer has appeared in since and including "CharEnd", according to OpusData |
#SUPPORT | The number of leading roles in films the performer has appeared in since and including "CharEnd", according to OpusData |
#Shows | The number of seasons of television of which the performer appeared in at least half the episodes since and including "CharEnd", according to OpusData |
Score | #LEAD + #Shows + 0.25*(#SUPPORT) |
Score/Y | "Score" divided by "Years Since" |
lead_notes | The list of films counted in #LEAD |
support_notes | The list of films counted in #SUPPORT |
show_notes | The seasons of shows counted in #Shows |
The supplemental file performer-scores.csv
is the consolidated data from show-data.csv
made into a pivot table.
This is a dataset from FiveThirtyEight hosted on their GitHub. Explore FiveThirtyEight data using Kaggle and all of the data sources available through the FiveThirtyEight organization page!
This dataset is maintained using GitHub's API and Kaggle's API.
This dataset is distributed under the Attribution 4.0 International (CC BY 4.0) license.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This folder contains the data behind the story Trump Might Be The First President To Scrap A National Monument.
This data was compiled by the National Parks Conservation Association and includes national monuments that were created by presidents by under the Antiquities Act. It does not include national monuments created by Congress.
Header | Definition |
---|---|
current_name | Current name of piece of land designated under the Antiquities Act |
states | State(s) or territory where land is located |
original_name | If included, original name of piece of land designated under the Antiquities Act |
current_agency | Current land management agency. NPS = National Parks Service, BLM = Bureau of Land Management, USFS = US Forest Service, FWS = US Fish and Wildlife Service, NOAA = National Oceanic and National Oceanic and Atmospheric Administration |
action | Type of action taken on land |
date | Date of action |
year | Year of action |
pres_or_congress | President or congress that issued action |
acres_affected | Acres affected by action. Note that total current acreage is not included. National monuments that cover ocean are listed in square miles. |
Sources: National Parks Conservation Association and National Parks Service Archeology Program
This is a dataset from FiveThirtyEight hosted on their GitHub. Explore FiveThirtyEight data using Kaggle and all of the data sources available through the FiveThirtyEight organization page!
This dataset is maintained using GitHub's API and Kaggle's API.
This dataset is distributed under the Attribution 4.0 International (CC BY 4.0) license.
Cover photo by Nick Tiemeyer on Unsplash
Unsplash Images are distributed under a unique Unsplash License.
More details about each file are in the individual file descriptions.
This is a dataset from the Federal Reserve hosted by the Federal Reserve Economic Database (FRED). FRED has a data platform found here and they update their information according to the frequency that the data updates. Explore the Federal Reserve using Kaggle and all of the data sources available through the Federal Reserve organization page!
This dataset is maintained using FRED's API and Kaggle's API.
Cover photo by Andrew Neel on Unsplash
Unsplash Images are distributed under a unique Unsplash License.
More details about each file are in the individual file descriptions.
This is a dataset from the Federal Reserve Bank of St. Louis hosted by the Federal Reserve Economic Database (FRED). FRED has a data platform found here and they update their information according to the frequency that the data updates. Explore the Federal Reserve Bank of St. Louis using Kaggle and all of the data sources available through the St. Louis Fed organization page!
This dataset is maintained using FRED's API and Kaggle's API.
Cover photo by Noah Silliman on Unsplash
Unsplash Images are distributed under a unique Unsplash License.
More details about each file are in the individual file descriptions.
This is a dataset from the Federal Reserve hosted by the Federal Reserve Economic Database (FRED). FRED has a data platform found here and they update their information according to the frequency that the data updates. Explore the Federal Reserve using Kaggle and all of the data sources available through the Federal Reserve organization page!
This dataset is maintained using FRED's API and Kaggle's API.
Cover photo by Copper and Wild on Unsplash
Unsplash Images are distributed under a unique Unsplash License.
More details about each file are in the individual file descriptions.
This is a dataset from the Federal Reserve hosted by the Federal Reserve Economic Database (FRED). FRED has a data platform found here and they update their information according to the frequency that the data updates. Explore the Federal Reserve using Kaggle and all of the data sources available through the Federal Reserve organization page!
This dataset is maintained using FRED's API and Kaggle's API.
Cover photo by Amruth Pillai on Unsplash
Unsplash Images are distributed under a unique Unsplash License.
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
More details about each file are in the individual file descriptions.
This is a dataset hosted by the World Bank. The organization has an open data platform found here and they update their information according the amount of data that is brought in. Explore the World Bank using Kaggle and all of the data sources available through the World Bank organization page!
This dataset is maintained using the World Bank's APIs and Kaggle's API.
Cover photo by Markus Spiske on Unsplash
Unsplash Images are distributed under a unique Unsplash License.
More details about each file are in the individual file descriptions.
This is a dataset from the U.S. Census Bureau hosted by the Federal Reserve Economic Database (FRED). FRED has a data platform found here and they update their information according the amount of data that is brought in. Explore the U.S. Census Bureau using Kaggle and all of the data sources available through the U.S. Census Bureau organization page!
This dataset is maintained using FRED's API and Kaggle's API.
Cover photo by Nathan Dumlao on Unsplash
Unsplash Images are distributed under a unique Unsplash License.
The dataset was originally obtained from TikTok's trending API by a GitHub user named Ivan Tran. It contains metadata on engagement with user-created videos and user profile data. The original create time is in Unix timecode format and is extracted directly from the video id number. TikTok's API has become much more difficult to access recently, so more current data is harder to obtain. The hashtags column contains lists.