Facebook
TwitterVariables are included in the same order as they appear in Table 1, and coded as per the reported analyses. (CSV)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data in this fileset are those used in the PLOS ONE publication "A snapshot of translational research funded by the National Institutes of Health (NIH): A case study using behavioral and social science research awards and Clinical and Translational Science Awards funded publications." The article can be accessed here: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0196545Original bBSSR and BSSR-only award and publication data downloaded from NIH Reporter (https://projectreporter.nih.gov/reporter.cfm). Data files used by R scripts in publication.* bBSSR award data (7 csv files in total): "bBSSR [2008-2014] NIH reporter 7 August 2017"* bBSSR publication data (6 csv files in total): "Publications from NIH Reporter batch [1-6]"* BSSR award data (10 csv files in total): "BSSR [2008-2014, part 1/2] NIH reporter 16/17 August 2017"* BSSR publication data (13 csv files in total): "BSSR publications batch [1-13] 17 August 2017"* "Foward citation data for analysis": list of forward citation PMIDs retrieved from Scopus for a select subset of bBSSR publications* Citation data for bBSSR and BSSR-only publications retrieved from the NIH iCite database (https://icite.od.nih.gov/analysis): "NIH iCite batch [1-40]"* "List of CTSA PMIDs from Surkis et al 2016": publications and associated PMIDs that are linked to CTSA awards; data retrieved from Surkis et al 2016 (reference below)ReferenceSurkis, A., Hogle, J.A., DiazGranados, D., Hunt, J.D., Mazmanian, P.E., Connors, E., Westaby, K., Whipple, E.C., Adamus, T., Mueller, M., and Aphinyanaphongs, Y. 2016. “Classifying publications from the clinical and translational science award program along the translational research spectrum: a machine learning approach.” Journal of Translational Medicine 14: 235. DOI: 10.1186/s12967-016-0992-8.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Original data for A Bibliometric Comparison of NIH R01 and R21 Awards: A Case Study Using Basic Behavioral and Social Science Research Awards.Abbreviations:BSSR: behavioral and social science researchbBSSR: basic behavioral and social science researchBSSR-only: awards that are BSSR but not bBSSR
The following datasets are in this fileset:1. OppNet data.csv: Award, publication, and forward citation data for OppNet awards. 2. NIH iCite results for OppNet publications.csv: forward citation data for OppNet publications; retrieved from the NIH iCite database (https://icite.od.nih.gov/analysis)3. BSSR-only data.csv: BSSR-only award, publication, and related information4. bBSSR data.csv: bBSSR award, publication, and related information5. BSSR-only and bBSSR NIH iCite data.csv: forward citation data for both BSSR-only and bBSSR publications; retrieved from the NIH iCite database 6. Forward citation data for BSSR-only and bBSSR publications.csv: Forward citation data, retrieved from Scopus, for BSSR-only and bBSSR publications7. BSSR-only and bBSSR total costs.csv: NIH total cost (direct costs + indirect costs) for BSSR-only and bBSSR awards; data retrieved from NIH Reporter Metadata (column name and description) for each dataset:OppNet data.csvGroup: specifies that the data is for OppNet awards and publicationsType: funding type, either R01 or R21 grant awardGrant.number: NIH grant number for awardProject.start.date: NIH project start date for award (month/day/year)PMID: PubMed identification number for publications as)sociated with respective awards; NA indicates that awards without publicationsPub.date: Date of publication retrieved from PubMed; dates designate time of electronic publication unless there was no electronic publication, then date designates time of physical publication (month/day/year)Fwd.PMID: PMID of forward citations for a particular publication; forward citations retrieved from Scopus database, limited to journal articles and publications; NA indicates no forward citations were retrieved from ScopusFwd.pub.date: Date of publication for a forward citation; dates retrieved from PubMed, electronic publication dates were used unless a publication had no electronic publication date (month/day/year)NIH iCite results for OppNet publications.csvPubMed.ID: PubMed identification number of OppNet publications (this is equivalent to PMID in the OppNet data.csv file)Total.Citations: number of total forward citations for a given OppNet publicationCitations.per.Year: citations per full calendar year after publication (calculated by NIH iCite)Expected.Citations.per.Year: number of expected citations per yearField.Citation.Rate: average citations of the field's journals each yearRelative.Citation.Ratio: the RCR represents a citation-based measure of scientific influence of one or more articles. It is calculated as the cites/year of each paper, normalized to the citations per year received by NIH-funded papers in the same field and year.NIH.Percentile: Percentile rank amongst NIH-funded publicationsYear: Year of OppNet publicationTitle: OppNet publication titleAuthors: List of authors for OppNet publicationJournal: Journal in which the OppNet publication was publishedArticle: Indicates whether the publication was a journal article (yes or no)BSSR-only data.csv & bBSSR data.csvColumn names and descriptions are the same for both files. BSSR-only data.csv contains only information related to BSSR-only awards; and bBSSR data.csv contains only information related to bBSSR awards.Grant.number: NIH grant number for awardFOA: NIH Funding Opportunity Announcement numberActivity: funding type, either R01 or R21 grant awardAnnouncement.code: type of FOA--Program Announcement (PA); PAR (a PA with special receipt, referral and/or review considerations); PAS (a PA that includes specific set-aside funds); RFA (request for application)Announcement.type: either PA (those designated as PAR or PAS under Announcement.code were re-designated as PA) or RFAProject.start.date: NIH project start date for award (month/day/year)Project.end.date: NIH project end date for award (month/day/year)PMID: PubMed identification number for publications associated with respective awards; NA indicates that awards without publicationsPub.date: Date of publication retrieved from PubMed; dates designate time of electronic publication unless there was no electronic publication, then date designates time of physical publication (month/day/year)Accepted.date: Date that the publication of interest was accepted by the publishing journal; data retrieved from PubMedReceived.date: Date that the journal received the publication; data retrieved from PubMedTranslational: 1 indicates publications that are translational research; 0 indicates publications that are not translational researchNote: translational research publications are those that had one or more of the following publication types, as indicated through PubMed: clinical trial; controlled clinical trial; clinical trial, phase I; clinical trial, phase II; clinical trial, phase III; clinical trial, phase IV; pragmatic clinical trial; randomized controlled trial; or observational studyTotal.grants: Total number of grants acknowledged by the publicationBSSR-only and bBSSR NIH iCite data.csvColumn names and descriptions are the same as those in NIH iCite results for OppNet publications.csv. Data are for BSSR-only and bBSSR publications. Forward citation data for BSSR-only and bBSSR publications.csvType: Funding mechanism, either R01 or R21Grant.number: NIH grant numberPMID: PubMed identification number for publications associated with respective awards; NA indicates that awards without publicationsFwd.PMID: PMID of forward citation for a particular publication; forward citations retrieved from Scopus database, limited to journal articles and publications; NA indicates no forward citations were retrieved from ScopusFwd.pub.date: Date of publication for a forward citation; dates retrieved from PubMed, electronic publication dates were used unless a publication had no electronic publication date (month/day/year)Fwd.accepted.date: Date of forward citation acceptance by journal (month/day/year)Fwd.received.date: Date journal received submission from forward citation publication (month/day/year)Fwd.translational: 1 indicates that the forward citation was translational research; 0 indicates that the forward citation was not translational researchNote: translational research was defined the same as aboveBSSR-only and bBSSR total costs.csvGrant.number: NIH grant numberTotal.costs: Total NIH costs (direct costs + indirect costs) for an award (in USD)Contact Xueying Han (xhan@ida.org) if you have any questions.
Facebook
TwitterA CSV file of our study database, which we used for the analyses in this manuscript.
Facebook
TwitterThe raw csv file of the dataset used for analysis in this study.
Facebook
TwitterSister Study is a prospective cohort of 50,884 U.S. women aged 35 to 74 years old conducted by the NIEHS. Eligible participants are women without a history of breast cancer but with at least one sister diagnosed with breast cancer at enrollment during 2003 - 2009. Datasets used in this research effort include health outcomes, lifestyle factors, socioeconomic factors, medication history, and built and natural environment factors. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: Contact NIEHS Sister Study (https://sisterstudy.niehs.nih.gov/English/index1.htm) for data access. Format: Datasets are provided in SAS and/or CSV format.
Facebook
TwitterThe purest type of electronic clinical data which is obtained at the point of care at a medical facility, hospital, clinic or practice. Often referred to as the electronic medical record (EMR), the EMR is generally not available to outside researchers. The data collected includes administrative and demographic information, diagnosis, treatment, prescription drugs, laboratory tests, physiologic monitoring data, hospitalization, patient insurance, etc.
Individual organizations such as hospitals or health systems may provide access to internal staff. Larger collaborations, such as the NIH Collaboratory Distributed Research Network provides mediated or collaborative access to clinical data repositories by eligible researchers. Additionally, the UW De-identified Clinical Data Repository (DCDR) and the Stanford Center for Clinical Informatics allow for initial cohort identification.
About Dataset:
333 scholarly articles cite this dataset.
Unique identifier: DOI
Dataset updated: 2023
Authors: Haoyang Mi
In this dataset, we have two dataset:
1- Clinical Data_Discovery_Cohort: Name of columns: Patient ID Specimen date Dead or Alive Date of Death Date of last Follow Sex Race Stage Event Time
2- Clinical_Data_Validation_Cohort Name of columns: Patient ID Survival time (days) Event Tumor size Grade Stage Age Sex Cigarette Pack per year Type Adjuvant Batch EGFR KRAS
Feel free to put your thought and analysis in a notebook for this datasets. And you can create some interesting and valuable ML projects for this case. Thanks for your attention.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This is a database snapshot of the iCite web service (provided here as a single zipped CSV file, or compressed, tarred JSON files). In addition, citation links in the NIH Open Citation Collection are provided as a two-column CSV table in open_citation_collection.zip. iCite provides bibliometrics and metadata on publications indexed in PubMed, organized into three modules:Influence: Delivers metrics of scientific influence, field-adjusted and benchmarked to NIH publications as the baseline.Translation: Measures how Human, Animal, or Molecular/Cellular Biology-oriented each paper is; tracks and predicts citation by clinical articlesOpen Cites: Disseminates link-level, public-domain citation data from the NIH Open Citation CollectionDefinitions for individual data fields:pmid: PubMed Identifier, an article ID as assigned in PubMed by the National Library of Medicinedoi: Digital Object Identifier, if availableyear: Year the article was publishedtitle: Title of the articleauthors: List of author namesjournal: Journal name (ISO abbreviation)is_research_article: Flag indicating whether the Publication Type tags for this article are consistent with that of a primary research articlerelative_citation_ratio: Relative Citation Ratio (RCR)--OPA's metric of scientific influence. Field-adjusted, time-adjusted and benchmarked against NIH-funded papers. The median RCR for NIH funded papers in any field is 1.0. An RCR of 2.0 means a paper is receiving twice as many citations per year than the median NIH funded paper in its field and year, while an RCR of 0.5 means that it is receiving half as many citations per year. Calculation details are documented in Hutchins et al., PLoS Biol. 2016;14(9):e1002541.provisional: RCRs for papers published in the previous two years are flagged as "provisional", to reflect that citation metrics for newer articles are not necessarily as stable as they are for older articles. Provisional RCRs are provided for papers published previous year, if they have received with 5 citations or more, despite being, in many cases, less than a year old. All papers published the year before the previous year receive provisional RCRs. The current year is considered to be the NIH Fiscal Year which starts in October. For example, in July 2019 (NIH Fiscal Year 2019), papers from 2018 receive provisional RCRs if they have 5 citations or more, and all papers from 2017 receive provisional RCRs. In October 2019, at the start of NIH Fiscal Year 2020, papers from 2019 receive provisional RCRs if they have 5 citations or more and all papers from 2018 receive provisional RCRs.citation_count: Number of unique articles that have cited this onecitations_per_year: Citations per year that this article has received since its publication. If this appeared as a preprint and a published article, the year from the published version is used as the primary publication date. This is the numerator for the Relative Citation Ratio.field_citation_rate: Measure of the intrinsic citation rate of this paper's field, estimated using its co-citation network.expected_citations_per_year: Citations per year that NIH-funded articles, with the same Field Citation Rate and published in the same year as this paper, receive. This is the denominator for the Relative Citation Ratio.nih_percentile: Percentile rank of this paper's RCR compared to all NIH publications. For example, 95% indicates that this paper's RCR is higher than 95% of all NIH funded publications.human: Fraction of MeSH terms that are in the Human category (out of this article's MeSH terms that fall into the Human, Animal, or Molecular/Cellular Biology categories)animal: Fraction of MeSH terms that are in the Animal category (out of this article's MeSH terms that fall into the Human, Animal, or Molecular/Cellular Biology categories)molecular_cellular: Fraction of MeSH terms that are in the Molecular/Cellular Biology category (out of this article's MeSH terms that fall into the Human, Animal, or Molecular/Cellular Biology categories)x_coord: X coordinate of the article on the Triangle of Biomediciney_coord: Y Coordinate of the article on the Triangle of Biomedicineis_clinical: Flag indicating that this paper meets the definition of a clinical article.cited_by_clin: PMIDs of clinical articles that this article has been cited by.apt: Approximate Potential to Translate is a machine learning-based estimate of the likelihood that this publication will be cited in later clinical trials or guidelines. Calculation details are documented in Hutchins et al., PLoS Biol. 2019;17(10):e3000416.cited_by: PMIDs of articles that have cited this one.references: PMIDs of articles in this article's reference list.Large CSV files are zipped using zip version 4.5, which is more recent than the default unzip command line utility in some common Linux distributions. These files can be unzipped with tools that support version 4.5 or later such as 7zip.Comments and questions can be addressed to iCite@mail.nih.gov
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The datasets were used to validate and test the data pipeline deployment following the RADON approach. The dataset has a CSV file that contains around 32000 Twitter tweets. 100 CSV files have been created from the single CSV file and each CSV file containing 320 tweets. Those 100 CSV files are used to validate and test (performance/load testing) the data pipeline components.
Facebook
TwitterLinking registered clinical trials with their published results continues to be a challenge. A variety of natural language processing (NLP)-based and machine learning-based models have been developed to assist users in identifying these connections. Articles from the PubMed Central full-text collection were scanned for mentions of ClinicalTrials.gov and international clinical trial registry identifiers. We analyzed the distribution of trial registry numbers within sections of the articles and characterized their publication type indexing and other metrics. Three supporting files are included herein: a pdf containing supplementary figures pertaining to the distribution of registry numbers found within the full text of articles, a csv dataset providing the registry numbers discovered and the corresponding XML path location within the document, and an example Python script to locate registry identifiers within an XML article document. It should be noted that the purpose of this study is to..., These datasets and files are the results of scanning 6,901,686 XML documents within the Pubmed Central Open Access article datasets available at: https://ftp.ncbi.nlm.nih.gov/pub/pmc/ Each registry identifier match is represented by a row in the xmlScanOutput.csv file, along with PubMed identifiers, file information, XML path information, and several computed columns including a validation that an NCT number exists within ClinicalTrials.gov, a generalized article section, and publication types from multiple indexing sources. Summaries within the Distribution_of_Trial_Registry_Numbers_Additional_File.pdf were generated by counting distinct PMID values within the csv file across various groups., , # Distribution of trial registry numbers within full-text PubMed Central - full dataset of discovered links
https://doi.org/10.5061/dryad.dbrv15fb1
This data set contains a table with every combination of publication ID, registry number, XML path, and section of the publication discovered in the Full-Text scanning of PubMed Central articles.
This document contains charts and summaries of the trial registry numbers found from the XML document scanning process. The explicit criteria for locating registry identifiers and designating article sections are provided in this document and may be useful for further research and refinement.
This zip archive contains a comma-separated file named "xmlScanOutput.csv" that contains all rows of registry numbers and art...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The free database mapping COVID-19 treatment and vaccine development based on the global scientific research is available at https://covid19-help.org/.
Files provided here are curated partial data exports in the form of .csv files or full data export as .sql script generated with pg_dump from our PostgreSQL 12 database. You can also find .png file with our ER diagram of tables in .sql file in this repository.
Structure of CSV files
*On our site, compounds are named as substances
compounds.csv
Id - Unique identifier in our database (unsigned integer)
Name - Name of the Substance/Compound (string)
Marketed name - The marketed name of the Substance/Compound (string)
Synonyms - Known synonyms (string)
Description - Description (HTML code)
Dietary sources - Dietary sources where the Substance/Compound can be found (string)
Dietary sources URL - Dietary sources URL (string)
Formula - Compound formula (HTML code)
Structure image URL - Url to our website with the structure image (string)
Status - Status of approval (string)
Therapeutic approach - Approach in which Substance/Compound works (string)
Drug status - Availability of Substance/Compound (string)
Additional data - Additional data in stringified JSON format with data as prescribing information and note (string)
General information - General information about Substance/Compound (HTML code)
references.csv
Id - Unique identifier in our database (unsigned integer)
Impact factor - Impact factor of the scientific article (string)
Source title - Title of the scientific article (string)
Source URL - URL link of the scientific article (string)
Tested on species - What testing model was used for the study (string)
Published at - Date of publication of the scientific article (Date in ISO 8601 format)
clinical-trials.csv
Id - Unique identifier in our database (unsigned integer)
Title - Title of the clinical trial study (string)
Acronym title - Acronym of title of the clinical trial study (string)
Source id - Unique identifier in the source database
Source id optional - Optional identifier in other databases (string)
Interventions - Description of interventions (string)
Study type - Type of the conducted study (string)
Study results - Has results? (string)
Phase - Current phase of the clinical trial (string)
Url - URL to clinical trial study page on clinicaltrials.gov (string)
Status - Status in which study currently is (string)
Start date - Date at which study was started (Date in ISO 8601 format)
Completion date - Date at which study was completed (Date in ISO 8601 format)
Additional data - Additional data in the form of stringified JSON with data as locations of study, study design, enrollment, age, outcome measures (string)
compound-reference-relations.csv
Reference id - Id of a reference in our DB (unsigned integer)
Compound id - Id of a substance in our DB (unsigned integer)
Note - Id of a substance in our DB (unsigned integer)
Is supporting - Is evidence supporting or contradictory (Boolean, true if supporting)
compound-clinical-trial.csv
Clinical trial id - Id of a clinical trial in our DB (unsigned integer)
Compound id - Id of a Substance/Compound in our DB (unsigned integer)
tags.csv
Id - Unique identifier in our database (unsigned integer)
Name - Name of the tag (string)
tags-entities.csv
Tag id - Id of a tag in our DB (unsigned integer)
Reference id - Id of a reference in our DB (unsigned integer)
API Specification
Our project also has an Open API that gives you access to our data in a format suitable for processing, particularly in JSON format.
https://covid19-help.org/api-specification
Services are split into five endpoints:
Substances - /api/substances
References - /api/references
Substance-reference relations - /api/substance-reference-relations
Clinical trials - /api/clinical-trials
Clinical trials-substances relations - /api/clinical-trials-substances
Method of providing data
All dates are text strings formatted in compliance with ISO 8601 as YYYY-MM-DD
If the syntax request is incorrect (missing or incorrectly formatted parameters) an HTTP 400 Bad Request response will be returned. The body of the response may include an explanation.
Data updated_at (used for querying changed-from) refers only to a particular entity and not its logical relations. Example: If a new substance reference relation is added, but the substance detail has not changed, this is reflected in the substance reference relation endpoint where a new entity with id and current dates in created_at and updated_at fields will be added, but in substances or references endpoint nothing has changed.
The recommended way of sequential download
During the first download, it is possible to obtain all data by entering an old enough date in the parameter value changed-from, for example: changed-from=2020-01-01 It is important to write down the date on which the receiving the data was initiated let’s say 2020-10-20
For repeated data downloads, it is sufficient to receive only the records in which something has changed. It can therefore be requested with the parameter changed-from=2020-10-20 (example from the previous bullet). Again, it is important to write down the date when the updates were downloaded (eg. 2020-10-20). This date will be used in the next update (refresh) of the data.
Services for entities
List of endpoint URLs:
Format of the request
All endpoints have these parameters in common:
changed-from - a parameter to return only the entities that have been modified on a given date or later.
continue-after-id - a parameter to return only the entities that have a larger ID than specified in the parameter.
limit - a parameter to return only the number of records specified (up to 1000). The preset number is 100.
Request example:
/api/references?changed-from=2020-01-01&continue-after-id=1&limit=100
Format of the response
The response format is the same for all endpoints.
number_of_remaining_ids - the number of remaining entities that meet the specified criteria but are not displayed on the page. An integer of virtually unlimited size.
entities - an array of entity details in JSON format.
Response example:
{
"number_of_remaining_ids" : 100,
"entities" : [
{
"id": 3,
"url": "https://www.ncbi.nlm.nih.gov/pubmed/32147628",
"title": "Discovering drugs to treat coronavirus disease 2019 (COVID-19).",
"impact_factor": "Discovering drugs to treat coronavirus disease 2019 (COVID-19).",
"tested_on_species": "in silico",
"publication_date": "2020-22-02",
"created_at": "2020-30-03",
"updated_at": "2020-31-03",
"deleted_at": null
},
{
"id": 4,
"url": "https://www.ncbi.nlm.nih.gov/pubmed/32157862",
"title": "CT Manifestations of Novel Coronavirus Pneumonia: A Case Report",
"impact_factor": "CT Manifestations of Novel Coronavirus Pneumonia: A Case Report",
"tested_on_species": "Patient",
"publication_date": "2020-06-03",
"created_at": "2020-30-03",
"updated_at": "2020-30-03",
"deleted_at": null
},
]
}
Endpoint details
Substances
URL: /api/substances
Substances
Facebook
TwitterMetadata supporting Wallis et al. 2024 in Environment International. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: Data from the National Children's Study must be accessed through the National Institutes of Health, National Institute of Child Health and Human Development's Data and Specimen Hub (DASH) at https://dash.nichd.nih.gov/. Format: Participant demographic, lifestyle, residence, occupational, and other types of data from questionnaire and observational survey instruments are in .csv and .xlsx files. PFAS measurements in serum and house dust in .csv files. This dataset is associated with the following publication: Wallis, D., K. Miller, N. Deluca, K. Thomas, C. Fuller, J. McCord, E. Cohen-Hubal, and J. Minucci. Understanding prenatal household exposures to per- and polyfluorylalkyl substances using paired Biological and dust measurements with sociodemographic and housing variables. ENVIRONMENT INTERNATIONAL. Elsevier B.V., Amsterdam, NETHERLANDS, 194(December): 109157, (2024).
Facebook
TwitterIn response to the COVID-19 pandemic, the White House and a coalition of leading research groups have prepared the COVID-19 Open Research Dataset (CORD-19). CORD-19 is a resource of over 1,000,000 scholarly articles, including over 400,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. This freely available dataset is provided to the global research community to apply recent advances in natural language processing and other AI techniques to generate new insights in support of the ongoing fight against this infectious disease. There is a growing urgency for these approaches because of the rapid acceleration in new coronavirus literature, making it difficult for the medical research community to keep up.
We are issuing a call to action to the world's artificial intelligence experts to develop text and data mining tools that can help the medical community develop answers to high priority scientific questions. The CORD-19 dataset represents the most extensive machine-readable coronavirus literature collection available for data mining to date. This allows the worldwide AI research community the opportunity to apply text and data mining approaches to find answers to questions within, and connect insights across, this content in support of the ongoing COVID-19 response efforts worldwide. There is a growing urgency for these approaches because of the rapid increase in coronavirus literature, making it difficult for the medical community to keep up.
A list of our initial key questions can be found under the Tasks section of this dataset. These key scientific questions are drawn from the NASEM’s SCIED (National Academies of Sciences, Engineering, and Medicine’s Standing Committee on Emerging Infectious Diseases and 21st Century Health Threats) research topics and the World Health Organization’s R&D Blueprint for COVID-19.
Many of these questions are suitable for text mining, and we encourage researchers to develop text mining tools to provide insights on these questions.
We are maintaining a summary of the community's contributions. For guidance on how to make your contributions useful, we're maintaining a forum thread with the feedback we're getting from the medical and health policy communities.
Kaggle is sponsoring a $1,000 per task award to the winner whose submission is identified as best meeting the evaluation criteria. The winner may elect to receive this award as a charitable donation to COVID-19 relief/research efforts or as a monetary payment. More details on the prizes and timeline can be found on the discussion post.
We have made this dataset available on Kaggle. Watch out for periodic updates.
The dataset is also hosted on AI2's Semantic Scholar. And you can search the dataset using AI2's new COVID-19 explorer.
The licenses for each dataset can be found in the all _ sources _ metadata csv file.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1314380%2Fae91071ed467eb59eaaaa42f0b7c040d%2Fcovid-19_partners_logos.png?generation=1591119342897058&alt=media" alt="">
This dataset was created by the Allen Institute for AI in partnership with the Chan Zuckerberg Initiative, Georgetown University’s Center for Security and Emerging Technology, Microsoft Research, IBM, and the National Library of Medicine - National Institutes of Health, in coordination with The White House Office of Science and Technology Policy.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This is a database snapshot of the iCite web service (provided here as a single zipped CSV file, or compressed, tarred JSON files). In addition, citation links in the NIH Open Citation Collection are provided as a two-column CSV table in open_citation_collection.zip. iCite provides bibliometrics and metadata on publications indexed in PubMed, organized into three modules:
Influence: Delivers metrics of scientific influence, field-adjusted and benchmarked to NIH publications as the baseline.
Translation: Measures how Human, Animal, or Molecular/Cellular Biology-oriented each paper is; tracks and predicts citation by clinical articles
Open Cites: Disseminates link-level, public-domain citation data from the NIH Open Citation Collection
Definitions for individual data fields:
pmid: PubMed Identifier, an article ID as assigned in PubMed by the National Library of Medicine
doi: Digital Object Identifier, if available
year: Year the article was published
title: Title of the article
authors: List of author names
journal: Journal name (ISO abbreviation)
is_research_article: Flag indicating whether the Publication Type tags for this article are consistent with that of a primary research article
relative_citation_ratio: Relative Citation Ratio (RCR)--OPA's metric of scientific influence. Field-adjusted, time-adjusted and benchmarked against NIH-funded papers. The median RCR for NIH funded papers in any field is 1.0. An RCR of 2.0 means a paper is receiving twice as many citations per year than the median NIH funded paper in its field and year, while an RCR of 0.5 means that it is receiving half as many citations per year. Calculation details are documented in Hutchins et al., PLoS Biol. 2016;14(9):e1002541.
provisional: RCRs for papers published in the previous two years are flagged as "provisional", to reflect that citation metrics for newer articles are not necessarily as stable as they are for older articles. Provisional RCRs are provided for papers published previous year, if they have received with 5 citations or more, despite being, in many cases, less than a year old. All papers published the year before the previous year receive provisional RCRs. The current year is considered to be the NIH Fiscal Year which starts in October. For example, in July 2019 (NIH Fiscal Year 2019), papers from 2018 receive provisional RCRs if they have 5 citations or more, and all papers from 2017 receive provisional RCRs. In October 2019, at the start of NIH Fiscal Year 2020, papers from 2019 receive provisional RCRs if they have 5 citations or more and all papers from 2018 receive provisional RCRs.
citation_count: Number of unique articles that have cited this one
citations_per_year: Citations per year that this article has received since its publication. If this appeared as a preprint and a published article, the year from the published version is used as the primary publication date. This is the numerator for the Relative Citation Ratio.
field_citation_rate: Measure of the intrinsic citation rate of this paper's field, estimated using its co-citation network.
expected_citations_per_year: Citations per year that NIH-funded articles, with the same Field Citation Rate and published in the same year as this paper, receive. This is the denominator for the Relative Citation Ratio.
nih_percentile: Percentile rank of this paper's RCR compared to all NIH publications. For example, 95% indicates that this paper's RCR is higher than 95% of all NIH funded publications.
human: Fraction of MeSH terms that are in the Human category (out of this article's MeSH terms that fall into the Human, Animal, or Molecular/Cellular Biology categories)
animal: Fraction of MeSH terms that are in the Animal category (out of this article's MeSH terms that fall into the Human, Animal, or Molecular/Cellular Biology categories)
molecular_cellular: Fraction of MeSH terms that are in the Molecular/Cellular Biology category (out of this article's MeSH terms that fall into the Human, Animal, or Molecular/Cellular Biology categories)
x_coord: X coordinate of the article on the Triangle of Biomedicine
y_coord: Y Coordinate of the article on the Triangle of Biomedicine
is_clinical: Flag indicating that this paper meets the definition of a clinical article.
cited_by_clin: PMIDs of clinical articles that this article has been cited by.
apt: Approximate Potential to Translate is a machine learning-based estimate of the likelihood that this publication will be cited in later clinical trials or guidelines. Calculation details are documented in Hutchins et al., PLoS Biol. 2019;17(10):e3000416.
cited_by: PMIDs of articles that have cited this one.
references: PMIDs of articles in this article's reference list.
Large CSV files are zipped using zip version 4.5, which is more recent than the default unzip command line utility in some common Linux distributions. These files can be unzipped with tools that support version 4.5 or later such as 7zip.
Comments and questions can be addressed to iCite@mail.nih.gov
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
RxNorm is a name of a US-specific terminology in medicine that contains all medications available on US market. Source: https://en.wikipedia.org/wiki/RxNorm
RxNorm provides normalized names for clinical drugs and links its names to many of the drug vocabularies commonly used in pharmacy management and drug interaction software, including those of First Databank, Micromedex, Gold Standard Drug Database, and Multum. By providing links between these vocabularies, RxNorm can mediate messages between systems not using the same software and vocabulary. Source: https://www.nlm.nih.gov/research/umls/rxnorm/
RxNorm was created by the U.S. National Library of Medicine (NLM) to provide a normalized naming system for clinical drugs, defined as the combination of {ingredient + strength + dose form}. In addition to the naming system, the RxNorm dataset also provides structured information such as brand names, ingredients, drug classes, and so on, for each clinical drug. Typical uses of RxNorm include navigating between names and codes among different drug vocabularies and using information in RxNorm to assist with health information exchange/medication reconciliation, e-prescribing, drug analytics, formulary development, and other functions.
This public dataset includes multiple data files originally released in RxNorm Rich Release Format (RXNRRF) that are loaded into Bigquery tables. The data is updated and archived on a monthly basis.
The following tables are included in the RxNorm dataset:
RXNCONSO contains concept and source information
RXNREL contains information regarding relationships between entities
RXNSAT contains attribute information
RXNSTY contains semantic information
RXNSAB contains source info
RXNCUI contains retired rxcui codes
RXNATOMARCHIVE contains archived data
RXNCUICHANGES contains concept changes
Update Frequency: Monthly
Fork this kernel to get started with this dataset.
https://www.nlm.nih.gov/research/umls/rxnorm/
https://bigquery.cloud.google.com/dataset/bigquery-public-data:nlm_rxnorm
https://cloud.google.com/bigquery/public-data/rxnorm
Dataset Source: Unified Medical Language System RxNorm. The dataset is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset. This dataset uses publicly available data from the U.S. National Library of Medicine (NLM), National Institutes of Health, Department of Health and Human Services; NLM is not responsible for the dataset, does not endorse or recommend this or any other dataset.
Banner Photo by @freestocks from Unsplash.
What are the RXCUI codes for the ingredients of a list of drugs?
Which ingredients have the most variety of dose forms?
In what dose forms is the drug phenylephrine found?
What are the ingredients of the drug labeled with the generic code number 072718?
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Abstract MIMIC-III is a large, freely-available database comprising deidentified health-related data associated with over 40,000 patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012 [1]. The MIMIC-III Clinical Database is available on PhysioNet (doi: 10.13026/C2XW26). Though deidentified, MIMIC-III contains detailed information regarding the care of real patients, and as such requires credentialing before access. To allow researchers to ascertain whether the database is suitable for their work, we have manually curated a demo subset, which contains information for 100 patients also present in the MIMIC-III Clinical Database. Notably, the demo dataset does not include free-text notes.
Background In recent years there has been a concerted move towards the adoption of digital health record systems in hospitals. Despite this advance, interoperability of digital systems remains an open issue, leading to challenges in data integration. As a result, the potential that hospital data offers in terms of understanding and improving care is yet to be fully realized.
MIMIC-III integrates deidentified, comprehensive clinical data of patients admitted to the Beth Israel Deaconess Medical Center in Boston, Massachusetts, and makes it widely accessible to researchers internationally under a data use agreement. The open nature of the data allows clinical studies to be reproduced and improved in ways that would not otherwise be possible.
The MIMIC-III database was populated with data that had been acquired during routine hospital care, so there was no associated burden on caregivers and no interference with their workflow. For more information on the collection of the data, see the MIMIC-III Clinical Database page.
Methods The demo dataset contains all intensive care unit (ICU) stays for 100 patients. These patients were selected randomly from the subset of patients in the dataset who eventually die. Consequently, all patients will have a date of death (DOD). However, patients do not necessarily die during an individual hospital admission or ICU stay.
This project was approved by the Institutional Review Boards of Beth Israel Deaconess Medical Center (Boston, MA) and the Massachusetts Institute of Technology (Cambridge, MA). Requirement for individual patient consent was waived because the project did not impact clinical care and all protected health information was deidentified.
Data Description MIMIC-III is a relational database consisting of 26 tables. For a detailed description of the database structure, see the MIMIC-III Clinical Database page. The demo shares an identical schema, except all rows in the NOTEEVENTS table have been removed.
The data files are distributed in comma separated value (CSV) format following the RFC 4180 standard. Notably, string fields which contain commas, newlines, and/or double quotes are encapsulated by double quotes ("). Actual double quotes in the data are escaped using an additional double quote. For example, the string she said "the patient was notified at 6pm" would be stored in the CSV as "she said ""the patient was notified at 6pm""". More detail is provided on the RFC 4180 description page: https://tools.ietf.org/html/rfc4180
Usage Notes The MIMIC-III demo provides researchers with an opportunity to review the structure and content of MIMIC-III before deciding whether or not to carry out an analysis on the full dataset.
CSV files can be opened natively using any text editor or spreadsheet program. However, some tables are large, and it may be preferable to navigate the data stored in a relational database. One alternative is to create an SQLite database using the CSV files. SQLite is a lightweight database format which stores all constituent tables in a single file, and SQLite databases interoperate well with a number software tools.
DB Browser for SQLite is a high quality, visual, open source tool to create, design, and edit database files compatible with SQLite. We have found this tool to be useful for navigating SQLite files. Information regarding installation of the software and creation of the database can be found online: https://sqlitebrowser.org/
Release Notes Release notes for the demo follow the release notes for the MIMIC-III database.
Acknowledgements This research and development was supported by grants NIH-R01-EB017205, NIH-R01-EB001659, and NIH-R01-GM104987 from the National Institutes of Health. The authors would also like to thank Philips Healthcare and staff at the Beth Israel Deaconess Medical Center, Boston, for supporting database development, and Ken Pierce for providing ongoing support for the MIMIC research community.
Conflicts of Interest The authors declare no competing financial interests.
References Johnson, A. E. W., Pollard, T. J., Shen, L., Lehman, L. H., Feng, M., Ghassemi, M., Mo...
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset created using https://people.dbmi.columbia.edu/~friedma/Projects/DiseaseSymptomKB/ Applied appropriate transformations and edits to make it more usable.
"This table below is a knowledge database of disease-symptom associations generated by an automated method based on information in textual discharge summaries of patients at New York Presbyterian Hospital admitted during 2004. The first column shows the disease, the second the number of discharge summaries containing a positive and current mention of the disease, and the associated symptom. Associations for the 150 most frequent diseases based on these notes were computed and the symptoms are shown ranked based on the strength of association. The method used the MedLEE natural language processing system to obtain UMLS codes for diseases and symptoms from the notes; then statistical methods based on frequencies and co-occurrences were used to obtain the associations. A more detailed description of the automated method can be found in Wang X, Chused A, Elhadad N, Friedman C, Markatou M. Automated knowledge acquisition from clinical reports. AMIA Annu Symp Proc. 2008. p. 783-7. PMCID: PMC2656103."
Facebook
TwitterThis is the CSV file of clinical trial data used for an interactive visualization at Aero Data Lab (https://www.aerodatalab.org/birds-eye-view-of-research-landscape).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was produced in 2023 from the data collected throughout 2022 from MEDLINE (scientific articles) and from Event Registry (news) for the development of the Rare Diseases Mining project (https://idefine-europe.org/medline) The data is distributed across 16 diseases supporting the research paper "Automatic text classification and interactive data visualization of published scientific and news articles on Rare Diseases" The available data comes in 2 kinds and file formats:CSV - the hand annotation of the news articles in TXT with 5 to 10 MeSH headingsJSON - the input file for the evaluation of the classifier, including the title, news article body and MeSH heading IDs (available from https://www.ncbi.nlm.nih.gov/mesh/) The CSV files with name starting in "f1_", "pr_", "re_" are the results of the F1/Precision/Recall evaluation for each of the cases. This work was prepared by Joao Pita Costa (researcher) and curated by Tanja Zdolšek Draksler (domain expert)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We introduce a large-scale dataset of the complete texts of free/open source software (FOSS) license variants. To assemble it we have collected from the Software Heritage archive—the largest publicly available archive of FOSS source code with accompanying development history—all versions of files whose names are commonly used to convey licensing terms to software users and developers. The dataset consists of 6.5 million unique license files that can be used to conduct empirical studies on open source licensing, training of automated license classifiers, natural language processing (NLP) analyses of legal texts, as well as historical and phylogenetic studies on FOSS licensing. Additional metadata about shipped license files are also provided, making the dataset ready to use in various contexts; they include: file length measures, detected MIME type, detected SPDX license (using ScanCode), example origin (e.g., GitHub repository), oldest public commit in which the license appeared. The dataset is released as open data as an archive file containing all deduplicated license blobs, plus several portable CSV files for metadata, referencing blobs via cryptographic checksums.
For more details see the included README file and companion paper:
Stefano Zacchiroli. A Large-scale Dataset of (Open Source) License Text Variants. In proceedings of the 2022 Mining Software Repositories Conference (MSR 2022). 23-24 May 2022 Pittsburgh, Pennsylvania, United States. ACM 2022.
If you use this dataset for research purposes, please acknowledge its use by citing the above paper.
Facebook
TwitterVariables are included in the same order as they appear in Table 1, and coded as per the reported analyses. (CSV)