Facebook
Twitterwshuai190/pubmed-pmc-sr-filtered
Dataset Description
This dataset contains medical literature data for training Boolean query generation models. The data includes PubMed articles with their associated metadata, references, and result section PMIDs.
Dataset Structure
Data Fields
pmid: PubMed ID of the article pmc-id: PMC ID (if available) title: Article title max-date: Maximum publication date references-pmids: List of PMIDs referenced in the article⦠See the full description on the dataset page: https://huggingface.co/datasets/wshuai190/pubmed-pmc-sr-filtered.
Facebook
Twitterhttps://brightdata.com/licensehttps://brightdata.com/license
Unlock valuable biomedical knowledge with our comprehensive PubMed Dataset, designed for researchers, analysts, and healthcare professionals to track medical advancements, explore drug discoveries, and analyze scientific literature.
Dataset Features
Scientific Articles & Abstracts: Access structured data from PubMed, including article titles, abstracts, authors, publication dates, and journal sources. Medical Research & Clinical Studies: Retrieve data on clinical trials, drug research, disease studies, and healthcare innovations. Keywords & MeSH Terms: Extract key medical subject headings (MeSH) and keywords to categorize and analyze research topics. Publication & Citation Data: Track citation counts, journal impact factors, and author affiliations for academic and industry research.
Customizable Subsets for Specific Needs Our PubMed Dataset is fully customizable, allowing you to filter data based on publication date, research category, keywords, or specific journals. Whether you need broad coverage for medical research or focused data for pharmaceutical analysis, we tailor the dataset to your needs.
Popular Use Cases
Pharmaceutical Research & Drug Development: Analyze clinical trial data, drug efficacy studies, and emerging treatments. Medical & Healthcare Intelligence: Track disease outbreaks, healthcare trends, and advancements in medical technology. AI & Machine Learning Applications: Use structured biomedical data to train AI models for predictive analytics, medical diagnosis, and literature summarization. Academic & Scientific Research: Access a vast collection of peer-reviewed studies for literature reviews, meta-analyses, and academic publishing. Regulatory & Compliance Monitoring: Stay updated on medical regulations, FDA approvals, and healthcare policy changes.
Whether you're conducting medical research, analyzing healthcare trends, or developing AI-driven solutions, our PubMed Dataset provides the structured data you need. Get started today and customize your dataset to fit your research objectives.
Facebook
TwitterOriginal data from: https://github.com/armancohan/long-summarization The first 3000 rows of the test split of the original dataset were processed and filtered as follows.
In the original dataset, some sentences appear several times in the same article, even if they're only contained once in the original research paper. For this reason, all dataset rows where the same sentence appeared more than once where removed. In the original dataset, every sentence is a separate string, and these strings⦠See the full description on the dataset page: https://huggingface.co/datasets/giuliadc/pubmed-filtered.
Facebook
Twittertimaeus/dsir-pile-13m-filtered-for-pubmed-central dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Using Large language Models to directly screen electronic databases as an alternative to traditional search strategies in systematic reviews: the example of the Cochrane Highly sensitive search
The enclosed files correspond to 1) all studies published in MEDLINE between September 1st and September 30th 2024 using the sole keywords diabetes; and 2) studies published in MEDLINE between September 1st and September 30th 2024 using the keywords "diabetes" as well as the Cochrane High Sensitivity search.
The code used to process the data is provided as a supplementary material in the publication
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Medical journal usage counts across 814 clinical locations in the U.S. and Canada from 2009 - 2015.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Adaptive stress response pathways (SRPs) restore cellular homeostasis following perturbation but may activate terminal outcomes like apoptosis, autophagy, or cellular senescence if disruption exceeds critical thresholds. Because SRPs hold the key to vital cellular tipping points, they are targeted for therapeutic interventions and assessed as biomarkers of toxicity. Hence, we are developing a public database of chemicals that perturb SRPs to enable new data-driven tools to improve public health. Here, we report on the automated text-mining pipeline we used to build and curate the first version of this database. We started with 100 reference SRP chemicals gathered from published biomarker studies to bootstrap the database. Second, we used information retrieval to find co-occurrences of reference chemicals with SRP terms in PubMed abstracts and determined pairwise mutual information thresholds to filter biologically relevant relationships. Third, we applied these thresholds to find 1206 putative SRP perturbagens within thousands of substances in the Library of Integrated Network-Based Cellular Signatures (LINCS). To assign SRP activity to LINCS chemicals, domain experts had to manually review at least three publications for each of 1206 chemicals out of 181,805 total abstracts. To accomplish this efficiently, we implemented a machine learning approach to predict SRP classifications from texts to prioritize abstracts. In 5-fold cross-validation testing with a corpus derived from the 100 reference chemicals, artificial neural networks performed the best (F1-macro = 0.678) and prioritized 2479/181,805 abstracts for expert review, which resulted in 457 chemicals annotated with SRP activities. An independent analysis of enriched mechanisms of action and chemical use class supported the text-mined chemical associations (p < 0.05): heat shock inducers were linked with HSP90 and DNA damage inducers to topoisomerase inhibition. This database will enable novel applications of LINCS data to evaluate SRP activities and to further develop tools for biomedical information extraction from the literature.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The PubMed model contains over 18 million PubMed documents (1996-2019) clustered into 28,743 clusters for use in research planning, portfolio analysis, systematic review, etc. This repository contains the PMID-to-cluster listing, an Excel workbook that characterizes each cluster with metadata and cluster-level indicators, and a Tableau workbook containing those same data plus a visual map and filters that can be used to explore the landscape and analyze cluster-level information. Model created by SciTech Strategies, Inc. Details can be found in the accompanying article published in Scientific Data at https://www.nature.com/articles/s41597-020-00749-y (or https://rdcu.be/ca4kv).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Micro-FTIR Filter Images for Particle Detection
This dataset consists of annotated images of filters containing particles. The primary objective of this dataset is to serve as training and validation data for developing a particle detection model using computer vision techniques. More specifically, this dataset can be used to train an image segmentation model that can be used with GEPARD (https://pubmed.ncbi.nlm.nih.gov/32436395/) in order to perform efficient particle detection and analysis using Micro-FTIR microscope.
Two kind of samples are used in our case:
In the first case, particles were annotated easilly as they are clearly visible over the filter. In the second scenario, the most distinguishable particles on the image have been annotated.
Note
In the case of a saturated filters, the correct method would be to collect a spectral image of the entire filter using a FPA detector or similar and then use tools (e.g. sIMPle ) to analyse this image. However, in our scenario such detector was not available, and a semi-random / operator dependant method had to be used in order to select particles or points for scanning.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Protein-Protein, Genetic, and Chemical Interactions for Du K (2002):Human VPAC1 receptor selectivity filter. Identification of a critical domain for restricting secretin binding. curated by BioGRID (https://thebiogrid.org); ABSTRACT: The human VPAC1 receptor for vasoactive intestinal peptide (VIP) and pituitary adenylate cyclase activating peptide (PACAP) belongs to the class II family of G protein coupled receptors with seven transmembrane segments. It recognizes several VIP-related peptides and displays a very low affinity for secretin despite >70% homology between VIP and secretin. Conversely, the human secretin receptor has high affinity for secretin but low affinity for VIP. We took advantage of this reversed selectivity to identify a domain of the VPAC1 receptor responsible for selectivity toward secretin by constructing human VPAC1-secretin receptor chimeras. A first set of chimeras consisted of exchanging the entire N-terminal ectodomain or large parts of this domain. They were constructed by overlap PCR, transfected in COS-7 cells, and their ligand selectivity, expressed as the ratio of EC(50) for secretin/EC(50) for VIP (referred to as S/V), in stimulating cAMP production was measured. Two very informative chimeras respectively referred to as S144V and S123V were obtained by replacing the entire ectodomain or only the first 123 amino acids of the VPAC1 receptor by the corresponding sequences of the secretin receptor. Whereas S144V no longer discriminated between VIP and secretin (S/V = 1.2), S123V discriminated between the two peptides (S/V = 300) in the same manner as the wild-type VPAC1 receptor. The motif responsible for discrimination was determined by introducing small blocks or individual amino acids of secretin receptor in the 123-144 sequence of the S123V chimera. The data obtained from 14 new chimeras sustained that two nonadjacent pairs of amino acids, Gln(135) Thr(136) and Gly(140) Ser(141) in the C-terminal end of the N-terminal VPAC1 receptor ectodomain constitute a selective filter that strongly restricts access of secretin to the VPAC1 receptor.
Facebook
Twittertimaeus/dsir-pile-100k-filtered-for-pubmed-central dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset supports filterNHP, an R package and web-based application for generating search filters to query scientific bibliographic sources (PubMed, PsycINFO, Web of Science) for non-human primate related publications. filterNHP can be found at: https://filterNHP.dpz.eu.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction
This note describes the data sets used for all analyses contained in the manuscript 'Oxytocin - a social peptide?β[1] that is currently under review.
Data Collection
The data sets described here were originally retrieved from Web of Science (WoS) Core Collection via the University of Edinburghβs library subscription [2]. The aim of the original study for which these data were gathered was to survey peer-reviewed primary studies on oxytocin and social behaviour. To capture relevant papers, we used the following query:
TI = (βoxytocinβ OR βpitocinβ OR βsyntocinonβ) AND TS = (βsocial*β OR βpro$socialβ OR βanti$socialβ)
The final search was performed on the 13 September 2021. This returned a total of 2,747 records, of which 2,049 were classified by WoS as βarticlesβ. Given our interest in primary studies only β articles reporting original data β we excluded all other document types. We further excluded all articles sub-classified as βbook chaptersβ or as βproceeding papersβ in order to limit our analysis to primary studies published in peer-reviewed academic journals. This reduced the set to 1,977 articles. All of these were published in the English language, and no further language refinements were unnecessary.
All available metadata on these 1,977 articles was exported as plain text βflatβ format files in four batches, which we later merged together via Notepad++. Upon manually examination, we discovered examples of papers classified as βarticlesβ by WoS that were, in fact, reviews. To further filter our results, we searched all available PMIDs in PubMed (1,903 had associated PMIDs - ~96% of set). We then filtered results to identify all records classified as βreviewβ, βsystematic reviewβ, or βmeta-analysisβ, identifying 75 records 3. After examining a sample and agreeing with the PubMed classification, these were removed these from our dataset - leaving a total of 1,902 articles.
From these data, we constructed two datasets via parsing out relevant reference data via the Sci2 Tool [4]. First, we constructed a βnode-attribute-listβ by first linking unique reference strings (βCite Me Asβ column in WoS data files) to unique identifiers, we then parsed into this dataset information on the identify of a paper, including the title of the article, all authors, journal publication, year of publication, total citations as recorded from WoS, and WoS accession number. Second, we constructed an βedge-listβ that records the citations from a citing paper in the βSourceβ column and identifies the cited paper in the βTargetβ column, using the unique identifies as described previously to link these data to the node-attribute-list.
We then constructed a network in which papers are nodes, and citation links between nodes are directed edges between nodes. We used Gephi Version 0.9.2 [5] to manually clean these data by merging duplicate references that are caused by different reference formats or by referencing errors. To do this, we needed to retain both all retrieved records (1,902) as well as including all of their references to papers whether these were included in our original search or not. In total, this produced a network of 46,633 nodes (unique reference strings) and 112,520 edges (citation links). Thus, the average reference list size of these articles is ~59 references. The mean indegree (within network citations) is 2.4 (median is 1) for the entire network reflecting a great diversity in referencing choices among our 1,902 articles.
After merging duplicates, we then restricted the network to include only articles fully retrieved (1,902), and retrained only those that were connected together by citations links in a large interconnected network (i.e. the largest component). In total, 1,892 (99.5%) of our initial set were connected together via citation links, meaning a total of ten papers were removed from the following analysis β and these were neither connected to the largest component, nor did they form connections with one another (i.e. these were βisolatesβ).
This left us with a network of 1,892 nodes connected together by 26,019 edges. It is this network that is described by the βnode-attribute-listβ and βedge-listβ provided here. This network has a mean in-degree of 13.76 (median in-degree of 4). By restricting our analysis in this way, we lose 44,741 unique references (96%) and 86,501 citations (77%) from the full network, but retain a set of articles tightly knitted together, all of which have been fully retrieved due to possessing certain terms related to oxytocin AND social behaviour in their title, abstract, or associated keywords.
Before moving on, we calculated indegree for all nodes in this network β this counts the number of citations to a given paper from other papers within this network β and have included this in the node-attribute-list. We further clustered this network via modularity maximisation via the Leiden algorithm [6]. We set the algorithm to resolution 1, and allowed the algorithm to run over 100 iterations and 100 restarts. This gave Q=0.43 and identified seven clusters, which we describe in detail within the body of the paper. We have included cluster membership as an attribute in the node-attribute-list.
Data description
We include here two datasets: (i) βOTSOC-node-attribute-list.csvβ consists of the attributes of 1,892 primary articles retrieved from WoS that include terms indicating a focus on oxytocin and social behaviour; (ii) βOTSOC-edge-list.csvβ records the citations between these papers. Together, these can be imported into a range of different software for network analysis; however, we have formatted these for ease of upload into Gephi 0.9.2. Below, we detail their contents:
Id, the unique identifier
Label, the reference string of the paper to which the attributes in this row correspond. This is taken from the βCite Me Asβ column from the original WoS download. The reference string is in the following format: last name of first author, publication year, journal, volume, start page, and DOI (if available).
Wos_id, unique Web of Science (WoS) accession number. These can be used to query WoS to find further data on all papers via the βUT= β field tag.
Title, paper title.
Authors, all named authors.
Journal, journal of publication.
Pub_year, year of publication.
Wos_citations, total number of citations recorded by WoS Core Collection to a given paper as of 13 September 2021
Indegree, the number of within network citations to a given paper, calculated for the network shown in Figure 1 of the manuscript.
Cluster, provides the cluster membership number as discussed within the manuscript (Figure 1). This was established via modularity maximisation via the Leiden algorithm (Res 1; Q=0.43|7 clusters)
Source, the unique identifier of the citing paper.
Target, the unique identifier of the cited paper.
Type, edges are βDirectedβ, and this column tells Gephi to regard all edges as such.
Syr_date, this contains the date of publication of the citing paper.
Tyr_date, this contains the date of publication of the cited paper.
Software recommended for analysis
Gephi version 0.9.2 was used for the visualisations within the manuscript, and both files can be read and into Gephi without modification.
Notes
[1] Leng, G., Leng, R. I., Ludwig, M. (Submitted). Oxytocin β a social peptide? Deconstructing the evidence.
[2] Edinburgh Universityβs subscription to Web of Science covers the following databases: (i) Science Citation Index Expanded, 1900-present; (ii) Social Sciences Citation Index, 1900-present; (iii) Arts & Humanities Citation Index, 1975-present; (iv) Conference Proceedings Citation Index- Science, 1990-present; (v) Conference Proceedings Citation Index- Social Science & Humanities, 1990-present; (vi) Book Citation Indexβ Science, 2005-present; (vii) Book Citation Indexβ Social Sciences & Humanities, 2005-present; (viii) Emerging Sources Citation Index, 2015-present.
[3] For those interested, the following PMIDs were identified as βarticlesβ by WoS, but as βreviewsβ by PubMed: β34502097β β33400920β β32060678β β31925983β β31734142β β30496762β β30253045β β29660735β β29518698β β29065361β β29048602β β28867943β β28586471β β28301323β β27974283β β27626613β β27603523β β27603327β β27513442β β27273834β β27071789β β26940141β β26932552β β26895254β β26869847β β26788924β β26581735β β26548910β β26317636β β26121678β β26094200β β25997760β β25631363β β25526824β β25446893β β25153535β β25092245β β25086828β β24946432β β24637261β β24588761β β24508579β β24486356β β24462936β β24239932β β24239931β β24231551β β24216134β β23955310β β23856187β β23686025β β23589638β β23575742β β23469841β β23055480β β22981649β β22406388β β22373652β β22141469β β21960250β β21881219β β21802859β β21714746β β21618004β β21150165β β20435805β β20173685β β19840865β β19546570β β19309413β β15288368β β12359512β β9401603β β9213136β β7630585β
[4] Sci2 Team. (2009). Science of Science (Sci2) Tool. Indiana University and SciTech Strategies. Stable URL: https://sci2.cns.iu.edu
[5] Bastian, M., Heymann, S., & Jacomy, M. (2009).
Facebook
TwitterRegulaTome corpus: this file contains the RegulaTome corpus inBRAT format. The directory"splits" has the corpus split based on the train/dev/test used for the training of the relation extraction system
RegulaTome annodoc: The annotation guidelines along with the annotation configuration files for BRAT are provided in annodoc+config.tar.gz. The online version of the annotation documentation can be found here: https://katnastou.github.io/regulatome-annodoc/
The tagger software can be found here:https://github.com/larsjuhljensen/tagger. The command used to run tagger before large-scale execution of the RE system is:
gzip -cd ls -1 pmc/*.en.merged.filtered.tsv.gz ls -1r
pubmed/*.tsv.gz | cat dictionary/excluded_documents.txt - |
tagger/tagcorpus --threads=16 --autodetect
--types=dictionary/curated_types.tsv
--entities=dictionary/all_entities.tsv
--names=dictionary/all_names_textmining.tsv
--groups=dictionary/all_groups.tsv --stopwords=dictionary/all_global.tsv
--local-stopwords=dictionary/all_local.tsv
--type-pairs=dictionary/all_type_pairs.tsv --out-matches=all_matches.tsv
Input documents for large-scale execution, which is done on entire PubMed (as of March 2024) and PMC Open Access (as of November 2023) articles in BioC format. The files are converted to a tab-delimited formatto be compatible with the RE system input (see below).
Input dictionary files: all the files necessary to execute the command above are available intagger_dictionary_files.tar.gz
Tagger output: we filter the results of the tagger run down to gene/protein hits, and documents with more than 1 hit (since we are doing relation extraction) before feeding it to our RE system. The filtered output is available in tagger_matches_ggp_only_gt_1_hit.tsv.gz
Relation extraction system input:combined_input_for_re.tar.gz: these are the directories with all the .ann and .txt files used as input for the large-scale execution of the relation extraction pipeline. The files are generated from the tagger tsv output (see above, tagger_matches_ggp_only_gt_1_hit.tsv.gz) using thetagger2standoff.py script from the string-db-tools repository.
Relation extraction models. The Transformer-based model used for large-scale relation extraction and prediction on the test set is atrelation_extraction_multi-label-best_model.tar.gz
The pre-trained RoBERTa model on PubMed and PMC and MIMIC-III with a BPE Vocab learned from PubMed (RoBERTa-large-PM-M3-Voc), which is used by our system is available here.
Relation extraction system output: the tab-delimited outputs of the relation extraction system are found atlarge_scale_relation_extraction_results.tar.gz !!!ATTENTION this file is approximately 1TB in size, so make sure you have enough space to download it on your machine!!!
The relation extraction system output files have 86 columns: PMID, Entity BRAT ID1, Entity BRAT ID2, and scores per class produced by the relation extraction model. Each file has a header to denote which score is in which column.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Bibliographic data of biomedical systematic reviews and meta-analysis studies published between 2014 and 2019, where at least one author is affiliated with an institution in Sub-Saharan Africa was retrieved from MEDLINE via the PubMed search engine. All forty-six (46) countries in Sub-Saharan Africa were included in the search query as affiliation. The search strategy are decripted in four steps:
Step #1: Nigeria[Affiliation] OR South Africa[Affiliation] OR Ghana[Affiliation] OR Tanzania[Affiliation] OR Kenya[Affiliation] OR Rwanda[Affiliation] OR Botswana[Affiliation] OR Cameroun[Affiliation] OR Senegal[Affiliation] OR Angola[Affiliation] OR Uganda[Affiliation] OR Mali[Affiliation] OR Sierra Leone[Affiliation] OR Ivory Coast[Affiliation] OR Ethiopia[Affiliation] OR Lesotho[Affiliation] OR Zambia[Affiliation] OR Zimbabwe[Affiliation] OR Namibia[Affiliation] OR Guinea[Affiliation] OR Mauritius[Affiliation] OR Mozambique[Affiliation] OR Niger[Affiliation] OR Seychelles[Affiliation] OR Burkina Faso[Affiliation] OR Burundi[Affiliation] OR Cape Verde[Affiliation] OR Cameroon[Affiliation] OR Central African Republic[Affiliation] OR Chad[Affiliation] OR Comoros[Affiliation] OR Democratic Republic of Congo[Affiliation] OR DR Congo[Affiliation] OR Djibouti[Affiliation] OR Cote D'ivoire[Affiliation] OR Congo[Affiliation] OR Equatorial Guinea[Affiliation] OR Eritrea[Affiliation] OR Gabon[Affiliation] OR Guinea-Bissau[Affiliation] OR Madagascar[Affiliation] OR Congo Republic[Affiliation] OR Sao Tome and Principe[Affiliation] OR Swaziland[Affiliation] OR Togo[Affiliation] OR Benin[Affiliation] OR Liberia[Affiliation] OR Namibia[Affiliation] OR Gambia[Affiliation] OR (Cent Afr Republ[Affiliation]) OR (Equat Guinea[Affiliation]) OR (Papua N Guinea[Affiliation]) OR (Sao Tome E Prin[Affiliation]) OR Principe[Affiliation] OR Sao Tome E Principe[Affiliation]
Step #2 The filter was set to Meta-Analysis[ptyp] OR systematic[sb]
Step #3: Text word search systematic review[Text Word] OR meta-analysis[Text Word] OR meta analysis[Text Word]
Step #4: Set publication date to: "2014/01/01"[PDAT] : "2019/12/31"[PDAT]
The search which was done on April 2nd, 2020 returned 3,171 results. The bibliographic data collected with the queries posed to PubMed were cleaned, duplicates were removed and articles that were not meta-analysis or systematic reviews were removed. MEDLINE is an authoritative and specialized biomedical database for indexing biomedical publications. Query: (Step #1) AND (Step #2 OR Step #3) AND (Step #4)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset comprises a collection of neuroscientific articles published between January 1, 1999, and December 31, 2023. The compilation includes information on articles and research domain clusters in multiple formats, including CSV, GraphML, and HDF5.
cluster_citation_density.graphml file. This has now been corrected.. βββ Code β βββ notebooks
β β βββ keyword_search.ipynb β β βββ exploring_clusters.ipynb β β βββ loading_article_shards.ipynb β β βββ traversing_article_graph.ipynb
β β βββ discipline_classification.ipynb
β β βββ from_generic_to_domain_embedding.ipynb β βββ requirements.txt β βββ src β βββ data_types.py β βββ utils.py βββ Data βββ CSV β βββ neuroscience_articles_1999-2023.csv β βββ neuroscience_clusters_1999-2023.csv β βββ neuroscience_dimensions_1999-2023.csv βββ Graphs β βββ cluster_citation_density.graphml β βββ article_similarity.graphml βββ HDF5 β βββ DomainEmbeddings β β βββ 2037 shard_#SHARD_ID.h5 files containing 200 articles β βββ VoyageAIEmbeddings β βββ Large_02_Instruct
β β βββ 2037 shard_#SHARD_ID.h5 files containing 200 articles
β βββ Lite_02_Instruct
β βββ 2037 shard_#SHARD_ID.h5 files containing 200 articles βββ Models βββ discipline_classification_model.pth βββ domain_embedding_model.pth
The Code folder contains minimal example code to help users get started with the dataset. It includes:
These examples provide a simple foundation for working with the dataset. More advanced analysis and demonstrations are covered in the accompanying publication.
neuroscience_articles_1999-2023.csv)This file contains metadata on neuroscientific articles from 1999 to 2023.
Review or Research).neuroscience_clusters_1999-2023.csv).neuroscience_clusters_1999-2023.csv)Clusters of related articles based on research themes.
neuroscience_dimensions_1999-2023.csv)Provides various research dimensions assessed for each cluster. Each dimension comes with specific binarized categories.
The HDF5 directory contains two sets of embeddings for the abstracts of articles. All folders contain 2037 HDF5 shard files, each holding about 200 articles (using a custom defined article filetype).
Please note that abstracts of articles in the subfolders of HDF5/VoyageAIEmbeddings have been embedded using Voyage AI's voyage-lite-02-instruct and voyage-large-02-instruct models, respectively. Those in the folder HDF5/DomainEmbeddings are voyage-large-02-instructembeddings that have subsequently been further transformed into a domain-specific lower dimensional embedding using a custom neural network (domain_embedding_model.pth).
article_similarity.graphml)A graph representation of article similarity based on cosine similarity between abstract embeddings (using domain-specific embedding reuslting from domain_embedding_model.pth).
pmid (PubMed ID) as an attribute.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PubMed (search date: 24/10/2014) | Search query: "retracted publication"[Publication Type] - Filter: systematic reviews | 48 results Google spreadsheet in the URL below
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Protein-Protein, Genetic, and Chemical Interactions for Li J (2024):Interaction Between SARS-CoV-2 Spike Protein S1 Subunit and Oyster Heat Shock Protein 70. curated by BioGRID (https://thebiogrid.org); ABSTRACT: There is growing evidence that severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) contaminates the marine environment and is bioaccumulated in filter-feeding shellfish. Previous study shows the Pacific oyster tissues can bioaccumulate the SARS-CoV-2, and the oyster heat shock protein 70 (oHSP70) may play as the primary attachment receptor to bind SARS-CoV-2's recombinant spike protein S1 subunit (rS1). However, detailed information about the interaction between rS1 and oHSP70 is still unknown. In this study, we confirmed that the affinity of recombinant oHSP70 (roHSP70) for rS1 (KD?=?20.4 nM) is comparable to the receptor-binding affinity of rACE2 for rS1 (KD?=?16.7 nM) by surface plasmon resonance (SPR)-based Biacore and further validated by enzyme-linked immunosorbent assay (ELISA). Three truncated proteins (roHSP70-N/C/M) and five mutated proteins (p.I229del, p.D457del, p.V491_K495del, p.K556I, and p.?roHSP70) were constructed according to the molecular docking results. All three truncated proteins have significantly lower affinity for rS1 than the full-length roHSP70, indicating that all three segments of roHSP70 are involved in binding to rS1. Further, the results of SPR and ELISA showed that all five mutant proteins had significantly lower affinity for rS1 than roHSP70, suggesting that amino acids at these sites are involved in binding to rS1. This study provides a preliminary theoretical basis for the bioaccumulation of SARS-CoV-2 in oyster tissues or using roHSP70 as the capture unit to selectively enrich virus particles for detection.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Protein-Protein, Genetic, and Chemical Interactions for Hanck DA (2009):Using lidocaine and benzocaine to link sodium channel molecular conformations to state-dependent antiarrhythmic drug affinity. curated by BioGRID (https://thebiogrid.org); ABSTRACT: Lidocaine and other antiarrhythmic drugs bind in the inner pore of voltage-gated Na channels and affect gating use-dependently. A phenylalanine in domain IV, S6 (Phe1759 in Na(V)1.5), modeled to face the inner pore just below the selectivity filter, is critical in use-dependent drug block.Measurement of gating currents and concentration-dependent availability curves to determine the role of Phe1759 in coupling of drug binding to the gating changes.The measurements showed that replacement of Phe1759 with a nonaromatic residue permits clear separation of action of lidocaine and benzocaine into 2 components that can be related to channel conformations. One component represents the drug acting as a voltage-independent, low-affinity blocker of closed channels (designated as lipophilic block), and the second represents high-affinity, voltage-dependent block of open/inactivated channels linked to stabilization of the S4s in domains III and IV (designated as voltage-sensor inhibition) by Phe1759. A homology model for how lidocaine and benzocaine bind in the closed and open/inactivated channel conformation is proposed.These 2 components, lipophilic block and voltage-sensor inhibition, can explain the differences in estimates between tonic and open-state/inactivated-state affinities, and they identify how differences in affinity for the 2 binding conformations can control use-dependence, the hallmark of successful antiarrhythmic drugs.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 1:EntrezGene official symbols with PubMed abstracts and their aliases classified by the algorithm. Description of data: 73 randomly chosen official gene symbols that produced text corpora of PubMed abstracts and their aliases. Aliases were classified by the algorithm as βsynonymsβ, βambiguousβ, βaliases with PubMed abstract but not passing the filtersβ, or βaliases without PubMed abstractsβ. (XLS 42 KB)
Facebook
Twitterwshuai190/pubmed-pmc-sr-filtered
Dataset Description
This dataset contains medical literature data for training Boolean query generation models. The data includes PubMed articles with their associated metadata, references, and result section PMIDs.
Dataset Structure
Data Fields
pmid: PubMed ID of the article pmc-id: PMC ID (if available) title: Article title max-date: Maximum publication date references-pmids: List of PMIDs referenced in the article⦠See the full description on the dataset page: https://huggingface.co/datasets/wshuai190/pubmed-pmc-sr-filtered.