54 datasets found

f
Data from: Implementing the MSFragger Search Engine as a Node in Proteome...
acs.figshare.com
zip
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hui-Yin Chang; Sarah E. Haynes; Fengchao Yu; Alexey I. Nesvizhskii (2023). Implementing the MSFragger Search Engine as a Node in Proteome Discoverer [Dataset]. http://doi.org/10.1021/acs.jproteome.2c00485.s002
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jproteome.2c00485.s002
Dataset updated
Jun 4, 2023
Dataset provided by
ACS Publications
Authors
Hui-Yin Chang; Sarah E. Haynes; Fengchao Yu; Alexey I. Nesvizhskii
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Here, we describe the implementation of the fast proteomics search engine MSFragger as a processing node in the widely used Proteome Discoverer (PD) software platform. PeptideProphet (via the Philosopher tool kit) is also implemented as an additional PD node to allow validation of MSFragger open (mass-tolerant) search results. These two nodes, along with the existing Percolator validation module, allow users to employ different search strategies and conveniently inspect search results through PD. Our results have demonstrated the improved numbers of PSMs, peptides, and proteins identified by MSFragger coupled with Percolator and significantly faster search speed compared to the conventional SEQUEST/Percolator PD workflows. The MSFragger-PD node is available at https://github.com/nesvilab/PD-Nodes/releases/.
Z
Data from: QuerTCI: A Tool Integrating GitHub Issue Querying with Comment...
data.niaid.nih.gov
Updated Feb 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ye Paing; Tatiana Castro Vélez; Raffi Khatchadourian (2022). QuerTCI: A Tool Integrating GitHub Issue Querying with Comment Classification [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6115403
Explore at:
Dataset updated
Feb 21, 2022
Dataset provided by
City University of New York (CUNY) Graduate Center
City University of New York (CUNY) Hunter College
Authors
Ye Paing; Tatiana Castro Vélez; Raffi Khatchadourian
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Issue tracking systems enable users and developers to comment on problems plaguing a software system. Empirical Software Engineering (ESE) researchers study (open-source) project issues and the comments and threads within to discover---among others---challenges developers face when, e.g., incorporating new technologies, platforms, and programming language constructs. However, issue discussion threads accumulate over time and thus can become unwieldy, hindering any insight that researchers may gain. While existing approaches alleviate this burden by classifying issue thread comments, there is a gap between searching popular open-source software repositories (e.g., those on GitHub) for issues containing particular keywords and feeding the results into a classification model. In this paper, we demonstrate a research infrastructure tool called QuerTCI that bridges this gap by integrating the GitHub issue comment search API with the classification models found in existing approaches. Using queries, ESE researchers can retrieve GitHub issues containing particular keywords, e.g., those related to a certain programming language construct, and subsequently classify the kinds of discussions occurring in those issues. Using our tool, our hope is that ESE researchers can uncover challenges related to particular technologies using certain keywords through popular open-source repositories more seamlessly than previously possible. A tool demonstration video may be found at: https://youtu.be/fADKSxn0QUk.
awesome archive query log
kaggle.com
zip
Updated May 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Federico Minutoli (2024). awesome archive query log [Dataset]. https://www.kaggle.com/datasets/federicominutoli/awesome-archive-query-log
Explore at:
zip(1316195 bytes)Available download formats
Dataset updated
May 30, 2024
Authors
Federico Minutoli
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
manually curated multi-lingual search results from more than 25 search engines.

search results are taken from the archive query log (AQL)'s manually curated examples, excluding empty results.

For more information on the AQL and building your own dataset, see the official repository, website or paper.

The dataset is to be understood as a multilingual resource to train semantic models for efficient deep web scraping; given a query in natural language and possible search results, will you identify the most relevant subset to expand the search to?

The dataset contains a JSON file per web search with the following info: - query - original query in natural language - interpreted query - query the search engine searched for - timestamp - UTC time the query was searched for - url - search engine URL for the query - results - possible search results

each search result contains the following info: - rank - search result relevance (lower is better) - snippet - content snippet from the search result - title - search result title - url - search result URL
Z
Data from: Qbias – A Dataset on Media Bias in Search Queries and Query...
data.niaid.nih.gov
Updated Mar 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Haak, Fabian; Schaer, Philipp (2023). Qbias – A Dataset on Media Bias in Search Queries and Query Suggestions [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7682914
Explore at:
Dataset updated
Mar 1, 2023
Dataset provided by
Technische Hochschule Köln
Authors
Haak, Fabian; Schaer, Philipp
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present Qbias, two novel datasets that promote the investigation of bias in online news search as described in

Fabian Haak and Philipp Schaer. 2023. 𝑄𝑏𝑖𝑎𝑠 - A Dataset on Media Bias in Search Queries and Query Suggestions. In Proceedings of ACM Web Science Conference (WebSci’23). ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/3578503.3583628.

Dataset 1: AllSides Balanced News Dataset (allsides_balanced_news_headlines-texts.csv)

The dataset contains 21,747 news articles collected from AllSides balanced news headline roundups in November 2022 as presented in our publication. The AllSides balanced news feature three expert-selected U.S. news articles from sources of different political views (left, right, center), often featuring spin bias, and slant other forms of non-neutral reporting on political news. All articles are tagged with a bias label by four expert annotators based on the expressed political partisanship, left, right, or neutral. The AllSides balanced news aims to offer multiple political perspectives on important news stories, educate users on biases, and provide multiple viewpoints. Collected data further includes headlines, dates, news texts, topic tags (e.g., "Republican party", "coronavirus", "federal jobs"), and the publishing news outlet. We also include AllSides' neutral description of the topic of the articles. Overall, the dataset contains 10,273 articles tagged as left, 7,222 as right, and 4,252 as center.

To provide easier access to the most recent and complete version of the dataset for future research, we provide a scraping tool and a regularly updated version of the dataset at https://github.com/irgroup/Qbias. The repository also contains regularly updated more recent versions of the dataset with additional tags (such as the URL to the article). We chose to publish the version used for fine-tuning the models on Zenodo to enable the reproduction of the results of our study.

Dataset 2: Search Query Suggestions (suggestions.csv)

The second dataset we provide consists of 671,669 search query suggestions for root queries based on tags of the AllSides biased news dataset. We collected search query suggestions from Google and Bing for the 1,431 topic tags, that have been used for tagging AllSides news at least five times, approximately half of the total number of topics. The topic tags include names, a wide range of political terms, agendas, and topics (e.g., "communism", "libertarian party", "same-sex marriage"), cultural and religious terms (e.g., "Ramadan", "pope Francis"), locations and other news-relevant terms. On average, the dataset contains 469 search queries for each topic. In total, 318,185 suggestions have been retrieved from Google and 353,484 from Bing.

The file contains a "root_term" column based on the AllSides topic tags. The "query_input" column contains the search term submitted to the search engine ("search_engine"). "query_suggestion" and "rank" represents the search query suggestions at the respective positions returned by the search engines at the given time of search "datetime". We scraped our data from a US server saved in "location".

We retrieved ten search query suggestions provided by the Google and Bing search autocomplete systems for the input of each of these root queries, without performing a search. Furthermore, we extended the root queries by the letters a to z (e.g., "democrats" (root term) >> "democrats a" (query input) >> "democrats and recession" (query suggestion)) to simulate a user's input during information search and generate a total of up to 270 query suggestions per topic and search engine. The dataset we provide contains columns for root term, query input, and query suggestion for each suggested query. The location from which the search is performed is the location of the Google servers running Colab, in our case Iowa in the United States of America, which is added to the dataset.

AllSides Scraper

At https://github.com/irgroup/Qbias, we provide a scraping tool, that allows for the automatic retrieval of all available articles at the AllSides balanced news headlines.

We want to provide an easy means of retrieving the news and all corresponding information. For many tasks it is relevant to have the most recent documents available. Thus, we provide this Python-based scraper, that scrapes all available AllSides news articles and gathers available information. By providing the scraper we facilitate access to a recent version of the dataset for other researchers.
Z
Data from: A dataset of GitHub Actions workflow histories
data.niaid.nih.gov
Updated Oct 25, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cardoen, Guillaume (2024). A dataset of GitHub Actions workflow histories [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10259013
Explore at:
Dataset updated
Oct 25, 2024
Dataset provided by
University of Mons
Authors
Cardoen, Guillaume
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This replication package accompagnies the dataset and exploratory empirical analysis reported in the paper "A dataset of GitHub Actions workflow histories" published in the IEEE MSR 2024 conference. (The Jupyter notebook can be found in previous version of this dataset).

Important notice : It looks like Zenodo is compressing gzipped files two times without notice, they are "double compressed". So, when you download them they should be named : x.gz.gz instead of x.gz. Notice that the provided MD5 refers to the original file.

2024-10-25 update : updated repositories list and observation period. The filters relying on date were also updated.

2024-07-09 update : fix sometimes invalid valid_yaml flag.

The dataset was created as follow :

First, we used GitHub SEART (on October 7th, 2024) to get a list of every non-fork repositories created before January 1st, 2024. having at least 300 commits and at least 100 stars where at least one commit was made after January 1st, 2024. (The goal of these filter is to exclude experimental and personnal repositories).

We checked if a folder .github/workflows existed. We filtered out those that did not contained this folder and pulled the others (between 9th and 10thof October 2024).

We applied the tool gigawork (version 1.4.2) to extract every files from this folder. The exact command used is python batch.py -d /ourDataFolder/repositories -e /ourDataFolder/errors -o /ourDataFolder/output -r /ourDataFolder/repositories_everything.csv.gz -- -w /ourDataFolder/workflows_auxiliaries. (The script batch.py can be found on GitHub).

We concatenated every files in /ourDataFolder/output into a csv (using cat headers.csv output/*.csv > workflows_auxiliaries.csv in /ourDataFolder) and compressed it.

We added the column uid via a script available on GitHub.

Finally, we archived the folder with pigz /ourDataFolder/workflows (tar -c --use-compress-program=pigz -f workflows_auxiliaries.tar.gz /ourDataFolder/workflows)

Using the extracted data, the following files were created :

workflows.tar.gz contains the dataset of GitHub Actions workflow file histories.

workflows_auxiliaries.tar.gz is a similar file containing also auxiliary files.

workflows.csv.gz contains the metadata for the extracted workflow files.

workflows_auxiliaries.csv.gz is a similar file containing also metadata for auxiliary files.

repositories.csv.gz contains metadata about the GitHub repositories containing the workflow files. These metadata were extracted using the SEART Search tool.

The metadata is separated in different columns:

repository: The repository (author and repository name) from which the workflow was extracted. The separator "/" allows to distinguish between the author and the repository name

commit_hash: The commit hash returned by git

author_name: The name of the author that changed this file

author_email: The email of the author that changed this file

committer_name: The name of the committer

committer_email: The email of the committer

committed_date: The committed date of the commit

authored_date: The authored date of the commit

file_path: The path to this file in the repository

previous_file_path: The path to this file before it has been touched

file_hash: The name of the related workflow file in the dataset

previous_file_hash: The name of the related workflow file in the dataset, before it has been touched

git_change_type: A single letter (A,D, M or R) representing the type of change made to the workflow (Added, Deleted, Modified or Renamed). This letter is given by gitpython and provided as is.

valid_yaml: A boolean indicating if the file is a valid YAML file.

probably_workflow: A boolean representing if the file contains the YAML key on and jobs. (Note that it can still be an invalid YAML file).

valid_workflow: A boolean indicating if the file respect the syntax of GitHub Actions workflow. A freely available JSON Schema (used by gigawork) was used in this goal.

uid: Unique identifier for a given file surviving modifications and renames. It is generated on the addition of the file and stays the same until the file is deleted. Renamings does not change the identifier.

Both workflows.csv.gz and workflows_auxiliaries.csv.gz are following this format.
Meal Plan Search
kaggle.com
zip
Updated Apr 2, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shawn (2019). Meal Plan Search [Dataset]. https://www.kaggle.com/shawnwolfe/meal-plan-search
Explore at:
zip(23142716 bytes)Available download formats
Dataset updated
Apr 2, 2019
Authors
Shawn
License
https://www.usa.gov/government-works/https://www.usa.gov/government-works/
Description
Data from a meal plan user study, as described in the paper "Item Retrieval As Utility Estimation". For more details, please see (https://github.com/yizuc/meal-plan).
Office of Head Start (OHS) Head Start Center Locations Search Tool -...
healthdata.gov
csv, xlsx, xml
Updated Apr 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Office of Head Start (OHS) Head Start Center Locations Search Tool - 4uua-wna9 - Archive Repository [Dataset]. https://healthdata.gov/w/3nre-t4n5/default?cur=Ygi6KL-gVUR
Explore at:
csv, xlsx, xmlAvailable download formats
Dataset updated
Apr 4, 2025
Description
This dataset tracks the updates made on the dataset "Office of Head Start (OHS) Head Start Center Locations Search Tool" as a repository for previous versions of the data and metadata.
f
PIA: An Intuitive Protein Inference Engine with a Web-Based User Interface
acs.figshare.com
xlsx
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julian Uszkoreit; Alexandra Maerkens; Yasset Perez-Riverol; Helmut E. Meyer; Katrin Marcus; Christian Stephan; Oliver Kohlbacher; Martin Eisenacher (2023). PIA: An Intuitive Protein Inference Engine with a Web-Based User Interface [Dataset]. http://doi.org/10.1021/acs.jproteome.5b00121.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jproteome.5b00121.s002
Dataset updated
Jun 3, 2023
Dataset provided by
ACS Publications
Authors
Julian Uszkoreit; Alexandra Maerkens; Yasset Perez-Riverol; Helmut E. Meyer; Katrin Marcus; Christian Stephan; Oliver Kohlbacher; Martin Eisenacher
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Protein inference connects the peptide spectrum matches (PSMs) obtained from database search engines back to proteins, which are typically at the heart of most proteomics studies. Different search engines yield different PSMs and thus different protein lists. Analysis of results from one or multiple search engines is often hampered by different data exchange formats and lack of convenient and intuitive user interfaces. We present PIA, a flexible software suite for combining PSMs from different search engine runs and turning these into consistent results. PIA can be integrated into proteomics data analysis workflows in several ways. A user-friendly graphical user interface can be run either locally or (e.g., for larger core facilities) from a central server. For automated data processing, stand-alone tools are available. PIA implements several established protein inference algorithms and can combine results from different search engines seamlessly. On several benchmark data sets, we show that PIA can identify a larger number of proteins at the same protein FDR when compared to that using inference based on a single search engine. PIA supports the majority of established search engines and data in the mzIdentML standard format. It is implemented in Java and freely available at https://github.com/mpc-bioinformatics/pia.
Ted Talks Transcript
kaggle.com
zip
Updated Jan 27, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wei (2018). Ted Talks Transcript [Dataset]. https://www.kaggle.com/goweiting/ted-talks-transcript
Explore at:
zip(577954747 bytes)Available download formats
Dataset updated
Jan 27, 2018
Authors
Wei
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
An extension to Kaggle's TED dataset

Using transcripts provided by TED.com, a dataset that combines combines YouTube metadata (at the time of scrapping) and metadata from Kaggle's TED dataset.

This extension provides not just additional metadata from YouTube, but also additional transcripts of the same video, but in different languages (e.g. Portugese, French, Arabic, Chinese, Japnese, Korean, Turkish, Dutch...). In total, 111 different languages are available (most videos do not have transcript for all languages).

Content

For each of the 111 languages in tedDirector.zip, each language file is a CSV file with the following headers:

videoID - YouTube IDs

lang - Language code

title - Title of the TED Talk

transcript - Transcript of the TED Talk in lang

Acknowledgements

This dataset was developed as part of a larger dataset used for an information retrieval assignment. In that assignment, my team and I used TED talks to evaluate different configuration of search engine algorithms. We also used different languages for the search and retrieve task, to test for reliability of our search engine. More information can be found in our Github repository. Dataset is downloaded from TedTalksDirector using YouTube-dl.

Code for downloading can be found from the IR project.

More about language codes here from w3schools.com.

Inspiration

Some of the problem experienced while preparing this dataset: 1. How can we improve the matching of YouTube dataset to the data scrapped from TED talk
f
Data from: metLinkR: Facilitating Metaanalysis of Human Metabolomics Data...
figshare.com
datasetcatalog.nlm.nih.gov
zip
Updated Apr 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrew Patt; Iris Pang; Fred Lee; Chiraag Gohel; Eoin Fahy; Vicki Stevens; David Ruggieri; Steven C. Moore; Ewy A. Mathé (2025). metLinkR: Facilitating Metaanalysis of Human Metabolomics Data through Automated Linking of Metabolite Identifiers [Dataset]. http://doi.org/10.1021/acs.jproteome.4c01051.s003
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jproteome.4c01051.s003
Dataset updated
Apr 4, 2025
Dataset provided by
ACS Publications
Authors
Andrew Patt; Iris Pang; Fred Lee; Chiraag Gohel; Eoin Fahy; Vicki Stevens; David Ruggieri; Steven C. Moore; Ewy A. Mathé
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Metabolites are referenced in spectral, structural and pathway databases with a diverse array of schemas, including various internal database identifiers and large tables of common name synonyms. Cross-linking metabolite identifiers is a required step for meta-analysis of metabolomic results across studies but made difficult due to the lack of a consensus identifier system. We have implemented metLinkR, an R package that leverages RefMet and RaMP-DB to automate and simplify cross-linking metabolite identifiers across studies and generating common names. MetLinkR accepts as input metabolite common names and identifiers from five different databases (HMDB, KEGG, ChEBI, LIPIDMAPS and PubChem) to exhaustively search for possible overlap in supplied metabolites from input data sets. In an example of 13 metabolomic data sets totaling 10,400 metabolites, metLinkR identified and provided common names for 1377 metabolites in common between at least 2 data sets in less than 18 min and produced standardized names for 74.4% of the input metabolites. In another example comprising five data sets with 3512 metabolites, metLinkR identified 715 metabolites in common between at least two data sets in under 12 min and produced standardized names for 82.3% of the input metabolites. Outputs of MetLInR include output tables and metrics allowing users to readily double check the mappings and to get an overview of chemical classes represented. Overall, MetLinkR provides a streamlined solution for a common task in metabolomic epidemiology and other fields that meta-analyze metabolomic data. The R package, vignette and source code are freely downloadable at https://github.com/ncats/metLinkR.
h
nerd-knowledge-api
huggingface.co
Updated Oct 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NerdOptimize (2025). nerd-knowledge-api [Dataset]. https://huggingface.co/datasets/NerdOptimize/nerd-knowledge-api
Explore at:
Dataset updated
Oct 31, 2025
Dataset authored and provided by
NerdOptimize
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
NerdOptimize Dataset (v1.0.0)

English dataset for SEO (Data‑Driven) and AI Search / AEO by NerdOptimize (Bangkok, TH).Built for GitHub, Hugging Face, and on‑site deployment, so LLMs can learn/cite the brand.

Structure

data/*.json → core machine‑readable data (ICPs, services, case studies, frameworks, articles, labels, metadata, processing steps) server.js / openapi.json → tiny Express API to serve the dataset schema-dataset.jsonld → Dataset JSON‑LD for Google Dataset… See the full description on the dataset page: https://huggingface.co/datasets/NerdOptimize/nerd-knowledge-api.
H
Replication Code and Data for: Suppressing the Search Engine Manipulation...
dataverse.harvard.edu
search.dataone.org
Updated Mar 29, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert Epstein; Ronald E. Robertson; David Lazer; Christo Wilson (2021). Replication Code and Data for: Suppressing the Search Engine Manipulation Effect (SEME) [Dataset]. http://doi.org/10.7910/DVN/DZYKFO
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/DZYKFO
Dataset updated
Mar 29, 2021
Dataset provided by
Harvard Dataverse
Authors
Robert Epstein; Ronald E. Robertson; David Lazer; Christo Wilson
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
R replication code and data for: Suppressing the Search Engine Manipulation Effect (SEME), a follow-up article to a 2015 article that presented the discovery of SEME. Paper to appear in the Proceedings of the ACM: Human Computer Interaction. See the README file in this repository for replication details. Note: Dataverse flattened the directory structure of this repo. A better version (structurally) is available here: https://github.com/gitronald/bias-alerts
Basic Local Alignment Search Tool (BLAST) - 7szi-q9wh - Archive Repository
healthdata.gov
csv, xlsx, xml
Updated Jul 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Basic Local Alignment Search Tool (BLAST) - 7szi-q9wh - Archive Repository [Dataset]. https://healthdata.gov/dataset/Basic-Local-Alignment-Search-Tool-BLAST-7szi-q9wh-/8r5d-ibg9
Explore at:
csv, xml, xlsxAvailable download formats
Dataset updated
Jul 18, 2025
Description
This dataset tracks the updates made on the dataset "Basic Local Alignment Search Tool (BLAST)" as a repository for previous versions of the data and metadata.
I
Self-citation analysis data based on PubMed Central subset (2002-2005)
databank.illinois.edu
aws-databank-alb.library.illinois.edu
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shubhanshu Mishra; Brent D Fegley; Jana Diesner; Vetle I. Torvik, Self-citation analysis data based on PubMed Central subset (2002-2005) [Dataset]. http://doi.org/10.13012/B2IDB-9665377_V1
Explore at:
Unique identifier
https://doi.org/10.13012/B2IDB-9665377_V1
Authors
Shubhanshu Mishra; Brent D Fegley; Jana Diesner; Vetle I. Torvik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset funded by
U.S. National Institutes of Health (NIH)
U.S. National Science Foundation (NSF)
Description
Self-citation analysis data based on PubMed Central subset (2002-2005) ---------------------------------------------------------------------- Created by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik on April 5th, 2018 ## Introduction This is a dataset created as part of the publication titled: Mishra S, Fegley BD, Diesner J, Torvik VI (2018) Self-Citation is the Hallmark of Productive Authors, of Any Gender. PLOS ONE. It contains files for running the self citation analysis on articles published in PubMed Central between 2002 and 2005, collected in 2015. The dataset is distributed in the form of the following tab separated text files: * Training_data_2002_2005_pmc_pair_First.txt (1.2G) - Data for first authors * Training_data_2002_2005_pmc_pair_Last.txt (1.2G) - Data for last authors * Training_data_2002_2005_pmc_pair_Middle_2nd.txt (964M) - Data for middle 2nd authors * Training_data_2002_2005_pmc_pair_txt.header.txt - Header for the data * COLUMNS_DESC.txt file - Descriptions of all columns * model_text_files.tar.gz - Text files containing model coefficients and scores for model selection. * results_all_model.tar.gz - Model coefficient and result files in numpy format used for plotting purposes. v4.reviewer contains models for analysis done after reviewer comments. * README.txt file ## Dataset creation Our experiments relied on data from multiple sources including properitery data from Thompson Rueter's (now Clarivate Analytics) Web of Science collection of MEDLINE citations. Author's interested in reproducing our experiments should personally request from Clarivate Analytics for this data. However, we do make a similar but open dataset based on citations from PubMed Central which can be utilized to get similar results to those reported in our analysis. Furthermore, we have also freely shared our datasets which can be used along with the citation datasets from Clarivate Analytics, to re-create the datased used in our experiments. These datasets are listed below. If you wish to use any of those datasets please make sure you cite both the dataset as well as the paper introducing the dataset. * MEDLINE 2015 baseline: https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html * Citation data from PubMed Central (original paper includes additional citations from Web of Science) * Author-ity 2009 dataset: - Dataset citation: Torvik, Vetle I.; Smalheiser, Neil R. (2018): Author-ity 2009 - PubMed author name disambiguated dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4222651_V1 - Paper citation: Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3), 1–29. https://doi.org/10.1145/1552303.1552304 - Paper citation: Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2004). A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology, 56(2), 140–158. https://doi.org/10.1002/asi.20105 * Genni 2.0 + Ethnea for identifying author gender and ethnicity: - Dataset citation: Torvik, Vetle (2018): Genni + Ethnea for the Author-ity 2009 dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-9087546_V1 - Paper citation: Smith, B. N., Singh, M., & Torvik, V. I. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. In Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries - JCDL ’13. ACM Press. https://doi.org/10.1145/2467696.2467720 - Paper citation: Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geo-coded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington DC, USA. http://hdl.handle.net/2142/88927 * MapAffil for identifying article country of affiliation: - Dataset citation: Torvik, Vetle I. (2018): MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4354331_V1 - Paper citation: Torvik VI. MapAffil: A Bibliographic Tool for Mapping Author Affiliation Strings to Cities and Their Geocodes Worldwide. D-Lib magazine : the magazine of the Digital Library Forum. 2015;21(11-12):10.1045/november2015-torvik * IMPLICIT journal similarity: - Dataset citation: Torvik, Vetle (2018): Author-implicit journal, MeSH, title-word, and affiliation-word pairs based on Author-ity 2009. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4742014_V1 * Novelty dataset for identify article level novelty: - Dataset citation: Mishra, Shubhanshu; Torvik, Vetle I. (2018): Conceptual novelty scores for PubMed articles. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-5060298_V1 - Paper citation: Mishra S, Torvik VI. Quantifying Conceptual Novelty in the Biomedical Literature. D-Lib magazine : The Magazine of the Digital Library Forum. 2016;22(9-10):10.1045/september2016-mishra - Code: https://github.com/napsternxg/Novelty * Expertise dataset for identifying author expertise on articles: * Source code provided at: https://github.com/napsternxg/PubMed_SelfCitationAnalysis Note: The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in the first week of October, 2016. Check here for information to get PubMed/MEDLINE, and NLMs data Terms and Conditions Additional data related updates can be found at Torvik Research Group ## Acknowledgments This work was made possible in part with funding to VIT from NIH grant P01AG039347 and NSF grant 1348742. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. ## License Self-citation analysis data based on PubMed Central subset (2002-2005) by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik is licensed under a Creative Commons Attribution 4.0 International License. Permissions beyond the scope of this license may be available at https://github.com/napsternxg/PubMed_SelfCitationAnalysis.
h
MMSearch
huggingface.co
Updated Sep 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dongzhi Jiang (2024). MMSearch [Dataset]. https://huggingface.co/datasets/CaraJ/MMSearch
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 25, 2024
Authors
Dongzhi Jiang
Description
MMSearch 🔥: Benchmarking the Potential of Large Models as Multi-modal Search Engines

Official repository for the paper "MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines". 🌟 For more details, please refer to the project page with dataset exploration and visualization tools: https://mmsearch.github.io/. [🌐 Webpage] [📖 Paper] [🤗 Huggingface Dataset] [🏆 Leaderboard] [🔍 Visualization]

💥 News

[2024.09.25] 🌟 The evaluation code now… See the full description on the dataset page: https://huggingface.co/datasets/CaraJ/MMSearch.
f
Data from: Semisupervised Machine Learning for Sensitive Open Modification...
acs.figshare.com
datasetcatalog.nlm.nih.gov
xlsx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Issar Arab; William E. Fondrie; Kris Laukens; Wout Bittremieux (2023). Semisupervised Machine Learning for Sensitive Open Modification Spectral Library Searching [Dataset]. http://doi.org/10.1021/acs.jproteome.2c00616.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jproteome.2c00616.s002
Dataset updated
Jun 1, 2023
Dataset provided by
ACS Publications
Authors
Issar Arab; William E. Fondrie; Kris Laukens; Wout Bittremieux
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
A key analysis task in mass spectrometry proteomics is matching the acquired tandem mass spectra to their originating peptides by sequence database searching or spectral library searching. Machine learning is an increasingly popular postprocessing approach to maximize the number of confident spectrum identifications that can be obtained at a given false discovery rate threshold. Here, we have integrated semisupervised machine learning in the ANN-SoLo tool, an efficient spectral library search engine that is optimized for open modification searching to identify peptides with any type of post-translational modification. We show that machine learning rescoring boosts the number of spectra that can be identified for both standard searching and open searching, and we provide insights into relevant spectrum characteristics harnessed by the machine learning model. The semisupervised machine learning functionality has now been fully integrated into ANN-SoLo, which is available as open source under the permissive Apache 2.0 license on GitHub at https://github.com/bittremieux/ANN-SoLo.
f
Data from: UniSpec: Deep Learning for Predicting the Full Range of Peptide...
acs.figshare.com
xlsx
Updated Feb 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joel Lapin; Xinjian Yan; Qian Dong (2024). UniSpec: Deep Learning for Predicting the Full Range of Peptide Fragment Ion Series to Enhance the Proteomics Data Analysis Workflow [Dataset]. http://doi.org/10.1021/acs.analchem.3c02321.s003
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.analchem.3c02321.s003
Dataset updated
Feb 8, 2024
Dataset provided by
ACS Publications
Authors
Joel Lapin; Xinjian Yan; Qian Dong
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
We present UniSpec, an attention-driven deep neural network designed to predict comprehensive collision-induced fragmentation spectra, thereby improving peptide identification in shotgun proteomics. Utilizing a training data set of 1.8 million unique high-quality tandem mass spectra (MS2) from 0.8 million unique peptide ions, UniSpec learned with a peptide fragmentation dictionary encompassing 7919 fragment peaks. Among these, 5712 are neutral loss peaks, with 2310 corresponding to modification-specific neutral losses. Remarkably, UniSpec can predict 73%–77% of fragment intensities based on our NIST reference library spectra, a significant leap from the 35%–45% coverage of only b and y ions. Comparative studies with Prosit elucidate that while both models are strong at predicting their respective fragment ion series, UniSpec particularly shines in generating more complex MS2 spectra with diverse ion annotations. The integration of UniSpec’s predictions into shotgun proteomics data analysis boosts the identification rate of tryptic peptides by 48% at a 1% false discovery rate (FDR) and 60% at a more confident 0.1% FDR. Using UniSpec’s predicted in-silico spectral library, the search results closely matched those from search engines and experimental spectral libraries used in peptide identification, highlighting its potential as a stand-alone identification tool. The source code and Python scripts are available on GitHub (https://github.com/usnistgov/UniSpec) and Zenodo (https://zenodo.org/records/10452792), and all data sets and analysis results generated in this work were deposited in Zenodo (https://zenodo.org/records/10052268).
Office of Head Start (OHS) Head Start Center Locations Search Tool
healthdata.gov
data.virginia.gov
+2more
csv, xlsx, xml
Updated Feb 13, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). Office of Head Start (OHS) Head Start Center Locations Search Tool [Dataset]. https://healthdata.gov/widgets/4uua-wna9?mobile_redirect=true
Explore at:
xml, csv, xlsxAvailable download formats
Dataset updated
Feb 13, 2021
Description
Office of Head Start (OHS) web based search tool for finding Head Start program office contact information. Searchable by location, grant number or center type. Results are downloadable in CSV format.
Vector Alignment Search Tool (VAST) - i6s4-dz8n - Archive Repository
healthdata.gov
csv, xlsx, xml
Updated Jul 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Vector Alignment Search Tool (VAST) - i6s4-dz8n - Archive Repository [Dataset]. https://healthdata.gov/dataset/Vector-Alignment-Search-Tool-VAST-i6s4-dz8n-Archiv/mfvc-rbfh
Explore at:
csv, xlsx, xmlAvailable download formats
Dataset updated
Jul 16, 2025
Description
This dataset tracks the updates made on the dataset "Vector Alignment Search Tool (VAST)" as a repository for previous versions of the data and metadata.
Z
KNMI-LENTIS large ensemble time slice dataset description
nde-dev.biothings.io
Updated Sep 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bintanja, Richard (2023). KNMI-LENTIS large ensemble time slice dataset description [Dataset]. https://nde-dev.biothings.io/resources?id=zenodo_7573136
Explore at:
Dataset updated
Sep 29, 2023
Dataset provided by
Bintanja, Richard
Muntjewerf, Laura
Reerink, Thomas
Van der Wiel, Karin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Contents

Available variables in KNMI-LENTIS

request-overview-CMIP-historical-including-EC-EARTH-AOGCM-preferences.txt

Where is the data deposited on the ECWMF's tape storage (section 4)

LENTIS_on_ECFS.zip

Data of all variables for 1 year for 1 ensemble member (section 5)

tree_of_files_one_member_all_data.txt

{AERmon,Amon,Emon,LImon,Lmon,Ofx,Omon,SImon,fx,Eday,Oday,day,CFday,3hr,6hrPlev,6hrPlevPt}.zip

Description of this Zenodo dataset

This Zenodo dataset pertains to the full KNMI-LENTIS dataset: a large ensemble of simulations with the Global Climate Model EC-Earth3. The periods are for the present-day period (2000-2009) and a future +2K period (2075-2084 following SSP2-4.5). KNMI-LENTIS has 1600 simulated years for both the two climates. This level of sampled climate variability allows for robust and in-depth research into extreme events. The available variables are listed in the file request-overview-CMIP-historical-including-EC-EARTH-AOGCM-preferences.txt. All variables are cmorised following CMIP6 data format convention. Further details on the variables and their output dimensions is available via the following search tool. The total size of KNMI-LENTIS is 128 TB. KNMI-LENTIS is stored at the high performance storage system of the ECMWF (ECFS).

The Global Climate Model that is used for generating this Large Ensemble is EC-Earth3 - VAREX project branch https://svn.ec-earth.org/ecearth3/branches/projects/varex (access restricted to ECMWF members).

The goal of this Zenodo dataset is :

to provide an accurate description and example of how the KNMI-LENTIS dataset is organised.

to describe in which servers the data are deposited and how to gain access to the data for future users

to provide links to related git repositories and other content relating to the KNMI-LENTIS production

How KNMI-LENTIS is organised

KNMI-LENTIS consists of 2 times 160 runs of 10 years. All simulations have a unique ensemble member label that reflects the forcing, and how the initial conditions are generated. The initial conditions have two aspects: the parent simulation from which the run is branched (macro perturbation, there are 16), and the seed relating to a particular micro-perturbation in the initial three-dimensional atmosphere temperature field (there are 10). The ensemble member label thus is a combination of:

forcing (h for present-day/historical and s for +2K/SSP2-4.5)

parent ID (number between 1 and 16)

micro perturbation ID (number between 0 and 9)

In this Zenodo dataset we publish 1 year from 1 member to give insight into the type of data and metadata that is representative of the full KNMI-LENTIS dataset. The published data is year 2000 from member h010. See Section 4

Further, all KNMI-LENTIS simulations are labeled per the CMIP6 convention of variant labelling. A variant label is made from four components: the realization index r, the initialization index i, the physics index p and the forcing index f. Further details on CMIP6 variant labelling be found in The CMIP6 Participation Guidance for Modelers. In the KNMI-LENTIS data set, the forcing is reflected in the first digit of the realization index r of the variant label. For the historical simulations, the one thousands (r1000-r1999) have been reserved. For the SSP2-4.5 the five thousands (r5000-r5999) have been reserved. The parent is reflected in the second and third digit of the realization index r of the variant label (r?01?-r?16?). The seed is reflected in the fourth digit of the realization index r: (r???0-r???9). The seed is also reflected in the initialization index i of the variant label (i0-i9), so this is double information. The physics index p5 has been reserved for the ECE3p5 version: all KNMI-LENTIS simulations have the p5 label. The forcing index f of the variant label is kept at 1 for all KNMI-LENTIS simulations. As an example, variant label r5119i9p5f1 refers to: the 2K time slice with parent 11 and randomizing seed number 9. The physics index is 5, meaning the run is done with the ECE3p5 version of EC-Earth3.

Where is the data deposited on the ECWMF's tape storage

In this Zenodo folder, there are several text files and several netcdf files. The text files provide

Data from KNMI-LENTIS is deposited in the ECMWF ECFS tape storage system. Data can be freely downloaded by to those who have access to the ECMWF ECFS. Else, the data can be made available by the authors upon request.

The way the dataset is organised is detailed in LENTIS_on_ECFS.zip. This contains details on all available KNMI-LENTIS files, in particular details for how these are filed in ECFS. The files on ECFS are tar zipped per ensemble member & variable: these contain 10 years of ensemble member data (10 separate netcdf files). The location on ECFS of the tar-zipped files that are listed in the various text files in this Zenodo dataset is

ec:/nklm/LENTIS/ec-earth/cmorised_by_var/

!/bin/bash

-------------------

script to write out LENTIS details on ECFS

-------------------

for freq in AERmon Amon Emon LImon Lmon Ofx Omon SImon fx Eday Oday day CFday 3hr 6hrPlev 6hrPlevPt; do for scen in hxxx sxxx; do els -l ec:/nklm/LENTIS/ec-earth/cmorised_by_var/${scen}/${freq}/* >> LENTIS_on_ECFS_${scen}_${freq}.txt done done

Further, part of the data will be made publicly available from the Earth System Grid Federation (ESGF) data portal. We aim to upload most of the monthly variables for the full ensemble. As search terms use EC-Earth for model and p5 for physical index to locate the KNMI-LENTIS data.

Data of all variables for 1 year for 1 ensemble member

The netcdf files of the data of 1 year from 1 member h010 are published here to give insight into the type of data and metadata that is representative of the full KNMI-LENTIS dataset. The data are in zipped folders per output frequencies: AERmon, Amon, Emon, LImon, Lmon, Ofx, Omon, SImon, fx, Eday, Oday, day, CFday, 3hr, 6hrPlev, 6hrPlevPt. The text file request-overview-CMIP-historical-including-EC-EARTH-AOGCM-preferences.txt gives an overview of variables available per output frequency. the text files tree_of_files_one_member_all_data.txt gives an overview of the files in the zipped folders.

Related links

The production of the KNMI-LENTIS ensemble was funded by the KNMI (Royal Dutch Meteorological Institute) multi-year strategic research fund KNMI MSO Climate Variability And Extremes (VAREX)

GitHub repository corresponding to this Zenodo dataset: https://github.com/lmuntjewerf/KNMI-LENTIS_dataset_description.git

Github repository for KNMI-LENTIS production code: https://github.com/lmuntjewerf/KNMI-LENTIS_production_script_train.git

Facebook

Twitter

Click to copy link

Link copied

Cite

Hui-Yin Chang; Sarah E. Haynes; Fengchao Yu; Alexey I. Nesvizhskii (2023). Implementing the MSFragger Search Engine as a Node in Proteome Discoverer [Dataset]. http://doi.org/10.1021/acs.jproteome.2c00485.s002

Data from: Implementing the MSFragger Search Engine as a Node in Proteome Discoverer

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.1021/acs.jproteome.2c00485.s002

Dataset updated

Jun 4, 2023

Dataset provided by

ACS Publications

Authors

Hui-Yin Chang; Sarah E. Haynes; Fengchao Yu; Alexey I. Nesvizhskii

License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

Here, we describe the implementation of the fast proteomics search engine MSFragger as a processing node in the widely used Proteome Discoverer (PD) software platform. PeptideProphet (via the Philosopher tool kit) is also implemented as an additional PD node to allow validation of MSFragger open (mass-tolerant) search results. These two nodes, along with the existing Percolator validation module, allow users to employ different search strategies and conveniently inspect search results through PD. Our results have demonstrated the improved numbers of PSMs, peptides, and proteins identified by MSFragger coupled with Percolator and significantly faster search speed compared to the conventional SEQUEST/Percolator PD workflows. The MSFragger-PD node is available at https://github.com/nesvilab/PD-Nodes/releases/.

Clear search

Close search

Google apps

Main menu

Data from: Implementing the MSFragger Search Engine as a Node in Proteome...

Data from: QuerTCI: A Tool Integrating GitHub Issue Querying with Comment...

awesome archive query log

Data from: Qbias – A Dataset on Media Bias in Search Queries and Query...

Data from: A dataset of GitHub Actions workflow histories

Meal Plan Search

Office of Head Start (OHS) Head Start Center Locations Search Tool -...

PIA: An Intuitive Protein Inference Engine with a Web-Based User Interface

Ted Talks Transcript

An extension to Kaggle's TED dataset

Content

Acknowledgements

Inspiration

Data from: metLinkR: Facilitating Metaanalysis of Human Metabolomics Data...

nerd-knowledge-api

Replication Code and Data for: Suppressing the Search Engine Manipulation...

Basic Local Alignment Search Tool (BLAST) - 7szi-q9wh - Archive Repository

Self-citation analysis data based on PubMed Central subset (2002-2005)

MMSearch

Data from: Semisupervised Machine Learning for Sensitive Open Modification...

Data from: UniSpec: Deep Learning for Predicting the Full Range of Peptide...

Office of Head Start (OHS) Head Start Center Locations Search Tool

Vector Alignment Search Tool (VAST) - i6s4-dz8n - Archive Repository

KNMI-LENTIS large ensemble time slice dataset description

!/bin/bash

-------------------

script to write out LENTIS details on ECFS

-------------------

Data from: Implementing the MSFragger Search Engine as a Node in Proteome Discoverer