100+ datasets found

f
Proportion of articles with original data.
plos.figshare.com
tiff
Updated Jun 5, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ryan P. Womack (2023). Proportion of articles with original data. [Dataset]. http://doi.org/10.1371/journal.pone.0143460.g003
Explore at:
tiffAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0143460.g003
Dataset updated
Jun 5, 2023
Dataset provided by
PLOS ONE
Authors
Ryan P. Womack
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This graph shows the proportion of articles by discipline with original data generated by the research described in the article, along with associated confidence intervals. See Tables 9 and 11 for numeric values.
Early Indicator for Data Sharing and Reuse - Supplementary Tables.xlsx
figshare.com
xlsx
Updated Apr 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agata Piekniewska; Laurel Haak; Darla Henderson; Katherine McNeill; Anita Bandrowski; Yvette Seger (2023). Early Indicator for Data Sharing and Reuse - Supplementary Tables.xlsx [Dataset]. http://doi.org/10.6084/m9.figshare.22720399.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22720399.v1
Dataset updated
Apr 28, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Agata Piekniewska; Laurel Haak; Darla Henderson; Katherine McNeill; Anita Bandrowski; Yvette Seger
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These data were generated for an investigation of research data repository (RDR) mentions in biuomedical research articles.

Supplementary Table 1 is a discrete subset of SciCrunch RDRs used to study RDR mentions in biomedical literature. We generated this list by starting with the top 1000 entries in the SciCrunch database, measured by citations, removed entries for organizations (such as universities without a corresponding RDR) or non-relevant tools (such as reference managers), updated links, and consolidated duplicates resulting from RDR mergers and name variations. The resulting list of 737 RDRs is shown in with as a base based on a source list of RDRs in the SciCrunch database. The file includes the Research Resource Identifier (RRID), the RDR name, and a link to the RDR record in the SciCrunch database.

Supplementary Table 2 shows the RDRs, associated journals, and article-mention pairs (records) with text snippets extracted from mined Methods text in 2020 PubMed articles. The dataset has 4 components. The first shows the list of repositories with RDR mentions, and includes the Research Resource Identifier (RRID), the RDR name, the number of articles that mention the RDR, and a link to the record in the SciCrunch database. The second shows the list of journals in the study set with at least 1 RDR mention, andincludes the Journal ID, nam, ESSN/ISSN, the total count of publications in 2020, the number of articles that had text available to mine, the number of article-mention pairs (records), number of articles with RDR mentions, the number of unique RDRs mentioned, % of articles with minable text. The third shows the top 200 journals by RDR mention, normalized by the proportion of articles with available text to mine, with the same metadata as the second table. The fourth shows text snippets for each RDR mention, and includes the RRID, RDR name, PubMedID (PMID), DOI, article publication date, journal name, journal ID, ESSN/ISSN, article title, and snippet.
r
Big Data and Society Abstract & Indexing - ResearchHelpDesk
researchhelpdesk.org
Updated Jun 23, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Research Help Desk (2022). Big Data and Society Abstract & Indexing - ResearchHelpDesk [Dataset]. https://www.researchhelpdesk.org/journal/abstract-and-indexing/477/big-data-and-society
Explore at:
Dataset updated
Jun 23, 2022
Dataset authored and provided by
Research Help Desk
Description
Big Data and Society Abstract & Indexing - ResearchHelpDesk - Big Data & Society (BD&S) is open access, peer-reviewed scholarly journal that publishes interdisciplinary work principally in the social sciences, humanities and computing and their intersections with the arts and natural sciences about the implications of Big Data for societies. The Journal's key purpose is to provide a space for connecting debates about the emerging field of Big Data practices and how they are reconfiguring academic, social, industry, business, and government relations, expertise, methods, concepts, and knowledge. BD&S moves beyond usual notions of Big Data and treats it as an emerging field of practice that is not defined by but generative of (sometimes) novel data qualities such as high volume and granularity and complex analytics such as data linking and mining. It thus attends to digital content generated through online and offline practices in social, commercial, scientific, and government domains. This includes, for instance, the content generated on the Internet through social media and search engines but also that which is generated in closed networks (commercial or government transactions) and open networks such as digital archives, open government, and crowdsourced data. Critically, rather than settling on a definition the Journal makes this an object of interdisciplinary inquiries and debates explored through studies of a variety of topics and themes. BD&S seeks contributions that analyze Big Data practices and/or involve empirical engagements and experiments with innovative methods while also reflecting on the consequences for how societies are represented (epistemologies), realized (ontologies) and governed (politics). Article processing charge (APC) The article processing charge (APC) for this journal is currently 1500 USD. Authors who do not have funding for open access publishing can request a waiver from the publisher, SAGE, once their Original Research Article is accepted after peer review. For all other content (Commentaries, Editorials, Demos) and Original Research Articles commissioned by the Editor, the APC will be waived. Abstract & Indexing Clarivate Analytics: Social Sciences Citation Index (SSCI) Directory of Open Access Journals (DOAJ) Google Scholar Scopus
Map of articles about "Teaching Open Science"
zenodo.org
data.niaid.nih.gov
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Isabel Steinhardt; Isabel Steinhardt (2020). Map of articles about "Teaching Open Science" [Dataset]. http://doi.org/10.5281/zenodo.3371415
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.3371415
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Isabel Steinhardt; Isabel Steinhardt
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This description is part of the blog post "Systematic Literature Review of teaching Open Science" https://sozmethode.hypotheses.org/839

According to my opinion, we do not pay enough attention to teaching Open Science in higher education. Therefore, I designed a seminar to teach students the practices of Open Science by doing qualitative research.About this seminar, I wrote the article ”Teaching Open Science and qualitative methods“. For the article ”Teaching Open Science and qualitative methods“, I started to review the literature on ”Teaching Open Science“. The result of my literature review is that certain aspects of Open Science are used for teaching. However, Open Science with all its aspects (Open Access, Open Data, Open Methodology, Open Science Evaluation and Open Science Tools) is not an issue in publications about teaching.

Based on this insight, I have started a systematic literature review. I realized quickly that I need help to analyse and interpret the articles and to evaluate my preliminary findings. Especially different disciplinary cultures of teaching different aspects of Open Science are challenging, as I myself, as a social scientist, do not have enough insight to be able to interpret the results correctly. Therefore, I would like to invite you to participate in this research project!

I am now looking for people who would like to join a collaborative process to further explore and write the systematic literature review on “Teaching Open Science“. Because I want to turn this project into a Massive Open Online Paper (MOOP). According to the 10 rules of Tennant et al (2019) on MOOPs, it is crucial to find a core group that is enthusiastic about the topic. Therefore, I am looking for people who are interested in creating the structure of the paper and writing the paper together with me. I am also looking for people who want to search for and review literature or evaluate the literature I have already found. Together with the interested persons I would then define, the rules for the project (cf. Tennant et al. 2019). So if you are interested to contribute to the further search for articles and / or to enhance the interpretation and writing of results, please get in touch. For everyone interested to contribute, the list of articles collected so far is freely accessible at Zotero: https://www.zotero.org/groups/2359061/teaching_open_science. The figure shown below provides a first overview of my ongoing work. I created the figure with the free software yEd and uploaded the file to zenodo, so everyone can download and work with it:

To make transparent what I have done so far, I will first introduce what a systematic literature review is. Secondly, I describe the decisions I made to start with the systematic literature review. Third, I present the preliminary results.

Systematic literature review – an Introduction

Systematic literature reviews “are a method of mapping out areas of uncertainty, and identifying where little or no relevant research has been done.” (Petticrew/Roberts 2008: 2). Fink defines the systematic literature review as a “systemic, explicit, and reproducible method for identifying, evaluating, and synthesizing the existing body of completed and recorded work produced by researchers, scholars, and practitioners.” (Fink 2019: 6). The aim of a systematic literature reviews is to surpass the subjectivity of a researchers’ search for literature. However, there can never be an objective selection of articles. This is because the researcher has for example already made a preselection by deciding about search strings, for example “Teaching Open Science”. In this respect, transparency is the core criteria for a high-quality review.

In order to achieve high quality and transparency, Fink (2019: 6-7) proposes the following seven steps:

Selecting a research question.

Selecting the bibliographic database.

Choosing the search terms.

Applying practical screening criteria.

Applying methodological screening criteria.

Doing the review.

Synthesizing the results.

I have adapted these steps for the “Teaching Open Science” systematic literature review. In the following, I will present the decisions I have made.

Systematic literature review – decisions I made

Research question: I am interested in the following research questions: How is Open Science taught in higher education? Is Open Science taught in its full range with all aspects like Open Access, Open Data, Open Methodology, Open Science Evaluation and Open Science Tools? Which aspects are taught? Are there disciplinary differences as to which aspects are taught and, if so, why are there such differences?

Databases: I started my search at the Directory of Open Science (DOAJ). “DOAJ is a community-curated online directory that indexes and provides access to high quality, open access, peer-reviewed journals.” (https://doaj.org/) Secondly, I used the Bielefeld Academic Search Engine (base). Base is operated by Bielefeld University Library and “one of the world’s most voluminous search engines especially for academic web resources” (base-search.net). Both platforms are non-commercial and focus on Open Access publications and thus differ from the commercial publication databases, such as Web of Science and Scopus. For this project, I deliberately decided against commercial providers and the restriction of search in indexed journals. Thus, because my explicit aim was to find articles that are open in the context of Open Science.

Search terms: To identify articles about teaching Open Science I used the following search strings: “teaching open science” OR teaching “open science” OR teach „open science“. The topic search looked for the search strings in title, abstract and keywords of articles. Since these are very narrow search terms, I decided to broaden the method. I searched in the reference lists of all articles that appear from this search for further relevant literature. Using Google Scholar I checked which other authors cited the articles in the sample. If the so checked articles met my methodological criteria, I included them in the sample and looked through the reference lists and citations at Google Scholar. This process has not yet been completed.

Practical screening criteria: I have included English and German articles in the sample, as I speak these languages (articles in other languages are very welcome, if there are people who can interpret them!). In the sample only journal articles, articles in edited volumes, working papers and conference papers from proceedings were included. I checked whether the journals were predatory journals – such articles were not included. I did not include blogposts, books or articles from newspapers. I only included articles that fulltexts are accessible via my institution (University of Kassel). As a result, recently published articles at Elsevier could not be included because of the special situation in Germany regarding the Project DEAL (https://www.projekt-deal.de/about-deal/). For articles that are not freely accessible, I have checked whether there is an accessible version in a repository or whether preprint is available. If this was not the case, the article was not included. I started the analysis in May 2019.

Methodological criteria: The method described above to check the reference lists has the problem of subjectivity. Therefore, I hope that other people will be interested in this project and evaluate my decisions. I have used the following criteria as the basis for my decisions: First, the articles must focus on teaching. For example, this means that articles must describe how a course was designed and carried out. Second, at least one aspect of Open Science has to be addressed. The aspects can be very diverse (FOSS, repositories, wiki, data management, etc.) but have to comply with the principles of openness. This means, for example, I included an article when it deals with the use of FOSS in class and addresses the aspects of openness of FOSS. I did not include articles when the authors describe the use of a particular free and open source software for teaching but did not address the principles of openness or re-use.

Doing the review: Due to the methodical approach of going through the reference lists, it is possible to create a map of how the articles relate to each other. This results in thematic clusters and connections between clusters. The starting point for the map were four articles (Cook et al. 2018; Marsden, Thompson, and Plonsky 2017; Petras et al. 2015; Toelch and Ostwald 2018) that I found using the databases and criteria described above. I used yEd to generate the network. „yEd is a powerful desktop application that can be used to quickly and effectively generate high-quality diagrams.” (https://www.yworks.com/products/yed) In the network, arrows show, which articles are cited in an article and which articles are cited by others as well. In addition, I made an initial rough classification of the content using colours. This classification is based on the contents mentioned in the articles’ title and abstract. This rough content classification requires a more exact, i.e., content-based subdivision and

Data Citation Corpus Data File

zenodo.org
redivis.com

zip

Updated May 20, 2024

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

DataCite (2024). Data Citation Corpus Data File [Dataset]. http://doi.org/10.5281/zenodo.11216814

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.11216814

Dataset updated

May 20, 2024

Dataset provided by

DataCitehttps://www.datacite.org/

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Data file for the first release of the Data Citation Corpus, produced by DataCite and Make Data Count as part of an ongoing grant project funded by the Wellcome Trust. Read more about the project.

The data file includes 10,006,058 data citation records in JSON and CSV formats. The JSON file is the version of record.

Version 1.0 of the corpus data file was released on January 30, 2024. Release v1.1 is an optimized version of v1.0 designed to make the original citation records more usable. No citations have been added to or removed from the dataset in v1.1.

For convenience, the data file is provided in batches of approximately 1 million records each. The publication date and batch number are included in each component file name, ex: 2024-05-10-data-citation-corpus-01-v1.1.json.

The data citations in the file originate from DataCite Event Data and a project by Chan Zuckerberg Initiative (CZI) to identify mentions to datasets in the full text of articles.

Each data citation record is comprised of:

A pair of identifiers: An identifier for the dataset (a DOI or an accession number) and the DOI of the publication object (journal article or preprint) in which the dataset is cited
Metadata for the cited dataset and for the citing publication object

The data file includes the following fields:

Field	Description	Required?
id	Internal identifier for the citation	Yes
created	Date of item's incorporation into the corpus	Yes
updated	Date of item's most recent update in corpus	Yes
repository	Repository where cited data is stored	No
publisher	Publisher for the article citing the data	No
journal	Journal for the article citing the data	No
title	Title of cited data	No
objId	DOI of article where data is cited	Yes
subjId	DOI or accession number of cited data	Yes
publishedDate	Date when citing article was published	No
accessionNumber	Accession number of cited data	No
doi	DOI of cited data	No
relationTypeId	Relation type in metadata between citation object and subject	No
source	Source where citation was harvested	Yes
subjects	Subject information for cited data	No
affiliations	Affiliation information for creator of cited data	No
funders	Funding information for cited data	No

Additional documentation about the citations and metadata in the file is available on the Make Data Count website.

Feedback on the data file can be submitted via Github. For general questions, email info@makedatacount.org.

I
Data for: An Examination of Data Reuse Practices within Highly Cited...
databank.illinois.edu
Updated Mar 17, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Heidi J Imker; Hoa Luong; William H Mischo; Mary C Schlembach; Chris Wiley (2021). Data for: An Examination of Data Reuse Practices within Highly Cited Articles of Faculty at a Research University [Dataset]. http://doi.org/10.13012/B2IDB-2087785_V1
Explore at:
Unique identifier
https://doi.org/10.13012/B2IDB-2087785_V1
Dataset updated
Mar 17, 2021
Authors
Heidi J Imker; Hoa Luong; William H Mischo; Mary C Schlembach; Chris Wiley
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset was developed as part of a study that assessed data reuse. Through bibliometric analysis, corresponding authors of highly cited papers published in 2015 at the University of Illinois at Urbana-Champaign in nine STEM disciplines were identified and then surveyed to determine if data were generated for their article and their knowledge of reuse by other researchers. Second, the corresponding authors who cited those 2015 articles were identified and surveyed to ascertain whether they reused data from the original article and how that data was obtained. The project goal was to better understand data reuse in practice and to explore if research data from an initial publication was reused in subsequent publications.
C
Synthetic Integrated Services Data
data.wprdc.org
csv, html, pdf, zip
Updated Jun 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Allegheny County (2024). Synthetic Integrated Services Data [Dataset]. https://data.wprdc.org/dataset/synthetic-integrated-services-data
Explore at:
html, pdf, csv(1375554033), zip(39231637)Available download formats
Dataset updated
Jun 25, 2024
Dataset provided by
Allegheny County
Description
Motivation

This dataset was created to pilot techniques for creating synthetic data from datasets containing sensitive and protected information in the local government context. Synthetic data generation replaces actual data with representative data generated from statistical models; this preserves the key data properties that allow insights to be drawn from the data while protecting the privacy of the people included in the data. We invite you to read the Understanding Synthetic Data white paper for a concise introduction to synthetic data.

This effort was a collaboration of the Urban Institute, Allegheny County’s Department of Human Services (DHS) and CountyStat, and the University of Pittsburgh’s Western Pennsylvania Regional Data Center.

Collection

The source data for this project consisted of 1) month-by-month records of services included in Allegheny County's data warehouse and 2) demographic data about the individuals who received the services. As the County’s data warehouse combines this service and client data, this data is referred to as “Integrated Services data”. Read more about the data warehouse and the kinds of services it includes here.

Preprocessing

Synthetic data are typically generated from probability distributions or models identified as being representative of the confidential data. For this dataset, a model of the Integrated Services data was used to generate multiple versions of the synthetic dataset. These different candidate datasets were evaluated to select for publication the dataset version that best balances utility and privacy. For high-level information about this evaluation, see the Synthetic Data User Guide.

For more information about the creation of the synthetic version of this data, see the technical brief for this project, which discusses the technical decision making and modeling process in more detail.

Recommended Uses

This disaggregated synthetic data allows for many analyses that are not possible with aggregate data (summary statistics). Broadly, this synthetic version of this data could be analyzed to better understand the usage of human services by people in Allegheny County, including the interplay in the usage of multiple services and demographic information about clients.

Known Limitations/Biases

Some amount of deviation from the original data is inherent to the synthetic data generation process. Specific examples of limitations (including undercounts and overcounts for the usage of different services) are given in the Synthetic Data User Guide and the technical report describing this dataset's creation.

Feedback

Please reach out to this dataset's data steward (listed below) to let us know how you are using this data and if you found it to be helpful. Please also provide any feedback on how to make this dataset more applicable to your work, any suggestions of future synthetic datasets, or any additional information that would make this more useful. Also, please copy wprdc@pitt.edu on any such feedback (as the WPRDC always loves to hear about how people use the data that they publish and how the data could be improved).

Further Documentation and Resources

1) A high-level overview of synthetic data generation as a method for protecting privacy can be found in the Understanding Synthetic Data white paper.
2) The Synthetic Data User Guide provides high-level information to help users understand the motivation, evaluation process, and limitations of the synthetic version of Allegheny County DHS's Human Services data published here.
3) Generating a Fully Synthetic Human Services Dataset: A Technical Report on Synthesis and Evaluation Methodologies describes the full technical methodology used for generating the synthetic data, evaluating the various options, and selecting the final candidate for publication.
4) The WPRDC also hosts the Allegheny County Human Services Community Profiles dataset, which provides annual updates on human-services usage, aggregated by neighborhood/municipality. That data can be explored using the County's Human Services Community Profile web site.
Data from: unarXive: A Large Scholarly Data Set with Publications'...
zenodo.org
Updated Apr 17, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tarek Saier; Tarek Saier; Michael Färber; Michael Färber (2024). unarXive: A Large Scholarly Data Set with Publications' Full-Text, Annotated In-Text Citations, and Links to Metadata [Dataset]. http://doi.org/10.5281/zenodo.3385851
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.3385851
Dataset updated
Apr 17, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Tarek Saier; Tarek Saier; Michael Färber; Michael Färber
Description
Description

unarXive is a scholarly data set containing publications' full-text, annotated in-text citations, and a citation network.

The data is generated from all LaTeX sources on arXiv and therefore of higher quality than data generated from PDF files.

Typical use cases are

Citation recommendation

Citation context analysis

Bibliographic analyses

Reference string parsing

Note: This Zenodo record is an old version of unarXive. You can find the most recent version at https://zenodo.org/record/7752754 and https://zenodo.org/record/7752615

Access

┏━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ D O W N L O A D S A M P L E ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━┛

To download the whole data set send an access request and note the following:

Note: this Zenodo record is a "full" version of unarXive, which was generated from all of arXiv.org including non-permissively licensed papers. Make sure that your use of the data is compliant with the paper's licensing terms.¹

¹ For information on papers' licenses use arXiv's bulk metadata access.

The code used for generating the data set is publicly available.

Usage examples for our data set are provided at here on GitHub.

Citing

This initial version of unarXive is described in the following journal article.

Tarek Saier, Michael Färber: "unarXive: A Large Scholarly Data Set with Publications' Full-Text, Annotated In-Text Citations, and Links to Metadata", Scientometrics, 2020,
[link to an author copy]

The updated version is described in the following conference paper.

Tarek Saier, Michael Färber. "unarXive 2022: All arXiv Publications Pre-Processed for NLP, Including Structured Full-Text and Citation Network", JCDL 2023.
[link to an author copy]
r
Big Data and Society Acceptance Rate - ResearchHelpDesk
researchhelpdesk.org
Updated Feb 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Research Help Desk (2022). Big Data and Society Acceptance Rate - ResearchHelpDesk [Dataset]. https://www.researchhelpdesk.org/journal/acceptance-rate/477/big-data-and-society
Explore at:
Dataset updated
Feb 15, 2022
Dataset authored and provided by
Research Help Desk
Description
Big Data and Society Acceptance Rate - ResearchHelpDesk - Big Data & Society (BD&S) is open access, peer-reviewed scholarly journal that publishes interdisciplinary work principally in the social sciences, humanities and computing and their intersections with the arts and natural sciences about the implications of Big Data for societies. The Journal's key purpose is to provide a space for connecting debates about the emerging field of Big Data practices and how they are reconfiguring academic, social, industry, business, and government relations, expertise, methods, concepts, and knowledge. BD&S moves beyond usual notions of Big Data and treats it as an emerging field of practice that is not defined by but generative of (sometimes) novel data qualities such as high volume and granularity and complex analytics such as data linking and mining. It thus attends to digital content generated through online and offline practices in social, commercial, scientific, and government domains. This includes, for instance, the content generated on the Internet through social media and search engines but also that which is generated in closed networks (commercial or government transactions) and open networks such as digital archives, open government, and crowdsourced data. Critically, rather than settling on a definition the Journal makes this an object of interdisciplinary inquiries and debates explored through studies of a variety of topics and themes. BD&S seeks contributions that analyze Big Data practices and/or involve empirical engagements and experiments with innovative methods while also reflecting on the consequences for how societies are represented (epistemologies), realized (ontologies) and governed (politics). Article processing charge (APC) The article processing charge (APC) for this journal is currently 1500 USD. Authors who do not have funding for open access publishing can request a waiver from the publisher, SAGE, once their Original Research Article is accepted after peer review. For all other content (Commentaries, Editorials, Demos) and Original Research Articles commissioned by the Editor, the APC will be waived. Abstract & Indexing Clarivate Analytics: Social Sciences Citation Index (SSCI) Directory of Open Access Journals (DOAJ) Google Scholar Scopus
I
Conceptual novelty scores for PubMed articles
databank.illinois.edu
Updated Apr 27, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shubhanshu Mishra; Vetle I. Torvik (2018). Conceptual novelty scores for PubMed articles [Dataset]. http://doi.org/10.13012/B2IDB-5060298_V1
Explore at:
Unique identifier
https://doi.org/10.13012/B2IDB-5060298_V1
Dataset updated
Apr 27, 2018
Authors
Shubhanshu Mishra; Vetle I. Torvik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset funded by
National Science Foundationhttp://www.nsf.gov/
U.S. National Institutes of Health (NIH)
Description
Conceptual novelty analysis data based on PubMed Medical Subject Headings ---------------------------------------------------------------------- Created by Shubhanshu Mishra, and Vetle I. Torvik on April 16th, 2018 ## Introduction This is a dataset created as part of the publication titled: Mishra S, Torvik VI. Quantifying Conceptual Novelty in the Biomedical Literature. D-Lib magazine : the magazine of the Digital Library Forum. 2016;22(9-10):10.1045/september2016-mishra. It contains final data generated as part of our experiments based on MEDLINE 2015 baseline and MeSH tree from 2015. The dataset is distributed in the form of the following tab separated text files: * PubMed2015_NoveltyData.tsv - Novelty scores for each paper in PubMed. The file contains 22,349,417 rows and 6 columns, as follow: - PMID: PubMed ID - Year: year of publication - TimeNovelty: time novelty score of the paper based on individual concepts (see paper) - VolumeNovelty: volume novelty score of the paper based on individual concepts (see paper) - PairTimeNovelty: time novelty score of the paper based on pair of concepts (see paper) - PairVolumeNovelty: volume novelty score of the paper based on pair of concepts (see paper) * mesh_scores.tsv - Temporal profiles for each MeSH term for all years. The file contains 1,102,831 rows and 5 columns, as follow: - MeshTerm: Name of the MeSH term - Year: year - AbsVal: Total publications with that MeSH term in the given year - TimeNovelty: age (in years since first publication) of MeSH term in the given year - VolumeNovelty: : age (in number of papers since first publication) of MeSH term in the given year * meshpair_scores.txt.gz (36 GB uncompressed) - Temporal profiles for each MeSH term for all years - Mesh1: Name of the first MeSH term (alphabetically sorted) - Mesh2: Name of the second MeSH term (alphabetically sorted) - Year: year - AbsVal: Total publications with that MeSH pair in the given year - TimeNovelty: age (in years since first publication) of MeSH pair in the given year - VolumeNovelty: : age (in number of papers since first publication) of MeSH pair in the given year * README.txt file ## Dataset creation This dataset was constructed using multiple datasets described in the following locations: * MEDLINE 2015 baseline: https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html * MeSH tree 2015: ftp://nlmpubs.nlm.nih.gov/online/mesh/2015/meshtrees/ * Source code provided at: https://github.com/napsternxg/Novelty Note: The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in the first week of October, 2016. Check here for information to get PubMed/MEDLINE, and NLMs data Terms and Conditions: Additional data related updates can be found at: Torvik Research Group ## Acknowledgments This work was made possible in part with funding to VIT from NIH grant P01AG039347 and NSF grant 1348742 . The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. ## License Conceptual novelty analysis data based on PubMed Medical Subject Headings by Shubhanshu Mishra, and Vetle Torvik is licensed under a Creative Commons Attribution 4.0 International License. Permissions beyond the scope of this license may be available at https://github.com/napsternxg/Novelty
Dataset of a Study of Computational reproducibility of Jupyter notebooks...
zenodo.org
explore.openaire.eu
pdf, zip
Updated Jul 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sheeba Samuel; Sheeba Samuel; Daniel Mietchen; Daniel Mietchen (2024). Dataset of a Study of Computational reproducibility of Jupyter notebooks from biomedical publications [Dataset]. http://doi.org/10.5281/zenodo.8226725
Explore at:
zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8226725
Dataset updated
Jul 11, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sheeba Samuel; Sheeba Samuel; Daniel Mietchen; Daniel Mietchen
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This repository contains the dataset for the study of computational reproducibility of Jupyter notebooks from biomedical publications. Our focus lies in evaluating the extent of reproducibility of Jupyter notebooks derived from GitHub repositories linked to publications present in the biomedical literature repository, PubMed Central. We analyzed the reproducibility of Jupyter notebooks from GitHub repositories associated with publications indexed in the biomedical literature repository PubMed Central. The dataset includes the metadata information of the journals, publications, the Github repositories mentioned in the publications and the notebooks present in the Github repositories.

Data Collection and Analysis

We use the code for reproducibility of Jupyter notebooks from the study done by Pimentel et al., 2019 and adapted the code from ReproduceMeGit. We provide code for collecting the publication metadata from PubMed Central using NCBI Entrez utilities via Biopython.

Our approach involves searching PMC using the esearch function for Jupyter notebooks using the query: ``(ipynb OR jupyter OR ipython) AND github''. We meticulously retrieve data in XML format, capturing essential details about journals and articles. By systematically scanning the entire article, encompassing the abstract, body, data availability statement, and supplementary materials, we extract GitHub links. Additionally, we mine repositories for key information such as dependency declarations found in files like requirements.txt, setup.py, and pipfile. Leveraging the GitHub API, we enrich our data by incorporating repository creation dates, update histories, pushes, and programming languages.

All the extracted information is stored in a SQLite database. After collecting and creating the database tables, we ran a pipeline to collect the Jupyter notebooks contained in the GitHub repositories based on the code from Pimentel et al., 2019.

Our reproducibility pipeline was started on 27 March 2023.

Repository Structure

Our repository is organized into two main folders:

archaeology: This directory hosts scripts designed to download, parse, and extract metadata from PubMed Central publications and associated repositories. There are 24 database tables created which store the information on articles, journals, authors, repositories, notebooks, cells, modules, executions, etc. in the db.sqlite database file.

analyses: Here, you will find notebooks instrumental in the in-depth analysis of data related to our study. The db.sqlite file generated by running the archaelogy folder is stored in the analyses folder for further analysis. The path can however be configured in the config.py file. There are two sets of notebooks: one set (naming pattern N[0-9]*.ipynb) is focused on examining data pertaining to repositories and notebooks, while the other set (PMC[0-9]*.ipynb) is for analyzing data associated with publications in PubMed Central, i.e.\ for plots involving data about articles, journals, publication dates or research fields. The resultant figures from the these notebooks are stored in the 'outputs' folder.

MethodsWorkflow: The MethodsWorkflow file provides a conceptual overview of the workflow used in this study.

Accessing Data and Resources:

All the data generated during the initial study can be accessed at https://doi.org/10.5281/zenodo.6802158

For the latest results and re-run data, refer to this link.

The comprehensive SQLite database that encapsulates all the study's extracted data is stored in the db.sqlite file.

The metadata in xml format extracted from PubMed Central which contains the information about the articles and journal can be accessed in pmc.xml file.

System Requirements:

Centos 7 (Documentation: https://www.centos.org/)

Conda 4.9.4 (Installation Guide: https://docs.anaconda.com/anaconda/install/linux/)

Python 3.7.6 (Download Link: https://www.python.org/downloads/)

GitHub account (Get Started: https://github.com/, Requires GitHub Username and Token)

gcc 7.3.0 (Installation Guide: https://gcc.gnu.org/install/)

lbzip2 (Command: `conda install -c conda-forge lbzip2')

Running the pipeline:

Clone the computational-reproducibility-pmc repository using Git:
git clone https://github.com/fusion-jena/computational-reproducibility-pmc.git

Navigate to the computational-reproducibility-pmc directory:
cd computational-reproducibility-pmc/computational-reproducibility-pmc

Configure environment variables in the config.py file:
GITHUB_USERNAME = os.environ.get("JUP_GITHUB_USERNAME", "add your github username here")
GITHUB_TOKEN = os.environ.get("JUP_GITHUB_PASSWORD", "add your github token here")

Other environment variables can also be set in the config.py file.
BASE_DIR = Path(os.environ.get("JUP_BASE_DIR", "./")).expanduser() # Add the path of directory where the GitHub repositories will be saved
DB_CONNECTION = os.environ.get("JUP_DB_CONNECTION", "sqlite:///db.sqlite") # Add the path where the database is stored.

To set up conda environments for each python versions, upgrade pip, install pipenv, and install the archaeology package in each environment, execute:
source conda-setup.sh

Change to the archaeology directory
cd archaeology

Activate conda environment. We used py36 to run the pipeline.
conda activate py36

Execute the main pipeline script (r0_main.py):
python r0_main.py

Running the analysis:

Navigate to the analysis directory.
cd analyses

Activate conda environment. We use raw38 for the analysis of the metadata collected in the study.
conda activate raw38

Install the required packages using the requirements.txt file.
pip install -r requirements.txt

Launch Jupyterlab
jupyter lab

Refer to the Index.ipynb notebook for the execution order and guidance.

References:

Sheeba Samuel, Daniel Mietchen. (2024). Computational reproducibility of Jupyter notebooks from biomedical publications, https://doi.org/10.1093/gigascience/giad113, GigaScience

Sheeba Samuel, Daniel Mietchen. (2022). Computational reproducibility of Jupyter notebooks from biomedical publications, https://arxiv.org/pdf/2209.04308.pdf, CoRR abs/2209.04308

Sheeba Samuel, & Daniel Mietchen. (2022). Dataset of a Study of Computational reproducibility of Jupyter notebooks from biomedical publications [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6802158
Z
Methodology data of "A qualitative and quantitative citation analysis toward...
data.niaid.nih.gov
zenodo.org
Updated Aug 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peroni, Silvio (2024). Methodology data of "A qualitative and quantitative citation analysis toward retracted articles: a case of study" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4024337
Explore at:
Dataset updated
Aug 2, 2024
Dataset provided by
Peroni, Silvio
Heibi, Ivan
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This document contains the datasets and visualizations generated after the application of the methodology defined in our work: "A qualitative and quantitative citation analysis toward retracted articles: a case of study". The methodology defines a citation analysis of the Wakefield et al. [1] retracted article from a quantitative and qualitative point of view. The data contained in this repository are based on the first two steps of the methodology. The first step of the methodology (i.e. “Data gathering”) builds an annotated dataset of the citing entities, this step is largely discussed also in [2]. The second step (i.e. "Topic Modelling") runs a topic modeling analysis on the textual features contained in the dataset generated by the first step.

Note: the data are all contained inside the "method_data.zip" file. You need to unzip the file to get access to all the files and directories listed below.

Data gathering

The data generated by this step are stored in "data/":

"cits_features.csv": a dataset containing all the entities (rows in the CSV) which have cited the Wakefield et al. retracted article, and a set of features characterizing each citing entity (columns in the CSV). The features included are: DOI ("doi"), year of publication ("year"), the title ("title"), the venue identifier ("source_id"), the title of the venue ("source_title"), yes/no value in case the entity is retracted as well ("retracted"), the subject area ("area"), the subject category ("category"), the sections of the in-text citations ("intext_citation.section"), the value of the reference pointer ("intext_citation.pointer"), the in-text citation function ("intext_citation.intent"), the in-text citation perceived sentiment ("intext_citation.sentiment"), and a yes/no value to denote whether the in-text citation context mentions the retraction of the cited entity ("intext_citation.section.ret_mention"). Note: this dataset is licensed under a Creative Commons public domain dedication (CC0).

"cits_text.csv": this dataset stores the abstract ("abstract") and the in-text citations context ("intext_citation.context") for each citing entity identified using the DOI value ("doi"). Note: the data keep their original license (the one provided by their publisher). This dataset is provided in order to favor the reproducibility of the results obtained in our work.

Topic modeling We run a topic modeling analysis on the textual features gathered (i.e. abstracts and citation contexts). The results are stored inside the "topic_modeling/" directory. The topic modeling has been done using MITAO, a tool for mashing up automatic text analysis tools, and creating a completely customizable visual workflow [3]. The topic modeling results for each textual feature are separated into two different folders, "abstracts/" for the abstracts, and "intext_cit/" for the in-text citation contexts. Both the directories contain the following directories/files:

"mitao_workflows/": the workflows of MITAO. These are JSON files that could be reloaded in MITAO to reproduce the results following the same workflows.

"corpus_and_dictionary/": it contains the dictionary and the vectorized corpus given as inputs for the LDA topic modeling.

"coherence/coherence.csv": the coherence score of several topic models trained on a number of topics from 1 - 40.

"datasets_and_views/": the datasets and visualizations generated using MITAO.

References

Wakefield, A., Murch, S., Anthony, A., Linnell, J., Casson, D., Malik, M., Berelowitz, M., Dhillon, A., Thomson, M., Harvey, P., Valentine, A., Davies, S., & Walker-Smith, J. (1998). RETRACTED: Ileal-lymphoid-nodular hyperplasia, non-specific colitis, and pervasive developmental disorder in children. The Lancet, 351(9103), 637–641. https://doi.org/10.1016/S0140-6736(97)11096-0

Heibi, I., & Peroni, S. (2020). A methodology for gathering and annotating the raw-data/characteristics of the documents citing a retracted article v1 (protocols.io.bdc4i2yw) [Data set]. In protocols.io. ZappyLab, Inc. https://doi.org/10.17504/protocols.io.bdc4i2yw

Ferri, P., Heibi, I., Pareschi, L., & Peroni, S. (2020). MITAO: A User Friendly and Modular Software for Topic Modelling [JD]. PuntOorg International Journal, 5(2), 135–149. https://doi.org/10.19245/25.05.pij.5.2.3
r
Big Data and Society Publication fee - ResearchHelpDesk
researchhelpdesk.org
Updated Jun 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Research Help Desk (2022). Big Data and Society Publication fee - ResearchHelpDesk [Dataset]. https://www.researchhelpdesk.org/journal/publication-fee/477/big-data-and-society
Explore at:
Dataset updated
Jun 20, 2022
Dataset authored and provided by
Research Help Desk
Description
Big Data and Society Publication fee - ResearchHelpDesk - Big Data & Society (BD&S) is open access, peer-reviewed scholarly journal that publishes interdisciplinary work principally in the social sciences, humanities and computing and their intersections with the arts and natural sciences about the implications of Big Data for societies. The Journal's key purpose is to provide a space for connecting debates about the emerging field of Big Data practices and how they are reconfiguring academic, social, industry, business, and government relations, expertise, methods, concepts, and knowledge. BD&S moves beyond usual notions of Big Data and treats it as an emerging field of practice that is not defined by but generative of (sometimes) novel data qualities such as high volume and granularity and complex analytics such as data linking and mining. It thus attends to digital content generated through online and offline practices in social, commercial, scientific, and government domains. This includes, for instance, the content generated on the Internet through social media and search engines but also that which is generated in closed networks (commercial or government transactions) and open networks such as digital archives, open government, and crowdsourced data. Critically, rather than settling on a definition the Journal makes this an object of interdisciplinary inquiries and debates explored through studies of a variety of topics and themes. BD&S seeks contributions that analyze Big Data practices and/or involve empirical engagements and experiments with innovative methods while also reflecting on the consequences for how societies are represented (epistemologies), realized (ontologies) and governed (politics). Article processing charge (APC) The article processing charge (APC) for this journal is currently 1500 USD. Authors who do not have funding for open access publishing can request a waiver from the publisher, SAGE, once their Original Research Article is accepted after peer review. For all other content (Commentaries, Editorials, Demos) and Original Research Articles commissioned by the Editor, the APC will be waived. Abstract & Indexing Clarivate Analytics: Social Sciences Citation Index (SSCI) Directory of Open Access Journals (DOAJ) Google Scholar Scopus
NewsUnravel Dataset
zenodo.org
data.niaid.nih.gov
csv, png
Updated Jul 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
anon; anon (2024). NewsUnravel Dataset [Dataset]. http://doi.org/10.5281/zenodo.8344891
Explore at:
csv, pngAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8344891
Dataset updated
Jul 11, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
anon; anon
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
About the NUDA Dataset
Media bias is a multifaceted problem, leading to one-sided views and impacting decision-making. A way to address bias in news articles is to automatically detect and indicate it through machine-learning methods. However, such detection is limited due to the difficulty of obtaining reliable training data. To facilitate the data-gathering process, we introduce NewsUnravel, a news-reading web application leveraging an initially tested feedback mechanism to collect reader feedback on machine-generated bias highlights within news articles. Our approach augments dataset quality by significantly increasing inter-annotator agreement by 26.31% and improving classifier performance by 2.49%. As the first human-in-the-loop application for media bias, NewsUnravel shows that a user-centric approach to media bias data collection can return reliable data while being scalable and evaluated as easy to use. NewsUnravel demonstrates that feedback mechanisms are a promising strategy to reduce data collection expenses, fluidly adapt to changes in language, and enhance evaluators' diversity.

General

This dataset was created through user feedback on automatically generated bias highlights on news articles on the website NewsUnravel made by ANON. Its goal is to improve the detection of linguistic media bias for analysis and to indicate it to the public. Support came from ANON. None of the funders played any role in the dataset creation process or publication-related decisions.

The dataset consists of text, namely biased sentences with binary bias labels (processed, biased or not biased) as well as metadata about the article. It includes all feedback that was given. The single ratings (unprocessed) used to create the labels with correlating User IDs are included.

For training, this dataset was combined with the BABE dataset. All data is completely anonymous. Some sentences might be offensive or triggering as they were taken from biased or more extreme news sources. The dataset does not identify sub-populations or can be considered sensitive to them, nor is it possible to identify individuals.

Description of the Data Files

This repository contains the datasets for the anonymous NewsUnravel submission. The tables contain the following data:

NUDAdataset.csv: the NUDA dataset with 310 new sentences with bias labels
Statistics.png: contains all Umami statistics for NewsUnravel's usage data
Feedback.csv: holds the participantID of a single feedback with the sentence ID (contentId), the bias rating, and provided reasons
Content.csv: holds the participant ID of a rating with the sentence ID (contentId) of a rated sentence and the bias rating, and reason, if given
Article.csv: holds the article ID, title, source, article metadata, article topic, and bias amount in %
Participant.csv: holds the participant IDs and data processing consent

Collection Process

Data was collected through interactions with the Feedback Mechanism on NewsUnravel. A news article was displayed with automatically generated bias highlights. Each highlight could be selected, and readers were able to agree or disagree with the automatic label. Through a majority vote, labels were generated from those feedback interactions. Spammers were excluded through a spam detection approach.

Readers came to our website voluntarily through posts on LinkedIn and social media as well as posts on university boards. The data collection period lasted for one week, from March 4th to March 11th (2023). The landing page informed them about the goal and the data processing. After being informed, they could proceed to the article overview.

So far, the dataset has been used on top of BABE to train a linguistic bias classifier, adopting hyperparameter configurations from BABE with a pre-trained model from Hugging Face.
The dataset will be open source. On acceptance, a link with all details and contact information will be provided. No third parties are involved.

The dataset will not be maintained as it captures the first test of NewsUnravel at a specific point in time. However, new datasets will arise from further iterations. Those will be linked in the repository. Please cite the NewsUnravel paper if you use the dataset and contact us if you're interested in more information or joining the project.
Data used by EPA researchers to generate illustrative figures for overview...
s.cnmilf.com
datasets.ai
+1more
Updated Nov 14, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2020). Data used by EPA researchers to generate illustrative figures for overview article "Multiscale Modeling of Background Ozone: Research Needs to Inform and Improve Air Quality Management" [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/data-used-by-epa-researchers-to-generate-illustrative-figures-for-overview-article-multisc
Explore at:
Dataset updated
Nov 14, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
Data sets used to prepare illustrative figures for the overview article “Multiscale Modeling of Background Ozone” Overview The CMAQ model output datasets used to create illustrative figures for this overview article were generated by scientists in EPA/ORD/CEMM and EPA/OAR/OAQPS. The EPA/ORD/CEMM-generated dataset consisted of hourly CMAQ output from two simulations. The first simulation was performed for July 1 – 31 over a 12 km modeling _domain covering the Western U.S. The simulation was configured with the Integrated Source Apportionment Method (ISAM) to estimate the contributions from 9 source categories to modeled ozone. ISAM source contributions for July 17 – 31 averaged over all grid cells located in Colorado were used to generate the illustrative pie chart in the overview article. The second simulation was performed for October 1, 2013 – August 31, 2014 over a 108 km modeling _domain covering the northern hemisphere. This simulation was also configured with ISAM to estimate the contributions from non-US anthropogenic sources, natural sources, stratospheric ozone, and other sources on ozone concentrations. Ozone ISAM results from this simulation were extracted along a boundary curtain of the 12 km modeling _domain specified over the Western U.S. for the time period January 1, 2014 – July 31, 2014 and used to generate the illustrative time-height cross-sections in the overview article. The EPA/OAR/OAQPS-generated dataset consisted of hourly gridded CMAQ output for surface ozone concentrations for the year 2016. The CMAQ simulations were performed over the northern hemisphere at a horizontal resolution of 108 km. NO2 and O3 data for July 2016 was extracted from these simulations generate the vertically-integrated column densities shown in the illustrative comparison to satellite-derived column densities. CMAQ Model Data The data from the CMAQ model simulations used in this research effort are very large (several terabytes) and cannot be uploaded to ScienceHub due to size restrictions. The model simulations are stored on the /asm archival system accessible through the atmos high-performance computing (HPC) system. Due to data management policies, files on /asm are subject to expiry depending on the template of the project. Files not requested for extension after the expiry date are deleted permanently from the system. The format of the files used in this analysis and listed below is ioapi/netcdf. Documentation of this format, including definitions of the geographical projection attributes contained in the file headers, are available at https://www.cmascenter.org/ioapi/ Documentation on the CMAQ model, including a description of the output file format and output model species can be found in the CMAQ documentation on the CMAQ GitHub site at https://github.com/USEPA/CMAQ. This dataset is associated with the following publication: Hogrefe, C., B. Henderson, G. Tonnesen, R. Mathur, and R. Matichuk. Multiscale Modeling of Background Ozone: Research Needs to Inform and Improve Air Quality Management. EM Magazine. Air and Waste Management Association, Pittsburgh, PA, USA, 1-6, (2020).
Synthetic datasets of the UK Biobank cohort
zenodo.org
data.niaid.nih.gov
bin, csv, pdf, zip
Updated Feb 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Antonio Gasparrini; Antonio Gasparrini; Jacopo Vanoli; Jacopo Vanoli (2025). Synthetic datasets of the UK Biobank cohort [Dataset]. http://doi.org/10.5281/zenodo.13983170
Explore at:
bin, csv, zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13983170
Dataset updated
Feb 6, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Antonio Gasparrini; Antonio Gasparrini; Jacopo Vanoli; Jacopo Vanoli
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository stores synthetic datasets derived from the database of the UK Biobank (UKB) cohort.

The datasets were generated for illustrative purposes, in particular for reproducing specific analyses on the health risks associated with long-term exposure to air pollution using the UKB cohort. The code used to create the synthetic datasets is available and documented in a related GitHub repo, with details provided in the section below. These datasets can be freely used for code testing and for illustrating other examples of analyses on the UKB cohort.

Note: while the synthetic versions of the datasets resemble the real ones in several aspects, the users should be aware that these data are fake and must not be used for testing and making inferences on specific research hypotheses. Even more importantly, these data cannot be considered a reliable description of the original UKB data, and they must not be presented as such.

The original datasets are described in the article by Vanoli et al in Epidemiology (2024) (DOI: 10.1097/EDE.0000000000001796) [freely available here], which also provides information about the data sources.

The work was supported by the Medical Research Council-UK (Grant ID: MR/Y003330/1).

Content

The series of synthetic datasets (stored in two versions with csv and RDS formats) are the following:

synthbdcohortinfo: basic cohort information regarding the follow-up period and birth/death dates for 502,360 participants.

synthbdbasevar: baseline variables, mostly collected at recruitment.

synthpmdata: annual average exposure to PM_2.5 for each participant reconstructed using their residential history.

synthoutdeath: death records that occurred during the follow-up with date and ICD-10 code.

In addition, this repository provides these additional files:

codebook: a pdf file with a codebook for the variables of the various datasets, including references to the fields of the original UKB database.

asscentre: a csv file with information on the assessment centres used for recruitment of the UKB participants, including code, names, and location (as northing/easting coordinates of the British National Grid).

Countries_December_2022_GB_BUC: a zip file including the shapefile defining the boundaries of the countries in Great Britain (England, Wales, and Scotland), used for mapping purposes [source].

Generation of the synthetic data

The datasets resemble the real data used in the analysis, and they were generated using the R package synthpop (www.synthpop.org.uk). The generation process involves two steps, namely the synthesis of the main data (cohort info, baseline variables, annual PM_2.5 exposure) and then the sampling of death events. The R scripts for performing the data synthesis are provided in the GitHub repo (subfolder Rcode/synthcode).

The first part merges all the data including the annual PM_2.5 levels in a single wide-format dataset (with a row for each subject), generates a synthetic version, adds fake IDs, and then extracts (and reshapes) the single datasets. In the second part, a Cox proportional hazard model is fitted on the original data to estimate risks associated with various predictors (including the main exposure represented by PM_2.5), and then these relationships are used to simulate death events in each year. Details on the modelling aspects are provided in the article.

This process guarantees that the synthetic data do not hold specific information about the original records, thus preserving confidentiality. At the same time, the multivariate distribution and correlation across variables as well as the mortality risks resemble those of the original data, so the results of descriptive and inferential analyses are similar to those in the original assessments. However, as noted above, the data are used only for illustrative purposes, and they must not be used to test other research hypotheses.
Z
Dataset Artifact for paper "Root Cause Analysis for Microservice System...
data.niaid.nih.gov
Updated Aug 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhang, Hongyu (2024). Dataset Artifact for paper "Root Cause Analysis for Microservice System based on Causal Inference: How Far Are We?" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13305662
Explore at:
Dataset updated
Aug 25, 2024
Dataset provided by
Zhang, Hongyu
Pham, Luan
Ha, Huong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Artifacts for the paper titled Root Cause Analysis for Microservice System based on Causal Inference: How Far Are We?.

This artifact repository contains 9 compressed folders, as follows:

ID File Name Description

1 syn_circa.zip CIRCA10, and CIRCA50 datasets for Causal Discovery

2 syn_rcd.zip RCD10, and RCD50 datasets for Causal Discovery

3 syn_causil.zip CausIL10, and CausIL50 datasets for Causal Discovery

4 rca_circa.zip CIRCA10, and CIRCA50 datasets for RCA

5 rca_rcd.zip RCD10, and RCD50 datasets for RCA

6 online-boutique.zip Online Boutique dataset for RCA

7 sock-shop-1.zip Sock Shop 1 dataset for RCA

8 sock-shop-2.zip Sock Shop 2 dataset for RCA

9 train-ticket.zip Train Ticket dataset for RCA

Each zip file contains the generated/collected data from the corresponding data generator or microservice benchmark systems (e.g., online-boutique.zip contains metrics data collected from the Online Boutique system).

Details about the generation of our datasets

Synthetic datasets

We use three different synthetic data generators from three previous RCA studies [15, 25, 28] to create the synthetic datasets: CIRCA, RCD, and CausIL data generators. Their mechanisms are as follows:1. CIRCA datagenerator [28] generates a random causal directed acyclic graph (DAG) based on a given number of nodes and edges. From this DAG, time series data for each node is generated using a vector auto-regression (VAR) model. A fault is injected into a node by altering the noise term in the VAR model for two timestamps. 2. RCD data generator [25] uses the pyAgrum package [3] to generate a random DAG based on a given number of nodes, subsequently generating discrete time series data for each node, with values ranging from 0 to 5. A fault is introduced into a node by changing its conditional probability distribution.3. CausIL data generator [15] generates causal graphs and time series data that simulate the behavior of microservice systems. It first constructs a DAG of services and metrics based on domain knowledge, then generates metric data for each node of the DAG using regressors trained on real metrics data. Unlike the CIRCA and RCD data generators, the CausIL data generator does not have the capability to inject faults.To create our synthetic datasets, we first generate 10 DAGs whose nodes range from 10 to 50 for each of the synthetic data generators. Next, we generate fault-free datasets using these DAGs with different seedings, resulting in 100 cases for the CIRCA and RCD generators and 10 cases for the CausIL generator. We then create faulty datasets by introducing ten faults into each DAG and generating the corresponding faulty data, yielding 100 cases for the CIRCA and RCD data generators. The fault-free datasets (e.g. syn_rcd, syn_circa) are used to evaluate causal discovery methods, while the faulty datasets (e.g. rca_rcd, rca_circa) are used to assess RCA methods.

Data collected from benchmark microservice systems

We deploy three popular benchmark microservice systems: Sock Shop [6], Online Boutique [4], and Train Ticket [8], on a four-node Kubernetes cluster hosted by AWS. Next, we use the Istio service mesh [2] with Prometheus [5] and cAdvisor [1] to monitor and collect resource-level and service-level metrics of all services, as in previous works [ 25 , 39, 59 ]. To generate traffic, we use the load generators provided by these systems and customise them to explore all services with 100 to 200 users concurrently. We then introduce five common faults (CPU hog, memory leak, disk IO stress, network delay, and packet loss) into five different services within each system. Finally, we collect metrics data before and after the fault injection operation. An overview of our setup is presented in the Figure below.

Code

The code to reproduce the experimental results in the paper is available at https://github.com/phamquiluan/RCAEval.

References

As in our paper.
f
Description of the CWRU datasets.
plos.figshare.com
xls
Updated Sep 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qiushi Wang; Zhicheng Sun; Yueming Zhu; Dong Li; Yunbin Ma (2024). Description of the CWRU datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0309714.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0309714.t002
Dataset updated
Sep 3, 2024
Dataset provided by
PLOS ONE
Authors
Qiushi Wang; Zhicheng Sun; Yueming Zhu; Dong Li; Yunbin Ma
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
As a critical component in mechanical systems, the operational status of rolling bearings plays a pivotal role in ensuring the stability and safety of the entire system. However, in practical applications, the fault diagnosis of rolling bearings often encounters limitations due to the constraint of sample size, leading to suboptimal diagnostic accuracy. This article proposes a rolling bearing fault diagnosis method based on an improved denoising diffusion probability model (DDPM) to address this issue. The practical value of this research lies in its ability to address the limitation of small sample sizes in rolling bearing fault diagnosis. By leveraging DDPM to generate one-dimensional vibration data, the proposed method significantly enriches the datasets and consequently enhances the generalization capability of the diagnostic model. During the model training process, we innovatively introduce the feature differences between the original vibration data and the predicted vibration data generated based on prediction noise into the loss function, making the generated data more directional and targeted. In addition, this article adopts a one-dimensional convolutional neural network (1D-CNN) to construct a fault diagnosis model to more accurately extract and focus on key feature information related to faults. The experimental results show that this method can effectively improve the accuracy and reliability of rolling bearing fault diagnosis, providing new ideas and methods for fault detection and prevention in industrial applications. This advancement in diagnostic technology has the potential to significantly reduce the risk of system failures, enhance operational efficiency, and lower maintenance costs, thus contributing significantly to the safety and efficiency of mechanical systems.
GlobalHighPM2.5: Big Data Gapless 1 km Global Ground-level PM2.5 Dataset...
zenodo.org
Updated Jul 10, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jing Wei; Jing Wei; Zhanqing Li; Alexei Lyapustin; Jun Wang; Oleg Dubovik; Joel Schwartz; Lin Sun; Chi Li; Song Liu; Tong Zhu; Zhanqing Li; Alexei Lyapustin; Jun Wang; Oleg Dubovik; Joel Schwartz; Lin Sun; Chi Li; Song Liu; Tong Zhu (2024). GlobalHighPM2.5: Big Data Gapless 1 km Global Ground-level PM2.5 Dataset over Land [Dataset]. http://doi.org/10.5281/zenodo.10081359
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.10081359
Dataset updated
Jul 10, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jing Wei; Jing Wei; Zhanqing Li; Alexei Lyapustin; Jun Wang; Oleg Dubovik; Joel Schwartz; Lin Sun; Chi Li; Song Liu; Tong Zhu; Zhanqing Li; Alexei Lyapustin; Jun Wang; Oleg Dubovik; Joel Schwartz; Lin Sun; Chi Li; Song Liu; Tong Zhu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Apr 11, 2022
Description
GlobalHighPM_2.5 is one of the series of long-term, full-coverage, global high-resolution and high-quality datasets of ground-level air pollutants over land (i.e., GlobalHighAirPollutants, GHAP). It is generated from big data (e.g., ground-based measurements, satellite remote sensing products, atmospheric reanalysis, and model simulations) using artificial intelligence by considering the spatiotemporal heterogeneity of air pollution.

This dataset contains input data, analysis codes, and generated dataset used for the following article, and if you use the GlobalHighPM_2.5 dataset for related scientific research, please cite the below-listed corresponding reference (Wei et al., NC, 2023):

Wei, J., Li, Z., Lyapustin, A., Wang, J., Dubovik, O., Schwartz, J., Sun, L., Li, C., Liu, S., and Zhu, T. First close insight into global daily gapless 1 km PM_2.5 pollution, variability, and health impact. Nature Communications, 2023, 14, 8349. https://doi.org/10.1038/s41467-023-43862-3

Input Data

Relevant raw data for each figure (compiled into a single sheet within an Excel document) in the manuscript.

Code

Relevant Python scripts for replicating and ploting the analysis results in the manuscript, as well as codes for converting data formats.

Generated Dataset

Here is the first big data-derived gapless (spatial coverage = 100%) monthly and yearly 1 km (i.e., M1K, and Y1K) global ground-level PM_2.5 dataset over land from 2017 to 2022. This dataset yields a high quality with cross-validation coefficient of determination (CV-R²) values of 0.91, 0.97, and 0.98, and root-mean-square errors (RMSEs) of 9.20, 4.15, and 2.77 µg m^-3 on the daily, monthly, and annual basises, respectively.

Due to data volume limitations,

all (including daily) data for the year 2022 is accessible at: GlobalHighPM2.5 (2022)

all (including daily) data for the year 2021 is accessible at: GlobalHighPM2.5 (2021)

all (including daily) data for the year 2020 is accessible at: GlobalHighPM2.5 (2020)

all (including daily) data for the year 2019 is accessible at: GlobalHighPM2.5 (2019)

all (including daily) data for the year 2018 is accessible at: GlobalHighPM2.5 (2018)

all (including daily) data for the year 2017 is accessible at: GlobalHighPM2.5 (2017)

Continuously updated...

More air quality datasets of different air pollutants can be found at: https://weijing-rs.github.io/product.html
Raw and aggregated data for the study introduced in the article "Citing and...
zenodo.org
csv
Updated Aug 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Erika Alves Santos; Silvio Peroni; Silvio Peroni; Marcos Luiz Mucheroni; Erika Alves Santos; Marcos Luiz Mucheroni (2020). Raw and aggregated data for the study introduced in the article "Citing and referencing habits in Medicine and Social Sciences journals in 2019" [Dataset]. http://doi.org/10.5281/zenodo.3996582
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3996582
Dataset updated
Aug 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Erika Alves Santos; Silvio Peroni; Silvio Peroni; Marcos Luiz Mucheroni; Erika Alves Santos; Marcos Luiz Mucheroni
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains all the raw data and aggregated data subject of the study introduced in the article "Citing and referencing habits in Medicine and Social Sciences journals in 2019". The study is based on the bibliographic and citation data contained in 213 articles published in 46 journals in two subject areas: Medicine and Social Sciences. The articles contained a total amount of 9,911 bibliographic references and 16,193 mentions and quotations overall. In particular, the raw data are about 27 journals, 132 articles, 5,340 bibliographic references, 8,400 mentions and 5 quotations in Medicine, and 19 journals, 81 articles, 4,571 bibliographic references, 6,953 mentions and 835 quotations in Social Sciences.

The dataset is composed of three files:

the file "medicine-journals.csv" contains the raw data of the articles published in Medicine journals;

the file "social-sciences-journals.csv" contains the raw data of the articles published in Social Sciences journals;

the file "aggregated-data-medicine-social-sciences.csv" contains the aggregated data created considering the raw data in the previous files, which have been used to creating all the tables and figures in the article.

Facebook

Twitter

Click to copy link

Link copied

Cite

Ryan P. Womack (2023). Proportion of articles with original data. [Dataset]. http://doi.org/10.1371/journal.pone.0143460.g003

Proportion of articles with original data.

Explore at:

tiffAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0143460.g003

Dataset updated

Jun 5, 2023

Dataset provided by

PLOS ONE

Authors

Ryan P. Womack

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This graph shows the proportion of articles by discipline with original data generated by the research described in the article, along with associated confidence intervals. See Tables 9 and 11 for numeric values.

Clear search

Close search

Google apps

Main menu

Proportion of articles with original data.

Early Indicator for Data Sharing and Reuse - Supplementary Tables.xlsx

Big Data and Society Abstract & Indexing - ResearchHelpDesk

Map of articles about "Teaching Open Science"

Data Citation Corpus Data File

Data for: An Examination of Data Reuse Practices within Highly Cited...

Synthetic Integrated Services Data

Motivation

Collection

Preprocessing

Recommended Uses

Known Limitations/Biases

Feedback

Further Documentation and Resources

Data from: unarXive: A Large Scholarly Data Set with Publications'...

Description

Access

Citing

Big Data and Society Acceptance Rate - ResearchHelpDesk

Conceptual novelty scores for PubMed articles

Dataset of a Study of Computational reproducibility of Jupyter notebooks...

Methodology data of "A qualitative and quantitative citation analysis toward...

Big Data and Society Publication fee - ResearchHelpDesk

NewsUnravel Dataset

Data used by EPA researchers to generate illustrative figures for overview...

Synthetic datasets of the UK Biobank cohort

Content

Generation of the synthetic data

Dataset Artifact for paper "Root Cause Analysis for Microservice System...

Description of the CWRU datasets.

GlobalHighPM2.5: Big Data Gapless 1 km Global Ground-level PM2.5 Dataset...

Raw and aggregated data for the study introduced in the article "Citing and...

Proportion of articles with original data.See More Versions

Proportion of articles with original data.