Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Background: Attribution to the original contributor upon reuse of published data is important both as a reward for data creators and to document the provenance of research findings. Previous studies have found that papers with publicly available datasets receive a higher number of citations than similar studies without available data. However, few previous analyses have had the statistical power to control for the many variables known to predict citation rate, which has led to uncertain estimates of the "citation benefit". Furthermore, little is known about patterns in data reuse over time and across datasets. Method and Results: Here, we look at citation rates while controlling for many known citation predictors, and investigate the variability of data reuse. In a multivariate regression on 10,555 studies that created gene expression microarray data, we found that studies that made data available in a public repository received 9% (95% confidence interval: 5% to 13%) more citations than similar studies for which the data was not made available. Date of publication, journal impact factor, open access status, number of authors, first and last author publication history, corresponding author country, institution citation history, and study topic were included as covariates. The citation benefit varied with date of dataset deposition: a citation benefit was most clear for papers published in 2004 and 2005, at about 30%. Authors published most papers using their own datasets within two years of their first publication on the dataset, whereas data reuse papers published by third-party investigators continued to accumulate for at least six years. To study patterns of data reuse directly, we compiled 9,724 instances of third party data reuse via mention of GEO or ArrayExpress accession numbers in the full text of papers. The level of third-party data use was high: for 100 datasets deposited in year 0, we estimated that 40 papers in PubMed reused a dataset by year 2, 100 by year 4, and more than 150 data reuse papers had been published by year 5. Data reuse was distributed across a broad base of datasets: a very conservative estimate found that 20% of the datasets deposited between 2003 and 2007 had been reused at least once by third parties. Conclusion: After accounting for other factors affecting citation rate, we find a robust citation benefit from open data, although a smaller one than previously reported. We conclude there is a direct effect of third-party data reuse that persists for years beyond the time when researchers have published most of the papers reusing their own data. Other factors that may also contribute to the citation benefit are considered.We further conclude that, at least for gene expression microarray data, a substantial fraction of archived datasets are reused, and that the intensity of dataset reuse has been steadily increasing since 2003.
Facebook
TwitterThis is a searchable historical collection of standards referenced in regulations - Voluntary consensus standards, government-unique standards, industry standards, and international standards referenced in the Code of Federal Regulations (CFR).
Facebook
TwitterCollection of college statistics, draft team information, and NFL career statistics for every quarterback drafted since the year 2000 until the 2024 offseason. Originally created in an attempt to train a neural network that predicts NFL success level of a quarterback at the time of being drafted.
This database was only made possible by the many NFL stat keeping websites I discovered in the data collection process:
year-drafted: The year drafted into the NFL
qb-num-picked: The number taken relative to other quarterbacks (1 = first quarterback selected, 2 = second selected, etc.)
rd-picked: The round of the NFL draft the player was selected
num-picked: The overall draft position the player was drafted at
name: Name of player
height (in): Player height in inches as reported at the NFL Draft
weight (lbs): Player weight in pounds as reported at the NFL Draft
nfl-team: The NFL team that drafted the player
coach-tenure: The number of years the head coach had been employed by the team that drafted the player at the time of the draft
drafted-team-winpr: The win percentage in the most recent season of the team that drafted the player at the time of drafting
drafted_team_ppg_rk: The points per game ranking in the most recent season of the team that drafted the player at the time of drafting
college: The college the player attended at the time of drafting
conf: The conference of the college the player participated in
conf-str: The calculated strength of the conference in the final year the quarterback played (reference link above)
p-cmp: Pass completions in college career
p-att: Pass attempts in college career
cmp-pct: Pass completion percentage in college career
p-yds: Total pass yards in college career
p-ypa: Passing yards per attempt in college career
p-adj-ypa: Adjusted passing yards per attempt in college career
p-td: Passing touchdowns in college career
int: Interceptions in college career
rate: Passing efficiency rating (reference link above)
r-att: Rushing attempt count in college career
r-yds: Rushing yards in college career
r-avg: Average yards per rush in college career
r-tds: Rushing touchdowns in college career
nfl-starts: Total number of started games in the NFL
nfl-wins: Total games won in the NFL
nfl-losses: Total games lost in the NFL
nfl-ties: Total games tied in the NFL
nfl-winpr: Total win percentage as a starter in the NFL
nfl-qbr: Quarterback rating in the NFL
nfl-cmp: Total pass completions in the NFL
nfl-att: Total pass attempts in the NFL
nfl-inc: Total incompletions thrown in the NFL
nfl-comp%: Career completion percentage in the NFL
nfl-yds: Total passing yards in the NFL
nfl-tds: Total passing touchdowns in the NFL
nfl-int: Total interceptions thrown in the NFL
nfl-pick6: Number of interceptions thrown that were returned for touchdowns in the NFL
nfl-int%: Percentage of NFL throws that were interceptions
nfl-sack%: Percentage of NFL passing plays the player gave up a sack
nfl-y/a: Yards per passing attempt in the NFL
nfl-ay/a: Adjusted yards per passing attempt in the NFL
nfl-any/a: Adjusted net yards per passing attempt in the NFL
nfl-y/c: Passing yards per completion in the NFL
nfl-y/g: Passing yards per game in the NFL
nfl-succ%: Passing success rate in the NFL (reference link above)
nfl-4qc: 4th quarter comebacks completed in the NFL
nfl-gwd: Game winning drives completed in the NFL
Facebook
TwitterAttribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
Citation metrics are widely used and misused. We have created a publicly available database of over 100,000 top-scientists that provides standardized information on citations, h-index, co-authorship adjusted hm-index, citations to papers in different authorship positions and a composite indicator. Separate data are shown for career-long and single year impact. Metrics with and without self-citations and ratio of citations to citing papers are given. Scientists are classified into 22 scientific fields and 176 sub-fields. Field- and subfield-specific percentiles are also provided for all scientists who have published at least 5 papers. Career-long data are updated to end-of-2020. The selection is based on the top 100,000 by c-score (with and without self-citations) or a percentile rank of 2% or above.
The dataset and code provides an update to previously released version 1 data under https://doi.org/10.17632/btchxktzyw.1; The version 2 dataset is based on the May 06, 2020 snapshot from Scopus and is updated to citation year 2019 available at https://doi.org/10.17632/btchxktzyw.2
This version (3) is based on the Aug 01, 2021 snapshot from Scopus and is updated to citation year 2020.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
NFL passing statistics since 2001. Contains record of every player who attempted a pass within the time period. Tracked metrics include passing yards, passing touchdowns, pass attempts, completions, interceptions, and touchdown/interception/completion percentages. More advanced metrics like yards per attempt, adjusted net yards per attempt, and other similar metrics are also included. I used this dataset, accompanied with the NFL Rushing Statistics dataset to predict the NFL MVP winner in 2024.
Facebook
TwitterNote: Due to a system migration, this data will cease to update on March 14th, 2023. The current projection is to restart the updates on or around July 17th, 2024.A list of all uniform citations from the Louisville Metro Police Department, the CSV file is updated daily, including case number, date, location, division, beat, offender demographics, statutes and charges, and UCR codes can be found in this Link.INCIDENT_NUMBER or CASE_NUMBER links these data sets together:Crime DataUniform Citation DataFirearm intakeLMPD hate crimesAssaulted OfficersCITATION_CONTROL_NUMBER links these data sets together:Uniform Citation DataLMPD Stops DataNote: When examining this data, make sure to read the LMPDCrime Data section in our Terms of Use.AGENCY_DESC - the name of the department that issued the citationCASE_NUMBER - the number associated with either the incident or used as reference to store the items in our evidence rooms and can be used to connect the dataset to the following other datasets INCIDENT_NUMBER:1. Crime Data2. Firearms intake3. LMPD hate crimes4. Assaulted OfficersNOTE: CASE_NUMBER is not formatted the same as the INCIDENT_NUMBER in the other datasets. For example: in the Uniform Citation Data you have CASE_NUMBER 8018013155 (no dashes) which matches up with INCIDENT_NUMBER 80-18-013155 in the other 4 datasets.CITATION_YEAR - the year the citation was issuedCITATION_CONTROL_NUMBER - links this LMPD stops dataCITATION_TYPE_DESC - the type of citation issued (citations include: general citations, summons, warrants, arrests, and juvenile)CITATION_DATE - the date the citation was issuedCITATION_LOCATION - the location the citation was issuedDIVISION - the LMPD division in which the citation was issuedBEAT - the LMPD beat in which the citation was issuedPERSONS_SEX - the gender of the person who received the citationPERSONS_RACE - the race of the person who received the citation (W-White, B-Black, H-Hispanic, A-Asian/Pacific Islander, I-American Indian, U-Undeclared, IB-Indian/India/Burmese, M-Middle Eastern Descent, AN-Alaskan Native)PERSONS_ETHNICITY - the ethnicity of the person who received the citation (N-Not Hispanic, H=Hispanic, U=Undeclared)PERSONS_AGE - the age of the person who received the citationPERSONS_HOME_CITY - the city in which the person who received the citation livesPERSONS_HOME_STATE - the state in which the person who received the citation livesPERSONS_HOME_ZIP - the zip code in which the person who received the citation livesVIOLATION_CODE - multiple alpha/numeric code assigned by the Kentucky State Police to link to a Kentucky Revised Statute. For a full list of codes visit: https://kentuckystatepolice.org/crime-traffic-data/ASCF_CODE - the code that follows the guidelines of the American Security Council Foundation. For more details visit https://www.ascfusa.org/STATUTE - multiple alpha/numeric code representing a Kentucky Revised Statute. For a full list of Kentucky Revised Statute information visit: https://apps.legislature.ky.gov/law/statutes/CHARGE_DESC - the description of the type of charge for the citationUCR_CODE - the code that follows the guidelines of the Uniform Crime Report. For more details visit https://ucr.fbi.gov/UCR_DESC - the description of the UCR_CODE. For more details visit https://ucr.fbi.gov/
Facebook
TwitterThis dataset provides an in-depth analysis of the 2023/24 Bundesliga season, capturing a wide array of team and player performance metrics across all matchdays. With over 50 individual CSV files, the collection encompasses stats on passing accuracy, goals scored, defensive actions, possession percentages, and player ratings. Whether you’re looking to analyze top scorers, assess team strengths, or dive into individual player contributions, this dataset offers a robust foundation for football analytics enthusiasts and professionals alike.
In addition to the core dataset, we have now added more files related to the league table, expanding the dataset with essential information on match outcomes, league standings, and advanced metrics.
The dataset contains the following types of data:
The file details provide an overview of each dataset, including a brief description of the data structure and potential uses for analysis. This helps users quickly navigate and understand the data available for analysis.
This dataset is ideal for statistical analysis, data visualization, and machine learning applications to uncover patterns in football performance.
This dataset opens up multiple avenues for data analysis and visualization. Here are some ideas:
This dataset is shared for non-commercial, educational, and personal analysis purposes only. It is not intended for redistribution, commercial use, or integration into other public datasets.
This dataset was sourced from FotMob, a proprietary provider of football statistics. All rights to the original data belong to FotMob. The dataset is a restructured collection of publicly available data and does not claim ownership over FotMob's data. Users should reference FotMob as the original source when using this dataset for research or analysis.
By using this dataset, you agree to the following: - Non-commercial Use: This dataset is only for educational, analytical, and personal use. It may not be used for commercial purposes or integrated into other public datasets. - **Proper Attribution...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset includes processed citation data for datasets recorded in OpenAlex as of May 2022. It identifies self-citations to these datasets at the individual, institutional, and country level, and includes domain classifications of the citing works using the Science-Metrix classifications.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction
This document describes the data collection and datasets used in the manuscript "A Matter of Culture? Conceptualising and Investigating ‘Evidence Cultures’ within Research on Evidence-Informed Policymaking" [1].
Data Collection
To construct the citation network analysed in the manuscript, we first designed a series of queries to capture a large sample of literature exploring the relationship between evidence, policy, and culture from various perspectives. Our team of domain experts developed the following queries based on terms common in the literature. These queries search for the terms included in the titles, abstracts, and associated keywords of WoS indexed records (i.e. ‘TS=’). While these are separated below for ease of reading, they combined into a single query via the OR operator in our search. Our search was conducted on the Web of Science’s (WoS) Core Collection through the University of Edinburgh Library subscription on 29/11/2023, returning a total of 2,089 records.
TS = ((“cultures of evidence” OR “culture of evidence” OR “culture of knowledge” OR “cultures of knowledge” OR “research culture” OR “research cultures” OR “culture of research” OR “cultures of research” OR “epistemic culture” OR “epistemic cultures” OR “epistemic community” OR “epistemic communities” OR “epistemic infrastructure” OR “evaluation culture” OR “evaluation cultures” OR “culture of evaluation” OR “cultures of evaluation” OR “thought style” OR “thought styles” OR “thought collective” OR “thought collectives” OR “knowledge regime” OR “knowledge regimes” OR “knowledge system” OR “knowledge systems” OR “civic epistemology” OR “civic epistemologies”) AND (“policy” OR “policies” OR “policymaking” OR “policy making” OR “policymaker” OR “policymakers” OR “policy maker” OR “policy makers” OR “policy decision” OR “policy decisions” OR “political decision” OR “political decisions” OR “political decision making”))
OR
TS = ((“culture” OR “cultures”) AND ((“evidence-based” OR “evidence-informed” OR “evidence-led” OR “science-based” OR “science-informed” OR “science-led” OR “research-based” OR “research-informed” OR “evidence use” OR “evidence user” OR “evidence utilisation” OR “evidence utilization” OR “research use” OR “researcher user” OR “research utilisation” OR “research utilization” OR “research in” OR “evidence in” OR “science in”) NEAR/1 (“policymaking” OR “policy making” OR “policy maker” OR “policy makers”)))
OR
TS = ((“culture” OR “cultures”) AND (“scientific advice” OR “technical advice” OR “scientific expertise” OR “technical expertise” OR “expert advice”) AND (“policy” OR “policies” OR “policymaking” OR “policy making” OR “policymaker” OR “policymakers” OR “policy maker” OR “policy makers” OR “political decision” OR “political decisions” OR “political decision making”))
OR
TS = ((“culture” OR “cultures”) AND (“post-normal science” OR “trans-science” OR “transdisciplinary” OR “transdisiplinarity” OR “science-policy interface” OR “policy sciences” OR “sociology of knowledge” OR “sociology of science” OR “knowledge transfer” OR “knowledge translation” OR “knowledge broker” OR “implementation science” OR “risk society”) AND (“policymaking” OR “policy making” OR “policymaker” OR “policymakers” OR “policy maker” OR “policy makers”))
Citation Network Construction
All bibliographic metadata on these 2,089 records were downloaded in five batches in plain text and then merged in R. We then parsed these data into network readable files. All unique reference strings are given unique node IDs. A node-attribute-list (‘CE_Node’) links identifying information of each document with its node ID, including authors, title, year of publication, journal WoS ID, and WoS citations. An edge-list (‘CE_Edge’) records all citations from these documents to their bibliographies – with edges going from a citing document to the cited – using the relevant node IDs. These data were then cleaned by (a) matching DOIs for reference strings that differ but point to the same paper, and (b) manual merging of obvious duplicates caused by referencing errors.
Our initial dataset consisted of 2,089 retrieved documents and 123,772 unretrieved cited documents (i.e. documents that were cited within the publications we retrieved but which were not one of these 2,089 documents). These documents were connected by 157,229 citation links, but ~87% of the documents in the network were cited just once. To focus on relevant literature, we filtered the network to include only documents with at least three citation or reference links. We further refined the dataset by focusing on the main connected component, resulting in 6,650 nodes and 29,198 edges. It is this dataset that we publish here, and it is this network that underpins Figure 1, Table 1, and the qualitative examination of documents (see manuscript for further details).
Our final network dataset contains 1,819 of the documents in our original query (~87% of the original retrieved records), and 4,831 documents not retrieved via our Web of Science search but cited by at least three of the retrieved documents. We then clustered this network by modularity maximization via the Leiden algorithm [2], detecting 14 clusters with Q=0.59. Citations to documents within the same cluster constitute ~77% of all citations in the network.
Citation Network Dataset Description
We include two network datasets: (i) ‘CE_Node.csv’ that contains 1,819 retrieved documents, 4,831 unretrieved referenced documents, making for a total of 6,650 documents (nodes); (ii)’CE_Edge.csv’ that records citations (edges) between the documents (nodes), including a total of 29,198 citation links. These files can be used to construct a network with many different tools, but we have formatted these to be used in Gephi 0.10[3].
‘CE_Node.csv’ is a comma-separate values file that contains two types of nodes:
i. Retrieved documents – these are documents captured by our query. These include full bibliographic metadata and reference lists.
ii. Non-retrieved documents – these are documents referenced by our retrieved documents but were not retrieved via our query. These only have data contained within their reference string (i.e. first author, journal or book title, year of publication, and possibly DOI).
The columns in the .csv refer to:
- Id, the node ID
- Label, the reference string of the document
- DOI, the DOI for the document, if available
- WOS_ID, WoS accession number
- Authors, named authors
- Title, title of document
- Document_type, variable indicating whether a document is an article, review, etc.
- Journal_book_title, journal of publication or title of book
- Publication year, year of publication.
- WOS_times_cited, total Core Collection citations as of 29/11/2023
- Indegree, number of within network citations to a given document
- Cluster, provides the cluster membership number as discussed in the manuscript (Figure 1)
‘CE_Edge.csv’ is a comma-separated values file that contains edges (citation links) between nodes (documents) (n=29,198). The columns refer to:
- Source, node ID of the citing document
- Target, node ID of the cited document
Cluster Analysis
We qualitatively analyse a set of publications from seven of the largest clusters in our manuscript.
Facebook
TwitterA list of all uniform citations from the Louisville Metro Police Department, the CSV file is updated daily, including case number, date, location, division, beat, offender demographics, statutes and charges, and UCR codes can be found in this Link.INCIDENT_NUMBER or CASE_NUMBER links these data sets together:Crime DataUniform Citation DataFirearm intakeLMPD hate crimesAssaulted OfficersCITATION_CONTROL_NUMBER links these data sets together:Uniform Citation DataLMPD Stops DataNote: When examining this data, make sure to read the LMPDCrime Data section in our Terms of Use.AGENCY_DESC - the name of the department that issued the citationCASE_NUMBER - the number associated with either the incident or used as reference to store the items in our evidence rooms and can be used to connect the dataset to the following other datasets INCIDENT_NUMBER:1. Crime Data2. Firearms intake3. LMPD hate crimes4. Assaulted OfficersNOTE: CASE_NUMBER is not formatted the same as the INCIDENT_NUMBER in the other datasets. For example: in the Uniform Citation Data you have CASE_NUMBER 8018013155 (no dashes) which matches up with INCIDENT_NUMBER 80-18-013155 in the other 4 datasets.CITATION_YEAR - the year the citation was issuedCITATION_CONTROL_NUMBER - links this LMPD stops dataCITATION_TYPE_DESC - the type of citation issued (citations include: general citations, summons, warrants, arrests, and juvenile)CITATION_DATE - the date the citation was issuedCITATION_LOCATION - the location the citation was issuedDIVISION - the LMPD division in which the citation was issuedBEAT - the LMPD beat in which the citation was issuedPERSONS_SEX - the gender of the person who received the citationPERSONS_RACE - the race of the person who received the citation (W-White, B-Black, H-Hispanic, A-Asian/Pacific Islander, I-American Indian, U-Undeclared, IB-Indian/India/Burmese, M-Middle Eastern Descent, AN-Alaskan Native)PERSONS_ETHNICITY - the ethnicity of the person who received the citation (N-Not Hispanic, H=Hispanic, U=Undeclared)PERSONS_AGE - the age of the person who received the citationPERSONS_HOME_CITY - the city in which the person who received the citation livesPERSONS_HOME_STATE - the state in which the person who received the citation livesPERSONS_HOME_ZIP - the zip code in which the person who received the citation livesVIOLATION_CODE - multiple alpha/numeric code assigned by the Kentucky State Police to link to a Kentucky Revised Statute. For a full list of codes visit: https://kentuckystatepolice.org/crime-traffic-data/ASCF_CODE - the code that follows the guidelines of the American Security Council Foundation. For more details visit https://www.ascfusa.org/STATUTE - multiple alpha/numeric code representing a Kentucky Revised Statute. For a full list of Kentucky Revised Statute information visit: https://apps.legislature.ky.gov/law/statutes/CHARGE_DESC - the description of the type of charge for the citationUCR_CODE - the code that follows the guidelines of the Uniform Crime Report. For more details visit https://ucr.fbi.gov/UCR_DESC - the description of the UCR_CODE. For more details visit https://ucr.fbi.gov/
Facebook
TwitterBeginning in 2023, certain agencies are required to submit one week of service data on a monthly basis to comply with FTA’s Weekly Reference reporting requirement on form WE-20. This data release will therefore present the limited set of key indicators reported by transit agencies on this form and will be updated each month with the most current data.
The resulting dataset provides data users with data shortly after the transit service was provided and consumed, over one month in advance of FTA’s routine update to the Monthly Ridership Time Series dataset. One use of this data is for reference in understanding ridership patterns (e.g., to develop to a full month estimate ahead of when the data reflecting the given month of service is released by FTA at the end of the following month).
Generally, FTA has defined the reference week to be the second or third full week of the month. All sampled agencies will report data referencing the same reference week.
The form collects the following service data points, as described in the metadata below: • Weekday 5-day UPT total for the reference week; • Weekday 5-day VRM total for the reference week; • Weekend 2-day UPT total for either the weekend preceding or following the reference week; and • Weekend 2-day VRM total for either the weekend preceding or following the reference week. • Vehicles Operated in Maximum Service (vanpool mode only) for the reference week.
FTA has also derived the change from the prior month for the same agency/mode/type of service/data point. Users should take caution when aggregating this measure and are encouraged to use the dataset export to measure service trends at a higher level (i.e., by reporter or nationally).
For any questions regarding this dataset, please contact the NTD helpdesk at ntdhelp@dot.gov .
Facebook
TwitterThe National Software Reference Library (NSRL) collects software from various sources and incorporates file profiles computed from this software into a Reference Data Set (RDS) of information. The RDS can be used by law enforcement, government, and industry organizations to review files on a computer by matching file profiles in the RDS. This alleviates much of the effort involved in determining which files are important as evidence on computers or file systems that have been seized as part of criminal investigations. The RDS is a collection of digital signatures of known, traceable software applications. There are application hash values in the hash set which may be considered malicious, i.e. steganography tools and hacking scripts. There are no hash values of illicit data, i.e. child abuse images.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset for our study on the coverage of software engineering articles in open citation databases:
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Within the ESA funded WorldCereal project we have built an open harmonized reference data repository at global extent for model training or product validation in support of land cover and crop type mapping. Data from 2017 onwards were collected from many different sources and then harmonized, annotated and evaluated. These steps are explained in the harmonization protocol (10.5281/zenodo.7584463). This protocol also clarifies the naming convention of the shape files and the WorldCereal attributes (LC, CT, IRR, valtime and sampleID) that were added to the original data sets.
This publication includes those harmonized data sets of which the original data set was published under the CC-BY-SA license or a license similar to CC-BY-SA. See document "_In-situ-data-World-Cereal - license - CC-BY-SA.pdf" for an overview of the original data sets.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In the field of Computer Science, conference and workshop papers serve as important contributions, carrying substantial weight in research assessment processes, compared to other disciplines. However, a considerable number of these papers are not assigned a Digital Object Identifier (DOI), hence their citations are not reported in widely used citation datasets like OpenCitations and Crossref, raising limitations to citation analysis. While the Microsoft Academic Graph (MAG) previously addressed this issue by providing substantial coverage, its discontinuation has created a void in available data. BIP! NDR aims to alleviate this issue and enhance the research assessment processes within the field of Computer Science. To accomplish this, it leverages a workflow that identifies and retrieves Open Science papers lacking DOIs from the DBLP Corpus, and by performing text analysis, it extracts citation information directly from their full text. The current version of the dataset contains more than 2.4M citations made by approximately 156K open access Computer Science conference or workshop papers that, according to DBLP, do not have a DOI. File Structure: The dataset is formatted as a JSON Lines (JSONL) file (one JSON Object per line) to facilitate file splitting and streaming. Each JSON object has three main fields: “_id”: a unique identifier, “citing_paper”, the “dblp_id” of the citing paper, “cited_papers”: array containing the objects that correspond to each reference found in the text of the “citing_paper”; each object may contain the following fields: “dblp_id”: the “dblp_id” of the cited paper. Optional - this field is required if a “doi” is not present. “doi”: the doi of the cited paper. Optional - this field is required if a “dblp_id” is not present. “bibliographic_reference”: the raw citation string as it appears in the citing paper. Changes from previous version: Processed additional papers from a more recent version of DBLP.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Annotated Benchmark of Real-World Data for Approximate Functional Dependency Discovery
This collection consists of ten open access relations commonly used by the data management community. In addition to the relations themselves (please take note of the references to the original sources below), we added three lists in this collection that describe approximate functional dependencies found in the relations. These lists are the result of a manual annotation process performed by two independent individuals by consulting the respective schemas of the relations and identifying column combinations where one column implies another based on its semantics. As an example, in the claims.csv file, the AirportCode implies AirportName, as each code should be unique for a given airport.
The file ground_truth.csv is a comma separated file containing approximate functional dependencies. table describes the relation we refer to, lhs and rhs reference two columns of those relations where semantically we found that lhs implies rhs.
The file excluded_candidates.csv and included_candidates.csv list all column combinations that were excluded or included in the manual annotation, respectively. We excluded a candidate if there was no tuple where both attributes had a value or if the g3_prime value was too small.
Dataset References
Facebook
TwitterThis dataset has been obtained from Basketball-reference
player_data.csv
seasons_stats.csv
Over 50 performance stats, Here is a list of Columns in this file and their descriptions, Glossary
Facebook
TwitterThe original version of the INTEGRAL Reference Catalog as published in 2003 classified previously known bright X-ray and gamma-ray sources before the launch of INTEGRAL. These sources are, or have been at least once, brighter than ~1 milliCrab above 3 keV energy, and are expected to be detected by INTEGRAL. This catalog was used in the INTEGRAL Quick Look Analysis (QLA) to discover new sources or significantly variable sources. The authors compiled several published X-ray and gamma-ray catalogs, and surveyed recent publications for new sources. Consequently, there were 1121 sources in the original INTEGRAL Reference Catalog. In addition to the source positions, an approximate spectral model and expected flux were given for each source, and the expected INTEGRAL counting rates based on these parameters was derived. Assuming the default instrument performances and at least ~105 seconds exposure time for any part of the sky, it is expected that INTEGRAL will detect at least ~700 sources below 10 keV and ~400 sources above 20 keV over the mission life. After the launch of INTEGRAL, a version of this catalog was placed on the ISDC website at http://www.isdc.unige.ch/integral/science/catalogue and has been updated periodically since then by adding, for example, new sources discovered by INTEGRAL itself (indicated by the IGR prefix in the name). This HEASARC table is based on the web version at the ISDC, and will be updated within a few days of whenever the latter is updated. This database table is updated automatically in the HEASARC database system within one week of any changes to the online web page maintained by the INTEGRAL Science Data Center at the URL http://www.isdc.unige.ch/integral/catalog/latest/catalog.html This is a service provided by NASA HEASARC .
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This zipped folders contain all the data produced for the research "Uncovering the Citation Landscape: Exploring OpenCitations COCI, OpenCitations Meta, and ERIH-PLUS in Social Sciences and Humanities Journals": the results datasets (dataset_map_disciplines, dataset_no_SSH, dataset_SSH, erih_meta_with_disciplines and erih_meta_without_disciplines).
dataset_map_disciplines.zip contains CSV files with four columns ("id", "citing", "cited", "disciplines") giving information about publications stored in OpenCitations META (version 3 released on February 2023) and part of SSH journals, according to ERIH PLUS (version downloaded on 2023-04-27), specifying the disciplines associated to them and a boolean value stating if they cite or are cited, according to the OpenCitations COCI dataset (version 19 released on January 2023).
dataset_no_SSH.zip and dataset_SSH.zip contain CSV files with the same structure. Each dataset has four columns: "citing", "is_citing_SSH", "cited", and "is_cited_SSH". ”Citing” and “cited” columns are filled with DOIs of publications stored in OpenCitations META that according to OpenCitations COCI are involved in a citation. The "is_citing_SSH" and "is_cited_SSH" columns contain boolean values: "True" if the corresponding publication is associated with a SSH (Social Sciences and Humanities) discipline, according to ERIH PLUS, and "False" otherwise. The two datasets are built starting from the two different subsets obtained as a result of the union between OpenCitations META and ERIH PLUS: dataset_SSH comes from erih_meta_with_disciplines and dataset_no_SSH from erih_meta_without_disciplines. dataset_no_SSH comes from erih_meta_with_disciplines.zip and erih_meta_without_disciplines.zip, as explained before, contain CSV files originating from ERIH PLUS and META. erih_meta_without_disciplines has just one column “id” and contains the DOIs of all the publications in META that do not have any discipline associated, that is, have not been published on a SSH journal, while erih_meta_with_disciplines derives from all the publications in META that have at least one linked discipline and has two columns: “id” and “erih_disciplines”, containing a string with all the disciplines linked to that publication like "History, Interdisciplinary research in the Humanities, Interdisciplinary research in the Social Sciences, Sociology".
Software: https://doi.org/10.5281/zenodo.8326023
Data preprocessed: https://doi.org/10.5281/zenodo.7973159
Article: https://zenodo.org/record/8326044
DMP: https://zenodo.org/record/8324973
Protocol: https://doi.org/10.17504/protocols.io.n92ldpeenl5b/v5
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We present a complete dataset of linked bibliography and index data, partially disambiguated and augmented with references to external resources, extracted from the Brill’s archive in the field of Classics. Processed book identifiers are listed in a separate text file. Text fragments extracted from different books via this process are then parsed and compared using a string-based similarity metric to form clusters of bibliographic references to the same published work or (variants of) the same subjects discussed in these books. The entire set of references was then disambiguated using Google Books and Crossref APIs.
Paper about extraction pipeline
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Background: Attribution to the original contributor upon reuse of published data is important both as a reward for data creators and to document the provenance of research findings. Previous studies have found that papers with publicly available datasets receive a higher number of citations than similar studies without available data. However, few previous analyses have had the statistical power to control for the many variables known to predict citation rate, which has led to uncertain estimates of the "citation benefit". Furthermore, little is known about patterns in data reuse over time and across datasets. Method and Results: Here, we look at citation rates while controlling for many known citation predictors, and investigate the variability of data reuse. In a multivariate regression on 10,555 studies that created gene expression microarray data, we found that studies that made data available in a public repository received 9% (95% confidence interval: 5% to 13%) more citations than similar studies for which the data was not made available. Date of publication, journal impact factor, open access status, number of authors, first and last author publication history, corresponding author country, institution citation history, and study topic were included as covariates. The citation benefit varied with date of dataset deposition: a citation benefit was most clear for papers published in 2004 and 2005, at about 30%. Authors published most papers using their own datasets within two years of their first publication on the dataset, whereas data reuse papers published by third-party investigators continued to accumulate for at least six years. To study patterns of data reuse directly, we compiled 9,724 instances of third party data reuse via mention of GEO or ArrayExpress accession numbers in the full text of papers. The level of third-party data use was high: for 100 datasets deposited in year 0, we estimated that 40 papers in PubMed reused a dataset by year 2, 100 by year 4, and more than 150 data reuse papers had been published by year 5. Data reuse was distributed across a broad base of datasets: a very conservative estimate found that 20% of the datasets deposited between 2003 and 2007 had been reused at least once by third parties. Conclusion: After accounting for other factors affecting citation rate, we find a robust citation benefit from open data, although a smaller one than previously reported. We conclude there is a direct effect of third-party data reuse that persists for years beyond the time when researchers have published most of the papers reusing their own data. Other factors that may also contribute to the citation benefit are considered.We further conclude that, at least for gene expression microarray data, a substantial fraction of archived datasets are reused, and that the intensity of dataset reuse has been steadily increasing since 2003.