Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This study estimates the effect of data sharing on the citations of academic articles, using journal policies as a natural experiment. We begin by examining 17 high-impact journals that have adopted the requirement that data from published articles be publicly posted. We match these 17 journals to 13 journals without policy changes and find that empirical articles published just before their change in editorial policy have citation rates with no statistically significant difference from those published shortly after the shift. We then ask whether this null result stems from poor compliance with data sharing policies, and use the data sharing policy changes as instrumental variables to examine more closely two leading journals in economics and political science with relatively strong enforcement of new data policies. We find that articles that make their data available receive 97 additional citations (estimate standard error of 34). We conclude that: a) authors who share data may be rewarded eventually with additional scholarly citations, and b) data-posting policies alone do not increase the impact of articles published in a journal unless those policies are enforced.
Data Access: The data in the research collection provided may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use it only for research purposes. Due to these restrictions, the collection is not open data. Please download the Agreement at Data Sharing Agreement and send the signed form to fakenewstask@gmail.com .
Citation
Please cite our work as
@article{shahi2021overview, title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection}, author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas}, journal={Working Notes of CLEF}, year={2021} }
Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English.
Subtask 3A: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. The training data will be released in batches and roughly about 900 articles with the respective label. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. Our definitions for the categories are as follows:
False - The main claim made in an article is untrue.
Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services.
True - This rating indicates that the primary elements of the main claim are demonstrably true.
Other- An article that cannot be categorised as true, false, or partially false due to lack of evidence about its claims. This category includes articles in dispute and unproven articles.
Subtask 3B: Topical Domain Classification of News Articles (English) Fact-checkers require background expertise to identify the truthfulness of an article. The categorisation will help to automate the sampling process from a stream of data. Given the text of a news article, determine the topical domain of the article (English). This is a classification problem. The task is to categorise fake news articles into six topical categories like health, election, crime, climate, election, education. This task will be offered for a subset of the data of Subtask 3A.
Input Data
The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows:
Task 3a
Task 3b
Output data format
Task 3a
Sample File
public_id, predicted_rating
1, false
2, true
Task 3b
Sample file
public_id, predicted_domain
1, health
2, crime
Additional data for Training
To train your model, the participant can use additional data with a similar format; some datasets are available over the web. We don't provide the background truth for those datasets. For testing, we will not use any articles from other datasets. Some of the possible source:
IMPORTANT!
Evaluation Metrics
This task is evaluated as a classification task. We will use the F1-macro measure for the ranking of teams. There is a limit of 5 runs (total and not per day), and only one person from a team is allowed to submit runs.
Submission Link: https://competitions.codalab.org/competitions/31238
Related Work
Input tables used to generate the output, i.e., technology ranking in the manuscript.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Version: 5
Authors: Carlota Balsa-Sánchez, Vanesa Loureiro
Date of data collection: 2023/09/05
General description: The publication of datasets according to the FAIR principles, could be reached publishing a data paper (or software paper) in data journals or in academic standard journals. The excel and CSV file contains a list of academic journals that publish data papers and software papers. File list:
Relationship between files: both files have the same information. Two different formats are offered to improve reuse
Type of version of the dataset: final processed version
Versions of the files: 5th version - Information updated: number of journals, URL, document types associated to a specific journal.
Version: 4
Authors: Carlota Balsa-Sánchez, Vanesa Loureiro
Date of data collection: 2022/12/15
General description: The publication of datasets according to the FAIR principles, could be reached publishing a data paper (or software paper) in data journals or in academic standard journals. The excel and CSV file contains a list of academic journals that publish data papers and software papers. File list:
Relationship between files: both files have the same information. Two different formats are offered to improve reuse
Type of version of the dataset: final processed version
Versions of the files: 4th version - Information updated: number of journals, URL, document types associated to a specific journal, publishers normalization and simplification of document types - Information added : listed in the Directory of Open Access Journals (DOAJ), indexed in Web of Science (WOS) and quartile in Journal Citation Reports (JCR) and/or Scimago Journal and Country Rank (SJR), Scopus and Web of Science (WOS), Journal Master List.
Version: 3
Authors: Carlota Balsa-Sánchez, Vanesa Loureiro
Date of data collection: 2022/10/28
General description: The publication of datasets according to the FAIR principles, could be reached publishing a data paper (or software paper) in data journals or in academic standard journals. The excel and CSV file contains a list of academic journals that publish data papers and software papers. File list:
Relationship between files: both files have the same information. Two different formats are offered to improve reuse
Type of version of the dataset: final processed version
Versions of the files: 3rd version - Information updated: number of journals, URL, document types associated to a specific journal, publishers normalization and simplification of document types - Information added : listed in the Directory of Open Access Journals (DOAJ), indexed in Web of Science (WOS) and quartile in Journal Citation Reports (JCR) and/or Scimago Journal and Country Rank (SJR).
Erratum - Data articles in journals Version 3:
Botanical Studies -- ISSN 1999-3110 -- JCR (JIF) Q2 Data -- ISSN 2306-5729 -- JCR (JIF) n/a Data in Brief -- ISSN 2352-3409 -- JCR (JIF) n/a
Version: 2
Author: Francisco Rubio, Universitat Politècnia de València.
Date of data collection: 2020/06/23
General description: The publication of datasets according to the FAIR principles, could be reached publishing a data paper (or software paper) in data journals or in academic standard journals. The excel and CSV file contains a list of academic journals that publish data papers and software papers. File list:
Relationship between files: both files have the same information. Two different formats are offered to improve reuse
Type of version of the dataset: final processed version
Versions of the files: 2nd version - Information updated: number of journals, URL, document types associated to a specific journal, publishers normalization and simplification of document types - Information added : listed in the Directory of Open Access Journals (DOAJ), indexed in Web of Science (WOS) and quartile in Scimago Journal and Country Rank (SJR)
Total size: 32 KB
Version 1: Description
This dataset contains a list of journals that publish data articles, code, software articles and database articles.
The search strategy in DOAJ and Ulrichsweb was the search for the word data in the title of the journals. Acknowledgements: Xaquín Lores Torres for his invaluable help in preparing this dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Frequency of reported types of studies and use of descriptive and inferential statistics (n = 216).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This graph shows the proportion of all articles by discipline that share data, making it available to the reader via any indicated mechanism, along with associated confidence intervals. See Tables 9 and 11 for numeric values.
This dataset was created by Sandeep Gautam
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Scientific and technical journal articles in India was reported at 207390 in 2022, according to the World Bank collection of development indicators, compiled from officially recognized sources. India - Scientific and technical journal articles - actual values, historical data, forecasts and projections were sourced from the World Bank on March of 2025.
Ce jeu de données recense l'historique des publications mises en avant sur data.gouv.nc depuis 2019.
Data sets used to prepare illustrative figures for the overview article “Multiscale Modeling of Background Ozone” Overview The CMAQ model output datasets used to create illustrative figures for this overview article were generated by scientists in EPA/ORD/CEMM and EPA/OAR/OAQPS. The EPA/ORD/CEMM-generated dataset consisted of hourly CMAQ output from two simulations. The first simulation was performed for July 1 – 31 over a 12 km modeling domain covering the Western U.S. The simulation was configured with the Integrated Source Apportionment Method (ISAM) to estimate the contributions from 9 source categories to modeled ozone. ISAM source contributions for July 17 – 31 averaged over all grid cells located in Colorado were used to generate the illustrative pie chart in the overview article. The second simulation was performed for October 1, 2013 – August 31, 2014 over a 108 km modeling domain covering the northern hemisphere. This simulation was also configured with ISAM to estimate the contributions from non-US anthropogenic sources, natural sources, stratospheric ozone, and other sources on ozone concentrations. Ozone ISAM results from this simulation were extracted along a boundary curtain of the 12 km modeling domain specified over the Western U.S. for the time period January 1, 2014 – July 31, 2014 and used to generate the illustrative time-height cross-sections in the overview article. The EPA/OAR/OAQPS-generated dataset consisted of hourly gridded CMAQ output for surface ozone concentrations for the year 2016. The CMAQ simulations were performed over the northern hemisphere at a horizontal resolution of 108 km. NO2 and O3 data for July 2016 was extracted from these simulations generate the vertically-integrated column densities shown in the illustrative comparison to satellite-derived column densities. CMAQ Model Data The data from the CMAQ model simulations used in this research effort are very large (several terabytes) and cannot be uploaded to ScienceHub due to size restrictions. The model simulations are stored on the /asm archival system accessible through the atmos high-performance computing (HPC) system. Due to data management policies, files on /asm are subject to expiry depending on the template of the project. Files not requested for extension after the expiry date are deleted permanently from the system. The format of the files used in this analysis and listed below is ioapi/netcdf. Documentation of this format, including definitions of the geographical projection attributes contained in the file headers, are available at https://www.cmascenter.org/ioapi/ Documentation on the CMAQ model, including a description of the output file format and output model species can be found in the CMAQ documentation on the CMAQ GitHub site at https://github.com/USEPA/CMAQ. This dataset is associated with the following publication: Hogrefe, C., B. Henderson, G. Tonnesen, R. Mathur, and R. Matichuk. Multiscale Modeling of Background Ozone: Research Needs to Inform and Improve Air Quality Management. EM Magazine. Air and Waste Management Association, Pittsburgh, PA, USA, 1-6, (2020).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Prevalence of journal-specific features (peer-reviewed journal articles only).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data collection served to create the CONTRAST-IT corpus.
CONTRAST-IT is a medium-size multilingual corpus (including ca. 1.5 million words) based on a comparable collection of articles published in online daily newspapers. The articles are written in five languages: Italian (from Italy), French (from France), Spanish (from Spain), English (from the UK), and German (from Germany).
This Spanish dataset includes 300'000 words drawn from 476 articles. All the texts collected are authentic, full-length electronic journalistic articles, chosen based on their high representativeness of contemporary Spanish newspaper language. The articles were published in 2011 and 2012 in two electronic daily newspapers (elpais.com and elmundo.es).
The corpus and data collection were used in two Swiss National Science Foundation Projects:
For details on the corpus and data collection, see:
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Reservoir operation rules(1) continuous flood discharge with ecological priority(2) pulse flood discharge with ecological priority
(3) pulse flood discharge with equal weight of ecology and power generation(4) pulse flood discharge with power generation priority
This dataset supports the following publication: "Solar and sensor geometry, not vegetation response, drive satellite NDVI phenology in widespread ecosystems of the western United States" (DOI:10.1016/j.rse.2020.112013). The data release allows users to replicate, test, or further explore results. The dataset consists of 4 separate items based on the analysis approach used in the original publication 1) the 'Phenocam' dataset uses images from a phenocam in a pinyon juniper ecosystem in Grand Canyon National Park to determine phenological patterns of multiple plant species. The 'Phenocam' dataset consists of scripts and tabular data developed while performing analyses and includes the final NDVI values for all areas of interest (AOIs) described in the associated publication. 2) the 'SolarSensorAnalysis' dataset uses downloaded tabular MODIS data to explore relationships between NDVI and multiple solar and sensor angles. The 'SolarSensorAnalysis' dataset consists of download and analysis scripts in Google Earth Engine and R. The source MODIS data used in the analysis are too large to include but are provided through MODIS providers and can be accessed through Google Earth Engine using the included script. A csv file includes solar and sensor angle information for the MODIS pixel closest to the phenocam as well as for a sample of 100 randomly selected MODIS pixels within the GRCA-PJ ecosystem. 3) the 'WinterPeakExtent' dataset includes final geotiffs showing the temporal frequency extent and associated vegetation physiognomic types experiencing winter NDVI peaks in the western US. 4) the "SensorComparison" dataset contains the NDVI time series at the phenocam location from 4 other satellites as well as the code used to download these data.
These data were used to generate the figures included in the following manuscript: Fahey, et al. (2017) "A framework for expanding aqueous chemistry in the Community Multiscale Air Quality (CMAQ) model version 5.1". Geosci. Mod. Dev. This dataset is associated with the following publication: Fahey, K., A. Carlton, H. Pye, J. Baek, B. Hutzell, C. Stanier, K. Baker, W. Appel, M. Jaoui, and J. Offenberg. A framework for expanding aqueous chemistry in the Community Multiscale Air Quality (CMAQ) model version 5.1. Geoscientific Model Development. Copernicus Publications, Katlenburg-Lindau, GERMANY, 10: 1587-1605, (2017).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Seasonal mean (DJF and JJA) temperature, water vapor and cloud fraction data for different experiments from WACCM4.
https://lindat.mff.cuni.cz/repository/xmlui/page/szn-dataset-licencehttps://lindat.mff.cuni.cz/repository/xmlui/page/szn-dataset-licence
The MLASK corpus consists of 41,243 multi-modal documents – video-based news articles in the Czech language – collected from Novinky.cz (https://www.novinky.cz/) and Seznam Zprávy (https://www.seznamzpravy.cz/). It was introduced in "MLASK: Multimodal Summarization of Video-based News Articles" (Krubiński & Pecina, EACL 2023). The articles' publication dates range from September 2016 to February 2022. The intended use case of the dataset is to model the task of multimodal summarization with multimodal output: based on a pair of a textual article and a short video, a textual summary is generated, and a single frame from the video is chosen as a pictorial summary.
Each document consists of the following: - a .mp4 video - a single image (cover picture) - the article's text - the article's summary - the article's title - the article's publication date
All of the videos are re-sampled to 25 fps and resized to the same resolution of 1280x720p. The maximum length of the video is 5 minutes, and the shortest one is 7 seconds. The average video duration is 86 seconds. The quantitative statistics of the lengths of titles, abstracts, and full texts (measured in the number of tokens) are below. Q1 and Q3 denote the first and third quartiles, respectively.
/ - / mean / Q1 / Median / Q3 / / Title / 11.16 ± 2.78 / 9 / 11 / 13 / / Abstract / 33.40 ± 13.86 / 22 / 32 / 43 / / Article / 276.96 ± 191.74 / 154 / 231 / 343 /
The proposed training/dev/test split follows the chronological ordering based on publication data. We use the articles published in the first half (Jan-Jun) of 2021 for validation (2,482 instances) and the ones published in the second half (Jul-Dec) of 2021 and the beginning (Jan-Feb) of 2022 for testing (2,652 instances). The remaining data is used for training (36,109 instances).
The textual data is shared as a single .tsv file. The visual data (video+image) is shared as a single archive for validation and test splits, and the one from the training split is partitioned based on the publication date.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
724 Global import shipment records of Article Made with prices, volume & current Buyer's suppliers relationships based on actual Global export trade database.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains data related to the research article: "Towards mouse genetic-specific RNA-sequencing read mapping".
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository stores the synthetic data, programs and models developed for the research entitled "PRV-FCM: an extension of fuzzy cognitive maps for prescriptive modeling".
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This study estimates the effect of data sharing on the citations of academic articles, using journal policies as a natural experiment. We begin by examining 17 high-impact journals that have adopted the requirement that data from published articles be publicly posted. We match these 17 journals to 13 journals without policy changes and find that empirical articles published just before their change in editorial policy have citation rates with no statistically significant difference from those published shortly after the shift. We then ask whether this null result stems from poor compliance with data sharing policies, and use the data sharing policy changes as instrumental variables to examine more closely two leading journals in economics and political science with relatively strong enforcement of new data policies. We find that articles that make their data available receive 97 additional citations (estimate standard error of 34). We conclude that: a) authors who share data may be rewarded eventually with additional scholarly citations, and b) data-posting policies alone do not increase the impact of articles published in a journal unless those policies are enforced.