4 datasets found
  1. Dataset for the paper: "Monant Medical Misinformation Dataset: Mapping...

    • zenodo.org
    • data.niaid.nih.gov
    Updated Apr 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ivan Srba; Ivan Srba; Branislav Pecher; Branislav Pecher; Matus Tomlein; Matus Tomlein; Robert Moro; Robert Moro; Elena Stefancova; Elena Stefancova; Jakub Simko; Jakub Simko; Maria Bielikova; Maria Bielikova (2022). Dataset for the paper: "Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims" [Dataset]. http://doi.org/10.5281/zenodo.5996864
    Explore at:
    Dataset updated
    Apr 22, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ivan Srba; Ivan Srba; Branislav Pecher; Branislav Pecher; Matus Tomlein; Matus Tomlein; Robert Moro; Robert Moro; Elena Stefancova; Elena Stefancova; Jakub Simko; Jakub Simko; Maria Bielikova; Maria Bielikova
    Description

    Overview

    This dataset of medical misinformation was collected and is published by Kempelen Institute of Intelligent Technologies (KInIT). It consists of approx. 317k news articles and blog posts on medical topics published between January 1, 1998 and February 1, 2022 from a total of 207 reliable and unreliable sources. The dataset contains full-texts of the articles, their original source URL and other extracted metadata. If a source has a credibility score available (e.g., from Media Bias/Fact Check), it is also included in the form of annotation. Besides the articles, the dataset contains around 3.5k fact-checks and extracted verified medical claims with their unified veracity ratings published by fact-checking organisations such as Snopes or FullFact. Lastly and most importantly, the dataset contains 573 manually and more than 51k automatically labelled mappings between previously verified claims and the articles; mappings consist of two values: claim presence (i.e., whether a claim is contained in the given article) and article stance (i.e., whether the given article supports or rejects the claim or provides both sides of the argument).

    The dataset is primarily intended to be used as a training and evaluation set for machine learning methods for claim presence detection and article stance classification, but it enables a range of other misinformation related tasks, such as misinformation characterisation or analyses of misinformation spreading.

    Its novelty and our main contributions lie in (1) focus on medical news article and blog posts as opposed to social media posts or political discussions; (2) providing multiple modalities (beside full-texts of the articles, there are also images and videos), thus enabling research of multimodal approaches; (3) mapping of the articles to the fact-checked claims (with manual as well as predicted labels); (4) providing source credibility labels for 95% of all articles and other potential sources of weak labels that can be mined from the articles' content and metadata.

    The dataset is associated with the research paper "Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims" accepted and presented at ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22).

    The accompanying Github repository provides a small static sample of the dataset and the dataset's descriptive analysis in a form of Jupyter notebooks.

    Options to access the dataset

    There are two ways how to get access to the dataset:

    1. Static dump of the dataset available in the CSV format
    2. Continuously updated dataset available via REST API

    In order to obtain an access to the dataset (either to full static dump or REST API), please, request the access by following instructions provided below.

    References

    If you use this dataset in any publication, project, tool or in any other form, please, cite the following papers:

    @inproceedings{SrbaMonantPlatform,
      author = {Srba, Ivan and Moro, Robert and Simko, Jakub and Sevcech, Jakub and Chuda, Daniela and Navrat, Pavol and Bielikova, Maria},
      booktitle = {Proceedings of Workshop on Reducing Online Misinformation Exposure (ROME 2019)},
      pages = {1--7},
      title = {Monant: Universal and Extensible Platform for Monitoring, Detection and Mitigation of Antisocial Behavior},
      year = {2019}
    }
    @inproceedings{SrbaMonantMedicalDataset,
      author = {Srba, Ivan and Pecher, Branislav and Tomlein Matus and Moro, Robert and Stefancova, Elena and Simko, Jakub and Bielikova, Maria},
      booktitle = {Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22)},
      numpages = {11},
      title = {Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims},
      year = {2022},
      doi = {10.1145/3477495.3531726},
      publisher = {Association for Computing Machinery},
      address = {New York, NY, USA},
      url = {https://doi.org/10.1145/3477495.3531726},
    }
    


    Dataset creation process

    In order to create this dataset (and to continuously obtain new data), we used our research platform Monant. The Monant platform provides so called data providers to extract news articles/blogs from news/blog sites as well as fact-checking articles from fact-checking sites. General parsers (from RSS feeds, Wordpress sites, Google Fact Check Tool, etc.) as well as custom crawler and parsers were implemented (e.g., for fact checking site Snopes.com). All data is stored in the unified format in a central data storage.


    Ethical considerations

    The dataset was collected and is published for research purposes only. We collected only publicly available content of news/blog articles. The dataset contains identities of authors of the articles if they were stated in the original source; we left this information, since the presence of an author's name can be a strong credibility indicator. However, we anonymised the identities of the authors of discussion posts included in the dataset.

    The main identified ethical issue related to the presented dataset lies in the risk of mislabelling of an article as supporting a false fact-checked claim and, to a lesser extent, in mislabelling an article as not containing a false claim or not supporting it when it actually does. To minimise these risks, we developed a labelling methodology and require an agreement of at least two independent annotators to assign a claim presence or article stance label to an article. It is also worth noting that we do not label an article as a whole as false or true. Nevertheless, we provide partial article-claim pair veracities based on the combination of claim presence and article stance labels.

    As to the veracity labels of the fact-checked claims and the credibility (reliability) labels of the articles' sources, we take these from the fact-checking sites and external listings such as Media Bias/Fact Check as they are and refer to their methodologies for more details on how they were established.

    Lastly, the dataset also contains automatically predicted labels of claim presence and article stance using our baselines described in the next section. These methods have their limitations and work with certain accuracy as reported in this paper. This should be taken into account when interpreting them.


    Reporting mistakes in the dataset

    The mean to report considerable mistakes in raw collected data or in manual annotations is by creating a new issue in the accompanying Github repository. Alternately, general enquiries or requests can be sent at info [at] kinit.sk.


    Dataset structure

    Raw data

    At first, the dataset contains so called raw data (i.e., data extracted by the Web monitoring module of Monant platform and stored in exactly the same form as they appear at the original websites). Raw data consist of articles from news sites and blogs (e.g. naturalnews.com), discussions attached to such articles, fact-checking articles from fact-checking portals (e.g. snopes.com). In addition, the dataset contains feedback (number of likes, shares, comments) provided by user on social network Facebook which is regularly extracted for all news/blogs articles.

    Raw data are contained in these CSV files (and corresponding REST API endpoints):

    • sources.csv
    • articles.csv
    • article_media.csv
    • article_authors.csv
    • discussion_posts.csv
    • discussion_post_authors.csv
    • fact_checking_articles.csv
    • fact_checking_article_media.csv
    • claims.csv
    • feedback_facebook.csv

    Note: Personal information about discussion posts' authors (name, website, gravatar) are anonymised.


    Annotations

    Secondly, the dataset contains so called annotations. Entity annotations describe the individual raw data entities (e.g., article, source). Relation annotations describe relation between two of such entities.

    Each annotation is described by the following attributes:

    1. category of annotation (`annotation_category`). Possible values: label (annotation corresponds to ground truth, determined by human experts) and prediction (annotation was created by means of AI method).
    2. type of annotation (`annotation_type_id`). Example values: Source reliability (binary), Claim presence. The list of possible values can be obtained from enumeration in annotation_types.csv.
    3. method which created annotation (`method_id`). Example values: Expert-based source reliability evaluation, Fact-checking article to claim transformation method. The list of possible values can be obtained from enumeration methods.csv.
    4. its value (`value`). The value is stored in JSON format and its structure differs according to particular annotation type.


    At the same time, annotations are associated with a particular object identified by:

    1. entity type (parameter entity_type in case of entity annotations, or source_entity_type and target_entity_type in case of relation annotations). Possible values: sources, articles, fact-checking-articles.
    2. entity id (parameter entity_id in case of entity annotations, or source_entity_id and target_entity_id in case of relation

  2. c

    Active Hurricanes, Cyclones and Typhoons

    • resilience.climate.gov
    • pacificgeoportal.com
    • +25more
    Updated Aug 16, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Esri (2022). Active Hurricanes, Cyclones and Typhoons [Dataset]. https://resilience.climate.gov/maps/248e7b5827a34b248647afb012c58787
    Explore at:
    Dataset updated
    Aug 16, 2022
    Dataset authored and provided by
    Esri
    Area covered
    Earth
    Description

    Hurricane tracks and positions provide information on where the storm has been, where it is currently located, and where it is predicted to go. Each storm location is depicted by the sustained wind speed, according to the Saffir-Simpson Scale. It should be noted that the Saffir-Simpson Scale only applies to hurricanes in the Atlantic and Eastern Pacific basins, however all storms are still symbolized using that classification for consistency.Data SourceThis data is provided by NOAA National Hurricane Center (NHC) for the Central+East Pacific and Atlantic, and the Joint Typhoon Warning Center for the West+Central Pacific and Indian basins. For more disaster-related live feeds visit the Disaster Web Maps & Feeds ArcGIS Online Group.Sample DataSee Sample Layer Item for sample data during inactive Hurricane Season!Update FrequencyThe Aggregated Live Feeds methodology checks the Source for updates every 15 minutes. Tropical cyclones are normally issued every six hours at 5:00 AM EDT, 11:00 AM EDT, 5:00 PM EDT, and 11:00 PM EDT (or 4:00 AM EST, 10:00 AM EST, 4:00 PM EST, and 10:00 PM EST).Public advisories for Eastern Pacific tropical cyclones are normally issued every six hours at 2:00 AM PDT, 8:00 AM PDT, 2:00 PM PDT, and 8:00 PM PDT (or 1:00 AM PST, 7:00 AM PST, 1:00 PM PST, and 7:00 PM PST).Intermediate public advisories may be issued every 3 hours when coastal watches or warnings are in effect, and every 2 hours when coastal watches or warnings are in effect and land-based radars have identified a reliable storm center. Additionally, special public advisories may be issued at any time due to significant changes in warnings or in a cyclone. For the NHC data source you can subscribe to RSS Feeds.North Pacific and North Indian Ocean tropical cyclone warnings are updated every 6 hours, and South Indian and South Pacific Ocean tropical cyclone warnings are routinely updated every 12 hours. Times are set to Zulu/UTC.Scale/ResolutionThe horizontal accuracy of these datasets is not stated but it is important to remember that tropical cyclone track forecasts are subject to error, and that the effects of a tropical cyclone can span many hundreds of miles from the center.Area CoveredWorldGlossaryForecast location: Represents the official NHC forecast locations for the center of a tropical cyclone. Forecast center positions are given for projections valid 12, 24, 36, 48, 72, 96, and 120 hours after the forecast's nominal initial time. Click here for more information.

    Forecast points from the JTWC are valid 12, 24, 36, 48 and 72 hours after the forecast’s initial time.Forecast track: This product aids in the visualization of an NHC official track forecast, the forecast points are connected by a red line. The track lines are not a forecast product, as such, the lines should not be interpreted as representing a specific forecast for the location of a tropical cyclone in between official forecast points. It is also important to remember that tropical cyclone track forecasts are subject to error, and that the effects of a tropical cyclone can span many hundreds of miles from the center. Click here for more information.The Cone of Uncertainty: Cyclone paths are hard to predict with absolute certainty, especially days in advance.

    The cone represents the probable track of the center of a tropical cyclone and is formed by enclosing the area swept out by a set of circles along the forecast track (at 12, 24, 36 hours, etc). The size of each circle is scaled so that two-thirds of the historical official forecast errors over a 5-year sample fall within the circle. Based on forecasts over the previous 5 years, the entire track of a tropical cyclone can be expected to remain within the cone roughly 60-70% of the time. It is important to note that the area affected by a tropical cyclone can extend well beyond the confines of the cone enclosing the most likely track area of the center. Click here for more information. Now includes 'Danger Area' Polygons from JTWC, detailing US Navy Ship Avoidance Area when Wind speeds exceed 34 Knots!Coastal Watch/Warning: Coastal areas are placed under watches and warnings depending on the proximity and intensity of the approaching storm.Tropical Storm Watch is issued when a tropical cyclone containing winds of 34 to 63 knots (39 to 73 mph) or higher poses a possible threat, generally within 48 hours. These winds may be accompanied by storm surge, coastal flooding, and/or river flooding. The watch does not mean that tropical storm conditions will occur. It only means that these conditions are possible.Tropical Storm Warning is issued when sustained winds of 34 to 63 knots (39 to 73 mph) or higher associated with a tropical cyclone are expected in 36 hours or less. These winds may be accompanied by storm surge, coastal flooding, and/or river flooding.Hurricane Watch is issued when a tropical cyclone containing winds of 64 knots (74 mph) or higher poses a possible threat, generally within 48 hours. These winds may be accompanied by storm surge, coastal flooding, and/or river flooding. The watch does not mean that hurricane conditions will occur. It only means that these conditions are possible.Hurricane Warning is issued when sustained winds of 64 knots (74 mph) or higher associated with a tropical cyclone are expected in 36 hours or less. These winds may be accompanied by storm surge, coastal flooding, and/or river flooding. A hurricane warning can remain in effect when dangerously high water or a combination of dangerously high water and exceptionally high waves continue, even though winds may be less than hurricane force.RevisionsMar 13, 2025: Altered 'Forecast Error Cone' layer to include 'Danger Area' with updated symbology.Nov 20, 2023: Added Event Label to 'Forecast Position' layer, showing arrival time and wind speed localized to user's location.Mar 27, 2022: Added UID, Max_SS, Max_Wind, Max_Gust, and Max_Label fields to ForecastErrorCone layer.This map is provided for informational purposes and is not monitored 24/7 for accuracy and currency. Always refer to NOAA or JTWC sources for official guidance.If you would like to be alerted to potential issues or simply see when this Service will update next, please visit our Live Feed Status Page!

  3. i

    Data from: Dataset of Pathloss and ToA Radio Maps with Localization...

    • ieee-dataport.org
    Updated Jan 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cagkan Yapar (2025). Dataset of Pathloss and ToA Radio Maps with Localization Application [Dataset]. https://ieee-dataport.org/documents/dataset-pathloss-and-toa-radio-maps-localization-application
    Explore at:
    Dataset updated
    Jan 27, 2025
    Authors
    Cagkan Yapar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains pathloss and ToA radio maps generated by the ray-tracing software WinProp from Altair. The dataset allows to develop and test the accuracies of pathloss radio map estimation methods and localization algorithms based on RSS or ToA in realistic urban scenarios.

  4. a

    Ontario Watershed Boundaries (OWB)

    • namp-repository-gbaybiosphere.hub.arcgis.com
    • geohub.lio.gov.on.ca
    • +1more
    Updated Mar 31, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ontario Ministry of Natural Resources and Forestry (2020). Ontario Watershed Boundaries (OWB) [Dataset]. https://namp-repository-gbaybiosphere.hub.arcgis.com/maps/53a1c537b320404087c54ef09700a7db
    Explore at:
    Dataset updated
    Mar 31, 2020
    Dataset authored and provided by
    Ontario Ministry of Natural Resources and Forestry
    License

    https://www.ontario.ca/page/open-government-licence-ontariohttps://www.ontario.ca/page/open-government-licence-ontario

    Area covered
    Description

    The Ontario Watershed Boundaries (OWB) collection represents the authoritative watershed boundaries for Ontario. The data is based on a framework similar to the Atlas of Canada Fundamental Drainage Areas and the United States Watershed Boundary Dataset, however it adopts a more stringent scientific approach to watershed delineation. The Ontario Watershed Boundaries (OWB) collection includes five data classes:OWB Main (OWB) (Download: Shapefile | File Geodatabase | Open Data Service | QGIS Layer )all watershed levels from primary to quaternary, and level 5 and 6 watersheds for select areas of the province;OWB Primary (OWBPRIM) (Download: SHP | FGDB | ODS | QLR-Diverted Flow | QLR-Natural Flow)all primary watersheds or major drainage areas (WSCMDA) in the Canadian classification;OWB Secondary (OWBSEC) (Download: SHP | FGDB | ODS | QLR*)all secondary watersheds or sub drainage areas (WSCSDA);OWB Tertiary (OWBTERT) (Download: SHP | FGDB | ODS | QLR*)all tertiary watersheds or sub-sub drainage areas (WSCSSDA);OWB Quaternary (OWBQUAT) (Download: SHP | FGDB | ODS | QLR)all quaternary watersheds or 6-digit drainage areas (WSC6).*Display issues in QGIS are currently being investigated for these services. See the RSS feed below for details.IMPORTANT NOTE: The OWB data replaces the following data classes:Provincial Watersheds, HistoricalAdditional DocumentationUser Guide for Ontario Watershed Boundaries (Word)Watershed Delineation Principles and Guidelines for Ontario (Word) Atlas of Canada 1,000,000 National Frameworks Data, Hydrology - Fundamental Drainage Areas United States Geological Survey Watershed Boundary Dataset (Website)

    Status Completed: Production of the data has been completed

    Maintenance and Update Frequency Irregular: data is updated in intervals that are uneven in duration - usually after the completion of major updates to source data (e.g. OIH), but could also include spot updates and expansion of the dataset over time based on user needs. RSS FeedFollow our feed to get the latest announcements and developments concerning our watersheds. Visit our feed at the bottom of our ArcGIS Online OWB page.

    Contact Ontario Ministry of Natural Resources - Geospatial Ontario, geospatial@ontario.ca

  5. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ivan Srba; Ivan Srba; Branislav Pecher; Branislav Pecher; Matus Tomlein; Matus Tomlein; Robert Moro; Robert Moro; Elena Stefancova; Elena Stefancova; Jakub Simko; Jakub Simko; Maria Bielikova; Maria Bielikova (2022). Dataset for the paper: "Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims" [Dataset]. http://doi.org/10.5281/zenodo.5996864
Organization logo

Dataset for the paper: "Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims"

Explore at:
Dataset updated
Apr 22, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ivan Srba; Ivan Srba; Branislav Pecher; Branislav Pecher; Matus Tomlein; Matus Tomlein; Robert Moro; Robert Moro; Elena Stefancova; Elena Stefancova; Jakub Simko; Jakub Simko; Maria Bielikova; Maria Bielikova
Description

Overview

This dataset of medical misinformation was collected and is published by Kempelen Institute of Intelligent Technologies (KInIT). It consists of approx. 317k news articles and blog posts on medical topics published between January 1, 1998 and February 1, 2022 from a total of 207 reliable and unreliable sources. The dataset contains full-texts of the articles, their original source URL and other extracted metadata. If a source has a credibility score available (e.g., from Media Bias/Fact Check), it is also included in the form of annotation. Besides the articles, the dataset contains around 3.5k fact-checks and extracted verified medical claims with their unified veracity ratings published by fact-checking organisations such as Snopes or FullFact. Lastly and most importantly, the dataset contains 573 manually and more than 51k automatically labelled mappings between previously verified claims and the articles; mappings consist of two values: claim presence (i.e., whether a claim is contained in the given article) and article stance (i.e., whether the given article supports or rejects the claim or provides both sides of the argument).

The dataset is primarily intended to be used as a training and evaluation set for machine learning methods for claim presence detection and article stance classification, but it enables a range of other misinformation related tasks, such as misinformation characterisation or analyses of misinformation spreading.

Its novelty and our main contributions lie in (1) focus on medical news article and blog posts as opposed to social media posts or political discussions; (2) providing multiple modalities (beside full-texts of the articles, there are also images and videos), thus enabling research of multimodal approaches; (3) mapping of the articles to the fact-checked claims (with manual as well as predicted labels); (4) providing source credibility labels for 95% of all articles and other potential sources of weak labels that can be mined from the articles' content and metadata.

The dataset is associated with the research paper "Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims" accepted and presented at ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22).

The accompanying Github repository provides a small static sample of the dataset and the dataset's descriptive analysis in a form of Jupyter notebooks.

Options to access the dataset

There are two ways how to get access to the dataset:

1. Static dump of the dataset available in the CSV format
2. Continuously updated dataset available via REST API

In order to obtain an access to the dataset (either to full static dump or REST API), please, request the access by following instructions provided below.

References

If you use this dataset in any publication, project, tool or in any other form, please, cite the following papers:

@inproceedings{SrbaMonantPlatform,
  author = {Srba, Ivan and Moro, Robert and Simko, Jakub and Sevcech, Jakub and Chuda, Daniela and Navrat, Pavol and Bielikova, Maria},
  booktitle = {Proceedings of Workshop on Reducing Online Misinformation Exposure (ROME 2019)},
  pages = {1--7},
  title = {Monant: Universal and Extensible Platform for Monitoring, Detection and Mitigation of Antisocial Behavior},
  year = {2019}
}
@inproceedings{SrbaMonantMedicalDataset,
  author = {Srba, Ivan and Pecher, Branislav and Tomlein Matus and Moro, Robert and Stefancova, Elena and Simko, Jakub and Bielikova, Maria},
  booktitle = {Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22)},
  numpages = {11},
  title = {Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims},
  year = {2022},
  doi = {10.1145/3477495.3531726},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/3477495.3531726},
}


Dataset creation process

In order to create this dataset (and to continuously obtain new data), we used our research platform Monant. The Monant platform provides so called data providers to extract news articles/blogs from news/blog sites as well as fact-checking articles from fact-checking sites. General parsers (from RSS feeds, Wordpress sites, Google Fact Check Tool, etc.) as well as custom crawler and parsers were implemented (e.g., for fact checking site Snopes.com). All data is stored in the unified format in a central data storage.


Ethical considerations

The dataset was collected and is published for research purposes only. We collected only publicly available content of news/blog articles. The dataset contains identities of authors of the articles if they were stated in the original source; we left this information, since the presence of an author's name can be a strong credibility indicator. However, we anonymised the identities of the authors of discussion posts included in the dataset.

The main identified ethical issue related to the presented dataset lies in the risk of mislabelling of an article as supporting a false fact-checked claim and, to a lesser extent, in mislabelling an article as not containing a false claim or not supporting it when it actually does. To minimise these risks, we developed a labelling methodology and require an agreement of at least two independent annotators to assign a claim presence or article stance label to an article. It is also worth noting that we do not label an article as a whole as false or true. Nevertheless, we provide partial article-claim pair veracities based on the combination of claim presence and article stance labels.

As to the veracity labels of the fact-checked claims and the credibility (reliability) labels of the articles' sources, we take these from the fact-checking sites and external listings such as Media Bias/Fact Check as they are and refer to their methodologies for more details on how they were established.

Lastly, the dataset also contains automatically predicted labels of claim presence and article stance using our baselines described in the next section. These methods have their limitations and work with certain accuracy as reported in this paper. This should be taken into account when interpreting them.


Reporting mistakes in the dataset

The mean to report considerable mistakes in raw collected data or in manual annotations is by creating a new issue in the accompanying Github repository. Alternately, general enquiries or requests can be sent at info [at] kinit.sk.


Dataset structure

Raw data

At first, the dataset contains so called raw data (i.e., data extracted by the Web monitoring module of Monant platform and stored in exactly the same form as they appear at the original websites). Raw data consist of articles from news sites and blogs (e.g. naturalnews.com), discussions attached to such articles, fact-checking articles from fact-checking portals (e.g. snopes.com). In addition, the dataset contains feedback (number of likes, shares, comments) provided by user on social network Facebook which is regularly extracted for all news/blogs articles.

Raw data are contained in these CSV files (and corresponding REST API endpoints):

  • sources.csv
  • articles.csv
  • article_media.csv
  • article_authors.csv
  • discussion_posts.csv
  • discussion_post_authors.csv
  • fact_checking_articles.csv
  • fact_checking_article_media.csv
  • claims.csv
  • feedback_facebook.csv

Note: Personal information about discussion posts' authors (name, website, gravatar) are anonymised.


Annotations

Secondly, the dataset contains so called annotations. Entity annotations describe the individual raw data entities (e.g., article, source). Relation annotations describe relation between two of such entities.

Each annotation is described by the following attributes:

  1. category of annotation (`annotation_category`). Possible values: label (annotation corresponds to ground truth, determined by human experts) and prediction (annotation was created by means of AI method).
  2. type of annotation (`annotation_type_id`). Example values: Source reliability (binary), Claim presence. The list of possible values can be obtained from enumeration in annotation_types.csv.
  3. method which created annotation (`method_id`). Example values: Expert-based source reliability evaluation, Fact-checking article to claim transformation method. The list of possible values can be obtained from enumeration methods.csv.
  4. its value (`value`). The value is stored in JSON format and its structure differs according to particular annotation type.


At the same time, annotations are associated with a particular object identified by:

  1. entity type (parameter entity_type in case of entity annotations, or source_entity_type and target_entity_type in case of relation annotations). Possible values: sources, articles, fact-checking-articles.
  2. entity id (parameter entity_id in case of entity annotations, or source_entity_id and target_entity_id in case of relation

Search
Clear search
Close search
Google apps
Main menu