6 datasets found

Data from: PANACEA dataset - Heterogeneous COVID-19 Claims
zenodo.org
explore.openaire.eu
+1more
csv
Updated Jul 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Miguel Arana-Catania; Miguel Arana-Catania; Elena Kochkina; Elena Kochkina; Arkaitz Zubiaga; Arkaitz Zubiaga; Maria Liakata; Maria Liakata; Rob Procter; Rob Procter; Yulan He; Yulan He (2022). PANACEA dataset - Heterogeneous COVID-19 Claims [Dataset]. http://doi.org/10.5281/zenodo.6493847
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6493847
Dataset updated
Jul 15, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Miguel Arana-Catania; Miguel Arana-Catania; Elena Kochkina; Elena Kochkina; Arkaitz Zubiaga; Arkaitz Zubiaga; Maria Liakata; Maria Liakata; Rob Procter; Rob Procter; Yulan He; Yulan He
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The peer-reviewed publication for this dataset has been presented in the 2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), and can be accessed here: https://arxiv.org/abs/2205.02596. Please cite this when using the dataset.

This dataset contains a heterogeneous set of True and False COVID claims and online sources of information for each claim.

The claims have been obtained from online fact-checking sources, existing datasets and research challenges. It combines different data sources with different foci, thus enabling a comprehensive approach that combines different media (Twitter, Facebook, general websites, academia), information domains (health, scholar, media), information types (news, claims) and applications (information retrieval, veracity evaluation).

The processing of the claims included an extensive de-duplication process eliminating repeated or very similar claims. The dataset is presented in a LARGE and a SMALL version, accounting for different degrees of similarity between the remaining claims (excluding respectively claims with a 90% and 99% probability of being similar, as obtained through the MonoT5 model). The similarity of claims was analysed using BM25 (Robertson et al., 1995; Crestani et al., 1998; Robertson and Zaragoza, 2009) with MonoT5 re-ranking (Nogueira et al., 2020), and BERTScore (Zhang et al., 2019).

The processing of the content also involved removing claims making only a direct reference to existing content in other media (audio, video, photos); automatically obtained content not representing claims; and entries with claims or fact-checking sources in languages other than English.

The claims were analysed to identify types of claims that may be of particular interest, either for inclusion or exclusion depending on the type of analysis. The following types were identified: (1) Multimodal; (2) Social media references; (3) Claims including questions; (4) Claims including numerical content; (5) Named entities, including: PERSON − People, including fictional; ORGANIZATION − Companies, agencies, institutions, etc.; GPE − Countries, cities, states; FACILITY − Buildings, highways, etc. These entities have been detected using a RoBERTa base English model (Liu et al., 2019) trained on the OntoNotes Release 5.0 dataset (Weischedel et al., 2013) using Spacy.

The original labels for the claims have been reviewed and homogenised from the different criteria used by each original fact-checker into the final True and False labels.

The data sources used are:

- The CoronaVirusFacts/DatosCoronaVirus Alliance Database. https://www.poynter.org/ifcn-covid-19-misinformation/

- CoAID dataset (Cui and Lee, 2020) https://github.com/cuilimeng/CoAID

- MM-COVID (Li et al., 2020) https://github.com/bigheiniu/MM-COVID

- CovidLies (Hossain et al., 2020) https://github.com/ucinlp/covid19-data

- TREC Health Misinformation track https://trec-health-misinfo.github.io/

- TREC COVID challenge (Voorhees et al., 2021; Roberts et al., 2020) https://ir.nist.gov/covidSubmit/data.html

The LARGE dataset contains 5,143 claims (1,810 False and 3,333 True), and the SMALL version 1,709 claims (477 False and 1,232 True).

The entries in the dataset contain the following information:

- Claim. Text of the claim.

- Claim label. The labels are: False, and True.

- Claim source. The sources include mostly fact-checking websites, health information websites, health clinics, public institutions sites, and peer-reviewed scientific journals.

- Original information source. Information about which general information source was used to obtain the claim.

- Claim type. The different types, previously explained, are: Multimodal, Social Media, Questions, Numerical, and Named Entities.

Funding. This work was supported by the UK Engineering and Physical Sciences Research Council (grant no. EP/V048597/1, EP/T017112/1). ML and YH are supported by Turing AI Fellowships funded by the UK Research and Innovation (grant no. EP/V030302/1, EP/V020579/1).

References

- Arana-Catania M., Kochkina E., Zubiaga A., Liakata M., Procter R., He Y.. Natural Language Inference with Self-Attention for Veracity Assessment of Pandemic Claims. NAACL 2022 https://arxiv.org/abs/2205.02596

- Stephen E Robertson, Steve Walker, Susan Jones, Micheline M Hancock-Beaulieu, Mike Gatford, et al. 1995. Okapi at trec-3. Nist Special Publication Sp,109:109.

- Fabio Crestani, Mounia Lalmas, Cornelis J Van Rijsbergen, and Iain Campbell. 1998. “is this document relevant?. . . probably” a survey of probabilistic models in information retrieval. ACM Computing Surveys (CSUR), 30(4):528–552.

- Stephen Robertson and Hugo Zaragoza. 2009. The probabilistic relevance framework: BM25 and beyond. Now Publishers Inc.

- Rodrigo Nogueira, Zhiying Jiang, Ronak Pradeep, and Jimmy Lin. 2020. Document ranking with a pre-trained sequence-to-sequence model. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pages 708–718.

- Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. 2019. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.

- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.

- Ralph Weischedel, Martha Palmer, Mitchell Marcus, Eduard Hovy, Sameer Pradhan, Lance Ramshaw, Nianwen Xue, Ann Taylor, Jeff Kaufman, Michelle Franchini, et al. 2013. Ontonotes release 5.0 ldc2013t19. Linguistic Data Consortium, Philadelphia, PA, 23.

- Limeng Cui and Dongwon Lee. 2020. Coaid: Covid-19 healthcare misinformation dataset. arXiv preprint arXiv:2006.00885.

- Yichuan Li, Bohan Jiang, Kai Shu, and Huan Liu. 2020. Mm-covid: A multilingual and multimodal data repository for combating covid-19 disinformation.

- Tamanna Hossain, Robert L. Logan IV, Arjuna Ugarte, Yoshitomo Matsubara, Sean Young, and Sameer Singh. 2020. COVIDLies: Detecting COVID-19 misinformation on social media. In Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020, Online. Association for Computational Linguistics.

- Ellen Voorhees, Tasmeer Alam, Steven Bedrick, Dina Demner-Fushman, William R Hersh, Kyle Lo, Kirk Roberts, Ian Soboroff, and Lucy Lu Wang. 2021. Trec-covid: constructing a pandemic information retrieval test collection. In ACM SIGIR Forum, volume 54, pages 1–12. ACM New York, NY, USA.
f
Dataset for research paper ‘Unanticipated questions can yield unanticipated...
sussex.figshare.com
bin
Updated Oct 12, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thomas Parkhouse; Thomas Ormerod (2018). Dataset for research paper ‘Unanticipated questions can yield unanticipated outcomes in investigative interviews’ [Dataset]. http://doi.org/10.25377/sussex.6683007.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.25377/sussex.6683007.v1
Dataset updated
Oct 12, 2018
Dataset provided by
University of Sussex
Authors
Thomas Parkhouse; Thomas Ormerod
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data files for the paper submitted to PLOSOneA copy of SPSS software is required to open these .sav files.-Experiment 1 Interviewer Accuracy (dichotomous choice RM): this shows the number of times each interviewer correctly guessed the veracity of the interviewee for each of the six conditions. A repeated measures ANOVA was conducted on this data.-Experiment 1 Interviewer Compliance: this shows the number of deviations from script made by interviewers for each interview.-Experiment 1 Interviewer Reasons: this show the number of reasons interviewers gave for their veracity decisions, coded into 4 categories. A binary logisitic regression was used on this data, with accuracy as the outcome variable.-Experiment 1 interviewer accuracy (scale judgement): this shows the interviewer veracity scale scores, with liars' scores reversed. This was analysed with a between-groups ANOVA.-Experiment 1 Manipulations & Anticipation: this includes the data on Question Anticipation and Question Difficulty.-Experiment 1 Reality Monitoring: this data shows the % of word count for each interview that fell within the 4 reality monitoring categories. A MANOVA was conducted on this data.-Experiment 2 Observer Accuracy (dichotomous choice): this data show the accuracy of forced choice decision for observers accros each condition. This was analysed with a series of one-sample t-tests.-Experiment 2 Observer Accuracy (Scale; with liars scores Reversed): this data shows the Observer Veracity Scale Ratings. This was analysed with a repeated measures ANOVA.Abstract for research paperAsking unanticipated questions in investigative interviews can elicit differences in the verbal behaviour of truth-tellers and liars: When faced with unanticipated questions, liars give less detailed and consistent responses than truth-tellers. Do such differences in verbal behaviour lead to an improvement in the accuracy of interviewers’ veracity judgements? Two empirical studies evaluated the efficacy of the unanticipated questions technique. Experiment 1 compared two types of unanticipated questions, assessing the veracity judgements of interviewers and verbal content of interviewees’ responses.. Experiment 2 assessed veracity judgements of independent observers. Overall, the results provide little support for the technique. For interviewers, unanticipated questions failed to improve veracity judgement accuracy above chance. Reality monitoring analysis revealed qualitatively distinct information in the responses to the two unanticipated question types, though little distinction between the responses of truth-tellers and liars. Accuracy for observers was greater when judging transcripts of unanticipated questions, and this effect was stronger for spatial and temporal questions than planning questions. The benefits of unanticipated questioning appear limited to post-interview situations. Furthermore, the type of unanticipated question affects both the type of information gathered and the ability to detect deceit.
D
Data Wrangling Market Report
marketreportanalytics.com
doc, pdf, ppt
Updated Jun 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Report Analytics (2025). Data Wrangling Market Report [Dataset]. https://www.marketreportanalytics.com/reports/data-wrangling-market-87869
Explore at:
ppt, doc, pdfAvailable download formats
Dataset updated
Jun 21, 2025
Dataset authored and provided by
Market Report Analytics
License
https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The Data Wrangling market is experiencing robust growth, projected to reach $3.41 billion in 2025 and exhibiting a Compound Annual Growth Rate (CAGR) of 11.03% from 2025 to 2033. This expansion is fueled by several key factors. The increasing volume and velocity of data generated across various industries necessitates efficient data preparation techniques. Businesses are increasingly adopting cloud-based data warehousing and analytics solutions, which directly benefit from streamlined data wrangling processes. Furthermore, the rising demand for advanced analytics and machine learning applications further emphasizes the need for high-quality, prepared data. This creates significant opportunities for vendors offering sophisticated data wrangling tools and services. Companies like Alteryx, TIBCO, Altair, Teradata, Oracle, SAS, Datameer, DataRobot, Cloudera, and Cambridge Semantics are key players capitalizing on this market expansion, offering a range of solutions from cloud-based platforms to specialized software. The market's growth trajectory is expected to remain strong throughout the forecast period, driven by continuous technological advancements, growing data literacy, and the increasing adoption of big data analytics across various sectors. The competitive landscape is characterized by both established players and emerging startups. Established vendors leverage their existing customer bases and robust product portfolios to maintain market share, while startups introduce innovative solutions and technologies to gain traction. Market segmentation will likely continue to evolve, with further differentiation emerging based on specific industry applications, data types, and deployment models (cloud vs. on-premise). Future growth will also hinge on successful integration with other data management and analytics tools, improving the overall efficiency of the data pipeline and reducing the time and resources required for data preparation. The market's trajectory reflects the indispensable role of data wrangling in facilitating data-driven decision-making and powering digital transformation initiatives across businesses globally. Recent developments include: May 2023 - Adroit DI launched SDF Pro, a cloud-based application that provides a cost-effective solution for storing, sorting, and Wrangling 10 million molecules within seconds. SDF Pro offers a user-configurable interface accessible from login, enabling users to organize, structure, and store large data sets., May 2023 - Qlik acquired Talend, expanding the company’s innovative capabilities for modern enterprises to transform, access, trust, analyze, and take action with data. Qlik, together with Talend, will bring substantial benefits to consumers, including expanded product offerings, improved support and services, and enhanced investments in innovation and R&D.. Key drivers for this market are: Growing Volumes of Data, Advancement in AI And Big Data Technologies; Growing Concern about Data Veracity. Potential restraints include: Growing Volumes of Data, Advancement in AI And Big Data Technologies; Growing Concern about Data Veracity. Notable trends are: Large Enterprises are Analyzed to Hold Significant Market Share.
Dataset for the paper: "Monant Medical Misinformation Dataset: Mapping...
zenodo.org
data.niaid.nih.gov
Updated Apr 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ivan Srba; Ivan Srba; Branislav Pecher; Branislav Pecher; Matus Tomlein; Matus Tomlein; Robert Moro; Robert Moro; Elena Stefancova; Elena Stefancova; Jakub Simko; Jakub Simko; Maria Bielikova; Maria Bielikova (2022). Dataset for the paper: "Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims" [Dataset]. http://doi.org/10.5281/zenodo.5996864
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.5996864
Dataset updated
Apr 22, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ivan Srba; Ivan Srba; Branislav Pecher; Branislav Pecher; Matus Tomlein; Matus Tomlein; Robert Moro; Robert Moro; Elena Stefancova; Elena Stefancova; Jakub Simko; Jakub Simko; Maria Bielikova; Maria Bielikova
Description
Overview

This dataset of medical misinformation was collected and is published by Kempelen Institute of Intelligent Technologies (KInIT). It consists of approx. 317k news articles and blog posts on medical topics published between January 1, 1998 and February 1, 2022 from a total of 207 reliable and unreliable sources. The dataset contains full-texts of the articles, their original source URL and other extracted metadata. If a source has a credibility score available (e.g., from Media Bias/Fact Check), it is also included in the form of annotation. Besides the articles, the dataset contains around 3.5k fact-checks and extracted verified medical claims with their unified veracity ratings published by fact-checking organisations such as Snopes or FullFact. Lastly and most importantly, the dataset contains 573 manually and more than 51k automatically labelled mappings between previously verified claims and the articles; mappings consist of two values: claim presence (i.e., whether a claim is contained in the given article) and article stance (i.e., whether the given article supports or rejects the claim or provides both sides of the argument).

The dataset is primarily intended to be used as a training and evaluation set for machine learning methods for claim presence detection and article stance classification, but it enables a range of other misinformation related tasks, such as misinformation characterisation or analyses of misinformation spreading.

Its novelty and our main contributions lie in (1) focus on medical news article and blog posts as opposed to social media posts or political discussions; (2) providing multiple modalities (beside full-texts of the articles, there are also images and videos), thus enabling research of multimodal approaches; (3) mapping of the articles to the fact-checked claims (with manual as well as predicted labels); (4) providing source credibility labels for 95% of all articles and other potential sources of weak labels that can be mined from the articles' content and metadata.

The dataset is associated with the research paper "Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims" accepted and presented at ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22).

The accompanying Github repository provides a small static sample of the dataset and the dataset's descriptive analysis in a form of Jupyter notebooks.

Options to access the dataset

There are two ways how to get access to the dataset:

1. Static dump of the dataset available in the CSV format
2. Continuously updated dataset available via REST API

In order to obtain an access to the dataset (either to full static dump or REST API), please, request the access by following instructions provided below.

References

If you use this dataset in any publication, project, tool or in any other form, please, cite the following papers:

@inproceedings{SrbaMonantPlatform, author = {Srba, Ivan and Moro, Robert and Simko, Jakub and Sevcech, Jakub and Chuda, Daniela and Navrat, Pavol and Bielikova, Maria}, booktitle = {Proceedings of Workshop on Reducing Online Misinformation Exposure (ROME 2019)}, pages = {1--7}, title = {Monant: Universal and Extensible Platform for Monitoring, Detection and Mitigation of Antisocial Behavior}, year = {2019} }

@inproceedings{SrbaMonantMedicalDataset, author = {Srba, Ivan and Pecher, Branislav and Tomlein Matus and Moro, Robert and Stefancova, Elena and Simko, Jakub and Bielikova, Maria}, booktitle = {Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22)}, numpages = {11}, title = {Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims}, year = {2022}, doi = {10.1145/3477495.3531726}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3477495.3531726}, }

Dataset creation process

In order to create this dataset (and to continuously obtain new data), we used our research platform Monant. The Monant platform provides so called data providers to extract news articles/blogs from news/blog sites as well as fact-checking articles from fact-checking sites. General parsers (from RSS feeds, Wordpress sites, Google Fact Check Tool, etc.) as well as custom crawler and parsers were implemented (e.g., for fact checking site Snopes.com). All data is stored in the unified format in a central data storage.

Ethical considerations

The dataset was collected and is published for research purposes only. We collected only publicly available content of news/blog articles. The dataset contains identities of authors of the articles if they were stated in the original source; we left this information, since the presence of an author's name can be a strong credibility indicator. However, we anonymised the identities of the authors of discussion posts included in the dataset.

The main identified ethical issue related to the presented dataset lies in the risk of mislabelling of an article as supporting a false fact-checked claim and, to a lesser extent, in mislabelling an article as not containing a false claim or not supporting it when it actually does. To minimise these risks, we developed a labelling methodology and require an agreement of at least two independent annotators to assign a claim presence or article stance label to an article. It is also worth noting that we do not label an article as a whole as false or true. Nevertheless, we provide partial article-claim pair veracities based on the combination of claim presence and article stance labels.

As to the veracity labels of the fact-checked claims and the credibility (reliability) labels of the articles' sources, we take these from the fact-checking sites and external listings such as Media Bias/Fact Check as they are and refer to their methodologies for more details on how they were established.

Lastly, the dataset also contains automatically predicted labels of claim presence and article stance using our baselines described in the next section. These methods have their limitations and work with certain accuracy as reported in this paper. This should be taken into account when interpreting them.

Reporting mistakes in the dataset
The mean to report considerable mistakes in raw collected data or in manual annotations is by creating a new issue in the accompanying Github repository. Alternately, general enquiries or requests can be sent at info [at] kinit.sk.

Dataset structure

Raw data

At first, the dataset contains so called raw data (i.e., data extracted by the Web monitoring module of Monant platform and stored in exactly the same form as they appear at the original websites). Raw data consist of articles from news sites and blogs (e.g. naturalnews.com), discussions attached to such articles, fact-checking articles from fact-checking portals (e.g. snopes.com). In addition, the dataset contains feedback (number of likes, shares, comments) provided by user on social network Facebook which is regularly extracted for all news/blogs articles.

Raw data are contained in these CSV files (and corresponding REST API endpoints):

sources.csv

articles.csv

article_media.csv

article_authors.csv

discussion_posts.csv

discussion_post_authors.csv

fact_checking_articles.csv

fact_checking_article_media.csv

claims.csv

feedback_facebook.csv

Note: Personal information about discussion posts' authors (name, website, gravatar) are anonymised.

Annotations

Secondly, the dataset contains so called annotations. Entity annotations describe the individual raw data entities (e.g., article, source). Relation annotations describe relation between two of such entities.

Each annotation is described by the following attributes:

category of annotation (`annotation_category`). Possible values: label (annotation corresponds to ground truth, determined by human experts) and prediction (annotation was created by means of AI method).

type of annotation (`annotation_type_id`). Example values: Source reliability (binary), Claim presence. The list of possible values can be obtained from enumeration in annotation_types.csv.

method which created annotation (`method_id`). Example values: Expert-based source reliability evaluation, Fact-checking article to claim transformation method. The list of possible values can be obtained from enumeration methods.csv.

its value (`value`). The value is stored in JSON format and its structure differs according to particular annotation type.

At the same time, annotations are associated with a particular object identified by:

entity type (parameter entity_type in case of entity annotations, or source_entity_type and target_entity_type in case of relation annotations). Possible values: sources, articles, fact-checking-articles.

entity id (parameter entity_id in case of entity annotations, or source_entity_id and target_entity_id in case of relation
Biological records of potentially host plants of mexican wild bees...
gbif.org
demo.gbif.org
Updated Apr 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Juan M Barrios; Carlos Andres Cultid-Medina; Jorge Mérida; Paola Andrea González-Vanegas; Brenda Yudith Bedolla García; Daniel Madrigal González; Juan M Barrios; Carlos Andres Cultid-Medina; Jorge Mérida; Paola Andrea González-Vanegas; Brenda Yudith Bedolla García; Daniel Madrigal González (2024). Biological records of potentially host plants of mexican wild bees identified from iNaturalist [Dataset]. http://doi.org/10.15468/m4r9h2
Explore at:
Unique identifier
https://doi.org/10.15468/m4r9h2
Dataset updated
Apr 19, 2024
Dataset provided by
Global Biodiversity Information Facilityhttps://www.gbif.org/
Comisión nacional para el conocimiento y uso de la biodiversidad
Authors
Juan M Barrios; Carlos Andres Cultid-Medina; Jorge Mérida; Paola Andrea González-Vanegas; Brenda Yudith Bedolla García; Daniel Madrigal González; Juan M Barrios; Carlos Andres Cultid-Medina; Jorge Mérida; Paola Andrea González-Vanegas; Brenda Yudith Bedolla García; Daniel Madrigal González
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Aug 24, 2003 - Aug 11, 2022
Area covered

Description
Biological records of potential host plants of Mexican wild bees identified from Naturalista (Mexican iNaturalist Node) observations. This is an interinstitutional effort that was carried out for the taxonomic curation and mobilization of plant observations compiled in the project “Xicotli Data: Native Bees and their Flowers”. Plant observations were obtained from bee photographs available in Naturalista. (https://www.naturalista.mx/projects/xicotli-data-abejas-mexicanas-y-sus-flores). The project and the resultant dataset of biological records of plants seek to maximize the use of the observations in Naturalista. Thus, it is expected to contribute to the documentation of native plant-bee interactions. This dataset is made up of a metadata and three tables: 1) occurrences; 2) ecomorphological attributes (see details here), and 3) type of interaction bee – plant (see details here). Taxonomic determination of plants from photographs is a great challenge. However, the plant taxonomists carried out the curation by mean of their expert knowledge and by consulting specialized literature (i.e., taxonomic descriptions and catalogues). They also compared the plants from the photograph with specimens of two Mexican herbaria: IEB-“Graciela Calderón and Jerzy Rzedowski ” (http://inecolbajio.com/index.php#) and MEXU-UNAM (http://www.ib.unam.mx/botanica/herbario/), applying a very conservative protocol that guarantees the greatest possible veracity of the taxonomic determinations. This dataset was generated within the framework of the project “Integration of biodiversity data of wild bee-plant interactions in Mexico” (https://www.gbif.org/es/project/BID-CA2020-021-NAC/integration-of-biodiversity-data-of-wild-bee-plant-interactions-in-mexico). The original Naturalista data can be consulted on the following resources: https://www.gbif.org/occurrence/download/0194949-230224095556074 and https://doi.org/10.5281/zenodo.7892227
NIST Special Database 302 Nail to Nail (N2N) Fingerprint Challenge
catalog.data.gov
data.nist.gov
+1more
Updated Jun 27, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2023). NIST Special Database 302 Nail to Nail (N2N) Fingerprint Challenge [Dataset]. https://catalog.data.gov/dataset/nist-special-database-302-nail-to-nail-n2n-fingerprint-challenge-ece35
Explore at:
Dataset updated
Jun 27, 2023
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
In September 2017, the Intelligence Advanced Research Projects Activity (IARPA) held a data collection as part of its Nail to Nail (N2N) Fingerprint Challenge. Participating Challengers deployed devices designed to collect an image of the full nail to nail surface area of a fingerprint equivalent to a rolled fingerprint from an unacclimated user, without assistance from a trained operator. Traditional operator-assisted live-scan rolled fingerprints were also captured, along with assorted other friction ridge live-scan and latent captures.In this data collection, study participants needed to have their fingerprints captured using traditional operator-assisted techniques in order to quantify the performance of the Challenger devices. IARPA invited members of the Federal Bureau of Investigation (FBI) Biometric Training Team to the data collection to perform this task. Each study participant had N2N fingerprint images captured twice, each by a different FBI expert, resulting in two N2N baseline datasets.To ensure the veracity of recorded N2N finger positions in the baseline datasets, Challenge test staff also captured plain fingerprint impressions in a 4-4-2 slap configuration. This capture method refers to simultaneously imaging the index, middle, ring, and little fingers on the right hand, then repeating the process on the left hand, and finishing with the simultaneous capture of the left and right thumbs. This technique is a best practice to ensure finger sequence order, since it is physically challenging for a study participant to change the ordering of fingers when imaging them simultaneously. There were four baseline (two rolled and two slap), eight challenger and ten auxiliary fingerprint sensors deployed during the data collection, amassing a series of rolled and plain images. It was required that the baseline devices achieve 100% acquisition rate, in order to verify the recorded friction ridge generalized positions (FRGPs) and study participant identifiers for other devices. There were no such requirements for Challenger devices. Not all devices were able to achieve 100% acquisition rate.Plain, rolled, and touch-free impression fingerprints were captured from a multitude of devices, as well as sets of plain palm impressions. NIST also partnered with the FBI and Schwarz Forensic Enterprises (SFE) to design activity scenarios in which subjects would likely leave fingerprints on different objects. The activities and associated objects were chosen in order to use a number of latent print development techniques and simulate the types of objects often found in real law enforcement case work.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Miguel Arana-Catania; Miguel Arana-Catania; Elena Kochkina; Elena Kochkina; Arkaitz Zubiaga; Arkaitz Zubiaga; Maria Liakata; Maria Liakata; Rob Procter; Rob Procter; Yulan He; Yulan He (2022). PANACEA dataset - Heterogeneous COVID-19 Claims [Dataset]. http://doi.org/10.5281/zenodo.6493847

Data from: PANACEA dataset - Heterogeneous COVID-19 Claims

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

csvAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.6493847

Dataset updated

Jul 15, 2022

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Miguel Arana-Catania; Miguel Arana-Catania; Elena Kochkina; Elena Kochkina; Arkaitz Zubiaga; Arkaitz Zubiaga; Maria Liakata; Maria Liakata; Rob Procter; Rob Procter; Yulan He; Yulan He

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The peer-reviewed publication for this dataset has been presented in the 2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), and can be accessed here: https://arxiv.org/abs/2205.02596. Please cite this when using the dataset.

This dataset contains a heterogeneous set of True and False COVID claims and online sources of information for each claim.

The claims have been obtained from online fact-checking sources, existing datasets and research challenges. It combines different data sources with different foci, thus enabling a comprehensive approach that combines different media (Twitter, Facebook, general websites, academia), information domains (health, scholar, media), information types (news, claims) and applications (information retrieval, veracity evaluation).

The processing of the claims included an extensive de-duplication process eliminating repeated or very similar claims. The dataset is presented in a LARGE and a SMALL version, accounting for different degrees of similarity between the remaining claims (excluding respectively claims with a 90% and 99% probability of being similar, as obtained through the MonoT5 model). The similarity of claims was analysed using BM25 (Robertson et al., 1995; Crestani et al., 1998; Robertson and Zaragoza, 2009) with MonoT5 re-ranking (Nogueira et al., 2020), and BERTScore (Zhang et al., 2019).

The processing of the content also involved removing claims making only a direct reference to existing content in other media (audio, video, photos); automatically obtained content not representing claims; and entries with claims or fact-checking sources in languages other than English.

The claims were analysed to identify types of claims that may be of particular interest, either for inclusion or exclusion depending on the type of analysis. The following types were identified: (1) Multimodal; (2) Social media references; (3) Claims including questions; (4) Claims including numerical content; (5) Named entities, including: PERSON − People, including fictional; ORGANIZATION − Companies, agencies, institutions, etc.; GPE − Countries, cities, states; FACILITY − Buildings, highways, etc. These entities have been detected using a RoBERTa base English model (Liu et al., 2019) trained on the OntoNotes Release 5.0 dataset (Weischedel et al., 2013) using Spacy.

The original labels for the claims have been reviewed and homogenised from the different criteria used by each original fact-checker into the final True and False labels.

The data sources used are:

- The CoronaVirusFacts/DatosCoronaVirus Alliance Database. https://www.poynter.org/ifcn-covid-19-misinformation/

- CoAID dataset (Cui and Lee, 2020) https://github.com/cuilimeng/CoAID

- MM-COVID (Li et al., 2020) https://github.com/bigheiniu/MM-COVID

- CovidLies (Hossain et al., 2020) https://github.com/ucinlp/covid19-data

- TREC Health Misinformation track https://trec-health-misinfo.github.io/

- TREC COVID challenge (Voorhees et al., 2021; Roberts et al., 2020) https://ir.nist.gov/covidSubmit/data.html

The LARGE dataset contains 5,143 claims (1,810 False and 3,333 True), and the SMALL version 1,709 claims (477 False and 1,232 True).

The entries in the dataset contain the following information:

- Claim. Text of the claim.

- Claim label. The labels are: False, and True.

- Claim source. The sources include mostly fact-checking websites, health information websites, health clinics, public institutions sites, and peer-reviewed scientific journals.

- Original information source. Information about which general information source was used to obtain the claim.

- Claim type. The different types, previously explained, are: Multimodal, Social Media, Questions, Numerical, and Named Entities.

Funding. This work was supported by the UK Engineering and Physical Sciences Research Council (grant no. EP/V048597/1, EP/T017112/1). ML and YH are supported by Turing AI Fellowships funded by the UK Research and Innovation (grant no. EP/V030302/1, EP/V020579/1).

References

- Arana-Catania M., Kochkina E., Zubiaga A., Liakata M., Procter R., He Y.. Natural Language Inference with Self-Attention for Veracity Assessment of Pandemic Claims. NAACL 2022 https://arxiv.org/abs/2205.02596

- Stephen E Robertson, Steve Walker, Susan Jones, Micheline M Hancock-Beaulieu, Mike Gatford, et al. 1995. Okapi at trec-3. Nist Special Publication Sp,109:109.

- Fabio Crestani, Mounia Lalmas, Cornelis J Van Rijsbergen, and Iain Campbell. 1998. “is this document relevant?. . . probably” a survey of probabilistic models in information retrieval. ACM Computing Surveys (CSUR), 30(4):528–552.

- Stephen Robertson and Hugo Zaragoza. 2009. The probabilistic relevance framework: BM25 and beyond. Now Publishers Inc.

- Rodrigo Nogueira, Zhiying Jiang, Ronak Pradeep, and Jimmy Lin. 2020. Document ranking with a pre-trained sequence-to-sequence model. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pages 708–718.

- Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. 2019. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.

- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.

- Ralph Weischedel, Martha Palmer, Mitchell Marcus, Eduard Hovy, Sameer Pradhan, Lance Ramshaw, Nianwen Xue, Ann Taylor, Jeff Kaufman, Michelle Franchini, et al. 2013. Ontonotes release 5.0 ldc2013t19. Linguistic Data Consortium, Philadelphia, PA, 23.

- Limeng Cui and Dongwon Lee. 2020. Coaid: Covid-19 healthcare misinformation dataset. arXiv preprint arXiv:2006.00885.

- Yichuan Li, Bohan Jiang, Kai Shu, and Huan Liu. 2020. Mm-covid: A multilingual and multimodal data repository for combating covid-19 disinformation.

- Tamanna Hossain, Robert L. Logan IV, Arjuna Ugarte, Yoshitomo Matsubara, Sean Young, and Sameer Singh. 2020. COVIDLies: Detecting COVID-19 misinformation on social media. In Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020, Online. Association for Computational Linguistics.

- Ellen Voorhees, Tasmeer Alam, Steven Bedrick, Dina Demner-Fushman, William R Hersh, Kyle Lo, Kirk Roberts, Ian Soboroff, and Lucy Lu Wang. 2021. Trec-covid: constructing a pandemic information retrieval test collection. In ACM SIGIR Forum, volume 54, pages 1–12. ACM New York, NY, USA.

Clear search

Close search

Google apps

Main menu

Data from: PANACEA dataset - Heterogeneous COVID-19 Claims

Dataset for research paper ‘Unanticipated questions can yield unanticipated...

Data Wrangling Market Report

Dataset for the paper: "Monant Medical Misinformation Dataset: Mapping...

Biological records of potentially host plants of mexican wild bees...

NIST Special Database 302 Nail to Nail (N2N) Fingerprint Challenge

Data from: PANACEA dataset - Heterogeneous COVID-19 Claims