This dataset contains a listing of incorporated places (cities and towns) and counties within the United States including the GNIS code, FIPS code, name, entity type and primary point (location) for the entity. The types of entities listed in this dataset are based on codes provided by the U.S. Census Bureau, and include the following: C1 - An active incorporated place that does not serve as a county subdivision equivalent; C2 - An active incorporated place legally coextensive with a county subdivision but treated as independent of any county subdivision; C3 - A consolidated city; C4 - An active incorporated place with an alternate official common name; C5 - An active incorporated place that is independent of any county subdivision and serves as a county subdivision equivalent; C6 - An active incorporated place that partially is independent of any county subdivision and serves as a county subdivision equivalent or partially coextensive with a county subdivision but treated as independent of any county subdivision; C7 - An incorporated place that is independent of any county; C8 - The balance of a consolidated city excluding the separately incorporated place(s) within that consolidated government; C9 - An inactive or nonfunctioning incorporated place; H1 - An active county or statistically equivalent entity; H4 - A legally defined inactive or nonfunctioning county or statistically equivalent entity; H5 - A census areas in Alaska, a statistical county equivalent entity; and H6 - A county or statistically equivalent entity that is areally coextensive or governmentally consolidated with an incorporated place, part of an incorporated place, or a consolidated city.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Residential School Locations Dataset [IRS_Locations.csv] contains the locations (latitude and longitude) of Residential Schools and student hostels operated by the federal government in Canada. All the residential schools and hostels that are listed in the Indian Residential School Settlement Agreement are included in this dataset, as well as several Industrial schools and residential schools that were not part of the IRRSA. This version of the dataset doesn’t include the five schools under the Newfoundland and Labrador Residential Schools Settlement Agreement. The original school location data was created by the Truth and Reconciliation Commission, and was provided to the researcher (Rosa Orlandini) by the National Centre for Truth and Reconciliation in April 2017. The dataset was created by Rosa Orlandini, and builds upon and enhances the previous work of the Truth and Reconcilation Commission, Morgan Hite (creator of the Atlas of Indian Residential Schools in Canada that was produced for the Tk'emlups First Nation and Justice for Day Scholar's Initiative, and Stephanie Pyne (project lead for the Residential Schools Interactive Map). Each individual school location in this dataset is attributed either to RSIM, Morgan Hite, NCTR or Rosa Orlandini. Many schools/hostels had several locations throughout the history of the institution. If the school/hostel moved from its’ original location to another property, then the school is considered to have two unique locations in this dataset,the original location and the new location. For example, Lejac Indian Residential School had two locations while it was operating, Stuart Lake and Fraser Lake. If a new school building was constructed on the same property as the original school building, it isn't considered to be a new location, as is the case of Girouard Indian Residential School.When the precise location is known, the coordinates of the main building are provided, and when the precise location of the building isn’t known, an approximate location is provided. For each residential school institution location, the following information is provided: official names, alternative name, dates of operation, religious affiliation, latitude and longitude coordinates, community location, Indigenous community name, contributor (of the location coordinates), school/institution photo (when available), location point precision, type of school (hostel or residential school) and list of references used to determine the location of the main buildings or sites.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Residential Schools Locations Dataset in shapefile format contains the locations (latitude and longitude) of Residential Schools and student hostels operated by the federal government in Canada. All the residential schools and hostels that are listed in the Indian Residential School Settlement Agreement are included in this data set, as well as several Industrial schools and residential schools that were not part of the IRRSA. This version of the dataset doesn’t include the five schools under the Newfoundland and Labrador Residential Schools Settlement Agreement. The original school location data was created by the Truth and Reconciliation Commission, and was provided to the researcher (Rosa Orlandini) by the National Centre for Truth and Reconciliation in April 2017. The data set was created by Rosa Orlandini, and builds upon and enhances the previous work of the Truth and Reconcilation Commission, Morgan Hite (creator of the Atlas of Indian Residential Schools in Canada that was produced for the Tk'emlups First Nation and Justice for Day Scholar's Initiative, and Stephanie Pyne (project lead for the Residential Schools Interactive Map). Each individual school location in this dataset is attributed either to RSIM, Morgan Hite, NCTR or Rosa Orlandini. Many schools/hostels had several locations throughout the history of the institution. If the school/hostel moved from its’ original location to another property, then the school is considered to have two unique locations in this data set,the original location and the new location. For example, Lejac Indian Residential School had two locations while it was operating, Stuart Lake and Fraser Lake. If a new school building was constructed on the same property as the original school building, it isn't considered to be a new location, as is the case of Girouard Indian Residential School. When the precise location is known, the coordinates of the main building are provided, and when the precise location of the building isn’t known, an approximate location is provided. For each residential school institution location, the following information is provided: official names, alternative name, dates of operation, religious affiliation, latitude and longitude coordinates, community location, Indigenous community name, contributor (of the location coordinates), school/institution photo (when available), location point precision, type of school (hostel or residential school) and list of references used to determine the location of the main buildings or sites. The geographic coordinate system for this dataset is WGS 1984. The data in shapefile format [IRS_locations.zip] can be viewed and mapped in a Geographic Information System software. Detailed metadata in xml format is available as part of the data in shapefile format. In addition, the field name descriptions (IRS_locfields.csv) and the detailed locations descriptions (IRS_locdescription.csv) should be used alongside the data in shapefile format.
PredictLeads Job Openings Data provides high-quality hiring insights sourced directly from company websites - not job boards. Using advanced web scraping technology, our dataset offers real-time access to job trends, salaries, and skills demand, making it a valuable resource for B2B sales, recruiting, investment analysis, and competitive intelligence.
Key Features:
✅206M+ Job Postings Tracked – Data sourced from 1.8M+ company websites worldwide. ✅7M+ Active Job Openings – Updated in real-time to reflect hiring demand. ✅Salary & Compensation Insights – Extract salary ranges, contract types, and job seniority levels. ✅Technology & Skill Tracking – Identify emerging tech trends and industry demands. ✅Company Data Enrichment – Link job postings to employer domains, firmographics, and growth signals. ✅Web Scraping Precision – Directly sourced from employer websites for unmatched accuracy.
Primary Attributes:
Job Metadata:
Salary Data (salary_data)
Occupational Data (onet_data) (object, nullable)
Additional Attributes:
📌 Trusted by enterprises, recruiters, and investors for high-precision job market insights.
Response Example: https://docs.predictleads.com/v3/api_endpoints/job_openings_dataset/retrieve_company_s_job_openings
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A collection of 22 data set of 50+ requirements each, expressed as user stories.
The dataset has been created by gathering data from web sources and we are not aware of license agreements or intellectual property rights on the requirements / user stories. The curator took utmost diligence in minimizing the risks of copyright infringement by using non-recent data that is less likely to be critical, by sampling a subset of the original requirements collection, and by qualitatively analyzing the requirements. In case of copyright infringement, please contact the dataset curator (Fabiano Dalpiaz, f.dalpiaz@uu.nl) to discuss the possibility of removal of that dataset [see Zenodo's policies]
The data sets have been originally used to conduct experiments about ambiguity detection with the REVV-Light tool: https://github.com/RELabUU/revv-light
This collection has been originally published in Mendeley data: https://data.mendeley.com/datasets/7zbk8zsd8y/1
The following text provides a description of the datasets, including links to the systems and websites, when available. The datasets are organized by macro-category and then by identifier.
g02-federalspending.txt
(2018) originates from early data in the Federal Spending Transparency project, which pertain to the website that is used to share publicly the spending data for the U.S. government. The website was created because of the Digital Accountability and Transparency Act of 2014 (DATA Act). The specific dataset pertains a system called DAIMS or Data Broker, which stands for DATA Act Information Model Schema. The sample that was gathered refers to a sub-project related to allowing the government to act as a data broker, thereby providing data to third parties. The data for the Data Broker project is currently not available online, although the backend seems to be hosted in GitHub under a CC0 1.0 Universal license. Current and recent snapshots of federal spending related websites, including many more projects than the one described in the shared collection, can be found here.
g03-loudoun.txt
(2018) is a set of extracted requirements from a document, by the Loudoun County Virginia, that describes the to-be user stories and use cases about a system for land management readiness assessment called Loudoun County LandMARC. The source document can be found here and it is part of the Electronic Land Management System and EPlan Review Project - RFP RFQ issued in March 2018. More information about the overall LandMARC system and services can be found here.
g04-recycling.txt
(2017) concerns a web application where recycling and waste disposal facilities can be searched and located. The application operates through the visualization of a map that the user can interact with. The dataset has obtained from a GitHub website and it is at the basis of a students' project on web site design; the code is available (no license).
g05-openspending.txt
(2018) is about the OpenSpending project (www), a project of the Open Knowledge foundation which aims at transparency about how local governments spend money. At the time of the collection, the data was retrieved from a Trello board that is currently unavailable. The sample focuses on publishing, importing and editing datasets, and how the data should be presented. Currently, OpenSpending is managed via a GitHub repository which contains multiple sub-projects with unknown license.
g11-nsf.txt
(2018) refers to a collection of user stories referring to the NSF Site Redesign & Content Discovery project, which originates from a publicly accessible GitHub repository (GPL 2.0 license). In particular, the user stories refer to an early version of the NSF's website. The user stories can be found as closed Issues.
g08-frictionless.txt
(2016) regards the Frictionless Data project, which offers an open source dataset for building data infrastructures, to be used by researchers, data scientists, and data engineers. Links to the many projects within the Frictionless Data project are on GitHub (with a mix of Unlicense and MIT license) and web. The specific set of user stories has been collected in 2016 by GitHub user @danfowler and are stored in a Trello board.
g14-datahub.txt
(2013) concerns the open source project DataHub, which is currently developed via a GitHub repository (the code has Apache License 2.0). DataHub is a data discovery platform which has been developed over multiple years. The specific data set is an initial set of user stories, which we can date back to 2013 thanks to a comment therein.
g16-mis.txt
(2015) is a collection of user stories that pertains a repository for researchers and archivists. The source of the dataset is a public Trello repository. Although the user stories do not have explicit links to projects, it can be inferred that the stories originate from some project related to the library of Duke University.
g17-cask.txt
(2016) refers to the Cask Data Application Platform (CDAP). CDAP is an open source application platform (GitHub, under Apache License 2.0) that can be used to develop applications within the Apache Hadoop ecosystem, an open-source framework which can be used for distributed processing of large datasets. The user stories are extracted from a document that includes requirements regarding dataset management for Cask 4.0, which includes the scenarios, user stories and a design for the implementation of these user stories. The raw data is available in the following environment.
g18-neurohub.txt
(2012) is concerned with the NeuroHub platform, a neuroscience data management, analysis and collaboration platform for researchers in neuroscience to collect, store, and share data with colleagues or with the research community. The user stories were collected at a time NeuroHub was still a research project sponsored by the UK Joint Information Systems Committee (JISC). For information about the research project from which the requirements were collected, see the following record.
g22-rdadmp.txt
(2018) is a collection of user stories from the Research Data Alliance's working group on DMP Common Standards. Their GitHub repository contains a collection of user stories that were created by asking the community to suggest functionality that should part of a website that manages data management plans. Each user story is stored as an issue on the GitHub's page.
g23-archivesspace.txt
(2012-2013) refers to ArchivesSpace: an open source, web application for managing archives information. The application is designed to support core functions in archives administration such as accessioning; description and arrangement of processed materials including analog, hybrid, and
born digital content; management of authorities and rights; and reference service. The application supports collection management through collection management records, tracking of events, and a growing number of administrative reports. ArchivesSpace is open source and its
This dataset is composed of the URLs of the top 1 million websites. The domains are ranked using the Alexa traffic ranking which is determined using a combination of the browsing behavior of users on the website, the number of unique visitors, and the number of pageviews. In more detail, unique visitors are the number of unique users who visit a website on a given day, and pageviews are the total number of user URL requests for the website. However, multiple requests for the same website on the same day are counted as a single pageview. The website with the highest combination of unique visitors and pageviews is ranked the highest
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
The USGS National Hydrography Dataset (NHD) Downloadable Data Collection from The National Map (TNM) is a comprehensive set of digital spatial data that encodes information about naturally occurring and constructed bodies of surface water (lakes, ponds, and reservoirs), paths through which water flows (canals, ditches, streams, and rivers), and related entities such as point features (springs, wells, stream gages, and dams). The information encoded about these features includes classification and other characteristics, delineation, geographic name, position and related measures, a "reach code" through which other information can be related to the NHD, and the direction of water flow. The network of reach codes delineating water and transported material flow allows users to trace movement in upstream and downstream directions. In addition to this geographic information, the dataset contains metadata that supports the exchange of future updates and improvements to the data. The NHD supports many applications, such as making maps, geocoding observations, flow modeling, data maintenance, and stewardship. For additional information on NHD, go to https://www.usgs.gov/core-science-systems/ngp/national-hydrography.
DWR was the steward for NHD and Watershed Boundary Dataset (WBD) in California. We worked with other organizations to edit and improve NHD and WBD, using the business rules for California. California's NHD improvements were sent to USGS for incorporation into the national database. The most up-to-date products are accessible from the USGS website. Please note that the California portion of the National Hydrography Dataset is appropriate for use at the 1:24,000 scale.
For additional derivative products and resources, including the major features in geopackage format, please go to this page: https://data.cnra.ca.gov/dataset/nhd-major-features Archives of previous statewide extracts of the NHD going back to 2018 may be found at https://data.cnra.ca.gov/dataset/nhd-archive.
In September 2022, USGS officially notified DWR that the NHD would become static as USGS resources will be devoted to the transition to the new 3D Hydrography Program (3DHP). 3DHP will consist of LiDAR-derived hydrography at a higher resolution than NHD. Upon completion, 3DHP data will be easier to maintain, based on a modern data model and architecture, and better meet the requirements of users that were documented in the Hydrography Requirements and Benefits Study (2016). The initial releases of 3DHP will be the NHD data cross-walked into the 3DHP data model. It will take several years for the 3DHP to be built out for California. Please refer to the resources on this page for more information.
The FINAL,STATIC version of the National Hydrography Dataset for California was published for download by USGS on December 27, 2023. This dataset can no longer be edited by the state stewards.
The first public release of the 3D Hydrography Program map service may be accessed at https://hydro.nationalmap.gov/arcgis/rest/services/3DHP_all/MapServer.
Questions about the California stewardship of these datasets may be directed to nhd_stewardship@water.ca.gov.
The Military Bases dataset was last updated on October 23, 2024 and are defined by Fiscal Year 2023 data, from the Office of the Assistant Secretary of Defense for Energy, Installations, and Environment and is part of the U.S. Department of Transportation (USDOT)/Bureau of Transportation Statistics (BTS) National Transportation Atlas Database (NTAD). The dataset depicts the authoritative locations of the most commonly known Department of Defense (DoD) sites, installations, ranges, and training areas world-wide. These sites encompass land which is federally owned or otherwise managed. This dataset was created from source data provided by the four Military Service Component headquarters and was compiled by the Defense Installation Spatial Data Infrastructure (DISDI) Program within the Office of the Assistant Secretary of Defense for Energy, Installations, and Environment. Only sites reported in the BSR or released in a map supplementing the Foreign Investment Risk Review Modernization Act of 2018 (FIRRMA) Real Estate Regulation (31 CFR Part 802) were considered for inclusion. This list does not necessarily represent a comprehensive collection of all Department of Defense facilities. For inventory purposes, installations are comprised of sites, where a site is defined as a specific geographic location of federally owned or managed land and is assigned to military installation. DoD installations are commonly referred to as a base, camp, post, station, yard, center, homeport facility for any ship, or other activity under the jurisdiction, custody, control of the DoD. While every attempt has been made to provide the best available data quality, this data set is intended for use at mapping scales between 1:50,000 and 1:3,000,000. For this reason, boundaries in this data set may not perfectly align with DoD site boundaries depicted in other federal data sources. Maps produced at a scale of 1:50,000 or smaller which otherwise comply with National Map Accuracy Standards, will remain compliant when this data is incorporated. Boundary data is most suitable for larger scale maps; point locations are better suited for mapping scales between 1:250,000 and 1:3,000,000. If a site is part of a Joint Base (effective/designated on 1 October, 2010) as established under the 2005 Base Realignment and Closure process, it is attributed with the name of the Joint Base. All sites comprising a Joint Base are also attributed to the responsible DoD Component, which is not necessarily the pre-2005 Component responsible for the site.
Netlas.io is a set of internet intelligence apps that provide accurate technical information on IP addresses, domain names, websites, web applications, IoT devices, and other online assets.
Netlas.io maintains five general data collections: Responses (internet scan data), DNS Registry data, IP Whois data, Domain Whois data, SSL Certificates.
This dataset contains Domain WHOIS data. It covers active domains only, including just registered, published and parked domains, domains on redeption grace period (waiting for renewal), and domains pending delete. This dataset doesn't include any historical records.
This world cities layer presents the locations of many cities of the world, both major cities and many provincial capitals.Population estimates are provided for those cities listed in open source data from the United Nations and US Census.
PromptCloud offers cutting-edge data extraction services that empower businesses with real-time, actionable intelligence from the vast expanses of the online marketplace. We are committed to putting data at the heart of your business. Reach out for a no-frills PromptCloud experience- professional, technologically ahead and reliable.
Our Amazon Best Seller Products Dataset is a key tool for businesses looking to understand and capitalize on market trends. It allows you to identify top-selling products and sellers, and track their performance across various categories and subcategories. This dataset is invaluable for competitive intelligence, monitoring trending products, and understanding customer sentiment. It also plays a crucial role in monitoring competitor prices and enhancing product inventory, ensuring that your business stays relevant and competitive.
Beyond Amazon, PromptCloud offers access to a diverse range of Ecommerce Product Data from various e-commerce websites. PromptCloud is a leading provider of advanced web scraping services, uniquely tailored to meet the dynamic needs of modern businesses. Our services are fully customizable, allowing clients to specify source websites, data collection frequencies, data points, and delivery mechanisms to fit their unique requirements. The data aggregation feature of our web crawler enables the extraction of data from multiple sources in a single stream, catering to a diverse range of ecommerce clients.
PromptCloud is a leading provider of advanced web scraping services, uniquely tailored to meet the dynamic needs of modern businesses. Our services are fully customizable, allowing clients to specify source websites, data collection frequencies, data points, and delivery mechanisms to fit their unique requirements. The data aggregation feature of our web crawler enables the extraction of data from multiple sources in a single stream, catering to a diverse range of clients, from news aggregators to job boards.
With over a decade of experience in extracting web data from any e-commerce website, PromptCloud stands as a seasoned veteran in the field. This extensive experience translates into high-quality, reliable data extraction, making PromptCloud your ideal product web data extraction partner. The reliability of our data is uncompromised, with a 100% verification process that ensures accuracy and trustworthiness.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Residential Schools Locations Dataset in Geodatabase format (IRS_Locations.gbd) contains a feature layer "IRS_Locations" that contains the locations (latitude and longitude) of Residential Schools and student hostels operated by the federal government in Canada. All the residential schools and hostels that are listed in the Residential Schools Settlement Agreement are included in this dataset, as well as several Industrial schools and residential schools that were not part of the IRRSA. This version of the dataset doesn’t include the five schools under the Newfoundland and Labrador Residential Schools Settlement Agreement. The original school location data was created by the Truth and Reconciliation Commission, and was provided to the researcher (Rosa Orlandini) by the National Centre for Truth and Reconciliation in April 2017. The dataset was created by Rosa Orlandini, and builds upon and enhances the previous work of the Truth and Reconcilation Commission, Morgan Hite (creator of the Atlas of Indian Residential Schools in Canada that was produced for the Tk'emlups First Nation and Justice for Day Scholar's Initiative, and Stephanie Pyne (project lead for the Residential Schools Interactive Map). Each individual school location in this dataset is attributed either to RSIM, Morgan Hite, NCTR or Rosa Orlandini. Many schools/hostels had several locations throughout the history of the institution. If the school/hostel moved from its’ original location to another property, then the school is considered to have two unique locations in this dataset,the original location and the new location. For example, Lejac Indian Residential School had two locations while it was operating, Stuart Lake and Fraser Lake. If a new school building was constructed on the same property as the original school building, it isn't considered to be a new location, as is the case of Girouard Indian Residential School.When the precise location is known, the coordinates of the main building are provided, and when the precise location of the building isn’t known, an approximate location is provided. For each residential school institution location, the following information is provided: official names, alternative name, dates of operation, religious affiliation, latitude and longitude coordinates, community location, Indigenous community name, contributor (of the location coordinates), school/institution photo (when available), location point precision, type of school (hostel or residential school) and list of references used to determine the location of the main buildings or sites. Access Instructions: there are 47 files in this data package. Please download the entire data package by selecting all the 47 files and click on download. Two files will be downloaded, IRS_Locations.gbd.zip and IRS_LocFields.csv. Uncompress the IRS_Locations.gbd.zip. Use QGIS, ArcGIS Pro, and ArcMap to open the feature layer IRS_Locations that is contained within the IRS_Locations.gbd data package. The feature layer is in WGS 1984 coordinate system. There is also detailed file level metadata included in this feature layer file. The IRS_locations.csv provides the full description of the fields and codes used in this dataset.
See uploaded ReadMe file.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
All cities with a population > 1000 or seats of adm div (ca 80.000)Sources and ContributionsSources : GeoNames is aggregating over hundred different data sources. Ambassadors : GeoNames Ambassadors help in many countries. Wiki : A wiki allows to view the data and quickly fix error and add missing places. Donations and Sponsoring : Costs for running GeoNames are covered by donations and sponsoring.Enrichment:add country name
Discover the convenience of our customized dataset preparation service, designed to meet your industry-specific and location-based requirements. When you request datasets tailored to your needs, we diligently gather, structure, and enrich the data with Local Pack insights, providing you with a comprehensive resource for strategic decision-making.
Whether you're focused on a specific industry or targeting a particular geographic area, our team ensures that the dataset aligns perfectly with your objectives. We meticulously curate keywords belonging to your industry, scrape Local Packs for relevant insights, and organize the data in a structured format for easy analysis.
Our service goes beyond mere data gathering – we understand the importance of accuracy and relevance. Therefore, before sharing the dataset with you, we conduct thorough quality checks and ensure that the information is up-to-date and reliable.
Empower your business with actionable insights derived from our tailored datasets. Make informed decisions, optimize your strategies, and stay ahead of the competition with our comprehensive and customizable data solutions
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Natural monuments and sites whose conservation or preservation presents, from an artistic, historical, scientific, legendary or picturesque point of view, a general interest. Sites may be listed or classified. The inscription either concerns natural monuments or sites that are deserving of protection but not of a remarkable interest sufficient to justify their classification, or constitutes a precautionary measure prior to classification. It can also be a suitable tool for the preservation of small rural heritage in areas with little land pressure.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Department of Conservation (DOC) - Recreation track lines (approx. centreline). Dataset shows tracks managed for walking and tramping. If you intend to walk a track, please confirm with your local office or the DOC website that the track isn't under a temporary or more permanent closure before embarking. Refreshed weekly and reflects the content on the website.*LICENCEThis work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ or send a letter to Creative Commons, 444 Castro Street, Suite 900, Mountain View, California, 94041, USA.DISCLAIMER 1. DOC makes no express or implied warranties as to the accuracy or completeness of the data or information, nor its suitability for any purpose. Errors are inevitably part of any database, and can arise by a number of means, from errors during field data collection, to errors during data entry. 2. DOC makes no warranties or representations as to possible infringement upon copyrights or other intellectual property rights of others in the data or information. 3. DOC will not accept liability for any direct, indirect, special or consequential damages, losses or expenses howsoever arising and relating to use, or lack of use, of the data or information supplied.GUIDELINES FOR THE USE OF THE INFORMATION 4. Care should be taken in deriving conclusions from any data or information supplied. 5. Any use of the data or information supplied should state when the data or information was acquired and that it may now be out-of-date.COPYRIGHT OBLIGATIONS*** 6. All proprietary rights to the intellectual property in the data or information remain with the Crown as its sole property. 7. Modification of the data and information or the addition of the information does not confer copyright or any other form of property of the original material to a user. 8. All maps or reports that are derived from the data or information must acknowledge the Crown copyright, in the following way: Crown Copyright: Department of Conservation Te Papa Atawhai [year]. 9. This information resource may be passed onto another party, in either hard copy or electronic form. If a user does this, then it is recommended that they also supply this metadata record with the information resource.
Overview
This dataset of medical misinformation was collected and is published by Kempelen Institute of Intelligent Technologies (KInIT). It consists of approx. 317k news articles and blog posts on medical topics published between January 1, 1998 and February 1, 2022 from a total of 207 reliable and unreliable sources. The dataset contains full-texts of the articles, their original source URL and other extracted metadata. If a source has a credibility score available (e.g., from Media Bias/Fact Check), it is also included in the form of annotation. Besides the articles, the dataset contains around 3.5k fact-checks and extracted verified medical claims with their unified veracity ratings published by fact-checking organisations such as Snopes or FullFact. Lastly and most importantly, the dataset contains 573 manually and more than 51k automatically labelled mappings between previously verified claims and the articles; mappings consist of two values: claim presence (i.e., whether a claim is contained in the given article) and article stance (i.e., whether the given article supports or rejects the claim or provides both sides of the argument).
The dataset is primarily intended to be used as a training and evaluation set for machine learning methods for claim presence detection and article stance classification, but it enables a range of other misinformation related tasks, such as misinformation characterisation or analyses of misinformation spreading.
Its novelty and our main contributions lie in (1) focus on medical news article and blog posts as opposed to social media posts or political discussions; (2) providing multiple modalities (beside full-texts of the articles, there are also images and videos), thus enabling research of multimodal approaches; (3) mapping of the articles to the fact-checked claims (with manual as well as predicted labels); (4) providing source credibility labels for 95% of all articles and other potential sources of weak labels that can be mined from the articles' content and metadata.
The dataset is associated with the research paper "Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims" accepted and presented at ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22).
The accompanying Github repository provides a small static sample of the dataset and the dataset's descriptive analysis in a form of Jupyter notebooks.
Options to access the dataset
There are two ways how to get access to the dataset:
1. Static dump of the dataset available in the CSV format
2. Continuously updated dataset available via REST API
In order to obtain an access to the dataset (either to full static dump or REST API), please, request the access by following instructions provided below.
References
If you use this dataset in any publication, project, tool or in any other form, please, cite the following papers:
@inproceedings{SrbaMonantPlatform,
author = {Srba, Ivan and Moro, Robert and Simko, Jakub and Sevcech, Jakub and Chuda, Daniela and Navrat, Pavol and Bielikova, Maria},
booktitle = {Proceedings of Workshop on Reducing Online Misinformation Exposure (ROME 2019)},
pages = {1--7},
title = {Monant: Universal and Extensible Platform for Monitoring, Detection and Mitigation of Antisocial Behavior},
year = {2019}
}
@inproceedings{SrbaMonantMedicalDataset,
author = {Srba, Ivan and Pecher, Branislav and Tomlein Matus and Moro, Robert and Stefancova, Elena and Simko, Jakub and Bielikova, Maria},
booktitle = {Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22)},
numpages = {11},
title = {Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims},
year = {2022},
doi = {10.1145/3477495.3531726},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3477495.3531726},
}
Dataset creation process
In order to create this dataset (and to continuously obtain new data), we used our research platform Monant. The Monant platform provides so called data providers to extract news articles/blogs from news/blog sites as well as fact-checking articles from fact-checking sites. General parsers (from RSS feeds, Wordpress sites, Google Fact Check Tool, etc.) as well as custom crawler and parsers were implemented (e.g., for fact checking site Snopes.com). All data is stored in the unified format in a central data storage.
Ethical considerations
The dataset was collected and is published for research purposes only. We collected only publicly available content of news/blog articles. The dataset contains identities of authors of the articles if they were stated in the original source; we left this information, since the presence of an author's name can be a strong credibility indicator. However, we anonymised the identities of the authors of discussion posts included in the dataset.
The main identified ethical issue related to the presented dataset lies in the risk of mislabelling of an article as supporting a false fact-checked claim and, to a lesser extent, in mislabelling an article as not containing a false claim or not supporting it when it actually does. To minimise these risks, we developed a labelling methodology and require an agreement of at least two independent annotators to assign a claim presence or article stance label to an article. It is also worth noting that we do not label an article as a whole as false or true. Nevertheless, we provide partial article-claim pair veracities based on the combination of claim presence and article stance labels.
As to the veracity labels of the fact-checked claims and the credibility (reliability) labels of the articles' sources, we take these from the fact-checking sites and external listings such as Media Bias/Fact Check as they are and refer to their methodologies for more details on how they were established.
Lastly, the dataset also contains automatically predicted labels of claim presence and article stance using our baselines described in the next section. These methods have their limitations and work with certain accuracy as reported in this paper. This should be taken into account when interpreting them.
Reporting mistakes in the dataset
The mean to report considerable mistakes in raw collected data or in manual annotations is by creating a new issue in the accompanying Github repository. Alternately, general enquiries or requests can be sent at info [at] kinit.sk.
Dataset structure
Raw data
At first, the dataset contains so called raw data (i.e., data extracted by the Web monitoring module of Monant platform and stored in exactly the same form as they appear at the original websites). Raw data consist of articles from news sites and blogs (e.g. naturalnews.com), discussions attached to such articles, fact-checking articles from fact-checking portals (e.g. snopes.com). In addition, the dataset contains feedback (number of likes, shares, comments) provided by user on social network Facebook which is regularly extracted for all news/blogs articles.
Raw data are contained in these CSV files (and corresponding REST API endpoints):
Note: Personal information about discussion posts' authors (name, website, gravatar) are anonymised.
Annotations
Secondly, the dataset contains so called annotations. Entity annotations describe the individual raw data entities (e.g., article, source). Relation annotations describe relation between two of such entities.
Each annotation is described by the following attributes:
At the same time, annotations are associated with a particular object identified by:
entity_type
in case of entity annotations, or source_entity_type
and target_entity_type
in case of relation annotations). Possible values: sources, articles, fact-checking-articles.entity_id
in case of entity annotations, or source_entity_id
and target_entity_id
in case of relation
The dataset depicts the authoritative locations of the most commonly known Department of Defense (DoD) sites, installations, ranges, and training areas world-wide. These sites encompass land which is federally owned or otherwise managed. This dataset was created from source data provided by the four Military Service Component headquarters and was compiled by the Defense Installation Spatial Data Infrastructure (DISDI) Program within the Office of the Assistant Secretary of Defense for Energy, Installations, and Environment. Only sites reported in the BSR or released in a map supplementing the Foreign Investment Risk Review Modernization Act of 2018 (FIRRMA) Real Estate Regulation (31 CFR Part 802) were considered for inclusion. This list does not necessarily represent a comprehensive collection of all Department of Defense facilities. For inventory purposes, installations are comprised of sites, where a site is defined as a specific geographic location of federally owned or managed land and is assigned to military installation. DoD installations are commonly referred to as a base, camp, post, station, yard, center, homeport facility for any ship, or other activity under the jurisdiction, custody, control of the DoD.While every attempt has been made to provide the best available data quality, this data set is intended for use at mapping scales between 1:50,000 and 1:3,000,000. For this reason, boundaries in this data set may not perfectly align with DoD site boundaries depicted in other federal data sources. Maps produced at a scale of 1:50,000 or smaller which otherwise comply with National Map Accuracy Standards, will remain compliant when this data is incorporated. Boundary data is most suitable for larger scale maps; point locations are better suited for mapping scales between 1:250,000 and 1:3,000,000.If a site is part of a Joint Base (effective/designated on 1 October, 2010) as established under the 2005 Base Realignment and Closure process, it is attributed with the name of the Joint Base. All sites comprising a Joint Base are also attributed to the responsible DoD Component, which is not necessarily the pre-2005 Component responsible for the site.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data provides information on recreation and amenities at river access sites. The information describes the site and its amenities and suggests recreational uses. The public access sites are open to the public without obtaining prior landowner’s permission.Almost, every public access site has been visited by an Environment Canterbury staff member. During the site visit, information is collected that describes the site and photos are taken. The access site is usually located where it is appropriate to leave a vehicle, such as an informal parking area, or picnic area. Distances to the river bank are approximate and may vary depending on water and flow levels.The purpose of the data is to provide information on Canterbury’s public access sites, the data helps answer questions from the public relating to recreation and access. In addition, the data also helps meet some of the requirements set out by the Canterbury Water Management Strategy for Recreation and Amenity targets.The information is recorded in a consistent manner with consistent standards applied during its collection. The categories used to describe access are:Foot and vehicle access: pedestrian and vehicle access.Foot access: pedestrian access onlyFoot and vehicle access over private property: pedestrian and vehicle access, available over private property, permitted by the land occupier or ownerFoot access over private property: pedestrian only access, available over private property, permitted by the land occupier or owner.The categories used to describe the quality of the road or track as described below:Good: well maintained, easily passable.Average: may require some care when driving a road car but still easily passable.Poor: difficult to drive along care and skill required, urgently in need of maintenance.4WD: four-wheel drive recommended for track.Caveats – conditions or limitations of the layerEvery effort is made to identify public access points through site visits and checking the data by utilising maps, plans, and the digital cadastral database (DCDB).Every effort is made to survey all access points along riparian margins unless access is in a remote, hard to reach area, or is unlikely to be frequently visited by the public.The information is only accurate to the time of the survey, refer to ‘Date surveyed’ attribute for each access point
This dataset contains a listing of incorporated places (cities and towns) and counties within the United States including the GNIS code, FIPS code, name, entity type and primary point (location) for the entity. The types of entities listed in this dataset are based on codes provided by the U.S. Census Bureau, and include the following: C1 - An active incorporated place that does not serve as a county subdivision equivalent; C2 - An active incorporated place legally coextensive with a county subdivision but treated as independent of any county subdivision; C3 - A consolidated city; C4 - An active incorporated place with an alternate official common name; C5 - An active incorporated place that is independent of any county subdivision and serves as a county subdivision equivalent; C6 - An active incorporated place that partially is independent of any county subdivision and serves as a county subdivision equivalent or partially coextensive with a county subdivision but treated as independent of any county subdivision; C7 - An incorporated place that is independent of any county; C8 - The balance of a consolidated city excluding the separately incorporated place(s) within that consolidated government; C9 - An inactive or nonfunctioning incorporated place; H1 - An active county or statistically equivalent entity; H4 - A legally defined inactive or nonfunctioning county or statistically equivalent entity; H5 - A census areas in Alaska, a statistical county equivalent entity; and H6 - A county or statistically equivalent entity that is areally coextensive or governmentally consolidated with an incorporated place, part of an incorporated place, or a consolidated city.