Classification of Mars Terrain Using Multiple Data Sources Alan Kraut1, David Wettergreen1 ABSTRACT. Images of Mars are being collected faster than they can be analyzed by planetary scientists. Automatic analysis of images would enable more rapid and more consistent image interpretation and could draft geologic maps where none yet exist. In this work we develop a method for incorporating images from multiple instruments to classify Martian terrain into multiple types. Each image is segmented into contiguous groups of similar pixels, called superpixels, with an associated vector of discriminative features. We have developed and tested several classification algorithms to associate a best class to each superpixel. These classifiers are trained using three different manual classifications with between 2 and 6 classes. Automatic classification accuracies of 50 to 80% are achieved in leave-one-out cross-validation across 20 scenes using a multi-class boosting classifier.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Generating synthetic population data from multiple raw data sources is a fundamental step for many data science tasks with a wide range of applications. However, despite the presence of a number of ap- proaches such as iterative proportional fitting (IPF) and combinatorial optimization (CO), an efficient and standard framework for handling this type of problems is absent. In this study, we propose a multi-stage frame- work called SynC (short for Synthetic Population via Gaussian Copula) to fill the gap. SynC first removes potential outliers in the data and then fits the filtered data with a Gaussian copula model to correctly capture dependencies and marginals distributions of sampled survey data. Fi- nally, SynC leverages neural networks to merge datasets into one. Our key contributions include: 1) propose a novel framework for generating individual level data from aggregated data sources by combining state-of- the-art machine learning and statistical techniques, 2) design a metric for validating the accuracy of generated data when the ground truth is hard to obtain, 3) release an easy-to-use framework implementation for repro- ducibility and demonstrate its effectiveness with the Canada National Census data, and 4) present two real-world use cases where datasets of this nature can be leveraged by businesses.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Global Biotic Interactions: Interpreted Data Products
Global Biotic Interactions (GloBI, https://globalbioticinteractions.org, [1]) aims to facilitate access to existing species interaction records (e.g., predator-prey, plant-pollinator, virus-host). This data publication provides interpreted species interaction data products. These products are the result of a process in which versioned, existing species interaction datasets ([2]) are linked to the so-called GloBI Taxon Graph ([3]) and transformed into various aggregate formats (e.g., tsv, csv, neo4j, rdf/nquad, darwin core-ish archives). In addition, the applied name maps are included to make the applied taxonomic linking explicit.
Citation--------
GloBI is made possible by researchers, collections, projects and institutions openly sharing their datasets. When using this data, please make sure to attribute these original data contributors, including citing the specific datasets in derivative work. Each species interaction record indexed by GloBI contains a reference and dataset citation. Also, a full lists of all references can be found in citations.csv/citations.tsv files in this publication. If you have ideas on how to make it easier to cite original datasets, please open/join a discussion via https://globalbioticinteractions.org or related projects.
To credit GloBI for more easily finding interaction data, please use the following citation to reference GloBI:
Jorrit H. Poelen, James D. Simons and Chris J. Mungall. (2014). Global Biotic Interactions: An open infrastructure to share and analyze species-interaction datasets. Ecological Informatics. https://doi.org/10.1016/j.ecoinf.2014.08.005.
Bias and Errors--------
As with any analysis and processing workflow, care should be taken to understand the bias and error propagation of data sources and related data transformation processes. The datasets indexed by GloBI are biased geospatially, temporally and taxonomically ([5], [6]). Also, mapping of verbatim names from datasets to known name concept may contains errors due to synonym mismatches, outdated names lists, typos or conflicting name authorities. Finally, bugs may introduce bias and errors in the resulting integrated data product.
To help better understand where bias and errors are introduced, only versioned data and code are used as an input: the datasets ([2]), name maps ([3]) and integration software ([6]) are versioned so that the integration processes can be reproduced if needed. This way, steps take to compile an integrated data record can be traced and the sources of bias and errors can be more easily found.
This version was preceded by [7].
Contents--------
README:this file
citations.csv.gz:contains data citations in a in a gzipped comma-separated values format.
citations.tsv.gz:contains data citations in a gzipped tab-separated values format.
datasets.csv.gz:contains list of indexed datasets in a gzipped comma-separated values format.
datasets.tsv.gz:contains list of indexed datasets in a gzipped tab-separated values format.
verbatim-interactions.csv.gzcontains species interactions tabulated as pair-wise interaction in a gzipped comma-separated values format. Included taxonomic name are not interpreted, but included as documented in their sources.
verbatim-interactions.tsv.gzcontains species interactions tabulated as pair-wise interaction in a gzipped tab-separated values format. Included taxonomic name are not interpreted, but included as documented in their sources.
interactions.csv.gz:contains species interactions tabulated as pair-wise interactions in a gzipped comma-separated values format. Included taxonomic names are interpreted using taxonomic alignment workflows and may be different than those provided by the original sources.
interactions.tsv.gz:contains species interactions tabulated as pair-wise interactions in a gzipped tab-separated values format. Included taxonomic names are interpreted using taxonomic alignment workflows and may be different than those provided by the original sources.
refuted-interactions.csv.gz:contains refuted species interactions tabulated as pair-wise interactions in a gzipped comma-separated values format. Included taxonomic names are interpreted using taxonomic alignment workflows and may be different than those provided by the original sources.
refuted-interactions.tsv.gz:contains refuted species interactions tabulated as pair-wise interactions in a gzipped tab-separated values format. Included taxonomic names are interpreted using taxonomic alignment workflows and may be different than those provided by the original sources.
refuted-verbatim-interactions.csv.gz:contains refuted species interactions tabulated as pair-wise interactions in a gzipped comma-separated values format. Included taxonomic name are not interpreted, but included as documented in their sources.
refuted-verbatim-interactions.tsv.gz:contains refuted species interactions tabulated as pair-wise interactions in a gzipped tab-separated values format. Included taxonomic name are not interpreted, but included as documented in their sources.
interactions.nq.gz:contains species interactions expressed in the resource description framework in a gzipped rdf/quads format.
dwca-by-study.zip:contains species interactions data as a Darwin Core Archive aggregated by study using a custom, occurrence level, association extension.
dwca.zip:contains species interactions data as a Darwin Core Archive using a custom, occurrence level, association extension.
neo4j-graphdb.zip:contains a neo4j v3.5.32 graph database snapshot containing a graph representation of the species interaction data.
taxonCache.tsv.gz:contains hierarchies and identifiers associated with names from naming schemes in a gzipped tab-separated values format.
taxonMap.tsv.gz:describes how names in existing datasets were mapped into existing naming schemes in a gzipped tab-separated values format.
References-----
[1] Jorrit H. Poelen, James D. Simons and Chris J. Mungall. (2014). Global Biotic Interactions: An open infrastructure to share and analyze species-interaction datasets. Ecological Informatics. doi: 10.1016/j.ecoinf.2014.08.005.
[2] Poelen, J. H. (2020) Global Biotic Interactions: Elton Dataset Cache. Zenodo. doi: 10.5281/ZENODO.3950557.
[3] Poelen, J. H. (2021). Global Biotic Interactions: Taxon Graph (Version 0.3.28) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.4451472
[4] Hortal, J. et al. (2015) Seven Shortfalls that Beset Large-Scale Knowledge of Biodiversity. Annual Review of Ecology, Evolution, and Systematics, 46(1), pp.523–549. doi: 10.1146/annurev-ecolsys-112414-054400.
[5] Cains, M. et al. (2017) Ivmooc 2017 - Gap Analysis Of Globi: Identifying Research And Data Sharing Opportunities For Species Interactions. Zenodo. Zenodo. doi: 10.5281/ZENODO.814978.
[6] Poelen, J. et al. (2022) globalbioticinteractions/globalbioticinteractions v0.24.6. Zenodo. doi: 10.5281/ZENODO.7327955.
[7] GloBI Community. (2024). Global Biotic Interactions: Interpreted Data Products hash://md5/946f7666667d60657dc89d9af8ffb909 hash://sha256/4e83d2daee05a4fa91819d58259ee58ffc5a29ec37aa7e84fd5ffbb2f92aa5b8 (0.7) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.11552565
Content References-----
hash://sha256/5f4906439eba61f936b3dd7455a62c51656a74206f82d3f654e330fda6fbbe45 citations.csv.gzhash://sha256/c8100368dae39363b241472695c1ae197aaddc6e3d6c0a14f3f5ee704b37f3f6 citations.tsv.gzhash://sha256/e6f4aa897c5b325e444315e021b246ffed07fef764b0de6c0f1b2688bbdf9d0f datasets.csv.gzhash://sha256/e6f4aa897c5b325e444315e021b246ffed07fef764b0de6c0f1b2688bbdf9d0f datasets.tsv.gzhash://sha256/f11dc825609cdb1d4a3e9ba8caca9bf93c90dd6f660c7f6a0c8aa01c035a5e1f dwca-by-study.ziphash://sha256/7f16aacacae74e8b0cdef04c612ba776f508ff7ffe385abc57583e37aec8fe53 dwca.ziphash://sha256/b65e4c9a3615f1386bb97e45fb907d053df55476149aa6d71e6f398351218d0d interactions.csv.gzhash://sha256/0c28032392f82d753690be126805e6334ca46bdc4b5e2102a79b15ce0cc0ba90 interactions.nq.gzhash://sha256/8a7031250c288ba0da3d5cdbedc19d54c2f16ba3aa70d49826a7369b6edeca04 interactions.tsv.gzhash://sha256/d0c0fbf536cc63c004d057efc14600ba8cc5874f401b08f51837273b7854f1bb neo4j-graphdb.ziphash://sha256/50e77636f8b58c040e38b6a70ba7cc8288b190ef252dc0d4eb2f12f4c541e82f READMEhash://sha256/a74e2a39cfe133ae9de1eeea94f5dda8cbd58cfe61a8ccf91b7c540757719c74 refuted-interactions.csv.gzhash://sha256/37b06e274e41ca749399763989816854101238ade9863365f384a2764c639e9d refuted-interactions.tsv.gzhash://sha256/23315b6cd3fdc91f9c1d5d5bc39fa52cf1cef7a4e97d9d023d452751df13f30e refuted-verbatim-interactions.csv.gzhash://sha256/ff82e40cee4f8a8852d0c241f5027f66157a2b8a9090ffa3a0a329a206828d96 refuted-verbatim-interactions.tsv.gzhash://sha256/f072fbc7affb6e29978c7540af6cdccd3a219a23b0a4765b5bae56bd20df0d88 taxonCache.tsv.gzhash://sha256/cd28c81bb2432646a81ad216bc11818f7568ce81826e0074d9a33579da2c1426 taxonMap.tsv.gzhash://sha256/a1d14aa47806c624cf7e3a8c8236643dcf19ed1835c79c65958f7317ebfb9566 verbatim-interactions.csv.gzhash://sha256/2284434219d5fdab1e2152955f04363852c132b76709c330d33e31517817a82e verbatim-interactions.tsv.gz
hash://md5/d6ebf42729d988e15cb30adfa6112234 citations.csv.gzhash://md5/42877ae68e51871b8eb7116e62f6b268 citations.tsv.gzhash://md5/3e437580296fdeff3b6f35d1331db9d1 datasets.csv.gzhash://md5/3e437580296fdeff3b6f35d1331db9d1 datasets.tsv.gzhash://md5/fe88720fd992771bd64bfa220ad6a7d3 dwca-by-study.ziphash://md5/cbe132a9288feaef2f3e0c0409b8dc2f dwca.ziphash://md5/051f6db667c4b84616223c2776464dbf interactions.csv.gzhash://md5/b66857f8750e56ba9abe484b1f72eac4 interactions.nq.gzhash://md5/300839c346184b2fedc4e1fb31bcc29c interactions.tsv.gzhash://md5/e79cf5ffee919672f99ea338f3661566 neo4j-graphdb.ziphash://md5/898678f47561d7ef53722bc32957dcd9 READMEhash://md5/65a185f19df304e53f92a7275f2de291 refuted-interactions.csv.gzhash://md5/bc37a4354f8a2402e9335ae44f28cbd7
By Homeland Infrastructure Foundation [source]
Within this dataset, users can find numerous attributes that provide insight into various aspects of shoreline construction lines. The Category_o field categorizes these structures based on certain characteristics or purposes they serve. Additionally, each object in the dataset possesses a unique name or identifier represented by the Object_Nam column.
Another crucial piece of information captured in this dataset is the status of each shoreline construction line. The Status field indicates whether a particular structure is currently active or inactive. This helps users understand if it still serves its intended purpose or has been decommissioned.
Furthermore, the dataset includes data pertaining to multiple water levels associated with different shoreline construction lines. This information can be found in the Water_Leve column and provides relevant context for understanding how these artificial coastlines interact with various water bodies.
To aid cartographic representations and proper utilization of this data source for mapping purposes at different scales, there is also an attribute called Scale_Mini. This value denotes the minimum scale necessary to visualize a specific shoreline construction line accurately.
Data sources are important for reproducibility and quality assurance purposes in any GIS analysis project; hence identifying who provided and contributed to collecting this data can be critical in assessing its reliability. In this regard, individuals or organizations responsible for providing source data are specified in the column labeled Source_Ind.
Accompanying descriptive information about each source used to create these shoreline constructions lines can be found in the Source_D_1 field. This supplemental information provides additional context and details about the data's origin or collection methodology.
The dataset also includes a numerical attribute called SHAPE_Leng, representing the length of each shoreline construction line. This information complements the geographic and spatial attributes associated with these structures.
Understanding the Categories:
- The Category_o column classifies each shoreline construction line into different categories. This can range from seawalls and breakwaters to jetties and groins.
- Use this information to identify specific types of shoreline constructions based on your analysis needs.
Identifying Specific Objects:
- The Object_Nam column provides unique names or identifiers for each shoreline construction line.
- These identifiers help differentiate between different segments of construction lines in a region.
Determining Status:
- The Status column indicates whether a shoreline construction line is active or inactive.
- Active constructions are still in use and may be actively maintained or monitored.
- Inactive constructions are no longer operational or may have been demolished.
Analyzing Water Levels:
- The Water_Leve column describes the water level at which each shoreline construction line is located.
- Different levels may impact the suitability or effectiveness of these structures based on tidal changes or flood zones.
Exploring Additional Information:
- The Informatio column contains additional details about each shoreline construction line.
- This can include various attributes such as materials used, design specifications, ownership details, etc.
Determining Minimum Visible Scale:
-- The Scale_Mini column specifies the minimum scale at which you can observe the coastline's man-made structures clearly.Verifying Data Sources: -- In order to understand data reliability and credibility for further analysis,Source_Ind, Source_D_1, SHAPE_Leng,and Source_Dat columns provide information about the individual or organization that provided the source data and length, and date of the source data used to create the shoreline construction lines.
Utilize this dataset to perform various analyses related to shorelines, coastal developments, navigational channels, and impacts of man-made structures on marine ecosystems. The combination of categories, object names, status, water levels, additional information, minimum visible scale and reliable source information offers a comprehensive understanding of shoreline constructions across different regions.
Remember to refer back to the dataset documentation for any specific deta...
This is digital research data corresponding to the manuscript, Reinhart, K.O., Vermeire, L.T. Precipitation Manipulation Experiments May Be Confounded by Water Source. J Soil Sci Plant Nutr (2023). https://doi.org/10.1007/s42729-023-01298-0 Files for a 3x2x2 factorial field experiment and water quality data used to create Table 1. Data for the experiment were used for the statistical analysis and generation of summary statistics for Figure 2. Purpose: This study aims to investigate the consequences of performing precipitation manipulation experiments with mineralized water in place of rainwater (i.e. demineralized water). Limited attention has been paid to the effects of water mineralization on plant and soil properties, even when the experiments are in a rainfed context. Methods: We conducted a 6-yr experiment with a gradient in spring rainfall (70, 100, and 130% of ambient). We tested effects of rainfall treatments on plant biomass and six soil properties and interpreted the confounding effects of dissolved solids in irrigation water. Results: Rainfall treatments affected all response variables. Sulfate was the most common dissolved solid in irrigation water and was 41 times more abundant in irrigated (i.e. 130% of ambient) than other plots. Soils of irrigated plots also had elevated iron (16.5 µg × 10 cm-2 × 60-d vs 8.9) and pH (7.0 vs 6.8). The rainfall gradient also had a nonlinear (hump-shaped) effect on plant available phosphorus (P). Plant and microbial biomasses are often limited by and positively associated with available P, suggesting the predicted positive linear relationship between plant biomass and P was confounded by additions of mineralized water. In other words, the unexpected nonlinear relationship was likely driven by components of mineralized irrigation water (i.e. calcium, iron) and/or shifts in soil pH that immobilized P. Conclusions: Our results suggest robust precipitation manipulation experiments should either capture rainwater when possible (or use demineralized water) or consider the confounding effects of mineralized water on plant and soil properties. Resources in this dataset: Resource Title: Readme file- Data dictionary File Name: README.txt Resource Description: File contains data dictionary to accompany data files for a research study. Resource Title: 3x2x2 factorial dataset.csv File Name: 3x2x2 factorial dataset.csv Resource Description: Dataset is for a 3x2x2 factorial field experiment (factors: rainfall variability, mowing seasons, mowing intensity) conducted in northern mixed-grass prairie vegetation in eastern Montana, USA. Data include activity of 5 plant available nutrients, soil pH, and plant biomass metrics. Data from 2018. Resource Title: water quality dataset.csv File Name: water quality dataset.csv Resource Description: Water properties (pH and common dissolved solids) of samples from Yellowstone River collected near Miles City, Montana. Data extracted from Rinella MJ, Muscha JM, Reinhart KO, Petersen MK (2021) Water quality for livestock in northern Great Plains rangelands. Rangeland Ecol. Manage. 75: 29-34.
DescriptionThis is a vector tile layer built from the same data as the Jurisdictional Units Public feature service located here: https://nifc.maps.arcgis.com/home/item.html?id=4107b5d1debf4305ba00e929b7e5971a. This service can be used alone as a fast-drawing background layer, or used in combination with the feature service when Identify and Copy Feature capabilities are needed. At fine zoom levels, the feature service will be needed.OverviewThe Jurisdictional Units dataset outlines wildland fire jurisdictional boundaries for federal, state, and local government entities on a national scale and is used within multiple wildland fire systems including the Wildland Fire Decision Support System (WFDSS), the Interior Fuels and Post-Fire Reporting System (IFPRS), the Interagency Fuels Treatment Decision Support System (IFTDSS), the Interagency Fire Occurrence Reporting Modules (InFORM), the Interagency Reporting of Wildland Fire Information System (IRWIN), and the Wildland Computer-Aided Dispatch Enterprise System (WildCAD-E).In this dataset, agency and unit names are an indication of the primary manager’s name and unit name, respectively, recognizing that:There may be multiple owner names.Jurisdiction may be held jointly by agencies at different levels of government (ie State and Local), especially on private lands, Some owner names may be blocked for security reasons.Some jurisdictions may not allow the distribution of owner names. Private ownerships are shown in this layer with JurisdictionalUnitIID=null, JurisdictionalKind=null, and LandownerKind="Private", LandownerCategory="Private". All land inside the US country boundary is covered by a polygon.Jurisdiction for privately owned land varies widely depending on state, county, or local laws and ordinances, fire workload, and other factors, and is not available in a national dataset in most cases.For publicly held lands the agency name is the surface managing agency, such as Bureau of Land Management, United States Forest Service, etc. The unit name refers to the descriptive name of the polygon (i.e. Northern California District, Boise National Forest, etc.).AttributesField NameDefinitionGeometryIDPrimary key for linking geospatial objects with other database systems. Required for every feature. Not populated for Census Block Groups.JurisdictionalUnitIDWhere it could be determined, this is the NWCG Unit Identifier (Unit ID). Where it is unknown, the value is ‘Null’. Null Unit IDs can occur because a unit may not have a Unit ID, or because one could not be reliably determined from the source data. Not every land ownership has an NWCG Unit ID. Unit ID assignment rules are available in the Unit ID standard.JurisdictionalUnitID_sansUSNWCG Unit ID with the "US" characters removed from the beginning. Provided for backwards compatibility.JurisdictionalUnitNameThe name of the Jurisdictional Unit. Where an NWCG Unit ID exists for a polygon, this is the name used in the Name field from the NWCG Unit ID database. Where no NWCG Unit ID exists, this is the “Unit Name” or other specific, descriptive unit name field from the source dataset. A value is populated for all polygons except for Census Blocks Group and for PAD-US polygons that did not have an associated name.LocalNameLocal name for the polygon provided from agency authoritative data, PAD-US, or other source.JurisdictionalKindDescribes the type of unit jurisdiction using the NWCG Landowner Kind data standard. There are two valid values: Federal, Other, and Private. A value is not populated for Census Block Groups.JurisdictionalCategoryDescribes the type of unit jurisdiction using the NWCG Landowner Category data standard. Valid values include: BIA, BLM, BOR, DOD, DOE, NPS, USFS, USFWS, Foreign, Tribal, City, County, State, OtherLoc (other local, not in the standard), Private, and ANCSA. A value is not populated for Census Block Groups.LandownerKindThe landowner kind value associated with the polygon. May be inferred from jurisdictional agency, or by lack of a jurisdictional agency. Legal values align with the NWCG Landowner Kind data standard. A value is populated for all polygons.LandownerCategoryThe landowner category value associated with the polygon. May be inferred from jurisdictional agency, or by lack of a jurisdictional agency. Legal values align with the NWCG Landowner Category data standard. A value is populated for all polygons.LandownerDepartmentFederal department information that aligns with a unit’s landownerCategory information. Legal values include: Department of Agriculture, Department of Interior, Department of Defense, and Department of Energy. A value is not populated for all polygons.DataSourceThe database from which the polygon originated. An effort is made to be as specific as possible (i.e. identify the geodatabase name and feature class in which the polygon originated).SecondaryDataSourceIf the DataSource field is an aggregation from other sources, use this field to specify the source that supplied data to the aggregation. For example, if DataSource is "PAD-US 4.0", then for a TNC polygon, the SecondaryDataSource would be " TNC_PADUS2_0_SA2015_Public_gdb ".SourceUniqueIDIdentifier (GUID or ObjectID) in the data source. Used to trace the polygon back to its authoritative source.DataSourceYearYear that the source data for the polygon were acquired.MapMethodControlled vocabulary to define how the geospatial feature was derived. MapMethod will be Mixed Methods by default for this layer as the data are from mixed sources. Valid Values include: GPS-Driven; GPS-Flight; GPS-Walked; GPS-Walked/Driven; GPS-Unknown Travel Method; Hand Sketch; Digitized-Image; DigitizedTopo; Digitized-Other; Image Interpretation; Infrared Image; Modeled; Mixed Methods; Remote Sensing Derived; Survey/GCDB/Cadastral; Vector; Phone/Tablet; Other.DateCurrentThe last edit, update, of this GIS record. Date should follow the assigned NWCG Date Time data standard, using the 24-hour clock, YYYY-MM-DDhh.mm.ssZ, ISO8601 Standard.CommentsAdditional information describing the feature.JoinMethodAdditional information on how the polygon was matched to information in the NWCG Unit ID database.LegendJurisdictionalCategoryJurisdictionalCategory values grouped for more intuitive use in a map legend or summary table. Census Block Groups are classified as “No Unit”.LegendLandownerCategoryLandownerCategory values grouped for more intuitive use in a map legend or summary table.Other Relevant NWCG Definition StandardsUnitA generic term that represents an organizational entity that only has meaning when it is contextualized by a descriptor, e.g. jurisdictional.Definition Extension: When referring to an organizational entity, a unit refers to the smallest area or lowest level. Higher levels of an organization (region, agency, department, etc.) can be derived from a unit based on organization hierarchy.Unit, JurisdictionalThe governmental entity having overall land and resource management responsibility for a specific geographical area as provided by law.Definition Extension: 1) Ultimately responsible for the fire report to account for statistical fire occurrence; 2) Responsible for setting fire management objectives; 3) Jurisdiction cannot be re-assigned by agreement; 4) The nature and extent of the incident determines jurisdiction (for example, Wildfire vs. All Hazard); 5) Responsible for signing a Delegation of Authority to the Incident Commander.See also: Protecting Unit; LandownerData SourcesThis dataset is an aggregation of multiple spatial data sources: • Authoritative land ownership records from BIA, BLM, NPS, USFS, USFWS, and the Alaska Fire Service/State of Alaska• The Protected Areas Database US (PAD-US 4.0)• Census Block-Group Geometry BIA and Tribal Data:BIA and Tribal land management data were aggregated from BIA regional offices. These data date from 2012 and were reviewed/updated in 2024. Indian Trust Land affiliated with Tribes, Reservations, or BIA Agencies: These data are not considered the system of record and are not intended to be used as such. The Bureau of Indian Affairs (BIA), Branch of Wildland Fire Management (BWFM) is not the originator of these data. The spatial data coverage is a consolidation of the best available records/data received from each of the 12 BIA Regional Offices. The data are no better than the original sources from which they were derived. Care was taken when consolidating these files. However, BWFM cannot accept any responsibility for errors, omissions, or positional accuracy in the original digital data. The information contained in these data is dynamic and is continually changing. Updates to these data will be made whenever such data are received from a Regional Office. The BWFM gives no guarantee, expressed, written, or implied, regarding the accuracy, reliability, or completeness of these data.Alaska:The state of Alaska and Alaska Fire Service (BLM) co-manage a process to aggregate authoritative land ownership, management, and jurisdictional boundary data, based on Master Title Plats. Data ProcessingTo compile this dataset, the authoritative land ownership records and the PAD-US data mentioned above were crosswalked into the Jurisdictional Unit Polygon schema and aggregated through a series of python scripts and FME models. Once aggregated, steps were taken to reduce overlaps within the data. All overlap areas larger than 300 acres were manually examined and removed with the assistance of fire management SMEs. Once overlaps were removed, Census Block Group geometry were crosswalked to the Jurisdictional Unit Polygon schema and appended in areas in which no jurisdictional boundaries were recorded within the authoritative land ownership records and the PAD-US data. Census Block Group geometries represent areas of unknown Landowner Kind/Category and
Empower Your Business With Professional Data Licensing Services
Discover a 360-Degree View of Worldwide Solution Buyers and Their Needs Leverage over 70 insights that will help you make better decisions to manage your sales pipeline, target key accounts with customized messaging, and focus your sales and marketing efforts:
Here are some of the types of Insights, our data licensing services can provide are:
Technology Insights: Discover companies’ technology preferences, including their tech stack for essential investments such as CRM systems, marketing and sales automation, email security and hosting, data analytics, and cloud security and providers.
Departmental Roles and Openings: Access real-time data on the number of roles and job openings across various departments, including IT, Development, Security, Marketing, Sales, and Customer Success. This information helps you gauge the company’s growth trajectory and possible needs.
Funding Insights: Keep updated of the latest funding, dates, types, and lead investors, providing you with a clear understanding of a company’s potential for growth investments.
Mobile Application Insights: Find out if the company has a mobile app or web app, enabling you to tailor your pitch effectively.
Website traffic and advertising spend metrics: Customers can leverage website traffic and advertising data to gain insights into competitor performance, allowing them to refine their marketing strategies and optimize ad spending.
Access unlimited data and improve conversation by 3X
Leverage the data for your Account-Based Marketing (ABM) strategy
Leverage ICP (industry, company size, location etc) to identify high- potential Accounts.
Utilize GTM strategies to deliver personalized marketing experiences through
Multi-channel outreach (email, Cell, social media) that resonate with the
target audience.
Who can leverage our Data:
B2B marketing Teams- Increase marketing leads and enhance conversions.
B2B sales teams- Build a stronger pipeline and increase your deal wins.
Talent sourcing/Staffing companies- Leverage our data to identify and engage top talent, streamlining your recruitment process and finding the best candidates faster.
Research companies/Investors- Insights into the financial investments received by a company, including funding rounds, amounts, and investor details.
Technology companies: Leverage our Technographic data to reveal the technology stack and tools used by companies, helping tailor marketing and sales efforts.
Data Source:
The Database, sourced through multiple sources and validated using proprietary methods on an ongoing basis, is highly customizable. It contains parameters such as employee size, job title, domain, industry, Technography, Ad spends, Funding data, and more, which can be tailored to create segments that perfectly align with your targeting needs. That is exactly why our Database is perfect for licensing!
FAQs
Can licensed data be resold or redistributed? Answer: No, The customer shall not, directly or indirectly, sell, distribute, license, or otherwise make available the licensed data to any third party that intends to resell, sublicense, or redistribute the data. The Customer must take reasonable steps to ensure that any recipient of the licensed data is using it for internal purposes only and not for resale or redistribution. Any breach of this provision shall be considered a material breach of this Order Form and may result in the immediate termination of the Customer's rights under this agreement, as well as any applicable remedies available under law.
What is the duration of the data license and usage terms? Answer: The data license is valid for 12 months (1 year) for unlimited usage. Customers also have the option to license the data for multiple years. At the end of the first year, Customers can renew the license to maintain continued access.
What happens if the customer misuses the data? Answer: The data can be used without limits for a period of one year or multiple years (depending on the contract tenure); however, Thomson Data actively monitors its usage. If any unusual activity is detected, Thomson Data reserves the right to terminate the account.
How frequently is the data updated? Answer: The data is updated on a quarterly basis and fresh records added on a monthly basis
What is the accuracy rate of the data? Answer: Customers can expect 90% accuracy for all data points, with email accuracy ranging between 85% and 90%. Cell phone data accuracy is around 80%.
What types of information are included in the data? Answer: Thomson Data provides over 70+ data points, including contact details (name, job title, LinkedIn profile, cell number, email address, education, certifications, work experience, etc.), company information, department/team sizes, SIC and NAICS codes, industry classification, technographic detai...
The USGS Protected Areas Database of the United States (PAD-US) is the nation's inventory of protected areas, including public land and voluntarily provided private protected areas, identified as an A-16 National Geospatial Data Asset in the Cadastre Theme ( https://communities.geoplatform.gov/ngda-cadastre/ ). The PAD-US is an ongoing project with several published versions of a spatial database including areas dedicated to the preservation of biological diversity, and other natural (including extraction), recreational, or cultural uses, managed for these purposes through legal or other effective means. The database was originally designed to support biodiversity assessments; however, its scope expanded in recent years to include all open space public and nonprofit lands and waters. Most are public lands owned in fee (the owner of the property has full and irrevocable ownership of the land); however, permanent and long-term easements, leases, agreements, Congressional (e.g. 'Wilderness Area'), Executive (e.g. 'National Monument'), and administrative designations (e.g. 'Area of Critical Environmental Concern') documented in agency management plans are also included. The PAD-US strives to be a complete inventory of U.S. public land and other protected areas, compiling “best available” data provided by managing agencies and organizations. The PAD-US geodatabase maps and describes areas using thirty-six attributes and five separate feature classes representing the U.S. protected areas network: Fee (ownership parcels), Designation, Easement, Marine, Proclamation and Other Planning Boundaries. An additional Combined feature class includes the full PAD-US inventory to support data management, queries, web mapping services, and analyses. The Feature Class (FeatClass) field in the Combined layer allows users to extract data types as needed. A Federal Data Reference file geodatabase lookup table (PADUS3_0Combined_Federal_Data_References) facilitates the extraction of authoritative federal data provided or recommended by managing agencies from the Combined PAD-US inventory. This PAD-US Version 3.0 dataset includes a variety of updates from the previous Version 2.1 dataset (USGS, 2020, https://doi.org/10.5066/P92QM3NT ), achieving goals to: 1) Annually update and improve spatial data representing the federal estate for PAD-US applications; 2) Update state and local lands data as state data-steward and PAD-US Team resources allow; and 3) Automate data translation efforts to increase PAD-US update efficiency. The following list summarizes the integration of "best available" spatial data to ensure public lands and other protected areas from all jurisdictions are represented in the PAD-US (other data were transferred from PAD-US 2.1). Federal updates - The USGS remains committed to updating federal fee owned lands data and major designation changes in annual PAD-US updates, where authoritative data provided directly by managing agencies are available or alternative data sources are recommended. The following is a list of updates or revisions associated with the federal estate: 1) Major update of the Federal estate (fee ownership parcels, easement interest, and management designations where available), including authoritative data from 8 agencies: Bureau of Land Management (BLM), U.S. Census Bureau (Census Bureau), Department of Defense (DOD), U.S. Fish and Wildlife Service (FWS), National Park Service (NPS), Natural Resources Conservation Service (NRCS), U.S. Forest Service (USFS), and National Oceanic and Atmospheric Administration (NOAA). The federal theme in PAD-US is developed in close collaboration with the Federal Geographic Data Committee (FGDC) Federal Lands Working Group (FLWG, https://communities.geoplatform.gov/ngda-govunits/federal-lands-workgroup/ ). 2) Improved the representation (boundaries and attributes) of the National Park Service, U.S. Forest Service, Bureau of Land Management, and U.S. Fish and Wildlife Service lands, in collaboration with agency data-stewards, in response to feedback from the PAD-US Team and stakeholders. 3) Added a Federal Data Reference file geodatabase lookup table (PADUS3_0Combined_Federal_Data_References) to the PAD-US 3.0 geodatabase to facilitate the extraction (by Data Provider, Dataset Name, and/or Aggregator Source) of authoritative data provided directly (or recommended) by federal managing agencies from the full PAD-US inventory. A summary of the number of records (Frequency) and calculated GIS Acres (vs Documented Acres) associated with features provided by each Aggregator Source is included; however, the number of records may vary from source data as the "State Name" standard is applied to national files. The Feature Class (FeatClass) field in the table and geodatabase describe the data type to highlight overlapping features in the full inventory (e.g. Designation features often overlap Fee features) and to assist users in building queries for applications as needed. 4) Scripted the translation of the Department of Defense, Census Bureau, and Natural Resource Conservation Service source data into the PAD-US format to increase update efficiency. 5) Revised conservation measures (GAP Status Code, IUCN Category) to more accurately represent protected and conserved areas. For example, Fish and Wildlife Service (FWS) Waterfowl Production Area Wetland Easements changed from GAP Status Code 2 to 4 as spatial data currently represents the complete parcel (about 10.54 million acres primarily in North Dakota and South Dakota). Only aliquot parts of these parcels are documented under wetland easement (1.64 million acres). These acreages are provided by the U.S. Fish and Wildlife Service and are referenced in the PAD-US geodatabase Easement feature class 'Comments' field. State updates - The USGS is committed to building capacity in the state data-steward network and the PAD-US Team to increase the frequency of state land updates, as resources allow. The USGS supported efforts to significantly increase state inventory completeness with the integration of local parks data in the PAD-US 2.1, and developed a state-to-PAD-US data translation script during PAD-US 3.0 development to pilot in future updates. Additional efforts are in progress to support the technical and organizational strategies needed to increase the frequency of state updates. The PAD-US 3.0 included major updates to the following three states: 1) California - added or updated state, regional, local, and nonprofit lands data from the California Protected Areas Database (CPAD), managed by GreenInfo Network, and integrated conservation and recreation measure changes following review coordinated by the data-steward with state managing agencies. Developed a data translation Python script (see Process Step 2 Source Data Documentation) in collaboration with the data-steward to increase the accuracy and efficiency of future PAD-US updates from CPAD. 2) Virginia - added or updated state, local, and nonprofit protected areas data (and removed legacy data) from the Virginia Conservation Lands Database, provided by the Virginia Department of Conservation and Recreation's Natural Heritage Program, and integrated conservation and recreation measure changes following review by the data-steward. 3) West Virginia - added or updated state, local, and nonprofit protected areas data provided by the West Virginia University, GIS Technical Center. For more information regarding the PAD-US dataset please visit, https://www.usgs.gov/gapanalysis/PAD-US/. For more information about data aggregation please review the PAD-US Data Manual available at https://www.usgs.gov/core-science-systems/science-analytics-and-synthesis/gap/pad-us-data-manual . A version history of PAD-US updates is summarized below (See https://www.usgs.gov/core-science-systems/science-analytics-and-synthesis/gap/pad-us-data-history for more information): 1) First posted - April 2009 (Version 1.0 - available from the PAD-US: Team pad-us@usgs.gov). 2) Revised - May 2010 (Version 1.1 - available from the PAD-US: Team pad-us@usgs.gov). 3) Revised - April 2011 (Version 1.2 - available from the PAD-US: Team pad-us@usgs.gov). 4) Revised - November 2012 (Version 1.3) https://doi.org/10.5066/F79Z92XD 5) Revised - May 2016 (Version 1.4) https://doi.org/10.5066/F7G73BSZ 6) Revised - September 2018 (Version 2.0) https://doi.org/10.5066/P955KPLE 7) Revised - September 2020 (Version 2.1) https://doi.org/10.5066/P92QM3NT 8) Revised - January 2022 (Version 3.0) https://doi.org/10.5066/P9Q9LQ4B Comparing protected area trends between PAD-US versions is not recommended without consultation with USGS as many changes reflect improvements to agency and organization GIS systems, or conservation and recreation measure classification, rather than actual changes in protected area acquisition on the ground.
Abstract Biodiversity research has advanced by testing expectations of ecological and evolutionary hypotheses through the linking of large-scale genetic, distributional, and trait datasets. The rise of molecular systematics over the past 30 years has resulted in a wealth of DNA sequences from around the globe. Yet, advances in molecular systematics also have created taxonomic instability, as new estimates of evolutionary relationships and interpretations of species limits have required widespread scientific name changes. Taxonomic instability, colloquially “splits, lumps, and shuffles,†presents logistical challenges to large-scale biodiversity research because (1) the same species or sets of populations may be listed under different names in different data sources, or (2) the same name may apply to different sets of populations representing different taxonomic concepts. Consequently, distributional and trait data are often difficult to link directly to primary DNA sequen..., Taxonomic reconciliationWe downloaded all names from the NCBI Taxonomy database (Schoch et al., 2020) that descended from “Aves†(TaxID: 8782) on 3 May 2020 (Data Repository D2). From this list, we extracted all species and subspecies names as well as their NCBI Taxonomy ID (TaxID) numbers. We then ran a custom Perl script (Data Repository D3) to exactly match binomial (genus, species) and trinomial (genus, species, subspecies) names from NCBI Taxonomy to the names recognized by eBird/Clements v2019 Integrated Checklist (August 2019; Data Repository D4). For each mismatch with the NCBI Taxonomy name, we then identified the corresponding equivalent eBird/Clements species or subspecies. We first searched for names in Avibase (Lepage et al., 2014). However, Avibase’s search function currently facilitates only exact matches to taxonomies it implements. For names that were not an exact match to an Avibase taxonomic concept, we implemented web searches (Google) which often identified minor sp..., D1:"PetersVsClements2Final.txt" - This file tells which species from the Peters taxonomy match the 2019 Clements/ebird taxonomy. The first column has a species name from the Peters taxonomy. In the second column, "Clements" indicates that the species name matches the Clements/ebird taxonomy, "No" means it does match, and "Close" means that the names match when you disregard the last two letters."SibleyMonroeVsClements_Final.txt" - This file tells which species from the Sibley Monroe taxonomy match the 2019 Clements/ebird taxonomy. The first column has a species ID number from the Sibley Monroe taxonomy. The second column has the species scientific name from the Sibley Monroe taxonomy. The third column has the common name from the Sibley Monroe taxonomy. In the fourth column, "Clements" indicates that the species name matches the Clements/ebird taxonomy, "No" means it does match, and "Close" means that the names match when you disregard the last two letters.D2:"taxonomy_result.unix.xml" ...
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Glycan arrays are indispensable for learning about the specificities of glycan-binding proteins. Despite the abundance of available data, the current analysis methods do not have the ability to interpret and use the variety of data types and to integrate information across datasets. Here, we evaluated whether a novel, automated algorithm for glycan-array analysis could meet that need. We developed a regression-tree algorithm with simultaneous motif optimization and packaged it in software called MotifFinder. We applied the software to analyze data from eight different glycan-array platforms with widely divergent characteristics and observed an accurate analysis of each dataset. We then evaluated the feasibility and value of the combined analyses of multiple datasets. In an integrated analysis of datasets covering multiple lectin concentrations, the software determined approximate binding constants for distinct motifs and identified major differences between the motifs that were not apparent from single-concentration analyses. Furthermore, an integrated analysis of data sources with complementary sets of glycans produced broader views of lectin specificity than produced by the analysis of just one data source. MotifFinder, therefore, enables the optimal use of the expanding resource of the glycan-array data and promises to advance the studies of protein–glycan interactions.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Estimates of crop nutrient removal (as crop products and crop residues) are an important component of crop nutrient balances. Crop nutrient removal can be estimated through multiplication of the quantity of crop products or crop residues (removed) by the nutrient concentration of those crop products and crop residue components respectively. Data for quantities of crop products removed at a country level are available through FAOSTAT (https://www.fao.org/faostat/en/), but equivalent data for quantities of crop residues are not available at a global level. However, quantities of crop residues can be estimated if the relationship between quantity of crop residues and crop products is known. Harvest index (HI) provides one such indication of the relationship between quantity of crop products and crop residues. HI is the proportion of above-ground biomass as crop products and can be used to estimate quantity of crop residues based on quantity of crop products. Previously, meta-analyses or surveys have been performed to estimate nutrient concentrations of crop products and crop residues and harvest indices (collectively known as crop coefficients). The challenges for using these coefficients in global nutrient balances include the representativeness of world regions or countries. Moreover, it may be unclear which countries or crop types are actually represented in the analyses of data. In addition, units used among studies differ which makes comparisons challenging. To overcome these challenges, data from meta-analyses and surveys were collated in one dataset with standardised units and referrals to the original region and crop names used by the sources of data. Original region and crop names were converted into internationally recognised names, and crop coefficients were summarised into two Tiers of data, representing the world (Tier 1, with single coefficient values for the world) and specific regions or countries of the world (Tier 2, with single coefficient values for each country). This dataset will aid both global and regional analyses for crop nutrient balances.
Methods
Data acquisition
Data were primarily collated from meta-analyses found in scientific literature. Terms used in Ovid (https://ovidsp.ovid.com/), CAB Abstracts (https://www.cabdirect.org/) and Google Scholar (https://scholar.google.com/) were: (crop) AND (“nutrient concentration” OR “nutrient content” OR “harvest index”) across any time. This search resulted in over 245,000 results. These results were refined to include studies that purported to represent crop nutrient concentration and/or harvest index of crops for geographic regions of the world, as opposed to site-specific field experiments. Given the range in different crops grown globally, preference was given to acquiring datasets that included multiple crops. In some cases, authors of meta-analyses were asked for raw data to aid the standardisation process. In addition, the International Fertilizer Association (IFA), and the Food and Agriculture Organization of the United Nations (UN FAO) provided data used for crop nutrient balances (FAOSTAT 2020). The request to UN FAO yielded phosphorus and potassium crop nutrient concentrations in addition to their publicly available nitrogen concentration values (FAOSTAT 2020). In total the refined search resulted in 26 different sources of data.
Data files were converted to separate comma-delimited CSV files for each source of data, whereby a unique ‘source’ was a dataset from an article from the scientific literature or a dataset sent by the UN FAO or IFA. Crop nutrient concentrations were expressed as a percentage of dry matter and/or the percentage of fresh weight depending on which units were reported and whether dry matter percentages of crop fresh weight were reported. Meta-data text files were written to accompany each standardized CSV file. The standardized CSV files for each source of data included information on the name of the original region, the crop coefficients it purported to represent, as well as the original names of the crops as categorised by the authors of the data. If the data related to a meta-analysis of multiple sources, information was included for the primary source of data when available. Data from the separate source files were collated into one file named ‘Combined_crop_data.csv’ using R Studio (version 4.1.0) (hereafter referred to as R) with the scripts available at https://github.com/ludemannc/Tier_1_2_crop_coefficients.git.
Processing of data
When transforming the combined data file (‘Combined_crop_data.csv’) into representative crop coefficients for different regions (available in ‘Tier_1_and_2_crop_coefficients.csv’), crop coefficients that were duplicates from the same primary source of data were excluded from processing. For instance, Zhang et al. (2021) referred to multiple primary sources of data, and the data requested from the UN FAO and the IFA referred (in many cases) to crop coefficients from IPNI (2014). Duplicate crop coefficient data that came from the same primary source were therefore excluded from the summarised dataset of crop coefficients.
Two tiers of data
The data were sub-divided into two Tiers to help overcome the challenge of using these data in a global nutrient balance when data are not available for every country. This follows the approach taken by the Intergovernmental Panel for Climate Change-IPCC (IPCC 2019). Data were assigned different ‘Tiers’ based on complexity and data requirements.
· Tier 1: crop coefficients at the world level.
· Tier 2: crop coefficients at more granular geographic regions of the world (e.g. at regional, country or sub-country levels).
Crop coefficients were summarised as means for each crop item and crop component based on either ‘Tier 1’ or ‘Tier 2’.
One could also envision a more detailed site-specific level (Tier 3). The data in this dataset did not meet the required level of complexity or data requirements for Tier 3, unlike, say, the site-specific data being collected as part of the Consortium for Precision Crop Nutrition (CPCN) (www.cropnutrientdata.net)-which could be described as being Tier 3. No data from the current dataset were therefore assigned to Tier 3. It is expected that in the future, site-specific data will be used to improve the crop coefficients further with a Tier 3 approach.
The ‘Tier_1_and_2_crop_coefficients.csv’ file includes mean crop coefficients for the Tier 1 data, and mean crop coefficients for the Tier 2 data. The Tier 1 estimates of crop coefficients were mean values across Tier 1 data that purported to represent the World.
Crop coefficients found in the data sources represent quite different geographic areas or regions. To enable combining data with different spatial overlaps for Tier 2, data were disaggregated to the country level. First, each region was assigned a list of countries (which the regional averages were assumed to represent, as listed in the ‘Original_region_names_and_assigned_countries.csv’ file). Countries were assigned alpha-3 country codes following the ISO 3166 international standards (https://www.iso.org/publication/PUB500001.html). Second, for each country mean, crop coefficients were calculated based on coefficients from regions listed for each country. For Australia for example, the mean values for each crop coefficient were calculated from values that represented sub-country (e.g. Australia New South Wales South East), country (Australia), and multi-country (e.g. Oceania) regions. For instance, if there was a harvest index value of 0.5 for wheat for the original region ‘Australia New South Wales South East’, a value of 0.51 for the original region named ‘Australia’ and a value of 0.47 for the original region named ‘Oceania’, then the mean Tier 2 harvest index for wheat for the country Australia would be 0.493, the unweighted mean. Using our dataset, a user can assign different weights to each entry.
To aid analysis, the names of the original categories of crop were converted into UN FAO crop ‘item’ categories, following UN FAO standards (FAOSTAT 2022) (available in the ‘Original_crop_names_in_each_item_category.csv’ file). These item categories were also assigned categorical numeric codes following UN FAO standards (FAOSTAT 2022). Data related to crop products (e.g. grain, beans, saleable tubers or fibre) were assigned the category “Crop_products” and crop residues (eg straw, stover) were assigned the category “Crop_residues”.
Dry and fresh matter weights
In some cases nutrient concentration values from the original sources were available on a dry matter or a fresh weight basis, but not both. Gaps in either the nutrient concentration on a dry matter or fresh weight basis were given imputed values. If the data source mentioned the dry matter percentage of the crop component then this was preferentially used to impute the other missing nutrient concentration data. If dry matter percentage information was not available for a particular crop item or crop component, missing data were imputed using the mean dry matter percentage values across all Tier 1 and Tier 2 data.
Global means for the UN FAO Cropland Nutrient Budget.
Data were also summarised as means for nitrogen (N), elemental phosphorus (P) and elemental potassium (K) nutrient concentrations of crop products using data that represented the world (Tier 1) for the 2023 UN FAO Cropland Nutrient Budget. These data are available in the file named World_crop_coefficients_for_UN_FAO.csv.
https://dataverse.harvard.edu/api/datasets/:persistentId/versions/2.null/customlicense?persistentId=doi:10.5064/F6JOQXNFhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/2.null/customlicense?persistentId=doi:10.5064/F6JOQXNF
This is an Annotation for Transparent Inquiry (ATI) data project. The annotated article can be viewed on the Publisher's Website. Data Generation The research project engages a story about perceptions of fairness in criminal justice decisions. The specific focus involves a debate between ProPublica, a news organization, and Northpointe, the owner of a popular risk tool called COMPAS. ProPublica wrote that COMPAS was racist against blacks, while Northpointe posted online a reply rejecting such a finding. These two documents were the obvious foci of the qualitative analysis because of the further media attention they attracted, the confusion their competing conclusions caused readers, and the power both companies wield in public circles. There were no barriers to retrieval as both documents have been publicly available on their corporate websites. This public access was one of the motivators for choosing them as it meant that they were also easily attainable by the general public, thus extending the documents’ reach and impact. Additional materials from ProPublica relating to the main debate were also freely downloadable from its website and a third party, open source platform. Access to secondary source materials comprising additional writings from Northpointe representatives that could assist in understanding Northpointe’s main document, though, was more limited. Because of a claim of trade secrets on its tool and the underlying algorithm, it was more difficult to reach Northpointe’s other reports. Nonetheless, largely because its clients are governmental bodies with transparency and accountability obligations, some of Northpointe-associated reports were retrievable from third parties who had obtained them, largely through Freedom of Information Act queries. Together, the primary and (retrievable) secondary sources allowed for a triangulation of themes, arguments, and conclusions. The quantitative component uses a dataset of over 7,000 individuals with information that was collected and compiled by ProPublica and made available to the public on github. ProPublica’s gathering the data directly from criminal justice officials via Freedom of Information Act requests rendered the dataset in the public domain, and thus no confidentiality issues are present. The dataset was loaded into SPSS v. 25 for data analysis. Data Analysis The qualitative enquiry used critical discourse analysis, which investigates ways in which parties in their communications attempt to create, legitimate, rationalize, and control mutual understandings of important issues. Each of the two main discourse documents was parsed on its own merit. Yet the project was also intertextual in studying how the discourses correspond with each other and to other relevant writings by the same authors. Several more specific types of discursive strategies were of interest in attracting further critical examination: Testing claims and rationalizations that appear to serve the speaker’s self-interest Examining conclusions and determining whether sufficient evidence supported them Revealing contradictions and/or inconsistencies within the same text and intertextually Assessing strategies underlying justifications and rationalizations used to promote a party’s assertions and arguments Noticing strategic deployment of lexical phrasings, syntax, and rhetoric Judging sincerity of voice and the objective consideration of alternative perspectives Of equal importance in a critical discourse analysis is consideration of what is not addressed, that is to uncover facts and/or topics missing from the communication. For this project, this included parsing issues that were either briefly mentioned and then neglected, asserted yet the significance left unstated, or not suggested at all. This task required understanding common practices in the algorithmic data science literature. The paper could have been completed with just the critical discourse analysis. However, because one of the salient findings from it highlighted that the discourses overlooked numerous definitions of algorithmic fairness, the call to fill this gap seemed obvious. Then, the availability of the same dataset used by the parties in conflict, made this opportunity more appealing. Calculating additional algorithmic equity equations would not thereby be troubled by irregularities because of diverse sample sets. New variables were created as relevant to calculate algorithmic fairness equations. In addition to using various SPSS Analyze functions (e.g., regression, crosstabs, means), online statistical calculators were useful to compute z-test comparisons of proportions and t-test comparisons of means. Logic of Annotation Annotations were employed to fulfil a variety of functions, including supplementing the main text with context, observations, counter-points, analysis, and source attributions. These fall under a few categories. Space considerations. Critical discourse analysis offers a rich method for studying speech and text. The discourse analyst wishes not simply to describe, but to critically assess, explain, and offer insights about the underlying discourses. In practice, this often means the researcher generates far more material than can comfortably be included in the final paper. As a result, many draft passages, evaluations, and issues typically need to be excised. Annotation offered opportunities to incorporate dozens of findings, explanations, and supporting materials that otherwise would have been redacted. Readers wishing to learn more than within the four corners of the official, published article can review these supplementary offerings through the links. Visuals. The annotations use multiple data sources to provide visuals to explain, illuminate, or otherwise contextualize particular points in the main body of the paper and/or in the analytic notes. For example, a conclusion that the tool was not calibrated the same for blacks and whites could be better understand with reference to a graph to observe the differences in the range of risk scores comparing these two groups. Overall, the visuals deployed here include graphs, screenshots, page extracts, diagrams, and statistical software output. Context. The data for the qualitative segment involved long discourses. Thus, annotations were employed to embed longer portions of quotations from the source material than was justified in the main text. This allows the reader to confirm whether quotations were taken in proper context, and thus hold the author accountable for potential errors in this regard. Sources. Annotations incorporated extra source materials, along with quotations from them to aid the discussion. Sources that carried some indication that they may not be permanently available in the same form and in available formats were more likely to be archived and activated. This practice helps ensure that readers continue to have access to third party materials as relied upon in the research for transparency and authentication purposes.
1.Framework overview. This paper proposed a pipeline to construct high-quality datasets for text mining in materials science. Firstly, we utilize the traceable automatic acquisition scheme of literature to ensure the traceability of textual data. Then, a data processing method driven by downstream tasks is performed to generate high-quality pre-annotated corpora conditioned on the characteristics of materials texts. On this basis, we define a general annotation scheme derived from materials science tetrahedron to complete high-quality annotation. Finally, a conditional data augmentation model incorporating materials domain knowledge (cDA-DK) is constructed to augment the data quantity.2.Dataset information. The experimental datasets used in this paper include: the Matscholar dataset publicly published by Weston et al. (DOI: 10.1021/acs.jcim.9b00470), and the NASICON entity recognition dataset constructed by ourselves. Herein, we mainly introduce the details of NASICON entity recognition dataset.2.1 Data collection and preprocessing. Firstly, 55 materials science literature related to NASICON system are collected through Crystallographic Information File (CIF), which contains a wealth of structure-activity relationship information. Note that materials science literature is mostly stored as portable document format (PDF), with content arranged in columns and mixed with tables, images, and formulas, which significantly compromises the readability of the text sequence. To tackle this issue, we employ the text parser PDFMiner (a Python toolkit) to standardize, segment, and parse the original documents, thereby converting PDF literature into plain text. In this process, the entire textual information of literature, encompassing title, author, abstract, keywords, institution, publisher, and publication year, is retained and stored as a unified TXT document. Subsequently, we apply rules based on Python regular expressions to remove redundant information, such as garbled characters and line breaks caused by figures, tables, and formulas. This results in a cleaner text corpus, enhancing its readability and enabling more efficient data analysis. Note that special symbols may also appear as garbled characters, but we refrain from directly deleting them, as they may contain valuable information such as chemical units. Therefore, we converted all such symbols to a special token
This digital dataset contains historical geochemical and other information for 200 samples of produced water from 182 sites in 25 oil fields in Los Angeles and Orange Counties, southern California. Produced water is a term used in the oil industry to describe water that is produced as a byproduct along with the oil and gas. The locations from which these historical samples have been collected include 152 wells. Well depth and (or) perforation depths are available for 114 of these wells. Sample depths are available for two additional wells in lieu of well or perforation depths. Additional sample sites include four storage tanks, and two unidentifiable sample sources. One of the storage tank samples (Dataset ID 57) is associated with a single identifiable well. Historical samples from other storage tanks and unidentifiable sample sources may also represent pre- or post-treated composite samples of produced water from single or multiple wells. Historical sample descriptions provide further insight about the site type associated with some of the samples. Twenty-four sites, including 21 wells, are classified as "injectate" based on the sample description combined with the designated well use at the time of sample collection (WD, water disposal or WF, water flood). Historical samples associated with these sites may represent water that originated from sources other than the wells from which they were collected. For example, samples collected from two wells (Dataset IDs 86 and 98) include as part of their description “blended and treated produced water from across the field”. Historical samples described as formation water (45 samples), including 38 wells with a well type designation of OG (oil/gas), are probably produced water, representing a mixture of formation water and water injected for enhanced recovery. A possible exception may be samples collected from OG wells prior to the onset of production. Historical samples from four wells, including three with a sample description of "formation water", were from wells identified as water source wells which access groundwater for use in the production of oil. The numerical water chemistry data were compiled by the U.S. Geological Survey (USGS) from scanned laboratory analysis reports available from the California Geologic Energy Management Division (CalGEM). Sample site characteristics, such as well construction details, were attributed using a combination of information provided with the scanned laboratory analysis reports and well history files from CalGEM Well Finder. The compiled data are divided into two separate data files described as follows: 1) a summary data file identifying each site by name, the site location, basic construction information, and American petroleum Institute (API) number (for wells), the number of chemistry samples, period of record, sample description, and the geologic formation associated with the origin of the sampled water, or intended destination (formation into which water was to intended to be injected for samples labeled as injectate) of the sample; and 2) a data file of geochemistry analyses for selected water-quality indicators, major and minor ions, nutrients, and trace elements, parameter code and (or) method, reporting level, reporting level type, and supplemental notes. A data dictionary was created to describe the geochemistry data file and is provided with this data release.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
General
The data source shallowgroundwater is a geospatial dataset of multipolygons that represent the estimated areas, in the Flemish region of Belgium, where the mean lowest groundwater level (MLW; Knotters & Van Walsum, 1997; Van Heesen, 1970) is less than approximately 2 m below soil surface (hence, “shallow” groundwater). We expect groundwater dependent species and communities to be present within these areas. Outside these areas we assume they are groundwater independent. We combined several data sources in order to estimate these areas.
Compilation of the data source
We compiled the dataset through an iterative process of adding specific data sources (referred in the description of attributes below), followed by validation steps based on both the actual presence of groundwater dependent habitat types or regionally important biotopes (Natura 2000 habitat map) and in situ measurements of groundwater levels (Watina+ database). The coverage of these validation data by shallowgroundwater was 96.9% and 98.6% respectively.
Most steps to compile the data source were done manually using QGIS. Final steps were done in R; see R-code in the GitHub repository 'n2khab-preprocessing' at commit 1b004e1.
Detailed properties
The data source is a GeoPackage with a single spatial multipolygon layer shallowgroundwater in the ‘Belge 1972 / Belgian Lambert 72’ coordinate reference system (EPSG-code 31370).
All attributes are boolean (true/false), each indicating if a polygon was selected from the corresponding data source by applying a set of criteria. Multiple attributes can be true for a given polygon. In order to reduce file size, polygons were dissolved by each unique combination of the values of all attributes. Hence the dataset consists of multipolygons (also multipart polygons) rather than single (part) polygons. The different attributes of this dataset reveal for each polygon (1) the data source(s) we relied on and (2) the selection criteria we applied to judge if the mean lowest groundwater level is less than approximately 2 m below soil surface. As far as possible, we reference each used data source in the description of attributes below. If one is interested in the original polygons of each datasource, selections can be made by consulting the referenced data sources and applying the specified criteria.
These are the attributes:
geomorph_wcoast:
source: Cosyns et al. (2019)
description: polygon belonging to geomorphological entities that are expected to harbour groundwater dependent types, and thus to exhibit shallow groundwater levels. It mainly concerns dune slacks, mud flats and salt marshes.
selection: "Code" IN ('ms', 'msl', 'sm', 'ss', 'ys', 'ysl', 'yfs') OR ("Code" = 't' and "Subtype" like '%vlakte%') with ms = medium old dune slack, msl = leveled medium old dune slack, sm = mud flat, ss = salt marsh, ys = young dune slack, ysl = leveled young dune slack, yfs = young frontal dune (intruded), ("Code" = 't' AND "Subtype" like '%vlakte%') = fossil beach
anthrop_gwdep:
source: soilmap_simple (10.5281/zenodo.3732903) as derived from the digital soil map of the Flemish Region (version soilmap_2017-06-20; 10.5281/zenodo.3387008); habitatmap_terr (10.5281/zenodo.3468948) as derived from the Natura 2000 habitat map of Flanders (10.5281/zenodo.3354381)
description: zones located within a 100 m buffer around (almost) everywhere groundwater dependent habitat types (or regionally important biotopes) ánd situated within zones classified as “anthropogenic” areas within the soil map. Within the zones of the soil map that are designated as “anthropogenic”, we lack information on soil characteristics. However, (almost) everywhere groundwater dependent types are present in these zones according to the Natura 2000 habitat map of Flanders, implying shallow groundwater levels. By including 100 m buffer zones around these types, restricted to the anthropogenic zones of the soil map, we consider the combined areas to have (potentially) shallow groundwater levels. So in practice, we first select the habitatmap polygons with (almost) everywhere groundwater dependent types that intersect the anthropogenic soil polygons, we buffer them, and then clip the result by the anthropogenic soil polygons.
selection: see https://github.com/inbo/n2khab-preprocessing/pull/61 for the adopted workflow
from soilmap_simple: bsm_mo_soilunitype starts with 'O'
from habitatmap_terr: (almost) everywhere groundwater dependent types only; list of this category of types is available through n2khab R-package.
narrowanthrop_gwdep:
source: soilmap_simple (10.5281/zenodo.3732903) as derived from the digital soil map of the Flemish Region (version soilmap_2017-06-20; 10.5281/zenodo.3387008); habitatmap_terr (10.5281/zenodo.3468948) as derived from the Natura 2000 habitat map of Flanders (10.5281/zenodo.3354381)
description: narrow zones classified as “anthropogenic” areas within the soil map that include (almost) everywhere groundwater dependent habitat types (or regionally important biotopes). Regarding the anthropogenic soil type polygons, it appears that the narrow ones containing (almost) everywhere groundwater dependent types are interesting to include as a whole as zones with supposed shallow groundwater levels. Hence we select them as a whole instead of selecting buffers around the polygons with (almost) everywhere groundwater dependent types (cfr. anthrop_gwdep and dunes_gwdep). An appropriate algorithm selects meaningful polygons, based on a “thinness” criterion and the fraction of (almost) everywhere groundwater dependent types that are present within the polygons.
selection: see https://github.com/inbo/n2khab-preprocessing/pull/61
drainage:
source: soilmap_simple (10.5281/zenodo.3732903) as derived from the digital soil map of the Flemish Region (version soilmap_2017-06-20; 10.5281/zenodo.3387008)
description: Drainage classification is based on a combination of groundwater depth, soil permeability, presence of impermeable layers, soil depth and topography (see Van Ranst & Sys, 2000).
selection: bsm_mo_drain in ('c-d', 'd', 'e', 'f', 'g', 'h', 'i', 'e-f', 'e-i', 'h-i'); these are soils that are at least moderately gleyic or wet.
dunes_gwdep:
source: soilmap_simple (10.5281/zenodo.3732903) as derived from the digital soil map of the Flemish Region (version soilmap_2017-06-20; 10.5281/zenodo.3387008); habitatmap_terr (10.5281/zenodo.3468948) as derived from the Natura 2000 habitat map of Flanders (10.5281/zenodo.3354381)
description: zones located within a 100 m buffer around (almost) everywhere groundwater dependent habitat types (or regionally important biotopes) ánd situated within zones classified as “dunes” areas within the soil map. Within the zones of the Belgian soil map that are designated as “dunes”, we lack information on soil characteristics. However, (almost) everywhere groundwater dependent types are present in these zones according to the Natura 2000 habitat map of Flanders, implying shallow groundwater levels. By including 100 m buffer zones around these types, restricted to the “dunes” of the soil map, we consider the combined areas to have shallow groundwater levels. So in practice, we first select the habitatmap polygons with (almost) everywhere groundwater dependent types that intersect the “dunes” polygons, we buffer them, and then clip the result by the “dunes” polygons.
selection: see https://github.com/inbo/n2khab-preprocessing/pull/61
from soilmap_simple: bsm_mo_soilunitype = 'X'
from habitatmap_terr: (almost) everywhere groundwater dependent types only; list of this category of types available through n2khab R-package.
peat_profile:
source: soilmap_simple (10.5281/zenodo.3732903) as derived from the digital soil map of the Flemish Region (version soilmap_2017-06-20; 10.5281/zenodo.3387008)
description: variant of the soil profile indicates a superficial peaty cover, mostly on gleyic or permanently water saturated soil with or without profile development (‘(v)’), eventually combined with strong anthropogenic influence (‘(o)’)
selection: bsm_mo_profvar in ('(o)(v)', '(v)')
peat_substr:
source: soilmap_simple (10.5281/zenodo.3732903) as derived from the digital soil map of the Flemish Region (version soilmap_2017-06-20; 10.5281/zenodo.3387008)
description: soil substrate (layer underlying superficial layer, and lithologically diverging from it) consists of peat material starting at small (less than 75 cm; ‘v’) or moderate depths (75-125 cm; ‘(v)’), or a combination of the previous (‘v-’)
selection: bsm_mo_substr in ('(v)', 'v', 'v-')
peat_parentmat:
source: soilmap_simple (10.5281/zenodo.3732903) as derived from the digital soil map of the Flemish Region (version soilmap_2017-06-20; 10.5281/zenodo.3387008)
description: parent material contains a mixture of at least 30% of peaty material
selection: bsm_mo_parentmat = 'v'
peat_texture:
source: soilmap_simple (10.5281/zenodo.3732903) as derived from the digital soil map of the Flemish Region (version soilmap_2017-06-20; 10.5281/zenodo.3387008)
description: soil consists of plain peat material
selection: bsm_mo_tex in ('V-E', 'V')
phys_system:
source: Fysische systeemkaart - Gegeneraliseerde bodemkaart voor Vlaanderen (here available at geopunt.be). Lhermitte & Honnay (1994).
description: polygons designated as seepage areas where groundwater is supposed to gather after having infiltrated elsewhere (infiltration areas) and
NOTE: A more current version of the Protected Areas Database of the United States (PAD-US) is available: PAD-US 2.0 https://doi.org/10.5066/P955KPLE. The USGS Protected Areas Database of the United States (PAD-US) is the nation's inventory of protected areas, including public open space and voluntarily provided, private protected areas, identified as an A-16 National Geospatial Data Asset in the Cadastral Theme (http://www.fgdc.gov/ngda-reports/NGDA_Datasets.html). PAD-US is an ongoing project with several published versions of a spatial database of areas dedicated to the preservation of biological diversity, and other natural, recreational or cultural uses, managed for these purposes through legal or other effective means. The geodatabase maps and describes public open space and other protected areas. Most areas are public lands owned in fee; however, long-term easements, leases, and agreements or administrative designations documented in agency management plans may be included. The PAD-US database strives to be a complete “best available” inventory of protected areas (lands and waters) including data provided by managing agencies and organizations. The dataset is built in collaboration with several partners and data providers (http://gapanalysis.usgs.gov/padus/stewards/). See Supplemental Information Section of this metadata record for more information on partnerships and links to major partner organizations. As this dataset is a compilation of many data sets; data completeness, accuracy, and scale may vary. Federal and state data are generally complete, while local government and private protected area coverage is about 50% complete, and depends on data management capacity in the state. For completeness estimates by state: http://www.protectedlands.net/partners. As the federal and state data are reasonably complete; focus is shifting to completing the inventory of local gov and voluntarily provided, private protected areas. The PAD-US geodatabase contains over twenty-five attributes and four feature classes to support data management, queries, web mapping services and analyses: Marine Protected Areas (MPA), Fee, Easements and Combined. The data contained in the MPA Feature class are provided directly by the National Oceanic and Atmospheric Administration (NOAA) Marine Protected Areas Center (MPA, http://marineprotectedareas.noaa.gov ) tracking the National Marine Protected Areas System. The Easements feature class contains data provided directly from the National Conservation Easement Database (NCED, http://conservationeasement.us ) The MPA and Easement feature classes contain some attributes unique to the sole source databases tracking them (e.g. Easement Holder Name from NCED, Protection Level from NOAA MPA Inventory). The "Combined" feature class integrates all fee, easement and MPA features as the best available national inventory of protected areas in the standard PAD-US framework. In addition to geographic boundaries, PAD-US describes the protection mechanism category (e.g. fee, easement, designation, other), owner and managing agency, designation type, unit name, area, public access and state name in a suite of standardized fields. An informative set of references (i.e. Aggregator Source, GIS Source, GIS Source Date) and "local" or source data fields provide a transparent link between standardized PAD-US fields and information from authoritative data sources. The areas in PAD-US are also assigned conservation measures that assess management intent to permanently protect biological diversity: the nationally relevant "GAP Status Code" and global "IUCN Category" standard. A wealth of attributes facilitates a wide variety of data analyses and creates a context for data to be used at local, regional, state, national and international scales. More information about specific updates and changes to this PAD-US version can be found in the Data Quality Information section of this metadata record as well as on the PAD-US website, http://gapanalysis.usgs.gov/padus/data/history/.) Due to the completeness and complexity of these data, it is highly recommended to review the Supplemental Information Section of the metadata record as well as the Data Use Constraints, to better understand data partnerships as well as see tips and ideas of appropriate uses of the data and how to parse out the data that you are looking for. For more information regarding the PAD-US dataset please visit, http://gapanalysis.usgs.gov/padus/. To find more data resources as well as view example analysis performed using PAD-US data visit, http://gapanalysis.usgs.gov/padus/resources/. The PAD-US dataset and data standard are compiled and maintained by the USGS Gap Analysis Program, http://gapanalysis.usgs.gov/ . For more information about data standards and how the data are aggregated please review the “Standards and Methods Manual for PAD-US,” http://gapanalysis.usgs.gov/padus/data/standards/ .
In order to improve the capacity of storage, exploration and processing of sensor data, a spatial DBMS was used and the Aquopts system was implemented.
In field surveys using different sensors on the aquatic environment, the existence of spatial attributes in the dataset is common, motivating the adoption of PostgreSQL and its spatial extension PostGIS. To enable the insertion of new data sets as well as new devices and sensing equipment, the database was modeled to support updates and provide structures for storing all the data collected in the field campaigns in conjunction with other possible future data sources. The database model provides resources to manage spatial and temporal data and allows flexibility to select and filter the dataset.
The data model ensures the storage integrity of the information related to the samplings performed during the field survey in an architecture that benefits the organization and management of the data. However, in addition to the storage specified on the data model, there are several procedures that need to be applied to the data to prepare it for analysis. Some validations are important to identify spurious data that may represent important sources of information about data quality. Other corrections are essential to tweak the data and eliminate undesirable effects. Some equations can be used to produce other factors that can be obtained from the combination of attributes. In general, the processing steps comprise a cycle of important operations that are directly related to the characteristics of the data set. Considering the data of the sensors stored in the database, an interactive prototype system, named Aquopts, was developed to perform the necessary standardization and basic corrections and produce useful data for analysis, according to the correction methods known in the literature.
The system provides resources for the analyst to automate the process of reading, inserting, integrating, interpolating, correcting, and other calculations that are always repeated after exporting field campaign data and producing new data sets. All operations and processing required for data integration and correction have been implemented from the PHP and Python language and are available from a Web interface, which can be accessed from any computer connected to the internet. The data access cab be access online (http://sertie.fct.unesp.br/aquopts), but the resources are restricted by registration and permissions for each user. After their identification, the system evaluates the access permissions and makes available the options of insertion of new datasets.
The source-code of the entire Aquopts system are available at: https://github.com/carmoafc/aquopts
The system and additional results were described on the official paper (under review)
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By fanqiwan (From Huggingface) [source]
The Mixture of Conversations Dataset is a collection of conversations gathered from various sources. Each conversation is represented as a list of messages, where each message is a string. This dataset provides a valuable resource for studying and analyzing conversations in different contexts.
The conversations in this dataset are diverse, covering a wide range of topics and scenarios. They include casual chats between friends, customer support interactions, online forum discussions, and more. The dataset aims to capture the natural flow of conversation and includes both structured and unstructured dialogues.
Each conversation entry in the dataset is associated with metadata information such as the name or identifier of the model that generated it and the corresponding dataset it belongs to. This information helps to keep track of the source and origin of each conversation.
The train.csv file provided in this dataset specifically serves as training data for various machine learning models. It contains an assortment of conversations that can be used to train chatbot systems, dialogue generation models, sentiment analysis algorithms, or any other conversational AI application.
Researchers, practitioners, developers, and enthusiasts can leverage this Mixture of Conversations Dataset to analyze patterns in human communication, explore language understanding capabilities, test dialogue strategies or develop novel AI-powered conversational systems. Its versatility makes it useful for various NLP tasks such as text classification, intent recognition,sentiment analysis,and language modeling.
By exploring this rich collection of conversational data points across different domains and platforms,you can gain valuable insights into how people communicate using textual input.The breadth and depth present within this extensive dataset provide ample opportunities for studies related to language understanding,recommendation systems,and other research areas involving human-computer interaction
Overview of the Dataset
The dataset consists of conversational data represented as a list of messages. Each conversation is represented as a list of strings, where each string corresponds to a message in the conversation. The dataset also includes information about the model that generated the conversations and the name or identifier of the dataset itself.
Accessing the Dataset
Understanding Column Information
This dataset has several columns:
- conversations: A list representing each conversation; each conversation is further represented as a list containing individual messages.
- dataset: The name or identifier of the dataset that these conversations belong to.
- model: The name or identifier of the model that generated these conversations.
Utilizing Conversations
To make use
- Chatbot Training: This dataset can be used to train chatbot models by providing a diverse range of conversations for the model to learn from. The conversations can cover various topics and scenarios, helping the chatbot to generate more accurate and relevant responses.
- Customer Support Training: The dataset can be used to train customer support models to handle different types of customer queries and provide appropriate solutions or responses. By exposing the model to a variety of conversation patterns, it can learn how to effectively address customer concerns.
- Conversation Analysis: Researchers or linguists may use this dataset for analyzing conversational patterns, language usage, or studying social interactions within conversations. The dataset's mixture of conversations from different sources can provide valuable insights into how people communicate in different settings or domains
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: train.csv | Column name | Description ...
The peer-reviewed publication for this dataset has been presented in the 2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), and can be accessed here: https://arxiv.org/abs/2205.02596. Please cite this when using the dataset. This dataset contains a heterogeneous set of True and False COVID claims and online sources of information for each claim. The claims have been obtained from online fact-checking sources, existing datasets and research challenges. It combines different data sources with different foci, thus enabling a comprehensive approach that combines different media (Twitter, Facebook, general websites, academia), information domains (health, scholar, media), information types (news, claims) and applications (information retrieval, veracity evaluation). The processing of the claims included an extensive de-duplication process eliminating repeated or very similar claims. The dataset is presented in a LARGE and a SMALL version, accounting for different degrees of similarity between the remaining claims (excluding respectively claims with a 90% and 99% probability of being similar, as obtained through the MonoT5 model). The similarity of claims was analysed using BM25 (Robertson et al., 1995; Crestani et al., 1998; Robertson and Zaragoza, 2009) with MonoT5 re-ranking (Nogueira et al., 2020), and BERTScore (Zhang et al., 2019). The processing of the content also involved removing claims making only a direct reference to existing content in other media (audio, video, photos); automatically obtained content not representing claims; and entries with claims or fact-checking sources in languages other than English. The claims were analysed to identify types of claims that may be of particular interest, either for inclusion or exclusion depending on the type of analysis. The following types were identified: (1) Multimodal; (2) Social media references; (3) Claims including questions; (4) Claims including numerical content; (5) Named entities, including: PERSON − People, including fictional; ORGANIZATION − Companies, agencies, institutions, etc.; GPE − Countries, cities, states; FACILITY − Buildings, highways, etc. These entities have been detected using a RoBERTa base English model (Liu et al., 2019) trained on the OntoNotes Release 5.0 dataset (Weischedel et al., 2013) using Spacy. The original labels for the claims have been reviewed and homogenised from the different criteria used by each original fact-checker into the final True and False labels. The data sources used are: - The CoronaVirusFacts/DatosCoronaVirus Alliance Database. https://www.poynter.org/ifcn-covid-19-misinformation/ - CoAID dataset (Cui and Lee, 2020) https://github.com/cuilimeng/CoAID - MM-COVID (Li et al., 2020) https://github.com/bigheiniu/MM-COVID - CovidLies (Hossain et al., 2020) https://github.com/ucinlp/covid19-data - TREC Health Misinformation track https://trec-health-misinfo.github.io/ - TREC COVID challenge (Voorhees et al., 2021; Roberts et al., 2020) https://ir.nist.gov/covidSubmit/data.html The LARGE dataset contains 5,143 claims (1,810 False and 3,333 True), and the SMALL version 1,709 claims (477 False and 1,232 True). The entries in the dataset contain the following information: - Claim. Text of the claim. - Claim label. The labels are: False, and True. - Claim source. The sources include mostly fact-checking websites, health information websites, health clinics, public institutions sites, and peer-reviewed scientific journals. - Original information source. Information about which general information source was used to obtain the claim. - Claim type. The different types, previously explained, are: Multimodal, Social Media, Questions, Numerical, and Named Entities. Funding. This work was supported by the UK Engineering and Physical Sciences Research Council (grant no. EP/V048597/1, EP/T017112/1). ML and YH are supported by Turing AI Fellowships funded by the UK Research and Innovation (grant no. EP/V030302/1, EP/V020579/1). References - Arana-Catania M., Kochkina E., Zubiaga A., Liakata M., Procter R., He Y.. Natural Language Inference with Self-Attention for Veracity Assessment of Pandemic Claims. NAACL 2022 https://arxiv.org/abs/2205.02596 - Stephen E Robertson, Steve Walker, Susan Jones, Micheline M Hancock-Beaulieu, Mike Gatford, et al. 1995. Okapi at trec-3. Nist Special Publication Sp,109:109. - Fabio Crestani, Mounia Lalmas, Cornelis J Van Rijsbergen, and Iain Campbell. 1998. “is this document relevant?. . . probably” a survey of probabilistic models in information retrieval. ACM Computing Surveys (CSUR), 30(4):528–552. - Stephen Robertson and Hugo Zaragoza. 2009. The probabilistic relevance framework: BM25 and beyond. Now Publishers Inc. - Rodrigo Nogueira, Zhiying Jiang, Ronak Pradeep, and Jimmy Lin. 2020. Document ranking with a pre-trained sequence-to-sequence model. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Pro...
This dataset contains a collection of known point locations of Cuvier's beaked whales identified through direct human observation via shipborne and aerial surveys. This can be useful for assessing species abundance, population structure, habitat use, and behavior. This collection is aggregated from multiple data sources and survey periods listed below. Each data point contains attributes for further information about the time and source of the observation. This dataset was compiled by the Pacific Islands Ocean Observing System (PacIOOS) and may be updated in the future if additional data sources are acquired. Cascadia Research Collective (CRC) has been undertaking shipborne surveys for odontocetes in Hawaiian waters since 2000. Photo-identification and satellite-tagging indicate a small resident population of Cuvier's beaked whales off of Hawaii Island. Less is known about this species around the other Hawaiian islands. In addition, Dr. Joseph Mobley of the Marine Mammal Research Consultants (MMRC) led aerial surveys for cetaceans in Hawaiian waters from 1993-2003. For further information, please see: http://www.cascadiaresearch.org/hawaiian-cetacean-studies/beaked-whales-hawaii
Classification of Mars Terrain Using Multiple Data Sources Alan Kraut1, David Wettergreen1 ABSTRACT. Images of Mars are being collected faster than they can be analyzed by planetary scientists. Automatic analysis of images would enable more rapid and more consistent image interpretation and could draft geologic maps where none yet exist. In this work we develop a method for incorporating images from multiple instruments to classify Martian terrain into multiple types. Each image is segmented into contiguous groups of similar pixels, called superpixels, with an associated vector of discriminative features. We have developed and tested several classification algorithms to associate a best class to each superpixel. These classifiers are trained using three different manual classifications with between 2 and 6 classes. Automatic classification accuracies of 50 to 80% are achieved in leave-one-out cross-validation across 20 scenes using a multi-class boosting classifier.