SFT Format Dataset
Overview
This dataset is converted to SFT (Supervised Fine-Tuning) format. It was created by transforming OpenMathInstruct and Stanford Human Preferences (SHP) datasets.
Dataset Structure
Each entry follows this format: Instruction: [Problem, question, or conversation history] Response: [Solution, answer, or response]
Usage Guide
Loading the Dataset
from datasets import load_dataset
Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
License information was derived automatically
This data set contains interpreted polygons describing different sedimentary energy environments of the Long Island Sound mapping project Phase II. This data set is the result of manual interpretation of detailed bathymetry data and resulting seafloor morphology, backscatter data, sediment core analysis results, and interpretation of sub-bottom data. It distinguishes high, low, and moderate energy environments, which can be caused by current and wave action. The outline of polygons was based on manual interpretation mostly following morphological and backscatter boundaries. Interpretation was cross-checked with sediment grab and core information. Polygon outlines are based on morphology and backscatter data that have 1 m pixel resolution, but interpretation could be several pixels (~ +/-10 m) in each direction, since the exact boundary is not always clear. Small pockets of different environments might not have been distinguished. The data is presented here as an ESRI shapefile in UTM-18 N projection. Funding was provided by the Long Island Sound Mapping Fund administered cooperatively by the EPA Long Island Sound Study and the Connecticut Department of Energy and Environmental Protection (DEEP).
Seattle Parks and Recreation GIS Map Layer Shapefile - View Points
Shapefile - This Seattle Parks and Recreation ARCGIS park feature map layer was exported from SPU ARCGIS and converted to a shapefile then manually uploaded to data.seattle.gov via Socrata.
OR
Web Services - Live "read only" data connection ESRI web services URL: http://gisrevprxy.seattle.gov/arcgis/rest/services/DPR_EXT/ParksExternalWebsite/MapServer/54
The original dataset is the train set of stanfordnlp/SHP. We only keep the pair with ratio > 2.0 and take at most 5 pairs per prompt. def filter_example(example): prompt = example['history'] if example['labels'] == 0: ratio = example['score_B'] * 1.0 / example['score_A'] elif example['labels'] == 1: ratio = example['score_A'] * 1.0 / example['score_B']
if ratio > 2.0:
return True
else:
return False
import itertools from collections… See the full description on the dataset page: https://huggingface.co/datasets/RLHFlow/SHP-standard.
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Shapefile for 492 Coastal Zone Management Program (CZMP) counties and county equivalents, 2009, extracted from the U.S. Census Bureau's MAF/TIGER database of U.S. counties and cross-referenced to a list of CZMP counties published by the NOAA/NOS Office of Ocean and Coastal Resource Management (OCRM). Data extent to the nearest quarter degree is 141.00 E to 64.50 W longitude and 14.75 S to 71.50 N latitude. TL2009 in this document refers to metadata content inherited from the original U.S. Census Bureau (2009) TIGER/Line shapefile. TL2009: The TIGER/Line Shapefiles are an extract of selected geographic and cartographic information from the Census MAF/TIGER database. The Census MAF/TIGER database represents a seamless national file with no overlaps or gaps between parts. However, each TIGER/Line Shapefile is designed to stand alone as an independent dataset or the shapefiles can be combined to cover the whole nation.
https://borealisdata.ca/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.5683/SP3/2AFGSWhttps://borealisdata.ca/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.5683/SP3/2AFGSW
The UNI-CEN Digital Boundary File Series facilitates the mapping of UNI-CEN census data tables. Boundaries are provided in multiple formats for different use cases: Esri Shapefile (SHP), geoJson, and File Geodatabase (FGDB). SHP and FGDB files are provided in two projections: NAD83 CSRS for print cartography and WGS84 for web applications. The geoJson version is provided in WGS84 only. The UNI-CEN Standardized Census Data Tables are readily merged to these boundary files. For more information about file sources, the methods used to create them, and how to use them, consult the documentation at https://borealisdata.ca/dataverse/unicen_docs. For more information about the project, visit https://observatory.uwo.ca/unicen.
This layer is a high-resolution tree canopy change-detection layer for Baltimore City, MD. It contains three tree-canopy classes for the period 2007-2015: (1) No Change; (2) Gain; and (3) Loss. It was created by extracting tree canopy from existing high-resolution land-cover maps for 2007 and 2015 and then comparing the mapped trees directly. Tree canopy that existed during both time periods was assigned to the No Change category while trees removed by development, storms, or disease were assigned to the Loss class. Trees planted during the interval were assigned to the Gain category, as were the edges of existing trees that expanded noticeably. Direct comparison was possible because both the 2007 and 2015 maps were created using object-based image analysis (OBIA) and included similar source datasets (LiDAR-derived surface models, multispectral imagery, and thematic GIS inputs). OBIA systems work by grouping pixels into meaningful objects based on their spectral and spatial properties, while taking into account boundaries imposed by existing vector datasets. Within the OBIA environment a rule-based expert system was designed to effectively mimic the process of manual image analysis by incorporating the elements of image interpretation (color/tone, texture, pattern, location, size, and shape) into the classification process. A series of morphological procedures were employed to insure that the end product is both accurate and cartographically pleasing. No accuracy assessment was conducted, but the dataset will be subjected to manual review and correction. 2006 LiDAR and 2014 LiDAR data was also used to assist in tree canopy change.
This dataset contains the White Mountain National Forest Boundary. The boundary was extracted from the National Forest boundaries coverage for the lower 48 states, including Puerto Rico developed by the USDA Forest Service - Geospatial Service and Technology Center. The coverage was projected from decimal degrees to UTM zone 19. This dataset includes administrative unit boundaries, derived primarily from the GSTC SOC data system, comprised of Cartographic Feature Files (CFFs), using ESRI Spatial Data Engine (SDE) and an Oracle database. The data that was available in SOC was extracted on November 10, 1999. Some of the data that had been entered into SOC was outdated, and some national forest boundaries had never been entered for a variety of reasons. The USDA Forest Service, Geospatial Service and Technology Center has edited this data in places where it was questionable or missing, to match the National Forest Inventoried Roadless Area data submitted for the President's Roadless Area Initiative. Data distributed as shapefile in Coordinate system EPSG:26919 - NAD83 / UTM zone 19N.
This map and corresponding dataset provide the location, satellite images and square footage of existing green roofs within the City of Chicago. This dataset is in ESRI shapefile format. To view or use these files, compression software and special GIS software, such as ESRI ArcGIS, is required. This information is derived from an analysis of high-spatial resolution (50cm), pan-sharpened, ortho-rectified, 8-band multi-spectral satellite images collected by Digital Globe’s Worldview-2 satellite. The City supplied the consultant with a 2009 City boundary shapefile to determine the required extent of the imagery. Acquisition of three different strips of imagery corresponding to the satellite’s paths was required. These strips of imagery spanned three consecutive months and were collected in August 2010 (90% coverage), September 2010 (5% coverage) and October 2010 (5% coverage). The results of the analysis include overall count of vegetated roofs, their total square footage, and the ratio of required to elective vegetated roofs. A total of 359 vegetated roofs were identified within the City of Chicago. The total square footage of these vegetated roofs was calculated to be approximately 5,469,463 square feet. The ratio of required vegetated roofs to elective vegetative roofs was 297:62 (~5:1). The median size of the vegetated roofs was calculated to be 5,234 square feet.
A GIS polygon shapefile outlining the extent of the 14 individual DEM sections that comprise the seamless, 2-meter resolution DEM for the open-coast region of the San Francisco Bay Area (outside of the Golden Gate Bridge), extending from Half Moon Bay to Bodega Head along the north-central California coastline. The goal was to integrate the most recent high-resolution bathymetric and topographic datasets available (for example, Light Detection and Ranging (lidar) topography, multibeam and single-beam sonar bathymetry) into a seamless surface model extending offshore at least 3 nautical miles and inland beyond the +20 meter elevation contour.
This dataset contains a point shapefile with benthic habitat classifications of vertical relief, geomorphological structure, substrate, and biological cover for selected points along various Remotely Operated Vehicle (ROV) underwater video transects in the US Virgin Islands and Puerto Rico. NOAA's NOS/NCCOS/CCMA Biogeography Team, in collaboration with NOAA vessel Nancy Foster and territory, fe...
The map is designed to be used as a basemap by marine GIS professionals and as a reference map by anyone interested in ocean data. The basemap focuses on bathymetry. It also includes inland waters and roads, overlaid on land cover and shaded relief imagery.
TRCA GIS Open data on ArcGIS online. This link will take you to an external site URL: https://trca-camaps.opendata.arcgis.com/
This resource is a member of a series. The TIGER/Line shapefiles and related database files (.dbf) are an extract of selected geographic and cartographic information from the U.S. Census Bureau's Master Address File / Topologically Integrated Geographic Encoding and Referencing (MAF/TIGER) Database (MTDB). The MTDB represents a seamless national file with no overlaps or gaps between parts, however, each TIGER/Line shapefile is designed to stand alone as an independent data set, or they can be combined to cover the entire nation. The All Roads Shapefile includes all features within the MTDB Super Class "Road/Path Features" distinguished where the MAF/TIGER Feature Classification Code (MTFCC) for the feature in MTDB that begins with "S". This includes all primary, secondary, local neighborhood, and rural roads, city streets, vehicular trails (4wd), ramps, service drives, alleys, parking lot roads, private roads for service vehicles (logging, oil fields, ranches, etc.), bike paths or trails, bridle/horse paths, walkways/pedestrian trails, and stairways.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
geographical table containing the contours of all the PPRi of the Loiret
Attribution-NoDerivs 3.0 (CC BY-ND 3.0)https://creativecommons.org/licenses/by-nd/3.0/
License information was derived automatically
The seamless, county-wide parcel layer was digitized from official Assessor Parcel (AP) Maps which were originally maintained on mylar sheets and/or maintained as individual Computer Aided Design (CAD) drawing files (e.g., DWG). The CRA office continues to maintain the official AP Maps in CAD drawings and Information Systems Department/Geographic Information Systems (ISD/GIS) staff apply updates from these maps to the seamless parcel base in the County’s Enterprise GIS. This layer is a partial view of the Information Sales System (ISS) extract, a report of property characteristics taken from the County’s Megabyte Property Tax System (MPTS). This layer may be missing some attributes (e.g., Owner Name) which may not be published to the Internet due to privacy conditions under the California Public Records Act (CPRA). Please contact the Clerk-Recorder-Assessor (CRA) office at (707) 565-1888 for information on availability, associated fees, and access to other versions of Sonoma County parcels containing additional property characteristics.The seamless parcel layer is updated and published to the Internet on a monthly basis.The seamless parcel layer was developed from the source data using the general methodology outlined below. The mylar sheets were scanned and saved to standard image file format (e.g., TIFF). The individual scanned maps or CAD drawing files were imported into GIS software and geo-referenced to their corresponding real-world locations using high resolution orthophotography as control. The standard approach was to rescale and rotate the scanned drawing (or CAD file) to match the general location on the orthophotograph. Then, appropriate control points were selected to register and rectify features on the scanned map (or CAD drawing file) to the orthophotography. In the process, features in the scanned map (or CAD drawing file) were transformed to real-world coordinates, and line features were created using “heads-up digitizing” and stored in new GIS feature classes. Recommended industry best practices were followed to minimize root mean square (RMS) error in the transformation of the data, and to ensure the integrity of the overall pattern of each AP map relative to neighboring pages. Where available Coordinate Geometry (COGO) & survey data, tied to global positioning systems (GPS) coordinates, were also referenced and input to improve the fit and absolute location of each page. The vector lines were then assembled into a polygon features, with each polygon being assigned a unique identifier, the Assessor Parcel Number (APN). The APN field in the parcel table was joined to the corresponding APN field in the assessor property characteristics table extracted from the MPTS database to create the final parcel layer. The result is a seamless parcel land base, each parcel polygon coded with a unique APN, assembled from approximately 6,000 individual map page of varying scale and accuracy, but ensuring the correct topology of each feature within the whole (i.e., no gaps or overlaps). The accuracy and quality of the parcels varies depending on the source. See the fields RANK and DESCRIPTION fields below for information on the fit assessment for each source page. These data should be used only for general reference and planning purposes. It is important to note that while these data were generated from authoritative public records, and checked for quality assurance, they do not provide survey-quality spatial accuracy and should NOT be used to interpret the true location of individual property boundary lines. Please contact the Sonoma County CRA and/or a licensed land surveyor before making a business decision that involves official boundary descriptions.
This GIS overlay is a component of the U.S. Geological Survey, Woods Hole Science Center's, Gulf of Mexico GIS database. The Gulf of Mexico GIS database is intended to organize and display USGS held data and provide on-line (WWW) access to the data and/or metadata.
SFT Format Dataset
Overview
This dataset is converted to SFT (Supervised Fine-Tuning) format. It was created by transforming OpenMathInstruct and Stanford Human Preferences (SHP) datasets.
Dataset Structure
Each entry follows this format: Instruction: [Problem, question, or conversation history] Response: [Solution, answer, or response]
Usage Guide
Loading the Dataset
from datasets import load_dataset
A shapefile of 311 undersea features from all major oceans and seas has been created as an aid for retrieving georeferenced information resources. The geographic extent of the shapefile is 0 degrees E to 0 degrees W longitude and 75 degrees S to 90 degrees N latitude. Many of the undersea features (UF) in the shapefile were selected from a list assembled by Weatherall and Cramer (2008) in a report from the British Oceanographic Data Centre (BODC) to the General Bathymetric Chart of the Oceans (GEBCO) Sub-Committee on Undersea Feature Names (SCUFN). Annex II of the Weatherall and Cramer report (p. 20-22) lists 183 undersea features that "may need additional points to define their shape" and includes online links to additional BODC documents providing coordinate pairs sufficient to define detailed linestrings for these features. For the first phase of the U.S. Geological Survey (USGS) project, Wingfield created polygons for 87 of the undersea features on the BODC list, using the linestrings as guides; the selected features were primarily ridges, rises, trenches, fracture zones, basins, and seamount chains. In the second phase of the USGS project, Wingfield and Hartwell created polygons for an additional 224 undersea features, mostly basins, abyssal plains, and fracture zones. Because USGS is a Federal agency, the attribute tables follow the conventions of the National Geospatial-Intelligence Agency (NGA) GEOnet Names Server (http://earth-info.nga.mil/gns/html).
We present a flora and fauna dataset for the Mira-Mataje binational basins. This is an area shared between southwestern Colombia and northwestern Ecuador, where both the Chocó and Tropical Andes biodiversity hotspots converge. Information from 120 sources was systematized in the Darwin Core Archive (DwC-A) standard and geospatial vector data format for geographic information systems (GIS) (shapefiles). Sources included natural history museums, published literature, and citizen science repositories across 18 countries. The resulting database has 33,460 records from 5,281 species, of which 1,083 are endemic and 680 threatened. The diversity represented in the dataset is equivalent to 10\% of the total plant species and 26\% of the total terrestrial vertebrate species in the hotspots. It corresponds to 0.07\% of their total area. The dataset can be used to estimate and compare biodiversity patterns with environmental parameters and provide value to ecosystems, ecoregions, and protected areas. The dataset is a baseline for future assessments of biodiversity in the face of environmental degradation, climate change, and accelerated extinction processes. The data has been formally presented in the manuscript entitled "The Tropical Andes Biodiversity Hotspot: A Comprehensive Dataset for the Mira-Mataje Binational Basins" in the journal "Scientific Data". To maintain DOI integrity, this version will not change after publication of the manuscript and therefore we cannot provide further references on volume, issue, and DOI of manuscript publication. - Data format 1: The .rds file extension saves a single object to be read in R and provides better compression, serialization, and integration within the R environment, than simple .csv files. The description of file names is in the original manuscript. -- m_m_flora_2021_voucher_ecuador.rds -- m_m_flora_2021_observation_ecuador.rds -- m_m_flora_2021_total_ecuador.rds -- m_m_fauna_2021_ecuador.rds - Data format 2: The .csv file has been encoded in UTF-8, and is an ASCII file with text separated by commas. The description of file names is in the original manuscript. -- m_m_flora_fauna_2021_all.zip. This file includes all biodiversity datasets. -- m_m_flora_2021_voucher_ecuador.csv -- m_m_flora_2021_observation_ecuador.csv -- m_m_flora_2021_total_ecuador.csv -- m_m_fauna_2021_ecuador.csv - Data format 3: We consolidated a shapefile for the basin containing layers for vegetation ecosystems and the total number of occurrences, species, and endemic and threatened species for each ecosystem. -- biodiversity_measures_mira_mataje.zip. This file includes the .shp file and accessory geomatic files. - A set of 3D shaded-relief map representations of the data in the shapefile can be found at https://doi.org/10.6084/m9.figshare.23499180.v4 Three taxonomic data tables were used in our technical validation of the presented dataset. These three files are: 1) the_catalog_of_life.tsv (Source: Bánki, O. et al. Catalogue of life checklist (version 2024-03-26). https://doi.org/10.48580/dfz8d (2024)) 2) world_checklist_of_vascular_plants_names.csv (we are also including ancillary tables "world_checklist_of_vascular_plants_distribution.csv", and "README_world_checklist_of_vascular_plants_.xlsx") (Source: Govaerts, R., Lughadha, E. N., Black, N., Turner, R. & Paton, A. The World Checklist of Vascular Plants is a continuously updated resource for exploring global plant diversity. Sci. Data 8, 215, 10.1038/s41597-021-00997-6 (2021).) 3) world_flora_online.csv (Source: The World Flora Online Consortium et al. World flora online plant list December 2023, 10.5281/zenodo.10425161 (2023).)
SFT Format Dataset
Overview
This dataset is converted to SFT (Supervised Fine-Tuning) format. It was created by transforming OpenMathInstruct and Stanford Human Preferences (SHP) datasets.
Dataset Structure
Each entry follows this format: Instruction: [Problem, question, or conversation history] Response: [Solution, answer, or response]
Usage Guide
Loading the Dataset
from datasets import load_dataset