100+ datasets found
  1. f

    Data from: Rare Feature Selection in High Dimensions

    • tandf.figshare.com
    pdf
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiaohan Yan; Jacob Bien (2023). Rare Feature Selection in High Dimensions [Dataset]. http://doi.org/10.6084/m9.figshare.12851331.v2
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Xiaohan Yan; Jacob Bien
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    It is common in modern prediction problems for many predictor variables to be counts of rarely occurring events. This leads to design matrices in which many columns are highly sparse. The challenge posed by such “rare features” has received little attention despite its prevalence in diverse areas, ranging from natural language processing (e.g., rare words) to biology (e.g., rare species). We show, both theoretically and empirically, that not explicitly accounting for the rareness of features can greatly reduce the effectiveness of an analysis. We next propose a framework for aggregating rare features into denser features in a flexible manner that creates better predictors of the response. Our strategy leverages side information in the form of a tree that encodes feature similarity. We apply our method to data from TripAdvisor, in which we predict the numerical rating of a hotel based on the text of the associated review. Our method achieves high accuracy by making effective use of rare words; by contrast, the lasso is unable to identify highly predictive words if they are too rare. A companion R package, called rare, implements our new estimator, using the alternating direction method of multipliers. Supplementary materials for this article are available online.

  2. d

    Great Basin Montane Watersheds - Streams (Feature Layer)

    • catalog.data.gov
    • agdatacommons.nal.usda.gov
    • +5more
    Updated Apr 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Forest Service (2025). Great Basin Montane Watersheds - Streams (Feature Layer) [Dataset]. https://catalog.data.gov/dataset/great-basin-montane-watersheds-streams-feature-layer
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    U.S. Forest Service
    Area covered
    Great Basin
    Description

    Multiple research and management partners collaboratively developed a multiscale approach for assessing the geomorphic sensitivity of streams and ecological resilience of riparian and meadow ecosystems in upland watersheds of the Great Basin to disturbances and management actions. The approach builds on long-term work by the partners on the responses of these systems to disturbances and management actions. At the core of the assessments is information on past and present watershed and stream channel characteristics, geomorphic and hydrologic processes, and riparian and meadow vegetation. In this report, we describe the approach used to delineate Great Basin mountain ranges and the watersheds within them, and the data that are available for the individual watersheds. We also describe the resulting database and the data sources. Furthermore, we summarize information on the characteristics of the regions and watersheds within the regions and the implications of the assessments for geomorphic sensitivity and ecological resilience. The target audience for this multiscale approach is managers and stakeholders interested in assessing and adaptively managing Great Basin stream systems and riparian and meadow ecosystems. Anyone interested in delineating the mountain ranges and watersheds within the Great Basin or quantifying the characteristics of the watersheds will be interested in this report. For more information, visit: https://www.fs.usda.gov/research/treesearch/61573Metadata and Downloads

  3. d

    Allegheny County Park Features

    • catalog.data.gov
    • data.wprdc.org
    • +2more
    Updated May 14, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Allegheny County (2023). Allegheny County Park Features [Dataset]. https://catalog.data.gov/dataset/allegheny-county-park-features
    Explore at:
    Dataset updated
    May 14, 2023
    Dataset provided by
    Allegheny County
    Area covered
    Allegheny County
    Description

    A combination of all park features, events, recreations, facilities, all in one layer, including Activenet information.

  4. m

    Features Comparison Data

    • landing-fe-dev.mentionnetwork.xyz
    • mention.network
    Updated 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mention Network (2024). Features Comparison Data [Dataset]. https://landing-fe-dev.mentionnetwork.xyz/
    Explore at:
    Dataset updated
    2024
    Dataset authored and provided by
    Mention Network
    License

    http://schema.org/PublicDomainhttp://schema.org/PublicDomain

    Description

    Comparison matrix showing feature availability across different AI visibility platforms

  5. Lakes, Rivers and Glaciers in Canada - CanVec Series - Hydrographic Features...

    • open.canada.ca
    • catalogue.arctic-sdi.org
    • +3more
    fgdb/gdb, html, kmz +2
    Updated May 19, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Natural Resources Canada (2023). Lakes, Rivers and Glaciers in Canada - CanVec Series - Hydrographic Features [Dataset]. https://open.canada.ca/data/en/dataset/9d96e8c9-22fe-4ad2-b5e8-94a6991b744b
    Explore at:
    html, fgdb/gdb, kmz, wms, shpAvailable download formats
    Dataset updated
    May 19, 2023
    Dataset provided by
    Ministry of Natural Resources of Canadahttps://www.nrcan.gc.ca/
    License

    Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
    License information was derived automatically

    Area covered
    Canada
    Description

    The hydrographic features of the CanVec series include watercourses, water linear flow segments, hydrographic obstacles (falls, rapids, etc.), waterbodies (lakes, watercourses, etc.), permanent snow and ice features, water wells and springs. The Hydrographic features theme provides quality vector geospatial data (current, accurate, and consistent) of Canadian hydrographic phenomena. It aims to offer a geometric description and a set of basic attributes on hydrographic features that comply with international geomatics standards, seamlessly across Canada. The CanVec multiscale series is available as prepackaged downloadable files and by user-defined extent via a Geospatial data extraction tool. Related Products: Topographic Data of Canada - CanVec Series

  6. Customer Segmentation Data

    • kaggle.com
    Updated Mar 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raval Smit (2024). Customer Segmentation Data [Dataset]. https://www.kaggle.com/datasets/ravalsmit/customer-segmentation-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 11, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Raval Smit
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset provides comprehensive customer data suitable for segmentation analysis. It includes anonymized demographic, transactional, and behavioral attributes, allowing for detailed exploration of customer segments. Leveraging this dataset, marketers, data scientists, and business analysts can uncover valuable insights to optimize targeted marketing strategies and enhance customer engagement. Whether you're looking to understand customer behavior or improve campaign effectiveness, this dataset offers a rich resource for actionable insights and informed decision-making.

    Key Features:

    Anonymized demographic, transactional, and behavioral data. Suitable for customer segmentation analysis. Opportunities to optimize targeted marketing strategies. Valuable insights for improving campaign effectiveness. Ideal for marketers, data scientists, and business analysts.

    Usage Examples:

    Segmenting customers based on demographic attributes. Analyzing purchase behavior to identify high-value customer segments. Optimizing marketing campaigns for targeted engagement. Understanding customer preferences and tailoring product offerings accordingly. Evaluating the effectiveness of marketing strategies and iterating for improvement. Explore this dataset to unlock actionable insights and drive success in your marketing initiatives!

  7. B

    Research Data Repository Requirements and Features Review

    • borealisdata.ca
    • dataone.org
    Updated Aug 24, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amber Leahey; Peter Webster; Claire Austin; Nancy Fong; Julie Friddell; Chuck Humphrey; Susan Brown; Walter Stewart (2015). Research Data Repository Requirements and Features Review [Dataset]. http://doi.org/10.5683/SP3/UPABVH
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 24, 2015
    Dataset provided by
    Borealis
    Authors
    Amber Leahey; Peter Webster; Claire Austin; Nancy Fong; Julie Friddell; Chuck Humphrey; Susan Brown; Walter Stewart
    License

    https://borealisdata.ca/api/datasets/:persistentId/versions/4.0/customlicense?persistentId=doi:10.5683/SP3/UPABVHhttps://borealisdata.ca/api/datasets/:persistentId/versions/4.0/customlicense?persistentId=doi:10.5683/SP3/UPABVH

    Time period covered
    Sep 2014 - Feb 2015
    Area covered
    Canada, United States, Europe, United Kingdom, International
    Description

    Data collected from major Canadian and international research data repositories cover data storage, preservation, metadata, interchange, data file types, and other standard features used in the retention and sharing of research data. The outputs of this project primarily aim to assist in the establishment of recommended minimum requirements for a Canadian research data infrastructure. The committee also aims to further develop guidelines and criteria for the assessment and selection o f repositories for deposit of Canadian research data by researchers, data managers, librarians, archivists etc.

  8. d

    GIS Features of the Geospatial Fabric for National Hydrologic Modeling

    • catalog.data.gov
    • data.usgs.gov
    • +5more
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). GIS Features of the Geospatial Fabric for National Hydrologic Modeling [Dataset]. https://catalog.data.gov/dataset/gis-features-of-the-geospatial-fabric-for-national-hydrologic-modeling
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Description

    The Geopspatial Fabric provides a consistent, documented, and topologically connected set of spatial features that create an abstracted stream/basin network of features useful for hydrologic modeling.The GIS vector features contained in this Geospatial Fabric (GF) data set cover the lower 48 U.S. states, Hawaii, and Puerto Rico. Four GIS feature classes are provided for each Region: 1) the Region outline ("one"), 2) Points of Interest ("POIs"), 3) a routing network ("nsegment"), and 4) Hydrologic Response Units ("nhru"). A graphic showing the boundaries for all Regions is provided at http://dx.doi.org/doi:10.5066/F7542KMD. These Regions are identical to those used to organize the NHDPlus v.1 dataset (US EPA and US Geological Survey, 2005). Although the GF Feature data set has been derived from NHDPlus v.1, it is an entirely new data set that has been designed to generically support regional and national scale applications of hydrologic models. Definition of each type of feature class and its derivation is provided within the

  9. b

    Effect of of gamelike features on cognitive test performance - Datasets -...

    • data.bris.ac.uk
    Updated Apr 4, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2016). Effect of of gamelike features on cognitive test performance - Datasets - data.bris [Dataset]. https://data.bris.ac.uk/data/dataset/1hjvqlpbtrk961ua9ml40bauie
    Explore at:
    Dataset updated
    Apr 4, 2016
    Description

    This study compared three versions of Go/No-Go (GNG) task, each with different gamelike features (non-game, points, theme) across two different testing sites (laboratory and online). We used a between subjects design, with reaction times (RT) on Go trials, Go trial accuracy, No-Go trial accuracy and subjective ratings as the dependent variables of interest.

  10. 24k Hydro Full File Geodatabase

    • data-wi-dnr.opendata.arcgis.com
    • arc-gis-hub-home-arcgishub.hub.arcgis.com
    Updated Aug 1, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wisconsin Department of Natural Resources (2017). 24k Hydro Full File Geodatabase [Dataset]. https://data-wi-dnr.opendata.arcgis.com/datasets/cb1c7f75d14f42ee819a46894fd2e771
    Explore at:
    Dataset updated
    Aug 1, 2017
    Dataset authored and provided by
    Wisconsin Department of Natural Resourceshttp://dnr.wi.gov/
    Area covered
    Description

    24K Hydro File Geodatabase, including bank lines, flow lines, junction points, hydro lines, water bodies, hydro points, and a network. Access the user guide, data dictionaries, and metadata below.The DNR Hydrography database was developed statewide using several 1:24,000-scale sources. This data layer includes information about surface water features represented on the USGS 1:24,000-scale topographic map series such as perennial and intermittent streams, lakes, etc. Because the sources of the Hydrography data span many years and originate from several sources, the data may reflect areas of transition from one source to another. As a result, the water features as represented in the Hydrography data may not always match what you see on a particular USGS quad or Digital Raster Graphic (DRG). General source information is presented on this map: Wisconsin Hydrography Source Information. Note: Wetlands delineations are not included in the DNR Hydrography data layer. For information about DNR Wetlands data, see the Wisconsin Wetland Inventory web page.Report errors in this data to Dennis Wiese (dennis.wiese@wisconsin.gov) with the following information:HYDROID of the feature in question; OR if the feature is missing, a location coordinate or description (e.g. latitude/longitude, Public Land Survey System Township, Range, and Section identifier) that identifies the area in question.Optional but very helpful: a screen capture of the area in question, or the Water Body Identification Code (WBIC) of the feature in question.DNR staff can access the hydrography database in the agency's central GIS data repository. The hydrography feature classes are stored in the feature dataset "W23324.WD_HYDRO_DATA_24K".USER GUIDES AND DOCUMENTATION: WDNR_HYDRO_24k_GETTING STARTED WDNR HYDRO 24K UPDATES DOCUMENT 24K HYDRO DECISION RULESData Dictionaries and Metadata WDNR_HYDRO_24k_waterbody_data_dict WDNR_HYDRO_24k_waterbody_metadata WDNR_HYDRO_24k_flowline_data_dict WDNR_HYDRO_24k_flowline_metadata WDNR_HYDRO_24k_bank_data_dict WDNR_HYDRO_24k_bank_metadata WDNR_HYDRO_24k_junction_data_dict WDNR_HYDRO_24k_junction_metadata WDNR_HYDRO_24k_line_data_dict WDNR_HYDRO_24k_line_metadata WDNR_HYDRO_24k_flowline_wbic_data_dict WDNR_HYDRO_24k_flowline_wbic_metadata WDNR_HYDRO_24k_waterbody_wbic_data_dict WDNR_HYDRO_24k_waterbody_wbic_metadataArcMap Layer (.lyr) Files 24k Hydro Flowline Duration 24k Hydro Bank Lines 24k Hydro Flowline Streams 24k Hydro Waterbody Open Water

  11. N

    Zoning GIS Data: Geodatabase

    • data.cityofnewyork.us
    • data.ny.gov
    • +1more
    application/rdfxml +5
    Updated Jan 29, 2013
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of City Planning (DCP) (2013). Zoning GIS Data: Geodatabase [Dataset]. https://data.cityofnewyork.us/City-Government/Zoning-GIS-Data-Geodatabase/mm69-vrje
    Explore at:
    csv, application/rssxml, xml, application/rdfxml, json, tsvAvailable download formats
    Dataset updated
    Jan 29, 2013
    Dataset authored and provided by
    Department of City Planning (DCP)
    Description

    This data set consists of 6 classes of zoning features: zoning districts, special purpose districts, special purpose district subdistricts, limited height districts, commercial overlay districts, and zoning map amendments.

    All previously released versions of this data are available at BYTES of the BIG APPLE - Archive.

  12. Landmark Features - Dataset - data.gov.uk

    • ckan.publishing.service.gov.uk
    Updated Mar 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ckan.publishing.service.gov.uk (2023). Landmark Features - Dataset - data.gov.uk [Dataset]. https://ckan.publishing.service.gov.uk/dataset/landmark-features1
    Explore at:
    Dataset updated
    Mar 29, 2023
    Dataset provided by
    CKANhttps://ckan.org/
    Description

    The location of Landmark Features within Nottingham City Centre. Landmark Features are points of local interest and significance within the townscape.

  13. T

    Park Features By PMAID

    • cos-data.seattle.gov
    • data.seattle.gov
    • +2more
    application/rdfxml +5
    Updated Oct 4, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    City of Seattle (2024). Park Features By PMAID [Dataset]. https://cos-data.seattle.gov/Community-and-Culture/Park-Features-By-PMAID/xrnu-8eiq
    Explore at:
    csv, json, xml, application/rssxml, tsv, application/rdfxmlAvailable download formats
    Dataset updated
    Oct 4, 2024
    Dataset authored and provided by
    City of Seattle
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    This dataset contains a list of features for each park PMAID.

  14. Loan Approval Classification Dataset

    • kaggle.com
    Updated Oct 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ta-wei Lo (2024). Loan Approval Classification Dataset [Dataset]. https://www.kaggle.com/datasets/taweilo/loan-approval-classification-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 29, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ta-wei Lo
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    1. Data Source

    This dataset is a synthetic version inspired by the original Credit Risk dataset on Kaggle and enriched with additional variables based on Financial Risk for Loan Approval data. SMOTENC was used to simulate new data points to enlarge the instances. The dataset is structured for both categorical and continuous features.

    2. Metadata

    The dataset contains 45,000 records and 14 variables, each described below:

    ColumnDescriptionType
    person_ageAge of the personFloat
    person_genderGender of the personCategorical
    person_educationHighest education levelCategorical
    person_incomeAnnual incomeFloat
    person_emp_expYears of employment experienceInteger
    person_home_ownershipHome ownership status (e.g., rent, own, mortgage)Categorical
    loan_amntLoan amount requestedFloat
    loan_intentPurpose of the loanCategorical
    loan_int_rateLoan interest rateFloat
    loan_percent_incomeLoan amount as a percentage of annual incomeFloat
    cb_person_cred_hist_lengthLength of credit history in yearsFloat
    credit_scoreCredit score of the personInteger
    previous_loan_defaults_on_fileIndicator of previous loan defaultsCategorical
    loan_status (target variable)Loan approval status: 1 = approved; 0 = rejectedInteger

    3. Data Usage

    The dataset can be used for multiple purposes:

    • Exploratory Data Analysis (EDA): Analyze key features, distribution patterns, and relationships to understand credit risk factors.
    • Classification: Build predictive models to classify the loan_status variable (approved/not approved) for potential applicants.
    • Regression: Develop regression models to predict the credit_score variable based on individual and loan-related attributes.

    Mind the data issue from the original data, such as the instance > 100-year-old as age.

    This dataset provides a rich basis for understanding financial risk factors and simulating predictive modeling processes for loan approval and credit scoring.

    Feel free to leave comments on the discussion. I'd appreciate your upvote if you find my dataset useful! 😀

  15. Topographic Data of Canada - CanVec Series

    • open.canada.ca
    • catalogue.arctic-sdi.org
    • +3more
    fgdb/gdb, html, kmz +3
    Updated May 19, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Natural Resources Canada (2023). Topographic Data of Canada - CanVec Series [Dataset]. https://open.canada.ca/data/en/dataset/8ba2aa2a-7bb9-4448-b4d7-f164409fe056
    Explore at:
    html, fgdb/gdb, wms, shp, kmz, pdfAvailable download formats
    Dataset updated
    May 19, 2023
    Dataset provided by
    Ministry of Natural Resources of Canadahttps://www.nrcan.gc.ca/
    License

    Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
    License information was derived automatically

    Area covered
    Canada
    Description

    CanVec contains more than 60 topographic features classes organized into 8 themes: Transport Features, Administrative Features, Hydro Features, Land Features, Manmade Features, Elevation Features, Resource Management Features and Toponymic Features. This multiscale product originates from the best available geospatial data sources covering Canadian territory. It offers quality topographic information in vector format complying with international geomatics standards. CanVec can be used in Web Map Services (WMS) and geographic information systems (GIS) applications and used to produce thematic maps. Because of its many attributes, CanVec allows for extensive spatial analysis. Related Products: Constructions and Land Use in Canada - CanVec Series - Manmade Features Lakes, Rivers and Glaciers in Canada - CanVec Series - Hydrographic Features Administrative Boundaries in Canada - CanVec Series - Administrative Features Mines, Energy and Communication Networks in Canada - CanVec Series - Resources Management Features Wooded Areas, Saturated Soils and Landscape in Canada - CanVec Series - Land Features Transport Networks in Canada - CanVec Series - Transport Features Elevation in Canada - CanVec Series - Elevation Features Map Labels - CanVec Series - Toponymic Features

  16. a

    Ecological Sections (Feature Layer)

    • sal-urichmond.hub.arcgis.com
    • datasets.ai
    • +6more
    Updated Jan 1, 2007
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Forest Service (2007). Ecological Sections (Feature Layer) [Dataset]. https://sal-urichmond.hub.arcgis.com/datasets/usfs::ecological-sections-feature-layer
    Explore at:
    Dataset updated
    Jan 1, 2007
    Dataset authored and provided by
    U.S. Forest Service
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Description

    This data set includes polygons for ecological sections within Subregions within the conterminous United States. This data set contains regional geographic delineations for analysis of ecological relationships across ecological units. Metadata

  17. Data from: LBA-ECO LC-09 Natural, Infrastructure, and Boundary Features,...

    • data.nasa.gov
    • datasets.ai
    • +6more
    Updated Apr 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). LBA-ECO LC-09 Natural, Infrastructure, and Boundary Features, Amazonian Sites, Brazil [Dataset]. https://data.nasa.gov/dataset/lba-eco-lc-09-natural-infrastructure-and-boundary-features-amazonian-sites-brazil-804ed
    Explore at:
    Dataset updated
    Apr 1, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Area covered
    Brazil
    Description

    This data set includes 16 zipped archives of shapefiles of cities, rivers and streams, roads, and study area boundaries of several Amazonian study sites: Altamira, Santarem, Bragantina, and Ponta de Pedras, in the state of Para, and 1 site at Machadinho D'Oeste, in the state of Rondonia. Data from Brazil were digitized from Instituto Nacional de Colonizacao e Reforma Agraria (INCRA) maps and other data from Instituto Brasileiro de Geografia e Estatistica (IBGE). These products were prepared in the 2000-2004 time period. The data of creation for the source material is unknown.

  18. Data from: Feature selection in an interactive search-based PLA design...

    • zenodo.org
    Updated May 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous; Anonymous (2023). Feature selection in an interactive search-based PLA design approach [Dataset]. http://doi.org/10.5281/zenodo.7942374
    Explore at:
    Dataset updated
    May 18, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous; Anonymous
    Description

    The Product Line Architecture (PLA) is one of the most important artifacts of a Software Product Line (SPL). PLA design can be formulated as an interactive optimization problem with many conflicting factors. Incorporate Decision Makers’ (DM) preferences during the search process may help the algorithms to find more adequate solutions for their profiles. Interactive approaches allow the DM to evaluate solutions, guiding the optimization according to their preferences. However, this brings up human fatigue problems caused by the excessive amount of interactions and solutions to evaluate. A common strategy to prevent this problem is limiting the number of interactions and solutions evaluated by the DM. Machine Learning (ML) models were also used to learn how to evaluate solutions according to the DM profile and replace them after some interactions. Feature selection performs an essential task as non-relevant and/or redundant features used to train the ML model can reduce the accuracy and comprehensibility of the hypotheses induced by ML algorithms. This work aims to select features of a ML model used to prevent human fatigue in an interactive search-based PLA design approach. We applied four selectors and through results we were able to reduce 30% of features, obtaining an accuracy of 99%.

  19. u

    Stress Trajectories Determined from Breakouts (GIS data, line features) -...

    • data.urbandatacentre.ca
    • beta.data.urbandatacentre.ca
    Updated Jun 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Stress Trajectories Determined from Breakouts (GIS data, line features) - Catalogue - Canadian Urban Data Catalogue (CUDC) [Dataset]. https://data.urbandatacentre.ca/dataset/ab-gda-dig_2008_0316
    Explore at:
    Dataset updated
    Jun 24, 2025
    Description

    The Geological Atlas of the Western Canada Sedimentary Basin was designed primarily as a reference volume documenting the subsurface geology of the Western Canada Sedimentary Basin. This GIS dataset is one of a collection of shapefiles representing part of Chapter 29 of the Atlas, In-situ Stress in the Western Canada Sedimentary Basin, Figure 10, Stress Trajectories Determined from Breakouts. Shapefiles were produced from archived digital files created by the Alberta Geological Survey in the mid-1990s, and edited in 2005-06 to correct, attribute and consolidate the data into single files by feature type and by figure.

  20. Chinook Abundance - Point Features [ds180]

    • gis-california.opendata.arcgis.com
    • data.cnra.ca.gov
    • +8more
    Updated Jan 31, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    California Department of Fish and Wildlife (2020). Chinook Abundance - Point Features [ds180] [Dataset]. https://gis-california.opendata.arcgis.com/datasets/CDFW::chinook-abundance-point-features-ds180
    Explore at:
    Dataset updated
    Jan 31, 2020
    Dataset authored and provided by
    California Department of Fish and Wildlifehttps://wildlife.ca.gov/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Description

    The dataset ds180_Chinook_pnts is a product of the CalFish Adult Salmonid Abundance Database. Data in this shapefile are collected from point features, such as dams and hatcheries. Some escapement monitoring locations, such as spawning stock surveys, are logically represented by linear features. See the companion linear feature shapefile ds181_Chinook_ln for information collected from stream reaches.The CalFish Abundance Database contains a comprehensive collection of anadromous fisheries abundance information. Beginning in 1998, the Pacific States Marine Fisheries Commission, the California Department of Fish and Game, and the National Marine Fisheries Service, began a cooperative project aimed at collecting, archiving, and entering into standardized electronic formats, the wealth of information generated by fisheries resource management agencies and tribes throughout California.The data format provides for sufficient detail to convey the relative accuracy of each population trend index record yet is simple and straight forward enough to be suited for public use. For those interested in more detail the database offers hyperlinks to digital copies of the original documents used to compile the information. In this way the database serves as an information hub directing the user to additional supporting information. This offers utility to field biologists and others interested in obtaining information for more in-depth analysis. Hyperlinks, built into the spatial data attribute tables used in the BIOS and CalFish I-map viewers, open the detailed index data archived in the on-line CalFish database application. The information can also be queried directly from the database via the CalFish Tabular Data Query. Once the detailed annual trend data are in view, another hyperlink opens a digital copy of the document used to compile each record.During 2010, as a part of the Central Valley Chinook Comprehensive Monitoring Plan, the CalFish Salmonid Abundance Database was reorganized and updated. CalFish provides a central location for sharing Central Valley Chinook salmon escapement estimates and annual monitoring reports to all stakeholders, including the public. Annual Chinook salmon in-river escapement indices that were, in many cases, eight to ten years behind are now current though 2009. In some cases, multiple datasets were consolidated into a single, more comprehensive, dataset to more closely reflect how data are reported in the California Department of Fish and Game standard index, Grandtab.Extensive data are currently available in the CalFish Abundance Database for California Chinook, coho, and steelhead. Major data categories include adult abundance population estimates, actual fish and/or carcass counts, counts of fish collected at dams, weirs, or traps, and redd counts. Harvest data has also been compiled for many streams.This CalFish Abundance Database shapefile was generated from fully routed 1:100,000 hydrography. In a few cases streams had to be added to the hydrography dataset in order to provide a means to create shapefiles to represent abundance data associated with them. Streams added were digitized at no more than 1:24,000 scale based on stream line images portrayed in 1:24,000 Digital Raster Graphics (DRG).The features in this layer represent the location for which abundance data records apply. In many cases there are multiple datasets associated with the same location, and so, features may overlap. Please view the associated datasets for detail regarding specific features. In CalFish these are accessed through the "link" field that is visible when performing an identify or query operation. A URL string is provided with each feature in the downloadable data which can also be used to access the underlying datasets.The Chinook data that is available from the CalFish website is actually mirrored from the StreamNet website where the CalFish Abundance Databases tabular data is currently stored. Additional information about StreamNet may be downloaded at http://www.streamnet.org. Complete documentation for the StreamNet database may be accessed at http://http://www.streamnet.org/def.html

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Xiaohan Yan; Jacob Bien (2023). Rare Feature Selection in High Dimensions [Dataset]. http://doi.org/10.6084/m9.figshare.12851331.v2

Data from: Rare Feature Selection in High Dimensions

Related Article
Explore at:
pdfAvailable download formats
Dataset updated
May 30, 2023
Dataset provided by
Taylor & Francis
Authors
Xiaohan Yan; Jacob Bien
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

It is common in modern prediction problems for many predictor variables to be counts of rarely occurring events. This leads to design matrices in which many columns are highly sparse. The challenge posed by such “rare features” has received little attention despite its prevalence in diverse areas, ranging from natural language processing (e.g., rare words) to biology (e.g., rare species). We show, both theoretically and empirically, that not explicitly accounting for the rareness of features can greatly reduce the effectiveness of an analysis. We next propose a framework for aggregating rare features into denser features in a flexible manner that creates better predictors of the response. Our strategy leverages side information in the form of a tree that encodes feature similarity. We apply our method to data from TripAdvisor, in which we predict the numerical rating of a hotel based on the text of the associated review. Our method achieves high accuracy by making effective use of rare words; by contrast, the lasso is unable to identify highly predictive words if they are too rare. A companion R package, called rare, implements our new estimator, using the alternating direction method of multipliers. Supplementary materials for this article are available online.

Search
Clear search
Close search
Google apps
Main menu