95 datasets found
  1. d

    Overview Metadata for the Regression Model Data, Estimated Discharge Data,...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Overview Metadata for the Regression Model Data, Estimated Discharge Data, and Calculated Flux and Yields Data at Tumacácori National Historical Park and the Upper Santa Cruz River, Arizona (1994-2017) [Dataset]. https://catalog.data.gov/dataset/overview-metadata-for-the-regression-model-data-estimated-discharge-data-and-calculat-1994
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    U.S. Geological Survey
    Area covered
    Tumacacori-Carmen, Arizona, Santa Cruz River
    Description

    This data release contains three different datasets that were used in the Scientific Investigations Report: Spatial and Temporal Distribution of Bacterial Indicators and Microbial Source Tracking within Tumacácori National Historical Park and the Upper Santa Cruz River, Arizona, 2015-16. These datasets contain regression model data, estimated discharge data, and calculated flux and yields data. Regression Model Data: This dataset contains data used in a regression model development in the SIR. The period of data ranged from May 25, 1994 to May 19, 2017. Data from 2015 to 2017 were collected by the U.S. Geological Survey. Data prior to 2015 were provided by various agencies. Listed below are the different data contained within this dataset: - Season represented as an indicator variable (Fall, Spring, Summer, and Winter) - Hydrologic Condition represented as an indicator variable (rising limb, recession limb, peak, or unable to classify) - Flood (binary variable indicating if the sample was collected during a flood event or not) - Decimal Date (DT) represented as a continuous variable - Sine of DT represented as a continuous variable for periodic function to describe seasonal variation - Cosine of DT represented as a continuous variable for periodic function to describe seasonal variation Estimated Discharge: This dataset contains estimated discharge at four different sites between 03/02/2015 and 12/14/2016. The discharge was estimated using nearby streamgage relations and methods are described in detail in the SIR . The sites where discharge was estimated are listed below. - NW8; 312551110573901; Nogales Wash at Ruby Road - SC3; 312654110573201; Santa Cruz River abv Nogales Wash - SC10; 313343110024701; Santa Cruz River at Santa Gertrudis Lane - SC14; 09481740; Santa Cruz River at Tubac, AZ Calculated Flux and Yields: This dataset contains calculated flux and yields for E. coli and suspended sediment concentrations. Mean daily flux was calculated when mean daily discharge was available at a corresponding streamgage. Instantaneous flux was calculated when instantaneous discharge (at 15-minute intervals) were available at a corresponding streamgage, or from a measured or estimated discharge value. The yields were calculated using the calculated flux values and the area of the different watersheds. Methods and equations are described in detail in the SIR. Listed below are the data contained within this dataset: - Mean daily E. coli flux, in most probable number per day - Mean daily suspended sediment, in flux, in tons per day - Instantaneous E. coli flux, in most probable number per second - Instantaneous suspended sediment flux, in tons per second - E. coli, in most probable number per square mile - Suspended sediment, in tons per square mile

  2. f

    Root mean square errors (RMSE) obtained with three different datasets for...

    • plos.figshare.com
    xls
    Updated Jun 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lucie Michel; David Makowski (2023). Root mean square errors (RMSE) obtained with three different datasets for several statistical models: linear regression (L), quadratic regression (Q), cubic regression (C), dynamic linear models with and without trend (DLMs, DLM0), and linear-plus-plateau (LP). [Dataset]. http://doi.org/10.1371/journal.pone.0078615.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 5, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Lucie Michel; David Makowski
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    aThe differences with respect to the lowest RMSE values are expressed as a percentage of the lowest RMSE values (RMSEmin); Difference = 100*(RMSE – RMSEmin)/RMSEmin.

  3. i

    Dataset for Space Partitioning and Regression Mode Seeking via a...

    • ieee-dataport.org
    Updated Mar 15, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wanli Qiao (2021). Dataset for Space Partitioning and Regression Mode Seeking via a Mean-Shift-Inspired Algorithm [Dataset]. https://ieee-dataport.org/open-access/dataset-space-partitioning-and-regression-mode-seeking-mean-shift-inspired-algorithm
    Explore at:
    Dataset updated
    Mar 15, 2021
    Authors
    Wanli Qiao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    using an idea based on iterative gradient ascent. In this paper we develop a mean-shift-inspired algorithm to estimate the modes of regression functions and partition the sample points in the input space. We prove convergence of the sequences generated by the algorithm and derive the non-asymptotic rates of convergence of the estimated local modes for the underlying regression model.

  4. A

    Sea Surface Temperature (SST) Standard Deviation of Long-term Mean,...

    • data.amerigeoss.org
    • data.ioos.us
    • +2more
    wcs, wms, xml
    Updated Jul 15, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ioos (2019). Sea Surface Temperature (SST) Standard Deviation of Long-term Mean, 2000-2013 - Hawaii [Dataset]. https://data.amerigeoss.org/pt_PT/dataset/sea-surface-temperature-sst-standard-deviation-of-long-term-mean-2000-2013-hawaii
    Explore at:
    wcs, wms, xmlAvailable download formats
    Dataset updated
    Jul 15, 2019
    Dataset provided by
    ioos
    Area covered
    Hawaii
    Description

    Sea surface temperature (SST) plays an important role in a number of ecological processes and can vary over a wide range of time scales, from daily to decadal changes. SST influences primary production, species migration patterns, and coral health. If temperatures are anomalously warm for extended periods of time, drastic changes in the surrounding ecosystem can result, including harmful effects such as coral bleaching. This layer represents the standard deviation of SST (degrees Celsius) of the weekly time series from 2000-2013.

    Three SST datasets were combined to provide continuous coverage from 1985-2013. The concatenation applies bias adjustment derived from linear regression to the overlap periods of datasets, with the final representation matching the 0.05-degree (~5-km) near real-time SST product. First, a weekly composite, gap-filled SST dataset from the NOAA Pathfinder v5.2 SST 1/24-degree (~4-km), daily dataset (a NOAA Climate Data Record) for each location was produced following Heron et al. (2010) for January 1985 to December 2012. Next, weekly composite SST data from the NOAA/NESDIS/STAR Blended SST 0.1-degree (~11-km), daily dataset was produced for February 2009 to October 2013. Finally, a weekly composite SST dataset from the NOAA/NESDIS/STAR Blended SST 0.05-degree (~5-km), daily dataset was produced for March 2012 to December 2013.

    The standard deviation of the long-term mean SST was calculated by taking the standard deviation over all weekly data from 2000-2013 for each pixel.

  5. f

    Root mean square error of one-year ahead predictions (RMSEP) obtained with...

    • plos.figshare.com
    xls
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lucie Michel; David Makowski (2023). Root mean square error of one-year ahead predictions (RMSEP) obtained with two different datasets for several statistical models: linear regression (L), quadratic regression (Q), cubic regression (C), dynamic linear models with and without trend (DLMs, DLM0), and Holt-Winters with and without trend (HWs, HW0). [Dataset]. http://doi.org/10.1371/journal.pone.0078615.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Lucie Michel; David Makowski
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    aThe differences with respect to the lowest RMSEP values are expressed as a percentage of the lowest RMSEP values (RMSEPmin); Difference = 100*(RMSEP – RMSEPmin)/RMSEPmin.

  6. d

    Example Groundwater-Level Datasets and Benchmarking Results for the...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Oct 13, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Example Groundwater-Level Datasets and Benchmarking Results for the Automated Regional Correlation Analysis for Hydrologic Record Imputation (ARCHI) Software Package [Dataset]. https://catalog.data.gov/dataset/example-groundwater-level-datasets-and-benchmarking-results-for-the-automated-regional-cor
    Explore at:
    Dataset updated
    Oct 13, 2024
    Dataset provided by
    U.S. Geological Survey
    Description

    This data release provides two example groundwater-level datasets used to benchmark the Automated Regional Correlation Analysis for Hydrologic Record Imputation (ARCHI) software package (Levy and others, 2024). The first dataset contains groundwater-level records and site metadata for wells located on Long Island, New York (NY) and some surrounding mainland sites in New York and Connecticut. The second dataset contains groundwater-level records and site metadata for wells located in the southeastern San Joaquin Valley of the Central Valley, California (CA). For ease of exposition these are referred to as NY and CA datasets, respectively. Both datasets are formatted with column headers that can be read by the ARCHI software package within the R computing environment. These datasets were used to benchmark the imputation accuracy of three ARCHI model settings (OLS, ridge, and MOVE.1) against the widely used imputation program missForest (Stekhoven and Bühlmann, 2012). The ARCHI program was used to process the NY and CA datasets on monthly and annual timesteps, respectively, filter out sites with insufficient data for imputation, and create 200 test datasets from each of the example datasets with 5 percent of observations removed at random (herein, referred to as "holdouts"). Imputation accuracy for test datasets was assessed using normalized root mean square error (NRMSE), which is the root mean square error divided by the standard deviation of the observed holdout values. ARCHI produces prediction intervals (PIs) using a non-parametric bootstrapping routine, which were assessed by computing a coverage rate (CR) defined as the proportion of holdout observations falling within the estimated PI. The multiple regression models included with the ARCHI package (OLS and ridge) were further tested on all test datasets at eleven different levels of the p_per_n input parameter, which limits the maximum ratio of regression model predictors (p) per observations (n) as a decimal fraction greater than zero and less than or equal to one. This data release contains ten tables formatted as tab-delimited text files. The “CA_data.txt” and “NY_data.txt” tables contain 243,094 and 89,997 depth-to-groundwater measurement values (value, in feet below land surface) indexed by site identifier (site_no) and measurement date (date) for CA and NY datasets, respectively. The “CA_sites.txt” and “NY_sites.txt” tables contain site metadata for the 4,380 and 476 unique sites included in the CA and NY datasets, respectively. The “CA_NRMSE.txt” and “NY_NRMSE.txt” tables contain NRMSE values computed by imputing 200 test datasets with 5 percent random holdouts to assess imputation accuracy for three different ARCHI model settings and missForest using CA and NY datasets, respectively. The “CA_CR.txt” and “NY_CR.txt” tables contain CR values used to evaluate non-parametric PIs generated by bootstrapping regressions with three different ARCHI model settings using the CA and NY test datasets, respectively. The “CA_p_per_n.txt” and “NY_p_per_n.txt” tables contain mean NRMSE values computed for 200 test datasets with 5 percent random holdouts at 11 different levels of p_per_n for OLS and ridge models compared to training error for the same models on the entire CA and NY datasets, respectively. References Cited Levy, Z.F., Stagnitta, T.J., and Glas, R.L., 2024, ARCHI: Automated Regional Correlation Analysis for Hydrologic Record Imputation, v1.0.0: U.S. Geological Survey software release, https://doi.org/10.5066/P1VVHWKE. Stekhoven, D.J., and Bühlmann, P., 2012, MissForest—non-parametric missing value imputation for mixed-type data: Bioinformatics 28(1), 112-118. https://doi.org/10.1093/bioinformatics/btr597.

  7. f

    Data from: Count-Based Morgan Fingerprint: A More Efficient and...

    • acs.figshare.com
    xlsx
    Updated Jul 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shifa Zhong; Xiaohong Guan (2023). Count-Based Morgan Fingerprint: A More Efficient and Interpretable Molecular Representation in Developing Machine Learning-Based Predictive Regression Models for Water Contaminants’ Activities and Properties [Dataset]. http://doi.org/10.1021/acs.est.3c02198.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jul 5, 2023
    Dataset provided by
    ACS Publications
    Authors
    Shifa Zhong; Xiaohong Guan
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    In this study, we introduce the count-based Morgan fingerprint (C-MF) to represent chemical structures of contaminants and develop machine learning (ML)-based predictive models for their activities and properties. Compared with the binary Morgan fingerprint (B-MF), C-MF not only qualifies the presence or absence of an atom group but also quantifies its counts in a molecule. We employ six different ML algorithms (ridge regression, SVM, KNN, RF, XGBoost, and CatBoost) to develop models on 10 contaminant-related data sets based on C-MF and B-MF to compare them in terms of the model’s predictive performance, interpretation, and applicability domain (AD). Our results show that C-MF outperforms B-MF in nine of 10 data sets in terms of model predictive performance. The advantage of C-MF over B-MF is dependent on the ML algorithm, and the performance enhancements are proportional to the difference in the chemical diversity of data sets calculated by B-MF and C-MF. Model interpretation results show that the C-MF-based model can elucidate the effect of atom group counts on the target and have a wider range of SHAP values. AD analysis shows that C-MF-based models have an AD similar to that of B-MF-based ones. Finally, we developed a “ContaminaNET” platform to deploy these C-MF-based models for free use.

  8. A

    ‘Walmart Dataset (Retail)’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Apr 18, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2020). ‘Walmart Dataset (Retail)’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-walmart-dataset-retail-0283/e07567d8/?iid=003-947&v=presentation
    Explore at:
    Dataset updated
    Apr 18, 2020
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Walmart Dataset (Retail)’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/rutuspatel/walmart-dataset-retail on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    Dataset Description :

    This is the historical data that covers sales from 2010-02-05 to 2012-11-01, in the file Walmart_Store_sales. Within this file you will find the following fields:

    Store - the store number

    Date - the week of sales

    Weekly_Sales - sales for the given store

    Holiday_Flag - whether the week is a special holiday week 1 – Holiday week 0 – Non-holiday week

    Temperature - Temperature on the day of sale

    Fuel_Price - Cost of fuel in the region

    CPI – Prevailing consumer price index

    Unemployment - Prevailing unemployment rate

    Holiday Events Super Bowl: 12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13 Labour Day: 10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13 Thanksgiving: 26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13 Christmas: 31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13

    Analysis Tasks

    Basic Statistics tasks

    1) Which store has maximum sales

    2) Which store has maximum standard deviation i.e., the sales vary a lot. Also, find out the coefficient of mean to standard deviation

    3) Which store/s has good quarterly growth rate in Q3’2012

    4) Some holidays have a negative impact on sales. Find out holidays which have higher sales than the mean sales in non-holiday season for all stores together

    5) Provide a monthly and semester view of sales in units and give insights

    Statistical Model

    For Store 1 – Build prediction models to forecast demand

    Linear Regression – Utilize variables like date and restructure dates as 1 for 5 Feb 2010 (starting from the earliest date in order). Hypothesize if CPI, unemployment, and fuel price have any impact on sales.

    Change dates into days by creating new variable.

    Select the model which gives best accuracy.

    --- Original source retains full ownership of the source dataset ---

  9. A

    ‘Jiffs house price prediction dataset’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Feb 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Jiffs house price prediction dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-jiffs-house-price-prediction-dataset-458f/1a7ff5ac/?iid=048-724&v=presentation
    Explore at:
    Dataset updated
    Feb 13, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Jiffs house price prediction dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/elakiricoder/jiffs-house-price-prediction-dataset on 13 February 2022.

    --- Dataset description provided by original source is as follows ---

    Context

    I have previously shared a classification based dataset to classify the gender which is liked by those who are new to machine learning as it give a pretty good accuracy, which encouraged me to create a regression dataset to predict continues values. I have tried many real world datasets for regression problems which are predicting with lower accuracy and high error rate. As a beginner, I have struggled and worried why and how the dataset performs poorly. This is another main reason why I created this dataset. Although this is a made up dataset, I have considered all the features when deciding the price of the property. If you are a beginner, you would love to try this as the results are stunning..

    Content

    Since this is a populated data, I will straightaway explain the features and the label. FEATURES 1. land_size_sqm - This the total size of the land in square meters. 2. house_size_sqm - This is the area in which house is located within the land. This is measured in square meters. 3. no_of_rooms - This indicates the number of rooms available in the house. 4. no_of_bathrooms - This shows the number of total bathrooms made in the house. 5. large_living_room - This indicates whether the house includes a larger living room or not. The assumption is that all the houses contain a living room. This feature attempts to classify whether it's large or small where '1' means large and '0' means small. However in the categorical dataset, 1 and 0 are represented with 'yes' and 'No' respectively. 6. parking_space - This indicates whether there is a parking space or not. '1' represents the parking available while '0' represents no parking space available. However in the categorical dataset, 1 and 0 are represented with 'yes' and 'No' respectively. 7. front_garden - This shows whether there is a garden available in front of the house. '1' means the garden available and '0' means no garden available. However in the categorical dataset, 1 and 0 are represented with 'yes' and 'No' respectively. 8. swimming_pool - This shows the availability of the swimming pool at the house. 1 represents the availability of the swimming pool while 0 represents the non availability of the same. However in the categorical dataset, 1 and 0 are represented with 'yes' and 'No' respectively. 9. distance_to_school_km - This shows the distance from the house to the nearest school in Kilometers. 10. wall_fence - This shows whether there is a wall fence or not. '1' mean there is wall fence and '0' means no wall fence. However in the categorical dataset, 1 and 0 are represented with 'yes' and 'No' respectively. 11. **house_age_or_renovated **- This is either the age of the house in years or the period from the date of renovation. 12. water_front - this indicates whether the house is located in front of the water or not. 1 means waterfront and 0 means its not located near the water. However in the categorical dataset, 1 and 0 are represented with 'yes' and 'No' respectively. 13. distance_to_supermarket_km - what is the distance to the nearest supermarket in kilometers.

    LABEL property_value - This is the price of the property

    Following features are only available in the "house price dataset original v2 cleaned" and "house price dataset original v2 with categorical features" data only. 14. crime_rate - its in float and falls between 0 and 7. lesser the better 15. room_size - As the name suggests, it explains the size of the room. 0 is being 'small', 1 is being 'medium', 2 is 'large' and 3 is being 'Extra large'. However in the categorical dataset, these values are categorical and self explanatory.

    Acknowledgements

    I spent around 3 hours creating this dataset. Enjoy..

    Inspiration

    Share your notebooks to see which algorithm predicts the house price precisely.

    --- Original source retains full ownership of the source dataset ---

  10. NOAA Climate Data Record (CDR) of Zonal Mean Ozone Binary Database of...

    • catalog.data.gov
    • data.cnra.ca.gov
    • +3more
    Updated Sep 19, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DOC/NOAA/NESDIS/NCEI > National Centers for Environmental Information, NESDIS, NOAA, U.S. Department of Commerce (Point of Contact) (2023). NOAA Climate Data Record (CDR) of Zonal Mean Ozone Binary Database of Profiles (BDBP), version 1.0 [Dataset]. https://catalog.data.gov/dataset/noaa-climate-data-record-cdr-of-zonal-mean-ozone-binary-database-of-profiles-bdbp-version-1-02
    Explore at:
    Dataset updated
    Sep 19, 2023
    Dataset provided by
    National Oceanic and Atmospheric Administrationhttp://www.noaa.gov/
    National Centers for Environmental Informationhttps://www.ncei.noaa.gov/
    United States Department of Commercehttp://www.commerce.gov/
    National Environmental Satellite, Data, and Information Service
    Description

    This NOAA Climate Data Record (CDR) of Zonal Mean Ozone Binary Database of Profiles (BDBP) dataset is a vertically resolved, global, gap-free and zonal mean dataset that was created with a multiple-linear regression model. The dataset has a monthly resolution and spans the period 1979 to 2007. It provides global product in 5 degree zonal bands, and 70 vertical levels of the atmosphere. The regression is based on monthly mean ozone concentrations that were calculated from several different satellite instruments and global ozone soundings. Due to the regression model that was used to create the product, various basis function contributions are provided as unique levels or tiers. To understand the different contributions of basis functions, the data product is provided in five different "Tiers". - Tier 0: raw monthly mean data that was used in the regression model - Tier 1.1: Anthropogenic influences (as determined by the regression model) - Tier 1.2: Natural influences (as determined by the regression model) - Tier 1.3: Natural and volcanic influences (as determined by the regression model) - Tier 1.4: All influences (as determined by the regression model, CDR variable)

  11. Statistical analysis for: Mode I fracture of beech-adhesive bondline at...

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    bin, csv, html, txt
    Updated Oct 4, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Burnard; Michael Burnard; Jaka Gašper Pečnik; Jaka Gašper Pečnik (2022). Statistical analysis for: Mode I fracture of beech-adhesive bondline at three different temperatures [Dataset]. http://doi.org/10.5281/zenodo.6839197
    Explore at:
    csv, html, bin, txtAvailable download formats
    Dataset updated
    Oct 4, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Michael Burnard; Michael Burnard; Jaka Gašper Pečnik; Jaka Gašper Pečnik
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset collects a raw dataset and a processed dataset derived from the raw dataset. There is a document containing the analytical code for statistical analysis of the processed dataset in .Rmd format and .html format.

    The study examined some aspects of mechanical performance of solid wood composites. We were interested in certain properties of solid wood composites made using different adhesives with different grain orientations at the bondline, then treated at different temperatures prior to testing.

    Performance was tested by assessing fracture energy and critical fracture energy, lap shear strength, and compression strength of the composites. This document concerns only the fracture properties, which are the focus of the related paper.

    Notes:

    * the raw data is provided in this upload, but the processing is not addressed here.
    * the authors of this document are a subset of the authors of the related paper.
    * this document and the related data files were uploaded at the time of submission for review. An update providing the doi of the related paper will be provided when it is available.

  12. Telecom Company Dataset - Logistic Regression

    • kaggle.com
    Updated Jan 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Uddhav Parab (2022). Telecom Company Dataset - Logistic Regression [Dataset]. https://www.kaggle.com/datasets/uddhavparab/telecom-company-dataset-logistic-regression
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 12, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Uddhav Parab
    Description

    "You have a telecom firm which has collected data of all its customers" The main types of attributes are : 1.Demographics (age, gender etc.) 2.Services availed (internet packs purchased, special offers etc) 3.Expenses (amount of recharge done per month etc.) Based on all this past information, you want to build a model which will predict whether a particular customer will churn or not. So the variable of interest, i.e. the target variable here is ‘Churn’ which will tell us whether or not a particular customer has churned. It is a binary variable 1 means that the customer has churned and 0 means the customer has not churned. With 21 predictor variables we need to predict whether a particular customer will switch to another telecom provider or not.

  13. m

    Educational Attainment in North Carolina Public Schools: Use of statistical...

    • data.mendeley.com
    Updated Nov 14, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Scott Herford (2018). Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets. [Dataset]. http://doi.org/10.17632/6cm9wyd5g5.1
    Explore at:
    Dataset updated
    Nov 14, 2018
    Authors
    Scott Herford
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    North Carolina
    Description

    The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.

  14. h

    Restricted Boltzmann Machine for Missing Data Imputation in Biomedical...

    • datahub.hku.hk
    Updated Aug 13, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wen Ma (2020). Restricted Boltzmann Machine for Missing Data Imputation in Biomedical Datasets [Dataset]. http://doi.org/10.25442/hku.12752549.v1
    Explore at:
    Dataset updated
    Aug 13, 2020
    Dataset provided by
    HKU Data Repository
    Authors
    Wen Ma
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description
    1. NCCTG Lung cancer datasetSurvival in patients with advanced lung cancer from the North Central Cancer Treatment Group. Performance scores rate how well the patient can perform usual daily activities.2.CNV measurements of CNV of GBM This dataset records the information about copy number variation of Glioblastoma (GBM).Abstract:In biology and medicine, conservative patient and data collection malpractice can lead to missing or incorrect values in patient registries, which can affect both diagnosis and prognosis. Insufficient or biased patient information significantly impedes the sensitivity and accuracy of predicting cancer survival. In bioinformatics, making a best guess of the missing values and identifying the incorrect values are collectively called “imputation”. Existing imputation methods work by establishing a model based on the data mechanism of the missing values. Existing imputation methods work well under two assumptions: 1) the data is missing completely at random, and 2) the percentage of missing values is not high. These are not cases found in biomedical datasets, such as the Cancer Genome Atlas Glioblastoma Copy-Number Variant dataset (TCGA: 108 columns), or the North Central Cancer Treatment Group Lung Cancer (NCCTG) dataset (NCCTG: 9 columns). We tested six existing imputation methods, but only two of them worked with these datasets: The Last Observation Carried Forward (LOCF) and K-nearest Algorithm (KNN). Predictive Mean Matching (PMM) and Classification and Regression Trees (CART) worked only with the NCCTG lung cancer dataset with fewer columns, except when the dataset contains 45% missing data. The quality of the imputed values using existing methods is bad because they do not meet the two assumptions.In our study, we propose a Restricted Boltzmann Machine (RBM)-based imputation method to cope with low randomness and the high percentage of the missing values. RBM is an undirected, probabilistic and parameterized two-layer neural network model, which is often used for extracting abstract information from data, especially for high-dimensional data with unknown or non-standard distributions. In our benchmarks, we applied our method to two cancer datasets: 1) NCCTG, and 2) TCGA. The running time, root mean squared error (RMSE) of the different methods were gauged. The benchmarks for the NCCTG dataset show that our method performs better than other methods when there is 5% missing data in the dataset, with 4.64 RMSE lower than the best KNN. For the TCGA dataset, our method achieved 0.78 RMSE lower than the best KNN.In addition to imputation, RBM can achieve simultaneous predictions. We compared the RBM model with four traditional prediction methods. The running time and area under the curve (AUC) were measured to evaluate the performance. Our RBM-based approach outperformed traditional methods. Specifically, the AUC was up to 19.8% higher than the multivariate logistic regression model in the NCCTG lung cancer dataset, and the AUC was higher than the Cox proportional hazard regression model, with 28.1% in the TCGA dataset.Apart from imputation and prediction, RBM models can detect outliers in one pass by allowing the reconstruction of all the inputs in the visible layer with in a single backward pass. Our results show that RBM models have achieved higher precision and recall on detecting outliers than other methods.
  15. Predicting Returns of Discounted Articles Sales

    • kaggle.com
    Updated Jul 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oscar Aguilar (2023). Predicting Returns of Discounted Articles Sales [Dataset]. https://www.kaggle.com/datasets/oscarm524/predicting-returns-of-discounted-articles-sales/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 5, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Oscar Aguilar
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    A fashion distributor sells articles of particular sizes and colors to its customers. In some cases items are returned to the distributor for various reasons. The order data and the related return data were recorded over a two-year period. The aim is to use this data and machine learning to build a model which enables a good prediction of return rates.

    The Data

    For this task real anonymized shop data are provided in the form of structured text files consisting of individual data sets. Below are some points to note about the files:

    1. Each data set is on a single line ending with "CR" ("carriage return", 0xD), "LF" ("carriage return" and "line feed", 0xD and 0xA).
    2. The first line has the same structure as the data sets, but contains the names of the respective columns (data fields).
    3. The header line and each data set contain multiple fields separated from each other by a semi-colon (;).
    4. There is no escape character, quotes are not used.
    5. ASCII is the character set used.
    6. Missing values may occur. These are coded using the character string NA.

    Actually only the field names from the included document features.pdf can appear as column headings in the order used in that document. The associated value ranges are also listed.

    The orders_train.txt contains all the data fields from the document whereas the associated test file orders_class.txt does not contain the target variable ``*returnQuantity*''.

    The Task

    The task is to use known historical data from January 2014 to September 2015 (approx. 2.33 million order positions) to build a model that makes predictions about return rates for order positions. The attribute returnQuantity in the given data indicates the number of articles for each order position (the value 0 means that the article will be kept while a value larger than 0 means that the article will be returned). For sales in the period from October 2015 to December 2015 (approx. 340,000 order positions) the model should then provide predictions for the number of articles which will be returned per order position. The prediction has to be a value of the set of natural numbers including 0. The difference between the prediction and the actual rate for an order position (i..e. error rate) must be as low as possible.

    Acknowledgement

    This dataset is publicly available in the data-mining-cup-website.

  16. t

    Evaluating FAIR Models for Rossmann Store Sales Prediction: Insights and...

    • test.researchdata.tuwien.ac.at
    bin, csv, json +1
    Updated Apr 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dilara Çakmak; Dilara Çakmak; Dilara Çakmak; Dilara Çakmak (2025). Evaluating FAIR Models for Rossmann Store Sales Prediction: Insights and Performance Analysis [Dataset]. http://doi.org/10.70124/f5t2d-xt904
    Explore at:
    csv, text/markdown, json, binAvailable download formats
    Dataset updated
    Apr 28, 2025
    Dataset provided by
    TU Wien
    Authors
    Dilara Çakmak; Dilara Çakmak; Dilara Çakmak; Dilara Çakmak
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 2025
    Description

    Context and Methodology

    Research Domain:
    The dataset is part of a project focused on retail sales forecasting. Specifically, it is designed to predict daily sales for Rossmann, a chain of over 3,000 drug stores operating across seven European countries. The project falls under the broader domain of time series analysis and machine learning applications for business optimization. The goal is to apply machine learning techniques to forecast future sales based on historical data, which includes factors like promotions, competition, holidays, and seasonal trends.

    Purpose:
    The primary purpose of this dataset is to help Rossmann store managers predict daily sales for up to six weeks in advance. By making accurate sales predictions, Rossmann can improve inventory management, staffing decisions, and promotional strategies. This dataset serves as a training set for machine learning models aimed at reducing forecasting errors and supporting decision-making processes across the company’s large network of stores.

    How the Dataset Was Created:
    The dataset was compiled from several sources, including historical sales data from Rossmann stores, promotional calendars, holiday schedules, and external factors such as competition. The data is split into multiple features, such as the store's location, promotion details, whether the store was open or closed, and weather information. The dataset is publicly available on platforms like Kaggle and was initially created for the Kaggle Rossmann Store Sales competition. The data is made accessible via an API for further analysis and modeling, and it is structured to help machine learning models predict future sales based on various input variables.

    Technical Details

    Dataset Structure:

    The dataset consists of three main files, each with its specific role:

    1. Train:
      This file contains the historical sales data, which is used to train machine learning models. It includes daily sales information for each store, as well as various features that could influence the sales (e.g., promotions, holidays, store type, etc.).

      https://handle.test.datacite.org/10.82556/yb6j-jw41
      PID: b1c59499-9c6e-42c2-af8f-840181e809db
    2. Test2:
      The test dataset mirrors the structure of train.csv but does not include the actual sales values (i.e., the target variable). This file is used for making predictions using the trained machine learning models. It is used to evaluate the accuracy of predictions when the true sales data is unknown.

      https://handle.test.datacite.org/10.82556/jerg-4b84
      PID: 7cbb845c-21dd-4b60-b990-afa8754a0dd9
    3. Store:
      This file provides metadata about each store, including information such as the store’s location, type, and assortment level. This data is essential for understanding the context in which the sales data is gathered.

      https://handle.test.datacite.org/10.82556/nqeg-gy34
      PID: 9627ec46-4ee6-4969-b14a-bda555fe34db

    Data Fields Description:

    • Id: A unique identifier for each (Store, Date) combination within the test set.

    • Store: A unique identifier for each store.

    • Sales: The daily turnover (target variable) for each store on a specific day (this is what you are predicting).

    • Customers: The number of customers visiting the store on a given day.

    • Open: An indicator of whether the store was open (1 = open, 0 = closed).

    • StateHoliday: Indicates if the day is a state holiday, with values like:

      • 'a' = public holiday,

      • 'b' = Easter holiday,

      • 'c' = Christmas,

      • '0' = no holiday.

    • SchoolHoliday: Indicates whether the store is affected by school closures (1 = yes, 0 = no).

    • StoreType: Differentiates between four types of stores: 'a', 'b', 'c', 'd'.

    • Assortment: Describes the level of product assortment in the store:

      • 'a' = basic,

      • 'b' = extra,

      • 'c' = extended.

    • CompetitionDistance: Distance (in meters) to the nearest competitor store.

    • CompetitionOpenSince[Month/Year]: The month and year when the nearest competitor store opened.

    • Promo: Indicates whether the store is running a promotion on a particular day (1 = yes, 0 = no).

    • Promo2: Indicates whether the store is participating in Promo2, a continuing promotion for some stores (1 = participating, 0 = not participating).

    • Promo2Since[Year/Week]: The year and calendar week when the store started participating in Promo2.

    • PromoInterval: Describes the months when Promo2 is active, e.g., "Feb,May,Aug,Nov" means the promotion starts in February, May, August, and November.

    Software Requirements

    To work with this dataset, you will need to have specific software installed, including:

    • DBRepo Authorization: This is required to access the datasets via the DBRepo API. You may need to authenticate with an API key or login credentials to retrieve the datasets.

    • Python Libraries: Key libraries for working with the dataset include:

      • pandas for data manipulation,

      • numpy for numerical operations,

      • matplotlib and seaborn for data visualization,

      • scikit-learn for machine learning algorithms.

    Additional Resources

    Several additional resources are available for working with the dataset:

    1. Presentation:
      A presentation summarizing the exploratory data analysis (EDA), feature engineering process, and key insights from the analysis is provided. This presentation also includes visualizations that help in understanding the dataset’s trends and relationships.

    2. Jupyter Notebook:
      A Jupyter notebook, titled Retail_Sales_Prediction_Capstone_Project.ipynb, is provided, which details the entire machine learning pipeline, from data loading and cleaning to model training and evaluation.

    3. Model Evaluation Results:
      The project includes a detailed evaluation of various machine learning models, including their performance metrics like training and testing scores, Mean Absolute Percentage Error (MAPE), and Root Mean Squared Error (RMSE). This allows for a comparison of model effectiveness in forecasting sales.

    4. Trained Models (.pkl files):
      The models trained during the project are saved as .pkl files. These files contain the trained machine learning models (e.g., Random Forest, Linear Regression, etc.) that can be loaded and used to make predictions without retraining the models from scratch.

    5. sample_submission.csv:
      This file is a sample submission file that demonstrates the format of predictions expected when using the trained model. The sample_submission.csv contains predictions made on the test dataset using the trained Random Forest model. It provides an example of how the output should be structured for submission.

    These resources provide a comprehensive guide to implementing and analyzing the sales forecasting model, helping you understand the data, methods, and results in greater detail.

  17. D

    Regression Analysis Tool Market Report | Global Forecast From 2025 To 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Jan 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Regression Analysis Tool Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/regression-analysis-tool-market
    Explore at:
    pdf, csv, pptxAvailable download formats
    Dataset updated
    Jan 7, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Regression Analysis Tool Market Outlook



    The global regression analysis tool market size is projected to grow from USD 2.5 billion in 2023 to USD 5.3 billion by 2032, exhibiting a robust CAGR of 8.5% during the forecast period. This anticipated growth is driven by the increasing integration of analytics in business processes and the rising need for data-driven decision-making across various industries.



    One of the primary growth factors for the regression analysis tool market is the exponential increase in data generation. Organizations across different sectors are collecting vast amounts of data to gain insights into consumer behavior, operational efficiency, and market trends. This surge in data necessitates advanced analytics tools, such as regression analysis tools, to extract meaningful patterns and correlations, which, in turn, aids in strategic decision-making. The need to handle complex datasets and derive actionable insights is propelling the demand for these analytical tools.



    Another significant growth driver is the advent of artificial intelligence (AI) and machine learning (ML) technologies. These technologies have revolutionized the analytics landscape by enabling more accurate and predictive analytics. Regression analysis tools, when combined with AI and ML, can provide more sophisticated and precise analytics solutions. This integration is particularly beneficial in sectors like healthcare and finance, where predictive analytics can lead to better patient outcomes and improved financial forecasting, respectively. The enhanced capabilities brought about by AI and ML are thus significantly boosting the adoption of regression analysis tools.



    The increasing adoption of cloud-based solutions is also contributing to the growth of the regression analysis tool market. Cloud computing offers several advantages such as scalability, cost-effectiveness, and accessibility, which are crucial for businesses of all sizes. The ability to deploy regression analysis tools on the cloud means that even small and medium enterprises (SMEs) can leverage these advanced tools without the need for significant upfront investment in IT infrastructure. This democratization of advanced analytics is a major factor driving the market growth.



    Business Analysis Tools play a pivotal role in today’s data-driven business environment. These tools encompass a wide range of applications that help organizations analyze data, identify trends, and make informed decisions. From financial forecasting to customer behavior analysis, business analysis tools provide the insights needed to optimize operations and drive growth. As businesses continue to face complex challenges, the demand for robust and versatile business analysis tools is on the rise. These tools not only enhance decision-making capabilities but also improve overall business efficiency and competitiveness. With the integration of advanced technologies like AI and machine learning, business analysis tools are becoming even more powerful, offering predictive analytics and real-time insights that are crucial for staying ahead in the competitive market.



    Regionally, North America is expected to dominate the regression analysis tool market during the forecast period. This is attributed to the presence of major technology companies, high adoption rate of advanced analytics tools, and a focus on data-driven decision-making. Furthermore, the region's strong IT infrastructure supports the seamless integration and deployment of these tools. Other regions, such as Asia Pacific, are also witnessing rapid growth due to increasing digitalization, investment in IT infrastructure, and an expanding base of tech-savvy enterprises.



    Component Analysis



    The regression analysis tool market can be segmented by component into software and services. The software segment includes a variety of analytic solutions that enable businesses to perform regression analysis on their data. These software tools range from basic standalone applications to more complex, integrated solutions that can handle large datasets and provide advanced analytics capabilities. The growing need for robust and user-friendly analytics software is driving the demand in this segment, with many companies continuously innovating to improve the functionality and ease of use of their products.



    The services segment encompasses consulting, implementation, training, and support services provided by vendors to help orga

  18. d

    Replication Data for: \"A Topic-based Segmentation Model for Identifying...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Sep 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kim, Sunghoon; Lee, Sanghak; McCulloch, Robert (2024). Replication Data for: \"A Topic-based Segmentation Model for Identifying Segment-Level Drivers of Star Ratings from Unstructured Text Reviews\" [Dataset]. http://doi.org/10.7910/DVN/EE3DE2
    Explore at:
    Dataset updated
    Sep 25, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Kim, Sunghoon; Lee, Sanghak; McCulloch, Robert
    Description

    We provide instructions, codes and datasets for replicating the article by Kim, Lee and McCulloch (2024), "A Topic-based Segmentation Model for Identifying Segment-Level Drivers of Star Ratings from Unstructured Text Reviews." This repository provides a user-friendly R package for any researchers or practitioners to apply A Topic-based Segmentation Model with Unstructured Texts (latent class regression with group variable selection) to their datasets. First, we provide a R code to replicate the illustrative simulation study: see file 1. Second, we provide the user-friendly R package with a very simple example code to help apply the model to real-world datasets: see file 2, Package_MixtureRegression_GroupVariableSelection.R and Dendrogram.R. Third, we provide a set of codes and instructions to replicate the empirical studies of customer-level segmentation and restaurant-level segmentation with Yelp reviews data: see files 3-a, 3-b, 4-a, 4-b. Note, due to the dataset terms of use by Yelp and the restriction of data size, we provide the link to download the same Yelp datasets (https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset/versions/6). Fourth, we provided a set of codes and datasets to replicate the empirical study with professor ratings reviews data: see file 5. Please see more details in the description text and comments of each file. [A guide on how to use the code to reproduce each study in the paper] 1. Full codes for replicating Illustrative simulation study.txt -- [see Table 2 and Figure 2 in main text]: This is R source code to replicate the illustrative simulation study. Please run from the beginning to the end in R. In addition to estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships, you will get dendrograms of selected groups of variables in Figure 2. Computing time is approximately 20 to 30 minutes 3-a. Preprocessing raw Yelp Reviews for Customer-level Segmentation.txt: Code for preprocessing the downloaded unstructured Yelp review data and preparing DV and IVs matrix for customer-level segmentation study. 3-b. Instruction for replicating Customer-level Segmentation analysis.txt -- [see Table 10 in main text; Tables F-1, F-2, and F-3 and Figure F-1 in Web Appendix]: Code for replicating customer-level segmentation study with Yelp data. You will get estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships. Computing time is approximately 3 to 4 hours. 4-a. Preprocessing raw Yelp reviews_Restaruant Segmentation (1).txt: R code for preprocessing the downloaded unstructured Yelp data and preparing DV and IVs matrix for restaurant-level segmentation study. 4-b. Instructions for replicating restaurant-level segmentation analysis.txt -- [see Tables 5, 6 and 7 in main text; Tables E-4 and E-5 and Figure H-1 in Web Appendix]: Code for replicating restaurant-level segmentation study with Yelp. you will get estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships. Computing time is approximately 10 to 12 hours. [Guidelines for running Benchmark models in Table 6] Unsupervised Topic model: 'topicmodels' package in R -- after determining the number of topics(e.g., with 'ldatuning' R package), run 'LDA' function in the 'topicmodels'package. Then, compute topic probabilities per restaurant (with 'posterior' function in the package) which can be used as predictors. Then, conduct prediction with regression Hierarchical topic model (HDP): 'gensimr' R package -- 'model_hdp' function for identifying topics in the package (see https://radimrehurek.com/gensim/models/hdpmodel.html or https://gensimr.news-r.org/). Supervised topic model: 'lda' R package -- 'slda.em' function for training and 'slda.predict' for prediction. Aggregate regression: 'lm' default function in R. Latent class regression without variable selection: 'flexmix' function in 'flexmix' R package. Run flexmix with a certain number of segments (e.g., 3 segments in this study). Then, with estimated coefficients and memberships, conduct prediction of dependent variable per each segment. Latent class regression with variable selection: 'Unconstraind_Bayes_Mixture' function in Kim, Fong and DeSarbo(2012)'s package. Run the Kim et al's model (2012) with a certain number of segments (e.g., 3 segments in this study). Then, with estimated coefficients and memberships, we can do prediction of dependent variables per each segment. The same R package ('KimFongDeSarbo2012.zip') can be downloaded at: https://sites.google.com/scarletmail.rutgers.edu/r-code-packages/home 5. Instructions for replicating Professor ratings review study.txt -- [see Tables G-1, G-2, G-4 and G-5, and Figures G-1 and H-2 in Web Appendix]: Code to replicate the Professor ratings reviews study. Computing time is approximately 10 hours. [A list of the versions of R, packages, and computer...

  19. Coupled Model Intercomparison Project Phase 5 (CMIP5) University of...

    • registry.opendata.aws
    Updated Mar 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NOAA (2022). Coupled Model Intercomparison Project Phase 5 (CMIP5) University of Wisconsin-Madison Probabilistic Downscaling Dataset [Dataset]. https://registry.opendata.aws/noaa-uwpd-cmip5/
    Explore at:
    Dataset updated
    Mar 14, 2022
    Dataset provided by
    National Oceanic and Atmospheric Administrationhttp://www.noaa.gov/
    Area covered
    Madison, Wisconsin
    Description

    The University of Wisconsin Probabilistic Downscaling (UWPD) is a statistically downscaled dataset based on the Coupled Model Intercomparison Project Phase 5 (CMIP5) climate models. UWPD consists of three variables, daily precipitation and maximum and minimum temperature. The spatial resolution is 0.1°x0.1° degree resolution for the United States and southern Canada east of the Rocky Mountains.

    The downscaling methodology is not deterministic. Instead, to properly capture unexplained variability and extreme events, the methodology predicts a spatially and temporally varying Probability Density Function (PDF) for each variable. Statistics such as the mean, mean PDF and annual maximum statistics can be calculated directly from the daily PDF and these statistics are included in the dataset. In addition, “standard”, “raw” data is created by randomly sampling from the PDFs to create a “realization” of the local scale given the large-scale from the climate model. There are 3 realizations for temperature and 14 realizations for precipitation.

    The directory structure of the data is as follows
    [cmip_version]/[scenario]/[climate_model]/[ensemble_member]/
    The realizations are as follows
    prcp_[realization_number][year].nc temp[realization_number][year].nc
    The time mean files averaged over certain year bounds are as follows
    prcp_mean
    [year_bound_1][year_bound_2].nc temp_mean[year_bound_1][year_bound_2].nc
    The time-mean Cumulative Distribution Function (CDF) files are as follows
    prcp_cdf
    [year_bound_1][year_bound_2].nc temp_cdf[year_bound_1][year_bound_2].nc
    The CDF of the annual maximum precipitation is given for each year in the record prcp_annual_max_cdf
    [start_year_of_scenario]_[end_year_of_scenario].nc

  20. Z

    Doodleverse/Segmentation Zoo/Seg2Map Res-UNet models for DeepGlobe/7-class...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Buscombe, Daniel (2024). Doodleverse/Segmentation Zoo/Seg2Map Res-UNet models for DeepGlobe/7-class segmentation of RGB 512x512 high-res. images [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7576897
    Explore at:
    Dataset updated
    Jul 12, 2024
    Dataset authored and provided by
    Buscombe, Daniel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Doodleverse/Segmentation Zoo/Seg2Map Res-UNet models for DeepGlobe/7-class segmentation of RGB 512x512 high-res. images

    These Residual-UNet model data are based on the DeepGlobe dataset

    Models have been created using Segmentation Gym* using the following dataset**: https://www.kaggle.com/datasets/balraj98/deepglobe-land-cover-classification-dataset

    Image size used by model: 512 x 512 x 3 pixels

    classes: 1. urban 2. agricultural 3. rangeland 4. forest 5. water 6. bare 7. unknown

    File descriptions

    For each model, there are 5 files with the same root name:

    1. '.json' config file: this is the file that was used by Segmentation Gym* to create the weights file. It contains instructions for how to make the model and the data it used, as well as instructions for how to use the model for prediction. It is a handy wee thing and mastering it means mastering the entire Doodleverse.

    2. '.h5' weights file: this is the file that was created by the Segmentation Gym* function train_model.py. It contains the trained model's parameter weights. It can called by the Segmentation Gym* function seg_images_in_folder.py. Models may be ensembled.

    3. '_modelcard.json' model card file: this is a json file containing fields that collectively describe the model origins, training choices, and dataset that the model is based upon. There is some redundancy between this file and the config file (described above) that contains the instructions for the model training and implementation. The model card file is not used by the program but is important metadata so it is important to keep with the other files that collectively make the model and is such is considered part of the model

    4. '_model_history.npz' model training history file: this numpy archive file contains numpy arrays describing the training and validation losses and metrics. It is created by the Segmentation Gym function train_model.py

    5. '.png' model training loss and mean IoU plot: this png file contains plots of training and validation losses and mean IoU scores during model training. A subset of data inside the .npz file. It is created by the Segmentation Gym function train_model.py

    Additionally, BEST_MODEL.txt contains the name of the model with the best validation loss and mean IoU

    References *Segmentation Gym: Buscombe, D., & Goldstein, E. B. (2022). A reproducible and reusable pipeline for segmentation of geoscientific imagery. Earth and Space Science, 9, e2022EA002332. https://doi.org/10.1029/2022EA002332 See: https://github.com/Doodleverse/segmentation_gym

    **Demir, I., Koperski, K., Lindenbaum, D., Pang, G., Huang, J., Basu, S., Hughes, F., Tuia, D. and Raskar, R., 2018. Deepglobe 2018: A challenge to parse the earth through satellite images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (pp. 172-181).

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
U.S. Geological Survey (2024). Overview Metadata for the Regression Model Data, Estimated Discharge Data, and Calculated Flux and Yields Data at Tumacácori National Historical Park and the Upper Santa Cruz River, Arizona (1994-2017) [Dataset]. https://catalog.data.gov/dataset/overview-metadata-for-the-regression-model-data-estimated-discharge-data-and-calculat-1994

Overview Metadata for the Regression Model Data, Estimated Discharge Data, and Calculated Flux and Yields Data at Tumacácori National Historical Park and the Upper Santa Cruz River, Arizona (1994-2017)

Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
U.S. Geological Survey
Area covered
Tumacacori-Carmen, Arizona, Santa Cruz River
Description

This data release contains three different datasets that were used in the Scientific Investigations Report: Spatial and Temporal Distribution of Bacterial Indicators and Microbial Source Tracking within Tumacácori National Historical Park and the Upper Santa Cruz River, Arizona, 2015-16. These datasets contain regression model data, estimated discharge data, and calculated flux and yields data. Regression Model Data: This dataset contains data used in a regression model development in the SIR. The period of data ranged from May 25, 1994 to May 19, 2017. Data from 2015 to 2017 were collected by the U.S. Geological Survey. Data prior to 2015 were provided by various agencies. Listed below are the different data contained within this dataset: - Season represented as an indicator variable (Fall, Spring, Summer, and Winter) - Hydrologic Condition represented as an indicator variable (rising limb, recession limb, peak, or unable to classify) - Flood (binary variable indicating if the sample was collected during a flood event or not) - Decimal Date (DT) represented as a continuous variable - Sine of DT represented as a continuous variable for periodic function to describe seasonal variation - Cosine of DT represented as a continuous variable for periodic function to describe seasonal variation Estimated Discharge: This dataset contains estimated discharge at four different sites between 03/02/2015 and 12/14/2016. The discharge was estimated using nearby streamgage relations and methods are described in detail in the SIR . The sites where discharge was estimated are listed below. - NW8; 312551110573901; Nogales Wash at Ruby Road - SC3; 312654110573201; Santa Cruz River abv Nogales Wash - SC10; 313343110024701; Santa Cruz River at Santa Gertrudis Lane - SC14; 09481740; Santa Cruz River at Tubac, AZ Calculated Flux and Yields: This dataset contains calculated flux and yields for E. coli and suspended sediment concentrations. Mean daily flux was calculated when mean daily discharge was available at a corresponding streamgage. Instantaneous flux was calculated when instantaneous discharge (at 15-minute intervals) were available at a corresponding streamgage, or from a measured or estimated discharge value. The yields were calculated using the calculated flux values and the area of the different watersheds. Methods and equations are described in detail in the SIR. Listed below are the data contained within this dataset: - Mean daily E. coli flux, in most probable number per day - Mean daily suspended sediment, in flux, in tons per day - Instantaneous E. coli flux, in most probable number per second - Instantaneous suspended sediment flux, in tons per second - E. coli, in most probable number per square mile - Suspended sediment, in tons per square mile

Search
Clear search
Close search
Google apps
Main menu