2 datasets found
  1. f

    Classification of rare land cover types: Distinguishing annual and perennial...

    • plos.figshare.com
    pdf
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christina Bogner; Bumsuk Seo; Dorian Rohner; Björn Reineking (2023). Classification of rare land cover types: Distinguishing annual and perennial crops in an agricultural catchment in South Korea [Dataset]. http://doi.org/10.1371/journal.pone.0190476
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Christina Bogner; Bumsuk Seo; Dorian Rohner; Björn Reineking
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    South Korea
    Description

    Many environmental data are inherently imbalanced, with some majority land use and land cover types dominating over rare ones. In cultivated ecosystems minority classes are often the target as they might indicate a beginning land use change. Most standard classifiers perform best on a balanced distribution of classes, and fail to detect minority classes. We used the synthetic minority oversampling technique (smote) with Random Forest to classify land cover classes in a small agricultural catchment in South Korea using modis time series. This area faces a major soil erosion problem and policy measures encourage farmers to replace annual by perennial crops to mitigate this issue. Our major goal was therefore to improve the classification performance on annual and perennial crops. We compared four different classification scenarios on original imbalanced and synthetically oversampled balanced data to quantify the effect of smote on classification performance. smote substantially increased the true positive rate of all oversampled minority classes. However, the performance on minor classes remained lower than on the majority class. We attribute this result to a class overlap already present in the original data set that is not resolved by smote. Our results show that resampling algorithms could help to derive more accurate land use and land cover maps from freely available data. These maps can be used to provide information on the distribution of land use classes in heterogeneous agricultural areas and could potentially benefit decision making.

  2. f

    Data from: Investigating the contributors to hit-and-run crashes using...

    • figshare.com
    xlsx
    Updated Oct 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gen Li (2024). Investigating the contributors to hit-and-run crashes using gradient boosting decision trees [Dataset]. http://doi.org/10.6084/m9.figshare.27178305.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Oct 7, 2024
    Dataset provided by
    figshare
    Authors
    Gen Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This paper uses the 2021 traffic crash data from the NHTSA CRSS as a sample for model training and validation. The CRSS data collects crash report data provided by police departments from all 50 states in the United States. It details various factors of each traffic crash, including crash information, driver information, vehicle information, road information, and environmental information.The crash accident data provided by CRSS include crash-related details such as the location, time, cause, type of crash, driver’s age, gender, attention level, injury status, risky driving behavior, vehicle type, usage, damage, and hit-and-run situations. However, due to the separate recording of the dataset and the presence of systematic errors and redundant information, the CRSS 2021 data undergo the following merging and filtering processes:1) Match and merge separately recorded data based on the unique case number "CASENUM" in the dataset.2) Records with missing values in critical variables (e.g., whether the crash involved a hit-and-run) were removed to avoid bias in the analysis. For non-critical variables, missing values were imputed using the mean or mode depending on the variable type. For continuous variables, such as speed limits, we used mean imputation. For categorical variables (e.g., weather, road surface conditions), mode imputation was applied.3) Noise in the dataset arises from both human error in crash reporting and random fluctuations in recorded variables. We used z-scores to detect and remove extreme outliers in numerical variables (e.g., speed limits, crash angle). Data points with a z-score beyond ±3 standard deviations were considered outliers and were excluded from the analysis. To handle noisy fluctuations in continuous variables (e.g., speed limits), we applied a symmetrical exponential moving average (EMA) filter.After processing, the CRSS 2021 data include a total of 54,187 crash accidents, among which there are 5,944 hit-and-run accidents, accounting for 10.97% of crash accidents. The hit-and-run and non-hit-and-run categories face a serious class imbalance issue, and data balancing processing is applied to the target variable during parameter calibration. Hit-and-run crashes constitute a relatively small proportion of total crashes in the dataset, leading to class imbalance in the binary classification target. To address this issue, we utilized the resampling techniques available in the data mining software. Specifically, random undersampling was applied to the majority class (non-hit-and-run crashes), while Synthetic Minority Over-sampling Technique (SMOTE) was used for the minority class. This ensured balanced class distribution in the training set, improving model performance and preventing the classifier from being biased toward the majority class.

  3. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Christina Bogner; Bumsuk Seo; Dorian Rohner; Björn Reineking (2023). Classification of rare land cover types: Distinguishing annual and perennial crops in an agricultural catchment in South Korea [Dataset]. http://doi.org/10.1371/journal.pone.0190476

Classification of rare land cover types: Distinguishing annual and perennial crops in an agricultural catchment in South Korea

Explore at:
12 scholarly articles cite this dataset (View in Google Scholar)
pdfAvailable download formats
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Christina Bogner; Bumsuk Seo; Dorian Rohner; Björn Reineking
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Area covered
South Korea
Description

Many environmental data are inherently imbalanced, with some majority land use and land cover types dominating over rare ones. In cultivated ecosystems minority classes are often the target as they might indicate a beginning land use change. Most standard classifiers perform best on a balanced distribution of classes, and fail to detect minority classes. We used the synthetic minority oversampling technique (smote) with Random Forest to classify land cover classes in a small agricultural catchment in South Korea using modis time series. This area faces a major soil erosion problem and policy measures encourage farmers to replace annual by perennial crops to mitigate this issue. Our major goal was therefore to improve the classification performance on annual and perennial crops. We compared four different classification scenarios on original imbalanced and synthetically oversampled balanced data to quantify the effect of smote on classification performance. smote substantially increased the true positive rate of all oversampled minority classes. However, the performance on minor classes remained lower than on the majority class. We attribute this result to a class overlap already present in the original data set that is not resolved by smote. Our results show that resampling algorithms could help to derive more accurate land use and land cover maps from freely available data. These maps can be used to provide information on the distribution of land use classes in heterogeneous agricultural areas and could potentially benefit decision making.

Search
Clear search
Close search
Google apps
Main menu