This resource collects teaching materials that are originally created for the in-person course 'GEOSC/GEOG 497 – Data Mining in Environmental Sciences' at Penn State University (co-taught by Tao Wen, Susan Brantley, and Alan Taylor) and then refined/revised by Tao Wen to be used in the online teaching module 'Data Science in Earth and Environmental Sciences' hosted on the NSF-sponsored HydroLearn platform.
This resource includes both R Notebooks and Python Jupyter Notebooks to teach the basics of R and Python coding, data analysis and data visualization, as well as building machine learning models in both programming languages by using authentic research data and questions. All of these R/Python scripts can be executed either on the CUAHSI JupyterHub or on your local machine.
This resource is shared under the CC-BY license. Please contact the creator Tao Wen at Syracuse University (twen08@syr.edu) for any questions you have about this resource. If you identify any errors in the files, please contact the creator.
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
The dataset contains 62,058 high quality Google Street View images. The images cover the downtown and neighboring areas of Pittsburgh, PA; Orlando, FL and partially Manhattan, NY. Accurate GPS coordinates of the images and their compass direction are provided as well. For each Street View placemark (i.e. each spot on one street), the 360° spherical view is broken down into 4 side views and 1 upward view. There is one additional image per placemark which shows some overlaid markers, such as the address, name of streets, etc. ### Citation: Please cite the following paper for which this data was collected (partially): Image Geo-localization based on Multiple Nearest Neighbor Feature Matching using Generalized Graphs. Amir Roshan Zamir and Mubarak Shah. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2014.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This resource contains data collected by the Whitefish Lake Institute (WLI) as well as R code used to compile and conduct quality assurance on the data. This resource reflects joint publication efforts between WLI and the Montana State University Extension Water Quality (MSUEWQ) program. All data included here was uploaded to the National Water Quality Portal (WQX) in 2022. It is the intention of WLI to upload all future data to WQX and this HydroShare resource may also be updated in the future with data for 2022 and forward.
Data Purpose: The ‘Data’ folder of this resource holds the final data products for the extensive dataset collected by WLI between 2007 and 2021. This folder is likely of interest to users who want data for research and analysis purposes. This dataset contains physical water parameter field data collected by Hydrolab MS5 and DS5 loggers, including water temperature, specific conductance, dissolved oxygen concentration and saturation, barometric pressure, and turbidity. Additional field data that needs further quality assurance prior to use includes chlorophyll a, ORP, pH, and PAR. This dataset also contains water chemistry data analyzed at certified laboratories including total nitrogen, total phosphorus, nitrate, orthophosphate, total suspended solids, organic carbon, and chlorophyll a. The data folder includes R scripts with code for examples of data visualization. This dataset can provide insight to water quality trends in lakes and streams of northwestern Montana over time. Data Summary: During the time-period, WLI collected water quality data for 63 lake sites and 17 stream and river sites in northwestern Montana under two separate monitoring projects. The Northwest Montana Lakes Network (NMLN) project currently visits 41 lake sites in Northwestern Montana once per summer. Field data from Hydrolabs are collected at discrete depths throughout a lake's profile, and depth integrated water chemistry samples are collected as well. The Whitefish Water Quality Monitoring Project (WWQMP) currently visits two sites on Whitefish Lake, one site on Tally Lake, and 11 stream and river sites in the Whitefish Lake and Upper Whitefish River watersheds monthly between April and November. Field data is collected at one depth for streams and many depths throughout the lake profiles, and water chemistry samples are collected at discrete depths for Whitefish Lake and streams. The final dataset for both programs includes over 112,000 datapoints of data passing quality assurance assessment and an additional 72,000 datapoints that would need further quality assurance before use.
Workflow Purpose: The ‘Workflow’ folder of this resource contains the raw data, folder structure, and R code used during this data compilation and upload process. This folder is likely of interest to users who have similar datasets and are interested in code for automating data compilation or upload processes. The R scripts included here have code to stitch together many individual Hydrolab MS5 and DS5 logger files as well as lab electronic data deliverables (EDDs), which may be useful for users who are interested in compiling one or multiple seasons' worth of data into a single file. Reformatting scripts format data to match the multi-sheet excel workbook format required by the Montana Department of Environmental Quality for uploads to WQX, and may be useful to others hoping to automate database uploads. Workflow Summary: Compilation code in the workflow folder compiles data from its most original forms, including Hydrolab sonde export files and lab EDDs. This compilation process includes extracting dates and times from comment fields and producing a single file from many input files. Formatting code then reformats the data to match WQX upload requirements, which includes generating unique activity IDs for data collected at the same site, date, and time then linking these activity IDs with results across worksheets in an excel workbook. Code for generating all quality assurance figures used in the decision-making process outlined in the Quality Assurance Document and resulting data removal decisions are included here as well. Finally, this folder includes code for combining data from the separate program uploads for WQX to the more user-friendly structure for analysis provided in the 'Data' file for this HydroShare resource.
This module series covers how to import, manipulate, format and plot time series data stored in .csv format in R. Originally designed to teach researchers to use NEON plant phenology and air temperature data; has been used in undergraduate classrooms.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
R is a very powerful language for statistical computing in many disciplines of research and has a steep learning curve. The software is open source, freely available and has a thriving community. This crash course provides an overview of Base-R concepts for beginners and covers the topics 1) introduction into R, 2) reading, saving, and viewing data, 3) selecting and changing objects in R, and 4) descriptive statistics.This course was held by Lisa Spitzer on September 3, 2021, as a precursor to the R tidyverse Workshop by Aurélien Ginolhac and Roland Krause (September 8 - 10, 2021). This entry features the slides, exercises/results, and chat messages of the crash course. Related to this entry are the recordings of the course, and the r tidyverse workshop materials. Click on "related PsychArchives objects" to view or download the recordings of the workshop.:
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of data measured on different scales is a relevant challenge. Biomedical studies often focus on high-throughput datasets of, e.g., quantitative measurements. However, the need for integration of other features possibly measured on different scales, e.g. clinical or cytogenetic factors, becomes increasingly important. The analysis results (e.g. a selection of relevant genes) are then visualized, while adding further information, like clinical factors, on top. However, a more integrative approach is desirable, where all available data are analyzed jointly, and where also in the visualization different data sources are combined in a more natural way. Here we specifically target integrative visualization and present a heatmap-style graphic display. To this end, we develop and explore methods for clustering mixed-type data, with special focus on clustering variables. Clustering of variables does not receive as much attention in the literature as does clustering of samples. We extend the variables clustering methodology by two new approaches, one based on the combination of different association measures and the other on distance correlation. With simulation studies we evaluate and compare different clustering strategies. Applying specific methods for mixed-type data proves to be comparable and in many cases beneficial as compared to standard approaches applied to corresponding quantitative or binarized data. Our two novel approaches for mixed-type variables show similar or better performance than the existing methods ClustOfVar and bias-corrected mutual information. Further, in contrast to ClustOfVar, our methods provide dissimilarity matrices, which is an advantage, especially for the purpose of visualization. Real data examples aim to give an impression of various kinds of potential applications for the integrative heatmap and other graphical displays based on dissimilarity matrices. We demonstrate that the presented integrative heatmap provides more information than common data displays about the relationship among variables and samples. The described clustering and visualization methods are implemented in our R package CluMix available from https://cran.r-project.org/web/packages/CluMix.
We provide instructions, codes and datasets for replicating the article by Kim, Lee and McCulloch (2024), "A Topic-based Segmentation Model for Identifying Segment-Level Drivers of Star Ratings from Unstructured Text Reviews." This repository provides a user-friendly R package for any researchers or practitioners to apply A Topic-based Segmentation Model with Unstructured Texts (latent class regression with group variable selection) to their datasets. First, we provide a R code to replicate the illustrative simulation study: see file 1. Second, we provide the user-friendly R package with a very simple example code to help apply the model to real-world datasets: see file 2, Package_MixtureRegression_GroupVariableSelection.R and Dendrogram.R. Third, we provide a set of codes and instructions to replicate the empirical studies of customer-level segmentation and restaurant-level segmentation with Yelp reviews data: see files 3-a, 3-b, 4-a, 4-b. Note, due to the dataset terms of use by Yelp and the restriction of data size, we provide the link to download the same Yelp datasets (https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset/versions/6). Fourth, we provided a set of codes and datasets to replicate the empirical study with professor ratings reviews data: see file 5. Please see more details in the description text and comments of each file. [A guide on how to use the code to reproduce each study in the paper] 1. Full codes for replicating Illustrative simulation study.txt -- [see Table 2 and Figure 2 in main text]: This is R source code to replicate the illustrative simulation study. Please run from the beginning to the end in R. In addition to estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships, you will get dendrograms of selected groups of variables in Figure 2. Computing time is approximately 20 to 30 minutes 3-a. Preprocessing raw Yelp Reviews for Customer-level Segmentation.txt: Code for preprocessing the downloaded unstructured Yelp review data and preparing DV and IVs matrix for customer-level segmentation study. 3-b. Instruction for replicating Customer-level Segmentation analysis.txt -- [see Table 10 in main text; Tables F-1, F-2, and F-3 and Figure F-1 in Web Appendix]: Code for replicating customer-level segmentation study with Yelp data. You will get estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships. Computing time is approximately 3 to 4 hours. 4-a. Preprocessing raw Yelp reviews_Restaruant Segmentation (1).txt: R code for preprocessing the downloaded unstructured Yelp data and preparing DV and IVs matrix for restaurant-level segmentation study. 4-b. Instructions for replicating restaurant-level segmentation analysis.txt -- [see Tables 5, 6 and 7 in main text; Tables E-4 and E-5 and Figure H-1 in Web Appendix]: Code for replicating restaurant-level segmentation study with Yelp. you will get estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships. Computing time is approximately 10 to 12 hours. [Guidelines for running Benchmark models in Table 6] Unsupervised Topic model: 'topicmodels' package in R -- after determining the number of topics(e.g., with 'ldatuning' R package), run 'LDA' function in the 'topicmodels'package. Then, compute topic probabilities per restaurant (with 'posterior' function in the package) which can be used as predictors. Then, conduct prediction with regression Hierarchical topic model (HDP): 'gensimr' R package -- 'model_hdp' function for identifying topics in the package (see https://radimrehurek.com/gensim/models/hdpmodel.html or https://gensimr.news-r.org/). Supervised topic model: 'lda' R package -- 'slda.em' function for training and 'slda.predict' for prediction. Aggregate regression: 'lm' default function in R. Latent class regression without variable selection: 'flexmix' function in 'flexmix' R package. Run flexmix with a certain number of segments (e.g., 3 segments in this study). Then, with estimated coefficients and memberships, conduct prediction of dependent variable per each segment. Latent class regression with variable selection: 'Unconstraind_Bayes_Mixture' function in Kim, Fong and DeSarbo(2012)'s package. Run the Kim et al's model (2012) with a certain number of segments (e.g., 3 segments in this study). Then, with estimated coefficients and memberships, we can do prediction of dependent variables per each segment. The same R package ('KimFongDeSarbo2012.zip') can be downloaded at: https://sites.google.com/scarletmail.rutgers.edu/r-code-packages/home 5. Instructions for replicating Professor ratings review study.txt -- [see Tables G-1, G-2, G-4 and G-5, and Figures G-1 and H-2 in Web Appendix]: Code to replicate the Professor ratings reviews study. Computing time is approximately 10 hours. [A list of the versions of R, packages, and computer...
Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
License information was derived automatically
This data set was acquired with a Navigation System on ROV SuBastian during R/V Falkor (too) expedition FKt250220 conducted in 2025 (Chief Scientist: Dr. Michelle Taylor). These data files are of Text File (ASCII) format and include Navigation data that have not been processed.
This dataset compiles salmon escapement data from Alaska Department of Fish and Game reports and salmon harvest data from commercial, personal use, sport fish, and subsistence sectors to generate an estimate of total salmon abundance in each of the regions defined by the State of Alaska's Salmon and People Project (SASAP). This dataset was assembled to enable a broad view of the salmon resource, whether biological (escapement) or economic/cultural (harvest), across regions. With that intent in mind, a fish is counted within a region if it escaped a fishery and was counted in a river that is contained within that region, or if it was caught in a fishery that is within the region. For commercial fisheries, each commercial fishing district was assigned to a region based on the location of the commercial fishing district relative to the bounding watersheds of the region. No effort was made to determine the region of origin for any commercially caught fish - thus for some regions, fish caught in one region may have been headed to spawn in another. This is especially true of Alaska Peninsula and Aleutian Islands commercial fishing areas, which are well known for having large amounts of Bristol Bay bound fish. Note that some regions have missing escapement data during some years. This dataset includes an R Markdown file which processes the original data and creates figures, the rendered html file generated from running the R Markdown file which includes many figures and data explanation, and several standalone versions of those figures. Data sources: Jeanette Clark and Alaska Department of Fish and Game, Division of Commercial Fisheries Alaska Department of Fish and Game, Division of Sport Fish Alaska Department of Fish and Game, Division of Subsistence. Harvest of Salmon across Commercial, Subsistence, Personal Use, and Sport Fish sectors, Alaska, 1995-2016. Knowledge Network for Biocomplexity. doi:10.5063/F1TT4P73 Andrew Munro and Eric Volk. 2018. Summary of Pacific Salmon Escapement Goals in Alaska with a Review of Escapements from 2001 to 2009. Knowledge Network for Biocomplexity. doi:10.5063/F1416VB4 Andrew Munro and Eric Volk. 2017. Summary of Pacific Salmon Escapement Goals in Alaska with a Review of Escapements from 2007 to 2015. Knowledge Network for Biocomplexity. doi:10.5063/F1GX48V4 James Savereide. 2017. Estimated annual Chinook Salmon escapement at Copper River from 1980 to 2016. Knowledge Network for Biocomplexity. doi:10.5063/F1G15Z4D.
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A Perfectly Accurate, Synthetic dataset featuring a virtual railway EnVironment for Multi-View Stereopsis (RailEnV-PASMVS) is presented, consisting of 40 scenes and 79,800 renderings together with ground truth depth maps, extrinsic and intrinsic camera parameters and binary segmentation masks of all the track components and surrounding environment. Every scene is rendered from a set of 3 cameras, each positioned relative to the track for optimal 3D reconstruction of the rail profile. The set of cameras is translated across the 100-meter length of tangent (straight) track to yield a total of 1,995 camera views. Photorealistic lighting of each of the 40 scenes is achieved with the implementation of high-definition, high dynamic range (HDR) environmental textures. Additional variation is introduced in the form of camera focal lengths, random noise for the camera location and rotation parameters and shader modifications of the rail profile. Representative track geometry data is used to generate random and unique vertical alignment data for the rail profile for every scene. This primary, synthetic dataset is augmented by a smaller image collection consisting of 320 manually annotated photographs for improved segmentation performance. The specular rail profile represents the most challenging component for MVS reconstruction algorithms, pipelines and neural network architectures, increasing the ambiguity and complexity of the data distribution. RailEnV-PASMVS represents an application specific dataset for railway engineering, against the backdrop of existing datasets available in the field of computer vision, providing the precision required for novel research applications in the field of transportation engineering.
File descriptions
RailEnV-PASMVS.blend (227 Mb) - Blender file (developed using Blender version 2.8.1) used to generate the dataset. The Blender file packs only one of the HDR environmental textures to use as an example, along with all the other asset textures.
RailEnV-PASMVS_sample.png (28 Mb) - A visual collage of 30 scenes, illustrating the variability introduced by using different models, illumination, material properties and camera focal lengths.
geometry.zip (2 Mb) - Geometry CSV files used for scenes 01 to 20. The Bezier curve defines the geometry of the rail profile (10 mm intervals).
PhysicalDataset.7z (2.0 Gb) - A smaller, secondary dataset of 320 manually annotated photographs of railway environments; only the railway profiles are annotated.
01.7z-20.7z (2.0 Gb each) - Archive of each scene (01 through 20).
all_list.txt, training_list.txt, validation_list.txt - Text files containing the all the scene names, together with those used for validation (validation_list.txt) and training (training_list.txt), used by MVSNet
index.csv - CSV file provides a convenient reference for all the sample files, linking the corresponding file and relative data path.
NOTE: Only 20 of the original 40 scenes are made available owing to size limitations of the data repository. This is still adequate for the purposes of training MVS neural networks. The Blender file is made available specifically to render out the scenes for different applications or adapt the camera framework altogether for different applications. Please refer to the corresponding manuscript for additional details.
Steps to reproduce
The open source Blender software suite (https://www.blender.org/) was used to generate the dataset, with the entire pipeline developed using the exposed Python API interface. The camera trajectory is kept fixed for all 40 scenes, except for small perturbations introduced in the form of random noise to increase the camera variation. The camera intrinsic information was initially exported as a single CSV file (scene.csv) for every scene, from which the camera information files were generated; this includes the focal length (focalLengthmm), image sensor dimensions (pixelDimensionX, pixelDimensionY), position, coordinate vector (vectC) and rotation vector (vectR). The STL model files, as provided in this data repository, were exported directly from Blender, such that the geometry/scenes can be reproduced. The data processing below is written for a Python implementation, transforming the information from Blender's coordinate system into universal rotation (R_world2cv) and translation (T_world2cv) matrices.
import numpy as np from scipy.spatial.transform import Rotation as R
focalLengthPixel = focalLengthmm x pixelDimensionX / sensorWidthmm K = [[focalLengthPixel, 0, dimX/2], [0, focalPixel, dimY/2], [0, 0, 1]]
r = R.from_euler('xyz', vectR, degrees=True) matR = r.as_matrix()
R_world2bcam = np.transpose(matR)
R_bcam2cv = np.array([[1, 0, 0], [0, -1, 0], [0, 0, -1]])
R_world2cv = R_bcam2cv.dot(R_world2bcam)
T_world2bcam = -1 * R_world2bcam.dot(vectC) T_world2cv = R_bcam2cv.dot(T_world2bcam)
The resulting R_world2cv and T_world2cv matrices are written to the camera information file using exactly the same format as that of BlendedMVS developed by Dr. Yao. The original rotation and translation information can be found by following the process in reverse. Note that additional steps were required to convert from Blender's unique coordinate system to that of OpenCV; this ensures universal compatibility in the way that the camera intrinsic and extrinsic information is provided.
Equivalent GPS information is provided (gps.csv), whereby the local coordinate frame is transformed into equivalent GPS information, centered around the Engineering 4.0 campus, University of Pretoria, South Africa. This information is embedded within the JPG files as EXIF data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We propose an extensive framework for additive regression models for correlated functional responses, allowing for multiple partially nested or crossed functional random effects with flexible correlation structures for, for example, spatial, temporal, or longitudinal functional data. Additionally, our framework includes linear and nonlinear effects of functional and scalar covariates that may vary smoothly over the index of the functional response. It accommodates densely or sparsely observed functional responses and predictors which may be observed with additional error and includes both spline-based and functional principal component-based terms. Estimation and inference in this framework is based on standard additive mixed models, allowing us to take advantage of established methods and robust, flexible algorithms. We provide easy-to-use open source software in the pffr() function for the R package refund. Simulations show that the proposed method recovers relevant effects reliably, handles small sample sizes well, and also scales to larger datasets. Applications with spatially and longitudinally observed functional data demonstrate the flexibility in modeling and interpretability of results of our approach.
New driver application status check. For historical data, please see- https://data.cityofnewyork.us/Transportation/Historical-Driver-Application-Status/p32s-yqxq
Data dictionary available here: https://data.cityofnewyork.us/api/views/dpec-ucu7/files/6cd40752-22c4-4c56-ba39-dfa51cc6e14c?download=true&filename=NewDriverAppStatusLookupLegend.pdf
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data and R code from Sabrina Heiser's study of the reproductive system of Plocamium sp. in the Palmer Station region.
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This dataset contains inventories and location maps for CTD data acquired by the icebreaker R/V Xue Long in the Prydz Bay- Amery Ice Shelf region. A total of 68 stations were acquired in February 2015 and 24 stations in March 2017, as part of a joint US/China project to study Antarctic Bottom Water (AABW) formation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These data sets were originally created for the following publications:
M. E. Houle, H.-P. Kriegel, P. Kröger, E. Schubert, A. Zimek Can Shared-Neighbor Distances Defeat the Curse of Dimensionality? In Proceedings of the 22nd International Conference on Scientific and Statistical Database Management (SSDBM), Heidelberg, Germany, 2010.
H.-P. Kriegel, E. Schubert, A. Zimek Evaluation of Multiple Clustering Solutions In 2nd MultiClust Workshop: Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with ECML PKDD 2011, Athens, Greece, 2011.
The outlier data set versions were introduced in:
E. Schubert, R. Wojdanowski, A. Zimek, H.-P. Kriegel On Evaluation of Outlier Rankings and Outlier Scores In Proceedings of the 12th SIAM International Conference on Data Mining (SDM), Anaheim, CA, 2012.
They are derived from the original image data available at https://aloi.science.uva.nl/
The image acquisition process is documented in the original ALOI work: J. M. Geusebroek, G. J. Burghouts, and A. W. M. Smeulders, The Amsterdam library of object images, Int. J. Comput. Vision, 61(1), 103-112, January, 2005
Additional information is available at: https://elki-project.github.io/datasets/multi_view
The following views are currently available:
Feature type
Description
Files
Object number
Sparse 1000 dimensional vectors that give the true object assignment
objs.arff.gz
RGB color histograms
Standard RGB color histograms (uniform binning)
aloi-8d.csv.gz aloi-27d.csv.gz aloi-64d.csv.gz aloi-125d.csv.gz aloi-216d.csv.gz aloi-343d.csv.gz aloi-512d.csv.gz aloi-729d.csv.gz aloi-1000d.csv.gz
HSV color histograms
Standard HSV/HSB color histograms in various binnings
aloi-hsb-2x2x2.csv.gz aloi-hsb-3x3x3.csv.gz aloi-hsb-4x4x4.csv.gz aloi-hsb-5x5x5.csv.gz aloi-hsb-6x6x6.csv.gz aloi-hsb-7x7x7.csv.gz aloi-hsb-7x2x2.csv.gz aloi-hsb-7x3x3.csv.gz aloi-hsb-14x3x3.csv.gz aloi-hsb-8x4x4.csv.gz aloi-hsb-9x5x5.csv.gz aloi-hsb-13x4x4.csv.gz aloi-hsb-14x5x5.csv.gz aloi-hsb-10x6x6.csv.gz aloi-hsb-14x6x6.csv.gz
Color similiarity
Average similarity to 77 reference colors (not histograms) 18 colors x 2 sat x 2 bri + 5 grey values (incl. white, black)
aloi-colorsim77.arff.gz (feature subsets are meaningful here, as these features are computed independently of each other)
Haralick features
First 13 Haralick features (radius 1 pixel)
aloi-haralick-1.csv.gz
Front to back
Vectors representing front face vs. back faces of individual objects
front.arff.gz
Basic light
Vectors indicating basic light situations
light.arff.gz
Manual annotations
Manually annotated object groups of semantically related objects such as cups
manual1.arff.gz
Outlier Detection Versions
Additionally, we generated a number of subsets for outlier detection:
Feature type
Description
Files
RGB Histograms
Downsampled to 100000 objects (553 outliers)
aloi-27d-100000-max10-tot553.csv.gz aloi-64d-100000-max10-tot553.csv.gz
Downsampled to 75000 objects (717 outliers)
aloi-27d-75000-max4-tot717.csv.gz aloi-64d-75000-max4-tot717.csv.gz
Downsampled to 50000 objects (1508 outliers)
aloi-27d-50000-max5-tot1508.csv.gz aloi-64d-50000-max5-tot1508.csv.gz
This resource collects teaching materials that are originally created for the in-person course 'GEOSC/GEOG 497 – Data Mining in Environmental Sciences' at Penn State University (co-taught by Tao Wen, Susan Brantley, and Alan Taylor) and then refined/revised by Tao Wen to be used in the online teaching module 'Data Science in Earth and Environmental Sciences' hosted on the NSF-sponsored HydroLearn platform.
This resource includes both R Notebooks and Python Jupyter Notebooks to teach the basics of R and Python coding, data analysis and data visualization, as well as building machine learning models in both programming languages by using authentic research data and questions. All of these R/Python scripts can be executed either on the CUAHSI JupyterHub or on your local machine.
This resource is shared under the CC-BY license. Please contact the creator Tao Wen at Syracuse University (twen08@syr.edu) for any questions you have about this resource. If you identify any errors in the files, please contact the creator.