Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Datasets often incorporate various functional patterns related to different aspects or regimes, which are typically not equally present throughout the dataset. We propose a novel partitioning algorithm that utilizes competition between models to detect and separate these functional patterns. This competition is induced by multiple models iteratively submitting their predictions for the dataset, with the best prediction for each data point being rewarded with training on that data point. This reward mechanism amplifies each model's strengths and encourages specialization in different patterns. The specializations can then be translated into a partitioning scheme. We validate our concept with datasets with clearly distinct functional patterns, such as mechanical stress and strain data in a porous structure. Our partitioning algorithm produces valuable insights into the datasets' structure, which can serve various further applications. As a demonstration of one exemplary usage, we set up modular models consisting of multiple expert models, each learning a single partition, and compare their performance on more than twenty popular regression problems with single models learning all partitions simultaneously. Our results show significant improvements, with up to 56% loss reduction, confirming our algorithm's utility.
Facebook
TwitterThis dataset was created by Robbie Manolache
Facebook
TwitterThere has been a tremendous increase in the volume of sensor data collected over the last decade for different monitoring tasks. For example, petabytes of earth science data are collected from modern satellites, in-situ sensors and different climate models. Similarly, huge amount of flight operational data is downloaded for different commercial airlines. These different types of datasets need to be analyzed for finding outliers. Information extraction from such rich data sources using advanced data mining methodologies is a challenging task not only due to the massive volume of data, but also because these datasets are physically stored at different geographical locations with only a subset of features available at any location. Moving these petabytes of data to a single location may waste a lot of bandwidth. To solve this problem, in this paper, we present a novel algorithm which can identify outliers in the entire data without moving all the data to a single location. The method we propose only centralizes a very small sample from the different data subsets at different locations. We analytically prove and experimentally verify that the algorithm offers high accuracy compared to complete centralization with only a fraction of the communication cost. We show that our algorithm is highly relevant to both earth sciences and aeronautics by describing applications in these domains. The performance of the algorithm is demonstrated on two large publicly available datasets: (1) the NASA MODIS satellite images and (2) a simulated aviation dataset generated by the ‘Commercial Modular Aero-Propulsion System Simulation’ (CMAPSS).
Facebook
TwitterThere has been a tremendous increase in the volume of sensor data collected over the last decade for different monitoring tasks. For example, petabytes of earth science data are collected from modern satellites, in-situ sensors and different climate models. Similarly, huge amount of flight operational data is downloaded for different commercial airlines. These different types of datasets need to be analyzed for finding outliers. Information extraction from such rich data sources using advanced data mining methodologies is a challenging task not only due to the massive volume of data, but also because these datasets are physically stored at different geographical locations with only a subset of features available at any location. Moving these petabytes of data to a single location may waste a lot of bandwidth. To solve this problem, in this paper, we present a novel algorithm which can identify outliers in the entire data without moving all the data to a single location. The method we propose only centralizes a very small sample from the different data subsets at different locations. We analytically prove and experimentally verify that the algorithm offers high accuracy compared to complete centralization with only a fraction of the communication cost. We show that our algorithm is highly relevant to both earth sciences and aeronautics by describing applications in these domains. The performance of the algorithm is demonstrated on two large publicly available datasets: (1) the NASA MODIS satellite images and (2) a simulated aviation dataset generated by the ‘Commercial Modular Aero-Propulsion System Simulation’ (CMAPSS).
Facebook
TwitterOPTIMAL PARTITIONS OF DATA IN HIGHER DIMENSIONS BRADLEY W. JACKSON, JEFFREY D. SCARGLE, AND CHRIS CUSANZA, DAVID BARNES, DENNIS KANYGIN, RUSSELL SARMIENTO, SOWMYA SUBRAMANIAM, TZU-WANG CHUANG** Abstract. Consider piece-wise constant approximations to a function of several parameters, and the problem of finding the best such approximation from measurements at a set of points in the parameter space. We find good approximate solutions to this problem in two steps: (1) partition the parameter space into cells, one for each of the N data points, and (2) collect these cells into blocks, such that within each block the function is constant to within measurement uncertainty. We describe a branch-and-bound algorithm for finding the optimal partition into connected blocks, as well as an O(N2) dynamic programming algorithm that finds the exact global optimum over this exponentially large search space, in a data space of any dimension. This second solution relaxes the connectivity constraint, and requires additivity and convexity conditions on the block fitness function, but in practice none of these items cause problems. From the wide variety of intelligent data understanding applications (including cluster analysis, classification, and anomaly detection) we demonstrate two: partitioning of the State of California (2D) and the Universe (3D).
Facebook
TwitterDescribes the data partitioning of the Berlin dataset.
Facebook
TwitterOPTIMAL PARTITIONS OF DATA IN HIGHER DIMENSIONS BRADLEY W. JACKSON, JEFFREY D. SCARGLE, AND CHRIS CUSANZA, DAVID BARNES, DENNIS KANYGIN, RUSSELL SARMIENTO, SOWMYA SUBRAMANIAM, TZU-WANG CHUANG** Abstract. Consider piece-wise constant approximations to a function of several parameters, and the problem of finding the best such approximation from measurements at a set of points in the parameter space. We find good approximate solutions to this problem in two steps: (1) partition the parameter space into cells, one for each of the N data points, and (2) collect these cells into blocks, such that within each block the function is constant to within measurement uncertainty. We describe a branch-and-bound algorithm for finding the optimal partition into connected blocks, as well as an O(N2) dynamic programming algorithm that finds the exact global optimum over this exponentially large search space, in a data space of any dimension. This second solution relaxes the connectivity constraint, and requires additivity and convexity conditions on the block fitness function, but in practice none of these items cause problems. From the wide variety of intelligent data understanding applications (including cluster analysis, classification, and anomaly detection) we demonstrate two: partitioning of the State of California (2D) and the Universe (3D).
Facebook
TwitterTHIS RESOURCE IS NO LONGER IN SERVICE, documented August 23, 2016. The GO Partition Database was designed to feature ontology partitions with GO terms of similar specificity. The GO partitions comprise varying numbers of nodes and present relevant information theoretic statistics, so researchers can choose to analyze datasets at arbitrary levels of specificity. The GO Partition Database, featuring GO partition sets for functional analysis of genes from human and ten other commonly-studied organisms with a total of 131,972 genes.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Solute descriptors have been widely used to model chemical transfer processes through poly-parameter linear free energy relationships (pp-LFERs); however, there are still substantial difficulties in obtaining these descriptors accurately and quickly for new organic chemicals. In this research, models (PaDEL-DNN) that require only SMILES of chemicals were built to satisfactorily estimate pp-LFER descriptors using deep neural networks (DNN) and the PaDEL chemical representation. The PaDEL-DNN-estimated pp-LFER descriptors demonstrated good performance in modeling storage-lipid/water partitioning coefficient (log Kstorage‑lipid/water), bioconcentration factor (BCF), aqueous solubility (ESOL), and hydration free energy (freesolve). Then, assuming that the accuracy in the estimated values of widely available properties, e.g., logP (octanol–water partition coefficient), can calibrate estimates for less available but related properties, we proposed logP as a surrogate metric for evaluating the overall accuracy of the estimated pp-LFER descriptors. When using the pp-LFER descriptors to model log Kstorage‑lipid/water, BCF, ESOL, and freesolve, we achieved around 0.1 log unit lower errors for chemicals whose estimated pp-LFER descriptors were deemed “accurate” by the surrogate metric. The interpretation of the PaDEL-DNN models revealed that, for a given test chemical, having several (around 5) “similar” chemicals in the training data set was crucial for accurate estimation while the remaining less similar training chemicals provided reasonable baseline estimates. Lastly, pp-LFER descriptors for over 2800 persistent, bioaccumulative, and toxic chemicals were reasonably estimated by combining PaDEL-DNN with the surrogate metric. Overall, the PaDEL-DNN/surrogate metric and newly estimated descriptors will greatly benefit chemical transfer modeling.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Land Use partitioned by sub-national region and year (1992-2019)
What is this ?
This archive includes land use partitioned by sub-national administrative region and year, i.e. for each year a table reports the count of each land-use class per region. Data is available as one CSV file per year in the folder "out-computedLUseStatsByRegionAndYear".
This archive contains also the set of scripts used to compute that partition (including input data download) and that can be easily modified to retrieve a partition by a different geographical level.
Warnings
See the README on https://github.com/sylvaticus/landUsePartitionByRegionAndYear/ for further informations and citation format.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data set accompanying the publication "Systematic partitioning of proteins for quantum-chemical fragmentation methods using graph algorithms"
The data set contains:
Input script for PyADF (v0.97) for calculating (a) all two body terms to use as graph weights and (b) fragmentation error for all k and nmax (aspf)
PDB files of proteins and the "regions of interest" (RoI) used in this work.
Raw data: protein graph representations, resulting partitions, data underlying all figures shown in our article.
Jupiter notebook to create all figures shown in the article and in the supporting information from data in the results folder.
Images of protein structures and graph representations of ubiquitin.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Context The Caltech-256 dataset is a foundational benchmark for object recognition, containing 30,607 images across 257 categories (256 object categories + 1 clutter category).
The original dataset is typically provided as a collection of directories, one for each category. This version streamlines the machine learning workflow by providing:
A clean, pre-defined 80/20 train-test split.
Manifest files (train.csv, test.csv) that map image paths directly to their labels, allowing for easy use with data generators in frameworks like PyTorch and TensorFlow.
A flat directory structure (train/, test/) for simplified file access.
File Content The dataset is organized into a single top-level folder and two CSV files:
train.csv: A CSV file containing two columns: image_path and label. This file lists all images designated for the training set.
test.csv: A CSV file with the same structure as train.csv, listing all images designated for the testing set.
Caltech-256_Train_Test/: The primary data folder.
train/: This directory contains 80% of the images from all 257 categories, intended for model training.
test/: This directory contains the remaining 20% of the images from all categories, reserved for model evaluation.
Data Split The dataset has been thoroughly partitioned to create a standard 80% training and 20% testing split. This split is (or should be assumed to be) stratified, meaning that each of the 257 object categories is represented in roughly an 80/20 proportion in the respective sets.
Acknowledgements & Original Source This dataset is a derivative work created for convenience. The original data and images belong to the authors of the Caltech-256 dataset.
Original Dataset Link: https://www.kaggle.com/datasets/jessicali9530/caltech256/data
Citation: Griffin, G. Holub, A.D. Perona, P. (2007). Caltech-256 Object Category Dataset. California Institute of Technology.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Overview: This dataset has 17 classes. Data are divided in three partition train, val and test.
Dataset Characteristics: Image Feature Type: Categorical Associated Tasks: Classification, Other
Class Labels: The classes are 0 : Beet Armyworm 1 : Black Hairy 2 : Cutworm 3 : Field Cricket 4 : Jute Aphid 5 : Jute Hairy 6 : Jute Red Mite 7 : Jute Semilooper 8 : Jute Stem Girdler 9 : Jute Stem Weevil 10 : Leaf Beetle 11 : Mealybug 12 : Pod Borer 13 : Scopula Emissaria 14 : Termite 15 : Termite odontotermes (Rambur) 16 : Yellow Mite
Has Missing Values?: No
Facebook
TwitterPartitionAnalysisTimeMC.FigS1Figure S1PartitionAnalysisTimeMC.TableS1S2Tables S1 & S2MakeTree.RRelClock.R
Facebook
TwitterThe dataset includes 30 minutes values of partitioned evaporation (E) and transpiration (T), T:ET ratios, and other ancillary datasets for three ET partitioning methods viz. Flux Variance Similarity (FVS) method, Transpiration Estimation Algorithm (TEA), and Underlying Water Use Efficiency (uWUE) method for three wheat sites. Three wheat sites had different grazing treatments. For example, Site 1 was Grain-only and Graze-grain wheat for the 2016-17 and 2017-18 growing seasons, respectively. Site 2 was Grain-only wheat for the 2017-18 growing season. Site 3 was Graze-grain and Graze-out wheat for the 2016-17 and 2017-18 growing seasons, respectively. The grain-only wheat system is a single purpose to produce wheat grains only. Graze-grain wheat system has a dual purpose as it serves as a pasture for grazing cattle from November to February and is used to produce wheat grains later. Graze-out wheat system is also a single purpose crop that is grazed by the cattle for the entire season to solely serve as a pasture. FVS method performed ET partitioning using the high frequency (10 Hz) data collected from Eddy Covariance Flux stations, located near the middle of each field. The high-frequency data were also processed using the EddyPro software to get good quality estimates of different fluxes at 30-minute intervals. The processed 30-min data were used by TEA and uWUE methods for ET partitioning. Ancillary hydro-meteorological variables including net radiation, air temperature, soil water content, relative humidity, and others, also have been included in this dataset. The study sites were located at the United States Department of Agriculture, Agricultural Research Service (USDA-ARS), Grazinglands Research Laboratory, El Reno, Oklahoma. All sites were rainfed. Resources in this dataset:Resource Title: FVS output and other met data and site info. File Name: FVS_output_and_other_met_data_and_site_info.xlsxResource Description: Output of FVS model along with corresponding meteorological data and site metadata.Resource Title: TEA output. File Name: TEA_output.xlsxResource Description: Out from TEA model along with site metadata.Resource Title: WUE output. File Name: uWUE_output.xlsxResource Description: Output of WUE model run along with site metadata.
Facebook
TwitterThe programs and software required are R, IQ-TREE2, and Seq-Gen-1.3.4.
Facebook
TwitterThe trace element selenium is an essential element with a narrow window between concentrations needed to support life and those that cause toxicity to egg laying organisms. Selenium bioaccumulation in aquatic organisms is primarily the result of trophic transfer through food webs and is poorly predicted by dissolved concentrations in freshwater bodies. To better understand the hydrologic and biological dynamics that control selenium accumulation into fishes of the Lower Gunnison River Basin (Colorado), ecosystem scale selenium accumulation models were developed from data collected between June 2015 and October 2016.
Facebook
TwitterData for all figures in NetCDF format zipped files. This dataset is associated with the following publication: He, J., and K. Alapaty. Precipitation Partitioning in Multiscale Atmospheric Simulations: Impacts of Stability Restoration Methods. JOURNAL OF GEOPHYSICAL RESEARCH-ATMOSPHERES. American Geophysical Union, Washington, DC, USA, 123(18): 10,185-10,201, (2018).
Facebook
TwitterPartitioning a permutation into a minimum number of monotone subsequences is NP-hard. We extend this complexity result to minimum partitions into unimodal subsequences. In graph theoretical terms these problems are cocoloring and what we call split-coloring of permutation graphs. Based on a network flow interpretation of both problems we introduce mixed integer programs; this is the first approach to obtain optimal partitions for these problems in general. We derive an LP rounding algorithm which is a 2-approximation for both coloring problems. It performs much better in practice. In an online situation the permutation becomes known to an algorithm sequentially, and we give a logarithmic lower bound on the competitive ratio and analyze two online algorithms.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.
This dataset describes the correlation of the Ronlow beds to other geological units in the Galilee subregion. The Ronlow beds are stratigraphic equivalents of three formal geological units: the Hutton Sandstone, the hooray Sandstone, and the Injune Creek Group. For the preparation of potentiometric surface maps and other hydrogeological interpretation of data from the Galilee subregion, the Ronlow beds were partitioned into three sub-units, which were assigned to either the Hutton Sandstone, Hooray Sandstone, or Injune Creek Group. This partitioning was based on potentiometry of bores screened in the Ronlow beds.
Hydraulic head data for bores screened in the Ronlow beds from dataset 'JkrRonlow_beds_Partitioning.gdb' were compared to hydraulic head values in bores assigned to the Hutton Sandstone, Hooray Sandstone, and Injune Creek group. Bores screened in the Ronlow beds were then assigned to either the Hutton Sandstone aquifer, Hooray Sandstone aquifer, or Injune Creek Group aquitard based on similarities in hydraulic head.The polygons were created in an ArcMap editing session.
Bioregional Assessment Programme (2015) Ronlow beds partitioning. Bioregional Assessment Derived Dataset. Viewed 07 December 2018, http://data.bioregionalassessments.gov.au/dataset/d2f60560-eda7-417d-86ca-1d29ce994edd.
Derived From QLD Dept of Natural Resources and Mines, Groundwater Entitlements 20131204
Derived From QDEX well completion reports (WCR) - Galilee v01
Derived From QLD Dept of Natural Resources and Mines, Groundwater Entitlements linked to bores v3 03122014
Derived From Potentiometric head difference v01
Derived From QLD Department of Natural Resources and Mines Groundwater Database Extract 20142808
Derived From Galilee subregion groundwater usage estimates dataset v01
Derived From Galilee Water Accounts Table: volumes and purposes
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Datasets often incorporate various functional patterns related to different aspects or regimes, which are typically not equally present throughout the dataset. We propose a novel partitioning algorithm that utilizes competition between models to detect and separate these functional patterns. This competition is induced by multiple models iteratively submitting their predictions for the dataset, with the best prediction for each data point being rewarded with training on that data point. This reward mechanism amplifies each model's strengths and encourages specialization in different patterns. The specializations can then be translated into a partitioning scheme. We validate our concept with datasets with clearly distinct functional patterns, such as mechanical stress and strain data in a porous structure. Our partitioning algorithm produces valuable insights into the datasets' structure, which can serve various further applications. As a demonstration of one exemplary usage, we set up modular models consisting of multiple expert models, each learning a single partition, and compare their performance on more than twenty popular regression problems with single models learning all partitions simultaneously. Our results show significant improvements, with up to 56% loss reduction, confirming our algorithm's utility.