Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Basic information on 40 datasets from UCI repository used in this study including information about number of instances, attributes, classes, length of longest attribute name (LAN) and length of the longest nominal attribute value (LAV).
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset is essentially the metadata from 164 datasets. Each of its lines concerns a dataset from which 22 features have been extracted, which are used to classify each dataset into one of the categories 0-Unmanaged, 2-INV, 3-SI, 4-NOA (DatasetType).
This Dataset consists of 164 Rows. Each row is the metadata of an other dataset. The target column is datasetType which has 4 values indicating the dataset type. These are:
2 - Invoice detail (INV): This dataset type is a special report (usually called Detailed Sales Statement) produced by a Company Accounting or an Enterprise Resource Planning software (ERP). Using a INV-type dataset directly for ARM is extremely convenient for users as it relieves them from the tedious work of transforming data into another more suitable form. INV-type data input typically includes a header but, only two of its attributes are essential for data mining. The first attribute serves as the grouping identifier creating a unique transaction (e.g., Invoice ID, Order Number), while the second attribute contains the items utilized for data mining (e.g., Product Code, Product Name, Product ID).
3 - Sparse Item (SI): This type is widespread in Association Rules Mining (ARM). It involves a header and a fixed number of columns. Each item corresponds to a column. Each row represents a transaction. The typical cell stores a value, usually one character in length, that depicts the presence or absence of the item in the corresponding transaction. The absence character must be identified or declared before the Association Rules Mining process takes place.
4 - Nominal Attributes (NOA): This type is commonly used in Machine Learning and Data Mining tasks. It involves a fixed number of columns. Each column registers nominal/categorical values. The presence of a header row is optional. However, in cases where no header is provided, there is a risk of extracting incorrect rules if similar values exist in different attributes of the dataset. The potential values for each attribute can vary.
0 - Unmanaged for ARM: On the other hand, not all datasets are suitable for extracting useful association rules or frequent item sets. For instance, datasets characterized predominantly by numerical features with arbitrary values, or datasets that involve fragmented or mixed types of data types. For such types of datasets, ARM processing becomes possible only by introducing a data discretization stage which in turn introduces information loss. Such types of datasets are not considered in the present treatise and they are termed (0) Unmanaged in the sequel.
The dataset type is crucial to determine for ARM, and the current dataset is used to classify the dataset's type using a Supervised Machine Learning Model.
There is and another dataset type named 1 - Market Basket List (MBL) where each dataset row is a transaction. A transaction involves a variable number of items. However, due to this characteristic, these datasets can be easily categorized using procedural programming and DoD does not include instances of them. For more details about Dataset Types please refer to article "WebApriori: a web application for association rules mining". https://link.springer.com/chapter/10.1007/978-3-030-49663-0_44
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Patient categorical and nominal attributes.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
All nominal attributes and instances with missing values are deleted. Price treated as the class attribute.
As used by Kilpatrick, D. & Cameron-Jones, M. (1998). Numeric prediction using instance-based learning with encoding length selection. In Progress in Connectionist-Based Information Systems. Singapore: Springer-Verlag.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Title: 1985 Auto Imports Database
Source Information: -- Creator/Donor: Jeffrey C. Schlimmer (Jeffrey.Schlimmer@a.gp.cs.cmu.edu) -- Date: 19 May 1987 -- Sources: 1) 1985 Model Import Car and Truck Specifications, 1985 Ward's Automotive Yearbook. 2) Personal Auto Manuals, Insurance Services Office, 160 Water Street, New York, NY 10038 3) Insurance Collision Report, Insurance Institute for Highway Safety, Watergate 600, Washington, DC 20037
Past Usage: -- Kibler,~D., Aha,~D.~W., & Albert,~M. (1989). Instance-based prediction of real-valued attributes. {it Computational Intelligence}, {it 5}, 51--57. -- Predicted price of car using all numeric and Boolean attributes -- Method: an instance-based learning (IBL) algorithm derived from a localized k-nearest neighbor algorithm. Compared with a linear regression prediction...so all instances with missing attribute values were discarded. This resulted with a training set of 159 instances, which was also used as a test set (minus the actual instance during testing). -- Results: Percent Average Deviation Error of Prediction from Actual -- 11.84% for the IBL algorithm -- 14.12% for the resulting linear regression equation
Relevant Information: -- Description This data set consists of three types of entities: (a) the specification of an auto in terms of various characteristics, (b) its assigned insurance risk rating, (c) its normalized losses in use as compared to other cars. The second rating corresponds to the degree to which the auto is more risky than its price indicates. Cars are initially assigned a risk factor symbol associated with its price. Then, if it is more risky (or less), this symbol is adjusted by moving it up (or down) the scale. Actuarians call this process "symboling". A value of +3 indicates that the auto is risky, -3 that it is probably pretty safe.
The third factor is the relative average loss payment per insured vehicle year. This value is normalized for all autos within a particular size classification (two-door small, station wagons, sports/speciality, etc...), and represents the average loss per car per year.
-- Note: Several of the attributes in the database could be used as a "class" attribute.
Number of Instances: 205
Number of Attributes: 26 total -- 15 continuous -- 1 integer -- 10 nominal
Attribute Information: Attribute: Attribute Range: ------------------ ----------------------------------------------- symboling: -3, -2, -1, 0, 1, 2, 3. normalized-losses: continuous from 65 to 256. make: alfa-romero, audi, bmw, chevrolet, dodge, honda, isuzu, jaguar, mazda, mercedes-benz, mercury, mitsubishi, nissan, peugot, plymouth, porsche, renault, saab, subaru, toyota, volkswagen, volvo fuel-type: diesel, gas. aspiration: std, turbo. num-of-doors: four, two. body-style: hardtop, wagon, sedan, hatchback, convertible. drive-wheels: 4wd, fwd, rwd. engine-location: front, rear. wheel-base: continuous from 86.6 120.9. length: continuous from 141.1 to 208.1. width: continuous from 60.3 to 72.3. height: continuous from 47.8 to 59.8. curb-weight: continuous from 1488 to 4066. engine-type: dohc, dohcv, l, ohc, ohcf, ohcv, rotor. num-of-cylinders: eight, five, four, six, three, twelve, two. engine-size: continuous from 61 to 326. fuel-system: 1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi. bore: continuous from 2.54 to 3.94. stroke: continuous from 2.07 to 4.17. compression-ratio: continuous from 7 to 23. horsepower: continuous from 48 to 288. peak-rpm: continuous from 4150 to 6600. city-mpg: continuous from 13 to 49. highway-mpg: continuous from 16 to 54. price: continuous from 5118 to 45400.
Missing Attribute Values: (denoted by "?") Attribute #: Number of instances missing a value: 41 2 4 4 2 2 4%
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data set provides Machine Learning for defining breathing patterns in sleep for adults using preprocessed abdominal electromyograms (EMGs). The data set of 40 records was casually picked from a vaster database (Computing in Cardiology Challenge 2018: Training/Test Sets. 2018. URL: https://archive.physionet.org/physiobank/database/challenge/2018/).
The optimal exponential smoothing model was uniform for all records: additive errors, small undamped trends, and no seasonality. Cleared out by trends and noises, signals had autocorrelation functions with the power-law decay. That has allowed making their persistence factors evaluations (Hurst exponent).
Most of the signals (38 of 40) showed frequent outliers: from a few percent up to 24.6 % of emissions. Wide data variability has been rated with the median absolute deviations, which is the most robust statistic in such a case. High variability looks a bit odd, considering low enough noise levels.
The outliers' percentage, variability, SNR (signal-to-noise ratio), and persistency factors were statistically z-scored with medians and median absolute deviations. Further, their linear combinations form three independent Principal Components: numeric attributes z_1, z_2, and z_3 of the data set.
Manhattan distances matrix among subjects' vectors in 4D attributes space allows imaging the data set as a weighted biconnected graph, the vertices of which are subjects. The weights of the graph's edges reflect distances between any pair of them. "Closeness centralities" of vertices, a well-known parameter in graphs theory, allowed us to cluster the data on two clusters with 11 and 29 subjects. They present two biconnected subgraphs, peripheral and core, respectively. The belonging to one of them has been reflected in binary (nominal) attribute z_4. There are 0 as the label of the peripheral subgraph and 1 for core one, respectively.
The periodograms of EMGs permitted us to find ten subjects with regular breathing and 30 with irregular one, defining two inequal classes using nominal attribute z_5.
So, we offer here the data set for Machine Learning in ARFF format, containing 40 instances with five attributes, the sense of which is described above.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Number of Instances: - 48842 instances, mix of continuous and discrete (train=32561, test=16281) - 45222 if instances with unknown values are removed (train=30162, test=15060)
Number of Attributes: - 6 continuous, 8 nominal attributes.
Attribute Information:
- age: continuous.
- workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
- fnlwgt: continuous.
- education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
- education-num: continuous.
- marital-status:Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
- occupation:Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
- relationship:Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
- race:White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
- gender:Female, Male.
- capital-gain:continuous.
- capital-loss:continuous.
- hours-per-week:continuous.
- native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
- income: >50K, <=50K
Facebook
TwitterDATA MINING THE GALAXY ZOO MERGERS STEVEN BAEHR, ARUN VEDACHALAM, KIRK BORNE, AND DANIEL SPONSELLER Abstract. Collisions between pairs of galaxies usually end in the coalescence (merger) of the two galaxies. Collisions and mergers are rare phenomena, yet they may signal the ultimate fate of most galaxies, including our own Milky Way. With the onset of massive collection of astronomical data, a computerized and automated method will be necessary for identifying those colliding galaxies worthy of more detailed study. This project researches methods to accomplish that goal. Astronomical data from the Sloan Digital Sky Survey (SDSS) and human-provided classifications on merger status from the Galaxy Zoo project are combined and processed with machine learning algorithms. The goal is to determine indicators of merger status based solely on discovering those automated pipeline-generated attributes in the astronomical database that correlate most strongly with the patterns identified through visual inspection by the Galaxy Zoo volunteers. In the end, we aim to provide a new and improved automated procedure for classification of collisions and mergers in future petascale astronomical sky surveys. Both information gain analysis (via the C4.5 decision tree algorithm) and cluster analysis (via the Davies-Bouldin Index) are explored as techniques for finding the strongest correlations between human-identified patterns and existing database attributes. Galaxy attributes measured in the SDSS green waveband images are found to represent the most influential of the attributes for correct classification of collisions and mergers. Only a nominal information gain is noted in this research, however, there is a clear indication of which attributes contribute so that a direction for further study is apparent.
Facebook
TwitterObjectivesUnderstanding the preferences of patients with multiple sclerosis (MS) for disease-modifying drugs and involving these patients in clinical decision making can improve the concordance between medical decisions and patient values and may, subsequently, improve adherence to disease-modifying drugs. This study aims first to identify which characteristics–or attributes–of disease-modifying drugs influence patients´ decisions about these treatments and second to quantify the attributes’ relative importance among patients.MethodsFirst, three focus groups of relapsing-remitting MS patients were formed to compile a preliminary list of attributes using a nominal group technique. Based on this qualitative research, a survey with several choice tasks (best-worst scaling) was developed to prioritize attributes, asking a larger patient group to choose the most and least important attributes. The attributes’ mean relative importance scores (RIS) were calculated.ResultsNineteen patients reported 34 attributes during the focus groups and 185 patients evaluated the importance of the attributes in the survey. The effect on disease progression received the highest RIS (RIS = 9.64, 95% confidence interval: [9.48–9.81]), followed by quality of life (RIS = 9.21 [9.00–9.42]), relapse rate (RIS = 7.76 [7.39–8.13]), severity of side effects (RIS = 7.63 [7.33–7.94]) and relapse severity (RIS = 7.39 [7.06–7.73]). Subgroup analyses showed heterogeneity in preference of patients. For example, side effect-related attributes were statistically more important for patients who had no experience in using disease-modifying drugs compared to experienced patients (p < .001).ConclusionsThis study shows that, on average, patients valued effectiveness and unwanted effects as most important. Clinicians should be aware of the average preferences but also that attributes of disease-modifying drugs are valued differently by different patients. Person-centred clinical decision making would be needed and requires eliciting individual preferences.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a PROMISE Software Engineering Repository data set made publicly available in order to encourage repeatable, verifiable, refutable, and/or improvable predictive models of software engineering.
Title: Nickle Repository Transaction Data
Sources: (a) Original creators of database: Bart Massey 01 503 725-5393 Computer Science Dept. PO Box 751 MS CMPS Portland State University Portland, OR USA 97207-0751 bart@cs.pdx.edu
(b) Donor of database: owner
(c) Date received: 31 March 2005
Past Usage: none
Relevant Information:
This dataset was assembled by analyzing the publicly-available CVS archives of the Nickle programming language (http://nickle.org) using a modified version of CVSAnalY (http://metricsgrimoire.github.io/CVSAnalY). It is intended for a wide variety of uses, and thus no dependent variable is specified. See Massey's PROMISE 2005 paper, "Longitudinal Analysis of Long-Timescale Open Source Repository Data", for further information.
Number of Instances: 2972
Number of Attributes: 10 (incl. 2 inferred)
Attribute information:
Inferred filetype:
non-numeric—nominal
9 values (documentation,images,i18n,ui,multimedia,code,build,devel-doc,unknown)
File Pathname: UNIX relative path name
non-numeric—structured
15 directories, 265 files
Revision: dotted-decimal revision string
non-numeric—structured
161 unique revision numbers
Author ID: integer identifier
non-numeric—nominal
6 unique authors
Lines added: integer
numeric—integer
MIN: 0 MAX: 1011 MEAN: 24.0148 STDEV: 65.6858
Lines removed: integer
numeric—integer
MIN: 0 MAX: 678 MEAN: 11.6413 STDEV: 42.4373)
File has since been removed ("in Attic"): boolean (0, 1)
non-numeric—nominal
11.98% positive
Commit has CVS_SILENT flag: boolean (0, 1)
non-numeric—nominal
never set
Inferred that committer was not author: boolean (0, 1)
non-numeric—nominal
never set
Commit date: date
MIN: "1999-01-13 06:22:11" MAX: "2005-01-14 16:58:53"
Note: CVS has trouble with timezones; we assume all dates are UTC.
Missing Attribute Values:
unknown inferred filetype (attr 1): 69
Class Distribution: N/A
Facebook
TwitterThis database contains 5 numeric-valued attributes.
Attribute Information:
ID: distinct for each instance and represented numerically
hobby: nominal values ranging between 1 and 3 (Chess, Sports, Stamps)
age: nominal values ranging between 1 and 4 (Child, Teenager, Young, Old)
educational level: nominal values ranging between 1 and 4 (Primary, Higher Secondary, Graduate, Post Graduate)
marital status: nominal values ranging between 1 and 4 (Not Married, Married, Divorced, Complicated)
class: nominal value between 1 and 3 (Lower, Middle, Upper)
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This is a transnational data set which contains all the transactions occurring between 2010 and 2011 online retail. These information is collected from the countries like US, UK, France etc.
InvoiceNo: Invoice number. Nominal. A 6-digit integral number uniquely assigned to each transaction. If this code starts with the letter 'c', it indicates a cancellation. StockCode: Product (item) code. Nominal. A 5-digit integral number uniquely assigned to each distinct product. Description: Product (item) name. Nominal. Quantity: The quantities of each product (item) per transaction. Numeric. InvoiceDate: Invice date and time. Numeric. The day and time when a transaction was generated. UnitPrice: Unit price. Numeric. Product price per unit in sterling (£). CustomerID: Customer number. Nominal. A 5-digit integral number uniquely assigned to each customer. Country: Country name. Nominal. The name of the country where a customer resides.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
1 Introduction
The Peatland Decomposition Database (PDD) stores data from published litterbag experiments related to peatlands. Currently, the database focuses on northern peatlands and Sphagnum litter and peat, but it also contains data from some vascular plant litterbag experiments. Currently, the database contains entries from 34 studies, 2,160 litterbag experiments, and 7,297 individual samples with 117,841 measurements for various attributes (e.g. relative mass remaining, N content, holocellulose content, mesh size). The aim is to provide a harmonized data source that can be useful to re-analyse existing data and to plan future litterbag experiments.
The Peatland Productivity and Decomposition Parameter Database (PPDPD) (Bona et al. 2018) is similar to the Peatland Decomposition Database (PDD) in that both contain data from peatland litterbag experiments. The differences are that both databases partly contain different data, that PPDPD additionally contains information on vegetation productivity, which PDD does not, and that PDD provides more information and metadata on litterbag experiments, and also measurement errors.
2 Updates
Compared to version 1.0.0, this version has a new structure for table experimental_design_format, contains additional metadata on the experimental design (these were omitted in version 1.0.0), and contains the scripts that were used to import the data into the database.
3 Methods
3.1 Data collection
Data for the database was collected from published litterbag studies, by extracting published data from figures, tables, or other data sources, and by contacting the authors of the studies to obtain raw data. All data processing was done with R (R version 4.2.0 (2022-04-22)) (R Core Team 2022).
Studies were identified via a Scopus search with search string (TITLE-ABS-KEY ( peat* AND ( "litter bag" OR "decomposition rate" OR "decay rate" OR "mass loss")) AND NOT ("tropic*")) (2022-12-17). These studies were further screened to exclude those which do not contain litterbag data or which recycle data from other studies that have already been considered. Additional studies with litterbag experiments in northern peatlands we were aware of, but which were not identified in the literature search were added to the list of publications. For studies not older than 10 years, authors were contacted to obtain raw data, however this was successful only in few cases. To date, the database focuses on Sphagnum litterbag experiments and not from all studies that were identified by the literature search data have been included yet in the database.
Data from figures were extracted using the package ‘metaDigitise’ (1.0.1) (Pick, Nakagawa, and Noble 2018). Data from tables were extracted manually.
Data from the following studies are currently included: Farrish and Grigal (1985), Bartsch and Moore (1985), Farrish and Grigal (1988), Vitt (1990), Hogg, Lieffers, and Wein (1992), Sanger, Billett, and Cresser (1994), Hiroki and Watanabe (1996), Szumigalski and Bayley (1996), Prevost, Belleau, and Plamondon (1997), Arp, Cooper, and Stednick (1999), Robbert A. Scheffer and Aerts (2000), R. A. Scheffer, Van Logtestijn, and Verhoeven (2001), Limpens and Berendse (2003), Waddington, Rochefort, and Campeau (2003), Asada, Warner, and Banner (2004), Thormann, Bayley, and Currah (2001), Trinder, Johnson, and Artz (2008), Breeuwer et al. (2008), Trinder, Johnson, and Artz (2009), Bragazza and Iacumin (2009), Hoorens, Stroetenga, and Aerts (2010), Straková et al. (2010), Straková et al. (2012), Orwin and Ostle (2012), Lieffers (1988), Manninen et al. (2016), Johnson and Damman (1991), Bengtsson, Rydin, and Hájek (2018a), Bengtsson, Rydin, and Hájek (2018b), Asada and Warner (2005), Bengtsson, Granath, and Rydin (2017), Bengtsson, Granath, and Rydin (2016), Hagemann and Moroni (2015), Hagemann and Moroni (2016), B. Piatkowski et al. (2021), B. T. Piatkowski et al. (2021), Mäkilä et al. (2018), Golovatskaya and Nikonova (2017), Golovatskaya and Nikonova (2017).
4 Database records
The database is a ‘MariaDB’ database and the database schema was designed to store data and metadata following the Ecological Metadata Language (EML) (Jones et al. 2019). Descriptions of the tables are shown in Tab. 1.
The database contains general metadata relevant for litterbag experiments (e.g., geographical, temporal, and taxonomic coverage, mesh sizes, experimental design). However, it does not contain a detailed description of sample handling, sample preprocessing methods, site descriptions, because there currently are no discipline-specific metadata and reporting standards. Table 1: Description of the individual tables in the database.
Name Description
attributes Defines the attributes of the database and the values in column attribute_name in table data.
citations Stores bibtex entries for references and data sources.
citations_to_datasets Links entries in table citations with entries in table datasets.
custom_units Stores custom units.
data Stores measured values for samples, for example remaining masses.
datasets Lists the individual datasets.
experimental_design_format Stores information on the experimental design of litterbag experiments.
measurement_scales, measurement_scales_date_time, measurement_scales_interval, measurement_scales_nominal, measurement_scales_ordinal, measurement_scales_ratio Defines data value types.
missing_value_codes Defines how missing values are encoded.
samples Stores information on individual samples.
samples_to_samples Links samples to other samples, for example litter samples collected in the field to litter samples collected during the incubation of the litterbags.
units, unit_types Stores information on measurement units.
5 Attributes Table 2: Definition of attributes in the Peatland Decomposition Database and entries in the column attribute_name in table data.
Name Definition Example value Unit Measurement scale Number type Minimum value Maximum value String format
4_hydroxyacetophenone_mass_absolute A numeric value representing the content of 4-hydroxyacetophenone, as described in Straková et al. (2010). 0.26 g ratio real 0 Inf NA
4_hydroxyacetophenone_mass_relative_mass A numeric value representing the content of 4-hydroxyacetophenone, as described in Straková et al. (2010). 0.26 g/g ratio real 0 1 NA
4_hydroxybenzaldehyde_mass_absolute A numeric value representing the content of 4-hydroxybenzaldehyde, as described in Straková et al. (2010). 0.26 g ratio real 0 Inf NA
4_hydroxybenzaldehyde_mass_relative_mass A numeric value representing the content of 4-hydroxybenzaldehyde, as described in Straková et al. (2010). 0.26 g/g ratio real 0 1 NA
4_hydroxybenzoic_acid_mass_absolute A numeric value representing the content of 4-hydroxybenzoic acid, as described in Straková et al. (2010). 0.26 g ratio real 0 Inf NA
4_hydroxybenzoic_acid_mass_relative_mass A numeric value representing the content of 4-hydroxybenzoic acid, as described in Straková et al. (2010). 0.26 g/g ratio real 0 1 NA
abbreviation In table custom_units: A string representing an abbreviation for the custom unit. gC NA nominal NA NA NA NA
acetone_extractives_mass_absolute A numeric value representing the content of acetone extractives, as described in Straková et al. (2010). 0.26 g ratio real 0 Inf NA
acetone_extractives_mass_relative_mass A numeric value representing the content of acetone extractives, as described in Straková et al. (2010). 0.26 g/g ratio real 0 1 NA
acetosyringone_mass_absolute A numeric value representing the content of acetosyringone, as described in Straková et al. (2010). 0.26 g ratio real 0 Inf NA
acetosyringone_mass_relative_mass A numeric value representing the content of acetosyringone, as described in Straková et al. (2010). 0.26 g/g ratio real 0 1 NA
acetovanillone_mass_absolute A numeric value representing the content of acetovanillone, as described in Straková et al. (2010). 0.26 g ratio real 0 Inf NA
acetovanillone_mass_relative_mass A numeric value representing the content of acetovanillone, as described in Straková et al. (2010). 0.26 g/g ratio real 0 1 NA
arabinose_mass_absolute A numeric value representing the content of arabinose, as described in Straková et al. (2010). 0.26 g ratio real 0 Inf NA
arabinose_mass_relative_mass A numeric value representing the content of arabinose, as described in Straková et al. (2010). 0.26 g/g ratio real 0 1 NA
ash_mass_absolute A numeric value representing the content of ash (after burning at 550°C). 4 g ratio real 0 Inf NA
ash_mass_relative_mass A numeric value representing the content of ash (after burning at 550°C). 0.05 g/g ratio real 0 Inf NA
attribute_definition A free text field with a textual description of the meaning of attributes in the dpeatdecomposition database. NA NA nominal NA NA NA NA
attribute_name A string describing the names of the attributes in all tables of the dpeatdecomposition database. attribute_name NA nominal NA NA NA NA
bibtex A string representing the bibtex code used for a literature reference throughout the dpeatdecomposition database. Galka.2021 NA nominal NA NA NA NA
bounds_maximum A numeric value representing the minimum possible value for a numeric attribute. 0 NA interval real Inf Inf NA
bounds_minimum A numeric value representing the maximum possible value for a numeric attribute. INF NA interval real Inf Inf NA
bulk_density A numeric value representing the bulk density of the sample [g cm-3]. 0,2 g/cm^3 ratio real 0 Inf NA
C_absolute The absolute mass of C in the sample. 1 g ratio real 0 Inf NA
C_relative_mass The absolute mass of C in the sample. 1 g/g ratio real 0 Inf NA
C_to_N A numeric value representing the C to N ratio of the sample. 35 g/g ratio real 0 Inf NA
C_to_P A numeric value representing the C to P ratio of the sample. 35 g/g ratio real 0 Inf NA
Ca_absolute The
Facebook
TwitterIBRA version 5.1 Sub-regions, like their parent regionalisation IBRA version 5.1, represent a landscape based approach to classifying the land surface of Australia from a range of continental data on environmental attributes, at a finer scale. 354 IBRA Sub-regions have been delineated, each reflecting a unifying set of major environmental influences which shape the occurrence of flora and fauna and their interaction with the physical environment.The IBRA Version 5.1 Sub-regions are the result of refinement of the IBRA Version 4 boundaries. These refined boundaries were jointly defined by the Commonwealth, State and Territory nature and conservation agencies. Following a DEH facilitated workshop on the revision of boundaries on 24 July 2000, spatial data refinements were undertaken by DEH in conjunction with relevant State / Territory agencies.Nominal attributes for the IBRA and IBRA Sub-regions are; climate, lithology/geology, landform, vegetation, flora and fauna, and landuse. The use of these attributes varies across the States.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Dr. Daqing Chen, Course Director: MSc Data Science. chend '@' lsbu.ac.uk, School of Engineering, London South Bank University, London SE1 0AA, UK.
This Online Retail II data set contains all the transactions occurring for a UK-based and registered, non-store online retail between 01/12/2009 and 09/12/2011.The company mainly sells unique all-occasion gift-ware. Many customers of the company are wholesalers.
InvoiceNo: Invoice number. Nominal. A 6-digit integral number uniquely assigned to each transaction. If this code starts with the letter 'c', it indicates a cancellation. StockCode: Product (item) code. Nominal. A 5-digit integral number uniquely assigned to each distinct product. Description: Product (item) name. Nominal. Quantity: The quantities of each product (item) per transaction. Numeric. InvoiceDate: Invice date and time. Numeric. The day and time when a transaction was generated. UnitPrice: Unit price. Numeric. Product price per unit in sterling (£). CustomerID: Customer number. Nominal. A 5-digit integral number uniquely assigned to each customer. Country: Country name. Nominal. The name of the country where a customer resides.
Chen, D. Sain, S.L., and Guo, K. (2012), Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining, Journal of Database Marketing and Customer Strategy Management, Vol. 19, No. 3, pp. 197-208. doi: [Web Link]. Chen, D., Guo, K. and Ubakanma, G. (2015), Predicting customer profitability over time based on RFM time series, International Journal of Business Forecasting and Marketing Intelligence, Vol. 2, No. 1, pp.1-18. doi: [Web Link]. Chen, D., Guo, K., and Li, Bo (2019), Predicting Customer Profitability Dynamically over Time: An Experimental Comparative Study, 24th Iberoamerican Congress on Pattern Recognition (CIARP 2019), Havana, Cuba, 28-31 Oct, 2019. Laha Ale, Ning Zhang, Huici Wu, Dajiang Chen, and Tao Han, Online Proactive Caching in Mobile Edge Computing Using Bidirectional Deep Recurrent Neural Network, IEEE Internet of Things Journal, Vol. 6, Issue 3, pp. 5520-5530, 2019. Rina Singh, Jeffrey A. Graves, Douglas A. Talbert, William Eberle, Prefix and Suffix Sequential Pattern Mining, Industrial Conference on Data Mining 2018: Advances in Data Mining. Applications and Theoretical Aspects, pp. 309-324. 2018.
If you have no special citation requests, please leave this field blank.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IBRA version 5.1 Sub-regions, like their parent regionalisation IBRA version 5.1, represent a landscape based approach to classifying the land surface of Australia from a range of continental data …Show full descriptionIBRA version 5.1 Sub-regions, like their parent regionalisation IBRA version 5.1, represent a landscape based approach to classifying the land surface of Australia from a range of continental data on environmental attributes, at a finer scale. 354 IBRA Sub-regions have been delineated, each reflecting a unifying set of major environmental influences which shape the occurrence of flora and fauna and their interaction with the physical environment. The IBRA Version 5.1 Sub-regions are the result of refinement of the IBRA Version 4 boundaries. These refined boundaries were jointly defined by the Commonwealth, State and Territory nature and conservation agencies. Following a DEH facilitated workshop on the revision of boundaries on 24 July 2000, spatial data refinements were undertaken by DEH in conjunction with relevant State / Territory agencies. Nominal attributes for the IBRA and IBRA Sub-regions are; climate, lithology/geology, landform, vegetation, flora and fauna, and landuse. The use of these attributes varies across the States.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
List of potentially important attributes/interventions.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is teaser data from the The International Software Benchmarking Standards Group. Stored here is a small subset of of the ISBSG data. The rest of the data can be accessed, for a nominal cost, from https://www.isbsg.org/.
See also
The COSMIC data set http://openscience.us/repo/effort/isbsg/cosmic.html
Reference
The International Software Benchmarking Standards Group Limited, ISBSG http://www.isbsg.org))
Attribute Information
Facebook
TwitterThis dataset contains Acoustic Doppler Current Profiler (ADCP), serial sensor, shipboard computer system (South China Sea (SCS)) measurements, Conductivity, Temperature, Depth (CTD), and the ship load and cruise track data from aboard the R/V Point Sur for cruise DP06 for an area encompassing roughly 27°N to 28°N and 86.5°W to 90°W. Data was collected July 19-August 1, 2018. The overall purpose of this cruise is to perform deep water sampling of in-situ seawater and associated fauna. CODAS_variables: Variables in this Common Oceanographic Data Analysis System (CODAS) short-form Netcdf file are intended for most end-user scientific analysis and display purposes. For additional information see the CODAS_processing_note global attribute and the attributes of each of the variables.
============= ================================================================= time Time at the end of the ensemble, days from start of year. lon, lat Longitude, Latitude from Global Positioning System (GPS) at the end of the ensemble. u,v Ocean zonal and meridional velocity component profiles. uship, vship Zonal and meridional velocity components of the ship. heading Mean ship heading during the ensemble. depth Bin centers in nominal meters (no sound speed profile correction). tr_temp ADCP transducer temperature. pg Percent Good pings for u, v averaging after editing. pflag Profile Flags based on editing, used to mask u, v. amp Received signal strength in ADCP-specific units; no correction for spreading or attenuation. ============= ================================================================= cdm_data_type=TrajectoryProfile cdm_profile_variables=time cdm_trajectory_variables=trajectory, latitude, longitude comment=software: pycurrents comment1=CODAS_variables: Variables in this CODAS short-form Netcdf file are intended for most end-user scientific analysis and display purposes. For additional information see the CODAS_processing_note global attribute and the attributes of each of the variables.
============= ================================================================= time Time at the end of the ensemble, days from start of year. lon, lat Longitude, Latitude from GPS at the end of the ensemble. u,v Ocean zonal and meridional velocity component profiles. uship, vship Zonal and meridional velocity components of the ship. heading Mean ship heading during the ensemble. depth Bin centers in nominal meters (no sound speed profile correction). tr_temp ADCP transducer temperature. pg Percent Good pings for u, v averaging after editing. pflag Profile Flags based on editing, used to mask u, v. amp Received signal strength in ADCP-specific units; no correction for spreading or attenuation. ============= ================================================================= comment2=CODAS_processing_note:
The CODAS database is a specialized storage format designed for shipboard ADCP data. "CODAS processing" uses this format to hold averaged shipboard ADCP velocities and other variables, during the stages of data processing. The CODAS database stores velocity profiles relative to the ship as east and north components along with position, ship speed, heading, and other variables. The netCDF short form contains ocean velocities relative to earth, time, position, transducer temperature, and ship heading; these are designed to be "ready for immediate use". The netCDF long form is just a dump of the entire CODAS database. Some variables are no longer used, and all have names derived from their original CODAS names, dating back to the late 1980's.
CODAS post-processing, i.e. that which occurs after the single-ping profiles have been vector-averaged and loaded into the CODAS database, includes editing (using automated algorithms and manual tools), rotation and scaling of the measured velocities, and application of a time-varying heading correction. Additional algorithms developed more recently include translation of the GPS positions to the transducer location, and averaging of ship's speed over the times of valid pings when Percent Good is reduced. Such post-processing is needed prior to submission of "processed ADCP data" to JASADCP or other archives.
Whenever single-ping data have been recorded, full CODAS processing provides the best end product.
Full CODAS processing starts with the single-ping velocities in beam coordinates. Based on the transducer orientation relative to the hull, the beam velocities are transformed to horizontal, vertical, and "error velocity" components. Using a reliable heading (typically from the ship's gyro compass), the velocities in ship coordinates are rotated into earth coordinates.
Pings are grouped into an "ensemble" (usually 2-5 minutes duration) and undergo a suite of automated editing algorithms (removal of acoustic interference; identification of the bottom; editing based on thresholds; and specialized editing that targets CTD wire interference and "weak, biased profiles". The ensemble of single-ping velocities is then averaged using an iterative reference layer averaging scheme. Each ensemble is approximated as a single function of depth, with a zero-average over a reference layer plus a reference layer velocity for each ping. Adding the average of the single-ping reference layer velocities to the function of depth yields the ensemble-average velocity profile. These averaged profiles, along with ancillary measurements, are written to disk, and subsequently loaded into the CODAS database. Everything after this stage is "post-processing".
Time is stored in the database using UTC Year, Month, Day, Hour, Minute, Seconds. Floating point time "Decimal Day" is the floating point interval in days since the start of the year, usually the year of the first day of the cruise.
CODAS processing uses heading from a reliable device, and (if available) uses a time-dependent correction by an accurate heading device. The reliable heading device is typically a gyro compass (for example, the Bridge gyro). Accurate heading devices can be POSMV, Seapath, Phins, Hydrins, MAHRS, or various Ashtech devices; this varies with the technology of the time. It is always confusing to keep track of the sign of the heading correction. Headings are written degrees, positive clockwise. setting up some variables:
X = transducer angle (CONFIG1_heading_bias) positive clockwise (beam 3 angle relative to ship) G = Reliable heading (gyrocompass) A = Accurate heading dh = G - A = time-dependent heading correction (ANCIL2_watrk_hd_misalign)
Rotation of the measured velocities into the correct coordinate system amounts to (u+i*v)*(exp(i*theta)) where theta is the sum of the corrected heading and the transducer angle.
theta = X + (G - dh) = X + G - dh
Watertrack and Bottomtrack calibrations give an indication of the residual angle offset to apply, for example if mean and median of the phase are all 0.5 (then R=0.5). Using the "rotate" command, the value of R is added to "ANCIL2_watrk_hd_misalign".
new_dh = dh + R
Therefore the total angle used in rotation is
new_theta = X + G - dh_new = X + G - (dh + R) = (X - R) + (G - dh)
The new estimate of the transducer angle is: X - R ANCIL2_watrk_hd_misalign contains: dh + R
====================================================
Profile editing flags are provided for each depth cell:
binary decimal below Percent value value bottom Good bin -------+----------+--------+----------+-------+ 000 0 001 1 bad 010 2 bad 011 3 bad bad 100 4 bad 101 5 bad bad 110 6 bad bad 111 7 bad bad bad -------+----------+--------+----------+-------+ contributor_email=acook1@nova.edu contributor_institution=Nova Southeastern University / Halmos College of Natural Sciences and Oceanography contributor_name=April Cook contributor_phone=+1-954-262-3733 contributor_role=Project Manager contributor_role_vocabulary=https://vocab.nerc.ac.uk/collection/G04/current/ contributor_url=https://cnso.nova.edu/overview/faculty-staff-profiles/april_cook.html Conventions=CF-1.6, ACDD-1.3, IOOS-1.2, COARDS Country=USA cruise_name=PS19-04_Sutton_ADCP date_metadata_modified=2021-01-13T17:55:08Z Easternmost_Easting=-87.30255555555556 featureType=TrajectoryProfile geospatial_bounds=Polygon ((25.08759 -91.13281, 31.35551 -91.13281, 31.35551 -81.81641, 25.08759 -81.81641, 25.08759 -91.13281)) geospatial_bounds_crs=EPSG:4326 geospatial_bounds_vertical_crs=EPSG:5831 geospatial_lat_max=30.358338888888888 geospatial_lat_min=27.422730555555557 geospatial_lat_resolution=-1.4737620404287115E-4 geospatial_lat_units=degrees_north geospatial_lon_max=-87.30255555555556 geospatial_lon_min=-89.09468055555556 geospatial_lon_resolution=1.7407068240401683E-4 geospatial_lon_units=degrees_east geospatial_vertical_max=970.989990234375 geospatial_vertical_min=26.93000030517578 geospatial_vertical_positive=down geospatial_vertical_resolution=16.0 geospatial_vertical_units=m history=Created: 2018-08-02 14:24:54 UTC id=PS19-04_Sutton_ADCP infoUrl=https://data.gulfresearchinitiative.org/data/R4.x257.000:0019 institution=Nova Southeastern University / Halmos College of Natural Sciences and Oceanography instrument=ADCP instrument_vocabulary=GCMD Science Keywords Version 9.1.5 keywords_vocabulary=GCMD Science
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Nominal transport risk value of hazardous materials.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This file concerns credit card applications. All attribute names and values have been changed to meaningless symbols to protect confidentiality of the data.
This dataset is interesting because there is a good mix of attributes -- continuous, nominal with small numbers of values, and nominal with larger numbers of values. There are also a few missing values.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Basic information on 40 datasets from UCI repository used in this study including information about number of instances, attributes, classes, length of longest attribute name (LAN) and length of the longest nominal attribute value (LAV).