45 datasets found

Market Basket Analysis
kaggle.com
zip
Updated Dec 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
Explore at:
zip(23875170 bytes)Available download formats
Dataset updated
Dec 9, 2021
Authors
Aslan Ahmedov
Description
Market Basket Analysis

Market basket analysis with Apriori algorithm

The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

Introduction

Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

An Example of Association Rules

Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

Strategy

Data Import

Data Understanding and Exploration

Transformation of the data – so that is ready to be consumed by the association rules algorithm

Running association rules

Exploring the rules generated

Filtering the generated rules

Visualization of Rule

Dataset Description

File name: Assignment-1_Data

List name: retaildata

File format: . xlsx

Number of Row: 522065

Number of Attributes: 7

BillNo: 6-digit number assigned to each transaction. Nominal.

Itemname: Product name. Nominal.

Quantity: The quantities of each product per transaction. Numeric.

Date: The day and time when each transaction was generated. Numeric.

Price: Product price. Numeric.

CustomerID: 5-digit number assigned to each customer. Nominal.

Country: Name of the country where each customer resides. Nominal.

https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

Libraries in R

First, we need to load required libraries. Shortly I describe all libraries.

arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).

arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.

tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.

readxl - Read Excel Files in R.

plyr - Tools for Splitting, Applying and Combining Data.

ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

knitr - Dynamic Report generation in R.

magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.

dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.

tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

Data Pre-processing

Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

After we will clear our data frame, will remove missing values.

https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
The dataset for Example 3 of Table 3.
plos.figshare.com
txt
Updated Nov 30, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Razaw Al-Sarraj; Johannes Forkman (2023). The dataset for Example 3 of Table 3. [Dataset]. http://doi.org/10.1371/journal.pone.0295066.s005
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0295066.s005
Dataset updated
Nov 30, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Razaw Al-Sarraj; Johannes Forkman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
It is commonly believed that if a two-way analysis of variance (ANOVA) is carried out in R, then reported p-values are correct. This article shows that this is not always the case. Results can vary from non-significant to highly significant, depending on the choice of options. The user must know exactly which options result in correct p-values, and which options do not. Furthermore, it is commonly supposed that analyses in SAS and R of simple balanced experiments using mixed-effects models result in correct p-values. However, the simulation study of the current article indicates that frequency of Type I error deviates from the nominal value. The objective of this article is to compare SAS and R with respect to correctness of results when analyzing small experiments. It is concluded that modern functions and procedures for analysis of mixed-effects models are sometimes not as reliable as traditional ANOVA based on simple computations of sums of squares.
Global monthly catch of tuna, tuna-like and shark species (1950-2021) by 1°...
data.europa.eu
unknown
Updated Jul 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2025). Global monthly catch of tuna, tuna-like and shark species (1950-2021) by 1° or 5° squares (IRD level 2) - and efforts level 0 (1950-2023) [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-15221705?locale=da
Explore at:
unknown(21391)Available download formats
Dataset updated
Jul 3, 2025
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Major differences from previous work: For level 2 catch: Catches in tons, raised to match nominal values, now consider the geographic area of the nominal data for improved accuracy. Captures in "Number of fish" are converted to weight based on nominal data. The conversion factors used in the previous version are no longer used, as they did not adequately represent the diversity of captures. Number of fish without corresponding data in nominal are not removed as they were before, creating a huge difference for this measurement_unit between the two datasets. Nominal data from WCPFC includes fishing fleet information, and georeferenced data has been raised based on this instead of solely on the triplet year/gear/species, to avoid random reallocations. Strata for which catches in tons are raised to match nominal data have had their numbers removed. Raising only applies to complete years to avoid overrepresenting specific months, particularly in the early years of georeferenced reporting. Strata where georeferenced data exceed nominal data have not been adjusted downward, as it is unclear if these discrepancies arise from missing nominal data or different aggregation methods in both datasets. The data is not aggregated to 5-degree squares and thus remains unharmonized spatially. Aggregation can be performed using CWP codes for geographic identifiers. For example, an R function is available: source("https://raw.githubusercontent.com/firms-gta/geoflow-tunaatlas/master/sardara_functions/transform_cwp_code_from_1deg_to_5deg.R") Level 0 dataset has been modified creating differences in this new version notably : The species retained are different; only 32 major species are kept. Mappings have been somewhat modified based on new standards implemented by FIRMS. New rules have been applied for overlapping areas. Data is only displayed in 1 degrees square area and 5 degrees square areas. The data is enriched with "Species group", "Gear labels" using the fdiwg standards. These main differences are recapped in the Differences_v2018_v2024.zip Recommendations: To avoid converting data from number using nominal stratas, we recommend the use of conversion factors which could be provided by tRFMOs. In some strata, nominal data appears higher than georeferenced data, as observed during level 2 processing. These discrepancies may result from errors or differences in aggregation methods. Further analysis will examine these differences in detail to refine treatments accordingly. A summary of differences by tRFMOs, based on the number of strata, is included in the appendix. Some nominal data have no equivalent in georeferenced data and therefore cannot be disaggregated. What could be done is to check for each nominal data without equivalence if a georeferenced data exists in different buffers, and to average the distribution of this footprint. Then, disaggregate the nominal data based on the georeferenced data. This would lead to the creation of data (approximately 3%), and would necessitate reducing/removing all georeferenced data without a nominal equivalent or with a lesser equivalent. Tests are currently being conducted with and without this. It would help improve the biomass captured footprint but could lead to unexpected discrepancies with current datasets. For level 0 effort : In some datasets—namely those from ICCAT and the purse seine (PS) data from WCPFC— same effort data has been reported multiple times by using different units which have been kept as is, since no official mapping allows conversion between these units. As a result, users have be remind that some ICCAT and WCPFC effort data are deliberately duplicated : in the case of ICCAT data, lines with identical strata but different effort units are duplicates reporting the same fishing activity with different measurement units. It is indeed not possible to infer strict equivalence between units, as some contain information about others (e.g., Hours.FAD and Hours.FSC may inform Hours.STD). in the case of WCPFC data, effort records were also kept in all originally reported units. Here, duplicates do not necessarily share the same “fishing_mode”, as SETS for purse seiners are reported with an explicit association to fishing_mode, while DAYS are not. This distinction allows SETS records to be separated by fishing mode, whereas DAYS records remain aggregated. Some limited harmonization—particularly between units such as NET-days and Nets—has not been implemented in the current version of the dataset, but may be considered in future releases if a consistent relationship can be established.

💰 Global GDP Dataset (Latest)

kaggle.com

zip

Updated Oct 17, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Asadullah Shehbaz (2025). 💰 Global GDP Dataset (Latest) [Dataset]. https://www.kaggle.com/datasets/asadullahcreative/global-gdp-explorer-2024-world-bank-un-data

Explore at:

zip(6672 bytes)Available download formats

Dataset updated

Oct 17, 2025

Authors

Asadullah Shehbaz

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

🧾 About Dataset

🌍 Global GDP by Country — 2024 Edition

📖 Overview

The Global GDP by Country (2024) dataset provides an up-to-date snapshot of worldwide economic performance, summarizing each country’s nominal GDP, growth rate, population, and global economic contribution.

This dataset is ideal for economic analysis, data visualization, policy modeling, and machine learning applications related to global development and financial forecasting.

📊 Dataset Information

Total Records: 181 countries
Time Period: 2024 (latest available global data)
Geographic Coverage: Worldwide
File Format: CSV
File Size: ~10 KB
Missing Values: None (100% complete dataset)

🎯 Target Use-Cases:
- Economic growth trend analysis
- GDP-based country clustering
- Per capita wealth comparison
- Share of world economy visualization

🧩 Key Features

Feature Name	Description
Country	Official country name
GDP (nominal, 2023)	Total nominal GDP in USD
GDP (abbrev.)	Simplified GDP format (e.g., “$25.46 Trillion”)
GDP Growth	Annual GDP growth rate (%)
Population 2023	Estimated population for 2023
GDP per capita	Average income per person (USD)
Share of World GDP	Percentage contribution to global GDP

📈 Statistical Summary

Population Overview

Mean Population: 43.6 million
Standard Deviation: 155.5 million
Minimum Population: 9,816 (small island nations)
Median Population: 9.1 million
Maximum Population: 1.43 billion (China)

🌟 Highlights

💰 Top Economies (Nominal GDP):
United States, China, Japan, Germany, India

📈 Fastest Growing Economies:
India, Bangladesh, Vietnam, and Rwanda

🌐 Global Insights:
- The dataset covers 181 countries representing 100% of global GDP.
- Suitable for data visualization dashboards, AI-driven economic forecasting, and educational research.

💡 Example Use-Cases

Build a choropleth map showing GDP distribution across continents.
Train a regression model to predict GDP per capita based on population and growth.
Compare economic inequality using population vs GDP share.

📚 Dataset Citation

Source: Worldometers — GDP by Country (2024)
Dataset compiled and cleaned by: Asadullah Shehbaz
For open research and data analysis.

z
Controlled Anomalies Time Series (CATS) Dataset
zenodo.org
bin
Updated Jul 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Patrick Fleith; Patrick Fleith (2024). Controlled Anomalies Time Series (CATS) Dataset [Dataset]. http://doi.org/10.5281/zenodo.7646897
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7646897
Dataset updated
Jul 12, 2024
Dataset provided by
Solenix Engineering GmbH
Authors
Patrick Fleith; Patrick Fleith
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Controlled Anomalies Time Series (CATS) Dataset consists of commands, external stimuli, and telemetry readings of a simulated complex dynamical system with 200 injected anomalies.

The CATS Dataset exhibits a set of desirable properties that make it very suitable for benchmarking Anomaly Detection Algorithms in Multivariate Time Series [1]:

Multivariate (17 variables) including sensors reading and control signals. It simulates the operational behaviour of an arbitrary complex system including:

4 Deliberate Actuations / Control Commands sent by a simulated operator / controller, for instance, commands of an operator to turn ON/OFF some equipment.

3 Environmental Stimuli / External Forces acting on the system and affecting its behaviour, for instance, the wind affecting the orientation of a large ground antenna.

10 Telemetry Readings representing the observable states of the complex system by means of sensors, for instance, a position, a temperature, a pressure, a voltage, current, humidity, velocity, acceleration, etc.

5 million timestamps. Sensors readings are at 1Hz sampling frequency.

1 million nominal observations (the first 1 million datapoints). This is suitable to start learning the "normal" behaviour.

4 million observations that include both nominal and anomalous segments. This is suitable to evaluate both semi-supervised approaches (novelty detection) as well as unsupervised approaches (outlier detection).

200 anomalous segments. One anomalous segment may contain several successive anomalous observations / timestamps. Only the last 4 million observations contain anomalous segments.

Different types of anomalies to understand what anomaly types can be detected by different approaches.

Fine control over ground truth. As this is a simulated system with deliberate anomaly injection, the start and end time of the anomalous behaviour is known very precisely. In contrast to real world datasets, there is no risk that the ground truth contains mislabelled segments which is often the case for real data.

Obvious anomalies. The simulated anomalies have been designed to be "easy" to be detected for human eyes (i.e., there are very large spikes or oscillations), hence also detectable for most algorithms. It makes this synthetic dataset useful for screening tasks (i.e., to eliminate algorithms that are not capable to detect those obvious anomalies). However, during our initial experiments, the dataset turned out to be challenging enough even for state-of-the-art anomaly detection approaches, making it suitable also for regular benchmark studies.

Context provided. Some variables can only be considered anomalous in relation to other behaviours. A typical example consists of a light and switch pair. The light being either on or off is nominal, the same goes for the switch, but having the switch on and the light off shall be considered anomalous. In the CATS dataset, users can choose (or not) to use the available context, and external stimuli, to test the usefulness of the context for detecting anomalies in this simulation.

Pure signal ideal for robustness-to-noise analysis. The simulated signals are provided without noise: while this may seem unrealistic at first, it is an advantage since users of the dataset can decide to add on top of the provided series any type of noise and choose an amplitude. This makes it well suited to test how sensitive and robust detection algorithms are against various levels of noise.

No missing data. You can drop whatever data you want to assess the impact of missing values on your detector with respect to a clean baseline.

[1] Example Benchmark of Anomaly Detection in Time Series: “Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. Anomaly Detection in Time Series: A Comprehensive Evaluation. PVLDB, 15(9): 1779 - 1797, 2022. doi:10.14778/3538598.3538602”

About Solenix

Solenix is an international company providing software engineering, consulting services and software products for the space market. Solenix is a dynamic company that brings innovative technologies and concepts to the aerospace market, keeping up to date with technical advancements and actively promoting spin-in and spin-out technology activities. We combine modern solutions which complement conventional practices. We aspire to achieve maximum customer satisfaction by fostering collaboration, constructivism, and flexibility.
Fundamental Data Record for Atmospheric Composition [ATMOS_L1B]
earth.esa.int
Updated Jul 1, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
European Space Agency (2024). Fundamental Data Record for Atmospheric Composition [ATMOS_L1B] [Dataset]. https://earth.esa.int/eogateway/catalog/fdr-for-atmospheric-composition
Explore at:
Dataset updated
Jul 1, 2024
Dataset authored and provided by
European Space Agencyhttp://www.esa.int/
License
https://earth.esa.int/eogateway/documents/20142/1564626/Terms-and-Conditions-for-the-use-of-ESA-Data.pdfhttps://earth.esa.int/eogateway/documents/20142/1564626/Terms-and-Conditions-for-the-use-of-ESA-Data.pdf
Time period covered
Jun 28, 1995 - Apr 7, 2012
Description
The Fundamental Data Record (FDR) for Atmospheric Composition UVN v.1.0 dataset is a cross-instrument Level-1 product [ATMOS_L1B] generated in 2023 and resulting from the ESA FDR4ATMOS project. The FDR contains selected Earth Observation Level 1b parameters (irradiance/reflectance) from the nadir-looking measurements of the ERS-2 GOME and Envisat SCIAMACHY missions for the period ranging from 1995 to 2012. The data record offers harmonised cross-calibrated spectra with focus on spectral windows in the Ultraviolet-Visible-Near Infrared regions for the retrieval of critical atmospheric constituents like ozone (O3), sulphur dioxide (SO2), nitrogen dioxide (NO2) column densities, alongside cloud parameters. The FDR4ATMOS products should be regarded as experimental due to the innovative approach and the current use of a limited-sized test dataset to investigate the impact of harmonization on the Level 2 target species, specifically SO2, O3 and NO2. Presently, this analysis is being carried out within follow-on activities. The FDR4ATMOS V1 is currently being extended to include the MetOp GOME-2 series. Product format For many aspects, the FDR product has improved compared to the existing individual mission datasets: GOME solar irradiances are harmonised using a validated SCIAMACHY solar reference spectrum, solving the problem of the fast-changing etalon present in the original GOME Level 1b data; Reflectances for both GOME and SCIAMACHY are provided in the FDR product. GOME reflectances are harmonised to degradation-corrected SCIAMACHY values, using collocated data from the CEOS PIC sites; SCIAMACHY data are scaled to the lowest integration time within the spectral band using high-frequency PMD measurements from the same wavelength range. This simplifies the use of the SCIAMACHY spectra which were split in a complex cluster structure (with own integration time) in the original Level 1b data; The harmonization process applied mitigates the viewing angle dependency observed in the UV spectral region for GOME data; Uncertainties are provided. Each FDR product provides, within the same file, irradiance/reflectance data for UV-VIS-NIR special regions across all orbits on a single day, including therein information from the individual ERS-2 GOME and Envisat SCIAMACHY measurements. FDR has been generated in two formats: Level 1A and Level 1B targeting expert users and nominal applications respectively. The Level 1A [ATMOS_L1A] data include additional parameters such as harmonisation factors, PMD, and polarisation data extracted from the original mission Level 1 products. The ATMOS_L1A dataset is not part of the nominal dissemination to users. In case of specific requirements, please contact EOHelp. Please refer to the README file for essential guidance before using the data. All the new products are conveniently formatted in NetCDF. Free standard tools, such as Panoply, can be used to read NetCDF data. Panoply is sourced and updated by external entities. For further details, please consult our Terms and Conditions page. Uncertainty characterisation One of the main aspects of the project was the characterization of Level 1 uncertainties for both instruments, based on metrological best practices. The following documents are provided: General guidance on a metrological approach to Fundamental Data Records (FDR) Uncertainty Characterisation document Effect tables NetCDF files containing example uncertainty propagation analysis and spectral error correlation matrices for SCIAMACHY (Atlantic and Mauretania scene for 2003 and 2010) and GOME (Atlantic scene for 2003) reflectance_uncertainty_example_FDR4ATMOS_GOME.nc reflectance_uncertainty_example_FDR4ATMOS_SCIA.nc Known Issues Non-monotonous wavelength axis for SCIAMACHY in FDR data version 1.0 In the SCIAMACHY OBSERVATION group of the atmospheric FDR v1.0 dataset (DOI: 10.5270/ESA-852456e), the wavelength axis (lambda variable) is not monotonically increasing. This issue affects all spectral channels (UV, VIS, NIR) in the SCIAMACHY group, while GOME OBSERVATION data remain unaffected. The root cause of the issue lies in the incorrect indexing of the lambda variable during the NetCDF writing process. Notably, the wavelength values themselves are calculated correctly within the processing chain. Temporary Workaround The wavelength axis is correct in the first record of each product. As a workaround, users can extract the wavelength axis from the first record and apply it to all subsequent measurements within the same product. The first record can be retrieved by setting the first two indices (time and scanline) to 0 (assuming counting of array indices starts at 0). Note that this process must be repeated separately for each spectral range (UV, VIS, NIR) and every daily product. Since the wavelength axis of SCIAMACHY is highly stable over time, using the first record introduces no expected impact on retrieval results. Python pseudo-code example: lambda_...

Banking Dataset Classification

kaggle.com

Updated Sep 6, 2020

Facebook

Twitter

Click to copy link

Link copied

Cite

Rashmi (2020). Banking Dataset Classification [Dataset]. https://www.kaggle.com/datasets/rashmiranu/banking-dataset-classification

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Sep 6, 2020

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Rashmi

Description

About Dataset

There has been a revenue decline in the Portuguese Bank and they would like to know what actions to take. After investigation, they found that the root cause was that their customers are not investing enough for long term deposits. So the bank would like to identify existing customers that have a higher chance to subscribe for a long term deposit and focus marketing efforts on such customers.

Data Set Information

The data is related to direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be subscribed ('yes') or not ('no') subscribed.

There are two datasets: train.csv with all examples (32950) and 21 inputs including the target feature, ordered by date (from May 2008 to November 2010), very close to the data analyzed in [Moro et al., 2014]

test.csv which is the test data that consists of 8238 observations and 20 features without the target feature

Goal:- The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).

The dataset contains train and test data. Features of train data are listed below. And the test data have already been preprocessed.

Features

Feature	Feature_Type	Description
age	numeric	age of a person
job	Categorical,nominal	type of job ('admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
marital	categorical,nominal	marital status ('divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
education	categorical,nominal	('basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
default	categorical,nominal	has credit in default? ('no','yes','unknown')
housing	categorical,nominal	has housing loan? ('no','yes','unknown')
loan	categorical,nominal	has personal loan? ('no','yes','unknown')
contact	categorical,nominal	contact communication type ('cellular','telephone')
month	categorical,ordinal	last contact month of year ('jan', 'feb', 'mar', ..., 'nov', 'dec')
day_of_week	categorical,ordinal	last contact day of the week ('mon','tue','wed','thu','fri')
duration	numeric	last contact duration, in seconds . Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no')
campaign	numeric	number of contacts performed during this campaign and for this client (includes last contact)
pdays	numeric	number of days that passed by after the client was last contacted from a previous campaign (999 means client was not previously contacted)
previous	numeric	number of contacts performed before this campaign and for this client
poutcome	categorical,nominal	outcome of the previous marketing campaign ('failure','nonexistent','success')

Target variable (desired output):

Feature	Feature_Type	Description
y	binary	has the client subscribed a term deposit? ('yes','no')

Global monthly catch of tuna, tuna-like and shark species (1950-2023) by 1°...
data.europa.eu
unknown
Updated Jul 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2025). Global monthly catch of tuna, tuna-like and shark species (1950-2023) by 1° or 5° squares (IRD level 2) - and efforts level 0 (1950-2023) [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-15405414?locale=fi
Explore at:
unknown(2677816)Available download formats
Dataset updated
Jul 3, 2025
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Major differences from v1: For level 2 catch: Catches and number raised to nominal are only raised to exactly matching stratas or if not existing, to a strata corresponding with UNK/NEI or 99.9. (new feature in v4) When nominal strata lack specific dimensions (e.g., fishing_mode always UNK) but georeferenced strata include them, the nominal data are “upgraded” to match—preventing loss of detail. Currently this adjustment aligns nominal values to georeferenced totals; future versions may apply proportional scaling. This does not create a direct raising but rather allows more precise reallocation. (new feature in v4) IATTC Purse seine catch-and-effort are available in 3 separate files according to the group of species: tuna, billfishes, sharks. This is due to the fact that PS data is collected from 2 sources: observer and fishing vessel logbooks. Observer records are used when available, and for unobserved trips logbooks are used. Both sources collect tuna data but only observers collect shark and billfish data. As an example, a strata may have observer effort and the number of sets from the observed trips would be counted for tuna and shark and billfish. But there may have also been logbook data for unobserved sets in the same strata so the tuna catch and number of sets for a cell would be added. This would make a higher total number of sets for tuna catch than shark or billfish. Efforts in the billfish and shark datasets might hence represent only a proportion of the total effort allocated in some strata since it is the observed effort, i.e. for which there was an observer onboard. As a result, catch in the billfish and shark datasets might represent only a proportion of the total catch allocated in some strata. Hence, shark and billfish catch were raised to the fishing effort reported in the tuna dataset. (new feature in v4, was done in Firms Level 0 before) Data with resolution of 10degx10deg is removed, it is considered to disaggregate it in next versions. Catches in tons, raised to match nominal values, now consider the geographic area of the nominal data for improved accuracy. (as v3) Captures in "Number of fish" are converted to weight based on nominal data. The conversion factors used in the previous version are no longer used, as they did not adequately represent the diversity of captures. (as v3) Number of fish without corresponding data in nominal are not removed as they were before, creating a huge difference for this measurement_unit between the two datasets. (as v3) Strata for which catches in tons are raised to match nominal data have had their numbers removed. (as v3) Raising only applies to complete years to avoid overrepresenting specific months, particularly in the early years of georeferenced reporting. (as v3) Strata where georeferenced data exceed nominal data have not been adjusted downward, as it is unclear if these discrepancies arise from missing nominal data or different aggregation methods in both datasets. (as v3) The data is not aggregated to 5-degree squares and thus remains unharmonized spatially. Aggregation can be performed using CWP codes for geographic identifiers. For example, an R function is available: source("https://raw.githubusercontent.com/firms-gta/geoflow-tunaatlas/master/sardara_functions/transform_cwp_code_from_1deg_to_5deg.R") (as v3) This results in a raising of the data compared to v3 for IOTC, ICCAT, IATTC and WCPFC. However as the raising is more specific for CCSBT, the raising is of 22% less than in the previous version. Level 0 dataset has been modified creating differences in this new version notably : The species retained are different; only 32 major species are kept. Mappings have been somewhat modified based on new standards implemented by FIRMS. New rules have been applied for overlapping areas. Data is only displayed in 1 degrees square area and 5 degrees square areas. The data is enriched with "Species group", "Gear labels" using the fdiwg standards. These main differences are recapped in the Differences_v2018_v2024.zip Recommendations: To avoid converting data from number using nominal stratas, we recommend the use of conversion factors which could be provided by tRFMOs. In some strata, nominal data appears higher than georeferenced data, as observed during level 2 processing. These discrepancies may result from errors or differences in aggregation methods. Further analysis will examine these differences in detail to refine treatments accordingly. A summary of differences by tRFMOs, based on the number of strata, is included in the appendix. For level 0 effort : In some datasets—namely those from ICCAT and the purse seine (PS) data from WCPFC— same effort data has been reported multiple times by using different units which have been kept as is, since no official mapping allows conversion between these units. As a result, users have be remind that some ICCAT and WCPFC effort data are deliberately duplicated : in the case of ICCAT data, lines wi
MKAD (Open Sourced Code) - Dataset - NASA Open Data Portal
data.nasa.gov
Updated Mar 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). MKAD (Open Sourced Code) - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/mkad-open-sourced-code
Explore at:
Dataset updated
Mar 31, 2025
Dataset provided by
NASAhttp://nasa.gov/
Area covered
MKAD
Description
The Multiple Kernel Anomaly Detection (MKAD) algorithm is designed for anomaly detection over a set of files. It combines multiple kernels into a single optimization function using the One Class Support Vector Machine (OCSVM) framework. Any kernel function can be combined in the algorithm as long as it meets the Mercer conditions, however for the purposes of this code the data preformatting and kernel type is specific to the Flight Operations Quality Assurance (FOQA) data and has been integrated into the coding steps. For this domain, discrete binary switch sequences are used in the discrete kernel, and discretized continuous parameter features are used to form the continuous kernel. The OCSVM uses a training set of nominal examples (in this case flights) and evaluates test examples for anomaly detection to determine whether they are anomalous or not. After completing this analysis the algorithm reports the anomalous examples and determines whether there is a contribution from either or both continuous and discrete elements.
Degradation Measurement of Robot Arm Position Accuracy
data.nist.gov
catalog.data.gov
Updated Sep 7, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Helen Qiao (2018). Degradation Measurement of Robot Arm Position Accuracy [Dataset]. http://doi.org/10.18434/M31962
Explore at:
Unique identifier
https://doi.org/10.18434/M31962
Dataset updated
Sep 7, 2018
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Authors
Helen Qiao
License
https://www.nist.gov/open/licensehttps://www.nist.gov/open/license
Description
The dataset contains both the robot's high-level tool center position (TCP) health data and controller-level components' information (i.e., joint positions, velocities, currents, temperatures, currents). The datasets can be used by users (e.g., software developers, data scientists) who work on robot health management (including accuracy) but have limited or no access to robots that can capture real data. The datasets can support the: - Development of robot health monitoring algorithms and tools - Research of technologies and tools to support robot monitoring, diagnostics, prognostics, and health management (collectively called PHM) - Validation and verification of the industrial PHM implementation. For example, the verification of a robot's TCP accuracy after the work cell has been reconfigured, or whenever a manufacturer wants to determine if the robot arm has experienced a degradation. For data collection, a trajectory is programmed for the Universal Robot (UR5) approaching and stopping at randomly-selected locations in its workspace. The robot moves along this preprogrammed trajectory during different conditions of temperature, payload, and speed. The TCP (x,y,z) of the robot are measured by a 7-D measurement system developed at NIST. Differences are calculated between the measured positions from the 7-D measurement system and the nominal positions calculated by the nominal robot kinematic parameters. The results are recorded within the dataset. Controller level sensing data are also collected from each joint (direct output from the controller of the UR5), to understand the influences of position degradation from temperature, payload, and speed. Controller-level data can be used for the root cause analysis of the robot performance degradation, by providing joint positions, velocities, currents, accelerations, torques, and temperatures. For example, the cold-start temperatures of the six joints were approximately 25 degrees Celsius. After two hours of operation, the joint temperatures increased to approximately 35 degrees Celsius. Control variables are listed in the header file in the data set (UR5TestResult_header.xlsx). If you'd like to comment on this data and/or offer recommendations on future datasets, please email guixiu.qiao@nist.gov.
Various Aspects of Indian States
kaggle.com
zip
Updated Oct 19, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sayantan Sadhu (2021). Various Aspects of Indian States [Dataset]. https://www.kaggle.com/datasets/sayantansadhu/various-aspects-of-indian-states/discussion
Explore at:
zip(1507 bytes)Available download formats
Dataset updated
Oct 19, 2021
Authors
Sayantan Sadhu
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
India
Description
Context

Every Politician lie but data doesn't. So I collected data of some of the important metrics of all the Indian States to check what is good and bad in all of them. The data is mostly scrapped from Wikipedia so it can be little bit inconsistent however, I will improve that in the subsequent versions.

Content

The contains the data about the metrics like HDI ( Human Development Index), Nominal GDP, Crime Rate, Percentage of population below poverty line and unemployment rate of all the states of India.

Acknowledgements

Most of the data is scrapped from Wikipedia so thanks to them for providing the data however I wish they improve their authenticity.

Inspiration

Feel free to play around the data, check where each state stands in all the metrics.

Try finding out why some states are top of some of the metrics, while at the bottom in others.

See if there's any correlation between different metrics. For example, One I am very interested to if there's any correlation between HDI and unemployment or HDI and nominal GDP or HDI and poverty.
e
Chen, C., Kyathanahally, S., Reyes, M., Merkli, S., Merz, E., Francazi, E.,...
opendata.eawag.ch
opendata-stage.eawag.ch
Updated Nov 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Chen, C., Kyathanahally, S., Reyes, M., Merkli, S., Merz, E., Francazi, E., et al. (2024). Data for: Producing Plankton Classifiers that are Robust to Dataset Shift (Version 1.0). Eawag: Swiss Federal Institute of Aquatic Science and Technology. https://doi.org/10.25678/000C6M [Dataset]. https://opendata.eawag.ch/dataset/data-for-producing-plankton-classifiers-that-are-robust-to-dataset-shift
Explore at:
Dataset updated
Nov 27, 2024
Description
Modern plankton high-throughput monitoring relies on deep learning classifiers for species recognition in water ecosystems. Despite satisfactory nominal performances, a significant challenge arises from the dataset shift, where performance drops during real-world deployment compared to ideal testing conditions. In our study, we integrate the ZooLake dataset, which consists of dark-field images of lake plankton, with manually-annotated images from 10 independent days of deployment, serving as test cells to benchmark out-of-dataset (OOD) performances. Our analysis reveals instances where classifiers, initially performing well in ideal conditions, encounter notable failures in real-world scenarios. For example, a MobileNet with a 92% nominal test accuracy shows a 77% OOD accuracy. We systematically investigate conditions leading to OOD performance drops and propose a preemptive assessment method to identify potential pitfalls when classifying new data, and pinpoint features in OOD images that adversely impact classification. We present a three-step pipeline: (i) identifying OOD degradation compared to nominal test performance, (ii) conducting a diagnostic analysis of degradation causes, and (iii) providing solutions. We find that ensembles of BEiT vision transformers, with targeted augmentations addressing OOD robustness, geometric ensembling, and rotation-based test-time augmentation, constitute the most robust model. It achieves an 83% OOD accuracy, with errors concentrated on container classes. Moreover, it exhibits lower sensitivity to dataset shift, and reproduces well the plankton abundances. Our proposed pipeline is applicable to generic plankton classifiers, contingent on the availability of suitable test cells. Implementation of this pipeline is anticipated to usher in a new era of robust classifiers, resilient to dataset shift, and capable of delivering reliable plankton abundance data. By identifying critical shortcomings and offering practical procedures to fortify models against dataset shift, our study contributes to the development of more reliable plankton classification technologies.
Data from: bicycle store dataset
kaggle.com
zip
Updated Sep 11, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rohit Sahoo (2020). bicycle store dataset [Dataset]. https://www.kaggle.com/rohitsahoo/bicycle-store-dataset
Explore at:
zip(682639 bytes)Available download formats
Dataset updated
Sep 11, 2020
Authors
Rohit Sahoo
License
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Description
Context

Perform Exploratory Data Analysis on the Bicycle Store Dataset!

DATA EXPLORATION Understand the characteristics of given fields in the underlying data such as variable distributions, whether the dataset is skewed towards a certain demographic and the data validity of the fields. For example, a training dataset may be highly skewed towards the younger age bracket. If so, how will this impact your results when using it to predict over the remaining customer base. Identify limitations surrounding the data and gather external data which may be useful for modelling purposes. This may include bringing in ABS data at different geographic levels and creating additional features for the model. For example, the geographic remoteness of different postcodes may be used as an indicator of proximity to consider to whether a customer is in need of a bike to ride to work.

MODEL DEVELOPMENT Determine a hypothesis related to the business question that can be answered with the data. Perform statistical testing to determine if the hypothesis is valid or not. Create calculated fields based on existing data, for example, convert the D.O.B into an age bracket. Other fields that may be engineered include ‘High Margin Product’ which may be an indicator of whether the product purchased by the customer is in a high margin category in the past three months based on the fields ‘list_price’ and ‘standard cost’. Other examples include, calculating the distance from office to home address to as a factor in determining whether customers may purchase a bicycle for transportation purposes. Additionally, this may include thoughts around determining what the predicted variable actually is. For example, are results predicted in ordinal buckets, nominal, binary or continuous. Test the performance of the model using factors relevant for the given model chosen (i.e. residual deviance, AIC, ROC curves, R Squared). Appropriately document model performance, assumptions and limitations.

INTEPRETATION AND REPORTING Visualisation and presentation of findings. This may involve interpreting the significant variables and co-efficient from a business perspective. These slides should tell a compelling storing around the business issue and support your case with quantitative and qualitative observations. Please refer to module below for further details

Content

The dataset is easy to understand and self-explanatory!

Inspiration

It is important to keep in mind the business context when presenting your findings: 1. What are the trends in the underlying data? 2. Which customer segment has the highest customer value? 3. What do you propose should be the marketing and growth strategy?
d
MKAD (Open Sourced Code)
catalog.data.gov
s.cnmilf.com
Updated Apr 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). MKAD (Open Sourced Code) [Dataset]. https://catalog.data.gov/dataset/mkad-open-sourced-code
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Dashlink
Area covered
MKAD
Description
The Multiple Kernel Anomaly Detection (MKAD) algorithm is designed for anomaly detection over a set of files. It combines multiple kernels into a single optimization function using the One Class Support Vector Machine (OCSVM) framework. Any kernel function can be combined in the algorithm as long as it meets the Mercer conditions, however for the purposes of this code the data preformatting and kernel type is specific to the Flight Operations Quality Assurance (FOQA) data and has been integrated into the coding steps. For this domain, discrete binary switch sequences are used in the discrete kernel, and discretized continuous parameter features are used to form the continuous kernel. The OCSVM uses a training set of nominal examples (in this case flights) and evaluates test examples for anomaly detection to determine whether they are anomalous or not. After completing this analysis the algorithm reports the anomalous examples and determines whether there is a contribution from either or both continuous and discrete elements.
Fault Adaptive Control of Overactuated Systems Using Prognostic Estimation -...
data.nasa.gov
Updated Mar 31, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). Fault Adaptive Control of Overactuated Systems Using Prognostic Estimation - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/fault-adaptive-control-of-overactuated-systems-using-prognostic-estimation
Explore at:
Dataset updated
Mar 31, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
Most fault adaptive control research addresses the preservation of system stability or functionality in the presence of a specific failure (fault). This paper examines the fault adaptive control problem for a generic class of incipient failure modes, which do not initially affect system stability, but will eventually cause a catastrophic failure to occur. This risk of catastrophic failure due a component fault mode is some monotonically increasing function of the load on the component. Assuming that a probabilistic prognostic model is available to evaluate the risk of incipient fault modes growing into catastrophic failure conditions, then fundamentally the fault adaptive control problem is to adjust component loads to minimize risk of failure, while not overly degrading nominal performance. A methodology is proposed for posing this problem as a finite horizon constrained optimization, where constraints correspond to maximum risk of failure and maximum deviation from nominal performance. Development of the methodology to handle a general class of overactuated systems is given. Also, the fault adaptive control methodology is demonstrated on an application example of practical significance, an electro-mechanical actuator (EMA) consisting of three DC motors geared to the same output shaft. Similar actuator systems are commonly used in aerospace, transportation, and industrial processes to actuate critical loads, such as aircraft control surfaces. The fault mode simulated in the system is a temperature dependent motor winding insulation degradation.
Z
Data from: Dataset of Vibration, Temperature and Speed Measurements for...
data-staging.niaid.nih.gov
data.niaid.nih.gov
+1more
Updated Oct 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Di Maggio, Luigi Gianpio; Giorio, Lorenzo; Delprete, Cristiana; Brusa, Eugenio (2024). Dataset of Vibration, Temperature and Speed Measurements for Multiple Types of Localized Defects on Spherical Roller Bearings across Multiple Operating Conditions [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_13913253
Explore at:
Dataset updated
Oct 24, 2024
Dataset provided by
Department of Mechanical and Aerospace Engineering (DIMEAS), Politecnico di Torino
Department of Management and Production Engineering (DIGEP), Politecnico di Torino
Authors
Di Maggio, Luigi Gianpio; Giorio, Lorenzo; Delprete, Cristiana; Brusa, Eugenio
Description
Description:

This dataset has been created by the ISED group of Politecnico di Torino to address the lack of data available for developing fault detection models and/or predictive maintenance systems for medium/large-sized spherical roller bearings, commonly used in industrial settings.

The data were collected using SKF 22240 CCK/W33 spherical roller bearings through an extensive experimental campaign on the medium/large-scale bearing test rig (capable of testing bearings with an outer diameter up to 420 mm), developed by the ISED research group at Politecnico di Torino (ISED Research Group). The technical details of the test rig can be found in this Paper.

The dataset contains data for individual localized defects applied to one of the four bearings tested simultaneously on the rig, which is capable of independently applying both axial and radial loads.

Dataset Structure:

The dataset is organized into four folders:

Undamaged

InnerRaceDamage

OuterRaceDamage

RollerDamage

The Undamaged folder contains data for all bearings in healthy condition. The other folders contain data from tests where one of the bearings presents a localized defect. The defects, introduced by chip removal, have a diameter of 2 mm and a depth of 0.5 mm, affecting either the inner race (IR), outer race (OR), or roller elements (B). More detailed information about the defect geometry and location can be found in the following publications:

Intelligent Fault Diagnosis of Industrial Bearings Using Transfer Learning and CNNs Pre-Trained for Audio Classification

Explainable AI for Machine Fault Diagnosis: Understanding Features’ Contribution in Machine Learning Models for Industrial Condition Monitoring

Zero-Shot Generative AI for Rotating Machinery Fault Diagnosis: Synthesizing Highly Realistic Training Data via Cycle-Consistent Adversarial Networks

Each folder contains .mat files named according to the following format:

(Nominal_Rotation_Speed)rpm_(Radial_Force)kN_(Axial_Force)kN.mat

Where:

Nominal_Rotation_Speed is the machine's nominal rotational speed

Radial_Force is the radial force applied to the bearing

Axial_Force is the axial force applied to the bearing

The dataset includes measurements from 10 different nominal rotation speeds and four load conditions, one of which includes an axial load. In some cases, "ramp" files are included, containing data where the rotational speed was linearly varied during the test.

Each .mat file contains multiple structures, depending on the type of test. Each structure is labeled as Signal_ followed by a number (0, 1, 2, 3, 4), where each represents a specific signal extracted during the test. There is no fixed correspondence between the signal number and the type of measurement. For example, Signal_2 does not always represent the accelerometer signal. Users are encouraged to inspect the y_values.quantity field to identify the signal's unit and nature. For instance, if y_values.quantity.label shows "g", the signal corresponds to an accelerometric measurement.

All signals have been exported using the MKS system, so y_values.values contains data in units of m/s² for acceleration signals. To convert the values to the unit indicated in y_values.quantity.label, users can apply the multiplication factor and offset provided in y_values.quantity.unit_transformation. In the case of accelerometric data, the multiplication factor is 0.1020, converting y_values.values from m/s² to g.

A more detailed description of the data structure can be found in the Test.Lab documentation.

In addition to acceleration data, the files include temperature signals, tacho sensor signals measuring shaft speed, and tacho impulse signals. In some cases, there is also a signal measured in "N", representing the frictional force generated by the bearings (as detailed in this Paper). This friction force signal is not always present, as the load cell could go into overload during certain tests.

Sensor Data Organization:

Acceleration and temperature data are presented in tables with four columns, each corresponding to one of the four bearings. In the damaged condition tests, the damaged bearing is always the one corresponding to sensor 4 (i.e., y_values.values(:, 4)).
Historical silver price from 1791 to 2020 in USD
kaggle.com
zip
Updated Jul 18, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
JS (2021). Historical silver price from 1791 to 2020 in USD [Dataset]. https://www.kaggle.com/joseserrat/yearly-silver-price-from-1791-to-2020
Explore at:
zip(2001 bytes)Available download formats
Dataset updated
Jul 18, 2021
Authors
JS
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Context

I'm creating a new website (centralbankanalytics.com) in which I need this type of data. I didn't found it easily available as I had to scrape it from an interactive graph, so now I upload it here for everyone

Content

In this dataset you can find real and nominal silver prices since 1791 to 2020. The explanation of the differences between real and nominal prices are:

· Nominal values are the current monetary values. · Real values are adjusted for inflation and show prices/wages at constant prices. · Real values give a better guide to what you can actually buy and the opportunity costs you face.

Example of real vs nominal:

· If you receive an 8% increase in your wages from £100 to £108, this is the nominal increase. · However, if inflation is 2%, then the real increase in wages is (8-2%) 6%. · The real wage is a better guide to how your living standards changes. It shows what you are actually able to buy with the extra increase in wages. · If wages increased 80%, but inflation was also 80%, the real increase in wages would be 0% – in effect, despite the monetary increase in wages of 80%, the amount of goods and services you could buy would be the same.

Hope this dataset is useful for you! Any questions or answers do not hesitate in contact me.
Ames Housing Engineered Dataset
kaggle.com
Updated Sep 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Atefeh Amjadian (2025). Ames Housing Engineered Dataset [Dataset]. https://www.kaggle.com/datasets/atefehamjadian/ameshousing-engineered
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 27, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Atefeh Amjadian
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Area covered
Ames
Description
This dataset is an engineered version of the original Ames Housing dataset from the "House Prices: Advanced Regression Techniques" Kaggle competition. The goal of this engineering was to clean the data, handle missing values, encode categorical features, scale numeric features, manage outliers, reduce skewness, select useful features, and create new features to improve model performance for house price prediction.

The original dataset contains information on 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, with the target variable being SalePrice. This engineered version has undergone several preprocessing steps to make it ready for machine learning models.

Preprocessing Steps Applied

Missing Value Handling: Missing values in categorical columns with meaningful absence (e.g., no pool for PoolQC) were filled with "None". Numeric columns were filled with median, and other categorical columns with mode.

Correlation-based Feature Selection: Numeric features with absolute correlation < 0.1 with SalePrice were removed.

Encoding Categorical Variables: Ordinal features (e.g., quality ratings) were encoded using OrdinalEncoder, and nominal features (e.g., neighborhoods) using OneHotEncoder.

Outlier Handling: Outliers in numeric features were detected using IQR and capped (Winsorized) to IQR bounds to preserve data while reducing extreme values.

Skewness Handling: Highly skewed numeric features (|skew| > 1) were transformed using Yeo-Johnson to make distributions more normal-like.

Additional Feature Selection: Low-variance one-hot features (variance < 0.01) and highly collinear features (|corr| > 0.8) were removed.

Feature Scaling: Numeric features were scaled using RobustScaler to handle outliers.

Duplicate Removal: Duplicate rows were checked and removed if found (none in this dataset).

The final dataset has fewer columns than the original (reduced from 81 to approximately 250 after one-hot encoding, then further reduced by feature selection), with improved quality for modeling.

New Features Created

To add more predictive power, the following new features were created based on domain knowledge: 1. HouseAge: Age of the house at the time of sale. Calculated as YrSold - YearBuilt. This captures how old the house is, which can negatively affect price due to depreciation. - Example: A house built in 2000 and sold in 2008 has HouseAge = 8. 2. Quality_x_Size: Interaction term between overall quality and living area. Calculated as OverallQual * GrLivArea. This combines quality and size to capture the value of high-quality large homes. - Example: A house with OverallQual = 7 and GrLivArea = 1500 has Quality_x_Size = 10500. 3. TotalSF: Total square footage of the house. Calculated as GrLivArea + TotalBsmtSF + 1stFlrSF + 2ndFlrSF (if available). This aggregates area features into a single metric for better price prediction. - Example: If GrLivArea = 1500 and TotalBsmtSF = 1000, TotalSF = 2500. 4. Log_LotArea: Log-transformed lot area to reduce skewness. Calculated as np.log1p(LotArea). This makes the distribution of lot sizes more normal, helping models handle extreme values. - Example: A lot area of 10000 becomes Log_LotArea ≈ 9.21.

These new features were created using the original (unscaled) values to maintain interpretability, then scaled with RobustScaler to match the rest of the dataset.

Data Dictionary

Original Numeric Features: Kept features with |corr| > 0.1 with SalePrice, such as:

OverallQual: Material and finish quality (scaled, 1-10).

GrLivArea: Above grade (ground) living area square feet (scaled).

GarageCars: Size of garage in car capacity (scaled).

TotalBsmtSF: Total square feet of basement area (scaled).

And others like FullBath, YearBuilt, etc. (see the code for the full list).

Ordinal Encoded Features: Quality and condition ratings, e.g.:

ExterQual: Exterior material quality (encoded as 0=Po to 4=Ex).

BsmtQual: Basement quality (encoded as 0=None to 5=Ex).

One-Hot Encoded Features: Nominal categorical features, e.g.:

MSZoning_RL: 1 if residential low density, 0 otherwise.

Neighborhood_NAmes: 1 if in NAmes neighborhood, 0 otherwise.

New Engineered Features (as described above):

HouseAge: Age of the house (scaled).

Quality_x_Size: Overall quality times living area (scaled).

TotalSF: Total square footage (scaled).

Log_LotArea: Log-transformed lot area (scaled).

Target: SalePrice - The property's sale price in dollars (not scaled, as it's the target).

Total columns: Approximately 200-250 (after one-hot encoding and feature selection).

License

This dataset is derived from the Ames Housing...
Z
Data from: Datasets for the article "Embedded Digital Phase Noise Analyzer...
data-staging.niaid.nih.gov
data.niaid.nih.gov
+1more
Updated Oct 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Donadello, Simone; Bertacco, Elio K.; Calonico, Davide; Clivati, Cecilia (2023). Datasets for the article "Embedded Digital Phase Noise Analyzer for Optical Frequency Metrology" [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_8392621
Explore at:
Dataset updated
Oct 2, 2023
Dataset provided by
Istituto Nazionale di Ricerca Metrologica (INRIM)
Authors
Donadello, Simone; Bertacco, Elio K.; Calonico, Davide; Clivati, Cecilia
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These files contain the datasets related to the figures presented in the article Donadello et al. (2023) "Embedded Digital Phase Noise Analyzer for Optical Frequency Metrology", IEEE Transactions on Instrumentation and Measurement, 72, pp. 1–12, Article Sequence Number: 2005412. Available at: https://doi.org/10.1109/TIM.2023.3288255. Each dataset is related to the respective figure number in the article, as indicated in the filename (e.g. "data_fig3a.csv" contains the data used to produce Fig. 3, subplot a).

All the files are formatted in the comma-separated format.

Description: "data_fig3a.csv": coefficients for IQ demodulation and filtering, with f_int=200kHz. "data_fig3b.csv": frequency response of demodulation filtering, with f_int=200kHz and f_int=10kHz. "data_fig4a.csv": power spectral density (PSD) of frequency signals, at different signal amplitudes. "data_fig4b.csv": overlapping Allan deviation of frequency signals, with either OCXO and maser references. "data_fig4c.csv": measured signal frequency and amplitude as a function of nominal frequency deviation. "data_fig4d.csv": measured signal amplitude as a function of nominal amplitude. "data_fig5.csv": example signals related to the synchronized acquisition of RF inputs. "data_fig7a.csv": synchronized time series of frequency deviation acquired over a fiber link in the self-heterodyne interference scheme. "data_fig7b.csv": PSD of the frequency signals acquired over a fiber link in the self-heterodyne interference scheme. "data_fig8a-without-corr.csv": synchronized time series of frequency deviation acquired over a fiber link in the heterodyne interference scheme without frequency drift correction. "data_fig8a-with-corr.csv": synchronized time series of frequency deviation acquired over a fiber link in the heterodyne interference scheme with frequency drift correction. "data_fig8b.csv": PSD of the frequency signals acquired over a fiber link in the heterodyne interference scheme.
u
High Resolution Research Soundings from Algonquin
ckanprod.data-commons.k8s.ucar.edu
data.ucar.edu
ascii
Updated Oct 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). High Resolution Research Soundings from Algonquin [Dataset]. http://doi.org/10.26023/W1S2-ZED1-AP0N
Explore at:
asciiAvailable download formats
Unique identifier
https://doi.org/10.26023/W1S2-ZED1-AP0N
Dataset updated
Oct 7, 2025
Time period covered
Dec 4, 1997 - Jan 25, 1998
Area covered

Description
This data set contains ten second vertical resolution sounding data for Algonquin Park, Ontario Canada. This site was operated by the Atmospheric Technology Division(ATD). Soundings were typically taken either four times a day or eight times a day during IOPs,i.e. every three or six hours beginning with 00 UTC. This dataset underwent JOSS automatic quality control procedures. Since the soundings are loaded into the database by actual release times and the actual hour of release is typically one hour before the nominal time, when attempting to preview a sounding try the hour right before the nominal time first. For example: for an 18 UTC sounding, try using hour 17. Consult the README for more information.

Facebook

Twitter

Click to copy link

Link copied

Cite

Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis

Market Basket Analysis

Analyzing Consumer Behaviour Using MBA Association Rule Mining

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

zip(23875170 bytes)Available download formats

Dataset updated

Dec 9, 2021

Authors

Aslan Ahmedov

Description

Market Basket Analysis

Market basket analysis with Apriori algorithm

The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

Introduction

Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

An Example of Association Rules

Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

Strategy

Data Import
Data Understanding and Exploration
Transformation of the data – so that is ready to be consumed by the association rules algorithm
Running association rules
Exploring the rules generated
Filtering the generated rules
Visualization of Rule

Dataset Description

File name: Assignment-1_Data
List name: retaildata
File format: . xlsx
Number of Row: 522065
Number of Attributes: 7
- BillNo: 6-digit number assigned to each transaction. Nominal.
- Itemname: Product name. Nominal.
- Quantity: The quantities of each product per transaction. Numeric.
- Date: The day and time when each transaction was generated. Numeric.
- Price: Product price. Numeric.
- CustomerID: 5-digit number assigned to each customer. Nominal.
- Country: Name of the country where each customer resides. Nominal.

https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

Libraries in R

First, we need to load required libraries. Shortly I describe all libraries.

arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).
arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.
tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.
readxl - Read Excel Files in R.
plyr - Tools for Splitting, Applying and Combining Data.
ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
knitr - Dynamic Report generation in R.
magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.
dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.
tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

Data Pre-processing

Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

After we will clear our data frame, will remove missing values.

https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...

Clear search

Close search

Google apps

Main menu

Market Basket Analysis

Market Basket Analysis

Introduction

An Example of Association Rules

Strategy

Dataset Description

Libraries in R

Data Pre-processing

The dataset for Example 3 of Table 3.

Global monthly catch of tuna, tuna-like and shark species (1950-2021) by 1°...

💰 Global GDP Dataset (Latest)

🧾 About Dataset

📖 Overview

📊 Dataset Information

🧩 Key Features

📈 Statistical Summary

Population Overview

🌟 Highlights

💡 Example Use-Cases

📚 Dataset Citation

Controlled Anomalies Time Series (CATS) Dataset

Fundamental Data Record for Atmospheric Composition [ATMOS_L1B]

Banking Dataset Classification

About Dataset

Data Set Information

Global monthly catch of tuna, tuna-like and shark species (1950-2023) by 1°...

MKAD (Open Sourced Code) - Dataset - NASA Open Data Portal

Degradation Measurement of Robot Arm Position Accuracy

Various Aspects of Indian States

Context

Content

Acknowledgements

Inspiration

Chen, C., Kyathanahally, S., Reyes, M., Merkli, S., Merz, E., Francazi, E.,...

Data from: bicycle store dataset

Context

Perform Exploratory Data Analysis on the Bicycle Store Dataset!

Content

Inspiration

MKAD (Open Sourced Code)

Fault Adaptive Control of Overactuated Systems Using Prognostic Estimation -...

Data from: Dataset of Vibration, Temperature and Speed Measurements for...

Historical silver price from 1791 to 2020 in USD

Context

Content

Ames Housing Engineered Dataset

Preprocessing Steps Applied

New Features Created

Data Dictionary

License

Data from: Datasets for the article "Embedded Digital Phase Noise Analyzer...

High Resolution Research Soundings from Algonquin

Market Basket Analysis

Analyzing Consumer Behaviour Using MBA Association Rule Mining

Market Basket Analysis

Introduction

An Example of Association Rules

Strategy

Dataset Description

Libraries in R

Data Pre-processing