93 datasets found

Car Price Prediction Challenge
kaggle.com
Updated Jul 6, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Deep Contractor (2022). Car Price Prediction Challenge [Dataset]. https://www.kaggle.com/datasets/deepcontractor/car-price-prediction-challenge
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 6, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Deep Contractor
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Assignment

Your notebooks must contain the following steps:

Perform data cleaning and pre-processing.

What steps did you use in this process and how did you clean your data.

Perform exploratory data analysis on the given dataset.

Explain each and every graphs that you make.

Train a ml-model and evaluate it using different metrics.

Why did you choose that particular model? What was the accuracy?

Hyperparameter optimization and feature selection is a plus.

Model deployment and use of ml-flow is a plus.

Perform model interpretation and show feature importance for your model.

Provide some explanation for the above point.

Future steps. Note: try to have your notebooks as presentable as possible.

Dataset Description

CSV file - 19237 rows x 18 columns (Includes Price Columns as Target)

Attributes

ID Price: price of the care(Target Column) Levy Manufacturer Model Prod. year Category Leather interior Fuel type Engine volume Mileage Cylinders Gear box type Drive wheels Doors Wheel Color Airbags

Confused or have any doubts in the data column values? Check the dataset discussion tab!
Z
Dairy Supply Chain Sales Dataset
data.niaid.nih.gov
zenodo.org
Updated Jul 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vasileios Argyriou (2024). Dairy Supply Chain Sales Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7853252
Explore at:
Dataset updated
Jul 12, 2024
Dataset provided by
Thomas Lagkas
Konstantinos Georgakidis
Dimitris Iatropoulos
Christos Chaschatzis
Dimitrios Pliatsios
Panagiotis Sarigiannidis
Anna Triantafyllou
Vasileios Argyriou
Athanasios Liatifis
Ilias Siniosoglou
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
1.Introduction

Sales data collection is a crucial aspect of any manufacturing industry as it provides valuable insights about the performance of products, customer behaviour, and market trends. By gathering and analysing this data, manufacturers can make informed decisions about product development, pricing, and marketing strategies in Internet of Things (IoT) business environments like the dairy supply chain.

One of the most important benefits of the sales data collection process is that it allows manufacturers to identify their most successful products and target their efforts towards those areas. For example, if a manufacturer could notice that a particular product is selling well in a certain region, this information could be utilised to develop new products, optimise the supply chain or improve existing ones to meet the changing needs of customers.

This dataset includes information about 7 of MEVGAL’s products [1]. According to the above information the data published will help researchers to understand the dynamics of the dairy market and its consumption patterns, which is creating the fertile ground for synergies between academia and industry and eventually help the industry in making informed decisions regarding product development, pricing and market strategies in the IoT playground. The use of this dataset could also aim to understand the impact of various external factors on the dairy market such as the economic, environmental, and technological factors. It could help in understanding the current state of the dairy industry and identifying potential opportunities for growth and development.

Citation

Please cite the following papers when using this dataset:

I. Siniosoglou, K. Xouveroudis, V. Argyriou, T. Lagkas, S. K. Goudos, K. E. Psannis and P. Sarigiannidis, "Evaluating the Effect of Volatile Federated Timeseries on Modern DNNs: Attention over Long/Short Memory," in the 12th International Conference on Circuits and Systems Technologies (MOCAST 2023), April 2023, Accepted

Dataset Modalities

The dataset includes data regarding the daily sales of a series of dairy product codes offered by MEVGAL. In particular, the dataset includes information gathered by the logistics division and agencies within the industrial infrastructures overseeing the production of each product code. The products included in this dataset represent the daily sales and logistics of a variety of yogurt-based stock. Each of the different files include the logistics for that product on a daily basis for three years, from 2020 to 2022.

3.1 Data Collection

The process of building this dataset involves several steps to ensure that the data is accurate, comprehensive and relevant.

The first step is to determine the specific data that is needed to support the business objectives of the industry, i.e., in this publication’s case the daily sales data.

Once the data requirements have been identified, the next step is to implement an effective sales data collection method. In MEVGAL’s case this is conducted through direct communication and reports generated each day by representatives & selling points.

It is also important for MEVGAL to ensure that the data collection process conducted is in an ethical and compliant manner, adhering to data privacy laws and regulation. The industry also has a data management plan in place to ensure that the data is securely stored and protected from unauthorised access.

The published dataset is consisted of 13 features providing information about the date and the number of products that have been sold. Finally, the dataset was anonymised in consideration to the privacy requirement of the data owner (MEVGAL).

File

Period

Number of Samples (days)

product 1 2020.xlsx

01/01/2020–31/12/2020

363

product 1 2021.xlsx

01/01/2021–31/12/2021

364

product 1 2022.xlsx

01/01/2022–31/12/2022

365

product 2 2020.xlsx

01/01/2020–31/12/2020

363

product 2 2021.xlsx

01/01/2021–31/12/2021

364

product 2 2022.xlsx

01/01/2022–31/12/2022

365

product 3 2020.xlsx

01/01/2020–31/12/2020

363

product 3 2021.xlsx

01/01/2021–31/12/2021

364

product 3 2022.xlsx

01/01/2022–31/12/2022

365

product 4 2020.xlsx

01/01/2020–31/12/2020

363

product 4 2021.xlsx

01/01/2021–31/12/2021

364

product 4 2022.xlsx

01/01/2022–31/12/2022

364

product 5 2020.xlsx

01/01/2020–31/12/2020

363

product 5 2021.xlsx

01/01/2021–31/12/2021

364

product 5 2022.xlsx

01/01/2022–31/12/2022

365

product 6 2020.xlsx

01/01/2020–31/12/2020

362

product 6 2021.xlsx

01/01/2021–31/12/2021

364

product 6 2022.xlsx

01/01/2022–31/12/2022

365

product 7 2020.xlsx

01/01/2020–31/12/2020

362

product 7 2021.xlsx

01/01/2021–31/12/2021

364

product 7 2022.xlsx

01/01/2022–31/12/2022

365

3.2 Dataset Overview

The following table enumerates and explains the features included across all of the included files.

Feature

Description

Unit

Day

day of the month

-

Month

Month

-

Year

Year

-

daily_unit_sales

Daily sales - the amount of products, measured in units, that during that specific day were sold

units

previous_year_daily_unit_sales

Previous Year’s sales - the amount of products, measured in units, that during that specific day were sold the previous year

units

percentage_difference_daily_unit_sales

The percentage difference between the two above values

%

daily_unit_sales_kg

The amount of products, measured in kilograms, that during that specific day were sold

kg

previous_year_daily_unit_sales_kg

Previous Year’s sales - the amount of products, measured in kilograms, that during that specific day were sold, the previous year

kg

percentage_difference_daily_unit_sales_kg

The percentage difference between the two above values

kg

daily_unit_returns_kg

The percentage of the products that were shipped to selling points and were returned

%

previous_year_daily_unit_returns_kg

The percentage of the products that were shipped to selling points and were returned the previous year

%

points_of_distribution

The amount of sales representatives through which the product was sold to the market for this year

previous_year_points_of_distribution

The amount of sales representatives through which the product was sold to the market for the same day for the previous year

Table 1 – Dataset Feature Description

Structure and Format

4.1 Dataset Structure

The provided dataset has the following structure:

Where:

Name

Type

Property

Readme.docx

Report

A File that contains the documentation of the Dataset.

product X

Folder

A folder containing the data of a product X.

product X YYYY.xlsx

Data file

An excel file containing the sales data of product X for year YYYY.

Table 2 - Dataset File Description

Acknowledgement

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 957406 (TERMINET).

References

[1] MEVGAL is a Greek dairy production company
Company Financial Data | Banking & Capital Markets Professionals in the...
datarade.ai
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Success.ai, Company Financial Data | Banking & Capital Markets Professionals in the Middle East | Verified Global Profiles from 700M+ Dataset [Dataset]. https://datarade.ai/data-products/company-financial-data-banking-capital-markets-profession-success-ai
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset provided by
Area covered
Kyrgyzstan, Mongolia, Korea (Republic of), State of, Brunei Darussalam, Maldives, Jordan, Uzbekistan, Bahrain, Georgia
Description
Success.ai’s Company Financial Data for Banking & Capital Markets Professionals in the Middle East offers a reliable and comprehensive dataset designed to connect businesses with key stakeholders in the financial sector. Covering banking executives, capital markets professionals, and financial advisors, this dataset provides verified contact details, decision-maker profiles, and firmographic insights tailored for the Middle Eastern market.

With access to over 170 million verified professional profiles and 30 million company profiles, Success.ai ensures your outreach and strategic initiatives are powered by accurate, continuously updated, and AI-validated data. Backed by our Best Price Guarantee, this solution empowers your organization to build meaningful connections in the region’s thriving financial industry.

Why Choose Success.ai’s Company Financial Data?

Verified Contact Data for Financial Professionals

Access verified email addresses, direct phone numbers, and LinkedIn profiles of banking executives, capital markets advisors, and financial consultants.

AI-driven validation ensures 99% accuracy, enabling confident communication and minimizing data inefficiencies.

Targeted Insights for the Middle East Financial Sector

Includes profiles from major Middle Eastern financial hubs such as Dubai, Riyadh, Abu Dhabi, and Doha, covering diverse institutions like banks, investment firms, and regulatory bodies.

Gain insights into region-specific financial trends, regulatory frameworks, and market opportunities.

Continuously Updated Datasets

Real-time updates reflect changes in leadership, market activities, and organizational structures.

Stay ahead of emerging opportunities and align your strategies with evolving market dynamics.

Ethical and Compliant

Adheres to GDPR, CCPA, and other global privacy regulations, ensuring responsible data usage and compliance with legal standards.

Data Highlights:

170M+ Verified Professional Profiles: Engage with decision-makers and professionals in banking, investment management, and capital markets across the Middle East.

30M Company Profiles: Access detailed firmographic data, including organization sizes, revenue ranges, and geographic footprints.

Leadership Contact Information: Connect directly with CEOs, CFOs, risk managers, and regulatory professionals driving financial strategies.

Decision-Maker Insights: Understand key decision-makers’ roles and responsibilities to tailor your outreach effectively.

Key Features of the Dataset:

Decision-Maker Profiles in Banking & Capital Markets

Identify and connect with executives, portfolio managers, and analysts shaping investment strategies and financial operations.

Target professionals responsible for compliance, risk management, and operational efficiency.

Advanced Filters for Precision Targeting

Filter institutions by segment (retail banking, investment banking, private equity), geographic location, revenue size, or workforce composition.

Tailor campaigns to align with specific financial needs, such as digital transformation, customer retention, or risk mitigation.

Firmographic and Leadership Insights

Access detailed firmographic data, including company hierarchies, financial health indicators, and service specializations.

Gain a deeper understanding of organizational structures and market positioning.

AI-Driven Enrichment

Profiles enriched with actionable data allow for personalized messaging, highlight unique value propositions, and enhance engagement outcomes.

Strategic Use Cases:

Sales and Lead Generation

Offer financial technology solutions, consulting services, or compliance tools to banking institutions and investment firms.

Build relationships with decision-makers responsible for vendor selection and financial strategy implementation.

Market Research and Competitive Analysis

Analyze trends in Middle Eastern banking and capital markets to guide product development and market entry strategies.

Benchmark against competitors to identify market gaps, emerging niches, and growth opportunities.

Partnership Development and Vendor Evaluation

Connect with financial institutions seeking strategic partnerships or evaluating service providers for operational improvements.

Foster alliances that drive mutual growth and innovation.

Recruitment and Talent Solutions

Engage HR professionals and hiring managers seeking top talent in finance, compliance, or risk management.

Provide staffing solutions, training programs, or workforce optimization tools tailored to the financial sector.

Why Choose Success.ai?

Best Price Guarantee

Access premium-quality financial data at competitive prices, ensuring strong ROI for your outreach, marketing, and partners...
R
Fruits Classification Dataset
universe.roboflow.com
zip
Updated Apr 6, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joseph Nelson (2020). Fruits Classification Dataset [Dataset]. https://universe.roboflow.com/joseph-nelson/fruits-dataset
Explore at:
zipAvailable download formats
Dataset updated
Apr 6, 2020
Dataset authored and provided by
Joseph Nelson
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Variables measured
Fruit
Description
Overview

The Fruits dataset is an image classification dataset of various fruits against white backgrounds from various angles, originally open sourced by GitHub user horea. This is a subset of that full dataset.

Example Image: https://github.com/Horea94/Fruit-Images-Dataset/blob/master/Training/Apple%20Braeburn/101_100.jpg?raw=true" alt="Example Image">

Use Cases

Build a fruit classifier! This could be a just-for-fun project just as much as you could be building a color sorter for agricultural use cases before fruits make their way to market.

Using this Dataset

Use the fork button to copy this dataset to your own Roboflow account and export it with new preprocessing settings (perhaps resized for your model's desired format or converted to grayscale), or additional augmentations to make your model generalize better. This particular dataset would be very well suited for Roboflow's new advanced Bounding Box Only Augmentations.

About Roboflow

Roboflow makes managing, preprocessing, augmenting, and versioning datasets for computer vision seamless.

Developers reduce 50% of their code when using Roboflow's workflow, automate annotation quality assurance, save training time, and increase model reproducibility.
Cryptocurrency extra data - Maker
kaggle.com
zip
Updated Nov 15, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yam Peleg (2021). Cryptocurrency extra data - Maker [Dataset]. https://www.kaggle.com/yamqwe/cryptocurrency-extra-data-maker
Explore at:
zip(79531062 bytes)Available download formats
Dataset updated
Nov 15, 2021
Authors
Yam Peleg
Description
Context:

This dataset is an extra updating dataset for the G-Research Crypto Forecasting competition.

Introduction

This is a daily updated dataset, automaticlly collecting market data for G-Research crypto forecasting competition. The data is of the 1-minute resolution, collected for all competition assets and both retrieval and uploading are fully automated. see discussion topic.

The Data

For every asset in the competition, the following fields from Binance's official API endpoint for historical candlestick data are collected, saved, and processed.

1. **timestamp** - A timestamp for the minute covered by the row. 2. **Asset_ID** - An ID code for the cryptoasset. 3. **Count** - The number of trades that took place this minute. 4. **Open** - The USD price at the beginning of the minute. 5. **High** - The highest USD price during the minute. 6. **Low** - The lowest USD price during the minute. 7. **Close** - The USD price at the end of the minute. 8. **Volume** - The number of cryptoasset u units traded during the minute. 9. **VWAP** - The volume-weighted average price for the minute. 10. **Target** - 15 minute residualized returns. See the 'Prediction and Evaluation section of this notebook for details of how the target is calculated. 11. **Weight** - Weight, defined by the competition hosts [here](https://www.kaggle.com/cstein06/tutorial-to-the-g-research-crypto-competition) 12. **Asset_Name** - Human readable Asset name.

Indexing

The dataframe is indexed by timestamp and sorted from oldest to newest. The first row starts at the first timestamp available on the exchange, which is July 2017 for the longest-running pairs.

Usage Example

The following is a collection of simple starter notebooks for Kaggle's Crypto Comp showing PurgedTimeSeries in use with the collected dataset. Purged TimesSeries is explained here. There are many configuration variables below to allow you to experiment. Use either GPU or TPU. You can control which years are loaded, which neural networks are used, and whether to use feature engineering. You can experiment with different data preprocessing, model architecture, loss, optimizers, and learning rate schedules. The extra datasets contain the full history of the assets in the same format as the competition, so you can input that into your model too.

Baseline Example Notebooks:

Neural Network Starter

LightGBM Starter

Catboost Starter

XGBoost Starter

TabNet Starter

Reinforcement Learning (PPO) Starter

These notebooks follow the ideas presented in my "Initial Thoughts" here. Some code sections have been reused from Chris' great (great) notebook series on SIIM ISIC melanoma detection competition here

Loose-ends:

This is a work in progress and will be updated constantly throughout the competition. At the moment, there are some known issues that still needed to be addressed:

VWAP: - At the moment VWAP calculation formula is still unclear. Currently the dataset uses an approximation calculated from the Open, High, Low, Close, Volume candlesticks. [Waiting for competition hosts input]

Target Labeling: There exist some mismatches to the original target provided by the hosts at some time intervals. On all the others - it is the same. The labeling code can be seen here. [Waiting for competition hosts] input]

Filtering: No filtration of 0 volume data is taken place.

Example Visualisations

Opening price with an added indicator (MA50): https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2234678%2Fb8664e6f26dc84e9a40d5a3d915c9640%2Fdownload.png?generation=1582053879538546&alt=media" alt="">

Volume and number of trades: https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2234678%2Fcd04ed586b08c1576a7b67d163ad9889%2Fdownload-1.png?generation=1582053899082078&alt=media" alt="">

License

This data is being collected automatically from the crypto exchange Binance.
Cybersecurity Framework Manufacturing Profile Low Security Level Example...
catalog.data.gov
data.nist.gov
Updated Jul 29, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2022). Cybersecurity Framework Manufacturing Profile Low Security Level Example Implementations for Discrete-based Manufacturing System Datasets [Dataset]. https://catalog.data.gov/dataset/cybersecurity-framework-manufacturing-profile-low-security-level-example-implementations-f-ccc42
Explore at:
Dataset updated
Jul 29, 2022
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
The Cybersecurity Framework Manufacturing Profile Low Security Level Example Implementations Guide provides example proof-of-concept solutions demonstrating how open-source and commercial off-the-shelf (COTS) products that are currently available today can be implemented in manufacturing environments to satisfy the requirements in the Cybersecurity Framework (CSF) Manufacturing Profile Low Security Level. Example proof-of-concept solutions for a process-based manufacturing environment and a discrete-based manufacturing environment are included in the guide. Depending on factors like size, sophistication, risk tolerance, and threat landscape, manufacturers should make their own determinations about the breadth of the proof-of-concept solutions they may voluntarily implement. The dataset includes all of the raw and processed measurement data for the example implementation of the discrete-based manufacturing system use case.
m
Diesel Engine Faults Features Dataset (3500-DEFault)
data.mendeley.com
Updated Apr 29, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Denys Pestana (2020). Diesel Engine Faults Features Dataset (3500-DEFault) [Dataset]. http://doi.org/10.17632/k22zxz29kr.1
Explore at:
Unique identifier
https://doi.org/10.17632/k22zxz29kr.1
Dataset updated
Apr 29, 2020
Authors
Denys Pestana
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The objective of this dataset is the fault diagnosis in diesel engines to assist the predictive maintenance, through the analysis of the variation of the pressure curves inside the cylinders and the torsional vibration response of the crankshaft. Hence a fault simulation model based on a zero-dimensional thermodynamic model was developed.

The adopted feature vectors were chosen from the thermodynamic model and obtained from processing signals as pressure and temperature inside the cylinder, as well as, torsional vibration of the engine’s flywheel. These vectors are used as input of the machine learning technique in order to discriminate among several machine conditions.

The database is expected to emulate all operating scenarios under study. In our case, all possible diesel machine faults and system conditions variations, which correspond to severities levels containing enough information to characterize and discriminate the faults. The developed database covered the following operating conditions: Normal (without faults), Pressure reduction in the intake manifold, Compression ratio reduction in the cylinders and Reduction of amount of fuel injected into the cylinders.

In all scenarios, the motor rotation frequency was set at 2500 RPM. The rotation of 2500 RPM was used, since it presented the lowest joint error rate in the estimation of the mean and maximum pressures of the burning cycle, between the experimental data (according to data supplied by the manufacturer) and the simulated data, during the validation stage of the thermodynamic and dynamic models.

The entire database comprises a total of 3500 different fault scenarios for 4 distinct operational conditions. 250 of which from the normal class, 250 from pressure reduction in the intake manifold" class, 1500 fromcompression ratio reduction in the cylinders" class and 1500 from the ``reduction of amount of fuel injected into the cylinders" class. This database is named 3500-DEFault database.
Automotive Vehicles Engine Health Dataset
kaggle.com
Updated Apr 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PARV MODI (2023). Automotive Vehicles Engine Health Dataset [Dataset]. https://www.kaggle.com/datasets/parvmodi/automotive-vehicles-engine-health-dataset/versions/1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 5, 2023
Dataset provided by
Kaggle
Authors
PARV MODI
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The dataset could include various features and measurements related to the engine health of vehicles, such as engine RPM, temperature, pressure, and other sensor data. It may also include metadata on the vehicle, such as make, model, year, and mileage.

One potential project using this dataset could be to build a predictive maintenance model for automotive engines. By analyzing the patterns and trends in the data, machine learning algorithms could be trained to predict when an engine is likely to require maintenance or repair. This could help vehicle owners and mechanics proactively address potential issues before they become more severe, leading to better vehicle performance and longer engine lifetimes.

Another potential use for this dataset could be to analyze the performance of different types of engines and vehicles. Researchers could use the data to compare the performance of engines from different manufacturers, for example, or to evaluate the effectiveness of different maintenance strategies. This could help drive innovation and improvements in the automotive industry.
w
Dataset of books called James Allison : a biography of the engine...
workwithdata.com
Updated Apr 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2025). Dataset of books called James Allison : a biography of the engine manufacturer and Indianapolis 500 cofounder [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=James+Allison+%3A+a+biography+of+the+engine+manufacturer+and+Indianapolis+500+cofounder
Explore at:
Dataset updated
Apr 17, 2025
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Indianapolis
Description
This dataset is about books. It has 1 row and is filtered where the book is James Allison : a biography of the engine manufacturer and Indianapolis 500 cofounder. It features 7 columns including author, publication date, language, and book publisher.
P
Fabrics Dataset Dataset
paperswithcode.com
Updated Dec 10, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2016). Fabrics Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/fabrics-dataset
Explore at:
Dataset updated
Dec 10, 2016
Description
The Fabrics Dataset consists of about 2000 samples of garments and fabrics. A small patch of each surface has been captured under 4 different illumination conditions using a custom made, portable photometric stereo sensor. All images have been acquired "in the field" (at clothes shops) and the dataset reflects the distribution of fabrics in real world, hence it is not balanced. The majority of clothes are made of specific fabrics, such as cotton and polyester, while some other fabrics, such as silk and linen, are more rare. Also, a large number of clothes are not composed of a single fabric but two or more fabrics are used to give the garment the desired properties (blended fabrics). For every garment there is information (attributes) about its material composition from the manufacturer label and its type (pants, shirt, skirt etc.).
A
‘Amazon Product Reviews Dataset’ analyzed by Analyst-2
analyst-2.ai
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com), ‘Amazon Product Reviews Dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-amazon-product-reviews-dataset-7933/c5d57177/?iid=005-711&v=presentation
Explore at:
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Amazon Product Reviews Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/amazon-product-reviews-datasete on 13 February 2022.

--- Dataset description provided by original source is as follows ---

About this dataset

This dataset contains 30K records of product reviews from amazon.com.

This dataset was created by PromptCloud and DataStock
Content

This dataset contains the following:

Total Records Count: 43729

Domain Name: amazon.com

Date Range: 01st Jan 2020 - 31st Mar 2020

File Extension: CSV

Available Fields:
-- Uniq Id,
-- Crawl Timestamp,
-- Billing Uniq Id,
-- Rating,
-- Review Title,
-- Review Rating,
-- Review Date,
-- User Id,
-- Brand,
-- Category,
-- Sub Category,
-- Product Description,
-- Asin,
-- Url,
-- Review Content,
-- Verified Purchase,
-- Helpful Review Count,
-- Manufacturer Response

Acknowledgements

We wouldn't be here without the help of our in house teams at PromptCloud and DataStock. Who has put their heart and soul into this project like all other projects? We want to provide the best quality data and we will continue to do so.

Inspiration

The inspiration for these datasets came from research. Reviews are something that is important wit everybody across the globe. So we decided to come up with this dataset that shows us exactly how the user reviews help companies to better their products.

This dataset was created by PromptCloud and contains around 0 samples along with Billing Uniq Id, Verified Purchase, technical information and other features such as: - Crawl Timestamp - Manufacturer Response - and more.

How to use this dataset

Analyze Helpful Review Count in relation to Sub Category

Study the influence of Review Date on Product Description

More datasets

Acknowledgements

If you use this dataset in your research, please credit PromptCloud

Start A New Notebook!

--- Original source retains full ownership of the source dataset ---
A
‘Television Brands Ecommerce Dataset’ analyzed by Analyst-2
analyst-2.ai
Updated Jan 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Television Brands Ecommerce Dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-television-brands-ecommerce-dataset-bfa2/c4113040/?iid=003-490&v=presentation
Explore at:
Dataset updated
Jan 28, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
Description
Analysis of ‘Television Brands Ecommerce Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/devsubhash/television-brands-ecommerce-dataset on 28 January 2022.

--- Dataset description provided by original source is as follows ---

This dataset contains 912 samples with 7 attributes. There are some missing values in this dataset.

Here are the columns in this dataset- 1. Brand: This indicates the manufacturer of the product i.e. Television 2. Resolution: This has multiple categories and indicates the type of display i.e. LED, HD LED, etc. 3. Size: This indicates the screen size in inches 4. Selling Price: This column has the Selling Price or the Discounted Price of the product 5. Original Price: This includes the Original Price of the product from the manufacturer. 6. Operating system: This categorical variable shows the type of OS like Android, Linux, etc. 7. Rating: Average customer ratings on a scale of 5.

Inspiration: This dataset could be used to explore the current market scenario for Televisions. There are various types of screens with different operating systems offered by several manufacturers at competitive prices. Some questions this dataset could be used to answer are -

Demand for different types of televisions and Number of Players in the market

Which are the top 5 brands for television?

Which brand has the highest number of products i.e. television ?

Are televisions with higher ratings more expensive?

Average Selling Price by Brand

--- Original source retains full ownership of the source dataset ---
Z
Data from: Clotho dataset
data.niaid.nih.gov
explore.openaire.eu
+1more
Updated May 30, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Konstantinos Drossos (2021). Clotho dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3490683
Explore at:
Dataset updated
May 30, 2021
Dataset provided by
Samuel Lipping
Konstantinos Drossos
Tuomas Virtanen
Description
Clotho is an audio captioning dataset, now reached version 2. Clotho consists of 6974 audio samples, and each audio sample has five captions (a total of 34 870 captions). Audio samples are of 15 to 30 s duration and captions are eight to 20 words long.

Clotho is thoroughly described in our paper:

K. Drossos, S. Lipping and T. Virtanen, "Clotho: an Audio Captioning Dataset," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 736-740, doi: 10.1109/ICASSP40776.2020.9052990.

available online at: https://arxiv.org/abs/1910.09387 and at: https://ieeexplore.ieee.org/document/9052990

If you use Clotho, please cite our paper.

To use the dataset, you can use our code at: https://github.com/audio-captioning/clotho-dataset

These are the files for the development, validation, and evaluation splits of Clotho dataset.

== Changes in version 2.1 ==

In version 2.1 of Clotho, we fixed some files that were corrupted from the compression and transferring processes (around 150 files) and we also replaced some characters that were illegal for most filesystems, e.g. ":" (around 10 files).

Please use this version for your experiments.

== Changes in version 2 ==

In version 2 of Clotho, there are audio files added in the development split and a new validation split is added. There are no changes in the evaluation split.

Specifically:

Now there are 3840 audio files in the development split. In Clotho version 1, there were 2893 audio files. Now, 947 new audio files are added.

There are 1046 new audio files in the validation split.

All new captions are treated as in version 1 of Clotho, i.e. having word consistency, no named entities, no speech transcription, and no hapax legomena between splits (i.e. words appearing only in one of the splits).

== Usage ==

To use the dataset you have to:

Download the audio files: clotho_audio_development.7z,clotho_audio_validation.7z, and clotho_audio_evalution.7z

Download the files with the captions: clotho_captions_development.csv, clotho_captions_validation.csv, and clotho_captions_evaluation.csv

Download the files with the associated metadata: clotho_metadata_development.csv, clotho_metadata_validation.csv, and clotho_metadata_evaluation.csv

Extract the audio files

Then you can use each audio file with its corresponding captions

== License ==

The audio files in the archives:

clotho_audio_development.7z,

clotho_audio_validation.7z, and

clotho_audio_evalution.7z

and the associated meta-data in the CSV files:

clotho_metadata_development.csv

clotho_metadata_validation.csv

clotho_metadata_evaluation.csv

are under the corresponding licences (mostly CreativeCommons with attribution) of Freesound [1] platform, mentioned explicitly in the CSV files for each of the audio files. That is, each audio file in the 7z archives is listed in the CSV files with the meta-data. The meta-data for each file are:

File name

Keywords

URL for the original audio file

Start and ending samples for the excerpt that is used in the Clotho dataset

Uploader/user in the Freesound platform (manufacturer)

Link to the licence of the file

The captions in the files:

clotho_captions_development.csv

clotho_captions_validation.csv

clotho_captions_evaluation.csv

are under the Tampere University licence, described in the LICENCE file (mainly a non-commercial with attribution licence).

== References == [1] Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia (MM '13). ACM, New York, NY, USA, 411-412. DOI: https://doi.org/10.1145/2502081.2502245
Z
CPAISD: Core-Penumbra Acute Ischemic Stroke Dataset
data.niaid.nih.gov
zenodo.org
Updated Apr 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peksheva, Marina (2025). CPAISD: Core-Penumbra Acute Ischemic Stroke Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10892315
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Umerenkov, Dmitriy
Peksheva, Marina
Kudin, Stepan
Pavlov, Denis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset contains 112 non-contrast cranial CT scans of patients with hyperacute stroke, featuring delineated zones of penumbra and core of the stroke on each slice where present. The data in the dataset are anonymized using the Kitware DicomAnonymizer, with standard anonymization settings, except for preserving the values of the following fields:

(0x0010, 0x0040) – Patient's Sex

(0x0010, 0x1010) – Patient's Age

(0x0008, 0x0070) – Manufacturer

(0x0008, 0x1090) – Manufacturer’s Model Name

The patient's sex and age are retained for demographic analysis of the samples, and the equipment manufacturer and model are kept for dataset statistics and the potential for domain shift analysis.

The dataset is split into three folds:

Training fold (92 studies, 8,376 slices).

Validation fold (10 studies, 980 slices).

Testing fold (10 studies, 809 slices).

The dataset has the following structure:

metadata.json – dataset metadata

summary.csv – metadata of each study in a CSV format table

Part of the dataset (train, val, and test)

Study

Slice

raw.dcm – original slice file

image.npz – slice in Numpy array format

mask.npz – segmentation mask in Numpy array format

metadata.json – slice metadata in JSON format

metadata.json – study metadata in JSON format

The metadata.json at the root of the dataset has the following format:

generation_params – dataset generation parameters:

test_size – proportion of the test part

val_size – proportion of the validation part

stats – statistical data:

common – general statistical data:

train_size_in_studies – number of studies in the training part of the dataset.

train_size_in_images – number of slices in the training part of the dataset.

val_size_in_studies – number of studies in the validation part of the dataset.

val_size_in_images – number of slices in the validation part of the dataset.

test_size_in_studies – number of studies in the test part of the dataset.

test_size_in_images – number of slices in the test part of the dataset.

train – statistical data for the training part of the dataset:

min – minimum pixel value.

max – maximum pixel value.

mean – average pixel value.

std – standard deviation for all pixel values.

The metadata.json at the root of the study has the following format, if a field value is unknown, it is given as 'unknown':

manufacturer – manufacturer of the tomograph.

model – model of the tomograph.

device – full name of the tomograph (manufacturer + model).

age – patient's age in years.

sex – patient's sex. M – male, F – female.

dsa – whether cerebral angiography was performed. true if yes, false if no.

nihss – NIHSS score.

time – time in hours from the onset of the stroke to the conduct of the study. Can be either a number or a range.

lethality – whether the person died as a result of this stroke. true if yes, false if no.

The summary.csv contains the same fields as the metadata.json from the root of the study, plus two additional fields:

name – name of the study.

part – part of the dataset in which the study is located.
LearnPlatform Educational Technology Engagement Dataset: Impact of COVID-19...
openicpsr.org
Updated Sep 16, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mary Styers (2021). LearnPlatform Educational Technology Engagement Dataset: Impact of COVID-19 on Digital Learning [Dataset]. http://doi.org/10.3886/E150042V1
Explore at:
Unique identifier
https://doi.org/10.3886/E150042V1
Dataset updated
Sep 16, 2021
Dataset provided by
LearnPlatform, Inc.
Authors
Mary Styers
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Time period covered
Jan 2020 - Dec 2020
Area covered
United States
Description
LearnPlatform is a unique technology platform in the K-12 market providing the only broadly interoperable platform to the breadth of edtech solutions in the US K12 field. A key component of edtech effectiveness is integrated reporting on tool usage and, where applicable, evidence of efficacy. With COVID closures, LearnPlatform has emerged as an important and singular resource to measure whether students are accessing digital resources within distance learning constraints. This platform provides a unique and needed source of data to understand if students are accessing digital resources, and where resources have disparate usage and impact.In this dataset we are sharing educational technology usage across the 8,000+ tools used in the education field in 2020. We make this dataset available to public so that educators, district leaders, researchers, institutions, policy-makers or anyone interested to learn about digital learning in 2020, can use this dataset to understand student engagement with core learning activities during the COVID-19 pandemic. Some example research questions that this dataset can help stakeholders answer: What is the picture of digital connectivity and engagement in 2020?What is the effect of the COVID-19 pandemic on online and distance learning, and how might this evolve in the future?How does student engagement with different types of education technology change over the course of the pandemic?How does student engagement with online learning platforms relate to different geography? Demographic context (e.g., race/ethnicity, ESL, learning disability)? Learning context? Socioeconomic status?Do certain state interventions, practices or policies (e.g., stimulus, reopening, eviction moratorium) correlate with increases or decreases in online engagement?
A
‘Pakistan Corona Virus Dataset’ analyzed by Analyst-2
analyst-2.ai
Updated Sep 30, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘Pakistan Corona Virus Dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-pakistan-corona-virus-dataset-7f50/f59c6dcf/?iid=027-428&v=presentation
Explore at:
Dataset updated
Sep 30, 2021
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Pakistan
Description
Analysis of ‘Pakistan Corona Virus Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/zusmani/pakistan-corona-virus-citywise-data on 30 September 2021.

--- Dataset description provided by original source is as follows ---

Context

Pakistan witnessed its first Corona virus patient on February 26th 2020. It's a bumpy ride since then. The cases are increasing gradually and we haven't seen the worst yet. While, there are few government resources for cumulative updates, there is no place where you can find the city level patients data. It's also not possible to find the running chronological tally of patients as they test positive. We have decided to create our own dataset for all the researchers out there with such details so we can model the infection spread and forecast the situation in coming days. We hope, by doing so, we will be able to inform policy makers on various intervention models, and healthcare professionals to be ready for the influx of new patients. We certainly hope, that this little contribution will go a long way for saving lives in Pakistan

Content

The dataset contains seven columns for date, number of cases, number of deaths, number of people recovered, travel history of those cases, and location of the cases (province and city).

The first version has the data from first case of February 26 2020 to April 19, 2020. We intend to publish weekly updates

Acknowledgements

Users are allowed to use, copy, distribute and cite the dataset as follows: “Zeeshan-ul-hassan Usmani, Sana Rasheed, Pakistan Corona Virus Data, Kaggle Dataset Repository, April 19, 2020.”

Inspiration

Some ideas worth exploring:

Can we find the spread factor for the Corona virus in Pakistan?

How long it takes for a positive case to infect another in Pakistan?

How we can use this data to simulate lock down scenarios and find its impact on country's economy? Here is a good
read to get started - http://zeeshanusmani.com/urdu/corona-economic-impact/

How does Pakistan Corona virus spread compare against its neighbors and other developed counties?

What would be the impact of this infection spread on country's economy and people living under poverty? Here are two briefs to get you started

http://zeeshanusmani.com/urdu/corona/ http://zeeshanusmani.com/urdu/corona-what-to-learn/

How do we visualize this dataset to inform policy makers? Here is one example https://zeeshanusmani.com/corona/

Can we predict the number of cases in next 10 days and a month?

--- Original source retains full ownership of the source dataset ---
Clotho-AQA dataset
zenodo.org
explore.openaire.eu
+1more
csv, txt, zip
Updated Apr 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Samuel Lipping; Parthasaarathy Sudarsanam; Konstantinos Drossos; Konstantinos Drossos; Tuomas Virtanen; Tuomas Virtanen; Samuel Lipping; Parthasaarathy Sudarsanam (2022). Clotho-AQA dataset [Dataset]. http://doi.org/10.5281/zenodo.6473207
Explore at:
csv, txt, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6473207
Dataset updated
Apr 22, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Samuel Lipping; Parthasaarathy Sudarsanam; Konstantinos Drossos; Konstantinos Drossos; Tuomas Virtanen; Tuomas Virtanen; Samuel Lipping; Parthasaarathy Sudarsanam
Description
Clotho-AQA is an audio question-answering dataset consisting of 1991 audio samples taken from Clotho dataset [1]. Each audio sample has 6 associated questions collected through crowdsourcing. For each question, the answers are provided by three different annotators making a total of 35,838 question-answer pairs. For each audio sample, 4 questions are designed to be answered with 'yes' or 'no', while the remaining two questions are designed to be answered in a single word. More details about the data collection process and data splitting process can be found in our following paper.

S. Lipping, P. Sudarsanam, K. Drossos, T. Virtanen ‘Clotho-AQA: A Crowdsourced Dataset for Audio Question Answering.’ The paper is available online at 2204.09634.pdf (arxiv.org)

If you use the Clotho-AQA dataset, please cite the paper mentioned above. A sample baseline model to use the Clotho-AQA dataset can be found at partha2409/AquaNet (github.com)

To use the dataset,

• Download and extract ‘audio_files.zip’. This contains all the 1991 audio samples in the dataset.

• Download ‘clotho_aqa_train.csv’, ‘clotho_aqa_val.csv’, and ‘clotho_aqa_test.csv’. These files contain the train, validation, and test splits, respectively. They contain the audio file name, questions, answers, and confidence scores provided by the annotators.

License:

The audio files in the archive ‘audio_files.zip’ are under the corresponding licenses (mostly CreativeCommons with attribution) of Freesound [2] platform, mentioned explicitly in the CSV file ’clotho_aqa_metadata.csv’ for each of the audio files. That is, each audio file in the archive is listed in the CSV file with meta-data. The meta-data for each file are:

• File name

• Keywords

• URL for the original audio file

• Start and ending samples for the excerpt that is used in the Clotho dataset

• Uploader/user in the Freesound platform (manufacturer)

• Link to the license of the file.

The questions and answers in the files:

• clotho_aqa_train.csv

• clotho_aqa_val.csv

• clotho_aqa_test.csv

are under the MIT license, described in the LICENSE file.

References:

[1] K. Drossos, S. Lipping and T. Virtanen, "Clotho: An Audio Captioning Dataset," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 736- 740, doi: 10.1109/ICASSP40776.2020.9052990.

[2] Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia (MM '13). ACM, New York, NY, USA, 411-412. DOI: https://doi.org/10.1145/2502081.2502245
Energy Consumption of United States Over Time
kaggle.com
Updated Dec 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). Energy Consumption of United States Over Time [Dataset]. https://www.kaggle.com/datasets/thedevastator/unlocking-the-energy-consumption-of-united-state
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 14, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Area covered
United States
Description
Energy Consumption of United States Over Time

Building Energy Data Book

By Department of Energy [source]

About this dataset

The Building Energy Data Book (2011) is an invaluable resource for gaining insight into the current state of energy consumption in the buildings sector. This dataset provides comprehensive data on residential, commercial and industrial building energy consumption, construction techniques, building technologies and characteristics. With this resource, you can get an in-depth understanding of how energy is used in various types of buildings - from single family homes to large office complexes - as well as its impact on the environment. The BTO within the U.S Department of Energy's Office of Energy Efficiency and Renewable Energy developed this dataset to provide a wealth of knowledge for researchers, policy makers, engineers and even everyday observers who are interested in learning more about our built environment and its energy usage patterns

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset provides comprehensive information regarding energy consumption in the buildings sector of the United States. It contains a number of key variables which can be used to analyze and explore the relations between energy consumption and building characteristics, technologies, and construction. The data is provided in both CSV format as well as tabular format which can make it helpful for those who prefer to use programs like Excel or other statistical modeling software.

In order to get started with this dataset we've developed a guide outlining how to effectively use it for your research or project needs.

Understand what's included: Before you start analyzing the data, you should read through the provided documentation so that you fully understand what is included in the datasets. You'll want to be aware of any potential limitations or requirements associated with each type of data point so that your results are valid and reliable when drawing conclusions from them.

Clean up any outliers: You may need to take some time upfront investigating suspicious outliers within your dataset before using it in any further analyses — otherwise, they can skew results down the road if not dealt with first-hand! Furthermore, they could also make complex statistical modeling more difficult as well since they artificially inflate values depending on their magnitude within each example data point (i.e., one outlier could affect an entire model’s prior distributions). Missing values should also be accounted for too since these may not always appear obvious at first glance when reviewing a table or graphical representation - but accurate statistics must still be obtained either way no matter how messy things seem!

Exploratory data analysis: After cleaning up your dataset you'll want to do some basic exploring by visualizing different types of summaries like boxplots, histograms and scatter plots etc.. This will give you an initial case into what trends might exist within certain demographic/geographic/etc.. regions & variables which can then help inform future predictive models when needed! Additionally this step will highlight any clear discontinuous changes over time due over-generalization (if applicable), making sure predictors themselves don’t become part noise instead contributing meaningful signals towards overall effect predictions accuracy etc…

Analyze key metrics & observations: Once exploratory analyses have been carried out on rawsamples post-processing steps are next such as analyzing metrics such ascorrelations amongst explanatory functions; performing significance testing regression models; imputing missing/outlier values and much more depending upon specific project needs at hand… Additionally – interpretation efforts based

Research Ideas

Creating an energy efficiency rating system for buildings - Using the dataset, an organization can develop a metric to rate the energy efficiency of commercial and residential buildings in a standardized way.

Developing targeted campaigns to raise awareness about energy conservation - Analyzing data from this dataset can help organizations identify areas of high energy consumption and create targeted campaigns and incentives to encourage people to conserve energy in those areas.

Estimating costs associated with upgrading building technologies - By evaluating various trends in building technologies and their associated costs, decision-makers can determine the most cost-effective option when it comes time to upgrade their structures' energy efficiency...
d
Privacy Preserving Distributed Data Mining
catalog.data.gov
datadiscoverystudio.org
+2more
Updated Apr 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Privacy Preserving Distributed Data Mining [Dataset]. https://catalog.data.gov/dataset/privacy-preserving-distributed-data-mining
Explore at:
Dataset updated
Apr 10, 2025
Dataset provided by
Dashlink
Description
Distributed data mining from privacy-sensitive multi-party data is likely to play an important role in the next generation of integrated vehicle health monitoring systems. For example, consider an airline manufacturer [tex]$\mathcal{C}$[/tex] manufacturing an aircraft model [tex]$A$[/tex] and selling it to five different airline operating companies [tex]$\mathcal{V}_1 \dots \mathcal{V}_5$[/tex]. These aircrafts, during their operation, generate huge amount of data. Mining this data can reveal useful information regarding the health and operability of the aircraft which can be useful for disaster management and prediction of efficient operating regimes. Now if the manufacturer [tex]$\mathcal{C}$[/tex] wants to analyze the performance data collected from different aircrafts of model-type [tex]$A$[/tex] belonging to different airlines then central collection of data for subsequent analysis may not be an option. It should be noted that the result of this analysis may be statistically more significant if the data for aircraft model [tex]$A$[/tex] across all companies were available to [tex]$\mathcal{C}$[/tex]. The potential problems arising out of such a data mining scenario are:
Z
AeroSonicDB (YPAD-0523): Labelled audio dataset for acoustic detection and...
data.niaid.nih.gov
zenodo.org
Updated Aug 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Downward, Blake (2024). AeroSonicDB (YPAD-0523): Labelled audio dataset for acoustic detection and classification of aircraft [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8000468
Explore at:
Dataset updated
Aug 1, 2024
Dataset authored and provided by
Downward, Blake
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
AeroSonicDB (YPAD-0523): Labelled audio dataset for acoustic detection and classification of aircraftVersion 1.1.2 (November 2023)

[UPDATE: June 2024]

Version 2.0 is currently in beta and can be found at https://zenodo.org/records/12775560. The repository is currently restricted, however you can gain access by emailing Blake Downward at aerosonicdb@gmail.com, or by submitting the following Google Form.

Version 2 vastly extends the number of Aircraft audio samples to over 3,000 (V1 contains 625 aircraft sampes), for more than 38 hours of strongly annotated aircraft audio (V1 contains 8.9 hours of aircraft audio).

Publication

When using this data in an academic work, please reference the dataset DOI and version. Please also reference the following paper which describes the methodology for collecting the dataset and presents baseline model results.

Downward, B., & Nordby, J. (2023). The AeroSonicDB (YPAD-0523) Dataset for Acoustic Detection and Classification of Aircraft. ArXiv, abs/2311.06368.

Description

AeroSonicDB:YPAD-0523 is a specialised dataset of ADS-B labelled audio clips for research in the fields of environmental noise attribution and machine listening, particularly acoustic detection and classification of low-flying aircraft. Audio files in this dataset were recorded at locations in close proximity to a flight path approaching or departing Adelaide International Airport's (ICAO code: YPAD) primary runway, 05/23. Recordings are initially labelled from radio (ADS-B) messages received from the aircraft overhead, then human verified and annotated with the first and final moments which the target aircraft is audible.

A total of 1,895 audio clips are distributed across two top-level classes, "Aircraft" (8.87 hours) and "Silence" (3.52 hours). The aircraft class is then further broken-down into four subclasses, which broadly describe the structure of the aircraft and propulsion mechanism. A variety of additional "airframe" features are provided to give researchers finer control of the dataset, and the opportunity to develop ontologies specific to their own use case.

For convenience, the dataset has been split into training (10.04 hours) and testing (2.35 hours) subsets, with the training set further split into 5 distinct folds for cross-validation. These splits are performed to prevent data-leakage between folds and the test set, ensuring samples collected in the same recording session (distinct in time, location and microphone) are assigned to the same fold.

Researchers may find applications for this dataset in a number of fields; particularly aircraft noise isolation and noise monitoring in an urban environment, development of passive acoustic systems to assist radar technology, and understanding the sources of aircraft noise to help manufacturers design less-noisy aircraft.

Audio data

ADS-B (Automatic Dependent Surveillance–Broadcast) messages transmitted directly from aircraft are used to automatically trigger, capture and label audio samples. A 60-second recording is triggered when an aircraft transmits a message indicating it is within a specified distance of the recording device (see "Location data" below for specifics). The resulting audio file is labelled with the unique ICAO identifier code for the aircraft, as well as its last reported altitude, date, time, location and microphone. The recording is then human verified and annotated with timestamps for the first and last moments the aircraft is audible. In total, AeroSonicDB contains 625 recordings of low-altitude aircraft - varying in length from 18 to 60 seconds, for a total of 8.87 hours of aircraft audio.

A collection of urban background noise without aircraft (silence) is included with the dataset as a means of distinguishing location specific environmental noises from aircraft noises. 10-second background noise, or "silence" recordings are triggered only when there are no aircraft broadcasting they are within a specified distance of the recording device (see "Location data" below). These "silence" recordings are also human verified to ensure no aircraft noise is present. The dataset contains 1,270 clips of silence/urban background noise.

Location data

Recordings have been collected from three (3) locations. GPS coordinates for each location are provided in the "locations.json" file. In order to protect privacy, coordinates have been provided for a road or public space nearby the recording device instead of its exact location.

Location: 0Situated in a suburban environment approximately 15.5km north-east of the start/end of the runway. For Adelaide, typical south-westerly winds bring most arriving aircraft past this location on approach. Winds from the north or east will cause aircraft to take-off to the north-east, however not all departing aircraft will maintain a course to trigger a recording at this location. The "trigger distance" for this location is set for 3km to ensure small/slower aircraft and large/faster aircraft are captured within a sixty-second recording.

"Silence" or ambient background noises at this location include; cars, motorbikes, light-trucks, garbage trucks, power-tools, lawn mowers, construction sounds, sirens, people talking, dogs barking and a wide range of Australian native birds (New Holland Honeyeaters, Wattlebirds, Australian Magpies, Australian Ravens, Spotted Doves, Rainbow Lorikeets and others).

Location: 1Situated approximately 500m south-east of the south-eastern end of the runway, this location is nearby recreational areas (golf course, skate park and parklands) with a busy road/highway inbetween the location and runway. This location features heavy winds and road traffic, as well as people talking, walking and riding, and also birds such as the Australian Magpie and Noisy Miner. The trigger distance for this location is set to 1km. Due to their low altitude aircraft are louder, but audible for a shorter time compared to "Location 0".

Location: 2As an alternative to "Location 1", this location is situated approximately 950m south-east of the end of the runway. This location has a wastewater facility to the north, a residential area to the south and a popular beach to the west. This location offers greater wind protection and further distance from airport and highway noises. Ambient background sounds feature close proximity cars and motorbikes, cyclists, people walking, nail guns and other construction sounds, as well as the local birds mentioned above.

Aircraft metadata

Supplementary "airframe" metadata for all aircraft has been gathered to help broaden the research possibilities from this dataset. Airframe information was collected and cross-checked from a number of open-source databases. The author has no reason to beleive any significant errors exist in the "aircraft_meta" files, however future versions of this dataset plan to obtain aircraft information directly from ICAO (International Civil Aviation Organization) to ensure a single, verifiable source of information.

Class/subclass ontology (minutes of recordings)

no aircraft (211) 0: no aircraft (211)

aircraft (533) 1: piston-propeller aeroplane (30) 2: turbine-propeller aeroplane (90) 3: turbine-fan aeroplane (409) 4: rotorcraft (4) The subclasses are a combination of the "airframe" and "engtype" features. Piston and Turboshaft rotorcraft/helicopters have been combined into a single subclass due to the small number of samples. Data splits

Audio recordings have been split into training (81%) and test (19%) sets. The training set has further been split into 5 folds, giving researchers a common split to perform 5-fold cross-validation to ensure reproducibility and comparable results. Data leakage into the test set has been avoided by ensuring recordings are disjointed from the training set by time and location - meaning samples in the test set for a particular location were recorded after any samples included in the training set for that particular location.

Labelled data

The entire dataset (training and test) is referenced and labelled in the "sample_meta.csv" file. Each row contains a reference to a unique recording, its meta information, annotations and airframe features.

Alternatively, these labels can be derived directly from the filename of the sample (see below). The "aircraft_meta.csv" and "aircraft_meta.json" files can be used to reference aircraft specific features - such as; manufacturer, engine type, ICAO type designator etc. (see "Columns/Labels" below for all features).

File naming convention

Audio samples are in WAV format, with some metadata stored in the filename.

Basic Convention

"Aircraft ID + Date + Time + Location ID + Microphone ID"

"XXXXXX_YYYY-MM-DD_hh-mm-ss_X_X"

Sample with aircraft

{hex_id} _ {date} _ {time} _ {location_id} _ {microphone_id} . {file_ext}

7C7CD0_2023-05-09_12-42-55_2_1.wav

Sample without aircraft

"Silence" files are denoted with six (6) leading zeros rather than an aircraft hex code. All relevant metadata for "silence" samples are contained in the audio filename, and again in the accompanying "sample_meta.csv"

000000 _ {date} _ {time} _ {location_id} _ {microphone_id} . {file_ext}

000000_2023-05-09_12-30-55_2_1.wav

Columns/Labels

(found in sample_meta.csv, aircraft_meta.csv/json files)

train-test: Train-test split (train, test)

fold: Digit from 1 to 5 splitting the training data 5 ways (else test)

filename: The filename of the audio recording

date: Date of the recording

time: Time of the recording

location: ID for the location of the recording

mic: ID of the microphone used

class: Top-level label for the recording (eg. 0 = No aircraft, 1 = Aircraft audible)

subclass: Subclass label for the recording (eg. 0 = No aircraft, 3 = Turbine-fan aeroplane)

altitude: Approximate altitude of the aircraft (in feet) at the start of the recording

hex_id: Unique ICAO 24-bit address for the aircraft recorded

session: Unique recording

Facebook

Twitter

Click to copy link

Link copied

Cite

Deep Contractor (2022). Car Price Prediction Challenge [Dataset]. https://www.kaggle.com/datasets/deepcontractor/car-price-prediction-challenge

Car Price Prediction Challenge

A dataset to practice regression by predicting the prices of different cars.

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jul 6, 2022

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Deep Contractor

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Assignment

Your notebooks must contain the following steps:

Perform data cleaning and pre-processing.
- What steps did you use in this process and how did you clean your data.
Perform exploratory data analysis on the given dataset.
- Explain each and every graphs that you make.
Train a ml-model and evaluate it using different metrics.
- Why did you choose that particular model? What was the accuracy?
Hyperparameter optimization and feature selection is a plus.
Model deployment and use of ml-flow is a plus.
Perform model interpretation and show feature importance for your model.
- Provide some explanation for the above point.
Future steps. Note: try to have your notebooks as presentable as possible.

Dataset Description

CSV file - 19237 rows x 18 columns (Includes Price Columns as Target)

Attributes

ID Price: price of the care(Target Column) Levy Manufacturer Model Prod. year Category Leather interior Fuel type Engine volume Mileage Cylinders Gear box type Drive wheels Doors Wheel Color Airbags

Confused or have any doubts in the data column values? Check the dataset discussion tab!

Clear search

Close search

Google apps

Main menu

Car Price Prediction Challenge

Assignment

Dataset Description

Attributes

Dairy Supply Chain Sales Dataset

Company Financial Data | Banking & Capital Markets Professionals in the...

Fruits Classification Dataset

Overview

Use Cases

Using this Dataset

About Roboflow

Cryptocurrency extra data - Maker

Context:

Introduction

The Data

Indexing

Usage Example

Baseline Example Notebooks:

Loose-ends:

Example Visualisations

License

Cybersecurity Framework Manufacturing Profile Low Security Level Example...

Diesel Engine Faults Features Dataset (3500-DEFault)

Automotive Vehicles Engine Health Dataset

Dataset of books called James Allison : a biography of the engine...

Fabrics Dataset Dataset

‘Amazon Product Reviews Dataset’ analyzed by Analyst-2

About this dataset

Content

Acknowledgements

Inspiration

How to use this dataset

Acknowledgements

Start A New Notebook!

‘Television Brands Ecommerce Dataset’ analyzed by Analyst-2

Data from: Clotho dataset

CPAISD: Core-Penumbra Acute Ischemic Stroke Dataset

LearnPlatform Educational Technology Engagement Dataset: Impact of COVID-19...

‘Pakistan Corona Virus Dataset’ analyzed by Analyst-2

Context

Content

Acknowledgements

Inspiration

Clotho-AQA dataset

Energy Consumption of United States Over Time

Energy Consumption of United States Over Time

Building Energy Data Book

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Privacy Preserving Distributed Data Mining

AeroSonicDB (YPAD-0523): Labelled audio dataset for acoustic detection and...

Car Price Prediction Challenge

A dataset to practice regression by predicting the prices of different cars.

Assignment

Dataset Description

Attributes