https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Your notebooks must contain the following steps:
CSV file - 19237 rows x 18 columns (Includes Price Columns as Target)
ID Price: price of the care(Target Column) Levy Manufacturer Model Prod. year Category Leather interior Fuel type Engine volume Mileage Cylinders Gear box type Drive wheels Doors Wheel Color Airbags
Confused or have any doubts in the data column values? Check the dataset discussion tab!
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
1.Introduction
Sales data collection is a crucial aspect of any manufacturing industry as it provides valuable insights about the performance of products, customer behaviour, and market trends. By gathering and analysing this data, manufacturers can make informed decisions about product development, pricing, and marketing strategies in Internet of Things (IoT) business environments like the dairy supply chain.
One of the most important benefits of the sales data collection process is that it allows manufacturers to identify their most successful products and target their efforts towards those areas. For example, if a manufacturer could notice that a particular product is selling well in a certain region, this information could be utilised to develop new products, optimise the supply chain or improve existing ones to meet the changing needs of customers.
This dataset includes information about 7 of MEVGAL’s products [1]. According to the above information the data published will help researchers to understand the dynamics of the dairy market and its consumption patterns, which is creating the fertile ground for synergies between academia and industry and eventually help the industry in making informed decisions regarding product development, pricing and market strategies in the IoT playground. The use of this dataset could also aim to understand the impact of various external factors on the dairy market such as the economic, environmental, and technological factors. It could help in understanding the current state of the dairy industry and identifying potential opportunities for growth and development.
Please cite the following papers when using this dataset:
I. Siniosoglou, K. Xouveroudis, V. Argyriou, T. Lagkas, S. K. Goudos, K. E. Psannis and P. Sarigiannidis, "Evaluating the Effect of Volatile Federated Timeseries on Modern DNNs: Attention over Long/Short Memory," in the 12th International Conference on Circuits and Systems Technologies (MOCAST 2023), April 2023, Accepted
The dataset includes data regarding the daily sales of a series of dairy product codes offered by MEVGAL. In particular, the dataset includes information gathered by the logistics division and agencies within the industrial infrastructures overseeing the production of each product code. The products included in this dataset represent the daily sales and logistics of a variety of yogurt-based stock. Each of the different files include the logistics for that product on a daily basis for three years, from 2020 to 2022.
3.1 Data Collection
The process of building this dataset involves several steps to ensure that the data is accurate, comprehensive and relevant.
The first step is to determine the specific data that is needed to support the business objectives of the industry, i.e., in this publication’s case the daily sales data.
Once the data requirements have been identified, the next step is to implement an effective sales data collection method. In MEVGAL’s case this is conducted through direct communication and reports generated each day by representatives & selling points.
It is also important for MEVGAL to ensure that the data collection process conducted is in an ethical and compliant manner, adhering to data privacy laws and regulation. The industry also has a data management plan in place to ensure that the data is securely stored and protected from unauthorised access.
The published dataset is consisted of 13 features providing information about the date and the number of products that have been sold. Finally, the dataset was anonymised in consideration to the privacy requirement of the data owner (MEVGAL).
File
Period
Number of Samples (days)
product 1 2020.xlsx
01/01/2020–31/12/2020
363
product 1 2021.xlsx
01/01/2021–31/12/2021
364
product 1 2022.xlsx
01/01/2022–31/12/2022
365
product 2 2020.xlsx
01/01/2020–31/12/2020
363
product 2 2021.xlsx
01/01/2021–31/12/2021
364
product 2 2022.xlsx
01/01/2022–31/12/2022
365
product 3 2020.xlsx
01/01/2020–31/12/2020
363
product 3 2021.xlsx
01/01/2021–31/12/2021
364
product 3 2022.xlsx
01/01/2022–31/12/2022
365
product 4 2020.xlsx
01/01/2020–31/12/2020
363
product 4 2021.xlsx
01/01/2021–31/12/2021
364
product 4 2022.xlsx
01/01/2022–31/12/2022
364
product 5 2020.xlsx
01/01/2020–31/12/2020
363
product 5 2021.xlsx
01/01/2021–31/12/2021
364
product 5 2022.xlsx
01/01/2022–31/12/2022
365
product 6 2020.xlsx
01/01/2020–31/12/2020
362
product 6 2021.xlsx
01/01/2021–31/12/2021
364
product 6 2022.xlsx
01/01/2022–31/12/2022
365
product 7 2020.xlsx
01/01/2020–31/12/2020
362
product 7 2021.xlsx
01/01/2021–31/12/2021
364
product 7 2022.xlsx
01/01/2022–31/12/2022
365
3.2 Dataset Overview
The following table enumerates and explains the features included across all of the included files.
Feature
Description
Unit
Day
day of the month
-
Month
Month
-
Year
Year
-
daily_unit_sales
Daily sales - the amount of products, measured in units, that during that specific day were sold
units
previous_year_daily_unit_sales
Previous Year’s sales - the amount of products, measured in units, that during that specific day were sold the previous year
units
percentage_difference_daily_unit_sales
The percentage difference between the two above values
%
daily_unit_sales_kg
The amount of products, measured in kilograms, that during that specific day were sold
kg
previous_year_daily_unit_sales_kg
Previous Year’s sales - the amount of products, measured in kilograms, that during that specific day were sold, the previous year
kg
percentage_difference_daily_unit_sales_kg
The percentage difference between the two above values
kg
daily_unit_returns_kg
The percentage of the products that were shipped to selling points and were returned
%
previous_year_daily_unit_returns_kg
The percentage of the products that were shipped to selling points and were returned the previous year
%
points_of_distribution
The amount of sales representatives through which the product was sold to the market for this year
previous_year_points_of_distribution
The amount of sales representatives through which the product was sold to the market for the same day for the previous year
Table 1 – Dataset Feature Description
4.1 Dataset Structure
The provided dataset has the following structure:
Where:
Name
Type
Property
Readme.docx
Report
A File that contains the documentation of the Dataset.
product X
Folder
A folder containing the data of a product X.
product X YYYY.xlsx
Data file
An excel file containing the sales data of product X for year YYYY.
Table 2 - Dataset File Description
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 957406 (TERMINET).
References
[1] MEVGAL is a Greek dairy production company
Success.ai’s Company Financial Data for Banking & Capital Markets Professionals in the Middle East offers a reliable and comprehensive dataset designed to connect businesses with key stakeholders in the financial sector. Covering banking executives, capital markets professionals, and financial advisors, this dataset provides verified contact details, decision-maker profiles, and firmographic insights tailored for the Middle Eastern market.
With access to over 170 million verified professional profiles and 30 million company profiles, Success.ai ensures your outreach and strategic initiatives are powered by accurate, continuously updated, and AI-validated data. Backed by our Best Price Guarantee, this solution empowers your organization to build meaningful connections in the region’s thriving financial industry.
Why Choose Success.ai’s Company Financial Data?
Verified Contact Data for Financial Professionals
Targeted Insights for the Middle East Financial Sector
Continuously Updated Datasets
Ethical and Compliant
Data Highlights:
Key Features of the Dataset:
Decision-Maker Profiles in Banking & Capital Markets
Advanced Filters for Precision Targeting
Firmographic and Leadership Insights
AI-Driven Enrichment
Strategic Use Cases:
Sales and Lead Generation
Market Research and Competitive Analysis
Partnership Development and Vendor Evaluation
Recruitment and Talent Solutions
Why Choose Success.ai?
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The Fruits
dataset is an image classification dataset of various fruits against white backgrounds from various angles, originally open sourced by GitHub user horea. This is a subset of that full dataset.
Example Image:
https://github.com/Horea94/Fruit-Images-Dataset/blob/master/Training/Apple%20Braeburn/101_100.jpg?raw=true" alt="Example Image">
Build a fruit classifier! This could be a just-for-fun project just as much as you could be building a color sorter for agricultural use cases before fruits make their way to market.
Use the fork
button to copy this dataset to your own Roboflow account and export it with new preprocessing settings (perhaps resized for your model's desired format or converted to grayscale), or additional augmentations to make your model generalize better. This particular dataset would be very well suited for Roboflow's new advanced Bounding Box Only Augmentations.
Roboflow makes managing, preprocessing, augmenting, and versioning datasets for computer vision seamless.
Developers reduce 50% of their code when using Roboflow's workflow, automate annotation quality assurance, save training time, and increase model reproducibility.
This dataset is an extra updating dataset for the G-Research Crypto Forecasting competition.
This is a daily updated dataset, automaticlly collecting market data for G-Research crypto forecasting competition. The data is of the 1-minute resolution, collected for all competition assets and both retrieval and uploading are fully automated. see discussion topic.
For every asset in the competition, the following fields from Binance's official API endpoint for historical candlestick data are collected, saved, and processed.
1. **timestamp** - A timestamp for the minute covered by the row.
2. **Asset_ID** - An ID code for the cryptoasset.
3. **Count** - The number of trades that took place this minute.
4. **Open** - The USD price at the beginning of the minute.
5. **High** - The highest USD price during the minute.
6. **Low** - The lowest USD price during the minute.
7. **Close** - The USD price at the end of the minute.
8. **Volume** - The number of cryptoasset u units traded during the minute.
9. **VWAP** - The volume-weighted average price for the minute.
10. **Target** - 15 minute residualized returns. See the 'Prediction and Evaluation section of this notebook for details of how the target is calculated.
11. **Weight** - Weight, defined by the competition hosts [here](https://www.kaggle.com/cstein06/tutorial-to-the-g-research-crypto-competition)
12. **Asset_Name** - Human readable Asset name.
The dataframe is indexed by timestamp
and sorted from oldest to newest.
The first row starts at the first timestamp available on the exchange, which is July 2017 for the longest-running pairs.
The following is a collection of simple starter notebooks for Kaggle's Crypto Comp showing PurgedTimeSeries in use with the collected dataset. Purged TimesSeries is explained here. There are many configuration variables below to allow you to experiment. Use either GPU or TPU. You can control which years are loaded, which neural networks are used, and whether to use feature engineering. You can experiment with different data preprocessing, model architecture, loss, optimizers, and learning rate schedules. The extra datasets contain the full history of the assets in the same format as the competition, so you can input that into your model too.
These notebooks follow the ideas presented in my "Initial Thoughts" here. Some code sections have been reused from Chris' great (great) notebook series on SIIM ISIC melanoma detection competition here
This is a work in progress and will be updated constantly throughout the competition. At the moment, there are some known issues that still needed to be addressed:
Opening price with an added indicator (MA50):
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2234678%2Fb8664e6f26dc84e9a40d5a3d915c9640%2Fdownload.png?generation=1582053879538546&alt=media" alt="">
Volume and number of trades:
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2234678%2Fcd04ed586b08c1576a7b67d163ad9889%2Fdownload-1.png?generation=1582053899082078&alt=media" alt="">
This data is being collected automatically from the crypto exchange Binance.
The Cybersecurity Framework Manufacturing Profile Low Security Level Example Implementations Guide provides example proof-of-concept solutions demonstrating how open-source and commercial off-the-shelf (COTS) products that are currently available today can be implemented in manufacturing environments to satisfy the requirements in the Cybersecurity Framework (CSF) Manufacturing Profile Low Security Level. Example proof-of-concept solutions for a process-based manufacturing environment and a discrete-based manufacturing environment are included in the guide. Depending on factors like size, sophistication, risk tolerance, and threat landscape, manufacturers should make their own determinations about the breadth of the proof-of-concept solutions they may voluntarily implement. The dataset includes all of the raw and processed measurement data for the example implementation of the discrete-based manufacturing system use case.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The objective of this dataset is the fault diagnosis in diesel engines to assist the predictive maintenance, through the analysis of the variation of the pressure curves inside the cylinders and the torsional vibration response of the crankshaft. Hence a fault simulation model based on a zero-dimensional thermodynamic model was developed.
The adopted feature vectors were chosen from the thermodynamic model and obtained from processing signals as pressure and temperature inside the cylinder, as well as, torsional vibration of the engine’s flywheel. These vectors are used as input of the machine learning technique in order to discriminate among several machine conditions.
The database is expected to emulate all operating scenarios under study. In our case, all possible diesel machine faults and system conditions variations, which correspond to severities levels containing enough information to characterize and discriminate the faults. The developed database covered the following operating conditions: Normal (without faults), Pressure reduction in the intake manifold, Compression ratio reduction in the cylinders and Reduction of amount of fuel injected into the cylinders.
In all scenarios, the motor rotation frequency was set at 2500 RPM. The rotation of 2500 RPM was used, since it presented the lowest joint error rate in the estimation of the mean and maximum pressures of the burning cycle, between the experimental data (according to data supplied by the manufacturer) and the simulated data, during the validation stage of the thermodynamic and dynamic models.
The entire database comprises a total of 3500 different fault scenarios for 4 distinct operational conditions. 250 of which from the normal class, 250 from pressure reduction in the intake manifold" class, 1500 from
compression ratio reduction in the cylinders" class and 1500 from the ``reduction of amount of fuel injected into the cylinders" class. This database is named 3500-DEFault database.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The dataset could include various features and measurements related to the engine health of vehicles, such as engine RPM, temperature, pressure, and other sensor data. It may also include metadata on the vehicle, such as make, model, year, and mileage.
One potential project using this dataset could be to build a predictive maintenance model for automotive engines. By analyzing the patterns and trends in the data, machine learning algorithms could be trained to predict when an engine is likely to require maintenance or repair. This could help vehicle owners and mechanics proactively address potential issues before they become more severe, leading to better vehicle performance and longer engine lifetimes.
Another potential use for this dataset could be to analyze the performance of different types of engines and vehicles. Researchers could use the data to compare the performance of engines from different manufacturers, for example, or to evaluate the effectiveness of different maintenance strategies. This could help drive innovation and improvements in the automotive industry.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 1 row and is filtered where the book is James Allison : a biography of the engine manufacturer and Indianapolis 500 cofounder. It features 7 columns including author, publication date, language, and book publisher.
The Fabrics Dataset consists of about 2000 samples of garments and fabrics. A small patch of each surface has been captured under 4 different illumination conditions using a custom made, portable photometric stereo sensor. All images have been acquired "in the field" (at clothes shops) and the dataset reflects the distribution of fabrics in real world, hence it is not balanced. The majority of clothes are made of specific fabrics, such as cotton and polyester, while some other fabrics, such as silk and linen, are more rare. Also, a large number of clothes are not composed of a single fabric but two or more fabrics are used to give the garment the desired properties (blended fabrics). For every garment there is information (attributes) about its material composition from the manufacturer label and its type (pants, shirt, skirt etc.).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Amazon Product Reviews Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/amazon-product-reviews-datasete on 13 February 2022.
--- Dataset description provided by original source is as follows ---
This dataset contains 30K records of product reviews from amazon.com.
This dataset was created by PromptCloud and DataStock
This dataset contains the following:
Total Records Count: 43729
Domain Name: amazon.com
Date Range: 01st Jan 2020 - 31st Mar 2020
File Extension: CSV
Available Fields:
-- Uniq Id,
-- Crawl Timestamp,
-- Billing Uniq Id,
-- Rating,
-- Review Title,
-- Review Rating,
-- Review Date,
-- User Id,
-- Brand,
-- Category,
-- Sub Category,
-- Product Description,
-- Asin,
-- Url,
-- Review Content,
-- Verified Purchase,
-- Helpful Review Count,
-- Manufacturer Response
We wouldn't be here without the help of our in house teams at PromptCloud and DataStock. Who has put their heart and soul into this project like all other projects? We want to provide the best quality data and we will continue to do so.
The inspiration for these datasets came from research. Reviews are something that is important wit everybody across the globe. So we decided to come up with this dataset that shows us exactly how the user reviews help companies to better their products.
This dataset was created by PromptCloud and contains around 0 samples along with Billing Uniq Id, Verified Purchase, technical information and other features such as: - Crawl Timestamp - Manufacturer Response - and more.
- Analyze Helpful Review Count in relation to Sub Category
- Study the influence of Review Date on Product Description
- More datasets
If you use this dataset in your research, please credit PromptCloud
--- Original source retains full ownership of the source dataset ---
Analysis of ‘Television Brands Ecommerce Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/devsubhash/television-brands-ecommerce-dataset on 28 January 2022.
--- Dataset description provided by original source is as follows ---
This dataset contains 912
samples with 7
attributes. There are some missing values in this dataset.
Here are the columns in this dataset- 1. Brand: This indicates the manufacturer of the product i.e. Television 2. Resolution: This has multiple categories and indicates the type of display i.e. LED, HD LED, etc. 3. Size: This indicates the screen size in inches 4. Selling Price: This column has the Selling Price or the Discounted Price of the product 5. Original Price: This includes the Original Price of the product from the manufacturer. 6. Operating system: This categorical variable shows the type of OS like Android, Linux, etc. 7. Rating: Average customer ratings on a scale of 5.
Inspiration: This dataset could be used to explore the current market scenario for Televisions. There are various types of screens with different operating systems offered by several manufacturers at competitive prices. Some questions this dataset could be used to answer are -
--- Original source retains full ownership of the source dataset ---
Clotho is an audio captioning dataset, now reached version 2. Clotho consists of 6974 audio samples, and each audio sample has five captions (a total of 34 870 captions). Audio samples are of 15 to 30 s duration and captions are eight to 20 words long.
Clotho is thoroughly described in our paper:
K. Drossos, S. Lipping and T. Virtanen, "Clotho: an Audio Captioning Dataset," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 736-740, doi: 10.1109/ICASSP40776.2020.9052990.
available online at: https://arxiv.org/abs/1910.09387 and at: https://ieeexplore.ieee.org/document/9052990
If you use Clotho, please cite our paper.
To use the dataset, you can use our code at: https://github.com/audio-captioning/clotho-dataset
These are the files for the development, validation, and evaluation splits of Clotho dataset.
== Changes in version 2.1 ==
In version 2.1 of Clotho, we fixed some files that were corrupted from the compression and transferring processes (around 150 files) and we also replaced some characters that were illegal for most filesystems, e.g. ":" (around 10 files).
Please use this version for your experiments.
== Changes in version 2 ==
In version 2 of Clotho, there are audio files added in the development split and a new validation split is added. There are no changes in the evaluation split.
Specifically:
Now there are 3840 audio files in the development split. In Clotho version 1, there were 2893 audio files. Now, 947 new audio files are added.
There are 1046 new audio files in the validation split.
All new captions are treated as in version 1 of Clotho, i.e. having word consistency, no named entities, no speech transcription, and no hapax legomena between splits (i.e. words appearing only in one of the splits).
== Usage ==
To use the dataset you have to:
Download the audio files: clotho_audio_development.7z,clotho_audio_validation.7z, and clotho_audio_evalution.7z
Download the files with the captions: clotho_captions_development.csv, clotho_captions_validation.csv, and clotho_captions_evaluation.csv
Download the files with the associated metadata: clotho_metadata_development.csv, clotho_metadata_validation.csv, and clotho_metadata_evaluation.csv
Extract the audio files
Then you can use each audio file with its corresponding captions
== License ==
The audio files in the archives:
clotho_audio_development.7z,
clotho_audio_validation.7z, and
clotho_audio_evalution.7z
and the associated meta-data in the CSV files:
clotho_metadata_development.csv
clotho_metadata_validation.csv
clotho_metadata_evaluation.csv
are under the corresponding licences (mostly CreativeCommons with attribution) of Freesound [1] platform, mentioned explicitly in the CSV files for each of the audio files. That is, each audio file in the 7z archives is listed in the CSV files with the meta-data. The meta-data for each file are:
File name
Keywords
URL for the original audio file
Start and ending samples for the excerpt that is used in the Clotho dataset
Uploader/user in the Freesound platform (manufacturer)
Link to the licence of the file
The captions in the files:
clotho_captions_development.csv
clotho_captions_validation.csv
clotho_captions_evaluation.csv
are under the Tampere University licence, described in the LICENCE file (mainly a non-commercial with attribution licence).
== References == [1] Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia (MM '13). ACM, New York, NY, USA, 411-412. DOI: https://doi.org/10.1145/2502081.2502245
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains 112 non-contrast cranial CT scans of patients with hyperacute stroke, featuring delineated zones of penumbra and core of the stroke on each slice where present. The data in the dataset are anonymized using the Kitware DicomAnonymizer, with standard anonymization settings, except for preserving the values of the following fields:
(0x0010, 0x0040) – Patient's Sex
(0x0010, 0x1010) – Patient's Age
(0x0008, 0x0070) – Manufacturer
(0x0008, 0x1090) – Manufacturer’s Model Name
The patient's sex and age are retained for demographic analysis of the samples, and the equipment manufacturer and model are kept for dataset statistics and the potential for domain shift analysis.
The dataset is split into three folds:
Training fold (92 studies, 8,376 slices).
Validation fold (10 studies, 980 slices).
Testing fold (10 studies, 809 slices).
The dataset has the following structure:
metadata.json – dataset metadata
summary.csv – metadata of each study in a CSV format table
Part of the dataset (train, val, and test)
Study
Slice
raw.dcm – original slice file
image.npz – slice in Numpy array format
mask.npz – segmentation mask in Numpy array format
metadata.json – slice metadata in JSON format
metadata.json – study metadata in JSON format
The metadata.json at the root of the dataset has the following format:
generation_params – dataset generation parameters:
test_size – proportion of the test part
val_size – proportion of the validation part
stats – statistical data:
common – general statistical data:
train_size_in_studies – number of studies in the training part of the dataset.
train_size_in_images – number of slices in the training part of the dataset.
val_size_in_studies – number of studies in the validation part of the dataset.
val_size_in_images – number of slices in the validation part of the dataset.
test_size_in_studies – number of studies in the test part of the dataset.
test_size_in_images – number of slices in the test part of the dataset.
train – statistical data for the training part of the dataset:
min – minimum pixel value.
max – maximum pixel value.
mean – average pixel value.
std – standard deviation for all pixel values.
The metadata.json at the root of the study has the following format, if a field value is unknown, it is given as 'unknown':
manufacturer – manufacturer of the tomograph.
model – model of the tomograph.
device – full name of the tomograph (manufacturer + model).
age – patient's age in years.
sex – patient's sex. M – male, F – female.
dsa – whether cerebral angiography was performed. true if yes, false if no.
nihss – NIHSS score.
time – time in hours from the onset of the stroke to the conduct of the study. Can be either a number or a range.
lethality – whether the person died as a result of this stroke. true if yes, false if no.
The summary.csv contains the same fields as the metadata.json
from the root of the study, plus two additional fields:
name – name of the study.
part – part of the dataset in which the study is located.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
LearnPlatform is a unique technology platform in the K-12 market providing the only broadly interoperable platform to the breadth of edtech solutions in the US K12 field. A key component of edtech effectiveness is integrated reporting on tool usage and, where applicable, evidence of efficacy. With COVID closures, LearnPlatform has emerged as an important and singular resource to measure whether students are accessing digital resources within distance learning constraints. This platform provides a unique and needed source of data to understand if students are accessing digital resources, and where resources have disparate usage and impact.In this dataset we are sharing educational technology usage across the 8,000+ tools used in the education field in 2020. We make this dataset available to public so that educators, district leaders, researchers, institutions, policy-makers or anyone interested to learn about digital learning in 2020, can use this dataset to understand student engagement with core learning activities during the COVID-19 pandemic. Some example research questions that this dataset can help stakeholders answer: What is the picture of digital connectivity and engagement in 2020?What is the effect of the COVID-19 pandemic on online and distance learning, and how might this evolve in the future?How does student engagement with different types of education technology change over the course of the pandemic?How does student engagement with online learning platforms relate to different geography? Demographic context (e.g., race/ethnicity, ESL, learning disability)? Learning context? Socioeconomic status?Do certain state interventions, practices or policies (e.g., stimulus, reopening, eviction moratorium) correlate with increases or decreases in online engagement?
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Pakistan Corona Virus Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/zusmani/pakistan-corona-virus-citywise-data on 30 September 2021.
--- Dataset description provided by original source is as follows ---
Pakistan witnessed its first Corona virus patient on February 26th 2020. It's a bumpy ride since then. The cases are increasing gradually and we haven't seen the worst yet. While, there are few government resources for cumulative updates, there is no place where you can find the city level patients data. It's also not possible to find the running chronological tally of patients as they test positive. We have decided to create our own dataset for all the researchers out there with such details so we can model the infection spread and forecast the situation in coming days. We hope, by doing so, we will be able to inform policy makers on various intervention models, and healthcare professionals to be ready for the influx of new patients. We certainly hope, that this little contribution will go a long way for saving lives in Pakistan
The dataset contains seven columns for date, number of cases, number of deaths, number of people recovered, travel history of those cases, and location of the cases (province and city).
The first version has the data from first case of February 26 2020 to April 19, 2020. We intend to publish weekly updates
Users are allowed to use, copy, distribute and cite the dataset as follows: “Zeeshan-ul-hassan Usmani, Sana Rasheed, Pakistan Corona Virus Data, Kaggle Dataset Repository, April 19, 2020.”
Some ideas worth exploring:
Can we find the spread factor for the Corona virus in Pakistan?
How long it takes for a positive case to infect another in Pakistan?
How we can use this data to simulate lock down scenarios and find its impact on country's economy? Here is a good
read to get started - http://zeeshanusmani.com/urdu/corona-economic-impact/
How does Pakistan Corona virus spread compare against its neighbors and other developed counties?
What would be the impact of this infection spread on country's economy and people living under poverty? Here are two briefs to get you started
http://zeeshanusmani.com/urdu/corona/ http://zeeshanusmani.com/urdu/corona-what-to-learn/
How do we visualize this dataset to inform policy makers? Here is one example https://zeeshanusmani.com/corona/
Can we predict the number of cases in next 10 days and a month?
--- Original source retains full ownership of the source dataset ---
Clotho-AQA is an audio question-answering dataset consisting of 1991 audio samples taken from Clotho dataset [1]. Each audio sample has 6 associated questions collected through crowdsourcing. For each question, the answers are provided by three different annotators making a total of 35,838 question-answer pairs. For each audio sample, 4 questions are designed to be answered with 'yes' or 'no', while the remaining two questions are designed to be answered in a single word. More details about the data collection process and data splitting process can be found in our following paper.
S. Lipping, P. Sudarsanam, K. Drossos, T. Virtanen ‘Clotho-AQA: A Crowdsourced Dataset for Audio Question Answering.’ The paper is available online at 2204.09634.pdf (arxiv.org)
If you use the Clotho-AQA dataset, please cite the paper mentioned above. A sample baseline model to use the Clotho-AQA dataset can be found at partha2409/AquaNet (github.com)
To use the dataset,
• Download and extract ‘audio_files.zip’. This contains all the 1991 audio samples in the dataset.
• Download ‘clotho_aqa_train.csv’, ‘clotho_aqa_val.csv’, and ‘clotho_aqa_test.csv’. These files contain the train, validation, and test splits, respectively. They contain the audio file name, questions, answers, and confidence scores provided by the annotators.
License:
The audio files in the archive ‘audio_files.zip’ are under the corresponding licenses (mostly CreativeCommons with attribution) of Freesound [2] platform, mentioned explicitly in the CSV file ’clotho_aqa_metadata.csv’ for each of the audio files. That is, each audio file in the archive is listed in the CSV file with meta-data. The meta-data for each file are:
• File name
• Keywords
• URL for the original audio file
• Start and ending samples for the excerpt that is used in the Clotho dataset
• Uploader/user in the Freesound platform (manufacturer)
• Link to the license of the file.
The questions and answers in the files:
• clotho_aqa_train.csv
• clotho_aqa_val.csv
• clotho_aqa_test.csv
are under the MIT license, described in the LICENSE file.
References:
[1] K. Drossos, S. Lipping and T. Virtanen, "Clotho: An Audio Captioning Dataset," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 736- 740, doi: 10.1109/ICASSP40776.2020.9052990.
[2] Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia (MM '13). ACM, New York, NY, USA, 411-412. DOI: https://doi.org/10.1145/2502081.2502245
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
By Department of Energy [source]
The Building Energy Data Book (2011) is an invaluable resource for gaining insight into the current state of energy consumption in the buildings sector. This dataset provides comprehensive data on residential, commercial and industrial building energy consumption, construction techniques, building technologies and characteristics. With this resource, you can get an in-depth understanding of how energy is used in various types of buildings - from single family homes to large office complexes - as well as its impact on the environment. The BTO within the U.S Department of Energy's Office of Energy Efficiency and Renewable Energy developed this dataset to provide a wealth of knowledge for researchers, policy makers, engineers and even everyday observers who are interested in learning more about our built environment and its energy usage patterns
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset provides comprehensive information regarding energy consumption in the buildings sector of the United States. It contains a number of key variables which can be used to analyze and explore the relations between energy consumption and building characteristics, technologies, and construction. The data is provided in both CSV format as well as tabular format which can make it helpful for those who prefer to use programs like Excel or other statistical modeling software.
In order to get started with this dataset we've developed a guide outlining how to effectively use it for your research or project needs.
Understand what's included: Before you start analyzing the data, you should read through the provided documentation so that you fully understand what is included in the datasets. You'll want to be aware of any potential limitations or requirements associated with each type of data point so that your results are valid and reliable when drawing conclusions from them.
Clean up any outliers: You may need to take some time upfront investigating suspicious outliers within your dataset before using it in any further analyses — otherwise, they can skew results down the road if not dealt with first-hand! Furthermore, they could also make complex statistical modeling more difficult as well since they artificially inflate values depending on their magnitude within each example data point (i.e., one outlier could affect an entire model’s prior distributions). Missing values should also be accounted for too since these may not always appear obvious at first glance when reviewing a table or graphical representation - but accurate statistics must still be obtained either way no matter how messy things seem!
Exploratory data analysis: After cleaning up your dataset you'll want to do some basic exploring by visualizing different types of summaries like boxplots, histograms and scatter plots etc.. This will give you an initial case into what trends might exist within certain demographic/geographic/etc.. regions & variables which can then help inform future predictive models when needed! Additionally this step will highlight any clear discontinuous changes over time due over-generalization (if applicable), making sure predictors themselves don’t become part noise instead contributing meaningful signals towards overall effect predictions accuracy etc…
Analyze key metrics & observations: Once exploratory analyses have been carried out on rawsamples post-processing steps are next such as analyzing metrics such ascorrelations amongst explanatory functions; performing significance testing regression models; imputing missing/outlier values and much more depending upon specific project needs at hand… Additionally – interpretation efforts based
- Creating an energy efficiency rating system for buildings - Using the dataset, an organization can develop a metric to rate the energy efficiency of commercial and residential buildings in a standardized way.
- Developing targeted campaigns to raise awareness about energy conservation - Analyzing data from this dataset can help organizations identify areas of high energy consumption and create targeted campaigns and incentives to encourage people to conserve energy in those areas.
- Estimating costs associated with upgrading building technologies - By evaluating various trends in building technologies and their associated costs, decision-makers can determine the most cost-effective option when it comes time to upgrade their structures' energy efficiency...
Distributed data mining from privacy-sensitive multi-party data is likely to play an important role in the next generation of integrated vehicle health monitoring systems. For example, consider an airline manufacturer [tex]$\mathcal{C}$[/tex] manufacturing an aircraft model [tex]$A$[/tex] and selling it to five different airline operating companies [tex]$\mathcal{V}_1 \dots \mathcal{V}_5$[/tex]. These aircrafts, during their operation, generate huge amount of data. Mining this data can reveal useful information regarding the health and operability of the aircraft which can be useful for disaster management and prediction of efficient operating regimes. Now if the manufacturer [tex]$\mathcal{C}$[/tex] wants to analyze the performance data collected from different aircrafts of model-type [tex]$A$[/tex] belonging to different airlines then central collection of data for subsequent analysis may not be an option. It should be noted that the result of this analysis may be statistically more significant if the data for aircraft model [tex]$A$[/tex] across all companies were available to [tex]$\mathcal{C}$[/tex]. The potential problems arising out of such a data mining scenario are:
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
AeroSonicDB (YPAD-0523): Labelled audio dataset for acoustic detection and classification of aircraftVersion 1.1.2 (November 2023)
[UPDATE: June 2024]
Version 2.0 is currently in beta and can be found at https://zenodo.org/records/12775560. The repository is currently restricted, however you can gain access by emailing Blake Downward at aerosonicdb@gmail.com, or by submitting the following Google Form.
Version 2 vastly extends the number of Aircraft audio samples to over 3,000 (V1 contains 625 aircraft sampes), for more than 38 hours of strongly annotated aircraft audio (V1 contains 8.9 hours of aircraft audio).
Publication
When using this data in an academic work, please reference the dataset DOI and version. Please also reference the following paper which describes the methodology for collecting the dataset and presents baseline model results.
Downward, B., & Nordby, J. (2023). The AeroSonicDB (YPAD-0523) Dataset for Acoustic Detection and Classification of Aircraft. ArXiv, abs/2311.06368.
Description
AeroSonicDB:YPAD-0523 is a specialised dataset of ADS-B labelled audio clips for research in the fields of environmental noise attribution and machine listening, particularly acoustic detection and classification of low-flying aircraft. Audio files in this dataset were recorded at locations in close proximity to a flight path approaching or departing Adelaide International Airport's (ICAO code: YPAD) primary runway, 05/23. Recordings are initially labelled from radio (ADS-B) messages received from the aircraft overhead, then human verified and annotated with the first and final moments which the target aircraft is audible.
A total of 1,895 audio clips are distributed across two top-level classes, "Aircraft" (8.87 hours) and "Silence" (3.52 hours). The aircraft class is then further broken-down into four subclasses, which broadly describe the structure of the aircraft and propulsion mechanism. A variety of additional "airframe" features are provided to give researchers finer control of the dataset, and the opportunity to develop ontologies specific to their own use case.
For convenience, the dataset has been split into training (10.04 hours) and testing (2.35 hours) subsets, with the training set further split into 5 distinct folds for cross-validation. These splits are performed to prevent data-leakage between folds and the test set, ensuring samples collected in the same recording session (distinct in time, location and microphone) are assigned to the same fold.
Researchers may find applications for this dataset in a number of fields; particularly aircraft noise isolation and noise monitoring in an urban environment, development of passive acoustic systems to assist radar technology, and understanding the sources of aircraft noise to help manufacturers design less-noisy aircraft.
Audio data
ADS-B (Automatic Dependent Surveillance–Broadcast) messages transmitted directly from aircraft are used to automatically trigger, capture and label audio samples. A 60-second recording is triggered when an aircraft transmits a message indicating it is within a specified distance of the recording device (see "Location data" below for specifics). The resulting audio file is labelled with the unique ICAO identifier code for the aircraft, as well as its last reported altitude, date, time, location and microphone. The recording is then human verified and annotated with timestamps for the first and last moments the aircraft is audible. In total, AeroSonicDB contains 625 recordings of low-altitude aircraft - varying in length from 18 to 60 seconds, for a total of 8.87 hours of aircraft audio.
A collection of urban background noise without aircraft (silence) is included with the dataset as a means of distinguishing location specific environmental noises from aircraft noises. 10-second background noise, or "silence" recordings are triggered only when there are no aircraft broadcasting they are within a specified distance of the recording device (see "Location data" below). These "silence" recordings are also human verified to ensure no aircraft noise is present. The dataset contains 1,270 clips of silence/urban background noise.
Location data
Recordings have been collected from three (3) locations. GPS coordinates for each location are provided in the "locations.json" file. In order to protect privacy, coordinates have been provided for a road or public space nearby the recording device instead of its exact location.
Location: 0Situated in a suburban environment approximately 15.5km north-east of the start/end of the runway. For Adelaide, typical south-westerly winds bring most arriving aircraft past this location on approach. Winds from the north or east will cause aircraft to take-off to the north-east, however not all departing aircraft will maintain a course to trigger a recording at this location. The "trigger distance" for this location is set for 3km to ensure small/slower aircraft and large/faster aircraft are captured within a sixty-second recording.
"Silence" or ambient background noises at this location include; cars, motorbikes, light-trucks, garbage trucks, power-tools, lawn mowers, construction sounds, sirens, people talking, dogs barking and a wide range of Australian native birds (New Holland Honeyeaters, Wattlebirds, Australian Magpies, Australian Ravens, Spotted Doves, Rainbow Lorikeets and others).
Location: 1Situated approximately 500m south-east of the south-eastern end of the runway, this location is nearby recreational areas (golf course, skate park and parklands) with a busy road/highway inbetween the location and runway. This location features heavy winds and road traffic, as well as people talking, walking and riding, and also birds such as the Australian Magpie and Noisy Miner. The trigger distance for this location is set to 1km. Due to their low altitude aircraft are louder, but audible for a shorter time compared to "Location 0".
Location: 2As an alternative to "Location 1", this location is situated approximately 950m south-east of the end of the runway. This location has a wastewater facility to the north, a residential area to the south and a popular beach to the west. This location offers greater wind protection and further distance from airport and highway noises. Ambient background sounds feature close proximity cars and motorbikes, cyclists, people walking, nail guns and other construction sounds, as well as the local birds mentioned above.
Aircraft metadata
Supplementary "airframe" metadata for all aircraft has been gathered to help broaden the research possibilities from this dataset. Airframe information was collected and cross-checked from a number of open-source databases. The author has no reason to beleive any significant errors exist in the "aircraft_meta" files, however future versions of this dataset plan to obtain aircraft information directly from ICAO (International Civil Aviation Organization) to ensure a single, verifiable source of information.
Class/subclass ontology (minutes of recordings)
no aircraft (211) 0: no aircraft (211)
aircraft (533) 1: piston-propeller aeroplane (30) 2: turbine-propeller aeroplane (90) 3: turbine-fan aeroplane (409) 4: rotorcraft (4) The subclasses are a combination of the "airframe" and "engtype" features. Piston and Turboshaft rotorcraft/helicopters have been combined into a single subclass due to the small number of samples. Data splits
Audio recordings have been split into training (81%) and test (19%) sets. The training set has further been split into 5 folds, giving researchers a common split to perform 5-fold cross-validation to ensure reproducibility and comparable results. Data leakage into the test set has been avoided by ensuring recordings are disjointed from the training set by time and location - meaning samples in the test set for a particular location were recorded after any samples included in the training set for that particular location.
Labelled data
The entire dataset (training and test) is referenced and labelled in the "sample_meta.csv" file. Each row contains a reference to a unique recording, its meta information, annotations and airframe features.
Alternatively, these labels can be derived directly from the filename of the sample (see below). The "aircraft_meta.csv" and "aircraft_meta.json" files can be used to reference aircraft specific features - such as; manufacturer, engine type, ICAO type designator etc. (see "Columns/Labels" below for all features).
File naming convention
Audio samples are in WAV format, with some metadata stored in the filename.
Basic Convention
"Aircraft ID + Date + Time + Location ID + Microphone ID"
"XXXXXX_YYYY-MM-DD_hh-mm-ss_X_X"
Sample with aircraft
{hex_id} _ {date} _ {time} _ {location_id} _ {microphone_id} . {file_ext}
7C7CD0_2023-05-09_12-42-55_2_1.wav
Sample without aircraft
"Silence" files are denoted with six (6) leading zeros rather than an aircraft hex code. All relevant metadata for "silence" samples are contained in the audio filename, and again in the accompanying "sample_meta.csv"
000000 _ {date} _ {time} _ {location_id} _ {microphone_id} . {file_ext}
000000_2023-05-09_12-30-55_2_1.wav
Columns/Labels
(found in sample_meta.csv, aircraft_meta.csv/json files)
train-test: Train-test split (train, test)
fold: Digit from 1 to 5 splitting the training data 5 ways (else test)
filename: The filename of the audio recording
date: Date of the recording
time: Time of the recording
location: ID for the location of the recording
mic: ID of the microphone used
class: Top-level label for the recording (eg. 0 = No aircraft, 1 = Aircraft audible)
subclass: Subclass label for the recording (eg. 0 = No aircraft, 3 = Turbine-fan aeroplane)
altitude: Approximate altitude of the aircraft (in feet) at the start of the recording
hex_id: Unique ICAO 24-bit address for the aircraft recorded
session: Unique recording
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Your notebooks must contain the following steps:
CSV file - 19237 rows x 18 columns (Includes Price Columns as Target)
ID Price: price of the care(Target Column) Levy Manufacturer Model Prod. year Category Leather interior Fuel type Engine volume Mileage Cylinders Gear box type Drive wheels Doors Wheel Color Airbags
Confused or have any doubts in the data column values? Check the dataset discussion tab!