93 datasets found
  1. Car Price Prediction Challenge

    • kaggle.com
    Updated Jul 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Deep Contractor (2022). Car Price Prediction Challenge [Dataset]. https://www.kaggle.com/datasets/deepcontractor/car-price-prediction-challenge
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 6, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Deep Contractor
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Assignment

    Your notebooks must contain the following steps:

    • Perform data cleaning and pre-processing.
      • What steps did you use in this process and how did you clean your data.
    • Perform exploratory data analysis on the given dataset.
      • Explain each and every graphs that you make.
    • Train a ml-model and evaluate it using different metrics.
      • Why did you choose that particular model? What was the accuracy?
    • Hyperparameter optimization and feature selection is a plus.
    • Model deployment and use of ml-flow is a plus.
    • Perform model interpretation and show feature importance for your model.
      • Provide some explanation for the above point.
    • Future steps. Note: try to have your notebooks as presentable as possible.

    Dataset Description

    CSV file - 19237 rows x 18 columns (Includes Price Columns as Target)

    Attributes

    ID Price: price of the care(Target Column) Levy Manufacturer Model Prod. year Category Leather interior Fuel type Engine volume Mileage Cylinders Gear box type Drive wheels Doors Wheel Color Airbags

    Confused or have any doubts in the data column values? Check the dataset discussion tab!

  2. Z

    Dairy Supply Chain Sales Dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vasileios Argyriou (2024). Dairy Supply Chain Sales Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7853252
    Explore at:
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    Thomas Lagkas
    Konstantinos Georgakidis
    Dimitris Iatropoulos
    Christos Chaschatzis
    Dimitrios Pliatsios
    Panagiotis Sarigiannidis
    Anna Triantafyllou
    Vasileios Argyriou
    Athanasios Liatifis
    Ilias Siniosoglou
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    1.Introduction

    Sales data collection is a crucial aspect of any manufacturing industry as it provides valuable insights about the performance of products, customer behaviour, and market trends. By gathering and analysing this data, manufacturers can make informed decisions about product development, pricing, and marketing strategies in Internet of Things (IoT) business environments like the dairy supply chain.

    One of the most important benefits of the sales data collection process is that it allows manufacturers to identify their most successful products and target their efforts towards those areas. For example, if a manufacturer could notice that a particular product is selling well in a certain region, this information could be utilised to develop new products, optimise the supply chain or improve existing ones to meet the changing needs of customers.

    This dataset includes information about 7 of MEVGAL’s products [1]. According to the above information the data published will help researchers to understand the dynamics of the dairy market and its consumption patterns, which is creating the fertile ground for synergies between academia and industry and eventually help the industry in making informed decisions regarding product development, pricing and market strategies in the IoT playground. The use of this dataset could also aim to understand the impact of various external factors on the dairy market such as the economic, environmental, and technological factors. It could help in understanding the current state of the dairy industry and identifying potential opportunities for growth and development.

    1. Citation

    Please cite the following papers when using this dataset:

    I. Siniosoglou, K. Xouveroudis, V. Argyriou, T. Lagkas, S. K. Goudos, K. E. Psannis and P. Sarigiannidis, "Evaluating the Effect of Volatile Federated Timeseries on Modern DNNs: Attention over Long/Short Memory," in the 12th International Conference on Circuits and Systems Technologies (MOCAST 2023), April 2023, Accepted

    1. Dataset Modalities

    The dataset includes data regarding the daily sales of a series of dairy product codes offered by MEVGAL. In particular, the dataset includes information gathered by the logistics division and agencies within the industrial infrastructures overseeing the production of each product code. The products included in this dataset represent the daily sales and logistics of a variety of yogurt-based stock. Each of the different files include the logistics for that product on a daily basis for three years, from 2020 to 2022.

    3.1 Data Collection

    The process of building this dataset involves several steps to ensure that the data is accurate, comprehensive and relevant.

    The first step is to determine the specific data that is needed to support the business objectives of the industry, i.e., in this publication’s case the daily sales data.

    Once the data requirements have been identified, the next step is to implement an effective sales data collection method. In MEVGAL’s case this is conducted through direct communication and reports generated each day by representatives & selling points.

    It is also important for MEVGAL to ensure that the data collection process conducted is in an ethical and compliant manner, adhering to data privacy laws and regulation. The industry also has a data management plan in place to ensure that the data is securely stored and protected from unauthorised access.

    The published dataset is consisted of 13 features providing information about the date and the number of products that have been sold. Finally, the dataset was anonymised in consideration to the privacy requirement of the data owner (MEVGAL).

    File

    Period

    Number of Samples (days)

    product 1 2020.xlsx

    01/01/2020–31/12/2020

    363

    product 1 2021.xlsx

    01/01/2021–31/12/2021

    364

    product 1 2022.xlsx

    01/01/2022–31/12/2022

    365

    product 2 2020.xlsx

    01/01/2020–31/12/2020

    363

    product 2 2021.xlsx

    01/01/2021–31/12/2021

    364

    product 2 2022.xlsx

    01/01/2022–31/12/2022

    365

    product 3 2020.xlsx

    01/01/2020–31/12/2020

    363

    product 3 2021.xlsx

    01/01/2021–31/12/2021

    364

    product 3 2022.xlsx

    01/01/2022–31/12/2022

    365

    product 4 2020.xlsx

    01/01/2020–31/12/2020

    363

    product 4 2021.xlsx

    01/01/2021–31/12/2021

    364

    product 4 2022.xlsx

    01/01/2022–31/12/2022

    364

    product 5 2020.xlsx

    01/01/2020–31/12/2020

    363

    product 5 2021.xlsx

    01/01/2021–31/12/2021

    364

    product 5 2022.xlsx

    01/01/2022–31/12/2022

    365

    product 6 2020.xlsx

    01/01/2020–31/12/2020

    362

    product 6 2021.xlsx

    01/01/2021–31/12/2021

    364

    product 6 2022.xlsx

    01/01/2022–31/12/2022

    365

    product 7 2020.xlsx

    01/01/2020–31/12/2020

    362

    product 7 2021.xlsx

    01/01/2021–31/12/2021

    364

    product 7 2022.xlsx

    01/01/2022–31/12/2022

    365

    3.2 Dataset Overview

    The following table enumerates and explains the features included across all of the included files.

    Feature

    Description

    Unit

    Day

    day of the month

    -

    Month

    Month

    -

    Year

    Year

    -

    daily_unit_sales

    Daily sales - the amount of products, measured in units, that during that specific day were sold

    units

    previous_year_daily_unit_sales

    Previous Year’s sales - the amount of products, measured in units, that during that specific day were sold the previous year

    units

    percentage_difference_daily_unit_sales

    The percentage difference between the two above values

    %

    daily_unit_sales_kg

    The amount of products, measured in kilograms, that during that specific day were sold

    kg

    previous_year_daily_unit_sales_kg

    Previous Year’s sales - the amount of products, measured in kilograms, that during that specific day were sold, the previous year

    kg

    percentage_difference_daily_unit_sales_kg

    The percentage difference between the two above values

    kg

    daily_unit_returns_kg

    The percentage of the products that were shipped to selling points and were returned

    %

    previous_year_daily_unit_returns_kg

    The percentage of the products that were shipped to selling points and were returned the previous year

    %

    points_of_distribution

    The amount of sales representatives through which the product was sold to the market for this year

    previous_year_points_of_distribution

    The amount of sales representatives through which the product was sold to the market for the same day for the previous year

    Table 1 – Dataset Feature Description

    1. Structure and Format

    4.1 Dataset Structure

    The provided dataset has the following structure:

    Where:

    Name

    Type

    Property

    Readme.docx

    Report

    A File that contains the documentation of the Dataset.

    product X

    Folder

    A folder containing the data of a product X.

    product X YYYY.xlsx

    Data file

    An excel file containing the sales data of product X for year YYYY.

    Table 2 - Dataset File Description

    1. Acknowledgement

    This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 957406 (TERMINET).

    References

    [1] MEVGAL is a Greek dairy production company

  3. Company Financial Data | Banking & Capital Markets Professionals in the...

    • datarade.ai
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Success.ai, Company Financial Data | Banking & Capital Markets Professionals in the Middle East | Verified Global Profiles from 700M+ Dataset [Dataset]. https://datarade.ai/data-products/company-financial-data-banking-capital-markets-profession-success-ai
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset provided by
    Area covered
    Kyrgyzstan, Mongolia, Korea (Republic of), State of, Brunei Darussalam, Maldives, Jordan, Uzbekistan, Bahrain, Georgia
    Description

    Success.ai’s Company Financial Data for Banking & Capital Markets Professionals in the Middle East offers a reliable and comprehensive dataset designed to connect businesses with key stakeholders in the financial sector. Covering banking executives, capital markets professionals, and financial advisors, this dataset provides verified contact details, decision-maker profiles, and firmographic insights tailored for the Middle Eastern market.

    With access to over 170 million verified professional profiles and 30 million company profiles, Success.ai ensures your outreach and strategic initiatives are powered by accurate, continuously updated, and AI-validated data. Backed by our Best Price Guarantee, this solution empowers your organization to build meaningful connections in the region’s thriving financial industry.

    Why Choose Success.ai’s Company Financial Data?

    1. Verified Contact Data for Financial Professionals

      • Access verified email addresses, direct phone numbers, and LinkedIn profiles of banking executives, capital markets advisors, and financial consultants.
      • AI-driven validation ensures 99% accuracy, enabling confident communication and minimizing data inefficiencies.
    2. Targeted Insights for the Middle East Financial Sector

      • Includes profiles from major Middle Eastern financial hubs such as Dubai, Riyadh, Abu Dhabi, and Doha, covering diverse institutions like banks, investment firms, and regulatory bodies.
      • Gain insights into region-specific financial trends, regulatory frameworks, and market opportunities.
    3. Continuously Updated Datasets

      • Real-time updates reflect changes in leadership, market activities, and organizational structures.
      • Stay ahead of emerging opportunities and align your strategies with evolving market dynamics.
    4. Ethical and Compliant

      • Adheres to GDPR, CCPA, and other global privacy regulations, ensuring responsible data usage and compliance with legal standards.

    Data Highlights:

    • 170M+ Verified Professional Profiles: Engage with decision-makers and professionals in banking, investment management, and capital markets across the Middle East.
    • 30M Company Profiles: Access detailed firmographic data, including organization sizes, revenue ranges, and geographic footprints.
    • Leadership Contact Information: Connect directly with CEOs, CFOs, risk managers, and regulatory professionals driving financial strategies.
    • Decision-Maker Insights: Understand key decision-makers’ roles and responsibilities to tailor your outreach effectively.

    Key Features of the Dataset:

    1. Decision-Maker Profiles in Banking & Capital Markets

      • Identify and connect with executives, portfolio managers, and analysts shaping investment strategies and financial operations.
      • Target professionals responsible for compliance, risk management, and operational efficiency.
    2. Advanced Filters for Precision Targeting

      • Filter institutions by segment (retail banking, investment banking, private equity), geographic location, revenue size, or workforce composition.
      • Tailor campaigns to align with specific financial needs, such as digital transformation, customer retention, or risk mitigation.
    3. Firmographic and Leadership Insights

      • Access detailed firmographic data, including company hierarchies, financial health indicators, and service specializations.
      • Gain a deeper understanding of organizational structures and market positioning.
    4. AI-Driven Enrichment

      • Profiles enriched with actionable data allow for personalized messaging, highlight unique value propositions, and enhance engagement outcomes.

    Strategic Use Cases:

    1. Sales and Lead Generation

      • Offer financial technology solutions, consulting services, or compliance tools to banking institutions and investment firms.
      • Build relationships with decision-makers responsible for vendor selection and financial strategy implementation.
    2. Market Research and Competitive Analysis

      • Analyze trends in Middle Eastern banking and capital markets to guide product development and market entry strategies.
      • Benchmark against competitors to identify market gaps, emerging niches, and growth opportunities.
    3. Partnership Development and Vendor Evaluation

      • Connect with financial institutions seeking strategic partnerships or evaluating service providers for operational improvements.
      • Foster alliances that drive mutual growth and innovation.
    4. Recruitment and Talent Solutions

      • Engage HR professionals and hiring managers seeking top talent in finance, compliance, or risk management.
      • Provide staffing solutions, training programs, or workforce optimization tools tailored to the financial sector.

    Why Choose Success.ai?

    1. Best Price Guarantee
      • Access premium-quality financial data at competitive prices, ensuring strong ROI for your outreach, marketing, and partners...
  4. R

    Fruits Classification Dataset

    • universe.roboflow.com
    zip
    Updated Apr 6, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joseph Nelson (2020). Fruits Classification Dataset [Dataset]. https://universe.roboflow.com/joseph-nelson/fruits-dataset
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 6, 2020
    Dataset authored and provided by
    Joseph Nelson
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Variables measured
    Fruit
    Description

    Overview

    The Fruits dataset is an image classification dataset of various fruits against white backgrounds from various angles, originally open sourced by GitHub user horea. This is a subset of that full dataset.

    Example Image: https://github.com/Horea94/Fruit-Images-Dataset/blob/master/Training/Apple%20Braeburn/101_100.jpg?raw=true" alt="Example Image">

    Use Cases

    Build a fruit classifier! This could be a just-for-fun project just as much as you could be building a color sorter for agricultural use cases before fruits make their way to market.

    Using this Dataset

    Use the fork button to copy this dataset to your own Roboflow account and export it with new preprocessing settings (perhaps resized for your model's desired format or converted to grayscale), or additional augmentations to make your model generalize better. This particular dataset would be very well suited for Roboflow's new advanced Bounding Box Only Augmentations.

    About Roboflow

    Roboflow makes managing, preprocessing, augmenting, and versioning datasets for computer vision seamless.

    Developers reduce 50% of their code when using Roboflow's workflow, automate annotation quality assurance, save training time, and increase model reproducibility.

    Roboflow Workmark

  5. Cryptocurrency extra data - Maker

    • kaggle.com
    zip
    Updated Nov 15, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yam Peleg (2021). Cryptocurrency extra data - Maker [Dataset]. https://www.kaggle.com/yamqwe/cryptocurrency-extra-data-maker
    Explore at:
    zip(79531062 bytes)Available download formats
    Dataset updated
    Nov 15, 2021
    Authors
    Yam Peleg
    Description

    Context:

    This dataset is an extra updating dataset for the G-Research Crypto Forecasting competition.

    Introduction

    This is a daily updated dataset, automaticlly collecting market data for G-Research crypto forecasting competition. The data is of the 1-minute resolution, collected for all competition assets and both retrieval and uploading are fully automated. see discussion topic.

    The Data

    For every asset in the competition, the following fields from Binance's official API endpoint for historical candlestick data are collected, saved, and processed.

    
    1. **timestamp** - A timestamp for the minute covered by the row.
    2. **Asset_ID** - An ID code for the cryptoasset.
    3. **Count** - The number of trades that took place this minute.
    4. **Open** - The USD price at the beginning of the minute.
    5. **High** - The highest USD price during the minute.
    6. **Low** - The lowest USD price during the minute.
    7. **Close** - The USD price at the end of the minute.
    8. **Volume** - The number of cryptoasset u units traded during the minute.
    9. **VWAP** - The volume-weighted average price for the minute.
    10. **Target** - 15 minute residualized returns. See the 'Prediction and Evaluation section of this notebook for details of how the target is calculated.
    11. **Weight** - Weight, defined by the competition hosts [here](https://www.kaggle.com/cstein06/tutorial-to-the-g-research-crypto-competition)
    12. **Asset_Name** - Human readable Asset name.
    

    Indexing

    The dataframe is indexed by timestamp and sorted from oldest to newest. The first row starts at the first timestamp available on the exchange, which is July 2017 for the longest-running pairs.

    Usage Example

    The following is a collection of simple starter notebooks for Kaggle's Crypto Comp showing PurgedTimeSeries in use with the collected dataset. Purged TimesSeries is explained here. There are many configuration variables below to allow you to experiment. Use either GPU or TPU. You can control which years are loaded, which neural networks are used, and whether to use feature engineering. You can experiment with different data preprocessing, model architecture, loss, optimizers, and learning rate schedules. The extra datasets contain the full history of the assets in the same format as the competition, so you can input that into your model too.

    Baseline Example Notebooks:

    These notebooks follow the ideas presented in my "Initial Thoughts" here. Some code sections have been reused from Chris' great (great) notebook series on SIIM ISIC melanoma detection competition here

    Loose-ends:

    This is a work in progress and will be updated constantly throughout the competition. At the moment, there are some known issues that still needed to be addressed:

    • VWAP: - At the moment VWAP calculation formula is still unclear. Currently the dataset uses an approximation calculated from the Open, High, Low, Close, Volume candlesticks. [Waiting for competition hosts input]
    • Target Labeling: There exist some mismatches to the original target provided by the hosts at some time intervals. On all the others - it is the same. The labeling code can be seen here. [Waiting for competition hosts] input]
    • Filtering: No filtration of 0 volume data is taken place.

    Example Visualisations

    Opening price with an added indicator (MA50): https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2234678%2Fb8664e6f26dc84e9a40d5a3d915c9640%2Fdownload.png?generation=1582053879538546&alt=media" alt="">

    Volume and number of trades: https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2234678%2Fcd04ed586b08c1576a7b67d163ad9889%2Fdownload-1.png?generation=1582053899082078&alt=media" alt="">

    License

    This data is being collected automatically from the crypto exchange Binance.

  6. Cybersecurity Framework Manufacturing Profile Low Security Level Example...

    • catalog.data.gov
    • data.nist.gov
    Updated Jul 29, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2022). Cybersecurity Framework Manufacturing Profile Low Security Level Example Implementations for Discrete-based Manufacturing System Datasets [Dataset]. https://catalog.data.gov/dataset/cybersecurity-framework-manufacturing-profile-low-security-level-example-implementations-f-ccc42
    Explore at:
    Dataset updated
    Jul 29, 2022
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Description

    The Cybersecurity Framework Manufacturing Profile Low Security Level Example Implementations Guide provides example proof-of-concept solutions demonstrating how open-source and commercial off-the-shelf (COTS) products that are currently available today can be implemented in manufacturing environments to satisfy the requirements in the Cybersecurity Framework (CSF) Manufacturing Profile Low Security Level. Example proof-of-concept solutions for a process-based manufacturing environment and a discrete-based manufacturing environment are included in the guide. Depending on factors like size, sophistication, risk tolerance, and threat landscape, manufacturers should make their own determinations about the breadth of the proof-of-concept solutions they may voluntarily implement. The dataset includes all of the raw and processed measurement data for the example implementation of the discrete-based manufacturing system use case.

  7. m

    Diesel Engine Faults Features Dataset (3500-DEFault)

    • data.mendeley.com
    Updated Apr 29, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Denys Pestana (2020). Diesel Engine Faults Features Dataset (3500-DEFault) [Dataset]. http://doi.org/10.17632/k22zxz29kr.1
    Explore at:
    Dataset updated
    Apr 29, 2020
    Authors
    Denys Pestana
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The objective of this dataset is the fault diagnosis in diesel engines to assist the predictive maintenance, through the analysis of the variation of the pressure curves inside the cylinders and the torsional vibration response of the crankshaft. Hence a fault simulation model based on a zero-dimensional thermodynamic model was developed.

    The adopted feature vectors were chosen from the thermodynamic model and obtained from processing signals as pressure and temperature inside the cylinder, as well as, torsional vibration of the engine’s flywheel. These vectors are used as input of the machine learning technique in order to discriminate among several machine conditions.

    The database is expected to emulate all operating scenarios under study. In our case, all possible diesel machine faults and system conditions variations, which correspond to severities levels containing enough information to characterize and discriminate the faults. The developed database covered the following operating conditions: Normal (without faults), Pressure reduction in the intake manifold, Compression ratio reduction in the cylinders and Reduction of amount of fuel injected into the cylinders.

    In all scenarios, the motor rotation frequency was set at 2500 RPM. The rotation of 2500 RPM was used, since it presented the lowest joint error rate in the estimation of the mean and maximum pressures of the burning cycle, between the experimental data (according to data supplied by the manufacturer) and the simulated data, during the validation stage of the thermodynamic and dynamic models.

    The entire database comprises a total of 3500 different fault scenarios for 4 distinct operational conditions. 250 of which from the normal class, 250 from pressure reduction in the intake manifold" class, 1500 fromcompression ratio reduction in the cylinders" class and 1500 from the ``reduction of amount of fuel injected into the cylinders" class. This database is named 3500-DEFault database.

  8. Automotive Vehicles Engine Health Dataset

    • kaggle.com
    Updated Apr 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PARV MODI (2023). Automotive Vehicles Engine Health Dataset [Dataset]. https://www.kaggle.com/datasets/parvmodi/automotive-vehicles-engine-health-dataset/versions/1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 5, 2023
    Dataset provided by
    Kaggle
    Authors
    PARV MODI
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The dataset could include various features and measurements related to the engine health of vehicles, such as engine RPM, temperature, pressure, and other sensor data. It may also include metadata on the vehicle, such as make, model, year, and mileage.

    One potential project using this dataset could be to build a predictive maintenance model for automotive engines. By analyzing the patterns and trends in the data, machine learning algorithms could be trained to predict when an engine is likely to require maintenance or repair. This could help vehicle owners and mechanics proactively address potential issues before they become more severe, leading to better vehicle performance and longer engine lifetimes.

    Another potential use for this dataset could be to analyze the performance of different types of engines and vehicles. Researchers could use the data to compare the performance of engines from different manufacturers, for example, or to evaluate the effectiveness of different maintenance strategies. This could help drive innovation and improvements in the automotive industry.

  9. w

    Dataset of books called James Allison : a biography of the engine...

    • workwithdata.com
    Updated Apr 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2025). Dataset of books called James Allison : a biography of the engine manufacturer and Indianapolis 500 cofounder [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=James+Allison+%3A+a+biography+of+the+engine+manufacturer+and+Indianapolis+500+cofounder
    Explore at:
    Dataset updated
    Apr 17, 2025
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Indianapolis
    Description

    This dataset is about books. It has 1 row and is filtered where the book is James Allison : a biography of the engine manufacturer and Indianapolis 500 cofounder. It features 7 columns including author, publication date, language, and book publisher.

  10. P

    Fabrics Dataset Dataset

    • paperswithcode.com
    Updated Dec 10, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2016). Fabrics Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/fabrics-dataset
    Explore at:
    Dataset updated
    Dec 10, 2016
    Description

    The Fabrics Dataset consists of about 2000 samples of garments and fabrics. A small patch of each surface has been captured under 4 different illumination conditions using a custom made, portable photometric stereo sensor. All images have been acquired "in the field" (at clothes shops) and the dataset reflects the distribution of fabrics in real world, hence it is not balanced. The majority of clothes are made of specific fabrics, such as cotton and polyester, while some other fabrics, such as silk and linen, are more rare. Also, a large number of clothes are not composed of a single fabric but two or more fabrics are used to give the garment the desired properties (blended fabrics). For every garment there is information (attributes) about its material composition from the manufacturer label and its type (pants, shirt, skirt etc.).

  11. A

    ‘Amazon Product Reviews Dataset’ analyzed by Analyst-2

    • analyst-2.ai
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com), ‘Amazon Product Reviews Dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-amazon-product-reviews-dataset-7933/c5d57177/?iid=005-711&v=presentation
    Explore at:
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Amazon Product Reviews Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/amazon-product-reviews-datasete on 13 February 2022.

    --- Dataset description provided by original source is as follows ---

    About this dataset

    This dataset contains 30K records of product reviews from amazon.com.

    This dataset was created by PromptCloud and DataStock

    Content

    This dataset contains the following:

    • Total Records Count: 43729

    • Domain Name: amazon.com

    • Date Range: 01st Jan 2020 - 31st Mar 2020

    • File Extension: CSV

    • Available Fields:
      -- Uniq Id,
      -- Crawl Timestamp,
      -- Billing Uniq Id,
      -- Rating,
      -- Review Title,
      -- Review Rating,
      -- Review Date,
      -- User Id,
      -- Brand,
      -- Category,
      -- Sub Category,
      -- Product Description,
      -- Asin,
      -- Url,
      -- Review Content,
      -- Verified Purchase,
      -- Helpful Review Count,
      -- Manufacturer Response

    Acknowledgements

    We wouldn't be here without the help of our in house teams at PromptCloud and DataStock. Who has put their heart and soul into this project like all other projects? We want to provide the best quality data and we will continue to do so.

    Inspiration

    The inspiration for these datasets came from research. Reviews are something that is important wit everybody across the globe. So we decided to come up with this dataset that shows us exactly how the user reviews help companies to better their products.

    This dataset was created by PromptCloud and contains around 0 samples along with Billing Uniq Id, Verified Purchase, technical information and other features such as: - Crawl Timestamp - Manufacturer Response - and more.

    How to use this dataset

    • Analyze Helpful Review Count in relation to Sub Category
    • Study the influence of Review Date on Product Description
    • More datasets

    Acknowledgements

    If you use this dataset in your research, please credit PromptCloud

    Start A New Notebook!

    --- Original source retains full ownership of the source dataset ---

  12. A

    ‘Television Brands Ecommerce Dataset’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Jan 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Television Brands Ecommerce Dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-television-brands-ecommerce-dataset-bfa2/c4113040/?iid=003-490&v=presentation
    Explore at:
    Dataset updated
    Jan 28, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    Description

    Analysis of ‘Television Brands Ecommerce Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/devsubhash/television-brands-ecommerce-dataset on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    This dataset contains 912 samples with 7 attributes. There are some missing values in this dataset.

    Here are the columns in this dataset- 1. Brand: This indicates the manufacturer of the product i.e. Television 2. Resolution: This has multiple categories and indicates the type of display i.e. LED, HD LED, etc. 3. Size: This indicates the screen size in inches 4. Selling Price: This column has the Selling Price or the Discounted Price of the product 5. Original Price: This includes the Original Price of the product from the manufacturer. 6. Operating system: This categorical variable shows the type of OS like Android, Linux, etc. 7. Rating: Average customer ratings on a scale of 5.

    Inspiration: This dataset could be used to explore the current market scenario for Televisions. There are various types of screens with different operating systems offered by several manufacturers at competitive prices. Some questions this dataset could be used to answer are -

    1. Demand for different types of televisions and Number of Players in the market
    2. Which are the top 5 brands for television?
    3. Which brand has the highest number of products i.e. television ?
    4. Are televisions with higher ratings more expensive?
    5. Average Selling Price by Brand

    --- Original source retains full ownership of the source dataset ---

  13. Z

    Data from: Clotho dataset

    • data.niaid.nih.gov
    • explore.openaire.eu
    • +1more
    Updated May 30, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Konstantinos Drossos (2021). Clotho dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3490683
    Explore at:
    Dataset updated
    May 30, 2021
    Dataset provided by
    Samuel Lipping
    Konstantinos Drossos
    Tuomas Virtanen
    Description

    Clotho is an audio captioning dataset, now reached version 2. Clotho consists of 6974 audio samples, and each audio sample has five captions (a total of 34 870 captions). Audio samples are of 15 to 30 s duration and captions are eight to 20 words long.

    Clotho is thoroughly described in our paper:

    K. Drossos, S. Lipping and T. Virtanen, "Clotho: an Audio Captioning Dataset," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 736-740, doi: 10.1109/ICASSP40776.2020.9052990.

    available online at: https://arxiv.org/abs/1910.09387 and at: https://ieeexplore.ieee.org/document/9052990

    If you use Clotho, please cite our paper.

    To use the dataset, you can use our code at: https://github.com/audio-captioning/clotho-dataset

    These are the files for the development, validation, and evaluation splits of Clotho dataset.

    == Changes in version 2.1 ==

    In version 2.1 of Clotho, we fixed some files that were corrupted from the compression and transferring processes (around 150 files) and we also replaced some characters that were illegal for most filesystems, e.g. ":" (around 10 files).

    Please use this version for your experiments.

    == Changes in version 2 ==

    In version 2 of Clotho, there are audio files added in the development split and a new validation split is added. There are no changes in the evaluation split.

    Specifically:

    Now there are 3840 audio files in the development split. In Clotho version 1, there were 2893 audio files. Now, 947 new audio files are added.

    There are 1046 new audio files in the validation split.

    All new captions are treated as in version 1 of Clotho, i.e. having word consistency, no named entities, no speech transcription, and no hapax legomena between splits (i.e. words appearing only in one of the splits).

    == Usage ==

    To use the dataset you have to:

    Download the audio files: clotho_audio_development.7z,clotho_audio_validation.7z, and clotho_audio_evalution.7z

    Download the files with the captions: clotho_captions_development.csv, clotho_captions_validation.csv, and clotho_captions_evaluation.csv

    Download the files with the associated metadata: clotho_metadata_development.csv, clotho_metadata_validation.csv, and clotho_metadata_evaluation.csv

    Extract the audio files

    Then you can use each audio file with its corresponding captions

    == License ==

    The audio files in the archives:

    clotho_audio_development.7z,

    clotho_audio_validation.7z, and

    clotho_audio_evalution.7z

    and the associated meta-data in the CSV files:

    clotho_metadata_development.csv

    clotho_metadata_validation.csv

    clotho_metadata_evaluation.csv

    are under the corresponding licences (mostly CreativeCommons with attribution) of Freesound [1] platform, mentioned explicitly in the CSV files for each of the audio files. That is, each audio file in the 7z archives is listed in the CSV files with the meta-data. The meta-data for each file are:

    File name

    Keywords

    URL for the original audio file

    Start and ending samples for the excerpt that is used in the Clotho dataset

    Uploader/user in the Freesound platform (manufacturer)

    Link to the licence of the file

    The captions in the files:

    clotho_captions_development.csv

    clotho_captions_validation.csv

    clotho_captions_evaluation.csv

    are under the Tampere University licence, described in the LICENCE file (mainly a non-commercial with attribution licence).

    == References == [1] Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia (MM '13). ACM, New York, NY, USA, 411-412. DOI: https://doi.org/10.1145/2502081.2502245

  14. Z

    CPAISD: Core-Penumbra Acute Ischemic Stroke Dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Apr 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peksheva, Marina (2025). CPAISD: Core-Penumbra Acute Ischemic Stroke Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10892315
    Explore at:
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Umerenkov, Dmitriy
    Peksheva, Marina
    Kudin, Stepan
    Pavlov, Denis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset contains 112 non-contrast cranial CT scans of patients with hyperacute stroke, featuring delineated zones of penumbra and core of the stroke on each slice where present. The data in the dataset are anonymized using the Kitware DicomAnonymizer, with standard anonymization settings, except for preserving the values of the following fields:

    (0x0010, 0x0040) – Patient's Sex

    (0x0010, 0x1010) – Patient's Age

    (0x0008, 0x0070) – Manufacturer

    (0x0008, 0x1090) – Manufacturer’s Model Name

    The patient's sex and age are retained for demographic analysis of the samples, and the equipment manufacturer and model are kept for dataset statistics and the potential for domain shift analysis.

    The dataset is split into three folds:

    Training fold (92 studies, 8,376 slices).

    Validation fold (10 studies, 980 slices).

    Testing fold (10 studies, 809 slices).

    The dataset has the following structure:

    metadata.json – dataset metadata

    summary.csv – metadata of each study in a CSV format table

    Part of the dataset (train, val, and test)

    Study

    Slice

    raw.dcm – original slice file

    image.npz – slice in Numpy array format

    mask.npz – segmentation mask in Numpy array format

    metadata.json – slice metadata in JSON format

    metadata.json – study metadata in JSON format

    The metadata.json at the root of the dataset has the following format:

    generation_params – dataset generation parameters:

    test_size – proportion of the test part

    val_size – proportion of the validation part

    stats – statistical data:

    common – general statistical data:

    train_size_in_studies – number of studies in the training part of the dataset.

    train_size_in_images – number of slices in the training part of the dataset.

    val_size_in_studies – number of studies in the validation part of the dataset.

    val_size_in_images – number of slices in the validation part of the dataset.

    test_size_in_studies – number of studies in the test part of the dataset.

    test_size_in_images – number of slices in the test part of the dataset.

    train – statistical data for the training part of the dataset:

    min – minimum pixel value.

    max – maximum pixel value.

    mean – average pixel value.

    std – standard deviation for all pixel values.

    The metadata.json at the root of the study has the following format, if a field value is unknown, it is given as 'unknown':

    manufacturer – manufacturer of the tomograph.

    model – model of the tomograph.

    device – full name of the tomograph (manufacturer + model).

    age – patient's age in years.

    sex – patient's sex. M – male, F – female.

    dsa – whether cerebral angiography was performed. true if yes, false if no.

    nihss – NIHSS score.

    time – time in hours from the onset of the stroke to the conduct of the study. Can be either a number or a range.

    lethality – whether the person died as a result of this stroke. true if yes, false if no.

    The summary.csv contains the same fields as the metadata.json from the root of the study, plus two additional fields:

    name – name of the study.

    part – part of the dataset in which the study is located.

  15. LearnPlatform Educational Technology Engagement Dataset: Impact of COVID-19...

    • openicpsr.org
    Updated Sep 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mary Styers (2021). LearnPlatform Educational Technology Engagement Dataset: Impact of COVID-19 on Digital Learning [Dataset]. http://doi.org/10.3886/E150042V1
    Explore at:
    Dataset updated
    Sep 16, 2021
    Dataset provided by
    LearnPlatform, Inc.
    Authors
    Mary Styers
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Time period covered
    Jan 2020 - Dec 2020
    Area covered
    United States
    Description

    LearnPlatform is a unique technology platform in the K-12 market providing the only broadly interoperable platform to the breadth of edtech solutions in the US K12 field. A key component of edtech effectiveness is integrated reporting on tool usage and, where applicable, evidence of efficacy. With COVID closures, LearnPlatform has emerged as an important and singular resource to measure whether students are accessing digital resources within distance learning constraints. This platform provides a unique and needed source of data to understand if students are accessing digital resources, and where resources have disparate usage and impact.In this dataset we are sharing educational technology usage across the 8,000+ tools used in the education field in 2020. We make this dataset available to public so that educators, district leaders, researchers, institutions, policy-makers or anyone interested to learn about digital learning in 2020, can use this dataset to understand student engagement with core learning activities during the COVID-19 pandemic. Some example research questions that this dataset can help stakeholders answer: What is the picture of digital connectivity and engagement in 2020?What is the effect of the COVID-19 pandemic on online and distance learning, and how might this evolve in the future?How does student engagement with different types of education technology change over the course of the pandemic?How does student engagement with online learning platforms relate to different geography? Demographic context (e.g., race/ethnicity, ESL, learning disability)? Learning context? Socioeconomic status?Do certain state interventions, practices or policies (e.g., stimulus, reopening, eviction moratorium) correlate with increases or decreases in online engagement?

  16. A

    ‘Pakistan Corona Virus Dataset’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Sep 30, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘Pakistan Corona Virus Dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-pakistan-corona-virus-dataset-7f50/f59c6dcf/?iid=027-428&v=presentation
    Explore at:
    Dataset updated
    Sep 30, 2021
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Pakistan
    Description

    Analysis of ‘Pakistan Corona Virus Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/zusmani/pakistan-corona-virus-citywise-data on 30 September 2021.

    --- Dataset description provided by original source is as follows ---

    Context

    Pakistan witnessed its first Corona virus patient on February 26th 2020. It's a bumpy ride since then. The cases are increasing gradually and we haven't seen the worst yet. While, there are few government resources for cumulative updates, there is no place where you can find the city level patients data. It's also not possible to find the running chronological tally of patients as they test positive. We have decided to create our own dataset for all the researchers out there with such details so we can model the infection spread and forecast the situation in coming days. We hope, by doing so, we will be able to inform policy makers on various intervention models, and healthcare professionals to be ready for the influx of new patients. We certainly hope, that this little contribution will go a long way for saving lives in Pakistan

    Content

    The dataset contains seven columns for date, number of cases, number of deaths, number of people recovered, travel history of those cases, and location of the cases (province and city).

    The first version has the data from first case of February 26 2020 to April 19, 2020. We intend to publish weekly updates

    Acknowledgements

    Users are allowed to use, copy, distribute and cite the dataset as follows: “Zeeshan-ul-hassan Usmani, Sana Rasheed, Pakistan Corona Virus Data, Kaggle Dataset Repository, April 19, 2020.”

    Inspiration

    Some ideas worth exploring:

    Can we find the spread factor for the Corona virus in Pakistan?

    How long it takes for a positive case to infect another in Pakistan?

    How we can use this data to simulate lock down scenarios and find its impact on country's economy? Here is a good
    read to get started - http://zeeshanusmani.com/urdu/corona-economic-impact/

    How does Pakistan Corona virus spread compare against its neighbors and other developed counties?

    What would be the impact of this infection spread on country's economy and people living under poverty? Here are two briefs to get you started

    http://zeeshanusmani.com/urdu/corona/ http://zeeshanusmani.com/urdu/corona-what-to-learn/

    How do we visualize this dataset to inform policy makers? Here is one example https://zeeshanusmani.com/corona/

    Can we predict the number of cases in next 10 days and a month?

    --- Original source retains full ownership of the source dataset ---

  17. Clotho-AQA dataset

    • zenodo.org
    • explore.openaire.eu
    • +1more
    csv, txt, zip
    Updated Apr 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samuel Lipping; Parthasaarathy Sudarsanam; Konstantinos Drossos; Konstantinos Drossos; Tuomas Virtanen; Tuomas Virtanen; Samuel Lipping; Parthasaarathy Sudarsanam (2022). Clotho-AQA dataset [Dataset]. http://doi.org/10.5281/zenodo.6473207
    Explore at:
    csv, txt, zipAvailable download formats
    Dataset updated
    Apr 22, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Samuel Lipping; Parthasaarathy Sudarsanam; Konstantinos Drossos; Konstantinos Drossos; Tuomas Virtanen; Tuomas Virtanen; Samuel Lipping; Parthasaarathy Sudarsanam
    Description

    Clotho-AQA is an audio question-answering dataset consisting of 1991 audio samples taken from Clotho dataset [1]. Each audio sample has 6 associated questions collected through crowdsourcing. For each question, the answers are provided by three different annotators making a total of 35,838 question-answer pairs. For each audio sample, 4 questions are designed to be answered with 'yes' or 'no', while the remaining two questions are designed to be answered in a single word. More details about the data collection process and data splitting process can be found in our following paper.

    S. Lipping, P. Sudarsanam, K. Drossos, T. Virtanen ‘Clotho-AQA: A Crowdsourced Dataset for Audio Question Answering.’ The paper is available online at 2204.09634.pdf (arxiv.org)

    If you use the Clotho-AQA dataset, please cite the paper mentioned above. A sample baseline model to use the Clotho-AQA dataset can be found at partha2409/AquaNet (github.com)

    To use the dataset,

    • Download and extract ‘audio_files.zip’. This contains all the 1991 audio samples in the dataset.

    • Download ‘clotho_aqa_train.csv’, ‘clotho_aqa_val.csv’, and ‘clotho_aqa_test.csv’. These files contain the train, validation, and test splits, respectively. They contain the audio file name, questions, answers, and confidence scores provided by the annotators.

    License:

    The audio files in the archive ‘audio_files.zip’ are under the corresponding licenses (mostly CreativeCommons with attribution) of Freesound [2] platform, mentioned explicitly in the CSV file ’clotho_aqa_metadata.csv’ for each of the audio files. That is, each audio file in the archive is listed in the CSV file with meta-data. The meta-data for each file are:

    • File name

    • Keywords

    • URL for the original audio file

    • Start and ending samples for the excerpt that is used in the Clotho dataset

    • Uploader/user in the Freesound platform (manufacturer)

    • Link to the license of the file.

    The questions and answers in the files:

    • clotho_aqa_train.csv

    • clotho_aqa_val.csv

    • clotho_aqa_test.csv

    are under the MIT license, described in the LICENSE file.

    References:

    [1] K. Drossos, S. Lipping and T. Virtanen, "Clotho: An Audio Captioning Dataset," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 736- 740, doi: 10.1109/ICASSP40776.2020.9052990.

    [2] Frederic Font, Gerard Roma, and Xavier Serra. 2013. Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia (MM '13). ACM, New York, NY, USA, 411-412. DOI: https://doi.org/10.1145/2502081.2502245

  18. Energy Consumption of United States Over Time

    • kaggle.com
    Updated Dec 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Energy Consumption of United States Over Time [Dataset]. https://www.kaggle.com/datasets/thedevastator/unlocking-the-energy-consumption-of-united-state
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 14, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Area covered
    United States
    Description

    Energy Consumption of United States Over Time

    Building Energy Data Book

    By Department of Energy [source]

    About this dataset

    The Building Energy Data Book (2011) is an invaluable resource for gaining insight into the current state of energy consumption in the buildings sector. This dataset provides comprehensive data on residential, commercial and industrial building energy consumption, construction techniques, building technologies and characteristics. With this resource, you can get an in-depth understanding of how energy is used in various types of buildings - from single family homes to large office complexes - as well as its impact on the environment. The BTO within the U.S Department of Energy's Office of Energy Efficiency and Renewable Energy developed this dataset to provide a wealth of knowledge for researchers, policy makers, engineers and even everyday observers who are interested in learning more about our built environment and its energy usage patterns

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset provides comprehensive information regarding energy consumption in the buildings sector of the United States. It contains a number of key variables which can be used to analyze and explore the relations between energy consumption and building characteristics, technologies, and construction. The data is provided in both CSV format as well as tabular format which can make it helpful for those who prefer to use programs like Excel or other statistical modeling software.

    In order to get started with this dataset we've developed a guide outlining how to effectively use it for your research or project needs.

    • Understand what's included: Before you start analyzing the data, you should read through the provided documentation so that you fully understand what is included in the datasets. You'll want to be aware of any potential limitations or requirements associated with each type of data point so that your results are valid and reliable when drawing conclusions from them.

    • Clean up any outliers: You may need to take some time upfront investigating suspicious outliers within your dataset before using it in any further analyses — otherwise, they can skew results down the road if not dealt with first-hand! Furthermore, they could also make complex statistical modeling more difficult as well since they artificially inflate values depending on their magnitude within each example data point (i.e., one outlier could affect an entire model’s prior distributions). Missing values should also be accounted for too since these may not always appear obvious at first glance when reviewing a table or graphical representation - but accurate statistics must still be obtained either way no matter how messy things seem!

    • Exploratory data analysis: After cleaning up your dataset you'll want to do some basic exploring by visualizing different types of summaries like boxplots, histograms and scatter plots etc.. This will give you an initial case into what trends might exist within certain demographic/geographic/etc.. regions & variables which can then help inform future predictive models when needed! Additionally this step will highlight any clear discontinuous changes over time due over-generalization (if applicable), making sure predictors themselves don’t become part noise instead contributing meaningful signals towards overall effect predictions accuracy etc…

    • Analyze key metrics & observations: Once exploratory analyses have been carried out on rawsamples post-processing steps are next such as analyzing metrics such ascorrelations amongst explanatory functions; performing significance testing regression models; imputing missing/outlier values and much more depending upon specific project needs at hand… Additionally – interpretation efforts based

    Research Ideas

    • Creating an energy efficiency rating system for buildings - Using the dataset, an organization can develop a metric to rate the energy efficiency of commercial and residential buildings in a standardized way.
    • Developing targeted campaigns to raise awareness about energy conservation - Analyzing data from this dataset can help organizations identify areas of high energy consumption and create targeted campaigns and incentives to encourage people to conserve energy in those areas.
    • Estimating costs associated with upgrading building technologies - By evaluating various trends in building technologies and their associated costs, decision-makers can determine the most cost-effective option when it comes time to upgrade their structures' energy efficiency...
  19. d

    Privacy Preserving Distributed Data Mining

    • catalog.data.gov
    • datadiscoverystudio.org
    • +2more
    Updated Apr 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Privacy Preserving Distributed Data Mining [Dataset]. https://catalog.data.gov/dataset/privacy-preserving-distributed-data-mining
    Explore at:
    Dataset updated
    Apr 10, 2025
    Dataset provided by
    Dashlink
    Description

    Distributed data mining from privacy-sensitive multi-party data is likely to play an important role in the next generation of integrated vehicle health monitoring systems. For example, consider an airline manufacturer [tex]$\mathcal{C}$[/tex] manufacturing an aircraft model [tex]$A$[/tex] and selling it to five different airline operating companies [tex]$\mathcal{V}_1 \dots \mathcal{V}_5$[/tex]. These aircrafts, during their operation, generate huge amount of data. Mining this data can reveal useful information regarding the health and operability of the aircraft which can be useful for disaster management and prediction of efficient operating regimes. Now if the manufacturer [tex]$\mathcal{C}$[/tex] wants to analyze the performance data collected from different aircrafts of model-type [tex]$A$[/tex] belonging to different airlines then central collection of data for subsequent analysis may not be an option. It should be noted that the result of this analysis may be statistically more significant if the data for aircraft model [tex]$A$[/tex] across all companies were available to [tex]$\mathcal{C}$[/tex]. The potential problems arising out of such a data mining scenario are:

  20. Z

    AeroSonicDB (YPAD-0523): Labelled audio dataset for acoustic detection and...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Aug 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Downward, Blake (2024). AeroSonicDB (YPAD-0523): Labelled audio dataset for acoustic detection and classification of aircraft [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8000468
    Explore at:
    Dataset updated
    Aug 1, 2024
    Dataset authored and provided by
    Downward, Blake
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    AeroSonicDB (YPAD-0523): Labelled audio dataset for acoustic detection and classification of aircraftVersion 1.1.2 (November 2023)

    [UPDATE: June 2024]

    Version 2.0 is currently in beta and can be found at https://zenodo.org/records/12775560. The repository is currently restricted, however you can gain access by emailing Blake Downward at aerosonicdb@gmail.com, or by submitting the following Google Form.

    Version 2 vastly extends the number of Aircraft audio samples to over 3,000 (V1 contains 625 aircraft sampes), for more than 38 hours of strongly annotated aircraft audio (V1 contains 8.9 hours of aircraft audio).

    Publication

    When using this data in an academic work, please reference the dataset DOI and version. Please also reference the following paper which describes the methodology for collecting the dataset and presents baseline model results.

    Downward, B., & Nordby, J. (2023). The AeroSonicDB (YPAD-0523) Dataset for Acoustic Detection and Classification of Aircraft. ArXiv, abs/2311.06368.

    Description

    AeroSonicDB:YPAD-0523 is a specialised dataset of ADS-B labelled audio clips for research in the fields of environmental noise attribution and machine listening, particularly acoustic detection and classification of low-flying aircraft. Audio files in this dataset were recorded at locations in close proximity to a flight path approaching or departing Adelaide International Airport's (ICAO code: YPAD) primary runway, 05/23. Recordings are initially labelled from radio (ADS-B) messages received from the aircraft overhead, then human verified and annotated with the first and final moments which the target aircraft is audible.

    A total of 1,895 audio clips are distributed across two top-level classes, "Aircraft" (8.87 hours) and "Silence" (3.52 hours). The aircraft class is then further broken-down into four subclasses, which broadly describe the structure of the aircraft and propulsion mechanism. A variety of additional "airframe" features are provided to give researchers finer control of the dataset, and the opportunity to develop ontologies specific to their own use case.

    For convenience, the dataset has been split into training (10.04 hours) and testing (2.35 hours) subsets, with the training set further split into 5 distinct folds for cross-validation. These splits are performed to prevent data-leakage between folds and the test set, ensuring samples collected in the same recording session (distinct in time, location and microphone) are assigned to the same fold.

    Researchers may find applications for this dataset in a number of fields; particularly aircraft noise isolation and noise monitoring in an urban environment, development of passive acoustic systems to assist radar technology, and understanding the sources of aircraft noise to help manufacturers design less-noisy aircraft.

    Audio data

    ADS-B (Automatic Dependent Surveillance–Broadcast) messages transmitted directly from aircraft are used to automatically trigger, capture and label audio samples. A 60-second recording is triggered when an aircraft transmits a message indicating it is within a specified distance of the recording device (see "Location data" below for specifics). The resulting audio file is labelled with the unique ICAO identifier code for the aircraft, as well as its last reported altitude, date, time, location and microphone. The recording is then human verified and annotated with timestamps for the first and last moments the aircraft is audible. In total, AeroSonicDB contains 625 recordings of low-altitude aircraft - varying in length from 18 to 60 seconds, for a total of 8.87 hours of aircraft audio.

    A collection of urban background noise without aircraft (silence) is included with the dataset as a means of distinguishing location specific environmental noises from aircraft noises. 10-second background noise, or "silence" recordings are triggered only when there are no aircraft broadcasting they are within a specified distance of the recording device (see "Location data" below). These "silence" recordings are also human verified to ensure no aircraft noise is present. The dataset contains 1,270 clips of silence/urban background noise.

    Location data

    Recordings have been collected from three (3) locations. GPS coordinates for each location are provided in the "locations.json" file. In order to protect privacy, coordinates have been provided for a road or public space nearby the recording device instead of its exact location.

    Location: 0Situated in a suburban environment approximately 15.5km north-east of the start/end of the runway. For Adelaide, typical south-westerly winds bring most arriving aircraft past this location on approach. Winds from the north or east will cause aircraft to take-off to the north-east, however not all departing aircraft will maintain a course to trigger a recording at this location. The "trigger distance" for this location is set for 3km to ensure small/slower aircraft and large/faster aircraft are captured within a sixty-second recording.

    "Silence" or ambient background noises at this location include; cars, motorbikes, light-trucks, garbage trucks, power-tools, lawn mowers, construction sounds, sirens, people talking, dogs barking and a wide range of Australian native birds (New Holland Honeyeaters, Wattlebirds, Australian Magpies, Australian Ravens, Spotted Doves, Rainbow Lorikeets and others).

    Location: 1Situated approximately 500m south-east of the south-eastern end of the runway, this location is nearby recreational areas (golf course, skate park and parklands) with a busy road/highway inbetween the location and runway. This location features heavy winds and road traffic, as well as people talking, walking and riding, and also birds such as the Australian Magpie and Noisy Miner. The trigger distance for this location is set to 1km. Due to their low altitude aircraft are louder, but audible for a shorter time compared to "Location 0".

    Location: 2As an alternative to "Location 1", this location is situated approximately 950m south-east of the end of the runway. This location has a wastewater facility to the north, a residential area to the south and a popular beach to the west. This location offers greater wind protection and further distance from airport and highway noises. Ambient background sounds feature close proximity cars and motorbikes, cyclists, people walking, nail guns and other construction sounds, as well as the local birds mentioned above.

    Aircraft metadata

    Supplementary "airframe" metadata for all aircraft has been gathered to help broaden the research possibilities from this dataset. Airframe information was collected and cross-checked from a number of open-source databases. The author has no reason to beleive any significant errors exist in the "aircraft_meta" files, however future versions of this dataset plan to obtain aircraft information directly from ICAO (International Civil Aviation Organization) to ensure a single, verifiable source of information.

    Class/subclass ontology (minutes of recordings)

    1. no aircraft (211) 0: no aircraft (211)

    2. aircraft (533) 1: piston-propeller aeroplane (30) 2: turbine-propeller aeroplane (90) 3: turbine-fan aeroplane (409) 4: rotorcraft (4) The subclasses are a combination of the "airframe" and "engtype" features. Piston and Turboshaft rotorcraft/helicopters have been combined into a single subclass due to the small number of samples. Data splits

    Audio recordings have been split into training (81%) and test (19%) sets. The training set has further been split into 5 folds, giving researchers a common split to perform 5-fold cross-validation to ensure reproducibility and comparable results. Data leakage into the test set has been avoided by ensuring recordings are disjointed from the training set by time and location - meaning samples in the test set for a particular location were recorded after any samples included in the training set for that particular location.

    Labelled data

    The entire dataset (training and test) is referenced and labelled in the "sample_meta.csv" file. Each row contains a reference to a unique recording, its meta information, annotations and airframe features.

    Alternatively, these labels can be derived directly from the filename of the sample (see below). The "aircraft_meta.csv" and "aircraft_meta.json" files can be used to reference aircraft specific features - such as; manufacturer, engine type, ICAO type designator etc. (see "Columns/Labels" below for all features).

    File naming convention

    Audio samples are in WAV format, with some metadata stored in the filename.

    Basic Convention

    "Aircraft ID + Date + Time + Location ID + Microphone ID"

    "XXXXXX_YYYY-MM-DD_hh-mm-ss_X_X"

    Sample with aircraft

    {hex_id} _ {date} _ {time} _ {location_id} _ {microphone_id} . {file_ext}

    7C7CD0_2023-05-09_12-42-55_2_1.wav

    Sample without aircraft

    "Silence" files are denoted with six (6) leading zeros rather than an aircraft hex code. All relevant metadata for "silence" samples are contained in the audio filename, and again in the accompanying "sample_meta.csv"

    000000 _ {date} _ {time} _ {location_id} _ {microphone_id} . {file_ext}

    000000_2023-05-09_12-30-55_2_1.wav

    Columns/Labels

    (found in sample_meta.csv, aircraft_meta.csv/json files)

    train-test: Train-test split (train, test)

    fold: Digit from 1 to 5 splitting the training data 5 ways (else test)

    filename: The filename of the audio recording

    date: Date of the recording

    time: Time of the recording

    location: ID for the location of the recording

    mic: ID of the microphone used

    class: Top-level label for the recording (eg. 0 = No aircraft, 1 = Aircraft audible)

    subclass: Subclass label for the recording (eg. 0 = No aircraft, 3 = Turbine-fan aeroplane)

    altitude: Approximate altitude of the aircraft (in feet) at the start of the recording

    hex_id: Unique ICAO 24-bit address for the aircraft recorded

    session: Unique recording

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Deep Contractor (2022). Car Price Prediction Challenge [Dataset]. https://www.kaggle.com/datasets/deepcontractor/car-price-prediction-challenge
Organization logo

Car Price Prediction Challenge

A dataset to practice regression by predicting the prices of different cars.

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 6, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Deep Contractor
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Assignment

Your notebooks must contain the following steps:

  • Perform data cleaning and pre-processing.
    • What steps did you use in this process and how did you clean your data.
  • Perform exploratory data analysis on the given dataset.
    • Explain each and every graphs that you make.
  • Train a ml-model and evaluate it using different metrics.
    • Why did you choose that particular model? What was the accuracy?
  • Hyperparameter optimization and feature selection is a plus.
  • Model deployment and use of ml-flow is a plus.
  • Perform model interpretation and show feature importance for your model.
    • Provide some explanation for the above point.
  • Future steps. Note: try to have your notebooks as presentable as possible.

Dataset Description

CSV file - 19237 rows x 18 columns (Includes Price Columns as Target)

Attributes

ID Price: price of the care(Target Column) Levy Manufacturer Model Prod. year Category Leather interior Fuel type Engine volume Mileage Cylinders Gear box type Drive wheels Doors Wheel Color Airbags

Confused or have any doubts in the data column values? Check the dataset discussion tab!

Search
Clear search
Close search
Google apps
Main menu