70 datasets found
  1. Brain Tumor CSV

    • kaggle.com
    zip
    Updated Oct 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akash Nath (2024). Brain Tumor CSV [Dataset]. https://www.kaggle.com/datasets/akashnath29/brain-tumor-csv/code
    Explore at:
    zip(538175483 bytes)Available download formats
    Dataset updated
    Oct 30, 2024
    Authors
    Akash Nath
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    This dataset provides grayscale pixel values for brain tumor MRI images, stored in a CSV format for simplified access and ease of use. The goal is to create a "MNIST-like" dataset for brain tumors, where each row in the CSV file represents the pixel values of a single image in its original resolution. This format makes it convenient for researchers and developers to quickly load and analyze MRI data for brain tumor detection, classification, and segmentation tasks without needing to handle large image files directly.

    Motivation and Use Cases

    Brain tumor classification and segmentation are critical tasks in medical imaging, and datasets like these are valuable for developing and testing machine learning and deep learning models. While there are several publicly available brain tumor image datasets, they often consist of large image files that can be challenging to process. This CSV-based dataset addresses that by providing a compact and accessible format. Potential use cases include: - Tumor Classification: Identifying different types of brain tumors, such as glioma, meningioma, and pituitary tumors, or distinguishing between tumor and non-tumor images. - Tumor Segmentation: Applying pixel-level classification and segmentation techniques for tumor boundary detection. - Educational and Rapid Prototyping: Ideal for educational purposes or quick experimentation without requiring large image processing capabilities.

    Data Structure

    This dataset is structured as a single CSV file where each row represents an image, and each column represents a grayscale pixel value. The pixel values are stored as integers ranging from 0 (black) to 255 (white).

    CSV File Contents

    • Pixel Values: Each row contains the pixel values of a single grayscale image, flattened into a 1-dimensional array. The original image dimensions vary, and rows in the CSV will correspondingly vary in length.
    • Simplified Access: By using a CSV format, this dataset avoids the need for specialized image processing libraries and can be easily loaded into data analysis and machine learning frameworks like Pandas, Scikit-Learn, and TensorFlow.

    How to Use This Dataset

    1. Loading the Data: The CSV can be loaded using standard data analysis libraries, making it compatible with Python, R, and other platforms.
    2. Data Preprocessing: Users may normalize pixel values (e.g., between 0 and 1) for deep learning applications.
    3. Splitting Data: While this dataset does not predefine training and testing splits, users can separate rows into training, validation, and test sets.
    4. Reshaping for Models: If needed, each row can be reshaped to the original dimensions (retrieved from the subfolder structure) to view or process as an image.

    Technical Details

    • Image Format: Grayscale MRI images, with pixel values ranging from 0 to 255.
    • Resolution: Original resolution, no resizing applied.
    • Size: Each row’s length varies according to the original dimensions of each MRI image.
    • Data Type: CSV file with integer pixel values.

    Acknowledgments

    This dataset is intended for research and educational purposes only. Users are encouraged to cite and credit the original data sources if using this dataset in any publications or projects. This is a derived CSV version aimed to simplify access and usability for machine learning and data science applications.

  2. Mecca Australia Extracted Data in CSV Format

    • crawlfeeds.com
    csv, zip
    Updated Sep 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crawl Feeds (2024). Mecca Australia Extracted Data in CSV Format [Dataset]. https://crawlfeeds.com/datasets/mecca-australia-extracted-data-in-csv-format
    Explore at:
    csv, zipAvailable download formats
    Dataset updated
    Sep 2, 2024
    Dataset authored and provided by
    Crawl Feeds
    License

    https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

    Area covered
    Australia
    Description

    format. This dataset provides comprehensive details on a wide range of beauty products listed on Mecca Australia, one of the leading beauty retailers in the country.

    Perfect for market researchers, data analysts, and beauty industry professionals, this dataset enables a deep dive into product offerings and trends without the clutter of customer reviews.

    Features:

    • Product Information: Detailed data on various beauty products, including product names, categories, and brands.
    • Pricing Data: Up-to-date pricing details for each product, allowing for competitive analysis and pricing strategy development.
    • Product Descriptions: Comprehensive descriptions that provide insight into product features and benefits.
    • Stock Availability: Information on stock status to help track product availability and manage inventory.
    • CSV Format: Easy-to-use CSV file format for seamless integration into any data analysis or business intelligence tool.

    Applications:

    • Market Analysis: Gain insights into the beauty market trends in Australia by analyzing product categories, brands, and pricing.
    • Competitor Research: Compare product offerings and pricing strategies to understand the competitive landscape.
    • Inventory Management: Use stock availability data to optimize inventory and ensure popular items are always in stock.
    • Product Development: Leverage product descriptions to identify gaps in the market and innovate new product offerings.

    With the "Mecca Australia Extracted Data" in CSV format, you can easily access and analyze crucial product data, enabling informed decision-making and strategic planning in the beauty industry.

  3. Clean Meta Kaggle

    • kaggle.com
    Updated Sep 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yoni Kremer (2023). Clean Meta Kaggle [Dataset]. https://www.kaggle.com/datasets/yonikremer/clean-meta-kaggle
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 8, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Yoni Kremer
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Cleaned Meta-Kaggle Dataset

    The Original Dataset - Meta-Kaggle

    Explore our public data on competitions, datasets, kernels (code / notebooks) and more Meta Kaggle may not be the Rosetta Stone of data science, but we do think there's a lot to learn (and plenty of fun to be had) from this collection of rich data about Kaggle’s community and activity.

    Strategizing to become a Competitions Grandmaster? Wondering who, where, and what goes into a winning team? Choosing evaluation metrics for your next data science project? The kernels published using this data can help. We also hope they'll spark some lively Kaggler conversations and be a useful resource for the larger data science community.

    https://i.imgur.com/2Egeb8R.png" alt="" title="a title">

    This dataset is made available as CSV files through Kaggle Kernels. It contains tables on public activity from Competitions, Datasets, Kernels, Discussions, and more. The tables are updated daily.

    Please note: This data is not a complete dump of our database. Rows, columns, and tables have been filtered out and transformed.

    August 2023 update

    In August 2023, we released Meta Kaggle for Code, a companion to Meta Kaggle containing public, Apache 2.0 licensed notebook data. View the dataset and instructions for how to join it with Meta Kaggle here

    We also updated the license on Meta Kaggle from CC-BY-NC-SA to Apache 2.0.

    The Problems with the Original Dataset

    • The original dataset is 32 CSV files, with 268 colums and 7GB of compressed data. Having so many tables and columns makes it hard to understand the data.
    • The data is not normalized, so when you join tables you get a lot of errors.
    • Some values refer to non-existing values in other tables. For example, the UserId column in the ForumMessages table has values that do not exist in the Users table.
    • There are missing values.
    • There are duplicate values.
    • There are values that are not valid. For example, Ids that are not positive integers.
    • The date and time columns are not in the right format.
    • Some columns only have the same value for all rows, so they are not useful.
    • The boolean columns have string values True or False.
    • Incorrect values for the Total columns. For example, the DatasetCount is not the total number of datasets with the Tag according to the DatasetTags table.
    • Users upvote their own messages.

    The Solution

    • To handle so many tables and columns I use a relational database. I use MySQL, but you can use any relational database.
    • The steps to create the database are:
    • Creating the database tables with the right data types and constraints. I do that by running the db_abd_create_tables.sql script.
    • Downloading the CSV files from Kaggle using the Kaggle API.
    • Cleaning the data using pandas. I do that by running the clean_data.py script. The script does the following steps for each table:
      • Drops the columns that are not needed.
      • Converts each column to the right data type.
      • Replaces foreign keys that do not exist with NULL.
      • Replaces some of the missing values with default values.
      • Removes rows where there are missing values in the primary key/not null columns.
      • Removes duplicate rows.
    • Loading the data into the database using the LOAD DATA INFILE command.
    • Checks that the number of rows in the database tables is the same as the number of rows in the CSV files.
    • Adds foreign key constraints to the database tables. I do that by running the add_foreign_keys.sql script.
    • Update the Total columns in the database tables. I do that by running the update_totals.sql script.
    • Backup the database.
  4. o

    HarDWR - Harmonized Water Rights Records

    • osti.gov
    Updated Apr 25, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MultiSector Dynamics - Living, Intuitive, Value-adding, Environment (2024). HarDWR - Harmonized Water Rights Records [Dataset]. http://doi.org/10.57931/2341234
    Explore at:
    Dataset updated
    Apr 25, 2024
    Dataset provided by
    USDOE Office of Science (SC), Biological and Environmental Research (BER)
    MultiSector Dynamics - Living, Intuitive, Value-adding, Environment
    Description

    For a detailed description of the database of which this record is only one part, please see the HarDWR meta-record. Here we present a new dataset of western U.S. water rights records. This dataset provides consistent unique identifiers for each spatial unit of water management across the domain, unique identifiers for each water right record, and a consistent categorization scheme that puts each water right record into one of 7 broad use categories. These data were instrumental in conducting a study of the multi-sector dynamics of intersectoral water allocation changes through water markets (Grogan et al., in review). Specifically, the data were formatted for use as input to a process-based hydrologic model, WBM, with a water rights module (Grogan et al., in review). While this specific study motivated the development of the database presented here, U.S. west water management is a rich area of study (e.g., Anderson and Woosly, 2005; Tidwell, 2014; Null and Prudencio, 2016; Carney et al, 2021) so releasing this database publicly with documentation and usage notes will enable other researchers to do further work on water management in the U.S. west. The raw downloaded data for each state is described in Lisk et al. (in review), as well as here. The dataset is a series of various files organized by state sub-directories. The first two characters of each file name is the abbreviation for the state the in which the file contains data for. After the abbreviation is the text which describes the contents of the file. Here is each file type described in detail: XXFullHarmonizedRights.csv: A file of the combined groundwater and surface water records for each state. Essentially, this file is the merging of XXGroundwaterHarmonizedRights.csv and XXSurfaceWaterHarmonizedRights.csv by state. The column headers for each of this type of file are: state - The name of the state the data comes from. FIPS - The two-digit numeric state ID code. waterRightID - The unique identifying ID of the water right, the same identifier as its state uses. priorityDate - The priority date associated with the right. origWaterUse - The original stated water use(s) from the state. waterUse - The water use category under the unified use categories established here. source - Whether the right is for surface water or groundwater. basinNum - The alpha-numeric identifier of the WMA the record belongs to. CFS - The maximum flow of the allocation in cubic feet per second (ft3s-1). Arizona is unique among the states, as its surface and groundwater resources are managed with two different sets of boundaries. So, for Arizona, the basinNum column is missing and instead there are two columns: surBasinNum - The alpha-numeric identifier of the surface water WMA the record belongs to. grdBasinNum - The alpha-numeric identifier of the groundwater WMA the record belongs to. XXStatePOD.shp: A shapefile which identifies the location of the Points of Diversion for the state's water rights. It should be noted that not all water right records in XXFullHarmonizedRights.csv have coordinates, and therefore may be missing from this file. XXStatePOU.shp: A shapefile which contains the area(s) in which each water right is claimed to be used. Currently, only Idaho and Washington provided valid data to include within this file. XXGroundwaterHarmonizedRights.csv: A file which contains only harmonized groundwater rights collected from each state. See XXFullHarmonizedRights.csv for more details on how the data is formatted. XXSurfaceWaterHarmonizedRights.csv: A file which contains only harmonized surface water rights collected from each state. See XXFullHarmonizedRights.csv for more details on how the data is formatted. Additionally, one file, stateWMALabels.csv, is not stored within a sub-directory. While we have referred to the spatial boundaries that each state uses to manage its water resources as WMAs, this term is not shared across all states. This file lists the proper name for each boundary set, by state. For those whom may be interested in exploring our code more in depth, we are also making available an internal data file for convenience. The file is in .RData format and contains everything described above as well as some minor additional objects used within the code calculating the cumulative curves. For completeness, here is a detailed description of the various objects which can be found within the .RData file: states: A character vector containing the state names for those states in which data was collected for. More importantly, the index of the state name is also the index in which that state's data can be found in the various following list objects. For example, if California is the third index in this object, the data for California will also be in the third index for each accompanying list. rightsByState_ground: A list of data frames with the cleaned ground water rights collected from each state. This object holds the the data that is exported to created the xxGroundwaterHarmonizedRights.csv files. rightsByState_surface: A list of data frames with the cleaned surface water rights collected from each state. This object holds the the data that is exported to created the xxSurfaceWaterHarmonizedRights.csv files. fullRightsRecs: A list of the combined groundwater and surface water records for each state. This object holds the the data that is exported to created the xxFullHarmonizedRights.csv files. projProj: The spatial projection used for map creation in the beginning of the project. Specifically, the World Geodetic System (WGS84) as a coordinate reference system (CRS) string in PROJ.4 format. wmaStateLabel: The name and/or abbreviation for what each state legally calls their WMAs. h2oUseByState: A list of spatial polygon data frames which contain the area(s) in which each water right is claimed to be used. It should be noted that not all water right records have a listed area(s) of use in this object. Currently, only Idaho and Washington provided valid data to be included in this object. h2oDivByState: A list of spatial points data frames which identifies the location of the Point of Diversion for the state's water rights. It should be noted that not all water right records have a listed Point of Diversion in this object. spatialWMAByState: A list of spatial polygon data frames which contain the spatial WMA boundaries for each state. The only data contained within the table are identifiers for each polygon. It is worth reiterating that Arizona is the only state in which the surface and groundwater WMA boundaries are not the same. wmaIDByState: A list which contains the unique ID values of the WMAs for each state. plottingDim: A character vector used to inform mapping functions for internal map making. Each state is classified as either "tall" or "wide", to maximize space on a typical 8x11 page. The code related to the creation of this dataset can be viewed within HarDWR GitHub Repository/dataHarmonization.

  5. The Device Activity Report with Complete Knowledge (DARCK) for NILM

    • zenodo.org
    bin, xz
    Updated Sep 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous Anonymous; Anonymous Anonymous (2025). The Device Activity Report with Complete Knowledge (DARCK) for NILM [Dataset]. http://doi.org/10.5281/zenodo.17159850
    Explore at:
    bin, xzAvailable download formats
    Dataset updated
    Sep 19, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous Anonymous; Anonymous Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    1. Abstract

    This dataset contains aggregated and sub-metered power consumption data from a two-person apartment in Germany. Data was collected from March 5 to September 4, 2025, spanning 6 months. It includes an aggregate reading from a main smart meter and individual readings from 40 smart plugs, smart relays, and smart power meters monitoring various appliances.

    2. Dataset Overview

    • Apartment: Two-person apartment, approx. 58m², located in Aachen, Germany.
    • Aggregate Meter: eBZ DD3
    • Sub-meters: 31 Shelly Plus Plug S, 6 Shelly Plus 1PM, 3 Shelly Plus PM Mini Gen3
    • Sampling Rate: 1 Hz
    • Measured Quantity: Active Power
    • Unit of Measurement: Watt
    • Duration: 6 months
    • Format: Single CSV file (`DARCK.csv`)
    • Structure: Timestamped rows with columns for the aggregate meter and each sub-metered appliance.
    • Completeness: The main power meter has a completeness of 99.3%. Missing values were linearly interpolated.

    3. Download and Usage

    The dataset can be downloaded here: https://doi.org/10.5281/zenodo.17159850

    As it contains longer off periods with zeros, the CSV file is nicely compressible.


    To extract it use: xz -d DARCK.csv.xz.
    The compression leads to a 97% smaller file size (From 4GB to 90.9MB).


    To use the dataset in python, you can, e.g., load the csv file into a pandas dataframe.

    python
    import pandas as pd

    df = pd.read_csv("DARCK.csv", parse_dates=["time"])

    4. Measurement Setup

    The main meter was monitored using an infrared reading head magnetically attached to the infrared interface of the meter. An ESP8266 flashed with Tasmota decodes the binary datagrams and forwards the Watt readings to the MQTT broker. Individual appliances were monitored using a combination of Shelly Plugs (for outlets), Shelly 1PM (for wired-in devices like ceiling lights), and Shelly PM Mini (for each of the three phases of the oven). All devices reported to a central InfluxDB database via Home Assistant running in docker on a Dell OptiPlex 3020M.

    5. File Format (DARCK.csv)

    The dataset is provided as a single comma-separated value (CSV) file.

    • The first row is a header containing the column names.
    • All power values are rounded to the first decimal place.
    • There are no missing values in the final dataset.
    • Each row represents 1 second, from start of measuring in March until the end in September.

    Column Descriptions

    Column Name

    Data Type

    Unit

    Description

    timedatetime-Timestamp for the reading in YYYY-MM-DD HH:MM:SS
    mainfloatWattTotal aggregate power consumption for the apartment, measured at the main electrical panel.
    [appliance_name]floatWattPower consumption of an individual appliance (e.g., lightbathroom, fridge, sherlockpc). See Section 8 for a full list.
    Aggregate Columns
    aggr_chargersfloatWattThe sum of sherlockcharger, sherlocklaptop, watsoncharger, watsonlaptop, watsonipadcharger, kitchencharger.
    aggr_stoveplatesfloatWattThe sum of stoveplatel1 and stoveplatel2.
    aggr_lightsfloatWattThe sum of lightbathroom, lighthallway, lightsherlock, lightkitchen, lightlivingroom, lightwatson, lightstoreroom, fcob, sherlockalarmclocklight, sherlockfloorlamphue, sherlockledstrip, livingfloorlamphue, sherlockglobe, watsonfloorlamp, watsondesklamp and watsonledmap.
    Analysis Columns
    inaccuracyfloatWattAs no electrical device bypasses a power meter, the true inaccuracy can be assessed. It is the absolute error between the sum of individual measurements and the mains reading. A 30W offset is applied to the sum since the measurement devices themselves draw power which is otherwise unaccounted for.

    6. Data Postprocessing Pipeline

    The final dataset was generated from two raw data sources (meter.csv and shellies.csv) using a comprehensive postprocessing pipeline.

    6.1. Main Meter (main) Postprocessing

    The aggregate power data required several cleaning steps to ensure accuracy.

    1. Outlier Removal: Readings below 10W or above 10,000W were removed (merely 3 occurrences).
    2. Timestamp Burst Correction: The source data contained bursts of delayed readings. A custom algorithm was used to identify these bursts (large time gap followed by rapid readings) and back-fill the timestamps to create an evenly spaced time series.
    3. Alignment & Interpolation: The smart meter pushes a new value via infrared every second. To align those to the whole seconds, it was resampled to a 1-second frequency by taking the mean of all readings within each second (in 99.5% only 1 value). Any resulting gaps (0.7% outage ratio) were filled using linear interpolation.

    6.2. Sub-metered Devices (shellies) Postprocessing

    The Shelly devices are not prone to the same burst issue as the ESP8266 is. They push a new reading at every change in power drawn. If no power change is observed or the one observed is too small (less than a few Watt), the reading is pushed once a minute, together with a heartbeat. When a device turns on or off, intermediate power values are published, which leads to sub-second values that need to be handled.

    1. Grouping: Data was grouped by the unique device identifier.
    2. Resampling & Filling: The data for each device was resampled to a 1-second frequency using .resample('1s').last().ffill().
      This method was chosen to firstly, capture the last known state of the device within each second, handling rapid on/off events. Secondly, to forward-fill the last state across periods of no new data, modeling that the device's consumption remained constant until a new reading was sent.

    6.3. Merging and Finalization

    1. Merge: The cleaned main meter and all sub-metered device dataframes were merged into a single dataframe on the time index.
    2. Final Fill: Any remaining NaN values (e.g., from before a device was installed) were filled with 0.0, assuming zero consumption.

    7. Manual Corrections and Known Data Issues

    During analysis, two significant unmetered load events were identified and manually corrected to improve the accuracy of the aggregate reading. The error column (inaccuracy) was recalculated after these corrections.

    1. March 10th - Unmetered Bulb: An unmetered 107W bulb was active. It was subtracted from the main reading as if it never happened.
    2. May 31st - Unmetered Air Pump: An unmetered 101W pump for an air mattress was used directly in an outlet with no intermediary plug and hence manually added to the respective plug.

    8. Appliance Details and Multipurpose Plugs

    The following table lists the column names with an explanation where needed. As Watson moved at the beginning of June, some metering plugs changed their appliance.

  6. FiN-2: Larg-Scale Powerline Communication Dataset (Pt.1)

    • zenodo.org
    bin, png, zip
    Updated Jul 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christoph Balada; Christoph Balada; Max Bondorf; Sheraz Ahmed; Andreas Dengel; Andreas Dengel; Markus Zdrallek; Max Bondorf; Sheraz Ahmed; Markus Zdrallek (2024). FiN-2: Larg-Scale Powerline Communication Dataset (Pt.1) [Dataset]. http://doi.org/10.5281/zenodo.8328113
    Explore at:
    bin, zip, pngAvailable download formats
    Dataset updated
    Jul 11, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Christoph Balada; Christoph Balada; Max Bondorf; Sheraz Ahmed; Andreas Dengel; Andreas Dengel; Markus Zdrallek; Max Bondorf; Sheraz Ahmed; Markus Zdrallek
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    # FiN-2 Large-Scale Real-World PLC-Dataset

    ## About
    #### FiN-2 dataset in a nutshell:
    FiN-2 is the first large-scale real-world dataset on data collected in a powerline communication infrastructure. Since the electricity grid is inherently a graph, our dataset could be interpreted as a graph dataset. Therefore, we use the word node to describe points (cable distribution cabinets) of measurement within the low-voltage electricity grid and the word edge to describe connections (cables) in between them. However, since these are PLC connections, an edge does not necessarily have to correspond to a real cable; more on this in our paper.
    FiN-2 shows measurements that relate to the nodes (voltage, total harmonic distortion) as well as to the edges (signal-to-noise ratio spectrum, tonemap). In total, FiN-2 is distributed across three different sites with a total of 1,930,762,116 node measurements each for the individual features and 638,394,025 edge measurements each for all 917 PLC channels. All data was collected over a 25-month period from mid-2020 to the end of 2022.
    We propose this dataset to foster research in the domain of grid automation and smart grid. Therefore, we provide different example use cases in asset management, grid state visualization, forecasting, predictive maintenance, and novelty detection. For more decent information on this dataset, please see our [paper](https://arxiv.org/abs/2209.12693).

    * * *
    ## Content
    FiN-2 dataset splits up into two compressed `csv-Files`: *nodes.csv* and *edges.csv*.

    All files are provided as a compressed ZIP file and are divided into four parts. The first part can be found in this repo, while the remaining parts can be found in the following:
    - https://zenodo.org/record/8328105
    - https://zenodo.org/record/8328108
    - https://zenodo.org/record/8328111

    ### Node data

    | id | ts | v1 | v2 | v3 | thd1 | thd2 | thd3 | phase_angle1 | phase_angle2 | phase_angle3 | temp |
    |----|----|----|----|----|----|----|----|----|----|----|----|----|----|
    |112|1605530460|236.5|236.4|236.0|2.9|2.5|2.4|120.0|119.8|120.0|35.3|
    |112|1605530520|236.9|236.6|236.6|3.1|2.7|2.5|120.1|119.8|120.0|35.3|
    |112|1605530580|236.2|236.4|236.0|3.1|2.7|2.5|120.0|120.0|119.9|35.5|

    - id / ts: Unique identifier of the node that is measured and timestemp of the measurement
    - v1/v2/v3: Voltage measurements of all three phases
    - thd1/thd2/thd3: Total harmonic distortion of all three phases
    - phase_angle1/2/3: Phase angle of all three phases
    - temp: Temperature in-circuit of the sensor inside a cable distribution unit (in °C)

    ### Edge data
    | src | dst | ts | snr0 | snr1 | snr2 | ... | snr916 |
    |----|----|----|----|----|----|----|----|
    |62|94|1605528900|70|72|45|...|-53|
    |62|32|1605529800|16|24|13|...|-51|
    |17|94|1605530700|37|25|24|...|-55|

    - src & dst & ts: Unique identifier of the source and target nodes where the spectrum is measured and time of measurement
    - snr0/snr1/.../snr916: 917 SNR measurements in tenths of a decibel (e.g. 50 --> 5dB).

    ### Metadata
    Metadata that is provided along with the data covers:

    - Number of cable joints
    - Cable properties (length, type, number of sections)
    - Relative position of the nodes (location, zero-centered gps)
    - Adjacent PV or wallbox installations
    - Year of installation w.r.t. the nodes and cables

    Since the electricity grid is part of the critical infrastructure, it is not possible to provide exact GPS locations.

    * * *
    ## Usage
    Simple data access using pandas:

    ```
    import pandas as pd

    nodes_file = "nodes.csv.gz" # /path/to/nodes.csv.gz
    edges_file = "edges.csv.gz" # /path/to/edges.csv.gz

    # read the first 10 rows
    data = pd.read_csv(nodes_file, nrows=10, compression='gzip')

    # read the row number 5 to 15
    data = pd.read_csv(nodes_file, nrows=10, skiprows=[i for i in range(1,6)], compression='gzip')

    # ... same for the edges
    ```

    Compressed csv-data format was used to make sharing as easy as possible, however it comes with significant drawbacks for machine learning. Due to the inherent graph structure, a single snapshot of the whole graph consists of a set of node and edge measurements. But due to timeouts, noise and other disturbances, nodes sometimes fail in collecting the data, wherefore the number of measurements for a specific timestamp differs. This, plus the high sparsity of the graph, leads to a high inefficiency when using the csv-format for an ML training.
    To utilize the data in an ML pipeline, we recommend other data formats like [datadings](https://datadings.readthedocs.io/en/latest/) or specialized database solutions like [VictoriaMetrics](https://victoriametrics.com/).


    ### Example use case (voltage forecasting)

    Forecasting of the voltage is one potential use cases. The Jupyter notebook provided in the repository gives an overview of how the dataset can be loaded, preprocessed and used for ML training. Thereby, a MinMax scaling was used as simple preprocessing and a PyTorch dataset class was created to handle the data. Furthermore, a vanilla autoencoder is utilized to process and forecast the voltage into the future.

  7. Cafe Sales - Dirty Data for Cleaning Training

    • kaggle.com
    zip
    Updated Jan 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmed Mohamed (2025). Cafe Sales - Dirty Data for Cleaning Training [Dataset]. https://www.kaggle.com/datasets/ahmedmohamed2003/cafe-sales-dirty-data-for-cleaning-training
    Explore at:
    zip(113510 bytes)Available download formats
    Dataset updated
    Jan 17, 2025
    Authors
    Ahmed Mohamed
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dirty Cafe Sales Dataset

    Overview

    The Dirty Cafe Sales dataset contains 10,000 rows of synthetic data representing sales transactions in a cafe. This dataset is intentionally "dirty," with missing values, inconsistent data, and errors introduced to provide a realistic scenario for data cleaning and exploratory data analysis (EDA). It can be used to practice cleaning techniques, data wrangling, and feature engineering.

    File Information

    • File Name: dirty_cafe_sales.csv
    • Number of Rows: 10,000
    • Number of Columns: 8

    Columns Description

    Column NameDescriptionExample Values
    Transaction IDA unique identifier for each transaction. Always present and unique.TXN_1234567
    ItemThe name of the item purchased. May contain missing or invalid values (e.g., "ERROR").Coffee, Sandwich
    QuantityThe quantity of the item purchased. May contain missing or invalid values.1, 3, UNKNOWN
    Price Per UnitThe price of a single unit of the item. May contain missing or invalid values.2.00, 4.00
    Total SpentThe total amount spent on the transaction. Calculated as Quantity * Price Per Unit.8.00, 12.00
    Payment MethodThe method of payment used. May contain missing or invalid values (e.g., None, "UNKNOWN").Cash, Credit Card
    LocationThe location where the transaction occurred. May contain missing or invalid values.In-store, Takeaway
    Transaction DateThe date of the transaction. May contain missing or incorrect values.2023-01-01

    Data Characteristics

    1. Missing Values:

      • Some columns (e.g., Item, Payment Method, Location) may contain missing values represented as None or empty cells.
    2. Invalid Values:

      • Some rows contain invalid entries like "ERROR" or "UNKNOWN" to simulate real-world data issues.
    3. Price Consistency:

      • Prices for menu items are consistent but may have missing or incorrect values introduced.

    Menu Items

    The dataset includes the following menu items with their respective price ranges:

    ItemPrice($)
    Coffee2
    Tea1.5
    Sandwich4
    Salad5
    Cake3
    Cookie1
    Smoothie4
    Juice3

    Use Cases

    This dataset is suitable for: - Practicing data cleaning techniques such as handling missing values, removing duplicates, and correcting invalid entries. - Exploring EDA techniques like visualizations and summary statistics. - Performing feature engineering for machine learning workflows.

    Cleaning Steps Suggestions

    To clean this dataset, consider the following steps: 1. Handle Missing Values: - Fill missing numeric values with the median or mean. - Replace missing categorical values with the mode or "Unknown."

    1. Handle Invalid Values:

      • Replace invalid entries like "ERROR" and "UNKNOWN" with NaN or appropriate values.
    2. Date Consistency:

      • Ensure all dates are in a consistent format.
      • Fill missing dates with plausible values based on nearby records.
    3. Feature Engineering:

      • Create new columns, such as Day of the Week or Transaction Month, for further analysis.

    License

    This dataset is released under the CC BY-SA 4.0 License. You are free to use, share, and adapt it, provided you give appropriate credit.

    Feedback

    If you have any questions or feedback, feel free to reach out through the dataset's discussion board on Kaggle.

  8. t

    FAIR Dataset for Disease Prediction in Healthcare Applications

    • test.researchdata.tuwien.ac.at
    bin, csv, json, png
    Updated Apr 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf (2025). FAIR Dataset for Disease Prediction in Healthcare Applications [Dataset]. http://doi.org/10.70124/5n77a-dnf02
    Explore at:
    csv, json, bin, pngAvailable download formats
    Dataset updated
    Apr 14, 2025
    Dataset provided by
    TU Wien
    Authors
    Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Description

    Context and Methodology

    • Research Domain/Project:
      This dataset was created for a machine learning experiment aimed at developing a classification model to predict outcomes based on a set of features. The primary research domain is disease prediction in patients. The dataset was used in the context of training, validating, and testing.

    • Purpose of the Dataset:
      The purpose of this dataset is to provide training, validation, and testing data for the development of machine learning models. It includes labeled examples that help train classifiers to recognize patterns in the data and make predictions.

    • Dataset Creation:
      Data preprocessing steps involved cleaning, normalization, and splitting the data into training, validation, and test sets. The data was carefully curated to ensure its quality and relevance to the problem at hand. For any missing values or outliers, appropriate handling techniques were applied (e.g., imputation, removal, etc.).

    Technical Details

    • Structure of the Dataset:
      The dataset consists of several files organized into folders by data type:

      • Training Data: Contains the training dataset used to train the machine learning model.

      • Validation Data: Used for hyperparameter tuning and model selection.

      • Test Data: Reserved for final model evaluation.

      Each folder contains files with consistent naming conventions for easy navigation, such as train_data.csv, validation_data.csv, and test_data.csv. Each file follows a tabular format with columns representing features and rows representing individual data points.

    • Software Requirements:
      To open and work with this dataset, you need VS Code or Jupyter, which could include tools like:

      • Python (with libraries such as pandas, numpy, scikit-learn, matplotlib, etc.)

    Further Details

    • Reusability:
      Users of this dataset should be aware that it is designed for machine learning experiments involving classification tasks. The dataset is already split into training, validation, and test subsets. Any model trained with this dataset should be evaluated using the test set to ensure proper validation.

    • Limitations:
      The dataset may not cover all edge cases, and it might have biases depending on the selection of data sources. It's important to consider these limitations when generalizing model results to real-world applications.

  9. d

    GP Practice Prescribing Presentation-level Data - July 2014

    • digital.nhs.uk
    csv, zip
    Updated Oct 31, 2014
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2014). GP Practice Prescribing Presentation-level Data - July 2014 [Dataset]. https://digital.nhs.uk/data-and-information/publications/statistical/practice-level-prescribing-data
    Explore at:
    csv(1.4 GB), zip(257.7 MB), csv(1.7 MB), csv(275.8 kB)Available download formats
    Dataset updated
    Oct 31, 2014
    License

    https://digital.nhs.uk/about-nhs-digital/terms-and-conditionshttps://digital.nhs.uk/about-nhs-digital/terms-and-conditions

    Time period covered
    Jul 1, 2014 - Jul 31, 2014
    Area covered
    United Kingdom
    Description

    Warning: Large file size (over 1GB). Each monthly data set is large (over 4 million rows), but can be viewed in standard software such as Microsoft WordPad (save by right-clicking on the file name and selecting 'Save Target As', or equivalent on Mac OSX). It is then possible to select the required rows of data and copy and paste the information into another software application, such as a spreadsheet. Alternatively, add-ons to existing software, such as the Microsoft PowerPivot add-on for Excel, to handle larger data sets, can be used. The Microsoft PowerPivot add-on for Excel is available from Microsoft http://office.microsoft.com/en-gb/excel/download-power-pivot-HA101959985.aspx Once PowerPivot has been installed, to load the large files, please follow the instructions below. Note that it may take at least 20 to 30 minutes to load one monthly file. 1. Start Excel as normal 2. Click on the PowerPivot tab 3. Click on the PowerPivot Window icon (top left) 4. In the PowerPivot Window, click on the "From Other Sources" icon 5. In the Table Import Wizard e.g. scroll to the bottom and select Text File 6. Browse to the file you want to open and choose the file extension you require e.g. CSV Once the data has been imported you can view it in a spreadsheet. What does the data cover? General practice prescribing data is a list of all medicines, dressings and appliances that are prescribed and dispensed each month. A record will only be produced when this has occurred and there is no record for a zero total. For each practice in England, the following information is presented at presentation level for each medicine, dressing and appliance, (by presentation name): - the total number of items prescribed and dispensed - the total net ingredient cost - the total actual cost - the total quantity The data covers NHS prescriptions written in England and dispensed in the community in the UK. Prescriptions written in England but dispensed outside England are included. The data includes prescriptions written by GPs and other non-medical prescribers (such as nurses and pharmacists) who are attached to GP practices. GP practices are identified only by their national code, so an additional data file - linked to the first by the practice code - provides further detail in relation to the practice. Presentations are identified only by their BNF code, so an additional data file - linked to the first by the BNF code - provides the chemical name for that presentation.

  10. Group Health Dataset (Sleep and Screen Time)

    • zenodo.org
    csv
    Updated Apr 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gogate; Gogate (2025). Group Health Dataset (Sleep and Screen Time) [Dataset]. http://doi.org/10.5281/zenodo.15171250
    Explore at:
    csvAvailable download formats
    Dataset updated
    Apr 8, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Gogate; Gogate
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    Group Health (Sleep and Screen Time) Dataset


    Title: Group Health (Sleep and Screen Time) Dataset

    Description: This dataset includes biometric and self-reported sleep-related information from users wearing health monitoring devices. It tracks heart rate data, screen time, and sleep quality ratings, intended for health analytics, sleep research, or machine learning applications.
    Creator: Eindhoven University of Technology
    Version: 1.0
    License: CC-BY 4.0
    Keywords: sleep health, wearable data, heart rate, screen time, sleep rating, health analytics
    Format: CSV (.csv)
    Size: 301,556 records
    PID: 10.5281/zenodo.15171250

    Column Descriptions

    - Uid (int64): Unique identifier for the user. Example: `2`
    - Sid (object): Session ID representing device/session (e.g., wearable device). Example: `huami.32093/11110030`
    - Key (object): The type of health metric (e.g., 'heart_rate'). Example: `heart_rate`
    - Time (int64): Unix timestamp of when the measurement was taken. Example: `1743911820`
    - Value (object): JSON object containing measurement details (e.g., heart rate BPM). Example: `{"time":1743911820,"bpm":64}`
    - UpdateTime (float64): Timestamp when the record was last updated. Example: `1743911982.0`
    - screentime (object): Reported or detected screen time during sleep period. Example: `0 days 08:25:00`
    - expected_sleep (object): Expected sleep time duration (possibly self-reported or algorithmic). Example: `0 days 07:45:00`
    - sleep_rating (float64): Numerical rating of sleep quality. Example: `0.65`

    Notes
    - The `Value` field stores JSON-like strings which should be parsed for specific values such as heart rate (`bpm`).
    - Missing data in `screentime`, `expected_sleep`, and `sleep_rating` should be handled carefully during analysis.
    - Timestamps are in Unix format and may need conversion to readable datetime.
    Provenance
    The Group Health (Sleep and Screen Time) Dataset was collected by the students at the Eindhoven University of Technology as part of a health monitoring study. Participants wore wearable health devices (Mi Band Smartwatches) that tracked biometric data, including heart rate, screen time, and self-reported sleep information. The dataset was compiled from multiple sessions of device usage over time the course of two weeks, with the data anonymized for privacy and research purposes. The original data was already in a standardized csv format and was altered for preprocessing purposes and analysis. This dataset is openly shared under a CC-BY 4.0 license, enabling users to reuse and modify the data while properly attributing the original creators
  11. h

    career_change_prediction_analysis

    • huggingface.co
    Updated Nov 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    harry (2025). career_change_prediction_analysis [Dataset]. https://huggingface.co/datasets/harry120/career_change_prediction_analysis
    Explore at:
    Dataset updated
    Nov 19, 2025
    Authors
    harry
    Description

    🎯 Assignment #1: Career Change Prediction Analysis

      1. Dataset Overview and Project Goal
    

    Dataset: career_change_prediction_dataset.csv (38,444 rows, 22 features) Source: Kaggle Research Question: What are the primary factors that predict an individual's likelihood of changing careers? Target Variable: Likely to Change Occupation (Binary Classification: 0/1)

      2. Data Handling and Integrity (The Logical Process)
    

    Before any analysis could begin, the first… See the full description on the dataset page: https://huggingface.co/datasets/harry120/career_change_prediction_analysis.

  12. Integrating urinary metabolomics and clinical datasets for multi-cancer...

    • figshare.com
    Updated Nov 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dongyong Lee (2025). Integrating urinary metabolomics and clinical datasets for multi-cancer detection [Dataset]. http://doi.org/10.6084/m9.figshare.30716096.v2
    Explore at:
    Dataset updated
    Nov 26, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Dongyong Lee
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundThis dataset contains raw urinary surface-enhanced Raman scattering (SERS) spectra acquired from participants with cardiometabolic conditions and solid cancers, as well as non-disease controls. The data are intended for method development and benchmarking of machine-learning based diagnostic models.## Study design and groups- Sample type: spot urine- Measurement: surface-enhanced Raman scattering (SERS), [instrument model / laser wavelength / objective / integration time / SERS substrate: to be filled by data owner]- Technical replicates: 5 SERS acquisitions per subject on the same specimen- Groups and sample sizes (subjects × replicates): - Normal controls: 100 × 5 = 500 spectra - Hypertension (HTN): 100 × 5 = 500 spectra - Diabetes mellitus (DM): 100 × 5 = 500 spectra - Hypertension + Diabetes (HTN+DM): 100 × 5 = 500 spectra - Colorectal cancer (CRC): 300 × 5 = 1,500 spectra - Lung cancer: 200 × 5 = 1,000 spectra - Pancreatic cancer: 53 × 5 = 265 spectra - Total: 953 subjects, 4,765 spectra## File organizationThe dataset is organized into seven zip archives, each corresponding to one clinical group, plus a metadata file:- normal_SERS.zip - Contains 500 CSV files under the folder normal_SERS/ - File naming pattern: NOR _.CSV- HTN_SERS.zip - Contains 500 CSV files under the folder HTN_SERS/ - File naming pattern: HBP _.CSV- DM_SERS.zip - Contains 500 CSV files under the folder DM_SERS/ - File naming pattern: DIA _.CSV- HTN+DM_SERS.zip - Contains 500 CSV files under the folder HTN+DM_SERS/ - File naming pattern: H.D. _.CSV- colorectal+cancer_SERS.zip - Contains 1,500 CSV files under the folder colorectal+cancer_SERS/ - File naming pattern: CRC _.CSV- lung+cancer_SERS.zip - Contains 1,000 CSV files under the folder lung+cancer_SERS/ - File naming pattern: LUN _.CSV- pancreatic+cancer_SERS.zip - Contains 265 CSV files under the folder pancreatic+cancer_SERS/ - File naming pattern: SPAN _.CSV- sample_metadata.csv - Sample-level metadata linking each spectrum file to its clinical group, subject, and replicate index.## sample_metadata.csv columnsThe sample_metadata.csv file has one row per SERS spectrum (4,765 rows in total) and the following columns:- group: descriptive group label - e.g., Normal control, Hypertension, Diabetes mellitus, Hypertension + Diabetes, Colorectal cancer, Lung cancer, Pancreatic cancer.- group_code: short group code - e.g., Normal, HTN, DM, HTN+DM, CRC, LungCA, PancreasCA.- original_prefix: prefix as it appears in the original file names - NOR, HBP, DIA, H.D., CRC, LUN, SPAN.- canonical_prefix: cleaned/standardized prefix used for constructing sample_id - NOR, HBP, DIA, HD, CRC, LUN, SPAN. - For example, H.D.HD.- subject_id: integer subject identifier within each prefix (1–100, 1–300, 1–200, or 1–53 depending on group).- sample_id: standardized subject identifier combining canonical_prefix and zero-padded subject_id - e.g., NOR_001, HBP_093, DIA_048, HD_027, CRC_077, LUN_151, SPAN_022.- replicate_index: technical replicate index (1–5).- filename: original CSV file name (e.g., HBP 93_5.CSV).- filepath_in_zip: relative path to the CSV file inside the corresponding zip archive (e.g., HTN_SERS/HBP 93_5.CSV).- zip_file: name of the zip archive that contains this file (e.g., HTN_SERS.zip).## Data format- Each CSV file contains two columns without a header: 1. Raman shift (cm⁻¹), typically spanning ~50–3300 cm⁻¹ 2. SERS intensity (arbitrary units)- All spectra have a uniform number of data points (rows) per file.- No baseline correction, smoothing, normalization, or other signal processing has been applied. - These spectra should be considered raw measurements.## Recommended usageThis dataset is suitable for:- Development and benchmarking of: - Preprocessing algorithms (baseline correction, denoising, normalization). - Feature extraction and dimensionality reduction methods for SERS. - Diagnostic and multi-disease classification models based on SERS spectra.- Methodological studies on: - Handling of technical replicates. - Cross-disease model generalization and domain adaptation.Users are encouraged to:- Implement and clearly describe their own preprocessing and validation strategies.- Report details such as train/validation splits, cross-validation schemes, and performance metrics when publishing work based on this dataset.

  13. AQUAIR Dataset

    • figshare.com
    csv
    Updated Oct 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Youssef Sabiri (2025). AQUAIR Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.28934375.v1
    Explore at:
    csvAvailable download formats
    Dataset updated
    Oct 1, 2025
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Youssef Sabiri
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset overview:This repository contains the AQUAIR Dataset, a high-resolution log of indoor-environment quality (IEQ) gathered in a trout (Oncorhynchus mykiss) hatchery room at Amghass, Azrou, Morocco. Six airborne variables—air temperature, relative humidity, carbon-dioxide (CO₂), total volatile organic compounds (TVOC), fine particulate matter (PM₂.₅) and inhalable particulate matter (PM₁₀)—were sampled every 5 minutes between 14 October 2024 and 09 January 2025. The data are provided as two comma-separated files:AQUAIR_1.csv : Contains data recorded from 14th October 2024 to 10th December 2024, has a total of 16 533 rows.AQUAIR_2.csv : Contains data recorded from 15th December 2024 to 9th January 2025, has a total of 7 323 rows.Combined, the set delivers 23 856 time-stamped observations suitable for time-series modelling, forecasting, anomaly detection and studies of airborne stressors in aquaculture facilities.Parameters and unitsParameterUnitRelevance in trout cultureTemperature°CInfluences metabolic rate, feed conversion and dissolved-oxygen levels.Relative humidity% RHHigh RH accelerates mould growth; low RH increases evaporation.CO₂ppmHead-space CO₂ equilibrates with water; sustained excess slows growth.VOCppbProxy for disinfectant off-gassing and human activity; ventilation indicator.PM₂.₅µg m⁻³Fine particles can load bio-filters and irritate gill tissue.PM₁₀µg m⁻³Coarser dust from feed handling and maintenance.All values are recorded in SI units; timestamps use ISO-8601 in Coordinated Universal Time (UTC).Reuse potentialBenchmark short-horizon IEQ forecasting (ARIMA, LSTM, transformer models).Develop anomaly detectors for hatchery monitoring dashboards.Correlate airborne conditions with fish-health metrics in future multi-modal studies.Validate low-cost sensor stability in high-humidity aquaculture environments.How to citeIf you use the AQUAIR dataset, please also cite our paper:Sabiri, Y., Houmaidi, W., El Maadi, O., & Chtouki, Y. (2025). AQUAIR: A High-Resolution Indoor Environmental Quality Dataset for Smart Aquaculture Monitoring. arXiv:2509.24069. https://arxiv.org/abs/2509.24069BibTeX:@misc{sabiri2025aquairhighresolutionindoorenvironmental, title={AQUAIR: A High-Resolution Indoor Environmental Quality Dataset for Smart Aquaculture Monitoring}, author={Youssef Sabiri and Walid Houmaidi and Ouail El Maadi and Yousra Chtouki}, year={2025}, eprint={2509.24069}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2509.24069}, }LicenceCreative Commons Attribution 4.0 International (CC-BY-4.0).

  14. t

    IMU Data for different Motorcyclist Behaviour

    • researchdata.tuwien.ac.at
    • researchdata.tuwien.at
    zip
    Updated Jun 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gerhard Navratil; Ioannis Giannopoulos; Ioannis Giannopoulos; Gerhard Navratil; Gerhard Navratil; Gerhard Navratil (2024). IMU Data for different Motorcyclist Behaviour [Dataset]. http://doi.org/10.48436/re6xk-ydq75
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 25, 2024
    Dataset provided by
    TU Wien
    Authors
    Gerhard Navratil; Ioannis Giannopoulos; Ioannis Giannopoulos; Gerhard Navratil; Gerhard Navratil; Gerhard Navratil
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Oct 17, 2023
    Description

    The data sets were collected during motorcycle trips near Vienna in 2021 and 2022. The behavior was split into different classes using videos (not part of the published data due to privacy concerns) and then cut into segments of 10 seconds.

    Context and methodology

    • The data set was collected to show how accurate motorcyclist behavior can be assessed using IMU data
    • The work follows the ideas published in http://hdl.handle.net/20.500.12708/43982
    • The authors have a background in geodesy and computer science respectively and work in the field of geoinformation / navigation

    Technical details

    • The data are stored as CSV files
    • Each file contains data from a unique behavior and has a length of 10 seconds
    • Each file has a header describing the columns
    • Units for acceleration are meters per squared second, units for angles are degrees
    • The files are names AB_Daten_D_C.csv
      • D: Datum of the trip (as YYYY_MM_DD)
      • A: Behavior (cruise, fun, overtake, traffic, or wait)
      • B: Number of the occurrence of this behavior during the trip
      • C: Number of the segment within the occurrence
    • The files are grouped by folders named after the corresponding behavior
    • The IMU used to collect the data was a XSENS MTi

  15. Bot_IoT

    • kaggle.com
    zip
    Updated Feb 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vignesh Venkateswaran (2023). Bot_IoT [Dataset]. https://www.kaggle.com/datasets/vigneshvenkateswaran/bot-iot
    Explore at:
    zip(1257092644 bytes)Available download formats
    Dataset updated
    Feb 28, 2023
    Authors
    Vignesh Venkateswaran
    Description

    INFO ABOUT THE BOT-IOT DATASET, NOTE: only the csv files stated in the description are used

    The BoT-IoT dataset can be downloaded from HERE. You can also use our new datasets: the TON_IoT and UNSW-NB15.

    --------------------------------------------------------------------------

    The BoT-IoT dataset was created by designing a realistic network environment in the Cyber Range Lab of UNSW Canberra. The network environment incorporated a combination of normal and botnet traffic. The dataset’s source files are provided in different formats, including the original pcap files, the generated argus files and csv files. The files were separated, based on attack category and subcategory, to better assist in labeling process.

    The captured pcap files are 69.3 GB in size, with more than 72.000.000 records. The extracted flow traffic, in csv format is 16.7 GB in size. The dataset includes DDoS, DoS, OS and Service Scan, Keylogging and Data exfiltration attacks, with the DDoS and DoS attacks further organized, based on the protocol used.

    To ease the handling of the dataset, we extracted 5% of the original dataset via the use of select MySQL queries. The extracted 5%, is comprised of 4 files of approximately 1.07 GB total size, and about 3 million records.

    --------------------------------------------------------------------------

    Free use of the Bot-IoT dataset for academic research purposes is hereby granted in perpetuity. Use for commercial purposes should be agreed by the authors. The authors have asserted their rights under the Copyright. To whom intent the use of the Bot-IoT dataset, the authors have to cite the following papers that has the dataset’s details: .

    Koroniotis, Nickolaos, Nour Moustafa, Elena Sitnikova, and Benjamin Turnbull. "Towards the development of realistic botnet dataset in the internet of things for network forensic analytics: Bot-iot dataset." Future Generation Computer Systems 100 (2019): 779-796. Public Access Here.

    Koroniotis, Nickolaos, Nour Moustafa, Elena Sitnikova, and Jill Slay. "Towards developing network forensic mechanism for botnet activities in the iot based on machine learning techniques." In International Conference on Mobile Networks and Management, pp. 30-44. Springer, Cham, 2017.

    Koroniotis, Nickolaos, Nour Moustafa, and Elena Sitnikova. "A new network forensic framework based on deep learning for Internet of Things networks: A particle deep framework." Future Generation Computer Systems 110 (2020): 91-106.

    Koroniotis, Nickolaos, and Nour Moustafa. "Enhancing network forensics with particle swarm and deep learning: The particle deep framework." arXiv preprint arXiv:2005.00722 (2020).

    Koroniotis, Nickolaos, Nour Moustafa, Francesco Schiliro, Praveen Gauravaram, and Helge Janicke. "A Holistic Review of Cybersecurity and Reliability Perspectives in Smart Airports." IEEE Access (2020).

    Koroniotis, Nickolaos. "Designing an effective network forensic framework for the investigation of botnets in the Internet of Things." PhD diss., The University of New South Wales Australia, 2020.

    --------------------------------------------------------------------------

  16. c

    Target store furniture datasets

    • crawlfeeds.com
    csv, zip
    Updated Aug 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crawl Feeds (2024). Target store furniture datasets [Dataset]. https://crawlfeeds.com/datasets/target-store-furniture-datasets
    Explore at:
    zip, csvAvailable download formats
    Dataset updated
    Aug 28, 2024
    Dataset authored and provided by
    Crawl Feeds
    License

    https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

    Description

    Explore our comprehensive Target store furniture datasets, designed to provide extensive product details for businesses and researchers. Our datasets include a wide range of information that can be used for market analysis, product development, and competitive strategy.

    What’s Included in the Target Store Furniture Datasets:

    • Product Names: Detailed names of all furniture items available at Target stores, including brands and specific product lines.
    • Prices: Current and historical pricing data for various furniture pieces, enabling price comparison and market trend analysis.
    • Descriptions: In-depth product descriptions, covering features, dimensions, materials, and customer benefits.
    • Stock Availability: Real-time stock information, including availability status and inventory levels, to help manage supply chain and stock replenishment strategies.
    • Category Information: Classification of products by category, such as living room, bedroom, office, and outdoor furniture, to help businesses identify market segments and trends.

    Our Target store furniture datasets are ideal for businesses looking to enhance their product offerings, optimize pricing strategies, and understand market dynamics within the furniture industry.

    Whether you're a retailer, market analyst, or business strategist, our datasets provide the comprehensive information you need to stay ahead in the competitive furniture market.

  17. BBC NEWS SUMMARY(CSV FORMAT)

    • kaggle.com
    zip
    Updated Sep 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dhiraj (2024). BBC NEWS SUMMARY(CSV FORMAT) [Dataset]. https://www.kaggle.com/datasets/dignity45/bbc-news-summarycsv-format
    Explore at:
    zip(2097600 bytes)Available download formats
    Dataset updated
    Sep 9, 2024
    Authors
    Dhiraj
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Description: Text Summarization Dataset

    This dataset is designed for users aiming to train models for text summarization. It contains 2,225 rows of data with two columns: "Text" and "Summary". Each row features a detailed news article or piece of text paired with its corresponding summary, providing a rich resource for developing and fine-tuning summarization algorithms.

    Key Features:

    • Text: Full-length articles or passages that serve as the input for summarization.
    • Summary: Concise summaries of the articles, which are ideal for training models to generate brief, coherent summaries from longer texts.

    Future Enhancements:

    This evolving dataset is planned to include additional features, such as text class labels, in future updates. These enhancements will provide more context and facilitate the development of models that can perform summarization across different categories of news content.

    Usage:

    Ideal for researchers and developers focused on text summarization tasks, this dataset enables the training of models to effectively compress information while retaining the essence of the original content.

    Acknowledgment

    We would like to extend our sincere gratitude to the dataset creator for their contribution to this valuable resource. This dataset, sourced from the BBC News Summary dataset on Kaggle, was created by Pariza. Their work has provided an invaluable asset for those working on text summarization tasks, and we appreciate their efforts in curating and sharing this data with the community.

    Thank you for supporting research and development in the field of natural language processing!

    File Description

    This script processes and consolidates text data from various directories containing news articles and their corresponding summaries. It reads the files from specified folders, handles encoding issues, and then creates a DataFrame that is saved as a CSV file for further analysis.

    Key Components:

    1. Imports:

      • numpy (np): Numerical operations library, though it's not used in this script.
      • pandas (pd): Data manipulation and analysis library.
      • os: For interacting with the operating system, e.g., building file paths.
      • glob: For file pattern matching and retrieving file paths.
    2. Function: get_texts

      • Parameters:
        • text_folders: List of folders containing news article text files.
        • text_list: List to store the content of text files.
        • summ_folder: List of folders containing summary text files.
        • sum_list: List to store the content of summary files.
        • encodings: List of encodings to try for reading files.
      • Purpose:
        • Reads text files from specified folders, handles different encodings, and appends the content to text_list and sum_list.
        • Returns the updated lists of texts and summaries.
    3. Data Preparation:

      • text_folder: List of directories for news articles.
      • summ_folder: List of directories for summaries.
      • text_list and summ_list: Initialize empty lists to store the contents.
      • data_df: Empty DataFrame to store the final data.
    4. Execution:

      • Calls get_texts function to populate text_list and summ_list.
      • Creates a DataFrame data_df with columns 'Text' and 'Summary'.
      • Saves data_df to a CSV file at /kaggle/working/bbc_news_data.csv.
    5. Output:

      • Prints the first few entries of the DataFrame to verify the content.

    Column Descriptions:

    • Text: Contains the full-length articles or passages of news content. This column is used as the input for summarization models.
    • Summary: Contains concise summaries of the corresponding articles in the "Text" column. This column is used as the target output for summarization models.

    Usage:

    • This script is designed to be run in a Kaggle environment where paths to text data are predefined.
    • It is intended for preprocessing and saving text data from news articles and summaries for subsequent analysis or model training.
  18. Data Citation Corpus Data File

    • zenodo.org
    zip
    Updated Oct 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DataCite (2024). Data Citation Corpus Data File [Dataset]. http://doi.org/10.5281/zenodo.13376773
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 14, 2024
    Dataset provided by
    DataCitehttps://www.datacite.org/
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Data file for the second release of the Data Citation Corpus, produced by DataCite and Make Data Count as part of an ongoing grant project funded by the Wellcome Trust. Read more about the project.

    The data file includes 5,256,114 data citation records in JSON and CSV formats. The JSON file is the version of record.

    For convenience, the data is provided in batches of approximately 1 million records each. The publication date and batch number are included in the file name, ex: 2024-08-23-data-citation-corpus-01-v2.0.json.

    The data citations in the file originate from DataCite Event Data and a project by Chan Zuckerberg Initiative (CZI) to identify mentions to datasets in the full text of articles.

    Each data citation record is comprised of:

    • A pair of identifiers: An identifier for the dataset (a DOI or an accession number) and the DOI of the publication (journal article or preprint) in which the dataset is cited

    • Metadata for the cited dataset and for the citing publication

    The data file includes the following fields:

    Field

    Description

    Required?

    id

    Internal identifier for the citation

    Yes

    created

    Date of item's incorporation into the corpus

    Yes

    updated

    Date of item's most recent update in corpus

    Yes

    repository

    Repository where cited data is stored

    No

    publisher

    Publisher for the article citing the data

    No

    journal

    Journal for the article citing the data

    No

    title

    Title of cited data

    No

    publication

    DOI of article where data is cited

    Yes

    dataset

    DOI or accession number of cited data

    Yes

    publishedDate

    Date when citing article was published

    No

    source

    Source where citation was harvested

    Yes

    subjects

    Subject information for cited data

    No

    affiliations

    Affiliation information for creator of cited data

    No

    funders

    Funding information for cited data

    No

    Additional documentation about the citations and metadata in the file is available on the Make Data Count website.

    The second release of the Data Citation Corpus data file reflects several changes made to add new citations, remove some records deemed out of scope for the corpus, update and enhance citation metadata, and improve the overall usability of the file. These changes are as follows:

    Add and update Event Data citations:

    • Add 179,885 new data citations created in DataCite Event Data between 01 June 2023 through 30 June 2024

    Remove citation records deemed out of scope for the corpus:

    • 273,567 records from DataCite Event Data with non-citation relationship types

    • 28,334 citations to items in non-data repositories (clinical trials registries, stem cells, samples, and other non-data materials)

    • 44,117 invalid citations where subj_id value was the same as the obj_id value or subj_id and obj_id are inverted, indicating a citation from a dataset to a publication

    • 473,792 citations to invalid accession numbers from CZI data present in v1.1 as a result of false positives in the algorithm used to identify mentions

    • 4,110,019 duplicate records from CZI data present in v1.1 where metadata is the same for obj_id, subj_id, repository_id, publisher_id, journal_id, accession_number, and source_id (the record with the most recent updated date was retained in all of these cases)

    Metadata enhancements:

    • Apply Field of Science subject terms to citation records originating from CZI, based on disciplinary area of data repository

    • Initial cleanup of affiliation and funder organization names to remove personal email addresses and social media handles (additional cleanup and standardization in progress and will be included in future releases)

    Data structure updates to improve usability and eliminate redundancies:

    • Rename subj_id and obj_id fields to “dataset” and “publication” for clarity

    • Remove accessionNumber and doi elements to eliminate redundancy with subj_id

    • Remove relationTypeId fields as these are specific to Event Data only

    Full details of the above changes, including scripts used to perform the above tasks, are available in GitHub.

    While v2 addresses a number of cleanup and enhancement tasks, additional data issues may remain, and additional enhancements are being explored. These will be addressed in the course of subsequent data file releases.


    Feedback on the data file can be submitted via GitHub. For general questions, email info@makedatacount.org.

  19. w

    SmartMeter Energy Consumption Data in London Households

    • data.wu.ac.at
    csv, xlsx, zip
    Updated Sep 26, 2015
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    London Datastore Archive (2015). SmartMeter Energy Consumption Data in London Households [Dataset]. https://data.wu.ac.at/schema/datahub_io/MDAzMjYwNDMtNjJiNi00N2E4LTlhNDktMWFhMjI2YjdlMmM0
    Explore at:
    zip(802288064.0), zip(802394933.0), csv(1010679.0), xlsx(245384.0)Available download formats
    Dataset updated
    Sep 26, 2015
    Dataset provided by
    London Datastore Archive
    Description

    Energy consumption readings for a sample of 5,567 London Households that took part in the UK Power Networks led Low Carbon London project between November 2011 and February 2014.

    Readings were taken at half hourly intervals. Households have been allocated to a CACI Acorn group (2010). The customers in the trial were recruited as a balanced sample representative of the Greater London population.

    The dataset contains energy consumption, in kWh (per half hour), unique household identifier, date and time, and CACI Acorn group. The CSV file is around 10GB when unzipped and contains around 167million rows.

    Within the data set are two groups of customers. The first is a sub-group, of approximately 1100 customers, who were subjected to Dynamic Time of Use (dToU) energy prices throughout the 2013 calendar year period. The tariff prices were given a day ahead via the Smart Meter IHD (In Home Display) or text message to mobile phone. Customers were issued High (67.20p/kWh), Low (3.99p/kWh) or normal (11.76p/kWh) price signals and the times of day these applied. The dates/times and the price signal schedule is availaible as part of this dataset. All non-Time of Use customers were on a flat rate tariff of 14.228pence/kWh.

    The signals given were designed to be representative of the types of signal that may be used in the future to manage both high renewable generation (supply following) operation and also test the potential to use high price signals to reduce stress on local distribution grids during periods of stress.

    The remaining sample of approximately 4500 customers energy consumption readings were not subject to the dToU tariff.

    More information can be found on the Low Carbon London webpage

    Some analysis of this data can be seen here.

  20. z

    Data from: A 26-year time series of mortality and growth of the Pacific...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jan 19, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mazaleyrat, Anna; Normand, Julien; Dubroca, Laurent; Fleury, Elodie (2022). Data from: A 26-year time series of mortality and growth of the Pacific oyster C. gigas recorded along French coasts [Dataset]. http://doi.org/10.5281/zenodo.6536065
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 19, 2022
    Dataset provided by
    Laboratoire de Biologie des Organismes et Ecosystèmes Aquatiques (BOREA) Université de Caen-Normandie, MNHN, SU, UA, CNRS, IRD, Esplanade de la Paix – CS, 14032 CAEN Cedex 5, France
    Univ Brest, Ifremer, CNRS, IRD, LEMAR, F-29280 Plouzané, France
    Ifremer, LRHPB, F-14520 Port-en-Bessin, France
    Ifremer, LERN, F-14520 Port-en-Bessin, France
    Authors
    Mazaleyrat, Anna; Normand, Julien; Dubroca, Laurent; Fleury, Elodie
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Contents: database of oyster growth (i.e., the changes in mass over time) and mortality along French coasts since 1993. To build this database, we took advantage of the Pacific oyster production monitoring network coordinated by IFREMER (the French Research Institute for the Exploitation of the Sea). This network monitors the growth and mortality of spat (less than one-year-old individuals) and half-grown (between one and two-year-old individuals) Crassostrea gigas oysters since 1993. As the number of sites monitored over the years varied, we focused on 13 sites that were almost continuously monitored during this period. For these locations, we modeled growth and cumulative mortality for spat and half-grown oysters as a function of time, to cope with changes in data acquisition frequency, and produced standardized growth and cumulative mortality indicators to improve data usability. Code to reproduce these analyses are archived here, as well as figures included in the companion data paper: "A 26-year time series of mortality and growth of the Pacific oyster C. gigas recorded along French coasts".

    Sampling protocol: in the oyster production monitoring network, oysters were mainly reared in plastic meshed bags fixed on iron tables, mimicking the oyster farmers practices. After their deployment at the beginning of the campaign (seeding dates from February to April depending on the year), growth and mortality were longitudinally monitored yearly. At each sampling date, local operators carefully emptied each bag in separate baskets, counted the dead individuals and alive ones, and removed the dead individuals. Then local operators weighed all alive individuals in each basket (mass taken at the bag level, protocol mainly used between 1993 and 1998 and since 2004) and/or collected 30 individuals to individually weigh them in the laboratory (mass taken at the individual level, protocol used between 1995 and 2010 for spat and since 1996 for half-grown oysters).

    Data:

    • AllDataresco. csv is a csv file containing the raw observations of oyster growth and mortality recorded within the REMORA, RESCO and ECOSCOPA programs. This data set is a modified extraction (carried out on 2021-07-20) of the RESCO REMORA Database (https://doi.org/10.17882/53007) available in SEANOE, an academic publisher or marine research data. The table contains 571101 rows and 18 columns. Description of columns:
      • program: the name of the program. Blank cells indicate that this information was not available.
      • mnemonic_site: the mnemonic is a unique identifier of the site and is constructed as follows: code of the marine area - P (for monitoring point) - order number of the monitoring location in the marine area. For example, 014-P-055.
      • site: the name of the site.
      • class_age: the age class of the oyster: N0 (spat), J1 (half-grown) or A2 (commercial size). Blank cells indicate that this information was not available.
      • ploidy: the ploidy of the oysters: diploïdes or triploïdes (in English: diploid or triploid). Blank cells indicate that this information was not available.
      • date: the date of data collection (format DD/MM/YYYY).
      • mnemonic_date: mnemonic of the visit. The name of the quarterly operation (P0, P1, P2, P3 or RF: last data collection). For intermediate operations, we use the previous name of the operation followed by an underscore and the number of the week. For example, data collection on 2019-05-06 corresponds to P1_S19. Biométrie initiale (in English: initial biometrics) is equivalent to P0 (first data collected during the campaign).
      • param: the name of the measured parameter: Nombre d'individus morts, Nombre d'individus vivants, Poids de l'individu or Poids total des individus vivants (in English: number of dead oysters, number of alive oysters, mass of the individual and total mass of alive individuals).
      • code_param: code of the measured parameter. INDVVIVNB = number of alive oysters, INDVMORNB = number of dead oysters, INDVPOID = mass of the individual, TOTVIVPOI = total mass of alive individuals (i.e., the mass of the bag).
      • unit_measure: the unit of measurement: Gramme or Unité de dénombrement (d'individus, de cellules, ...)
      • fraction: either the measure was made at the bag level on which case the fraction is "Sans objet" = Not applicable or the measure was made at the individual level (code_param = INDVPOID), in which case the fraction indicates the part of the oyster that was measured: Chair totale égouttée or coquille (in English: total flesh drained or shell).
      • method: the method used to obtain the data. For the number of alive and dead oysters (code_param = INDVVIVNB and INDMORNB), the method is comptage macroscopique (in English: macroscopic count). For mass taken at the individual level (code_param = INDVPOID), the method is Pesée après lyophilisation or Pesée simple sans préparation (in English: weighing without preparation or weighing after lyophilization).
      • id_ind: the id of the individual oyster when code_param is INDVPOI or the id of the bag when code_param is INDVVIVNB, INDVPOID and TOTVIVPOI.
      • value: numeric value of the measurement.
      • mnemonic_sampling: This is a concatenated field. Its coding is not consistent throughout the dataset. Indeed, it is sometimes composed of the first letter of the program name attached to 2 numbers indicating the year of data collection and the age class (gj: spat, ga: half-grown or commercial size oysters) - 2 letters indicating the region attached to a 4-character site identifier- mnemonic passage. For example, R05gj-NOBV02-P0 corresponds to data collected in the program REMORA in 2005 on gigas spat (gj) in Normandy (NO) in the site Géfosse 02 (BV02) in the 1st quarter (P0). Other times the mnemonic_prelevement is composed of the first two letters of the program name attached to 2 numbers indicating the year of data collection _ the age class (GJ: spat, GA18: half-grown, GA30: commercial size oysters) attached to the origin of the initial spat group (this information is not always indicated) (CN + number: identifier of wild-caught site, ET + character: identifier of the hatchery, NSI: Argenton hatchery via a standardized protocol) _ a 4-character identifier for the site. For example, RE12_GJET2_BV02 corresponds to data collected in the program REMORA in 2012 (RE12) on gigas spat born in hatchery 2 (GJET2) in the site Géfosse 02 (BV02). Finally, mnemonic_prelevement is sometimes: Biométrie initiale (initial biometrics), Biométrie initiale 6 mois (initial biometrics of spat), Biométrie initiale 18 mois and Biométrie initiale adulte (both correspond to initial biometrics of half-grown oysters), Biométrie initiale 30 mois (initial biometrics of commercial size oysters), Biométrie initiale NSI (initial biometrics of spat batch produced in Argenton Ifremer hatchery via a standardized protocol).
      • long: The longitudinal coordinate of the site given in decimal in the WGS 84 system.
      • lat: The latitudinal coordinate of the site given in decimal in the WGS 84 system.
      • pop_init_batch: this is a concatenated field. It is composed of the two first letters of the name of the program name attached to 2 numbers indicating the year of data collection _ the age class code (GJ: gigas spat, GA: gigas half-grown, GA30: gigas commercial size) _ the origin of the initial spat group (CN: wild-caught, ET: hatchery, NSI: Argenton hatchery) attached to two numbers indicating the year of birth of the initial spat group _ the birth place of the initial spat group (this one is optional). For example, RE00_GJ_CN99_AR corresponds to data collected in the program REMORA in 2000 (RE00) on spat oysters (GJ) born in 1999 and wild-caught (CN99) in the Bay of Arcachon (AR). Blank cells indicate that this information was not available.

    • sites.csv is a csv file of 7 columns and 13 rows containing information about the 13 sites. Description of the columns found in the data set:
      • num: a unique identifier for each site. Ranges between 1 and 13.
      • site: the abbreviated name of the site.
      • Name: the full name of the site.
      • zone_fr: the French name of the zone where data collection took place.
      • zone_en: the English name of the zone where data collection took place.
      • lat: the latitudinal coordinate of the site given in decimal in the WGS 84 system.
      • long: the longitudinal coordinate of the site given in decimal in the WGS 84 system.

    • DataResco_clean.csv is the curated data set of oyster growth and mortality (csv file). The table contains 5178 rows and 13 columns. Each row corresponds to the mean cumulative mortality and mean mass of oysters for a specific date x site x age class combination. This is the data set we used to fit logistic and Gompertz models to describe mean mass and cumulative mortality at time t. Description of the columns found in the data set:

      <ul>
        <li>num, site, name, zone_en, lat, long: see the description above for the data set sites.csv.</li>
        <li>campaign: the year of data collection. Ranges between 1993 and 2018.</li>
        <li>class_age: the age class of the oyster (i.e. spat: N0 or half-grown: J1).</li>
        <li>batch: the identifier of the batch (group of oysters born from the same reproductive event, having experienced strictly the same zootechnical route). It is a field that concatenates the campaign, the age class of oysters (spat: N0 or half-grown: J1), the origin of the initial spatgroup (wild-caught: CAPT or Ifremer hatchery: ECLO), ploidy (diploid: 2n) and birthplace of the original spatgroup (AR: Bay of Arcachon or E4:
      
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Akash Nath (2024). Brain Tumor CSV [Dataset]. https://www.kaggle.com/datasets/akashnath29/brain-tumor-csv/code
Organization logo

Brain Tumor CSV

Brain Tumor Dataset in CSV Format: Pixel-Level Grayscale Values for Each Pixel

Explore at:
zip(538175483 bytes)Available download formats
Dataset updated
Oct 30, 2024
Authors
Akash Nath
License

Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically

Description

This dataset provides grayscale pixel values for brain tumor MRI images, stored in a CSV format for simplified access and ease of use. The goal is to create a "MNIST-like" dataset for brain tumors, where each row in the CSV file represents the pixel values of a single image in its original resolution. This format makes it convenient for researchers and developers to quickly load and analyze MRI data for brain tumor detection, classification, and segmentation tasks without needing to handle large image files directly.

Motivation and Use Cases

Brain tumor classification and segmentation are critical tasks in medical imaging, and datasets like these are valuable for developing and testing machine learning and deep learning models. While there are several publicly available brain tumor image datasets, they often consist of large image files that can be challenging to process. This CSV-based dataset addresses that by providing a compact and accessible format. Potential use cases include: - Tumor Classification: Identifying different types of brain tumors, such as glioma, meningioma, and pituitary tumors, or distinguishing between tumor and non-tumor images. - Tumor Segmentation: Applying pixel-level classification and segmentation techniques for tumor boundary detection. - Educational and Rapid Prototyping: Ideal for educational purposes or quick experimentation without requiring large image processing capabilities.

Data Structure

This dataset is structured as a single CSV file where each row represents an image, and each column represents a grayscale pixel value. The pixel values are stored as integers ranging from 0 (black) to 255 (white).

CSV File Contents

  • Pixel Values: Each row contains the pixel values of a single grayscale image, flattened into a 1-dimensional array. The original image dimensions vary, and rows in the CSV will correspondingly vary in length.
  • Simplified Access: By using a CSV format, this dataset avoids the need for specialized image processing libraries and can be easily loaded into data analysis and machine learning frameworks like Pandas, Scikit-Learn, and TensorFlow.

How to Use This Dataset

  1. Loading the Data: The CSV can be loaded using standard data analysis libraries, making it compatible with Python, R, and other platforms.
  2. Data Preprocessing: Users may normalize pixel values (e.g., between 0 and 1) for deep learning applications.
  3. Splitting Data: While this dataset does not predefine training and testing splits, users can separate rows into training, validation, and test sets.
  4. Reshaping for Models: If needed, each row can be reshaped to the original dimensions (retrieved from the subfolder structure) to view or process as an image.

Technical Details

  • Image Format: Grayscale MRI images, with pixel values ranging from 0 to 255.
  • Resolution: Original resolution, no resizing applied.
  • Size: Each row’s length varies according to the original dimensions of each MRI image.
  • Data Type: CSV file with integer pixel values.

Acknowledgments

This dataset is intended for research and educational purposes only. Users are encouraged to cite and credit the original data sources if using this dataset in any publications or projects. This is a derived CSV version aimed to simplify access and usability for machine learning and data science applications.

Search
Clear search
Close search
Google apps
Main menu