70 datasets found

Brain Tumor CSV
kaggle.com
zip
Updated Oct 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Akash Nath (2024). Brain Tumor CSV [Dataset]. https://www.kaggle.com/datasets/akashnath29/brain-tumor-csv/code
Explore at:
zip(538175483 bytes)Available download formats
Dataset updated
Oct 30, 2024
Authors
Akash Nath
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
This dataset provides grayscale pixel values for brain tumor MRI images, stored in a CSV format for simplified access and ease of use. The goal is to create a "MNIST-like" dataset for brain tumors, where each row in the CSV file represents the pixel values of a single image in its original resolution. This format makes it convenient for researchers and developers to quickly load and analyze MRI data for brain tumor detection, classification, and segmentation tasks without needing to handle large image files directly.

Motivation and Use Cases

Brain tumor classification and segmentation are critical tasks in medical imaging, and datasets like these are valuable for developing and testing machine learning and deep learning models. While there are several publicly available brain tumor image datasets, they often consist of large image files that can be challenging to process. This CSV-based dataset addresses that by providing a compact and accessible format. Potential use cases include: - Tumor Classification: Identifying different types of brain tumors, such as glioma, meningioma, and pituitary tumors, or distinguishing between tumor and non-tumor images. - Tumor Segmentation: Applying pixel-level classification and segmentation techniques for tumor boundary detection. - Educational and Rapid Prototyping: Ideal for educational purposes or quick experimentation without requiring large image processing capabilities.

Data Structure

This dataset is structured as a single CSV file where each row represents an image, and each column represents a grayscale pixel value. The pixel values are stored as integers ranging from 0 (black) to 255 (white).

CSV File Contents

Pixel Values: Each row contains the pixel values of a single grayscale image, flattened into a 1-dimensional array. The original image dimensions vary, and rows in the CSV will correspondingly vary in length.

Simplified Access: By using a CSV format, this dataset avoids the need for specialized image processing libraries and can be easily loaded into data analysis and machine learning frameworks like Pandas, Scikit-Learn, and TensorFlow.

How to Use This Dataset

Loading the Data: The CSV can be loaded using standard data analysis libraries, making it compatible with Python, R, and other platforms.

Data Preprocessing: Users may normalize pixel values (e.g., between 0 and 1) for deep learning applications.

Splitting Data: While this dataset does not predefine training and testing splits, users can separate rows into training, validation, and test sets.

Reshaping for Models: If needed, each row can be reshaped to the original dimensions (retrieved from the subfolder structure) to view or process as an image.

Technical Details

Image Format: Grayscale MRI images, with pixel values ranging from 0 to 255.

Resolution: Original resolution, no resizing applied.

Size: Each row’s length varies according to the original dimensions of each MRI image.

Data Type: CSV file with integer pixel values.

Acknowledgments

This dataset is intended for research and educational purposes only. Users are encouraged to cite and credit the original data sources if using this dataset in any publications or projects. This is a derived CSV version aimed to simplify access and usability for machine learning and data science applications.
Mecca Australia Extracted Data in CSV Format
crawlfeeds.com
csv, zip
Updated Sep 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crawl Feeds (2024). Mecca Australia Extracted Data in CSV Format [Dataset]. https://crawlfeeds.com/datasets/mecca-australia-extracted-data-in-csv-format
Explore at:
csv, zipAvailable download formats
Dataset updated
Sep 2, 2024
Dataset authored and provided by
Crawl Feeds
License
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Area covered
Australia
Description
format. This dataset provides comprehensive details on a wide range of beauty products listed on Mecca Australia, one of the leading beauty retailers in the country.

Perfect for market researchers, data analysts, and beauty industry professionals, this dataset enables a deep dive into product offerings and trends without the clutter of customer reviews.

Features:

Product Information: Detailed data on various beauty products, including product names, categories, and brands.

Pricing Data: Up-to-date pricing details for each product, allowing for competitive analysis and pricing strategy development.

Product Descriptions: Comprehensive descriptions that provide insight into product features and benefits.

Stock Availability: Information on stock status to help track product availability and manage inventory.

CSV Format: Easy-to-use CSV file format for seamless integration into any data analysis or business intelligence tool.

Applications:

Market Analysis: Gain insights into the beauty market trends in Australia by analyzing product categories, brands, and pricing.

Competitor Research: Compare product offerings and pricing strategies to understand the competitive landscape.

Inventory Management: Use stock availability data to optimize inventory and ensure popular items are always in stock.

Product Development: Leverage product descriptions to identify gaps in the market and innovate new product offerings.

With the "Mecca Australia Extracted Data" in CSV format, you can easily access and analyze crucial product data, enabling informed decision-making and strategic planning in the beauty industry.
Clean Meta Kaggle
kaggle.com
Updated Sep 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yoni Kremer (2023). Clean Meta Kaggle [Dataset]. https://www.kaggle.com/datasets/yonikremer/clean-meta-kaggle
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 8, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Yoni Kremer
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Cleaned Meta-Kaggle Dataset

The Original Dataset - Meta-Kaggle

Explore our public data on competitions, datasets, kernels (code / notebooks) and more Meta Kaggle may not be the Rosetta Stone of data science, but we do think there's a lot to learn (and plenty of fun to be had) from this collection of rich data about Kaggle’s community and activity.

Strategizing to become a Competitions Grandmaster? Wondering who, where, and what goes into a winning team? Choosing evaluation metrics for your next data science project? The kernels published using this data can help. We also hope they'll spark some lively Kaggler conversations and be a useful resource for the larger data science community.

https://i.imgur.com/2Egeb8R.png" alt="" title="a title">

This dataset is made available as CSV files through Kaggle Kernels. It contains tables on public activity from Competitions, Datasets, Kernels, Discussions, and more. The tables are updated daily.

Please note: This data is not a complete dump of our database. Rows, columns, and tables have been filtered out and transformed.

August 2023 update

In August 2023, we released Meta Kaggle for Code, a companion to Meta Kaggle containing public, Apache 2.0 licensed notebook data. View the dataset and instructions for how to join it with Meta Kaggle here

We also updated the license on Meta Kaggle from CC-BY-NC-SA to Apache 2.0.

The Problems with the Original Dataset

The original dataset is 32 CSV files, with 268 colums and 7GB of compressed data. Having so many tables and columns makes it hard to understand the data.

The data is not normalized, so when you join tables you get a lot of errors.

Some values refer to non-existing values in other tables. For example, the UserId column in the ForumMessages table has values that do not exist in the Users table.

There are missing values.

There are duplicate values.

There are values that are not valid. For example, Ids that are not positive integers.

The date and time columns are not in the right format.

Some columns only have the same value for all rows, so they are not useful.

The boolean columns have string values True or False.

Incorrect values for the Total columns. For example, the DatasetCount is not the total number of datasets with the Tag according to the DatasetTags table.

Users upvote their own messages.

The Solution

To handle so many tables and columns I use a relational database. I use MySQL, but you can use any relational database.

The steps to create the database are:

Creating the database tables with the right data types and constraints. I do that by running the db_abd_create_tables.sql script.

Downloading the CSV files from Kaggle using the Kaggle API.

Cleaning the data using pandas. I do that by running the clean_data.py script. The script does the following steps for each table:

Drops the columns that are not needed.

Converts each column to the right data type.

Replaces foreign keys that do not exist with NULL.

Replaces some of the missing values with default values.

Removes rows where there are missing values in the primary key/not null columns.

Removes duplicate rows.

Loading the data into the database using the LOAD DATA INFILE command.

Checks that the number of rows in the database tables is the same as the number of rows in the CSV files.

Adds foreign key constraints to the database tables. I do that by running the add_foreign_keys.sql script.

Update the Total columns in the database tables. I do that by running the update_totals.sql script.

Backup the database.
o
HarDWR - Harmonized Water Rights Records
osti.gov
Updated Apr 25, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MultiSector Dynamics - Living, Intuitive, Value-adding, Environment (2024). HarDWR - Harmonized Water Rights Records [Dataset]. http://doi.org/10.57931/2341234
Explore at:
Unique identifier
https://doi.org/10.57931/2341234
Dataset updated
Apr 25, 2024
Dataset provided by
USDOE Office of Science (SC), Biological and Environmental Research (BER)
MultiSector Dynamics - Living, Intuitive, Value-adding, Environment
Description
For a detailed description of the database of which this record is only one part, please see the HarDWR meta-record. Here we present a new dataset of western U.S. water rights records. This dataset provides consistent unique identifiers for each spatial unit of water management across the domain, unique identifiers for each water right record, and a consistent categorization scheme that puts each water right record into one of 7 broad use categories. These data were instrumental in conducting a study of the multi-sector dynamics of intersectoral water allocation changes through water markets (Grogan et al., in review). Specifically, the data were formatted for use as input to a process-based hydrologic model, WBM, with a water rights module (Grogan et al., in review). While this specific study motivated the development of the database presented here, U.S. west water management is a rich area of study (e.g., Anderson and Woosly, 2005; Tidwell, 2014; Null and Prudencio, 2016; Carney et al, 2021) so releasing this database publicly with documentation and usage notes will enable other researchers to do further work on water management in the U.S. west. The raw downloaded data for each state is described in Lisk et al. (in review), as well as here. The dataset is a series of various files organized by state sub-directories. The first two characters of each file name is the abbreviation for the state the in which the file contains data for. After the abbreviation is the text which describes the contents of the file. Here is each file type described in detail: XXFullHarmonizedRights.csv: A file of the combined groundwater and surface water records for each state. Essentially, this file is the merging of XXGroundwaterHarmonizedRights.csv and XXSurfaceWaterHarmonizedRights.csv by state. The column headers for each of this type of file are: state - The name of the state the data comes from. FIPS - The two-digit numeric state ID code. waterRightID - The unique identifying ID of the water right, the same identifier as its state uses. priorityDate - The priority date associated with the right. origWaterUse - The original stated water use(s) from the state. waterUse - The water use category under the unified use categories established here. source - Whether the right is for surface water or groundwater. basinNum - The alpha-numeric identifier of the WMA the record belongs to. CFS - The maximum flow of the allocation in cubic feet per second (ft3s-1). Arizona is unique among the states, as its surface and groundwater resources are managed with two different sets of boundaries. So, for Arizona, the basinNum column is missing and instead there are two columns: surBasinNum - The alpha-numeric identifier of the surface water WMA the record belongs to. grdBasinNum - The alpha-numeric identifier of the groundwater WMA the record belongs to. XXStatePOD.shp: A shapefile which identifies the location of the Points of Diversion for the state's water rights. It should be noted that not all water right records in XXFullHarmonizedRights.csv have coordinates, and therefore may be missing from this file. XXStatePOU.shp: A shapefile which contains the area(s) in which each water right is claimed to be used. Currently, only Idaho and Washington provided valid data to include within this file. XXGroundwaterHarmonizedRights.csv: A file which contains only harmonized groundwater rights collected from each state. See XXFullHarmonizedRights.csv for more details on how the data is formatted. XXSurfaceWaterHarmonizedRights.csv: A file which contains only harmonized surface water rights collected from each state. See XXFullHarmonizedRights.csv for more details on how the data is formatted. Additionally, one file, stateWMALabels.csv, is not stored within a sub-directory. While we have referred to the spatial boundaries that each state uses to manage its water resources as WMAs, this term is not shared across all states. This file lists the proper name for each boundary set, by state. For those whom may be interested in exploring our code more in depth, we are also making available an internal data file for convenience. The file is in .RData format and contains everything described above as well as some minor additional objects used within the code calculating the cumulative curves. For completeness, here is a detailed description of the various objects which can be found within the .RData file: states: A character vector containing the state names for those states in which data was collected for. More importantly, the index of the state name is also the index in which that state's data can be found in the various following list objects. For example, if California is the third index in this object, the data for California will also be in the third index for each accompanying list. rightsByState_ground: A list of data frames with the cleaned ground water rights collected from each state. This object holds the the data that is exported to created the xxGroundwaterHarmonizedRights.csv files. rightsByState_surface: A list of data frames with the cleaned surface water rights collected from each state. This object holds the the data that is exported to created the xxSurfaceWaterHarmonizedRights.csv files. fullRightsRecs: A list of the combined groundwater and surface water records for each state. This object holds the the data that is exported to created the xxFullHarmonizedRights.csv files. projProj: The spatial projection used for map creation in the beginning of the project. Specifically, the World Geodetic System (WGS84) as a coordinate reference system (CRS) string in PROJ.4 format. wmaStateLabel: The name and/or abbreviation for what each state legally calls their WMAs. h2oUseByState: A list of spatial polygon data frames which contain the area(s) in which each water right is claimed to be used. It should be noted that not all water right records have a listed area(s) of use in this object. Currently, only Idaho and Washington provided valid data to be included in this object. h2oDivByState: A list of spatial points data frames which identifies the location of the Point of Diversion for the state's water rights. It should be noted that not all water right records have a listed Point of Diversion in this object. spatialWMAByState: A list of spatial polygon data frames which contain the spatial WMA boundaries for each state. The only data contained within the table are identifiers for each polygon. It is worth reiterating that Arizona is the only state in which the surface and groundwater WMA boundaries are not the same. wmaIDByState: A list which contains the unique ID values of the WMAs for each state. plottingDim: A character vector used to inform mapping functions for internal map making. Each state is classified as either "tall" or "wide", to maximize space on a typical 8x11 page. The code related to the creation of this dataset can be viewed within HarDWR GitHub Repository/dataHarmonization.

The Device Activity Report with Complete Knowledge (DARCK) for NILM

zenodo.org

bin, xz

Updated Sep 19, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Anonymous Anonymous; Anonymous Anonymous (2025). The Device Activity Report with Complete Knowledge (DARCK) for NILM [Dataset]. http://doi.org/10.5281/zenodo.17159850

Explore at:

bin, xzAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.17159850

Dataset updated

Sep 19, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Anonymous Anonymous; Anonymous Anonymous

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

1. Abstract

This dataset contains aggregated and sub-metered power consumption data from a two-person apartment in Germany. Data was collected from March 5 to September 4, 2025, spanning 6 months. It includes an aggregate reading from a main smart meter and individual readings from 40 smart plugs, smart relays, and smart power meters monitoring various appliances.

2. Dataset Overview

Apartment: Two-person apartment, approx. 58m², located in Aachen, Germany.
Aggregate Meter: eBZ DD3
Sub-meters: 31 Shelly Plus Plug S, 6 Shelly Plus 1PM, 3 Shelly Plus PM Mini Gen3
Sampling Rate: 1 Hz
Measured Quantity: Active Power
Unit of Measurement: Watt
Duration: 6 months
Format: Single CSV file (`DARCK.csv`)
Structure: Timestamped rows with columns for the aggregate meter and each sub-metered appliance.
Completeness: The main power meter has a completeness of 99.3%. Missing values were linearly interpolated.

3. Download and Usage

The dataset can be downloaded here: https://doi.org/10.5281/zenodo.17159850

As it contains longer off periods with zeros, the CSV file is nicely compressible.

To extract it use: xz -d DARCK.csv.xz.
The compression leads to a 97% smaller file size (From 4GB to 90.9MB).

To use the dataset in python, you can, e.g., load the csv file into a pandas dataframe.

python
import pandas as pd

df = pd.read_csv("DARCK.csv", parse_dates=["time"])

4. Measurement Setup

The main meter was monitored using an infrared reading head magnetically attached to the infrared interface of the meter. An ESP8266 flashed with Tasmota decodes the binary datagrams and forwards the Watt readings to the MQTT broker. Individual appliances were monitored using a combination of Shelly Plugs (for outlets), Shelly 1PM (for wired-in devices like ceiling lights), and Shelly PM Mini (for each of the three phases of the oven). All devices reported to a central InfluxDB database via Home Assistant running in docker on a Dell OptiPlex 3020M.

5. File Format (`DARCK.csv`)

The dataset is provided as a single comma-separated value (CSV) file.

The first row is a header containing the column names.
All power values are rounded to the first decimal place.
There are no missing values in the final dataset.
Each row represents 1 second, from start of measuring in March until the end in September.

Column Descriptions

Column Name	Data Type	Unit	Description
`time`	datetime	-	Timestamp for the reading in `YYYY-MM-DD HH:MM:SS`
`main`	float	Watt	Total aggregate power consumption for the apartment, measured at the main electrical panel.
`[appliance_name]`	float	Watt	Power consumption of an individual appliance (e.g., `lightbathroom`, `fridge`, `sherlockpc`). See Section 8 for a full list.
Aggregate Columns
`aggr_chargers`	float	Watt	The sum of `sherlockcharger`, `sherlocklaptop`, `watsoncharger`, `watsonlaptop`, `watsonipadcharger`, `kitchencharger`.
`aggr_stoveplates`	float	Watt	The sum of `stoveplatel1` and `stoveplatel2`.
`aggr_lights`	float	Watt	The sum of `lightbathroom`, `lighthallway`, `lightsherlock`, `lightkitchen`, `lightlivingroom`, `lightwatson`, `lightstoreroom`, `fcob`, `sherlockalarmclocklight`, `sherlockfloorlamphue`, `sherlockledstrip`, `livingfloorlamphue`, `sherlockglobe`, `watsonfloorlamp`, `watsondesklamp` and `watsonledmap`.
Analysis Columns
`inaccuracy`	float	Watt	As no electrical device bypasses a power meter, the true inaccuracy can be assessed. It is the absolute error between the sum of individual measurements and the mains reading. A 30W offset is applied to the sum since the measurement devices themselves draw power which is otherwise unaccounted for.

6. Data Postprocessing Pipeline

The final dataset was generated from two raw data sources (meter.csv and shellies.csv) using a comprehensive postprocessing pipeline.

6.1. Main Meter (`main`) Postprocessing

The aggregate power data required several cleaning steps to ensure accuracy.

Outlier Removal: Readings below 10W or above 10,000W were removed (merely 3 occurrences).
Timestamp Burst Correction: The source data contained bursts of delayed readings. A custom algorithm was used to identify these bursts (large time gap followed by rapid readings) and back-fill the timestamps to create an evenly spaced time series.
Alignment & Interpolation: The smart meter pushes a new value via infrared every second. To align those to the whole seconds, it was resampled to a 1-second frequency by taking the mean of all readings within each second (in 99.5% only 1 value). Any resulting gaps (0.7% outage ratio) were filled using linear interpolation.

6.2. Sub-metered Devices (`shellies`) Postprocessing

The Shelly devices are not prone to the same burst issue as the ESP8266 is. They push a new reading at every change in power drawn. If no power change is observed or the one observed is too small (less than a few Watt), the reading is pushed once a minute, together with a heartbeat. When a device turns on or off, intermediate power values are published, which leads to sub-second values that need to be handled.

Grouping: Data was grouped by the unique device identifier.
Resampling & Filling: The data for each device was resampled to a 1-second frequency using .resample('1s').last().ffill().
This method was chosen to firstly, capture the last known state of the device within each second, handling rapid on/off events. Secondly, to forward-fill the last state across periods of no new data, modeling that the device's consumption remained constant until a new reading was sent.

6.3. Merging and Finalization

Merge: The cleaned main meter and all sub-metered device dataframes were merged into a single dataframe on the time index.
Final Fill: Any remaining NaN values (e.g., from before a device was installed) were filled with 0.0, assuming zero consumption.

7. Manual Corrections and Known Data Issues

During analysis, two significant unmetered load events were identified and manually corrected to improve the accuracy of the aggregate reading. The error column (inaccuracy) was recalculated after these corrections.

March 10th - Unmetered Bulb: An unmetered 107W bulb was active. It was subtracted from the main reading as if it never happened.
May 31st - Unmetered Air Pump: An unmetered 101W pump for an air mattress was used directly in an outlet with no intermediary plug and hence manually added to the respective plug.

8. Appliance Details and Multipurpose Plugs

The following table lists the column names with an explanation where needed. As Watson moved at the beginning of June, some metering plugs changed their appliance.

FiN-2: Larg-Scale Powerline Communication Dataset (Pt.1)
zenodo.org
bin, png, zip
Updated Jul 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christoph Balada; Christoph Balada; Max Bondorf; Sheraz Ahmed; Andreas Dengel; Andreas Dengel; Markus Zdrallek; Max Bondorf; Sheraz Ahmed; Markus Zdrallek (2024). FiN-2: Larg-Scale Powerline Communication Dataset (Pt.1) [Dataset]. http://doi.org/10.5281/zenodo.8328113
Explore at:
bin, zip, pngAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8328113
Dataset updated
Jul 11, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Christoph Balada; Christoph Balada; Max Bondorf; Sheraz Ahmed; Andreas Dengel; Andreas Dengel; Markus Zdrallek; Max Bondorf; Sheraz Ahmed; Markus Zdrallek
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
# FiN-2 Large-Scale Real-World PLC-Dataset

## About
#### FiN-2 dataset in a nutshell:
FiN-2 is the first large-scale real-world dataset on data collected in a powerline communication infrastructure. Since the electricity grid is inherently a graph, our dataset could be interpreted as a graph dataset. Therefore, we use the word node to describe points (cable distribution cabinets) of measurement within the low-voltage electricity grid and the word edge to describe connections (cables) in between them. However, since these are PLC connections, an edge does not necessarily have to correspond to a real cable; more on this in our paper.
FiN-2 shows measurements that relate to the nodes (voltage, total harmonic distortion) as well as to the edges (signal-to-noise ratio spectrum, tonemap). In total, FiN-2 is distributed across three different sites with a total of 1,930,762,116 node measurements each for the individual features and 638,394,025 edge measurements each for all 917 PLC channels. All data was collected over a 25-month period from mid-2020 to the end of 2022.
We propose this dataset to foster research in the domain of grid automation and smart grid. Therefore, we provide different example use cases in asset management, grid state visualization, forecasting, predictive maintenance, and novelty detection. For more decent information on this dataset, please see our [paper](https://arxiv.org/abs/2209.12693).

* * *
## Content
FiN-2 dataset splits up into two compressed `csv-Files`: *nodes.csv* and *edges.csv*.

All files are provided as a compressed ZIP file and are divided into four parts. The first part can be found in this repo, while the remaining parts can be found in the following:
- https://zenodo.org/record/8328105
- https://zenodo.org/record/8328108
- https://zenodo.org/record/8328111

### Node data

| id | ts | v1 | v2 | v3 | thd1 | thd2 | thd3 | phase_angle1 | phase_angle2 | phase_angle3 | temp |
|----|----|----|----|----|----|----|----|----|----|----|----|----|----|
|112|1605530460|236.5|236.4|236.0|2.9|2.5|2.4|120.0|119.8|120.0|35.3|
|112|1605530520|236.9|236.6|236.6|3.1|2.7|2.5|120.1|119.8|120.0|35.3|
|112|1605530580|236.2|236.4|236.0|3.1|2.7|2.5|120.0|120.0|119.9|35.5|

- id / ts: Unique identifier of the node that is measured and timestemp of the measurement
- v1/v2/v3: Voltage measurements of all three phases
- thd1/thd2/thd3: Total harmonic distortion of all three phases
- phase_angle1/2/3: Phase angle of all three phases
- temp: Temperature in-circuit of the sensor inside a cable distribution unit (in °C)

### Edge data
| src | dst | ts | snr0 | snr1 | snr2 | ... | snr916 |
|----|----|----|----|----|----|----|----|
|62|94|1605528900|70|72|45|...|-53|
|62|32|1605529800|16|24|13|...|-51|
|17|94|1605530700|37|25|24|...|-55|

- src & dst & ts: Unique identifier of the source and target nodes where the spectrum is measured and time of measurement
- snr0/snr1/.../snr916: 917 SNR measurements in tenths of a decibel (e.g. 50 --> 5dB).

### Metadata
Metadata that is provided along with the data covers:

- Number of cable joints
- Cable properties (length, type, number of sections)
- Relative position of the nodes (location, zero-centered gps)
- Adjacent PV or wallbox installations
- Year of installation w.r.t. the nodes and cables

Since the electricity grid is part of the critical infrastructure, it is not possible to provide exact GPS locations.

* * *
## Usage
Simple data access using pandas:

```
import pandas as pd

nodes_file = "nodes.csv.gz" # /path/to/nodes.csv.gz
edges_file = "edges.csv.gz" # /path/to/edges.csv.gz

# read the first 10 rows
data = pd.read_csv(nodes_file, nrows=10, compression='gzip')

# read the row number 5 to 15
data = pd.read_csv(nodes_file, nrows=10, skiprows=[i for i in range(1,6)], compression='gzip')

# ... same for the edges
```

Compressed csv-data format was used to make sharing as easy as possible, however it comes with significant drawbacks for machine learning. Due to the inherent graph structure, a single snapshot of the whole graph consists of a set of node and edge measurements. But due to timeouts, noise and other disturbances, nodes sometimes fail in collecting the data, wherefore the number of measurements for a specific timestamp differs. This, plus the high sparsity of the graph, leads to a high inefficiency when using the csv-format for an ML training.
To utilize the data in an ML pipeline, we recommend other data formats like [datadings](https://datadings.readthedocs.io/en/latest/) or specialized database solutions like [VictoriaMetrics](https://victoriametrics.com/).

### Example use case (voltage forecasting)

Forecasting of the voltage is one potential use cases. The Jupyter notebook provided in the repository gives an overview of how the dataset can be loaded, preprocessed and used for ML training. Thereby, a MinMax scaling was used as simple preprocessing and a PyTorch dataset class was created to handle the data. Furthermore, a vanilla autoencoder is utilized to process and forecast the voltage into the future.

Cafe Sales - Dirty Data for Cleaning Training

kaggle.com

zip

Updated Jan 17, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Ahmed Mohamed (2025). Cafe Sales - Dirty Data for Cleaning Training [Dataset]. https://www.kaggle.com/datasets/ahmedmohamed2003/cafe-sales-dirty-data-for-cleaning-training

Explore at:

zip(113510 bytes)Available download formats

Dataset updated

Jan 17, 2025

Authors

Ahmed Mohamed

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Dirty Cafe Sales Dataset

Overview

The Dirty Cafe Sales dataset contains 10,000 rows of synthetic data representing sales transactions in a cafe. This dataset is intentionally "dirty," with missing values, inconsistent data, and errors introduced to provide a realistic scenario for data cleaning and exploratory data analysis (EDA). It can be used to practice cleaning techniques, data wrangling, and feature engineering.

File Information

File Name: dirty_cafe_sales.csv
Number of Rows: 10,000
Number of Columns: 8

Columns Description

Column Name	Description	Example Values
`Transaction ID`	A unique identifier for each transaction. Always present and unique.	`TXN_1234567`
`Item`	The name of the item purchased. May contain missing or invalid values (e.g., "ERROR").	`Coffee`, `Sandwich`
`Quantity`	The quantity of the item purchased. May contain missing or invalid values.	`1`, `3`, `UNKNOWN`
`Price Per Unit`	The price of a single unit of the item. May contain missing or invalid values.	`2.00`, `4.00`
`Total Spent`	The total amount spent on the transaction. Calculated as `Quantity * Price Per Unit`.	`8.00`, `12.00`
`Payment Method`	The method of payment used. May contain missing or invalid values (e.g., `None`, "UNKNOWN").	`Cash`, `Credit Card`
`Location`	The location where the transaction occurred. May contain missing or invalid values.	`In-store`, `Takeaway`
`Transaction Date`	The date of the transaction. May contain missing or incorrect values.	`2023-01-01`

Data Characteristics

Missing Values:
- Some columns (e.g., Item, Payment Method, Location) may contain missing values represented as None or empty cells.
Invalid Values:
- Some rows contain invalid entries like "ERROR" or "UNKNOWN" to simulate real-world data issues.
Price Consistency:
- Prices for menu items are consistent but may have missing or incorrect values introduced.

Menu Items

The dataset includes the following menu items with their respective price ranges:

Item	Price($)
Coffee	2
Tea	1.5
Sandwich	4
Salad	5
Cake	3
Cookie	1
Smoothie	4
Juice	3

Use Cases

This dataset is suitable for: - Practicing data cleaning techniques such as handling missing values, removing duplicates, and correcting invalid entries. - Exploring EDA techniques like visualizations and summary statistics. - Performing feature engineering for machine learning workflows.

Cleaning Steps Suggestions

To clean this dataset, consider the following steps: 1. Handle Missing Values: - Fill missing numeric values with the median or mean. - Replace missing categorical values with the mode or "Unknown."

Handle Invalid Values:
- Replace invalid entries like "ERROR" and "UNKNOWN" with NaN or appropriate values.
Date Consistency:
- Ensure all dates are in a consistent format.
- Fill missing dates with plausible values based on nearby records.
Feature Engineering:
- Create new columns, such as Day of the Week or Transaction Month, for further analysis.

License

This dataset is released under the CC BY-SA 4.0 License. You are free to use, share, and adapt it, provided you give appropriate credit.

Feedback

If you have any questions or feedback, feel free to reach out through the dataset's discussion board on Kaggle.

t
FAIR Dataset for Disease Prediction in Healthcare Applications
test.researchdata.tuwien.ac.at
bin, csv, json, png
Updated Apr 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf (2025). FAIR Dataset for Disease Prediction in Healthcare Applications [Dataset]. http://doi.org/10.70124/5n77a-dnf02
Explore at:
csv, json, bin, pngAvailable download formats
Unique identifier
https://doi.org/10.70124/5n77a-dnf02
Dataset updated
Apr 14, 2025
Dataset provided by
TU Wien
Authors
Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Description

Context and Methodology

Research Domain/Project:
This dataset was created for a machine learning experiment aimed at developing a classification model to predict outcomes based on a set of features. The primary research domain is disease prediction in patients. The dataset was used in the context of training, validating, and testing.

Purpose of the Dataset:
The purpose of this dataset is to provide training, validation, and testing data for the development of machine learning models. It includes labeled examples that help train classifiers to recognize patterns in the data and make predictions.

Dataset Creation:
Data preprocessing steps involved cleaning, normalization, and splitting the data into training, validation, and test sets. The data was carefully curated to ensure its quality and relevance to the problem at hand. For any missing values or outliers, appropriate handling techniques were applied (e.g., imputation, removal, etc.).

Technical Details

Structure of the Dataset:
The dataset consists of several files organized into folders by data type:

Training Data: Contains the training dataset used to train the machine learning model.

Validation Data: Used for hyperparameter tuning and model selection.

Test Data: Reserved for final model evaluation.

Each folder contains files with consistent naming conventions for easy navigation, such as train_data.csv, validation_data.csv, and test_data.csv. Each file follows a tabular format with columns representing features and rows representing individual data points.

Software Requirements:
To open and work with this dataset, you need VS Code or Jupyter, which could include tools like:

Python (with libraries such as pandas, numpy, scikit-learn, matplotlib, etc.)

Further Details

Reusability:
Users of this dataset should be aware that it is designed for machine learning experiments involving classification tasks. The dataset is already split into training, validation, and test subsets. Any model trained with this dataset should be evaluated using the test set to ensure proper validation.

Limitations:
The dataset may not cover all edge cases, and it might have biases depending on the selection of data sources. It's important to consider these limitations when generalizing model results to real-world applications.
d
GP Practice Prescribing Presentation-level Data - July 2014
digital.nhs.uk
csv, zip
Updated Oct 31, 2014
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2014). GP Practice Prescribing Presentation-level Data - July 2014 [Dataset]. https://digital.nhs.uk/data-and-information/publications/statistical/practice-level-prescribing-data
Explore at:
csv(1.4 GB), zip(257.7 MB), csv(1.7 MB), csv(275.8 kB)Available download formats
Dataset updated
Oct 31, 2014
License
https://digital.nhs.uk/about-nhs-digital/terms-and-conditionshttps://digital.nhs.uk/about-nhs-digital/terms-and-conditions
Time period covered
Jul 1, 2014 - Jul 31, 2014
Area covered
United Kingdom
Description
Warning: Large file size (over 1GB). Each monthly data set is large (over 4 million rows), but can be viewed in standard software such as Microsoft WordPad (save by right-clicking on the file name and selecting 'Save Target As', or equivalent on Mac OSX). It is then possible to select the required rows of data and copy and paste the information into another software application, such as a spreadsheet. Alternatively, add-ons to existing software, such as the Microsoft PowerPivot add-on for Excel, to handle larger data sets, can be used. The Microsoft PowerPivot add-on for Excel is available from Microsoft http://office.microsoft.com/en-gb/excel/download-power-pivot-HA101959985.aspx Once PowerPivot has been installed, to load the large files, please follow the instructions below. Note that it may take at least 20 to 30 minutes to load one monthly file. 1. Start Excel as normal 2. Click on the PowerPivot tab 3. Click on the PowerPivot Window icon (top left) 4. In the PowerPivot Window, click on the "From Other Sources" icon 5. In the Table Import Wizard e.g. scroll to the bottom and select Text File 6. Browse to the file you want to open and choose the file extension you require e.g. CSV Once the data has been imported you can view it in a spreadsheet. What does the data cover? General practice prescribing data is a list of all medicines, dressings and appliances that are prescribed and dispensed each month. A record will only be produced when this has occurred and there is no record for a zero total. For each practice in England, the following information is presented at presentation level for each medicine, dressing and appliance, (by presentation name): - the total number of items prescribed and dispensed - the total net ingredient cost - the total actual cost - the total quantity The data covers NHS prescriptions written in England and dispensed in the community in the UK. Prescriptions written in England but dispensed outside England are included. The data includes prescriptions written by GPs and other non-medical prescribers (such as nurses and pharmacists) who are attached to GP practices. GP practices are identified only by their national code, so an additional data file - linked to the first by the practice code - provides further detail in relation to the practice. Presentations are identified only by their BNF code, so an additional data file - linked to the first by the BNF code - provides the chemical name for that presentation.
Group Health Dataset (Sleep and Screen Time)
zenodo.org
csv
Updated Apr 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gogate; Gogate (2025). Group Health Dataset (Sleep and Screen Time) [Dataset]. http://doi.org/10.5281/zenodo.15171250
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15171250
Dataset updated
Apr 8, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gogate; Gogate
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Group Health (Sleep and Screen Time) Dataset

Title: Group Health (Sleep and Screen Time) Dataset

Description: This dataset includes biometric and self-reported sleep-related information from users wearing health monitoring devices. It tracks heart rate data, screen time, and sleep quality ratings, intended for health analytics, sleep research, or machine learning applications.

Creator: Eindhoven University of Technology

Version: 1.0

License: CC-BY 4.0

Keywords: sleep health, wearable data, heart rate, screen time, sleep rating, health analytics

Format: CSV (.csv)

Size: 301,556 records

PID: 10.5281/zenodo.15171250

Column Descriptions

- Uid (int64): Unique identifier for the user. Example: `2`

- Sid (object): Session ID representing device/session (e.g., wearable device). Example: `huami.32093/11110030`

- Key (object): The type of health metric (e.g., 'heart_rate'). Example: `heart_rate`

- Time (int64): Unix timestamp of when the measurement was taken. Example: `1743911820`

- Value (object): JSON object containing measurement details (e.g., heart rate BPM). Example: `{"time":1743911820,"bpm":64}`

- UpdateTime (float64): Timestamp when the record was last updated. Example: `1743911982.0`

- screentime (object): Reported or detected screen time during sleep period. Example: `0 days 08:25:00`

- expected_sleep (object): Expected sleep time duration (possibly self-reported or algorithmic). Example: `0 days 07:45:00`

- sleep_rating (float64): Numerical rating of sleep quality. Example: `0.65`

Notes

- The `Value` field stores JSON-like strings which should be parsed for specific values such as heart rate (`bpm`).

- Missing data in `screentime`, `expected_sleep`, and `sleep_rating` should be handled carefully during analysis.

- Timestamps are in Unix format and may need conversion to readable datetime.

Provenance

The Group Health (Sleep and Screen Time) Dataset was collected by the students at the Eindhoven University of Technology as part of a health monitoring study. Participants wore wearable health devices (Mi Band Smartwatches) that tracked biometric data, including heart rate, screen time, and self-reported sleep information. The dataset was compiled from multiple sessions of device usage over time the course of two weeks, with the data anonymized for privacy and research purposes. The original data was already in a standardized csv format and was altered for preprocessing purposes and analysis. This dataset is openly shared under a CC-BY 4.0 license, enabling users to reuse and modify the data while properly attributing the original creators
h
career_change_prediction_analysis
huggingface.co
Updated Nov 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
harry (2025). career_change_prediction_analysis [Dataset]. https://huggingface.co/datasets/harry120/career_change_prediction_analysis
Explore at:
Dataset updated
Nov 19, 2025
Authors
harry
Description
🎯 Assignment #1: Career Change Prediction Analysis

1. Dataset Overview and Project Goal

Dataset: career_change_prediction_dataset.csv (38,444 rows, 22 features) Source: Kaggle Research Question: What are the primary factors that predict an individual's likelihood of changing careers? Target Variable: Likely to Change Occupation (Binary Classification: 0/1)

2. Data Handling and Integrity (The Logical Process)

Before any analysis could begin, the first… See the full description on the dataset page: https://huggingface.co/datasets/harry120/career_change_prediction_analysis.
Integrating urinary metabolomics and clinical datasets for multi-cancer...
figshare.com
Updated Nov 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dongyong Lee (2025). Integrating urinary metabolomics and clinical datasets for multi-cancer detection [Dataset]. http://doi.org/10.6084/m9.figshare.30716096.v2
Explore at:
Unique identifier
https://doi.org/10.6084/m9.figshare.30716096.v2
Dataset updated
Nov 26, 2025
Dataset provided by
Figsharehttp://figshare.com/
Authors
Dongyong Lee
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundThis dataset contains raw urinary surface-enhanced Raman scattering (SERS) spectra acquired from participants with cardiometabolic conditions and solid cancers, as well as non-disease controls. The data are intended for method development and benchmarking of machine-learning based diagnostic models.## Study design and groups- Sample type: spot urine- Measurement: surface-enhanced Raman scattering (SERS), [instrument model / laser wavelength / objective / integration time / SERS substrate: to be filled by data owner]- Technical replicates: 5 SERS acquisitions per subject on the same specimen- Groups and sample sizes (subjects × replicates): - Normal controls: 100 × 5 = 500 spectra - Hypertension (HTN): 100 × 5 = 500 spectra - Diabetes mellitus (DM): 100 × 5 = 500 spectra - Hypertension + Diabetes (HTN+DM): 100 × 5 = 500 spectra - Colorectal cancer (CRC): 300 × 5 = 1,500 spectra - Lung cancer: 200 × 5 = 1,000 spectra - Pancreatic cancer: 53 × 5 = 265 spectra - Total: 953 subjects, 4,765 spectra## File organizationThe dataset is organized into seven zip archives, each corresponding to one clinical group, plus a metadata file:- normal_SERS.zip - Contains 500 CSV files under the folder normal_SERS/ - File naming pattern: NOR _.CSV- HTN_SERS.zip - Contains 500 CSV files under the folder HTN_SERS/ - File naming pattern: HBP _.CSV- DM_SERS.zip - Contains 500 CSV files under the folder DM_SERS/ - File naming pattern: DIA _.CSV- HTN+DM_SERS.zip - Contains 500 CSV files under the folder HTN+DM_SERS/ - File naming pattern: H.D. _.CSV- colorectal+cancer_SERS.zip - Contains 1,500 CSV files under the folder colorectal+cancer_SERS/ - File naming pattern: CRC _.CSV- lung+cancer_SERS.zip - Contains 1,000 CSV files under the folder lung+cancer_SERS/ - File naming pattern: LUN _.CSV- pancreatic+cancer_SERS.zip - Contains 265 CSV files under the folder pancreatic+cancer_SERS/ - File naming pattern: SPAN _.CSV- sample_metadata.csv - Sample-level metadata linking each spectrum file to its clinical group, subject, and replicate index.## sample_metadata.csv columnsThe sample_metadata.csv file has one row per SERS spectrum (4,765 rows in total) and the following columns:- group: descriptive group label - e.g., Normal control, Hypertension, Diabetes mellitus, Hypertension + Diabetes, Colorectal cancer, Lung cancer, Pancreatic cancer.- group_code: short group code - e.g., Normal, HTN, DM, HTN+DM, CRC, LungCA, PancreasCA.- original_prefix: prefix as it appears in the original file names - NOR, HBP, DIA, H.D., CRC, LUN, SPAN.- canonical_prefix: cleaned/standardized prefix used for constructing sample_id - NOR, HBP, DIA, HD, CRC, LUN, SPAN. - For example, H.D. → HD.- subject_id: integer subject identifier within each prefix (1–100, 1–300, 1–200, or 1–53 depending on group).- sample_id: standardized subject identifier combining canonical_prefix and zero-padded subject_id - e.g., NOR_001, HBP_093, DIA_048, HD_027, CRC_077, LUN_151, SPAN_022.- replicate_index: technical replicate index (1–5).- filename: original CSV file name (e.g., HBP 93_5.CSV).- filepath_in_zip: relative path to the CSV file inside the corresponding zip archive (e.g., HTN_SERS/HBP 93_5.CSV).- zip_file: name of the zip archive that contains this file (e.g., HTN_SERS.zip).## Data format- Each CSV file contains two columns without a header: 1. Raman shift (cm⁻¹), typically spanning ~50–3300 cm⁻¹ 2. SERS intensity (arbitrary units)- All spectra have a uniform number of data points (rows) per file.- No baseline correction, smoothing, normalization, or other signal processing has been applied. - These spectra should be considered raw measurements.## Recommended usageThis dataset is suitable for:- Development and benchmarking of: - Preprocessing algorithms (baseline correction, denoising, normalization). - Feature extraction and dimensionality reduction methods for SERS. - Diagnostic and multi-disease classification models based on SERS spectra.- Methodological studies on: - Handling of technical replicates. - Cross-disease model generalization and domain adaptation.Users are encouraged to:- Implement and clearly describe their own preprocessing and validation strategies.- Report details such as train/validation splits, cross-validation schemes, and performance metrics when publishing work based on this dataset.
AQUAIR Dataset
figshare.com
csv
Updated Oct 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Youssef Sabiri (2025). AQUAIR Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.28934375.v1
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28934375.v1
Dataset updated
Oct 1, 2025
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Youssef Sabiri
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset overview:This repository contains the AQUAIR Dataset, a high-resolution log of indoor-environment quality (IEQ) gathered in a trout (Oncorhynchus mykiss) hatchery room at Amghass, Azrou, Morocco. Six airborne variables—air temperature, relative humidity, carbon-dioxide (CO₂), total volatile organic compounds (TVOC), fine particulate matter (PM₂.₅) and inhalable particulate matter (PM₁₀)—were sampled every 5 minutes between 14 October 2024 and 09 January 2025. The data are provided as two comma-separated files:AQUAIR_1.csv : Contains data recorded from 14th October 2024 to 10th December 2024, has a total of 16 533 rows.AQUAIR_2.csv : Contains data recorded from 15th December 2024 to 9th January 2025, has a total of 7 323 rows.Combined, the set delivers 23 856 time-stamped observations suitable for time-series modelling, forecasting, anomaly detection and studies of airborne stressors in aquaculture facilities.Parameters and unitsParameterUnitRelevance in trout cultureTemperature°CInfluences metabolic rate, feed conversion and dissolved-oxygen levels.Relative humidity% RHHigh RH accelerates mould growth; low RH increases evaporation.CO₂ppmHead-space CO₂ equilibrates with water; sustained excess slows growth.VOCppbProxy for disinfectant off-gassing and human activity; ventilation indicator.PM₂.₅µg m⁻³Fine particles can load bio-filters and irritate gill tissue.PM₁₀µg m⁻³Coarser dust from feed handling and maintenance.All values are recorded in SI units; timestamps use ISO-8601 in Coordinated Universal Time (UTC).Reuse potentialBenchmark short-horizon IEQ forecasting (ARIMA, LSTM, transformer models).Develop anomaly detectors for hatchery monitoring dashboards.Correlate airborne conditions with fish-health metrics in future multi-modal studies.Validate low-cost sensor stability in high-humidity aquaculture environments.How to citeIf you use the AQUAIR dataset, please also cite our paper:Sabiri, Y., Houmaidi, W., El Maadi, O., & Chtouki, Y. (2025). AQUAIR: A High-Resolution Indoor Environmental Quality Dataset for Smart Aquaculture Monitoring. arXiv:2509.24069. https://arxiv.org/abs/2509.24069BibTeX:@misc{sabiri2025aquairhighresolutionindoorenvironmental, title={AQUAIR: A High-Resolution Indoor Environmental Quality Dataset for Smart Aquaculture Monitoring}, author={Youssef Sabiri and Walid Houmaidi and Ouail El Maadi and Yousra Chtouki}, year={2025}, eprint={2509.24069}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2509.24069}, }LicenceCreative Commons Attribution 4.0 International (CC-BY-4.0).
t
IMU Data for different Motorcyclist Behaviour
researchdata.tuwien.ac.at
researchdata.tuwien.at
zip
Updated Jun 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gerhard Navratil; Ioannis Giannopoulos; Ioannis Giannopoulos; Gerhard Navratil; Gerhard Navratil; Gerhard Navratil (2024). IMU Data for different Motorcyclist Behaviour [Dataset]. http://doi.org/10.48436/re6xk-ydq75
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.48436/re6xk-ydq75
Dataset updated
Jun 25, 2024
Dataset provided by
TU Wien
Authors
Gerhard Navratil; Ioannis Giannopoulos; Ioannis Giannopoulos; Gerhard Navratil; Gerhard Navratil; Gerhard Navratil
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Oct 17, 2023
Description
The data sets were collected during motorcycle trips near Vienna in 2021 and 2022. The behavior was split into different classes using videos (not part of the published data due to privacy concerns) and then cut into segments of 10 seconds.
Context and methodology
The data set was collected to show how accurate motorcyclist behavior can be assessed using IMU data
The work follows the ideas published in http://hdl.handle.net/20.500.12708/43982
The authors have a background in geodesy and computer science respectively and work in the field of geoinformation / navigation
Technical details
The data are stored as CSV files
Each file contains data from a unique behavior and has a length of 10 seconds
Each file has a header describing the columns
Units for acceleration are meters per squared second, units for angles are degrees
The files are names AB_Daten_D_C.csv
D: Datum of the trip (as YYYY_MM_DD)
A: Behavior (cruise, fun, overtake, traffic, or wait)
B: Number of the occurrence of this behavior during the trip
C: Number of the segment within the occurrence
The files are grouped by folders named after the corresponding behavior
The IMU used to collect the data was a XSENS MTi
Bot_IoT
kaggle.com
zip
Updated Feb 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vignesh Venkateswaran (2023). Bot_IoT [Dataset]. https://www.kaggle.com/datasets/vigneshvenkateswaran/bot-iot
Explore at:
zip(1257092644 bytes)Available download formats
Dataset updated
Feb 28, 2023
Authors
Vignesh Venkateswaran
Description
INFO ABOUT THE BOT-IOT DATASET, NOTE: only the csv files stated in the description are used

The BoT-IoT dataset can be downloaded from HERE. You can also use our new datasets: the TON_IoT and UNSW-NB15.

--------------------------------------------------------------------------

The BoT-IoT dataset was created by designing a realistic network environment in the Cyber Range Lab of UNSW Canberra. The network environment incorporated a combination of normal and botnet traffic. The dataset’s source files are provided in different formats, including the original pcap files, the generated argus files and csv files. The files were separated, based on attack category and subcategory, to better assist in labeling process.

The captured pcap files are 69.3 GB in size, with more than 72.000.000 records. The extracted flow traffic, in csv format is 16.7 GB in size. The dataset includes DDoS, DoS, OS and Service Scan, Keylogging and Data exfiltration attacks, with the DDoS and DoS attacks further organized, based on the protocol used.

To ease the handling of the dataset, we extracted 5% of the original dataset via the use of select MySQL queries. The extracted 5%, is comprised of 4 files of approximately 1.07 GB total size, and about 3 million records.

--------------------------------------------------------------------------

Free use of the Bot-IoT dataset for academic research purposes is hereby granted in perpetuity. Use for commercial purposes should be agreed by the authors. The authors have asserted their rights under the Copyright. To whom intent the use of the Bot-IoT dataset, the authors have to cite the following papers that has the dataset’s details: .

Koroniotis, Nickolaos, Nour Moustafa, Elena Sitnikova, and Benjamin Turnbull. "Towards the development of realistic botnet dataset in the internet of things for network forensic analytics: Bot-iot dataset." Future Generation Computer Systems 100 (2019): 779-796. Public Access Here.

Koroniotis, Nickolaos, Nour Moustafa, Elena Sitnikova, and Jill Slay. "Towards developing network forensic mechanism for botnet activities in the iot based on machine learning techniques." In International Conference on Mobile Networks and Management, pp. 30-44. Springer, Cham, 2017.

Koroniotis, Nickolaos, Nour Moustafa, and Elena Sitnikova. "A new network forensic framework based on deep learning for Internet of Things networks: A particle deep framework." Future Generation Computer Systems 110 (2020): 91-106.

Koroniotis, Nickolaos, and Nour Moustafa. "Enhancing network forensics with particle swarm and deep learning: The particle deep framework." arXiv preprint arXiv:2005.00722 (2020).

Koroniotis, Nickolaos, Nour Moustafa, Francesco Schiliro, Praveen Gauravaram, and Helge Janicke. "A Holistic Review of Cybersecurity and Reliability Perspectives in Smart Airports." IEEE Access (2020).

Koroniotis, Nickolaos. "Designing an effective network forensic framework for the investigation of botnets in the Internet of Things." PhD diss., The University of New South Wales Australia, 2020.

--------------------------------------------------------------------------
c
Target store furniture datasets
crawlfeeds.com
csv, zip
Updated Aug 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crawl Feeds (2024). Target store furniture datasets [Dataset]. https://crawlfeeds.com/datasets/target-store-furniture-datasets
Explore at:
zip, csvAvailable download formats
Dataset updated
Aug 28, 2024
Dataset authored and provided by
Crawl Feeds
License
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Description
Explore our comprehensive Target store furniture datasets, designed to provide extensive product details for businesses and researchers. Our datasets include a wide range of information that can be used for market analysis, product development, and competitive strategy.

What’s Included in the Target Store Furniture Datasets:

Product Names: Detailed names of all furniture items available at Target stores, including brands and specific product lines.

Prices: Current and historical pricing data for various furniture pieces, enabling price comparison and market trend analysis.

Descriptions: In-depth product descriptions, covering features, dimensions, materials, and customer benefits.

Stock Availability: Real-time stock information, including availability status and inventory levels, to help manage supply chain and stock replenishment strategies.

Category Information: Classification of products by category, such as living room, bedroom, office, and outdoor furniture, to help businesses identify market segments and trends.

Our Target store furniture datasets are ideal for businesses looking to enhance their product offerings, optimize pricing strategies, and understand market dynamics within the furniture industry.

Whether you're a retailer, market analyst, or business strategist, our datasets provide the comprehensive information you need to stay ahead in the competitive furniture market.
BBC NEWS SUMMARY(CSV FORMAT)
kaggle.com
zip
Updated Sep 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dhiraj (2024). BBC NEWS SUMMARY(CSV FORMAT) [Dataset]. https://www.kaggle.com/datasets/dignity45/bbc-news-summarycsv-format
Explore at:
zip(2097600 bytes)Available download formats
Dataset updated
Sep 9, 2024
Authors
Dhiraj
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Description: Text Summarization Dataset

This dataset is designed for users aiming to train models for text summarization. It contains 2,225 rows of data with two columns: "Text" and "Summary". Each row features a detailed news article or piece of text paired with its corresponding summary, providing a rich resource for developing and fine-tuning summarization algorithms.

Key Features:

Text: Full-length articles or passages that serve as the input for summarization.

Summary: Concise summaries of the articles, which are ideal for training models to generate brief, coherent summaries from longer texts.

Future Enhancements:

This evolving dataset is planned to include additional features, such as text class labels, in future updates. These enhancements will provide more context and facilitate the development of models that can perform summarization across different categories of news content.

Usage:

Ideal for researchers and developers focused on text summarization tasks, this dataset enables the training of models to effectively compress information while retaining the essence of the original content.

Acknowledgment

We would like to extend our sincere gratitude to the dataset creator for their contribution to this valuable resource. This dataset, sourced from the BBC News Summary dataset on Kaggle, was created by Pariza. Their work has provided an invaluable asset for those working on text summarization tasks, and we appreciate their efforts in curating and sharing this data with the community.

Thank you for supporting research and development in the field of natural language processing!

File Description

This script processes and consolidates text data from various directories containing news articles and their corresponding summaries. It reads the files from specified folders, handles encoding issues, and then creates a DataFrame that is saved as a CSV file for further analysis.

Key Components:

Imports:

numpy (np): Numerical operations library, though it's not used in this script.

pandas (pd): Data manipulation and analysis library.

os: For interacting with the operating system, e.g., building file paths.

glob: For file pattern matching and retrieving file paths.

Function: get_texts

Parameters:

text_folders: List of folders containing news article text files.

text_list: List to store the content of text files.

summ_folder: List of folders containing summary text files.

sum_list: List to store the content of summary files.

encodings: List of encodings to try for reading files.

Purpose:

Reads text files from specified folders, handles different encodings, and appends the content to text_list and sum_list.

Returns the updated lists of texts and summaries.

Data Preparation:

text_folder: List of directories for news articles.

summ_folder: List of directories for summaries.

text_list and summ_list: Initialize empty lists to store the contents.

data_df: Empty DataFrame to store the final data.

Execution:

Calls get_texts function to populate text_list and summ_list.

Creates a DataFrame data_df with columns 'Text' and 'Summary'.

Saves data_df to a CSV file at /kaggle/working/bbc_news_data.csv.

Output:

Prints the first few entries of the DataFrame to verify the content.

Column Descriptions:

Text: Contains the full-length articles or passages of news content. This column is used as the input for summarization models.

Summary: Contains concise summaries of the corresponding articles in the "Text" column. This column is used as the target output for summarization models.

Usage:

This script is designed to be run in a Kaggle environment where paths to text data are predefined.

It is intended for preprocessing and saving text data from news articles and summaries for subsequent analysis or model training.

Data Citation Corpus Data File

zenodo.org

zip

Updated Oct 14, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

DataCite (2024). Data Citation Corpus Data File [Dataset]. http://doi.org/10.5281/zenodo.13376773

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.13376773

Dataset updated

Oct 14, 2024

Dataset provided by

DataCitehttps://www.datacite.org/

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Data file for the second release of the Data Citation Corpus, produced by DataCite and Make Data Count as part of an ongoing grant project funded by the Wellcome Trust. Read more about the project.

The data file includes 5,256,114 data citation records in JSON and CSV formats. The JSON file is the version of record.

For convenience, the data is provided in batches of approximately 1 million records each. The publication date and batch number are included in the file name, ex: 2024-08-23-data-citation-corpus-01-v2.0.json.

The data citations in the file originate from DataCite Event Data and a project by Chan Zuckerberg Initiative (CZI) to identify mentions to datasets in the full text of articles.

Each data citation record is comprised of:

A pair of identifiers: An identifier for the dataset (a DOI or an accession number) and the DOI of the publication (journal article or preprint) in which the dataset is cited
Metadata for the cited dataset and for the citing publication

The data file includes the following fields:

Field	Description	Required?
id	Internal identifier for the citation	Yes
created	Date of item's incorporation into the corpus	Yes
updated	Date of item's most recent update in corpus	Yes
repository	Repository where cited data is stored	No
publisher	Publisher for the article citing the data	No
journal	Journal for the article citing the data	No
title	Title of cited data	No
publication	DOI of article where data is cited	Yes
dataset	DOI or accession number of cited data	Yes
publishedDate	Date when citing article was published	No
source	Source where citation was harvested	Yes
subjects	Subject information for cited data	No
affiliations	Affiliation information for creator of cited data	No
funders	Funding information for cited data	No

Additional documentation about the citations and metadata in the file is available on the Make Data Count website.

The second release of the Data Citation Corpus data file reflects several changes made to add new citations, remove some records deemed out of scope for the corpus, update and enhance citation metadata, and improve the overall usability of the file. These changes are as follows:

Add and update Event Data citations:

Add 179,885 new data citations created in DataCite Event Data between 01 June 2023 through 30 June 2024

Remove citation records deemed out of scope for the corpus:

273,567 records from DataCite Event Data with non-citation relationship types
28,334 citations to items in non-data repositories (clinical trials registries, stem cells, samples, and other non-data materials)
44,117 invalid citations where subj_id value was the same as the obj_id value or subj_id and obj_id are inverted, indicating a citation from a dataset to a publication
473,792 citations to invalid accession numbers from CZI data present in v1.1 as a result of false positives in the algorithm used to identify mentions
4,110,019 duplicate records from CZI data present in v1.1 where metadata is the same for obj_id, subj_id, repository_id, publisher_id, journal_id, accession_number, and source_id (the record with the most recent updated date was retained in all of these cases)

Metadata enhancements:

Apply Field of Science subject terms to citation records originating from CZI, based on disciplinary area of data repository
Initial cleanup of affiliation and funder organization names to remove personal email addresses and social media handles (additional cleanup and standardization in progress and will be included in future releases)

Data structure updates to improve usability and eliminate redundancies:

Rename subj_id and obj_id fields to “dataset” and “publication” for clarity
Remove accessionNumber and doi elements to eliminate redundancy with subj_id
Remove relationTypeId fields as these are specific to Event Data only

Full details of the above changes, including scripts used to perform the above tasks, are available in GitHub.

While v2 addresses a number of cleanup and enhancement tasks, additional data issues may remain, and additional enhancements are being explored. These will be addressed in the course of subsequent data file releases.

Feedback on the data file can be submitted via GitHub. For general questions, email info@makedatacount.org.

w
SmartMeter Energy Consumption Data in London Households
data.wu.ac.at
csv, xlsx, zip
Updated Sep 26, 2015
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
London Datastore Archive (2015). SmartMeter Energy Consumption Data in London Households [Dataset]. https://data.wu.ac.at/schema/datahub_io/MDAzMjYwNDMtNjJiNi00N2E4LTlhNDktMWFhMjI2YjdlMmM0
Explore at:
zip(802288064.0), zip(802394933.0), csv(1010679.0), xlsx(245384.0)Available download formats
Dataset updated
Sep 26, 2015
Dataset provided by
London Datastore Archive
Description
Energy consumption readings for a sample of 5,567 London Households that took part in the UK Power Networks led Low Carbon London project between November 2011 and February 2014.

Readings were taken at half hourly intervals. Households have been allocated to a CACI Acorn group (2010). The customers in the trial were recruited as a balanced sample representative of the Greater London population.

The dataset contains energy consumption, in kWh (per half hour), unique household identifier, date and time, and CACI Acorn group. The CSV file is around 10GB when unzipped and contains around 167million rows.

Within the data set are two groups of customers. The first is a sub-group, of approximately 1100 customers, who were subjected to Dynamic Time of Use (dToU) energy prices throughout the 2013 calendar year period. The tariff prices were given a day ahead via the Smart Meter IHD (In Home Display) or text message to mobile phone. Customers were issued High (67.20p/kWh), Low (3.99p/kWh) or normal (11.76p/kWh) price signals and the times of day these applied. The dates/times and the price signal schedule is availaible as part of this dataset. All non-Time of Use customers were on a flat rate tariff of 14.228pence/kWh.

The signals given were designed to be representative of the types of signal that may be used in the future to manage both high renewable generation (supply following) operation and also test the potential to use high price signals to reduce stress on local distribution grids during periods of stress.

The remaining sample of approximately 4500 customers energy consumption readings were not subject to the dToU tariff.

More information can be found on the Low Carbon London webpage

Some analysis of this data can be seen here.
z
Data from: A 26-year time series of mortality and growth of the Pacific...
zenodo.org
data.niaid.nih.gov
zip
Updated Jan 19, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mazaleyrat, Anna; Normand, Julien; Dubroca, Laurent; Fleury, Elodie (2022). Data from: A 26-year time series of mortality and growth of the Pacific oyster C. gigas recorded along French coasts [Dataset]. http://doi.org/10.5281/zenodo.6536065
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6536065
Dataset updated
Jan 19, 2022
Dataset provided by
Laboratoire de Biologie des Organismes et Ecosystèmes Aquatiques (BOREA) Université de Caen-Normandie, MNHN, SU, UA, CNRS, IRD, Esplanade de la Paix – CS, 14032 CAEN Cedex 5, France
Univ Brest, Ifremer, CNRS, IRD, LEMAR, F-29280 Plouzané, France
Ifremer, LRHPB, F-14520 Port-en-Bessin, France
Ifremer, LERN, F-14520 Port-en-Bessin, France
Authors
Mazaleyrat, Anna; Normand, Julien; Dubroca, Laurent; Fleury, Elodie
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Contents: database of oyster growth (i.e., the changes in mass over time) and mortality along French coasts since 1993. To build this database, we took advantage of the Pacific oyster production monitoring network coordinated by IFREMER (the French Research Institute for the Exploitation of the Sea). This network monitors the growth and mortality of spat (less than one-year-old individuals) and half-grown (between one and two-year-old individuals) Crassostrea gigas oysters since 1993. As the number of sites monitored over the years varied, we focused on 13 sites that were almost continuously monitored during this period. For these locations, we modeled growth and cumulative mortality for spat and half-grown oysters as a function of time, to cope with changes in data acquisition frequency, and produced standardized growth and cumulative mortality indicators to improve data usability. Code to reproduce these analyses are archived here, as well as figures included in the companion data paper: "A 26-year time series of mortality and growth of the Pacific oyster C. gigas recorded along French coasts".

Sampling protocol: in the oyster production monitoring network, oysters were mainly reared in plastic meshed bags fixed on iron tables, mimicking the oyster farmers practices. After their deployment at the beginning of the campaign (seeding dates from February to April depending on the year), growth and mortality were longitudinally monitored yearly. At each sampling date, local operators carefully emptied each bag in separate baskets, counted the dead individuals and alive ones, and removed the dead individuals. Then local operators weighed all alive individuals in each basket (mass taken at the bag level, protocol mainly used between 1993 and 1998 and since 2004) and/or collected 30 individuals to individually weigh them in the laboratory (mass taken at the individual level, protocol used between 1995 and 2010 for spat and since 1996 for half-grown oysters).

Data:

AllDataresco. csv is a csv file containing the raw observations of oyster growth and mortality recorded within the REMORA, RESCO and ECOSCOPA programs. This data set is a modified extraction (carried out on 2021-07-20) of the RESCO REMORA Database (https://doi.org/10.17882/53007) available in SEANOE, an academic publisher or marine research data. The table contains 571101 rows and 18 columns. Description of columns:

program: the name of the program. Blank cells indicate that this information was not available.

mnemonic_site: the mnemonic is a unique identifier of the site and is constructed as follows: code of the marine area - P (for monitoring point) - order number of the monitoring location in the marine area. For example, 014-P-055.

site: the name of the site.

class_age: the age class of the oyster: N0 (spat), J1 (half-grown) or A2 (commercial size). Blank cells indicate that this information was not available.

ploidy: the ploidy of the oysters: diploïdes or triploïdes (in English: diploid or triploid). Blank cells indicate that this information was not available.

date: the date of data collection (format DD/MM/YYYY).

mnemonic_date: mnemonic of the visit. The name of the quarterly operation (P0, P1, P2, P3 or RF: last data collection). For intermediate operations, we use the previous name of the operation followed by an underscore and the number of the week. For example, data collection on 2019-05-06 corresponds to P1_S19. Biométrie initiale (in English: initial biometrics) is equivalent to P0 (first data collected during the campaign).

param: the name of the measured parameter: Nombre d'individus morts, Nombre d'individus vivants, Poids de l'individu or Poids total des individus vivants (in English: number of dead oysters, number of alive oysters, mass of the individual and total mass of alive individuals).

code_param: code of the measured parameter. INDVVIVNB = number of alive oysters, INDVMORNB = number of dead oysters, INDVPOID = mass of the individual, TOTVIVPOI = total mass of alive individuals (i.e., the mass of the bag).

unit_measure: the unit of measurement: Gramme or Unité de dénombrement (d'individus, de cellules, ...)

fraction: either the measure was made at the bag level on which case the fraction is "Sans objet" = Not applicable or the measure was made at the individual level (code_param = INDVPOID), in which case the fraction indicates the part of the oyster that was measured: Chair totale égouttée or coquille (in English: total flesh drained or shell).

method: the method used to obtain the data. For the number of alive and dead oysters (code_param = INDVVIVNB and INDMORNB), the method is comptage macroscopique (in English: macroscopic count). For mass taken at the individual level (code_param = INDVPOID), the method is Pesée après lyophilisation or Pesée simple sans préparation (in English: weighing without preparation or weighing after lyophilization).

id_ind: the id of the individual oyster when code_param is INDVPOI or the id of the bag when code_param is INDVVIVNB, INDVPOID and TOTVIVPOI.

value: numeric value of the measurement.

mnemonic_sampling: This is a concatenated field. Its coding is not consistent throughout the dataset. Indeed, it is sometimes composed of the first letter of the program name attached to 2 numbers indicating the year of data collection and the age class (gj: spat, ga: half-grown or commercial size oysters) - 2 letters indicating the region attached to a 4-character site identifier- mnemonic passage. For example, R05gj-NOBV02-P0 corresponds to data collected in the program REMORA in 2005 on gigas spat (gj) in Normandy (NO) in the site Géfosse 02 (BV02) in the 1^st quarter (P0). Other times the mnemonic_prelevement is composed of the first two letters of the program name attached to 2 numbers indicating the year of data collection _ the age class (GJ: spat, GA18: half-grown, GA30: commercial size oysters) attached to the origin of the initial spat group (this information is not always indicated) (CN + number: identifier of wild-caught site, ET + character: identifier of the hatchery, NSI: Argenton hatchery via a standardized protocol) _ a 4-character identifier for the site. For example, RE12_GJET2_BV02 corresponds to data collected in the program REMORA in 2012 (RE12) on gigas spat born in hatchery 2 (GJET2) in the site Géfosse 02 (BV02). Finally, mnemonic_prelevement is sometimes: Biométrie initiale (initial biometrics), Biométrie initiale 6 mois (initial biometrics of spat), Biométrie initiale 18 mois and Biométrie initiale adulte (both correspond to initial biometrics of half-grown oysters), Biométrie initiale 30 mois (initial biometrics of commercial size oysters), Biométrie initiale NSI (initial biometrics of spat batch produced in Argenton Ifremer hatchery via a standardized protocol).

long: The longitudinal coordinate of the site given in decimal in the WGS 84 system.

lat: The latitudinal coordinate of the site given in decimal in the WGS 84 system.

pop_init_batch: this is a concatenated field. It is composed of the two first letters of the name of the program name attached to 2 numbers indicating the year of data collection _ the age class code (GJ: gigas spat, GA: gigas half-grown, GA30: gigas commercial size) _ the origin of the initial spat group (CN: wild-caught, ET: hatchery, NSI: Argenton hatchery) attached to two numbers indicating the year of birth of the initial spat group _ the birth place of the initial spat group (this one is optional). For example, RE00_GJ_CN99_AR corresponds to data collected in the program REMORA in 2000 (RE00) on spat oysters (GJ) born in 1999 and wild-caught (CN99) in the Bay of Arcachon (AR). Blank cells indicate that this information was not available.

sites.csv is a csv file of 7 columns and 13 rows containing information about the 13 sites. Description of the columns found in the data set:

num: a unique identifier for each site. Ranges between 1 and 13.

site: the abbreviated name of the site.

Name: the full name of the site.

zone_fr: the French name of the zone where data collection took place.

zone_en: the English name of the zone where data collection took place.

lat: the latitudinal coordinate of the site given in decimal in the WGS 84 system.

long: the longitudinal coordinate of the site given in decimal in the WGS 84 system.

DataResco_clean.csv is the curated data set of oyster growth and mortality (csv file). The table contains 5178 rows and 13 columns. Each row corresponds to the mean cumulative mortality and mean mass of oysters for a specific date x site x age class combination. This is the data set we used to fit logistic and Gompertz models to describe mean mass and cumulative mortality at time t. Description of the columns found in the data set:

<ul> <li>num, site, name, zone_en, lat, long: see the description above for the data set sites.csv.</li> <li>campaign: the year of data collection. Ranges between 1993 and 2018.</li> <li>class_age: the age class of the oyster (i.e. spat: N0 or half-grown: J1).</li> <li>batch: the identifier of the batch (group of oysters born from the same reproductive event, having experienced strictly the same zootechnical route). It is a field that concatenates the campaign, the age class of oysters (spat: N0 or half-grown: J1), the origin of the initial spatgroup (wild-caught: CAPT or Ifremer hatchery: ECLO), ploidy (diploid: 2n) and birthplace of the original spatgroup (AR: Bay of Arcachon or E4:

Facebook

Twitter

Click to copy link

Link copied

Cite

Akash Nath (2024). Brain Tumor CSV [Dataset]. https://www.kaggle.com/datasets/akashnath29/brain-tumor-csv/code

Brain Tumor CSV

Brain Tumor Dataset in CSV Format: Pixel-Level Grayscale Values for Each Pixel

Explore at:

zip(538175483 bytes)Available download formats

Dataset updated

Oct 30, 2024

Authors

Akash Nath

License

Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically

Description

This dataset provides grayscale pixel values for brain tumor MRI images, stored in a CSV format for simplified access and ease of use. The goal is to create a "MNIST-like" dataset for brain tumors, where each row in the CSV file represents the pixel values of a single image in its original resolution. This format makes it convenient for researchers and developers to quickly load and analyze MRI data for brain tumor detection, classification, and segmentation tasks without needing to handle large image files directly.

Motivation and Use Cases

Brain tumor classification and segmentation are critical tasks in medical imaging, and datasets like these are valuable for developing and testing machine learning and deep learning models. While there are several publicly available brain tumor image datasets, they often consist of large image files that can be challenging to process. This CSV-based dataset addresses that by providing a compact and accessible format. Potential use cases include: - Tumor Classification: Identifying different types of brain tumors, such as glioma, meningioma, and pituitary tumors, or distinguishing between tumor and non-tumor images. - Tumor Segmentation: Applying pixel-level classification and segmentation techniques for tumor boundary detection. - Educational and Rapid Prototyping: Ideal for educational purposes or quick experimentation without requiring large image processing capabilities.

Data Structure

This dataset is structured as a single CSV file where each row represents an image, and each column represents a grayscale pixel value. The pixel values are stored as integers ranging from 0 (black) to 255 (white).

CSV File Contents

Pixel Values: Each row contains the pixel values of a single grayscale image, flattened into a 1-dimensional array. The original image dimensions vary, and rows in the CSV will correspondingly vary in length.
Simplified Access: By using a CSV format, this dataset avoids the need for specialized image processing libraries and can be easily loaded into data analysis and machine learning frameworks like Pandas, Scikit-Learn, and TensorFlow.

How to Use This Dataset

Loading the Data: The CSV can be loaded using standard data analysis libraries, making it compatible with Python, R, and other platforms.
Data Preprocessing: Users may normalize pixel values (e.g., between 0 and 1) for deep learning applications.
Splitting Data: While this dataset does not predefine training and testing splits, users can separate rows into training, validation, and test sets.
Reshaping for Models: If needed, each row can be reshaped to the original dimensions (retrieved from the subfolder structure) to view or process as an image.

Technical Details

Image Format: Grayscale MRI images, with pixel values ranging from 0 to 255.
Resolution: Original resolution, no resizing applied.
Size: Each row’s length varies according to the original dimensions of each MRI image.
Data Type: CSV file with integer pixel values.

Acknowledgments

This dataset is intended for research and educational purposes only. Users are encouraged to cite and credit the original data sources if using this dataset in any publications or projects. This is a derived CSV version aimed to simplify access and usability for machine learning and data science applications.

Clear search

Close search

Google apps

Main menu

Brain Tumor CSV

Motivation and Use Cases

Data Structure

CSV File Contents

How to Use This Dataset

Technical Details

Acknowledgments

Mecca Australia Extracted Data in CSV Format

Features:

Applications:

Clean Meta Kaggle

Cleaned Meta-Kaggle Dataset

The Original Dataset - Meta-Kaggle

August 2023 update

The Problems with the Original Dataset

The Solution

HarDWR - Harmonized Water Rights Records

The Device Activity Report with Complete Knowledge (DARCK) for NILM

1. Abstract

2. Dataset Overview

3. Download and Usage

4. Measurement Setup

5. File Format (DARCK.csv)

Column Descriptions

Column Name

Data Type

Unit

Description

6. Data Postprocessing Pipeline

6.1. Main Meter (main) Postprocessing

6.2. Sub-metered Devices (shellies) Postprocessing

6.3. Merging and Finalization

7. Manual Corrections and Known Data Issues

8. Appliance Details and Multipurpose Plugs

FiN-2: Larg-Scale Powerline Communication Dataset (Pt.1)

Cafe Sales - Dirty Data for Cleaning Training

Dirty Cafe Sales Dataset

Overview

File Information

Columns Description

Data Characteristics

Menu Items

Use Cases

Cleaning Steps Suggestions

License

Feedback

FAIR Dataset for Disease Prediction in Healthcare Applications

Dataset Description

Context and Methodology

Technical Details

Further Details

GP Practice Prescribing Presentation-level Data - July 2014

Group Health Dataset (Sleep and Screen Time)

career_change_prediction_analysis

Integrating urinary metabolomics and clinical datasets for multi-cancer...

AQUAIR Dataset

IMU Data for different Motorcyclist Behaviour

Context and methodology

Technical details

Bot_IoT

INFO ABOUT THE BOT-IOT DATASET, NOTE: only the csv files stated in the description are used

The BoT-IoT dataset can be downloaded from HERE. You can also use our new datasets: the TON_IoT and UNSW-NB15.

--------------------------------------------------------------------------

To ease the handling of the dataset, we extracted 5% of the original dataset via the use of select MySQL queries. The extracted 5%, is comprised of 4 files of approximately 1.07 GB total size, and about 3 million records.

--------------------------------------------------------------------------

Koroniotis, Nickolaos, Nour Moustafa, Elena Sitnikova, and Benjamin Turnbull. "Towards the development of realistic botnet dataset in the internet of things for network forensic analytics: Bot-iot dataset." Future Generation Computer Systems 100 (2019): 779-796. Public Access Here.

Koroniotis, Nickolaos, Nour Moustafa, Elena Sitnikova, and Jill Slay. "Towards developing network forensic mechanism for botnet activities in the iot based on machine learning techniques." In International Conference on Mobile Networks and Management, pp. 30-44. Springer, Cham, 2017.

Koroniotis, Nickolaos, Nour Moustafa, and Elena Sitnikova. "A new network forensic framework based on deep learning for Internet of Things networks: A particle deep framework." Future Generation Computer Systems 110 (2020): 91-106.

Koroniotis, Nickolaos, and Nour Moustafa. "Enhancing network forensics with particle swarm and deep learning: The particle deep framework." arXiv preprint arXiv:2005.00722 (2020).

Koroniotis, Nickolaos, Nour Moustafa, Francesco Schiliro, Praveen Gauravaram, and Helge Janicke. "A Holistic Review of Cybersecurity and Reliability Perspectives in Smart Airports." IEEE Access (2020).

Koroniotis, Nickolaos. "Designing an effective network forensic framework for the investigation of botnets in the Internet of Things." PhD diss., The University of New South Wales Australia, 2020.

--------------------------------------------------------------------------

Target store furniture datasets

BBC NEWS SUMMARY(CSV FORMAT)

Dataset Description: Text Summarization Dataset

Key Features:

Future Enhancements:

Usage:

Acknowledgment

File Description

5. File Format (`DARCK.csv`)

6.1. Main Meter (`main`) Postprocessing

6.2. Sub-metered Devices (`shellies`) Postprocessing