Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
This dataset provides grayscale pixel values for brain tumor MRI images, stored in a CSV format for simplified access and ease of use. The goal is to create a "MNIST-like" dataset for brain tumors, where each row in the CSV file represents the pixel values of a single image in its original resolution. This format makes it convenient for researchers and developers to quickly load and analyze MRI data for brain tumor detection, classification, and segmentation tasks without needing to handle large image files directly.
Brain tumor classification and segmentation are critical tasks in medical imaging, and datasets like these are valuable for developing and testing machine learning and deep learning models. While there are several publicly available brain tumor image datasets, they often consist of large image files that can be challenging to process. This CSV-based dataset addresses that by providing a compact and accessible format. Potential use cases include: - Tumor Classification: Identifying different types of brain tumors, such as glioma, meningioma, and pituitary tumors, or distinguishing between tumor and non-tumor images. - Tumor Segmentation: Applying pixel-level classification and segmentation techniques for tumor boundary detection. - Educational and Rapid Prototyping: Ideal for educational purposes or quick experimentation without requiring large image processing capabilities.
This dataset is structured as a single CSV file where each row represents an image, and each column represents a grayscale pixel value. The pixel values are stored as integers ranging from 0 (black) to 255 (white).
This dataset is intended for research and educational purposes only. Users are encouraged to cite and credit the original data sources if using this dataset in any publications or projects. This is a derived CSV version aimed to simplify access and usability for machine learning and data science applications.
Facebook
Twitterhttps://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
format. This dataset provides comprehensive details on a wide range of beauty products listed on Mecca Australia, one of the leading beauty retailers in the country.
Perfect for market researchers, data analysts, and beauty industry professionals, this dataset enables a deep dive into product offerings and trends without the clutter of customer reviews.
With the "Mecca Australia Extracted Data" in CSV format, you can easily access and analyze crucial product data, enabling informed decision-making and strategic planning in the beauty industry.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Explore our public data on competitions, datasets, kernels (code / notebooks) and more Meta Kaggle may not be the Rosetta Stone of data science, but we do think there's a lot to learn (and plenty of fun to be had) from this collection of rich data about Kaggle’s community and activity.
Strategizing to become a Competitions Grandmaster? Wondering who, where, and what goes into a winning team? Choosing evaluation metrics for your next data science project? The kernels published using this data can help. We also hope they'll spark some lively Kaggler conversations and be a useful resource for the larger data science community.
https://i.imgur.com/2Egeb8R.png" alt="" title="a title">
This dataset is made available as CSV files through Kaggle Kernels. It contains tables on public activity from Competitions, Datasets, Kernels, Discussions, and more. The tables are updated daily.
Please note: This data is not a complete dump of our database. Rows, columns, and tables have been filtered out and transformed.
In August 2023, we released Meta Kaggle for Code, a companion to Meta Kaggle containing public, Apache 2.0 licensed notebook data. View the dataset and instructions for how to join it with Meta Kaggle here
We also updated the license on Meta Kaggle from CC-BY-NC-SA to Apache 2.0.
UserId column in the ForumMessages table has values that do not exist in the Users table.True or False.Total columns.
For example, the DatasetCount is not the total number of datasets with the Tag according to the DatasetTags table.db_abd_create_tables.sql script.clean_data.py script.
The script does the following steps for each table:
NULL.add_foreign_keys.sql script.Total columns in the database tables. I do that by running the update_totals.sql script.
Facebook
TwitterFor a detailed description of the database of which this record is only one part, please see the HarDWR meta-record. Here we present a new dataset of western U.S. water rights records. This dataset provides consistent unique identifiers for each spatial unit of water management across the domain, unique identifiers for each water right record, and a consistent categorization scheme that puts each water right record into one of 7 broad use categories. These data were instrumental in conducting a study of the multi-sector dynamics of intersectoral water allocation changes through water markets (Grogan et al., in review). Specifically, the data were formatted for use as input to a process-based hydrologic model, WBM, with a water rights module (Grogan et al., in review). While this specific study motivated the development of the database presented here, U.S. west water management is a rich area of study (e.g., Anderson and Woosly, 2005; Tidwell, 2014; Null and Prudencio, 2016; Carney et al, 2021) so releasing this database publicly with documentation and usage notes will enable other researchers to do further work on water management in the U.S. west. The raw downloaded data for each state is described in Lisk et al. (in review), as well as here. The dataset is a series of various files organized by state sub-directories. The first two characters of each file name is the abbreviation for the state the in which the file contains data for. After the abbreviation is the text which describes the contents of the file. Here is each file type described in detail: XXFullHarmonizedRights.csv: A file of the combined groundwater and surface water records for each state. Essentially, this file is the merging of XXGroundwaterHarmonizedRights.csv and XXSurfaceWaterHarmonizedRights.csv by state. The column headers for each of this type of file are: state - The name of the state the data comes from. FIPS - The two-digit numeric state ID code. waterRightID - The unique identifying ID of the water right, the same identifier as its state uses. priorityDate - The priority date associated with the right. origWaterUse - The original stated water use(s) from the state. waterUse - The water use category under the unified use categories established here. source - Whether the right is for surface water or groundwater. basinNum - The alpha-numeric identifier of the WMA the record belongs to. CFS - The maximum flow of the allocation in cubic feet per second (ft3s-1). Arizona is unique among the states, as its surface and groundwater resources are managed with two different sets of boundaries. So, for Arizona, the basinNum column is missing and instead there are two columns: surBasinNum - The alpha-numeric identifier of the surface water WMA the record belongs to. grdBasinNum - The alpha-numeric identifier of the groundwater WMA the record belongs to. XXStatePOD.shp: A shapefile which identifies the location of the Points of Diversion for the state's water rights. It should be noted that not all water right records in XXFullHarmonizedRights.csv have coordinates, and therefore may be missing from this file. XXStatePOU.shp: A shapefile which contains the area(s) in which each water right is claimed to be used. Currently, only Idaho and Washington provided valid data to include within this file. XXGroundwaterHarmonizedRights.csv: A file which contains only harmonized groundwater rights collected from each state. See XXFullHarmonizedRights.csv for more details on how the data is formatted. XXSurfaceWaterHarmonizedRights.csv: A file which contains only harmonized surface water rights collected from each state. See XXFullHarmonizedRights.csv for more details on how the data is formatted. Additionally, one file, stateWMALabels.csv, is not stored within a sub-directory. While we have referred to the spatial boundaries that each state uses to manage its water resources as WMAs, this term is not shared across all states. This file lists the proper name for each boundary set, by state. For those whom may be interested in exploring our code more in depth, we are also making available an internal data file for convenience. The file is in .RData format and contains everything described above as well as some minor additional objects used within the code calculating the cumulative curves. For completeness, here is a detailed description of the various objects which can be found within the .RData file: states: A character vector containing the state names for those states in which data was collected for. More importantly, the index of the state name is also the index in which that state's data can be found in the various following list objects. For example, if California is the third index in this object, the data for California will also be in the third index for each accompanying list. rightsByState_ground: A list of data frames with the cleaned ground water rights collected from each state. This object holds the the data that is exported to created the xxGroundwaterHarmonizedRights.csv files. rightsByState_surface: A list of data frames with the cleaned surface water rights collected from each state. This object holds the the data that is exported to created the xxSurfaceWaterHarmonizedRights.csv files. fullRightsRecs: A list of the combined groundwater and surface water records for each state. This object holds the the data that is exported to created the xxFullHarmonizedRights.csv files. projProj: The spatial projection used for map creation in the beginning of the project. Specifically, the World Geodetic System (WGS84) as a coordinate reference system (CRS) string in PROJ.4 format. wmaStateLabel: The name and/or abbreviation for what each state legally calls their WMAs. h2oUseByState: A list of spatial polygon data frames which contain the area(s) in which each water right is claimed to be used. It should be noted that not all water right records have a listed area(s) of use in this object. Currently, only Idaho and Washington provided valid data to be included in this object. h2oDivByState: A list of spatial points data frames which identifies the location of the Point of Diversion for the state's water rights. It should be noted that not all water right records have a listed Point of Diversion in this object. spatialWMAByState: A list of spatial polygon data frames which contain the spatial WMA boundaries for each state. The only data contained within the table are identifiers for each polygon. It is worth reiterating that Arizona is the only state in which the surface and groundwater WMA boundaries are not the same. wmaIDByState: A list which contains the unique ID values of the WMAs for each state. plottingDim: A character vector used to inform mapping functions for internal map making. Each state is classified as either "tall" or "wide", to maximize space on a typical 8x11 page. The code related to the creation of this dataset can be viewed within HarDWR GitHub Repository/dataHarmonization.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains aggregated and sub-metered power consumption data from a two-person apartment in Germany. Data was collected from March 5 to September 4, 2025, spanning 6 months. It includes an aggregate reading from a main smart meter and individual readings from 40 smart plugs, smart relays, and smart power meters monitoring various appliances.
The dataset can be downloaded here: https://doi.org/10.5281/zenodo.17159850
As it contains longer off periods with zeros, the CSV file is nicely compressible.
To extract it use: xz -d DARCK.csv.xz.
The compression leads to a 97% smaller file size (From 4GB to 90.9MB).
To use the dataset in python, you can, e.g., load the csv file into a pandas dataframe.
pythonimport pandas as pd
df = pd.read_csv("DARCK.csv", parse_dates=["time"])
The main meter was monitored using an infrared reading head magnetically attached to the infrared interface of the meter. An ESP8266 flashed with Tasmota decodes the binary datagrams and forwards the Watt readings to the MQTT broker. Individual appliances were monitored using a combination of Shelly Plugs (for outlets), Shelly 1PM (for wired-in devices like ceiling lights), and Shelly PM Mini (for each of the three phases of the oven). All devices reported to a central InfluxDB database via Home Assistant running in docker on a Dell OptiPlex 3020M.
DARCK.csv)The dataset is provided as a single comma-separated value (CSV) file.
Column Name |
Data Type |
Unit |
Description |
time | datetime | - | Timestamp for the reading in YYYY-MM-DD HH:MM:SS |
main | float | Watt | Total aggregate power consumption for the apartment, measured at the main electrical panel. |
[appliance_name] | float | Watt | Power consumption of an individual appliance (e.g., lightbathroom, fridge, sherlockpc). See Section 8 for a full list. |
| Aggregate Columns | |||
aggr_chargers | float | Watt | The sum of sherlockcharger, sherlocklaptop, watsoncharger, watsonlaptop, watsonipadcharger, kitchencharger. |
aggr_stoveplates | float | Watt | The sum of stoveplatel1 and stoveplatel2. |
aggr_lights | float | Watt | The sum of lightbathroom, lighthallway, lightsherlock, lightkitchen, lightlivingroom, lightwatson, lightstoreroom, fcob, sherlockalarmclocklight, sherlockfloorlamphue, sherlockledstrip, livingfloorlamphue, sherlockglobe, watsonfloorlamp, watsondesklamp and watsonledmap. |
| Analysis Columns | |||
inaccuracy | float | Watt | As no electrical device bypasses a power meter, the true inaccuracy can be assessed. It is the absolute error between the sum of individual measurements and the mains reading. A 30W offset is applied to the sum since the measurement devices themselves draw power which is otherwise unaccounted for. |
The final dataset was generated from two raw data sources (meter.csv and shellies.csv) using a comprehensive postprocessing pipeline.
main) PostprocessingThe aggregate power data required several cleaning steps to ensure accuracy.
shellies) PostprocessingThe Shelly devices are not prone to the same burst issue as the ESP8266 is. They push a new reading at every change in power drawn. If no power change is observed or the one observed is too small (less than a few Watt), the reading is pushed once a minute, together with a heartbeat. When a device turns on or off, intermediate power values are published, which leads to sub-second values that need to be handled.
.resample('1s').last().ffill(). time index.NaN values (e.g., from before a device was installed) were filled with 0.0, assuming zero consumption.During analysis, two significant unmetered load events were identified and manually corrected to improve the accuracy of the aggregate reading. The error column (inaccuracy) was recalculated after these corrections.
The following table lists the column names with an explanation where needed. As Watson moved at the beginning of June, some metering plugs changed their appliance.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
# FiN-2 Large-Scale Real-World PLC-Dataset
## About
#### FiN-2 dataset in a nutshell:
FiN-2 is the first large-scale real-world dataset on data collected in a powerline communication infrastructure. Since the electricity grid is inherently a graph, our dataset could be interpreted as a graph dataset. Therefore, we use the word node to describe points (cable distribution cabinets) of measurement within the low-voltage electricity grid and the word edge to describe connections (cables) in between them. However, since these are PLC connections, an edge does not necessarily have to correspond to a real cable; more on this in our paper.
FiN-2 shows measurements that relate to the nodes (voltage, total harmonic distortion) as well as to the edges (signal-to-noise ratio spectrum, tonemap). In total, FiN-2 is distributed across three different sites with a total of 1,930,762,116 node measurements each for the individual features and 638,394,025 edge measurements each for all 917 PLC channels. All data was collected over a 25-month period from mid-2020 to the end of 2022.
We propose this dataset to foster research in the domain of grid automation and smart grid. Therefore, we provide different example use cases in asset management, grid state visualization, forecasting, predictive maintenance, and novelty detection. For more decent information on this dataset, please see our [paper](https://arxiv.org/abs/2209.12693).
* * *
## Content
FiN-2 dataset splits up into two compressed `csv-Files`: *nodes.csv* and *edges.csv*.
All files are provided as a compressed ZIP file and are divided into four parts. The first part can be found in this repo, while the remaining parts can be found in the following:
- https://zenodo.org/record/8328105
- https://zenodo.org/record/8328108
- https://zenodo.org/record/8328111
### Node data
| id | ts | v1 | v2 | v3 | thd1 | thd2 | thd3 | phase_angle1 | phase_angle2 | phase_angle3 | temp |
|----|----|----|----|----|----|----|----|----|----|----|----|----|----|
|112|1605530460|236.5|236.4|236.0|2.9|2.5|2.4|120.0|119.8|120.0|35.3|
|112|1605530520|236.9|236.6|236.6|3.1|2.7|2.5|120.1|119.8|120.0|35.3|
|112|1605530580|236.2|236.4|236.0|3.1|2.7|2.5|120.0|120.0|119.9|35.5|
- id / ts: Unique identifier of the node that is measured and timestemp of the measurement
- v1/v2/v3: Voltage measurements of all three phases
- thd1/thd2/thd3: Total harmonic distortion of all three phases
- phase_angle1/2/3: Phase angle of all three phases
- temp: Temperature in-circuit of the sensor inside a cable distribution unit (in °C)
### Edge data
| src | dst | ts | snr0 | snr1 | snr2 | ... | snr916 |
|----|----|----|----|----|----|----|----|
|62|94|1605528900|70|72|45|...|-53|
|62|32|1605529800|16|24|13|...|-51|
|17|94|1605530700|37|25|24|...|-55|
- src & dst & ts: Unique identifier of the source and target nodes where the spectrum is measured and time of measurement
- snr0/snr1/.../snr916: 917 SNR measurements in tenths of a decibel (e.g. 50 --> 5dB).
### Metadata
Metadata that is provided along with the data covers:
- Number of cable joints
- Cable properties (length, type, number of sections)
- Relative position of the nodes (location, zero-centered gps)
- Adjacent PV or wallbox installations
- Year of installation w.r.t. the nodes and cables
Since the electricity grid is part of the critical infrastructure, it is not possible to provide exact GPS locations.
* * *
## Usage
Simple data access using pandas:
```
import pandas as pd
nodes_file = "nodes.csv.gz" # /path/to/nodes.csv.gz
edges_file = "edges.csv.gz" # /path/to/edges.csv.gz
# read the first 10 rows
data = pd.read_csv(nodes_file, nrows=10, compression='gzip')
# read the row number 5 to 15
data = pd.read_csv(nodes_file, nrows=10, skiprows=[i for i in range(1,6)], compression='gzip')
# ... same for the edges
```
Compressed csv-data format was used to make sharing as easy as possible, however it comes with significant drawbacks for machine learning. Due to the inherent graph structure, a single snapshot of the whole graph consists of a set of node and edge measurements. But due to timeouts, noise and other disturbances, nodes sometimes fail in collecting the data, wherefore the number of measurements for a specific timestamp differs. This, plus the high sparsity of the graph, leads to a high inefficiency when using the csv-format for an ML training.
To utilize the data in an ML pipeline, we recommend other data formats like [datadings](https://datadings.readthedocs.io/en/latest/) or specialized database solutions like [VictoriaMetrics](https://victoriametrics.com/).
### Example use case (voltage forecasting)
Forecasting of the voltage is one potential use cases. The Jupyter notebook provided in the repository gives an overview of how the dataset can be loaded, preprocessed and used for ML training. Thereby, a MinMax scaling was used as simple preprocessing and a PyTorch dataset class was created to handle the data. Furthermore, a vanilla autoencoder is utilized to process and forecast the voltage into the future.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The Dirty Cafe Sales dataset contains 10,000 rows of synthetic data representing sales transactions in a cafe. This dataset is intentionally "dirty," with missing values, inconsistent data, and errors introduced to provide a realistic scenario for data cleaning and exploratory data analysis (EDA). It can be used to practice cleaning techniques, data wrangling, and feature engineering.
dirty_cafe_sales.csv| Column Name | Description | Example Values |
|---|---|---|
Transaction ID | A unique identifier for each transaction. Always present and unique. | TXN_1234567 |
Item | The name of the item purchased. May contain missing or invalid values (e.g., "ERROR"). | Coffee, Sandwich |
Quantity | The quantity of the item purchased. May contain missing or invalid values. | 1, 3, UNKNOWN |
Price Per Unit | The price of a single unit of the item. May contain missing or invalid values. | 2.00, 4.00 |
Total Spent | The total amount spent on the transaction. Calculated as Quantity * Price Per Unit. | 8.00, 12.00 |
Payment Method | The method of payment used. May contain missing or invalid values (e.g., None, "UNKNOWN"). | Cash, Credit Card |
Location | The location where the transaction occurred. May contain missing or invalid values. | In-store, Takeaway |
Transaction Date | The date of the transaction. May contain missing or incorrect values. | 2023-01-01 |
Missing Values:
Item, Payment Method, Location) may contain missing values represented as None or empty cells.Invalid Values:
"ERROR" or "UNKNOWN" to simulate real-world data issues.Price Consistency:
The dataset includes the following menu items with their respective price ranges:
| Item | Price($) |
|---|---|
| Coffee | 2 |
| Tea | 1.5 |
| Sandwich | 4 |
| Salad | 5 |
| Cake | 3 |
| Cookie | 1 |
| Smoothie | 4 |
| Juice | 3 |
This dataset is suitable for: - Practicing data cleaning techniques such as handling missing values, removing duplicates, and correcting invalid entries. - Exploring EDA techniques like visualizations and summary statistics. - Performing feature engineering for machine learning workflows.
To clean this dataset, consider the following steps: 1. Handle Missing Values: - Fill missing numeric values with the median or mean. - Replace missing categorical values with the mode or "Unknown."
Handle Invalid Values:
"ERROR" and "UNKNOWN" with NaN or appropriate values.Date Consistency:
Feature Engineering:
Day of the Week or Transaction Month, for further analysis.This dataset is released under the CC BY-SA 4.0 License. You are free to use, share, and adapt it, provided you give appropriate credit.
If you have any questions or feedback, feel free to reach out through the dataset's discussion board on Kaggle.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Research Domain/Project:
This dataset was created for a machine learning experiment aimed at developing a classification model to predict outcomes based on a set of features. The primary research domain is disease prediction in patients. The dataset was used in the context of training, validating, and testing.
Purpose of the Dataset:
The purpose of this dataset is to provide training, validation, and testing data for the development of machine learning models. It includes labeled examples that help train classifiers to recognize patterns in the data and make predictions.
Dataset Creation:
Data preprocessing steps involved cleaning, normalization, and splitting the data into training, validation, and test sets. The data was carefully curated to ensure its quality and relevance to the problem at hand. For any missing values or outliers, appropriate handling techniques were applied (e.g., imputation, removal, etc.).
Structure of the Dataset:
The dataset consists of several files organized into folders by data type:
Training Data: Contains the training dataset used to train the machine learning model.
Validation Data: Used for hyperparameter tuning and model selection.
Test Data: Reserved for final model evaluation.
Each folder contains files with consistent naming conventions for easy navigation, such as train_data.csv, validation_data.csv, and test_data.csv. Each file follows a tabular format with columns representing features and rows representing individual data points.
Software Requirements:
To open and work with this dataset, you need VS Code or Jupyter, which could include tools like:
Python (with libraries such as pandas, numpy, scikit-learn, matplotlib, etc.)
Reusability:
Users of this dataset should be aware that it is designed for machine learning experiments involving classification tasks. The dataset is already split into training, validation, and test subsets. Any model trained with this dataset should be evaluated using the test set to ensure proper validation.
Limitations:
The dataset may not cover all edge cases, and it might have biases depending on the selection of data sources. It's important to consider these limitations when generalizing model results to real-world applications.
Facebook
Twitterhttps://digital.nhs.uk/about-nhs-digital/terms-and-conditionshttps://digital.nhs.uk/about-nhs-digital/terms-and-conditions
Warning: Large file size (over 1GB). Each monthly data set is large (over 4 million rows), but can be viewed in standard software such as Microsoft WordPad (save by right-clicking on the file name and selecting 'Save Target As', or equivalent on Mac OSX). It is then possible to select the required rows of data and copy and paste the information into another software application, such as a spreadsheet. Alternatively, add-ons to existing software, such as the Microsoft PowerPivot add-on for Excel, to handle larger data sets, can be used. The Microsoft PowerPivot add-on for Excel is available from Microsoft http://office.microsoft.com/en-gb/excel/download-power-pivot-HA101959985.aspx Once PowerPivot has been installed, to load the large files, please follow the instructions below. Note that it may take at least 20 to 30 minutes to load one monthly file. 1. Start Excel as normal 2. Click on the PowerPivot tab 3. Click on the PowerPivot Window icon (top left) 4. In the PowerPivot Window, click on the "From Other Sources" icon 5. In the Table Import Wizard e.g. scroll to the bottom and select Text File 6. Browse to the file you want to open and choose the file extension you require e.g. CSV Once the data has been imported you can view it in a spreadsheet. What does the data cover? General practice prescribing data is a list of all medicines, dressings and appliances that are prescribed and dispensed each month. A record will only be produced when this has occurred and there is no record for a zero total. For each practice in England, the following information is presented at presentation level for each medicine, dressing and appliance, (by presentation name): - the total number of items prescribed and dispensed - the total net ingredient cost - the total actual cost - the total quantity The data covers NHS prescriptions written in England and dispensed in the community in the UK. Prescriptions written in England but dispensed outside England are included. The data includes prescriptions written by GPs and other non-medical prescribers (such as nurses and pharmacists) who are attached to GP practices. GP practices are identified only by their national code, so an additional data file - linked to the first by the practice code - provides further detail in relation to the practice. Presentations are identified only by their BNF code, so an additional data file - linked to the first by the BNF code - provides the chemical name for that presentation.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Title: Group Health (Sleep and Screen Time) Dataset
Facebook
Twitter🎯 Assignment #1: Career Change Prediction Analysis
1. Dataset Overview and Project Goal
Dataset: career_change_prediction_dataset.csv (38,444 rows, 22 features) Source: Kaggle Research Question: What are the primary factors that predict an individual's likelihood of changing careers? Target Variable: Likely to Change Occupation (Binary Classification: 0/1)
2. Data Handling and Integrity (The Logical Process)
Before any analysis could begin, the first… See the full description on the dataset page: https://huggingface.co/datasets/harry120/career_change_prediction_analysis.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
normal_SERS.zip - Contains 500 CSV files under the folder normal_SERS/ - File naming pattern: NOR _.CSV- HTN_SERS.zip - Contains 500 CSV files under the folder HTN_SERS/ - File naming pattern: HBP _.CSV- DM_SERS.zip - Contains 500 CSV files under the folder DM_SERS/ - File naming pattern: DIA _.CSV- HTN+DM_SERS.zip - Contains 500 CSV files under the folder HTN+DM_SERS/ - File naming pattern: H.D. _.CSV- colorectal+cancer_SERS.zip - Contains 1,500 CSV files under the folder colorectal+cancer_SERS/ - File naming pattern: CRC _.CSV- lung+cancer_SERS.zip - Contains 1,000 CSV files under the folder lung+cancer_SERS/ - File naming pattern: LUN _.CSV- pancreatic+cancer_SERS.zip - Contains 265 CSV files under the folder pancreatic+cancer_SERS/ - File naming pattern: SPAN _.CSV- sample_metadata.csv - Sample-level metadata linking each spectrum file to its clinical group, subject, and replicate index.## sample_metadata.csv columnsThe sample_metadata.csv file has one row per SERS spectrum (4,765 rows in total) and the following columns:- group: descriptive group label - e.g., Normal control, Hypertension, Diabetes mellitus, Hypertension + Diabetes, Colorectal cancer, Lung cancer, Pancreatic cancer.- group_code: short group code - e.g., Normal, HTN, DM, HTN+DM, CRC, LungCA, PancreasCA.- original_prefix: prefix as it appears in the original file names - NOR, HBP, DIA, H.D., CRC, LUN, SPAN.- canonical_prefix: cleaned/standardized prefix used for constructing sample_id - NOR, HBP, DIA, HD, CRC, LUN, SPAN. - For example, H.D. → HD.- subject_id: integer subject identifier within each prefix (1–100, 1–300, 1–200, or 1–53 depending on group).- sample_id: standardized subject identifier combining canonical_prefix and zero-padded subject_id - e.g., NOR_001, HBP_093, DIA_048, HD_027, CRC_077, LUN_151, SPAN_022.- replicate_index: technical replicate index (1–5).- filename: original CSV file name (e.g., HBP 93_5.CSV).- filepath_in_zip: relative path to the CSV file inside the corresponding zip archive (e.g., HTN_SERS/HBP 93_5.CSV).- zip_file: name of the zip archive that contains this file (e.g., HTN_SERS.zip).## Data format- Each CSV file contains two columns without a header: 1. Raman shift (cm⁻¹), typically spanning ~50–3300 cm⁻¹ 2. SERS intensity (arbitrary units)- All spectra have a uniform number of data points (rows) per file.- No baseline correction, smoothing, normalization, or other signal processing has been applied. - These spectra should be considered raw measurements.## Recommended usageThis dataset is suitable for:- Development and benchmarking of: - Preprocessing algorithms (baseline correction, denoising, normalization). - Feature extraction and dimensionality reduction methods for SERS. - Diagnostic and multi-disease classification models based on SERS spectra.- Methodological studies on: - Handling of technical replicates. - Cross-disease model generalization and domain adaptation.Users are encouraged to:- Implement and clearly describe their own preprocessing and validation strategies.- Report details such as train/validation splits, cross-validation schemes, and performance metrics when publishing work based on this dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset overview:This repository contains the AQUAIR Dataset, a high-resolution log of indoor-environment quality (IEQ) gathered in a trout (Oncorhynchus mykiss) hatchery room at Amghass, Azrou, Morocco. Six airborne variables—air temperature, relative humidity, carbon-dioxide (CO₂), total volatile organic compounds (TVOC), fine particulate matter (PM₂.₅) and inhalable particulate matter (PM₁₀)—were sampled every 5 minutes between 14 October 2024 and 09 January 2025. The data are provided as two comma-separated files:AQUAIR_1.csv : Contains data recorded from 14th October 2024 to 10th December 2024, has a total of 16 533 rows.AQUAIR_2.csv : Contains data recorded from 15th December 2024 to 9th January 2025, has a total of 7 323 rows.Combined, the set delivers 23 856 time-stamped observations suitable for time-series modelling, forecasting, anomaly detection and studies of airborne stressors in aquaculture facilities.Parameters and unitsParameterUnitRelevance in trout cultureTemperature°CInfluences metabolic rate, feed conversion and dissolved-oxygen levels.Relative humidity% RHHigh RH accelerates mould growth; low RH increases evaporation.CO₂ppmHead-space CO₂ equilibrates with water; sustained excess slows growth.VOCppbProxy for disinfectant off-gassing and human activity; ventilation indicator.PM₂.₅µg m⁻³Fine particles can load bio-filters and irritate gill tissue.PM₁₀µg m⁻³Coarser dust from feed handling and maintenance.All values are recorded in SI units; timestamps use ISO-8601 in Coordinated Universal Time (UTC).Reuse potentialBenchmark short-horizon IEQ forecasting (ARIMA, LSTM, transformer models).Develop anomaly detectors for hatchery monitoring dashboards.Correlate airborne conditions with fish-health metrics in future multi-modal studies.Validate low-cost sensor stability in high-humidity aquaculture environments.How to citeIf you use the AQUAIR dataset, please also cite our paper:Sabiri, Y., Houmaidi, W., El Maadi, O., & Chtouki, Y. (2025). AQUAIR: A High-Resolution Indoor Environmental Quality Dataset for Smart Aquaculture Monitoring. arXiv:2509.24069. https://arxiv.org/abs/2509.24069BibTeX:@misc{sabiri2025aquairhighresolutionindoorenvironmental, title={AQUAIR: A High-Resolution Indoor Environmental Quality Dataset for Smart Aquaculture Monitoring}, author={Youssef Sabiri and Walid Houmaidi and Ouail El Maadi and Yousra Chtouki}, year={2025}, eprint={2509.24069}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2509.24069}, }LicenceCreative Commons Attribution 4.0 International (CC-BY-4.0).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data sets were collected during motorcycle trips near Vienna in 2021 and 2022. The behavior was split into different classes using videos (not part of the published data due to privacy concerns) and then cut into segments of 10 seconds.
Facebook
Twitter
Facebook
Twitterhttps://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Explore our comprehensive Target store furniture datasets, designed to provide extensive product details for businesses and researchers. Our datasets include a wide range of information that can be used for market analysis, product development, and competitive strategy.
What’s Included in the Target Store Furniture Datasets:
Our Target store furniture datasets are ideal for businesses looking to enhance their product offerings, optimize pricing strategies, and understand market dynamics within the furniture industry.
Whether you're a retailer, market analyst, or business strategist, our datasets provide the comprehensive information you need to stay ahead in the competitive furniture market.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset is designed for users aiming to train models for text summarization. It contains 2,225 rows of data with two columns: "Text" and "Summary". Each row features a detailed news article or piece of text paired with its corresponding summary, providing a rich resource for developing and fine-tuning summarization algorithms.
This evolving dataset is planned to include additional features, such as text class labels, in future updates. These enhancements will provide more context and facilitate the development of models that can perform summarization across different categories of news content.
Ideal for researchers and developers focused on text summarization tasks, this dataset enables the training of models to effectively compress information while retaining the essence of the original content.
We would like to extend our sincere gratitude to the dataset creator for their contribution to this valuable resource. This dataset, sourced from the BBC News Summary dataset on Kaggle, was created by Pariza. Their work has provided an invaluable asset for those working on text summarization tasks, and we appreciate their efforts in curating and sharing this data with the community.
Thank you for supporting research and development in the field of natural language processing!
This script processes and consolidates text data from various directories containing news articles and their corresponding summaries. It reads the files from specified folders, handles encoding issues, and then creates a DataFrame that is saved as a CSV file for further analysis.
Imports:
numpy (np): Numerical operations library, though it's not used in this script.pandas (pd): Data manipulation and analysis library.os: For interacting with the operating system, e.g., building file paths.glob: For file pattern matching and retrieving file paths.Function: get_texts
text_folders: List of folders containing news article text files.text_list: List to store the content of text files.summ_folder: List of folders containing summary text files.sum_list: List to store the content of summary files.encodings: List of encodings to try for reading files.text_list and sum_list.Data Preparation:
text_folder: List of directories for news articles.summ_folder: List of directories for summaries.text_list and summ_list: Initialize empty lists to store the contents.data_df: Empty DataFrame to store the final data.Execution:
get_texts function to populate text_list and summ_list.data_df with columns 'Text' and 'Summary'.data_df to a CSV file at /kaggle/working/bbc_news_data.csv.Output:
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Data file for the second release of the Data Citation Corpus, produced by DataCite and Make Data Count as part of an ongoing grant project funded by the Wellcome Trust. Read more about the project.
The data file includes 5,256,114 data citation records in JSON and CSV formats. The JSON file is the version of record.
For convenience, the data is provided in batches of approximately 1 million records each. The publication date and batch number are included in the file name, ex: 2024-08-23-data-citation-corpus-01-v2.0.json.
The data citations in the file originate from DataCite Event Data and a project by Chan Zuckerberg Initiative (CZI) to identify mentions to datasets in the full text of articles.
Each data citation record is comprised of:
A pair of identifiers: An identifier for the dataset (a DOI or an accession number) and the DOI of the publication (journal article or preprint) in which the dataset is cited
Metadata for the cited dataset and for the citing publication
The data file includes the following fields:
|
Field |
Description |
Required? |
|
id |
Internal identifier for the citation |
Yes |
|
created |
Date of item's incorporation into the corpus |
Yes |
|
updated |
Date of item's most recent update in corpus |
Yes |
|
repository |
Repository where cited data is stored |
No |
|
publisher |
Publisher for the article citing the data |
No |
|
journal |
Journal for the article citing the data |
No |
|
title |
Title of cited data |
No |
|
publication |
DOI of article where data is cited |
Yes |
|
dataset |
DOI or accession number of cited data |
Yes |
|
publishedDate |
Date when citing article was published |
No |
|
source |
Source where citation was harvested |
Yes |
|
subjects |
Subject information for cited data |
No |
|
affiliations |
Affiliation information for creator of cited data |
No |
|
funders |
Funding information for cited data |
No |
Additional documentation about the citations and metadata in the file is available on the Make Data Count website.
The second release of the Data Citation Corpus data file reflects several changes made to add new citations, remove some records deemed out of scope for the corpus, update and enhance citation metadata, and improve the overall usability of the file. These changes are as follows:
Add and update Event Data citations:
Add 179,885 new data citations created in DataCite Event Data between 01 June 2023 through 30 June 2024
Remove citation records deemed out of scope for the corpus:
273,567 records from DataCite Event Data with non-citation relationship types
28,334 citations to items in non-data repositories (clinical trials registries, stem cells, samples, and other non-data materials)
44,117 invalid citations where subj_id value was the same as the obj_id value or subj_id and obj_id are inverted, indicating a citation from a dataset to a publication
473,792 citations to invalid accession numbers from CZI data present in v1.1 as a result of false positives in the algorithm used to identify mentions
4,110,019 duplicate records from CZI data present in v1.1 where metadata is the same for obj_id, subj_id, repository_id, publisher_id, journal_id, accession_number, and source_id (the record with the most recent updated date was retained in all of these cases)
Metadata enhancements:
Apply Field of Science subject terms to citation records originating from CZI, based on disciplinary area of data repository
Initial cleanup of affiliation and funder organization names to remove personal email addresses and social media handles (additional cleanup and standardization in progress and will be included in future releases)
Data structure updates to improve usability and eliminate redundancies:
Rename subj_id and obj_id fields to “dataset” and “publication” for clarity
Remove accessionNumber and doi elements to eliminate redundancy with subj_id
Remove relationTypeId fields as these are specific to Event Data only
Full details of the above changes, including scripts used to perform the above tasks, are available in GitHub.
While v2 addresses a number of cleanup and enhancement tasks, additional data issues may remain, and additional enhancements are being explored. These will be addressed in the course of subsequent data file releases.
Feedback on the data file can be submitted via GitHub. For general questions, email info@makedatacount.org.
Facebook
TwitterEnergy consumption readings for a sample of 5,567 London Households that took part in the UK Power Networks led Low Carbon London project between November 2011 and February 2014.
Readings were taken at half hourly intervals. Households have been allocated to a CACI Acorn group (2010). The customers in the trial were recruited as a balanced sample representative of the Greater London population.
The dataset contains energy consumption, in kWh (per half hour), unique household identifier, date and time, and CACI Acorn group. The CSV file is around 10GB when unzipped and contains around 167million rows.
Within the data set are two groups of customers. The first is a sub-group, of approximately 1100 customers, who were subjected to Dynamic Time of Use (dToU) energy prices throughout the 2013 calendar year period. The tariff prices were given a day ahead via the Smart Meter IHD (In Home Display) or text message to mobile phone. Customers were issued High (67.20p/kWh), Low (3.99p/kWh) or normal (11.76p/kWh) price signals and the times of day these applied. The dates/times and the price signal schedule is availaible as part of this dataset. All non-Time of Use customers were on a flat rate tariff of 14.228pence/kWh.
The signals given were designed to be representative of the types of signal that may be used in the future to manage both high renewable generation (supply following) operation and also test the potential to use high price signals to reduce stress on local distribution grids during periods of stress.
The remaining sample of approximately 4500 customers energy consumption readings were not subject to the dToU tariff.
More information can be found on the Low Carbon London webpage
Some analysis of this data can be seen here.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Contents: database of oyster growth (i.e., the changes in mass over time) and mortality along French coasts since 1993. To build this database, we took advantage of the Pacific oyster production monitoring network coordinated by IFREMER (the French Research Institute for the Exploitation of the Sea). This network monitors the growth and mortality of spat (less than one-year-old individuals) and half-grown (between one and two-year-old individuals) Crassostrea gigas oysters since 1993. As the number of sites monitored over the years varied, we focused on 13 sites that were almost continuously monitored during this period. For these locations, we modeled growth and cumulative mortality for spat and half-grown oysters as a function of time, to cope with changes in data acquisition frequency, and produced standardized growth and cumulative mortality indicators to improve data usability. Code to reproduce these analyses are archived here, as well as figures included in the companion data paper: "A 26-year time series of mortality and growth of the Pacific oyster C. gigas recorded along French coasts".
Sampling protocol: in the oyster production monitoring network, oysters were mainly reared in plastic meshed bags fixed on iron tables, mimicking the oyster farmers practices. After their deployment at the beginning of the campaign (seeding dates from February to April depending on the year), growth and mortality were longitudinally monitored yearly. At each sampling date, local operators carefully emptied each bag in separate baskets, counted the dead individuals and alive ones, and removed the dead individuals. Then local operators weighed all alive individuals in each basket (mass taken at the bag level, protocol mainly used between 1993 and 1998 and since 2004) and/or collected 30 individuals to individually weigh them in the laboratory (mass taken at the individual level, protocol used between 1995 and 2010 for spat and since 1996 for half-grown oysters).
Data:
<ul>
<li>num, site, name, zone_en, lat, long: see the description above for the data set sites.csv.</li>
<li>campaign: the year of data collection. Ranges between 1993 and 2018.</li>
<li>class_age: the age class of the oyster (i.e. spat: N0 or half-grown: J1).</li>
<li>batch: the identifier of the batch (group of oysters born from the same reproductive event, having experienced strictly the same zootechnical route). It is a field that concatenates the campaign, the age class of oysters (spat: N0 or half-grown: J1), the origin of the initial spatgroup (wild-caught: CAPT or Ifremer hatchery: ECLO), ploidy (diploid: 2n) and birthplace of the original spatgroup (AR: Bay of Arcachon or E4:
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
This dataset provides grayscale pixel values for brain tumor MRI images, stored in a CSV format for simplified access and ease of use. The goal is to create a "MNIST-like" dataset for brain tumors, where each row in the CSV file represents the pixel values of a single image in its original resolution. This format makes it convenient for researchers and developers to quickly load and analyze MRI data for brain tumor detection, classification, and segmentation tasks without needing to handle large image files directly.
Brain tumor classification and segmentation are critical tasks in medical imaging, and datasets like these are valuable for developing and testing machine learning and deep learning models. While there are several publicly available brain tumor image datasets, they often consist of large image files that can be challenging to process. This CSV-based dataset addresses that by providing a compact and accessible format. Potential use cases include: - Tumor Classification: Identifying different types of brain tumors, such as glioma, meningioma, and pituitary tumors, or distinguishing between tumor and non-tumor images. - Tumor Segmentation: Applying pixel-level classification and segmentation techniques for tumor boundary detection. - Educational and Rapid Prototyping: Ideal for educational purposes or quick experimentation without requiring large image processing capabilities.
This dataset is structured as a single CSV file where each row represents an image, and each column represents a grayscale pixel value. The pixel values are stored as integers ranging from 0 (black) to 255 (white).
This dataset is intended for research and educational purposes only. Users are encouraged to cite and credit the original data sources if using this dataset in any publications or projects. This is a derived CSV version aimed to simplify access and usability for machine learning and data science applications.