28 datasets found

Dairy Supply Chain Sales Dataset

zenodo.org
data.niaid.nih.gov

pdf, zip

Updated Jul 12, 2024

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Dimitris Iatropoulos; Konstantinos Georgakidis; Konstantinos Georgakidis; Ilias Siniosoglou; Ilias Siniosoglou; Christos Chaschatzis; Christos Chaschatzis; Anna Triantafyllou; Anna Triantafyllou; Athanasios Liatifis; Athanasios Liatifis; Dimitrios Pliatsios; Dimitrios Pliatsios; Thomas Lagkas; Thomas Lagkas; Vasileios Argyriou; Vasileios Argyriou; Panagiotis Sarigiannidis; Panagiotis Sarigiannidis; Dimitris Iatropoulos (2024). Dairy Supply Chain Sales Dataset [Dataset]. http://doi.org/10.21227/smv6-z405

Explore at:

zip, pdfAvailable download formats

Unique identifier

https://doi.org/10.21227/smv6-z405

Dataset updated

Jul 12, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

1.Introduction

Sales data collection is a crucial aspect of any manufacturing industry as it provides valuable insights about the performance of products, customer behaviour, and market trends. By gathering and analysing this data, manufacturers can make informed decisions about product development, pricing, and marketing strategies in Internet of Things (IoT) business environments like the dairy supply chain.

One of the most important benefits of the sales data collection process is that it allows manufacturers to identify their most successful products and target their efforts towards those areas. For example, if a manufacturer could notice that a particular product is selling well in a certain region, this information could be utilised to develop new products, optimise the supply chain or improve existing ones to meet the changing needs of customers.

This dataset includes information about 7 of MEVGAL’s products [1]. According to the above information the data published will help researchers to understand the dynamics of the dairy market and its consumption patterns, which is creating the fertile ground for synergies between academia and industry and eventually help the industry in making informed decisions regarding product development, pricing and market strategies in the IoT playground. The use of this dataset could also aim to understand the impact of various external factors on the dairy market such as the economic, environmental, and technological factors. It could help in understanding the current state of the dairy industry and identifying potential opportunities for growth and development.

2. Citation

Please cite the following papers when using this dataset:

I. Siniosoglou, K. Xouveroudis, V. Argyriou, T. Lagkas, S. K. Goudos, K. E. Psannis and P. Sarigiannidis, "Evaluating the Effect of Volatile Federated Timeseries on Modern DNNs: Attention over Long/Short Memory," in the 12th International Conference on Circuits and Systems Technologies (MOCAST 2023), April 2023, Accepted

3. Dataset Modalities

The dataset includes data regarding the daily sales of a series of dairy product codes offered by MEVGAL. In particular, the dataset includes information gathered by the logistics division and agencies within the industrial infrastructures overseeing the production of each product code. The products included in this dataset represent the daily sales and logistics of a variety of yogurt-based stock. Each of the different files include the logistics for that product on a daily basis for three years, from 2020 to 2022.

3.1 Data Collection

The process of building this dataset involves several steps to ensure that the data is accurate, comprehensive and relevant.

The first step is to determine the specific data that is needed to support the business objectives of the industry, i.e., in this publication’s case the daily sales data.

Once the data requirements have been identified, the next step is to implement an effective sales data collection method. In MEVGAL’s case this is conducted through direct communication and reports generated each day by representatives & selling points.

It is also important for MEVGAL to ensure that the data collection process conducted is in an ethical and compliant manner, adhering to data privacy laws and regulation. The industry also has a data management plan in place to ensure that the data is securely stored and protected from unauthorised access.

The published dataset is consisted of 13 features providing information about the date and the number of products that have been sold. Finally, the dataset was anonymised in consideration to the privacy requirement of the data owner (MEVGAL).

File	Period	Number of Samples (days)
product 1 2020.xlsx	01/01/2020–31/12/2020	363
product 1 2021.xlsx	01/01/2021–31/12/2021	364
product 1 2022.xlsx	01/01/2022–31/12/2022	365
product 2 2020.xlsx	01/01/2020–31/12/2020	363
product 2 2021.xlsx	01/01/2021–31/12/2021	364
product 2 2022.xlsx	01/01/2022–31/12/2022	365
product 3 2020.xlsx	01/01/2020–31/12/2020	363
product 3 2021.xlsx	01/01/2021–31/12/2021	364
product 3 2022.xlsx	01/01/2022–31/12/2022	365
product 4 2020.xlsx	01/01/2020–31/12/2020	363
product 4 2021.xlsx	01/01/2021–31/12/2021	364
product 4 2022.xlsx	01/01/2022–31/12/2022	364
product 5 2020.xlsx	01/01/2020–31/12/2020	363
product 5 2021.xlsx	01/01/2021–31/12/2021	364
product 5 2022.xlsx	01/01/2022–31/12/2022	365
product 6 2020.xlsx	01/01/2020–31/12/2020	362
product 6 2021.xlsx	01/01/2021–31/12/2021	364
product 6 2022.xlsx	01/01/2022–31/12/2022	365
product 7 2020.xlsx	01/01/2020–31/12/2020	362
product 7 2021.xlsx	01/01/2021–31/12/2021	364
product 7 2022.xlsx	01/01/2022–31/12/2022	365

3.2 Dataset Overview

The following table enumerates and explains the features included across all of the included files.

Feature	Description	Unit
Day	day of the month	-
Month	Month	-
Year	Year	-
daily_unit_sales	Daily sales - the amount of products, measured in units, that during that specific day were sold	units
previous_year_daily_unit_sales	Previous Year’s sales - the amount of products, measured in units, that during that specific day were sold the previous year	units
percentage_difference_daily_unit_sales	The percentage difference between the two above values	%
daily_unit_sales_kg	The amount of products, measured in kilograms, that during that specific day were sold	kg
previous_year_daily_unit_sales_kg	Previous Year’s sales - the amount of products, measured in kilograms, that during that specific day were sold, the previous year	kg
percentage_difference_daily_unit_sales_kg	The percentage difference between the two above values	kg
daily_unit_returns_kg	The percentage of the products that were shipped to selling points and were returned	%
previous_year_daily_unit_returns_kg	The percentage of the products that were shipped to

t
Evaluating FAIR Models for Rossmann Store Sales Prediction: Insights and...
test.researchdata.tuwien.ac.at
bin, csv, json +1
Updated Apr 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dilara Çakmak; Dilara Çakmak; Dilara Çakmak; Dilara Çakmak (2025). Evaluating FAIR Models for Rossmann Store Sales Prediction: Insights and Performance Analysis [Dataset]. http://doi.org/10.70124/f5t2d-xt904
Explore at:
csv, text/markdown, json, binAvailable download formats
Unique identifier
https://doi.org/10.70124/f5t2d-xt904
Dataset updated
Apr 28, 2025
Dataset provided by
TU Wien
Authors
Dilara Çakmak; Dilara Çakmak; Dilara Çakmak; Dilara Çakmak
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Apr 2025
Description
Context and Methodology

Research Domain:
The dataset is part of a project focused on retail sales forecasting. Specifically, it is designed to predict daily sales for Rossmann, a chain of over 3,000 drug stores operating across seven European countries. The project falls under the broader domain of time series analysis and machine learning applications for business optimization. The goal is to apply machine learning techniques to forecast future sales based on historical data, which includes factors like promotions, competition, holidays, and seasonal trends.

Purpose:
The primary purpose of this dataset is to help Rossmann store managers predict daily sales for up to six weeks in advance. By making accurate sales predictions, Rossmann can improve inventory management, staffing decisions, and promotional strategies. This dataset serves as a training set for machine learning models aimed at reducing forecasting errors and supporting decision-making processes across the company’s large network of stores.

How the Dataset Was Created:
The dataset was compiled from several sources, including historical sales data from Rossmann stores, promotional calendars, holiday schedules, and external factors such as competition. The data is split into multiple features, such as the store's location, promotion details, whether the store was open or closed, and weather information. The dataset is publicly available on platforms like Kaggle and was initially created for the Kaggle Rossmann Store Sales competition. The data is made accessible via an API for further analysis and modeling, and it is structured to help machine learning models predict future sales based on various input variables.

Technical Details

Dataset Structure:

The dataset consists of three main files, each with its specific role:

Train:
This file contains the historical sales data, which is used to train machine learning models. It includes daily sales information for each store, as well as various features that could influence the sales (e.g., promotions, holidays, store type, etc.).

https://handle.test.datacite.org/10.82556/yb6j-jw41
PID: b1c59499-9c6e-42c2-af8f-840181e809db

Test2:
The test dataset mirrors the structure of train.csv but does not include the actual sales values (i.e., the target variable). This file is used for making predictions using the trained machine learning models. It is used to evaluate the accuracy of predictions when the true sales data is unknown.

https://handle.test.datacite.org/10.82556/jerg-4b84
PID: 7cbb845c-21dd-4b60-b990-afa8754a0dd9

Store:
This file provides metadata about each store, including information such as the store’s location, type, and assortment level. This data is essential for understanding the context in which the sales data is gathered.

https://handle.test.datacite.org/10.82556/nqeg-gy34
PID: 9627ec46-4ee6-4969-b14a-bda555fe34db

Data Fields Description:

Id: A unique identifier for each (Store, Date) combination within the test set.

Store: A unique identifier for each store.

Sales: The daily turnover (target variable) for each store on a specific day (this is what you are predicting).

Customers: The number of customers visiting the store on a given day.

Open: An indicator of whether the store was open (1 = open, 0 = closed).

StateHoliday: Indicates if the day is a state holiday, with values like:

'a' = public holiday,

'b' = Easter holiday,

'c' = Christmas,

'0' = no holiday.

SchoolHoliday: Indicates whether the store is affected by school closures (1 = yes, 0 = no).

StoreType: Differentiates between four types of stores: 'a', 'b', 'c', 'd'.

Assortment: Describes the level of product assortment in the store:

'a' = basic,

'b' = extra,

'c' = extended.

CompetitionDistance: Distance (in meters) to the nearest competitor store.

CompetitionOpenSince[Month/Year]: The month and year when the nearest competitor store opened.

Promo: Indicates whether the store is running a promotion on a particular day (1 = yes, 0 = no).

Promo2: Indicates whether the store is participating in Promo2, a continuing promotion for some stores (1 = participating, 0 = not participating).

Promo2Since[Year/Week]: The year and calendar week when the store started participating in Promo2.

PromoInterval: Describes the months when Promo2 is active, e.g., "Feb,May,Aug,Nov" means the promotion starts in February, May, August, and November.

Software Requirements

To work with this dataset, you will need to have specific software installed, including:

DBRepo Authorization: This is required to access the datasets via the DBRepo API. You may need to authenticate with an API key or login credentials to retrieve the datasets.

Python Libraries: Key libraries for working with the dataset include:

pandas for data manipulation,

numpy for numerical operations,

matplotlib and seaborn for data visualization,

scikit-learn for machine learning algorithms.

Additional Resources

Several additional resources are available for working with the dataset:

Presentation:
A presentation summarizing the exploratory data analysis (EDA), feature engineering process, and key insights from the analysis is provided. This presentation also includes visualizations that help in understanding the dataset’s trends and relationships.

Jupyter Notebook:
A Jupyter notebook, titled Retail_Sales_Prediction_Capstone_Project.ipynb, is provided, which details the entire machine learning pipeline, from data loading and cleaning to model training and evaluation.

Model Evaluation Results:
The project includes a detailed evaluation of various machine learning models, including their performance metrics like training and testing scores, Mean Absolute Percentage Error (MAPE), and Root Mean Squared Error (RMSE). This allows for a comparison of model effectiveness in forecasting sales.

Trained Models (.pkl files):
The models trained during the project are saved as .pkl files. These files contain the trained machine learning models (e.g., Random Forest, Linear Regression, etc.) that can be loaded and used to make predictions without retraining the models from scratch.

sample_submission.csv:
This file is a sample submission file that demonstrates the format of predictions expected when using the trained model. The sample_submission.csv contains predictions made on the test dataset using the trained Random Forest model. It provides an example of how the output should be structured for submission.

These resources provide a comprehensive guide to implementing and analyzing the sales forecasting model, helping you understand the data, methods, and results in greater detail.
Market Basket Analysis
kaggle.com
Updated Dec 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 9, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Aslan Ahmedov
Description
Market Basket Analysis

Market basket analysis with Apriori algorithm

The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

Introduction

Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

An Example of Association Rules

Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

Strategy

Data Import

Data Understanding and Exploration

Transformation of the data – so that is ready to be consumed by the association rules algorithm

Running association rules

Exploring the rules generated

Filtering the generated rules

Visualization of Rule

Dataset Description

File name: Assignment-1_Data

List name: retaildata

File format: . xlsx

Number of Row: 522065

Number of Attributes: 7

BillNo: 6-digit number assigned to each transaction. Nominal.

Itemname: Product name. Nominal.

Quantity: The quantities of each product per transaction. Numeric.

Date: The day and time when each transaction was generated. Numeric.

Price: Product price. Numeric.

CustomerID: 5-digit number assigned to each customer. Nominal.

Country: Name of the country where each customer resides. Nominal.

https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

Libraries in R

First, we need to load required libraries. Shortly I describe all libraries.

arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).

arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.

tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.

readxl - Read Excel Files in R.

plyr - Tools for Splitting, Applying and Combining Data.

ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

knitr - Dynamic Report generation in R.

magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.

dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.

tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

Data Pre-processing

Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

After we will clear our data frame, will remove missing values.

https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
Fundamental Data Record for Atmospheric Composition [ATMOS_L1B]
earth.esa.int
Updated Jul 1, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
European Space Agency (2024). Fundamental Data Record for Atmospheric Composition [ATMOS_L1B] [Dataset]. https://earth.esa.int/eogateway/catalog/fdr-for-atmospheric-composition
Explore at:
Dataset updated
Jul 1, 2024
Dataset authored and provided by
European Space Agencyhttp://www.esa.int/
License
https://earth.esa.int/eogateway/documents/20142/1564626/Terms-and-Conditions-for-the-use-of-ESA-Data.pdfhttps://earth.esa.int/eogateway/documents/20142/1564626/Terms-and-Conditions-for-the-use-of-ESA-Data.pdf
Time period covered
Jun 28, 1995 - Apr 7, 2012
Description
The Fundamental Data Record (FDR) for Atmospheric Composition UVN v.1.0 dataset is a cross-instrument Level-1 product [ATMOS_L1B] generated in 2023 and resulting from the ESA FDR4ATMOS project. The FDR contains selected Earth Observation Level 1b parameters (irradiance/reflectance) from the nadir-looking measurements of the ERS-2 GOME and Envisat SCIAMACHY missions for the period ranging from 1995 to 2012. The data record offers harmonised cross-calibrated spectra with focus on spectral windows in the Ultraviolet-Visible-Near Infrared regions for the retrieval of critical atmospheric constituents like ozone (O3), sulphur dioxide (SO2), nitrogen dioxide (NO2) column densities, alongside cloud parameters. The FDR4ATMOS products should be regarded as experimental due to the innovative approach and the current use of a limited-sized test dataset to investigate the impact of harmonization on the Level 2 target species, specifically SO2, O3 and NO2. Presently, this analysis is being carried out within follow-on activities. The FDR4ATMOS V1 is currently being extended to include the MetOp GOME-2 series. Product format For many aspects, the FDR product has improved compared to the existing individual mission datasets: GOME solar irradiances are harmonised using a validated SCIAMACHY solar reference spectrum, solving the problem of the fast-changing etalon present in the original GOME Level 1b data; Reflectances for both GOME and SCIAMACHY are provided in the FDR product. GOME reflectances are harmonised to degradation-corrected SCIAMACHY values, using collocated data from the CEOS PIC sites; SCIAMACHY data are scaled to the lowest integration time within the spectral band using high-frequency PMD measurements from the same wavelength range. This simplifies the use of the SCIAMACHY spectra which were split in a complex cluster structure (with own integration time) in the original Level 1b data; The harmonization process applied mitigates the viewing angle dependency observed in the UV spectral region for GOME data; Uncertainties are provided. Each FDR product provides, within the same file, irradiance/reflectance data for UV-VIS-NIR special regions across all orbits on a single day, including therein information from the individual ERS-2 GOME and Envisat SCIAMACHY measurements. FDR has been generated in two formats: Level 1A and Level 1B targeting expert users and nominal applications respectively. The Level 1A [ATMOS_L1A] data include additional parameters such as harmonisation factors, PMD, and polarisation data extracted from the original mission Level 1 products. The ATMOS_L1A dataset is not part of the nominal dissemination to users. In case of specific requirements, please contact EOHelp. Please refer to the README file for essential guidance before using the data. All the new products are conveniently formatted in NetCDF. Free standard tools, such as Panoply, can be used to read NetCDF data. Panoply is sourced and updated by external entities. For further details, please consult our Terms and Conditions page. Uncertainty characterisation One of the main aspects of the project was the characterization of Level 1 uncertainties for both instruments, based on metrological best practices. The following documents are provided: General guidance on a metrological approach to Fundamental Data Records (FDR) Uncertainty Characterisation document Effect tables NetCDF files containing example uncertainty propagation analysis and spectral error correlation matrices for SCIAMACHY (Atlantic and Mauretania scene for 2003 and 2010) and GOME (Atlantic scene for 2003) reflectance_uncertainty_example_FDR4ATMOS_GOME.nc reflectance_uncertainty_example_FDR4ATMOS_SCIA.nc Known Issues Non-monotonous wavelength axis for SCIAMACHY in FDR data version 1.0 In the SCIAMACHY OBSERVATION group of the atmospheric FDR v1.0 dataset (DOI: 10.5270/ESA-852456e), the wavelength axis (lambda variable) is not monotonically increasing. This issue affects all spectral channels (UV, VIS, NIR) in the SCIAMACHY group, while GOME OBSERVATION data remain unaffected. The root cause of the issue lies in the incorrect indexing of the lambda variable during the NetCDF writing process. Notably, the wavelength values themselves are calculated correctly within the processing chain. Temporary Workaround The wavelength axis is correct in the first record of each product. As a workaround, users can extract the wavelength axis from the first record and apply it to all subsequent measurements within the same product. The first record can be retrieved by setting the first two indices (time and scanline) to 0 (assuming counting of array indices starts at 0). Note that this process must be repeated separately for each spectral range (UV, VIS, NIR) and every daily product. Since the wavelength axis of SCIAMACHY is highly stable over time, using the first record introduces no expected impact on retrieval results. Python pseudo-code example: lambda_...
Data from: A large synthetic dataset for machine learning applications in...
zenodo.org
csv, json, png, zip
Updated Mar 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marc Gillioz; Marc Gillioz; Guillaume Dubuis; Philippe Jacquod; Philippe Jacquod; Guillaume Dubuis (2025). A large synthetic dataset for machine learning applications in power transmission grids [Dataset]. http://doi.org/10.5281/zenodo.13378476
Explore at:
zip, png, csv, jsonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13378476
Dataset updated
Mar 25, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Marc Gillioz; Marc Gillioz; Guillaume Dubuis; Philippe Jacquod; Philippe Jacquod; Guillaume Dubuis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
With the ongoing energy transition, power grids are evolving fast. They operate more and more often close to their technical limit, under more and more volatile conditions. Fast, essentially real-time computational approaches to evaluate their operational safety, stability and reliability are therefore highly desirable. Machine Learning methods have been advocated to solve this challenge, however they are heavy consumers of training and testing data, while historical operational data for real-world power grids are hard if not impossible to access.

This dataset contains long time series for production, consumption, and line flows, amounting to 20 years of data with a time resolution of one hour, for several thousands of loads and several hundreds of generators of various types representing the ultra-high-voltage transmission grid of continental Europe. The synthetic time series have been statistically validated agains real-world data.

Data generation algorithm

The algorithm is described in a Nature Scientific Data paper. It relies on the PanTaGruEl model of the European transmission network -- the admittance of its lines as well as the location, type and capacity of its power generators -- and aggregated data gathered from the ENTSO-E transparency platform, such as power consumption aggregated at the national level.

Network

The network information is encoded in the file europe_network.json. It is given in PowerModels format, which it itself derived from MatPower and compatible with PandaPower. The network features 7822 power lines and 553 transformers connecting 4097 buses, to which are attached 815 generators of various types.

Time series

The time series forming the core of this dataset are given in CSV format. Each CSV file is a table with 8736 rows, one for each hourly time step of a 364-day year. All years are truncated to exactly 52 weeks of 7 days, and start on a Monday (the load profiles are typically different during weekdays and weekends). The number of columns depends on the type of table: there are 4097 columns in load files, 815 for generators, and 8375 for lines (including transformers). Each column is described by a header corresponding to the element identifier in the network file. All values are given in per-unit, both in the model file and in the tables, i.e. they are multiples of a base unit taken to be 100 MW.

There are 20 tables of each type, labeled with a reference year (2016 to 2020) and an index (1 to 4), zipped into archive files arranged by year. This amount to a total of 20 years of synthetic data. When using loads, generators, and lines profiles together, it is important to use the same label: for instance, the files loads_2020_1.csv, gens_2020_1.csv, and lines_2020_1.csv represent a same year of the dataset, whereas gens_2020_2.csv is unrelated (it actually shares some features, such as nuclear profiles, but it is based on a dispatch with distinct loads).

Usage

The time series can be used without a reference to the network file, simply using all or a selection of columns of the CSV files, depending on the needs. We show below how to select series from a particular country, or how to aggregate hourly time steps into days or weeks. These examples use Python and the data analyis library pandas, but other frameworks can be used as well (Matlab, Julia). Since all the yearly time series are periodic, it is always possible to define a coherent time window modulo the length of the series.

Selecting a particular country

This example illustrates how to select generation data for Switzerland in Python. This can be done without parsing the network file, but using instead gens_by_country.csv, which contains a list of all generators for any country in the network. We start by importing the pandas library, and read the column of the file corresponding to Switzerland (country code CH):

import pandas as pd CH_gens = pd.read_csv('gens_by_country.csv', usecols=['CH'], dtype=str)

The object created in this way is Dataframe with some null values (not all countries have the same number of generators). It can be turned into a list with:

CH_gens_list = CH_gens.dropna().squeeze().to_list()

Finally, we can import all the time series of Swiss generators from a given data table with

pd.read_csv('gens_2016_1.csv', usecols=CH_gens_list)

The same procedure can be applied to loads using the list contained in the file loads_by_country.csv.

Averaging over time

This second example shows how to change the time resolution of the series. Suppose that we are interested in all the loads from a given table, which are given by default with a one-hour resolution:

hourly_loads = pd.read_csv('loads_2018_3.csv')

To get a daily average of the loads, we can use:

daily_loads = hourly_loads.groupby([t // 24 for t in range(24 * 364)]).mean()

This results in series of length 364. To average further over entire weeks and get series of length 52, we use:

weekly_loads = hourly_loads.groupby([t // (24 * 7) for t in range(24 * 364)]).mean()

Source code

The code used to generate the dataset is freely available at https://github.com/GeeeHesso/PowerData. It consists in two packages and several documentation notebooks. The first package, written in Python, provides functions to handle the data and to generate synthetic series based on historical data. The second package, written in Julia, is used to perform the optimal power flow. The documentation in the form of Jupyter notebooks contains numerous examples on how to use both packages. The entire workflow used to create this dataset is also provided, starting from raw ENTSO-E data files and ending with the synthetic dataset given in the repository.

Funding

This work was supported by the Cyber-Defence Campus of armasuisse and by an internal research grant of the Engineering and Architecture domain of HES-SO.
t
ESA CCI SM GAPFILLED Long-term Climate Data Record of Surface Soil Moisture...
researchdata.tuwien.ac.at
zip
Updated Jun 6, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wolfgang Preimesberger; Wolfgang Preimesberger; Pietro Stradiotti; Pietro Stradiotti; Wouter Arnoud Dorigo; Wouter Arnoud Dorigo (2025). ESA CCI SM GAPFILLED Long-term Climate Data Record of Surface Soil Moisture from merged multi-satellite observations [Dataset]. http://doi.org/10.48436/3fcxr-cde10
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.48436/3fcxr-cde10
Dataset updated
Jun 6, 2025
Dataset provided by
TU Wien
Authors
Wolfgang Preimesberger; Wolfgang Preimesberger; Pietro Stradiotti; Pietro Stradiotti; Wouter Arnoud Dorigo; Wouter Arnoud Dorigo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset was produced with funding from the European Space Agency (ESA) Climate Change Initiative (CCI) Plus Soil Moisture Project (CCN 3 to ESRIN Contract No: 4000126684/19/I-NB "ESA CCI+ Phase 1 New R&D on CCI ECVS Soil Moisture"). Project website: https://climate.esa.int/en/projects/soil-moisture/

This dataset contains information on the Surface Soil Moisture (SM) content derived from satellite observations in the microwave domain.

Dataset paper (public preprint)

A description of this dataset, including the methodology and validation results, is available at:

Preimesberger, W., Stradiotti, P., and Dorigo, W.: ESA CCI Soil Moisture GAPFILLED: An independent global gap-free satellite climate data record with uncertainty estimates, Earth Syst. Sci. Data Discuss. [preprint], https://doi.org/10.5194/essd-2024-610, in review, 2025.

Abstract

ESA CCI Soil Moisture is a multi-satellite climate data record that consists of harmonized, daily observations coming from 19 satellites (as of v09.1) operating in the microwave domain. The wealth of satellite information, particularly over the last decade, facilitates the creation of a data record with the highest possible data consistency and coverage.
However, data gaps are still found in the record. This is particularly notable in earlier periods when a limited number of satellites were in operation, but can also arise from various retrieval issues, such as frozen soils, dense vegetation, and radio frequency interference (RFI). These data gaps present a challenge for many users, as they have the potential to obscure relevant events within a study area or are incompatible with (machine learning) software that often relies on gap-free inputs.
Since the requirement of a gap-free ESA CCI SM product was identified, various studies have demonstrated the suitability of different statistical methods to achieve this goal. A fundamental feature of such gap-filling method is to rely only on the original observational record, without need for ancillary variable or model-based information. Due to the intrinsic challenge, there was until present no global, long-term univariate gap-filled product available. In this version of the record, data gaps due to missing satellite overpasses and invalid measurements are filled using the Discrete Cosine Transform (DCT) Penalized Least Squares (PLS) algorithm (Garcia, 2010). A linear interpolation is applied over periods of (potentially) frozen soils with little to no variability in (frozen) soil moisture content. Uncertainty estimates are based on models calibrated in experiments to fill satellite-like gaps introduced to GLDAS Noah reanalysis soil moisture (Rodell et al., 2004), and consider the gap size and local vegetation conditions as parameters that affect the gapfilling performance.

Summary

Gap-filled global estimates of volumetric surface soil moisture from 1991-2023 at 0.25° sampling

Fields of application (partial): climate variability and change, land-atmosphere interactions, global biogeochemical cycles and ecology, hydrological and land surface modelling, drought applications, and meteorology

Method: Modified version of DCT-PLS (Garcia, 2010) interpolation/smoothing algorithm, linear interpolation over periods of frozen soils. Uncertainty estimates are provided for all data points.

More information: See Preimesberger et al. (2025) and https://doi.org/10.5281/zenodo.8320869" target="_blank" rel="noopener">ESA CCI SM Algorithm Theoretical Baseline Document [Chapter 7.2.9] (Dorigo et al., 2023)

Programmatic Download

You can use command line tools such as wget or curl to download (and extract) data for multiple years. The following command will download and extract the complete data set to the local directory ~/Download on Linux or macOS systems.

#!/bin/bash

# Set download directory
DOWNLOAD_DIR=~/Downloads

base_url="https://researchdata.tuwien.at/records/3fcxr-cde10/files"

# Loop through years 1991 to 2023 and download & extract data
for year in {1991..2023}; do
echo "Downloading $year.zip..."
wget -q -P "$DOWNLOAD_DIR" "$base_url/$year.zip"
unzip -o "$DOWNLOAD_DIR/$year.zip" -d $DOWNLOAD_DIR
rm "$DOWNLOAD_DIR/$year.zip"
done

Data details

The dataset provides global daily estimates for the 1991-2023 period at 0.25° (~25 km) horizontal grid resolution. Daily images are grouped by year (YYYY), each subdirectory containing one netCDF image file for a specific day (DD), month (MM) in a 2-dimensional (longitude, latitude) grid system (CRS: WGS84). The file name has the following convention:

ESACCI-SOILMOISTURE-L3S-SSMV-COMBINED_GAPFILLED-YYYYMMDD000000-fv09.1r1.nc

Data Variables

Each netCDF file contains 3 coordinate variables (WGS84 longitude, latitude and time stamp), as well as the following data variables:

sm: (float) The Soil Moisture variable reflects estimates of daily average volumetric soil moisture content (m3/m3) in the soil surface layer (~0-5 cm) over a whole grid cell (0.25 degree).

sm_uncertainty: (float) The Soil Moisture Uncertainty variable reflects the uncertainty (random error) of the original satellite observations and of the predictions used to fill observation data gaps.

sm_anomaly: Soil moisture anomalies (reference period 1991-2020) derived from the gap-filled values (`sm`)

sm_smoothed: Contains DCT-PLS predictions used to fill data gaps in the original soil moisture field. These values are also provided for cases where an observation was initially available (compare `gapmask`). In this case, they provided a smoothed version of the original data.

gapmask: (0 | 1) Indicates grid cells where a satellite observation is available (1), and where the interpolated (smoothed) values are used instead (0) in the 'sm' field.

frozenmask: (0 | 1) Indicates grid cells where ERA5 soil temperature is <0 °C. In this case, a linear interpolation over time is applied.

Additional information for each variable is given in the netCDF attributes.

Version Changelog

Changes in v9.1r1 (previous version was v09.1):

This version uses a novel uncertainty estimation scheme as described in Preimesberger et al. (2025).

Software to open netCDF files

These data can be read by any software that supports Climate and Forecast (CF) conform metadata standards for netCDF files, such as:

https://github.com/pydata/xarray" target="_blank" rel="noopener">Xarray (python)

https://unidata.github.io/netcdf4-python/" target="_blank" rel="noopener">netCDF4 (python)

https://github.com/TUW-GEO/esa_cci_sm">esa_cci_sm (python)

Similar tools exists for other programming languages (Matlab, R, etc.)

Software packages and GIS tools can open netCDF files, e.g. CDO, NCO, QGIS, ArCGIS

You can also use the GUI software Panoply to view the contents of each file

References

Preimesberger, W., Stradiotti, P., and Dorigo, W.: ESA CCI Soil Moisture GAPFILLED: An independent global gap-free satellite climate data record with uncertainty estimates, Earth Syst. Sci. Data Discuss. [preprint], https://doi.org/10.5194/essd-2024-610, in review, 2025.

Dorigo, W., Preimesberger, W., Stradiotti, P., Kidd, R., van der Schalie, R., van der Vliet, M., Rodriguez-Fernandez, N., Madelon, R., & Baghdadi, N. (2023). ESA Climate Change Initiative Plus - Soil Moisture Algorithm Theoretical Baseline Document (ATBD) Supporting Product Version 08.1 (version 1.1). Zenodo. https://doi.org/10.5281/zenodo.8320869

Garcia, D., 2010. Robust smoothing of gridded data in one and higher dimensions with missing values. Computational Statistics & Data Analysis, 54(4), pp.1167-1178. Available at: https://doi.org/10.1016/j.csda.2009.09.020

Rodell, M., Houser, P. R., Jambor, U., Gottschalck, J., Mitchell, K., Meng, C.-J., Arsenault, K., Cosgrove, B., Radakovich, J., Bosilovich, M., Entin, J. K., Walker, J. P., Lohmann, D., and Toll, D.: The Global Land Data Assimilation System, Bulletin of the American Meteorological Society, 85, 381 – 394, https://doi.org/10.1175/BAMS-85-3-381, 2004.

Related Records

The following records are all part of the Soil Moisture Climate Data Records from satellites community

1
ESA CCI SM MODELFREE Surface Soil Moisture Record
<a href="https://doi.org/10.48436/svr1r-27j77" target="_blank"
LYMob-4Cities: Multi-City Human Mobility Dataset
zenodo.org
data.niaid.nih.gov
zip
Updated Dec 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Takahiro Yabe; Takahiro Yabe; Kota Tsubouchi; Kota Tsubouchi; Toru Shimizu; Toru Shimizu (2024). LYMob-4Cities: Multi-City Human Mobility Dataset [Dataset]. http://doi.org/10.5281/zenodo.14219563
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14219563
Dataset updated
Dec 3, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Takahiro Yabe; Takahiro Yabe; Kota Tsubouchi; Kota Tsubouchi; Toru Shimizu; Toru Shimizu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This multi-city human mobility dataset contains data from 4 metropolitan areas (cities A, B, C, D), somewhere in Japan. Each city is divided into 500 meters x 500 meters cells, which span a 200 x 200 grid. The human mobility datasets contain the movement of individuals across a 75-day period, discretized into 30-minute intervals and 500-meter grid cells. Each city contains the movement data of 100,000, 25,000, 20,000, and 6,000 individuals, respectively.

While the name or location of the city is not disclosed, the participants are provided with points-of-interest (POIs; e.g., restaurants, parks) data for each grid cell (~85 dimensional vector) for the four cities as supplementary information (e.g., POIdata_cityA). The list of 85 POI categories can be found in POI_datacategories.csv.

This dataset was used for the HuMob Data Challenge 2024 competition. For more details, see https://wp.nyu.edu/humobchallenge2024/

Researchers may use this dataset for publications and reports, as long as: 1) Users shall not carry out activities that involve unethical usage of the data, including attempts at re-identifying data subjects, harming individuals, or damaging companies, and 2) The Data Descriptor paper of an earlier version of the dataset (citation below) needs to be cited when using the data for research and/or commercial purposes. Downloading this dataset implies agreement with the above two conditions.

Yabe, T., Tsubouchi, K., Shimizu, T., Sekimoto, Y., Sezaki, K., Moro, E., & Pentland, A. (2024). YJMob100K: City-scale and longitudinal dataset of anonymized human mobility trajectories. Scientific Data, 11(1), 397. https://www.nature.com/articles/s41597-024-03237-9
A
‘ 🚴 Bike Sharing Dataset’ analyzed by Analyst-2
analyst-2.ai
Updated Jan 13, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2014). ‘ 🚴 Bike Sharing Dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-bike-sharing-dataset-20d4/6ac341fa/?iid=032-878&v=presentation
Explore at:
Dataset updated
Jan 13, 2014
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘ 🚴 Bike Sharing Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/bike-sharing-datasete on 13 February 2022.

--- Dataset description provided by original source is as follows ---

About this dataset

Source:

Hadi Fanaee-T
Laboratory of Artificial Intelligence and Decision Support (LIAAD), University of PortoINESC Porto, Campus da FEUPRua Dr. Roberto Frias, 3784200 - 465 Porto, Portugal
Original Source:
http://capitalbikeshare.com/system-data
Weather Information:
http://www.freemeteo.com
Holiday Schedule:
http://dchr.dc.gov/page/holiday-schedule

Data Set Information:

Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.
Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position is explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important events in the city could be detected via monitoring these data.

Attribute Information:

Both hour.csv and day.csv have the following fields, except hr which is not available in day.csv
- instant: record index - dteday : date - season : season (1:springer, 2:summer, 3:fall, 4:winter) - yr : year (0: 2011, 1:2012) - mnth : month ( 1 to 12) - hr : hour (0 to 23) - holiday : weather day is holiday or not (extracted from ) - weekday : day of the week - workingday : if day is neither weekend nor holiday is 1, otherwise is 0. + weathersit :
- 1: Clear, Few clouds, Partly cloudy, Partly cloudy - 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist - 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds - 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog - temp : Normalized temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-8, t_max=+39 (only in hourly scale) - atemp: Normalized feeling temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-16, t_max=+50 (only in hourly scale) - hum: Normalized humidity. The values are divided to 100 (max) - windspeed: Normalized wind speed. The values are divided to 67 (max) - casual: count of casual users - registered: count of registered users - cnt: count of total rental bikes including both casual and registered

Relevant Papers:

Fanaee-T, Hadi, and Gama, Joao, 'Event labeling combining ensemble detectors and background knowledge', Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg, .

Citation Request:

Fanaee-T, Hadi, and Gama, Joao, 'Event labeling combining ensemble detectors and background knowledge', Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg, .
@article{ year={2013}, issn={2192-6352}, journal={Progress in Artificial Intelligence}, doi={10.1007/s13748-013-0040-3}, title={Event labeling combining ensemble detectors and background knowledge}, url={ }, publisher={Springer Berlin Heidelberg}, keywords={Event labeling; Event detection; Ensemble learning; Background knowledge}, author={Fanaee-T, Hadi and Gama, Joao}, pages={1-15}}

Source: http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset

This dataset was created by UCI and contains around 20000 samples along with Dteday, Windspeed, technical information and other features such as: - Registered - Cnt - and more.

How to use this dataset

Analyze Weekday in relation to Casual

Study the influence of Season on Holiday

More datasets

Acknowledgements

If you use this dataset in your research, please credit UCI

Start A New Notebook!

--- Original source retains full ownership of the source dataset ---
O
Child and Adult Care Food Programs (CACFP) –Child Centers – Meal...
data.texas.gov
catalog.data.gov
application/rdfxml +5
Updated Apr 2, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Texas Department of Agriculture (2025). Child and Adult Care Food Programs (CACFP) –Child Centers – Meal Reimbursement – Program Year 2023 - 2024 [Dataset]. https://data.texas.gov/dataset/Child-and-Adult-Care-Food-Programs-CACFP-Child-Cen/pp75-3tt8
Explore at:
csv, application/rdfxml, tsv, xml, json, application/rssxmlAvailable download formats
Dataset updated
Apr 2, 2025
Dataset authored and provided by
Texas Department of Agriculturehttp://www.texasagriculture.gov/
Description
Help us provide the most useful data by completing our ODP User Feedback Survey for Child and Adult Care Food Program (CACFP) Data

About the Dataset
This data set contains claims information for meal reimbursement for sites participating in CACFP as child centers for the program year 2023-2024. This includes Child Care Centers, At-Risk centers, Head Start sites, Outside School Hours sites, and Emergency Shelters . The CACFP program year begins October 1 and ends September 30.

This dataset only includes claims submitted by CACFP sites operating as child centers.Sites can participate in multiple CACFP sub-programs. Each record (row) represents monthly meals data for a single site and for a single CACFP center sub-program.

To filter data for a specific CACFP center Program, select "View Data" to open the Exploration Canvas filter tools. Select the program(s) of interest from the Program field. A filtering tutorial can be found HERE

For meals data on CACFP participants operating as Day Care Homes, Adult Day Care Centers, or child care centers for previous program years, please refer to the corresponding “Child and Adult Care Food Programs (CACFP) – Meal Reimbursement” dataset for that sub-program available on the State of Texas Open Data Portal.

An overview of all CACFP data available on the Texas Open Data Portal can be found at our TDA Data Overview - Child and Adult Care Food Programs page.

An overview of all TDA Food and Nutrition data available on the Texas Open Data Portal can be found at our TDA Data Overview - Food and Nutrition Open Data page.

More information about accessing and working with TDA data on the Texas Open Data Portal can be found on the SquareMeals.org website on the TDA Food and Nutrition Open Data page.

About Dataset Updates
TDA aims to post new program year data by December 15 of the active program year. Participants have 60 days to file monthly reimbursement claims. Dataset updates will occur daily until 90 days after the close of the program year. After 90 days from the close of the program year, the dataset will be updated at six months and one year from the close of the program year before becoming archived. Archived datasets will remain published but will not be updated. Any data posted during the active program year is subject to change.

About the Agency
The Texas Department of Agriculture administers 12 U.S. Department of Agriculture nutrition programs in Texas including the National School Lunch and School Breakfast Programs, the Child and Adult Care Food Programs (CACFP), and the summer meal programs. TDA’s Food and Nutrition division provides technical assistance and training resources to partners operating the programs and oversees the USDA reimbursements they receive to cover part of the cost associated with serving food in their facilities. By working to ensure these partners serve nutritious meals and snacks, the division adheres to its mission — Feeding the Hungry and Promoting Healthy Lifestyles.

For more information on these programs, please visit our website.
"
Data from: EHR data
kaggle.com
Updated Apr 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bipul Shahi (2025). EHR data [Dataset]. https://www.kaggle.com/datasets/vipulshahi/ehr-data/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 30, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Bipul Shahi
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
🔍 Dataset Overview

Each patient in the dataset has 30 days of continuous health data. The goal is to predict if a patient will progress to a critical condition based on their vital signs, medication adherence, and symptoms recorded daily.

There are 10 columns in the dataset:

Column Name Description patient_id Unique identifier for each patient. day Day number (from 1 to 30) indicating sequential daily records. bp_systolic Systolic blood pressure (top number) in mm Hg. Higher values may indicate hypertension. bp_diastolic Diastolic blood pressure (bottom number) in mm Hg. heart_rate Heartbeats per minute. Elevated heart rate can signal stress, infection, or deterioration. respiratory_rate Breaths per minute. Elevated rates can indicate respiratory distress. temperature Body temperature in °F. Fever or hypothermia are signs of infection or inflammation. oxygen_saturation Percentage of oxygen in blood. Lower values are concerning (< 94%). med_adherence Patient’s medication adherence (between 0 and 1). Lower values may contribute to worsening. symptom_severity Subjective symptom rating (scale of 1–10). Higher means worse condition. progressed_to_critical Target label: 1 if patient deteriorated to a critical condition, else 0. 🎯 Final Task (Prediction Objective)

Problem Type: Binary classification with time-series data.

Goal: Train deep learning models (RNN, LSTM, GRU) to learn temporal patterns from a patient's 30-day health history and predict whether the patient will progress to a critical condition.

📈 How the Data is Used for Modeling

Input: A 3D array shaped as (num_patients, 30, 8) where: 30 = number of days (timesteps), 8 = features per day (excluding ID, day, and target). Output: A binary label for each patient (0 or 1). 🔄 Feature Contribution to Prediction

Feature Why It Matters bp_systolic/dia Persistently high or rising BP may signal stress, cardiac issues, or deterioration. heart_rate A rising heart rate can indicate fever, infection, or organ distress. respiratory_rate Often increases early in critical illnesses like sepsis or COVID. temperature Fever is a key sign of infection. Chronic low/high temp may indicate underlying pathology. oxygen_saturation A declining oxygen level is a strong predictor of respiratory failure. med_adherence Poor medication adherence is often linked to worsening chronic conditions. symptom_severity Patient-reported worsening symptoms may precede measurable physiological changes. 🛠 Tools You’ll Use

Task Tool/Technique Data processing Pandas, NumPy, Scikit-learn Time series modeling Keras (using SimpleRNN, LSTM, GRU) Evaluation Accuracy, Loss, ROC Curve (optional)
A
‘Vehicle Miles Traveled During Covid-19 Lock-Downs ’ analyzed by Analyst-2
analyst-2.ai
Updated Jan 4, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘Vehicle Miles Traveled During Covid-19 Lock-Downs ’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-vehicle-miles-traveled-during-covid-19-lock-downs-636d/latest
Explore at:
Dataset updated
Jan 4, 2021
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Vehicle Miles Traveled During Covid-19 Lock-Downs ’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/vehicle-miles-travelede on 13 February 2022.

--- Dataset description provided by original source is as follows ---

About this dataset

**This data set was last updated 3:30 PM ET Monday, January 4, 2021. The last date of data in this dataset is December 31, 2020. **

Overview

Data shows that mobility declined nationally since states and localities began shelter-in-place strategies to stem the spread of COVID-19. The numbers began climbing as more people ventured out and traveled further from their homes, but in parallel with the rise of COVID-19 cases in July, travel declined again.

This distribution contains county level data for vehicle miles traveled (VMT) from StreetLight Data, Inc, updated three times a week. This data offers a detailed look at estimates of how much people are moving around in each county.

Data available has a two day lag - the most recent data is from two days prior to the update date. Going forward, this dataset will be updated by AP at 3:30pm ET on Monday, Wednesday and Friday each week.

This data has been made available to members of AP’s Data Distribution Program. To inquire about access for your organization - publishers, researchers, corporations, etc. - please click Request Access in the upper right corner of the page or email kromano@ap.org. Be sure to include your contact information and use case.

Findings

Nationally, data shows that vehicle travel in the US has doubled compared to the seven-day period ending April 13, which was the lowest VMT since the COVID-19 crisis began. In early December, travel reached a low not seen since May, with a small rise leading up to the Christmas holiday.

Average vehicle miles traveled continues to be below what would be expected without a pandemic - down 38% compared to January 2020. September 4 reported the largest single day estimate of vehicle miles traveled since March 14.

New Jersey, Michigan and New York are among the states with the largest relative uptick in travel at this point of the pandemic - they report almost two times the miles traveled compared to their lowest seven-day period. However, travel in New Jersey and New York is still much lower than expected without a pandemic. Other states such as New Mexico, Vermont and West Virginia have rebounded the least.

About This Data

The county level data is provided by StreetLight Data, Inc, a transportation analysis firm that measures travel patterns across the U.S.. The data is from their Vehicle Miles Traveled (VMT) Monitor which uses anonymized and aggregated data from smartphones and other GPS-enabled devices to provide county-by-county VMT metrics for more than 3,100 counties. The VMT Monitor provides an estimate of total vehicle miles travelled by residents of each county, each day since the COVID-19 crisis began (March 1, 2020), as well as a change from the baseline average daily VMT calculated for January 2020. Additional columns are calculations by AP.

Included Data

01_vmt_nation.csv - Data summarized to provide a nationwide look at vehicle miles traveled. Includes single day VMT across counties, daily percent change compared to January and seven day rolling averages to smooth out the trend lines over time.

02_vmt_state.csv - Data summarized to provide a statewide look at vehicle miles traveled. Includes single day VMT across counties, daily percent change compared to January and seven day rolling averages to smooth out the trend lines over time.

03_vmt_county.csv - Data providing a county level look at vehicle miles traveled. Includes VMT estimate, percent change compared to January and seven day rolling averages to smooth out the trend lines over time.

Additional Data Queries

* Filter for specific state - filters 02_vmt_state.csv daily data for specific state.

* Filter counties by state - filters 03_vmt_county.csv daily data for counties in specific state.

* Filter for specific county - filters 03_vmt_county.csv daily data for specific county.

Interactive

The AP has designed an interactive map to show percent change in vehicle miles traveled by county since each counties lowest point during the pandemic:

This dataset was created by Angeliki Kastanis and contains around 0 samples along with Date At Low, Mean7 County Vmt At Low, technical information and other features such as: - County Name - County Fips - and more.

How to use this dataset

Analyze State Name in relation to Baseline Jan Vmt

Study the influence of Date At Low on Mean7 County Vmt At Low

More datasets

Acknowledgements

If you use this dataset in your research, please credit Angeliki Kastanis

Start A New Notebook!

--- Original source retains full ownership of the source dataset ---
z
Primary documentation on the scientific study of indicators of continuous...
zenodo.org
Updated Apr 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mariia Matveeva; Mariia Matveeva; Marina Koshmeleva; Marina Koshmeleva; Dmitriy Kachanov; Dmitriy Kachanov; Svetlana Fomina; Svetlana Fomina; Iuliia Samoilova; Iuliia Samoilova; Екатерина Трифонова; Екатерина Трифонова (2025). Primary documentation on the scientific study of indicators of continuous monitoring and flash monitoring of glycemia in children and adolescents with type 1 diabetes mellitus [Dataset]. http://doi.org/10.5281/zenodo.10074289
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.10074289
Dataset updated
Apr 24, 2025
Dataset provided by
Zenodo
Authors
Mariia Matveeva; Mariia Matveeva; Marina Koshmeleva; Marina Koshmeleva; Dmitriy Kachanov; Dmitriy Kachanov; Svetlana Fomina; Svetlana Fomina; Iuliia Samoilova; Iuliia Samoilova; Екатерина Трифонова; Екатерина Трифонова
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The main purpose of creating an electronic database was to evaluate the performance of continuous glucose monitoring (CGM) and flash monitoring (FMS) in children and adolescents diagnosed with type 1 diabetes mellitus. The database is intended for entering, systematizing, storing and displaying patient data (date of birth, age, date of diagnosis of type 1 diabetes mellitus, length of illness, date of first visit to an endocrinologist, with the installation of a CGM or FMS), glycated hemoglobin indicators initially , during the study and ultimately, as well as CGM, FMS data (average glucose level, glycemic variability, percentage of cases above the target range, percentage of cases within the target range, percentage of cases below the target range, number of hypoglycemic episodes and their average duration, frequency of daily scans and frequency of sensor readings).
The database is the basis for comparative statistical analysis of dynamic monitoring indicators in groups of patients with the presence or absence of diabetic complications (neuropathy, retinopathy and nephropathy). The database presents the results of a prospective, open, controlled, clinical study obtained over a year and a half. The database includes information on 307 patients (adolescent children) aged 3 to 17 years inclusive. During the study, the observed patients were divided into two groups: Group 1 - patients diagnosed with type 1 diabetes mellitus and with diabetic complications, 152 people, Group 2 – patients diagnosed with type 1 diabetes mellitus and with no diabetic complications, 155 people. All registrants of the database were assigned individual codes, which made it possible to exclude personal data (full name) from the database.
The database is executed in the Microsoft Office Excel program and has the character of a depersonalized summary table, which consists of two blocks-sheets: patients of groups 1 and 2 and is structured according to the following sections: "Patient number"; "Patient code"; "Date of birth"; "Age of the patient"; section "Date of diagnosis of DM1" indicates the date of the official diagnosis of type 1 diabetes mellitus at the first hospitalization of the patient, this information is borrowed from medical information systems; section "Length of service DM1" reflects information about the duration of the patient's illness; the section "Date of the first visit" contains information about the date of the registrant's visit to the endocrinologist with the installation of FMS / CGM devices; the section "Frequency of self-monitoring with a glucometer" contains information about the frequency of measuring blood glucose levels by the patient at home using a glucometer until the establishment of FMS / CGM.
Sections "HbA1c initially (GMI)", "HbA1c (GMI)", "HbA1c final (GMI)", display the indicators of the level of glycated hemoglobin from the total for the period of the beginning of the study, at the intermediate stages of the study and at the end of observation.
The database structure has a number of sections accumulating information obtained with CGM/FMS, in particular: the section "Average glucose level"; the section "% above the target range", reflecting the percentage of the patient's stay with glycemia above the target indicators during the day; the section "% within the target range", reflecting the percentage of the patient's stay within the target glycemia indicators per day; the section "% below the target range", reflecting the percentage of the patient's stay with glycemia below the target indicators during the day; the section "Hypoglycemic phenomena", reflecting the number of cases of hypoglycemia in patients within 2 weeks; the section "Average duration", reflecting the average duration of hypoglycemic phenomena registered in the patient; the section "Sensor data received", indicating the percentage of time the patient was with an active device sensor; the section "Daily scans" show the frequency of scans of the patient's glycemic level (once a day); the section "%CV" displays the variability of the patient's glycemia recorded by the device. The listed sections are repeated in the database in accordance with the number of follow-up visit.
Also in the database there is a section "Mid. values", which contains indicators of the average values of patient data for all of the above sections, both in the first and in the second group of patients.
When working with the database, the use of filters (in the "Data" tab) containing the names of indicators allows you to enter information about new registrants in a convenient form or correct existing data, as well as sort and search for one or more specified indicators.
The electronic database allows you to systematize a large volume of results, distribute data into categories, search for any field or set of fields in the input format, systematize the selected array, makes it possible to directly use this data for statistical analysis, as well as to view and print information on specified conditions with the location of fields in a convenient sequence.
Long-term Continuous SIF-informed Photosynthesis Proxy reconstructed with...
zenodo.org
application/gzip
Updated Jan 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jianing Fang; Jianing Fang; Xu Lian; Xu Lian; Youngryel Ryu; Youngryel Ryu; Sungchan Jeong; Sungchan Jeong; Chongya Jiang; Chongya Jiang; Pierre Gentine; Pierre Gentine (2025). Long-term Continuous SIF-informed Photosynthesis Proxy reconstructed with calibrated AVHRR surface reflectance (LCSPP-AVHRR), 2001-2023 [Dataset]. http://doi.org/10.5281/zenodo.14568491
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14568491
Dataset updated
Jan 10, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jianing Fang; Jianing Fang; Xu Lian; Xu Lian; Youngryel Ryu; Youngryel Ryu; Sungchan Jeong; Sungchan Jeong; Chongya Jiang; Chongya Jiang; Pierre Gentine; Pierre Gentine
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Usage Notes:
This is the updated LCSPP dataset (v3.2), generated using the LCREF-AVHRR record from 1982–2023. Due to Zenodo’s size constraints, LCSPP-AVHRR is divided into two separate repositories. Previously referred to as "LCSIF," the dataset was renamed to emphasize its role as a SIF-informed long-term photosynthesis proxy derived from surface reflectance and to avoid confusion with directly measured SIF signals.

Key updates in version 3.2 include:

Improved Calibration: Enhanced consistency in calibration methods, addressing technical limitations in version 3.1 including applying more stringent quality filtering and snow masks.

Quality Flags: New quality flag layer enables users to identify whether a pixel is derived from observed surface reflectance (QA=0), high-quality gap-filled values (QA=1), lower-quality gap-filled based on the mean seasonal cycle (QA=2), or missing entirely (QA=3). We advice the user to rely only on observed and high-quality gap-filled values for their analyses.

Extension to include observations from the year of 2023.

Other LCSPP repositories can be accessed via the following links:

LCSPP-AVHRR v3.2 (1982-2000): 10.5281/zenodo.7916850

LCSPP-MODIS v3.2(2001-2023): 10.5281/zenodo.11658088

The user can choose between LCSPP-AVHRR and LCSPP-MODIS for the overlapping period from 2001-2023. The two datasets are generally consistent during this overlapping period, although LCSPP-MODIS shows a stronger greening trend between 2001-2023. For studies exploring the long-term vegetation dynamics, the user can either use only LCSPP-AVHRR or use a blend dataset of LCSPP-AVHRR and LCSPP-MODIS as a sensitivity test.

In addition, the updated long-term continuous reflectance datasets (LCREF), used for the production of LCSPP, can be accessed using the following links:

LCREF-AVHRR v3.2 (1982-2023): 10.5281/zenodo.11905959

LCREF-MODIS v3.2 (2001-2023): 10.5281/zenodo.11657458

A manuscript describing the technical details is available at https://arxiv.org/abs/2311.14987, while detailed the uses and limitations of the dataset. In particular, we note that LCSPP is a reconstruction of SIF-informed photosynthesis proxy and should not be treated as SIF measurements. Although LCSPP has demonstrated skill in tracking the dynamics of GPP and PAR absorbed by canopy chlorophyll (APARchl), it is not suitable for estimating fluorescence quantum yield.

All data outputs from this study are available at 0.05° spatial resolution and biweekly temporal resolution in NetCDF format. Each month is divided into two files, with the first file “a” representative of the 1^st day to the 15^th day of a month, and the second file “b” representative of the 16^th day to the last day of a month.

Abstract:

Satellite-observed solar-induced chlorophyll fluorescence (SIF) is a powerful proxy for the photosynthetic characteristics of terrestrial ecosystems. Direct SIF observations are primarily limited to the recent decade, impeding their application in detecting long-term dynamics of ecosystem function. In this study, we leverage two surface reflectance bands available both from Advanced Very High-Resolution Radiometer (AVHRR, 1982-2023) and MODerate-resolution Imaging Spectroradiometer (MODIS, 2001-2023). Importantly, we calibrate and orbit-correct the AVHRR bands against their MODIS counterparts during their overlapping period. Using the long-term bias-corrected reflectance data from AVHRR and MODIS, a neural network is trained to produce a Long-term Continuous SIF-informed Photosynthesis Proxy (LCSPP) by emulating Orbiting Carbon Observatory-2 SIF, mapping it globally over the 1982-2023 period. Compared with previous SIF-informed photosynthesis proxies, LCSPP has similar skill but can be advantageously extended to the AVHRR period. Further comparison with three widely used vegetation indices (NDVI, kNDVI, NIRv) shows a higher or comparable correlation of LCSPP with satellite SIF and site-level GPP estimates across vegetation types, ensuring a greater capacity for representing long-term photosynthetic activity.
Percentage of time IT security services are available
open.canada.ca
datasets.ai
+1more
csv
Updated Dec 9, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shared Services Canada (2024). Percentage of time IT security services are available [Dataset]. https://open.canada.ca/data/dataset/29017f8f-54ce-42a9-9ff7-61f53d5abee1
Explore at:
csvAvailable download formats
Dataset updated
Dec 9, 2024
Dataset provided by
Shared Services Canadahttps://www.canada.ca/en/shared-services.html
License
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Description
This indicator reports on the SSC Enterprise IT security services that are part of the SSC service catalogue. This indicator is an aggregation of myKEY, Secure Remote Access and External Credential Management. These services best represent CITS’ ability to secure IT infrastructure. Calculation / formula: Numerator: Total time (# hours, minutes, and seconds) the infrastructure security services are available (i.e. Up-Time) in assessment period (day, week, month, and year) multiplied by the number of services, and multiplied by the number of applicable customer departments. Denominator: Total time (# hours, minutes and seconds) in assessment period (day, week, month, and year) multiplied by the number of services, and multiplied by the number of applicable customer departments. Trend should be interpreted such that a higher % represents progress toward target. Once target has been reach, additional % represents excellence. The % of availability, exclude the maintenance windows. For example: If there were a 1-day outage of an IS Service (e.g. myKey), the Up-Time would be: 24 hrs/day x30 days/month x1 service x 40 customers = 28800 hours (per month), which is the Numerator The Total time for the reporting period (in this case, monthly) is: 24 hrs/day x31 days/month x1 service x 40 customers = 29760 hours, which is the Denominator. Numerator over Denominator is % of Time IS service is available = 28800/29760 = 96.8% Target: 99.8%.
Executive Functioning Data
openneuro.org
Updated Dec 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tracy Brandmeyer; Arnaud Delorme (2022). Executive Functioning Data [Dataset]. http://doi.org/10.18112/openneuro.ds004350.v1.0.0
Explore at:
Unique identifier
https://doi.org/10.18112/openneuro.ds004350.v1.0.0
Dataset updated
Dec 5, 2022
Dataset provided by
OpenNeurohttps://openneuro.org/
Authors
Tracy Brandmeyer; Arnaud Delorme
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Executive Functioning Tasks

The data of this dataset was collected as part of an executive functioning battery consisting of three separate tasks:

1) N-Back (NB)

2) Sustained Attention to Response Task (SART)

3) Local Global (LG)

The original experiment details in which these tasks were conducted in addition to can be read about here (https://doi.org/10.3389/fnhum.2020.00246).

Experiment Design: Two sessions of each task were conducted on the first and last day of the neurofeedback experiment with 24 participants (mentioned above).

[N-Back (NB)] Participants performed a visual sequential letter n-back working memory task, with memory load ranging from 1-back to 3-back. The visual stimuli consisted of a sequence of 4 letters (A, B, C, D) presented black on a gray background. Participants observed stimuli on a visual display and responded using the spacebar on a provided keyboard. In the 1-back condition, the target was any letter identical to the trial immediately preceding one. In the 2-back and 3-back conditions, the target was any letter that was presented two or three trials back, respectively. The stimuli were presented on a screen for a duration of 1 s, after which a fixation cross was presented for 500 ms. Participants responded to each stimulus by pressing the spacebar with their right hand upon target presentation. If no spacebar was pressed within 1500 ms of the stimulus presentation, a new stimulus was presented. Each n-back condition (1, 2, and 3-back) consisted of the presentation of 280 stimuli selected randomly in the 4-letter pool.

[Sustained Attention to Response Task (SART)] Participants were presented with a series of single numerical digits (randomly selected from 0 to 9 - the same digit could not be presented twice in a row) and instructed to press the spacebar for each digit, except for when presented with the digit 3. Each number was presented for 400 ms in white on a gray background. The inter-stimulus interval was 2 s irrespective of the button press and a fixation cross was present at all times except for when the digits were presented. Participants performed the SART for approximately 10 minutes corresponding to 250 digit presentations.

[Local Global (LG)] Participants were shown large letters (H and T) on a computer screen. The large letters were made up of an aggregate of smaller letters that could be congruent (i.e large H made of small Hs or large T made of small Ts) or incongruent (large H made of small Ts or large T made of small Hs) with respect to the large letter. The small letters were 0.8 cm high and the large letters were 8 cm high on the computer screen. A fixation cross was present at all times except when the stimulus letters were presented. Letters were shown on the computer screen until the subject responded. After each subject's response, there was a delay of 1 s before the next stimulus was presented. Before each sequence of letters, instructions were shown on a computer screen indicating to participants whether they should respond to the presence of small (local condition) or large (global condition) letters. The participants were instructed to categorize specifically large letters or small letters and to press the letter H or T on the computer keyboard to indicate their choice.

Data Processing: Data processing was performed in Matlab and EEGLAB. The EEG data was average referenced and down-sampled from 2048 to 256 Hz. A high-pass filter at 1 HZ using an elliptical non-linear filter was applied and the data was then average referenced.

Note: The data files in this dataset were converted into the .set format for EEGLAB. The .bdf files that were converted for each of the tasks can be found in the sourcedata folder.

Exclusion Note: The second run of NB in session 1 of sub-11 and the run of SART in session 1 of sub-18 were both excluded due to issues with conversion to .set format. However, the .bdf files of these runs can be found in the sourcedata folder.
Lead Scoring Dataset
kaggle.com
zip
Updated Aug 17, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amrita Chatterjee (2020). Lead Scoring Dataset [Dataset]. https://www.kaggle.com/amritachatterjee09/lead-scoring-dataset
Explore at:
zip(411028 bytes)Available download formats
Dataset updated
Aug 17, 2020
Authors
Amrita Chatterjee
Description
Context

An education company named X Education sells online courses to industry professionals. On any given day, many professionals who are interested in the courses land on their website and browse for courses.

The company markets its courses on several websites and search engines like Google. Once these people land on the website, they might browse the courses or fill up a form for the course or watch some videos. When these people fill up a form providing their email address or phone number, they are classified to be a lead. Moreover, the company also gets leads through past referrals. Once these leads are acquired, employees from the sales team start making calls, writing emails, etc. Through this process, some of the leads get converted while most do not. The typical lead conversion rate at X education is around 30%.

Now, although X Education gets a lot of leads, its lead conversion rate is very poor. For example, if, say, they acquire 100 leads in a day, only about 30 of them are converted. To make this process more efficient, the company wishes to identify the most potential leads, also known as ‘Hot Leads’. If they successfully identify this set of leads, the lead conversion rate should go up as the sales team will now be focusing more on communicating with the potential leads rather than making calls to everyone.

There are a lot of leads generated in the initial stage (top) but only a few of them come out as paying customers from the bottom. In the middle stage, you need to nurture the potential leads well (i.e. educating the leads about the product, constantly communicating, etc. ) in order to get a higher lead conversion.

X Education wants to select the most promising leads, i.e. the leads that are most likely to convert into paying customers. The company requires you to build a model wherein you need to assign a lead score to each of the leads such that the customers with higher lead score h have a higher conversion chance and the customers with lower lead score have a lower conversion chance. The CEO, in particular, has given a ballpark of the target lead conversion rate to be around 80%.

Content

Variables Description * Prospect ID - A unique ID with which the customer is identified. * Lead Number - A lead number assigned to each lead procured. * Lead Origin - The origin identifier with which the customer was identified to be a lead. Includes API, Landing Page Submission, etc. * Lead Source - The source of the lead. Includes Google, Organic Search, Olark Chat, etc. * Do Not Email -An indicator variable selected by the customer wherein they select whether of not they want to be emailed about the course or not. * Do Not Call - An indicator variable selected by the customer wherein they select whether of not they want to be called about the course or not. * Converted - The target variable. Indicates whether a lead has been successfully converted or not. * TotalVisits - The total number of visits made by the customer on the website. * Total Time Spent on Website - The total time spent by the customer on the website. * Page Views Per Visit - Average number of pages on the website viewed during the visits. * Last Activity - Last activity performed by the customer. Includes Email Opened, Olark Chat Conversation, etc. * Country - The country of the customer. * Specialization - The industry domain in which the customer worked before. Includes the level 'Select Specialization' which means the customer had not selected this option while filling the form. * How did you hear about X Education - The source from which the customer heard about X Education. * What is your current occupation - Indicates whether the customer is a student, umemployed or employed. * What matters most to you in choosing this course An option selected by the customer - indicating what is their main motto behind doing this course. * Search - Indicating whether the customer had seen the ad in any of the listed items. * Magazine
* Newspaper Article * X Education Forums
* Newspaper * Digital Advertisement * Through Recommendations - Indicates whether the customer came in through recommendations. * Receive More Updates About Our Courses - Indicates whether the customer chose to receive more updates about the courses. * Tags - Tags assigned to customers indicating the current status of the lead. * Lead Quality - Indicates the quality of lead based on the data and intuition the employee who has been assigned to the lead. * Update me on Supply Chain Content - Indicates whether the customer wants updates on the Supply Chain Content. * Get updates on DM Content - Indicates whether the customer wants updates on the DM Content. * Lead Profile - A lead level assigned to each customer based on their profile. * City - The city of the customer. * Asymmetric Activity Index - An index and score assigned to each customer based on their activity and their profile * Asymmetric Profile Index * Asymmetric Activity Score * Asymmetric Profile Score
* I agree to pay the amount through cheque - Indicates whether the customer has agreed to pay the amount through cheque or not. * a free copy of Mastering The Interview - Indicates whether the customer wants a free copy of 'Mastering the Interview' or not. * Last Notable Activity - The last notable activity performed by the student.

Acknowledgements

UpGrad Case Study

Inspiration

Your data will be in front of the world's largest data science community. What questions do you want to see answered?
h
cnn_dailymail
huggingface.co
tensorflow.org
+2more
Updated Dec 18, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ccdv (2021). cnn_dailymail [Dataset]. https://huggingface.co/datasets/ccdv/cnn_dailymail
Explore at:
Dataset updated
Dec 18, 2021
Authors
ccdv
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
CNN/DailyMail non-anonymized summarization dataset.

There are two features: - article: text of news article, used as the document to be summarized - highlights: joined text of highlights with and around each highlight, which is the target summary
c
Rain in Australia Dataset
cubig.ai
Updated Jun 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CUBIG (2025). Rain in Australia Dataset [Dataset]. https://cubig.ai/store/products/501/rain-in-australia-dataset
Explore at:
Dataset updated
Jun 22, 2025
Dataset authored and provided by
CUBIG
License
https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
Area covered
Australia
Measurement technique
Privacy-preserving data transformation via differential privacy, Synthetic data generation using AI techniques for model training
Description
1) Data Introduction • The Rain in Australia Dataset is a tabular weather forecasting dataset, including daily weather information collected for approximately 10 years from various weather stations across Australia, and next-day precipitation (more than 1 mm, RainTomorrow).

2) Data Utilization (1) Rain in Australia Dataset has characteristics that: • Each row contains a variety of daily weather variables and target variables (RainTomorrow: Next Day RainTomorrow) such as date, region, highest/lowest temperature, precipitation, humidity, wind speed, and air pressure. • The data reflect multiple regions and various weather conditions, making them suitable for time series and spatial weather pattern analysis and the development of binary classification prediction models. (2) Rain in Australia Dataset can be used to: • Development of precipitation prediction models: Machine learning-based next-day precipitation prediction (whether an umbrella is required) models can be built using various weather variables and RainTomorrow labels. • Weather Patterns and Regional Analysis: By analyzing regional and seasonal weather variables and precipitation patterns, it can be used to establish customized weather strategies for each industry, such as climate change research and agriculture and tourism.
o
Data from: Financial Fraud Detection Dataset
opendatabay.com
.undefined
Updated Jun 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Review Nexus (2025). Financial Fraud Detection Dataset [Dataset]. https://www.opendatabay.com/data/financial/d226c56e-5929-4059-a30d-13632e07b344
Explore at:
.undefinedAvailable download formats
Dataset updated
Jun 25, 2025
Dataset authored and provided by
Review Nexus
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Area covered
Fraud Detection & Risk Management
Description
This dataset is designed to support research and model development in the area of fraud detection. It consists of real-world credit card transactions made by European cardholders over a two-day period in September 2013. Out of 284,807 transactions, 492 are labeled as fraudulent (positive class), making this a highly imbalanced classification problem.

Performance Note:

Due to the extreme class imbalance, standard accuracy metrics are not informative. We recommend using the Area Under the Precision-Recall Curve (AUPRC) or F1-score for model evaluation.

Features:

Time Series Data: Each row represents a transaction, with the Time feature indicating the number of seconds elapsed since the first transaction.

Dimensionality Reduction Applied: Features V1 through V28 are anonymized principal components derived from a PCA transformation due to confidentiality constraints.

Raw Transaction Amount: The Amount field reflects the transaction value, useful for cost-sensitive modeling.

Binary Classification Target: The Class label is 1 for fraud and 0 for legitimate transactions.

Usage:

Machine learning model training for fraud detection.

Evaluation of anomaly detection and imbalanced classification methods.

Development of cost-sensitive learning approaches using the Amount variable.

Data Summary:

Total Records: 284,807

Fraud Cases: 492

Imbalance Ratio: Fraudulent transactions account for just 0.172% of the dataset.

Columns: 31 total (28 PCA features, plus Time, Amount, and Class)

License:

The dataset is provided under the CC0 (Public Domain) license, allowing users to freely use, modify, and distribute the data without any restrictions.

Acknowledgements

The dataset has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection. More details on current and past projects on related topics are available on https://www.researchgate.net/project/Fraud-detection-5 and the page of the DefeatFraud project

Please cite the following works:

Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015

Dal Pozzolo, Andrea; Caelen, Olivier; Le Borgne, Yann-Ael; Waterschoot, Serge; Bontempi, Gianluca. Learned lessons in credit card fraud detection from a practitioner perspective, Expert systems with applications,41,10,4915-4928,2014, Pergamon

Dal Pozzolo, Andrea; Boracchi, Giacomo; Caelen, Olivier; Alippi, Cesare; Bontempi, Gianluca. Credit card fraud detection: a realistic modeling and a novel learning strategy, IEEE transactions on neural networks and learning systems,29,8,3784-3797,2018,IEEE

Dal Pozzolo, Andrea Adaptive Machine learning for credit card fraud detection ULB MLG PhD thesis (supervised by G. Bontempi)

Carcillo, Fabrizio; Dal Pozzolo, Andrea; Le Borgne, Yann-Aël; Caelen, Olivier; Mazzer, Yannis; Bontempi, Gianluca. Scarff: a scalable framework for streaming credit card fraud detection with Spark, Information fusion,41, 182-194,2018,Elsevier

Carcillo, Fabrizio; Le Borgne, Yann-Aël; Caelen, Olivier; Bontempi, Gianluca. Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization, International Journal of Data Science and Analytics, 5,4,285-300,2018,Springer International Publishing

Bertrand Lebichot, Yann-Aël Le Borgne, Liyun He, Frederic Oblé, Gianluca Bontempi Deep-Learning Domain Adaptation Techniques for Credit Cards Fraud Detection, INNSBDDL 2019: Recent Advances in Big Data and Deep Learning, pp 78-88, 2019

Fabrizio Carcillo, Yann-Aël Le Borgne, Olivier Caelen, Frederic Oblé, Gianluca Bontempi Combining Unsupervised and Supervised Learning in Credit Card Fraud Detection Information Sciences, 2019

Yann-Aël Le Borgne, Gianluca Bontempi Reproducible machine Learning for Credit Card Fraud Detection - Practical Handbook

Bertrand Lebichot, Gianmarco Paldino, Wissam Siblini, Liyun He, Frederic Oblé, Gianluca Bontempi Incremental learning strategies for credit cards fraud detection, IInternational Journal of Data Science and Analytics
RADAR data - TO2015 Pan and Parapan American Games
ouvert.canada.ca
datasets.ai
+4more
html, zip
Updated Mar 21, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Environment and Climate Change Canada (2018). RADAR data - TO2015 Pan and Parapan American Games [Dataset]. https://ouvert.canada.ca/data/dataset/cbe2ac22-c492-43e2-9187-3e95fcbb2f99
Explore at:
html, zipAvailable download formats
Dataset updated
Mar 21, 2018
Dataset provided by
Environment And Climate Change Canadahttps://www.canada.ca/en/environment-climate-change.html
License
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Time period covered
May 1, 2015 - Sep 30, 2015
Description
The main radar covering the Toronto Area is the King City Doppler Dual-Polarization C-Band radar (43.96388, -79.57416). Other nearby radars include the Exeter Doppler C-Band radar (43.37027,-81.38416) and the Buffalo Doppler Dual-Polarization S-Band radar (42.94889,-78.73667) from the United States. Though the primary radar for the project is the King City radar, the raw data from all three radars are included in their native format (IRIS or Nexrad Level 2) and are intended for radar specialists. The data is available from May 1 2015 to Sept 30 2015. The scan strategy for each radar is different, with at least 10 minute scan cycles or better. The user should consult with a radar specialist for more details. Reflectivity and radial velocity images (presented as a pair) for the lowest elevation angle (0.5o) centred on a 128 km x 128 km box around from the King City radar are provided for general use. Besides their normal use as precipitation observations, they are particularly useful to identify Lake Breezes as weak linear reflectivity and as radial velocity discontinuity features for the entire period. The target providing the radar returns are insects. Analysis indicates the presence of Lake Breezes on 118 days, only 35 days did not have any kind of lake breeze-like features. Daily movies have been created. The format of the single images is PNG and the movies is an animated GIF. The data is organized by radar and by day in the following structure. The raw data is organized in the following directory structure: RADAR ->

Facebook

Twitter

Click to copy link

Link copied

Cite

Dairy Supply Chain Sales Dataset

Explore at:

3 scholarly articles cite this dataset (View in Google Scholar)

zip, pdfAvailable download formats

Unique identifier

https://doi.org/10.21227/smv6-z405

Dataset updated

Jul 12, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

1.Introduction

2. Citation

Please cite the following papers when using this dataset:

I. Siniosoglou, K. Xouveroudis, V. Argyriou, T. Lagkas, S. K. Goudos, K. E. Psannis and P. Sarigiannidis, "Evaluating the Effect of Volatile Federated Timeseries on Modern DNNs: Attention over Long/Short Memory," in the 12th International Conference on Circuits and Systems Technologies (MOCAST 2023), April 2023, Accepted

3. Dataset Modalities

3.1 Data Collection

The process of building this dataset involves several steps to ensure that the data is accurate, comprehensive and relevant.

The first step is to determine the specific data that is needed to support the business objectives of the industry, i.e., in this publication’s case the daily sales data.

File	Period	Number of Samples (days)
product 1 2020.xlsx	01/01/2020–31/12/2020	363
product 1 2021.xlsx	01/01/2021–31/12/2021	364
product 1 2022.xlsx	01/01/2022–31/12/2022	365
product 2 2020.xlsx	01/01/2020–31/12/2020	363
product 2 2021.xlsx	01/01/2021–31/12/2021	364
product 2 2022.xlsx	01/01/2022–31/12/2022	365
product 3 2020.xlsx	01/01/2020–31/12/2020	363
product 3 2021.xlsx	01/01/2021–31/12/2021	364
product 3 2022.xlsx	01/01/2022–31/12/2022	365
product 4 2020.xlsx	01/01/2020–31/12/2020	363
product 4 2021.xlsx	01/01/2021–31/12/2021	364
product 4 2022.xlsx	01/01/2022–31/12/2022	364
product 5 2020.xlsx	01/01/2020–31/12/2020	363
product 5 2021.xlsx	01/01/2021–31/12/2021	364
product 5 2022.xlsx	01/01/2022–31/12/2022	365
product 6 2020.xlsx	01/01/2020–31/12/2020	362
product 6 2021.xlsx	01/01/2021–31/12/2021	364
product 6 2022.xlsx	01/01/2022–31/12/2022	365
product 7 2020.xlsx	01/01/2020–31/12/2020	362
product 7 2021.xlsx	01/01/2021–31/12/2021	364
product 7 2022.xlsx	01/01/2022–31/12/2022	365

3.2 Dataset Overview

The following table enumerates and explains the features included across all of the included files.

Feature	Description	Unit
Day	day of the month	-
Month	Month	-
Year	Year	-
daily_unit_sales	Daily sales - the amount of products, measured in units, that during that specific day were sold	units
previous_year_daily_unit_sales	Previous Year’s sales - the amount of products, measured in units, that during that specific day were sold the previous year	units
percentage_difference_daily_unit_sales	The percentage difference between the two above values	%
daily_unit_sales_kg	The amount of products, measured in kilograms, that during that specific day were sold	kg
previous_year_daily_unit_sales_kg	Previous Year’s sales - the amount of products, measured in kilograms, that during that specific day were sold, the previous year	kg
percentage_difference_daily_unit_sales_kg	The percentage difference between the two above values	kg
daily_unit_returns_kg	The percentage of the products that were shipped to selling points and were returned	%
previous_year_daily_unit_returns_kg	The percentage of the products that were shipped to

Clear search

Close search

Google apps

Main menu

Dairy Supply Chain Sales Dataset

Evaluating FAIR Models for Rossmann Store Sales Prediction: Insights and...

Context and Methodology

Technical Details

Data Fields Description:

Software Requirements

Additional Resources

Market Basket Analysis

Market Basket Analysis

Introduction

An Example of Association Rules

Strategy

Dataset Description

Libraries in R

Data Pre-processing

Fundamental Data Record for Atmospheric Composition [ATMOS_L1B]

Data from: A large synthetic dataset for machine learning applications in...

Data generation algorithm

Network

Time series

Usage

Selecting a particular country

Averaging over time

Source code

Funding

ESA CCI SM GAPFILLED Long-term Climate Data Record of Surface Soil Moisture...

Dataset paper (public preprint)

Abstract

Summary

Programmatic Download

Data details

Data Variables

Version Changelog

Software to open netCDF files

References

Related Records

LYMob-4Cities: Multi-City Human Mobility Dataset

‘ 🚴 Bike Sharing Dataset’ analyzed by Analyst-2

About this dataset

Source:

Data Set Information:

Attribute Information:

Relevant Papers:

Citation Request:

How to use this dataset

Acknowledgements

Start A New Notebook!

Child and Adult Care Food Programs (CACFP) –Child Centers – Meal...

Data from: EHR data

‘Vehicle Miles Traveled During Covid-19 Lock-Downs ’ analyzed by Analyst-2

About this dataset

Overview

Findings

About This Data

Included Data

Additional Data Queries

Interactive

How to use this dataset

Acknowledgements

Start A New Notebook!

Primary documentation on the scientific study of indicators of continuous...

Long-term Continuous SIF-informed Photosynthesis Proxy reconstructed with...

Percentage of time IT security services are available

Executive Functioning Data

Executive Functioning Tasks

Lead Scoring Dataset

Context

Content

Acknowledgements

Inspiration

cnn_dailymail

Rain in Australia Dataset

Data from: Financial Fraud Detection Dataset

Performance Note:

Features:

Usage:

Data Summary:

License:

Acknowledgements

RADAR data - TO2015 Pan and Parapan American Games

Dairy Supply Chain Sales Dataset