In 2024, Target Corporation's food and beverage product segment generated sales of approximately 23.8 billion U.S. dollars. In contrast, the hardline segment, which include electronics, toys, entertainment, sporting goods, and luggage, registered sales of 15.8 billion U.S. dollars. Target Corporation had revenues amounting to around 106.6 billion U.S. dollars that year.
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Get access to a curated dataset of over 160,000 products from Target.com, all featuring a 30% or greater discount. This collection is ideal for anyone studying pricing trends, consumer deal behavior, or building retail pricing intelligence platforms.
The data spans categories including home goods, electronics, fashion, beauty, and personal care, offering insights into Target’s promotional strategies and markdown inventory.
Product Title & URL
Original & Discounted Prices
% Discount
Brand, Category
Image links, Description
Availability (in stock / out of stock)
Scraped Date
Build daily deal apps or deal newsletters
Monitor Target’s price drops and markdown strategy
Analyze clearance vs. everyday discount trends
Create dashboards for pricing analytics
Feed retail bots or price comparison engines
This dataset can be refreshed weekly or monthly upon request.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
1.Introduction
Sales data collection is a crucial aspect of any manufacturing industry as it provides valuable insights about the performance of products, customer behaviour, and market trends. By gathering and analysing this data, manufacturers can make informed decisions about product development, pricing, and marketing strategies in Internet of Things (IoT) business environments like the dairy supply chain.
One of the most important benefits of the sales data collection process is that it allows manufacturers to identify their most successful products and target their efforts towards those areas. For example, if a manufacturer could notice that a particular product is selling well in a certain region, this information could be utilised to develop new products, optimise the supply chain or improve existing ones to meet the changing needs of customers.
This dataset includes information about 7 of MEVGAL’s products [1]. According to the above information the data published will help researchers to understand the dynamics of the dairy market and its consumption patterns, which is creating the fertile ground for synergies between academia and industry and eventually help the industry in making informed decisions regarding product development, pricing and market strategies in the IoT playground. The use of this dataset could also aim to understand the impact of various external factors on the dairy market such as the economic, environmental, and technological factors. It could help in understanding the current state of the dairy industry and identifying potential opportunities for growth and development.
2. Citation
Please cite the following papers when using this dataset:
3. Dataset Modalities
The dataset includes data regarding the daily sales of a series of dairy product codes offered by MEVGAL. In particular, the dataset includes information gathered by the logistics division and agencies within the industrial infrastructures overseeing the production of each product code. The products included in this dataset represent the daily sales and logistics of a variety of yogurt-based stock. Each of the different files include the logistics for that product on a daily basis for three years, from 2020 to 2022.
3.1 Data Collection
The process of building this dataset involves several steps to ensure that the data is accurate, comprehensive and relevant.
The first step is to determine the specific data that is needed to support the business objectives of the industry, i.e., in this publication’s case the daily sales data.
Once the data requirements have been identified, the next step is to implement an effective sales data collection method. In MEVGAL’s case this is conducted through direct communication and reports generated each day by representatives & selling points.
It is also important for MEVGAL to ensure that the data collection process conducted is in an ethical and compliant manner, adhering to data privacy laws and regulation. The industry also has a data management plan in place to ensure that the data is securely stored and protected from unauthorised access.
The published dataset is consisted of 13 features providing information about the date and the number of products that have been sold. Finally, the dataset was anonymised in consideration to the privacy requirement of the data owner (MEVGAL).
File |
Period |
Number of Samples (days) |
product 1 2020.xlsx |
01/01/2020–31/12/2020 |
363 |
product 1 2021.xlsx |
01/01/2021–31/12/2021 |
364 |
product 1 2022.xlsx |
01/01/2022–31/12/2022 |
365 |
product 2 2020.xlsx |
01/01/2020–31/12/2020 |
363 |
product 2 2021.xlsx |
01/01/2021–31/12/2021 |
364 |
product 2 2022.xlsx |
01/01/2022–31/12/2022 |
365 |
product 3 2020.xlsx |
01/01/2020–31/12/2020 |
363 |
product 3 2021.xlsx |
01/01/2021–31/12/2021 |
364 |
product 3 2022.xlsx |
01/01/2022–31/12/2022 |
365 |
product 4 2020.xlsx |
01/01/2020–31/12/2020 |
363 |
product 4 2021.xlsx |
01/01/2021–31/12/2021 |
364 |
product 4 2022.xlsx |
01/01/2022–31/12/2022 |
364 |
product 5 2020.xlsx |
01/01/2020–31/12/2020 |
363 |
product 5 2021.xlsx |
01/01/2021–31/12/2021 |
364 |
product 5 2022.xlsx |
01/01/2022–31/12/2022 |
365 |
product 6 2020.xlsx |
01/01/2020–31/12/2020 |
362 |
product 6 2021.xlsx |
01/01/2021–31/12/2021 |
364 |
product 6 2022.xlsx |
01/01/2022–31/12/2022 |
365 |
product 7 2020.xlsx |
01/01/2020–31/12/2020 |
362 |
product 7 2021.xlsx |
01/01/2021–31/12/2021 |
364 |
product 7 2022.xlsx |
01/01/2022–31/12/2022 |
365 |
3.2 Dataset Overview
The following table enumerates and explains the features included across all of the included files.
Feature |
Description |
Unit |
Day |
day of the month |
- |
Month |
Month |
- |
Year |
Year |
- |
daily_unit_sales |
Daily sales - the amount of products, measured in units, that during that specific day were sold |
units |
previous_year_daily_unit_sales |
Previous Year’s sales - the amount of products, measured in units, that during that specific day were sold the previous year |
units |
percentage_difference_daily_unit_sales |
The percentage difference between the two above values |
% |
daily_unit_sales_kg |
The amount of products, measured in kilograms, that during that specific day were sold |
kg |
previous_year_daily_unit_sales_kg |
Previous Year’s sales - the amount of products, measured in kilograms, that during that specific day were sold, the previous year |
kg |
percentage_difference_daily_unit_sales_kg |
The percentage difference between the two above values |
kg |
daily_unit_returns_kg |
The percentage of the products that were shipped to selling points and were returned |
% |
previous_year_daily_unit_returns_kg |
The percentage of the products that were shipped to |
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Research Domain:
The dataset is part of a project focused on retail sales forecasting. Specifically, it is designed to predict daily sales for Rossmann, a chain of over 3,000 drug stores operating across seven European countries. The project falls under the broader domain of time series analysis and machine learning applications for business optimization. The goal is to apply machine learning techniques to forecast future sales based on historical data, which includes factors like promotions, competition, holidays, and seasonal trends.
Purpose:
The primary purpose of this dataset is to help Rossmann store managers predict daily sales for up to six weeks in advance. By making accurate sales predictions, Rossmann can improve inventory management, staffing decisions, and promotional strategies. This dataset serves as a training set for machine learning models aimed at reducing forecasting errors and supporting decision-making processes across the company’s large network of stores.
How the Dataset Was Created:
The dataset was compiled from several sources, including historical sales data from Rossmann stores, promotional calendars, holiday schedules, and external factors such as competition. The data is split into multiple features, such as the store's location, promotion details, whether the store was open or closed, and weather information. The dataset is publicly available on platforms like Kaggle and was initially created for the Kaggle Rossmann Store Sales competition. The data is made accessible via an API for further analysis and modeling, and it is structured to help machine learning models predict future sales based on various input variables.
Dataset Structure:
The dataset consists of three main files, each with its specific role:
Train:
This file contains the historical sales data, which is used to train machine learning models. It includes daily sales information for each store, as well as various features that could influence the sales (e.g., promotions, holidays, store type, etc.).
https://handle.test.datacite.org/10.82556/yb6j-jw41
PID: b1c59499-9c6e-42c2-af8f-840181e809db
Test2:
The test dataset mirrors the structure of train.csv
but does not include the actual sales values (i.e., the target variable). This file is used for making predictions using the trained machine learning models. It is used to evaluate the accuracy of predictions when the true sales data is unknown.
https://handle.test.datacite.org/10.82556/jerg-4b84
PID: 7cbb845c-21dd-4b60-b990-afa8754a0dd9
Store:
This file provides metadata about each store, including information such as the store’s location, type, and assortment level. This data is essential for understanding the context in which the sales data is gathered.
https://handle.test.datacite.org/10.82556/nqeg-gy34
PID: 9627ec46-4ee6-4969-b14a-bda555fe34db
Id: A unique identifier for each (Store, Date) combination within the test set.
Store: A unique identifier for each store.
Sales: The daily turnover (target variable) for each store on a specific day (this is what you are predicting).
Customers: The number of customers visiting the store on a given day.
Open: An indicator of whether the store was open (1 = open, 0 = closed).
StateHoliday: Indicates if the day is a state holiday, with values like:
'a' = public holiday,
'b' = Easter holiday,
'c' = Christmas,
'0' = no holiday.
SchoolHoliday: Indicates whether the store is affected by school closures (1 = yes, 0 = no).
StoreType: Differentiates between four types of stores: 'a', 'b', 'c', 'd'.
Assortment: Describes the level of product assortment in the store:
'a' = basic,
'b' = extra,
'c' = extended.
CompetitionDistance: Distance (in meters) to the nearest competitor store.
CompetitionOpenSince[Month/Year]: The month and year when the nearest competitor store opened.
Promo: Indicates whether the store is running a promotion on a particular day (1 = yes, 0 = no).
Promo2: Indicates whether the store is participating in Promo2, a continuing promotion for some stores (1 = participating, 0 = not participating).
Promo2Since[Year/Week]: The year and calendar week when the store started participating in Promo2.
PromoInterval: Describes the months when Promo2 is active, e.g., "Feb,May,Aug,Nov" means the promotion starts in February, May, August, and November.
To work with this dataset, you will need to have specific software installed, including:
DBRepo Authorization: This is required to access the datasets via the DBRepo API. You may need to authenticate with an API key or login credentials to retrieve the datasets.
Python Libraries: Key libraries for working with the dataset include:
pandas
for data manipulation,
numpy
for numerical operations,
matplotlib
and seaborn
for data visualization,
scikit-learn
for machine learning algorithms.
Several additional resources are available for working with the dataset:
Presentation:
A presentation summarizing the exploratory data analysis (EDA), feature engineering process, and key insights from the analysis is provided. This presentation also includes visualizations that help in understanding the dataset’s trends and relationships.
Jupyter Notebook:
A Jupyter notebook, titled Retail_Sales_Prediction_Capstone_Project.ipynb
, is provided, which details the entire machine learning pipeline, from data loading and cleaning to model training and evaluation.
Model Evaluation Results:
The project includes a detailed evaluation of various machine learning models, including their performance metrics like training and testing scores, Mean Absolute Percentage Error (MAPE), and Root Mean Squared Error (RMSE). This allows for a comparison of model effectiveness in forecasting sales.
Trained Models (.pkl files):
The models trained during the project are saved as .pkl
files. These files contain the trained machine learning models (e.g., Random Forest, Linear Regression, etc.) that can be loaded and used to make predictions without retraining the models from scratch.
sample_submission.csv:
This file is a sample submission file that demonstrates the format of predictions expected when using the trained model. The sample_submission.csv
contains predictions made on the test dataset using the trained Random Forest model. It provides an example of how the output should be structured for submission.
These resources provide a comprehensive guide to implementing and analyzing the sales forecasting model, helping you understand the data, methods, and results in greater detail.
Market basket analysis with Apriori algorithm
The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.
Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.
Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.
Number of Attributes: 7
https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">
First, we need to load required libraries. Shortly I describe all libraries.
https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">
Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.
https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png">
https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">
After we will clear our data frame, will remove missing values.
https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">
To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
With the ongoing energy transition, power grids are evolving fast. They operate more and more often close to their technical limit, under more and more volatile conditions. Fast, essentially real-time computational approaches to evaluate their operational safety, stability and reliability are therefore highly desirable. Machine Learning methods have been advocated to solve this challenge, however they are heavy consumers of training and testing data, while historical operational data for real-world power grids are hard if not impossible to access.
This dataset contains long time series for production, consumption, and line flows, amounting to 20 years of data with a time resolution of one hour, for several thousands of loads and several hundreds of generators of various types representing the ultra-high-voltage transmission grid of continental Europe. The synthetic time series have been statistically validated agains real-world data.
The algorithm is described in a Nature Scientific Data paper. It relies on the PanTaGruEl model of the European transmission network -- the admittance of its lines as well as the location, type and capacity of its power generators -- and aggregated data gathered from the ENTSO-E transparency platform, such as power consumption aggregated at the national level.
The network information is encoded in the file europe_network.json. It is given in PowerModels format, which it itself derived from MatPower and compatible with PandaPower. The network features 7822 power lines and 553 transformers connecting 4097 buses, to which are attached 815 generators of various types.
The time series forming the core of this dataset are given in CSV format. Each CSV file is a table with 8736 rows, one for each hourly time step of a 364-day year. All years are truncated to exactly 52 weeks of 7 days, and start on a Monday (the load profiles are typically different during weekdays and weekends). The number of columns depends on the type of table: there are 4097 columns in load files, 815 for generators, and 8375 for lines (including transformers). Each column is described by a header corresponding to the element identifier in the network file. All values are given in per-unit, both in the model file and in the tables, i.e. they are multiples of a base unit taken to be 100 MW.
There are 20 tables of each type, labeled with a reference year (2016 to 2020) and an index (1 to 4), zipped into archive files arranged by year. This amount to a total of 20 years of synthetic data. When using loads, generators, and lines profiles together, it is important to use the same label: for instance, the files loads_2020_1.csv, gens_2020_1.csv, and lines_2020_1.csv represent a same year of the dataset, whereas gens_2020_2.csv is unrelated (it actually shares some features, such as nuclear profiles, but it is based on a dispatch with distinct loads).
The time series can be used without a reference to the network file, simply using all or a selection of columns of the CSV files, depending on the needs. We show below how to select series from a particular country, or how to aggregate hourly time steps into days or weeks. These examples use Python and the data analyis library pandas, but other frameworks can be used as well (Matlab, Julia). Since all the yearly time series are periodic, it is always possible to define a coherent time window modulo the length of the series.
This example illustrates how to select generation data for Switzerland in Python. This can be done without parsing the network file, but using instead gens_by_country.csv, which contains a list of all generators for any country in the network. We start by importing the pandas library, and read the column of the file corresponding to Switzerland (country code CH):
import pandas as pd
CH_gens = pd.read_csv('gens_by_country.csv', usecols=['CH'], dtype=str)
The object created in this way is Dataframe with some null values (not all countries have the same number of generators). It can be turned into a list with:
CH_gens_list = CH_gens.dropna().squeeze().to_list()
Finally, we can import all the time series of Swiss generators from a given data table with
pd.read_csv('gens_2016_1.csv', usecols=CH_gens_list)
The same procedure can be applied to loads using the list contained in the file loads_by_country.csv.
This second example shows how to change the time resolution of the series. Suppose that we are interested in all the loads from a given table, which are given by default with a one-hour resolution:
hourly_loads = pd.read_csv('loads_2018_3.csv')
To get a daily average of the loads, we can use:
daily_loads = hourly_loads.groupby([t // 24 for t in range(24 * 364)]).mean()
This results in series of length 364. To average further over entire weeks and get series of length 52, we use:
weekly_loads = hourly_loads.groupby([t // (24 * 7) for t in range(24 * 364)]).mean()
The code used to generate the dataset is freely available at https://github.com/GeeeHesso/PowerData. It consists in two packages and several documentation notebooks. The first package, written in Python, provides functions to handle the data and to generate synthetic series based on historical data. The second package, written in Julia, is used to perform the optimal power flow. The documentation in the form of Jupyter notebooks contains numerous examples on how to use both packages. The entire workflow used to create this dataset is also provided, starting from raw ENTSO-E data files and ending with the synthetic dataset given in the repository.
This work was supported by the Cyber-Defence Campus of armasuisse and by an internal research grant of the Engineering and Architecture domain of HES-SO.
About the Dataset
This data set contains claims information for meal reimbursement for sites participating in CACFP as child centers for the program year 2023-2024. This includes Child Care Centers, At-Risk centers, Head Start sites, Outside School Hours sites, and Emergency Shelters . The CACFP program year begins October 1 and ends September 30.
This dataset only includes claims submitted by CACFP sites operating as child centers.Sites can participate in multiple CACFP sub-programs. Each record (row) represents monthly meals data for a single site and for a single CACFP center sub-program.
To filter data for a specific CACFP center Program, select "View Data" to open the Exploration Canvas filter tools. Select the program(s) of interest from the Program field. A filtering tutorial can be found HERE
For meals data on CACFP participants operating as Day Care Homes, Adult Day Care Centers, or child care centers for previous program years, please refer to the corresponding “Child and Adult Care Food Programs (CACFP) – Meal Reimbursement” dataset for that sub-program available on the State of Texas Open Data Portal.
An overview of all CACFP data available on the Texas Open Data Portal can be found at our TDA Data Overview - Child and Adult Care Food Programs page.
An overview of all TDA Food and Nutrition data available on the Texas Open Data Portal can be found at our TDA Data Overview - Food and Nutrition Open Data page.
More information about accessing and working with TDA data on the Texas Open Data Portal can be found on the SquareMeals.org website on the TDA Food and Nutrition Open Data page.
About Dataset Updates
TDA aims to post new program year data by December 15 of the active program year. Participants have 60 days to file monthly reimbursement claims. Dataset updates will occur daily until 90 days after the close of the program year. After 90 days from the close of the program year, the dataset will be updated at six months and one year from the close of the program year before becoming archived. Archived datasets will remain published but will not be updated. Any data posted during the active program year is subject to change.
About the Agency
The Texas Department of Agriculture administers 12 U.S. Department of Agriculture nutrition programs in Texas including the National School Lunch and School Breakfast Programs, the Child and Adult Care Food Programs (CACFP), and the summer meal programs. TDA’s Food and Nutrition division provides technical assistance and training resources to partners operating the programs and oversees the USDA reimbursements they receive to cover part of the cost associated with serving food in their facilities. By working to ensure these partners serve nutritious meals and snacks, the division adheres to its mission — Feeding the Hungry and Promoting Healthy Lifestyles.
For more information on these programs, please visit our website.
"
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Vehicle Miles Traveled During Covid-19 Lock-Downs ’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/vehicle-miles-travelede on 13 February 2022.
--- Dataset description provided by original source is as follows ---
**This data set was last updated 3:30 PM ET Monday, January 4, 2021. The last date of data in this dataset is December 31, 2020. **
Overview
Data shows that mobility declined nationally since states and localities began shelter-in-place strategies to stem the spread of COVID-19. The numbers began climbing as more people ventured out and traveled further from their homes, but in parallel with the rise of COVID-19 cases in July, travel declined again.
This distribution contains county level data for vehicle miles traveled (VMT) from StreetLight Data, Inc, updated three times a week. This data offers a detailed look at estimates of how much people are moving around in each county.
Data available has a two day lag - the most recent data is from two days prior to the update date. Going forward, this dataset will be updated by AP at 3:30pm ET on Monday, Wednesday and Friday each week.
This data has been made available to members of AP’s Data Distribution Program. To inquire about access for your organization - publishers, researchers, corporations, etc. - please click Request Access in the upper right corner of the page or email kromano@ap.org. Be sure to include your contact information and use case.
Findings
- Nationally, data shows that vehicle travel in the US has doubled compared to the seven-day period ending April 13, which was the lowest VMT since the COVID-19 crisis began. In early December, travel reached a low not seen since May, with a small rise leading up to the Christmas holiday.
- Average vehicle miles traveled continues to be below what would be expected without a pandemic - down 38% compared to January 2020. September 4 reported the largest single day estimate of vehicle miles traveled since March 14.
- New Jersey, Michigan and New York are among the states with the largest relative uptick in travel at this point of the pandemic - they report almost two times the miles traveled compared to their lowest seven-day period. However, travel in New Jersey and New York is still much lower than expected without a pandemic. Other states such as New Mexico, Vermont and West Virginia have rebounded the least.
About This Data
The county level data is provided by StreetLight Data, Inc, a transportation analysis firm that measures travel patterns across the U.S.. The data is from their Vehicle Miles Traveled (VMT) Monitor which uses anonymized and aggregated data from smartphones and other GPS-enabled devices to provide county-by-county VMT metrics for more than 3,100 counties. The VMT Monitor provides an estimate of total vehicle miles travelled by residents of each county, each day since the COVID-19 crisis began (March 1, 2020), as well as a change from the baseline average daily VMT calculated for January 2020. Additional columns are calculations by AP.
Included Data
01_vmt_nation.csv - Data summarized to provide a nationwide look at vehicle miles traveled. Includes single day VMT across counties, daily percent change compared to January and seven day rolling averages to smooth out the trend lines over time.
02_vmt_state.csv - Data summarized to provide a statewide look at vehicle miles traveled. Includes single day VMT across counties, daily percent change compared to January and seven day rolling averages to smooth out the trend lines over time.
03_vmt_county.csv - Data providing a county level look at vehicle miles traveled. Includes VMT estimate, percent change compared to January and seven day rolling averages to smooth out the trend lines over time.
Additional Data Queries
* Filter for specific state - filters
02_vmt_state.csv
daily data for specific state.* Filter counties by state - filters
03_vmt_county.csv
daily data for counties in specific state.* Filter for specific county - filters
03_vmt_county.csv
daily data for specific county.Interactive
The AP has designed an interactive map to show percent change in vehicle miles traveled by county since each counties lowest point during the pandemic:
This dataset was created by Angeliki Kastanis and contains around 0 samples along with Date At Low, Mean7 County Vmt At Low, technical information and other features such as: - County Name - County Fips - and more.
- Analyze State Name in relation to Baseline Jan Vmt
- Study the influence of Date At Low on Mean7 County Vmt At Low
- More datasets
If you use this dataset in your research, please credit Angeliki Kastanis
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains information on the Surface Soil Moisture (SM) content derived from satellite observations in the microwave domain.
A description of this dataset, including the methodology and validation results, is available at:
Preimesberger, W., Stradiotti, P., and Dorigo, W.: ESA CCI Soil Moisture GAPFILLED: An independent global gap-free satellite climate data record with uncertainty estimates, Earth Syst. Sci. Data Discuss. [preprint], https://doi.org/10.5194/essd-2024-610, in review, 2025.
ESA CCI Soil Moisture is a multi-satellite climate data record that consists of harmonized, daily observations coming from 19 satellites (as of v09.1) operating in the microwave domain. The wealth of satellite information, particularly over the last decade, facilitates the creation of a data record with the highest possible data consistency and coverage.
However, data gaps are still found in the record. This is particularly notable in earlier periods when a limited number of satellites were in operation, but can also arise from various retrieval issues, such as frozen soils, dense vegetation, and radio frequency interference (RFI). These data gaps present a challenge for many users, as they have the potential to obscure relevant events within a study area or are incompatible with (machine learning) software that often relies on gap-free inputs.
Since the requirement of a gap-free ESA CCI SM product was identified, various studies have demonstrated the suitability of different statistical methods to achieve this goal. A fundamental feature of such gap-filling method is to rely only on the original observational record, without need for ancillary variable or model-based information. Due to the intrinsic challenge, there was until present no global, long-term univariate gap-filled product available. In this version of the record, data gaps due to missing satellite overpasses and invalid measurements are filled using the Discrete Cosine Transform (DCT) Penalized Least Squares (PLS) algorithm (Garcia, 2010). A linear interpolation is applied over periods of (potentially) frozen soils with little to no variability in (frozen) soil moisture content. Uncertainty estimates are based on models calibrated in experiments to fill satellite-like gaps introduced to GLDAS Noah reanalysis soil moisture (Rodell et al., 2004), and consider the gap size and local vegetation conditions as parameters that affect the gapfilling performance.
You can use command line tools such as wget or curl to download (and extract) data for multiple years. The following command will download and extract the complete data set to the local directory ~/Download on Linux or macOS systems.
#!/bin/bash
# Set download directory
DOWNLOAD_DIR=~/Downloads
base_url="https://researchdata.tuwien.at/records/3fcxr-cde10/files"
# Loop through years 1991 to 2023 and download & extract data
for year in {1991..2023}; do
echo "Downloading $year.zip..."
wget -q -P "$DOWNLOAD_DIR" "$base_url/$year.zip"
unzip -o "$DOWNLOAD_DIR/$year.zip" -d $DOWNLOAD_DIR
rm "$DOWNLOAD_DIR/$year.zip"
done
The dataset provides global daily estimates for the 1991-2023 period at 0.25° (~25 km) horizontal grid resolution. Daily images are grouped by year (YYYY), each subdirectory containing one netCDF image file for a specific day (DD), month (MM) in a 2-dimensional (longitude, latitude) grid system (CRS: WGS84). The file name has the following convention:
ESACCI-SOILMOISTURE-L3S-SSMV-COMBINED_GAPFILLED-YYYYMMDD000000-fv09.1r1.nc
Each netCDF file contains 3 coordinate variables (WGS84 longitude, latitude and time stamp), as well as the following data variables:
Additional information for each variable is given in the netCDF attributes.
Changes in v9.1r1 (previous version was v09.1):
These data can be read by any software that supports Climate and Forecast (CF) conform metadata standards for netCDF files, such as:
The following records are all part of the Soil Moisture Climate Data Records from satellites community
1 |
ESA CCI SM MODELFREE Surface Soil Moisture Record | <a href="https://doi.org/10.48436/svr1r-27j77" target="_blank" |
https://earth.esa.int/eogateway/documents/20142/1564626/Terms-and-Conditions-for-the-use-of-ESA-Data.pdfhttps://earth.esa.int/eogateway/documents/20142/1564626/Terms-and-Conditions-for-the-use-of-ESA-Data.pdf
The Fundamental Data Record (FDR) for Atmospheric Composition UVN v.1.0 dataset is a cross-instrument Level-1 product [ATMOS_L1B] generated in 2023 and resulting from the ESA FDR4ATMOS project. The FDR contains selected Earth Observation Level 1b parameters (irradiance/reflectance) from the nadir-looking measurements of the ERS-2 GOME and Envisat SCIAMACHY missions for the period ranging from 1995 to 2012. The data record offers harmonised cross-calibrated spectra with focus on spectral windows in the Ultraviolet-Visible-Near Infrared regions for the retrieval of critical atmospheric constituents like ozone (O3), sulphur dioxide (SO2), nitrogen dioxide (NO2) column densities, alongside cloud parameters. The FDR4ATMOS products should be regarded as experimental due to the innovative approach and the current use of a limited-sized test dataset to investigate the impact of harmonization on the Level 2 target species, specifically SO2, O3 and NO2. Presently, this analysis is being carried out within follow-on activities. The FDR4ATMOS V1 is currently being extended to include the MetOp GOME-2 series. Product format For many aspects, the FDR product has improved compared to the existing individual mission datasets: GOME solar irradiances are harmonised using a validated SCIAMACHY solar reference spectrum, solving the problem of the fast-changing etalon present in the original GOME Level 1b data; Reflectances for both GOME and SCIAMACHY are provided in the FDR product. GOME reflectances are harmonised to degradation-corrected SCIAMACHY values, using collocated data from the CEOS PIC sites; SCIAMACHY data are scaled to the lowest integration time within the spectral band using high-frequency PMD measurements from the same wavelength range. This simplifies the use of the SCIAMACHY spectra which were split in a complex cluster structure (with own integration time) in the original Level 1b data; The harmonization process applied mitigates the viewing angle dependency observed in the UV spectral region for GOME data; Uncertainties are provided. Each FDR product provides, within the same file, irradiance/reflectance data for UV-VIS-NIR special regions across all orbits on a single day, including therein information from the individual ERS-2 GOME and Envisat SCIAMACHY measurements. FDR has been generated in two formats: Level 1A and Level 1B targeting expert users and nominal applications respectively. The Level 1A [ATMOS_L1A] data include additional parameters such as harmonisation factors, PMD, and polarisation data extracted from the original mission Level 1 products. The ATMOS_L1A dataset is not part of the nominal dissemination to users. In case of specific requirements, please contact EOHelp. Please refer to the README file for essential guidance before using the data. All the new products are conveniently formatted in NetCDF. Free standard tools, such as Panoply, can be used to read NetCDF data. Panoply is sourced and updated by external entities. For further details, please consult our Terms and Conditions page. Uncertainty characterisation One of the main aspects of the project was the characterization of Level 1 uncertainties for both instruments, based on metrological best practices. The following documents are provided: General guidance on a metrological approach to Fundamental Data Records (FDR) Uncertainty Characterisation document Effect tables NetCDF files containing example uncertainty propagation analysis and spectral error correlation matrices for SCIAMACHY (Atlantic and Mauretania scene for 2003 and 2010) and GOME (Atlantic scene for 2003) reflectance_uncertainty_example_FDR4ATMOS_GOME.nc reflectance_uncertainty_example_FDR4ATMOS_SCIA.nc Known Issues Non-monotonous wavelength axis for SCIAMACHY in FDR data version 1.0 In the SCIAMACHY OBSERVATION group of the atmospheric FDR v1.0 dataset (DOI: 10.5270/ESA-852456e), the wavelength axis (lambda variable) is not monotonically increasing. This issue affects all spectral channels (UV, VIS, NIR) in the SCIAMACHY group, while GOME OBSERVATION data remain unaffected. The root cause of the issue lies in the incorrect indexing of the lambda variable during the NetCDF writing process. Notably, the wavelength values themselves are calculated correctly within the processing chain. Temporary Workaround The wavelength axis is correct in the first record of each product. As a workaround, users can extract the wavelength axis from the first record and apply it to all subsequent measurements within the same product. The first record can be retrieved by setting the first two indices (time and scanline) to 0 (assuming counting of array indices starts at 0). Note that this process must be repeated separately for each spectral range (UV, VIS, NIR) and every daily product. Since the wavelength axis of SCIAMACHY is highly stable over time, using the first record introduces no expected impact on retrieval results. Python pseudo-code example: lambda_...
An education company named X Education sells online courses to industry professionals. On any given day, many professionals who are interested in the courses land on their website and browse for courses.
The company markets its courses on several websites and search engines like Google. Once these people land on the website, they might browse the courses or fill up a form for the course or watch some videos. When these people fill up a form providing their email address or phone number, they are classified to be a lead. Moreover, the company also gets leads through past referrals. Once these leads are acquired, employees from the sales team start making calls, writing emails, etc. Through this process, some of the leads get converted while most do not. The typical lead conversion rate at X education is around 30%.
Now, although X Education gets a lot of leads, its lead conversion rate is very poor. For example, if, say, they acquire 100 leads in a day, only about 30 of them are converted. To make this process more efficient, the company wishes to identify the most potential leads, also known as ‘Hot Leads’. If they successfully identify this set of leads, the lead conversion rate should go up as the sales team will now be focusing more on communicating with the potential leads rather than making calls to everyone.
There are a lot of leads generated in the initial stage (top) but only a few of them come out as paying customers from the bottom. In the middle stage, you need to nurture the potential leads well (i.e. educating the leads about the product, constantly communicating, etc. ) in order to get a higher lead conversion.
X Education wants to select the most promising leads, i.e. the leads that are most likely to convert into paying customers. The company requires you to build a model wherein you need to assign a lead score to each of the leads such that the customers with higher lead score h have a higher conversion chance and the customers with lower lead score have a lower conversion chance. The CEO, in particular, has given a ballpark of the target lead conversion rate to be around 80%.
Variables Description
* Prospect ID - A unique ID with which the customer is identified.
* Lead Number - A lead number assigned to each lead procured.
* Lead Origin - The origin identifier with which the customer was identified to be a lead. Includes API, Landing Page Submission, etc.
* Lead Source - The source of the lead. Includes Google, Organic Search, Olark Chat, etc.
* Do Not Email -An indicator variable selected by the customer wherein they select whether of not they want to be emailed about the course or not.
* Do Not Call - An indicator variable selected by the customer wherein they select whether of not they want to be called about the course or not.
* Converted - The target variable. Indicates whether a lead has been successfully converted or not.
* TotalVisits - The total number of visits made by the customer on the website.
* Total Time Spent on Website - The total time spent by the customer on the website.
* Page Views Per Visit - Average number of pages on the website viewed during the visits.
* Last Activity - Last activity performed by the customer. Includes Email Opened, Olark Chat Conversation, etc.
* Country - The country of the customer.
* Specialization - The industry domain in which the customer worked before. Includes the level 'Select Specialization' which means the customer had not selected this option while filling the form.
* How did you hear about X Education - The source from which the customer heard about X Education.
* What is your current occupation - Indicates whether the customer is a student, umemployed or employed.
* What matters most to you in choosing this course An option selected by the customer - indicating what is their main motto behind doing this course.
* Search - Indicating whether the customer had seen the ad in any of the listed items.
* Magazine
* Newspaper Article
* X Education Forums
* Newspaper
* Digital Advertisement
* Through Recommendations - Indicates whether the customer came in through recommendations.
* Receive More Updates About Our Courses - Indicates whether the customer chose to receive more updates about the courses.
* Tags - Tags assigned to customers indicating the current status of the lead.
* Lead Quality - Indicates the quality of lead based on the data and intuition the employee who has been assigned to the lead.
* Update me on Supply Chain Content - Indicates whether the customer wants updates on the Supply Chain Content.
* Get updates on DM Content - Indicates whether the customer wants updates on the DM Content.
* Lead Profile - A lead level assigned to each customer based on their profile.
* City - The city of the customer.
* Asymmetric Activity Index - An index and score assigned to each customer based on their activity and their profile
* Asymmetric Profile Index
* Asymmetric Activity Score
* Asymmetric Profile Score
* I agree to pay the amount through cheque - Indicates whether the customer has agreed to pay the amount through cheque or not.
* a free copy of Mastering The Interview - Indicates whether the customer wants a free copy of 'Mastering the Interview' or not.
* Last Notable Activity - The last notable activity performed by the student.
UpGrad Case Study
Your data will be in front of the world's largest data science community. What questions do you want to see answered?
This dataset was created as part of the following study, which was published in the Journal of Hydrology: A new framework for experimental design using Bayesian Evidential Learning: the case of wellhead protection area https://doi.org/10.1016/j.jhydrol.2021.126903. The pre-print is available on arXiv: https://arxiv.org/pdf/2105.05539.pdf
Files description This dataset contains 4148 simulation results, i.e., 4148 pairs of predictor/target. bkt.npy contains the breakthrough curves from all 6 injection wells recorded at the pumping well. pz.npy contains the 2D coordinates of the backtracked particles' end points, used to delineate the WHPA.
Introduction The Wellhead Protection Area (WHPA) is a zone around a pumping well where human activities are limited in order to preserve water resources, usually based on how long dangerous chemicals in the area will take to reach the pumping well (according to local regulation). The flow velocity in the subsurface around the well determines it, and it can be computed numerically using particle tracking or transport simulation, or in practice using tracer testing. A groundwater model is typically calibrated against field data before being used to calculate the WHPA. In highly populated places where land occupation is a big concern, the introduction of such zones could have a large socioeconomic impact.
WHPA prediction Different tracers emerge from six data sources (injection wells) scattered across the pumping well. Their job is to inject individual tracers into the system in order to predict their transport and record their breakthrough curves (BCs) at the pumping well location. Numerous particles are artificially positioned around the pumping well, and their origins are traced backward in time to identify the associated WHPA.
Our predictor and target will be generated using the USGS' open-source finite-difference code Modflow. To get different sets of predictors and targets, we will run different hydrologic models with one variable parameter, namely hydraulic conductivity in metres per day. To obtain a satisfactory heterogeneity in the hydraulic conductivity fields, which will control the shape and extent of our target, the PAs, we use sequential gaussian simulation based on arbitrarily defined variograms. The pumping well is located at the 1000, 500 metres mark and is surrounded by six injection wells.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The ActiSynValidator data set is an aggregated activity data set derived from the HETUS 2010 time use survey. It is intended to enable reusable and reproducible validation of various behavior models.
The ActiSynValidator software framework belongs to this data set, and together they can be used to validate activity profiles, e.g. the results of an occupant behavior model. It provides modules for preprocessing and categorising activity profiles, and comparing them to the statistics in this data set using indicators and plots. It also contains the code that was used to create this data set out of the HETUS 2010 data, so that the generation of this data set is fully reproducible.
The HETUS data set consists of many single-day activity profiles. These cannot be made publicly accessible due to data protection regulations. The idea of the ActiSynValidator data set is to aggregate these activity profiles using a meaningful classification, to provide behavior statistics for different types of activity profiles. For that, the attributes country, sex, work status, and day type are used.
Human behavior is complex, and in order to thorougly validate it, multiple aspects have to be taken into account. Therefore, the ActiSynValidator data set contains distributions for duration and frequency of each activity, as well as the temporal distribution throughout the day. For that purpose, a set of 15 common activity groups is defined. The mapping from the 108 activity codes used in HEUTS 2010 is provided as part of the validation framework.
For convenience, the ActiSynValidator data set is provided both as .tar.gz and as .zip archive. Both files contain the same content, the full activity validation data set.
Additionally, the document ActiSynValidator_data_set_description.pdf contains a more thorough description of the data set, including its file structure, the content and meaning of its files, and examples on how to use it.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
🔍 Dataset Overview
Each patient in the dataset has 30 days of continuous health data. The goal is to predict if a patient will progress to a critical condition based on their vital signs, medication adherence, and symptoms recorded daily.
There are 10 columns in the dataset:
Column Name Description patient_id Unique identifier for each patient. day Day number (from 1 to 30) indicating sequential daily records. bp_systolic Systolic blood pressure (top number) in mm Hg. Higher values may indicate hypertension. bp_diastolic Diastolic blood pressure (bottom number) in mm Hg. heart_rate Heartbeats per minute. Elevated heart rate can signal stress, infection, or deterioration. respiratory_rate Breaths per minute. Elevated rates can indicate respiratory distress. temperature Body temperature in °F. Fever or hypothermia are signs of infection or inflammation. oxygen_saturation Percentage of oxygen in blood. Lower values are concerning (< 94%). med_adherence Patient’s medication adherence (between 0 and 1). Lower values may contribute to worsening. symptom_severity Subjective symptom rating (scale of 1–10). Higher means worse condition. progressed_to_critical Target label: 1 if patient deteriorated to a critical condition, else 0. 🎯 Final Task (Prediction Objective)
Problem Type: Binary classification with time-series data.
Goal: Train deep learning models (RNN, LSTM, GRU) to learn temporal patterns from a patient's 30-day health history and predict whether the patient will progress to a critical condition.
📈 How the Data is Used for Modeling
Input: A 3D array shaped as (num_patients, 30, 8) where: 30 = number of days (timesteps), 8 = features per day (excluding ID, day, and target). Output: A binary label for each patient (0 or 1). 🔄 Feature Contribution to Prediction
Feature Why It Matters bp_systolic/dia Persistently high or rising BP may signal stress, cardiac issues, or deterioration. heart_rate A rising heart rate can indicate fever, infection, or organ distress. respiratory_rate Often increases early in critical illnesses like sepsis or COVID. temperature Fever is a key sign of infection. Chronic low/high temp may indicate underlying pathology. oxygen_saturation A declining oxygen level is a strong predictor of respiratory failure. med_adherence Poor medication adherence is often linked to worsening chronic conditions. symptom_severity Patient-reported worsening symptoms may precede measurable physiological changes. 🛠 Tools You’ll Use
Task Tool/Technique Data processing Pandas, NumPy, Scikit-learn Time series modeling Keras (using SimpleRNN, LSTM, GRU) Evaluation Accuracy, Loss, ROC Curve (optional)
This indicator reports on the SSC Enterprise IT security services that are part of the SSC service catalogue. This indicator is an aggregation of myKEY, Secure Remote Access and External Credential Management. These services best represent CITS’ ability to secure IT infrastructure.
Calculation / formula: Numerator: Total time (# hours, minutes, and seconds) the infrastructure security services are available (i.e. Up-Time) in assessment period (day, week, month, and year) multiplied by the number of services, and multiplied by the number of applicable customer departments.
Denominator: Total time (# hours, minutes and seconds) in assessment period (day, week, month, and year) multiplied by the number of services, and multiplied by the number of applicable customer departments.
Trend should be interpreted such that a higher % represents progress toward target. Once target has been reach, additional % represents excellence.
The % of availability, exclude the maintenance windows. For example: If there were a 1-day outage of an IS Service (e.g. myKey), the Up-Time would be: 24 hrs/day x30 days/month x1 service x 40 customers = 28800 hours (per month), which is the Numerator The Total time for the reporting period (in this case, monthly) is: 24 hrs/day x31 days/month x1 service x 40 customers = 29760 hours, which is the Denominator. Numerator over Denominator is % of Time IS service is available = 28800/29760 = 96.8%
Target: 99.8%.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The main purpose of creating an electronic database was to evaluate the performance of continuous glucose monitoring (CGM) and flash monitoring (FMS) in children and adolescents diagnosed with type 1 diabetes mellitus. The database is intended for entering, systematizing, storing and displaying patient data (date of birth, age, date of diagnosis of type 1 diabetes mellitus, length of illness, date of first visit to an endocrinologist, with the installation of a CGM or FMS), glycated hemoglobin indicators initially , during the study and ultimately, as well as CGM, FMS data (average glucose level, glycemic variability, percentage of cases above the target range, percentage of cases within the target range, percentage of cases below the target range, number of hypoglycemic episodes and their average duration, frequency of daily scans and frequency of sensor readings).
The database is the basis for comparative statistical analysis of dynamic monitoring indicators in groups of patients with the presence or absence of diabetic complications (neuropathy, retinopathy and nephropathy). The database presents the results of a prospective, open, controlled, clinical study obtained over a year and a half. The database includes information on 307 patients (adolescent children) aged 3 to 17 years inclusive. During the study, the observed patients were divided into two groups: Group 1 - patients diagnosed with type 1 diabetes mellitus and with diabetic complications, 152 people, Group 2 – patients diagnosed with type 1 diabetes mellitus and with no diabetic complications, 155 people. All registrants of the database were assigned individual codes, which made it possible to exclude personal data (full name) from the database.
The database is executed in the Microsoft Office Excel program and has the character of a depersonalized summary table, which consists of two blocks-sheets: patients of groups 1 and 2 and is structured according to the following sections: "Patient number"; "Patient code"; "Date of birth"; "Age of the patient"; section "Date of diagnosis of DM1" indicates the date of the official diagnosis of type 1 diabetes mellitus at the first hospitalization of the patient, this information is borrowed from medical information systems; section "Length of service DM1" reflects information about the duration of the patient's illness; the section "Date of the first visit" contains information about the date of the registrant's visit to the endocrinologist with the installation of FMS / CGM devices; the section "Frequency of self-monitoring with a glucometer" contains information about the frequency of measuring blood glucose levels by the patient at home using a glucometer until the establishment of FMS / CGM.
Sections "HbA1c initially (GMI)", "HbA1c (GMI)", "HbA1c final (GMI)", display the indicators of the level of glycated hemoglobin from the total for the period of the beginning of the study, at the intermediate stages of the study and at the end of observation.
The database structure has a number of sections accumulating information obtained with CGM/FMS, in particular: the section "Average glucose level"; the section "% above the target range", reflecting the percentage of the patient's stay with glycemia above the target indicators during the day; the section "% within the target range", reflecting the percentage of the patient's stay within the target glycemia indicators per day; the section "% below the target range", reflecting the percentage of the patient's stay with glycemia below the target indicators during the day; the section "Hypoglycemic phenomena", reflecting the number of cases of hypoglycemia in patients within 2 weeks; the section "Average duration", reflecting the average duration of hypoglycemic phenomena registered in the patient; the section "Sensor data received", indicating the percentage of time the patient was with an active device sensor; the section "Daily scans" show the frequency of scans of the patient's glycemic level (once a day); the section "%CV" displays the variability of the patient's glycemia recorded by the device. The listed sections are repeated in the database in accordance with the number of follow-up visit.
Also in the database there is a section "Mid. values", which contains indicators of the average values of patient data for all of the above sections, both in the first and in the second group of patients.
When working with the database, the use of filters (in the "Data" tab) containing the names of indicators allows you to enter information about new registrants in a convenient form or correct existing data, as well as sort and search for one or more specified indicators.
The electronic database allows you to systematize a large volume of results, distribute data into categories, search for any field or set of fields in the input format, systematize the selected array, makes it possible to directly use this data for statistical analysis, as well as to view and print information on specified conditions with the location of fields in a convenient sequence.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The ETHOS.ActivityAssure dataset is an aggregated activity dataset derived from the HETUS 2010 time use survey. It is intended to enable reusable and reproducible validation of various behavior models.
The ETHOS.ActivityAssure software framework belongs to this dataset, and together they can be used to validate activity profiles, e.g. the results of an occupant behavior model. It provides modules for preprocessing and categorising activity profiles, and comparing them to the statistics in this dataset using indicators and plots. It also contains the code that was used to create this dataset out of the HETUS 2010 data, so that the generation of this dataset is fully reproducible.
The HETUS dataset consists of many single-day activity profiles. These cannot be made publicly accessible due to data protection regulations. The idea of the ETHOS.ActivityAssure dataset is to aggregate these activity profiles using a meaningful classification, to provide behavior statistics for different types of activity profiles. For that, the attributes country, sex, work status, and day type are used.
Human behavior is complex, and in order to thorougly validate it, multiple aspects have to be taken into account. Therefore, the ETHOS.ActivityAssure dataset contains distributions for duration and frequency of each activity, as well as the temporal distribution throughout the day. For that purpose, a set of 15 common activity groups is defined. The mapping from the 108 activity codes used in HEUTS 2010 is provided as part of the validation framework.
For convenience, the ETHOS.ActivityAssure dataset is provided both as .tar.gz and as .zip archive. Both files contain the same content, the full activity validation dataset.
Additionally, the document ActivityAssure_data_set_description.pdf contains a more thorough description of the dataset, including its file structure, the content and meaning of its files, and examples on how to use it.
About the Dataset
This dataset contains site-level meal counts from approved TDA claims for Seamless Summer Option (SSO) and Summer Food Service Program (SFSP) in summer 2023. Summer meal programs typically operate mid-May through August unless otherwise noted. Participants have 60 days from the final meal service day of the month to submit claims to TDA.
Meal count information for individual summer meal programs can be found as filtered views of this dataset on our TDA Data Overview - Summer Meals Programs page.
Meal reimbursement data is collected at the sponsor/CE level and is reported in the "Summer Meal Programs - Seamless Summer Option (SSO) - Meal Reimbursements" and “Summer Meal Programs – Summer Food Service Program (SFSP) – Meal Reimbursements” datasets found on the Summer Meal Program Data Overview page.
An overview of all Summer Meal Program data available on the Texas Open Data Portal can be found at our TDA Data Overview - Summer Meals Programs page.
An overview of all TDA Food and Nutrition data available on the Texas Open Data Portal can be found at our TDA Data Overview - Food and Nutrition Open Data page.
About Dataset Updates
TDA aims to post new program year data by July 15 of the active program period. Participants have 60 days to submit claims. Data updates will occur daily and end 90 days after the close of the program year. Any data posted during the active program year is subject to change.
About the Agency
The Texas Department of Agriculture administers 12 U.S. Department of Agriculture nutrition programs in Texas including the National School Lunch and School Breakfast Programs, the Child and Adult Care Food Programs (CACFP), and summer meal programs. TDA’s Food and Nutrition division provides technical assistance and training resources to partners operating the programs and oversees the USDA reimbursements they receive to cover part of the cost associated with serving food in their facilities. By working to ensure these partners serve nutritious meals and snacks, the division adheres to its mission — Feeding the Hungry and Promoting Healthy Lifestyles.
For more information on these programs, please visit our website.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
CNN/DailyMail non-anonymized summarization dataset.
There are two features: - article: text of news article, used as the document to be summarized - highlights: joined text of highlights with and around each highlight, which is the target summary
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains all data used during the evaluation of trace meaning preservation. Archives are protected by password "trace-share" to avoid false detection by antivirus software.
For more information, see the project repository at https://github.com/Trace-Share.
Selected Attack Traces
The following list contains trace datasets used for evaluation. Each attack was chosen to have not only a different meaning but also different statistical properties.
dos_http_flood — the capture of GET and POST requests sent to one server by one attacker (HTTP~traffic);
ftp_bruteforce — short and unsuccessful attempt to guess a user’s password for FTP service (FTP traffic);
ponyloader_botnet — Pony Loader botnet used for stealing of credentials from 3 target devices reporting to single IP with a large number of intermediate addresses (DNS and HTTP traffic);
scan — the capture of nmap tool that scans given subnet using ICMP echo and TCP SYN requests (consist of ARP, ICMP, and TCP traffic);
wannacry_ransomware — the capture of Wanacry ransomware that spreads in a domain with three workstations, a domain controller, and a file-sharing server (SMB and SMBv2 traffic).
Background Traffic Data
Publicly available dataset CSE-CIC-IDS-2018 was used as a background traffic data. The evaluation uses data from the day Thursday-01-03-2018 containing a sufficient proportion of regular traffic without any statistically significant attacks. Only traffic aimed at victim machines (range 172.31.69.0/24) is used to reduce less significant traffic.
Evaluation Results and Dataset Structure
Traces variants (traces.zip)
./traces-original/ — trace PCAP files and crawled details in YAML format;
./traces-normalized — normalized PCAP files and details in YAML format;
./traces-adjusted — adjusted PCAP files using various timestamp generation settings, combination configuration in YAML format, and lables provided by ID2T in XML format.
Extracted alerts (alerts.zip)
./alerts-original/ — extracted Suricata alerts, Suricata log, and full Suricata output for all original trace files;
./alerts-normalized/ — extracted Suricata alerts, Suricata log, and full Suricata output for all normalized trace files;
./alerts-adjusted/ — extracted Suricata alerts, Suricata log, and full Suricata output for all adjusted trace files.
Evaluation results
*.csv files in the root directory — data contains extracted alert signatures and their count per each trace variant.
In 2024, Target Corporation's food and beverage product segment generated sales of approximately 23.8 billion U.S. dollars. In contrast, the hardline segment, which include electronics, toys, entertainment, sporting goods, and luggage, registered sales of 15.8 billion U.S. dollars. Target Corporation had revenues amounting to around 106.6 billion U.S. dollars that year.