30 datasets found
  1. Target: sales in the U.S. 2017-2024, by product category

    • statista.com
    Updated Apr 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Target: sales in the U.S. 2017-2024, by product category [Dataset]. https://www.statista.com/statistics/1113245/target-sales-by-product-segment-in-the-us/
    Explore at:
    Dataset updated
    Apr 4, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    United States
    Description

    In 2024, Target Corporation's food and beverage product segment generated sales of approximately 23.8 billion U.S. dollars. In contrast, the hardline segment, which include electronics, toys, entertainment, sporting goods, and luggage, registered sales of 15.8 billion U.S. dollars. Target Corporation had revenues amounting to around 106.6 billion U.S. dollars that year.

  2. c

    Target Discounted Products Dataset (30%+ Off) — 160K+ Items

    • crawlfeeds.com
    csv, zip
    Updated Jun 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crawl Feeds (2025). Target Discounted Products Dataset (30%+ Off) — 160K+ Items [Dataset]. https://crawlfeeds.com/datasets/target-discounted-products-dataset-30-off-190k-items
    Explore at:
    csv, zipAvailable download formats
    Dataset updated
    Jun 30, 2025
    Dataset authored and provided by
    Crawl Feeds
    License

    https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

    Description

    Get access to a curated dataset of over 160,000 products from Target.com, all featuring a 30% or greater discount. This collection is ideal for anyone studying pricing trends, consumer deal behavior, or building retail pricing intelligence platforms.

    The data spans categories including home goods, electronics, fashion, beauty, and personal care, offering insights into Target’s promotional strategies and markdown inventory.

    What’s Included:

    • Product Title & URL

    • Original & Discounted Prices

    • % Discount

    • Brand, Category

    • Image links, Description

    • Availability (in stock / out of stock)

    • Scraped Date

    Use Cases:

    • Build daily deal apps or deal newsletters

    • Monitor Target’s price drops and markdown strategy

    • Analyze clearance vs. everyday discount trends

    • Create dashboards for pricing analytics

    • Feed retail bots or price comparison engines

    Updates:

    This dataset can be refreshed weekly or monthly upon request.

  3. Dairy Supply Chain Sales Dataset

    • zenodo.org
    • data.niaid.nih.gov
    pdf, zip
    Updated Jul 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dimitris Iatropoulos; Konstantinos Georgakidis; Konstantinos Georgakidis; Ilias Siniosoglou; Ilias Siniosoglou; Christos Chaschatzis; Christos Chaschatzis; Anna Triantafyllou; Anna Triantafyllou; Athanasios Liatifis; Athanasios Liatifis; Dimitrios Pliatsios; Dimitrios Pliatsios; Thomas Lagkas; Thomas Lagkas; Vasileios Argyriou; Vasileios Argyriou; Panagiotis Sarigiannidis; Panagiotis Sarigiannidis; Dimitris Iatropoulos (2024). Dairy Supply Chain Sales Dataset [Dataset]. http://doi.org/10.21227/smv6-z405
    Explore at:
    zip, pdfAvailable download formats
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Dimitris Iatropoulos; Konstantinos Georgakidis; Konstantinos Georgakidis; Ilias Siniosoglou; Ilias Siniosoglou; Christos Chaschatzis; Christos Chaschatzis; Anna Triantafyllou; Anna Triantafyllou; Athanasios Liatifis; Athanasios Liatifis; Dimitrios Pliatsios; Dimitrios Pliatsios; Thomas Lagkas; Thomas Lagkas; Vasileios Argyriou; Vasileios Argyriou; Panagiotis Sarigiannidis; Panagiotis Sarigiannidis; Dimitris Iatropoulos
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    1.Introduction

    Sales data collection is a crucial aspect of any manufacturing industry as it provides valuable insights about the performance of products, customer behaviour, and market trends. By gathering and analysing this data, manufacturers can make informed decisions about product development, pricing, and marketing strategies in Internet of Things (IoT) business environments like the dairy supply chain.

    One of the most important benefits of the sales data collection process is that it allows manufacturers to identify their most successful products and target their efforts towards those areas. For example, if a manufacturer could notice that a particular product is selling well in a certain region, this information could be utilised to develop new products, optimise the supply chain or improve existing ones to meet the changing needs of customers.

    This dataset includes information about 7 of MEVGAL’s products [1]. According to the above information the data published will help researchers to understand the dynamics of the dairy market and its consumption patterns, which is creating the fertile ground for synergies between academia and industry and eventually help the industry in making informed decisions regarding product development, pricing and market strategies in the IoT playground. The use of this dataset could also aim to understand the impact of various external factors on the dairy market such as the economic, environmental, and technological factors. It could help in understanding the current state of the dairy industry and identifying potential opportunities for growth and development.

    2. Citation

    Please cite the following papers when using this dataset:

    1. I. Siniosoglou, K. Xouveroudis, V. Argyriou, T. Lagkas, S. K. Goudos, K. E. Psannis and P. Sarigiannidis, "Evaluating the Effect of Volatile Federated Timeseries on Modern DNNs: Attention over Long/Short Memory," in the 12th International Conference on Circuits and Systems Technologies (MOCAST 2023), April 2023, Accepted

    3. Dataset Modalities

    The dataset includes data regarding the daily sales of a series of dairy product codes offered by MEVGAL. In particular, the dataset includes information gathered by the logistics division and agencies within the industrial infrastructures overseeing the production of each product code. The products included in this dataset represent the daily sales and logistics of a variety of yogurt-based stock. Each of the different files include the logistics for that product on a daily basis for three years, from 2020 to 2022.

    3.1 Data Collection

    The process of building this dataset involves several steps to ensure that the data is accurate, comprehensive and relevant.

    The first step is to determine the specific data that is needed to support the business objectives of the industry, i.e., in this publication’s case the daily sales data.

    Once the data requirements have been identified, the next step is to implement an effective sales data collection method. In MEVGAL’s case this is conducted through direct communication and reports generated each day by representatives & selling points.

    It is also important for MEVGAL to ensure that the data collection process conducted is in an ethical and compliant manner, adhering to data privacy laws and regulation. The industry also has a data management plan in place to ensure that the data is securely stored and protected from unauthorised access.

    The published dataset is consisted of 13 features providing information about the date and the number of products that have been sold. Finally, the dataset was anonymised in consideration to the privacy requirement of the data owner (MEVGAL).

    File

    Period

    Number of Samples (days)

    product 1 2020.xlsx

    01/01/2020–31/12/2020

    363

    product 1 2021.xlsx

    01/01/2021–31/12/2021

    364

    product 1 2022.xlsx

    01/01/2022–31/12/2022

    365

    product 2 2020.xlsx

    01/01/2020–31/12/2020

    363

    product 2 2021.xlsx

    01/01/2021–31/12/2021

    364

    product 2 2022.xlsx

    01/01/2022–31/12/2022

    365

    product 3 2020.xlsx

    01/01/2020–31/12/2020

    363

    product 3 2021.xlsx

    01/01/2021–31/12/2021

    364

    product 3 2022.xlsx

    01/01/2022–31/12/2022

    365

    product 4 2020.xlsx

    01/01/2020–31/12/2020

    363

    product 4 2021.xlsx

    01/01/2021–31/12/2021

    364

    product 4 2022.xlsx

    01/01/2022–31/12/2022

    364

    product 5 2020.xlsx

    01/01/2020–31/12/2020

    363

    product 5 2021.xlsx

    01/01/2021–31/12/2021

    364

    product 5 2022.xlsx

    01/01/2022–31/12/2022

    365

    product 6 2020.xlsx

    01/01/2020–31/12/2020

    362

    product 6 2021.xlsx

    01/01/2021–31/12/2021

    364

    product 6 2022.xlsx

    01/01/2022–31/12/2022

    365

    product 7 2020.xlsx

    01/01/2020–31/12/2020

    362

    product 7 2021.xlsx

    01/01/2021–31/12/2021

    364

    product 7 2022.xlsx

    01/01/2022–31/12/2022

    365

    3.2 Dataset Overview

    The following table enumerates and explains the features included across all of the included files.

    Feature

    Description

    Unit

    Day

    day of the month

    -

    Month

    Month

    -

    Year

    Year

    -

    daily_unit_sales

    Daily sales - the amount of products, measured in units, that during that specific day were sold

    units

    previous_year_daily_unit_sales

    Previous Year’s sales - the amount of products, measured in units, that during that specific day were sold the previous year

    units

    percentage_difference_daily_unit_sales

    The percentage difference between the two above values

    %

    daily_unit_sales_kg

    The amount of products, measured in kilograms, that during that specific day were sold

    kg

    previous_year_daily_unit_sales_kg

    Previous Year’s sales - the amount of products, measured in kilograms, that during that specific day were sold, the previous year

    kg

    percentage_difference_daily_unit_sales_kg

    The percentage difference between the two above values

    kg

    daily_unit_returns_kg

    The percentage of the products that were shipped to selling points and were returned

    %

    previous_year_daily_unit_returns_kg

    The percentage of the products that were shipped to

  4. t

    Evaluating FAIR Models for Rossmann Store Sales Prediction: Insights and...

    • test.researchdata.tuwien.ac.at
    bin, csv, json +1
    Updated Apr 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dilara Çakmak; Dilara Çakmak; Dilara Çakmak; Dilara Çakmak (2025). Evaluating FAIR Models for Rossmann Store Sales Prediction: Insights and Performance Analysis [Dataset]. http://doi.org/10.70124/f5t2d-xt904
    Explore at:
    csv, text/markdown, json, binAvailable download formats
    Dataset updated
    Apr 28, 2025
    Dataset provided by
    TU Wien
    Authors
    Dilara Çakmak; Dilara Çakmak; Dilara Çakmak; Dilara Çakmak
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 2025
    Description

    Context and Methodology

    Research Domain:
    The dataset is part of a project focused on retail sales forecasting. Specifically, it is designed to predict daily sales for Rossmann, a chain of over 3,000 drug stores operating across seven European countries. The project falls under the broader domain of time series analysis and machine learning applications for business optimization. The goal is to apply machine learning techniques to forecast future sales based on historical data, which includes factors like promotions, competition, holidays, and seasonal trends.

    Purpose:
    The primary purpose of this dataset is to help Rossmann store managers predict daily sales for up to six weeks in advance. By making accurate sales predictions, Rossmann can improve inventory management, staffing decisions, and promotional strategies. This dataset serves as a training set for machine learning models aimed at reducing forecasting errors and supporting decision-making processes across the company’s large network of stores.

    How the Dataset Was Created:
    The dataset was compiled from several sources, including historical sales data from Rossmann stores, promotional calendars, holiday schedules, and external factors such as competition. The data is split into multiple features, such as the store's location, promotion details, whether the store was open or closed, and weather information. The dataset is publicly available on platforms like Kaggle and was initially created for the Kaggle Rossmann Store Sales competition. The data is made accessible via an API for further analysis and modeling, and it is structured to help machine learning models predict future sales based on various input variables.

    Technical Details

    Dataset Structure:

    The dataset consists of three main files, each with its specific role:

    1. Train:
      This file contains the historical sales data, which is used to train machine learning models. It includes daily sales information for each store, as well as various features that could influence the sales (e.g., promotions, holidays, store type, etc.).

      https://handle.test.datacite.org/10.82556/yb6j-jw41
      PID: b1c59499-9c6e-42c2-af8f-840181e809db
    2. Test2:
      The test dataset mirrors the structure of train.csv but does not include the actual sales values (i.e., the target variable). This file is used for making predictions using the trained machine learning models. It is used to evaluate the accuracy of predictions when the true sales data is unknown.

      https://handle.test.datacite.org/10.82556/jerg-4b84
      PID: 7cbb845c-21dd-4b60-b990-afa8754a0dd9
    3. Store:
      This file provides metadata about each store, including information such as the store’s location, type, and assortment level. This data is essential for understanding the context in which the sales data is gathered.

      https://handle.test.datacite.org/10.82556/nqeg-gy34
      PID: 9627ec46-4ee6-4969-b14a-bda555fe34db

    Data Fields Description:

    • Id: A unique identifier for each (Store, Date) combination within the test set.

    • Store: A unique identifier for each store.

    • Sales: The daily turnover (target variable) for each store on a specific day (this is what you are predicting).

    • Customers: The number of customers visiting the store on a given day.

    • Open: An indicator of whether the store was open (1 = open, 0 = closed).

    • StateHoliday: Indicates if the day is a state holiday, with values like:

      • 'a' = public holiday,

      • 'b' = Easter holiday,

      • 'c' = Christmas,

      • '0' = no holiday.

    • SchoolHoliday: Indicates whether the store is affected by school closures (1 = yes, 0 = no).

    • StoreType: Differentiates between four types of stores: 'a', 'b', 'c', 'd'.

    • Assortment: Describes the level of product assortment in the store:

      • 'a' = basic,

      • 'b' = extra,

      • 'c' = extended.

    • CompetitionDistance: Distance (in meters) to the nearest competitor store.

    • CompetitionOpenSince[Month/Year]: The month and year when the nearest competitor store opened.

    • Promo: Indicates whether the store is running a promotion on a particular day (1 = yes, 0 = no).

    • Promo2: Indicates whether the store is participating in Promo2, a continuing promotion for some stores (1 = participating, 0 = not participating).

    • Promo2Since[Year/Week]: The year and calendar week when the store started participating in Promo2.

    • PromoInterval: Describes the months when Promo2 is active, e.g., "Feb,May,Aug,Nov" means the promotion starts in February, May, August, and November.

    Software Requirements

    To work with this dataset, you will need to have specific software installed, including:

    • DBRepo Authorization: This is required to access the datasets via the DBRepo API. You may need to authenticate with an API key or login credentials to retrieve the datasets.

    • Python Libraries: Key libraries for working with the dataset include:

      • pandas for data manipulation,

      • numpy for numerical operations,

      • matplotlib and seaborn for data visualization,

      • scikit-learn for machine learning algorithms.

    Additional Resources

    Several additional resources are available for working with the dataset:

    1. Presentation:
      A presentation summarizing the exploratory data analysis (EDA), feature engineering process, and key insights from the analysis is provided. This presentation also includes visualizations that help in understanding the dataset’s trends and relationships.

    2. Jupyter Notebook:
      A Jupyter notebook, titled Retail_Sales_Prediction_Capstone_Project.ipynb, is provided, which details the entire machine learning pipeline, from data loading and cleaning to model training and evaluation.

    3. Model Evaluation Results:
      The project includes a detailed evaluation of various machine learning models, including their performance metrics like training and testing scores, Mean Absolute Percentage Error (MAPE), and Root Mean Squared Error (RMSE). This allows for a comparison of model effectiveness in forecasting sales.

    4. Trained Models (.pkl files):
      The models trained during the project are saved as .pkl files. These files contain the trained machine learning models (e.g., Random Forest, Linear Regression, etc.) that can be loaded and used to make predictions without retraining the models from scratch.

    5. sample_submission.csv:
      This file is a sample submission file that demonstrates the format of predictions expected when using the trained model. The sample_submission.csv contains predictions made on the test dataset using the trained Random Forest model. It provides an example of how the output should be structured for submission.

    These resources provide a comprehensive guide to implementing and analyzing the sales forecasting model, helping you understand the data, methods, and results in greater detail.

  5. Market Basket Analysis

    • kaggle.com
    Updated Dec 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 9, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Aslan Ahmedov
    Description

    Market Basket Analysis

    Market basket analysis with Apriori algorithm

    The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

    Introduction

    Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

    An Example of Association Rules

    Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

    Strategy

    • Data Import
    • Data Understanding and Exploration
    • Transformation of the data – so that is ready to be consumed by the association rules algorithm
    • Running association rules
    • Exploring the rules generated
    • Filtering the generated rules
    • Visualization of Rule

    Dataset Description

    • File name: Assignment-1_Data
    • List name: retaildata
    • File format: . xlsx
    • Number of Row: 522065
    • Number of Attributes: 7

      • BillNo: 6-digit number assigned to each transaction. Nominal.
      • Itemname: Product name. Nominal.
      • Quantity: The quantities of each product per transaction. Numeric.
      • Date: The day and time when each transaction was generated. Numeric.
      • Price: Product price. Numeric.
      • CustomerID: 5-digit number assigned to each customer. Nominal.
      • Country: Name of the country where each customer resides. Nominal.

    imagehttps://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

    Libraries in R

    First, we need to load required libraries. Shortly I describe all libraries.

    • arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).
    • arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.
    • tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.
    • readxl - Read Excel Files in R.
    • plyr - Tools for Splitting, Applying and Combining Data.
    • ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
    • knitr - Dynamic Report generation in R.
    • magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.
    • dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.
    • tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

    imagehttps://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

    Data Pre-processing

    Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

    imagehttps://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> imagehttps://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

    After we will clear our data frame, will remove missing values.

    imagehttps://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

    To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...

  6. Data from: A large synthetic dataset for machine learning applications in...

    • zenodo.org
    csv, json, png, zip
    Updated Mar 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marc Gillioz; Marc Gillioz; Guillaume Dubuis; Philippe Jacquod; Philippe Jacquod; Guillaume Dubuis (2025). A large synthetic dataset for machine learning applications in power transmission grids [Dataset]. http://doi.org/10.5281/zenodo.13378476
    Explore at:
    zip, png, csv, jsonAvailable download formats
    Dataset updated
    Mar 25, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Marc Gillioz; Marc Gillioz; Guillaume Dubuis; Philippe Jacquod; Philippe Jacquod; Guillaume Dubuis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    With the ongoing energy transition, power grids are evolving fast. They operate more and more often close to their technical limit, under more and more volatile conditions. Fast, essentially real-time computational approaches to evaluate their operational safety, stability and reliability are therefore highly desirable. Machine Learning methods have been advocated to solve this challenge, however they are heavy consumers of training and testing data, while historical operational data for real-world power grids are hard if not impossible to access.

    This dataset contains long time series for production, consumption, and line flows, amounting to 20 years of data with a time resolution of one hour, for several thousands of loads and several hundreds of generators of various types representing the ultra-high-voltage transmission grid of continental Europe. The synthetic time series have been statistically validated agains real-world data.

    Data generation algorithm

    The algorithm is described in a Nature Scientific Data paper. It relies on the PanTaGruEl model of the European transmission network -- the admittance of its lines as well as the location, type and capacity of its power generators -- and aggregated data gathered from the ENTSO-E transparency platform, such as power consumption aggregated at the national level.

    Network

    The network information is encoded in the file europe_network.json. It is given in PowerModels format, which it itself derived from MatPower and compatible with PandaPower. The network features 7822 power lines and 553 transformers connecting 4097 buses, to which are attached 815 generators of various types.

    Time series

    The time series forming the core of this dataset are given in CSV format. Each CSV file is a table with 8736 rows, one for each hourly time step of a 364-day year. All years are truncated to exactly 52 weeks of 7 days, and start on a Monday (the load profiles are typically different during weekdays and weekends). The number of columns depends on the type of table: there are 4097 columns in load files, 815 for generators, and 8375 for lines (including transformers). Each column is described by a header corresponding to the element identifier in the network file. All values are given in per-unit, both in the model file and in the tables, i.e. they are multiples of a base unit taken to be 100 MW.

    There are 20 tables of each type, labeled with a reference year (2016 to 2020) and an index (1 to 4), zipped into archive files arranged by year. This amount to a total of 20 years of synthetic data. When using loads, generators, and lines profiles together, it is important to use the same label: for instance, the files loads_2020_1.csv, gens_2020_1.csv, and lines_2020_1.csv represent a same year of the dataset, whereas gens_2020_2.csv is unrelated (it actually shares some features, such as nuclear profiles, but it is based on a dispatch with distinct loads).

    Usage

    The time series can be used without a reference to the network file, simply using all or a selection of columns of the CSV files, depending on the needs. We show below how to select series from a particular country, or how to aggregate hourly time steps into days or weeks. These examples use Python and the data analyis library pandas, but other frameworks can be used as well (Matlab, Julia). Since all the yearly time series are periodic, it is always possible to define a coherent time window modulo the length of the series.

    Selecting a particular country

    This example illustrates how to select generation data for Switzerland in Python. This can be done without parsing the network file, but using instead gens_by_country.csv, which contains a list of all generators for any country in the network. We start by importing the pandas library, and read the column of the file corresponding to Switzerland (country code CH):

    import pandas as pd
    CH_gens = pd.read_csv('gens_by_country.csv', usecols=['CH'], dtype=str)

    The object created in this way is Dataframe with some null values (not all countries have the same number of generators). It can be turned into a list with:

    CH_gens_list = CH_gens.dropna().squeeze().to_list()

    Finally, we can import all the time series of Swiss generators from a given data table with

    pd.read_csv('gens_2016_1.csv', usecols=CH_gens_list)

    The same procedure can be applied to loads using the list contained in the file loads_by_country.csv.

    Averaging over time

    This second example shows how to change the time resolution of the series. Suppose that we are interested in all the loads from a given table, which are given by default with a one-hour resolution:

    hourly_loads = pd.read_csv('loads_2018_3.csv')

    To get a daily average of the loads, we can use:

    daily_loads = hourly_loads.groupby([t // 24 for t in range(24 * 364)]).mean()

    This results in series of length 364. To average further over entire weeks and get series of length 52, we use:

    weekly_loads = hourly_loads.groupby([t // (24 * 7) for t in range(24 * 364)]).mean()

    Source code

    The code used to generate the dataset is freely available at https://github.com/GeeeHesso/PowerData. It consists in two packages and several documentation notebooks. The first package, written in Python, provides functions to handle the data and to generate synthetic series based on historical data. The second package, written in Julia, is used to perform the optimal power flow. The documentation in the form of Jupyter notebooks contains numerous examples on how to use both packages. The entire workflow used to create this dataset is also provided, starting from raw ENTSO-E data files and ending with the synthetic dataset given in the repository.

    Funding

    This work was supported by the Cyber-Defence Campus of armasuisse and by an internal research grant of the Engineering and Architecture domain of HES-SO.

  7. O

    Child and Adult Care Food Programs (CACFP) –Child Centers – Meal...

    • data.texas.gov
    • catalog.data.gov
    application/rdfxml +5
    Updated Apr 2, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Texas Department of Agriculture (2025). Child and Adult Care Food Programs (CACFP) –Child Centers – Meal Reimbursement – Program Year 2023 - 2024 [Dataset]. https://data.texas.gov/dataset/Child-and-Adult-Care-Food-Programs-CACFP-Child-Cen/pp75-3tt8
    Explore at:
    csv, application/rdfxml, tsv, xml, json, application/rssxmlAvailable download formats
    Dataset updated
    Apr 2, 2025
    Dataset authored and provided by
    Texas Department of Agriculturehttp://www.texasagriculture.gov/
    Description

    Help us provide the most useful data by completing our ODP User Feedback Survey for Child and Adult Care Food Program (CACFP) Data

    About the Dataset
    This data set contains claims information for meal reimbursement for sites participating in CACFP as child centers for the program year 2023-2024. This includes Child Care Centers, At-Risk centers, Head Start sites, Outside School Hours sites, and Emergency Shelters . The CACFP program year begins October 1 and ends September 30.

    This dataset only includes claims submitted by CACFP sites operating as child centers.Sites can participate in multiple CACFP sub-programs. Each record (row) represents monthly meals data for a single site and for a single CACFP center sub-program.

    To filter data for a specific CACFP center Program, select "View Data" to open the Exploration Canvas filter tools. Select the program(s) of interest from the Program field. A filtering tutorial can be found HERE

    For meals data on CACFP participants operating as Day Care Homes, Adult Day Care Centers, or child care centers for previous program years, please refer to the corresponding “Child and Adult Care Food Programs (CACFP) – Meal Reimbursement” dataset for that sub-program available on the State of Texas Open Data Portal.

    An overview of all CACFP data available on the Texas Open Data Portal can be found at our TDA Data Overview - Child and Adult Care Food Programs page.

    An overview of all TDA Food and Nutrition data available on the Texas Open Data Portal can be found at our TDA Data Overview - Food and Nutrition Open Data page.

    More information about accessing and working with TDA data on the Texas Open Data Portal can be found on the SquareMeals.org website on the TDA Food and Nutrition Open Data page.

    About Dataset Updates
    TDA aims to post new program year data by December 15 of the active program year. Participants have 60 days to file monthly reimbursement claims. Dataset updates will occur daily until 90 days after the close of the program year. After 90 days from the close of the program year, the dataset will be updated at six months and one year from the close of the program year before becoming archived. Archived datasets will remain published but will not be updated. Any data posted during the active program year is subject to change.

    About the Agency
    The Texas Department of Agriculture administers 12 U.S. Department of Agriculture nutrition programs in Texas including the National School Lunch and School Breakfast Programs, the Child and Adult Care Food Programs (CACFP), and the summer meal programs. TDA’s Food and Nutrition division provides technical assistance and training resources to partners operating the programs and oversees the USDA reimbursements they receive to cover part of the cost associated with serving food in their facilities. By working to ensure these partners serve nutritious meals and snacks, the division adheres to its mission — Feeding the Hungry and Promoting Healthy Lifestyles.

    For more information on these programs, please visit our website.
    "

  8. A

    ‘Vehicle Miles Traveled During Covid-19 Lock-Downs ’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Jan 4, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘Vehicle Miles Traveled During Covid-19 Lock-Downs ’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-vehicle-miles-traveled-during-covid-19-lock-downs-636d/latest
    Explore at:
    Dataset updated
    Jan 4, 2021
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Vehicle Miles Traveled During Covid-19 Lock-Downs ’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/vehicle-miles-travelede on 13 February 2022.

    --- Dataset description provided by original source is as follows ---

    About this dataset

    **This data set was last updated 3:30 PM ET Monday, January 4, 2021. The last date of data in this dataset is December 31, 2020. **

    Overview

    Data shows that mobility declined nationally since states and localities began shelter-in-place strategies to stem the spread of COVID-19. The numbers began climbing as more people ventured out and traveled further from their homes, but in parallel with the rise of COVID-19 cases in July, travel declined again.

    This distribution contains county level data for vehicle miles traveled (VMT) from StreetLight Data, Inc, updated three times a week. This data offers a detailed look at estimates of how much people are moving around in each county.

    Data available has a two day lag - the most recent data is from two days prior to the update date. Going forward, this dataset will be updated by AP at 3:30pm ET on Monday, Wednesday and Friday each week.

    This data has been made available to members of AP’s Data Distribution Program. To inquire about access for your organization - publishers, researchers, corporations, etc. - please click Request Access in the upper right corner of the page or email kromano@ap.org. Be sure to include your contact information and use case.

    Findings

    • Nationally, data shows that vehicle travel in the US has doubled compared to the seven-day period ending April 13, which was the lowest VMT since the COVID-19 crisis began. In early December, travel reached a low not seen since May, with a small rise leading up to the Christmas holiday.
    • Average vehicle miles traveled continues to be below what would be expected without a pandemic - down 38% compared to January 2020. September 4 reported the largest single day estimate of vehicle miles traveled since March 14.
    • New Jersey, Michigan and New York are among the states with the largest relative uptick in travel at this point of the pandemic - they report almost two times the miles traveled compared to their lowest seven-day period. However, travel in New Jersey and New York is still much lower than expected without a pandemic. Other states such as New Mexico, Vermont and West Virginia have rebounded the least.

    About This Data

    The county level data is provided by StreetLight Data, Inc, a transportation analysis firm that measures travel patterns across the U.S.. The data is from their Vehicle Miles Traveled (VMT) Monitor which uses anonymized and aggregated data from smartphones and other GPS-enabled devices to provide county-by-county VMT metrics for more than 3,100 counties. The VMT Monitor provides an estimate of total vehicle miles travelled by residents of each county, each day since the COVID-19 crisis began (March 1, 2020), as well as a change from the baseline average daily VMT calculated for January 2020. Additional columns are calculations by AP.

    Included Data

    01_vmt_nation.csv - Data summarized to provide a nationwide look at vehicle miles traveled. Includes single day VMT across counties, daily percent change compared to January and seven day rolling averages to smooth out the trend lines over time.

    02_vmt_state.csv - Data summarized to provide a statewide look at vehicle miles traveled. Includes single day VMT across counties, daily percent change compared to January and seven day rolling averages to smooth out the trend lines over time.

    03_vmt_county.csv - Data providing a county level look at vehicle miles traveled. Includes VMT estimate, percent change compared to January and seven day rolling averages to smooth out the trend lines over time.

    Additional Data Queries

    * Filter for specific state - filters 02_vmt_state.csv daily data for specific state.

    * Filter counties by state - filters 03_vmt_county.csv daily data for counties in specific state.

    * Filter for specific county - filters 03_vmt_county.csv daily data for specific county.

    Interactive

    The AP has designed an interactive map to show percent change in vehicle miles traveled by county since each counties lowest point during the pandemic:

    This dataset was created by Angeliki Kastanis and contains around 0 samples along with Date At Low, Mean7 County Vmt At Low, technical information and other features such as: - County Name - County Fips - and more.

    How to use this dataset

    • Analyze State Name in relation to Baseline Jan Vmt
    • Study the influence of Date At Low on Mean7 County Vmt At Low
    • More datasets

    Acknowledgements

    If you use this dataset in your research, please credit Angeliki Kastanis

    Start A New Notebook!

    --- Original source retains full ownership of the source dataset ---

  9. t

    ESA CCI SM GAPFILLED Long-term Climate Data Record of Surface Soil Moisture...

    • researchdata.tuwien.ac.at
    zip
    Updated Jun 6, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wolfgang Preimesberger; Wolfgang Preimesberger; Pietro Stradiotti; Pietro Stradiotti; Wouter Arnoud Dorigo; Wouter Arnoud Dorigo (2025). ESA CCI SM GAPFILLED Long-term Climate Data Record of Surface Soil Moisture from merged multi-satellite observations [Dataset]. http://doi.org/10.48436/3fcxr-cde10
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 6, 2025
    Dataset provided by
    TU Wien
    Authors
    Wolfgang Preimesberger; Wolfgang Preimesberger; Pietro Stradiotti; Pietro Stradiotti; Wouter Arnoud Dorigo; Wouter Arnoud Dorigo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    This dataset was produced with funding from the European Space Agency (ESA) Climate Change Initiative (CCI) Plus Soil Moisture Project (CCN 3 to ESRIN Contract No: 4000126684/19/I-NB "ESA CCI+ Phase 1 New R&D on CCI ECVS Soil Moisture"). Project website: https://climate.esa.int/en/projects/soil-moisture/

    This dataset contains information on the Surface Soil Moisture (SM) content derived from satellite observations in the microwave domain.

    Dataset paper (public preprint)

    A description of this dataset, including the methodology and validation results, is available at:

    Preimesberger, W., Stradiotti, P., and Dorigo, W.: ESA CCI Soil Moisture GAPFILLED: An independent global gap-free satellite climate data record with uncertainty estimates, Earth Syst. Sci. Data Discuss. [preprint], https://doi.org/10.5194/essd-2024-610, in review, 2025.

    Abstract

    ESA CCI Soil Moisture is a multi-satellite climate data record that consists of harmonized, daily observations coming from 19 satellites (as of v09.1) operating in the microwave domain. The wealth of satellite information, particularly over the last decade, facilitates the creation of a data record with the highest possible data consistency and coverage.
    However, data gaps are still found in the record. This is particularly notable in earlier periods when a limited number of satellites were in operation, but can also arise from various retrieval issues, such as frozen soils, dense vegetation, and radio frequency interference (RFI). These data gaps present a challenge for many users, as they have the potential to obscure relevant events within a study area or are incompatible with (machine learning) software that often relies on gap-free inputs.
    Since the requirement of a gap-free ESA CCI SM product was identified, various studies have demonstrated the suitability of different statistical methods to achieve this goal. A fundamental feature of such gap-filling method is to rely only on the original observational record, without need for ancillary variable or model-based information. Due to the intrinsic challenge, there was until present no global, long-term univariate gap-filled product available. In this version of the record, data gaps due to missing satellite overpasses and invalid measurements are filled using the Discrete Cosine Transform (DCT) Penalized Least Squares (PLS) algorithm (Garcia, 2010). A linear interpolation is applied over periods of (potentially) frozen soils with little to no variability in (frozen) soil moisture content. Uncertainty estimates are based on models calibrated in experiments to fill satellite-like gaps introduced to GLDAS Noah reanalysis soil moisture (Rodell et al., 2004), and consider the gap size and local vegetation conditions as parameters that affect the gapfilling performance.

    Summary

    • Gap-filled global estimates of volumetric surface soil moisture from 1991-2023 at 0.25° sampling
    • Fields of application (partial): climate variability and change, land-atmosphere interactions, global biogeochemical cycles and ecology, hydrological and land surface modelling, drought applications, and meteorology
    • Method: Modified version of DCT-PLS (Garcia, 2010) interpolation/smoothing algorithm, linear interpolation over periods of frozen soils. Uncertainty estimates are provided for all data points.
    • More information: See Preimesberger et al. (2025) and https://doi.org/10.5281/zenodo.8320869" target="_blank" rel="noopener">ESA CCI SM Algorithm Theoretical Baseline Document [Chapter 7.2.9] (Dorigo et al., 2023)

    Programmatic Download

    You can use command line tools such as wget or curl to download (and extract) data for multiple years. The following command will download and extract the complete data set to the local directory ~/Download on Linux or macOS systems.

    #!/bin/bash

    # Set download directory
    DOWNLOAD_DIR=~/Downloads

    base_url="https://researchdata.tuwien.at/records/3fcxr-cde10/files"

    # Loop through years 1991 to 2023 and download & extract data
    for year in {1991..2023}; do
    echo "Downloading $year.zip..."
    wget -q -P "$DOWNLOAD_DIR" "$base_url/$year.zip"
    unzip -o "$DOWNLOAD_DIR/$year.zip" -d $DOWNLOAD_DIR
    rm "$DOWNLOAD_DIR/$year.zip"
    done

    Data details

    The dataset provides global daily estimates for the 1991-2023 period at 0.25° (~25 km) horizontal grid resolution. Daily images are grouped by year (YYYY), each subdirectory containing one netCDF image file for a specific day (DD), month (MM) in a 2-dimensional (longitude, latitude) grid system (CRS: WGS84). The file name has the following convention:

    ESACCI-SOILMOISTURE-L3S-SSMV-COMBINED_GAPFILLED-YYYYMMDD000000-fv09.1r1.nc

    Data Variables

    Each netCDF file contains 3 coordinate variables (WGS84 longitude, latitude and time stamp), as well as the following data variables:

    • sm: (float) The Soil Moisture variable reflects estimates of daily average volumetric soil moisture content (m3/m3) in the soil surface layer (~0-5 cm) over a whole grid cell (0.25 degree).
    • sm_uncertainty: (float) The Soil Moisture Uncertainty variable reflects the uncertainty (random error) of the original satellite observations and of the predictions used to fill observation data gaps.
    • sm_anomaly: Soil moisture anomalies (reference period 1991-2020) derived from the gap-filled values (`sm`)
    • sm_smoothed: Contains DCT-PLS predictions used to fill data gaps in the original soil moisture field. These values are also provided for cases where an observation was initially available (compare `gapmask`). In this case, they provided a smoothed version of the original data.
    • gapmask: (0 | 1) Indicates grid cells where a satellite observation is available (1), and where the interpolated (smoothed) values are used instead (0) in the 'sm' field.
    • frozenmask: (0 | 1) Indicates grid cells where ERA5 soil temperature is <0 °C. In this case, a linear interpolation over time is applied.

    Additional information for each variable is given in the netCDF attributes.

    Version Changelog

    Changes in v9.1r1 (previous version was v09.1):

    • This version uses a novel uncertainty estimation scheme as described in Preimesberger et al. (2025).

    Software to open netCDF files

    These data can be read by any software that supports Climate and Forecast (CF) conform metadata standards for netCDF files, such as:

    References

    • Preimesberger, W., Stradiotti, P., and Dorigo, W.: ESA CCI Soil Moisture GAPFILLED: An independent global gap-free satellite climate data record with uncertainty estimates, Earth Syst. Sci. Data Discuss. [preprint], https://doi.org/10.5194/essd-2024-610, in review, 2025.
    • Dorigo, W., Preimesberger, W., Stradiotti, P., Kidd, R., van der Schalie, R., van der Vliet, M., Rodriguez-Fernandez, N., Madelon, R., & Baghdadi, N. (2023). ESA Climate Change Initiative Plus - Soil Moisture Algorithm Theoretical Baseline Document (ATBD) Supporting Product Version 08.1 (version 1.1). Zenodo. https://doi.org/10.5281/zenodo.8320869
    • Garcia, D., 2010. Robust smoothing of gridded data in one and higher dimensions with missing values. Computational Statistics & Data Analysis, 54(4), pp.1167-1178. Available at: https://doi.org/10.1016/j.csda.2009.09.020
    • Rodell, M., Houser, P. R., Jambor, U., Gottschalck, J., Mitchell, K., Meng, C.-J., Arsenault, K., Cosgrove, B., Radakovich, J., Bosilovich, M., Entin, J. K., Walker, J. P., Lohmann, D., and Toll, D.: The Global Land Data Assimilation System, Bulletin of the American Meteorological Society, 85, 381 – 394, https://doi.org/10.1175/BAMS-85-3-381, 2004.

    Related Records

    The following records are all part of the Soil Moisture Climate Data Records from satellites community

    1

    ESA CCI SM MODELFREE Surface Soil Moisture Record

    <a href="https://doi.org/10.48436/svr1r-27j77" target="_blank"

  10. Fundamental Data Record for Atmospheric Composition [ATMOS_L1B]

    • earth.esa.int
    Updated Jul 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    European Space Agency (2024). Fundamental Data Record for Atmospheric Composition [ATMOS_L1B] [Dataset]. https://earth.esa.int/eogateway/catalog/fdr-for-atmospheric-composition
    Explore at:
    Dataset updated
    Jul 1, 2024
    Dataset authored and provided by
    European Space Agencyhttp://www.esa.int/
    License

    https://earth.esa.int/eogateway/documents/20142/1564626/Terms-and-Conditions-for-the-use-of-ESA-Data.pdfhttps://earth.esa.int/eogateway/documents/20142/1564626/Terms-and-Conditions-for-the-use-of-ESA-Data.pdf

    Time period covered
    Jun 28, 1995 - Apr 7, 2012
    Description

    The Fundamental Data Record (FDR) for Atmospheric Composition UVN v.1.0 dataset is a cross-instrument Level-1 product [ATMOS_L1B] generated in 2023 and resulting from the ESA FDR4ATMOS project. The FDR contains selected Earth Observation Level 1b parameters (irradiance/reflectance) from the nadir-looking measurements of the ERS-2 GOME and Envisat SCIAMACHY missions for the period ranging from 1995 to 2012. The data record offers harmonised cross-calibrated spectra with focus on spectral windows in the Ultraviolet-Visible-Near Infrared regions for the retrieval of critical atmospheric constituents like ozone (O3), sulphur dioxide (SO2), nitrogen dioxide (NO2) column densities, alongside cloud parameters. The FDR4ATMOS products should be regarded as experimental due to the innovative approach and the current use of a limited-sized test dataset to investigate the impact of harmonization on the Level 2 target species, specifically SO2, O3 and NO2. Presently, this analysis is being carried out within follow-on activities. The FDR4ATMOS V1 is currently being extended to include the MetOp GOME-2 series. Product format For many aspects, the FDR product has improved compared to the existing individual mission datasets: GOME solar irradiances are harmonised using a validated SCIAMACHY solar reference spectrum, solving the problem of the fast-changing etalon present in the original GOME Level 1b data; Reflectances for both GOME and SCIAMACHY are provided in the FDR product. GOME reflectances are harmonised to degradation-corrected SCIAMACHY values, using collocated data from the CEOS PIC sites; SCIAMACHY data are scaled to the lowest integration time within the spectral band using high-frequency PMD measurements from the same wavelength range. This simplifies the use of the SCIAMACHY spectra which were split in a complex cluster structure (with own integration time) in the original Level 1b data; The harmonization process applied mitigates the viewing angle dependency observed in the UV spectral region for GOME data; Uncertainties are provided. Each FDR product provides, within the same file, irradiance/reflectance data for UV-VIS-NIR special regions across all orbits on a single day, including therein information from the individual ERS-2 GOME and Envisat SCIAMACHY measurements. FDR has been generated in two formats: Level 1A and Level 1B targeting expert users and nominal applications respectively. The Level 1A [ATMOS_L1A] data include additional parameters such as harmonisation factors, PMD, and polarisation data extracted from the original mission Level 1 products. The ATMOS_L1A dataset is not part of the nominal dissemination to users. In case of specific requirements, please contact EOHelp. Please refer to the README file for essential guidance before using the data. All the new products are conveniently formatted in NetCDF. Free standard tools, such as Panoply, can be used to read NetCDF data. Panoply is sourced and updated by external entities. For further details, please consult our Terms and Conditions page. Uncertainty characterisation One of the main aspects of the project was the characterization of Level 1 uncertainties for both instruments, based on metrological best practices. The following documents are provided: General guidance on a metrological approach to Fundamental Data Records (FDR) Uncertainty Characterisation document Effect tables NetCDF files containing example uncertainty propagation analysis and spectral error correlation matrices for SCIAMACHY (Atlantic and Mauretania scene for 2003 and 2010) and GOME (Atlantic scene for 2003) reflectance_uncertainty_example_FDR4ATMOS_GOME.nc reflectance_uncertainty_example_FDR4ATMOS_SCIA.nc Known Issues Non-monotonous wavelength axis for SCIAMACHY in FDR data version 1.0 In the SCIAMACHY OBSERVATION group of the atmospheric FDR v1.0 dataset (DOI: 10.5270/ESA-852456e), the wavelength axis (lambda variable) is not monotonically increasing. This issue affects all spectral channels (UV, VIS, NIR) in the SCIAMACHY group, while GOME OBSERVATION data remain unaffected. The root cause of the issue lies in the incorrect indexing of the lambda variable during the NetCDF writing process. Notably, the wavelength values themselves are calculated correctly within the processing chain. Temporary Workaround The wavelength axis is correct in the first record of each product. As a workaround, users can extract the wavelength axis from the first record and apply it to all subsequent measurements within the same product. The first record can be retrieved by setting the first two indices (time and scanline) to 0 (assuming counting of array indices starts at 0). Note that this process must be repeated separately for each spectral range (UV, VIS, NIR) and every daily product. Since the wavelength axis of SCIAMACHY is highly stable over time, using the first record introduces no expected impact on retrieval results. Python pseudo-code example: lambda_...

  11. Lead Scoring Dataset

    • kaggle.com
    zip
    Updated Aug 17, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amrita Chatterjee (2020). Lead Scoring Dataset [Dataset]. https://www.kaggle.com/amritachatterjee09/lead-scoring-dataset
    Explore at:
    zip(411028 bytes)Available download formats
    Dataset updated
    Aug 17, 2020
    Authors
    Amrita Chatterjee
    Description

    Context

    An education company named X Education sells online courses to industry professionals. On any given day, many professionals who are interested in the courses land on their website and browse for courses.

    The company markets its courses on several websites and search engines like Google. Once these people land on the website, they might browse the courses or fill up a form for the course or watch some videos. When these people fill up a form providing their email address or phone number, they are classified to be a lead. Moreover, the company also gets leads through past referrals. Once these leads are acquired, employees from the sales team start making calls, writing emails, etc. Through this process, some of the leads get converted while most do not. The typical lead conversion rate at X education is around 30%.

    Now, although X Education gets a lot of leads, its lead conversion rate is very poor. For example, if, say, they acquire 100 leads in a day, only about 30 of them are converted. To make this process more efficient, the company wishes to identify the most potential leads, also known as ‘Hot Leads’. If they successfully identify this set of leads, the lead conversion rate should go up as the sales team will now be focusing more on communicating with the potential leads rather than making calls to everyone.

    There are a lot of leads generated in the initial stage (top) but only a few of them come out as paying customers from the bottom. In the middle stage, you need to nurture the potential leads well (i.e. educating the leads about the product, constantly communicating, etc. ) in order to get a higher lead conversion.

    X Education wants to select the most promising leads, i.e. the leads that are most likely to convert into paying customers. The company requires you to build a model wherein you need to assign a lead score to each of the leads such that the customers with higher lead score h have a higher conversion chance and the customers with lower lead score have a lower conversion chance. The CEO, in particular, has given a ballpark of the target lead conversion rate to be around 80%.

    Content

    Variables Description * Prospect ID - A unique ID with which the customer is identified. * Lead Number - A lead number assigned to each lead procured. * Lead Origin - The origin identifier with which the customer was identified to be a lead. Includes API, Landing Page Submission, etc. * Lead Source - The source of the lead. Includes Google, Organic Search, Olark Chat, etc. * Do Not Email -An indicator variable selected by the customer wherein they select whether of not they want to be emailed about the course or not. * Do Not Call - An indicator variable selected by the customer wherein they select whether of not they want to be called about the course or not. * Converted - The target variable. Indicates whether a lead has been successfully converted or not. * TotalVisits - The total number of visits made by the customer on the website. * Total Time Spent on Website - The total time spent by the customer on the website. * Page Views Per Visit - Average number of pages on the website viewed during the visits. * Last Activity - Last activity performed by the customer. Includes Email Opened, Olark Chat Conversation, etc. * Country - The country of the customer. * Specialization - The industry domain in which the customer worked before. Includes the level 'Select Specialization' which means the customer had not selected this option while filling the form. * How did you hear about X Education - The source from which the customer heard about X Education. * What is your current occupation - Indicates whether the customer is a student, umemployed or employed. * What matters most to you in choosing this course An option selected by the customer - indicating what is their main motto behind doing this course. * Search - Indicating whether the customer had seen the ad in any of the listed items. * Magazine
    * Newspaper Article * X Education Forums
    * Newspaper * Digital Advertisement * Through Recommendations - Indicates whether the customer came in through recommendations. * Receive More Updates About Our Courses - Indicates whether the customer chose to receive more updates about the courses. * Tags - Tags assigned to customers indicating the current status of the lead. * Lead Quality - Indicates the quality of lead based on the data and intuition the employee who has been assigned to the lead. * Update me on Supply Chain Content - Indicates whether the customer wants updates on the Supply Chain Content. * Get updates on DM Content - Indicates whether the customer wants updates on the DM Content. * Lead Profile - A lead level assigned to each customer based on their profile. * City - The city of the customer. * Asymmetric Activity Index - An index and score assigned to each customer based on their activity and their profile * Asymmetric Profile Index * Asymmetric Activity Score * Asymmetric Profile Score
    * I agree to pay the amount through cheque - Indicates whether the customer has agreed to pay the amount through cheque or not. * a free copy of Mastering The Interview - Indicates whether the customer wants a free copy of 'Mastering the Interview' or not. * Last Notable Activity - The last notable activity performed by the student.

    Acknowledgements

    UpGrad Case Study

    Inspiration

    Your data will be in front of the world's largest data science community. What questions do you want to see answered?

  12. P

    WHPA Dataset

    • paperswithcode.com
    Updated May 11, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robin Thibaut; Eric Laloy; Thomas Hermans (2021). WHPA Dataset [Dataset]. https://paperswithcode.com/dataset/whpa
    Explore at:
    Dataset updated
    May 11, 2021
    Authors
    Robin Thibaut; Eric Laloy; Thomas Hermans
    Description

    This dataset was created as part of the following study, which was published in the Journal of Hydrology: A new framework for experimental design using Bayesian Evidential Learning: the case of wellhead protection area https://doi.org/10.1016/j.jhydrol.2021.126903. The pre-print is available on arXiv: https://arxiv.org/pdf/2105.05539.pdf

    Files description This dataset contains 4148 simulation results, i.e., 4148 pairs of predictor/target. bkt.npy contains the breakthrough curves from all 6 injection wells recorded at the pumping well. pz.npy contains the 2D coordinates of the backtracked particles' end points, used to delineate the WHPA.

    Introduction The Wellhead Protection Area (WHPA) is a zone around a pumping well where human activities are limited in order to preserve water resources, usually based on how long dangerous chemicals in the area will take to reach the pumping well (according to local regulation). The flow velocity in the subsurface around the well determines it, and it can be computed numerically using particle tracking or transport simulation, or in practice using tracer testing. A groundwater model is typically calibrated against field data before being used to calculate the WHPA. In highly populated places where land occupation is a big concern, the introduction of such zones could have a large socioeconomic impact.

    WHPA prediction Different tracers emerge from six data sources (injection wells) scattered across the pumping well. Their job is to inject individual tracers into the system in order to predict their transport and record their breakthrough curves (BCs) at the pumping well location. Numerous particles are artificially positioned around the pumping well, and their origins are traced backward in time to identify the associated WHPA.

    Our predictor and target will be generated using the USGS' open-source finite-difference code Modflow. To get different sets of predictors and targets, we will run different hydrologic models with one variable parameter, namely hydraulic conductivity in metres per day. To obtain a satisfactory heterogeneity in the hydraulic conductivity fields, which will control the shape and extent of our target, the PAs, we use sequential gaussian simulation based on arbitrarily defined variograms. The pumping well is located at the 1000, 500 metres mark and is surrounded by six injection wells.

  13. ActiSynValidator Data Set

    • zenodo.org
    application/gzip, pdf +1
    Updated Jul 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Neuroth; David Neuroth; Noah Pflugradt; Noah Pflugradt; Jann Michael Weinand; Jann Michael Weinand; Detlef Stolten; Detlef Stolten (2024). ActiSynValidator Data Set [Dataset]. http://doi.org/10.5281/zenodo.10835209
    Explore at:
    pdf, application/gzip, zipAvailable download formats
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    David Neuroth; David Neuroth; Noah Pflugradt; Noah Pflugradt; Jann Michael Weinand; Jann Michael Weinand; Detlef Stolten; Detlef Stolten
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 15, 2024
    Description

    ActiSynValidator Data Set

    The ActiSynValidator data set is an aggregated activity data set derived from the HETUS 2010 time use survey. It is intended to enable reusable and reproducible validation of various behavior models.

    The ActiSynValidator software framework belongs to this data set, and together they can be used to validate activity profiles, e.g. the results of an occupant behavior model. It provides modules for preprocessing and categorising activity profiles, and comparing them to the statistics in this data set using indicators and plots. It also contains the code that was used to create this data set out of the HETUS 2010 data, so that the generation of this data set is fully reproducible.

    Activity Profile Categorization

    The HETUS data set consists of many single-day activity profiles. These cannot be made publicly accessible due to data protection regulations. The idea of the ActiSynValidator data set is to aggregate these activity profiles using a meaningful classification, to provide behavior statistics for different types of activity profiles. For that, the attributes country, sex, work status, and day type are used.

    Activity Statistics

    Human behavior is complex, and in order to thorougly validate it, multiple aspects have to be taken into account. Therefore, the ActiSynValidator data set contains distributions for duration and frequency of each activity, as well as the temporal distribution throughout the day. For that purpose, a set of 15 common activity groups is defined. The mapping from the 108 activity codes used in HEUTS 2010 is provided as part of the validation framework.

    File Overview

    For convenience, the ActiSynValidator data set is provided both as .tar.gz and as .zip archive. Both files contain the same content, the full activity validation data set.
    Additionally, the document ActiSynValidator_data_set_description.pdf contains a more thorough description of the data set, including its file structure, the content and meaning of its files, and examples on how to use it.

  14. Data from: EHR data

    • kaggle.com
    Updated Apr 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bipul Shahi (2025). EHR data [Dataset]. https://www.kaggle.com/datasets/vipulshahi/ehr-data/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 30, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Bipul Shahi
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    🔍 Dataset Overview

    Each patient in the dataset has 30 days of continuous health data. The goal is to predict if a patient will progress to a critical condition based on their vital signs, medication adherence, and symptoms recorded daily.

    There are 10 columns in the dataset:

    Column Name Description patient_id Unique identifier for each patient. day Day number (from 1 to 30) indicating sequential daily records. bp_systolic Systolic blood pressure (top number) in mm Hg. Higher values may indicate hypertension. bp_diastolic Diastolic blood pressure (bottom number) in mm Hg. heart_rate Heartbeats per minute. Elevated heart rate can signal stress, infection, or deterioration. respiratory_rate Breaths per minute. Elevated rates can indicate respiratory distress. temperature Body temperature in °F. Fever or hypothermia are signs of infection or inflammation. oxygen_saturation Percentage of oxygen in blood. Lower values are concerning (< 94%). med_adherence Patient’s medication adherence (between 0 and 1). Lower values may contribute to worsening. symptom_severity Subjective symptom rating (scale of 1–10). Higher means worse condition. progressed_to_critical Target label: 1 if patient deteriorated to a critical condition, else 0. 🎯 Final Task (Prediction Objective)

    Problem Type: Binary classification with time-series data.

    Goal: Train deep learning models (RNN, LSTM, GRU) to learn temporal patterns from a patient's 30-day health history and predict whether the patient will progress to a critical condition.

    📈 How the Data is Used for Modeling

    Input: A 3D array shaped as (num_patients, 30, 8) where: 30 = number of days (timesteps), 8 = features per day (excluding ID, day, and target). Output: A binary label for each patient (0 or 1). 🔄 Feature Contribution to Prediction

    Feature Why It Matters bp_systolic/dia Persistently high or rising BP may signal stress, cardiac issues, or deterioration. heart_rate A rising heart rate can indicate fever, infection, or organ distress. respiratory_rate Often increases early in critical illnesses like sepsis or COVID. temperature Fever is a key sign of infection. Chronic low/high temp may indicate underlying pathology. oxygen_saturation A declining oxygen level is a strong predictor of respiratory failure. med_adherence Poor medication adherence is often linked to worsening chronic conditions. symptom_severity Patient-reported worsening symptoms may precede measurable physiological changes. 🛠 Tools You’ll Use

    Task Tool/Technique Data processing Pandas, NumPy, Scikit-learn Time series modeling Keras (using SimpleRNN, LSTM, GRU) Evaluation Accuracy, Loss, ROC Curve (optional)

  15. d

    Percentage of time IT security services are available

    • datasets.ai
    • ouvert.canada.ca
    • +1more
    8
    Updated Oct 8, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shared Services Canada | Services partagés Canada (2024). Percentage of time IT security services are available [Dataset]. https://datasets.ai/datasets/29017f8f-54ce-42a9-9ff7-61f53d5abee1
    Explore at:
    8Available download formats
    Dataset updated
    Oct 8, 2024
    Dataset authored and provided by
    Shared Services Canada | Services partagés Canada
    Description

    This indicator reports on the SSC Enterprise IT security services that are part of the SSC service catalogue. This indicator is an aggregation of myKEY, Secure Remote Access and External Credential Management. These services best represent CITS’ ability to secure IT infrastructure.

    Calculation / formula: Numerator: Total time (# hours, minutes, and seconds) the infrastructure security services are available (i.e. Up-Time) in assessment period (day, week, month, and year) multiplied by the number of services, and multiplied by the number of applicable customer departments.

    Denominator: Total time (# hours, minutes and seconds) in assessment period (day, week, month, and year) multiplied by the number of services, and multiplied by the number of applicable customer departments.

    Trend should be interpreted such that a higher % represents progress toward target. Once target has been reach, additional % represents excellence.

    The % of availability, exclude the maintenance windows. For example: If there were a 1-day outage of an IS Service (e.g. myKey), the Up-Time would be: 24 hrs/day x30 days/month x1 service x 40 customers = 28800 hours (per month), which is the Numerator The Total time for the reporting period (in this case, monthly) is: 24 hrs/day x31 days/month x1 service x 40 customers = 29760 hours, which is the Denominator. Numerator over Denominator is % of Time IS service is available = 28800/29760 = 96.8%

    Target: 99.8%.

  16. z

    Primary documentation on the scientific study of indicators of continuous...

    • zenodo.org
    Updated Apr 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mariia Matveeva; Mariia Matveeva; Marina Koshmeleva; Marina Koshmeleva; Dmitriy Kachanov; Dmitriy Kachanov; Svetlana Fomina; Svetlana Fomina; Iuliia Samoilova; Iuliia Samoilova; Екатерина Трифонова; Екатерина Трифонова (2025). Primary documentation on the scientific study of indicators of continuous monitoring and flash monitoring of glycemia in children and adolescents with type 1 diabetes mellitus [Dataset]. http://doi.org/10.5281/zenodo.10074289
    Explore at:
    Dataset updated
    Apr 24, 2025
    Dataset provided by
    Zenodo
    Authors
    Mariia Matveeva; Mariia Matveeva; Marina Koshmeleva; Marina Koshmeleva; Dmitriy Kachanov; Dmitriy Kachanov; Svetlana Fomina; Svetlana Fomina; Iuliia Samoilova; Iuliia Samoilova; Екатерина Трифонова; Екатерина Трифонова
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The main purpose of creating an electronic database was to evaluate the performance of continuous glucose monitoring (CGM) and flash monitoring (FMS) in children and adolescents diagnosed with type 1 diabetes mellitus. The database is intended for entering, systematizing, storing and displaying patient data (date of birth, age, date of diagnosis of type 1 diabetes mellitus, length of illness, date of first visit to an endocrinologist, with the installation of a CGM or FMS), glycated hemoglobin indicators initially , during the study and ultimately, as well as CGM, FMS data (average glucose level, glycemic variability, percentage of cases above the target range, percentage of cases within the target range, percentage of cases below the target range, number of hypoglycemic episodes and their average duration, frequency of daily scans and frequency of sensor readings).

    The database is the basis for comparative statistical analysis of dynamic monitoring indicators in groups of patients with the presence or absence of diabetic complications (neuropathy, retinopathy and nephropathy). The database presents the results of a prospective, open, controlled, clinical study obtained over a year and a half. The database includes information on 307 patients (adolescent children) aged 3 to 17 years inclusive. During the study, the observed patients were divided into two groups: Group 1 - patients diagnosed with type 1 diabetes mellitus and with diabetic complications, 152 people, Group 2 – patients diagnosed with type 1 diabetes mellitus and with no diabetic complications, 155 people. All registrants of the database were assigned individual codes, which made it possible to exclude personal data (full name) from the database.

    The database is executed in the Microsoft Office Excel program and has the character of a depersonalized summary table, which consists of two blocks-sheets: patients of groups 1 and 2 and is structured according to the following sections: "Patient number"; "Patient code"; "Date of birth"; "Age of the patient"; section "Date of diagnosis of DM1" indicates the date of the official diagnosis of type 1 diabetes mellitus at the first hospitalization of the patient, this information is borrowed from medical information systems; section "Length of service DM1" reflects information about the duration of the patient's illness; the section "Date of the first visit" contains information about the date of the registrant's visit to the endocrinologist with the installation of FMS / CGM devices; the section "Frequency of self-monitoring with a glucometer" contains information about the frequency of measuring blood glucose levels by the patient at home using a glucometer until the establishment of FMS / CGM.

    Sections "HbA1c initially (GMI)", "HbA1c (GMI)", "HbA1c final (GMI)", display the indicators of the level of glycated hemoglobin from the total for the period of the beginning of the study, at the intermediate stages of the study and at the end of observation.

    The database structure has a number of sections accumulating information obtained with CGM/FMS, in particular: the section "Average glucose level"; the section "% above the target range", reflecting the percentage of the patient's stay with glycemia above the target indicators during the day; the section "% within the target range", reflecting the percentage of the patient's stay within the target glycemia indicators per day; the section "% below the target range", reflecting the percentage of the patient's stay with glycemia below the target indicators during the day; the section "Hypoglycemic phenomena", reflecting the number of cases of hypoglycemia in patients within 2 weeks; the section "Average duration", reflecting the average duration of hypoglycemic phenomena registered in the patient; the section "Sensor data received", indicating the percentage of time the patient was with an active device sensor; the section "Daily scans" show the frequency of scans of the patient's glycemic level (once a day); the section "%CV" displays the variability of the patient's glycemia recorded by the device. The listed sections are repeated in the database in accordance with the number of follow-up visit.

    Also in the database there is a section "Mid. values", which contains indicators of the average values of patient data for all of the above sections, both in the first and in the second group of patients.

    When working with the database, the use of filters (in the "Data" tab) containing the names of indicators allows you to enter information about new registrants in a convenient form or correct existing data, as well as sort and search for one or more specified indicators.

    The electronic database allows you to systematize a large volume of results, distribute data into categories, search for any field or set of fields in the input format, systematize the selected array, makes it possible to directly use this data for statistical analysis, as well as to view and print information on specified conditions with the location of fields in a convenient sequence.

  17. ETHOS.ActivityAssure Dataset

    • zenodo.org
    application/gzip, pdf +1
    Updated Nov 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Neuroth; David Neuroth; Noah Pflugradt; Noah Pflugradt; Jann Michael Weinand; Jann Michael Weinand; Detlef Stolten; Detlef Stolten (2024). ETHOS.ActivityAssure Dataset [Dataset]. http://doi.org/10.5281/zenodo.11035251
    Explore at:
    zip, application/gzip, pdfAvailable download formats
    Dataset updated
    Nov 27, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    David Neuroth; David Neuroth; Noah Pflugradt; Noah Pflugradt; Jann Michael Weinand; Jann Michael Weinand; Detlef Stolten; Detlef Stolten
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 15, 2024
    Description

    ETHOS.ActivityAssure Dataset

    The ETHOS.ActivityAssure dataset is an aggregated activity dataset derived from the HETUS 2010 time use survey. It is intended to enable reusable and reproducible validation of various behavior models.

    The ETHOS.ActivityAssure software framework belongs to this dataset, and together they can be used to validate activity profiles, e.g. the results of an occupant behavior model. It provides modules for preprocessing and categorising activity profiles, and comparing them to the statistics in this dataset using indicators and plots. It also contains the code that was used to create this dataset out of the HETUS 2010 data, so that the generation of this dataset is fully reproducible.

    Activity Profile Categorization

    The HETUS dataset consists of many single-day activity profiles. These cannot be made publicly accessible due to data protection regulations. The idea of the ETHOS.ActivityAssure dataset is to aggregate these activity profiles using a meaningful classification, to provide behavior statistics for different types of activity profiles. For that, the attributes country, sex, work status, and day type are used.

    Activity Statistics

    Human behavior is complex, and in order to thorougly validate it, multiple aspects have to be taken into account. Therefore, the ETHOS.ActivityAssure dataset contains distributions for duration and frequency of each activity, as well as the temporal distribution throughout the day. For that purpose, a set of 15 common activity groups is defined. The mapping from the 108 activity codes used in HEUTS 2010 is provided as part of the validation framework.

    File Overview

    For convenience, the ETHOS.ActivityAssure dataset is provided both as .tar.gz and as .zip archive. Both files contain the same content, the full activity validation dataset.
    Additionally, the document ActivityAssure_data_set_description.pdf contains a more thorough description of the dataset, including its file structure, the content and meaning of its files, and examples on how to use it.

  18. O

    Summer Meal Programs - All Summer Sites - Meal Count - Program Period 2023

    • data.texas.gov
    • catalog.data.gov
    application/rdfxml +5
    Updated Feb 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Texas Department of Agriculture (2024). Summer Meal Programs - All Summer Sites - Meal Count - Program Period 2023 [Dataset]. https://data.texas.gov/dataset/Summer-Meal-Programs-All-Summer-Sites-Meal-Count-P/ihhh-b9xf
    Explore at:
    csv, json, application/rssxml, xml, application/rdfxml, tsvAvailable download formats
    Dataset updated
    Feb 7, 2024
    Dataset authored and provided by
    Texas Department of Agriculturehttp://www.texasagriculture.gov/
    Description

    Help us provide the most useful data by completing our ODP User Feedback Survey for Summer Meal Programs Data

    About the Dataset
    This dataset contains site-level meal counts from approved TDA claims for Seamless Summer Option (SSO) and Summer Food Service Program (SFSP) in summer 2023. Summer meal programs typically operate mid-May through August unless otherwise noted. Participants have 60 days from the final meal service day of the month to submit claims to TDA.

    Meal count information for individual summer meal programs can be found as filtered views of this dataset on our TDA Data Overview - Summer Meals Programs page.

    Meal reimbursement data is collected at the sponsor/CE level and is reported in the "Summer Meal Programs - Seamless Summer Option (SSO) - Meal Reimbursements" and “Summer Meal Programs – Summer Food Service Program (SFSP) – Meal Reimbursements” datasets found on the Summer Meal Program Data Overview page.

    An overview of all Summer Meal Program data available on the Texas Open Data Portal can be found at our TDA Data Overview - Summer Meals Programs page.

    An overview of all TDA Food and Nutrition data available on the Texas Open Data Portal can be found at our TDA Data Overview - Food and Nutrition Open Data page.

    About Dataset Updates

    TDA aims to post new program year data by July 15 of the active program period. Participants have 60 days to submit claims. Data updates will occur daily and end 90 days after the close of the program year. Any data posted during the active program year is subject to change.

    About the Agency
    The Texas Department of Agriculture administers 12 U.S. Department of Agriculture nutrition programs in Texas including the National School Lunch and School Breakfast Programs, the Child and Adult Care Food Programs (CACFP), and summer meal programs. TDA’s Food and Nutrition division provides technical assistance and training resources to partners operating the programs and oversees the USDA reimbursements they receive to cover part of the cost associated with serving food in their facilities. By working to ensure these partners serve nutritious meals and snacks, the division adheres to its mission — Feeding the Hungry and Promoting Healthy Lifestyles.

    For more information on these programs, please visit our website.

  19. h

    cnn_dailymail

    • huggingface.co
    • tensorflow.org
    • +2more
    Updated Dec 18, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ccdv (2021). cnn_dailymail [Dataset]. https://huggingface.co/datasets/ccdv/cnn_dailymail
    Explore at:
    Dataset updated
    Dec 18, 2021
    Authors
    ccdv
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    CNN/DailyMail non-anonymized summarization dataset.

    There are two features: - article: text of news article, used as the document to be summarized - highlights: joined text of highlights with and around each highlight, which is the target summary

  20. Z

    Trace-Share Dataset for Evaluation of Trace Meaning Preservation

    • data.niaid.nih.gov
    • zenodo.org
    Updated May 7, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Madeja, Tomas (2020). Trace-Share Dataset for Evaluation of Trace Meaning Preservation [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3547527
    Explore at:
    Dataset updated
    May 7, 2020
    Dataset provided by
    Cermak, Milan
    Madeja, Tomas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset contains all data used during the evaluation of trace meaning preservation. Archives are protected by password "trace-share" to avoid false detection by antivirus software.

    For more information, see the project repository at https://github.com/Trace-Share.

    Selected Attack Traces

    The following list contains trace datasets used for evaluation. Each attack was chosen to have not only a different meaning but also different statistical properties.

    dos_http_flood — the capture of GET and POST requests sent to one server by one attacker (HTTP~traffic);

    ftp_bruteforce — short and unsuccessful attempt to guess a user’s password for FTP service (FTP traffic);

    ponyloader_botnet — Pony Loader botnet used for stealing of credentials from 3 target devices reporting to single IP with a large number of intermediate addresses (DNS and HTTP traffic);

    scan — the capture of nmap tool that scans given subnet using ICMP echo and TCP SYN requests (consist of ARP, ICMP, and TCP traffic);

    wannacry_ransomware — the capture of Wanacry ransomware that spreads in a domain with three workstations, a domain controller, and a file-sharing server (SMB and SMBv2 traffic).

    Background Traffic Data

    Publicly available dataset CSE-CIC-IDS-2018 was used as a background traffic data. The evaluation uses data from the day Thursday-01-03-2018 containing a sufficient proportion of regular traffic without any statistically significant attacks. Only traffic aimed at victim machines (range 172.31.69.0/24) is used to reduce less significant traffic.

    Evaluation Results and Dataset Structure

    Traces variants (traces.zip)

    ./traces-original/ — trace PCAP files and crawled details in YAML format;

    ./traces-normalized — normalized PCAP files and details in YAML format;

    ./traces-adjusted — adjusted PCAP files using various timestamp generation settings, combination configuration in YAML format, and lables provided by ID2T in XML format.

    Extracted alerts (alerts.zip)

    ./alerts-original/ — extracted Suricata alerts, Suricata log, and full Suricata output for all original trace files;

    ./alerts-normalized/ — extracted Suricata alerts, Suricata log, and full Suricata output for all normalized trace files;

    ./alerts-adjusted/ — extracted Suricata alerts, Suricata log, and full Suricata output for all adjusted trace files.

    Evaluation results

    *.csv files in the root directory — data contains extracted alert signatures and their count per each trace variant.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Statista (2025). Target: sales in the U.S. 2017-2024, by product category [Dataset]. https://www.statista.com/statistics/1113245/target-sales-by-product-segment-in-the-us/
Organization logo

Target: sales in the U.S. 2017-2024, by product category

Explore at:
Dataset updated
Apr 4, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
United States
Description

In 2024, Target Corporation's food and beverage product segment generated sales of approximately 23.8 billion U.S. dollars. In contrast, the hardline segment, which include electronics, toys, entertainment, sporting goods, and luggage, registered sales of 15.8 billion U.S. dollars. Target Corporation had revenues amounting to around 106.6 billion U.S. dollars that year.

Search
Clear search
Close search
Google apps
Main menu