100+ datasets found
  1. Online Retail Transaction Data

    • kaggle.com
    Updated Dec 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Online Retail Transaction Data [Dataset]. https://www.kaggle.com/datasets/thedevastator/online-retail-transaction-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 21, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    Description

    Online Retail Transaction Data

    UK Online Retail Sales and Customer Transaction Data

    By UCI [source]

    About this dataset

    Comprehensive Dataset on Online Retail Sales and Customer Data

    Welcome to this comprehensive dataset offering a wide array of information related to online retail sales. This data set provides an in-depth look at transactions, product details, and customer information documented by an online retail company based in the UK. The scope of the data spans vastly, from granular details about each product sold to extensive customer data sets from different countries.

    This transnational data set is a treasure trove of vital business insights as it meticulously catalogues all the transactions that happened during its span. It houses rich transactional records curated by a renowned non-store online retail company based in the UK known for selling unique all-occasion gifts. A considerable portion of its clientele includes wholesalers; ergo, this dataset can prove instrumental for companies looking for patterns or studying purchasing trends among such businesses.

    The available attributes within this dataset offer valuable pieces of information:

    • InvoiceNo: This attribute refers to invoice numbers that are six-digit integral numbers uniquely assigned to every transaction logged in this system. Transactions marked with 'c' at the beginning signify cancellations - adding yet another dimension for purchase pattern analysis.

    • StockCode: Stock Code corresponds with specific items as they're represented within the inventory system via 5-digit integral numbers; these allow easy identification and distinction between products.

    • Description: This refers to product names, giving users qualitative knowledge about what kind of items are being bought and sold frequently.

    • Quantity: These figures ascertain the volume of each product per transaction – important figures that can help understand buying trends better.

    • InvoiceDate: Invoice Dates detail when each transaction was generated down to precise timestamps – invaluable when conducting time-based trend analysis or segmentation studies.

    • UnitPrice: Unit prices represent how much each unit retails at — crucial for revenue calculations or cost-related analyses.

    Finally,

    • Country: This locational attribute shows where each customer hails from, adding geographical segmentation to your data investigation toolkit.

    This dataset was originally collated by Dr Daqing Chen, Director of the Public Analytics group based at the School of Engineering, London South Bank University. His research studies and business cases with this dataset have been published in various papers contributing to establishing a solid theoretical basis for direct, data and digital marketing strategies.

    Access to such records can ensure enriching explorations or formulating insightful hypotheses about consumer behavior patterns among wholesalers. Whether it's managing inventory or studying transactional trends over time or spotting cancellation patterns - this dataset is apt for multiple forms of retail analysis

    How to use the dataset

    1. Sales Analysis:

    Sales data forms the backbone of this dataset, and it allows users to delve into various aspects of sales performance. You can use the Quantity and UnitPrice fields to calculate metrics like revenue, and further combine it with InvoiceNo information to understand sales over individual transactions.

    2. Product Analysis:

    Each product in this dataset comes with its unique identifier (StockCode) and its name (Description). You could analyse which products are most popular based on Quantity sold or look at popularity per transaction by considering both Quantity and InvoiceNo.

    3. Customer Segmentation:

    If you associated specific business logic onto the transactions (such as calculating total amounts), then you could use standard machine learning methods or even RFM (Recency, Frequency, Monetary) segmentation techniques combining it with 'CustomerID' for your customer base to understand customer behavior better. Concatenating invoice numbers (which stand for separate transactions) per client will give insights about your clients as well.

    4. Geographical Analysis:

    The Country column enables analysts to study purchase patterns across different geographical locations.

    Practical applications

    Understand what products sell best where - It can help drive tailored marketing strategies. Anomalies detection – Identify unusual behaviors that might lead frau...

  2. A multi-modal human neuroimaging dataset for data integration: simultaneous...

    • openneuro.org
    Updated Jun 7, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Giulia Lioi; Claire Cury; Lorraine Perronnet; Marsel Mano; Elise Bannier; Anatole Lecuyer; Christian Barillot (2021). A multi-modal human neuroimaging dataset for data integration: simultaneous EEG and fMRI acquisition during a motor imagery neurofeedback task: XP1 [Dataset]. http://doi.org/10.18112/openneuro.ds002336.v2.0.2
    Explore at:
    Dataset updated
    Jun 7, 2021
    Dataset provided by
    OpenNeurohttps://openneuro.org/
    Authors
    Giulia Lioi; Claire Cury; Lorraine Perronnet; Marsel Mano; Elise Bannier; Anatole Lecuyer; Christian Barillot
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    ———————————————————————————————— ORIGINAL PAPERS ————————————————————————————————

    Lioi, G., Cury, C., Perronnet, L., Mano, M., Bannier, E., Lécuyer, A., & Barillot, C. (2019). Simultaneous MRI-EEG during a motor imagery neurofeedback task: an open access brain imaging dataset for multi-modal data integration Authors. BioRxiv. https://doi.org/https://doi.org/10.1101/862375

    Mano, Marsel, Anatole Lécuyer, Elise Bannier, Lorraine Perronnet, Saman Noorzadeh, and Christian Barillot. 2017. “How to Build a Hybrid Neurofeedback Platform Combining EEG and FMRI.” Frontiers in Neuroscience 11 (140). https://doi.org/10.3389/fnins.2017.00140 Perronnet, Lorraine, L Anatole, Marsel Mano, Elise Bannier, Maureen Clerc, Christian Barillot, Lorraine Perronnet, et al. 2017. “Unimodal Versus Bimodal EEG-FMRI Neurofeedback of a Motor Imagery Task.” Frontiers in Human Neuroscience 11 (193). https://doi.org/10.3389/fnhum.2017.00193.

    This dataset named XP1 can be pull together with the dataset XP2, available here : https://openneuro.org/datasets/ds002338. Data acquisition methods have been described in Perronnet et al. (2017, Frontiers in Human Neuroscience). Simultaneous 64 channels EEG and fMRI during right-hand motor imagery and neurofeedback (NF) were acquired in this study (as well as in XP2). For this study, 10 subjects performed three types of NF runs (bimodal EEG-fMRI NF, unimodal EEG-NF and fMRI-NF).

    ———————————————————————————————— EXPERIMENTAL PARADIGM ————————————————————————————————
    Subjects were instructed to perform a kinaesthetic motor imagery of the right hand and to find their own strategy to control and bring the ball to the target. The experimental protocol consisted of 6 EEG-fMRI runs with a 20s block design alternating rest and task motor localizer run (task-motorloc) - 8 blocks X (20s rest+20 s task) motor imagery run without NF (task-MIpre) -5 blocks X (20s rest+20 s task) three NF runs with different NF conditions (task-eegNF, task-fmriNF, task-eegfmriNF) occurring in random order- 10 blocks X (20s rest+20 s task) motor imagery run without NF (task-MIpost) - 5 blocks X (20s rest+20 s task)

    ———————————————————————————————— EEG DATA ———————————————————————————————— EEG data was recorded using a 64-channel MR compatible solution from Brain Products (Brain Products GmbH, Gilching, Germany).

    RAW EEG DATA

    EEG was sampled at 5kHz with FCz as the reference electrode and AFz as the ground electrode, and a resolution of 0.5 microV. Following the BIDs arborescence, raw eeg data for each task can be found for each subject in

    XP1/sub-xp1*/eeg

    in Brain Vision Recorder format (File Version 1.0). Each raw EEG recording includes three files: the data file (.eeg), the header file (.vhdr) and the marker file (*.vmrk). The header file contains information about acquisition parameters and amplifier setup. For each electrode, the impedance at the beginning of the recording is also specified. For all subjects, channel 32 is the ECG channel. The 63 other channels are EEG channels.

    The marker file contains the list of markers assigned to the EEG recordings and their properties (marker type, marker ID and position in data points). Three type of markers are relevant for the EEG processing: R128 (Response): is the fMRI volume marker to correct for the gradient artifact S 99 (Stimulus): is the protocol marker indicating the start of the Rest block S 2 (Stimulus): is the protocol marker indicating the start of the Task (Motor Execution Motor Imagery or Neurofeedback)
    Warning : in few EEG data, the first S99 marker might be missing, but can be easily “added” 20 s before the first S 2.

    PREPROCESSED EEG DATA

    Following the BIDs arborescence, processed eeg data for each task and subject in the pre-processed data folder :

    XP1/derivatives/sub-xp1*/eeg_pp/*eeg_pp.*

    and following the Brain Analyzer format. Each processed EEG recording includes three files: the data file (.dat), the header file (.vhdr) and the marker file (*.vmrk), containing information similar to those described for raw data. In the header file of preprocessed data channels location are also specified. In the marker file the location in data points of the identified heart pulse (R marker) are specified as well.

    EEG data were pre-processed using BrainVision Analyzer II Software, with the following steps: Automatic gradient artifact correction using the artifact template subtraction method (Sliding average calculation with 21 intervals for sliding average and all channels enabled for correction. Downsampling with factor: 25 (200 Hz) Low Pass FIR Filter:Cut-off Frequency: 50 Hz. Ballistocardiogram (pulse) artifact correction using a semiautomatic procedure (Pulse Template searched between 40 s and 240 s in the ECG channel with the following parameters:Coherence Trigger = 0.5, Minimal Amplitude = 0.5, Maximal Amplitude = 1.3. The identified pulses were marked with R. Segmentation relative to the first block marker (S 99) for all the length of the training protocol (las S 2 + 20 s).

    EEG NF SCORES

    Neurofeedback scores can be found in the .mat structures in

    XP1/derivatives/sub-xp1*/NF_eeg/d_sub*NFeeg_scores.mat

    Structures names NF_eeg are composed of the following subfields:

    NF_eeg → .nf_laterality (NF score computed as for real-time calculation - equation (1))
    → .filteegpow_left (Bandpower of the filtered eeg signal in C1) → .filteegpow_right (Bandpower of the filtered eeg signal in C2) → .nf (vector of NF scores -4 per s- computed as in eq 3) for comparison with XP2 → .smoothed → .eegdata (64 X 200 X 400 matrix, with the pre-processed EEG signals according to the steps described above) → .method

    Where the subfield method contains information about the laplacian filtered used and the frequency band of interest.

    ———————————————————————————————— BOLD fMRI DATA ———————————————————————————————— All DICOM files were converted to Nifti-1 and then in BIDs format (version 2.1.4) using the software dcm2niix (version v1.0.20190720 GVV7.4.0)

    fMRI acquisitions were performed using echo- planar imaging (EPI) and covering the entire brain with the following parameters

    3T Siemens Verio EPI sequence TR=2 s TE=23 ms Resolution 2x2x4 mm3 FOV = 210×210mm2 N of slices: 32 No slice gap

    As specified in the relative task event files in XP1\ *events.tsv files onset, the scanner began the EPI pulse sequence two seconds prior to the start of the protocol (first rest block), so the the first two TRs should be discarded. The useful TRs for the runs are therefore

    task-motorloc: 320 s (2 to 322) task-MIpre and task-MIpost: 200 s (2 to 202) task-eegNF, task-fmriNF, task-eegfmriNF: 400 s (2 to 402)

    In task events files for the different tasks, each column represents:

    • 'onset': onset time (sec) of an event
    • 'duration': duration (sec) of the event
    • 'trial_type': trial (block) type: rest or task (Rest, Task-ME, Task-MI, Task-NF)
    • ''stim_file’: image presented in a stimulus block: during Rest, Motor Imagery (Task-MI) or Motor execution (Task-ME) instructions were presented. On the other hand, during Neurofeedback blocks (Task-NF) the image presented was a ball moving in a square that the subject could control self-regulating his EEG and/or fMRI brain activity.

    Following the BIDs arborescence, the functional data and relative metadata are found for each subject in the following directory

    XP1/sub-xp1*/func

    BOLD-NF SCORES

    For each subject and NF session, a matlab structure with BOLD-NF features can be found in

    XP1/derivatives/sub-xp1*/NF_bold/

    For each subject and NF session, a Matlab structure with BOLD-NF features can be found in

    XP1/derivatives/sub-xp1*/NF_bold/

    In view of BOLD-NF scores computation, fMRI data were preprocessed using SPM8 and with the following steps: slice-time correction, spatial realignment and coregistration with the anatomical scan, spatial smoothing with a 6 mm Gaussian kernel and normalization to the Montreal Neurological Institute (MNI) template. For each session, a first level general linear model analysis was then performed. The resulting activation maps (voxel-wise Family-Wise error corrected at p < 0.05) were used to define two ROIs (9x9x3 voxels) around the maximum of activation in the left and right motor cortex. The BOLD-NF scores (fMRI laterality index) were calculated as the difference between percentage signal change in the left and right motor ROIs as for the online NF calculation. A smoothed and normalized version of the NF scores over the precedent three volumes was also computed. To allow for comparison and aggregation of the two datasets XP1 and XP2 we also computed NF scores considering the left motor cortex and a background as for online NF calculation in XP2.

    In the NF_bold folder, the Matlab files sub-xp1*_task-*_NFbold_scores.mat have therefore the following structure :

    NF_bold → .nf_laterality (calculated as for online NF calculation) → .smoothnf_laterality → .normnf_laterality → .nf (calculated as for online NF calculation in XP2) → .roimean_left (averaged BOLD signal in the left motor ROI) → .roimean_right (averaged BOLD signal in the right motor ROI) → .bgmean (averaged BOLD signal in the background slice) → .method

    Where the subfield ".method" contains information about the ROI size (.roisize), the background mask (.bgmask) and ROI masks (.roimask_left,.roimask_right ). More details about signal processing and NF calculation can be

  3. f

    UC_vs_US Statistic Analysis.xlsx

    • figshare.com
    xlsx
    Updated Jul 9, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    F. (Fabiano) Dalpiaz (2020). UC_vs_US Statistic Analysis.xlsx [Dataset]. http://doi.org/10.23644/uu.12631628.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jul 9, 2020
    Dataset provided by
    Utrecht University
    Authors
    F. (Fabiano) Dalpiaz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Sheet 1 (Raw-Data): The raw data of the study is provided, presenting the tagging results for the used measures described in the paper. For each subject, it includes multiple columns: A. a sequential student ID B an ID that defines a random group label and the notation C. the used notation: user Story or use Cases D. the case they were assigned to: IFA, Sim, or Hos E. the subject's exam grade (total points out of 100). Empty cells mean that the subject did not take the first exam F. a categorical representation of the grade L/M/H, where H is greater or equal to 80, M is between 65 included and 80 excluded, L otherwise G. the total number of classes in the student's conceptual model H. the total number of relationships in the student's conceptual model I. the total number of classes in the expert's conceptual model J. the total number of relationships in the expert's conceptual model K-O. the total number of encountered situations of alignment, wrong representation, system-oriented, omitted, missing (see tagging scheme below) P. the researchers' judgement on how well the derivation process explanation was explained by the student: well explained (a systematic mapping that can be easily reproduced), partially explained (vague indication of the mapping ), or not present.

    Tagging scheme:
    Aligned (AL) - A concept is represented as a class in both models, either
    

    with the same name or using synonyms or clearly linkable names; Wrongly represented (WR) - A class in the domain expert model is incorrectly represented in the student model, either (i) via an attribute, method, or relationship rather than class, or (ii) using a generic term (e.g., user'' instead ofurban planner''); System-oriented (SO) - A class in CM-Stud that denotes a technical implementation aspect, e.g., access control. Classes that represent legacy system or the system under design (portal, simulator) are legitimate; Omitted (OM) - A class in CM-Expert that does not appear in any way in CM-Stud; Missing (MI) - A class in CM-Stud that does not appear in any way in CM-Expert.

    All the calculations and information provided in the following sheets
    

    originate from that raw data.

    Sheet 2 (Descriptive-Stats): Shows a summary of statistics from the data collection,
    

    including the number of subjects per case, per notation, per process derivation rigor category, and per exam grade category.

    Sheet 3 (Size-Ratio):
    

    The number of classes within the student model divided by the number of classes within the expert model is calculated (describing the size ratio). We provide box plots to allow a visual comparison of the shape of the distribution, its central value, and its variability for each group (by case, notation, process, and exam grade) . The primary focus in this study is on the number of classes. However, we also provided the size ratio for the number of relationships between student and expert model.

    Sheet 4 (Overall):
    

    Provides an overview of all subjects regarding the encountered situations, completeness, and correctness, respectively. Correctness is defined as the ratio of classes in a student model that is fully aligned with the classes in the corresponding expert model. It is calculated by dividing the number of aligned concepts (AL) by the sum of the number of aligned concepts (AL), omitted concepts (OM), system-oriented concepts (SO), and wrong representations (WR). Completeness on the other hand, is defined as the ratio of classes in a student model that are correctly or incorrectly represented over the number of classes in the expert model. Completeness is calculated by dividing the sum of aligned concepts (AL) and wrong representations (WR) by the sum of the number of aligned concepts (AL), wrong representations (WR) and omitted concepts (OM). The overview is complemented with general diverging stacked bar charts that illustrate correctness and completeness.

    For sheet 4 as well as for the following four sheets, diverging stacked bar
    

    charts are provided to visualize the effect of each of the independent and mediated variables. The charts are based on the relative numbers of encountered situations for each student. In addition, a "Buffer" is calculated witch solely serves the purpose of constructing the diverging stacked bar charts in Excel. Finally, at the bottom of each sheet, the significance (T-test) and effect size (Hedges' g) for both completeness and correctness are provided. Hedges' g was calculated with an online tool: https://www.psychometrica.de/effect_size.html. The independent and moderating variables can be found as follows:

    Sheet 5 (By-Notation):
    

    Model correctness and model completeness is compared by notation - UC, US.

    Sheet 6 (By-Case):
    

    Model correctness and model completeness is compared by case - SIM, HOS, IFA.

    Sheet 7 (By-Process):
    

    Model correctness and model completeness is compared by how well the derivation process is explained - well explained, partially explained, not present.

    Sheet 8 (By-Grade):
    

    Model correctness and model completeness is compared by the exam grades, converted to categorical values High, Low , and Medium.

  4. d

    Pasadena Test Data Sets

    • catalog.data.gov
    • datahub.transportation.gov
    • +4more
    Updated Jun 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    US Department of Transportation (2025). Pasadena Test Data Sets [Dataset]. https://catalog.data.gov/dataset/pasadena-test-data-sets
    Explore at:
    Dataset updated
    Jun 16, 2025
    Dataset provided by
    US Department of Transportation
    Area covered
    Pasadena
    Description

    The purpose of the data environment is to provide multi-modal data and contextual information (weather and incidents) that can be used to research and develop Intelligent Transportation System applications. This data set contains the following data for the two months of September and October 2011 in Pasadena, California: Highway network data, Demand data, Sample mobile sightings provided for a two-hour period, provided by AirSage (see note 1 below), Network performance data (measured and forecast), Work zone data, Weather data, and Changeable message sign data. This legacy dataset was created before data.transportation.gov and is only currently available via the attached file(s). Please contact the dataset owner if there is a need for users to work with this data using the data.transportation.gov analysis features (online viewing, API, graphing, etc.) and the USDOT will consider modifying the dataset to fully integrate in data.transportation.gov.

  5. O

    Hazardous Waste Portal Manifest Metadata

    • data.ct.gov
    • datasets.ai
    application/rdfxml +5
    Updated May 7, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bureau of Materials Management and Compliance Assurance, Waste Engineering and Enforcement Division (2020). Hazardous Waste Portal Manifest Metadata [Dataset]. https://data.ct.gov/w/x2z6-swxe/wqz6-rhce?cur=7oNWyJm2JEZ
    Explore at:
    json, tsv, application/rdfxml, csv, application/rssxml, xmlAvailable download formats
    Dataset updated
    May 7, 2020
    Dataset authored and provided by
    Bureau of Materials Management and Compliance Assurance, Waste Engineering and Enforcement Division
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    Note: Please use the following view to be able to see the entire Dataset Description: https://data.ct.gov/Environment-and-Natural-Resources/Hazardous-Waste-Portal-Manifest-Metadata/x2z6-swxe

    Dataset Description Outline (5 sections)

    • INTRODUCTION

    • WHY USE THE CONNECTICUT OPEN DATA PORTAL MANIFEST METADATA DATASET INSTEAD OF THE DEEP DOCUMENT ONLINE SEARCH PORTAL ITSELF?

    • WHAT MANIFESTS ARE INCLUDED IN DEEP’S MANIFEST PERMANENT RECORDS ARE ALSO AVAILABLE VIA THE DEEP DOCUMENT SEARCH PORTAL AND CT OPEN DATA?

    • HOW DOES THE PORTAL MANIFEST METADATA DATASET RELATE TO THE OTHER TWO MANIFEST DATASETS PUBLISHED IN CT OPEN DATA?

    • IMPORTANT NOTES

    INTRODUCTION • All of DEEP’s paper hazardous waste manifest records were recently scanned and “indexed”. • Indexing consisted of 6 basic pieces of information or “metadata” taken from each manifest about the Generator and stored with the scanned image. The metadata enables searches by: Site Town, Site Address, Generator Name, Generator ID Number, Manifest ID Number and Date of Shipment. • All of the metadata and scanned images are available electronically via DEEP’s Document Online Search Portal at: https://filings.deep.ct.gov/DEEPDocumentSearchPortal/ • Therefore, it is no longer necessary to visit the DEEP Records Center in Hartford for manifest records or information. • This CT Data dataset “Hazardous Waste Portal Manifest Metadata” (or “Portal Manifest Metadata”) was copied from the DEEP Document Online Search Portal, and includes only the metadata – no images.

    WHY USE THE CONNECTICUT OPEN DATA PORTAL MANIFEST METADATA DATASET INSTEAD OF THE DEEP DOCUMENT ONLINE SEARCH PORTAL ITSELF? The Portal Manifest Metadata is a good search tool to use along with the Portal. Searching the Portal Manifest Metadata can provide the following advantages over searching the Portal: • faster searches, especially for “large searches” - those with a large number of search returns unlimited number of search returns (Portal is limited to 500); • larger display of search returns; • search returns can be sorted and filtered online in CT Data; and • search returns and the entire dataset can be downloaded from CT Data and used offline (e.g. download to Excel format) • metadata from searches can be copied from CT Data and pasted into the Portal search fields to quickly find single scanned images. The main advantages of the Portal are: • it provides access to scanned images of manifest documents (CT Data does not); and • images can be downloaded one or multiple at a time.

    WHAT MANIFESTS ARE INCLUDED IN DEEP’S MANIFEST PERMANENT RECORDS ARE ALSO AVAILABLE VIA THE DEEP DOCUMENT SEARCH PORTAL AND CT OPEN DATA? All hazardous waste manifest records received and maintained by the DEEP Manifest Program; including: • manifests originating from a Connecticut Generator or sent to a Connecticut Destination Facility including manifests accompanying an exported shipment • manifests with RCRA hazardous waste listed on them (such manifests may also have non-RCRA hazardous waste listed) • manifests from a Generator with a Connecticut Generator ID number (permanent or temporary number) • manifests with sufficient quantities of RCRA hazardous waste listed for DEEP to consider the Generator to be a Small or Large Quantity Generator • manifests with PCBs listed on them from 2016 to 6-29-2018. • Note: manifests sent to a CT Destination Facility were indexed by the Connecticut or Out of State Generator. Searches by CT Designated Facility are not possible unless such facility is the Generator for the purposes of manifesting.

    All other manifests were considered “non-hazardous” manifests and not scanned. They were discarded after 2 years in accord with DEEP records retention schedule. Non-hazardous manifests include: • Manifests with only non-RCRA hazardous waste listed • Manifests from generators that did not have a permanent or temporary Generator ID number • Sometimes non-hazardous manifests were considered “Hazardous Manifests” and kept on file if DEEP had reason to believe the generator should have had a permanent or temporary Generator ID number. These manifests were scanned and included in the Portal.

    Dates included: manifests with shipment dates from 1980 to present • States were the primary keepers of manifest records until June 29, 2018. Any manifest regarding a Connecticut Generator or Destination Facility should have been sent to DEEP, and should be present in the Portal and CT Data. • June 30, 2018 was the start of the EPA e-Manifest program. Most manifests with a shipment date on and after this date are sent to, and maintained by the EPA. • For information from EPA regarding these newer manifests: • Overview: https://rcrapublic.epa.gov/rcrainfoweb/action/modules/em/emoverview • To search by site, use EPA’s Sites List: https://rcrapublic.epa.gov/rcrainfoweb/action/modules/hd/handlerindex (Tip: Change the Location field from “National” to “Connecticut”) • Manifests still sent to DEEP on or after 6-30-2018 include: • manifests from exported shipments; and • manifest copies submitted pursuant to discrepancy reports and unmanifested shipments.

    HOW DOES THE PORTAL MANIFEST METADATA RELATE TO THE OTHER TWO MANIFEST DATASETS PUBLISHED IN CT DATA? • DEEP has posted in CT Data two other datasets about the same hazardous waste documents which are the subject of the Portal and the Portal Manifest Metadata Copy. • There are likely some differences in the metadata between the Portal Manifest Metadata and the two others. DEEP recommends using all data sources for a complete search. • These two datasets were the best search tool DEEP had available to the public prior to the Portal and the Metadata Copy: • “Hazardous Waste Manifest Data (CT) 1984 – 2008” https://data.ct.gov/Environment-and-Natural-Resources/Hazardous-Waste-Manifest-Data-CT-1984-2008/h6d8-qiar; and • “Hazardous Waste Manifest Data (CT) 1984 – 2008: Generator Summary View” https://data.ct.gov/Environment-and-Natural-Resources/Hazardous-Waste-Manifest-Data-CT-1984-2008-Generat/72mi-3f82.
    • The only difference between these two datasets is: • the first dataset includes all of the metadata transcribed from the manifests. • the second “Generator Summary View” dataset is a smaller subset of the first, requested for convenience by the public. Both of these datasets: • Are copies of metadata from a manifest database maintained by DEEP. No scanned images are available as a companion to these datasets. • The date range of the manifests for these datasets is 1984 to approximately 2008.

    IMPORTANT NOTES (4): NOTE 1: Some manifest images are effectively unavailable via the Portal and the Portal Metadata due to incomplete or incorrect metadata. Such errors may be the result of unintentional data entry error, errors on the manifests or illegible manifests. • Incomplete or incorrect metadata may prevent a manifest from being found by a search. DEEP is currently working to complete the metadata as best it can. • Please report errors to the DEEP Manifest Program at deep.manifests@ct.gov. • DEEP will publish updates regarding this work here and through the DEEP Hazardous Waste Advisory Committee listserv. To sign up for this listserv, visit this webpage: https://portal.ct.gov/DEEP/Waste-Management-and-Disposal/Hazardous-Waste-Advisory-Committee/HWAC-Home. NOTE 2: This dataset does not replace the potential need for a full review of other files publicly available either on-line and/or at CT DEEP’s Records Center. For a complete review of agency records for this or other agency programs, you can perform your own search in our DEEP public file room located at 79 Elm Street, Hartford CT or at our DEEP Online Search Portal at: https://filings.deep.ct.gov/DEEPDocumentSearchPortal/Home. NOTE 3: Other DEEP programs or state and federal agencies may maintain manifest records (e.g., DEEP Emergency Response, US Environmental Protection Agency, etc.) These other manifests were not scanned along with those from the Manifest Program files. However, most likely these other manifests are duplicate copies of manifests available via the Portal. NOTE 4: search tips for using the Portal and CT Data: • If your search will yield a small number of search returns, try using the Portal for your search. “Small” is meant to mean fewer than the 500 maximum search returns allowed using the Portal. • Start your search as broadly as possible – try entering just the town and the street name, or a portion of the street name that is likely to be spelled correctly • For searches yielding a large number of search returns, try using first the Portal Manifest Metadata in CT Data. • Try downloading the metadata and sorting, filtering, etc. the data to look for related spellings, etc. • Once you narrow down you research, copy the manifest number of a manifest you are interested in, and paste it into the Agency ID field of the Portal search page. • If you are using information from older information sources for consistency, you may want to search the two datasets copied from the older DEEP Manifest Database.

  6. A

    ‘Top 1000 Kaggle Datasets’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Jan 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Top 1000 Kaggle Datasets’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-top-1000-kaggle-datasets-658b/b992f64b/?iid=004-457&v=presentation
    Explore at:
    Dataset updated
    Jan 28, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Top 1000 Kaggle Datasets’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/notkrishna/top-1000-kaggle-datasets on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    From wiki

    Kaggle, a subsidiary of Google LLC, is an online community of data scientists and machine learning practitioners. Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges.

    Kaggle got its start in 2010 by offering machine learning competitions and now also offers a public data platform, a cloud-based workbench for data science, and Artificial Intelligence education. Its key personnel were Anthony Goldbloom and Jeremy Howard. Nicholas Gruen was founding chair succeeded by Max Levchin. Equity was raised in 2011 valuing the company at $25 million. On 8 March 2017, Google announced that they were acquiring Kaggle.[1][2]

    Source: Kaggle

    --- Original source retains full ownership of the source dataset ---

  7. f

    Data from: Variable Selection with Multiply-Imputed Datasets: Choosing...

    • tandf.figshare.com
    pdf
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jiacong Du; Jonathan Boss; Peisong Han; Lauren J. Beesley; Michael Kleinsasser; Stephen A. Goutman; Stuart Batterman; Eva L. Feldman; Bhramar Mukherjee (2023). Variable Selection with Multiply-Imputed Datasets: Choosing Between Stacked and Grouped Methods [Dataset]. http://doi.org/10.6084/m9.figshare.19111441.v2
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Jiacong Du; Jonathan Boss; Peisong Han; Lauren J. Beesley; Michael Kleinsasser; Stephen A. Goutman; Stuart Batterman; Eva L. Feldman; Bhramar Mukherjee
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Penalized regression methods are used in many biomedical applications for variable selection and simultaneous coefficient estimation. However, missing data complicates the implementation of these methods, particularly when missingness is handled using multiple imputation. Applying a variable selection algorithm on each imputed dataset will likely lead to different sets of selected predictors. This article considers a general class of penalized objective functions which, by construction, force selection of the same variables across imputed datasets. By pooling objective functions across imputations, optimization is then performed jointly over all imputed datasets rather than separately for each dataset. We consider two objective function formulations that exist in the literature, which we will refer to as “stacked” and “grouped” objective functions. Building on existing work, we (i) derive and implement efficient cyclic coordinate descent and majorization-minimization optimization algorithms for continuous and binary outcome data, (ii) incorporate adaptive shrinkage penalties, (iii) compare these methods through simulation, and (iv) develop an R package miselect. Simulations demonstrate that the “stacked” approaches are more computationally efficient and have better estimation and selection properties. We apply these methods to data from the University of Michigan ALS Patients Biorepository aiming to identify the association between environmental pollutants and ALS risk. Supplementary materials for this article are available online.

  8. NYC STEW-MAP Staten Island organizations' website hyperlink webscrape

    • catalog.data.gov
    • s.cnmilf.com
    Updated Nov 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2022). NYC STEW-MAP Staten Island organizations' website hyperlink webscrape [Dataset]. https://catalog.data.gov/dataset/nyc-stew-map-staten-island-organizations-website-hyperlink-webscrape
    Explore at:
    Dataset updated
    Nov 21, 2022
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Area covered
    Staten Island, New York
    Description

    The data represent web-scraping of hyperlinks from a selection of environmental stewardship organizations that were identified in the 2017 NYC Stewardship Mapping and Assessment Project (STEW-MAP) (USDA 2017). There are two data sets: 1) the original scrape containing all hyperlinks within the websites and associated attribute values (see "README" file); 2) a cleaned and reduced dataset formatted for network analysis. For dataset 1: Organizations were selected from from the 2017 NYC Stewardship Mapping and Assessment Project (STEW-MAP) (USDA 2017), a publicly available, spatial data set about environmental stewardship organizations working in New York City, USA (N = 719). To create a smaller and more manageable sample to analyze, all organizations that intersected (i.e., worked entirely within or overlapped) the NYC borough of Staten Island were selected for a geographically bounded sample. Only organizations with working websites and that the web scraper could access were retained for the study (n = 78). The websites were scraped between 09 and 17 June 2020 to a maximum search depth of ten using the snaWeb package (version 1.0.1, Stockton 2020) in the R computational language environment (R Core Team 2020). For dataset 2: The complete scrape results were cleaned, reduced, and formatted as a standard edge-array (node1, node2, edge attribute) for network analysis. See "READ ME" file for further details. References: R Core Team. (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/. Version 4.0.3. Stockton, T. (2020). snaWeb Package: An R package for finding and building social networks for a website, version 1.0.1. USDA Forest Service. (2017). Stewardship Mapping and Assessment Project (STEW-MAP). New York City Data Set. Available online at https://www.nrs.fs.fed.us/STEW-MAP/data/. This dataset is associated with the following publication: Sayles, J., R. Furey, and M. Ten Brink. How deep to dig: effects of web-scraping search depth on hyperlink network analysis of environmental stewardship organizations. Applied Network Science. Springer Nature, New York, NY, 7: 36, (2022).

  9. Z

    INTRODUCTION OF COVID-NEWS-US-NNK AND COVID-NEWS-BD-NNK DATASET

    • data.niaid.nih.gov
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nafiz Sadman (2024). INTRODUCTION OF COVID-NEWS-US-NNK AND COVID-NEWS-BD-NNK DATASET [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4047647
    Explore at:
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    Nafiz Sadman
    Kishor Datta Gupta
    Nishat Anjum
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    United States, Bangladesh
    Description

    Introduction

    There are several works based on Natural Language Processing on newspaper reports. Mining opinions from headlines [ 1 ] using Standford NLP and SVM by Rameshbhaiet. Al.compared several algorithms on a small and large dataset. Rubinet. al., in their paper [ 2 ], created a mechanism to differentiate fake news from real ones by building a set of characteristics of news according to their types. The purpose was to contribute to the low resource data available for training machine learning algorithms. Doumitet. al.in [ 3 ] have implemented LDA, a topic modeling approach to study bias present in online news media.

    However, there are not many NLP research invested in studying COVID-19. Most applications include classification of chest X-rays and CT-scans to detect presence of pneumonia in lungs [ 4 ], a consequence of the virus. Other research areas include studying the genome sequence of the virus[ 5 ][ 6 ][ 7 ] and replicating its structure to fight and find a vaccine. This research is crucial in battling the pandemic. The few NLP based research publications are sentiment classification of online tweets by Samuel et el [ 8 ] to understand fear persisting in people due to the virus. Similar work has been done using the LSTM network to classify sentiments from online discussion forums by Jelodaret. al.[ 9 ]. NKK dataset is the first study on a comparatively larger dataset of a newspaper report on COVID-19, which contributed to the virus’s awareness to the best of our knowledge.

    2 Data-set Introduction

    2.1 Data Collection

    We accumulated 1000 online newspaper report from United States of America (USA) on COVID-19. The newspaper includes The Washington Post (USA) and StarTribune (USA). We have named it as “Covid-News-USA-NNK”. We also accumulated 50 online newspaper report from Bangladesh on the issue and named it “Covid-News-BD-NNK”. The newspaper includes The Daily Star (BD) and Prothom Alo (BD). All these newspapers are from the top provider and top read in the respective countries. The collection was done manually by 10 human data-collectors of age group 23- with university degrees. This approach was suitable compared to automation to ensure the news were highly relevant to the subject. The newspaper online sites had dynamic content with advertisements in no particular order. Therefore there were high chances of online scrappers to collect inaccurate news reports. One of the challenges while collecting the data is the requirement of subscription. Each newspaper required $1 per subscriptions. Some criteria in collecting the news reports provided as guideline to the human data-collectors were as follows:

    The headline must have one or more words directly or indirectly related to COVID-19.

    The content of each news must have 5 or more keywords directly or indirectly related to COVID-19.

    The genre of the news can be anything as long as it is relevant to the topic. Political, social, economical genres are to be more prioritized.

    Avoid taking duplicate reports.

    Maintain a time frame for the above mentioned newspapers.

    To collect these data we used a google form for USA and BD. We have two human editor to go through each entry to check any spam or troll entry.

    2.2 Data Pre-processing and Statistics

    Some pre-processing steps performed on the newspaper report dataset are as follows:

    Remove hyperlinks.

    Remove non-English alphanumeric characters.

    Remove stop words.

    Lemmatize text.

    While more pre-processing could have been applied, we tried to keep the data as much unchanged as possible since changing sentence structures could result us in valuable information loss. While this was done with help of a script, we also assigned same human collectors to cross check for any presence of the above mentioned criteria.

    The primary data statistics of the two dataset are shown in Table 1 and 2.

    Table 1: Covid-News-USA-NNK data statistics

    No of words per headline

    7 to 20

    No of words per body content

    150 to 2100

    Table 2: Covid-News-BD-NNK data statistics No of words per headline

    10 to 20

    No of words per body content

    100 to 1500

    2.3 Dataset Repository

    We used GitHub as our primary data repository in account name NKK^1. Here, we created two repositories USA-NKK^2 and BD-NNK^3. The dataset is available in both CSV and JSON format. We are regularly updating the CSV files and regenerating JSON using a py script. We provided a python script file for essential operation. We welcome all outside collaboration to enrich the dataset.

    3 Literature Review

    Natural Language Processing (NLP) deals with text (also known as categorical) data in computer science, utilizing numerous diverse methods like one-hot encoding, word embedding, etc., that transform text to machine language, which can be fed to multiple machine learning and deep learning algorithms.

    Some well-known applications of NLP includes fraud detection on online media sites[ 10 ], using authorship attribution in fallback authentication systems[ 11 ], intelligent conversational agents or chatbots[ 12 ] and machine translations used by Google Translate[ 13 ]. While these are all downstream tasks, several exciting developments have been made in the algorithm solely for Natural Language Processing tasks. The two most trending ones are BERT[ 14 ], which uses bidirectional encoder-decoder architecture to create the transformer model, that can do near-perfect classification tasks and next-word predictions for next generations, and GPT-3 models released by OpenAI[ 15 ] that can generate texts almost human-like. However, these are all pre-trained models since they carry huge computation cost. Information Extraction is a generalized concept of retrieving information from a dataset. Information extraction from an image could be retrieving vital feature spaces or targeted portions of an image; information extraction from speech could be retrieving information about names, places, etc[ 16 ]. Information extraction in texts could be identifying named entities and locations or essential data. Topic modeling is a sub-task of NLP and also a process of information extraction. It clusters words and phrases of the same context together into groups. Topic modeling is an unsupervised learning method that gives us a brief idea about a set of text. One commonly used topic modeling is Latent Dirichlet Allocation or LDA[17].

    Keyword extraction is a process of information extraction and sub-task of NLP to extract essential words and phrases from a text. TextRank [ 18 ] is an efficient keyword extraction technique that uses graphs to calculate the weight of each word and pick the words with more weight to it.

    Word clouds are a great visualization technique to understand the overall ’talk of the topic’. The clustered words give us a quick understanding of the content.

    4 Our experiments and Result analysis

    We used the wordcloud library^4 to create the word clouds. Figure 1 and 3 presents the word cloud of Covid-News-USA- NNK dataset by month from February to May. From the figures 1,2,3, we can point few information:

    In February, both the news paper have talked about China and source of the outbreak.

    StarTribune emphasized on Minnesota as the most concerned state. In April, it seemed to have been concerned more.

    Both the newspaper talked about the virus impacting the economy, i.e, bank, elections, administrations, markets.

    Washington Post discussed global issues more than StarTribune.

    StarTribune in February mentioned the first precautionary measurement: wearing masks, and the uncontrollable spread of the virus throughout the nation.

    While both the newspaper mentioned the outbreak in China in February, the weight of the spread in the United States are more highlighted through out March till May, displaying the critical impact caused by the virus.

    We used a script to extract all numbers related to certain keywords like ’Deaths’, ’Infected’, ’Died’ , ’Infections’, ’Quarantined’, Lock-down’, ’Diagnosed’ etc from the news reports and created a number of cases for both the newspaper. Figure 4 shows the statistics of this series. From this extraction technique, we can observe that April was the peak month for the covid cases as it gradually rose from February. Both the newspaper clearly shows us that the rise in covid cases from February to March was slower than the rise from March to April. This is an important indicator of possible recklessness in preparations to battle the virus. However, the steep fall from April to May also shows the positive response against the attack. We used Vader Sentiment Analysis to extract sentiment of the headlines and the body. On average, the sentiments were from -0.5 to -0.9. Vader Sentiment scale ranges from -1(highly negative to 1(highly positive). There were some cases

    where the sentiment scores of the headline and body contradicted each other,i.e., the sentiment of the headline was negative but the sentiment of the body was slightly positive. Overall, sentiment analysis can assist us sort the most concerning (most negative) news from the positive ones, from which we can learn more about the indicators related to COVID-19 and the serious impact caused by it. Moreover, sentiment analysis can also provide us information about how a state or country is reacting to the pandemic. We used PageRank algorithm to extract keywords from headlines as well as the body content. PageRank efficiently highlights important relevant keywords in the text. Some frequently occurring important keywords extracted from both the datasets are: ’China’, Government’, ’Masks’, ’Economy’, ’Crisis’, ’Theft’ , ’Stock market’ , ’Jobs’ , ’Election’, ’Missteps’, ’Health’, ’Response’. Keywords extraction acts as a filter allowing quick searches for indicators in case of locating situations of the economy,

  10. s

    Fostering cultures of open qualitative research: Dataset 2 – Interview...

    • orda.shef.ac.uk
    xlsx
    Updated Jun 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matthew Hanchard; Itzel San Roman Pineda (2023). Fostering cultures of open qualitative research: Dataset 2 – Interview Transcripts [Dataset]. http://doi.org/10.15131/shef.data.23567223.v2
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 28, 2023
    Dataset provided by
    The University of Sheffield
    Authors
    Matthew Hanchard; Itzel San Roman Pineda
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This dataset was created and deposited onto the University of Sheffield Online Research Data repository (ORDA) on 23-Jun-2023 by Dr. Matthew S. Hanchard, Research Associate at the University of Sheffield iHuman Institute. The dataset forms part of three outputs from a project titled ‘Fostering cultures of open qualitative research’ which ran from January 2023 to June 2023:

    · Fostering cultures of open qualitative research: Dataset 1 – Survey Responses · Fostering cultures of open qualitative research: Dataset 2 – Interview Transcripts · Fostering cultures of open qualitative research: Dataset 3 – Coding Book

    The project was funded with £13,913.85 of Research England monies held internally by the University of Sheffield - as part of their ‘Enhancing Research Cultures’ scheme 2022-2023.

    The dataset aligns with ethical approval granted by the University of Sheffield School of Sociological Studies Research Ethics Committee (ref: 051118) on 23-Jan-2021. This includes due concern for participant anonymity and data management.

    ORDA has full permission to store this dataset and to make it open access for public re-use on the basis that no commercial gain will be made form reuse. It has been deposited under a CC-BY-NC license. Overall, this dataset comprises:

    · 15 x Interview transcripts - in .docx file format which can be opened with Microsoft Word, Google Doc, or an open-source equivalent.

    All participants have read and approved their transcripts and have had an opportunity to retract details should they wish to do so.

    Participants chose whether to be pseudonymised or named directly. The pseudonym can be used to identify individual participant responses in the qualitative coding held within the ‘Fostering cultures of open qualitative research: Dataset 3 – Coding Book’ files.

    For recruitment, 14 x participants we selected based on their responses to the project survey., whilst one participant was recruited based on specific expertise.

    · 1 x Participant sheet – in .csv format which may by opened with Microsoft Excel, Google Sheet, or an open-source equivalent.

    The provides socio-demographic detail on each participant alongside their main field of research and career stage. It includes a RespondentID field/column which can be used to connect interview participants with their responses to the survey questions in the accompanying ‘Fostering cultures of open qualitative research: Dataset 1 – Survey Responses’ files.

    The project was undertaken by two staff:

    Co-investigator: Dr. Itzel San Roman Pineda ORCiD ID: 0000-0002-3785-8057 i.sanromanpineda@sheffield.ac.uk Postdoctoral Research Assistant Labelled as ‘Researcher 1’ throughout the dataset

    Principal Investigator (corresponding dataset author): Dr. Matthew Hanchard ORCiD ID: 0000-0003-2460-8638 m.s.hanchard@sheffield.ac.uk Research Associate iHuman Institute, Social Research Institutes, Faculty of Social Science Labelled as ‘Researcher 2’ throughout the dataset

  11. o

    QASPER: NLP Questions and Evidence

    • opendatabay.com
    .undefined
    Updated Jun 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). QASPER: NLP Questions and Evidence [Dataset]. https://www.opendatabay.com/data/ai-ml/c030902d-7b02-48a2-b32f-8f7140dd1de7
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jun 22, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Data Science and Analytics
    Description

    QASPER: NLP Questions and Evidence Discovering Answers with Expertise By Huggingface Hub [source]

    About this dataset QASPER is an incredible collection of over 5,000 questions and answers on a vast range of Natural Language Processing (NLP) papers -- all crowdsourced from experienced NLP practitioners. Each question in the dataset is written based only on the titles and abstracts of the corresponding paper, providing an insight into how the experts understood and parsed various materials. The answers to each query have been expertly enriched by evidence taken directly from the full text of each paper. Moreover, QASPER comes with carefully crafted fields that contain relevant information including ‘qas’ – questions and answers; ‘evidence’ – evidence provided for answering questions; title; abstract; figures_and_tables, and full_text. All this adds up to create a remarkable dataset for researchers looking to gain insights into how practitioners interpret NLP topics while providing effective validation when it comes to finding clear-cut solutions to problems encountered in existing literature

    More Datasets For more datasets, click here.

    Featured Notebooks 🚨 Your notebook can be here! 🚨! How to use the dataset This guide will provide instructions on how to use the QASPER dataset of Natural Language Processing (NLP) questions and evidence. The QASPER dataset contains 5,049 questions over 1,585 papers that has been crowdsourced by NLP practitioners. To get the most out of this dataset we will show you how to access the questions and evidence, as well as provide tips for getting started.

    Step 1: Accessing the Dataset To access the data you can download it from Kaggle's website or through a code version control system like Github. Once downloaded, you will find five files in .csv format; two test data sets (test.csv and validation.csv), two train data sets (train-v2-0_lessons_only_.csv and trainv2-0_unsplit.csv) as well as one figure data set (figures_and_tables_.json). Each .csv file contains different datasets with columns representing titles, abstracts, full texts and Q&A fields with evidence for each paper mentioned in each row of each file respectively

    **Step 2: Analyzing Your Data Sets ** Now would be a good time to explore your datasets using basic descriptive statistics or more advanced predictive analytics such as logistic regression or naive bayes models depending on what kind of analysis you would like to undertake with this dataset You can start simple by summarizing some basic crosstabs between any two variables comprise your dataset; titles abstracts etc.). As an example try correlating title lengths with certain number of words in their corresponding abstracts then check if there is anything worth investigating further

    **Step 3: Define Your Research Questions & Perform Further Analysis ** Once satisfied with your initial exploration it is time to dig deeper into the underlying QR relationship among different variables comprising your main documents One way would be using text mining technologies such as topic modeling machine learning techniques or even automated processes that may help summarize any underlying patterns Yet another approach could involve filtering terms that are relevant per specific research hypothesis then process such terms via web crawlers search engines document similarity algorithms etc

    Finally once all relevant parameters are defined analyzed performed searched it would make sense to draw preliminary connsusison linking them back together before conducting replicable tests ensuring reproducible results

    Research Ideas Developing AI models to automatically generate questions and answers from paper titles and abstracts. Enhancing machine learning algorithms by combining the answers with the evidence provided in the dataset to find relationships between papers. Creating online forums for NLP practitioners that uses questions from this dataset to spark discussion within the community

    License

    CC0

    Original Data Source: QASPER: NLP Questions and Evidence

  12. Customer Shopping Trends Dataset

    • kaggle.com
    Updated Oct 5, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sourav Banerjee (2023). Customer Shopping Trends Dataset [Dataset]. https://www.kaggle.com/datasets/iamsouravbanerjee/customer-shopping-trends-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 5, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sourav Banerjee
    Description

    Context

    The Customer Shopping Preferences Dataset offers valuable insights into consumer behavior and purchasing patterns. Understanding customer preferences and trends is critical for businesses to tailor their products, marketing strategies, and overall customer experience. This dataset captures a wide range of customer attributes including age, gender, purchase history, preferred payment methods, frequency of purchases, and more. Analyzing this data can help businesses make informed decisions, optimize product offerings, and enhance customer satisfaction. The dataset stands as a valuable resource for businesses aiming to align their strategies with customer needs and preferences. It's important to note that this dataset is a Synthetic Dataset Created for Beginners to learn more about Data Analysis and Machine Learning.

    Content

    This dataset encompasses various features related to customer shopping preferences, gathering essential information for businesses seeking to enhance their understanding of their customer base. The features include customer age, gender, purchase amount, preferred payment methods, frequency of purchases, and feedback ratings. Additionally, data on the type of items purchased, shopping frequency, preferred shopping seasons, and interactions with promotional offers is included. With a collection of 3900 records, this dataset serves as a foundation for businesses looking to apply data-driven insights for better decision-making and customer-centric strategies.

    Dataset Glossary (Column-wise)

    • Customer ID - Unique identifier for each customer
    • Age - Age of the customer
    • Gender - Gender of the customer (Male/Female)
    • Item Purchased - The item purchased by the customer
    • Category - Category of the item purchased
    • Purchase Amount (USD) - The amount of the purchase in USD
    • Location - Location where the purchase was made
    • Size - Size of the purchased item
    • Color - Color of the purchased item
    • Season - Season during which the purchase was made
    • Review Rating - Rating given by the customer for the purchased item
    • Subscription Status - Indicates if the customer has a subscription (Yes/No)
    • Shipping Type - Type of shipping chosen by the customer
    • Discount Applied - Indicates if a discount was applied to the purchase (Yes/No)
    • Promo Code Used - Indicates if a promo code was used for the purchase (Yes/No)
    • Previous Purchases - The total count of transactions concluded by the customer at the store, excluding the ongoing transaction
    • Payment Method - Customer's most preferred payment method
    • Frequency of Purchases - Frequency at which the customer makes purchases (e.g., Weekly, Fortnightly, Monthly)

    Structure of the Dataset

    https://i.imgur.com/6UEqejq.png" alt="">

    Acknowledgement

    This dataset is a synthetic creation generated using ChatGPT to simulate a realistic customer shopping experience. Its purpose is to provide a platform for beginners and data enthusiasts, allowing them to create, enjoy, practice, and learn from a dataset that mirrors real-world customer shopping behavior. The aim is to foster learning and experimentation in a simulated environment, encouraging a deeper understanding of data analysis and interpretation in the context of consumer preferences and retail scenarios.

    Cover Photo by: Freepik

    Thumbnail by: Clothing icons created by Flat Icons - Flaticon

  13. ERA5 hourly data on single levels from 1940 to present

    • cds.climate.copernicus.eu
    • arcticdata.io
    grib
    Updated Jun 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ECMWF (2025). ERA5 hourly data on single levels from 1940 to present [Dataset]. http://doi.org/10.24381/cds.adbb2d47
    Explore at:
    gribAvailable download formats
    Dataset updated
    Jun 30, 2025
    Dataset provided by
    European Centre for Medium-Range Weather Forecastshttp://ecmwf.int/
    Authors
    ECMWF
    License

    https://object-store.os-api.cci2.ecmwf.int:443/cci2-prod-catalogue/licences/licence-to-use-copernicus-products/licence-to-use-copernicus-products_b4b9451f54cffa16ecef5c912c9cebd6979925a956e3fa677976e0cf198c2c18.pdfhttps://object-store.os-api.cci2.ecmwf.int:443/cci2-prod-catalogue/licences/licence-to-use-copernicus-products/licence-to-use-copernicus-products_b4b9451f54cffa16ecef5c912c9cebd6979925a956e3fa677976e0cf198c2c18.pdf

    Time period covered
    Jan 1, 1940 - Jun 24, 2025
    Description

    ERA5 is the fifth generation ECMWF reanalysis for the global climate and weather for the past 8 decades. Data is available from 1940 onwards. ERA5 replaces the ERA-Interim reanalysis. Reanalysis combines model data with observations from across the world into a globally complete and consistent dataset using the laws of physics. This principle, called data assimilation, is based on the method used by numerical weather prediction centres, where every so many hours (12 hours at ECMWF) a previous forecast is combined with newly available observations in an optimal way to produce a new best estimate of the state of the atmosphere, called analysis, from which an updated, improved forecast is issued. Reanalysis works in the same way, but at reduced resolution to allow for the provision of a dataset spanning back several decades. Reanalysis does not have the constraint of issuing timely forecasts, so there is more time to collect observations, and when going further back in time, to allow for the ingestion of improved versions of the original observations, which all benefit the quality of the reanalysis product. ERA5 provides hourly estimates for a large number of atmospheric, ocean-wave and land-surface quantities. An uncertainty estimate is sampled by an underlying 10-member ensemble at three-hourly intervals. Ensemble mean and spread have been pre-computed for convenience. Such uncertainty estimates are closely related to the information content of the available observing system which has evolved considerably over time. They also indicate flow-dependent sensitive areas. To facilitate many climate applications, monthly-mean averages have been pre-calculated too, though monthly means are not available for the ensemble mean and spread. ERA5 is updated daily with a latency of about 5 days. In case that serious flaws are detected in this early release (called ERA5T), this data could be different from the final release 2 to 3 months later. In case that this occurs users are notified. The data set presented here is a regridded subset of the full ERA5 data set on native resolution. It is online on spinning disk, which should ensure fast and easy access. It should satisfy the requirements for most common applications. An overview of all ERA5 datasets can be found in this article. Information on access to ERA5 data on native resolution is provided in these guidelines. Data has been regridded to a regular lat-lon grid of 0.25 degrees for the reanalysis and 0.5 degrees for the uncertainty estimate (0.5 and 1 degree respectively for ocean waves). There are four main sub sets: hourly and monthly products, both on pressure levels (upper air fields) and single levels (atmospheric, ocean-wave and land surface quantities). The present entry is "ERA5 hourly data on single levels from 1940 to present".

  14. Z

    Data from: Dataset of the study: "Chatbots put to the test in math and logic...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jiménez Rios, Alejandro (2024). Dataset of the study: "Chatbots put to the test in math and logic problems: A preliminary comparison and assessment of ChatGPT-3.5, ChatGPT-4, and Google Bard" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7940781
    Explore at:
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    Papazafeiropoulos, George
    Jiménez Rios, Alejandro
    Plevris, Vagelis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains the 30 questions that were posed to the chatbots (i) ChatGPT-3.5; (ii) ChatGPT-4; and (iii) Google Bard, in May 2023 for the study “Chatbots put to the test in math and logic problems: A preliminary comparison and assessment of ChatGPT-3.5, ChatGPT-4, and Google Bard”. These 30 questions describe mathematics and logic problems that have a unique correct answer. The questions are fully described with plain text only, without the need for any images or special formatting. The questions are divided into two sets of 15 questions each (Set A and Set B). The questions of Set A are 15 “Original” problems that cannot be found online, at least in their exact wording, while Set B contains 15 “Published” problems that one can find online by searching on the internet, usually with their solution. Each question is posed three times to each chatbot. This dataset contains the following: (i) The full set of the 30 questions, A01-A15 and B01-B15; (ii) the correct answer for each one of them; (iii) an explanation of the solution, for the problems where such an explanation is needed, (iv) the 30 (questions) × 3 (chatbots) × 3 (answers) = 270 detailed answers of the chatbots. For the published problems of Set B, we also provide a reference to the source where each problem was taken from.

  15. d

    Data from: KEYWORD SEARCH IN TEXT CUBE: FINDING TOP-K RELEVANT CELLS

    • catalog.data.gov
    • datasets.ai
    • +2more
    Updated Apr 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). KEYWORD SEARCH IN TEXT CUBE: FINDING TOP-K RELEVANT CELLS [Dataset]. https://catalog.data.gov/dataset/keyword-search-in-text-cube-finding-top-k-relevant-cells
    Explore at:
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Dashlink
    Description

    KEYWORD SEARCH IN TEXT CUBE: FINDING TOP-K RELEVANT CELLS BOLIN DING, YINTAO YU, BO ZHAO, CINDY XIDE LIN, JIAWEI HAN, AND CHENGXIANG ZHAI Abstract. We study the problem of keyword search in a data cube with text-rich dimension(s) (so-called text cube). The text cube is built on a multidimensional text database, where each row is associated with some text data (e.g., a document) and other structural dimensions (attributes). A cell in the text cube aggregates a set of documents with matching attribute values in a subset of dimensions. A cell document is the concatenation of all documents in a cell. Given a keyword query, our goal is to find the top-k most relevant cells (ranked according to the relevance scores of cell documents w.r.t. the given query) in the text cube. We define a keyword-based query language and apply IR-style relevance model for scoring and ranking cell documents in the text cube. We propose two efficient approaches to find the top-k answers. The proposed approaches support a general class of IR-style relevance scoring formulas that satisfy certain basic and common properties. One of them uses more time for pre-processing and less time for answering online queries; and the other one is more efficient in pre-processing and consumes more time for online queries. Experimental studies on the ASRS dataset are conducted to verify the efficiency and effectiveness of the proposed approaches.

  16. u

    PDMX

    • cseweb.ucsd.edu
    json
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UCSD CSE Research Project, PDMX [Dataset]. https://cseweb.ucsd.edu/~jmcauley/datasets.html
    Explore at:
    jsonAvailable download formats
    Dataset authored and provided by
    UCSD CSE Research Project
    Description

    We introduce PDMX: a Public Domain MusicXML dataset for symbolic music processing, including over 250k musical scores in MusicXML format. PDMX is the largest publicly available, copyright-free MusicXML dataset in existence. PDMX includes genre, tag, description, and popularity metadata for every file.

  17. Z

    Data from: TDMentions: A Dataset of Technical Debt Mentions in Online Posts

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Morgan Ericsson (2020). TDMentions: A Dataset of Technical Debt Mentions in Online Posts [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2593141
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Anna Wingkvist
    Morgan Ericsson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    TDMentions: A Dataset of Technical Debt Mentions in Online Posts (version 1.0)

    TDMentions is a dataset that contains mentions of technical debt from Reddit, Hacker News, and Stack Exchange. It also contains a list of blog posts on Medium that were tagged as technical debt. The dataset currently contains approximately 35,000 items.

    Data collection and processing

    The dataset is mainly collected from existing datasets. We used data from:

    The data set currently contains data from the start of each source/service until 2018-12-31. For GitHub, we currently only include data from 2015-01-01.

    We use the regular expression tech(nical)?[\s\-_]*?debt to find mentions in all sources except for Medium. We decided to limit our matches to variations of technical debt and tech debt. Other shorter forms, such as TD, can result in too many false positives. For Medium, we used the tag technical-debt.

    Data Format

    The dataset is stored as a compressed (bzip2) JSON file with one JSON object per line. Each mention is represented as a JSON object with the following keys.

    • id: the id used in the original source. We use the URL path to identify Medium posts.
    • body: the text that contains the mention. This is either the comment or the title of the post. For Medium posts this is the title and subtitle (which might not mention technical debt, since posts are identified by the tag).
    • created_utc: the time the item was posted in seconds since epoch in UTC.
    • author: the author of the item. We use the username or userid from the source.
    • source: where the item was posted. Valid sources are:
      • HackerNews Comment
      • HackerNews Job
      • HackerNews Submission
      • Reddit Comment
      • Reddit Submission
      • StackExchange Answer
      • StackExchange Comment
      • StackExchange Question
      • Medium Post
    • meta: Additional information about the item specific to the source. This includes, e.g., the subreddit a Reddit submission or comment was posted to, the score, etc. We try to use the same names, e.g., score and num_comments for keys that have the same meaning/information across multiple sources.

    This is a sample item from Reddit:

    {
     "id": "ab8auf",
     "body": "Technical Debt Explained (x-post r/Eve)",
     "created_utc": 1546271789,
     "author": "totally_100_human",
     "source": "Reddit Submission",
     "meta": {
      "title": "Technical Debt Explained (x-post r/Eve)",
      "score": 1,
      "num_comments": 0,
      "url": "http://jestertrek.com/eve/technical-debt-2.png",
      "subreddit": "RCBRedditBot"
     }
    }
    

    Sample Analyses

    We decided to use JSON to store the data, since it is easy to work with from multiple programming languages. In the following examples, we use jq to process the JSON.

    How many items are there for each source?

    lbzip2 -cd postscomments.json.bz2 | jq '.source' | sort | uniq -c
    

    How many submissions that mentioned technical debt were posted each month?

    lbzip2 -cd postscomments.json.bz2 | jq 'select(.source == "Reddit Submission") | .created_utc | strftime("%Y-%m")' | sort | uniq -c
    

    What are the titles of items that link (meta.url) to PDF documents?

    lbzip2 -cd postscomments.json.bz2 | jq '. as $r | select(.meta.url?) | .meta.url | select(endswith(".pdf")) | $r.body'
    

    Please, I want CSV!

    lbzip2 -cd postscomments.json.bz2 | jq -r '[.id, .body, .author] | @csv'
    

    Note that you need to specify the keys you want to include for the CSV, so it is easier to either ignore the meta information or process each source.

    Please see https://github.com/sse-lnu/tdmentions for more analyses

    Limitations and Future updates

    The current version of the dataset lacks GitHub data and Medium comments. GitHub data will be added in the next update. Medium comments (responses) will be added in a future update if we find a good way to represent these.

  18. a

    Online News Popularity Data Set

    • academictorrents.com
    bittorrent
    Updated Feb 11, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kelwin Fernandes and Pedro Vinagre and Paulo Cortez and Pedro Sernadela (2016). Online News Popularity Data Set [Dataset]. https://academictorrents.com/details/95d3b03397a0bafd74a662fe13ba3550c13b7ce1
    Explore at:
    bittorrent(7476401)Available download formats
    Dataset updated
    Feb 11, 2016
    Dataset authored and provided by
    Kelwin Fernandes and Pedro Vinagre and Paulo Cortez and Pedro Sernadela
    License

    https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

    Description

    Data Set Information: * The articles were published by Mashable (www.mashable.com) and their content as the rights to reproduce it belongs to them. Hence, this dataset does not share the original content but some statistics associated with it. The original content be publicly accessed and retrieved using the provided urls. * Acquisition date: January 8, 2015 * The estimated relative performance values were estimated by the authors using a Random Forest classifier and a rolling windows as assessment method. See their article for more details on how the relative performance values were set. ##Attribute Information: Number of Attributes: 61 (58 predictive attributes, 2 non-predictive, 1 goal field) 0. url: URL of the article (non-predictive) 1. timedelta: Days between the article publication and the dataset acquisition (non-predictive) 2. n_tokens_title: Number of words in the title 3. n_tokens_content: Number of words in the content 4. n_unique_tokens: Rate of unique words in the conte

  19. O*NET Database

    • onetcenter.org
    • kaggle.com
    excel, mysql, oracle +2
    Updated May 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Center for O*NET Development (2025). O*NET Database [Dataset]. https://www.onetcenter.org/database.html
    Explore at:
    oracle, sql server, text, mysql, excelAvailable download formats
    Dataset updated
    May 20, 2025
    Dataset provided by
    Occupational Information Network
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    United States
    Dataset funded by
    US Department of Labor, Employment and Training Administration
    Description

    The O*NET Database contains hundreds of standardized and occupation-specific descriptors on almost 1,000 occupations covering the entire U.S. economy. The database, which is available to the public at no cost, is continually updated by a multi-method data collection program. Sources of data include: job incumbents, occupational experts, occupational analysts, employer job postings, and customer/professional association input.

    Data content areas include:

    • Worker Characteristics (e.g., Abilities, Interests, Work Styles)
    • Worker Requirements (e.g., Education, Knowledge, Skills)
    • Experience Requirements (e.g., On-the-Job Training, Work Experience)
    • Occupational Requirements (e.g., Detailed Work Activities, Work Context)
    • Occupation-Specific Information (e.g., Job Titles, Tasks, Technology Skills)

  20. gsm8k

    • huggingface.co
    Updated Aug 11, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenAI (2022). gsm8k [Dataset]. https://huggingface.co/datasets/openai/gsm8k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 11, 2022
    Dataset authored and provided by
    OpenAIhttps://openai.com/
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for GSM8K

      Dataset Summary
    

    GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning.

    These problems take between 2 and 8 steps to solve. Solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − ×÷) to reach the… See the full description on the dataset page: https://huggingface.co/datasets/openai/gsm8k.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
The Devastator (2023). Online Retail Transaction Data [Dataset]. https://www.kaggle.com/datasets/thedevastator/online-retail-transaction-data
Organization logo

Online Retail Transaction Data

UK Online Retail Sales and Customer Transaction Data

Explore at:
17 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 21, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
Description

Online Retail Transaction Data

UK Online Retail Sales and Customer Transaction Data

By UCI [source]

About this dataset

Comprehensive Dataset on Online Retail Sales and Customer Data

Welcome to this comprehensive dataset offering a wide array of information related to online retail sales. This data set provides an in-depth look at transactions, product details, and customer information documented by an online retail company based in the UK. The scope of the data spans vastly, from granular details about each product sold to extensive customer data sets from different countries.

This transnational data set is a treasure trove of vital business insights as it meticulously catalogues all the transactions that happened during its span. It houses rich transactional records curated by a renowned non-store online retail company based in the UK known for selling unique all-occasion gifts. A considerable portion of its clientele includes wholesalers; ergo, this dataset can prove instrumental for companies looking for patterns or studying purchasing trends among such businesses.

The available attributes within this dataset offer valuable pieces of information:

  • InvoiceNo: This attribute refers to invoice numbers that are six-digit integral numbers uniquely assigned to every transaction logged in this system. Transactions marked with 'c' at the beginning signify cancellations - adding yet another dimension for purchase pattern analysis.

  • StockCode: Stock Code corresponds with specific items as they're represented within the inventory system via 5-digit integral numbers; these allow easy identification and distinction between products.

  • Description: This refers to product names, giving users qualitative knowledge about what kind of items are being bought and sold frequently.

  • Quantity: These figures ascertain the volume of each product per transaction – important figures that can help understand buying trends better.

  • InvoiceDate: Invoice Dates detail when each transaction was generated down to precise timestamps – invaluable when conducting time-based trend analysis or segmentation studies.

  • UnitPrice: Unit prices represent how much each unit retails at — crucial for revenue calculations or cost-related analyses.

Finally,

  • Country: This locational attribute shows where each customer hails from, adding geographical segmentation to your data investigation toolkit.

This dataset was originally collated by Dr Daqing Chen, Director of the Public Analytics group based at the School of Engineering, London South Bank University. His research studies and business cases with this dataset have been published in various papers contributing to establishing a solid theoretical basis for direct, data and digital marketing strategies.

Access to such records can ensure enriching explorations or formulating insightful hypotheses about consumer behavior patterns among wholesalers. Whether it's managing inventory or studying transactional trends over time or spotting cancellation patterns - this dataset is apt for multiple forms of retail analysis

How to use the dataset

1. Sales Analysis:

Sales data forms the backbone of this dataset, and it allows users to delve into various aspects of sales performance. You can use the Quantity and UnitPrice fields to calculate metrics like revenue, and further combine it with InvoiceNo information to understand sales over individual transactions.

2. Product Analysis:

Each product in this dataset comes with its unique identifier (StockCode) and its name (Description). You could analyse which products are most popular based on Quantity sold or look at popularity per transaction by considering both Quantity and InvoiceNo.

3. Customer Segmentation:

If you associated specific business logic onto the transactions (such as calculating total amounts), then you could use standard machine learning methods or even RFM (Recency, Frequency, Monetary) segmentation techniques combining it with 'CustomerID' for your customer base to understand customer behavior better. Concatenating invoice numbers (which stand for separate transactions) per client will give insights about your clients as well.

4. Geographical Analysis:

The Country column enables analysts to study purchase patterns across different geographical locations.

Practical applications

Understand what products sell best where - It can help drive tailored marketing strategies. Anomalies detection – Identify unusual behaviors that might lead frau...

Search
Clear search
Close search
Google apps
Main menu