10 datasets found
  1. m

    Lisbon, Portugal, hotel’s customer dataset with three years of personal,...

    • data.mendeley.com
    Updated Nov 18, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nuno Antonio (2020). Lisbon, Portugal, hotel’s customer dataset with three years of personal, behavioral, demographic, and geographic information [Dataset]. http://doi.org/10.17632/j83f5fsh6c.1
    Explore at:
    Dataset updated
    Nov 18, 2020
    Authors
    Nuno Antonio
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Portugal, Lisbon
    Description

    Hotel customer dataset with 31 variables describing a total of 83,590 instances (customers). It comprehends three full years of customer behavioral data. In addition to personal and behavioral information, the dataset also contains demographic and geographical information. This dataset contributes to reducing the lack of real-world business data that can be used for educational and research purposes. The dataset can be used in data mining, machine learning, and other analytical field problems in the scope of data science. Due to its unit of analysis, it is a dataset especially suitable for building customer segmentation models, including clustering and RFM (Recency, Frequency, and Monetary value) models, but also be used in classification and regression problems.

  2. m

    Educational Attainment in North Carolina Public Schools: Use of statistical...

    • data.mendeley.com
    Updated Nov 14, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Scott Herford (2018). Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets. [Dataset]. http://doi.org/10.17632/6cm9wyd5g5.1
    Explore at:
    Dataset updated
    Nov 14, 2018
    Authors
    Scott Herford
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.

  3. Data Mining for IVHM using Sparse Binary Ensembles, Phase I

    • data.nasa.gov
    application/rdfxml +5
    Updated Jun 26, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). Data Mining for IVHM using Sparse Binary Ensembles, Phase I [Dataset]. https://data.nasa.gov/dataset/Data-Mining-for-IVHM-using-Sparse-Binary-Ensembles/qfus-evzq
    Explore at:
    xml, tsv, csv, application/rssxml, application/rdfxml, jsonAvailable download formats
    Dataset updated
    Jun 26, 2018
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    In response to NASA SBIR topic A1.05, "Data Mining for Integrated Vehicle Health Management", Michigan Aerospace Corporation (MAC) asserts that our unique SPADE (Sparse Processing Applied to Data Exploitation) technology meets a significant fraction of the stated criteria and has functionality that enables it to handle many applications within the aircraft lifecycle. SPADE distills input data into highly quantized features and uses MAC's novel techniques for constructing Ensembles of Decision Trees to develop extremely accurate diagnostic/prognostic models for classification, regression, clustering, anomaly detection and semi-supervised learning tasks. These techniques are currently being employed to do Threat Assessment for satellites in conjunction with researchers at the Air Force Research Lab. Significant advantages to this approach include: 1) completely data driven; 2) training and evaluation are faster than conventional methods; 3) operates effectively on huge datasets (> billion samples X > million features), 4) proven to be as accurate as state-of-the-art techniques in many significant real-world applications. The specific goals for Phase 1 will be to work with domain experts at NASA and with our partners Boeing, SpaceX and GMV Space Systems to delineate a subset of problems that are particularly well-suited to this approach and to determine requirements for deploying algorithms on platforms of opportunity.

  4. u

    Dataset for AIJProcess mining-based goal recognition: a running example in...

    • figshare.unimelb.edu.au
    application/bzip2
    Updated Aug 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zihang Su (2023). Dataset for AIJProcess mining-based goal recognition: a running example in an 11 by 11 gridDataset for AIJ [Dataset]. http://doi.org/10.26188/21749570.v2
    Explore at:
    application/bzip2Available download formats
    Dataset updated
    Aug 11, 2023
    Dataset provided by
    The University of Melbourne
    Authors
    Zihang Su
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the dataset for the paper "Fast and Accurate Data-Driven Goal Recognition Using Process Mining Techniques." Including a running example, evaluation dataset for synthetic domains, and real-world business logs.

  5. f

    Table_1_Interactive process mining of cancer treatment sequences with...

    • frontiersin.figshare.com
    docx
    Updated Jun 4, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexandre Wicky; Roberto Gatta; Sofiya Latifyan; Rita De Micheli; Camille Gerard; Sylvain Pradervand; Olivier Michielin; Michel A. Cuendet (2023). Table_1_Interactive process mining of cancer treatment sequences with melanoma real-world data.docx [Dataset]. http://doi.org/10.3389/fonc.2023.1043683.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    Frontiers
    Authors
    Alexandre Wicky; Roberto Gatta; Sofiya Latifyan; Rita De Micheli; Camille Gerard; Sylvain Pradervand; Olivier Michielin; Michel A. Cuendet
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The growing availability of clinical real-world data (RWD) represents a formidable opportunity to complement evidence from randomized clinical trials and observe how oncological treatments perform in real-life conditions. In particular, RWD can provide insights on questions for which no clinical trials exist, such as comparing outcomes from different sequences of treatments. To this end, process mining is a particularly suitable methodology for analyzing different treatment paths and their associated outcomes. Here, we describe an implementation of process mining algorithms directly within our hospital information system with an interactive application that allows oncologists to compare sequences of treatments in terms of overall survival, progression-free survival and best overall response. As an application example, we first performed a RWD descriptive analysis of 303 patients with advanced melanoma and reproduced findings observed in two notorious clinical trials: CheckMate-067 and DREAMseq. Then, we explored the outcomes of an immune-checkpoint inhibitor rechallenge after a first progression on immunotherapy versus switching to a BRAF targeted treatment. By using interactive process-oriented RWD analysis, we observed that patients still derive long-term survival benefits from immune-checkpoint inhibitors rechallenge, which could have direct implications on treatment guidelines for patients able to carry on immune-checkpoint therapy, if confirmed by external RWD and randomized clinical trials. Overall, our results highlight how an interactive implementation of process mining can lead to clinically relevant insights from RWD with a framework that can be ported to other centers or networks of centers.

  6. o

    Extended Stanford Natural Language Inference

    • opendatabay.com
    .other
    Updated Jun 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Extended Stanford Natural Language Inference [Dataset]. https://www.opendatabay.com/data/ai-ml/7b782e98-5caa-4987-b802-8450ce5765cd
    Explore at:
    .otherAvailable download formats
    Dataset updated
    Jun 24, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Education & Learning Analytics
    Description

    The dataset comprises several columns, including premise and hypothesis texts that are used to evaluate or infer information based on each other. Each sentence is accompanied by a label that classifies the entailment relation into one of three categories: entailment, contradiction, or neutral. Furthermore, there are three annotated explanations provided for each entailment relation to further support and clarify the relationship between the premise and hypothesis.

    The validation.csv file within this dataset contains a set of examples specifically designed for validation purposes. It includes premises, hypotheses, labels (entailment classification), and three annotated explanations per entailment relation. Similarly, the train.csv file provides training data for natural language inference tasks using SNLI annotations with premises, hypotheses' texts as well as corresponding labels and multiple annotated explanations supporting their connection.

    As part of this extended e-SNLI dataset package from Kaggle,you will also find test.csv file which features additional test data extracted from SNLI database containing various sentences with their contextual background(premises), statements being evaluated(hypotheses), appropriate labels categorizing their relationships(entailments), along with three detailed justifications provided as explanatory notes supporting those relationships.

    Summing up important features offered by this comprehensive e-SNLI toolkit enriched with annotation assistance: An extensive range of premises generated from real-world textual data sources paired with well-established matching-oriented authorization encompassing alternate yet applicable hypothetical implications while adhering to one among Entailment/Contradiction/Neutral labeling scheme.Taking advantage of the complete dataset, you can explore nuanced understanding and analysis of entailment relations between linguistic units, emerging from various domains and contexts enhanced by multiple explanations available per relationship

    How to use the dataset The Extended Stanford Natural Language Inference (e-SNLI) Dataset with Explanations is a valuable resource for researchers and practitioners working in the field of natural language processing. This dataset builds upon the existing Stanford Natural Language Inference (SNLI) Dataset by including annotated explanations for entailment relations.

    Overview of the Dataset The e-SNLI dataset consists of three main files: train.csv, validation.csv, and test.csv. Each file contains a collection of examples, including premises, hypotheses, labels, and three annotated explanations for each entailment relation.

    premise: This column represents the sentence or text that serves as the context or background information for the entailment relation. hypothesis: This column contains the sentence or text that is being evaluated or inferred based on the premise. label: The label column indicates whether there is an entailment relation between the premise and hypothesis. It can take one of three categories: entailment, contradiction, or neutral. explanation_1, explanation_2, explanation_3: These columns provide additional annotated explanations or reasons to support the entailment relation between the premise and hypothesis. How to Utilize this Dataset When working with this dataset, there are several steps you can follow:

    Importing Data: Load one of the provided CSV files using your preferred programming language or data analysis tool to access its content. Exploring Premises and Hypotheses: Analyze both premises and hypotheses to gain an understanding of their relationship and create insights about how certain statements may lead to specific conclusions. Examining Label Distribution: Observe how labels are distributed across different examples within each file. This analysis will help you understand potential biases in data collection. Investigating Annotations: Read through the annotated explanations provided for each entailment relation. These explanations can offer valuable insights into the underlying reasoning behind each label. Consider using these annotations to build more comprehensive models or improve your existing ones. Model Training and Evaluation: Utilize this dataset to train and evaluate models for natural language inference tasks, such as text classification or sentiment analysis. Evaluate the performance of your models based on the predefined labels. Potential Applications The e-SNLI dataset can be used in various natural language processing tasks, including but not limited to:

    Natural Language Inference: Develop models capable of determining if a hypothesis is entailed Research Ideas Natural Language Understanding: The e-SNLI dataset can be used to train and evaluate models for natural language understanding tasks such as textual entailment, contradiction detection, and neutral classification. With the annotated explanations provided for each entailment relation, models can learn

  7. I

    Israel Geospatial Analytics Market Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Dec 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2024). Israel Geospatial Analytics Market Report [Dataset]. https://www.datainsightsmarket.com/reports/israel-geospatial-analytics-market-13540
    Explore at:
    ppt, pdf, docAvailable download formats
    Dataset updated
    Dec 14, 2024
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Israel
    Variables measured
    Market Size
    Description

    The Israel geospatial analytics market is projected to grow from USD 1.69 million in 2025 to USD 2.69 million by 2033, at a CAGR of 5.93% during the forecast period. The growth of this market is attributed to increasing adoption of geospatial analytics in various end-user verticals, such as agriculture, utility and communication, defense and intelligence, government, mining and natural resources, automotive and transportation, healthcare, real estate and construction, and other end-user verticals. Geospatial analytics helps in better decision-making, improves operational efficiency, and enhances customer engagement. Key drivers of the Israel geospatial analytics market include increasing adoption of cloud-based geospatial platforms, rising demand for real-time insights, and growing investments in smart city development. However, factors such as high cost of implementation and skilled labor shortage may hinder the market growth. Major companies operating in the Israel geospatial analytics market include SAS Institute Inc., General Electrical Company, Esri Inc. (Environmental Systems Research Institute), Harris Corporation, Microsoft Corporation, Autodesk Inc., Oracle Corporation, Trimble Inc., Bentley Systems Inc., and Google Inc. The Israel geospatial analytics market is estimated to grow from $170 million in 2023 to $320 million by 2029, at a CAGR of 9.5%. The market growth is majorly driven by the increasing adoption of geospatial technologies in various end-user verticals, such as agriculture, utility and communication, defense and intelligence, government, mining and natural resources, automotive and transportation, healthcare, real estate and construction. Recent developments include: June 2023: Autodesk and Esri's partnership accelerates innovations in AEC. Autodesk's InfoWater Pro and Esri's ArcGIS Pro were integrated to make this possible, and there are many more examples of how their partnership with Esri enables BIM and GIS data to flow between respective solutions seamlessly. The result is that project stakeholders can now visualize, understand, and analyze infrastructure within its real-world context., February 2023: Mercedes-Benz and Google announced a long-term strategic partnership to accelerate auto innovation and create the industry's next-generation digital luxury car experience. With this partnership, Mercedes-Benz will be the first automaker to build its branded navigation experience based on new in-car data and navigation capabilities from the Google Maps Platform. This will give the luxury automaker access to Google's leading geospatial offering, including detailed information about places, real-time and predictive traffic information, automatic rerouting, and more.. Key drivers for this market are: Increasing in Demand for Location Intelligence, Advancements of Big Data Analytics. Potential restraints include: High Costs and Operational Concerns, Concerns related to Geoprivacy and Confidential Data. Notable trends are: Surface Analysis is Expected to Hold Significant Share of the Market.

  8. o

    Sentiment Analysis Dataset

    • opendatabay.com
    .csv
    Updated Jun 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Sentiment Analysis Dataset [Dataset]. https://www.opendatabay.com/data/dataset/6323a1b5-7112-49bd-ad55-c1ef6968abc3
    Explore at:
    .csvAvailable download formats
    Dataset updated
    Jun 8, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Data Science and Analytics
    Description

    This dataset is a large-scale collection of 241,000+ English-language comments sourced from various online platforms. Each comment is annotated with a sentiment label:

    0 — Negative 1 — Neutral 2 — Positive The Data has been gathered from multiple websites such as : Hugginface : https://huggingface.co/datasets/Sp1786/multiclass-sentiment-analysis-dataset Kaggle : https://www.kaggle.com/datasets/abhi8923shriv/sentiment-analysis-dataset https://www.kaggle.com/datasets/jp797498e/twitter-entity-sentiment-analysis https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment The goal is to enable training and evaluation of multi-class sentiment analysis models for real-world text data. The dataset is already preprocessed — lowercase, cleaned from punctuation, URLs, numbers, and stopwords — and is ready for NLP pipelines. 📊 Columns Column Description Comment User-generated text content | Sentiment| Sentiment label (0=Negative, 1=Neutral, 2=Positive) | 🚀 Use Cases 🧠 Train sentiment classifiers using LSTM, BiLSTM, CNN, BERT, or RoBERTa 🔍 Evaluate preprocessing and tokenization strategies 📈 Benchmark NLP models on multi-class classification tasks 🎓 Educational projects and research in opinion mining or text classification

    Original Data Source: Sentiment Analysis Dataset

    • 🧪 Fine-tune transformer models on a large and diverse sentiment dataset 💬 Example ```plaintext Comment: "apple pay is so convenient secure and easy to use" Sentiment: 2 (Positive)
  9. 4

    BPI Challenge 2020: International Declarations

    • data.4tu.nl
    application/gzip
    Updated Jul 28, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Boudewijn van Dongen (2020). BPI Challenge 2020: International Declarations [Dataset]. http://doi.org/10.4121/uuid:2bbf8f6a-fc50-48eb-aa9e-c4ea5ef7e8c5
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jul 28, 2020
    Dataset provided by
    4TU.ResearchData
    Authors
    Boudewijn van Dongen
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This file contains the events related to International Declarations: 6,449 cases, 72151 events Parent item: BPI Challenge 2020 The dataset contains events pertaining to two years of travel expense claims. In 2017, events were collected for two departments, in 2018 for the entire university. The various permits and declaration documents (domestic and international declarations, pre-paid travel costs and requests for payment) all follow a similar process flow. After submission by the employee, the request is sent for approval to the travel administration. If approved, the request is then forwarded to the budget owner and after that to the supervisor. If the budget owner and supervisor are the same person, then only one of these steps is taken. In some cases, the director also needs to approve the request.The process finished with either the trip taking place or a payment being requested and payed.

    On a high level, we distinguish two types of trips, namely domestic and international. For domestic trips, no prior permission is needed, i.e. an employee can undertake these trips and ask for reimbursement of the costs afterwards. For international trips, permission is needed from the supervisor. This permission is obtained by filing a travel-permit and this travel permit should be approved before making any arrangements. To get the costs for a travel reimbursed, a claim is filed. This can be done as soon as costs are actually payed (for example for flights or conference registration fees), or within two months after the trip (for example hotel and food costs which are usually payed on the spot).

  10. f

    Experimental test datasets for Fast and Scalable Implementation of the...

    • springernature.figshare.com
    txt
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Florian Wenzel; Théo Galy-Fajou; Matthäus Deutsch; Marius Kloft (2023). Experimental test datasets for Fast and Scalable Implementation of the Bayesian SVM [Dataset]. http://doi.org/10.6084/m9.figshare.5443621.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    figshare
    Authors
    Florian Wenzel; Théo Galy-Fajou; Matthäus Deutsch; Marius Kloft
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This record contains seven real-world test datasets used in experiments with the Bayesian SVM algorithm in the ECML PKDD 2017 paper; Wenzel et al.: Bayesian Nonlinear Support Vector Machines for Big Data.For code used in the related experiments please see https://doi.org/10.6084/m9.figshare.5443627The datasets are used in the related experiments to compare the prediction performance, the quality of the uncertainty estimates and run time of the various methods. Collectively these contain containing millions of samples. The datasets are all from the Rätsch benchmark datasets commonly used to test the accuracy of binary nonlinear classifiers.Data files are in .data format used by Analysis Studio, a statistical analysis and data mining program. It contains mined data in a plain text, tab-delimited format, including an Analysis Studio file header. The raw data is can be openly accessed via text edit software.The data are from a range of disciplines that correspond to applications considered in the related publication:Processed_BreastCancer.dataProcessed_Diabetis.dataProcessed_Flare.dataProcessed_German.dataProcessed_Heart.dataProcessed_Splice.dataProcessed_Waveform.dataBackgroundWe propose a fast inference method for Bayesian nonlinear support vector machines that leverages stochastic variational inference and inducing points. Our experiments show that the proposed method is faster than competing Bayesian approaches and scales easily to millions of data points. It provides additional features over frequentist competitors such as accurate predictive uncertainty estimates and automatic hyperparameter search.Please also check out our github repository:https://github.com/theogf/BayesianSVM.jl

  11. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Nuno Antonio (2020). Lisbon, Portugal, hotel’s customer dataset with three years of personal, behavioral, demographic, and geographic information [Dataset]. http://doi.org/10.17632/j83f5fsh6c.1

Lisbon, Portugal, hotel’s customer dataset with three years of personal, behavioral, demographic, and geographic information

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Nov 18, 2020
Authors
Nuno Antonio
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Area covered
Portugal, Lisbon
Description

Hotel customer dataset with 31 variables describing a total of 83,590 instances (customers). It comprehends three full years of customer behavioral data. In addition to personal and behavioral information, the dataset also contains demographic and geographical information. This dataset contributes to reducing the lack of real-world business data that can be used for educational and research purposes. The dataset can be used in data mining, machine learning, and other analytical field problems in the scope of data science. Due to its unit of analysis, it is a dataset especially suitable for building customer segmentation models, including clustering and RFM (Recency, Frequency, and Monetary value) models, but also be used in classification and regression problems.

Search
Clear search
Close search
Google apps
Main menu