29 datasets found
  1. Survey Data - Entrepreneurs Data Mining

    • kaggle.com
    zip
    Updated Nov 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lay Christian (2024). Survey Data - Entrepreneurs Data Mining [Dataset]. https://www.kaggle.com/datasets/laychristian/survey-data-entrepreneurs-data-mining
    Explore at:
    zip(38815 bytes)Available download formats
    Dataset updated
    Nov 21, 2024
    Authors
    Lay Christian
    Description

    Title: Identifying Factors that Affect Entrepreneurs’ Use of Data Mining for Analytics Authors: Edward Matthew Dominica, Feylin Wijaya, Andrew Giovanni Winoto, Christian Conference: The 4th International Conference on Electrical, Computer, Communications, and Mechatronics Engineering https://www.iceccme.com/home

    This dataset was created to support research focused on understanding the factors influencing entrepreneurs’ adoption of data mining techniques for business analytics. The dataset contains carefully curated data points that reflect entrepreneurial behaviors, decision-making criteria, and the role of data mining in enhancing business insights.

    Researchers and practitioners can leverage this dataset to explore patterns, conduct statistical analyses, and build predictive models to gain a deeper understanding of entrepreneurial adoption of data mining.

    Intended Use: This dataset is designed for research and academic purposes, especially in the fields of business analytics, entrepreneurship, and data mining. It is suitable for conducting exploratory data analysis, hypothesis testing, and model development.

    Citation: If you use this dataset in your research or publication, please cite the paper presented at the ICECCME 2024 conference using the following format: Edward Matthew Dominica, Feylin Wijaya, Andrew Giovanni Winoto, Christian. Identifying Factors that Affect Entrepreneurs’ Use of Data Mining for Analytics. The 4th International Conference on Electrical, Computer, Communications, and Mechatronics Engineering (2024).

  2. m

    Educational Attainment in North Carolina Public Schools: Use of statistical...

    • data.mendeley.com
    Updated Nov 14, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Scott Herford (2018). Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets. [Dataset]. http://doi.org/10.17632/6cm9wyd5g5.1
    Explore at:
    Dataset updated
    Nov 14, 2018
    Authors
    Scott Herford
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.

  3. Prediction of Online Orders

    • kaggle.com
    zip
    Updated May 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oscar Aguilar (2023). Prediction of Online Orders [Dataset]. https://www.kaggle.com/datasets/oscarm524/prediction-of-orders/versions/3
    Explore at:
    zip(6680913 bytes)Available download formats
    Dataset updated
    May 23, 2023
    Authors
    Oscar Aguilar
    Description

    The visit of an online shop by a possible customer is also called a session. During a session the visitor clicks on products in order to see the corresponding detail page. Furthermore, he possibly will add or remove products to/from his shopping basket. At the end of a session it is possible that one or several products from the shopping basket will be ordered. The activities of the user are also called transactions. The goal of the analysis is to predict whether the visitor will place an order or not on the basis of the transaction data collected during the session.

    Tasks

    In the first task historical shop data are given consisting of the session activities inclusive of the associated information whether an order was placed or not. These data can be used in order to subsequently make order forecasts for other session activities in the same shop. Of course, the real outcome of the sessions for this set is not known. Thus, the first task can be understood as a classical data mining problem.

    The second task deals with the online scenario. In this context the participants are to implement an agent learning on the basis of transactions. That means that the agent successively receives the individual transactions and has to make a forecast for each of them with respect to the outcome of the shopping cart transaction. This task maps the practice scenario in the best possible way in the case that a transaction-based forecast is required and a corresponding algorithm should learn in an adaptive manner.

    The Data

    For the individual tasks anonymised real shop data are provided in the form of structured text files consisting of individual data sets. The data sets represent in each case transactions in the shop and may contain redundant information. For the data, in particular the following applies:

    1. Each data set is in an individual line that is closed by “LF”(“line feed”, 0xA), “CR”(“carriage return”, 0xD), or “CR”and “LF”(“carriage return”and “line feed”, 0xD and 0xA).
    2. The first line is structured analog to the data sets but contains the names of the respective columns (data arrays).
    3. The header and each data set contain several arrays separated by the symbol “|”.
    4. There is no escape character, and no quota system is used.
    5. ASCII is used as character set.
    6. There may be missing values. These are marked by the symbol “?”.

    In concrete terms, only the array names of the attached document “*features.pdf*” in their respective sequence will be used as column headings. The corresponding value ranges are listed there, too.

    The training file for task 1 is “*transact_train.txt*“) contains all data arrays of the document, whereas the corresponding classification file (“*transact_class.txt*”) of course does not contain the target attribute “*order*”.

    In task 2 data in the form of a string array are transferred to the implementations of the participants by means of a method. The individual fields of the array contain the same data arrays that are listed in “*features.pdf*”–also without the target attribute “*order*”–and exactly in the sequence used there.

    Acknowledgement

    This dataset is publicly available in the data-mining-cup-website.

  4. f

    DataSheet1_Monitoring Neutral Axis Position Using Monthly Sample Residuals...

    • figshare.com
    • frontiersin.figshare.com
    docx
    Updated Jun 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christos Aloupis; Harry W. Shenton; Michael J. Chajes (2023). DataSheet1_Monitoring Neutral Axis Position Using Monthly Sample Residuals as Estimated From a Data Mining Model.docx [Dataset]. http://doi.org/10.3389/fbuil.2021.625754.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 5, 2023
    Dataset provided by
    Frontiers
    Authors
    Christos Aloupis; Harry W. Shenton; Michael J. Chajes
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Structural Health Monitoring (SHM) has enabled the condition of large structures, like bridges, to be evaluated in real time. In order to monitor behavioral changes, it is essential to identify parameters of the structure that are sensitive enough to capture damage as it develops while being stable enough during ambient behavior of the structure. Research has shown that monitoring the neutral axis (N.A.) position satisfies the first criterion of sensitivity; however, monitoring N.A. location is challenging because its position is affected by the loads applied to the structure. The motivation behind this research comes from the greater than expected impact of various load characteristics on observed N.A. location. This paper develops an indirect way to estimate the characteristics of vehicular loads (magnitude and lateral position of the load) and uses a data mining approach to predict the expected location of the N.A. Instead of monitoring the behavior of the N.A., in the proposed method the residuals between the monitored and predicted N.A. location are monitored. Using actual SHM data collected from a cable-stayed bridge, over a 2-year period, the paper presents the steps to be followed for creating a data mining model to predict N.A. location, the use of monthly sample residuals of N.A. to capture behavioral changes, the ability of the method to distinguish between changes in the load characteristics from behavioral changes of the structure (e.g. change in response due to cracking, bearings becoming frozen, cables losing tension, etc.), and the high sensitivity of the method that allows capturing of minor changes.

  5. Data from: Modeling of stem form and volume through machine learning

    • scielo.figshare.com
    jpeg
    Updated Jun 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ANA B. SCHIKOWSKI; ANA P.D. CORTE; MARIELI S. RUZA; CARLOS R. SANQUETTA; RAZER A.N.R. MONTAÑO (2023). Modeling of stem form and volume through machine learning [Dataset]. http://doi.org/10.6084/m9.figshare.7244495.v1
    Explore at:
    jpegAvailable download formats
    Dataset updated
    Jun 6, 2023
    Dataset provided by
    SciELOhttp://www.scielo.org/
    Authors
    ANA B. SCHIKOWSKI; ANA P.D. CORTE; MARIELI S. RUZA; CARLOS R. SANQUETTA; RAZER A.N.R. MONTAÑO
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract Taper functions and volume equations are essential for estimation of the individual volume, which have consolidated theory. On the other hand, mathematical innovation is dynamic, and may improve the forestry modeling. The objective was analyzing the accuracy of machine learning (ML) techniques in relation to a volumetric model and a taper function for acácia negra. We used cubing data, and fit equations with Schumacher and Hall volumetric model and with Hradetzky taper function, compared to the algorithms: k nearest neighbor (k-NN), Random Forest (RF) and Artificial Neural Networks (ANN) for estimation of total volume and diameter to the relative height. Models were ranked according to error statistics, as well as their dispersion was verified. Schumacher and Hall model and ANN showed the best results for volume estimation as function of dap and height. Machine learning methods were more accurate than the Hradetzky polynomial for tree form estimations. ML models have proven to be appropriate as an alternative to traditional modeling applications in forestry measurement, however, its application must be careful because fit-based overtraining is likely.

  6. Z

    Data Analysis for the Systematic Literature Review of DL4SE

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk (2024). Data Analysis for the Systematic Literature Review of DL4SE [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4768586
    Explore at:
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    College of William and Mary
    Washington and Lee University
    Authors
    Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data Analysis is the process that supports decision-making and informs arguments in empirical studies. Descriptive statistics, Exploratory Data Analysis (EDA), and Confirmatory Data Analysis (CDA) are the approaches that compose Data Analysis (Xia & Gong; 2014). An Exploratory Data Analysis (EDA) comprises a set of statistical and data mining procedures to describe data. We ran EDA to provide statistical facts and inform conclusions. The mined facts allow attaining arguments that would influence the Systematic Literature Review of DL4SE.

    The Systematic Literature Review of DL4SE requires formal statistical modeling to refine the answers for the proposed research questions and formulate new hypotheses to be addressed in the future. Hence, we introduce DL4SE-DA, a set of statistical processes and data mining pipelines that uncover hidden relationships among Deep Learning reported literature in Software Engineering. Such hidden relationships are collected and analyzed to illustrate the state-of-the-art of DL techniques employed in the software engineering context.

    Our DL4SE-DA is a simplified version of the classical Knowledge Discovery in Databases, or KDD (Fayyad, et al; 1996). The KDD process extracts knowledge from a DL4SE structured database. This structured database was the product of multiple iterations of data gathering and collection from the inspected literature. The KDD involves five stages:

    Selection. This stage was led by the taxonomy process explained in section xx of the paper. After collecting all the papers and creating the taxonomies, we organize the data into 35 features or attributes that you find in the repository. In fact, we manually engineered features from the DL4SE papers. Some of the features are venue, year published, type of paper, metrics, data-scale, type of tuning, learning algorithm, SE data, and so on.

    Preprocessing. The preprocessing applied was transforming the features into the correct type (nominal), removing outliers (papers that do not belong to the DL4SE), and re-inspecting the papers to extract missing information produced by the normalization process. For instance, we normalize the feature “metrics” into “MRR”, “ROC or AUC”, “BLEU Score”, “Accuracy”, “Precision”, “Recall”, “F1 Measure”, and “Other Metrics”. “Other Metrics” refers to unconventional metrics found during the extraction. Similarly, the same normalization was applied to other features like “SE Data” and “Reproducibility Types”. This separation into more detailed classes contributes to a better understanding and classification of the paper by the data mining tasks or methods.

    Transformation. In this stage, we omitted to use any data transformation method except for the clustering analysis. We performed a Principal Component Analysis to reduce 35 features into 2 components for visualization purposes. Furthermore, PCA also allowed us to identify the number of clusters that exhibit the maximum reduction in variance. In other words, it helped us to identify the number of clusters to be used when tuning the explainable models.

    Data Mining. In this stage, we used three distinct data mining tasks: Correlation Analysis, Association Rule Learning, and Clustering. We decided that the goal of the KDD process should be oriented to uncover hidden relationships on the extracted features (Correlations and Association Rules) and to categorize the DL4SE papers for a better segmentation of the state-of-the-art (Clustering). A clear explanation is provided in the subsection “Data Mining Tasks for the SLR od DL4SE”. 5.Interpretation/Evaluation. We used the Knowledge Discover to automatically find patterns in our papers that resemble “actionable knowledge”. This actionable knowledge was generated by conducting a reasoning process on the data mining outcomes. This reasoning process produces an argument support analysis (see this link).

    We used RapidMiner as our software tool to conduct the data analysis. The procedures and pipelines were published in our repository.

    Overview of the most meaningful Association Rules. Rectangles are both Premises and Conclusions. An arrow connecting a Premise with a Conclusion implies that given some premise, the conclusion is associated. E.g., Given that an author used Supervised Learning, we can conclude that their approach is irreproducible with a certain Support and Confidence.

    Support = Number of occurrences this statement is true divided by the amount of statements Confidence = The support of the statement divided by the number of occurrences of the premise

  7. Generative AI In Data Analytics Market Analysis, Size, and Forecast...

    • technavio.com
    pdf
    Updated Jul 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technavio (2025). Generative AI In Data Analytics Market Analysis, Size, and Forecast 2025-2029: North America (US, Canada, and Mexico), Europe (France, Germany, and UK), APAC (China, India, and Japan), South America (Brazil), and Rest of World (ROW) [Dataset]. https://www.technavio.com/report/generative-ai-in-data-analytics-market-industry-analysis
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jul 17, 2025
    Dataset provided by
    TechNavio
    Authors
    Technavio
    License

    https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice

    Time period covered
    2025 - 2029
    Area covered
    United States
    Description

    Snapshot img

    Generative AI In Data Analytics Market Size 2025-2029

    The generative ai in data analytics market size is valued to increase by USD 4.62 billion, at a CAGR of 35.5% from 2024 to 2029. Democratization of data analytics and increased accessibility will drive the generative ai in data analytics market.

    Market Insights

    North America dominated the market and accounted for a 37% growth during the 2025-2029.
    By Deployment - Cloud-based segment was valued at USD 510.60 billion in 2023
    By Technology - Machine learning segment accounted for the largest market revenue share in 2023
    

    Market Size & Forecast

    Market Opportunities: USD 621.84 million 
    Market Future Opportunities 2024: USD 4624.00 million
    CAGR from 2024 to 2029 : 35.5%
    

    Market Summary

    The market is experiencing significant growth as businesses worldwide seek to unlock new insights from their data through advanced technologies. This trend is driven by the democratization of data analytics and increased accessibility of AI models, which are now available in domain-specific and enterprise-tuned versions. Generative AI, a subset of artificial intelligence, uses deep learning algorithms to create new data based on existing data sets. This capability is particularly valuable in data analytics, where it can be used to generate predictions, recommendations, and even new data points. One real-world business scenario where generative AI is making a significant impact is in supply chain optimization. In this context, generative AI models can analyze historical data and generate forecasts for demand, inventory levels, and production schedules. This enables businesses to optimize their supply chain operations, reduce costs, and improve customer satisfaction. However, the adoption of generative AI in data analytics also presents challenges, particularly around data privacy, security, and governance. As businesses continue to generate and analyze increasingly large volumes of data, ensuring that it is protected and used in compliance with regulations is paramount. Despite these challenges, the benefits of generative AI in data analytics are clear, and its use is set to grow as businesses seek to gain a competitive edge through data-driven insights.

    What will be the size of the Generative AI In Data Analytics Market during the forecast period?

    Get Key Insights on Market Forecast (PDF) Request Free SampleGenerative AI, a subset of artificial intelligence, is revolutionizing data analytics by automating data processing and analysis, enabling businesses to derive valuable insights faster and more accurately. Synthetic data generation, a key application of generative AI, allows for the creation of large, realistic datasets, addressing the challenge of insufficient data in analytics. Parallel processing methods and high-performance computing power the rapid analysis of vast datasets. Automated machine learning and hyperparameter optimization streamline model development, while model monitoring systems ensure continuous model performance. Real-time data processing and scalable data solutions facilitate data-driven decision-making, enabling businesses to respond swiftly to market trends. One significant trend in the market is the integration of AI-powered insights into business operations. For instance, probabilistic graphical models and backpropagation techniques are used to predict customer churn and optimize marketing strategies. Ensemble learning methods and transfer learning techniques enhance predictive analytics, leading to improved customer segmentation and targeted marketing. According to recent studies, businesses have achieved a 30% reduction in processing time and a 25% increase in predictive accuracy by implementing generative AI in their data analytics processes. This translates to substantial cost savings and improved operational efficiency. By embracing this technology, businesses can gain a competitive edge, making informed decisions with greater accuracy and agility.

    Unpacking the Generative AI In Data Analytics Market Landscape

    In the dynamic realm of data analytics, Generative AI algorithms have emerged as a game-changer, revolutionizing data processing and insights generation. Compared to traditional data mining techniques, Generative AI models can create new data points that mirror the original dataset, enabling more comprehensive data exploration and analysis (Source: Gartner). This innovation leads to a 30% increase in identified patterns and trends, resulting in improved ROI and enhanced business decision-making (IDC).

    Data security protocols are paramount in this context, with Classification Algorithms and Clustering Algorithms ensuring data privacy and compliance alignment. Machine Learning Pipelines and Deep Learning Frameworks facilitate seamless integration with Predictive Modeling Tools and Automated Report Generation on Cloud

  8. g

    HUN GW Model Mines raw data v01 | gimi9.com

    • gimi9.com
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HUN GW Model Mines raw data v01 | gimi9.com [Dataset]. https://gimi9.com/dataset/au_709cd6d1-8579-4aea-887d-b06d32a71961/
    Explore at:
    Description

    Abstract This dataset and its metadata statement were supplied to the Bioregional Assessment Programme by a third party and are presented here as originally supplied. Raw data used to build the groundwater model of the Hunter subregion. Various types of data are included, for instance, PDFs downloaded from the internet, autocad files, powerpoint files describing mine plans, shape files, etc. The data are organised by mining company (eg "anglo"), or by mine name (eg "drayton_south"), or by data type (eg "alluvium") ## Dataset History This is raw data, received directly from mining companies, or downloaded from the internet. No data have been manipulated. ## Dataset Citation Bioregional Assessment Programme (2016) HUN GW Model Mines raw data v01. Bioregional Assessment Source Dataset. Viewed 13 March 2019, http://data.bioregionalassessments.gov.au/dataset/709cd6d1-8579-4aea-887d-b06d32a71961.

  9. f

    Data from: Generation of Pairwise Potentials Using Multidimensional Data...

    • figshare.com
    • acs.figshare.com
    xlsx
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zheng Zheng; Jun Pei; Nupur Bansal; Hao Liu; Lin Frank Song; Kenneth M. Merz (2023). Generation of Pairwise Potentials Using Multidimensional Data Mining [Dataset]. http://doi.org/10.1021/acs.jctc.8b00516.s004
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    ACS Publications
    Authors
    Zheng Zheng; Jun Pei; Nupur Bansal; Hao Liu; Lin Frank Song; Kenneth M. Merz
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The rapid development of molecular structural databases provides the chemistry community access to an enormous array of experimental data that can be used to build and validate computational models. Using radial distribution functions collected from experimentally available X-ray and NMR structures, a number of so-called statistical potentials have been developed over the years using the structural data mining strategy. These potentials have been developed within the context of the two-particle Kirkwood equation by extending its original use for isotropic monatomic systems to anisotropic biomolecular systems. However, the accuracy and the unclear physical meaning of statistical potentials have long formed the central arguments against such methods. In this work, we present a new approach to generate molecular energy functions using structural data mining. Instead of employing the Kirkwood equation and introducing the “reference state” approximation, we model the multidimensional probability distributions of the molecular system using graphical models and generate the target pairwise Boltzmann probabilities using the Bayesian field theory. Different from the current statistical potentials that mimic the “knowledge-based” PMF based on the 2-particle Kirkwood equation, the graphical-model-based structure-derived potential developed in this study focuses on the generation of lower-dimensional Boltzmann distributions of atoms through reduction of dimensionality. We have named this new scoring function GARF, and in this work we focus on the mathematical derivation of our novel approach followed by validation studies on its ability to predict protein–ligand interactions.

  10. RICO dataset

    • kaggle.com
    zip
    Updated Dec 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Onur Gunes (2021). RICO dataset [Dataset]. https://www.kaggle.com/datasets/onurgunes1993/rico-dataset
    Explore at:
    zip(6703669364 bytes)Available download formats
    Dataset updated
    Dec 1, 2021
    Authors
    Onur Gunes
    Description

    Context

    Data-driven models help mobile app designers understand best practices and trends, and can be used to make predictions about design performance and support the creation of adaptive UIs. This paper presents Rico, the largest repository of mobile app designs to date, created to support five classes of data-driven applications: design search, UI layout generation, UI code generation, user interaction modeling, and user perception prediction. To create Rico, we built a system that combines crowdsourcing and automation to scalably mine design and interaction data from Android apps at runtime. The Rico dataset contains design data from more than 9.3k Android apps spanning 27 categories. It exposes visual, textual, structural, and interactive design properties of more than 66k unique UI screens. To demonstrate the kinds of applications that Rico enables, we present results from training an autoencoder for UI layout similarity, which supports query-by-example search over UIs.

    Content

    Rico was built by mining Android apps at runtime via human-powered and programmatic exploration. Like its predecessor ERICA, Rico’s app mining infrastructure requires no access to — or modification of — an app’s source code. Apps are downloaded from the Google Play Store and served to crowd workers through a web interface. When crowd workers use an app, the system records a user interaction trace that captures the UIs visited and the interactions performed on them. Then, an automated agent replays the trace to warm up a new copy of the app and continues the exploration programmatically, leveraging a content-agnostic similarity heuristic to efficiently discover new UI states. By combining crowdsourcing and automation, Rico can achieve higher coverage over an app’s UI states than either crawling strategy alone. In total, 13 workers recruited on UpWork spent 2,450 hours using apps on the platform over five months, producing 10,811 user interaction traces. After collecting a user trace for an app, we ran the automated crawler on the app for one hour.

    Acknowledgements

    UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN https://interactionmining.org/rico

    Inspiration

    The Rico dataset is large enough to support deep learning applications. We trained an autoencoder to learn an embedding for UI layouts, and used it to annotate each UI with a 64-dimensional vector representation encoding visual layout. This vector representation can be used to compute structurally — and often semantically — similar UIs, supporting example-based search over the dataset. To create training inputs for the autoencoder that embed layout information, we constructed a new image for each UI capturing the bounding box regions of all leaf elements in its view hierarchy, differentiating between text and non-text elements. Rico’s view hierarchies obviate the need for noisy image processing or OCR techniques to create these inputs.

  11. d

    Data for: Epidemiological landscape of Batrachochytrium dendrobatidis and...

    • search.dataone.org
    • data.niaid.nih.gov
    • +1more
    Updated Dec 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    M. Delia Basanta; Julián A. Velasco; Constantino González-Salazar (2023). Data for: Epidemiological landscape of Batrachochytrium dendrobatidis and its impact on amphibian diversity at global scale [Dataset]. http://doi.org/10.5061/dryad.83bk3j9zv
    Explore at:
    Dataset updated
    Dec 14, 2023
    Dataset provided by
    Dryad Digital Repository
    Authors
    M. Delia Basanta; Julián A. Velasco; Constantino González-Salazar
    Time period covered
    Jan 1, 2023
    Description

    Chytridiomycosis, caused by the fungal pathogen Batrachochytrium dendrobatidis (Bd), is a major driver of amphibian decline worldwide. The global presence of Bd is driven by a synergy of factors, such as climate, species life history, and amphibian host suscepÂtibility. Here, using a Bayesian data-mining approach, we modeled the epidemiologiÂcal landscape of Bd to evaluate how infection varies across several spatial, ecological, and phylogenetic scales. We compiled global information on Bd occurrence, climate, species ranges, and phylogenetic diversity to infer the potential distribution and prevaÂlence of Bd. By calculating the degree of co-distribution between Bd and our set of environmental and biological variables (e.g. climate and species), we identified the factors that could potentially be related to Bd presence and prevalence using a geoÂgraphic correlation metric, epsilon (ε). We fitted five ecological models based on 1) amphibian species identity, 2) phylogenetic species varia..., Usage notes

    These datasets include the geographic data used to build ecological and geographical models for Batrachochytrium dendrobatidis, as well as supplementary results of the following paper: Basanta et al. Epidemiological landscape of Batrachochytrium dendrobatidis and its impact on amphibian diversity at the global scale. Missing values are denoted by NA. Details for each dataset are provided in the README file. Datasets included:

    Information of Bd records. Table S1.xls contains Bd occurrence records and prevalence of infection from the Bd-Maps online database (http://www.bd-maps.net), Olson et al. 2013) accessed in 2013, and searched Google Scholar for recent papers with Bd infection reports using the keywords ‘*Batrachochytrium dendrobatidis’*. We excluded records from studies of captive individuals and those without coordinates, keeping only records in which coordinates reflected site-specific sample locations. Supplementary figures Supplementary information S1.docx cont..., , # 1. Title of Dataset: Epidemiological landscape of Batrachochytrium dendrobatidis and its impact on amphibian diversity at global scale

    2. Authors Information

    M. Delia Basanta Department of Biology, University of Nevada Reno. Reno, Nevada, USA.

    Julián A. Velasco Instituto de Ciencias de la Atmósfera y Cambio Climático, Universidad Nacional Autónoma de México. Ciudad de México, México.

    Constantino González-Salazar. Instituto de Ciencias de la Atmósfera y Cambio Climático, Universidad Nacional Autónoma de México. Ciudad de México, México.

    3. Date of data collection (single date, range, approximate date): 2019-2022

    4. Geographic location of data collection: Global

    DATA & FILE OVERVIEW

    1. File List:

    1. Table S1.xls
    2. Supplementary information S1.docx
    3. Table S2.xlsx
    4. Table S3.xlsx
    5. Table S4.xlsx

    DATA-SPECIFIC INFORMATION FOR: Table S1.xls

    Table S1.xls contains Bd occurrence records and prevalence of infection from the Bd-Maps online da...

  12. D

    Data Science Platform Industry Report

    • marketreportanalytics.com
    doc, pdf, ppt
    Updated Apr 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Report Analytics (2025). Data Science Platform Industry Report [Dataset]. https://www.marketreportanalytics.com/reports/data-science-platform-industry-89665
    Explore at:
    ppt, pdf, docAvailable download formats
    Dataset updated
    Apr 30, 2025
    Dataset authored and provided by
    Market Report Analytics
    License

    https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The Data Science Platform market is experiencing robust growth, projected to reach $10.15 billion in 2025 and exhibiting a Compound Annual Growth Rate (CAGR) of 23.50% from 2025 to 2033. This expansion is fueled by several key drivers. The increasing volume and complexity of data generated across diverse industries necessitates sophisticated platforms for analysis and insights extraction. Businesses are increasingly adopting cloud-based solutions for their scalability, cost-effectiveness, and accessibility, driving the growth of the cloud deployment segment. Furthermore, the rising demand for advanced analytics capabilities across sectors like BFSI (Banking, Financial Services, and Insurance), retail and e-commerce, and IT & Telecom is significantly boosting market demand. The availability of robust and user-friendly platforms is empowering businesses of all sizes, from SMEs to large enterprises, to leverage data science effectively for improved decision-making and competitive advantage. The market is witnessing the emergence of innovative solutions such as automated machine learning (AutoML) and integrated platforms that combine data preparation, model building, and deployment capabilities. The market segmentation reveals significant opportunities across various offerings and deployment models. While the platform segment holds a larger share, the services segment is poised for significant growth driven by the need for expert consulting and support in data science projects. Geographically, North America currently dominates the market, but the Asia-Pacific region is expected to witness faster growth due to increasing digitalization and technological advancements. Key players like IBM, Google, Microsoft, and Amazon are driving innovation and competition, with new entrants continuously emerging, adding to the market's dynamism. While challenges such as data security and privacy concerns remain, the overall market outlook is exceptionally positive, promising considerable growth over the forecast period. Continued technological innovation, coupled with rising adoption across a wider array of industries, will be central to the market's continued expansion. Recent developments include: November 2023 - Stagwell announced a partnership with Google Cloud and SADA, a Google Cloud premier partner, to develop generative AI (gen AI) marketing solutions that support Stagwell agencies, client partners, and product development within the Stagwell Marketing Cloud (SMC). The partnership will help in harnessing data analytics and insights by developing and training a proprietary Stagwell large language model (LLM) purpose-built for Stagwell clients, productizing data assets via APIs to create new digital experiences for brands, and multiplying the value of their first-party data ecosystems to drive new revenue streams using Vertex AI and open source-based models., May 2023 - IBM launched a new AI and data platform, watsonx, it is aimed at allowing businesses to accelerate advanced AI usage with trusted data, speed and governance. IBM also introduced GPU-as-a-service, which is designed to support AI intensive workloads, with an AI dashboard to measure, track and help report on cloud carbon emissions. With watsonx, IBM offers an AI development studio with access to IBMcurated and trained foundation models and open-source models, access to a data store to gather and clean up training and tune data,. Key drivers for this market are: Rapid Increase in Big Data, Emerging Promising Use Cases of Data Science and Machine Learning; Shift of Organizations Toward Data-intensive Approach and Decisions. Potential restraints include: Rapid Increase in Big Data, Emerging Promising Use Cases of Data Science and Machine Learning; Shift of Organizations Toward Data-intensive Approach and Decisions. Notable trends are: Small and Medium Enterprises to Witness Major Growth.

  13. d

    MODFLOW-NWT and MODPATH models, capture zones and uncertainty analysis for...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Oct 22, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). MODFLOW-NWT and MODPATH models, capture zones and uncertainty analysis for the Partridge River Basin, Minnesota [Dataset]. https://catalog.data.gov/dataset/modflow-nwt-and-modpath-models-capture-zones-and-uncertainty-analysis-for-the-partridge-ri
    Explore at:
    Dataset updated
    Oct 22, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    Minnesota
    Description

    A MODFLOW-NWT model was used to simulate the groundwater/surface-water interactions in the Partridge River Basin, MN using the Streamflow Routing and Unsaturated Zone Flow packages. The base model represents 2011-2013 average mining conditions and was used to build five mining scenario models, as described in the report. The base model and mining scenarios were used to estimate the base flow at 6 stream locations, pit inflows rates for the new hypothetical pits, and the average depth to water in twelve wetlands. PEST utilities were used to estimate an uncertainty with each of these forecasts. Particle tracking was performed with the MODFLOW solution (using MODPATH 7) and Monte Carlo techniques to create probabilistic capture zones. This USGS data release contains all of the input and output files for the simulations described in the associated model documentation report (https://doi.org/10.3133/sir20215038).

  14. w

    HUN RMS Output Dat Files v01

    • data.wu.ac.at
    • researchdata.edu.au
    Updated Jun 21, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bioregional Assessment Programme (2018). HUN RMS Output Dat Files v01 [Dataset]. https://data.wu.ac.at/schema/data_gov_au/YmRiM2Y4ZDMtMTI1YS00NGI5LWIwZTctZGI3NDNmMGExYTE5
    Explore at:
    Dataset updated
    Jun 21, 2018
    Dataset provided by
    Bioregional Assessment Programme
    Description

    Abstract

    The dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.

    The dataset contains the raw .dat versions of the structural geological model for the Hunter subregion. RMS geomodelling was used to construct the geological model for the Hunter subregion. The data set contains the depth to basement horizons, reference horizons, eroded horizons, isochores and well markers extracted from the completed geological model. The model was built using data extracted from well completions reports published by mining companies and consultants, which record the depth of various formations encountered during drilling works. These data were compiled into model input files (See data set: Hunter deep well completion reports - f2df86d5-6749-48c7-a445-d60067109f08) used to build the RMS model.

    Nine geological formations and their depths from the surface are included covering a grid across the Basin. The geological model is based on measured depths recorded in well completion reports published by mining companies and consultancies.

    The naming convention refers to the geological age and depth (TVD ss = total vertical depth subsea reported to the Australian Height Datum) of the various formations as follows:

    Regional horizon name Age (geological stage) Newcastle Coalfield Hunter Coalfield Western Coalfield Central or Southern coalfields

    M600 Top Anisian Top Hawkesbury Sandstone Top Hawkesbury Sandstone Top Hawkesbury Sandstone Base Wianamatta Group

    M700 Top Olenekian Base Hawkesbury Sandstone Base Hawkesbury Sandstone Base Hawkesbury Sandstone Base Hawkesbury Sandstone

    P000 Top Changhsingian Base Narrabeen Group Base Narrabeen Group Base Narrabeen Group Base Narrabeen Group

    P100 Upper Wuchiapingian Base Newcastle Coal Measures Base Newcastle Coal Measures Top Watts Sandstone Top Bargo Claystone

    P500 Mid Capitanian Base Tomago Coal Measures Base Wittingham Coal Measures Base Illawarra Coal Measures Base Illawarra Coal Measures

    P550 Top Wordian Base Mulbring Siltstone Base Mulbring Siltstone Base Berry Siltstone Base Berry Siltstone

    P600 Mid Roadian Base Maitland Group Base Maitland Group Base Shoalhaven Group Base Shoalhaven Group

    P700 Upper Kungurian Top Base Greta Coal Measures Top Base Greta Coal Measures

    P900 Base Serpukhovian Base Seaham Formation Base Seaham Formation

    with 'M' referring to Mesozoic and 'P' to Paleozoic

    Dataset History

    RMS geomodelling was used to construct the geological model for the Hunter subregion. The data set contains the layers in the completed geological model. The model was built using data extracted from well completions reports published by mining companies and consultants which record the depth of various formations encountered during drilling works. These data were compiled into model input files (See dataset: Hunter deep well completion reports - f2df86d5-6749-48c7-a445-d60067109f08) used to build the RMS model.

    This model has a horizontal resolution of 2000 x 2000 m (x y), with 109 vertical layers for a total of 511,118 cells. The depth ranges between 1185m above sea level and 5062 m below sea level.

    Data originally sourced from 44 well completion reports and incorporated into the geological model. The reference horizons were exported from RMS software as .dat files.

    Dataset Citation

    Bioregional Assessment Programme (XXXX) HUN RMS Output Dat Files v01. Bioregional Assessment Derived Dataset. Viewed 22 June 2018, http://data.bioregionalassessments.gov.au/dataset/c975d250-b699-4585-b32f-cbfde4d8d436.

    Dataset Ancestors

  15. LongAlpaca-Yukang ML Instructional Outputs

    • kaggle.com
    zip
    Updated Nov 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). LongAlpaca-Yukang ML Instructional Outputs [Dataset]. https://www.kaggle.com/datasets/thedevastator/longalpaca-yukang-ml-instructional-outputs
    Explore at:
    zip(168273444 bytes)Available download formats
    Dataset updated
    Nov 24, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    LongAlpaca-Yukang ML Instructional Outputs

    Unlocking the Power of AI

    By Huggingface Hub [source]

    About this dataset

    This dataset contains 12000 instructional outputs from LongAlpaca-Yukang Machine Learning system, unlocking the cutting-edge power of Artificial Intelligence for users. With this data, researchers have an abundance of information to explore the mysteries behind AI and how it works. This dataset includes columns such as output, instruction, file and input which provide endless possibilities of analysis ripe for you to discover! Teeming with potential insights into AI’s functioning and implications for our everyday lives, let this data be your guide in unravelling the many secrets yet to be discovered in the world of AI

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    Exploring the Dataset:

    The dataset contains 12000 rows of information, with four columns containing output, instruction, file and input data. You can use these columns to explore the workings of a machine learning system, examine different instructional outputs for different inputs or instructions, study training data for specific ML systems, or analyze files being used by a machine learning system.

    Visualizing Data:

    Using built-in plotting tools within your chosen toolkit (such as Python), you can create powerful visualizations. Plotting outputs versus input instructions will give you an overview of what your machine learning system is capable of doing--and how it performs on different types of tasks or problems. You could also plot outputs along side files being used--this would help identify patterns in training data and identify areas that need improvement in your machine learning models.

    Analyzing Performance:

    Using statistical analysis techniques such as regressions or clustering algorithms, you can measure performance metrics such as accuracy and understand how they vary across instruction types. Experimenting with hyperparameter tuning may be helpful to see which settings yield better results for any given situation. Additionally correlations between inputs samples and output measurements can be examined so any relationships can be identified such as trends in accuracy over certain sets of instructions.

    Drawing Conclusions:

    By leveraging the power of big data mining tools, you are able to build comprehensive predictive models that allow us to project future outcomes based on past performance metric measurements from various instruction types fed into our system's datasets — allowing us determine if certain changes produce improve outcomes over time for our AI model’s capability & predictability!

    Research Ideas

    • Developing self-improving Artificial Intelligence algorithms by using the outputs and instructional data to identify correlations and feedback loop structures between instructions and output results.
    • Generating Machine Learning simulations using this dataset to optimize AI performance based on given instruction set.
    • Using the instructions, input, and output data in the dataset to build AI systems for natural language processing, enabling comprehensive understanding of user queries and providing more accurate answers accordingly

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: train.csv | Column name | Description | |:--------------|:-------------------------------------------------------| | output | The output of the instruction given. (String) | | file | The file used when executing the instruction. (String) | | input | Additional context for the instruction. (String) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.

  16. 99 Little Orange, Technical Business Case

    • kaggle.com
    zip
    Updated Jun 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    IVAN CHAVEZ (2022). 99 Little Orange, Technical Business Case [Dataset]. https://www.kaggle.com/datasets/ivanchvez/99littleorange
    Explore at:
    zip(91998345 bytes)Available download formats
    Dataset updated
    Jun 13, 2022
    Authors
    IVAN CHAVEZ
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    99 Little Orange, Technical Business Case

    Dear candidate, we are so excited with your interest in working with us! This challenge is an opportunity for us to know a bit of the great talent we know you have. It was built to simulate real-case scenarios that you would face while working at [Organization] and is organized in 2 parts:

      1. A technical part of close-ended questions with specific answers that are meant to assess your ability to analyze large amounts of data with SQL to answer key questions.
      1. An analytical part of open-ended questions to assess your ability to build data-backed recommendations to support decision-making. Expect further questions and discussions on top of your answers in the next phase of our hiring process.

    Part I - Technical Provide both the answer and the SQL code used. 1. What is the average trip cost of holidays? How does it compare to non-holidays? 2. Find the average call time of the first time passengers make a trip. 3. Find the average number of trips per driver for every week day. 4. Which day of the week drivers usually drive the most distance on average? 5. What was the growth percentage of rides month over month? 6. Optional. List the top 5 drivers per number of trips in the top 5 largest cities.

    Part II - Analytical 99 is a marketplace, where drivers are the supply and passengers, the demand. One of our main challenges is to keep this marketplace balanced. If there's too much demand, prices would increase due to surge and passengers would prefer not to run. If there's too much supply, drivers would spend more time idle impacting their revenue. 1. Let's say it's 2019-09-23 and a new Operations manager for The Shire was just hired. She has 5 minutes during the Ops weekly meeting to present an overview of the business in the city, and since she's just arrived, she asked your help to do it. What would you prepare for this 5 minutes presentation? Please provide 1-2 slides with your idea. 2. She also mentioned she has a budget to invest in promoting the business. What kind of metrics and performance indicators would you use in order to help her decide if she should invest it into the passenger side or the driver side? Extra point if you provide data-backed recommendations. 3. One month later, she comes back, super grateful for all the helpful insights you have given her. And says she is anticipating a driver supply shortage due to a major concert that is going to take place the next day and also a 3 day city holiday that is coming the next month. What would you do to help her analyze the best course of action to either prevent or minimize the problem in each case? 4. Optional. We want to build up a model to predict “Possible Churn Users” (e.g.: no trips in the past 4 weeks). List all features that you can think about and the data mining or machine learning model or other methods you may use for this case.

  17. d

    Data from: A Local Scalable Distributed Expectation Maximization Algorithm...

    • datasets.ai
    • s.cnmilf.com
    • +2more
    33
    Updated Nov 11, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Aeronautics and Space Administration (2020). A Local Scalable Distributed Expectation Maximization Algorithm for Large Peer-to-Peer Networks [Dataset]. https://datasets.ai/datasets/a-local-scalable-distributed-expectation-maximization-algorithm-for-large-peer-to-peer-net
    Explore at:
    33Available download formats
    Dataset updated
    Nov 11, 2020
    Dataset authored and provided by
    National Aeronautics and Space Administration
    Description

    This paper describes a local and distributed expectation maximization algorithm for learning parameters of Gaussian mixture models (GMM) in large peer-to-peer (P2P) environments. The algorithm can be used for a variety of well-known data mining tasks in distributed environments such as clustering, anomaly detection, target tracking, and density estimation to name a few, necessary for many emerging P2P applications in bioinformatics, webmining and sensor networks. Centralizing all or some of the data to build global models is impractical in such P2P environments because of the large number of data sources, the asynchronous nature of the P2P networks, and dynamic nature of the data/network. The proposed algorithm takes a two-step approach. In the monitoring phase, the algorithm checks if the model ‘quality’ is acceptable by using an efficient local algorithm. This is then used as a feedback loop to sample data from the network and rebuild the GMM when it is outdated. We present thorough experimental results to verify our theoretical claims.

  18. Predicting Returns of Discounted Articles Sales

    • kaggle.com
    zip
    Updated Jul 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oscar Aguilar (2023). Predicting Returns of Discounted Articles Sales [Dataset]. https://www.kaggle.com/datasets/oscarm524/predicting-returns-of-discounted-articles-sales/code
    Explore at:
    zip(30074240 bytes)Available download formats
    Dataset updated
    Jul 5, 2023
    Authors
    Oscar Aguilar
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    A fashion distributor sells articles of particular sizes and colors to its customers. In some cases items are returned to the distributor for various reasons. The order data and the related return data were recorded over a two-year period. The aim is to use this data and machine learning to build a model which enables a good prediction of return rates.

    The Data

    For this task real anonymized shop data are provided in the form of structured text files consisting of individual data sets. Below are some points to note about the files:

    1. Each data set is on a single line ending with "CR" ("carriage return", 0xD), "LF" ("carriage return" and "line feed", 0xD and 0xA).
    2. The first line has the same structure as the data sets, but contains the names of the respective columns (data fields).
    3. The header line and each data set contain multiple fields separated from each other by a semi-colon (;).
    4. There is no escape character, quotes are not used.
    5. ASCII is the character set used.
    6. Missing values may occur. These are coded using the character string NA.

    Actually only the field names from the included document features.pdf can appear as column headings in the order used in that document. The associated value ranges are also listed.

    The orders_train.txt contains all the data fields from the document whereas the associated test file orders_class.txt does not contain the target variable ``*returnQuantity*''.

    The Task

    The task is to use known historical data from January 2014 to September 2015 (approx. 2.33 million order positions) to build a model that makes predictions about return rates for order positions. The attribute returnQuantity in the given data indicates the number of articles for each order position (the value 0 means that the article will be kept while a value larger than 0 means that the article will be returned). For sales in the period from October 2015 to December 2015 (approx. 340,000 order positions) the model should then provide predictions for the number of articles which will be returned per order position. The prediction has to be a value of the set of natural numbers including 0. The difference between the prediction and the actual rate for an order position (i..e. error rate) must be as low as possible.

    Acknowledgement

    This dataset is publicly available in the data-mining-cup-website.

  19. f

    Table_1_Development of an Agricultural Primary Productivity Decision Support...

    • frontiersin.figshare.com
    docx
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Taru Sandén; Aneta Trajanov; Heide Spiegel; Vladimir Kuzmanovski; Nicolas P. A. Saby; Calypso Picaud; Christian Bugge Henriksen; Marko Debeljak (2023). Table_1_Development of an Agricultural Primary Productivity Decision Support Model: A Case Study in France.DOCX [Dataset]. http://doi.org/10.3389/fenvs.2019.00058.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Frontiers
    Authors
    Taru Sandén; Aneta Trajanov; Heide Spiegel; Vladimir Kuzmanovski; Nicolas P. A. Saby; Calypso Picaud; Christian Bugge Henriksen; Marko Debeljak
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Agricultural soils provide society with several functions, one of which is primary productivity. This function is defined as the capacity of a soil to supply nutrients and water and to produce plant biomass for human use, providing food, feed, fiber, and fuel. For farmers, the productivity function delivers an economic basis and is a prerequisite for agricultural sustainability. Our study was designed to develop an agricultural primary productivity decision support model. To obtain a highly accurate decision support model that helps farmers and advisors to assess and manage the provision of the primary productivity soil function on their agricultural fields, we addressed the following specific objectives: (i) to construct a qualitative decision support model to assess the primary productivity soil function at the agricultural field level; (ii) to carry out verification, calibration, and sensitivity analysis of this model; and (iii) to validate the model based on empirical data. The result is a hierarchical qualitative model consisting of 25 input attributes describing soil properties, environmental conditions, cropping specifications, and management practices on each respective field. An extensive dataset from France containing data from 399 sites was used to calibrate and validate the model. The large amount of data enabled data mining to support model calibration. The accuracy of the decision support model prior to calibration supported by data mining was ~40%. The data mining approach improved the accuracy to 77%. The proposed methodology of combining decision modeling and data mining proved to be an important step forward. This iterative approach yielded an accurate, reliable, and useful decision support model for the assessment of the primary productivity soil function at the field level. This can assist farmers and advisors in selecting the most appropriate crop management practices. Embedding this decision support model in a set of complementary models for four adjacent soil functions, as endeavored in the H2020 LANDMARK project, will help take the integrated sustainability of arable cropping systems to a new level.

  20. Data from: Forecasting the human development index and life expectancy in...

    • scielo.figshare.com
    jpeg
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Celso Bilynkievycz dos Santos; Luiz Alberto Pilatti; Bruno Pedroso; Deborah Ribeiro Carvalho; Alaine Margarete Guimarães (2023). Forecasting the human development index and life expectancy in Latin American countries using data mining techniques [Dataset]. http://doi.org/10.6084/m9.figshare.7420340.v1
    Explore at:
    jpegAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    SciELOhttp://www.scielo.org/
    Authors
    Celso Bilynkievycz dos Santos; Luiz Alberto Pilatti; Bruno Pedroso; Deborah Ribeiro Carvalho; Alaine Margarete Guimarães
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Latin America
    Description

    Abstract The predictability of epidemiological indicators can help estimate dependent variables, assist in decision-making to support public policies, and explain the scenarios experienced by different countries worldwide. This study aimed to forecast the Human Development Index (HDI) and life expectancy (LE) for Latin American countries for the period of 2015-2020 using data mining techniques. All stages of the process of knowledge discovery in databases were covered. The SMOReg data mining algorithm was used in the models with multivariate time series to make predictions; this algorithm performed the best in the tests developed during the evaluation period. The average HDI and LE for Latin American countries showed an increasing trend in the period evaluated, corresponding to 4.99 ± 3.90% and 2.65 ± 0.06 years, respectively. Multivariate models allow for a greater evaluation of algorithms, thus increasing their accuracy. Data mining techniques have a better predictive quality relative to the most popular technique, Autoregressive Integrated Moving Average (ARIMA). In addition, the predictions suggest that there will be a higher increase in the mean HDI and LE for Latin American countries compared to the mean values for the rest of the world.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Lay Christian (2024). Survey Data - Entrepreneurs Data Mining [Dataset]. https://www.kaggle.com/datasets/laychristian/survey-data-entrepreneurs-data-mining
Organization logo

Survey Data - Entrepreneurs Data Mining

Explore at:
zip(38815 bytes)Available download formats
Dataset updated
Nov 21, 2024
Authors
Lay Christian
Description

Title: Identifying Factors that Affect Entrepreneurs’ Use of Data Mining for Analytics Authors: Edward Matthew Dominica, Feylin Wijaya, Andrew Giovanni Winoto, Christian Conference: The 4th International Conference on Electrical, Computer, Communications, and Mechatronics Engineering https://www.iceccme.com/home

This dataset was created to support research focused on understanding the factors influencing entrepreneurs’ adoption of data mining techniques for business analytics. The dataset contains carefully curated data points that reflect entrepreneurial behaviors, decision-making criteria, and the role of data mining in enhancing business insights.

Researchers and practitioners can leverage this dataset to explore patterns, conduct statistical analyses, and build predictive models to gain a deeper understanding of entrepreneurial adoption of data mining.

Intended Use: This dataset is designed for research and academic purposes, especially in the fields of business analytics, entrepreneurship, and data mining. It is suitable for conducting exploratory data analysis, hypothesis testing, and model development.

Citation: If you use this dataset in your research or publication, please cite the paper presented at the ICECCME 2024 conference using the following format: Edward Matthew Dominica, Feylin Wijaya, Andrew Giovanni Winoto, Christian. Identifying Factors that Affect Entrepreneurs’ Use of Data Mining for Analytics. The 4th International Conference on Electrical, Computer, Communications, and Mechatronics Engineering (2024).

Search
Clear search
Close search
Google apps
Main menu