Facebook
TwitterTitle: Identifying Factors that Affect Entrepreneurs’ Use of Data Mining for Analytics Authors: Edward Matthew Dominica, Feylin Wijaya, Andrew Giovanni Winoto, Christian Conference: The 4th International Conference on Electrical, Computer, Communications, and Mechatronics Engineering https://www.iceccme.com/home
This dataset was created to support research focused on understanding the factors influencing entrepreneurs’ adoption of data mining techniques for business analytics. The dataset contains carefully curated data points that reflect entrepreneurial behaviors, decision-making criteria, and the role of data mining in enhancing business insights.
Researchers and practitioners can leverage this dataset to explore patterns, conduct statistical analyses, and build predictive models to gain a deeper understanding of entrepreneurial adoption of data mining.
Intended Use: This dataset is designed for research and academic purposes, especially in the fields of business analytics, entrepreneurship, and data mining. It is suitable for conducting exploratory data analysis, hypothesis testing, and model development.
Citation: If you use this dataset in your research or publication, please cite the paper presented at the ICECCME 2024 conference using the following format: Edward Matthew Dominica, Feylin Wijaya, Andrew Giovanni Winoto, Christian. Identifying Factors that Affect Entrepreneurs’ Use of Data Mining for Analytics. The 4th International Conference on Electrical, Computer, Communications, and Mechatronics Engineering (2024).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.
Facebook
TwitterThe visit of an online shop by a possible customer is also called a session. During a session the visitor clicks on products in order to see the corresponding detail page. Furthermore, he possibly will add or remove products to/from his shopping basket. At the end of a session it is possible that one or several products from the shopping basket will be ordered. The activities of the user are also called transactions. The goal of the analysis is to predict whether the visitor will place an order or not on the basis of the transaction data collected during the session.
In the first task historical shop data are given consisting of the session activities inclusive of the associated information whether an order was placed or not. These data can be used in order to subsequently make order forecasts for other session activities in the same shop. Of course, the real outcome of the sessions for this set is not known. Thus, the first task can be understood as a classical data mining problem.
The second task deals with the online scenario. In this context the participants are to implement an agent learning on the basis of transactions. That means that the agent successively receives the individual transactions and has to make a forecast for each of them with respect to the outcome of the shopping cart transaction. This task maps the practice scenario in the best possible way in the case that a transaction-based forecast is required and a corresponding algorithm should learn in an adaptive manner.
For the individual tasks anonymised real shop data are provided in the form of structured text files consisting of individual data sets. The data sets represent in each case transactions in the shop and may contain redundant information. For the data, in particular the following applies:
In concrete terms, only the array names of the attached document “*features.pdf*” in their respective sequence will be used as column headings. The corresponding value ranges are listed there, too.
The training file for task 1 is “*transact_train.txt*“) contains all data arrays of the document, whereas the corresponding classification file (“*transact_class.txt*”) of course does not contain the target attribute “*order*”.
In task 2 data in the form of a string array are transferred to the implementations of the participants by means of a method. The individual fields of the array contain the same data arrays that are listed in “*features.pdf*”–also without the target attribute “*order*”–and exactly in the sequence used there.
This dataset is publicly available in the data-mining-cup-website.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Structural Health Monitoring (SHM) has enabled the condition of large structures, like bridges, to be evaluated in real time. In order to monitor behavioral changes, it is essential to identify parameters of the structure that are sensitive enough to capture damage as it develops while being stable enough during ambient behavior of the structure. Research has shown that monitoring the neutral axis (N.A.) position satisfies the first criterion of sensitivity; however, monitoring N.A. location is challenging because its position is affected by the loads applied to the structure. The motivation behind this research comes from the greater than expected impact of various load characteristics on observed N.A. location. This paper develops an indirect way to estimate the characteristics of vehicular loads (magnitude and lateral position of the load) and uses a data mining approach to predict the expected location of the N.A. Instead of monitoring the behavior of the N.A., in the proposed method the residuals between the monitored and predicted N.A. location are monitored. Using actual SHM data collected from a cable-stayed bridge, over a 2-year period, the paper presents the steps to be followed for creating a data mining model to predict N.A. location, the use of monthly sample residuals of N.A. to capture behavioral changes, the ability of the method to distinguish between changes in the load characteristics from behavioral changes of the structure (e.g. change in response due to cracking, bearings becoming frozen, cables losing tension, etc.), and the high sensitivity of the method that allows capturing of minor changes.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract Taper functions and volume equations are essential for estimation of the individual volume, which have consolidated theory. On the other hand, mathematical innovation is dynamic, and may improve the forestry modeling. The objective was analyzing the accuracy of machine learning (ML) techniques in relation to a volumetric model and a taper function for acácia negra. We used cubing data, and fit equations with Schumacher and Hall volumetric model and with Hradetzky taper function, compared to the algorithms: k nearest neighbor (k-NN), Random Forest (RF) and Artificial Neural Networks (ANN) for estimation of total volume and diameter to the relative height. Models were ranked according to error statistics, as well as their dispersion was verified. Schumacher and Hall model and ANN showed the best results for volume estimation as function of dap and height. Machine learning methods were more accurate than the Hradetzky polynomial for tree form estimations. ML models have proven to be appropriate as an alternative to traditional modeling applications in forestry measurement, however, its application must be careful because fit-based overtraining is likely.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data Analysis is the process that supports decision-making and informs arguments in empirical studies. Descriptive statistics, Exploratory Data Analysis (EDA), and Confirmatory Data Analysis (CDA) are the approaches that compose Data Analysis (Xia & Gong; 2014). An Exploratory Data Analysis (EDA) comprises a set of statistical and data mining procedures to describe data. We ran EDA to provide statistical facts and inform conclusions. The mined facts allow attaining arguments that would influence the Systematic Literature Review of DL4SE.
The Systematic Literature Review of DL4SE requires formal statistical modeling to refine the answers for the proposed research questions and formulate new hypotheses to be addressed in the future. Hence, we introduce DL4SE-DA, a set of statistical processes and data mining pipelines that uncover hidden relationships among Deep Learning reported literature in Software Engineering. Such hidden relationships are collected and analyzed to illustrate the state-of-the-art of DL techniques employed in the software engineering context.
Our DL4SE-DA is a simplified version of the classical Knowledge Discovery in Databases, or KDD (Fayyad, et al; 1996). The KDD process extracts knowledge from a DL4SE structured database. This structured database was the product of multiple iterations of data gathering and collection from the inspected literature. The KDD involves five stages:
Selection. This stage was led by the taxonomy process explained in section xx of the paper. After collecting all the papers and creating the taxonomies, we organize the data into 35 features or attributes that you find in the repository. In fact, we manually engineered features from the DL4SE papers. Some of the features are venue, year published, type of paper, metrics, data-scale, type of tuning, learning algorithm, SE data, and so on.
Preprocessing. The preprocessing applied was transforming the features into the correct type (nominal), removing outliers (papers that do not belong to the DL4SE), and re-inspecting the papers to extract missing information produced by the normalization process. For instance, we normalize the feature “metrics” into “MRR”, “ROC or AUC”, “BLEU Score”, “Accuracy”, “Precision”, “Recall”, “F1 Measure”, and “Other Metrics”. “Other Metrics” refers to unconventional metrics found during the extraction. Similarly, the same normalization was applied to other features like “SE Data” and “Reproducibility Types”. This separation into more detailed classes contributes to a better understanding and classification of the paper by the data mining tasks or methods.
Transformation. In this stage, we omitted to use any data transformation method except for the clustering analysis. We performed a Principal Component Analysis to reduce 35 features into 2 components for visualization purposes. Furthermore, PCA also allowed us to identify the number of clusters that exhibit the maximum reduction in variance. In other words, it helped us to identify the number of clusters to be used when tuning the explainable models.
Data Mining. In this stage, we used three distinct data mining tasks: Correlation Analysis, Association Rule Learning, and Clustering. We decided that the goal of the KDD process should be oriented to uncover hidden relationships on the extracted features (Correlations and Association Rules) and to categorize the DL4SE papers for a better segmentation of the state-of-the-art (Clustering). A clear explanation is provided in the subsection “Data Mining Tasks for the SLR od DL4SE”. 5.Interpretation/Evaluation. We used the Knowledge Discover to automatically find patterns in our papers that resemble “actionable knowledge”. This actionable knowledge was generated by conducting a reasoning process on the data mining outcomes. This reasoning process produces an argument support analysis (see this link).
We used RapidMiner as our software tool to conduct the data analysis. The procedures and pipelines were published in our repository.
Overview of the most meaningful Association Rules. Rectangles are both Premises and Conclusions. An arrow connecting a Premise with a Conclusion implies that given some premise, the conclusion is associated. E.g., Given that an author used Supervised Learning, we can conclude that their approach is irreproducible with a certain Support and Confidence.
Support = Number of occurrences this statement is true divided by the amount of statements Confidence = The support of the statement divided by the number of occurrences of the premise
Facebook
Twitterhttps://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice
Generative AI In Data Analytics Market Size 2025-2029
The generative ai in data analytics market size is valued to increase by USD 4.62 billion, at a CAGR of 35.5% from 2024 to 2029. Democratization of data analytics and increased accessibility will drive the generative ai in data analytics market.
Market Insights
North America dominated the market and accounted for a 37% growth during the 2025-2029.
By Deployment - Cloud-based segment was valued at USD 510.60 billion in 2023
By Technology - Machine learning segment accounted for the largest market revenue share in 2023
Market Size & Forecast
Market Opportunities: USD 621.84 million
Market Future Opportunities 2024: USD 4624.00 million
CAGR from 2024 to 2029 : 35.5%
Market Summary
The market is experiencing significant growth as businesses worldwide seek to unlock new insights from their data through advanced technologies. This trend is driven by the democratization of data analytics and increased accessibility of AI models, which are now available in domain-specific and enterprise-tuned versions. Generative AI, a subset of artificial intelligence, uses deep learning algorithms to create new data based on existing data sets. This capability is particularly valuable in data analytics, where it can be used to generate predictions, recommendations, and even new data points. One real-world business scenario where generative AI is making a significant impact is in supply chain optimization. In this context, generative AI models can analyze historical data and generate forecasts for demand, inventory levels, and production schedules. This enables businesses to optimize their supply chain operations, reduce costs, and improve customer satisfaction. However, the adoption of generative AI in data analytics also presents challenges, particularly around data privacy, security, and governance. As businesses continue to generate and analyze increasingly large volumes of data, ensuring that it is protected and used in compliance with regulations is paramount. Despite these challenges, the benefits of generative AI in data analytics are clear, and its use is set to grow as businesses seek to gain a competitive edge through data-driven insights.
What will be the size of the Generative AI In Data Analytics Market during the forecast period?
Get Key Insights on Market Forecast (PDF) Request Free SampleGenerative AI, a subset of artificial intelligence, is revolutionizing data analytics by automating data processing and analysis, enabling businesses to derive valuable insights faster and more accurately. Synthetic data generation, a key application of generative AI, allows for the creation of large, realistic datasets, addressing the challenge of insufficient data in analytics. Parallel processing methods and high-performance computing power the rapid analysis of vast datasets. Automated machine learning and hyperparameter optimization streamline model development, while model monitoring systems ensure continuous model performance. Real-time data processing and scalable data solutions facilitate data-driven decision-making, enabling businesses to respond swiftly to market trends. One significant trend in the market is the integration of AI-powered insights into business operations. For instance, probabilistic graphical models and backpropagation techniques are used to predict customer churn and optimize marketing strategies. Ensemble learning methods and transfer learning techniques enhance predictive analytics, leading to improved customer segmentation and targeted marketing. According to recent studies, businesses have achieved a 30% reduction in processing time and a 25% increase in predictive accuracy by implementing generative AI in their data analytics processes. This translates to substantial cost savings and improved operational efficiency. By embracing this technology, businesses can gain a competitive edge, making informed decisions with greater accuracy and agility.
Unpacking the Generative AI In Data Analytics Market Landscape
In the dynamic realm of data analytics, Generative AI algorithms have emerged as a game-changer, revolutionizing data processing and insights generation. Compared to traditional data mining techniques, Generative AI models can create new data points that mirror the original dataset, enabling more comprehensive data exploration and analysis (Source: Gartner). This innovation leads to a 30% increase in identified patterns and trends, resulting in improved ROI and enhanced business decision-making (IDC).
Data security protocols are paramount in this context, with Classification Algorithms and Clustering Algorithms ensuring data privacy and compliance alignment. Machine Learning Pipelines and Deep Learning Frameworks facilitate seamless integration with Predictive Modeling Tools and Automated Report Generation on Cloud
Facebook
Twitter
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The rapid development of molecular structural databases provides the chemistry community access to an enormous array of experimental data that can be used to build and validate computational models. Using radial distribution functions collected from experimentally available X-ray and NMR structures, a number of so-called statistical potentials have been developed over the years using the structural data mining strategy. These potentials have been developed within the context of the two-particle Kirkwood equation by extending its original use for isotropic monatomic systems to anisotropic biomolecular systems. However, the accuracy and the unclear physical meaning of statistical potentials have long formed the central arguments against such methods. In this work, we present a new approach to generate molecular energy functions using structural data mining. Instead of employing the Kirkwood equation and introducing the “reference state” approximation, we model the multidimensional probability distributions of the molecular system using graphical models and generate the target pairwise Boltzmann probabilities using the Bayesian field theory. Different from the current statistical potentials that mimic the “knowledge-based” PMF based on the 2-particle Kirkwood equation, the graphical-model-based structure-derived potential developed in this study focuses on the generation of lower-dimensional Boltzmann distributions of atoms through reduction of dimensionality. We have named this new scoring function GARF, and in this work we focus on the mathematical derivation of our novel approach followed by validation studies on its ability to predict protein–ligand interactions.
Facebook
TwitterData-driven models help mobile app designers understand best practices and trends, and can be used to make predictions about design performance and support the creation of adaptive UIs. This paper presents Rico, the largest repository of mobile app designs to date, created to support five classes of data-driven applications: design search, UI layout generation, UI code generation, user interaction modeling, and user perception prediction. To create Rico, we built a system that combines crowdsourcing and automation to scalably mine design and interaction data from Android apps at runtime. The Rico dataset contains design data from more than 9.3k Android apps spanning 27 categories. It exposes visual, textual, structural, and interactive design properties of more than 66k unique UI screens. To demonstrate the kinds of applications that Rico enables, we present results from training an autoencoder for UI layout similarity, which supports query-by-example search over UIs.
Rico was built by mining Android apps at runtime via human-powered and programmatic exploration. Like its predecessor ERICA, Rico’s app mining infrastructure requires no access to — or modification of — an app’s source code. Apps are downloaded from the Google Play Store and served to crowd workers through a web interface. When crowd workers use an app, the system records a user interaction trace that captures the UIs visited and the interactions performed on them. Then, an automated agent replays the trace to warm up a new copy of the app and continues the exploration programmatically, leveraging a content-agnostic similarity heuristic to efficiently discover new UI states. By combining crowdsourcing and automation, Rico can achieve higher coverage over an app’s UI states than either crawling strategy alone. In total, 13 workers recruited on UpWork spent 2,450 hours using apps on the platform over five months, producing 10,811 user interaction traces. After collecting a user trace for an app, we ran the automated crawler on the app for one hour.
UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN https://interactionmining.org/rico
The Rico dataset is large enough to support deep learning applications. We trained an autoencoder to learn an embedding for UI layouts, and used it to annotate each UI with a 64-dimensional vector representation encoding visual layout. This vector representation can be used to compute structurally — and often semantically — similar UIs, supporting example-based search over the dataset. To create training inputs for the autoencoder that embed layout information, we constructed a new image for each UI capturing the bounding box regions of all leaf elements in its view hierarchy, differentiating between text and non-text elements. Rico’s view hierarchies obviate the need for noisy image processing or OCR techniques to create these inputs.
Facebook
TwitterChytridiomycosis, caused by the fungal pathogen Batrachochytrium dendrobatidis (Bd), is a major driver of amphibian decline worldwide. The global presence of Bd is driven by a synergy of factors, such as climate, species life history, and amphibian host suscepÂtibility. Here, using a Bayesian data-mining approach, we modeled the epidemiologiÂcal landscape of Bd to evaluate how infection varies across several spatial, ecological, and phylogenetic scales. We compiled global information on Bd occurrence, climate, species ranges, and phylogenetic diversity to infer the potential distribution and prevaÂlence of Bd. By calculating the degree of co-distribution between Bd and our set of environmental and biological variables (e.g. climate and species), we identified the factors that could potentially be related to Bd presence and prevalence using a geoÂgraphic correlation metric, epsilon (ε). We fitted five ecological models based on 1) amphibian species identity, 2) phylogenetic species varia..., Usage notes
These datasets include the geographic data used to build ecological and geographical models for Batrachochytrium dendrobatidis, as well as supplementary results of the following paper: Basanta et al. Epidemiological landscape of Batrachochytrium dendrobatidis and its impact on amphibian diversity at the global scale. Missing values are denoted by NA. Details for each dataset are provided in the README file. Datasets included:
Information of Bd records. Table S1.xls contains Bd occurrence records and prevalence of infection from the Bd-Maps online database (http://www.bd-maps.net), Olson et al. 2013) accessed in 2013, and searched Google Scholar for recent papers with Bd infection reports using the keywords ‘*Batrachochytrium dendrobatidis’*. We excluded records from studies of captive individuals and those without coordinates, keeping only records in which coordinates reflected site-specific sample locations. Supplementary figures Supplementary information S1.docx cont..., , # 1. Title of Dataset: Epidemiological landscape of Batrachochytrium dendrobatidis and its impact on amphibian diversity at global scale
M. Delia Basanta Department of Biology, University of Nevada Reno. Reno, Nevada, USA.
Julián A. Velasco Instituto de Ciencias de la Atmósfera y Cambio Climático, Universidad Nacional Autónoma de México. Ciudad de México, México.
Constantino González-Salazar. Instituto de Ciencias de la Atmósfera y Cambio Climático, Universidad Nacional Autónoma de México. Ciudad de México, México.
Table S1.xls contains Bd occurrence records and prevalence of infection from the Bd-Maps online da...
Facebook
Twitterhttps://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
The Data Science Platform market is experiencing robust growth, projected to reach $10.15 billion in 2025 and exhibiting a Compound Annual Growth Rate (CAGR) of 23.50% from 2025 to 2033. This expansion is fueled by several key drivers. The increasing volume and complexity of data generated across diverse industries necessitates sophisticated platforms for analysis and insights extraction. Businesses are increasingly adopting cloud-based solutions for their scalability, cost-effectiveness, and accessibility, driving the growth of the cloud deployment segment. Furthermore, the rising demand for advanced analytics capabilities across sectors like BFSI (Banking, Financial Services, and Insurance), retail and e-commerce, and IT & Telecom is significantly boosting market demand. The availability of robust and user-friendly platforms is empowering businesses of all sizes, from SMEs to large enterprises, to leverage data science effectively for improved decision-making and competitive advantage. The market is witnessing the emergence of innovative solutions such as automated machine learning (AutoML) and integrated platforms that combine data preparation, model building, and deployment capabilities. The market segmentation reveals significant opportunities across various offerings and deployment models. While the platform segment holds a larger share, the services segment is poised for significant growth driven by the need for expert consulting and support in data science projects. Geographically, North America currently dominates the market, but the Asia-Pacific region is expected to witness faster growth due to increasing digitalization and technological advancements. Key players like IBM, Google, Microsoft, and Amazon are driving innovation and competition, with new entrants continuously emerging, adding to the market's dynamism. While challenges such as data security and privacy concerns remain, the overall market outlook is exceptionally positive, promising considerable growth over the forecast period. Continued technological innovation, coupled with rising adoption across a wider array of industries, will be central to the market's continued expansion. Recent developments include: November 2023 - Stagwell announced a partnership with Google Cloud and SADA, a Google Cloud premier partner, to develop generative AI (gen AI) marketing solutions that support Stagwell agencies, client partners, and product development within the Stagwell Marketing Cloud (SMC). The partnership will help in harnessing data analytics and insights by developing and training a proprietary Stagwell large language model (LLM) purpose-built for Stagwell clients, productizing data assets via APIs to create new digital experiences for brands, and multiplying the value of their first-party data ecosystems to drive new revenue streams using Vertex AI and open source-based models., May 2023 - IBM launched a new AI and data platform, watsonx, it is aimed at allowing businesses to accelerate advanced AI usage with trusted data, speed and governance. IBM also introduced GPU-as-a-service, which is designed to support AI intensive workloads, with an AI dashboard to measure, track and help report on cloud carbon emissions. With watsonx, IBM offers an AI development studio with access to IBMcurated and trained foundation models and open-source models, access to a data store to gather and clean up training and tune data,. Key drivers for this market are: Rapid Increase in Big Data, Emerging Promising Use Cases of Data Science and Machine Learning; Shift of Organizations Toward Data-intensive Approach and Decisions. Potential restraints include: Rapid Increase in Big Data, Emerging Promising Use Cases of Data Science and Machine Learning; Shift of Organizations Toward Data-intensive Approach and Decisions. Notable trends are: Small and Medium Enterprises to Witness Major Growth.
Facebook
TwitterA MODFLOW-NWT model was used to simulate the groundwater/surface-water interactions in the Partridge River Basin, MN using the Streamflow Routing and Unsaturated Zone Flow packages. The base model represents 2011-2013 average mining conditions and was used to build five mining scenario models, as described in the report. The base model and mining scenarios were used to estimate the base flow at 6 stream locations, pit inflows rates for the new hypothetical pits, and the average depth to water in twelve wetlands. PEST utilities were used to estimate an uncertainty with each of these forecasts. Particle tracking was performed with the MODFLOW solution (using MODPATH 7) and Monte Carlo techniques to create probabilistic capture zones. This USGS data release contains all of the input and output files for the simulations described in the associated model documentation report (https://doi.org/10.3133/sir20215038).
Facebook
TwitterThe dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.
The dataset contains the raw .dat versions of the structural geological model for the Hunter subregion. RMS geomodelling was used to construct the geological model for the Hunter subregion. The data set contains the depth to basement horizons, reference horizons, eroded horizons, isochores and well markers extracted from the completed geological model. The model was built using data extracted from well completions reports published by mining companies and consultants, which record the depth of various formations encountered during drilling works. These data were compiled into model input files (See data set: Hunter deep well completion reports - f2df86d5-6749-48c7-a445-d60067109f08) used to build the RMS model.
Nine geological formations and their depths from the surface are included covering a grid across the Basin. The geological model is based on measured depths recorded in well completion reports published by mining companies and consultancies.
The naming convention refers to the geological age and depth (TVD ss = total vertical depth subsea reported to the Australian Height Datum) of the various formations as follows:
Regional horizon name Age (geological stage) Newcastle Coalfield Hunter Coalfield Western Coalfield Central or Southern coalfields
M600 Top Anisian Top Hawkesbury Sandstone Top Hawkesbury Sandstone Top Hawkesbury Sandstone Base Wianamatta Group
M700 Top Olenekian Base Hawkesbury Sandstone Base Hawkesbury Sandstone Base Hawkesbury Sandstone Base Hawkesbury Sandstone
P000 Top Changhsingian Base Narrabeen Group Base Narrabeen Group Base Narrabeen Group Base Narrabeen Group
P100 Upper Wuchiapingian Base Newcastle Coal Measures Base Newcastle Coal Measures Top Watts Sandstone Top Bargo Claystone
P500 Mid Capitanian Base Tomago Coal Measures Base Wittingham Coal Measures Base Illawarra Coal Measures Base Illawarra Coal Measures
P550 Top Wordian Base Mulbring Siltstone Base Mulbring Siltstone Base Berry Siltstone Base Berry Siltstone
P600 Mid Roadian Base Maitland Group Base Maitland Group Base Shoalhaven Group Base Shoalhaven Group
P700 Upper Kungurian Top Base Greta Coal Measures Top Base Greta Coal Measures
P900 Base Serpukhovian Base Seaham Formation Base Seaham Formation
with 'M' referring to Mesozoic and 'P' to Paleozoic
RMS geomodelling was used to construct the geological model for the Hunter subregion. The data set contains the layers in the completed geological model. The model was built using data extracted from well completions reports published by mining companies and consultants which record the depth of various formations encountered during drilling works. These data were compiled into model input files (See dataset: Hunter deep well completion reports - f2df86d5-6749-48c7-a445-d60067109f08) used to build the RMS model.
This model has a horizontal resolution of 2000 x 2000 m (x y), with 109 vertical layers for a total of 511,118 cells. The depth ranges between 1185m above sea level and 5062 m below sea level.
Data originally sourced from 44 well completion reports and incorporated into the geological model. The reference horizons were exported from RMS software as .dat files.
Bioregional Assessment Programme (XXXX) HUN RMS Output Dat Files v01. Bioregional Assessment Derived Dataset. Viewed 22 June 2018, http://data.bioregionalassessments.gov.au/dataset/c975d250-b699-4585-b32f-cbfde4d8d436.
Derived From Geoscience Australia, 3 second SRTM Digital Elevation Model (DEM) v01
Derived From Australian Coal Basins
Derived From Bathymetry GA 2009 9sec v4
Derived From Hunter Groundwater Model extent
Derived From South Sydney Deep Well Completion Reports V02
Derived From Hunter deep well completion reports
Derived From HUN GW modelling DEM v01
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
This dataset contains 12000 instructional outputs from LongAlpaca-Yukang Machine Learning system, unlocking the cutting-edge power of Artificial Intelligence for users. With this data, researchers have an abundance of information to explore the mysteries behind AI and how it works. This dataset includes columns such as output, instruction, file and input which provide endless possibilities of analysis ripe for you to discover! Teeming with potential insights into AI’s functioning and implications for our everyday lives, let this data be your guide in unravelling the many secrets yet to be discovered in the world of AI
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
Exploring the Dataset:
The dataset contains 12000 rows of information, with four columns containing output, instruction, file and input data. You can use these columns to explore the workings of a machine learning system, examine different instructional outputs for different inputs or instructions, study training data for specific ML systems, or analyze files being used by a machine learning system.
Visualizing Data:
Using built-in plotting tools within your chosen toolkit (such as Python), you can create powerful visualizations. Plotting outputs versus input instructions will give you an overview of what your machine learning system is capable of doing--and how it performs on different types of tasks or problems. You could also plot outputs along side files being used--this would help identify patterns in training data and identify areas that need improvement in your machine learning models.
Analyzing Performance:
Using statistical analysis techniques such as regressions or clustering algorithms, you can measure performance metrics such as accuracy and understand how they vary across instruction types. Experimenting with hyperparameter tuning may be helpful to see which settings yield better results for any given situation. Additionally correlations between inputs samples and output measurements can be examined so any relationships can be identified such as trends in accuracy over certain sets of instructions.
Drawing Conclusions:
By leveraging the power of big data mining tools, you are able to build comprehensive predictive models that allow us to project future outcomes based on past performance metric measurements from various instruction types fed into our system's datasets — allowing us determine if certain changes produce improve outcomes over time for our AI model’s capability & predictability!
- Developing self-improving Artificial Intelligence algorithms by using the outputs and instructional data to identify correlations and feedback loop structures between instructions and output results.
- Generating Machine Learning simulations using this dataset to optimize AI performance based on given instruction set.
- Using the instructions, input, and output data in the dataset to build AI systems for natural language processing, enabling comprehensive understanding of user queries and providing more accurate answers accordingly
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: train.csv | Column name | Description | |:--------------|:-------------------------------------------------------| | output | The output of the instruction given. (String) | | file | The file used when executing the instruction. (String) | | input | Additional context for the instruction. (String) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Dear candidate, we are so excited with your interest in working with us! This challenge is an opportunity for us to know a bit of the great talent we know you have. It was built to simulate real-case scenarios that you would face while working at [Organization] and is organized in 2 parts:
Part I - Technical Provide both the answer and the SQL code used. 1. What is the average trip cost of holidays? How does it compare to non-holidays? 2. Find the average call time of the first time passengers make a trip. 3. Find the average number of trips per driver for every week day. 4. Which day of the week drivers usually drive the most distance on average? 5. What was the growth percentage of rides month over month? 6. Optional. List the top 5 drivers per number of trips in the top 5 largest cities.
Part II - Analytical 99 is a marketplace, where drivers are the supply and passengers, the demand. One of our main challenges is to keep this marketplace balanced. If there's too much demand, prices would increase due to surge and passengers would prefer not to run. If there's too much supply, drivers would spend more time idle impacting their revenue. 1. Let's say it's 2019-09-23 and a new Operations manager for The Shire was just hired. She has 5 minutes during the Ops weekly meeting to present an overview of the business in the city, and since she's just arrived, she asked your help to do it. What would you prepare for this 5 minutes presentation? Please provide 1-2 slides with your idea. 2. She also mentioned she has a budget to invest in promoting the business. What kind of metrics and performance indicators would you use in order to help her decide if she should invest it into the passenger side or the driver side? Extra point if you provide data-backed recommendations. 3. One month later, she comes back, super grateful for all the helpful insights you have given her. And says she is anticipating a driver supply shortage due to a major concert that is going to take place the next day and also a 3 day city holiday that is coming the next month. What would you do to help her analyze the best course of action to either prevent or minimize the problem in each case? 4. Optional. We want to build up a model to predict “Possible Churn Users” (e.g.: no trips in the past 4 weeks). List all features that you can think about and the data mining or machine learning model or other methods you may use for this case.
Facebook
TwitterThis paper describes a local and distributed expectation maximization algorithm for learning parameters of Gaussian mixture models (GMM) in large peer-to-peer (P2P) environments. The algorithm can be used for a variety of well-known data mining tasks in distributed environments such as clustering, anomaly detection, target tracking, and density estimation to name a few, necessary for many emerging P2P applications in bioinformatics, webmining and sensor networks. Centralizing all or some of the data to build global models is impractical in such P2P environments because of the large number of data sources, the asynchronous nature of the P2P networks, and dynamic nature of the data/network. The proposed algorithm takes a two-step approach. In the monitoring phase, the algorithm checks if the model ‘quality’ is acceptable by using an efficient local algorithm. This is then used as a feedback loop to sample data from the network and rebuild the GMM when it is outdated. We present thorough experimental results to verify our theoretical claims.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
A fashion distributor sells articles of particular sizes and colors to its customers. In some cases items are returned to the distributor for various reasons. The order data and the related return data were recorded over a two-year period. The aim is to use this data and machine learning to build a model which enables a good prediction of return rates.
For this task real anonymized shop data are provided in the form of structured text files consisting of individual data sets. Below are some points to note about the files:
Actually only the field names from the included document features.pdf can appear as column headings in the order used in that document. The associated value ranges are also listed.
The orders_train.txt contains all the data fields from the document whereas the associated test file orders_class.txt does not contain the target variable ``*returnQuantity*''.
The task is to use known historical data from January 2014 to September 2015 (approx. 2.33 million order positions) to build a model that makes predictions about return rates for order positions. The attribute returnQuantity in the given data indicates the number of articles for each order position (the value 0 means that the article will be kept while a value larger than 0 means that the article will be returned). For sales in the period from October 2015 to December 2015 (approx. 340,000 order positions) the model should then provide predictions for the number of articles which will be returned per order position. The prediction has to be a value of the set of natural numbers including 0. The difference between the prediction and the actual rate for an order position (i..e. error rate) must be as low as possible.
This dataset is publicly available in the data-mining-cup-website.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Agricultural soils provide society with several functions, one of which is primary productivity. This function is defined as the capacity of a soil to supply nutrients and water and to produce plant biomass for human use, providing food, feed, fiber, and fuel. For farmers, the productivity function delivers an economic basis and is a prerequisite for agricultural sustainability. Our study was designed to develop an agricultural primary productivity decision support model. To obtain a highly accurate decision support model that helps farmers and advisors to assess and manage the provision of the primary productivity soil function on their agricultural fields, we addressed the following specific objectives: (i) to construct a qualitative decision support model to assess the primary productivity soil function at the agricultural field level; (ii) to carry out verification, calibration, and sensitivity analysis of this model; and (iii) to validate the model based on empirical data. The result is a hierarchical qualitative model consisting of 25 input attributes describing soil properties, environmental conditions, cropping specifications, and management practices on each respective field. An extensive dataset from France containing data from 399 sites was used to calibrate and validate the model. The large amount of data enabled data mining to support model calibration. The accuracy of the decision support model prior to calibration supported by data mining was ~40%. The data mining approach improved the accuracy to 77%. The proposed methodology of combining decision modeling and data mining proved to be an important step forward. This iterative approach yielded an accurate, reliable, and useful decision support model for the assessment of the primary productivity soil function at the field level. This can assist farmers and advisors in selecting the most appropriate crop management practices. Embedding this decision support model in a set of complementary models for four adjacent soil functions, as endeavored in the H2020 LANDMARK project, will help take the integrated sustainability of arable cropping systems to a new level.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract The predictability of epidemiological indicators can help estimate dependent variables, assist in decision-making to support public policies, and explain the scenarios experienced by different countries worldwide. This study aimed to forecast the Human Development Index (HDI) and life expectancy (LE) for Latin American countries for the period of 2015-2020 using data mining techniques. All stages of the process of knowledge discovery in databases were covered. The SMOReg data mining algorithm was used in the models with multivariate time series to make predictions; this algorithm performed the best in the tests developed during the evaluation period. The average HDI and LE for Latin American countries showed an increasing trend in the period evaluated, corresponding to 4.99 ± 3.90% and 2.65 ± 0.06 years, respectively. Multivariate models allow for a greater evaluation of algorithms, thus increasing their accuracy. Data mining techniques have a better predictive quality relative to the most popular technique, Autoregressive Integrated Moving Average (ARIMA). In addition, the predictions suggest that there will be a higher increase in the mean HDI and LE for Latin American countries compared to the mean values for the rest of the world.
Facebook
TwitterTitle: Identifying Factors that Affect Entrepreneurs’ Use of Data Mining for Analytics Authors: Edward Matthew Dominica, Feylin Wijaya, Andrew Giovanni Winoto, Christian Conference: The 4th International Conference on Electrical, Computer, Communications, and Mechatronics Engineering https://www.iceccme.com/home
This dataset was created to support research focused on understanding the factors influencing entrepreneurs’ adoption of data mining techniques for business analytics. The dataset contains carefully curated data points that reflect entrepreneurial behaviors, decision-making criteria, and the role of data mining in enhancing business insights.
Researchers and practitioners can leverage this dataset to explore patterns, conduct statistical analyses, and build predictive models to gain a deeper understanding of entrepreneurial adoption of data mining.
Intended Use: This dataset is designed for research and academic purposes, especially in the fields of business analytics, entrepreneurship, and data mining. It is suitable for conducting exploratory data analysis, hypothesis testing, and model development.
Citation: If you use this dataset in your research or publication, please cite the paper presented at the ICECCME 2024 conference using the following format: Edward Matthew Dominica, Feylin Wijaya, Andrew Giovanni Winoto, Christian. Identifying Factors that Affect Entrepreneurs’ Use of Data Mining for Analytics. The 4th International Conference on Electrical, Computer, Communications, and Mechatronics Engineering (2024).