MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by Eric Demangel
Released under MIT
This is an electronic database detailing different types of, various phases of, best practices for, and cost and time associated with geothermal exploration techniques. The groups of exploration techniques included in the database are Data and Modeling Techniques, Downhole Techniques, Drilling Techniques, Field Technologies, Geochemical Techniques, Geophysical Techniques, Lab Analysis Techniques, and Remote Sensing Techniques.
https://entrepot.recherche.data.gouv.fr/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.15454/AGU4QEhttps://entrepot.recherche.data.gouv.fr/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.15454/AGU4QE
WIDEa is R-based software aiming to provide users with a range of functionalities to explore, manage, clean and analyse "big" environmental and (in/ex situ) experimental data. These functionalities are the following, 1. Loading/reading different data types: basic (called normal), temporal, infrared spectra of mid/near region (called IR) with frequency (wavenumber) used as unit (in cm-1); 2. Interactive data visualization from a multitude of graph representations: 2D/3D scatter-plot, box-plot, hist-plot, bar-plot, correlation matrix; 3. Manipulation of variables: concatenation of qualitative variables, transformation of quantitative variables by generic functions in R; 4. Application of mathematical/statistical methods; 5. Creation/management of data (named flag data) considered as atypical; 6. Study of normal distribution model results for different strategies: calibration (checking assumptions on residuals), validation (comparison between measured and fitted values). The model form can be more or less complex: mixed effects, main/interaction effects, weighted residuals.
This paper describes the methodology used to define the baseline exploration suite of techniques (baseline), as well as the approach that was used to create the cost and time data set that populates the baseline. The resulting product, an online tool for measuring impact, and the aggregated cost and time data are available on the Open Energy Information website (OpenEI, http://en.openei.org) for public access. The Department of Energy's Geothermal Technology Office (GTO) provides RD&D funding for geothermal exploration technologies with the goal of lowering the risks and costs of geothermal development and exploration. The National Renewable Energy Laboratory (NREL) developed this cost and time metric included collecting cost and time data for exploration techniques, creating a baseline suite of exploration techniques to which future exploration cost and time improvements can be compared, and developing an online tool for graphically showing potential project impacts (all available at http://en.openei.org/wiki/Gateway: Geothermal).
There's a story behind every dataset and here's your opportunity to share yours.
Data from Game of Thrones series
We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.
Your data will be in front of the world's largest data science community. What questions do you want to see answered?
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
The global market for data lens (visualizations of data) is experiencing robust growth, driven by the increasing adoption of data analytics across diverse industries. This market, estimated at $50 billion in 2025, is projected to achieve a compound annual growth rate (CAGR) of 15% from 2025 to 2033. This expansion is fueled by several key factors. Firstly, the rising volume and complexity of data necessitate effective visualization tools for insightful analysis. Businesses are increasingly relying on interactive dashboards and data storytelling techniques to derive actionable intelligence from their data, fostering the demand for sophisticated data visualization solutions. Secondly, advancements in artificial intelligence (AI) and machine learning (ML) are enhancing the capabilities of data visualization platforms, enabling automated insights generation and predictive analytics. This creates new opportunities for vendors to offer more advanced and user-friendly tools. Finally, the growing adoption of cloud-based solutions is further accelerating market growth, offering enhanced scalability, accessibility, and cost-effectiveness. The market is segmented across various types, including points, lines, and bars, and applications, ranging from exploratory data analysis and interactive data visualization to descriptive statistics and advanced data science techniques. Major players like Tableau, Sisense, and Microsoft dominate the market, constantly innovating to meet evolving customer needs and competitive pressures. The geographical distribution of the market reveals strong growth across North America and Europe, driven by early adoption and technological advancements. However, emerging markets in Asia-Pacific and the Middle East & Africa are showing significant growth potential, fueled by increasing digitalization and investment in data analytics infrastructure. Restraints to growth include the high cost of implementation, the need for skilled professionals to effectively utilize these tools, and security concerns related to data privacy. Nonetheless, the overall market outlook remains positive, with continued expansion anticipated throughout the forecast period due to the fundamental importance of data visualization in informed decision-making across all sectors.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Geochemical data are frequently collected from mineral exploration drill-hole samples to more accurately define and characterise the geological units intersected by the drill hole. However, large multi-element data sets are slow and challenging to interpret without using some form of automated analysis, such as mathematical, statistical or machine learning techniques. Automated analysis techniques also have the advantage in that they are repeatable and can provide consistent results, even for very large data sets. In this paper, an automated litho-geochemical interpretation workflow is demonstrated, which includes data exploration and data preparation using appropriate compositional data-analysis techniques. Multiscale analysis using a modified wavelet tessellation has been applied to the data to provide coherent geological domains. Unsupervised machine learning (clustering) has been used to provide a first-pass classification. The results are compared with the detailed geologist’s logs. The comparison shows how the integration of automated analysis of geochemical data can be used to enhance traditional geological logging and demonstrates the identification of new geological units from the automated litho-geochemical logging that were not apparent from visual logging but are geochemically distinct. To reduce computational complexity and facilitate interpretation, a subset of geochemical elements is selected, and then a centred log-ratio transform is applied. The wavelet tessellation method is used to domain the drill holes into rock units at a range of scales. Several clustering methods were tested to identify distinct rock units in the samples and multiscale domains for classification. Results are compared with geologist’s logs to assess how geochemical data analysis can inform and improve traditional geology logs.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Machine learning‐based behaviour classification using acceleration data is a powerful tool in bio‐logging research. Deep learning architectures such as convolutional neural networks (CNN), long short‐term memory (LSTM) and self‐attention mechanisms as well as related training techniques have been extensively studied in human activity recognition. However, they have rarely been used in wild animal studies. The main challenges of acceleration‐based wild animal behaviour classification include data shortages, class imbalance problems, various types of noise in data due to differences in individual behaviour and where the loggers were attached and complexity in data due to complex animal‐specific behaviours, which may have limited the application of deep learning techniques in this area. To overcome these challenges, we explored the effectiveness of techniques for efficient model training: data augmentation, manifold mixup and pre‐training of deep learning models with unlabelled data, using datasets from two species of wild seabirds and state‐of‐the‐art deep learning model architectures. Data augmentation improved the overall model performance when one of the various techniques (none, scaling, jittering, permutation, time‐warping and rotation) was randomly applied to each data during mini‐batch training. Manifold mixup also improved model performance, but not as much as random data augmentation. Pre‐training with unlabelled data did not improve model performance. The state‐of‐the‐art deep learning models, including a model consisting of four CNN layers, an LSTM layer and a multi‐head attention layer, as well as its modified version with shortcut connection, showed better performance among other comparative models. Using only raw acceleration data as inputs, these models outperformed classic machine learning approaches that used 119 handcrafted features. Our experiments showed that deep learning techniques are promising for acceleration‐based behaviour classification of wild animals and highlighted some challenges (e.g. effective use of unlabelled data). There is scope for greater exploration of deep learning techniques in wild animal studies (e.g. advanced data augmentation, multimodal sensor data use, transfer learning and self‐supervised learning). We hope that this study will stimulate the development of deep learning techniques for wild animal behaviour classification using time‐series sensor data.
This abstract is cited from the original article "Exploring deep learning techniques for wild animal behaviour classification using animal-borne accelerometers" in Methods in Ecology and Evolution (Otsuka et al., 2024).Please see README for the details of the datasets.
Description: Dive into the world of exceptional cinema with our meticulously curated dataset, "IMDb's Gems Unveiled." This dataset is a result of an extensive data collection effort based on two critical criteria: IMDb ratings exceeding 7 and a substantial number of votes, surpassing 10,000. The outcome? A treasure trove of 4070 movies meticulously selected from IMDb's vast repository.
What sets this dataset apart is its richness and diversity. With more than 20 data points meticulously gathered for each movie, this collection offers a comprehensive insight into each cinematic masterpiece. Our data collection process leveraged the power of Selenium and Pandas modules, ensuring accuracy and reliability.
Cleaning this vast dataset was a meticulous task, combining both Excel and Python for optimum precision. Analysis is powered by Pandas, Matplotlib, and NLTK, enabling to uncover hidden patterns, trends, and themes within the realm of cinema.
Note: The data is collected as of April 2023. Future versions of this analysis include Movie recommendation system Please do connect for any queries, All Love, No Hate.
This paper provides a review of three different advanced machine learning algorithms for anomaly detection in continuous data streams from a ground-test firing of a subscale Solid Rocket Motor (SRM). This study compares Orca, one-class support vector machines, and the Inductive Monitoring System (IMS) for anomaly detection on the data streams. We measure the performance of the algorithm with respect to the detection horizon for situations where fault information is available. These algorithms have been also studied by the present authors (and other co-authors) as applied to liquid propulsion systems. The trade space will be explored between these algorithms for both types of propulsion systems.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The dataset contain a sales data for different region . if you are beginner you can work . it is a different data set in which you can able to understand many new concept . take this as challenge and work on it .
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about book subjects. It has 3 rows and is filtered where the books is Data cleaning and exploration with machine learning : clean data with machine learning algorithms and techniques. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.
No Publication Abstract is Available
The OECD Programme for International Student Assessment (PISA) surveys collected data on students’ performances in reading, mathematics and science, as well as contextual information on students’ background, home characteristics and school factors which could influence performance. This publication includes detailed information on how to analyse the PISA data, enabling researchers to both reproduce the initial results and to undertake further analyses. In addition to the inclusion of the necessary techniques, the manual also includes a detailed account of the PISA 2006 database and worked examples providing full syntax in SPSS.
No Publication Abstract is Available
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The global AI Data Analysis Tool market size was valued at approximately USD 15.3 billion in 2023 and is projected to reach USD 57.2 billion by 2032, growing at a compound annual growth rate (CAGR) of 15.5% during the forecast period. The rapid growth factor of this market can be attributed to the increasing adoption of artificial intelligence and machine learning technologies across various industries to enhance data processing and analytics capabilities, driving the demand for advanced AI-powered data analysis tools.
One of the primary growth factors in the AI Data Analysis Tool market is the exponential increase in the volume of data generated by digital devices, social media, online transactions, and IoT sensors. This data deluge has created an urgent need for robust tools that can analyze and extract actionable insights from large datasets. AI data analysis tools, leveraging machine learning algorithms and deep learning techniques, facilitate real-time data processing, trend analysis, pattern recognition, and predictive analytics, making them indispensable for modern businesses looking to stay competitive in the data-driven era.
Another significant growth driver is the expanding application of AI data analysis tools in various industries such as healthcare, finance, retail, and manufacturing. In healthcare, for instance, these tools are utilized to analyze patient data for improved diagnostics, treatment plans, and personalized medicine. In finance, AI data analysis is employed for risk assessment, fraud detection, and investment strategies. Retailers use these tools to understand consumer behavior, optimize inventory management, and enhance customer experiences. In manufacturing, AI-driven data analysis enhances predictive maintenance, process optimization, and quality control, leading to increased efficiency and cost savings.
The surge in cloud computing adoption is also contributing to the growth of the AI Data Analysis Tool market. Cloud-based AI data analysis tools offer scalability, flexibility, and cost-effectiveness, allowing businesses to access powerful analytics capabilities without the need for substantial upfront investments in hardware and infrastructure. This shift towards cloud deployment is particularly beneficial for small and medium enterprises (SMEs) that aim to leverage advanced analytics without bearing the high costs associated with on-premises solutions. Additionally, the integration of AI data analysis tools with other cloud services, such as storage and data warehousing, further enhances their utility and appeal.
AI and Analytics Systems are becoming increasingly integral to the modern business landscape, offering unparalleled capabilities in data processing and insight generation. These systems leverage the power of artificial intelligence to analyze vast datasets, uncovering patterns and trends that were previously inaccessible. By integrating AI and Analytics Systems, companies can enhance their decision-making processes, improve operational efficiency, and gain a competitive edge in their respective industries. The ability to process and analyze data in real-time allows businesses to respond swiftly to market changes and customer demands, driving innovation and growth. As these systems continue to evolve, they are expected to play a crucial role in shaping the future of data-driven enterprises.
Regionally, North America holds a prominent share in the AI Data Analysis Tool market due to the early adoption of advanced technologies, presence of major tech companies, and significant investments in AI research and development. However, the Asia Pacific region is expected to exhibit the highest growth rate during the forecast period. This growth can be attributed to the rapid digital transformation across emerging economies, increasing government initiatives to promote AI adoption, and the rising number of tech startups focusing on AI and data analytics. The growing awareness of the benefits of AI-driven data analysis among businesses in this region is also a key factor propelling market growth.
The component segment of the AI Data Analysis Tool market is categorized into software, hardware, and services. Software is the largest segment, holding the majority share due to the extensive adoption of AI-driven analytics platforms and applications across various industries. These software solutions include machine learning algorithms, data visualization too
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Unsupervised exploratory data analysis (EDA) is often the first step in understanding complex data sets. While summary statistics are among the most efficient and convenient tools for exploring and describing sets of data, they are often overlooked in EDA. In this paper, we show multiple case studies that compare the performance, including clustering, of a series of summary statistics in EDA. The summary statistics considered here are pattern recognition entropy (PRE), the mean, standard deviation (STD), 1-norm, range, sum of squares (SSQ), and X4, which are compared with principal component analysis (PCA), multivariate curve resolution (MCR), and/or cluster analysis. PRE and the other summary statistics are direct methods for analyzing datathey are not factor-based approaches. To quantify the performance of summary statistics, we use the concept of the “critical pair,” which is employed in chromatography. The data analyzed here come from different analytical methods. Hyperspectral images, including one of a biological material, are also analyzed. In general, PRE outperforms the other summary statistics, especially in image analysis, although a suite of summary statistics is useful in exploring complex data sets. While PRE results were generally comparable to those from PCA and MCR, PRE is easier to apply. For example, there is no need to determine the number of factors that describe a data set. Finally, we introduce the concept of divided spectrum-PRE (DS-PRE) as a new EDA method. DS-PRE increases the discrimination power of PRE. We also show that DS-PRE can be used to provide the inputs for the k-nearest neighbor (kNN) algorithm. We recommend PRE and DS-PRE as rapid new tools for unsupervised EDA.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Improving the accuracy of prediction on future values based on the past and current observations has been pursued by enhancing the prediction's methods, combining those methods or performing data pre-processing. In this paper, another approach is taken, namely by increasing the number of input in the dataset. This approach would be useful especially for a shorter time series data. By filling the in-between values in the time series, the number of training set can be increased, thus increasing the generalization capability of the predictor. The algorithm used to make prediction is Neural Network as it is widely used in literature for time series tasks. For comparison, Support Vector Regression is also employed. The dataset used in the experiment is the frequency of USPTO's patents and PubMed's scientific publications on the field of health, namely on Apnea, Arrhythmia, and Sleep Stages. Another time series data designated for NN3 Competition in the field of transportation is also used for benchmarking. The experimental result shows that the prediction performance can be significantly increased by filling in-between data in the time series. Furthermore, the use of detrend and deseasonalization which separates the data into trend, seasonal and stationary time series also improve the prediction performance both on original and filled dataset. The optimal number of increase on the dataset in this experiment is about five times of the length of original dataset.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by Eric Demangel
Released under MIT