Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Thorough knowledge of the structure of analyzed data allows to form detailed scientific hypotheses and research questions. The structure of data can be revealed with methods for exploratory data analysis. Due to multitude of available methods, selecting those which will work together well and facilitate data interpretation is not an easy task. In this work we present a well fitted set of tools for a complete exploratory analysis of a clinical dataset and perform a case study analysis on a set of 515 patients. The proposed procedure comprises several steps: 1) robust data normalization, 2) outlier detection with Mahalanobis (MD) and robust Mahalanobis distances (rMD), 3) hierarchical clustering with Ward’s algorithm, 4) Principal Component Analysis with biplot vectors. The analyzed set comprised elderly patients that participated in the PolSenior project. Each patient was characterized by over 40 biochemical and socio-geographical attributes. Introductory analysis showed that the case-study dataset comprises two clusters separated along the axis of sex hormone attributes. Further analysis was carried out separately for male and female patients. The most optimal partitioning in the male set resulted in five subgroups. Two of them were related to diseased patients: 1) diabetes and 2) hypogonadism patients. Analysis of the female set suggested that it was more homogeneous than the male dataset. No evidence of pathological patient subgroups was found. In the study we showed that outlier detection with MD and rMD allows not only to identify outliers, but can also assess the heterogeneity of a dataset. The case study proved that our procedure is well suited for identification and visualization of biologically meaningful patient subgroups.
Facebook
TwitterThis dataset contains 55,000 entries of synthetic customer transactions, generated using Python's Faker library. The goal behind creating this dataset was to provide a resource for learners like myself to explore, analyze, and apply various data analysis techniques in a context that closely mimics real-world data.
About the Dataset: - CID (Customer ID): A unique identifier for each customer. - TID (Transaction ID): A unique identifier for each transaction. - Gender: The gender of the customer, categorized as Male or Female. - Age Group: Age group of the customer, divided into several ranges. - Purchase Date: The timestamp of when the transaction took place. - Product Category: The category of the product purchased, such as Electronics, Apparel, etc. - Discount Availed: Indicates whether the customer availed any discount (Yes/No). - Discount Name: Name of the discount applied (e.g., FESTIVE50). - Discount Amount (INR): The amount of discount availed by the customer. - Gross Amount: The total amount before applying any discount. - Net Amount: The final amount after applying the discount. - Purchase Method: The payment method used (e.g., Credit Card, Debit Card, etc.). - Location: The city where the purchase took place.
Use Cases: 1. Exploratory Data Analysis (EDA): This dataset is ideal for conducting EDA, allowing users to practice techniques such as summary statistics, visualizations, and identifying patterns within the data. 2. Data Preprocessing and Cleaning: Learners can work on handling missing data, encoding categorical variables, and normalizing numerical values to prepare the dataset for analysis. 3. Data Visualization: Use tools like Python’s Matplotlib, Seaborn, or Power BI to visualize purchasing trends, customer demographics, or the impact of discounts on purchase amounts. 4. Machine Learning Applications: After applying feature engineering, this dataset is suitable for supervised learning models, such as predicting whether a customer will avail a discount or forecasting purchase amounts based on the input features.
This dataset provides an excellent sandbox for honing skills in data analysis, machine learning, and visualization in a structured but flexible manner.
This is not a real dataset. This dataset was generated using Python's Faker library for the sole purpose of learning
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Many upcoming and proposed missions to ocean worlds such as Europa, Enceladus, and Titan aim to evaluate their habitability and the existence of potential life on these moons. These missions will suffer from communication challenges and technology limitations. We review and investigate the applicability of data science and unsupervised machine learning (ML) techniques on isotope ratio mass spectrometry data (IRMS) from volatile laboratory analogs of Europa and Enceladus seawaters as a case study for development of new strategies for icy ocean world missions. Our driving science goal is to determine whether the mass spectra of volatile gases could contain information about the composition of the seawater and potential biosignatures. We implement data science and ML techniques to investigate what inherent information the spectra contain and determine whether a data science pipeline could be designed to quickly analyze data from future ocean worlds missions. In this study, we focus on the exploratory data analysis (EDA) step in the analytics pipeline. This is a crucial unsupervised learning step that allows us to understand the data in depth before subsequent steps such as predictive/supervised learning. EDA identifies and characterizes recurring patterns, significant correlation structure, and helps determine which variables are redundant and which contribute to significant variation in the lower dimensional space. In addition, EDA helps to identify irregularities such as outliers that might be due to poor data quality. We compared dimensionality reduction methods Uniform Manifold Approximation and Projection (UMAP) and Principal Component Analysis (PCA) for transforming our data from a high-dimensional space to a lower dimension, and we compared clustering algorithms for identifying data-driven groups (“clusters”) in the ocean worlds analog IRMS data and mapping these clusters to experimental conditions such as seawater composition and CO2 concentration. Such data analysis and characterization efforts are the first steps toward the longer-term science autonomy goal where similar automated ML tools could be used onboard a spacecraft to prioritize data transmissions for bandwidth-limited outer Solar System missions.
Facebook
Twitterhttps://entrepot.recherche.data.gouv.fr/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.15454/AGU4QEhttps://entrepot.recherche.data.gouv.fr/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.15454/AGU4QE
WIDEa is R-based software aiming to provide users with a range of functionalities to explore, manage, clean and analyse "big" environmental and (in/ex situ) experimental data. These functionalities are the following, 1. Loading/reading different data types: basic (called normal), temporal, infrared spectra of mid/near region (called IR) with frequency (wavenumber) used as unit (in cm-1); 2. Interactive data visualization from a multitude of graph representations: 2D/3D scatter-plot, box-plot, hist-plot, bar-plot, correlation matrix; 3. Manipulation of variables: concatenation of qualitative variables, transformation of quantitative variables by generic functions in R; 4. Application of mathematical/statistical methods; 5. Creation/management of data (named flag data) considered as atypical; 6. Study of normal distribution model results for different strategies: calibration (checking assumptions on residuals), validation (comparison between measured and fitted values). The model form can be more or less complex: mixed effects, main/interaction effects, weighted residuals.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data Analysis is the process that supports decision-making and informs arguments in empirical studies. Descriptive statistics, Exploratory Data Analysis (EDA), and Confirmatory Data Analysis (CDA) are the approaches that compose Data Analysis (Xia & Gong; 2014). An Exploratory Data Analysis (EDA) comprises a set of statistical and data mining procedures to describe data. We ran EDA to provide statistical facts and inform conclusions. The mined facts allow attaining arguments that would influence the Systematic Literature Review of DL4SE.
The Systematic Literature Review of DL4SE requires formal statistical modeling to refine the answers for the proposed research questions and formulate new hypotheses to be addressed in the future. Hence, we introduce DL4SE-DA, a set of statistical processes and data mining pipelines that uncover hidden relationships among Deep Learning reported literature in Software Engineering. Such hidden relationships are collected and analyzed to illustrate the state-of-the-art of DL techniques employed in the software engineering context.
Our DL4SE-DA is a simplified version of the classical Knowledge Discovery in Databases, or KDD (Fayyad, et al; 1996). The KDD process extracts knowledge from a DL4SE structured database. This structured database was the product of multiple iterations of data gathering and collection from the inspected literature. The KDD involves five stages:
Selection. This stage was led by the taxonomy process explained in section xx of the paper. After collecting all the papers and creating the taxonomies, we organize the data into 35 features or attributes that you find in the repository. In fact, we manually engineered features from the DL4SE papers. Some of the features are venue, year published, type of paper, metrics, data-scale, type of tuning, learning algorithm, SE data, and so on.
Preprocessing. The preprocessing applied was transforming the features into the correct type (nominal), removing outliers (papers that do not belong to the DL4SE), and re-inspecting the papers to extract missing information produced by the normalization process. For instance, we normalize the feature “metrics” into “MRR”, “ROC or AUC”, “BLEU Score”, “Accuracy”, “Precision”, “Recall”, “F1 Measure”, and “Other Metrics”. “Other Metrics” refers to unconventional metrics found during the extraction. Similarly, the same normalization was applied to other features like “SE Data” and “Reproducibility Types”. This separation into more detailed classes contributes to a better understanding and classification of the paper by the data mining tasks or methods.
Transformation. In this stage, we omitted to use any data transformation method except for the clustering analysis. We performed a Principal Component Analysis to reduce 35 features into 2 components for visualization purposes. Furthermore, PCA also allowed us to identify the number of clusters that exhibit the maximum reduction in variance. In other words, it helped us to identify the number of clusters to be used when tuning the explainable models.
Data Mining. In this stage, we used three distinct data mining tasks: Correlation Analysis, Association Rule Learning, and Clustering. We decided that the goal of the KDD process should be oriented to uncover hidden relationships on the extracted features (Correlations and Association Rules) and to categorize the DL4SE papers for a better segmentation of the state-of-the-art (Clustering). A clear explanation is provided in the subsection “Data Mining Tasks for the SLR od DL4SE”. 5.Interpretation/Evaluation. We used the Knowledge Discover to automatically find patterns in our papers that resemble “actionable knowledge”. This actionable knowledge was generated by conducting a reasoning process on the data mining outcomes. This reasoning process produces an argument support analysis (see this link).
We used RapidMiner as our software tool to conduct the data analysis. The procedures and pipelines were published in our repository.
Overview of the most meaningful Association Rules. Rectangles are both Premises and Conclusions. An arrow connecting a Premise with a Conclusion implies that given some premise, the conclusion is associated. E.g., Given that an author used Supervised Learning, we can conclude that their approach is irreproducible with a certain Support and Confidence.
Support = Number of occurrences this statement is true divided by the amount of statements Confidence = The support of the statement divided by the number of occurrences of the premise
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Modern research projects incorporate data from several sources, and new insights are increasingly driven by the ability to interpret data in the context of other data. Glue is an interactive environment built on top of the standard Python science stack to visualize relationships within and between datasets. With Glue, users can load and visualize multiple related datasets simultaneously. Users specify the logical connections that exist between data, and Glue transparently uses this information as needed to enable visualization across files. This functionality makes it trivial, for example, to interactively overplot catalogs on top of images. The central philosophy behind Glue is that the structure of research data is highly customized and problem-specific. Glue aims to accommodate this and simplify the "data munging" process, so that researchers can more naturally explore what their data have to say. The result is a cleaner scientific workflow, faster interaction with data, and an easier avenue to insight.
Facebook
Twitterhttps://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/
Case study: How does a bike-share navigate speedy success?
Scenario:
As a data analyst on Cyclistic's marketing team, our focus is on enhancing annual memberships to drive the company's success. We aim to analyze the differing usage patterns between casual riders and annual members to craft a marketing strategy aimed at converting casual riders. Our recommendations, supported by data insights and professional visualizations, await Cyclistic executives' approval to proceed.
About the company
In 2016, Cyclistic launched a bike-share program in Chicago, growing to 5,824 bikes and 692 stations. Initially, their marketing aimed at broad segments with flexible pricing plans attracting both casual riders (single-ride or full-day passes) and annual members. However, recognizing that annual members are more profitable, Cyclistic is shifting focus to convert casual riders into annual members. To achieve this, they plan to analyze historical bike trip data to understand the differences and preferences between the two user groups, aiming to tailor marketing strategies that encourage casual riders to purchase annual memberships.
Project Overview:
This capstone project is a culmination of the skills and knowledge acquired through the Google Professional Data Analytics Certification. It focuses on Track 1, which is centered around Cyclistic, a fictional bike-share company modeled to reflect real-world data analytics scenarios in the transportation and service industry.
Dataset Acknowledgment:
We are grateful to Motivate Inc. for providing the dataset that serves as the foundation of this capstone project. Their contribution has enabled us to apply practical data analytics techniques to a real-world dataset, mirroring the challenges and opportunities present in the bike-sharing sector.
Objective:
The primary goal of this project is to analyze the Cyclistic dataset to uncover actionable insights that could help the company optimize its operations, improve customer satisfaction, and increase its market share. Through comprehensive data exploration, cleaning, analysis, and visualization, we aim to identify patterns and trends that inform strategic business decisions.
Methodology:
Data Collection: Utilizing the dataset provided by Motivate Inc., which includes detailed information on bike usage, customer behavior, and operational metrics. Data Cleaning and Preparation: Ensuring the dataset is accurate, complete, and ready for analysis by addressing any inconsistencies, missing values, or anomalies. Data Analysis: Applying statistical methods and data analytics techniques to extract meaningful insights from the dataset.
Visualization and Reporting:
Creating intuitive and compelling visualizations to present the findings clearly and effectively, facilitating data-driven decision-making. Findings and Recommendations:
Conclusion:
The Cyclistic Capstone Project not only demonstrates the practical application of data analytics skills in a real-world scenario but also provides valuable insights that can drive strategic improvements for Cyclistic. Through this project, showcasing the power of data analytics in transforming data into actionable knowledge, underscoring the importance of data-driven decision-making in today's competitive business landscape.
Acknowledgments:
Special thanks to Motivate Inc. for their support and for providing the dataset that made this project possible. Their contribution is immensely appreciated and has significantly enhanced the learning experience.
STRATEGIES USED
Case Study Roadmap - ASK
●What is the problem you are trying to solve? ●How can your insights drive business decisions?
Key Tasks ● Identify the business task ● Consider key stakeholders
Deliverable ● A clear statement of the business task
Case Study Roadmap - PREPARE
● Where is your data located? ● Are there any problems with the data?
Key tasks ● Download data and store it appropriately. ● Identify how it’s organized.
Deliverable ● A description of all data sources used
Case Study Roadmap - PROCESS
● What tools are you choosing and why? ● What steps have you taken to ensure that your data is clean?
Key tasks ● Choose your tools. ● Document the cleaning process.
Deliverable ● Documentation of any cleaning or manipulation of data
Case Study Roadmap - ANALYZE
● Has your data been properly formaed? ● How will these insights help answer your business questions?
Key tasks ● Perform calculations ● Formatting
Deliverable ● A summary of analysis
Case Study Roadmap - SHARE
● Were you able to answer all questions of stakeholders? ● Can Data visualization help you share findings?
Key tasks ● Present your findings ● Create effective data viz.
Deliverable ● Supporting viz and key findings
**Case Study Roadmap - A...
Facebook
Twitterhttps://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice
Big Data Market In Oil And Gas Sector Size 2025-2029
The big data market in oil and gas sector size is forecast to increase by USD 31.13 billion, at a CAGR of 29.7% between 2024 and 2029.
In the Oil and Gas sector, the adoption of Big Data is increasingly becoming a strategic priority to optimize production processes and enhance operational efficiency. The implementation of advanced analytics tools and technologies is enabling companies to gain valuable insights from vast volumes of data, leading to improved decision-making and operational excellence. However, the use of Big Data in the Oil and Gas industry is not without challenges. Security concerns are at the forefront of the Big Data landscape in the Oil and Gas sector. With the vast amounts of sensitive data being generated and shared, ensuring data security is crucial. The use of blockchain solutions is gaining traction as a potential answer to this challenge, offering enhanced security and transparency. Yet, the implementation of these solutions presents its own set of complexities, requiring significant investment and expertise. Despite these challenges, the potential benefits of Big Data in the Oil and Gas sector are significant, offering opportunities for increased productivity, cost savings, and competitive advantage. Companies seeking to capitalize on these opportunities must navigate the security challenges effectively, investing in the right technologies and expertise to secure their data and reap the rewards of Big Data analytics.
What will be the Size of the Big Data Market In Oil And Gas Sector during the forecast period?
Explore in-depth regional segment analysis with market size data - historical 2019-2023 and forecasts 2025-2029 - in the full report.
Request Free SampleIn the oil and gas sector, the application of big data continues to evolve, shaping market dynamics across various sectors. Predictive modeling and pipeline management are two areas where big data plays a pivotal role. Big data storage solutions ensure the secure handling of vast amounts of data, enabling data governance and natural gas processing. The integration of data from exploration and production, drilling optimization, and reservoir simulation enhances operational efficiency and cost optimization. Artificial intelligence, data mining, and automated workflows facilitate decision support systems and data visualization, enabling pattern recognition and risk management. Big data also optimizes upstream operations through real-time data processing, horizontal drilling, and hydraulic fracturing.
Downstream operations benefit from data analytics, asset management, process automation, and energy efficiency. Sensor networks and IoT devices facilitate environmental monitoring and carbon emissions tracking. Deep learning and machine learning algorithms optimize production and improve enhanced oil recovery. Digital twins and automated workflows streamline project management and supply chain operations. Edge computing and cloud computing enable data processing in real-time, ensuring data quality and security. Remote monitoring and health and safety applications enhance operational efficiency and ensure regulatory compliance. Big data's role in the oil and gas sector is ongoing and dynamic, continuously unfolding and shaping market patterns.
How is this Big Data In Oil And Gas Sector Industry segmented?
The big data in oil and gas sector industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments. ApplicationUpstreamMidstreamDownstreamTypeStructuredUnstructuredSemi-structuredDeploymentOn-premisesCloud-basedProduct TypeServicesSoftwareGeographyNorth AmericaUSCanadaEuropeFranceGermanyRussiaAPACChinaIndiaJapanSouth KoreaSouth AmericaBrazilRest of World (ROW)
By Application Insights
The upstream segment is estimated to witness significant growth during the forecast period.In the oil and gas industry's upstream sector, big data analytics significantly enhances exploration, drilling, and production activities. Big data storage and processing facilitate the analysis of extensive seismic data, well logs, geological information, and other relevant data. This information is crucial for identifying potential drilling sites, estimating reserves, and enhancing reservoir modeling. Real-time data processing from production operations allows for optimization, maximizing hydrocarbon recovery, and improving operational efficiency. Machine learning and artificial intelligence algorithms identify patterns and anomalies, providing valuable insights for drilling optimization, production forecasting, and risk management. Data integration and data governance ensure data quality and security, enabling effective decision-making through advanced decision support systems and data visual
Facebook
Twitterhttps://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
According to our latest research, the AI in Mining Exploration market size reached USD 1.28 billion in 2024 globally, demonstrating robust momentum driven by accelerated digital transformation across the mining sector. The market is expected to expand at a CAGR of 19.6% from 2025 to 2033, reaching a forecasted value of USD 6.24 billion by 2033. This remarkable growth is primarily attributed to the industry’s increasing demand for advanced analytics, automation, and real-time data insights to enhance exploration efficiency, reduce operational costs, and improve safety outcomes. As per our latest research, AI-driven solutions are rapidly becoming indispensable for mining companies seeking to remain competitive in a resource-constrained and sustainability-focused environment.
The growth of the AI in Mining Exploration market is strongly influenced by the sector’s urgent need to optimize exploration processes and maximize resource discovery. Traditional exploration methods are often time-consuming, costly, and prone to human error, making them less viable in the face of declining ore grades and complex geological formations. Artificial intelligence technologies, including machine learning, deep learning, and predictive analytics, are transforming how data is collected, processed, and interpreted. These innovations enable mining companies to analyze vast datasets from satellite imagery, geophysical surveys, and drilling logs more efficiently, leading to improved target identification and higher success rates in mineral discovery. The integration of AI not only accelerates exploration timelines but also reduces the risks and costs associated with fieldwork, making it a key driver of market expansion.
Another significant growth factor is the mining industry’s increasing focus on sustainability and environmental stewardship. Regulatory pressures and stakeholder expectations are compelling companies to adopt cleaner and more responsible exploration practices. AI-powered environmental monitoring tools help organizations track and mitigate the ecological impact of exploration activities by providing real-time insights into land use, water quality, and biodiversity. Furthermore, AI facilitates the optimization of drilling operations, reducing unnecessary drilling and minimizing land disturbance. These capabilities are crucial for mining companies aiming to comply with environmental regulations, secure permits, and maintain their social license to operate. As sustainability becomes a central theme in the industry, the adoption of AI in mining exploration is set to accelerate further.
The ongoing digital transformation and the advent of Industry 4.0 technologies are also propelling the AI in Mining Exploration market forward. Mining companies are increasingly investing in smart mining solutions that integrate AI with the Internet of Things (IoT), cloud computing, and automation. This convergence allows for seamless data collection, real-time analytics, and predictive maintenance, ultimately leading to safer and more efficient exploration operations. The rising adoption of cloud-based AI platforms is making advanced analytics accessible to both large enterprises and small and medium-sized exploration firms, democratizing innovation across the industry. The proliferation of partnerships between technology providers and mining companies is further fostering the development and deployment of AI-driven exploration solutions.
Regionally, the market exhibits strong growth potential across all major geographies, with particular momentum in North America and Asia Pacific. North America leads in AI adoption due to its advanced mining infrastructure, significant investments in digital technologies, and a well-established ecosystem of technology providers. Meanwhile, Asia Pacific is witnessing rapid growth, driven by the region’s expanding mining sector, increasing government support for digitalization, and a surge in mineral exploration activities in countries like Australia, China, and India. Europe and Latin America are also emerging as key markets, benefiting from favorable regulatory environments and growing demand for sustainable mining practices. The Middle East & Africa, while still nascent, is expected to experience steady growth as mining companies in the region begin to embrace digital transformation.
The AI in Mining Exploration market by component i
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset is part of our Data Structures (Machine Learning) course project at the French University in Armenia (UFAR) under the supervision of PhD Varazdat Avetisyan. The dataset was collected through web scraping and contains valuable insights into the Armenian real estate market, covering apartments, houses, and commercial properties.
👥 Contributors: • Vahe Mirzoyan • Arsen Martirosyan • Arman Nagdalyan
📌 Data Collection Process: • Scraping Tools Used: Selenium & BeautifulSoup in Google Colab • Source: Real estate website (Armenia) • Storage: Data was structured and stored in Google Sheets & CSV format
📊 Dataset Features:
The dataset includes the following columns: • ID – Unique identifier for each property • Address – Property location • Floors – Total number of floors • Rooms – Number of rooms • Area (sq.m) – Total square meters of the property • Bathrooms – Number of bathrooms • Building Type – Old or new construction • Price (USD) – Listed price of the property
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Exploratory data analysis.
Facebook
Twitterhttps://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The global exploration services market size was valued at approximately USD 15 billion in 2023 and is projected to reach around USD 25 billion by 2032, growing at a compound annual growth rate (CAGR) of about 6%. This growth can be attributed to the increasing demand for natural resources, technological advancements in exploration techniques, and the rising focus on sustainable and efficient resource management.
The primary growth driver for the exploration services market is the escalating global demand for energy and minerals. With the world economy consistently expanding, there is a heightened need for oil, gas, and minerals to power industries and provide materials for manufacturing. Exploration services, including geophysical, geological, and geochemical services, play a critical role in identifying and assessing these essential resources. Additionally, the transition to renewable energy sources and the increased exploration of resources such as lithium for batteries underscore the market's importance.
Technological advancements represent another significant growth factor. Innovations in exploration technologies, including remote sensing, 3D seismic imaging, and machine learning algorithms, have revolutionized the way resources are discovered and evaluated. These advanced techniques enhance the accuracy and efficiency of exploration activities, reducing costs and minimizing environmental impact. As technology continues to evolve, it will further drive the growth of the exploration services market by improving the success rates of exploration projects.
Sustainability and environmental concerns are also fueling market growth. Governments and organizations worldwide are placing greater emphasis on sustainable practices and environmental stewardship. Exploration services companies are increasingly adopting eco-friendly methods and technologies to minimize the environmental impact of their activities. This shift toward sustainability is not only a regulatory requirement but also a market differentiator, appealing to investors and stakeholders who prioritize environmental responsibility.
Regionally, the exploration services market is witnessing varied growth patterns. North America remains a dominant player, driven by substantial investments in oil and gas exploration and the presence of major mining companies. Meanwhile, Asia Pacific is experiencing rapid growth due to increasing demand for minerals and energy resources in countries like China and India. Europe is focusing on sustainable exploration practices and technological advancements, while Latin America and the Middle East & Africa are capitalizing on their abundant natural resources.
The exploration services market is segmented by service type into geophysical services, geological services, geochemical services, drilling services, and others. Geophysical services, which include seismic surveys, magnetic and gravity surveys, and remote sensing, are essential for understanding subsurface conditions. These services provide critical data for identifying potential resource deposits and assessing their viability. The adoption of advanced technologies in geophysical services, such as 3D and 4D seismic imaging, has significantly enhanced the accuracy and efficiency of exploration activities, making this segment a key growth driver in the market.
Geological services, encompassing field mapping, sample collection, and analysis, are integral to the exploration process. These services provide valuable insights into the geological characteristics of an area, aiding in the identification of resource-rich zones. The increasing deployment of geological information systems (GIS) and other digital tools has streamlined geological data management and interpretation, further propelling the growth of this segment. Additionally, the demand for experienced geologists and advanced analytical techniques is on the rise, driven by the complexity of modern exploration projects.
Geochemical services, which involve the analysis of soil, rock, and water samples to detect the presence of minerals and hydrocarbons, are gaining prominence. Innovations in geochemical analysis, including the use of portable X-ray fluorescence (XRF) analyzers and mass spectrometry, have improved the speed and accuracy of these services. The growing focus on sustainable exploration practices is also driving the adoption of non-invasive geochemical methods, minimizing environmental impact while providing reliable data.
Facebook
TwitterDEEPEN stands for DE-risking Exploration of geothermal Plays in magmatic ENvironments. As part of the development of the DEEPEN 3D play fairway analysis (PFA) methodology for magmatic plays (conventional hydrothermal, superhot EGS, and supercritical), index models needed to be developed to map values in geoscientific exploration datasets to favorability index values. This GDR submission includes those index models. Index models were created by binning values in exploration datasets into chunks based on their favorability, and then applying a number between 0 and 5 to each chunk, where 0 represents very unfavorable data values and 5 represents very favorable data values. To account for differences in how exploration methods are used to detect each play component, separate index models are produced for each exploration method for each component of each play type. Index models were created using histograms of the distributions of each exploration dataset in combination with literature and input from experts about what combinations of geophysical, geological, and geochemical signatures are considered favorable at Newberry. This is in attempt to create similar sized bins based on the current understanding of how different anomalies map to favorable areas for the different types of geothermal plays (i.e., conventional hydrothermal, superhot EGS, and supercritical). For example, an area of partial melt would likely appear as an area of low density, high conductivity, low vp, and high vp/vs. This means that these target anomalies would be given high (4 or 5) index values for the purpose of imaging the heat source. To account for differences in how exploration methods are used to detect each play component, separate index models are produced for each exploration method for each component of each play type. Index models were produced for the following datasets: - Geologic model - Alteration model - vp/vs - vp - vs - Temperature model - Seismicity (density*magnitude) - Density - Resistivity - Fault distance - Earthquake cutoff depth model
Facebook
TwitterThis data set contains example data for exploration of the theory of regression based regionalization. The 90th percentile of annual maximum streamflow is provided as an example response variable for 293 streamgages in the conterminous United States. Several explanatory variables are drawn from the GAGES-II data base in order to demonstrate how multiple linear regression is applied. Example scripts demonstrate how to collect the original streamflow data provided and how to recreate the figures from the associated Techniques and Methods chapter.
Facebook
TwitterCell maps are created by the USGS as a method for illustrating the degree of petroleum exploration, type of production as indicated by final well status, distribution of production and well density in a given area; in this case, covering the SaMiRa project area of interest. Each cell represents a quarter-mile square of the land surface, and the cells are coded to represent whether oil and gas wells included within the cell are predominantly oil-producing, gas-producing, both oil and gas-producing, dry or the type of production of the wells located within the cell is unknown. The well information was initially retrieved from the IHS Energy Enerdeq, which is a proprietary, commercial database containing information for most oil and gas wells in the United States. Cells were developed as a graphic solution to overcome the problem of displaying proprietary well data. No proprietary data are displayed or included in this dataset. The data from IHS were current as of February 2016. This particular cell dataset was created for the Sagebrush Mineral Resource Assessment project (SaMiRa). Please see the purpose section for further information on SaMiRa.
Facebook
TwitterThe Geothermal Exploration Artificial Intelligence looks to use machine learning to spot geothermal identifiers from land maps. This is done to remotely detect geothermal sites for the purpose of energy uses. Such uses include enhanced geothermal system (EGS) applications, especially regarding finding locations for viable EGS sites. This submission includes the appendices and reports formerly attached to the Geothermal Exploration Artificial Intelligence Quarterly and Final Reports. The appendices below include methodologies, results, and some data regarding what was used to train the Geothermal Exploration AI. The methodology reports explain how specific anomaly detection modes were selected for use with the Geo Exploration AI. This also includes how the detection mode is useful for finding geothermal sites. Some methodology reports also include small amounts of code. Results from these reports explain the accuracy of methods used for the selected sites (Brady Desert Peak and Salton Sea). Data from these detection modes can be found in some of the reports, such as the Mineral Markers Maps, but most of the raw data is included the DOE Database which includes Brady, Desert Peak, and Salton Sea Geothermal Sites.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
environmental management
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Geochemical data are frequently collected from mineral exploration drill-hole samples to more accurately define and characterise the geological units intersected by the drill hole. However, large multi-element data sets are slow and challenging to interpret without using some form of automated analysis, such as mathematical, statistical or machine learning techniques. Automated analysis techniques also have the advantage in that they are repeatable and can provide consistent results, even for very large data sets. In this paper, an automated litho-geochemical interpretation workflow is demonstrated, which includes data exploration and data preparation using appropriate compositional data-analysis techniques. Multiscale analysis using a modified wavelet tessellation has been applied to the data to provide coherent geological domains. Unsupervised machine learning (clustering) has been used to provide a first-pass classification. The results are compared with the detailed geologist’s logs. The comparison shows how the integration of automated analysis of geochemical data can be used to enhance traditional geological logging and demonstrates the identification of new geological units from the automated litho-geochemical logging that were not apparent from visual logging but are geochemically distinct. To reduce computational complexity and facilitate interpretation, a subset of geochemical elements is selected, and then a centred log-ratio transform is applied. The wavelet tessellation method is used to domain the drill holes into rock units at a range of scales. Several clustering methods were tested to identify distinct rock units in the samples and multiscale domains for classification. Results are compared with geologist’s logs to assess how geochemical data analysis can inform and improve traditional geology logs.
Facebook
TwitterDEEPEN stands for DE-risking Exploration of geothermal Plays in magmatic ENvironments. As part of the development of the DEEPEN 3D play fairway analysis (PFA) methodology for magmatic plays (conventional hydrothermal, superhot EGS, and supercritical), weights needed to be developed for use in the weighted sum of the different favorability index models produced from geoscientific exploration datasets. This GDR submission includes those weights. The weighting was done using two different approaches: one based on expert opinions, and one based on statistical learning. The weights are intended to describe how useful a particular exploration method is for imaging each component of each play type. They may be adjusted based on the characteristics of the resource under investigation, knowledge of the quality of the dataset, or simply to reduce the impact a single dataset has on the resulting outputs. Within the DEEPEN PFA, separate sets of weights are produced for each component of each play type, since exploration methods hold different levels of importance for detecting each play component, within each play type. The weights for conventional hydrothermal systems were based on the average of the normalized weights used in the DOE-funded PFA projects that were focused on magmatic plays. This decision was made because conventional hydrothermal plays are already well-studied and understood, and therefore it is logical to use existing weights where possible. In contrast, a true PFA has never been applied to superhot EGS or supercritical plays, meaning that exploration methods have never been weighted in terms of their utility in imaging the components of these plays. To produce weights for superhot EGS and supercritical plays, two different approaches were used: one based on expert opinion and the analytical hierarchy process (AHP), and another using a statistical approach based on principal component analysis (PCA). The weights are intended to provide standardized sets of weights for each play type in all magmatic geothermal systems. Two different approaches were used to investigate whether a more data-centric approach might allow new insights into the datasets, and also to analyze how different weighting approaches impact the outcomes. The expert/AHP approach involved using an online tool (https://bpmsg.com/ahp/) with built-in forms to make pairwise comparisons which are used to rank exploration methods against one-another. The inputs are then combined in a quantitative way, ultimately producing a set of consensus-based weights. To minimize the burden on each individual participant, the forms were completed in group discussions. While the group setting means that there is potential for some opinions to outweigh others, it also provides a venue for conversation to take place, in theory leading the group to a more robust consensus then what can be achieved on an individual basis. This exercise was done with two separate groups: one consisting of U.S.-based experts, and one consisting of Iceland-based experts in magmatic geothermal systems. The two sets of weights were then averaged to produce what we will from here on refer to as the "expert opinion-based weights," or "expert weights" for short. While expert opinions allow us to include more nuanced information in the weights, expert opinions are subject to human bias. Data-centric or statistical approaches help to overcome these potential human biases by focusing on and drawing conclusions from the data alone. More information on this approach along with the dataset used to produce the statistical weights may be found in the linked dataset below.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The data collection process commenced with web scraping of a selected higher education institution's website, collecting any data that relates to the admission topic of higher education institutions, during the period from July to September 2023. This resulted in a raw dataset primarily cantered around admission-related content. Subsequently, meticulous data cleaning and organization procedures were implemented to refine the dataset. The primary data, in its raw form before annotation into a question-and-answer format, was predominantly in the Indonesian language. Following this, a comprehensive annotation process was conducted to enrich the dataset with specific admission-related information, transforming it into secondary data. Both primary and secondary data predominantly remained in the Indonesian language. To enhance data quality, we added filters to remove or exclude: 1) data not in the Indonesian language, 2) data unrelated to the admission topic, and 3) redundant entries. This meticulous curation has culminated in the creation of a finalized dataset, meticulously prepared and now readily available for research and analysis in the domain of higher education admission.
Computer Science, Education, Marketing, Natural Language Processing
Emny Yossy,Derwin Suhartono,Agung Trisetyarso,Widodo Budiharto
Data Source: Mendeley Data
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Thorough knowledge of the structure of analyzed data allows to form detailed scientific hypotheses and research questions. The structure of data can be revealed with methods for exploratory data analysis. Due to multitude of available methods, selecting those which will work together well and facilitate data interpretation is not an easy task. In this work we present a well fitted set of tools for a complete exploratory analysis of a clinical dataset and perform a case study analysis on a set of 515 patients. The proposed procedure comprises several steps: 1) robust data normalization, 2) outlier detection with Mahalanobis (MD) and robust Mahalanobis distances (rMD), 3) hierarchical clustering with Ward’s algorithm, 4) Principal Component Analysis with biplot vectors. The analyzed set comprised elderly patients that participated in the PolSenior project. Each patient was characterized by over 40 biochemical and socio-geographical attributes. Introductory analysis showed that the case-study dataset comprises two clusters separated along the axis of sex hormone attributes. Further analysis was carried out separately for male and female patients. The most optimal partitioning in the male set resulted in five subgroups. Two of them were related to diseased patients: 1) diabetes and 2) hypogonadism patients. Analysis of the female set suggested that it was more homogeneous than the male dataset. No evidence of pathological patient subgroups was found. In the study we showed that outlier detection with MD and rMD allows not only to identify outliers, but can also assess the heterogeneity of a dataset. The case study proved that our procedure is well suited for identification and visualization of biologically meaningful patient subgroups.