Facebook
TwitterOne of the key problems that arises in many areas is to estimate a potentially nonlinear function [tex] G(x, \theta)[/tex] given input and output samples tex [/tex] so that [tex]y approx G(x, \theta)[/tex]. There are many approaches to addressing this regression problem. Neural networks, regression trees, and many other methods have been developed to estimate [tex]$G$[/tex] given the input output pair tex [/tex]. One method that I have worked with is called Gaussian process regression. There many good texts and papers on the subject. For more technical information on the method and its applications see: http://www.gaussianprocess.org/ A key problem that arises in developing these models on very large data sets is that it ends up requiring an [tex]O(N^3)[/tex] computation where N is the number of data points and the training sample. Obviously this becomes very problematic when N is large. I discussed this problem with Leslie Foster, a mathematics professor at San Jose State University. He, along with some of his students, developed a method to address this problem based on Cholesky decomposition and pivoting. He also shows that this leads to a numerically stable result. If ou're interested in some light reading, I’d suggest you take a look at his recent paper (which was accepted in the Journal of Machine Learning Research) posted on dashlink. We've also posted code for you to try it out. Let us know how it goes. If you are interested in applications of this method in the area of prognostics, check out our new paper on the subject which was published in IEEE Transactions on Systems, Man, and Cybernetics.
Facebook
TwitterThe OECD has initiated PISA for Development (PISA-D) in response to the rising need of developing countries to collect data about their education systems and the capacity of their student bodies. This report aims to compare and contrast approaches regarding the instruments that are used to collect data on (a) component skills and cognitive instruments, (b) contextual frameworks, and (c) the implementation of the different international assessments, as well as approaches to include children who are not at school, and the ways in which data are used. It then seeks to identify assessment practices in these three areas that will be useful for developing countries. This report reviews the major international and regional large-scale educational assessments: large-scale international surveys, school-based surveys and household-based surveys. For each of the issues discussed, there is a description of the prevailing international situation, followed by a consideration of the issue for developing countries and then a description of the relevance of the issue to PISA for Development.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The goal of this study was to adapt a recently proposed linear large-scale support vector machine to large-scale binary cheminformatics classification problems and to assess its performance on various benchmarks using virtual screening performance measures. We extended the large-scale linear support vector machine library LIBLINEAR with state-of-the-art virtual high-throughput screening metrics to train classifiers on whole large and unbalanced data sets. The formulation of this linear support machine has an excellent performance if applied to high-dimensional sparse feature vectors. An additional advantage is the average linear complexity in the number of non-zero features of a prediction. Nevertheless, the approach assumes that a problem is linearly separable. Therefore, we conducted an extensive benchmarking to evaluate the performance on large-scale problems up to a size of 175000 samples. To examine the virtual screening performance, we determined the chemotype clusters using Feature Trees and integrated this information to compute weighted AUC-based performance measures and a leave-cluster-out cross-validation. We also considered the BEDROC score, a metric that was suggested to tackle the early enrichment problem. The performance on each problem was evaluated by a nested cross-validation and a nested leave-cluster-out cross-validation. We compared LIBLINEAR against a Naïve Bayes classifier, a random decision forest classifier, and a maximum similarity ranking approach. These reference approaches were outperformed in a direct comparison by LIBLINEAR. A comparison to literature results showed that the LIBLINEAR performance is competitive but without achieving results as good as the top-ranked nonlinear machines on these benchmarks. However, considering the overall convincing performance and computation time of the large-scale support vector machine, the approach provides an excellent alternative to established large-scale classification approaches.
Facebook
TwitterIn this paper, we investigate the use of Bayesian networks to construct large-scale diagnostic systems. In particular, we consider the development of large-scale Bayesian networks by composition. This compositional approach reflects how (often redundant) subsystems are architected to form systems such as electrical power systems. We develop high-level specifications, Bayesian networks, clique trees, and arithmetic circuits representing 24 different electrical power systems. The largest among these 24 Bayesian networks contains over 1,000 random variables. Another BN represents the real-world electrical power system ADAPT, which is representative of electrical power systems deployed in aerospace vehicles. In addition to demonstrating the scalability of the compositional approach, we briefly report on experimental results from the diagnostic competition DXC, where the ProADAPT team, using techniques discussed here, obtained the highest scores in both Tier 1 (among 9 international competitors) and Tier 2 (among 6 international competitors) of the industrial track. While we consider diagnosis of power systems specically, we believe this work is relevant to other system health management problems, in particular in dependable systems such as aircraft and spacecraft. Reference: O. J. Mengshoel, S. Poll, and T. Kurtoglu. "Developing Large-Scale Bayesian Networks by Composition: Fault Diagnosis of Electrical Power Systems in Aircraft and Spacecraft." Proc. of the IJCAI-09 Workshop on Self-* and Autonomous Systems (SAS): Reasoning and Integration Challenges, 2009 BibTex Reference: @inproceedings{mengshoel09developing, title = {Developing Large-Scale {Bayesian} Networks by Composition: Fault Diagnosis of Electrical Power Systems in Aircraft and Spacecraft}, author = {Mengshoel, O. J. and Poll, S. and Kurtoglu, T.}, booktitle = {Proc. of the IJCAI-09 Workshop on Self-$\star$ and Autonomous Systems (SAS): Reasoning and Integration Challenges}, year={2009} }
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The motivation for the preprocessing of large-scale CAD models for assembly-by-disassembly approaches. The assembly-by-disassembly is only suitable for assemblies with a small number of parts (n_{parts} < 22). However, when dealing with large-scale products with high complexity, the CAD models may not contain feasible subassemblies (e.g. with connected and interference-free parts) and have too many parts to be processed with assembly-by-disassembly. Product designers' preferences during the design phase might not be ideal for assembly-by-disassembly processing because they do not consider subassembly feasibility and the number of parts per subassembly concisely. An automated preprocessing approach is proposed to address this issue by splitting the model into manageable partitions using community detection. This will allow for parallelised, efficient and accurate assembly-by-disassembly of large-scale CAD models. However, applying community detection methods for automatically splitting CAD models into smaller subassemblies is a new concept and research on the suitability for ASP needs to be conducted. Therefore, the following underlying research question will be answered in this experiments:
Underlying research question 2: Can automated preprocessing increase the suitability of CAD-based assembly-by-disassembly for large-scale products?
A hypothesis is formulated to answer this research question, which will be utilised to design experiments for hypothesis testing.
Hypothesis 2: Community detection algorithms can be applied to automatically split large-scale assemblies in suitable candidates for CAD-based AND/OR graph generation.}
Facebook
TwitterOne of the key problems that arises in many areas is to estimate a potentially nonlinear function [tex] G(x, \theta)[/tex] given input and output samples tex [/tex] so that [tex]y approx G(x, \theta)[/tex]. There are many approaches to addressing this regression problem. Neural networks, regression trees, and many other methods have been developed to estimate [tex]$G$[/tex] given the input output pair tex [/tex]. One method that I have worked with is called Gaussian process regression. There many good texts and papers on the subject. For more technical information on the method and its applications see: http://www.gaussianprocess.org/ A key problem that arises in developing these models on very large data sets is that it ends up requiring an [tex]O(N^3)[/tex] computation where N is the number of data points and the training sample. Obviously this becomes very problematic when N is large. I discussed this problem with Leslie Foster, a mathematics professor at San Jose State University. He, along with some of his students, developed a method to address this problem based on Cholesky decomposition and pivoting. He also shows that this leads to a numerically stable result. If ou're interested in some light reading, I’d suggest you take a look at his recent paper (which was accepted in the Journal of Machine Learning Research) posted on dashlink. We've also posted code for you to try it out. Let us know how it goes. If you are interested in applications of this method in the area of prognostics, check out our new paper on the subject which was published in IEEE Transactions on Systems, Man, and Cybernetics.
Facebook
TwitterThe high performance computing (HPC) and big data (BD) communities traditionally have pursued independent trajectories in the world of computational science. HPC has been synonymous with modeling and simulation, and BD with ingesting and analyzing data from diverse sources, including from simulations. However, both communities are evolving in response to changing user needs and technological landscapes. Researchers are increasingly using machine learning (ML) not only for data analytics but also for modeling and simulation; science-based simulations are increasingly relying on embedded ML models not only to interpret results from massive data outputs but also to steer computations. Science-based models are being combined with data-driven models to represent complex systems and phenomena. There also is an increasing need for real-time data analytics, which requires large-scale computations to be performed closer to the data and data infrastructures, to adapt to HPC-like modes of operation. These new use cases create a vital need for HPC and BD systems to deal with simulations and data analytics in a more unified fashion. To explore this need, the NITRD Big Data and High-End Computing R&D Interagency Working Groups held a workshop, The Convergence of High-Performance Computing, Big Data, and Machine Learning, on October 29-30, 2018, in Bethesda, Maryland. The purposes of the workshop were to bring together representatives from the public, private, and academic sectors to share their knowledge and insights on integrating HPC, BD, and ML systems and approaches and to identify key research challenges and opportunities. The 58 workshop participants represented a balanced cross-section of stakeholders involved in or impacted by this area of research. Additional workshop information, including a webcast, is available at https://www.nitrd.gov/nitrdgroups/index.php?title=HPC-BD-Convergence.
Facebook
Twitterhttps://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice
The cloud-based project portfolio management market share is expected to increase by USD 4.83 billion from 2020 to 2025, and the market’s growth momentum will accelerate at a CAGR of 18.26%.
This cloud-based project portfolio management market research report provides valuable insights on the post COVID-19 impact on the market, which will help companies evaluate their business approaches. Furthermore, this report extensively covers cloud-based project portfolio management market segmentations by end user (manufacturing, ICT, healthcare, BFSI, and others) and geography (North America, Europe, APAC, MEA, and South America). The cloud-based project portfolio management market report also offers information on several market vendors, including Atlassian Corp. Plc, Broadcom Inc., Mavenlink Inc., Micro Focus International Plc, Microsoft Corp., Oracle Corp., Planview Inc., SAP SE, ServiceNow Inc., and Upland Software, Inc. among others.
What will the Cloud-based Project Portfolio Management Market Size be During the Forecast Period?
Download the Free Report Sample to Unlock the Cloud-based Project Portfolio Management Market Size for the Forecast Period and Other Important Statistics
Cloud-based Project Portfolio Management Market: Key Drivers, Trends, and Challenges
The increasing requirements for large-scale project portfolio management is notably driving the cloud-based project portfolio management market growth, although factors such as challenges from open-source platforms may impede market growth. Our research analysts have studied the historical data and deduced the key market drivers and the COVID-19 pandemic impact on the cloud-based project portfolio management industry. The holistic analysis of the drivers will help in deducing end goals and refining marketing strategies to gain a competitive edge.
Key Cloud-based Project Portfolio Management Market Driver
The increasing requirements for large-scale project portfolio management is a major factor driving the global cloud-based project portfolio management market share growth. Currently, organizations are focusing on cultivating and managing the resources necessary for efficient product outputs, which increases the requirements for efficient solutions for large-scale project portfolio management. The primary purpose of the cloud-based project portfolio management software is to automate processes to ensure maximum outputs by managing resources and maintaining a regular follow-up. The main benefit of employing cloud-based project portfolio management software in large-scale project portfolio management is that automated services increase the connectivity so that organizations can handle the project-related inquiries easily and effectively. Also, automation decreases the response time and increases productivity, which ensures efficient process management. Additionally, by using cloud-based project portfolio management software, revenue possibilities can be rapidly increased by calculating conversion ratios and running reports to track the metrics detailed as per the customer demand. These features decrease the operating time. Due to such reasons, the demand for the market will grow significantly during the forecast period.
Key Cloud-based Project Portfolio Management Market Trend
The interlinking of software with project portfolio management is another factor supporting the global cloud- based project portfolio management market share growth. Since the demand for project portfolio management software is rising in the market, the stakeholders in several businesses are demanding new features in the software to increase their productivity. One of the main trends identified in the global cloud-based project portfolio management market is the interlinking of multiple software to match the requirements of the business. Currently, cloud-based project portfolio management software is deployed by several enterprises to give people access to documents, data, and reports from multiple devices at multiple locations. With all the data accessible centrally by numerous users, the accountability of the system will increase, which will provide enterprises with an instant overview of what everyone is working on. Additionally, interlinked project portfolio management software will enable the users to update data in real-time and will end the complication of sending endless email attachments of the same document. Moreover, the implementation of cloud-based project portfolio management will enhance the company's assurance for up-to-date data. Therefore, all such factor will contribute to the growth of the market.
Key Cloud-based Project Portfolio Management Market Challenge
The rising challenges from open-source platforms will be a major challenge for the global cloud-based project portfolio management market share growth during the forecast period. With the rising demand for digitalization in the current market s
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This data set contains the coordinates of the plots in the thesis "Solving Large-Scale Dynamic Collaborative Vehicle Routing Problems - An Auction-Based Multi-Agent Approach" by Johan Los. It represents the results of various computational experiments in collaborative vehicle routing that were conducted to investigate to what extent an auction-based multi-agent system can be applied to solve dynamic large-scale collaborative vehicle routing problems. The data set indicates, among others, the value of information sharing, the profits that can be obtained by cooperation under different circumstances, and the individual profits that can be obtained when strategic bidding is applied.
Facebook
TwitterU.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Geodetic imaging is revolutionizing geophysics, but the scope of discovery has been limited by labor-intensive technological implementation of the analyses. The Advanced Rapid Imaging and Analysis (ARIA) project has proven capability to automate SAR image analysis, having processed thousands of COSMO-SkyMed (CSK) scenes collected over California in the last year as part of a JPL/Caltech collaboration with the Italian Space Agency (ASI). The successful analysis of large volumes of SAR data has brought to the forefront the need for analytical tools for SAR quality assessment (QA) on large volumes of images, a critical step before higher level time series and velocity products can be reliably generated. While single interferograms are useful for imaging episodic events such earthquakes, in order to fully exploit the tsunami of SAR imagery that will be generated by current and future missions, we need to develop more agile and flexible methods for evaluating interferograms and coherence maps.
Our AIST-2011 Advanced Rapid Imaging & Analysis for Monitoring Hazards (ARIA-MH) data system has been providing data products to researchers working on a variety of earth science problems including glacial dynamics, tectonics, volcano dynamics, landslides and disaster response. A data system with agile analytics capability could reduce the amount of time researchers currently spend on analysis, quality assessment, and re-analysis of interferograms and time series analysis from months to hours. A key stage in analytics for SAR is the quality assessment stage, which is a necessary step before researchers can reliably use results for their interpretations and models, and we propose to develop machine learning tools to enable more automated quality assessment of complex imagery like interferograms, which will in turn enable greater science return by expanding the amount of data that can be applied to research problems.
Objectives: We will develop an advanced hybrid-cloud computing science data system for easily performing massive-scale analytics of geodetic data products for improving the quality of InSAR and GPS data products that are used for disasters monitoring and response. We will focus our analysis on Big Data-scale analytics that are needed to quickly and efficiently assess the quality of the increasing collections of geodetic data products being generated existing and future missions.
Technology Innovations: Science is an iterative process that requires repeated exploration of the data through various what-if scenarios. By enabling faster turn-around of analytics and analysis processing of the increasing amount of geodetic data, we will enable new science that cannot currently be done. We will adapt machine learning approach to QA assessment for improving the quality of geodetic data products. Furthermore, these types of analytics such as assessing coherence measures of the InSAR data will be used to improve the quality of the data products that are already being used for disasters response. We will develop new approaches enabling users to quickly develop, deploy, run, and analyze their own custom analysis code across entire InSAR and GPS collections.
Expected Significance: To improve the impact of our generated data products for both the science and monitoring user communities, quality assessment (QA) techniques and metrics are needed to automatically analyze the PB-scale data volumes to identify both problems and changes in the deformation and coherence time series. Automated QA techniques are currently underdeveloped within the InSAR analysis community, but have already become much more strategically important for supporting the expected high data volumes of upcoming missions such as Sentinel, ALOS-2, and NASA-ISRO SAR (NISAR) and high-quality science and applications. The science data system technology will also enable NASA to support the high data volume needs of NISAR in addition to the analysis of the data products.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset and replication package of the study "A continuous open source data collection platform for architectural technical debt assessment".
Abstract
Architectural decisions are the most important source of technical debt. In recent years, researchers spent an increasing amount of effort investigating this specific category of technical debt, with quantitative methods, and in particular static analysis, being the most common approach to investigate such a topic.
However, quantitative studies are susceptible, to varying degrees, to external validity threats, which hinder the generalisation of their findings.
In response to this concern, researchers strive to expand the scope of their study by incorporating a larger number of projects into their analyses. This practice is typically executed on a case-by-case basis, necessitating substantial data collection efforts that have to be repeated for each new study.
To address this issue, this paper presents our initial attempt at tackling this problem and enabling researchers to study architectural smells at large scale, a well-known indicator of architectural technical debt. Specifically, we introduce a novel approach to data collection pipeline that leverages Apache Airflow to continuously generate up-to-date, large-scale datasets using Arcan, a tool for architectural smells detection (or any other tool).
Finally, we present the publicly-available dataset resulting from the first three months of execution of the pipeline, that includes over 30,000 analysed commits and releases from over 10,000 open source GitHub projects written in 5 different programming languages and amounting to over a billion of lines of code analysed.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Two classical multivariate statistical problems, testing of multivariate normality and the k-sample problem, are explored by a novel analysis on several resolutions simultaneously. The presented methods do not invert any estimated covariance matrix. Thereby, the methods work in the High Dimension Low Sample Size situation, i.e. when n ≤ p. The output, a significance map, is produced by doing a one-dimensional test for all possible resolution/position pairs. The significance map shows for which resolution/position pairs the null hypothesis is rejected. For the testing of multinormality, the Anderson-Darling test is utilized to detect potential departures from multinormality at different combinations of resolutions and positions. In the k-sample case, it is tested whether k data sets can be said to originate from the same unspecified discrete or continuous multivariate distribution. This is done by testing the k vectors corresponding to the same resolution/position pair of the k different data sets through the k-sample Anderson-Darling test. Successful demonstrations of the new methodology on artificial and real data sets are presented, and a feature selection scheme is demonstrated.
Facebook
Twitterhttps://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The global market size for Big Data Analytics in the BFSI sector was valued at approximately USD 20 billion in 2023 and is expected to reach nearly USD 60 billion by 2032, growing at a robust CAGR of 12.5% during the forecast period. This significant growth can be attributed to the increasing adoption of advanced data analytics techniques in the banking, financial services, and insurance (BFSI) sector to enhance decision-making processes, optimize operations, and improve customer experiences.
One of the primary growth factors for the Big Data Analytics market in the BFSI sector is the growing need for risk management and fraud detection. Financial institutions are increasingly harnessing big data analytics to detect anomalies and patterns that could indicate fraudulent activities, thereby protecting themselves and their customers from significant financial losses. With cyber threats becoming more sophisticated, the demand for advanced analytics solutions that can provide real-time insights and predictive analytics is on the rise.
Another critical driver of market growth is the increasing regulatory requirements and compliance standards that financial institutions must adhere to. Governments and regulatory bodies worldwide are imposing stricter regulations to ensure the stability and security of financial systems. Big data analytics solutions help organizations ensure compliance with these regulations by providing comprehensive data analysis and reporting capabilities, which can identify potential compliance issues before they become critical problems.
Customer analytics is also a significant growth factor, as financial institutions strive to understand their customers better and offer personalized services. By leveraging big data analytics, banks and insurers can analyze customer behavior, preferences, and transaction history to develop tailored products and services, thereby enhancing customer satisfaction and loyalty. This customer-centric approach not only helps in retaining existing customers but also attracts new ones, further driving market growth.
Regionally, North America holds the largest market share due to the early adoption of advanced technologies and the presence of major financial institutions that are keen on investing in big data analytics solutions. The region's strong technological infrastructure and supportive regulatory environment also contribute to market growth. Asia Pacific is expected to witness the highest growth rate during the forecast period, driven by the rapid digital transformation in emerging economies such as China and India, and increasing investments in big data analytics by regional BFSI players.
The Big Data Analytics market in the BFSI sector can be segmented by components into software and services. The software segment encompasses various analytics tools and platforms that enable financial institutions to collect, process, and analyze large volumes of data. This segment is expected to witness substantial growth owing to the increasing demand for sophisticated analytics software that can handle the complexity and scale of financial data.
Within the software segment, solutions for data visualization, predictive analytics, and machine learning are gaining significant traction. These technologies empower organizations to uncover hidden patterns, predict future trends, and make data-driven decisions. For instance, predictive analytics can help banks forecast credit risk and optimize loan portfolios, while machine learning algorithms can enhance fraud detection systems by identifying unusual transaction patterns.
The services segment includes consulting, implementation, and maintenance services offered by vendors to help BFSI institutions effectively deploy and manage big data analytics solutions. As the adoption of big data analytics grows, the demand for professional services to support the implementation and ongoing management of these solutions is also expected to rise. Consulting services are particularly important as they enable financial institutions to develop tailored analytics strategies that align with their specific business goals and regulatory requirements.
Furthermore, managed services are becoming increasingly popular, as they allow organizations to outsource the management of their analytics infrastructure to specialized vendors. This not only reduces the burden on internal IT teams but also ensures that the analytics systems are maintained and updated regularly to
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
7 problems from previous literature are employed to test the performance of the proposed approach and show its advantages by comparing the results with existing approaches. The problems are arranged from small to large scale. Their incidence matrices, data sets and solutions can be found at the dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Bug datasets play a vital role in advancing software engineering tasks, including bug detection, fault localization, and automated program repair. These datasets enable the development of more accurate algorithms, facilitate efficient fault identification, and drive the creation of reliable automated repair tools. However, the manual collection and curation of such data are labor-intensive and prone to inconsistency, which limits scalability and reliability. Current datasets often fail to provide detailed and accurate information, particularly regarding bug types, descriptions, and classifications, reducing their utility in diverse research and practical applications. To address these challenges, we introduce BugCatcher, a comprehensive approach for constructing large-scale, high-quality bug datasets. BugCatcher begins by enhancing PR-Issue linking mechanisms, extending data collection to 12 programming languages over a decade, and ensuring accurate linkage between pull requests and issues. It employs a two-stage filtering process, BugCurator, to refine data quality, and utilizes large language models with Zero-shot Chain-of-Thought prompting to generate precise bug types and detailed descriptions. Furthermore, BugCatcher incorporates a robust classification framework, fine-tuning models for improved categorization. The resulting dataset, BugCatcher-Data, includes 243,265 bug-fix entries with comprehensive fields such as code diffs, bug locations, detailed descriptions, and classifications, serving as a substantial resource for advancing software engineering research and practices.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
The "ensemble-first" strategy, while a popular heuristic for tabular regression, lacks a formal framework and fails on specific data challenges. This thesis introduces the Efficiency-Based Model Selection Framework (EMSF), a new methodology that aligns model architecture with a dataset's primary structural challenge. We benchmarked over 20 models across 100 real-world datasets, categorized into four novel cohorts: high row-to-size (computational efficiency), wide data (parameter efficiency), and messy data (data efficiency). This large-scale empirical study establishes three fundamental laws of applied regression. The Law of Ensemble Dominance confirms that ensembles are the most efficient choice in over 70% of standard cases. The Law of Anomaly Supremacy proves the critical exceptions: we provide the first large-scale evidence that K-Nearest Neighbors (KNN) excels on high-dimensional data, and that robust models like the Huber Regressor are "silver bullet" solutions for datasets with hidden outliers, winning with performance margins exceeding 1500%. Finally, the Law of Predictive Futility reframes benchmarking as a diagnostic tool for identifying datasets that lack predictive signal. The EMSF provides a practical, evidence-based playbook for practitioners to move beyond a one-size-fits-all approach to model selection.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Computational hydrology and real world decision-making increasingly rely on simulation-based, multi-scenario analyses. Enabling scientists to align their research with national-scale efforts is necessary to facilitate knowledge transfer and sharing between operational applications and those focused on local or regional water issues. Leveraging existing large-domain datasets with new and innovative modeling practices is vital for improving operational prediction systems. The scale of these large-domain datasets presents significant challenges when applying them at smaller spatial scales, specifically data collection, pre-processing, post-processing, and reproducibly disseminating findings. Given these challenges, we propose a cloud-based data processing and modeling pipeline, leveraging existing open source tools and cloud technologies, to support common hydrologic data analysis and modeling procedures. Through this work we establish a scalable and flexible pattern for enabling efficient data processing and modeling in the cloud using workflows containing both publicly accessible and privately maintained cloud stores. By leveraging modern cloud computing technologies such as Kubernetes, Dask, Argo, and Analysis Ready Cloud Optimized data, we establish a computationally scalable solution that can be deployed for specific scientific studies, research projects, or communities. We present an approach for using large-domain meteorological and hydrologic modeling datasets for local and regional applications using the NOAA National Water Model, the NOAA NextGen Hydrological Modeling Framework, and Parflow. We discuss how this approach can be used to advance our collective understanding of hydrologic processes, creating reusable workflows, and operating on large-scale data in the cloud.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The forest height mosaic for the northeastern parts of China and U.S are generated based on a global-to-local inversion approach proposed in (Yu et al., 2023) by making use of Spaceborne repeat-pass InSAR and spaceborne GEDI data. The sparsely but extensively distributed LiDAR samples provided by NASA’s GEDI mission are used to parametrize the semi-empirical repeat-pass InSAR scattering model(Lei et al., 2017) and to obtain forest height estimates. Compared to our previous efforts (Lei et al., 2018, Lei and Siqueira, 2022), this paper further removes the assumptions that were made given the limited availability of calibration samples at that time, and developed a new inversion approach based on a global-to-local two-stage inversion scheme. This approach allows a better use of local GEDI samples to achieve finer characterization of temporal decorrelation pattern and thus higher accuracy of forest height inversion. This approach is further fully automated to enable a large-scale forest mapping capability. Two forest height mosaic maps were generated for the entire northeastern regions of U.S. and China with total area of 18 million hectares and 112 million hectares, respectively. The validation of the forest height estimates demonstrates much improved accuracy achieved by the proposed approach compared to the previous efforts i.e., reducing from RMSE of 3-4 m on the order of 3-6-hectare aggregated pixel size to RMSE 3-4 m on the order of 0.81-hectare pixel size. The proposed fusion approach not only addresses the sparse spatial sampling problem inherent to the GEDI mission, but also improve the accuracy of forest height estimates compared to the GEDI-interpolated maps by a factor of 20% at 30-m resolution. The extensive evaluation of forest height inversion against LVIS LiDAR data indicates an accuracy 3-4 m on the order of 0.81 hectare over smooth areas and 4-5 m over hilly areas in U.S., whereas the forest height estimates over northeastern China are best compared with small footprint LiDAR validation data even at an accuracy of even below 3.5 m with R2 mostly above 0.6. Such a forest height inversion accuracy at sub-hectare pixel size provides promising values towards the existing and future spaceborne LiDAR (JAXA’s MOLI, NASA’s GEDI, China’s TECIS) and InSAR missions (NASA-ISRO’s NISAR, JAXA’s ALOS-4 and China’s LuTan-1). This fusion prototype can work as a cost-effective solution for public users to obtain a wall-to-wall forest height mapping at large scale when only spaceborne repeat-pass InSAR data are available and freely accessible.
Facebook
Twitterhttps://www.law.cornell.edu/uscode/text/17/106https://www.law.cornell.edu/uscode/text/17/106
With the development of science and technology, large-scale optimization tasks have become integral to cutting-edge engineering. The challenges of solving these problems arises from ever-growing system sizes, intricate physical space, and the computational cost required to accurately model and optimize target objectives. Taking the design of advanced functional materials as an example, the high-dimensional parameter space and high-fidelity physical simulations can demand immense computational resources for searching and iterations. Although emerging machine learning techniques have been combined with conventional experimental and simulation approaches to explore the design space and identify high-performance solutions, these methods are still limited to a small part of the design space around those materials have been well investigated.
Over the past several decades, continuous development of both hardware and algorithms have addressed some of the challenges. High-performance computing (HPC) architectures and heterogeneous systems have greatly expanded the capacity to perform large-scale calculations and optimizations; On the other hand, the emergence of machine learning frameworks and algorithms have dramatically facilitated the development of advanced models and enable the integration of AI-driven techniques into traditional experiments and simulations more seamlessly. In recent years, quantum computing (QC) has received widespread attention due to its powerful performance on solving global optima and is regarded as a promising solution to large-scale and non-linear optimization problems in the future, and in the meantime, the quantum computing principles also expand the capacity of classical algorithms on exploring high-dimensional combinatorial spaces. In this dissertation, we will show the power of the integration of machine learning algorithms, quantum algorithms and HPC architectures on tackling the challenges of solving large-scale optimization problems.
In the first part of this dissertation, we introduced an optimization algorithm based on a Quantum-inspired Genetic Algorithm (QGA) to design planar multilayer (PML) for transparent radiative cooler (TRC) applications. Results of numerical experiments showed that our QGA-facilitated optimization algorithm can converge to comparable solutions as quantum annealing (QA) and the QGA overperformed on classical genetic algorithm (CGA) on both convergence speed and global search capacity. Our work shows that quantum heuristic algorithms will become powerful tools for addressing the challenges traditional optimization algorithm faced when solving large-scale optimization problems with complex search space.
In the second part of the dissertation, we proposed a quantum annealing-assisted lattice optimization (QALO) algorithm for high-entropy alloy (HEA) systems. The algorithm is developed based on the active learning framework that integrates the field-aware factorization machine (FFM), quantum annealing (QA) and machine learning potential (MLP). When applying to optimize the bulk grain configuration of the NbMoTaW alloy system, our algorithm can quickly obtain low-energy microstructures and the results successfully reproduce the Nb segregation and W enrichment in the bulk phase driven by thermodynamic driving force, which usually be observed in the experiments and MC/MD simulations. This work highlights the potential of quantum computing in exploring the large design space for HEA systems.
In the third part of the dissertation, we employed the Distributed Quantum Approximate Optimization Algorithm (DQAOA) to address large-scale combinatorial optimization problems that exceed the limits of conventional computational resources. This was achieved through a divide-and-conquer strategy, in which the original problem is decomposed into smaller sub-tasks that are solved in parallel on a high-performance computing (HPC) system. To further enhance convergence efficiency, we introduced an Impact Factor Directed (IFD) decomposition method. By calculating impact factors and leveraging a targeted traversal strategy, IFD captures local structural features of the problem, making it effective for both dense and sparse instances. Finally, we explored the integration of DQAOA with the Quantum Framework (QFw) on the Frontier HPC system, demonstrating the potential for efficient management of large-scale circuit execution workloads across CPUs and GPUs.
Facebook
TwitterAttribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
The set of Knowledge Graphs (KGs) generated with automatic and manual approaches is constantly growing.
For an integrated view and usage, an alignment between these KGs is necessary on the schema as well as instance level.
There are already approaches which try to tackle this multi source knowledge graph matching problem,
but large gold standards are missing to evaluate their effectiveness and scalability.
In particular, most existing gold standards are fairly small and can be solved by matchers which match exactly two KGs (1:1), which are the majority of existing matching systems.
We close this gap by presenting Gollum -- a gold standard for large-scale multi source knowledge graph matching with over 275,000 correspondences between 4,149 different KGs.
They originate from knowledge graphs derived by applying the DBpedia extraction framework to a large wiki farm.
Three variations of the gold standard are made available:
(1) a version with all correspondences for evaluating unsupervised matching approaches, and two versions for evaluating supervised matching: (2) one where each KG is contained both in the train and test set, and (3) one where each KG is exclusively contained in the train or the test set.
We plan to extend our KG track at the Ontology Alignment Evaluation Initiative (OAEI) to allow for matching systems
which are specifically designed to solve the multi KG matching problem.
As a first step towards this direction, we evaluate multi source matching approaches which reuse two-KG (1:1) matchers from the past OAEI.
Due to the size of the KG files, they are hosted at the institute:
http://data.dws.informatik.uni-mannheim.de/dbkwik/gollum/40K.tar (50,3 GB)
http://data.dws.informatik.uni-mannheim.de/dbkwik/gollum/all.tar (74,7 GB)
http://data.dws.informatik.uni-mannheim.de/dbkwik/gollum/gold.tar (25,3 GB)
Facebook
TwitterOne of the key problems that arises in many areas is to estimate a potentially nonlinear function [tex] G(x, \theta)[/tex] given input and output samples tex [/tex] so that [tex]y approx G(x, \theta)[/tex]. There are many approaches to addressing this regression problem. Neural networks, regression trees, and many other methods have been developed to estimate [tex]$G$[/tex] given the input output pair tex [/tex]. One method that I have worked with is called Gaussian process regression. There many good texts and papers on the subject. For more technical information on the method and its applications see: http://www.gaussianprocess.org/ A key problem that arises in developing these models on very large data sets is that it ends up requiring an [tex]O(N^3)[/tex] computation where N is the number of data points and the training sample. Obviously this becomes very problematic when N is large. I discussed this problem with Leslie Foster, a mathematics professor at San Jose State University. He, along with some of his students, developed a method to address this problem based on Cholesky decomposition and pivoting. He also shows that this leads to a numerically stable result. If ou're interested in some light reading, I’d suggest you take a look at his recent paper (which was accepted in the Journal of Machine Learning Research) posted on dashlink. We've also posted code for you to try it out. Let us know how it goes. If you are interested in applications of this method in the area of prognostics, check out our new paper on the subject which was published in IEEE Transactions on Systems, Man, and Cybernetics.