Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction
This archive contains the ApacheJIT dataset presented in the paper "ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction" as well as the replication package. The paper is submitted to MSR 2022 Data Showcase Track.
The datasets are available under directory dataset. There are 4 datasets in this directory.
In addition to the dataset, we also provide the scripts using which we built the dataset. These scripts are written in Python 3.8. Therefore, Python 3.8 or above is required. To set up the environment, we have provided a list of required packages in file requirements.txt. Additionally, one filtering step requires GumTree [1]. For Java, GumTree requires Java 11. For other languages, external tools are needed. Installation guide and more details can be found here.
The scripts are comprised of Python scripts under directory src and Python notebooks under directory notebooks. The Python scripts are mainly responsible for conducting GitHub search via GitHub search API and collecting commits through PyDriller Package [2]. The notebooks link the fixed issue reports with their corresponding fixing commits and apply some filtering steps. The bug-inducing candidates then are filtered again using gumtree.py script that utilizes the GumTree package. Finally, the remaining bug-inducing candidates are combined with the clean commits in the dataset_construction notebook to form the entire dataset.
More specifically, git_token.py handles GitHub API token that is necessary for requests to GitHub API. Script collector.py performs GitHub search. Tracing changed lines and git annotate is done in gitminer.py using PyDriller. Finally, gumtree.py applies 4 filtering steps (number of lines, number of files, language, and change significance).
References:
Jean-Rémy Falleri, Floréal Morandat, Xavier Blanc, Matias Martinez, and Martin Monperrus. 2014. Fine-grained and accurate source code differencing. In ACM/IEEE International Conference on Automated Software Engineering, ASE ’14,Vasteras, Sweden - September 15 - 19, 2014. 313–324
Davide Spadini, Maurício Aniche, and Alberto Bacchelli. 2018. PyDriller: Python Framework for Mining Software Repositories. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Lake Buena Vista, FL, USA)(ESEC/FSE2018). Association for Computing Machinery, New York, NY, USA, 908–911
https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy
The Big Data Technology Market size was valued at USD 349.40 USD Billion in 2023 and is projected to reach USD 918.16 USD Billion by 2032, exhibiting a CAGR of 14.8 % during the forecast period. Big data is larger, more complex data sets, especially from new data sources. These data sets are so voluminous that traditional data processing software just can’t manage them. But these massive volumes of data can be used to address business problems that wouldn’t have been able to tackle before. Big data technology is defined as software-utility. This technology is primarily designed to analyze, process and extract information from a large data set and a huge set of extremely complex structures. This is very difficult for traditional data processing software to deal with. Among the larger concepts of rage in technology, big data technologies are widely associated with many other technologies such as deep learning, machine learning, artificial intelligence (AI), and Internet of Things (IoT) that are massively augmented. In combination with these technologies, big data technologies are focused on analyzing and handling large amounts of real-time data and batch-related data. Recent developments include: February 2024: - SQream, a GPU data analytics platform, partnered with Dataiku, an AI and machine learning platform, to deliver a comprehensive solution for efficiently generating big data analytics and business insights by handling complex data., October 2023: - MultiversX (ELGD), a blockchain infrastructure firm, formed a partnership with Google Cloud to enhance Web3’s presence by integrating big data analytics and artificial intelligence tools. The collaboration aims to offer new possibilities for developers and startups., May 2023: - Vpon Big Data Group partnered with VIOOH, a digital out-of-home advertising (DOOH) supply-side platform, to display the unique advertising content generated by Vpon’s AI visual content generator "InVnity" with VIOOH's digital outdoor advertising inventories. This partnership pioneers the future of outdoor advertising by using AI and big data solutions., May 2023: - Salesforce launched the next generation of Tableau for users to automate data analysis and generate actionable insights., March 2023: - SAP SE, a German multinational software company, entered a partnership with AI companies, including Databricks, Collibra NV, and DataRobot, Inc., to introduce the next generation of data management portfolio., November 2022: - Thai Oil and Retail Corporation PTT Oil and Retail Business Public Company implemented the Cloudera Data Platform to deliver insights and enhance customer engagement. The implementation offered a unified and personalized experience across 1,900 gas stations and 3,000 retail branches., November 2022: - IBM launched new software for enterprises to break down data and analytics silos that helped users make data-driven decisions. The software helps to streamline how users access and discover analytics and planning tools from multiple vendors in a single dashboard view., September 2022: - ActionIQ, a global leader in CX solutions, and Teradata, a leading software company, entered a strategic partnership and integrated AIQ’s new HybridCompute Technology with Teradata VantageCloud analytics and data platform.. Key drivers for this market are: Increasing Adoption of AI, ML, and Data Analytics to Boost Market Growth . Potential restraints include: Rising Concerns on Information Security and Privacy to Hinder Market Growth. Notable trends are: Rising Adoption of Big Data and Business Analytics among End-use Industries.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Tabular data is a way to structure, organize and present information conveniently and effectively. Real-world tables present data in two dimensions by arranging cells in matrices that summarize information and facilitate side-by-side comparisons. Recent research efforts aim to train large models with machine learning methods to understand structured tables, a process that enables knowledge transfer in various downstream tasks. Model pre-training, though, requires large tabular datasets, conveniently formatted to reflect cell and table properties and characteristics. This paper presents a financial dataset, called ENTRANT that comprises millions of tables. The tables are transformed to reflect cell attributes, as well as positional and hierarchical information. Hence, they facilitate, among other things, pre-training tasks for table understanding with deep learning methods. The dataset provides table and cell information along with the corresponding metadata in a machine-readable JSON format. Furthermore, we have automated all data processing and curation in a free and open-access project. Moreover, we have technically validated the dataset, through unit testing of high code coverage. Finally, we demonstrate the use of the dataset in a pre-training task of a state-of-the-art model, which is also used for downstream cell classification.
This research study was conducted to analyze the (potential) relationship between hardware and data set sizes. 100 data scientists from France between Jan-2016 and Aug-2016 were interviewed in order to have exploitable data. Therefore, this sample might not be representative of the true population.
What can you do with the data?
I did not find any past research on a similar scale. You are free to play with this data set. For re-usage of this data set out of Kaggle, please contact the author directly on Kaggle (use "Contact User"). Please mention:
Arbitrarily, we chose characteristics to describe Data Scientists and data set sizes.
Data set size:
For the data, it uses the following fields (DS = Data Scientist, W = Workstation):
You should expect potential noise in the data set. It might not be "free" of internal contradictions, as with all researches.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Pathogen diversity resulting in quasispecies can enable persistence and adaptation to host defenses and therapies. However, accurate quasispecies characterization can be impeded by errors introduced during sample handling and sequencing which can require extensive optimizations to overcome. We present complete laboratory and bioinformatics workflows to overcome many of these hurdles. The Pacific Biosciences single molecule real-time platform was used to sequence PCR amplicons derived from cDNA templates tagged with universal molecular identifiers (SMRT-UMI). Optimized laboratory protocols were developed through extensive testing of different sample preparation conditions to minimize between-template recombination during PCR and the use of UMI allowed accurate template quantitation as well as removal of point mutations introduced during PCR and sequencing to produce a highly accurate consensus sequence from each template. Handling of the large datasets produced from SMRT-UMI sequencing was facilitated by a novel bioinformatic pipeline, Probabilistic Offspring Resolver for Primer IDs (PORPIDpipeline), that automatically filters and parses reads by sample, identifies and discards reads with UMIs likely created from PCR and sequencing errors, generates consensus sequences, checks for contamination within the dataset, and removes any sequence with evidence of PCR recombination or early cycle PCR errors, resulting in highly accurate sequence datasets. The optimized SMRT-UMI sequencing method presented here represents a highly adaptable and established starting point for accurate sequencing of diverse pathogens. These methods are illustrated through characterization of human immunodeficiency virus (HIV) quasispecies.
Methods
This serves as an overview of the analysis performed on PacBio sequence data that is summarized in Analysis Flowchart.pdf and was used as primary data for the paper by Westfall et al. "Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies"
Five different PacBio sequencing datasets were used for this analysis: M027, M2199, M1567, M004, and M005
For the datasets which were indexed (M027, M2199), CCS reads from PacBio sequencing files and the chunked_demux_config files were used as input for the chunked_demux pipeline. Each config file lists the different Index primers added during PCR to each sample. The pipeline produces one fastq file for each Index primer combination in the config. For example, in dataset M027 there were 3–4 samples using each Index combination. The fastq files from each demultiplexed read set were moved to the sUMI_dUMI_comparison pipeline fastq folder for further demultiplexing by sample and consensus generation with that pipeline. More information about the chunked_demux pipeline can be found in the README.md file on GitHub.
The demultiplexed read collections from the chunked_demux pipeline or CCS read files from datasets which were not indexed (M1567, M004, M005) were each used as input for the sUMI_dUMI_comparison pipeline along with each dataset's config file. Each config file contains the primer sequences for each sample (including the sample ID block in the cDNA primer) and further demultiplexes the reads to prepare data tables summarizing all of the UMI sequences and counts for each family (tagged.tar.gz) as well as consensus sequences from each sUMI and rank 1 dUMI family (consensus.tar.gz). More information about the sUMI_dUMI_comparison pipeline can be found in the paper and the README.md file on GitHub.
The consensus.tar.gz and tagged.tar.gz files were moved from sUMI_dUMI_comparison pipeline directory on the server to the Pipeline_Outputs folder in this analysis directory for each dataset and appended with the dataset name (e.g. consensus_M027.tar.gz). Also in this analysis directory is a Sample_Info_Table.csv containing information about how each of the samples was prepared, such as purification methods and number of PCRs. There are also three other folders: Sequence_Analysis, Indentifying_Recombinant_Reads, and Figures. Each has an .Rmd
file with the same name inside which is used to collect, summarize, and analyze the data. All of these collections of code were written and executed in RStudio to track notes and summarize results.
Sequence_Analysis.Rmd
has instructions to decompress all of the consensus.tar.gz files, combine them, and create two fasta files, one with all sUMI and one with all dUMI sequences. Using these as input, two data tables were created, that summarize all sequences and read counts for each sample that pass various criteria. These are used to help create Table 2 and as input for Indentifying_Recombinant_Reads.Rmd
and Figures.Rmd
. Next, 2 fasta files containing all of the rank 1 dUMI sequences and the matching sUMI sequences were created. These were used as input for the python script compare_seqs.py which identifies any matched sequences that are different between sUMI and dUMI read collections. This information was also used to help create Table 2. Finally, to populate the table with the number of sequences and bases in each sequence subset of interest, different sequence collections were saved and viewed in the Geneious program.
To investigate the cause of sequences where the sUMI and dUMI sequences do not match, tagged.tar.gz was decompressed and for each family with discordant sUMI and dUMI sequences the reads from the UMI1_keeping directory were aligned using geneious. Reads from dUMI families failing the 0.7 filter were also aligned in Genious. The uncompressed tagged folder was then removed to save space. These read collections contain all of the reads in a UMI1 family and still include the UMI2 sequence. By examining the alignment and specifically the UMI2 sequences, the site of the discordance and its case were identified for each family as described in the paper. These alignments were saved as "Sequence Alignments.geneious". The counts of how many families were the result of PCR recombination were used in the body of the paper.
Using Identifying_Recombinant_Reads.Rmd
, the dUMI_ranked.csv file from each sample was extracted from all of the tagged.tar.gz files, combined and used as input to create a single dataset containing all UMI information from all samples. This file dUMI_df.csv was used as input for Figures.Rmd.
Figures.Rmd
used dUMI_df.csv, sequence_counts.csv, and read_counts.csv as input to create draft figures and then individual datasets for eachFigure. These were copied into Prism software to create the final figures for the paper.
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The global storage in big data market size was estimated to be USD 57.5 billion in 2023, and it is projected to reach approximately USD 147.3 billion by 2032, growing at a compound annual growth rate (CAGR) of 11.0% during the forecast period. This growth can be attributed to the increasing volume of data generated by various industry verticals, advancements in data storage technologies, and the rising adoption of big data analytics across organizations worldwide. The rapid digital transformation across industries has necessitated efficient data storage solutions, paving the way for substantial growth in the big data storage market.
The proliferation of data generated from various sources such as social media, IoT devices, and enterprise applications is one of the major growth factors for the storage in big data market. The exponential increase in data volume has created a pressing need for effective storage solutions that can handle, manage, and analyze large datasets in real time. Organizations are increasingly relying on data-driven insights to inform their business strategies, optimize operations, and enhance customer experiences, thereby driving the demand for sophisticated storage solutions. Furthermore, the growing importance of data in decision-making processes has underscored the critical role of robust storage infrastructure to support big data initiatives.
Technological advancements in storage solutions, such as the development of high-performance storage systems and cloud-based storage platforms, have significantly contributed to the market's growth. Innovations in storage technologies, including the use of solid-state drives (SSDs), non-volatile memory express (NVMe), and software-defined storage (SDS), have enhanced storage efficiency and accessibility, meeting the demands of organizations dealing with massive data volumes. Additionally, cloud-based storage solutions have gained traction due to their scalability, flexibility, and cost-effectiveness, enabling businesses to manage their data resources more efficiently. These technological advancements are expected to drive the adoption of big data storage solutions over the forecast period.
The increasing investment in big data analytics by various industries is another key growth driver for the storage in big data market. Industries such as healthcare, retail, BFSI (banking, financial services, and insurance), and IT and telecommunications are leveraging big data analytics to derive valuable insights from their data reserves. As a result, there is a growing demand for advanced storage solutions capable of supporting complex data analytics processes. The integration of machine learning and artificial intelligence with big data analytics further emphasizes the need for efficient storage systems that can handle the processing of large datasets, thereby boosting the market growth.
The regional outlook for the storage in big data market indicates that North America is expected to hold a significant share of the market during the forecast period. This dominance can be attributed to the early adoption of advanced technologies, the presence of major market players, and the high investment in big data analytics in the region. Additionally, the Asia Pacific region is projected to witness the highest growth rate, driven by the increasing adoption of digital technologies, the expansion of the IT sector, and the growing focus on data-driven decision-making processes. Europe is also anticipated to experience substantial growth, supported by the rising demand for data storage solutions across various industries and increasing regulatory requirements for data management.
The component segment of the storage in big data market is divided into hardware, software, and services. Each component plays a critical role in the overall market ecosystem and contributes to the effective management and utilization of big data. Hardware components, which include storage devices and infrastructure, are essential for storing the vast amounts of data generated by organizations. With advancements in storage technologies, hardware components have evolved to offer higher storage capacities, faster data retrieval speeds, and better energy efficiency. Innovations such as SSDs and NVMe have revolutionized the storage landscape, providing organizations with robust solutions to meet their growing data storage needs.
Software components in the big data storage market are designed to enhance the functionality and management of stored data. They include data management software, data in
This module series covers how to import, manipulate, format and plot time series data stored in .csv format in R. Originally designed to teach researchers to use NEON plant phenology and air temperature data; has been used in undergraduate classrooms.
In a large network of computers or wireless sensors, each of the components (henceforth, peers) has some data about the global state of the system. Much of the system's functionality such as message routing, information retrieval and load sharing relies on modeling the global state. We refer to the outcome of the function (e.g., the load experienced by each peer) as the emph{model} of the system. Since the state of the system is constantly changing, it is necessary to keep the models up-to-date. Computing global data mining models e.g. decision trees, k-means clustering in large distributed systems may be very costly due to the scale of the system and due to communication cost, which may be high. The cost further increases in a dynamic scenario when the data changes rapidly. In this paper we describe a two step approach for dealing with these costs. First, we describe a highly efficient emph{local} algorithm which can be used to monitor a wide class of data mining models. Then, we use this algorithm as a feedback loop for the monitoring of complex functions of the data such as its k-means clustering. The theoretical claims are corroborated with a thorough experimental analysis.
CSVs with more than 1 million rows can be viewed using add-ons to existing software, such as the Microsoft PowerPivot add-on for Excel, to handle larger data sets. The Microsoft PowerPivot add-on for Excel is available using the link in the 'Related Links' section below. Once PowerPivot has been installed, to load the large files, please follow the instructions below. Note that it may take at least 20 to 30 minutes to load one monthly file. Start Excel as normal Click on the PowerPivot tab
One of the key problems that arises in many areas is to estimate a potentially nonlinear function [tex] G(x, \theta)[/tex] given input and output samples tex [/tex] so that [tex]y approx G(x, \theta)[/tex]. There are many approaches to addressing this regression problem. Neural networks, regression trees, and many other methods have been developed to estimate [tex]$G$[/tex] given the input output pair tex [/tex]. One method that I have worked with is called Gaussian process regression. There many good texts and papers on the subject. For more technical information on the method and its applications see: http://www.gaussianprocess.org/ A key problem that arises in developing these models on very large data sets is that it ends up requiring an [tex]O(N^3)[/tex] computation where N is the number of data points and the training sample. Obviously this becomes very problematic when N is large. I discussed this problem with Leslie Foster, a mathematics professor at San Jose State University. He, along with some of his students, developed a method to address this problem based on Cholesky decomposition and pivoting. He also shows that this leads to a numerically stable result. If ou're interested in some light reading, I’d suggest you take a look at his recent paper (which was accepted in the Journal of Machine Learning Research) posted on dashlink. We've also posted code for you to try it out. Let us know how it goes. If you are interested in applications of this method in the area of prognostics, check out our new paper on the subject which was published in IEEE Transactions on Systems, Man, and Cybernetics.
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The global market size for artificial intelligence in big data analysis was valued at approximately $45 billion in 2023 and is projected to reach around $210 billion by 2032, growing at a remarkable CAGR of 18.7% during the forecast period. This phenomenal growth is driven by the increasing adoption of AI technologies across various sectors to analyze vast datasets, derive actionable insights, and make data-driven decisions.
The first significant growth factor for this market is the exponential increase in data generation from various sources such as social media, IoT devices, and business transactions. Organizations are increasingly leveraging AI technologies to sift through these massive datasets, identify patterns, and make informed decisions. The integration of AI with big data analytics provides enhanced predictive capabilities, enabling businesses to foresee market trends and consumer behaviors, thereby gaining a competitive edge.
Another critical factor contributing to the growth of AI in the big data analysis market is the rising demand for personalized customer experiences. Companies, especially in the retail and e-commerce sectors, are utilizing AI algorithms to analyze consumer data and deliver personalized recommendations, targeted advertising, and improved customer service. This not only enhances customer satisfaction but also boosts sales and customer retention rates.
Additionally, advancements in AI technologies, such as machine learning, natural language processing, and computer vision, are further propelling market growth. These technologies enable more sophisticated data analysis, allowing organizations to automate complex processes, improve operational efficiency, and reduce costs. The combination of AI and big data analytics is proving to be a powerful tool for gaining deeper insights and driving innovation across various industries.
From a regional perspective, North America holds a significant share of the AI in big data analysis market, owing to the presence of major technology companies and high adoption rates of advanced technologies. However, the Asia Pacific region is expected to exhibit the highest growth rate during the forecast period, driven by rapid digital transformation, increasing investments in AI and big data technologies, and the growing need for data-driven decision-making processes.
The AI in big data analysis market is segmented by components into software, hardware, and services. The software segment encompasses AI platforms and analytics tools that facilitate data analysis and decision-making. The hardware segment includes the computational infrastructure required to process large volumes of data, such as servers, GPUs, and storage devices. The services segment involves consulting, integration, and support services that assist organizations in implementing and optimizing AI and big data solutions.
The software segment is anticipated to hold the largest share of the market, driven by the continuous development of advanced AI algorithms and analytics tools. These solutions enable organizations to process and analyze large datasets efficiently, providing valuable insights that drive strategic decisions. The demand for AI-powered analytics software is particularly high in sectors such as finance, healthcare, and retail, where data plays a critical role in operations.
On the hardware front, the increasing need for high-performance computing to handle complex data analysis tasks is boosting the demand for powerful servers and GPUs. Companies are investing in robust hardware infrastructure to support AI and big data applications, ensuring seamless data processing and analysis. The rise of edge computing is also contributing to the growth of the hardware segment, as organizations seek to process data closer to the source.
The services segment is expected to grow at a significant rate, driven by the need for expertise in implementing and managing AI and big data solutions. Consulting services help organizations develop effective strategies for leveraging AI and big data, while integration services ensure seamless deployment of these technologies. Support services provide ongoing maintenance and optimization, ensuring that AI and big data solutions deliver maximum value.
Overall, the combination of software, hardware, and services forms a comprehensive ecosystem that supports the deployment and utilization of AI in big data analys
https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy
The Big Data Analytics Market in Energy Sector size was valued at USD 9.56 USD Billion in 2023 and is projected to reach USD 13.81 USD Billion by 2032, exhibiting a CAGR of 5.4 % during the forecast period. Big Data Analytics in the energy sector can be defined as the application of sophisticated methods or tools in analyzing vast collections of information that are produced by numerous entities within the energy industry. This process covers descriptive, predictive, and prescriptive analytics to provide valuable information for procedures, costs, and strategies. Real-time analytics, etc are immediate, while predictive analytics focuses on the probability to happen in the future and prescriptive analytics solutions provide recommendations for action. Some of the main characteristics of the data collectors include handling large datasets, compatibility with IoT to stream data, and machine learning features for pattern detection. These can range from grid control and load management to predicting customer demand and equipment reliability and equipment efficiency enhancement. Thus, there is a significant advantage because Big Data Analytics helps global energy companies to increase performance, minimize sick time, and develop effective strategies to meet the necessary legal demands. Key drivers for this market are: Growing Focus on Safety and Organization to Fuel Market Growth. Potential restraints include: Higher Cost of Geotechnical Services to Hinder Market Growth. Notable trends are: Growth of IT Infrastructure to Bolster the Demand for Modern Cable Tray Management Solutions.
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The global cluster computing market size was valued at approximately USD 35 billion in 2023 and is projected to reach around USD 70 billion by 2032, growing at a compound annual growth rate (CAGR) of about 8%. This robust growth is primarily driven by the increasing demand for computational power and data processing capabilities across various sectors. The proliferation of big data analytics, artificial intelligence, and machine learning applications necessitates powerful computing solutions, thus driving the adoption of cluster computing. Additionally, the ongoing advancements in technology and the emergence of new computational methods further bolster the market's growth trajectory.
One of the primary growth factors for the cluster computing market is the escalating need for data-intensive processing in scientific research. Fields such as genomics, astrophysics, and climate modeling require immense computational resources that cluster computing can provide. Researchers are attracted to cluster computing solutions due to their ability to efficiently process complex algorithms and manage large datasets. Furthermore, the cost-effectiveness of cluster computing compared to supercomputers makes it a preferred choice for research institutions operating under budget constraints. The accessibility of open-source cluster management software also enhances its appeal, enabling researchers to build and manage clusters with relative ease and reduced costs.
The industrial sector is another significant contributor to the growth of the cluster computing market. Industries such as automotive, aerospace, and energy increasingly rely on cluster computing for simulations, modeling, and design optimization. These applications require substantial computational power to deliver accurate results in a timely manner. The need for enhanced product development cycles and increased efficiency in complex processes further drives the adoption of cluster computing. Moreover, with the rise of Industry 4.0, there is a surge in demand for real-time data processing and analytics, which cluster computing systems are well-equipped to handle, thereby fueling market growth.
In the commercial sphere, businesses are leveraging cluster computing to improve their decision-making processes through advanced data analytics. The financial services sector, in particular, is investing in cluster computing systems to perform risk assessments, fraud detection, and customer analytics at unprecedented speeds. The ability to process large volumes of data in real-time gives businesses a competitive edge in today's fast-paced market environment. Furthermore, the increasing trend of digital transformation across industries is accelerating the implementation of cluster computing solutions, as organizations strive to harness the power of data analytics for strategic advantage and operational excellence.
Supercomputing has emerged as a pivotal element in the evolution of cluster computing, offering unprecedented computational capabilities that extend beyond traditional methods. As industries and research institutions strive to tackle increasingly complex problems, the integration of supercomputing resources within cluster environments is becoming more prevalent. This synergy allows for enhanced processing power and efficiency, enabling the execution of intricate simulations and data analyses that were previously unattainable. The role of supercomputing in advancing scientific research, particularly in fields requiring intensive data processing like genomics and climate modeling, underscores its significance in the cluster computing landscape. As technology continues to evolve, the collaboration between supercomputing and cluster computing is expected to drive further innovations and breakthroughs across various sectors.
Regionally, North America stands out as a dominant player in the cluster computing market, driven by a strong presence of technology companies and research institutions. The region's leadership in technological innovation and early adoption of advanced computing solutions contributes to its substantial market share. Meanwhile, Asia Pacific is anticipated to exhibit the highest growth rate over the forecast period. This growth is fueled by increasing investments in IT infrastructure, rising demand for data analytics, and government initiatives promoting technological advancements in countries like China and India. Europe also presents significant opportunities, with growing adoption in sectors such as automotive, manufacturing, and
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains the code for Relevance and Redundancy ranking; a an efficient filter-based feature ranking framework for evaluating relevance based on multi-feature interactions and redundancy on mixed datasets.Source code is in .scala and .sbt format, metadata in .xml, all of which can be accessed and edited in standard, openly accessible text edit software. Diagrams are in openly accessible .png format.Supplementary_2.pdf: contains the results of experiments on multiple classifiers, along with parameter settings and a description of how KLD converges to mutual information based on its symmetricity.dataGenerator.zip: Synthetic data generator inspired from NIPS: Workshop on variable and feature selection (2001), http://www.clopinet.com/isabelle/Projects/NIPS2001/rar-mfs-master.zip: Relevance and Redundancy Framework containing overview diagram, example datasets, source code and metadata. Details on installing and running are provided below.Background. Feature ranking is benfiecial to gain knowledge and to identify the relevant features from a high-dimensional dataset. However, in several datasets, few features by themselves might have small correlation with the target classes, but by combining these features with some other features, they can be strongly correlated with the target. This means that multiple features exhibit interactions among themselves. It is necessary to rank the features based on these interactions for better analysis and classifier performance. However, evaluating these interactions on large datasets is computationally challenging. Furthermore, datasets often have features with redundant information. Using such redundant features hinders both efficiency and generalization capability of the classifier. The major challenge is to efficiently rank the features based on relevance and redundancy on mixed datasets. In the related publication, we propose a filter-based framework based on Relevance and Redundancy (RaR), RaR computes a single score that quantifies the feature relevance by considering interactions between features and redundancy. The top ranked features of RaR are characterized by maximum relevance and non-redundancy. The evaluation on synthetic and real world datasets demonstrates that our approach outperforms several state of-the-art feature selection techniques.# Relevance and Redundancy Framework (rar-mfs) rar-mfs is an algorithm for feature selection and can be employed to select features from labelled data sets. The Relevance and Redundancy Framework (RaR), which is the theory behind the implementation, is a novel feature selection algorithm that - works on large data sets (polynomial runtime),- can handle differently typed features (e.g. nominal features and continuous features), and- handles multivariate correlations.## InstallationThe tool is written in scala and uses the weka framework to load and handle data sets. You can either run it independently providing the data as an
.arff
or .csv
file or you can include the algorithm as a (maven / ivy) dependency in your project. As an example data set we use heart-c. ### Project dependencyThe project is published to maven central (link). To depend on the project use:- maven xml de.hpi.kddm rar-mfs_2.11 1.0.2
- sbt: sbt libraryDependencies += "de.hpi.kddm" %% "rar-mfs" % "1.0.2"
To run the algorithm usescalaimport de.hpi.kddm.rar._// ...val dataSet = de.hpi.kddm.rar.Runner.loadCSVDataSet(new File("heart-c.csv", isNormalized = false, "")val algorithm = new RaRSearch( HicsContrastPramsFA(numIterations = config.samples, maxRetries = 1, alphaFixed = config.alpha, maxInstances = 1000), RaRParamsFixed(k = 5, numberOfMonteCarlosFixed = 5000, parallelismFactor = 4))algorithm.selectFeatures(dataSet)
### Command line tool- EITHER download the prebuild binary which requires only an installation of a recent java version (>= 6) 1. download the prebuild jar from the releases tab (latest) 2. run java -jar rar-mfs-1.0.2.jar--help
Using the prebuild jar, here is an example usage: sh rar-mfs > java -jar rar-mfs-1.0.2.jar arff --samples 100 --subsetSize 5 --nonorm heart-c.arff Feature Ranking: 1 - age (12) 2 - sex (8) 3 - cp (11) ...
- OR build the repository on your own: 1. make sure sbt is installed 2. clone repository 3. run sbt run
Simple example using sbt directly after cloning the repository: sh rar-mfs > sbt "run arff --samples 100 --subsetSize 5 --nonorm heart-c.arff" Feature Ranking: 1 - age (12) 2 - sex (8) 3 - cp (11) ...
### [Optional]To speed up the algorithm, consider using a fast solver such as Gurobi (http://www.gurobi.com/). Install the solver and put the provided gurobi.jar
into the java classpath. ## Algorithm### IdeaAbstract overview of the different steps of the proposed feature selection algorithm:https://github.com/tmbo/rar-mfs/blob/master/docu/images/algorithm_overview.png" alt="Algorithm Overview">The Relevance and Redundancy ranking framework (RaR) is a method able to handle large scale data sets and data sets with mixed features. Instead of directly selecting a subset, a feature ranking gives a more detailed overview into the relevance of the features. The method consists of a multistep approach where we 1. repeatedly sample subsets from the whole feature space and examine their relevance and redundancy: exploration of the search space to gather more and more knowledge about the relevance and redundancy of features 2. decude scores for features based on the scores of the subsets 3. create the best possible ranking given the sampled insights.### Parameters| Parameter | Default value | Description || ---------- | ------------- | ------------|| m - contrast iterations | 100 | Number of different slices to evaluate while comparing marginal and conditional probabilities || alpha - subspace slice size | 0.01 | Percentage of all instances to use as part of a slice which is used to compare distributions || n - sampling itertations | 1000 | Number of different subsets to select in the sampling phase|| k - sample set size | 5 | Maximum size of the subsets to be selected in the sampling phase|
https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
Feature selection is an important technique for data mining before a machine learning algorithm is applied. Despite its importance, most studies of feature selection are restricted to batch learning. Unlike traditional batch learning methods, online learning represents a promising family of efficient and scalable machine learning algorithms for large-scale applications. Most existing studies of online learning require accessing all the attributes/features of training instances. Such a classical setting is not always appropriate for real-world applications when data instances are of high dimensionality or it is expensive to acquire the full set of attributes/features. To address this limitation, we investigate the problem of Online Feature Selection (OFS) in which an online learner is only allowed to maintain a classifier involved only a small and fixed number of features. The key challenge of Online Feature Selection is how to make accurate prediction using a small and fixed number of active features. This is in contrast to the classical setup of online learning where all the features can be used for prediction. We attempt to tackle this challenge by studying sparsity regularization and truncation techniques. Specifically, this article addresses two different tasks of online feature selection: (1) learning with full input where an learner is allowed to access all the features to decide the subset of active features, and (2) learning with partial input where only a limited number of features is allowed to be accessed for each instance by the learner. We present novel algorithms to solve each of the two problems and give their performance analysis. We evaluate the performance of the proposed algorithms for online feature selection on several public datasets, and demonstrate their applications to real-world problems including image classification in computer vision and microarray gene expression analysis in bioinformatics. The encouraging results of our experiments validate the efficacy and efficiency of the proposed techniques.Related Publication: Hoi, S. C., Wang, J., Zhao, P., & Jin, R. (2012). Online feature selection for mining big data. In Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications (pp. 93-100). ACM. http://dx.doi.org/10.1145/2351316.2351329 Full text available in InK: http://ink.library.smu.edu.sg/sis_research/2402/ Wang, J., Zhao, P., Hoi, S. C., & Jin, R. (2014). Online feature selection and its applications. IEEE Transactions on Knowledge and Data Engineering, 26(3), 698-710. http://dx.doi.org/10.1109/TKDE.2013.32 Full text available in InK: http://ink.library.smu.edu.sg/sis_research/2277/
Climate change has been shown to influence lake temperatures globally. To better understand the diversity of lake responses to climate change and give managers tools to manage individual lakes, we modelled daily water temperature profiles for 10,774 lakes in Michigan, Minnesota and Wisconsin for contemporary (1979-2015) and future (2020-2040 and 2080-2100) time periods with climate models based on the Representative Concentration Pathway 8.5, the worst-case emission scenario. From simulated temperatures, we derived commonly used, ecologically relevant annual metrics of thermal conditions for each lake. We included all available supporting metadata including satellite and in-situ observations of water clarity, maximum observed lake depth, land-cover based estimates of surrounding canopy height and observed water temperature profiles (used here for validation). This unique dataset offers landscape-level insight into the future impact of climate change on lakes. This data set contains the following parameters: time, wtr_{z}, which are defined below.
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The global supervised learning market size was valued at approximately USD 3.2 billion in 2023 and is projected to reach USD 12.5 billion by 2032, growing at a remarkable compound annual growth rate (CAGR) of 16.5%. This significant growth can be attributed to the increasing adoption of artificial intelligence (AI) and machine learning (ML) technologies across various industries, driven by the need for enhanced data analytics and predictive modeling capabilities.
One of the primary growth factors for the supervised learning market is the rising demand for automation and efficiency in business processes. As organizations increasingly rely on data to drive decision-making, supervised learning algorithms have become crucial in processing vast amounts of information to derive actionable insights. The ability to train models on labeled data enables businesses to predict outcomes, identify trends, and mitigate risks effectively. This has led to widespread adoption across sectors such as finance, healthcare, and retail, where data-driven decision-making is paramount.
Another key driver is the exponential growth of data generation from various sources, including IoT devices, social media, and enterprise systems. This data explosion necessitates advanced analytical tools to harness its full potential. Supervised learning algorithms, with their capability to handle large datasets and deliver precise predictions, are increasingly being integrated into data analytics platforms. This trend is further bolstered by advancements in computational power and the availability of sophisticated ML frameworks, making it easier for organizations to implement and scale supervised learning solutions.
The increasing focus on personalized customer experiences is also fueling the market's growth. In sectors like retail and e-commerce, businesses are leveraging supervised learning models to analyze customer behavior, preferences, and purchase history. This enables the creation of targeted marketing strategies, personalized recommendations, and improved customer service. Similarly, in the healthcare industry, supervised learning is being used to develop predictive models for patient outcomes, disease diagnosis, and treatment planning, thereby enhancing the quality of care and patient satisfaction.
Regionally, North America holds a significant share of the supervised learning market, driven by the presence of major tech companies and a robust infrastructure for AI and machine learning research. The region is home to numerous innovators and early adopters of advanced technologies, providing a conducive environment for the growth of supervised learning applications. Furthermore, supportive government initiatives and substantial investments in AI research and development are expected to sustain the market's upward trajectory in North America. Meanwhile, Asia Pacific is emerging as a high-growth region, with rapid digital transformation and increased adoption of AI technologies across industries.
Regression algorithms are a fundamental component of supervised learning, widely used for predictive modeling and forecasting. These algorithms help in identifying relationships between variables and predicting continuous outcomes. In industries such as finance and healthcare, regression models are employed to predict stock prices, patient outcomes, and other key metrics. The growth of big data and the need for precise forecasting are driving the adoption of regression algorithms, making them a vital part of the supervised learning landscape.
Classification algorithms play a crucial role in categorizing data into predefined classes or groups. They are extensively utilized in applications like spam detection, image recognition, and fraud detection. With the growing prevalence of digital transactions and online activities, the need for robust classification algorithms to identify fraudulent activities and enhance security measures has never been greater. The increasing demand for cybersecurity solutions and the advancement of classification techniques are contributing to the market's expansion.
Decision trees are another popular supervised learning algorithm known for their simplicity and interpretability. They are used to create models that predict the value of a target variable based on several input features. Decision trees are particularly favored in applications where decision-making rules are crucial, such as credit scoring and medical diagnosis. The ease of understanding and implementing
Success.ai’s Company Data Solutions provide businesses with powerful, enterprise-ready B2B company datasets, enabling you to unlock insights on over 28 million verified company profiles. Our solution is ideal for organizations seeking accurate and detailed B2B contact data, whether you’re targeting large enterprises, mid-sized businesses, or small business contact data.
Success.ai offers B2B marketing data across industries and geographies, tailored to fit your specific business needs. With our white-glove service, you’ll receive curated, ready-to-use company datasets without the hassle of managing data platforms yourself. Whether you’re looking for UK B2B data or global datasets, Success.ai ensures a seamless experience with the most accurate and up-to-date information in the market.
Why Choose Success.ai’s Company Data Solution? At Success.ai, we prioritize quality and relevancy. Every company profile is AI-validated for a 99% accuracy rate and manually reviewed to ensure you're accessing actionable and GDPR-compliant data. Our price match guarantee ensures you receive the best deal on the market, while our white-glove service provides personalized assistance in sourcing and delivering the data you need.
Why Choose Success.ai?
Our database spans 195 countries and covers 28 million public and private company profiles, with detailed insights into each company’s structure, size, funding history, and key technologies. We provide B2B company data for businesses of all sizes, from small business contact data to large corporations, with extensive coverage in regions such as North America, Europe, Asia-Pacific, and Latin America.
Comprehensive Data Points: Success.ai delivers in-depth information on each company, with over 15 data points, including:
Company Name: Get the full legal name of the company. LinkedIn URL: Direct link to the company's LinkedIn profile. Company Domain: Website URL for more detailed research. Company Description: Overview of the company’s services and products. Company Location: Geographic location down to the city, state, and country. Company Industry: The sector or industry the company operates in. Employee Count: Number of employees to help identify company size. Technologies Used: Insights into key technologies employed by the company, valuable for tech-based outreach. Funding Information: Track total funding and the most recent funding dates for investment opportunities. Maximize Your Sales Potential: With Success.ai’s B2B contact data and company datasets, sales teams can build tailored lists of target accounts, identify decision-makers, and access real-time company intelligence. Our curated datasets ensure you’re always focused on high-value leads—those who are most likely to convert into clients. Whether you’re conducting account-based marketing (ABM), expanding your sales pipeline, or looking to improve your lead generation strategies, Success.ai offers the resources you need to scale your business efficiently.
Tailored for Your Industry: Success.ai serves multiple industries, including technology, healthcare, finance, manufacturing, and more. Our B2B marketing data solutions are particularly valuable for businesses looking to reach professionals in key sectors. You’ll also have access to small business contact data, perfect for reaching new markets or uncovering high-growth startups.
From UK B2B data to contacts across Europe and Asia, our datasets provide global coverage to expand your business reach and identify new markets. With continuous data updates, Success.ai ensures you’re always working with the freshest information.
Key Use Cases:
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
The Alternative Data Market size was valued at USD 7.20 billion in 2023 and is projected to reach USD 126.50 billion by 2032, exhibiting a CAGR of 50.6 % during the forecasts period. The use and processing of information that is not in financial databases is known as the alternative data market. Such data involves posts in social networks, satellite images, credit card transactions, web traffic and many others. It is mostly used in financial field to make the investment decisions, managing risks and analyzing competitors, giving a more general view on market trends as well as consumers’ attitude. It has been found that there is increasing requirement for the obtaining of data from unconventional sources as firms strive to nose ahead in highly competitive markets. Some current trend are the finding of AI and machine learning to drive large sets of data and the broadening utilization of the so called “Alternative Data” across industries that are not only the finance industry. Recent developments include: In April 2023, Thinknum Alternative Data launched new data fields to its employee sentiment datasets for people analytics teams and investors to use this as an 'employee NPS' proxy, and support highly-rated employers set up interviews through employee referrals. , In September 2022, Thinknum Alternative Data announced its plan to combine data Similarweb, SensorTower, Thinknum, Caplight, and Pathmatics with Lagoon, a sophisticated infrastructure platform to deliver an alternative data source for investment research, due diligence, deal sourcing and origination, and post-acquisition strategies in private markets. , In May 2022, M Science LLC launched a consumer spending trends platform, providing daily, weekly, monthly, and semi-annual visibility into consumer behaviors and competitive benchmarking. The consumer spending platform provided real-time insights into consumer spending patterns for Australian brands and an unparalleled business performance analysis. .
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
With the rapid increase of large-scale datasets, biomedical data visualization is facing challenges. The data may be large, have different orders of magnitude, contain extreme values, and the data distribution is not clear. Here we present an R package ggbreak that allows users to create broken axes using ggplot2 syntax. It can effectively use the plotting area to deal with large datasets (especially for long sequential data), data with different magnitudes, and contain outliers. The ggbreak package increases the available visual space for a better presentation of the data and detailed annotation, thus improves our ability to interpret the data. The ggbreak package is fully compatible with ggplot2 and it is easy to superpose additional layers and applies scale and theme to adjust the plot using the ggplot2 syntax. The ggbreak package is open-source software released under the Artistic-2.0 license, and it is freely available on CRAN (https://CRAN.R-project.org/package=ggbreak) and Github (https://github.com/YuLab-SMU/ggbreak).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction
This archive contains the ApacheJIT dataset presented in the paper "ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction" as well as the replication package. The paper is submitted to MSR 2022 Data Showcase Track.
The datasets are available under directory dataset. There are 4 datasets in this directory.
In addition to the dataset, we also provide the scripts using which we built the dataset. These scripts are written in Python 3.8. Therefore, Python 3.8 or above is required. To set up the environment, we have provided a list of required packages in file requirements.txt. Additionally, one filtering step requires GumTree [1]. For Java, GumTree requires Java 11. For other languages, external tools are needed. Installation guide and more details can be found here.
The scripts are comprised of Python scripts under directory src and Python notebooks under directory notebooks. The Python scripts are mainly responsible for conducting GitHub search via GitHub search API and collecting commits through PyDriller Package [2]. The notebooks link the fixed issue reports with their corresponding fixing commits and apply some filtering steps. The bug-inducing candidates then are filtered again using gumtree.py script that utilizes the GumTree package. Finally, the remaining bug-inducing candidates are combined with the clean commits in the dataset_construction notebook to form the entire dataset.
More specifically, git_token.py handles GitHub API token that is necessary for requests to GitHub API. Script collector.py performs GitHub search. Tracing changed lines and git annotate is done in gitminer.py using PyDriller. Finally, gumtree.py applies 4 filtering steps (number of lines, number of files, language, and change significance).
References:
Jean-Rémy Falleri, Floréal Morandat, Xavier Blanc, Matias Martinez, and Martin Monperrus. 2014. Fine-grained and accurate source code differencing. In ACM/IEEE International Conference on Automated Software Engineering, ASE ’14,Vasteras, Sweden - September 15 - 19, 2014. 313–324
Davide Spadini, Maurício Aniche, and Alberto Bacchelli. 2018. PyDriller: Python Framework for Mining Software Repositories. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Lake Buena Vista, FL, USA)(ESEC/FSE2018). Association for Computing Machinery, New York, NY, USA, 908–911