Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data Analysis is the process that supports decision-making and informs arguments in empirical studies. Descriptive statistics, Exploratory Data Analysis (EDA), and Confirmatory Data Analysis (CDA) are the approaches that compose Data Analysis (Xia & Gong; 2014). An Exploratory Data Analysis (EDA) comprises a set of statistical and data mining procedures to describe data. We ran EDA to provide statistical facts and inform conclusions. The mined facts allow attaining arguments that would influence the Systematic Literature Review of DL4SE.
The Systematic Literature Review of DL4SE requires formal statistical modeling to refine the answers for the proposed research questions and formulate new hypotheses to be addressed in the future. Hence, we introduce DL4SE-DA, a set of statistical processes and data mining pipelines that uncover hidden relationships among Deep Learning reported literature in Software Engineering. Such hidden relationships are collected and analyzed to illustrate the state-of-the-art of DL techniques employed in the software engineering context.
Our DL4SE-DA is a simplified version of the classical Knowledge Discovery in Databases, or KDD (Fayyad, et al; 1996). The KDD process extracts knowledge from a DL4SE structured database. This structured database was the product of multiple iterations of data gathering and collection from the inspected literature. The KDD involves five stages:
Selection. This stage was led by the taxonomy process explained in section xx of the paper. After collecting all the papers and creating the taxonomies, we organize the data into 35 features or attributes that you find in the repository. In fact, we manually engineered features from the DL4SE papers. Some of the features are venue, year published, type of paper, metrics, data-scale, type of tuning, learning algorithm, SE data, and so on.
Preprocessing. The preprocessing applied was transforming the features into the correct type (nominal), removing outliers (papers that do not belong to the DL4SE), and re-inspecting the papers to extract missing information produced by the normalization process. For instance, we normalize the feature “metrics” into “MRR”, “ROC or AUC”, “BLEU Score”, “Accuracy”, “Precision”, “Recall”, “F1 Measure”, and “Other Metrics”. “Other Metrics” refers to unconventional metrics found during the extraction. Similarly, the same normalization was applied to other features like “SE Data” and “Reproducibility Types”. This separation into more detailed classes contributes to a better understanding and classification of the paper by the data mining tasks or methods.
Transformation. In this stage, we omitted to use any data transformation method except for the clustering analysis. We performed a Principal Component Analysis to reduce 35 features into 2 components for visualization purposes. Furthermore, PCA also allowed us to identify the number of clusters that exhibit the maximum reduction in variance. In other words, it helped us to identify the number of clusters to be used when tuning the explainable models.
Data Mining. In this stage, we used three distinct data mining tasks: Correlation Analysis, Association Rule Learning, and Clustering. We decided that the goal of the KDD process should be oriented to uncover hidden relationships on the extracted features (Correlations and Association Rules) and to categorize the DL4SE papers for a better segmentation of the state-of-the-art (Clustering). A clear explanation is provided in the subsection “Data Mining Tasks for the SLR od DL4SE”. 5.Interpretation/Evaluation. We used the Knowledge Discover to automatically find patterns in our papers that resemble “actionable knowledge”. This actionable knowledge was generated by conducting a reasoning process on the data mining outcomes. This reasoning process produces an argument support analysis (see this link).
We used RapidMiner as our software tool to conduct the data analysis. The procedures and pipelines were published in our repository.
Overview of the most meaningful Association Rules. Rectangles are both Premises and Conclusions. An arrow connecting a Premise with a Conclusion implies that given some premise, the conclusion is associated. E.g., Given that an author used Supervised Learning, we can conclude that their approach is irreproducible with a certain Support and Confidence.
Support = Number of occurrences this statement is true divided by the amount of statements Confidence = The support of the statement divided by the number of occurrences of the premise
Facebook
TwitterResources for Advanced Data Analysis and VisualizationResearchers who have access to the latest analysis and visualization tools are able to use large amounts of complex data to find efficiencies in projects, designs, and resources. The Data Analysis and Assessment Center (DAAC) at ERDC's Information Technology Laboratory (ITL) provides visualization and analysis tools and support services to enable the analysis of an ever-increasing volume of data.Simplify Data Analysis and Visualization ResearchThe resources provided by the DAAC enable any user to conduct important data analysis and visualization that provides valuable insight into projects and designs and helps to find ways to save resources. The DAAC provides new tools like ezVIZ, and services such as the DAAC website, a rich resource of news about the DAAC, training materials, a community forum and tutorials on how to use data analysis and other issues.The DAAC can perform collaborative work when users prefer to do the work themselves but need help in choosing which visualization program and/or technique and using the visualization tools. The DAAC also carries out custom projects to produce high-quality animations of data, such as movies, which allow researchers to communicate their results to others.Communicate Research in ContextDAAC provides leading animation and modeling software which allows scientists and researchers may communicate all aspects of their research by setting their results in context through conceptual visualization and data analysis.Success StoriesWave Breaking and Associated Droplet and Bubble FormationWave breaking and associated droplet and bubble formation are among the most challenging problems in the field of free-surface hydrodynamics. The method of computational fluid dynamics (CFD) was used to solve this problem numerically for flow about naval vessels. The researchers wanted to animate the time-varying three-dimensional data sets using isosurfaces, but transferring the data back to the local site was a problem because the data sets were large. The DAAC visualization team solved the problem by using EnSight and ezVIZ to generate the isosurfaces, and photorealistic rendering software to produce the images for the animation.Explosive Structure Interaction Effects in Urban TerrainKnown as the Breaching Project, this research studied the effects of high-explosive (HE) charges on brick or reinforced concrete walls. The results of this research will enable the war fighter to breach a wall to enter a building where enemy forces are conducting operations against U.S. interests. Images produced show computed damaged caused by an HE charge on the outer and inner sides of a reinforced concrete wall. The ability to quickly and meaningfully analyze large simulation data sets helps guide further development of new HE package designs and better ways to deploy the HE packages. A large number of designs can be simulated and analyzed to find the best at breaching the wall. The project saves money in greatly reduced field test costs by testing only the designs which were identified in analysis as the best performers.SpecificationsAmethyst, the seven-node Linux visualization cluster housed at the DAAC, is supported by ParaView, EnSight, and ezViz visualization tools and configured as follows:Six computer nodes, each with the following specifications:CPU: 8 dual-core 2.4 Ghz, 64-bit AMD Opteron Processors (16 effective cores)Memory: 128-G RAMVideo: NVidia Quadro 5500 1-GB memoryNetwork: Infiniband Interconnect between nodes, and Gigabit Ethernet to Defense Research and Engineering Network (DREN)One storage node:Disk Space: 20-TB TerraGrid file system, mounted on all nodes as /viz and /work
Facebook
TwitterScientific investigation is of value only insofar as relevant results are obtained and communicated, a task that requires organizing, evaluating, analysing and unambiguously communicating the significance of data. In this context, working with ecological data, reflecting the complexities and interactions of the natural world, can be a challenge. Recent innovations for statistical analysis of multifaceted interrelated data make obtaining more accurate and meaningful results possible, but key decisions of the analyses to use, and which components to present in a scientific paper or report, may be overwhelming. We offer a 10-step protocol to streamline analysis of data that will enhance understanding of the data, the statistical models and the results, and optimize communication with the reader with respect to both the procedure and the outcomes. The protocol takes the investigator from study design and organization of data (formulating relevant questions, visualizing data collection, data...
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The FragPipe computational proteomics platform is gaining widespread popularity among the proteomics research community because of its fast processing speed and user-friendly graphical interface. Although FragPipe produces well-formatted output tables that are ready for analysis, there is still a need for an easy-to-use and user-friendly downstream statistical analysis and visualization tool. FragPipe-Analyst addresses this need by providing an R shiny web server to assist FragPipe users in conducting downstream analyses of the resulting quantitative proteomics data. It supports major quantification workflows, including label-free quantification, tandem mass tags, and data-independent acquisition. FragPipe-Analyst offers a range of useful functionalities, such as various missing value imputation options, data quality control, unsupervised clustering, differential expression (DE) analysis using Limma, and gene ontology and pathway enrichment analysis using Enrichr. To support advanced analysis and customized visualizations, we also developed FragPipeAnalystR, an R package encompassing all FragPipe-Analyst functionalities that is extended to support site-specific analysis of post-translational modifications (PTMs). FragPipe-Analyst and FragPipeAnalystR are both open-source and freely available.
Facebook
TwitterThis dataset was created by samy ghebache
Facebook
TwitterThis webinar highlighted several research projects conducting secondary analysis on the National Survey on Child and Adolescent Well-Being (NSCAW), a study funded by ACF's Office of Planning Research and Evaluation (OPRE). This survey is a five-year longitudinal study of 5501 children who had contact with the child welfare system. Research using NSCAW data informs policy and practice in child welfare services and other services to maltreated children and their families, and advances the state of knowledge in child maltreatment, child welfare, child and family services, and/or child development for high-risk children.
Presenters: Barbara J. Burns, Ph.D., Duke University School of Medicine; Sandra Jee, Ph.D., University of Rochester Medical Center; and Dana Schultz, Ph.D., Rand Corporation
View the Webinar (WMV - 52MB)
Metadata-only record linking to the original dataset. Open original dataset below.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Initial data analysis (IDA) is the part of the data pipeline that takes place between the end of data retrieval and the beginning of data analysis that addresses the research question. Systematic IDA and clear reporting of the IDA findings is an important step towards reproducible research. A general framework of IDA for observational studies includes data cleaning, data screening, and possible updates of pre-planned statistical analyses. Longitudinal studies, where participants are observed repeatedly over time, pose additional challenges, as they have special features that should be taken into account in the IDA steps before addressing the research question. We propose a systematic approach in longitudinal studies to examine data properties prior to conducting planned statistical analyses. In this paper we focus on the data screening element of IDA, assuming that the research aims are accompanied by an analysis plan, meta-data are well documented, and data cleaning has already been performed. IDA data screening comprises five types of explorations, covering the analysis of participation profiles over time, evaluation of missing data, presentation of univariate and multivariate descriptions, and the depiction of longitudinal aspects. Executing the IDA plan will result in an IDA report to inform data analysts about data properties and possible implications for the analysis plan—another element of the IDA framework. Our framework is illustrated focusing on hand grip strength outcome data from a data collection across several waves in a complex survey. We provide reproducible R code on a public repository, presenting a detailed data screening plan for the investigation of the average rate of age-associated decline of grip strength. With our checklist and reproducible R code we provide data analysts a framework to work with longitudinal data in an informed way, enhancing the reproducibility and validity of their work.
Facebook
TwitterThis dataset contains 55,000 entries of synthetic customer transactions, generated using Python's Faker library. The goal behind creating this dataset was to provide a resource for learners like myself to explore, analyze, and apply various data analysis techniques in a context that closely mimics real-world data.
About the Dataset: - CID (Customer ID): A unique identifier for each customer. - TID (Transaction ID): A unique identifier for each transaction. - Gender: The gender of the customer, categorized as Male or Female. - Age Group: Age group of the customer, divided into several ranges. - Purchase Date: The timestamp of when the transaction took place. - Product Category: The category of the product purchased, such as Electronics, Apparel, etc. - Discount Availed: Indicates whether the customer availed any discount (Yes/No). - Discount Name: Name of the discount applied (e.g., FESTIVE50). - Discount Amount (INR): The amount of discount availed by the customer. - Gross Amount: The total amount before applying any discount. - Net Amount: The final amount after applying the discount. - Purchase Method: The payment method used (e.g., Credit Card, Debit Card, etc.). - Location: The city where the purchase took place.
Use Cases: 1. Exploratory Data Analysis (EDA): This dataset is ideal for conducting EDA, allowing users to practice techniques such as summary statistics, visualizations, and identifying patterns within the data. 2. Data Preprocessing and Cleaning: Learners can work on handling missing data, encoding categorical variables, and normalizing numerical values to prepare the dataset for analysis. 3. Data Visualization: Use tools like Python’s Matplotlib, Seaborn, or Power BI to visualize purchasing trends, customer demographics, or the impact of discounts on purchase amounts. 4. Machine Learning Applications: After applying feature engineering, this dataset is suitable for supervised learning models, such as predicting whether a customer will avail a discount or forecasting purchase amounts based on the input features.
This dataset provides an excellent sandbox for honing skills in data analysis, machine learning, and visualization in a structured but flexible manner.
This is not a real dataset. This dataset was generated using Python's Faker library for the sole purpose of learning
Facebook
Twitter
According to our latest research, the global rocket engine test data analytics market size in 2024 stands at USD 1.42 billion. The market is experiencing robust expansion, with a compounded annual growth rate (CAGR) of 12.8% from 2025 to 2033. By 2033, the market is forecasted to reach a value of USD 4.19 billion. This growth is primarily fueled by the increasing demand for advanced data analytics to enhance the reliability, safety, and performance of rocket engines, as well as the rising frequency of space missions and test launches across both governmental and commercial sectors.
One of the key factors propelling the growth of the rocket engine test data analytics market is the rapid technological advancement in data acquisition and processing systems. Modern rocket engine tests generate colossal volumes of data, encompassing parameters such as thrust, temperature, vibration, and fuel flow. The integration of sophisticated analytics platforms enables stakeholders to derive actionable insights from this data, facilitating real-time monitoring, anomaly detection, and root-cause analysis. This technological leap not only shortens development cycles but also significantly reduces the risk of catastrophic failures, making it indispensable for organizations aiming to maintain a competitive edge in the aerospace and defense sector.
Another significant growth driver is the escalating investment in space exploration and commercial spaceflight activities. Both government agencies like NASA and ESA, as well as private players such as SpaceX and Blue Origin, are conducting more frequent and complex test campaigns. These organizations increasingly rely on data analytics to validate engine designs, optimize test procedures, and ensure compliance with stringent safety standards. The advent of reusable rocket technology further amplifies the need for predictive maintenance and performance analytics, as understanding wear and tear across multiple launches becomes critical to mission success and cost efficiency.
The convergence of artificial intelligence (AI) and machine learning (ML) with rocket engine test data analytics is also catalyzing market expansion. Advanced algorithms are now capable of identifying subtle patterns and correlations within vast datasets, enabling predictive maintenance and early fault detection with unprecedented accuracy. This capability is particularly valuable for commercial space companies and research institutes seeking to maximize engine uptime and minimize unplanned downtimes. Moreover, the growing adoption of cloud-based analytics platforms is democratizing access to high-performance computing resources, allowing smaller organizations and emerging space nations to participate in the market and drive further innovation.
From a regional perspective, North America continues to dominate the rocket engine test data analytics market, accounting for over 43% of the global revenue in 2024. This leadership is attributed to the presence of major aerospace companies, robust government funding, and a vibrant ecosystem of technology providers. However, Asia Pacific is emerging as the fastest-growing region, with countries like China and India ramping up their space programs and investing heavily in indigenous rocket engine development and testing infrastructure. Europe also remains a significant market, driven by collaborative initiatives and strong research capabilities. The Middle East & Africa and Latin America, while still nascent, are expected to witness steady growth as regional space ambitions intensify.
The component segment of the rocket engine test data analytics market is categorized into software, hardware, and services. The software component is witnessing the highest growth, driven by the increasing demand for advanced analytics platforms capable of handling large-scale, high-velocity data streams generated during engine tests. These so
Facebook
TwitterABSTRACT Meta-analysis is an adequate statistical technique to combine results from different studies, and its use has been growing in the medical field. Thus, not only knowing how to interpret meta-analysis, but also knowing how to perform one, is fundamental today. Therefore, the objective of this article is to present the basic concepts and serve as a guide for conducting a meta-analysis using R and RStudio software. For this, the reader has access to the basic commands in the R and RStudio software, necessary for conducting a meta-analysis. The advantage of R is that it is a free software. For a better understanding of the commands, two examples were presented in a practical way, in addition to revising some basic concepts of this statistical technique. It is assumed that the data necessary for the meta-analysis has already been collected, that is, the description of methodologies for systematic review is not a discussed subject. Finally, it is worth remembering that there are many other techniques used in meta-analyses that were not addressed in this work. However, with the two examples used, the article already enables the reader to proceed with good and robust meta-analyses. Level of Evidence V, Expert Opinion.
Facebook
TwitterMost respondents used AI-enabled software to analyze qualitative data in 2023. However, expected use of AI shifted most heavily towards conducting data science or analytics in the future. Overall survey data had the most expected AI usage both current and in the future. The largest shift is expected in meta-analysis, with a ** percent increase between have used and might use.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
There are the original datasets and corresponding analysis code files for five experiments. The xlsx files contain the raw data, while the r files include the code for data analysis.data_S1.xlsx is the data related to Study 1.data_S2.xlsx is the data related to Study 2.data_pre.xlsx is the data related to pre-study.data_S3.xlsx is the data related to Study 3.data_S4.xlsx is the data related to Study 4.S1.R is the analysis code for Study 1, which conducting data analysis on data_S1.xlsx.S2.R is the analysis code for Study 2, which conducting data analysis on data_S2.xlsx.Spre.R is the analysis code for pre-study, which conducting data analysis on data_pre.xlsx.S3.R is the analysis code for Study 3, which conducting data analysis on data_S3.xlsx.S4.R is the analysis code for Study 4, which conducting data analysis on data_S4.xlsx.The process.R is the bootstrap program file that contains process functions used in S1.R, S2.R, S3.R, and S4.R files. It must be run first for function import.
Facebook
TwitterThe purpose of this project is to provide the resources and capabilities necessary to permit the State of Oklahoma to conduct Area of Review (AOR) variance analysis on a statewide level. The project allows for the analysis and identification of areas which may qualify for AOR variances, the correlation of information from various databases and automated systems to conduct AORs in area which do not qualify for variances, the evaluation of the risk of pollution, during permitting and monitoring, using risk-based data analysis. and the ability to conduct spatial analysis of injection well data in conjunction with other geographically referenced information.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
‘Ease of use’ assessment self-completed questionnaire on self-testing in Zambia (N = 1125).
Facebook
TwitterTitle: Identifying Factors that Affect Entrepreneurs’ Use of Data Mining for Analytics Authors: Edward Matthew Dominica, Feylin Wijaya, Andrew Giovanni Winoto, Christian Conference: The 4th International Conference on Electrical, Computer, Communications, and Mechatronics Engineering https://www.iceccme.com/home
This dataset was created to support research focused on understanding the factors influencing entrepreneurs’ adoption of data mining techniques for business analytics. The dataset contains carefully curated data points that reflect entrepreneurial behaviors, decision-making criteria, and the role of data mining in enhancing business insights.
Researchers and practitioners can leverage this dataset to explore patterns, conduct statistical analyses, and build predictive models to gain a deeper understanding of entrepreneurial adoption of data mining.
Intended Use: This dataset is designed for research and academic purposes, especially in the fields of business analytics, entrepreneurship, and data mining. It is suitable for conducting exploratory data analysis, hypothesis testing, and model development.
Citation: If you use this dataset in your research or publication, please cite the paper presented at the ICECCME 2024 conference using the following format: Edward Matthew Dominica, Feylin Wijaya, Andrew Giovanni Winoto, Christian. Identifying Factors that Affect Entrepreneurs’ Use of Data Mining for Analytics. The 4th International Conference on Electrical, Computer, Communications, and Mechatronics Engineering (2024).
Facebook
TwitterFile List simulation.R figures.R redkite/model_monotonic.R redkite/plot_fun.R redkite/bva.Rda redkite/variables.csv redkite/XY.RdaDescription "simulation.R" : This file contains R codes to perform the following: Set up and perform simulations as described in the Appendix. "figures.R" : This file contains R codes to perform the following: Produce figures presented in the paper based on the models fitted using "modelfit_monotonic.R". "modelfit_monotonic.R" : This file contains R codes to perform the following: Model fitting for Red Kite breeding data, once with monotonic constraint and once unconstrained. "plot_fun.R" : This file contains R codes to perform the following: Plotting functions to be used in "figures.R". "bva.Rda", "variables.csv" : These files are R and csv data files: Red Kite breeding data for Bavaria and the variable description. "redkite/XY.Rda" : This file is an R data file: Data required for plotting purposes only. Rda files can be loaded into R as in R> load("bva.Rda")
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Overview: This dataset contains transactional data of customers from an e-commerce platform, to analyze and understand their purchasing behavior. The dataset includes customer ID, product purchased, purchase amount, purchase date, and product category.
Purpose of the Dataset: The primary objective of this dataset is to provide an opportunity to perform data exploration and preprocessing, allowing users to practice and enhance their data cleaning and analysis skills. The dataset has been intentionally modified to simulate a "messy" scenario, where some values have been removed, and inconsistencies have been introduced, which provides a real-world challenge for users to handle during data preparation.
Key Features: CustomerID: Unique identifier for each customer. ProductID: Unique identifier for each product purchased. PurchaseAmount: Amount spent by the customer on a particular transaction. PurchaseDate: Date when the transaction took place. ProductCategory: Category of the purchased product.
Analysis Opportunities:
Perform data cleaning and preprocessing to handle missing values, duplicates, and outliers. Conduct exploratory data analysis (EDA) to uncover trends and patterns in customer behavior. Apply machine learning models like clustering and association rule mining for segmenting customers and understanding purchasing patterns.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains data collected during a study "Smarter open government data for Society 5.0: are your open data smart enough" (Sensors. 2021; 21(15):5204) conducted by Anastasija Nikiforova (University of Latvia). It being made public both to act as supplementary data for "Smarter open government data for Society 5.0: are your open data smart enough" paper and in order for other researchers to use these data in their own work.
The data in this dataset were collected in the result of the inspection of 60 countries and their OGD portals (total of 51 OGD portal in May 2021) to find out whether they meet the trends of Society 5.0 and Industry 4.0 obtained by conducting an analysis of relevant OGD portals.
Each portal has been studied starting with a search for a data set of interest, i.e. “real-time”, “sensor” and “covid-19”, follwing by asking a list of additional questions. These questions were formulated on the basis of combination of (1) crucial open (government) data-related aspects, including open data principles, success factors, recent studies on the topic, PSI Directive etc., (2) trends and features of Society 5.0 and Industry 4.0, (3) elements of the Technology Acceptance Model (TAM) and the Unified Theory of Acceptance and Use Model (UTAUT).
The method used belongs to typical / daily tasks of open data portals sometimes called “usability test” – keywords related to a research question are used to filter data sets, i.e. “real-time”, “real time” and “real time”, “sensor”, covid”, “covid-19”, “corona”, “coronavirus”, “virus”. In most cases, “real-time”, “sensor” and “covid” keywords were sufficient.
The examination of the respective aspects for less user-friendly portals was adapted to particular case based on the portal or data set specifics, by checking:
1. are the open data related to the topic under question ({sensor; real-time; Covid-19}) published, i.e. available?
2. are these data available in a machine-readable format?
3. are these data current, i.e. regularly updated? Where the criteria on the currency depends on the nature of data, i.e. Covid-19 data on the number of cases per day is expected to be updated daily, which won’t be sufficient for real-time data as the title supposes etc.
4. is API ensured for these data? having most importance for real-time and sensor data;
5. have they been published in a timely manner? which was verified mainly for Covid-19 related data. The timeliness is assessed by comparing the dates of the first case identified in a given country and the first release of open data on this topic.
6. what is the total number of available data sets?
7. does the open government data portal provides use-cases / showcases?
8. does the open government portal provide an opportunity to gain insight into the popularity of the data, i.e. does the portal provide statistics of this nature, such as the number of views, downloads, reuses, rating etc.?
9. is there an opportunity to provide a feedback, comment, suggestion or complaint?
10. (9a) is the artifact, i.e. feedback, comment, suggestion or complaint, visible to other users?
Format of the file .xls, .ods, .csv (for the first spreadsheet only)
Licenses or restrictions CC-BY
For more info, see README.txt
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In studies of cognitive neuroscience, multivariate pattern analysis (MVPA) is widely used as it offers richer information than traditional univariate analysis. Representational similarity analysis (RSA), as one method of MVPA, has become an effective decoding method based on neural data by calculating the similarity between different representations in the brain under different conditions. Moreover, RSA is suitable for researchers to compare data from different modalities and even bridge data from different species. However, previous toolboxes have been made to fit specific datasets. Here, we develop NeuroRA, a novel and easy-to-use toolbox for representational analysis. Our toolbox aims at conducting cross-modal data analysis from multi-modal neural data (e.g., EEG, MEG, fNIRS, fMRI, and other sources of neruroelectrophysiological data), behavioral data, and computer-simulated data. Compared with previous software packages, our toolbox is more comprehensive and powerful. Using NeuroRA, users can not only calculate the representational dissimilarity matrix (RDM), which reflects the representational similarity among different task conditions and conduct a representational analysis among different RDMs to achieve a cross-modal comparison. Besides, users can calculate neural pattern similarity (NPS), spatiotemporal pattern similarity (STPS), and inter-subject correlation (ISC) with this toolbox. NeuroRA also provides users with functions performing statistical analysis, storage, and visualization of results. We introduce the structure, modules, features, and algorithms of NeuroRA in this paper, as well as examples applying the toolbox in published datasets.
Facebook
Twitterhttps://www.reportsanddata.com/privacy-policyhttps://www.reportsanddata.com/privacy-policy
Discover Conductive Fibc Bags Market size, share, and forecast data for informed decision-making. Actionable insights backed by research.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data Analysis is the process that supports decision-making and informs arguments in empirical studies. Descriptive statistics, Exploratory Data Analysis (EDA), and Confirmatory Data Analysis (CDA) are the approaches that compose Data Analysis (Xia & Gong; 2014). An Exploratory Data Analysis (EDA) comprises a set of statistical and data mining procedures to describe data. We ran EDA to provide statistical facts and inform conclusions. The mined facts allow attaining arguments that would influence the Systematic Literature Review of DL4SE.
The Systematic Literature Review of DL4SE requires formal statistical modeling to refine the answers for the proposed research questions and formulate new hypotheses to be addressed in the future. Hence, we introduce DL4SE-DA, a set of statistical processes and data mining pipelines that uncover hidden relationships among Deep Learning reported literature in Software Engineering. Such hidden relationships are collected and analyzed to illustrate the state-of-the-art of DL techniques employed in the software engineering context.
Our DL4SE-DA is a simplified version of the classical Knowledge Discovery in Databases, or KDD (Fayyad, et al; 1996). The KDD process extracts knowledge from a DL4SE structured database. This structured database was the product of multiple iterations of data gathering and collection from the inspected literature. The KDD involves five stages:
Selection. This stage was led by the taxonomy process explained in section xx of the paper. After collecting all the papers and creating the taxonomies, we organize the data into 35 features or attributes that you find in the repository. In fact, we manually engineered features from the DL4SE papers. Some of the features are venue, year published, type of paper, metrics, data-scale, type of tuning, learning algorithm, SE data, and so on.
Preprocessing. The preprocessing applied was transforming the features into the correct type (nominal), removing outliers (papers that do not belong to the DL4SE), and re-inspecting the papers to extract missing information produced by the normalization process. For instance, we normalize the feature “metrics” into “MRR”, “ROC or AUC”, “BLEU Score”, “Accuracy”, “Precision”, “Recall”, “F1 Measure”, and “Other Metrics”. “Other Metrics” refers to unconventional metrics found during the extraction. Similarly, the same normalization was applied to other features like “SE Data” and “Reproducibility Types”. This separation into more detailed classes contributes to a better understanding and classification of the paper by the data mining tasks or methods.
Transformation. In this stage, we omitted to use any data transformation method except for the clustering analysis. We performed a Principal Component Analysis to reduce 35 features into 2 components for visualization purposes. Furthermore, PCA also allowed us to identify the number of clusters that exhibit the maximum reduction in variance. In other words, it helped us to identify the number of clusters to be used when tuning the explainable models.
Data Mining. In this stage, we used three distinct data mining tasks: Correlation Analysis, Association Rule Learning, and Clustering. We decided that the goal of the KDD process should be oriented to uncover hidden relationships on the extracted features (Correlations and Association Rules) and to categorize the DL4SE papers for a better segmentation of the state-of-the-art (Clustering). A clear explanation is provided in the subsection “Data Mining Tasks for the SLR od DL4SE”. 5.Interpretation/Evaluation. We used the Knowledge Discover to automatically find patterns in our papers that resemble “actionable knowledge”. This actionable knowledge was generated by conducting a reasoning process on the data mining outcomes. This reasoning process produces an argument support analysis (see this link).
We used RapidMiner as our software tool to conduct the data analysis. The procedures and pipelines were published in our repository.
Overview of the most meaningful Association Rules. Rectangles are both Premises and Conclusions. An arrow connecting a Premise with a Conclusion implies that given some premise, the conclusion is associated. E.g., Given that an author used Supervised Learning, we can conclude that their approach is irreproducible with a certain Support and Confidence.
Support = Number of occurrences this statement is true divided by the amount of statements Confidence = The support of the statement divided by the number of occurrences of the premise