Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We present a ProgSnap2-based dataset containing anonymized logs of over 34,000 programming events exhibited by 81 programming students in Scratch, a visual programming environment, during our designed study as described in the paper "Semi-Automatically Mining Students' Common Scratch Programming Behaviors." We also include a list of approx. 3100 mined sequential patterns of programming processes that are performed by at least 10% of the 62 of the 81 students who are novice programmers, and represent maximal patterns generated by the MG-FSM algorithm while allowing a gap of one programming event. README.txt — overview of the dataset and its propertiesmainTable.csv — main event table of the dataset holding rows of programming eventscodeState.csv — table holding XML representations of code snapshots at the time of each programming eventdatasetMetadata.csv — describes features of the datasetScratch-SeqPatterns.txt — list of sequential patterns mined from the Main Event Table
Facebook
TwitterA primary goal to design smart homes is to provide automatic assistance for the residents to make them able to live independently at home. Activity recognition is done to achieve the mentioned goal and then to provide assistance, we would need three sort of information. First, we would need to know the goal of the resident, then the pattern that the resident should obey to achieve its goal and third sort of needed information is the deviations from the previously known patterns. In the presented paper, spatiotemporal aspects of daily activities are surveyed to mine the patterns of activities realized by the smart homes residents. Necessary data to model the spatiotemporal aspects of daily activities is provided by embedded sensors in the smart home. We believe that to accomplish daily activities, specific objects are applied and by analyzing the movement of objects and resident(s), we would obtain valuable information to model the daily activities of the Smart Home’s residents.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data Analysis is the process that supports decision-making and informs arguments in empirical studies. Descriptive statistics, Exploratory Data Analysis (EDA), and Confirmatory Data Analysis (CDA) are the approaches that compose Data Analysis (Xia & Gong; 2014). An Exploratory Data Analysis (EDA) comprises a set of statistical and data mining procedures to describe data. We ran EDA to provide statistical facts and inform conclusions. The mined facts allow attaining arguments that would influence the Systematic Literature Review of DL4SE.
The Systematic Literature Review of DL4SE requires formal statistical modeling to refine the answers for the proposed research questions and formulate new hypotheses to be addressed in the future. Hence, we introduce DL4SE-DA, a set of statistical processes and data mining pipelines that uncover hidden relationships among Deep Learning reported literature in Software Engineering. Such hidden relationships are collected and analyzed to illustrate the state-of-the-art of DL techniques employed in the software engineering context.
Our DL4SE-DA is a simplified version of the classical Knowledge Discovery in Databases, or KDD (Fayyad, et al; 1996). The KDD process extracts knowledge from a DL4SE structured database. This structured database was the product of multiple iterations of data gathering and collection from the inspected literature. The KDD involves five stages:
Selection. This stage was led by the taxonomy process explained in section xx of the paper. After collecting all the papers and creating the taxonomies, we organize the data into 35 features or attributes that you find in the repository. In fact, we manually engineered features from the DL4SE papers. Some of the features are venue, year published, type of paper, metrics, data-scale, type of tuning, learning algorithm, SE data, and so on.
Preprocessing. The preprocessing applied was transforming the features into the correct type (nominal), removing outliers (papers that do not belong to the DL4SE), and re-inspecting the papers to extract missing information produced by the normalization process. For instance, we normalize the feature “metrics” into “MRR”, “ROC or AUC”, “BLEU Score”, “Accuracy”, “Precision”, “Recall”, “F1 Measure”, and “Other Metrics”. “Other Metrics” refers to unconventional metrics found during the extraction. Similarly, the same normalization was applied to other features like “SE Data” and “Reproducibility Types”. This separation into more detailed classes contributes to a better understanding and classification of the paper by the data mining tasks or methods.
Transformation. In this stage, we omitted to use any data transformation method except for the clustering analysis. We performed a Principal Component Analysis to reduce 35 features into 2 components for visualization purposes. Furthermore, PCA also allowed us to identify the number of clusters that exhibit the maximum reduction in variance. In other words, it helped us to identify the number of clusters to be used when tuning the explainable models.
Data Mining. In this stage, we used three distinct data mining tasks: Correlation Analysis, Association Rule Learning, and Clustering. We decided that the goal of the KDD process should be oriented to uncover hidden relationships on the extracted features (Correlations and Association Rules) and to categorize the DL4SE papers for a better segmentation of the state-of-the-art (Clustering). A clear explanation is provided in the subsection “Data Mining Tasks for the SLR od DL4SE”. 5.Interpretation/Evaluation. We used the Knowledge Discover to automatically find patterns in our papers that resemble “actionable knowledge”. This actionable knowledge was generated by conducting a reasoning process on the data mining outcomes. This reasoning process produces an argument support analysis (see this link).
We used RapidMiner as our software tool to conduct the data analysis. The procedures and pipelines were published in our repository.
Overview of the most meaningful Association Rules. Rectangles are both Premises and Conclusions. An arrow connecting a Premise with a Conclusion implies that given some premise, the conclusion is associated. E.g., Given that an author used Supervised Learning, we can conclude that their approach is irreproducible with a certain Support and Confidence.
Support = Number of occurrences this statement is true divided by the amount of statements Confidence = The support of the statement divided by the number of occurrences of the premise
Facebook
TwitterA primary goal to design smart homes is to provide automatic assistance for the residents to make them able to live independently at home. Activity recognition is done to achieve the mentioned goal and then to provide assistance, we would need three sort of information. First, we would need to know the goal of the resident, then the pattern that the resident should obey to achieve its goal and third sort of needed information is the deviations from the previously known patterns. In the presented paper, spatiotemporal aspects of daily activities are surveyed to mine the patterns of activities realized by the smart homes residents. Necessary data to model the spatiotemporal aspects of daily activities is provided by embedded sensors in the smart home. We believe that to accomplish daily activities, specific objects are applied and by analyzing the movement of objects and resident(s), we would obtain valuable information to model the daily activities of the Smart Home’s residents.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract This aim of this paper is the acquisition of geographic data from the Foursquare application, using data mining to perform exploratory and spatial analyses of the distribution of tourist attraction and their density distribution in Rio de Janeiro city. Thus, in accordance with the Extraction, Transformation, and Load methodology, three research algorithms were developed using a tree hierarchical structure to collect information for the categories of Museums, Monuments and Landmarks, Historic Sites, Scenic Lookouts, and Trails, in the foursquare database. Quantitative analysis was performed of check-ins per neighborhood of Rio de Janeiro city, and kernel density (hot spot) maps were generated The results presented in this paper show the need for the data filtering process - less than 50% of the mined data were used, and a large part of the density of the Museums, Historic Sites, and Monuments and Landmarks categories is in the center of the city; while the Scenic Lookouts and Trails categories predominate in the south zone. This kind of analysis was shown to be a tool to support the city's tourist management in relation to the spatial localization of these categories, the tourists’ evaluations of the places, and the frequency of the target public.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Ateeb Shamas
Released under CC0: Public Domain
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Given that metals, minerals and energy resources extracted through mining are fundamental to human society, it follows that accurate data describing mine production are equally important. Although there are often national statistical sources, this typically includes data for metals (e.g., gold), minerals (e.g., iron ore) or energy resources (e.g., coal). No such study has ever compiled a national mine production data set which includes basic mining data such as ore processed, grades, extracted products (e.g., metals, concentrates, saleable ore) and waste rock. These data are crucial for geological assessments of mineable resources, environmental impacts, material flows (including losses during mining, smelting-refining, use and disposal or recycling) as well as facilitating more quantitative assessments of critical mineral potential (including possible extraction from tailings and/or waste rock left by mining). This data set achieves these needs for Australia, providing a world-first and comprehensive review of a national mining industry and an exemplar of what can be achieved for other countries with mining industry sectors.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Supplementary material for the articule "Mutation Testing in the Wild: Findings from GitHub" submitted to the Empirical Software Engineering Journal. It includes:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data set covers global extraction and production of coal and metal ores on an individual mine level. It covers
1171 individual mines, reporting mine-level production for 80 different materials in the period 2000-2021. Furthermore, also data on mining coordinates, ownership, mineral reserves, mining waste, transportation of mining products, as well
as mineral processing capacities (smelters and mineral refineries) and production is included. The data was gathered manually from more than 1900 openly available sources, such as annual or sustainability reports of mining companies. All datapoints are linked to their respective sources. After manual screening and entry of the data, automatic cleaning, harmonization and data checking was conducted. Geoinformation was obtained either from coordinates available in company reports, or by retrieving the coordinates via Google Maps API and subsequent manual checking. For mines where no coordinates could be found, other geospatial attributes such as province, region, district or municipality were recorded, and linked to the GADM data set, available at www.gadm.org.
The data set consists of 12 tables. The table “facilities” contains descriptive and spatial information of mines and processing facilities, and is available as a GeoPackage (GPKG) file. All other tables are available in comma-separated values (CSV) format. A schematic depiction of the database is provided as in PNG format in the file database_model.png.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Peptides are biologically ubiquitous and important molecules that self-assemble into diverse structures. While extensive research has explored the effects of chemical composition and environmental conditions on self-assembly, a systematic study consolidating this data to uncover global rules is lacking. In this work, we curate a peptide assembly database through a combination of manual processing by human experts and literature mining with a large language model. As a result, we collect more than 1,000 experimental data entries with information about peptide sequence, experimental conditions and corresponding self-assembly phases. Utilizing the data, machine learning models are trained and evaluated, demonstrating excellent accuracy (> 80%) and efficiency in assembly phase classification. Moreover, we fine-tune our GPT model for peptide literature mining with the developed dataset, which exhibits markedly superior performance in extracting information from academic publications relative to the pre-trained model. This workflow can improve efficiency when exploring potential self-assembling peptide candidates, through guiding experimental work, while also deepening our understanding of the mechanisms governing peptide self-assembly.
--- phase_data_clean.csv stores 1000+ peptide self-assembly data under different experimental conditions.
---mined_paper_list.csv stores the corresponding papers we used to collect data.
--- trainset.jsonl and testset.jsonl are data we used for fine-tuning the LLM.
--- fine-tuning.ipynb: code used to fine-tune ChatGPT model.
--- pretrain.ipynb: code used to test the pretrained ChatGPT model.
--- train_and_inference.ipynb: code to use mined data to train and test a ML predictor for phase classification.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data on the impact on national parliaments resulting from the Covid-19 pandemic mined from country reports published by the Lex-Atlas: Covid-19 project and the Oxford University Press. For more information see https://lexatlas-c19.org
Facebook
TwitterAir pollution directly affects human health endpoints including growth, respiratory processes, cardiovascular health, fertility, pregnancy outcomes, and cancer. Therefore, the distribution of air pollution is a topic that is relevant to all, and of direct interest to many students. Air quality varies across space and time, often disproportionally affecting minority communities and impoverished neighborhoods. Air pollution is usually higher in locations where pollution sources are concentrated, such as industrial production facilities, highways, and coal-fired power plants. The United States Environmental Protection Agency manages a national air quality-monitoring program to measure and report air-pollutant levels across the United States. These data cover multiple decades and are publicly available via a website interface. For this lesson, students learn how to mine data from this website. They work in pairs to develop their own questions about air quality or air pollution that span spatial and/or temporal scales, and then gather the data needed to answer their question. The students analyze their data and write a scientific paper describing their work. This laboratory experience requires the students to generate their own questions, gather and interpret data, and draw conclusions, allowing for creativity and instilling ownership and motivation for deeper learning gains.
Facebook
TwitterSubscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
Facebook
TwitterBackground: Everolimus is an inhibitor of the mammalian target of rapamycin and is used to treat various tumors. The presented study aimed to evaluate the Everolimus-associated adverse events (AEs) through data mining of the US Food and Drug Administration Adverse Event Reporting System (FAERS).Methods: The AE records were selected by searching the FDA Adverse Event Reporting System database from the first quarter of 2009 to the first quarter of 2022. Potential adverse event signals were mined using the disproportionality analysis, including reporting odds ratio the proportional reporting ratio the Bayesian confidence propagation neural network and the empirical Bayes geometric mean and MedDRA was used to systematically classify the results.Results: A total of 24,575 AE reports of Everolimus were obtained using data from the FAERS database, and Everolimus-induced AEs occurrence targeted 24 system organ classes after conforming to the four algorithms simultaneously. The common significant SOCs were identified, included benign, malignant and unspecified neoplasms, reproductive system and breast disorders, etc. The significant AEs were then mapped to preferred terms such as stomatitis, pneumonitis and impaired insulin secretion, which have emerged in the study usually reported in patients with Everolimus. Of note, unexpected significant AEs, including biliary ischaemia, angiofibroma, and tuberous sclerosis complex were uncovered in the label.Conclusion: This study provided novel insights into the monitoring, surveillance, and management of adverse drug reaction associated with Everolimus. The outcome of serious adverse events and the corresponding detection signals, as well as the unexpected significant adverse events signals are worthy of attention in order to improving clinical medication safety during treatment of Everolimus.
Facebook
TwitterThis dataset includes locations and associated information about mines and mining activity in the contiguous United States. The database was developed by combining publicly available national datasets of mineral mines, uranium mines, and minor and major coal mine activities. This database was developed in 2013, but temporal range of mine data varied dependent on source. Uranium mine information came from the TENORM Uranium Location Database produced by the US Environmental Protection Agency (U.S. EPA) in 2003. Major and minor coal mine information was from the USTRAT (Stratigraphic data related to coal) database 2012, and the mineral mine data came from the USGS Mineral Resource Program.
Facebook
TwitterThis data release includes GIS datasets supporting the Colorado Legacy Mine Lands Watershed Delineation and Scoring tool (WaDeS), a web mapping application available at https://geonarrative.usgs.gov/colmlwades/. Water chemistry data were compiled from the U.S. Geological Survey (USGS) National Water Information System (NWIS), U.S. Environmental Protection Agency (EPA) STORET database, and the USGS Central Colorado Assessment Project (CCAP) (Church and others, 2009). The CCAP study area was used for this application. Samples were summarized at each monitoring station and hardness-dependent chronic and acute toxicity thresholds for aquatic life protections under Colorado Regulation No. 31 (CDPHE, 5 CCR 1002-31) for cadmium, copper, lead, and/or zinc were calculated. Samples were scored according to how metal concentrations compared with acute and chronic toxicity thresholds. The results were used in combination with remote sensing derived hydrothermal alteration (Rockwell and Bonham, 2017) and mine-related features (Horton and San Juan, 2016) to identify potential mine remediation sites within the headwaters of the central Colorado mineral belt. Headwaters were defined by watersheds delineated from a 10-meter digital elevation dataset (DEM), ranging in 5-35 square kilometers in size. Python and R scripts used to derive these products are included with this data release as documentation of the processing steps and to enable users to adapt the methods for their own applications. References Church, S.E., San Juan, C.A., Fey, D.L., Schmidt, T.S., Klein, T.L. DeWitt, E.H., Wanty, R.B., Verplanck, P.L., Mitchell, K.A., Adams, M.G., Choate, L.M., Todorov, T.I., Rockwell, B.W., McEachron, Luke, and Anthony, M.W., 2012, Geospatial database for regional environmental assessment of central Colorado: U.S. Geological Survey Data Series 614, 76 p., https://doi.org/10.3133/ds614. Colorado Department of Public Health and Environment (CDPHE), Water Quality Control Commission 5 CCR 1002-31. Regulation No. 31 The Basic Standards and Methodologies for Surface Water. Effective 12/31/2021, accessed on July 28, 2023 at https://cdphe.colorado.gov/water-quality-control-commission-regulations. Horton, J.D., and San Juan, C.A., 2022, Prospect- and mine-related features from U.S. Geological Survey 7.5- and 15-minute topographic quadrangle maps of the United States (ver. 8.0, September 2022): U.S. Geological Survey data release, https://doi.org/10.5066/F78W3CHG. Rockwell, B.W. and Bonham, L.C., 2017, Digital maps of hydrothermal alteration type, key mineral groups, and green vegetation of the western United States derived from automated analysis of ASTER satellite data: U.S. Geological Survey data release, https://doi.org/10.5066/F7CR5RK7.
Facebook
Twitterhttps://fred.stlouisfed.org/legal/#copyright-public-domainhttps://fred.stlouisfed.org/legal/#copyright-public-domain
Graph and download economic data for Gross Domestic Product: Mining, Quarrying, and Oil and Gas Extraction (21) in Oklahoma (OKMINNGSP) from 1997 to 2024 about OK, mining, GSP, private industries, private, industry, GDP, and USA.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data on the use of emergency powers used to handle the Covid-19 pandemic mined from country reports published by the Lex-Atlas: Covid-19 project and the Oxford University Press. For more information see https://lexatlas-c19.org
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This work developed image dataset of underground longwall mining face (DsLMF+), which consists of 138004 images with annotation 6 categories of mine personnel, hydraulic support guard plate, large coal, towline, miners’ behaviour and mine safety helmet. All the labels of dataset are publicly available in YOLO format and COCO format.The dataset aims to support further research and advancement of the intelligent identification and classification of abnormal conditions for underground mining.
Facebook
TwitterGlobal trade data of Mine under 3811210010, 3811210010 global trade data, trade data of Mine from 80+ Countries.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We present a ProgSnap2-based dataset containing anonymized logs of over 34,000 programming events exhibited by 81 programming students in Scratch, a visual programming environment, during our designed study as described in the paper "Semi-Automatically Mining Students' Common Scratch Programming Behaviors." We also include a list of approx. 3100 mined sequential patterns of programming processes that are performed by at least 10% of the 62 of the 81 students who are novice programmers, and represent maximal patterns generated by the MG-FSM algorithm while allowing a gap of one programming event. README.txt — overview of the dataset and its propertiesmainTable.csv — main event table of the dataset holding rows of programming eventscodeState.csv — table holding XML representations of code snapshots at the time of each programming eventdatasetMetadata.csv — describes features of the datasetScratch-SeqPatterns.txt — list of sequential patterns mined from the Main Event Table