Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LScDC Word-Category RIG MatrixApril 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk / suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny MirkesGetting StartedThis file describes the Word-Category RIG Matrix for theLeicester Scientific Corpus (LSC) [1], the procedure to build the matrix and introduces the Leicester Scientific Thesaurus (LScT) with the construction process. The Word-Category RIG Matrix is a 103,998 by 252 matrix, where rows correspond to words of Leicester Scientific Dictionary-Core (LScDC) [2] and columns correspond to 252 Web of Science (WoS) categories [3, 4, 5]. Each entry in the matrix corresponds to a pair (category,word). Its value for the pair shows the Relative Information Gain (RIG) on the belonging of a text from the LSC to the category from observing the word in this text. The CSV file of Word-Category RIG Matrix in the published archive is presented with two additional columns of the sum of RIGs in categories and the maximum of RIGs over categories (last two columns of the matrix). So, the file ‘Word-Category RIG Matrix.csv’ contains a total of 254 columns.This matrix is created to be used in future research on quantifying of meaning in scientific texts under the assumption that words have scientifically specific meanings in subject categories and the meaning can be estimated by information gains from word to categories. LScT (Leicester Scientific Thesaurus) is a scientific thesaurus of English. The thesaurus includes a list of 5,000 words from the LScDC. We consider ordering the words of LScDC by the sum of their RIGs in categories. That is, words are arranged in their informativeness in the scientific corpus LSC. Therefore, meaningfulness of words evaluated by words’ average informativeness in the categories. We have decided to include the most informative 5,000 words in the scientific thesaurus. Words as a Vector of Frequencies in WoS CategoriesEach word of the LScDC is represented as a vector of frequencies in WoS categories. Given the collection of the LSC texts, each entry of the vector consists of the number of texts containing the word in the corresponding category.It is noteworthy that texts in a corpus do not necessarily belong to a single category, as they are likely to correspond to multidisciplinary studies, specifically in a corpus of scientific texts. In other words, categories may not be exclusive. There are 252 WoS categories and a text can be assigned to at least 1 and at most 6 categories in the LSC. Using the binary calculation of frequencies, we introduce the presence of a word in a category. We create a vector of frequencies for each word, where dimensions are categories in the corpus.The collection of vectors, with all words and categories in the entire corpus, can be shown in a table, where each entry corresponds to a pair (word,category). This table is build for the LScDC with 252 WoS categories and presented in published archive with this file. The value of each entry in the table shows how many times a word of LScDC appears in a WoS category. The occurrence of a word in a category is determined by counting the number of the LSC texts containing the word in a category. Words as a Vector of Relative Information Gains Extracted for CategoriesIn this section, we introduce our approach to representation of a word as a vector of relative information gains for categories under the assumption that meaning of a word can be quantified by their information gained for categories.For each category, a function is defined on texts that takes the value 1, if the text belongs to the category, and 0 otherwise. For each word, a function is defined on texts that takes the value 1 if the word belongs to the text, and 0 otherwise. Consider LSC as a probabilistic sample space (the space of equally probable elementary outcomes). For the Boolean random variables, the joint probability distribution, the entropy and information gains are defined.The information gain about the category from the word is the amount of information on the belonging of a text from the LSC to the category from observing the word in the text [6]. We used the Relative Information Gain (RIG) providing a normalised measure of the Information Gain. This provides the ability of comparing information gains for different categories. The calculations of entropy, Information Gains and Relative Information Gains can be found in the README file in the archive published. Given a word, we created a vector where each component of the vector corresponds to a category. Therefore, each word is represented as a vector of relative information gains. It is obvious that the dimension of vector for each word is the number of categories. The set of vectors is used to form the Word-Category RIG Matrix, in which each column corresponds to a category, each row corresponds to a word and each component is the relative information gain from the word to the category. In Word-Category RIG Matrix, a row vector represents the corresponding word as a vector of RIGs in categories. We note that in the matrix, a column vector represents RIGs of all words in an individual category. If we choose an arbitrary category, words can be ordered by their RIGs from the most informative to the least informative for the category. As well as ordering words in each category, words can be ordered by two criteria: sum and maximum of RIGs in categories. The top n words in this list can be considered as the most informative words in the scientific texts. For a given word, the sum and maximum of RIGs are calculated from the Word-Category RIG Matrix.RIGs for each word of LScDC in 252 categories are calculated and vectors of words are formed. We then form the Word-Category RIG Matrix for the LSC. For each word, the sum (S) and maximum (M) of RIGs in categories are calculated and added at the end of the matrix (last two columns of the matrix). The Word-Category RIG Matrix for the LScDC with 252 categories, the sum of RIGs in categories and the maximum of RIGs over categories can be found in the database.Leicester Scientific Thesaurus (LScT)Leicester Scientific Thesaurus (LScT) is a list of 5,000 words form the LScDC [2]. Words of LScDC are sorted in descending order by the sum (S) of RIGs in categories and the top 5,000 words are selected to be included in the LScT. We consider these 5,000 words as the most meaningful words in the scientific corpus. In other words, meaningfulness of words evaluated by words’ average informativeness in the categories and the list of these words are considered as a ‘thesaurus’ for science. The LScT with value of sum can be found as CSV file with the published archive. Published archive contains following files:1) Word_Category_RIG_Matrix.csv: A 103,998 by 254 matrix where columns are 252 WoS categories, the sum (S) and the maximum (M) of RIGs in categories (last two columns of the matrix), and rows are words of LScDC. Each entry in the first 252 columns is RIG from the word to the category. Words are ordered as in the LScDC.2) Word_Category_Frequency_Matrix.csv: A 103,998 by 252 matrix where columns are 252 WoS categories and rows are words of LScDC. Each entry of the matrix is the number of texts containing the word in the corresponding category. Words are ordered as in the LScDC.3) LScT.csv: List of words of LScT with sum (S) values. 4) Text_No_in_Cat.csv: The number of texts in categories. 5) Categories_in_Documents.csv: List of WoS categories for each document of the LSC.6) README.txt: Description of Word-Category RIG Matrix, Word-Category Frequency Matrix and LScT and forming procedures.7) README.pdf (same as 6 in PDF format)References[1] Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2[2] Suzen, Neslihan (2019): LScDC (Leicester Scientific Dictionary-Core). figshare. Dataset. https://doi.org/10.25392/leicester.data.9896579.v3[3] Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4] WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html [5] Suzen, N., Mirkes, E. M., & Gorban, A. N. (2019). LScDC-new large scientific dictionary. arXiv preprint arXiv:1912.06858. [6] Shannon, C. E. (1948). A mathematical theory of communication. Bell system technical journal, 27(3), 379-423.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data Analysis is the process that supports decision-making and informs arguments in empirical studies. Descriptive statistics, Exploratory Data Analysis (EDA), and Confirmatory Data Analysis (CDA) are the approaches that compose Data Analysis (Xia & Gong; 2014). An Exploratory Data Analysis (EDA) comprises a set of statistical and data mining procedures to describe data. We ran EDA to provide statistical facts and inform conclusions. The mined facts allow attaining arguments that would influence the Systematic Literature Review of DL4SE.
The Systematic Literature Review of DL4SE requires formal statistical modeling to refine the answers for the proposed research questions and formulate new hypotheses to be addressed in the future. Hence, we introduce DL4SE-DA, a set of statistical processes and data mining pipelines that uncover hidden relationships among Deep Learning reported literature in Software Engineering. Such hidden relationships are collected and analyzed to illustrate the state-of-the-art of DL techniques employed in the software engineering context.
Our DL4SE-DA is a simplified version of the classical Knowledge Discovery in Databases, or KDD (Fayyad, et al; 1996). The KDD process extracts knowledge from a DL4SE structured database. This structured database was the product of multiple iterations of data gathering and collection from the inspected literature. The KDD involves five stages:
Selection. This stage was led by the taxonomy process explained in section xx of the paper. After collecting all the papers and creating the taxonomies, we organize the data into 35 features or attributes that you find in the repository. In fact, we manually engineered features from the DL4SE papers. Some of the features are venue, year published, type of paper, metrics, data-scale, type of tuning, learning algorithm, SE data, and so on.
Preprocessing. The preprocessing applied was transforming the features into the correct type (nominal), removing outliers (papers that do not belong to the DL4SE), and re-inspecting the papers to extract missing information produced by the normalization process. For instance, we normalize the feature “metrics” into “MRR”, “ROC or AUC”, “BLEU Score”, “Accuracy”, “Precision”, “Recall”, “F1 Measure”, and “Other Metrics”. “Other Metrics” refers to unconventional metrics found during the extraction. Similarly, the same normalization was applied to other features like “SE Data” and “Reproducibility Types”. This separation into more detailed classes contributes to a better understanding and classification of the paper by the data mining tasks or methods.
Transformation. In this stage, we omitted to use any data transformation method except for the clustering analysis. We performed a Principal Component Analysis to reduce 35 features into 2 components for visualization purposes. Furthermore, PCA also allowed us to identify the number of clusters that exhibit the maximum reduction in variance. In other words, it helped us to identify the number of clusters to be used when tuning the explainable models.
Data Mining. In this stage, we used three distinct data mining tasks: Correlation Analysis, Association Rule Learning, and Clustering. We decided that the goal of the KDD process should be oriented to uncover hidden relationships on the extracted features (Correlations and Association Rules) and to categorize the DL4SE papers for a better segmentation of the state-of-the-art (Clustering). A clear explanation is provided in the subsection “Data Mining Tasks for the SLR od DL4SE”. 5.Interpretation/Evaluation. We used the Knowledge Discover to automatically find patterns in our papers that resemble “actionable knowledge”. This actionable knowledge was generated by conducting a reasoning process on the data mining outcomes. This reasoning process produces an argument support analysis (see this link).
We used RapidMiner as our software tool to conduct the data analysis. The procedures and pipelines were published in our repository.
Overview of the most meaningful Association Rules. Rectangles are both Premises and Conclusions. An arrow connecting a Premise with a Conclusion implies that given some premise, the conclusion is associated. E.g., Given that an author used Supervised Learning, we can conclude that their approach is irreproducible with a certain Support and Confidence.
Support = Number of occurrences this statement is true divided by the amount of statements Confidence = The support of the statement divided by the number of occurrences of the premise
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LScD (Leicester Scientific Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScD (Leicester Scientific Dictionary) is created from the updated LSC (Leicester Scientific Corpus) - Version 2*. All pre-processing steps applied to build the new version of the dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. After pre-processing steps, the total number of unique words in the new version of the dictionary is 972,060. The files provided with this description are also same as described as for LScD Version 2 below.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2** Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v2[Version 2] Getting StartedThis document provides the pre-processing steps for creating an ordered list of words from the LSC (Leicester Scientific Corpus) [1] and the description of LScD (Leicester Scientific Dictionary). This dictionary is created to be used in future work on the quantification of the meaning of research texts. R code for producing the dictionary from LSC and instructions for usage of the code are available in [2]. The code can be also used for list of texts from other sources, amendments to the code may be required.LSC is a collection of abstracts of articles and proceeding papers published in 2014 and indexed by the Web of Science (WoS) database [3]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English. The corpus was collected in July 2018 and contains the number of citations from publication date to July 2018. The total number of documents in LSC is 1,673,824.LScD is an ordered list of words from texts of abstracts in LSC.The dictionary stores 974,238 unique words, is sorted by the number of documents containing the word in descending order. All words in the LScD are in stemmed form of words. The LScD contains the following information:1.Unique words in abstracts2.Number of documents containing each word3.Number of appearance of a word in the entire corpusProcessing the LSCStep 1.Downloading the LSC Online: Use of the LSC is subject to acceptance of request of the link by email. To access the LSC for research purposes, please email to ns433@le.ac.uk. The data are extracted from Web of Science [3]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.Step 2.Importing the Corpus to R: The full R code for processing the corpus can be found in the GitHub [2].All following steps can be applied for arbitrary list of texts from any source with changes of parameter. The structure of the corpus such as file format and names (also the position) of fields should be taken into account to apply our code. The organisation of CSV files of LSC is described in README file for LSC [1].Step 3.Extracting Abstracts and Saving Metadata: Metadata that include all fields in a document excluding abstracts and the field of abstracts are separated. Metadata are then saved as MetaData.R. Fields of metadata are: List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.Step 4.Text Pre-processing Steps on the Collection of Abstracts: In this section, we presented our approaches to pre-process abstracts of the LSC.1.Removing punctuations and special characters: This is the process of substitution of all non-alphanumeric characters by space. We did not substitute the character “-” in this step, because we need to keep words like “z-score”, “non-payment” and “pre-processing” in order not to lose the actual meaning of such words. A processing of uniting prefixes with words are performed in later steps of pre-processing.2.Lowercasing the text data: Lowercasing is performed to avoid considering same words like “Corpus”, “corpus” and “CORPUS” differently. Entire collection of texts are converted to lowercase.3.Uniting prefixes of words: Words containing prefixes joined with character “-” are united as a word. The list of prefixes united for this research are listed in the file “list_of_prefixes.csv”. The most of prefixes are extracted from [4]. We also added commonly used prefixes: ‘e’, ‘extra’, ‘per’, ‘self’ and ‘ultra’.4.Substitution of words: Some of words joined with “-” in the abstracts of the LSC require an additional process of substitution to avoid losing the meaning of the word before removing the character “-”. Some examples of such words are “z-test”, “well-known” and “chi-square”. These words have been substituted to “ztest”, “wellknown” and “chisquare”. Identification of such words is done by sampling of abstracts form LSC. The full list of such words and decision taken for substitution are presented in the file “list_of_substitution.csv”.5.Removing the character “-”: All remaining character “-” are replaced by space.6.Removing numbers: All digits which are not included in a word are replaced by space. All words that contain digits and letters are kept because alphanumeric characters such as chemical formula might be important for our analysis. Some examples are “co2”, “h2o” and “21st”.7.Stemming: Stemming is the process of converting inflected words into their word stem. This step results in uniting several forms of words with similar meaning into one form and also saving memory space and time [5]. All words in the LScD are stemmed to their word stem.8.Stop words removal: Stop words are words that are extreme common but provide little value in a language. Some common stop words in English are ‘I’, ‘the’, ‘a’ etc. We used ‘tm’ package in R to remove stop words [6]. There are 174 English stop words listed in the package.Step 5.Writing the LScD into CSV Format: There are 1,673,824 plain processed texts for further analysis. All unique words in the corpus are extracted and written in the file “LScD.csv”.The Organisation of the LScDThe total number of words in the file “LScD.csv” is 974,238. Each field is described below:Word: It contains unique words from the corpus. All words are in lowercase and their stem forms. The field is sorted by the number of documents that contain words in descending order.Number of Documents Containing the Word: In this content, binary calculation is used: if a word exists in an abstract then there is a count of 1. If the word exits more than once in a document, the count is still 1. Total number of document containing the word is counted as the sum of 1s in the entire corpus.Number of Appearance in Corpus: It contains how many times a word occurs in the corpus when the corpus is considered as one large document.Instructions for R CodeLScD_Creation.R is an R script for processing the LSC to create an ordered list of words from the corpus [2]. Outputs of the code are saved as RData file and in CSV format. Outputs of the code are:Metadata File: It includes all fields in a document excluding abstracts. Fields are List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.File of Abstracts: It contains all abstracts after pre-processing steps defined in the step 4.DTM: It is the Document Term Matrix constructed from the LSC[6]. Each entry of the matrix is the number of times the word occurs in the corresponding document.LScD: An ordered list of words from LSC as defined in the previous section.The code can be used by:1.Download the folder ‘LSC’, ‘list_of_prefixes.csv’ and ‘list_of_substitution.csv’2.Open LScD_Creation.R script3.Change parameters in the script: replace with the full path of the directory with source files and the full path of the directory to write output files4.Run the full code.References[1]N. Suzen. (2019). LSC (Leicester Scientific Corpus) [Dataset]. Available: https://doi.org/10.25392/leicester.data.9449639.v1[2]N. Suzen. (2019). LScD-LEICESTER SCIENTIFIC DICTIONARY CREATION. Available: https://github.com/neslihansuzen/LScD-LEICESTER-SCIENTIFIC-DICTIONARY-CREATION[3]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4]A. Thomas, "Common Prefixes, Suffixes and Roots," Center for Development and Learning, 2013.[5]C. Ramasubramanian and R. Ramya, "Effective pre-processing activities in text mining using improved porter’s stemming algorithm," International Journal of Advanced Research in Computer and Communication Engineering, vol. 2, no. 12, pp. 4536-4538, 2013.[6]I. Feinerer, "Introduction to the tm Package Text Mining in R," Accessible en ligne: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf, 2013.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comparison of the running time(in ms) of the three algorithms.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Iris data aggregation class effect.
Facebook
TwitterMotivation
Maus et al created the first database of the spatial extent of mining areas by mobilizing nearly 20 years of Landsat data. This dataset is imperative for GlobES, as mining areas are specified in the IUCN habitat class scheme. Yet, this dataset is temporally static. To tackle this flaw, we mined the Landsat archive to infer the first observable year of mining.
Approach
For each mining area polygon, we collected 50 random samples within it and 50 random samples along its borders. This was meant to reflect increasing spectral differences between areas within and outside a mining exploration after its onset. Then, for each sample, we used Google Earth Engine to extract spectral profiles for every available acquisition between 1990 and 2020.
After completing the extraction, we estimate mean spectral profiles for each acquisition date, once for the samples “inside” the mining area, and another for those “outside” of it. In this process, we masked pixels afflicted by clouds and cloud shadows using Landsat's quality information.
Using the time-series of mean profiles, at each mining site and for each unique date, we normalized the “inside” and “outside” multi-spectral averages and estimated the Root Mean Square Error (RMSE) between them. The normalization step aimed at emphasizing differences in the shape of the spectral profiles rather than on specific values, which can be related to radiometric innacuracies, or simply to differences in acquisition dates. This resulted in an RMSE time-series for each mining site.
We then used these data to infer the first mining year. To achieve this, we first derived a cumulative sum of the RMSE time-series with the intent of removing noise while preserving abrupt directional changes. For example, if a mine was introduced in a forest, it would drive an increase in the RMSE due to the removal of trees, whereas the outskirts of the mine would remain forested. In this example, the accumulated values would tilt upwards. However, if a mining exploration was accompanied by the removal of vegetation along its outskirts where bare land was common, a downwards shift is RMSE values is more likely as the landscape becomes more homogenization.
To detect the date marking a shift in RMSE values, we used a knee/elbow detection algorithm implemented in the python package kneebow, which uses curve rotation to infer the inflection/deflection point of a time series. Here, downward trends correspond to the elbow and upward trends to the knee. To determine which of these metrics was the most adequate, we use the Area Under the Curve (AUC). An elbow is characterized by a convex shape of a time-series which makes the AUC greater than 50%. However, if the shape of the curve is concave, the knee is the most adequate metric. We limited the detection of shifts to time-series with at least 100 time steps. When below this threshold, we assumed the mine (or the the conditions to sustain it) were present since 1990.
Content
This repository contains the infrastructure used to infer the start of a mining operation, which is organized as following:
00_data - Contains the base data required for the operation, including a SHP file with the mining area outlines, and validation samples.
01_analysis - Contains several outputs of our analysis:
xy.tar.gz - Sample locations for each mining site.
sr.tar.gz - Spectral profiles for each sample location.
mine_start.csv - First year when we detected the start of mining.
02_code - Includes all code used in our analysis.
requirements.txt - Python module requirements that can be fed to pip to replicate our study.
config.yml - Configuration file, including information on the Landsat products used.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Descriptions of the datasets.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Malaysia Hours Worked: Mean: Mining & Quarrying data was reported at 48.000 Hour in 2017. This records a decrease from the previous number of 48.900 Hour for 2016. Malaysia Hours Worked: Mean: Mining & Quarrying data is updated yearly, averaging 49.650 Hour from Dec 2010 (Median) to 2017, with 8 observations. The data reached an all-time high of 50.400 Hour in 2012 and a record low of 48.000 Hour in 2017. Malaysia Hours Worked: Mean: Mining & Quarrying data remains active status in CEIC and is reported by Department of Statistics. The data is categorized under Global Database’s Malaysia – Table MY.G050: Labour Force Survey: Hours Worked: By Sex & Industry.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This dataset contains scraped and processed text from roughly 100 years of articles published in the Wiley journal Science Education (formerly General Science Quarterly). This text has been cleaned and filtered in preparation for analysis using natural language processing techniques, particularly topic modeling with latent Dirichlet allocation (LDA). We also include a Jupyter Notebook illustrating how one can use LDA to analyze this dataset and extract latent topics from it, as well as analyze the rise and fall of those topics over the history of the journal.
The articles were downloaded and scraped in December of 2019. Only non-duplicate articles with a listed author (according to the CrossRef metadata database) were included, and due to missing data and text recognition issues we excluded all articles published prior to 1922. This resulted in 5577 articles in total being included in the dataset. The text of these articles was then cleaned in the following way:
We removed duplicated text from each article: prior to 1969, articles in the journal were published in a magazine format in which the end of one article and the beginning of the next would share the same page, so we developed an automated detection of article beginnings and endings that was able to remove any duplicate text.
We removed the reference sections of the articles, as well headings (in all caps) such as “ABSTRACT”.
We reunited any partial words that were separated due to line breaks, text recognition issues, or British vs. American spellings (for example converting “per cent” to “percent”)
We removed all numbers, symbols, special characters, and punctuation, and lowercased all words.
We removed all stop words, which are words without any semantic meaning on their own—“the”, “in,” “if”, “and”, “but”, etc.—and all single-letter words.
We lemmatized all words, with the added step of including a part-of-speech tagger so our algorithm would only aggregate and lemmatize words from the same part of speech (e.g., nouns vs. verbs).
We detected and create bi-grams, sets of words that frequently co-occur and carry additional meaning together. These words were combined with an underscore: for example, “problem_solving” and “high_school”.
After filtering, each document was then turned into a list of individual words (or tokens) which were then collected and saved (using the python pickle format) into the file scied_words_bigrams_V5.pkl.
In addition to this file, we have also included the following files:
SciEd_paper_names_weights.pkl: A file containing limited metadata (title, author, year published, and DOI) for each of the papers, in the same order as they appear within the main datafile. This file also includes the weights assigned by an LDA model used to analyze the data
Science Education LDA Notebook.ipynb: A notebook file that replicates our LDA analysis, with a written explanation of all of the steps and suggestions on how to explore the results.
Supporting files for the notebook. These include the requirements, the README, a helper script with functions for plotting that were too long to include in the notebook, and two HTML graphs that are embedded into the notebook.
This dataset is shared under the terms of the Wiley Text and Data Mining Agreement, which allows users to share text and data mining output for non-commercial research purposes. Any questions or comments can be directed to Tor Ole Odden, t.o.odden@fys.uio.no.
Facebook
TwitterThis digital publication, GPR 2014-4, contains CGG's analysis and interpretation of data produced from airborne geophysical surveys published by DGGS in 2013 (GPR 2013-1) and 2014 (GPR 2014-5) for the Middle Styx and Farewell survey areas respectively. The two survey blocks Farewell and Middle Styx are referred to as the 'Farewell' survey in the project report as these blocks were flown under one contract. Elsewhere the separate block names, Farewell and Middle Styx, are typically used. CGG's frequency-domain DIGHEM V system was used for the EM data. GPR 2014-4 includes (1) CGG's project report with interpretation and detailed EM Anomalies, and (2) Multi-channel stacked profiles in pdf format. Interpretation maps and EM anomalies are provided in various formats (e.g., Geotiffs , KMZs, and others) and are listed in detail in gpr2014_004_readme (.txt and .pdf). Other supporting files include gpr2014_004_browsegraphic.pdf, farewell_middlestyx_emanomaly_readme.pdf, farewell_middlestyx_smprofiles_readme.pdf, and many legends as jpgs.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Taiwan FIES: DI: Mean: Mining & Quarrying and Manufacturing data was reported at 581,255.000 NTD in 2016. This records an increase from the previous number of 557,858.000 NTD for 2015. Taiwan FIES: DI: Mean: Mining & Quarrying and Manufacturing data is updated yearly, averaging 522,298.000 NTD from Dec 2007 (Median) to 2016, with 10 observations. The data reached an all-time high of 581,255.000 NTD in 2016 and a record low of 507,101.000 NTD in 2010. Taiwan FIES: DI: Mean: Mining & Quarrying and Manufacturing data remains active status in CEIC and is reported by Directorate-General of Budget, Accounting and Statistics, Executive Yuan. The data is categorized under Global Database’s Taiwan – Table TW.H019: Family Income and Expenditure Survey: Directorate General of Budget, Accounting and Statistics.
Facebook
Twitterhttps://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
According to our latest research, the global Subsurface Data Visualization market size reached USD 2.84 billion in 2024, and is expected to grow at a robust CAGR of 13.2% from 2025 to 2033. By the end of the forecast period, the market is projected to achieve a value of USD 8.38 billion. This impressive growth is primarily driven by the increasing demand for advanced visualization technologies in sectors such as oil & gas, mining, and environmental sciences, where accurate interpretation of subsurface data is crucial for operational efficiency and risk mitigation. As per our latest research, technological advancements, coupled with the rising adoption of cloud-based solutions and immersive visualization platforms, are further propelling the market forward.
A significant growth driver for the Subsurface Data Visualization market is the escalating complexity of subsurface data generated by modern exploration and monitoring technologies. With the proliferation of sensors and high-resolution imaging tools, industries like oil & gas and mining are now producing vast volumes of multidimensional data that require sophisticated visualization solutions for effective analysis. The ability to transform raw data into actionable insights through intuitive 2D and 3D models has become indispensable, enabling organizations to make informed decisions, optimize resource allocation, and minimize operational risks. This trend is further accentuated by the integration of artificial intelligence and machine learning algorithms, which enhance the analytical capabilities of visualization platforms, making them more adaptive and predictive.
Another key factor fueling the growth of the Subsurface Data Visualization market is the rapid adoption of cloud-based deployment models. Cloud solutions offer unparalleled scalability, flexibility, and cost-efficiency, allowing organizations to access advanced visualization tools without the need for significant upfront investments in hardware or infrastructure. This has democratized access to powerful analytics and visualization capabilities, particularly for small and medium enterprises (SMEs) and research institutions that previously faced budgetary constraints. In addition, the cloud facilitates seamless collaboration among geographically dispersed teams, accelerating project timelines and fostering innovation in subsurface data interpretation.
The emergence of immersive technologies such as virtual reality (VR) and augmented reality (AR) is also reshaping the Subsurface Data Visualization market. These cutting-edge visualization types enable users to interact with subsurface models in a highly intuitive and immersive manner, enhancing understanding and communication among stakeholders. For example, VR and AR solutions are increasingly being used in training, simulation, and remote operations, reducing the need for physical presence in hazardous environments. This not only improves safety but also enhances the efficiency of exploration, drilling, and monitoring activities across various sectors, thereby driving market growth.
Regionally, North America continues to dominate the Subsurface Data Visualization market due to its strong presence of leading technology providers, high investments in research and development, and advanced infrastructure in industries such as oil & gas and environmental science. However, the Asia Pacific region is witnessing the fastest growth, driven by increasing exploration activities, rapid industrialization, and government initiatives aimed at sustainable resource management. Europe also holds a significant share, supported by stringent environmental regulations and the adoption of innovative technologies in mining and construction. The Middle East & Africa and Latin America are emerging as promising markets, fueled by expanding energy and mining sectors and growing awareness of the benefits of advanced data visualization.
The Subsurface Data Visualization market is segmented by component into software, hardware, and services. The software segment holds the largest share, accounting for over 50% of the total market revenue in 2024. This dominance is attributed to the continuous innovation in visualization algorithms, user interfaces, and integration capabilities with other data management and analytics platforms. Modern software solutions offer comprehensive toolsets for data p
Facebook
TwitterMulti-source data - "ShanShuiLinTian HuCao" system expert knowledge data sets including source of qilian mountain areas, science and technology literature, such as database mapping knowledge base. 1, the qilian mountain SDGs and big data and social network science and technology literature data mapping knowledge base and the basin geographical relationship knowledge base: qilian mountain SDGs and big data and social network science and technology literature integration model and other models in the watershed are data mapping knowledge base including output variables according to the social networks, science and technology literature data and other subject classification in both English and Chinese.Basin geographical relation database including qinghai lake basin, qaidam basin, shule river basin, the chase - HuangShui river, heihe basin, shiyang river basin, state, county, township, village/community such as geographic database name. 2, qilian mountain science and technology literature data, including 1990-2022 and qilian mountain watershed related literature of science and technology of the sustainable development of the main information and the corresponding target (SDGs), two data tables contain data in Chinese and English data, Chinese data table contains 15 fields, English data table contains 19 fields, details see the file "data. TXT".2) data sources and processing methods: based on keywords qilian mountain basin respectively on hownet and WoS cnki database retrieval in China in both English and Chinese literature, through natural language processing, such as clean process data, using machine learning method to build data and the mapping relationship between SDGs.Application results and prospects: 3) data mining from the qilian mountain basin had the information, to support the scientific research on a particular issue of emotional judgment. 3, qilian mountain social network data: weibo sign in qilian mountain region in 2017-2020 data, using the BERT deep learning training model for text categorization, classification standard for SDG index of artificial judgment.Data field meaning is: the date for weibo check-in, mid for weibo id, userid as user id, SDGS for weibo after deep learning discriminant of SDG corresponding indexes, sa for micro blog this analysis results corresponding emotions.More detailed description of the source database documentation may refer to dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset is transformed into Matlab format. They are designed to be in cell formats. Each cell is a matrix which consists of a column representing the gene and row for the subject.Each dataset is organized in a separate directory. The directory contains four versions: a) Original dataset, b) Imputed dataset by MEAN,c) Imputed dataset by MEDIAN,d) Imputed dataset by Most Frequent,
Facebook
TwitterThis dataset contains information about posts made on Famous Cosmetic Brand's Facebook page from 1st of January to 31th of December of 2014. Each row represents a single post and includes the following attributes:
Citation: (Moro et al., 2016) S. Moro, P. Rita and B. Vala. Predicting social media performance metrics and evaluation of the impact on brand building: A data mining approach. Journal of Business Research, Elsevier, In press. Available at: http://dx.doi.org/10.1016/j.jbusres.2016.02.010
Facebook
Twitterhttps://www.wiseguyreports.com/pages/privacy-policyhttps://www.wiseguyreports.com/pages/privacy-policy
| BASE YEAR | 2024 |
| HISTORICAL DATA | 2019 - 2023 |
| REGIONS COVERED | North America, Europe, APAC, South America, MEA |
| REPORT COVERAGE | Revenue Forecast, Competitive Landscape, Growth Factors, and Trends |
| MARKET SIZE 2024 | 2.72(USD Billion) |
| MARKET SIZE 2025 | 3.06(USD Billion) |
| MARKET SIZE 2035 | 10.0(USD Billion) |
| SEGMENTS COVERED | Application, Deployment Type, Technology, End Use, Regional |
| COUNTRIES COVERED | US, Canada, Germany, UK, France, Russia, Italy, Spain, Rest of Europe, China, India, Japan, South Korea, Malaysia, Thailand, Indonesia, Rest of APAC, Brazil, Mexico, Argentina, Rest of South America, GCC, South Africa, Rest of MEA |
| KEY MARKET DYNAMICS | Growing demand for data analytics, Increase in research funding, Advancements in machine learning, Rising focus on personalized medicine, Need for improved patient outcomes |
| MARKET FORECAST UNITS | USD Billion |
| KEY COMPANIES PROFILED | MModal, Amazon, Linguamatics, Health Fidelity, Watson Health, CureMetrix, Google, Microsoft, Nuance Communications, Dolbey, Qventus, Clinical Architecture, Verily Life Sciences, IBM, Oracle, GNS Healthcare |
| MARKET FORECAST PERIOD | 2025 - 2035 |
| KEY MARKET OPPORTUNITIES | Increased demand for patient data analysis, Automation of clinical trials processes, Enhanced drug discovery through text mining, Personalized medicine via genomic data interpretation, Integration with AI for predictive analytics |
| COMPOUND ANNUAL GROWTH RATE (CAGR) | 12.6% (2025 - 2035) |
Facebook
Twitterhttps://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The global geology and seismic software market is experiencing robust growth, driven by the increasing demand for efficient and accurate data analysis in the energy, mining, and environmental sectors. The market's expansion is fueled by several key factors, including the rising adoption of cloud-based solutions offering enhanced accessibility and scalability, the proliferation of advanced data analytics techniques for interpreting complex geological and seismic data, and the growing need for precise subsurface imaging for exploration and resource management. Furthermore, government initiatives promoting sustainable resource exploration and environmental monitoring are creating new opportunities for market growth. The market is segmented by application (large enterprises and SMEs) and type (cloud-based and web-based), with cloud-based solutions gaining significant traction due to their flexibility and cost-effectiveness. While the market exhibits substantial growth potential, challenges such as high initial investment costs for software and hardware, the need for specialized expertise to operate the software, and the complexities associated with data integration from diverse sources could potentially impede growth. The competitive landscape is characterized by a mix of established players and emerging innovative companies, each offering specialized solutions catering to different market segments. The North American market currently holds a significant market share, primarily driven by the robust presence of major energy companies and a strong technological infrastructure. However, regions like Asia-Pacific are poised for significant growth in the coming years due to rising infrastructure investment and growing exploration activities. This robust growth is expected to continue throughout the forecast period (2025-2033), fueled by advancements in artificial intelligence and machine learning that are revolutionizing data interpretation processes. Increased adoption of 3D and 4D seismic technologies promises improved accuracy and detailed subsurface imaging. The integration of geological and seismic data with other geoscience datasets, such as geochemistry and remote sensing, will further enhance the value proposition of these software solutions. While the high cost of entry may remain a barrier for some smaller businesses, the emergence of more affordable and user-friendly software options is expected to broaden market penetration. Furthermore, strategic partnerships and mergers and acquisitions within the industry are anticipated to shape the competitive landscape and drive innovation, ultimately contributing to the overall market expansion.
Facebook
Twitter
According to our latest research, the global subsurface modeling AI market size reached USD 1.86 billion in 2024. The market is expected to grow at a robust CAGR of 19.3% from 2025 to 2033, projecting a value of USD 8.63 billion by 2033. This significant expansion is driven by the increasing adoption of artificial intelligence technologies to optimize subsurface data interpretation, enhance resource exploration efficiency, and reduce operational risks across sectors such as oil & gas, mining, environmental assessment, and civil engineering.
The primary growth driver for the subsurface modeling AI market is the escalating demand for advanced data analytics and modeling tools in the oil & gas and mining sectors. Traditional subsurface exploration methods are often time-consuming and prone to human error, leading to inefficiencies and increased costs. AI-powered subsurface modeling solutions enable organizations to rapidly process and analyze vast datasets, identify patterns, and generate high-precision models of subsurface structures. This not only accelerates exploration timelines but also improves decision-making accuracy, ultimately resulting in higher resource recovery rates and lower environmental impact. The integration of machine learning algorithms with geophysical and geological data is transforming how companies approach subsurface challenges, making AI an indispensable asset in resource-intensive industries.
Another significant factor contributing to the market's growth is the rising emphasis on sustainability and environmental stewardship. Regulatory pressures and public scrutiny are compelling companies to adopt technologies that minimize ecological disruption during exploration and development activities. Subsurface modeling AI enables more precise targeting of drilling and excavation sites, reducing unnecessary land disturbance and resource wastage. Environmental agencies and construction firms are increasingly leveraging AI-driven models for groundwater assessment, contamination prediction, and infrastructure planning, further expanding the addressable market. The ability of AI to synthesize multi-source data, such as seismic, geological, and environmental datasets, provides a holistic view of subsurface conditions, which is critical for sustainable project execution.
The proliferation of cloud computing and advancements in high-performance hardware have also played a pivotal role in accelerating the adoption of subsurface modeling AI solutions. Cloud-based platforms offer scalable computing resources that can handle the intensive processing demands of AI-driven modeling applications. This democratizes access to cutting-edge technology for small and medium enterprises (SMEs), which traditionally faced barriers due to high infrastructure costs. The growing ecosystem of AI software vendors, service providers, and hardware manufacturers is fostering innovation and competition, leading to more affordable and user-friendly solutions. As organizations across diverse sectors recognize the value proposition of AI-enabled subsurface modeling, the market is poised for sustained growth throughout the forecast period.
Machine Learning for Seismic Interpretation is becoming a transformative force in the subsurface modeling AI market. By leveraging machine learning algorithms, companies can enhance the interpretation of seismic data, leading to more accurate subsurface models. This technology allows for the rapid analysis of seismic waves, identifying subtle patterns and anomalies that traditional methods might overlook. As a result, geoscientists can achieve a deeper understanding of subsurface structures, improving the precision of resource exploration and extraction. The integration of machine learning with seismic interpretation not only accelerates the data processing timelines but also enhances the predictive accuracy of geological models, thereby reducing exploration risks and costs. As the industry continues to evolve, the role of machine learning in seismic interpretation is expected to expand, offering new opportunities for innovation and efficiency in subsurface exploration.
From a regional perspective, North America currently leads the subsurface modeling AI market, accounting for the largest share in 2024, driven by substantial investmen
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains 25,000 synthetic healthcare records designed for machine learning models that classify diseases based on patient symptoms. It includes demographic attributes, symptom lists, and confirmed diagnoses across 30 common acute, chronic, infectious, and neurological diseases.
The dataset is well-suited for:
Multi-class disease classification Symptom pattern analysis Medical decision support modeling NLP feature extraction on symptom text Data mining and biomedical research
Each record corresponds to a unique patient with a generated combination of symptoms and diagnosis created from realistic patterns while maintaining anonymity.
This dataset is purely synthetic, meaning no real patient data is used.
📌 Column Descriptions
Patient_ID — A randomized unique identifier assigned to each synthetic patient. Age — Age of the patient (ranging from 1 to 90 years). Gender — Gender of the patient (Male, Female, or Other). Symptoms — A comma-separated list containing 3 to 7 symptoms. Symptom_Count — Total number of symptoms listed for the patient. Disease — The diagnosed condition; one of the 30 diseases included in the dataset.
🦠 List of Diseases Included
Common Cold, Influenza, COVID-19, Pneumonia, Tuberculosis, Diabetes, Hypertension, Asthma, Heart Disease, Chronic Kidney Disease, Gastritis, Food Poisoning, Irritable Bowel Syndrome (IBS), Liver Disease, Ulcer, Migraine, Epilepsy, Stroke, Dementia, Parkinson’s Disease, Allergy, Arthritis, Anemia, Thyroid Disorder, Obesity, Depression, Anxiety, Dermatitis, Sinusitis, Bronchitis.
🎯 Possible Use Cases
Multi-class disease prediction Symptom pattern analysis Clinical decision support prototypes NLP-based text classification Educational and academic projects
🔐 License
This dataset is released under CC0 Public Domain, meaning it is free to use, modify, and share without restrictions.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LScDC Word-Category RIG MatrixApril 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk / suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny MirkesGetting StartedThis file describes the Word-Category RIG Matrix for theLeicester Scientific Corpus (LSC) [1], the procedure to build the matrix and introduces the Leicester Scientific Thesaurus (LScT) with the construction process. The Word-Category RIG Matrix is a 103,998 by 252 matrix, where rows correspond to words of Leicester Scientific Dictionary-Core (LScDC) [2] and columns correspond to 252 Web of Science (WoS) categories [3, 4, 5]. Each entry in the matrix corresponds to a pair (category,word). Its value for the pair shows the Relative Information Gain (RIG) on the belonging of a text from the LSC to the category from observing the word in this text. The CSV file of Word-Category RIG Matrix in the published archive is presented with two additional columns of the sum of RIGs in categories and the maximum of RIGs over categories (last two columns of the matrix). So, the file ‘Word-Category RIG Matrix.csv’ contains a total of 254 columns.This matrix is created to be used in future research on quantifying of meaning in scientific texts under the assumption that words have scientifically specific meanings in subject categories and the meaning can be estimated by information gains from word to categories. LScT (Leicester Scientific Thesaurus) is a scientific thesaurus of English. The thesaurus includes a list of 5,000 words from the LScDC. We consider ordering the words of LScDC by the sum of their RIGs in categories. That is, words are arranged in their informativeness in the scientific corpus LSC. Therefore, meaningfulness of words evaluated by words’ average informativeness in the categories. We have decided to include the most informative 5,000 words in the scientific thesaurus. Words as a Vector of Frequencies in WoS CategoriesEach word of the LScDC is represented as a vector of frequencies in WoS categories. Given the collection of the LSC texts, each entry of the vector consists of the number of texts containing the word in the corresponding category.It is noteworthy that texts in a corpus do not necessarily belong to a single category, as they are likely to correspond to multidisciplinary studies, specifically in a corpus of scientific texts. In other words, categories may not be exclusive. There are 252 WoS categories and a text can be assigned to at least 1 and at most 6 categories in the LSC. Using the binary calculation of frequencies, we introduce the presence of a word in a category. We create a vector of frequencies for each word, where dimensions are categories in the corpus.The collection of vectors, with all words and categories in the entire corpus, can be shown in a table, where each entry corresponds to a pair (word,category). This table is build for the LScDC with 252 WoS categories and presented in published archive with this file. The value of each entry in the table shows how many times a word of LScDC appears in a WoS category. The occurrence of a word in a category is determined by counting the number of the LSC texts containing the word in a category. Words as a Vector of Relative Information Gains Extracted for CategoriesIn this section, we introduce our approach to representation of a word as a vector of relative information gains for categories under the assumption that meaning of a word can be quantified by their information gained for categories.For each category, a function is defined on texts that takes the value 1, if the text belongs to the category, and 0 otherwise. For each word, a function is defined on texts that takes the value 1 if the word belongs to the text, and 0 otherwise. Consider LSC as a probabilistic sample space (the space of equally probable elementary outcomes). For the Boolean random variables, the joint probability distribution, the entropy and information gains are defined.The information gain about the category from the word is the amount of information on the belonging of a text from the LSC to the category from observing the word in the text [6]. We used the Relative Information Gain (RIG) providing a normalised measure of the Information Gain. This provides the ability of comparing information gains for different categories. The calculations of entropy, Information Gains and Relative Information Gains can be found in the README file in the archive published. Given a word, we created a vector where each component of the vector corresponds to a category. Therefore, each word is represented as a vector of relative information gains. It is obvious that the dimension of vector for each word is the number of categories. The set of vectors is used to form the Word-Category RIG Matrix, in which each column corresponds to a category, each row corresponds to a word and each component is the relative information gain from the word to the category. In Word-Category RIG Matrix, a row vector represents the corresponding word as a vector of RIGs in categories. We note that in the matrix, a column vector represents RIGs of all words in an individual category. If we choose an arbitrary category, words can be ordered by their RIGs from the most informative to the least informative for the category. As well as ordering words in each category, words can be ordered by two criteria: sum and maximum of RIGs in categories. The top n words in this list can be considered as the most informative words in the scientific texts. For a given word, the sum and maximum of RIGs are calculated from the Word-Category RIG Matrix.RIGs for each word of LScDC in 252 categories are calculated and vectors of words are formed. We then form the Word-Category RIG Matrix for the LSC. For each word, the sum (S) and maximum (M) of RIGs in categories are calculated and added at the end of the matrix (last two columns of the matrix). The Word-Category RIG Matrix for the LScDC with 252 categories, the sum of RIGs in categories and the maximum of RIGs over categories can be found in the database.Leicester Scientific Thesaurus (LScT)Leicester Scientific Thesaurus (LScT) is a list of 5,000 words form the LScDC [2]. Words of LScDC are sorted in descending order by the sum (S) of RIGs in categories and the top 5,000 words are selected to be included in the LScT. We consider these 5,000 words as the most meaningful words in the scientific corpus. In other words, meaningfulness of words evaluated by words’ average informativeness in the categories and the list of these words are considered as a ‘thesaurus’ for science. The LScT with value of sum can be found as CSV file with the published archive. Published archive contains following files:1) Word_Category_RIG_Matrix.csv: A 103,998 by 254 matrix where columns are 252 WoS categories, the sum (S) and the maximum (M) of RIGs in categories (last two columns of the matrix), and rows are words of LScDC. Each entry in the first 252 columns is RIG from the word to the category. Words are ordered as in the LScDC.2) Word_Category_Frequency_Matrix.csv: A 103,998 by 252 matrix where columns are 252 WoS categories and rows are words of LScDC. Each entry of the matrix is the number of texts containing the word in the corresponding category. Words are ordered as in the LScDC.3) LScT.csv: List of words of LScT with sum (S) values. 4) Text_No_in_Cat.csv: The number of texts in categories. 5) Categories_in_Documents.csv: List of WoS categories for each document of the LSC.6) README.txt: Description of Word-Category RIG Matrix, Word-Category Frequency Matrix and LScT and forming procedures.7) README.pdf (same as 6 in PDF format)References[1] Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2[2] Suzen, Neslihan (2019): LScDC (Leicester Scientific Dictionary-Core). figshare. Dataset. https://doi.org/10.25392/leicester.data.9896579.v3[3] Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4] WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html [5] Suzen, N., Mirkes, E. M., & Gorban, A. N. (2019). LScDC-new large scientific dictionary. arXiv preprint arXiv:1912.06858. [6] Shannon, C. E. (1948). A mathematical theory of communication. Bell system technical journal, 27(3), 379-423.