47 datasets found

l
LSC (Leicester Scientific Corpus)
figshare.le.ac.uk
Updated Apr 15, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neslihan Suzen (2020). LSC (Leicester Scientific Corpus) [Dataset]. http://doi.org/10.25392/leicester.data.9449639.v1
Explore at:
Unique identifier
https://doi.org/10.25392/leicester.data.9449639.v1
Dataset updated
Apr 15, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Leicester
Description
The LSC (Leicester Scientific Corpus)August 2019 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk) Supervised by Prof Alexander Gorban and Dr Evgeny MirkesThe data is extracted from the Web of Science® [1] You may not copy or distribute this data in whole or in part without the written consent of Clarivate Analytics.Getting StartedThis text provides background information on the LSC (Leicester Scientific Corpus) and pre-processing steps on abstracts, and describes the structure of files to organise the corpus. This corpus is created to be used in future work on the quantification of the sense of research texts. One of the goal of publishing the data is to make it available for further analysis and use in Natural Language Processing projects.LSC is a collection of abstracts of articles and proceeding papers published in 2014, and indexed by the Web of Science (WoS) database [1]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English.The corpus was collected in July 2018 online and contains the number of citations from publication date to July 2018.Each document in the corpus contains the following parts:1. Authors: The list of authors of the paper2. Title: The title of the paper3. Abstract: The abstract of the paper4. Categories: One or more category from the list of categories [2]. Full list of categories is presented in file ‘List_of _Categories.txt’.5. Research Areas: One or more research area from the list of research areas [3]. Full list of research areas is presented in file ‘List_of_Research_Areas.txt’.6. Total Times cited: The number of times the paper was cited by other items from all databases within Web of Science platform [4]7. Times cited in Core Collection: The total number of times the paper was cited by other papers within the WoS Core Collection [4]We describe a document as the collection of information (about a paper) listed above. The total number of documents in LSC is 1,673,824.All documents in LSC have nonempty abstract, title, categories, research areas and times cited in WoS databases. There are 119 documents with empty authors list, we did not exclude these documents.Data ProcessingThis section describes all steps in order for the LSC to be collected, clean and available to researchers. Processing the data consists of six main steps:Step 1: Downloading of the Data OnlineThis is the step of collecting the dataset online. This is done manually by exporting documents as Tab-delimitated files. All downloaded documents are available online.Step 2: Importing the Dataset to RThis is the process of converting the collection to RData format for processing the data. The LSC was collected as TXT files. All documents are extracted to R.Step 3: Cleaning the Data from Documents with Empty Abstract or without CategoryNot all papers have abstract and categories in the collection. As our research is based on the analysis of abstracts and categories, preliminary detecting and removing inaccurate documents were performed. All documents with empty abstracts and documents without categories are removed.Step 4: Identification and Correction of Concatenate Words in AbstractsTraditionally, abstracts are written in a format of executive summary with one paragraph of continuous writing, which is known as ‘unstructured abstract’. However, especially medicine-related publications use ‘structured abstracts’. Such type of abstracts are divided into sections with distinct headings such as introduction, aim, objective, method, result, conclusion etc.Used tool for extracting abstracts leads concatenate words of section headings with the first word of the section. As a result, some of structured abstracts in the LSC require additional process of correction to split such concatenate words. For instance, we observe words such as ConclusionHigher and ConclusionsRT etc. in the corpus. The detection and identification of concatenate words cannot be totally automated. Human intervention is needed in the identification of possible headings of sections. We note that we only consider concatenate words in headings of sections as it is not possible to detect all concatenate words without deep knowledge of research areas. Identification of such words is done by sampling of medicine-related publications. The section headings in such abstracts are listed in the List 1.List 1 Headings of sections identified in structured abstractsBackground Method(s) DesignTheoretical Measurement(s) LocationAim(s) Methodology ProcessAbstract Population ApproachObjective(s) Purpose(s) Subject(s)Introduction Implication(s) Patient(s)Procedure(s) Hypothesis Measure(s)Setting(s) Limitation(s) DiscussionConclusion(s) Result(s) Finding(s)Material (s) Rationale(s)Implications for health and nursing policyAll words including headings in the List 1 are detected in entire corpus, and then words are split into two words. For instance, the word ‘ConclusionHigher’ is split into ‘Conclusion’ and ‘Higher’.Step 5: Extracting (Sub-setting) the Data Based on Lengths of AbstractsAfter correction of concatenate words is completed, the lengths of abstracts are calculated. ‘Length’ indicates the totalnumber of words in the text, calculated by the same rule as for Microsoft Word ‘word count’ [5].According to APA style manual [6], an abstract should contain between 150 to 250 words. However, word limits vary from journal to journal. For instance, Journal of Vascular Surgery recommends that ‘Clinical and basic research studies must include a structured abstract of 400 words or less’[7].In LSC, the length of abstracts varies from 1 to 3805. We decided to limit length of abstracts from 30 to 500 words in order to study documents with abstracts of typical length ranges and to avoid the effect of the length to the analysis. Documents containing less than 30 and more than 500 words in abstracts are removed.Step 6: Saving the Dataset into CSV FormatCorrected and extracted documents are saved into 36 CSV files. The structure of files are described in the following section.The Structure of Fields in CSV FilesIn CSV files, the information is organised with one record on each line and parts of abstract, title, list of authors, list of categories, list of research areas, and times cited is recorded in separated fields.To access the LSC for research purposes, please email to ns433@le.ac.uk.References[1]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[2]WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html[3]Research Areas in WoS. Available: https://images.webofknowledge.com/images/help/WOS/hp_research_areas_easca.html[4]Times Cited in WoS Core Collection. (15 July). Available: https://support.clarivate.com/ScientificandAcademicResearch/s/article/Web-of-Science-Times-Cited-accessibility-and-variation?language=en_US[5]Word Count. Available: https://support.office.com/en-us/article/show-word-count-3c9e6a11-a04d-43b4-977c-563a0e0d5da3[6]A. P. Association, Publication manual. American Psychological Association Washington, DC, 1983.[7]P. Gloviczki and P. F. Lawrence, "Information for authors," Journal of Vascular Surgery, vol. 65, no. 1, pp. A16-A22, 2017.
r
Results
researchdata.edu.au
bridges.monash.edu
Updated May 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chang Wei Tan (2022). Results [Dataset]. http://doi.org/10.26180/5c30a56c0bda8
Explore at:
Unique identifier
https://doi.org/10.26180/5c30a56c0bda8
Dataset updated
May 5, 2022
Dataset provided by
Monash University
Authors
Chang Wei Tan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the results for the FastEE paper.
S
Semantic Knowledge Discovery Software Report
datainsightsmarket.com
doc, pdf, ppt
Updated May 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). Semantic Knowledge Discovery Software Report [Dataset]. https://www.datainsightsmarket.com/reports/semantic-knowledge-discovery-software-1949491
Explore at:
pdf, ppt, docAvailable download formats
Dataset updated
May 29, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The Semantic Knowledge Discovery Software market is experiencing robust growth, driven by the increasing need for organizations to extract actionable insights from complex and unstructured data. The market, estimated at $2 billion in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033, reaching an estimated $6 billion by 2033. This growth is fueled by several key factors. The rising adoption of artificial intelligence (AI) and machine learning (ML) technologies across various industries is enabling more sophisticated semantic analysis, leading to improved decision-making. Furthermore, the proliferation of big data, coupled with the limitations of traditional data analysis methods, is driving the demand for solutions that can effectively uncover hidden patterns and relationships within vast datasets. The growing emphasis on data-driven decision-making across sectors like healthcare, finance, and research and development is also contributing significantly to market expansion. Major restraints to market growth include the high initial investment costs associated with implementing semantic knowledge discovery software, the complexity of integrating these solutions with existing IT infrastructure, and the scarcity of skilled professionals capable of managing and interpreting the results generated by these systems. However, these challenges are being addressed through the development of more user-friendly software, cloud-based deployment models that reduce upfront costs, and increased training and education programs focused on semantic technology. The market is segmented by deployment mode (cloud, on-premise), industry (healthcare, finance, manufacturing, etc.), and functionality (data integration, knowledge graph construction, semantic search). Key players like Expert System SpA, ChemAxon, Collexis (Elsevier), MAANA, OntoText, Cambridge Semantics, and Nervana (Intel) are actively shaping the market landscape through innovation and strategic partnerships. The North American market currently holds a significant share, but regions like Asia-Pacific are expected to witness rapid growth in the coming years.
Triple random ensemble method for multi-label classification
dro.deakin.edu.au
researchdata.edu.au
pdf
Updated Sep 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
G Nasierding; G Tsoumakas; Abbas Kouzani (2024). Triple random ensemble method for multi-label classification [Dataset]. https://dro.deakin.edu.au/articles/dataset/Triple_random_ensemble_method_for_multi-label_classification/21031912
Explore at:
pdfAvailable download formats
Dataset updated
Sep 22, 2024
Dataset provided by
Deakin Universityhttp://www.deakin.edu.au/
Authors
G Nasierding; G Tsoumakas; Abbas Kouzani
License
https://www.rioxx.net/licenses/all-rights-reserved/https://www.rioxx.net/licenses/all-rights-reserved/
Description
Triple random ensemble method for multi-label classification
l
LScDC Word-Category RIG Matrix
figshare.le.ac.uk
pdf
Updated Apr 28, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neslihan Suzen (2020). LScDC Word-Category RIG Matrix [Dataset]. http://doi.org/10.25392/leicester.data.12133431.v2
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.25392/leicester.data.12133431.v2
Dataset updated
Apr 28, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
LScDC Word-Category RIG MatrixApril 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk / suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny MirkesGetting StartedThis file describes the Word-Category RIG Matrix for theLeicester Scientific Corpus (LSC) [1], the procedure to build the matrix and introduces the Leicester Scientific Thesaurus (LScT) with the construction process. The Word-Category RIG Matrix is a 103,998 by 252 matrix, where rows correspond to words of Leicester Scientific Dictionary-Core (LScDC) [2] and columns correspond to 252 Web of Science (WoS) categories [3, 4, 5]. Each entry in the matrix corresponds to a pair (category,word). Its value for the pair shows the Relative Information Gain (RIG) on the belonging of a text from the LSC to the category from observing the word in this text. The CSV file of Word-Category RIG Matrix in the published archive is presented with two additional columns of the sum of RIGs in categories and the maximum of RIGs over categories (last two columns of the matrix). So, the file ‘Word-Category RIG Matrix.csv’ contains a total of 254 columns.This matrix is created to be used in future research on quantifying of meaning in scientific texts under the assumption that words have scientifically specific meanings in subject categories and the meaning can be estimated by information gains from word to categories. LScT (Leicester Scientific Thesaurus) is a scientific thesaurus of English. The thesaurus includes a list of 5,000 words from the LScDC. We consider ordering the words of LScDC by the sum of their RIGs in categories. That is, words are arranged in their informativeness in the scientific corpus LSC. Therefore, meaningfulness of words evaluated by words’ average informativeness in the categories. We have decided to include the most informative 5,000 words in the scientific thesaurus. Words as a Vector of Frequencies in WoS CategoriesEach word of the LScDC is represented as a vector of frequencies in WoS categories. Given the collection of the LSC texts, each entry of the vector consists of the number of texts containing the word in the corresponding category.It is noteworthy that texts in a corpus do not necessarily belong to a single category, as they are likely to correspond to multidisciplinary studies, specifically in a corpus of scientific texts. In other words, categories may not be exclusive. There are 252 WoS categories and a text can be assigned to at least 1 and at most 6 categories in the LSC. Using the binary calculation of frequencies, we introduce the presence of a word in a category. We create a vector of frequencies for each word, where dimensions are categories in the corpus.The collection of vectors, with all words and categories in the entire corpus, can be shown in a table, where each entry corresponds to a pair (word,category). This table is build for the LScDC with 252 WoS categories and presented in published archive with this file. The value of each entry in the table shows how many times a word of LScDC appears in a WoS category. The occurrence of a word in a category is determined by counting the number of the LSC texts containing the word in a category. Words as a Vector of Relative Information Gains Extracted for CategoriesIn this section, we introduce our approach to representation of a word as a vector of relative information gains for categories under the assumption that meaning of a word can be quantified by their information gained for categories.For each category, a function is defined on texts that takes the value 1, if the text belongs to the category, and 0 otherwise. For each word, a function is defined on texts that takes the value 1 if the word belongs to the text, and 0 otherwise. Consider LSC as a probabilistic sample space (the space of equally probable elementary outcomes). For the Boolean random variables, the joint probability distribution, the entropy and information gains are defined.The information gain about the category from the word is the amount of information on the belonging of a text from the LSC to the category from observing the word in the text [6]. We used the Relative Information Gain (RIG) providing a normalised measure of the Information Gain. This provides the ability of comparing information gains for different categories. The calculations of entropy, Information Gains and Relative Information Gains can be found in the README file in the archive published. Given a word, we created a vector where each component of the vector corresponds to a category. Therefore, each word is represented as a vector of relative information gains. It is obvious that the dimension of vector for each word is the number of categories. The set of vectors is used to form the Word-Category RIG Matrix, in which each column corresponds to a category, each row corresponds to a word and each component is the relative information gain from the word to the category. In Word-Category RIG Matrix, a row vector represents the corresponding word as a vector of RIGs in categories. We note that in the matrix, a column vector represents RIGs of all words in an individual category. If we choose an arbitrary category, words can be ordered by their RIGs from the most informative to the least informative for the category. As well as ordering words in each category, words can be ordered by two criteria: sum and maximum of RIGs in categories. The top n words in this list can be considered as the most informative words in the scientific texts. For a given word, the sum and maximum of RIGs are calculated from the Word-Category RIG Matrix.RIGs for each word of LScDC in 252 categories are calculated and vectors of words are formed. We then form the Word-Category RIG Matrix for the LSC. For each word, the sum (S) and maximum (M) of RIGs in categories are calculated and added at the end of the matrix (last two columns of the matrix). The Word-Category RIG Matrix for the LScDC with 252 categories, the sum of RIGs in categories and the maximum of RIGs over categories can be found in the database.Leicester Scientific Thesaurus (LScT)Leicester Scientific Thesaurus (LScT) is a list of 5,000 words form the LScDC [2]. Words of LScDC are sorted in descending order by the sum (S) of RIGs in categories and the top 5,000 words are selected to be included in the LScT. We consider these 5,000 words as the most meaningful words in the scientific corpus. In other words, meaningfulness of words evaluated by words’ average informativeness in the categories and the list of these words are considered as a ‘thesaurus’ for science. The LScT with value of sum can be found as CSV file with the published archive. Published archive contains following files:1) Word_Category_RIG_Matrix.csv: A 103,998 by 254 matrix where columns are 252 WoS categories, the sum (S) and the maximum (M) of RIGs in categories (last two columns of the matrix), and rows are words of LScDC. Each entry in the first 252 columns is RIG from the word to the category. Words are ordered as in the LScDC.2) Word_Category_Frequency_Matrix.csv: A 103,998 by 252 matrix where columns are 252 WoS categories and rows are words of LScDC. Each entry of the matrix is the number of texts containing the word in the corresponding category. Words are ordered as in the LScDC.3) LScT.csv: List of words of LScT with sum (S) values. 4) Text_No_in_Cat.csv: The number of texts in categories. 5) Categories_in_Documents.csv: List of WoS categories for each document of the LSC.6) README.txt: Description of Word-Category RIG Matrix, Word-Category Frequency Matrix and LScT and forming procedures.7) README.pdf (same as 6 in PDF format)References[1] Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2[2] Suzen, Neslihan (2019): LScDC (Leicester Scientific Dictionary-Core). figshare. Dataset. https://doi.org/10.25392/leicester.data.9896579.v3[3] Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4] WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html [5] Suzen, N., Mirkes, E. M., & Gorban, A. N. (2019). LScDC-new large scientific dictionary. arXiv preprint arXiv:1912.06858. [6] Shannon, C. E. (1948). A mathematical theory of communication. Bell system technical journal, 27(3), 379-423.
r
Results
researchdata.edu.au
bridges.monash.edu
+1more
Updated May 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chang Wei Tan (2022). Results [Dataset]. http://doi.org/10.4225/03/59e302de4ad10
Explore at:
Unique identifier
https://doi.org/10.4225/03/59e302de4ad10
Dataset updated
May 5, 2022
Dataset provided by
Monash University
Authors
Chang Wei Tan
Description
This is the result folder for our SDM18 paper on "Efficient search of the best warping window for Dynamic Time Warping"
m
Large-scale International Study
bridges.monash.edu
researchdata.edu.au
txt
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Francois Petitjean (2023). Large-scale International Study [Dataset]. http://doi.org/10.26180/5be402942000e
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.26180/5be402942000e
Dataset updated
Jun 1, 2023
Dataset provided by
Monash University
Authors
Francois Petitjean
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
The Large-scale International Study shows comparative availability and terms for a much larger sample of almost 100,000 books across those same five jurisdictions.
Additional file 7 of FELLA: an R package to enrich metabolomics data
springernature.figshare.com
zip
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sergio Picart-Armada; Francesc FernĂĄndez-Albert; Maria Vinaixa; Oscar Yanes; Alexandre Perera-Lluna (2023). Additional file 7 of FELLA: an R package to enrich metabolomics data [Dataset]. http://doi.org/10.6084/m9.figshare.7503230.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7503230.v1
Dataset updated
Jun 1, 2023
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Sergio Picart-Armada; Francesc FernĂĄndez-Albert; Maria Vinaixa; Oscar Yanes; Alexandre Perera-Lluna
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
R workspace from the mouse model study. (ZIP 829 kb)
m
Source Code
bridges.monash.edu
researchdata.edu.au
zip
Updated Oct 15, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chang Wei Tan (2017). Source Code [Dataset]. http://doi.org/10.4225/03/59e33dfb920f1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4225/03/59e33dfb920f1
Dataset updated
Oct 15, 2017
Dataset provided by
Monash University
Authors
Chang Wei Tan
License
https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
Description
This is the source code for the paper "Efficient search of the best warping window for Dynamic Time Warping".This work focused on fast learning/searching for the best warping window for Dynamic Time Warping and Time Series Classification.For more info, visit https://github.com/ChangWeiTan/FastWWSearch
Additional file 5 of FELLA: an R package to enrich metabolomics data
springernature.figshare.com
zip
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sergio Picart-Armada; Francesc FernĂĄndez-Albert; Maria Vinaixa; Oscar Yanes; Alexandre Perera-Lluna (2023). Additional file 5 of FELLA: an R package to enrich metabolomics data [Dataset]. http://doi.org/10.6084/m9.figshare.7503221.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7503221.v1
Dataset updated
Jun 1, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Sergio Picart-Armada; Francesc FernĂĄndez-Albert; Maria Vinaixa; Oscar Yanes; Alexandre Perera-Lluna
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Descriptive files on the three human datasets: a summary of the inputs (descriptive_input.csv), input and reported subgraph in each dataset (dataset_input.csv, dataset_subgraph.csv and dataset_subgraph.pdf), hits discussed in the results section (descriptive_hits.csv). Also contains the database object (fella_data.RData) and metadata about the database (info_fella_data.txt), the KEGG version (info_kegg.txt) and the R session (info_session.txt). (ZIP 525 kb)
l
LScDC (Leicester Scientific Dictionary-Core)
figshare.le.ac.uk
docx
Updated Apr 15, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neslihan Suzen (2020). LScDC (Leicester Scientific Dictionary-Core) [Dataset]. http://doi.org/10.25392/leicester.data.9896579.v3
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.25392/leicester.data.9896579.v3
Dataset updated
Apr 15, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Leicester
Description
The LScDC (Leicester Scientific Dictionary-Core Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScDC (Leicester Scientific Dictionary-Core) is formed using the updated LScD (Leicester Scientific Dictionary) - Version 3*. All steps applied to build the new version of core dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. The files provided with this description are also same as described as for LScDC Version 2. The numbers of words in the 3rd versions of LScD and LScDC are summarized below. # of wordsLScD (v3) 972,060LScDC (v3) 103,998 * Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v3 ** Suzen, Neslihan (2019): LScDC (Leicester Scientific Dictionary-Core). figshare. Dataset. https://doi.org/10.25392/leicester.data.9896579.v2[Version 2] Getting StartedThis file describes a sorted and cleaned list of words from LScD (Leicester Scientific Dictionary), explains steps for sub-setting the LScD and basic statistics of words in the LSC (Leicester Scientific Corpus), to be found in [1, 2]. The LScDC (Leicester Scientific Dictionary-Core) is a list of words ordered by the number of documents containing the words, and is available in the CSV file published. There are 104,223 unique words (lemmas) in the LScDC. This dictionary is created to be used in future work on the quantification of the sense of research texts. The objective of sub-setting the LScD is to discard words which appear too rarely in the corpus. In text mining algorithms, usage of enormous number of text data brings the challenge to the performance and the accuracy of data mining applications. The performance and the accuracy of models are heavily depend on the type of words (such as stop words and content words) and the number of words in the corpus. Rare occurrence of words in a collection is not useful in discriminating texts in large corpora as rare words are likely to be non-informative signals (or noise) and redundant in the collection of texts. The selection of relevant words also holds out the possibility of more effective and faster operation of text mining algorithms.To build the LScDC, we decided the following process on LScD: removing words that appear in no more than 10 documents (
r
International Journal of Artificial Intelligence Impact Factor 2024-2025 -...
researchhelpdesk.org
Updated Feb 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Research Help Desk (2022). International Journal of Artificial Intelligence Impact Factor 2024-2025 - ResearchHelpDesk [Dataset]. https://www.researchhelpdesk.org/journal/impact-factor-if/586/international-journal-of-artificial-intelligence
Explore at:
Dataset updated
Feb 23, 2022
Dataset authored and provided by
Research Help Desk
Description
International Journal of Artificial Intelligence Impact Factor 2024-2025 - ResearchHelpDesk - The main aim of the International Journal of Artificial Intelligence™ (ISSN 0974-0635) is to publish refereed, well-written original research articles, and studies that describe the latest research and developments in the area of Artificial Intelligence. This is a broad-based journal covering all branches of Artificial Intelligence and its application in the following topics: Technology & Computing; Fuzzy Logic; Neural Networks; Reasoning and Evolution; Automatic Control; Mechatronics; Robotics; Parallel Processing; Programming Languages; Software & Hardware Architectures; CAD Design & Testing; Web Intelligence Applications; Computer Vision and Speech Understanding; Multimedia & Cognitive Informatics, Data Mining and Machine Learning Tools, Heuristic and AI Planning Strategies and Tools, Computational Theories of Learning; Signal, Image & Speech Processing; Intelligent System Architectures; Knowledge Representation; Bioinformatics; Natural Language Processing; Mathematics & Physics. The International Journal of Artificial Intelligence (IJAI) is a peer-reviewed online journal and is published in Spring and Autumn i.e. two times in a year. The International Journal of Artificial Intelligence (ISSN 0974-0635) was reviewed, abstracted and indexed in the past by the INSPEC The IET, SCOPUS (Elsevier Bibliographic Databases), Zentralblatt MATH (io-port.net) of European Mathematical Society, Indian Science Abstracts, getCITED, SCImago Journal & Country Rank, Newjour, JournalSeek, Math-jobs.com’s Journal Index, Academic keys, Ulrich's Periodicals Directory, IndexCopernicus, and International Statistical Institute (ISI, Netherlands)Journal Index. The IJAI is already in request process to get reviewed, abstracted and indexed by the Clarivate Analytics Web of Science (Also known as Thomson ISI Web of Knowledge SCI), Mathematical Reviews and MathSciNet of American Mathematical Society, and by other agencies.
Synthesis_NER_Tagger
figshare.com
bin
Updated Jun 24, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fusataka Kuniyoshi (2021). Synthesis_NER_Tagger [Dataset]. http://doi.org/10.6084/m9.figshare.14832798.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14832798.v1
Dataset updated
Jun 24, 2021
Dataset provided by
Figsharehttp://figshare.com/
Authors
Fusataka Kuniyoshi
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Trained model for NER. Description: https://github.com/BananaTonic/Material_Synthesis_Corpus
Data from: Enriching time series datasets using Nonparametric kernel...
figshare.com
pdf
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohamad Ivan Fanany (2023). Enriching time series datasets using Nonparametric kernel regression to improve forecasting accuracy [Dataset]. http://doi.org/10.6084/m9.figshare.1609661.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1609661.v1
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Mohamad Ivan Fanany
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Improving the accuracy of prediction on future values based on the past and current observations has been pursued by enhancing the prediction's methods, combining those methods or performing data pre-processing. In this paper, another approach is taken, namely by increasing the number of input in the dataset. This approach would be useful especially for a shorter time series data. By filling the in-between values in the time series, the number of training set can be increased, thus increasing the generalization capability of the predictor. The algorithm used to make prediction is Neural Network as it is widely used in literature for time series tasks. For comparison, Support Vector Regression is also employed. The dataset used in the experiment is the frequency of USPTO's patents and PubMed's scientific publications on the field of health, namely on Apnea, Arrhythmia, and Sleep Stages. Another time series data designated for NN3 Competition in the field of transportation is also used for benchmarking. The experimental result shows that the prediction performance can be significantly increased by filling in-between data in the time series. Furthermore, the use of detrend and deseasonalization which separates the data into trend, seasonal and stationary time series also improve the prediction performance both on original and filled dataset. The optimal number of increase on the dataset in this experiment is about five times of the length of original dataset.
r
International Journal of Artificial Intelligence Acceptance Rate -...
researchhelpdesk.org
Updated Apr 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Research Help Desk (2022). International Journal of Artificial Intelligence Acceptance Rate - ResearchHelpDesk [Dataset]. https://www.researchhelpdesk.org/journal/acceptance-rate/586/international-journal-of-artificial-intelligence
Explore at:
Dataset updated
Apr 27, 2022
Dataset authored and provided by
Research Help Desk
Description
International Journal of Artificial Intelligence Acceptance Rate - ResearchHelpDesk - The main aim of the International Journal of Artificial Intelligence™ (ISSN 0974-0635) is to publish refereed, well-written original research articles, and studies that describe the latest research and developments in the area of Artificial Intelligence. This is a broad-based journal covering all branches of Artificial Intelligence and its application in the following topics: Technology & Computing; Fuzzy Logic; Neural Networks; Reasoning and Evolution; Automatic Control; Mechatronics; Robotics; Parallel Processing; Programming Languages; Software & Hardware Architectures; CAD Design & Testing; Web Intelligence Applications; Computer Vision and Speech Understanding; Multimedia & Cognitive Informatics, Data Mining and Machine Learning Tools, Heuristic and AI Planning Strategies and Tools, Computational Theories of Learning; Signal, Image & Speech Processing; Intelligent System Architectures; Knowledge Representation; Bioinformatics; Natural Language Processing; Mathematics & Physics. The International Journal of Artificial Intelligence (IJAI) is a peer-reviewed online journal and is published in Spring and Autumn i.e. two times in a year. The International Journal of Artificial Intelligence (ISSN 0974-0635) was reviewed, abstracted and indexed in the past by the INSPEC The IET, SCOPUS (Elsevier Bibliographic Databases), Zentralblatt MATH (io-port.net) of European Mathematical Society, Indian Science Abstracts, getCITED, SCImago Journal & Country Rank, Newjour, JournalSeek, Math-jobs.com’s Journal Index, Academic keys, Ulrich's Periodicals Directory, IndexCopernicus, and International Statistical Institute (ISI, Netherlands)Journal Index. The IJAI is already in request process to get reviewed, abstracted and indexed by the Clarivate Analytics Web of Science (Also known as Thomson ISI Web of Knowledge SCI), Mathematical Reviews and MathSciNet of American Mathematical Society, and by other agencies.
g
Business Online Services Sp. z o.o. - Autonomous Knowledge Extractor |...
gimi9.com
Updated Mar 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Business Online Services Sp. z o.o. - Autonomous Knowledge Extractor | gimi9.com [Dataset]. https://gimi9.com/dataset/pl_3071_autonomiczny-ekstraktor-wiedzy/
Explore at:
Dataset updated
Mar 24, 2023
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Development of methods for obtaining data from various sources - the goal of the task is to develop an appropriate architecture and pipelines for processing data obtained from heterogeneous sources and formats in order to collect them in a coherent form in a central knowledge repository. It requires the use of an ETL/ESB type architecture based on a queuing system and distributed processing. Development of a large-scale data processing architecture by developed algorithms - the goal of the task is to develop an implementation architecture that would enable the implementation of the developed algorithms on a large scale, e.g. on the basis of distributed processing systems such as Apache Spark. Development of scalable data storage methods - the aim of the task is to select a data storage environment that enables effective representation of knowledge as a semantic network. The use of a graph database engine or a base that supports the RDF format will be required here. Development of an API enabling data mining - the aim of the task is to develop an API enabling the use of semantic knowledge accumulated in the system by various types of algorithms for further data processing, machine learning and artificial intelligence. A probable solution here may be to create an interface based on the SPARQL standard. Development of a prototype of a user interface for data mining - the aim of the task is to develop an ergonomic interface that allows domain users to explore and analyze the collected data. It is necessary to develop a method of generating an interface that automatically adapts to the type of data that is collected in the system, enabling data exploration by asking queries on the "Query By Example" basis, faceted/faceted search and traversing relationships between entities in the semantic model.
u
Loughran McDonald-SA-2020 Sentiment Word List
researchdata.up.ac.za
txt
Updated Aug 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michelle Terblanche; Vukosi Marivate (2025). Loughran McDonald-SA-2020 Sentiment Word List [Dataset]. http://doi.org/10.25403/UPresearchdata.14401178.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.25403/UPresearchdata.14401178.v2
Dataset updated
Aug 27, 2025
Dataset provided by
University of Pretoria
Authors
Michelle Terblanche; Vukosi Marivate
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
The Loughran and McDonald Sentiment Word Lists were developed using corporate 10-K reports between 1994 and 2008 [14]. These reports are relevant to companies in the United States of America and required by the U.S. Securities and Exchange Commission (SEC)14.The motivation for building the LM-SA-2020 word list was based on an experiment using the above-mentioned original lists to detect sentiment-carrying words in South African financial article headlines. A corpus of 808 financial articles (relating to Sasol) were used and only 37% of headlines had words of which the sentiment matched that of the words in the Loughran and McDonald Sentiment Word Lists correctly according to ground truth labels. A gap was therefore identified in developing a method for predicting sentiment of financial articles in a South African context. Due to the size of data set, it was possible to manually examine the head-lines to identify sentiment-carrying words to be included in the original wordlists. Furthermore, synonyms were added for the existing words in the Loughran and McDonald Sentiment Word Lists using NLTK’s WordNet16 interface. The sentiment detection/prediction accuracy improved by 29% using the new word list. This sentiment word list can be further expanded/improved in future by increasing the size of the data set and/or including data from other companies. It highlights the need for not only domain-specific sentiment prediction tools but also region-specific corporate.
r
International Journal of Artificial Intelligence FAQ - ResearchHelpDesk
researchhelpdesk.org
Updated Jun 22, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Research Help Desk (2022). International Journal of Artificial Intelligence FAQ - ResearchHelpDesk [Dataset]. https://www.researchhelpdesk.org/journal/faq/586/international-journal-of-artificial-intelligence
Explore at:
Dataset updated
Jun 22, 2022
Dataset authored and provided by
Research Help Desk
Description
International Journal of Artificial Intelligence FAQ - ResearchHelpDesk - The main aim of the International Journal of Artificial Intelligence™ (ISSN 0974-0635) is to publish refereed, well-written original research articles, and studies that describe the latest research and developments in the area of Artificial Intelligence. This is a broad-based journal covering all branches of Artificial Intelligence and its application in the following topics: Technology & Computing; Fuzzy Logic; Neural Networks; Reasoning and Evolution; Automatic Control; Mechatronics; Robotics; Parallel Processing; Programming Languages; Software & Hardware Architectures; CAD Design & Testing; Web Intelligence Applications; Computer Vision and Speech Understanding; Multimedia & Cognitive Informatics, Data Mining and Machine Learning Tools, Heuristic and AI Planning Strategies and Tools, Computational Theories of Learning; Signal, Image & Speech Processing; Intelligent System Architectures; Knowledge Representation; Bioinformatics; Natural Language Processing; Mathematics & Physics. The International Journal of Artificial Intelligence (IJAI) is a peer-reviewed online journal and is published in Spring and Autumn i.e. two times in a year. The International Journal of Artificial Intelligence (ISSN 0974-0635) was reviewed, abstracted and indexed in the past by the INSPEC The IET, SCOPUS (Elsevier Bibliographic Databases), Zentralblatt MATH (io-port.net) of European Mathematical Society, Indian Science Abstracts, getCITED, SCImago Journal & Country Rank, Newjour, JournalSeek, Math-jobs.com’s Journal Index, Academic keys, Ulrich's Periodicals Directory, IndexCopernicus, and International Statistical Institute (ISI, Netherlands)Journal Index. The IJAI is already in request process to get reviewed, abstracted and indexed by the Clarivate Analytics Web of Science (Also known as Thomson ISI Web of Knowledge SCI), Mathematical Reviews and MathSciNet of American Mathematical Society, and by other agencies.
Additional file 6 of FELLA: an R package to enrich metabolomics data
springernature.figshare.com
zip
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sergio Picart-Armada; Francesc FernĂĄndez-Albert; Maria Vinaixa; Oscar Yanes; Alexandre Perera-Lluna (2023). Additional file 6 of FELLA: an R package to enrich metabolomics data [Dataset]. http://doi.org/10.6084/m9.figshare.7503227.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7503227.v1
Dataset updated
Jun 1, 2023
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Sergio Picart-Armada; Francesc FernĂĄndez-Albert; Maria Vinaixa; Oscar Yanes; Alexandre Perera-Lluna
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
R workspace from the gilt-head bream datasets. (ZIP 590 kb)
Data from: Mining Coronavirus (COVID-19) Posts in Social Media
figshare.com
zip
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Negin Karisani; Payam Karisani (2023). Mining Coronavirus (COVID-19) Posts in Social Media [Dataset]. http://doi.org/10.6084/m9.figshare.12597755.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12597755.v1
Dataset updated
Jun 2, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Negin Karisani; Payam Karisani
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A short description of the files: - tweet_ids.txt.zip: Conatins the tweet ids mentioned in the paper. It contains 9 million tweets tweet id per line, published between Jan 27 and April 20, 2020, see the paper below for more details- bert-base-uncased-corona.zip: The pre-trained BERT model discussed in the paper. We used the pytorch implementation of BERT, available at huggingface Github.

Facebook

Twitter

Click to copy link

Link copied

Cite

Neslihan Suzen (2020). LSC (Leicester Scientific Corpus) [Dataset]. http://doi.org/10.25392/leicester.data.9449639.v1

LSC (Leicester Scientific Corpus)

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

Unique identifier

https://doi.org/10.25392/leicester.data.9449639.v1

Dataset updated

Apr 15, 2020

Dataset provided by

University of Leicester

Authors

Neslihan Suzen

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Area covered

Leicester

Description

The LSC (Leicester Scientific Corpus)August 2019 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk) Supervised by Prof Alexander Gorban and Dr Evgeny MirkesThe data is extracted from the Web of Science® [1] You may not copy or distribute this data in whole or in part without the written consent of Clarivate Analytics.Getting StartedThis text provides background information on the LSC (Leicester Scientific Corpus) and pre-processing steps on abstracts, and describes the structure of files to organise the corpus. This corpus is created to be used in future work on the quantification of the sense of research texts. One of the goal of publishing the data is to make it available for further analysis and use in Natural Language Processing projects.LSC is a collection of abstracts of articles and proceeding papers published in 2014, and indexed by the Web of Science (WoS) database [1]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English.The corpus was collected in July 2018 online and contains the number of citations from publication date to July 2018.Each document in the corpus contains the following parts:1. Authors: The list of authors of the paper2. Title: The title of the paper3. Abstract: The abstract of the paper4. Categories: One or more category from the list of categories [2]. Full list of categories is presented in file ‘List_of _Categories.txt’.5. Research Areas: One or more research area from the list of research areas [3]. Full list of research areas is presented in file ‘List_of_Research_Areas.txt’.6. Total Times cited: The number of times the paper was cited by other items from all databases within Web of Science platform [4]7. Times cited in Core Collection: The total number of times the paper was cited by other papers within the WoS Core Collection [4]We describe a document as the collection of information (about a paper) listed above. The total number of documents in LSC is 1,673,824.All documents in LSC have nonempty abstract, title, categories, research areas and times cited in WoS databases. There are 119 documents with empty authors list, we did not exclude these documents.Data ProcessingThis section describes all steps in order for the LSC to be collected, clean and available to researchers. Processing the data consists of six main steps:Step 1: Downloading of the Data OnlineThis is the step of collecting the dataset online. This is done manually by exporting documents as Tab-delimitated files. All downloaded documents are available online.Step 2: Importing the Dataset to RThis is the process of converting the collection to RData format for processing the data. The LSC was collected as TXT files. All documents are extracted to R.Step 3: Cleaning the Data from Documents with Empty Abstract or without CategoryNot all papers have abstract and categories in the collection. As our research is based on the analysis of abstracts and categories, preliminary detecting and removing inaccurate documents were performed. All documents with empty abstracts and documents without categories are removed.Step 4: Identification and Correction of Concatenate Words in AbstractsTraditionally, abstracts are written in a format of executive summary with one paragraph of continuous writing, which is known as ‘unstructured abstract’. However, especially medicine-related publications use ‘structured abstracts’. Such type of abstracts are divided into sections with distinct headings such as introduction, aim, objective, method, result, conclusion etc.Used tool for extracting abstracts leads concatenate words of section headings with the first word of the section. As a result, some of structured abstracts in the LSC require additional process of correction to split such concatenate words. For instance, we observe words such as ConclusionHigher and ConclusionsRT etc. in the corpus. The detection and identification of concatenate words cannot be totally automated. Human intervention is needed in the identification of possible headings of sections. We note that we only consider concatenate words in headings of sections as it is not possible to detect all concatenate words without deep knowledge of research areas. Identification of such words is done by sampling of medicine-related publications. The section headings in such abstracts are listed in the List 1.List 1 Headings of sections identified in structured abstractsBackground Method(s) DesignTheoretical Measurement(s) LocationAim(s) Methodology ProcessAbstract Population ApproachObjective(s) Purpose(s) Subject(s)Introduction Implication(s) Patient(s)Procedure(s) Hypothesis Measure(s)Setting(s) Limitation(s) DiscussionConclusion(s) Result(s) Finding(s)Material (s) Rationale(s)Implications for health and nursing policyAll words including headings in the List 1 are detected in entire corpus, and then words are split into two words. For instance, the word ‘ConclusionHigher’ is split into ‘Conclusion’ and ‘Higher’.Step 5: Extracting (Sub-setting) the Data Based on Lengths of AbstractsAfter correction of concatenate words is completed, the lengths of abstracts are calculated. ‘Length’ indicates the totalnumber of words in the text, calculated by the same rule as for Microsoft Word ‘word count’ [5].According to APA style manual [6], an abstract should contain between 150 to 250 words. However, word limits vary from journal to journal. For instance, Journal of Vascular Surgery recommends that ‘Clinical and basic research studies must include a structured abstract of 400 words or less’[7].In LSC, the length of abstracts varies from 1 to 3805. We decided to limit length of abstracts from 30 to 500 words in order to study documents with abstracts of typical length ranges and to avoid the effect of the length to the analysis. Documents containing less than 30 and more than 500 words in abstracts are removed.Step 6: Saving the Dataset into CSV FormatCorrected and extracted documents are saved into 36 CSV files. The structure of files are described in the following section.The Structure of Fields in CSV FilesIn CSV files, the information is organised with one record on each line and parts of abstract, title, list of authors, list of categories, list of research areas, and times cited is recorded in separated fields.To access the LSC for research purposes, please email to ns433@le.ac.uk.References[1]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[2]WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html[3]Research Areas in WoS. Available: https://images.webofknowledge.com/images/help/WOS/hp_research_areas_easca.html[4]Times Cited in WoS Core Collection. (15 July). Available: https://support.clarivate.com/ScientificandAcademicResearch/s/article/Web-of-Science-Times-Cited-accessibility-and-variation?language=en_US[5]Word Count. Available: https://support.office.com/en-us/article/show-word-count-3c9e6a11-a04d-43b4-977c-563a0e0d5da3[6]A. P. Association, Publication manual. American Psychological Association Washington, DC, 1983.[7]P. Gloviczki and P. F. Lawrence, "Information for authors," Journal of Vascular Surgery, vol. 65, no. 1, pp. A16-A22, 2017.

Clear search

Close search

Google apps

Main menu

LSC (Leicester Scientific Corpus)

Results

Semantic Knowledge Discovery Software Report

Triple random ensemble method for multi-label classification

LScDC Word-Category RIG Matrix

Results

Large-scale International Study

Additional file 7 of FELLA: an R package to enrich metabolomics data

Source Code

Additional file 5 of FELLA: an R package to enrich metabolomics data

LScDC (Leicester Scientific Dictionary-Core)

International Journal of Artificial Intelligence Impact Factor 2024-2025 -...

Synthesis_NER_Tagger

Data from: Enriching time series datasets using Nonparametric kernel...

International Journal of Artificial Intelligence Acceptance Rate -...

Business Online Services Sp. z o.o. - Autonomous Knowledge Extractor |...

Loughran McDonald-SA-2020 Sentiment Word List

International Journal of Artificial Intelligence FAQ - ResearchHelpDesk

Additional file 6 of FELLA: an R package to enrich metabolomics data

Data from: Mining Coronavirus (COVID-19) Posts in Social Media

LSC (Leicester Scientific Corpus)See More Versions

LSC (Leicester Scientific Corpus)