44 datasets found
  1. Data from: Results obtained in a data mining process applied to a database...

    • scielo.figshare.com
    jpeg
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    E.M. Ruiz Lobaina; C. P. Romero Suárez (2023). Results obtained in a data mining process applied to a database containing bibliographic information concerning four segments of science. [Dataset]. http://doi.org/10.6084/m9.figshare.20011798.v1
    Explore at:
    jpegAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    SciELOhttp://www.scielo.org/
    Authors
    E.M. Ruiz Lobaina; C. P. Romero Suárez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract The objective of this work is to improve the quality of the information that belongs to the database CubaCiencia, of the Institute of Scientific and Technological Information. This database has bibliographic information referring to four segments of science and is the main database of the Library Management System. The applied methodology was based on the Decision Trees, the Correlation Matrix, the 3D Scatter Plot, etc., which are techniques used by data mining, for the study of large volumes of information. The results achieved not only made it possible to improve the information in the database, but also provided truly useful patterns in the solution of the proposed objectives.

  2. G

    Knowledge Discovery in Databases Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Aug 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). Knowledge Discovery in Databases Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/knowledge-discovery-in-databases-market
    Explore at:
    pptx, csv, pdfAvailable download formats
    Dataset updated
    Aug 22, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Knowledge Discovery in Databases Market Outlook



    According to our latest research, the global Knowledge Discovery in Databases (KDD) market size reached USD 8.7 billion in 2024, driven by the exponential growth of data across industries and increasing demand for advanced analytics solutions. The market is experiencing a robust expansion, registering a CAGR of 18.5% during the forecast period. By 2033, the Knowledge Discovery in Databases market is projected to attain a value of USD 44.9 billion. This remarkable growth is primarily attributed to the rising adoption of artificial intelligence (AI), machine learning (ML), and big data analytics, which are transforming how organizations extract actionable insights from vast and complex datasets.



    The surge in data generation from digital transformation initiatives, IoT devices, and cloud-based applications is a major growth driver for the Knowledge Discovery in Databases market. As organizations increasingly digitize their operations and customer interactions, the volume, variety, and velocity of data have soared, making traditional data analysis methods insufficient. KDD platforms and solutions are essential for uncovering hidden patterns, correlations, and trends within large datasets, enabling businesses to make data-driven decisions and gain a competitive edge. Furthermore, the proliferation of unstructured data from sources such as social media, emails, and multimedia content has heightened the need for advanced mining techniques, further fueling market growth.



    Another significant factor propelling the Knowledge Discovery in Databases market is the integration of AI and ML technologies into KDD solutions. These intelligent algorithms enhance the automation, accuracy, and scalability of data mining processes, allowing organizations to extract deeper insights in real time. The increasing availability of cloud-based KDD solutions has democratized access to advanced analytics, enabling small and medium enterprises (SMEs) to leverage sophisticated tools without the need for extensive infrastructure investments. Additionally, the growing emphasis on regulatory compliance, risk management, and fraud detection in sectors such as BFSI and healthcare is driving the adoption of KDD technologies to ensure data integrity and security.



    The evolving landscape of digital businesses and the rising importance of customer-centric strategies have also contributed to the expansion of the Knowledge Discovery in Databases market. Enterprises across retail, telecommunications, and manufacturing are harnessing KDD tools to personalize offerings, optimize supply chains, and enhance operational efficiency. The ability of KDD platforms to handle diverse data types, including text, images, and video, has broadened their applicability across various domains. Moreover, the increasing focus on predictive analytics and real-time decision-making is encouraging organizations to invest in KDD solutions that provide timely and actionable insights, thereby driving sustained market growth through 2033.



    From a regional perspective, North America continues to dominate the Knowledge Discovery in Databases market, supported by the presence of leading technology vendors, high digital adoption rates, and substantial investments in AI and analytics infrastructure. However, the Asia Pacific region is witnessing the fastest growth, propelled by rapid digitalization, expanding IT ecosystems, and government initiatives promoting data-driven innovation. Europe remains a significant market, characterized by strong regulatory frameworks and a focus on data privacy and security. Latin America and the Middle East & Africa are also emerging as promising markets, driven by increasing awareness of the benefits of KDD and growing investments in digital transformation across industries.





    Component Analysis



    The Knowledge Discovery in Databases market is segmented by component into Software, Services, and Platforms, each playing a crucial role in the overall ecosystem. Software solutions form the backbone of the KDD ma

  3. Z

    Data Analysis for the Systematic Literature Review of DL4SE

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk (2024). Data Analysis for the Systematic Literature Review of DL4SE [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4768586
    Explore at:
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    Washington and Lee University
    College of William and Mary
    Authors
    Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data Analysis is the process that supports decision-making and informs arguments in empirical studies. Descriptive statistics, Exploratory Data Analysis (EDA), and Confirmatory Data Analysis (CDA) are the approaches that compose Data Analysis (Xia & Gong; 2014). An Exploratory Data Analysis (EDA) comprises a set of statistical and data mining procedures to describe data. We ran EDA to provide statistical facts and inform conclusions. The mined facts allow attaining arguments that would influence the Systematic Literature Review of DL4SE.

    The Systematic Literature Review of DL4SE requires formal statistical modeling to refine the answers for the proposed research questions and formulate new hypotheses to be addressed in the future. Hence, we introduce DL4SE-DA, a set of statistical processes and data mining pipelines that uncover hidden relationships among Deep Learning reported literature in Software Engineering. Such hidden relationships are collected and analyzed to illustrate the state-of-the-art of DL techniques employed in the software engineering context.

    Our DL4SE-DA is a simplified version of the classical Knowledge Discovery in Databases, or KDD (Fayyad, et al; 1996). The KDD process extracts knowledge from a DL4SE structured database. This structured database was the product of multiple iterations of data gathering and collection from the inspected literature. The KDD involves five stages:

    Selection. This stage was led by the taxonomy process explained in section xx of the paper. After collecting all the papers and creating the taxonomies, we organize the data into 35 features or attributes that you find in the repository. In fact, we manually engineered features from the DL4SE papers. Some of the features are venue, year published, type of paper, metrics, data-scale, type of tuning, learning algorithm, SE data, and so on.

    Preprocessing. The preprocessing applied was transforming the features into the correct type (nominal), removing outliers (papers that do not belong to the DL4SE), and re-inspecting the papers to extract missing information produced by the normalization process. For instance, we normalize the feature “metrics” into “MRR”, “ROC or AUC”, “BLEU Score”, “Accuracy”, “Precision”, “Recall”, “F1 Measure”, and “Other Metrics”. “Other Metrics” refers to unconventional metrics found during the extraction. Similarly, the same normalization was applied to other features like “SE Data” and “Reproducibility Types”. This separation into more detailed classes contributes to a better understanding and classification of the paper by the data mining tasks or methods.

    Transformation. In this stage, we omitted to use any data transformation method except for the clustering analysis. We performed a Principal Component Analysis to reduce 35 features into 2 components for visualization purposes. Furthermore, PCA also allowed us to identify the number of clusters that exhibit the maximum reduction in variance. In other words, it helped us to identify the number of clusters to be used when tuning the explainable models.

    Data Mining. In this stage, we used three distinct data mining tasks: Correlation Analysis, Association Rule Learning, and Clustering. We decided that the goal of the KDD process should be oriented to uncover hidden relationships on the extracted features (Correlations and Association Rules) and to categorize the DL4SE papers for a better segmentation of the state-of-the-art (Clustering). A clear explanation is provided in the subsection “Data Mining Tasks for the SLR od DL4SE”. 5.Interpretation/Evaluation. We used the Knowledge Discover to automatically find patterns in our papers that resemble “actionable knowledge”. This actionable knowledge was generated by conducting a reasoning process on the data mining outcomes. This reasoning process produces an argument support analysis (see this link).

    We used RapidMiner as our software tool to conduct the data analysis. The procedures and pipelines were published in our repository.

    Overview of the most meaningful Association Rules. Rectangles are both Premises and Conclusions. An arrow connecting a Premise with a Conclusion implies that given some premise, the conclusion is associated. E.g., Given that an author used Supervised Learning, we can conclude that their approach is irreproducible with a certain Support and Confidence.

    Support = Number of occurrences this statement is true divided by the amount of statements Confidence = The support of the statement divided by the number of occurrences of the premise

  4. f

    Data from: Historical Data Mining Deep Dive into Machine Learning-Aided 2D...

    • acs.figshare.com
    • figshare.com
    xlsx
    Updated Jun 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Krittapong Deshsorn; Panwad Chavalekvirat; Somrudee Deepaisarn; Ho-Chiao Chuang; Pawin Iamprasertkun (2025). Historical Data Mining Deep Dive into Machine Learning-Aided 2D Materials Research in Electrochemical Applications [Dataset]. http://doi.org/10.1021/acsmaterialsau.5c00030.s001
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 23, 2025
    Dataset provided by
    ACS Publications
    Authors
    Krittapong Deshsorn; Panwad Chavalekvirat; Somrudee Deepaisarn; Ho-Chiao Chuang; Pawin Iamprasertkun
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Machine learning transforms the landscape of 2D materials design, particularly in accelerating discovery, optimization, and screening processes. This review has delved into the historical and ongoing integration of machine learning in 2D materials for electrochemical energy applications, using the Knowledge Discovery in Databases (KDD) approach to guide the research through data mining from the Scopus database using analysis of citations, keywords, and trends. The topics will first focus on a “macro” scope, where hundreds of literature reports are computer analyzed for key insights, such as year analysis, publication origin, and word co-occurrence using heat maps and network graphs. Afterward, the focus will be narrowed down into a more specific “micro” scope obtained from the “macro” overview, which is intended to dive deep into machine learning usage. From the gathered insights, this work highlights how machine learning, density functional theory (DFT), and traditional experimentation are jointly advancing the field of materials science. Overall, the resulting review offers a comprehensive analysis, touching on essential applications such as batteries, fuel cells, supercapacitors, and synthesis processes while showcasing machine learning techniques that enhance the identification of critical material properties.

  5. m

    Portuguese Examples (Semantic Migration)

    • data.mendeley.com
    Updated Jul 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dora Melo (2022). Portuguese Examples (Semantic Migration) [Dataset]. http://doi.org/10.17632/t2cx9stwfb.1
    Explore at:
    Dataset updated
    Jul 14, 2022
    Authors
    Dora Melo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    An excerpt of the CIDOC-CRM Ontology Representation of the DigitArq records from Bragança District Archive. The dataset also includes two SPARQL query examples - "What are the locals and their parishes located in the county 'Bragança' between 1900 and 1910?" and "What is the number of children per couple, between 1800 and 1850?", to facilitate the ontology exploration. This dataset is part of the results obtained from the semantic migration process of DigitArq - Portuguese Archive Database - metadata into CIDOC-CRM Ontology representation. This work is done in the context of the R&D EPISA project (Entity and Property Inference for Semantic Archives), a research project financed by National Funds through the Portuguese funding agency, FCT (Fundação para a Ciência e a Tecnologia) - DSAIPA/DS/0023/2018.

  6. Data from: Forecasting the human development index and life expectancy in...

    • scielo.figshare.com
    jpeg
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Celso Bilynkievycz dos Santos; Luiz Alberto Pilatti; Bruno Pedroso; Deborah Ribeiro Carvalho; Alaine Margarete Guimarães (2023). Forecasting the human development index and life expectancy in Latin American countries using data mining techniques [Dataset]. http://doi.org/10.6084/m9.figshare.7420340.v1
    Explore at:
    jpegAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    SciELOhttp://www.scielo.org/
    Authors
    Celso Bilynkievycz dos Santos; Luiz Alberto Pilatti; Bruno Pedroso; Deborah Ribeiro Carvalho; Alaine Margarete Guimarães
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Latin America
    Description

    Abstract The predictability of epidemiological indicators can help estimate dependent variables, assist in decision-making to support public policies, and explain the scenarios experienced by different countries worldwide. This study aimed to forecast the Human Development Index (HDI) and life expectancy (LE) for Latin American countries for the period of 2015-2020 using data mining techniques. All stages of the process of knowledge discovery in databases were covered. The SMOReg data mining algorithm was used in the models with multivariate time series to make predictions; this algorithm performed the best in the tests developed during the evaluation period. The average HDI and LE for Latin American countries showed an increasing trend in the period evaluated, corresponding to 4.99 ± 3.90% and 2.65 ± 0.06 years, respectively. Multivariate models allow for a greater evaluation of algorithms, thus increasing their accuracy. Data mining techniques have a better predictive quality relative to the most popular technique, Autoregressive Integrated Moving Average (ARIMA). In addition, the predictions suggest that there will be a higher increase in the mean HDI and LE for Latin American countries compared to the mean values for the rest of the world.

  7. Synthetic Process Execution Trace

    • kaggle.com
    zip
    Updated May 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Asjad K (2022). Synthetic Process Execution Trace [Dataset]. https://www.kaggle.com/datasets/asjad99/process-trace
    Explore at:
    zip(55873943 bytes)Available download formats
    Dataset updated
    May 22, 2022
    Authors
    Asjad K
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Background

    Any set of related activities that are executed in a repeatable manner and with a defined goal can be seen as process.

    Process analytic approaches allow organizations to support the practice of Business Process Management and continuous improvement by leveraging all process-related data to extract knowledge, improve process performance and support managerial-decision making across the organization.

    For organisations interested in continuous improvement, such datasets allow data-driven approach for identifying performance bottlenecks, reducing costs, extracting insights and optimizing the utilization of available resources. Understanding the properties of ‘current deployed process’ (whose execution trace is available), is critical to knowing whether it is worth investing in improvements, where performance problems exist, and how much variation there is in the process across the instances and what are the root-causes.

    What is Process Mining (PM) ?

    → process of extracting valuable information from event logs/databases that are generated by processes.

    Two topics are important i) process discovery where a process model describing the control flow is inferred from the data and ii) of conformance checking which deals with verifying that the behavior in the event log adheres to a set of business rules, e.g., defined as a process model. Rhese two use cases focus on the control-flow perspective,

    Why Process Mining ?

    → identifying hidden nodes and bottlenecks in business processes.

    About the Dataset

    A synthetic event log with 100,000 traces and 900,000 events that was generated by simulating a simple artificial process model. There are three data attributes in the event log: Priority, Nurse, and Type. Some paths in the model are recorded infrequently based on the value of these attributes.

    Noise is added by randomly adding one additional event to an increasing number of traces. CPN Tools (http://cpntools.org) was used to generate the event log and inject the noise. The amount of noise can be controlled with the constant 'noise'.

    Smaller dataset:

    The files test0 to test5 represent process traces and maybe used for debugging and sanity check purposes

  8. a

    Veterans Affairs Corporate Data Warehouse

    • atlaslongitudinaldatasets.ac.uk
    url
    Updated Oct 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    United States Department of Veterans Affairs (VA) (2024). Veterans Affairs Corporate Data Warehouse [Dataset]. https://atlaslongitudinaldatasets.ac.uk/datasets/va-cdw
    Explore at:
    urlAvailable download formats
    Dataset updated
    Oct 21, 2024
    Dataset provided by
    Atlas of Longitudinal Datasets
    Authors
    United States Department of Veterans Affairs (VA)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    United States of America
    Variables measured
    None
    Measurement technique
    Healthcare records, Secondary data, Registry, None
    Dataset funded by
    National Institutes of Health (NIH)
    Description

    VA CDW is a repository comprising data from multiple Veterans Health Administration (VHA) clinical and administrative systems. VHA is one of the largest integrated healthcare systems in the United States with data from over 20 years of sustained electronic health record (EHR) use. VA CDW was developed in 2006 to accommodate the massive amounts of data being generated and to streamline the process of knowledge discovery to application. The registry consists of approximately 7,500 databases hosted across 86 servers. Information that appears in the VA CDW includes demographic information, information on medication dispensing from VA pharmacies, laboratory test result information, free text from progress notes and radiology reports, as well as billing and claims-related data.

  9. Sample of database transaction.

    • plos.figshare.com
    xls
    Updated Jun 3, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Iyad Aqra; Tutut Herawan; Norjihan Abdul Ghani; Adnan Akhunzada; Akhtar Ali; Ramdan Bin Razali; Manzoor Ilahi; Kim-Kwang Raymond Choo (2023). Sample of database transaction. [Dataset]. http://doi.org/10.1371/journal.pone.0179703.t016
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Iyad Aqra; Tutut Herawan; Norjihan Abdul Ghani; Adnan Akhunzada; Akhtar Ali; Ramdan Bin Razali; Manzoor Ilahi; Kim-Kwang Raymond Choo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Sample of database transaction.

  10. Discovery of Possible Gene Relationships through the Application of...

    • plos.figshare.com
    txt
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rocio Chavez-Alvarez; Arturo Chavoya; Andres Mendez-Vazquez (2023). Discovery of Possible Gene Relationships through the Application of Self-Organizing Maps to DNA Microarray Databases [Dataset]. http://doi.org/10.1371/journal.pone.0093233
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Rocio Chavez-Alvarez; Arturo Chavoya; Andres Mendez-Vazquez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    DNA microarrays and cell cycle synchronization experiments have made possible the study of the mechanisms of cell cycle regulation of Saccharomyces cerevisiae by simultaneously monitoring the expression levels of thousands of genes at specific time points. On the other hand, pattern recognition techniques can contribute to the analysis of such massive measurements, providing a model of gene expression level evolution through the cell cycle process. In this paper, we propose the use of one of such techniques –an unsupervised artificial neural network called a Self-Organizing Map (SOM)–which has been successfully applied to processes involving very noisy signals, classifying and organizing them, and assisting in the discovery of behavior patterns without requiring prior knowledge about the process under analysis. As a test bed for the use of SOMs in finding possible relationships among genes and their possible contribution in some biological processes, we selected 282 S. cerevisiae genes that have been shown through biological experiments to have an activity during the cell cycle. The expression level of these genes was analyzed in five of the most cited time series DNA microarray databases used in the study of the cell cycle of this organism. With the use of SOM, it was possible to find clusters of genes with similar behavior in the five databases along two cell cycles. This result suggested that some of these genes might be biologically related or might have a regulatory relationship, as was corroborated by comparing some of the clusters obtained with SOMs against a previously reported regulatory network that was generated using biological knowledge, such as protein-protein interactions, gene expression levels, metabolism dynamics, promoter binding, and modification, regulation and transport of proteins. The methodology described in this paper could be applied to the study of gene relationships of other biological processes in different organisms.

  11. I

    Author-ity 2009 - PubMed author name disambiguated dataset

    • databank.illinois.edu
    Updated Apr 23, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vetle I. Torvik; Neil R. Smalheiser (2018). Author-ity 2009 - PubMed author name disambiguated dataset [Dataset]. http://doi.org/10.13012/B2IDB-4222651_V1
    Explore at:
    Dataset updated
    Apr 23, 2018
    Authors
    Vetle I. Torvik; Neil R. Smalheiser
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Dataset funded by
    U.S. National Institutes of Health (NIH)
    Description

    Author-ity 2009 baseline dataset. Prepared by Vetle Torvik 2009-12-03 The dataset comes in the form of 18 compressed (.gz) linux text files named authority2009.part00.gz - authority2009.part17.gz. The total size should be ~17.4GB uncompressed. • How was the dataset created? The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in July 2009. A total of 19,011,985 Article records and 61,658,514 author name instances. Each instance of an author name is uniquely represented by the PMID and the position on the paper (e.g., 10786286_3 is the third author name on PMID 10786286). Thus, each cluster is represented by a collection of author name instances. The instances were first grouped into "blocks" by last name and first name initial (including some close variants), and then each block was separately subjected to clustering. Details are described in Torvik, V., & Smalheiser, N. (2009). Author name disambiguation in MEDLINE. ACM Transactions On Knowledge Discovery From Data, 3(3), doi:10.1145/1552303.1552304 Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2005). A Probabilistic Similarity Metric for Medline Records: A Model for Author Name Disambiguation. Journal Of The American Society For Information Science & Technology, 56(2), 140-158. doi:10.1002/asi.20105 Note that for Author-ity 2009, some new predictive features (e.g., grants, citations matches, temporal, affiliation phrases) and a post-processing merging procedure were applied (to capture name variants not capture during blocking e.g. matches for subsets of compound last name matches, and nicknames with different first initial like Bill and William), and a temporal feature was used -- this has not yet been written up for publication. • How accurate is the 2009 dataset (compared to 2006 and 2009)? The recall reported for 2006 of 98.8% has been much improved in 2009 (because common last name variants are now captured). Compared to 2006, both years 2008 and 2009 overall seem to exhibit a higher rate of splitting errors but lower rate of lumping errors. This reflects an overall decrease in prior probabilites -- possibly because e.g. a) new prior estimation procedure that avoid wild estimates (by dampening the magnitude of iterative changes); b) 2008 and 2009 included items in Pubmed-not-Medline (including in-process items); and c) and the dramatic (exponential) increase in frequencies of some names (J. Lee went from ~16,000 occurrences in 2006 to 26,000 in 2009.) Although, splitting is reduced in 2009 for some special cases like NIH funded investigators who list their grant number of their papers. Compared to 2008, splitting errors were reduced overall in 2009 while maintaining the same level of lumping errors. • What is the format of the dataset? The cluster summaries for 2009 are much more extenstive than the 2008 dataset. Each line corresponds to a predicted author-individual represented by cluster of author name instances and a summary of all the corresponding papers and author name variants (and if there are > 10 papers in the cluster, an identical summary of the 10 most recent papers). Each cluster has a unique Author ID (which is uniquely identified by the PMID of the earliest paper in the cluster and the author name position. The summary has the following tab-delimited fields: 1. blocks separated by '||'; each block may consist of multiple lastname-first initial variants separated by '|' 2. prior probabilities of the respective blocks separated by '|' 3. Cluster number relative to the block ordered by cluster size (some are listed as 'CLUSTER X' when they were derived from multiple blocks) 4. Author ID (or cluster ID) e.g., bass_c_9731334_2 represents a cluster where 9731334_2 is the earliest author name instance. Although not needed for uniqueness, the id also has the most frequent lastname_firstinitial (lowercased). 5. cluster size (number of author name instances on papers) 6. name variants separated by '|' with counts in parenthesis. Each variant of the format lastname_firstname middleinitial, suffix 7. last name variants separated by '|' 8. first name variants separated by '|' 9. middle initial variants separated by '|' ('-' if none) 10. suffix variants separated by '|' ('-' if none) 11. email addresses separated by '|' ('-' if none) 12. range of years (e.g., 1997-2009) 13. Top 20 most frequent affiliation words (after stoplisting and tokenizing; some phrases are also made) with counts in parenthesis; separated by '|'; ('-' if none) 14. Top 20 most frequent MeSH (after stoplisting; "-") with counts in parenthesis; separated by '|'; ('-' if none) 15. Journals with counts in parenthesis (separated by "|"), 16. Top 20 most frequent title words (after stoplisting and tokenizing) with counts in parenthesis; separated by '|'; ('-' if none) 17. Co-author names (lowercased lastname and first/middle initials) with counts in parenthesis; separated by '|'; ('-' if none) 18. Co-author IDs with counts in parenthesis; separated by '|'; ('-' if none) 19. Author name instances (PMID_auno separated '|') 20. Grant IDs (after normalization; "-" if none given; separated by "|"), 21. Total number of times cited. (Citations are based on references extracted from PMC). 22. h-index 23. Citation counts (e.g., for h-index): PMIDs by the author that have been cited (with total citation counts in parenthesis); separated by "|" 24. Cited: PMIDs that the author cited (with counts in parenthesis) separated by "|" 25. Cited-by: PMIDs that cited the author (with counts in parenthesis) separated by "|" 26-47. same summary as for 4-25 except that the 10 most recent papers were used (based on year; so if paper 10, 11, 12... have the same year, one is selected arbitrarily)

  12. f

    Dataset, transaction database.

    • figshare.com
    xls
    Updated Jun 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Iyad Aqra; Tutut Herawan; Norjihan Abdul Ghani; Adnan Akhunzada; Akhtar Ali; Ramdan Bin Razali; Manzoor Ilahi; Kim-Kwang Raymond Choo (2023). Dataset, transaction database. [Dataset]. http://doi.org/10.1371/journal.pone.0179703.t007
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 21, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Iyad Aqra; Tutut Herawan; Norjihan Abdul Ghani; Adnan Akhunzada; Akhtar Ali; Ramdan Bin Razali; Manzoor Ilahi; Kim-Kwang Raymond Choo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset, transaction database.

  13. Match Video Tensors

    • figshare.com
    • datasetcatalog.nlm.nih.gov
    zip
    Updated Nov 7, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luca Pappalardo; Paolo Cintia; Danilo Sorano (2020). Match Video Tensors [Dataset]. http://doi.org/10.6084/m9.figshare.12562382.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 7, 2020
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Luca Pappalardo; Paolo Cintia; Danilo Sorano
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    If you use these data, please remember to cite the following paper:Sorano, D., Carrara, F., Cintia, P., Falchi, F., Pappalardo, L. (2020) Automatic Pass Annotation from Soccer VideoStreams Based on Object Detection and LSTM, In: Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2020.Tensors extracted from soccer video broadcasts. Each file is a zip of folder and corresponds to a single half of a match. Each file in the folder (in .pickle format) corresponds to a frame of the video.This item contains the following files/matches:- roma_juve_1H_tensors.zip: tensors/frames of the first half of match Roma vs Juventus- roma_juve_2H_tensors.zip: tensors/frames of the second half of match Roma vs Juventus- roma_lazio_1H_tensors.zip: tensors/frames of the first half of match Roma vs Lazio- sassuolo_inter_1H_tensors.zip: tensors/frames of the first half of match Sassuolo vs Inter- sassuolo_inter_2H_tensors.zip: tensors/frames of the second half of match Sassuolo vs Inter

  14. G

    Cognitive Search for Medical Literature Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Aug 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). Cognitive Search for Medical Literature Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/cognitive-search-for-medical-literature-market
    Explore at:
    pdf, csv, pptxAvailable download formats
    Dataset updated
    Aug 4, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Cognitive Search for Medical Literature Market Outlook



    According to our latest research, the global market size for Cognitive Search for Medical Literature reached USD 1.42 billion in 2024, demonstrating robust adoption across healthcare and research institutions. The market is expanding at a CAGR of 17.6% and is forecasted to reach USD 6.24 billion by 2033. This impressive growth rate is primarily attributed to the surging demand for advanced data analytics and AI-powered search solutions that can efficiently navigate the ever-increasing volume of medical literature. As per our latest research, key drivers include the acceleration of digital transformation in healthcare, the imperative for evidence-based medicine, and the rising importance of rapid knowledge discovery in clinical and pharmaceutical settings.




    The exponential growth of the Cognitive Search for Medical Literature Market is underpinned by the relentless expansion of medical literature and scientific publications. With thousands of new studies, clinical trial reports, and peer-reviewed articles published every week, traditional manual search methods are no longer sufficient for healthcare professionals, researchers, and clinicians. Cognitive search solutions, leveraging artificial intelligence, machine learning, and natural language processing, enable users to extract relevant, contextually accurate information from massive datasets in real time. This capability is revolutionizing clinical decision-making, accelerating research timelines, and supporting the development of novel therapeutics. The increasing emphasis on evidence-based practice and the need to stay updated with the latest medical advancements are key factors propelling the adoption of cognitive search platforms.




    Another significant growth factor for the Cognitive Search for Medical Literature Market is the rapid digital transformation of healthcare systems worldwide. Hospitals, research centers, pharmaceutical companies, and academic institutions are increasingly investing in digital infrastructure to streamline operations, enhance patient care, and foster innovation. Cognitive search tools are becoming integral components of this digital ecosystem, enabling seamless integration with electronic health records, research databases, and knowledge management systems. The ability to unify disparate data sources and deliver actionable insights is driving widespread deployment, particularly in developed regions with advanced healthcare IT frameworks. Moreover, the integration of cognitive search with other emerging technologies, such as predictive analytics and personalized medicine, is further amplifying its value proposition.




    The market is also benefiting from the growing focus on drug discovery and clinical research, especially in the wake of global health crises and the increasing prevalence of complex diseases. Pharmaceutical and biotechnology companies are leveraging cognitive search to accelerate literature reviews, identify research gaps, and uncover potential therapeutic targets. The technology’s ability to rapidly analyze and synthesize vast amounts of scientific information is shortening drug development cycles and enhancing the efficiency of research teams. Additionally, academic institutions are incorporating cognitive search tools to support faculty and students in conducting comprehensive literature reviews, fostering innovation, and ensuring the rigor of scientific inquiry. These trends collectively underscore the pivotal role of cognitive search in shaping the future of medical research and healthcare delivery.




    Regionally, North America continues to dominate the Cognitive Search for Medical Literature Market, accounting for the largest share in 2024. This leadership is driven by the presence of leading healthcare institutions, robust investments in health IT, and a favorable regulatory environment supporting innovation. Europe follows closely, with strong adoption in countries such as Germany, the UK, and France, where research excellence and digital health initiatives are well established. The Asia Pacific region is emerging as a high-growth market, fueled by expanding healthcare infrastructure, increasing research activity, and government initiatives to modernize healthcare systems. Latin America and the Middle East & Africa are also witnessing gradual adoption, primarily through collaborations with global technology providers and investments in healthcare modernization.


    <b

  15. f

    Data from: Exploring Dance Movement Data Using Sequence Alignment Methods

    • datasetcatalog.nlm.nih.gov
    Updated Jul 16, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neutens, Tijs; Van de Weghe, Nico; Chavoshi, Seyed Hossein; De Baets, Bernard; De Tré, Guy (2015). Exploring Dance Movement Data Using Sequence Alignment Methods [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001875348
    Explore at:
    Dataset updated
    Jul 16, 2015
    Authors
    Neutens, Tijs; Van de Weghe, Nico; Chavoshi, Seyed Hossein; De Baets, Bernard; De Tré, Guy
    Description

    Despite the abundance of research on knowledge discovery from moving object databases, only a limited number of studies have examined the interaction between moving point objects in space over time. This paper describes a novel approach for measuring similarity in the interaction between moving objects. The proposed approach consists of three steps. First, we transform movement data into sequences of successive qualitative relations based on the Qualitative Trajectory Calculus (QTC). Second, sequence alignment methods are applied to measure the similarity between movement sequences. Finally, movement sequences are grouped based on similarity by means of an agglomerative hierarchical clustering method. The applicability of this approach is tested using movement data from samba and tango dancers.

  16. Text Mining for Literature Review and Knowledge Discovery in Cancer Risk...

    • plos.figshare.com
    txt
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anna Korhonen; Diarmuid Ó Séaghdha; Ilona Silins; Lin Sun; Johan Högberg; Ulla Stenius (2023). Text Mining for Literature Review and Knowledge Discovery in Cancer Risk Assessment and Research [Dataset]. http://doi.org/10.1371/journal.pone.0033427
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Anna Korhonen; Diarmuid Ó Séaghdha; Ilona Silins; Lin Sun; Johan Högberg; Ulla Stenius
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Research in biomedical text mining is starting to produce technology which can make information in biomedical literature more accessible for bio-scientists. One of the current challenges is to integrate and refine this technology to support real-life scientific tasks in biomedicine, and to evaluate its usefulness in the context of such tasks. We describe CRAB – a fully integrated text mining tool designed to support chemical health risk assessment. This task is complex and time-consuming, requiring a thorough review of existing scientific data on a particular chemical. Covering human, animal, cellular and other mechanistic data from various fields of biomedicine, this is highly varied and therefore difficult to harvest from literature databases via manual means. Our tool automates the process by extracting relevant scientific data in published literature and classifying it according to multiple qualitative dimensions. Developed in close collaboration with risk assessors, the tool allows navigating the classified dataset in various ways and sharing the data with other users. We present a direct and user-based evaluation which shows that the technology integrated in the tool is highly accurate, and report a number of case studies which demonstrate how the tool can be used to support scientific discovery in cancer risk assessment and research. Our work demonstrates the usefulness of a text mining pipeline in facilitating complex research tasks in biomedicine. We discuss further development and application of our technology to other types of chemical risk assessment in the future.

  17. Second intermediate itemset.

    • plos.figshare.com
    xls
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Iyad Aqra; Tutut Herawan; Norjihan Abdul Ghani; Adnan Akhunzada; Akhtar Ali; Ramdan Bin Razali; Manzoor Ilahi; Kim-Kwang Raymond Choo (2023). Second intermediate itemset. [Dataset]. http://doi.org/10.1371/journal.pone.0179703.t018
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Iyad Aqra; Tutut Herawan; Norjihan Abdul Ghani; Adnan Akhunzada; Akhtar Ali; Ramdan Bin Razali; Manzoor Ilahi; Kim-Kwang Raymond Choo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Second intermediate itemset.

  18. Data from: Evaluation of classification techniques for identifying fake...

    • scielo.figshare.com
    jpeg
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrey Schmidt dos Santos; Luis Felipe Riehs Camargo; Daniel Pacheco Lacerda (2023). Evaluation of classification techniques for identifying fake reviews about products and services on the internet [Dataset]. http://doi.org/10.6084/m9.figshare.14283143.v1
    Explore at:
    jpegAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    SciELOhttp://www.scielo.org/
    Authors
    Andrey Schmidt dos Santos; Luis Felipe Riehs Camargo; Daniel Pacheco Lacerda
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract: With the e-commerce growth, more people are buying products over the internet. To increase customer satisfaction, merchants provide spaces for product and service reviews. Products with positive reviews attract customers, while products with negative reviews lose customers. Following this idea, some individuals and corporations write fake reviews to promote their products and services or defame their competitors. The difficulty for finding these reviews was in the large amount of information available. One solution is to use data mining techniques and tools, such as the classification function. Exploring this situation, the present work evaluates classification techniques to identify fake reviews about products and services on the Internet. The research also presents a literature systematic review on fake reviews. The research used 8 classification algorithms. The algorithms were trained and tested with a hotels database. The CONCENSO algorithm presented the best result, with 88% in the precision indicator. After the first test, the algorithms classified reviews on another hotels database. To compare the results of this new classification, the Review Skeptic algorithm was used. The SVM and GLMNET algorithms presented the highest convergence with the Review Skeptic algorithm, classifying 83% of reviews with the same result. The research contributes by demonstrating the algorithms ability to understand consumers’ real reviews to products and services on the Internet. Another contribution is to be the pioneer in the investigation of fake reviews in Brazil and in production engineering.

  19. First intermediate itemset.

    • plos.figshare.com
    xls
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Iyad Aqra; Tutut Herawan; Norjihan Abdul Ghani; Adnan Akhunzada; Akhtar Ali; Ramdan Bin Razali; Manzoor Ilahi; Kim-Kwang Raymond Choo (2023). First intermediate itemset. [Dataset]. http://doi.org/10.1371/journal.pone.0179703.t017
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Iyad Aqra; Tutut Herawan; Norjihan Abdul Ghani; Adnan Akhunzada; Akhtar Ali; Ramdan Bin Razali; Manzoor Ilahi; Kim-Kwang Raymond Choo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    First intermediate itemset.

  20. Sample transaction.

    • plos.figshare.com
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Iyad Aqra; Tutut Herawan; Norjihan Abdul Ghani; Adnan Akhunzada; Akhtar Ali; Ramdan Bin Razali; Manzoor Ilahi; Kim-Kwang Raymond Choo (2023). Sample transaction. [Dataset]. http://doi.org/10.1371/journal.pone.0179703.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Iyad Aqra; Tutut Herawan; Norjihan Abdul Ghani; Adnan Akhunzada; Akhtar Ali; Ramdan Bin Razali; Manzoor Ilahi; Kim-Kwang Raymond Choo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Sample transaction.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
E.M. Ruiz Lobaina; C. P. Romero Suárez (2023). Results obtained in a data mining process applied to a database containing bibliographic information concerning four segments of science. [Dataset]. http://doi.org/10.6084/m9.figshare.20011798.v1
Organization logo

Data from: Results obtained in a data mining process applied to a database containing bibliographic information concerning four segments of science.

Related Article
Explore at:
jpegAvailable download formats
Dataset updated
Jun 4, 2023
Dataset provided by
SciELOhttp://www.scielo.org/
Authors
E.M. Ruiz Lobaina; C. P. Romero Suárez
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Abstract The objective of this work is to improve the quality of the information that belongs to the database CubaCiencia, of the Institute of Scientific and Technological Information. This database has bibliographic information referring to four segments of science and is the main database of the Library Management System. The applied methodology was based on the Decision Trees, the Correlation Matrix, the 3D Scatter Plot, etc., which are techniques used by data mining, for the study of large volumes of information. The results achieved not only made it possible to improve the information in the database, but also provided truly useful patterns in the solution of the proposed objectives.

Search
Clear search
Close search
Google apps
Main menu