100+ datasets found
  1. Data from: Peer-to-Peer Data Mining, Privacy Issues, and Games

    • data.nasa.gov
    • s.cnmilf.com
    • +2more
    Updated Mar 31, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). Peer-to-Peer Data Mining, Privacy Issues, and Games [Dataset]. https://data.nasa.gov/dataset/peer-to-peer-data-mining-privacy-issues-and-games
    Explore at:
    Dataset updated
    Mar 31, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    Peer-to-Peer (P2P) networks are gaining increasing popularity in many distributed applications such as file-sharing, network storage, web caching, sear- ching and indexing of relevant documents and P2P network-threat analysis. Many of these applications require scalable analysis of data over a P2P network. This paper starts by offering a brief overview of distributed data mining applications and algorithms for P2P environments. Next it discusses some of the privacy concerns with P2P data mining and points out the problems of existing privacy-preserving multi-party data mining techniques. It further points out that most of the nice assumptions of these existing privacy preserving techniques fall apart in real-life applications of privacy-preserving distributed data mining (PPDM). The paper offers a more realistic formulation of the PPDM problem as a multi-party game and points out some recent results.

  2. IGDB Dataset for Data Mining Projects

    • kaggle.com
    zip
    Updated Jul 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emir Şahin (2025). IGDB Dataset for Data Mining Projects [Dataset]. https://www.kaggle.com/datasets/emirshn/igdb-dataset-for-data-mining-projects
    Explore at:
    zip(56776900 bytes)Available download formats
    Dataset updated
    Jul 26, 2025
    Authors
    Emir Şahin
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset contains detailed metadata for over 240,000 video games sourced from the IGDB API. It includes information about each game's release, genres, themes, platforms, developers, publishers, player perspectives, game modes, ratings, summaries, media assets (screenshots, artworks, covers), and more. This dataset is ideal for projects in game recommendation, clustering, tagging, genre analysis, and player preference modeling.

  3. d

    Data from: Data Mining at NASA: From Theory to Applications

    • catalog.data.gov
    • s.cnmilf.com
    • +1more
    Updated Aug 23, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Data Mining at NASA: From Theory to Applications [Dataset]. https://catalog.data.gov/dataset/data-mining-at-nasa-from-theory-to-applications
    Explore at:
    Dataset updated
    Aug 23, 2025
    Dataset provided by
    Dashlink
    Description

    NASA has some of the largest and most complex data sources in the world, with data sources ranging from the earth sciences, space sciences, and massive distributed engineering data sets from commercial aircraft and spacecraft. This talk will discuss some of the issues and algorithms developed to analyze and discover patterns in these data sets. We will also provide an overview of a large research program in Integrated Vehicle Health Management. The goal of this program is to develop advanced technologies to automatically detect, diagnose, predict, and mitigate adverse events during the flight of an aircraft. A case study will be presented on a recent data mining analysis performed to support the Flight Readiness Review of the Space Shuttle Mission STS-119.

  4. Airbnb Berlin 2020

    • kaggle.com
    zip
    Updated Sep 22, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MrRaghav (2020). Airbnb Berlin 2020 [Dataset]. https://www.kaggle.com/raghavs1003/airbnb-berlin-2020
    Explore at:
    zip(112994067 bytes)Available download formats
    Dataset updated
    Sep 22, 2020
    Authors
    MrRaghav
    Area covered
    Berlin
    Description

    Acknowledgements

    http://insideairbnb.com/get-the-data.html

    Inspiration

    A. Is there seasonality in the prices of properties listed in Airbnb-Berlin? B. Which are the popular areas of Berlin among the tourists? C. An analysis of reviews – using text mining D. Which are the most commonly available amenities in the properties of Berlin? E. Can we predict the price of properties in Berlin by analyzing other column values?

  5. T

    Text and Data Mining (TDM) Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Jul 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Text and Data Mining (TDM) Report [Dataset]. https://www.datainsightsmarket.com/reports/text-and-data-mining-tdm-1403420
    Explore at:
    ppt, pdf, docAvailable download formats
    Dataset updated
    Jul 27, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    Unlock the power of your unstructured data! Explore the booming Text & Data Mining market, projected to reach significant growth by 2033. Discover key trends, leading companies like IBM & SAS, and regional market insights in this comprehensive analysis.

  6. r

    Journal of Computational Design and Engineering Impact Factor 2024-2025 -...

    • researchhelpdesk.org
    Updated Feb 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Research Help Desk (2022). Journal of Computational Design and Engineering Impact Factor 2024-2025 - ResearchHelpDesk [Dataset]. https://www.researchhelpdesk.org/journal/impact-factor-if/293/journal-of-computational-design-and-engineering
    Explore at:
    Dataset updated
    Feb 23, 2022
    Dataset authored and provided by
    Research Help Desk
    Description

    Journal of Computational Design and Engineering Impact Factor 2024-2025 - ResearchHelpDesk - Journal of Computational Design and Engineering is an international journal that aims to provide academia and industry with a venue for rapid publication of research papers reporting innovative computational methods and applications to achieve a major breakthrough, practical improvements, and bold new research directions within a wide range of design and engineering: Theory and its progress in computational advancement for design and engineering Development of computational framework to support large scale design and engineering Interaction issues among human, designed artifacts, and systems Knowledge-intensive technologies for intelligent and sustainable systems Emerging technology and convergence of technology fields presented with convincing design examples Educational issues for academia, practitioners, and future generation Proposal on new research directions as well as survey and retrospectives on mature field. Examples of relevant topics include traditional and emerging issues in design and engineering but are not limited to: Field specific issues in mechanical, aerospace, shipbuilding, industrial, architectural, plant, and civil engineering as well as industrial design Geometric modeling and processing, solid and heterogeneous modeling, computational geometry, features, and virtual prototyping Computer graphics, virtual and augmented reality, and scientific visualization Human modeling and engineering, user interaction and experience, HCI, HMI, human-vehicle interaction(HVI), cognitive engineering, and human factors and ergonomics with computers Knowledge-based engineering, intelligent CAD, AI and machine learning in design, and ontology Product data exchange and management, PDM/PLM/CPC, PDX/PDQ, interoperability, data mining, and database issues Design theory and methodology, sustainable design and engineering, concurrent engineering, and collaborative engineering Digital/virtual manufacturing, rapid prototyping and tooling, and CNC machining Computer aided inspection, geometric and engineering tolerancing, and reverse engineering Finite element analysis, optimization, meshes and discretization, and virtual engineering Bio-CAD, Nano-CAD, and medical applications Industrial design, aesthetic design, new media, and design education Survey and benchmark reports

  7. R

    Data Mining Kel 11 Dataset

    • universe.roboflow.com
    zip
    Updated Oct 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Mining (2025). Data Mining Kel 11 Dataset [Dataset]. https://universe.roboflow.com/data-mining-mtwls/data-mining-kel-11-zp4xe
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 29, 2025
    Dataset authored and provided by
    Data Mining
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Beras
    Description

    Data Mining Kel 11

    ## Overview
    
    Data Mining Kel 11 is a dataset for classification tasks - it contains Beras annotations for 59,785 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  8. u

    Data from: The use of project portfolios in effective strategy execution to...

    • researchdata.up.ac.za
    zip
    Updated May 31, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Palesa Agnes Ramashala (2023). The use of project portfolios in effective strategy execution to improve business value [Dataset]. http://doi.org/10.25403/UPresearchdata.13280141.v3
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    University of Pretoria
    Authors
    Palesa Agnes Ramashala
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Qualitative data gathered from interviews that were conducted with case organisations. The data is analysed using a qualitative data analysis tool (AtlasTi) to code and generate network diagrams. Software such as Atlas.ti 8 Windows will be a great advantage to use in order to view these results. Interviews were conducted with four case organisations. The details of the responses from the respondents from case organisations are captured. The data gathered during the interview sessions is captured in a tabular form and graphs were also created to identify trends. Also in this study is desktop review of the case organisations that formed part of the study. The desktop study was done using published annual reports over a period of more than seven years. The analysis was done given the scope of the project and its constructs.

  9. m

    GitHub training and test data-sets

    • data.mendeley.com
    Updated Jul 31, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Youcef Bouziane (2019). GitHub training and test data-sets [Dataset]. http://doi.org/10.17632/gt3f4jnbvn.1
    Explore at:
    Dataset updated
    Jul 31, 2019
    Authors
    Youcef Bouziane
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains the SQL tables of the training and test datasets used in our experimentation. These tables contain the preprocessed textual data (in a form of tokens) extracted from each training and test project. Besides the preprocessed textual data, this dataset also contains meta-data about the projects, GitHub topics, and GitHub collections. The GitHub projects are identified by the tuple “Owner” and “Name”. The descriptions of the table fields are attached to their respective data descriptions.

  10. T

    Text Mining Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Nov 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Text Mining Report [Dataset]. https://www.datainsightsmarket.com/reports/text-mining-1436427
    Explore at:
    pdf, ppt, docAvailable download formats
    Dataset updated
    Nov 9, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    Explore the dynamic Text Mining market forecast, key drivers, and trends driving its expansion to USD 56 billion by 2033. Discover insights into data analysis, fraud detection, and CRM applications.

  11. Data from: Mining significant crisp-fuzzy spatial association rules

    • tandf.figshare.com
    pdf
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wenzhong Shi; Anshu Zhang; Geoffrey I. Webb (2023). Mining significant crisp-fuzzy spatial association rules [Dataset]. http://doi.org/10.6084/m9.figshare.5873139.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Wenzhong Shi; Anshu Zhang; Geoffrey I. Webb
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Spatial association rule mining (SARM) is an important data mining task for understanding implicit and sophisticated interactions in spatial data. The usefulness of SARM results, represented as sets of rules, depends on their reliability: the abundance of rules, control over the risk of spurious rules, and accuracy of rule interestingness measure (RIM) values. This study presents crisp-fuzzy SARM, a novel SARM method that can enhance the reliability of resultant rules. The method firstly prunes dubious rules using statistically sound tests and crisp supports for the patterns involved, and then evaluates RIMs of accepted rules using fuzzy supports. For the RIM evaluation stage, the study also proposes a Gaussian-curve-based fuzzy data discretization model for SARM with improved design for spatial semantics. The proposed techniques were evaluated by both synthetic and real-world data. The synthetic data was generated with predesigned rules and RIM values, thus the reliability of SARM results could be confidently and quantitatively evaluated. The proposed techniques showed high efficacy in enhancing the reliability of SARM results in all three aspects. The abundance of resultant rules was improved by 50% or more compared with using conventional fuzzy SARM. Minimal risk of spurious rules was guaranteed by statistically sound tests. The probability that the entire result contained any spurious rules was below 1%. The RIM values also avoided large positive errors committed by crisp SARM, which typically exceeded 50% for representative RIMs. The real-world case study on New York City points of interest reconfirms the improved reliability of crisp-fuzzy SARM results, and demonstrates that such improvement is critical for practical spatial data analytics and decision support.

  12. Data from: A large-scale comparative analysis of Coding Standard conformance...

    • figshare.com
    application/x-gzip
    Updated Oct 4, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anj Simmons; Scott Barnett; Jessica Rivera-Villicana; Akshat Bajaj; Rajesh Vasa (2021). A large-scale comparative analysis of Coding Standard conformance in Open-Source Data Science projects [Dataset]. http://doi.org/10.6084/m9.figshare.12377237.v3
    Explore at:
    application/x-gzipAvailable download formats
    Dataset updated
    Oct 4, 2021
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Anj Simmons; Scott Barnett; Jessica Rivera-Villicana; Akshat Bajaj; Rajesh Vasa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This study investigates the extent to which data science projects follow code standards. In particular, which standards are followed, which are ignored, and how does this differ to traditional software projects? We compare a corpus of 1048 Open-Source Data Science projects to a reference group of 1099 non-Data Science projects with a similar level of quality and maturity.results.tar.gz: Extracted data for each project, including raw logs of all detected code violations.notebooks_out.tar.gz: Tables and figures generated by notebooks.source_code_anonymized.tar.gz: Anonymized source code (at time of publication) to identify, clone, and analyse the projects. Also includes Jupyter notebooks used to produce figures in the paper.The latest source code can be found at: https://github.com/a2i2/mining-data-science-repositoriesPublished in ESEM 2020: https://doi.org/10.1145/3382494.3410680Preprint: https://arxiv.org/abs/2007.08978

  13. Locations and numbers of past producing metal and coal mining projects

    • catalog.data.gov
    • s.cnmilf.com
    Updated Aug 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2022). Locations and numbers of past producing metal and coal mining projects [Dataset]. https://catalog.data.gov/dataset/locations-and-numbers-of-past-producing-metal-and-coal-mining-projects
    Explore at:
    Dataset updated
    Aug 14, 2022
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    Locations and numbers of past producing metal and coal mining projects in NW US and Canada. This dataset is associated with the following publication: Sergeant, C., E. Sexton, J. Moore, A. Westwood, S. Nagorski, J. Ebersole, D.M. Chambers, S.L. O'Neal, R.L. Malison, R. Hauer, D.C. Whited, J. Weitz, J. Caldwell, M. Capito, M. Connor, C.A. Frissell, G. Knox, E.D. Lowery, R. Macnair, V. Marlatt, J. McIntyre, M.V. McPhee, and N. Skuce. Risks of mining to salmonid-bearing watersheds. Science Advances. American Association for the Advancement of Science (AAAS), Washington, DC, USA, 8(26): eabn0929, (2022).

  14. f

    Data from: Inference of topics with Latent Dirichlet Allocation for Open...

    • figshare.com
    tiff
    Updated Jun 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nádia Felix Felipe da Silva; Núbia Rosa da Silva; Kátia Kelvis Cassiano; Douglas Farias Cordeiro (2023). Inference of topics with Latent Dirichlet Allocation for Open Government Data [Dataset]. http://doi.org/10.6084/m9.figshare.20006430.v1
    Explore at:
    tiffAvailable download formats
    Dataset updated
    Jun 5, 2023
    Dataset provided by
    SciELO journals
    Authors
    Nádia Felix Felipe da Silva; Núbia Rosa da Silva; Kátia Kelvis Cassiano; Douglas Farias Cordeiro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ABSTRACT Open government data can be considered as an important initiative of institutions of civil society, promoting transparency and allowing its reuse as an input in the development of innovation projects. However, it is common for certain databases to require the application of specific treatments, so that the data can be used more efficiently, such as the case of classification using Data Mining. In this scenario, this paper presents an automatic topic inference proposal using the Latent Dirichlet Allocation method to classify cultural projects in their thematic areas, by identifying the similarity in their data. The results demonstrate the feasibility of the approach in the context of open government data.

  15. The top 10 clusters of innovativeness research named using the dominant...

    • plos.figshare.com
    xls
    Updated Jun 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yousif Elsamani; Cristian Mejia; Yuya Kajikawa (2023). The top 10 clusters of innovativeness research named using the dominant theme with the most important quantitative data (number of articles, average publication year, top three journals, and number of articles in each journal) until 2021. [Dataset]. http://doi.org/10.1371/journal.pone.0280005.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 21, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Yousif Elsamani; Cristian Mejia; Yuya Kajikawa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The top 10 clusters of innovativeness research named using the dominant theme with the most important quantitative data (number of articles, average publication year, top three journals, and number of articles in each journal) until 2021.

  16. R

    Data Mining Test Dataset

    • universe.roboflow.com
    zip
    Updated Oct 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ons (2025). Data Mining Test Dataset [Dataset]. https://universe.roboflow.com/ons-eykpy/data-mining-test-fjlw4/dataset/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 20, 2025
    Dataset authored and provided by
    ons
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Cars Damage Cars Bounding Boxes
    Description

    Data Mining Test

    ## Overview
    
    Data Mining Test is a dataset for object detection tasks - it contains Cars Damage Cars annotations for 382 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  17. r

    International Journal of Computational Intelligence Systems Impact Factor...

    • researchhelpdesk.org
    Updated Feb 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Research Help Desk (2022). International Journal of Computational Intelligence Systems Impact Factor 2024-2025 - ResearchHelpDesk [Dataset]. https://www.researchhelpdesk.org/journal/impact-factor-if/359/international-journal-of-computational-intelligence-systems
    Explore at:
    Dataset updated
    Feb 23, 2022
    Dataset authored and provided by
    Research Help Desk
    Description

    International Journal of Computational Intelligence Systems Impact Factor 2024-2025 - ResearchHelpDesk - The International Journal of Computational Intelligence Systems is an international peer reviewed journal and the official publication of the European Society for Fuzzy Logic and Technologies (EUSFLAT). The journal publishes original research on all aspects of applied computational intelligence, especially targeting papers demonstrating the use of techniques and methods originating from computational intelligence theory. This is an open access journal, i.e. all articles are immediately and permanently free to read, download, copy & distribute. The journal is published under the CC BY-NC 4.0 user license which defines the permitted 3rd-party reuse of its articles. Aims & Scope The International Journal of Computational Intelligence Systems publishes original research on all aspects of applied computational intelligence, especially targeting papers demonstrating the use of techniques and methods originating from computational intelligence theory. The core theories of computational intelligence are fuzzy logic, neural networks, evolutionary computation and probabilistic reasoning. The journal publishes only articles related to the use of computational intelligence and broadly covers the following topics: Autonomous reasoning Bio-informatics Cloud computing Condition monitoring Data science Data mining Data visualization Decision support systems Fault diagnosis Intelligent information retrieval Human-machine interaction and interfaces Image processing Internet and networks Noise analysis Pattern recognition Prediction systems Power (nuclear) safety systems Process and system control Real-time systems Risk analysis and safety-related issues Robotics Signal and image processing IoT and smart environments Systems integration System control System modelling and optimization Telecommunications Time series prediction Warning systems Virtual reality Web intelligence Deep learning

  18. f

    Tourism research from its inception to present day: Subject area, geography,...

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xlsx
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrei P. Kirilenko; Svetlana Stepchenkova (2023). Tourism research from its inception to present day: Subject area, geography, and gender distributions [Dataset]. http://doi.org/10.1371/journal.pone.0206820
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Andrei P. Kirilenko; Svetlana Stepchenkova
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This paper uses text data mining to identify long-term developments in tourism academic research from the perspectives of thematic focus, geography, and gender of tourism authorship. Abstracts of papers published in the period of 1970–2017 in high-ranking tourist journals were extracted from the Scopus database and served as data source for the analysis. Fourteen subject areas were identified using the Latent Dirichlet Allocation (LDA) text mining approach. LDA integrated with GIS information allowed to obtain geography distribution and trends of scholarly output, while probabilistic methods of gender identification based on social network data mining were used to track gender dynamics with sufficient confidence. The findings indicate that, while all 14 topics have been prominent from the inception of tourism studies to the present day, the geography of scholarship has notably expanded and the share of female authorship has increased through time and currently almost equals that of male authorship.

  19. l

    LSC (Leicester Scientific Corpus)

    • figshare.le.ac.uk
    Updated Apr 15, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neslihan Suzen (2020). LSC (Leicester Scientific Corpus) [Dataset]. http://doi.org/10.25392/leicester.data.9449639.v2
    Explore at:
    Dataset updated
    Apr 15, 2020
    Dataset provided by
    University of Leicester
    Authors
    Neslihan Suzen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Leicester
    Description

    The LSC (Leicester Scientific Corpus)

    April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk) Supervised by Prof Alexander Gorban and Dr Evgeny MirkesThe data are extracted from the Web of Science [1]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.[Version 2] A further cleaning is applied in Data Processing for LSC Abstracts in Version 1*. Details of cleaning procedure are explained in Step 6.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v1.Getting StartedThis text provides the information on the LSC (Leicester Scientific Corpus) and pre-processing steps on abstracts, and describes the structure of files to organise the corpus. This corpus is created to be used in future work on the quantification of the meaning of research texts and make it available for use in Natural Language Processing projects.LSC is a collection of abstracts of articles and proceeding papers published in 2014, and indexed by the Web of Science (WoS) database [1]. The corpus contains only documents in English. Each document in the corpus contains the following parts:1. Authors: The list of authors of the paper2. Title: The title of the paper 3. Abstract: The abstract of the paper 4. Categories: One or more category from the list of categories [2]. Full list of categories is presented in file ‘List_of _Categories.txt’. 5. Research Areas: One or more research area from the list of research areas [3]. Full list of research areas is presented in file ‘List_of_Research_Areas.txt’. 6. Total Times cited: The number of times the paper was cited by other items from all databases within Web of Science platform [4] 7. Times cited in Core Collection: The total number of times the paper was cited by other papers within the WoS Core Collection [4]The corpus was collected in July 2018 online and contains the number of citations from publication date to July 2018. We describe a document as the collection of information (about a paper) listed above. The total number of documents in LSC is 1,673,350.Data ProcessingStep 1: Downloading of the Data Online

    The dataset is collected manually by exporting documents as Tab-delimitated files online. All documents are available online.Step 2: Importing the Dataset to R

    The LSC was collected as TXT files. All documents are extracted to R.Step 3: Cleaning the Data from Documents with Empty Abstract or without CategoryAs our research is based on the analysis of abstracts and categories, all documents with empty abstracts and documents without categories are removed.Step 4: Identification and Correction of Concatenate Words in AbstractsEspecially medicine-related publications use ‘structured abstracts’. Such type of abstracts are divided into sections with distinct headings such as introduction, aim, objective, method, result, conclusion etc. Used tool for extracting abstracts leads concatenate words of section headings with the first word of the section. For instance, we observe words such as ConclusionHigher and ConclusionsRT etc. The detection and identification of such words is done by sampling of medicine-related publications with human intervention. Detected concatenate words are split into two words. For instance, the word ‘ConclusionHigher’ is split into ‘Conclusion’ and ‘Higher’.The section headings in such abstracts are listed below:

    Background Method(s) Design Theoretical Measurement(s) Location Aim(s) Methodology Process Abstract Population Approach Objective(s) Purpose(s) Subject(s) Introduction Implication(s) Patient(s) Procedure(s) Hypothesis Measure(s) Setting(s) Limitation(s) Discussion Conclusion(s) Result(s) Finding(s) Material (s) Rationale(s) Implications for health and nursing policyStep 5: Extracting (Sub-setting) the Data Based on Lengths of AbstractsAfter correction, the lengths of abstracts are calculated. ‘Length’ indicates the total number of words in the text, calculated by the same rule as for Microsoft Word ‘word count’ [5].According to APA style manual [6], an abstract should contain between 150 to 250 words. In LSC, we decided to limit length of abstracts from 30 to 500 words in order to study documents with abstracts of typical length ranges and to avoid the effect of the length to the analysis.

    Step 6: [Version 2] Cleaning Copyright Notices, Permission polices, Journal Names and Conference Names from LSC Abstracts in Version 1Publications can include a footer of copyright notice, permission policy, journal name, licence, author’s right or conference name below the text of abstract by conferences and journals. Used tool for extracting and processing abstracts in WoS database leads to attached such footers to the text. For example, our casual observation yields that copyright notices such as ‘Published by Elsevier ltd.’ is placed in many texts. To avoid abnormal appearances of words in further analysis of words such as bias in frequency calculation, we performed a cleaning procedure on such sentences and phrases in abstracts of LSC version 1. We removed copyright notices, names of conferences, names of journals, authors’ rights, licenses and permission policies identified by sampling of abstracts.Step 7: [Version 2] Re-extracting (Sub-setting) the Data Based on Lengths of AbstractsThe cleaning procedure described in previous step leaded to some abstracts having less than our minimum length criteria (30 words). 474 texts were removed.Step 8: Saving the Dataset into CSV FormatDocuments are saved into 34 CSV files. In CSV files, the information is organised with one record on each line and parts of abstract, title, list of authors, list of categories, list of research areas, and times cited is recorded in fields.To access the LSC for research purposes, please email to ns433@le.ac.uk.References[1]Web of Science. (15 July). Available: https://apps.webofknowledge.com/ [2]WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html [3]Research Areas in WoS. Available: https://images.webofknowledge.com/images/help/WOS/hp_research_areas_easca.html [4]Times Cited in WoS Core Collection. (15 July). Available: https://support.clarivate.com/ScientificandAcademicResearch/s/article/Web-of-Science-Times-Cited-accessibility-and-variation?language=en_US [5]Word Count. Available: https://support.office.com/en-us/article/show-word-count-3c9e6a11-a04d-43b4-977c-563a0e0d5da3 [6]A. P. Association, Publication manual. American Psychological Association Washington, DC, 1983.

  20. Discovering Anomalous Aviation Safety Events Using Scalable Data Mining...

    • data.nasa.gov
    • s.cnmilf.com
    • +3more
    Updated Mar 31, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). Discovering Anomalous Aviation Safety Events Using Scalable Data Mining Algorithms [Dataset]. https://data.nasa.gov/dataset/discovering-anomalous-aviation-safety-events-using-scalable-data-mining-algorithms
    Explore at:
    Dataset updated
    Mar 31, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    The worldwide civilian aviation system is one of the most complex dynamical systems created. Most modern commercial aircraft have onboard flight data recorders that record several hundred discrete and continuous parameters at approximately 1Hz for the entire duration of the flight. These data contain information about the flight control systems, actuators, engines, landing gear, avionics, and pilot commands. In this paper, recent advances in the development of a novel knowledge discovery process consisting of a suite of data mining techniques for identifying precursors to aviation safety incidents are discussed. The data mining techniques include scalable multiple-kernel learning for large-scale distributed anomaly detection. A novel multivariate time-series search algorithm is used to search for signatures of discovered anomalies on massive datasets. The process can identify operationally significant events due to environmental, mechanical, and human factors issues in the high-dimensional flight operations quality assurance data. All discovered anomalies are validated by a team of independent domain experts. This novel automated knowledge discovery process is aimed at complementing the state-of-the-art human-generated exceedance-based analysis that fails to discover previously unknown aviation safety incidents. In this paper, the discovery pipeline, the methods used, and some of the significant anomalies detected on real-world commercial aviation data are discussed.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
nasa.gov (2025). Peer-to-Peer Data Mining, Privacy Issues, and Games [Dataset]. https://data.nasa.gov/dataset/peer-to-peer-data-mining-privacy-issues-and-games
Organization logo

Data from: Peer-to-Peer Data Mining, Privacy Issues, and Games

Related Article
Explore at:
Dataset updated
Mar 31, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description

Peer-to-Peer (P2P) networks are gaining increasing popularity in many distributed applications such as file-sharing, network storage, web caching, sear- ching and indexing of relevant documents and P2P network-threat analysis. Many of these applications require scalable analysis of data over a P2P network. This paper starts by offering a brief overview of distributed data mining applications and algorithms for P2P environments. Next it discusses some of the privacy concerns with P2P data mining and points out the problems of existing privacy-preserving multi-party data mining techniques. It further points out that most of the nice assumptions of these existing privacy preserving techniques fall apart in real-life applications of privacy-preserving distributed data mining (PPDM). The paper offers a more realistic formulation of the PPDM problem as a multi-party game and points out some recent results.

Search
Clear search
Close search
Google apps
Main menu