PADMINI: A PEER-TO-PEER DISTRIBUTED ASTRONOMY DATA MINING SYSTEM AND A CASE STUDY TUSHAR MAHULE, KIRK BORNE, SANDIPAN DEY, SUGANDHA ARORA, AND HILLOL KARGUPTA** Abstract. Peer-to-Peer (P2P) networks are appealing for astronomy data mining from virtual observatories because of the large volume of the data, compute-intensive tasks, potentially large number of users, and distributed nature of the data analysis process. This paper offers a brief overview of PADMINI—a Peer-to-Peer Astronomy Data MINIng system. It also presents a case study on PADMINI for distributed outlier detection using astronomy data. PADMINI is a webbased system powered by Google Sky and distributed data mining algorithms that run on a collection of computing nodes. This paper offers a case study of the PADMINI evaluating the architecture and the performance of the overall system. Detailed experimental results are presented in order to document the utility and scalability of the system.
NASA has some of the largest and most complex data sources in the world, with data sources ranging from the earth sciences, space sciences, and massive distributed engineering data sets from commercial aircraft and spacecraft. This talk will discuss some of the issues and algorithms developed to analyze and discover patterns in these data sets. We will also provide an overview of a large research program in Integrated Vehicle Health Management. The goal of this program is to develop advanced technologies to automatically detect, diagnose, predict, and mitigate adverse events during the flight of an aircraft. A case study will be presented on a recent data mining analysis performed to support the Flight Readiness Review of the Space Shuttle Mission STS-119.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Results of running KHC on our case study.
This chapter presents theoretical and practical aspects associated to the implementation of a combined model-based/data-driven approach for failure prognostics based on particle filtering algorithms, in which the current esti- mate of the state PDF is used to determine the operating condition of the system and predict the progression of a fault indicator, given a dynamic state model and a set of process measurements. In this approach, the task of es- timating the current value of the fault indicator, as well as other important changing parameters in the environment, involves two basic steps: the predic- tion step, based on the process model, and an update step, which incorporates the new measurement into the a priori state estimate. This framework allows to estimate of the probability of failure at future time instants (RUL PDF) in real-time, providing information about time-to- failure (TTF) expectations, statistical confidence intervals, long-term predic- tions; using for this purpose empirical knowledge about critical conditions for the system (also referred to as the hazard zones). This information is of paramount significance for the improvement of the system reliability and cost-effective operation of critical assets, as it has been shown in a case study where feedback correction strategies (based on uncertainty measures) have been implemented to lengthen the RUL of a rotorcraft transmission system with propagating fatigue cracks on a critical component. Although the feed- back loop is implemented using simple linear relationships, it is helpful to provide a quick insight into the manner that the system reacts to changes on its input signals, in terms of its predicted RUL. The method is able to manage non-Gaussian pdf’s since it includes concepts such as nonlinear state estimation and confidence intervals in its formulation. Real data from a fault seeded test showed that the proposed framework was able to anticipate modifications on the system input to lengthen its RUL. Results of this test indicate that the method was able to successfully suggest the correction that the system required. In this sense, future work will be focused on the development and testing of similar strategies using different input-output uncertainty metrics.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The coupling values of the classes in our case study.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Results of running KSA for Extract Message Refactoring on our case study.
Data from a study to critically examine some of the issues of using data from ToxRefDB, a database largely composed of guideline studies for pesticidal active ingredients, using a case study focusing on chemically-induced anemia.
This dataset is associated with the following publication: Judson, R.S., M. Martin, G. Patlewicz, and C.E. Wood. (Reg. Tox. Pharm.) Retrospective Mining of Toxicology Data to Discover Multispecies and Chemical Class Effects: Anemia as a Case Study. REGULATORY TOXICOLOGY AND PHARMACOLOGY. Elsevier Science Ltd, New York, NY, USA, 86: 74-92, (2017).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Qualitative data gathered from interviews that were conducted with case organisations. The data is analysed using a qualitative data analysis tool (AtlasTi) to code and generate network diagrams. Software such as Atlas.ti 8 Windows will be a great advantage to use in order to view these results. Interviews were conducted with four case organisations. The details of the responses from the respondents from case organisations are captured. The data gathered during the interview sessions is captured in a tabular form and graphs were also created to identify trends. Also in this study is desktop review of the case organisations that formed part of the study. The desktop study was done using published annual reports over a period of more than seven years. The analysis was done given the scope of the project and its constructs.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A summary of related hashtags (top group) and related place mentions (bottom group) identified with each particular meta-path.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Ethical Clearance reference number: 2020FOBREC758
The purpose of this study was to investigate the awareness and adoption levels of digital project management solutions by project businesses at a selected platinum mine in the Limpopo Province. The study further investigated their project management challenges. It examined the impediments of adoption and the readiness of project businesses that had not yet adopted digital project management solutions, to adopt these solutions. The quantitative research survey was distributed and emailed to project businesses at a selected mine in the Limpopo Province, where 110 project businesses participated in this study. SPSS v26.0 software was utilised to process data.
The study revealed that the level of awareness of digital project management tools is higher than the level of adoption. It was found that project businesses that have not employed digital project management tools experience several internal challenges, while external challenges were faced by various project businesses. This study identified several barriers which affected project businesses that had not employed digital project management tools, for example, lack of knowledge and the high costs associated with the use of digital project management solutions. These barriers were also identified as factors that affected the readiness of project businesses to adopt digital project management tools.
The study proposed several recommendations, including that software developers should consider integrating their software packages with external systems that could enhance project management processes. It was further recommended that the South African mining sector should set a minimum standard requirement of project management training and the South African business development agencies should consider offering ongoing training programmes that are focused on digitalisation and knowledge building in the discipline of project management.
International Journal of Engineering and Advanced Technology FAQ - ResearchHelpDesk - International Journal of Engineering and Advanced Technology (IJEAT) is having Online-ISSN 2249-8958, bi-monthly international journal, being published in the months of February, April, June, August, October, and December by Blue Eyes Intelligence Engineering & Sciences Publication (BEIESP) Bhopal (M.P.), India since the year 2011. It is academic, online, open access, double-blind, peer-reviewed international journal. It aims to publish original, theoretical and practical advances in Computer Science & Engineering, Information Technology, Electrical and Electronics Engineering, Electronics and Telecommunication, Mechanical Engineering, Civil Engineering, Textile Engineering and all interdisciplinary streams of Engineering Sciences. All submitted papers will be reviewed by the board of committee of IJEAT. Aim of IJEAT Journal disseminate original, scientific, theoretical or applied research in the field of Engineering and allied fields. dispense a platform for publishing results and research with a strong empirical component. aqueduct the significant gap between research and practice by promoting the publication of original, novel, industry-relevant research. seek original and unpublished research papers based on theoretical or experimental works for the publication globally. publish original, theoretical and practical advances in Computer Science & Engineering, Information Technology, Electrical and Electronics Engineering, Electronics and Telecommunication, Mechanical Engineering, Civil Engineering, Textile Engineering and all interdisciplinary streams of Engineering Sciences. impart a platform for publishing results and research with a strong empirical component. create a bridge for a significant gap between research and practice by promoting the publication of original, novel, industry-relevant research. solicit original and unpublished research papers, based on theoretical or experimental works. Scope of IJEAT International Journal of Engineering and Advanced Technology (IJEAT) covers all topics of all engineering branches. Some of them are Computer Science & Engineering, Information Technology, Electronics & Communication, Electrical and Electronics, Electronics and Telecommunication, Civil Engineering, Mechanical Engineering, Textile Engineering and all interdisciplinary streams of Engineering Sciences. The main topic includes but not limited to: 1. Smart Computing and Information Processing Signal and Speech Processing Image Processing and Pattern Recognition WSN Artificial Intelligence and machine learning Data mining and warehousing Data Analytics Deep learning Bioinformatics High Performance computing Advanced Computer networking Cloud Computing IoT Parallel Computing on GPU Human Computer Interactions 2. Recent Trends in Microelectronics and VLSI Design Process & Device Technologies Low-power design Nanometer-scale integrated circuits Application specific ICs (ASICs) FPGAs Nanotechnology Nano electronics and Quantum Computing 3. Challenges of Industry and their Solutions, Communications Advanced Manufacturing Technologies Artificial Intelligence Autonomous Robots Augmented Reality Big Data Analytics and Business Intelligence Cyber Physical Systems (CPS) Digital Clone or Simulation Industrial Internet of Things (IIoT) Manufacturing IOT Plant Cyber security Smart Solutions – Wearable Sensors and Smart Glasses System Integration Small Batch Manufacturing Visual Analytics Virtual Reality 3D Printing 4. Internet of Things (IoT) Internet of Things (IoT) & IoE & Edge Computing Distributed Mobile Applications Utilizing IoT Security, Privacy and Trust in IoT & IoE Standards for IoT Applications Ubiquitous Computing Block Chain-enabled IoT Device and Data Security and Privacy Application of WSN in IoT Cloud Resources Utilization in IoT Wireless Access Technologies for IoT Mobile Applications and Services for IoT Machine/ Deep Learning with IoT & IoE Smart Sensors and Internet of Things for Smart City Logic, Functional programming and Microcontrollers for IoT Sensor Networks, Actuators for Internet of Things Data Visualization using IoT IoT Application and Communication Protocol Big Data Analytics for Social Networking using IoT IoT Applications for Smart Cities Emulation and Simulation Methodologies for IoT IoT Applied for Digital Contents 5. Microwaves and Photonics Microwave filter Micro Strip antenna Microwave Link design Microwave oscillator Frequency selective surface Microwave Antenna Microwave Photonics Radio over fiber Optical communication Optical oscillator Optical Link design Optical phase lock loop Optical devices 6. Computation Intelligence and Analytics Soft Computing Advance Ubiquitous Computing Parallel Computing Distributed Computing Machine Learning Information Retrieval Expert Systems Data Mining Text Mining Data Warehousing Predictive Analysis Data Management Big Data Analytics Big Data Security 7. Energy Harvesting and Wireless Power Transmission Energy harvesting and transfer for wireless sensor networks Economics of energy harvesting communications Waveform optimization for wireless power transfer RF Energy Harvesting Wireless Power Transmission Microstrip Antenna design and application Wearable Textile Antenna Luminescence Rectenna 8. Advance Concept of Networking and Database Computer Network Mobile Adhoc Network Image Security Application Artificial Intelligence and machine learning in the Field of Network and Database Data Analytic High performance computing Pattern Recognition 9. Machine Learning (ML) and Knowledge Mining (KM) Regression and prediction Problem solving and planning Clustering Classification Neural information processing Vision and speech perception Heterogeneous and streaming data Natural language processing Probabilistic Models and Methods Reasoning and inference Marketing and social sciences Data mining Knowledge Discovery Web mining Information retrieval Design and diagnosis Game playing Streaming data Music Modelling and Analysis Robotics and control Multi-agent systems Bioinformatics Social sciences Industrial, financial and scientific applications of all kind 10. Advanced Computer networking Computational Intelligence Data Management, Exploration, and Mining Robotics Artificial Intelligence and Machine Learning Computer Architecture and VLSI Computer Graphics, Simulation, and Modelling Digital System and Logic Design Natural Language Processing and Machine Translation Parallel and Distributed Algorithms Pattern Recognition and Analysis Systems and Software Engineering Nature Inspired Computing Signal and Image Processing Reconfigurable Computing Cloud, Cluster, Grid and P2P Computing Biomedical Computing Advanced Bioinformatics Green Computing Mobile Computing Nano Ubiquitous Computing Context Awareness and Personalization, Autonomic and Trusted Computing Cryptography and Applied Mathematics Security, Trust and Privacy Digital Rights Management Networked-Driven Multicourse Chips Internet Computing Agricultural Informatics and Communication Community Information Systems Computational Economics, Digital Photogrammetric Remote Sensing, GIS and GPS Disaster Management e-governance, e-Commerce, e-business, e-Learning Forest Genomics and Informatics Healthcare Informatics Information Ecology and Knowledge Management Irrigation Informatics Neuro-Informatics Open Source: Challenges and opportunities Web-Based Learning: Innovation and Challenges Soft computing Signal and Speech Processing Natural Language Processing 11. Communications Microstrip Antenna Microwave Radar and Satellite Smart Antenna MIMO Antenna Wireless Communication RFID Network and Applications 5G Communication 6G Communication 12. Algorithms and Complexity Sequential, Parallel And Distributed Algorithms And Data Structures Approximation And Randomized Algorithms Graph Algorithms And Graph Drawing On-Line And Streaming Algorithms Analysis Of Algorithms And Computational Complexity Algorithm Engineering Web Algorithms Exact And Parameterized Computation Algorithmic Game Theory Computational Biology Foundations Of Communication Networks Computational Geometry Discrete Optimization 13. Software Engineering and Knowledge Engineering Software Engineering Methodologies Agent-based software engineering Artificial intelligence approaches to software engineering Component-based software engineering Embedded and ubiquitous software engineering Aspect-based software engineering Empirical software engineering Search-Based Software engineering Automated software design and synthesis Computer-supported cooperative work Automated software specification Reverse engineering Software Engineering Techniques and Production Perspectives Requirements engineering Software analysis, design and modelling Software maintenance and evolution Software engineering tools and environments Software engineering decision support Software design patterns Software product lines Process and workflow management Reflection and metadata approaches Program understanding and system maintenance Software domain modelling and analysis Software economics Multimedia and hypermedia software engineering Software engineering case study and experience reports Enterprise software, middleware, and tools Artificial intelligent methods, models, techniques Artificial life and societies Swarm intelligence Smart Spaces Autonomic computing and agent-based systems Autonomic computing Adaptive Systems Agent architectures, ontologies, languages and protocols Multi-agent systems Agent-based learning and knowledge discovery Interface agents Agent-based auctions and marketplaces Secure mobile and multi-agent systems Mobile agents SOA and Service-Oriented Systems Service-centric software engineering Service oriented requirements engineering Service oriented architectures Middleware for service based systems Service discovery and composition Service level agreements (drafting,
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This collection comprises the manifest and the two generated XES logs from the case study in [1]. The manifest specifies which data related to the execution of the CryptoKitties smart contracts was retrieved from the transaction log of the public Ethereum blockchain. The manifest also specifies how this data was transformed into the two XES logs. Details regarding the approach to extracting XES logs from the Ethereum blockchain and the case study are provided in [1]. For more information on the XES standard see http://www.xes-standard.org.
[1] Klinkmüller, C., Ponomarev, A., Tran, A., Weber, I., and van der Aalst, W.: "Mining Blockchain Processes: Extracting Process Mining Data from Blockchain Applications", Blockchain Forum at BPM 2019.
About Dataset
This case study is a part of Google Data Analytics course. Cyclistic is a fictional bike-sharing company, however, the data is real. It encompasses information about bike-sharing stations in Chicago and total rides with rented bikes during more than 10 years, from 2013 until February 2023.
The business task is to help design the marketing strategy. The project owner aims at converting casual riders into annual members. To achieve that goal the marketing team needs to better understand how annual members and casual riders differ in using rented bikes.
My specific task was to analyze the available data of rides and provide 3 main recommendation for the marketing strategy, based on the data analysis.
The requirement was to analyze the data for the last 12 months. However, I decided to use the whole dataset, since it was openly available for the whole period of operations.
Data License Agreement
Lyft Bikes and Scooters, LLC (“Bikeshare”) operates the City of Chicago’s (“City”) Divvy bicycle sharing service. Bikeshare and the City are committed to supporting bicycling as an alternative transportation option. As part of that commitment, the City permits Bikeshare to make certain Divvy system data owned by the City (“Data”) available to the public, subject to the terms and conditions of this License Agreement (“Agreement”). By accessing or using any of the Data, you agree to all of the terms and conditions of this Agreement.
License. Bikeshare hereby grants to you a non-exclusive, royalty-free, limited, perpetual license to access, reproduce, analyze, copy, modify, distribute in your product or service and use the Data for any lawful purpose (“License”). Prohibited Conduct. The License does not authorize you to do, and you will not do or assist others in doing, any of the following
Use the Data in any unlawful manner or for any unlawful purpose; Host, stream, publish, distribute, sublicense, or sell the Data as a stand-alone dataset; provided, however, you may include the Data as source material, as applicable, in analyses, reports, or studies published or distributed for non-commercial purposes; Access the Data by means other than the interface Bikeshare provides or authorizes for that purpose; Circumvent any access restrictions relating to the Data; Use data mining or other extraction methods in connection with Bikeshare's website or the Data; Attempt to correlate the Data with names, addresses, or other information of customers or Members of Bikeshare; and State or imply that you are affiliated, approved, endorsed, or sponsored by Bikeshare. Use or authorize others to use, without the written permission of the applicable owners, the trademarks or trade names of Lyft Bikes and Scooters, LLC, the City of Chicago or any sponsor of the Divvy service. These marks include, but are not limited to DIVVY, and the DIVVY logo, which are owned by the City of Chicago. No Warranty. THE DATA IS PROVIDED “AS IS,” AS AVAILABLE (AT BIKESHARE’S SOLE DISCRETION) AND AT YOUR SOLE RISK. TO THE MAXIMUM EXTENT PROVIDED BY LAW BIKESHARE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING THE IMPLIED WARRANTIES OF MERCHANTABILITY FITNESS FOR A PARTICULAR PURPOSE, AND NON-INFRINGEMENT. BIKESHARE FURTHER DISCLAIMS ANY WARRANTY THAT THE DATA WILL MEET YOUR NEEDS OR WILL BE OR CONTINUE TO BE AVAILABLE, COMPLETE, ACCURATE, TIMELY, SECURE, OR ERROR FREE.
Limitation of Liability and Covenant Not to Sue. Bikeshare, its parent, affiliates and sponsors, and their respective directors, officers, employees, or agents will not be liable to you or anyone else for any loss or damage, including any direct, indirect, incidental, and consequential damages, whether foreseeable or not, based on any theory of liability, resulting in whole or in part from your access to or use of the Data. You will not bring any claim for damages against any of those persons or entities in any court or otherwise arising out of or relating to this Agreement, the Data, or your use of the Data. In any event, if you were to bring and prevail on such a claim, your maximum recovery is limited to $100 in the aggregate even if you or they had been advised of the possibility of liability exceeding that amount. Ownership and Provision of Data. The City of Chicago owns all right, title, and interest in the Data. Bikeshare may modify or cease providing any or all of the Data at any time, without notice, in its sole discretion. No Waiver. Nothing in this Agreement is or implies a waiver of any rights Bikeshare or the City of Chicago has in the Data or in any copyrights, patents, or trademarks owned or licensed by Bikeshare, its parent, affiliates or sponsors. The DIVVY trademarks are owned by the City of Chicago. Termination of Agreement. Bikeshare may terminate this Agreement at any time and for any reason in its sole discretion. Termination will be effective ...
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
FANTASIAThis repository contains the data related to image descriptors and sound associated with a selection of frames of the films Fantasia and Fantasia 2000 produced by DisneyAboutThis repository contains the data used in the article Automatic composition of descriptive music: A case study of the relationship between image and sound published in the 6th International Workshop on Computational Creativity, Concept Invention, and General Intelligence (C3GI). Data structure is explained in detail in the article. AbstractHuman beings establish relationships with the environment mainly through sight and hearing. This work focuses on the concept of descriptive music, which makes use of sound resources to narrate a story. The Fantasia film, produced by Walt Disney was used in the case study. One of its musical pieces is analyzed in order to obtain the relationship between image and music. This connection is subsequently used to create a descriptive musical composition from a new video. Naive Bayes, Support Vector Machine and Random Forest are the three classifiers studied for the model induction process. After an analysis of their performance, it was concluded that Random Forest provided the best solution; the produced musical composition had a considerably high descriptive quality. DataNutcracker_data.arff: Image descriptors and the most important sound of each frame from the fragment "The Nutcracker Suite" in film Fantasia. Data stored into ARFF format.Firebird_data.arff: Image descriptors of each frame from the fragment "The Firebird" in film Fantasia 2000. Data stored into ARFF format.Firebird_midi_prediction.csv: Frame number of the fragment "The Firebird" in film Fantasia 2000 and the sound predicted by the system encoded in MIDI. Data stored into CSV format.Firebird_prediction.mp3: Audio file with the synthesizing of the prediction data for the fragment "The Firebird" of film Fantasia 2000.LicenseData is available under MIT License. To make use of the data the article must be cited.
Journal of Big Data Impact Factor 2024-2025 - ResearchHelpDesk - The Journal of Big Data publishes high-quality, scholarly research papers, methodologies and case studies covering a broad range of topics, from big data analytics to data-intensive computing and all applications of big data research. The journal examines the challenges facing big data today and going forward including, but not limited to: data capture and storage; search, sharing, and analytics; big data technologies; data visualization; architectures for massively parallel processing; data mining tools and techniques; machine learning algorithms for big data; cloud computing platforms; distributed file systems and databases; and scalable storage systems. Academic researchers and practitioners will find the Journal of Big Data to be a seminal source of innovative material. All articles published by the Journal of Big Data are made freely and permanently accessible online immediately upon publication, without subscription charges or registration barriers. As authors of articles published in the Journal of Big Data you are the copyright holders of your article and have granted to any third party, in advance and in perpetuity, the right to use, reproduce or disseminate your article, according to the SpringerOpen copyright and license agreement. For those of you who are US government employees or are prevented from being copyright holders for similar reasons, SpringerOpen can accommodate non-standard copyright lines.
Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
Parameters for case study: Corrosive Sulphur in power and distribution transformers
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset provide the ".lp" and the ".sol" files for the case study published in: J.L Yarmuch, Grigaliunas M.C., Munizaga-Rosas J.C. "Optimum Layout of Sublevel Stoping Mines"
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This document includes the underlying data of the case study conducted and presented in the confrence paper titled "Exploring Customer Journey Mining and RPA: Prediction of Customers’ Next Touchpoint".
Data Source: https://www.kaggle.com/datasets/kishlaya18/customer-purchase-journey-netherlands
Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
The data was generated randomly, based on different combinations of possible actions in the dynamic simulator K-Spice from Kongsberg Digital. In case study 1, the aim was to increase +10 % of the oil production with respect to the initial condition value. In case study 2, the aim was to decrease -10 % of the gas production with respect to the initial condition value. The data show examples of possible correct and incorrect paths that a trainee could follow trying to solve the scenarios.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains the content of the subset of all files with a correct publication date from the 2017 release of files related to the JFK case (retrieved from https://www.archives.gov/research/jfk/2017-release). This content was extracted from the source PDF files using the R OCR libraries tesseract and pdftools.
The code to derive the dataset is given as follows:
library(tesseract) library(pdftools)
pdfs <- list.files("[path to your output directory containing all PDF files]")
meta <- read.csv2("[path to your input directory]/jfkrelease-2017-dce65d0ec70a54d5744de17d280f3ad2.csv",header = T,sep = ',') #the meta file containing all metadata for the PDF files (e.g. publication date)
meta$Doc.Date <- as.character(meta$Doc.Date)
meta.clean <- meta[-which(meta$Doc.Date=="" | grepl("/0000",meta$Doc.Date)),] for(i in 1:nrow(meta.clean)){ meta.clean$Doc.Date[i] <- gsub("00","01",meta.clean$Doc.Date[i])
if(nchar(meta.clean$Doc.Date[i])<10){ meta.clean$Doc.Date[i]<-format(strptime(meta.clean$Doc.Date[i],format = "%d/%m/%y"),"%m/%d/%Y") }
}
meta.clean$Doc.Date <- strptime(meta.clean$Doc.Date,format = "%m/%d/%Y")
meta.clean <- meta.clean[order(meta.clean$Doc.Date),]
docs <- data.frame(content=character(0),dpub=character(0),stringsAsFactors = F) for(i in 1:nrow(meta.clean)){
pdf_prop <- pdftools::pdf_info(paste0("[path to your output directory]/",tolower(meta.clean$File.Name[i]))) tmp_files <- c() for(k in 1:pdf_prop$pages){ tmp_files <- c(tmp_files,paste0("/home/STAFF/luczakma/RProjects/JFK/data/tmp/",k)) }
img_file <- pdftools::pdf_convert(paste0("[path to your output directory]/",tolower(meta.clean$File.Name[i])), format = 'tiff', pages = NULL, dpi = 700,filenames = tmp_files)
txt <- ""
for(j in 1:length(img_file)){ extract <- ocr(img_file[j], engine = tesseract("eng")) #unlink(img_file) txt <- paste(txt,extract,collapse = " ") }
docs <- rbind(docs,data.frame(content=iconv(tolower(gsub("\s+"," ",gsub("[[:punct:]]|[ ]"," ",txt))),to="UTF-8"),dpub=format(meta.clean$Doc.Date[i],"%Y/%m/%d"),stringsAsFactors = F),stringsAsFactors = F) }
write.table(docs,"[path to your output directory]/documents.csv", row.names = F)
PADMINI: A PEER-TO-PEER DISTRIBUTED ASTRONOMY DATA MINING SYSTEM AND A CASE STUDY TUSHAR MAHULE, KIRK BORNE, SANDIPAN DEY, SUGANDHA ARORA, AND HILLOL KARGUPTA** Abstract. Peer-to-Peer (P2P) networks are appealing for astronomy data mining from virtual observatories because of the large volume of the data, compute-intensive tasks, potentially large number of users, and distributed nature of the data analysis process. This paper offers a brief overview of PADMINI—a Peer-to-Peer Astronomy Data MINIng system. It also presents a case study on PADMINI for distributed outlier detection using astronomy data. PADMINI is a webbased system powered by Google Sky and distributed data mining algorithms that run on a collection of computing nodes. This paper offers a case study of the PADMINI evaluating the architecture and the performance of the overall system. Detailed experimental results are presented in order to document the utility and scalability of the system.