Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Due to increasing use of technology-enhanced educational assessment, data mining methods have been explored to analyse process data in log files from such assessment. However, most studies were limited to one data mining technique under one specific scenario. The current study demonstrates the usage of four frequently used supervised techniques, including Classification and Regression Trees (CART), gradient boosting, random forest, support vector machine (SVM), and two unsupervised methods, Self-organizing Map (SOM) and k-means, fitted to one assessment data. The USA sample (N = 426) from the 2012 Program for International Student Assessment (PISA) responding to problem-solving items is extracted to demonstrate the methods. After concrete feature generation and feature selection, classifier development procedures are implemented using the illustrated techniques. Results show satisfactory classification accuracy for all the techniques. Suggestions for the selection of classifiers are presented based on the research questions, the interpretability and the simplicity of the classifiers. Interpretations for the results from both supervised and unsupervised learning methods are provided.
Facebook
Twitterhttps://paper.erudition.co.in/termshttps://paper.erudition.co.in/terms
Question Paper Solutions of chapter Supervised Learning for classification of Data Mining, 6th Semester , B.Tech in Computer Science & Engineering (Artificial Intelligence and Machine Learning)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data, programs, results, and analysis software for the paper "Comparison of 14 different families of classification algorithms on 115 binary data sets" https://arxiv.org/abs/1606.00930
Facebook
Twitterhttps://paper.erudition.co.in/termshttps://paper.erudition.co.in/terms
Question Paper Solutions of chapter Classification and Prediction of Data Warehousing and Data Mining, 6th Semester , Computer Science and Engineering
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data supporting the Master thesis "Monitoring von Open Data Praktiken - Herausforderungen beim Auffinden von Datenpublikationen am Beispiel der Publikationen von Forschenden der TU Dresden" (Monitoring open data practices - challenges in finding data publications using the example of publications by researchers at TU Dresden) - Katharina Zinke, Institut für Bibliotheks- und Informationswissenschaften, Humboldt-Universität Berlin, 2023
This ZIP-File contains the data the thesis is based on, interim exports of the results and the R script with all pre-processing, data merging and analyses carried out. The documentation of the additional, explorative analysis is also available. The actual PDFs and text files of the scientific papers used are not included as they are published open access.
The folder structure is shown below with the file names and a brief description of the contents of each file. For details concerning the analyses approach, please refer to the master's thesis (publication following soon).
## Data sources
Folder 01_SourceData/
- PLOS-Dataset_v2_Mar23.csv (PLOS-OSI dataset)
- ScopusSearch_ExportResults.csv (export of Scopus search results from Scopus)
- ScopusSearch_ExportResults.ris (export of Scopus search results from Scopus)
- Zotero_Export_ScopusSearch.csv (export of the file names and DOIs of the Scopus search results from Zotero)
## Automatic classification
Folder 02_AutomaticClassification/
- (NOT INCLUDED) PDFs folder (Folder for PDFs of all publications identified by the Scopus search, named AuthorLastName_Year_PublicationTitle_Title)
- (NOT INCLUDED) PDFs_to_text folder (Folder for all texts extracted from the PDFs by ODDPub, named AuthorLastName_Year_PublicationTitle_Title)
- PLOS_ScopusSearch_matched.csv (merge of the Scopus search results with the PLOS_OSI dataset for the files contained in both)
- oddpub_results_wDOIs.csv (results file of the ODDPub classification)
- PLOS_ODDPub.csv (merge of the results file of the ODDPub classification with the PLOS-OSI dataset for the publications contained in both)
## Manual coding
Folder 03_ManualCheck/
- CodeSheet_ManualCheck.txt (Code sheet with descriptions of the variables for manual coding)
- ManualCheck_2023-06-08.csv (Manual coding results file)
- PLOS_ODDPub_Manual.csv (Merge of the results file of the ODDPub and PLOS-OSI classification with the results file of the manual coding)
## Explorative analysis for the discoverability of open data
Folder04_FurtherAnalyses
Proof_of_of_Concept_Open_Data_Monitoring.pdf (Description of the explorative analysis of the discoverability of open data publications using the example of a researcher) - in German
## R-Script
Analyses_MA_OpenDataMonitoring.R (R-Script for preparing, merging and analyzing the data and for performing the ODDPub algorithm)
Facebook
TwitterThe visit of an online shop by a possible customer is also called a session. During a session the visitor clicks on products in order to see the corresponding detail page. Furthermore, he possibly will add or remove products to/from his shopping basket. At the end of a session it is possible that one or several products from the shopping basket will be ordered. The activities of the user are also called transactions. The goal of the analysis is to predict whether the visitor will place an order or not on the basis of the transaction data collected during the session.
In the first task historical shop data are given consisting of the session activities inclusive of the associated information whether an order was placed or not. These data can be used in order to subsequently make order forecasts for other session activities in the same shop. Of course, the real outcome of the sessions for this set is not known. Thus, the first task can be understood as a classical data mining problem.
The second task deals with the online scenario. In this context the participants are to implement an agent learning on the basis of transactions. That means that the agent successively receives the individual transactions and has to make a forecast for each of them with respect to the outcome of the shopping cart transaction. This task maps the practice scenario in the best possible way in the case that a transaction-based forecast is required and a corresponding algorithm should learn in an adaptive manner.
For the individual tasks anonymised real shop data are provided in the form of structured text files consisting of individual data sets. The data sets represent in each case transactions in the shop and may contain redundant information. For the data, in particular the following applies:
In concrete terms, only the array names of the attached document “*features.pdf*” in their respective sequence will be used as column headings. The corresponding value ranges are listed there, too.
The training file for task 1 is “*transact_train.txt*“) contains all data arrays of the document, whereas the corresponding classification file (“*transact_class.txt*”) of course does not contain the target attribute “*order*”.
In task 2 data in the form of a string array are transferred to the implementations of the participants by means of a method. The individual fields of the array contain the same data arrays that are listed in “*features.pdf*”–also without the target attribute “*order*”–and exactly in the sequence used there.
This dataset is publicly available in the data-mining-cup-website.
Facebook
Twitterhttps://paper.erudition.co.in/termshttps://paper.erudition.co.in/terms
Question Paper Solutions of chapter Classification and Prediction of Data Warehousing and Data Mining, 3rd Semester , Master of Computer Applications (2 Years)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the results for the FastEE paper.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Image classification features and examples of statistical results for the data mining approach using a one-versus-one strategy to implement a SVM (support vector machine) multi-class classifier. Data published in: Fefilatyev, S., K. Kramer, L. Hall, D. Goldgof, R. Kasturi, A. Remsen, K. Daly. 2011. Detection of Anomalous Particles from the Deepwater Horizon Oil Spill Using the SIPPER3 Underwater Imaging Platform. Proceedings of International Conference on Data Mining Workshops, p. 741-748. Awarded Data Mining Practice Prize at the IEEE International Conference on Data Mining (ICDM), Vancouver, Canada, December 11-14, 2011. DOI 10.1109/ICDMW.2011.65.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains Zenodo's published open access records' metadata, including also records that have been marked by the Zenodo staff as spam and deleted.
The dataset is a gzipped compressed JSON-lines file, where each line is a JSON object representation of a Zenodo record.
Each object contains the terms:
part_of, thesis, description, doi, meeting, imprint, references, recid, alternate_identifiers, resource_type, journal, related_identifiers, title, subjects, notes, creators, communities, access_right, keywords, contributors, publication_date
which are corresponding to the fields with the same name available in Zenodo's record JSON Schema at https://zenodo.org/schemas/records/record-v1.0.0.json.
In addition, some terms have been altered:
The term files contains a list of dictionaries containing filetype, size, and filename only.
The term license contains a short Zenodo ID of the license (e.g "cc-by").
The term spam contains a boolean value, determining whether a given record was marked as a spam record by Zenodo staff.
Some values for the top-level terms, which were missing in the metadata may contain a null value.
A smaller uncompressed random sample of 200 JSON lines is also included to allow for testing and getting familiar with the format without having to download the entire dataset.
Facebook
TwitterThis is the required files to run the experiment published in the paper "Indexing and classifying gigabytes of time series under time warping". It contains the nearest neighbour indices for each query in each dataset.
Facebook
TwitterHepatocellular organic anion transporting polypeptides (OATP1B1, OATP1B3, and OATP2B1) are important for proper liver function and the regulation of the drug elimination process. Understanding their roles in different conditions of liver toxicity and cancer requires an in-depth investigation of hepatic OATP–ligand interactions and selectivity. However, such studies are impeded by the lack of crystal structures, the promiscuous nature of these transporters, and the limited availability of reliable bioactivity data, which are spread over different data sources in the open domain. To this end, we integrated ligand bioactivity data for hepatic OATPs from five open data sources (ChEMBL, the UCSF–FDA TransPortal database, DrugBank, Metrabase, and IUPHAR) in a semiautomatic KNIME workflow. Highly curated data sets were analyzed with respect to enriched scaffolds, and their activity profiles and interesting scaffold series providing indication for selective, dual-, or pan-inhibitory activity toward hepatic OATPs could be extracted. In addition, a sequential binary modeling approach revealed common and distinctive ligand features for inhibitory activity toward the individual transporters. The workflows designed for integrating data from open sources, data curation, and subsequent substructure analyses are freely available and fully adaptable. The new data sets for inhibitors and substrates of hepatic OATPs as well as the insights provided by the feature and substructure analyses will guide future structure-based studies on hepatic OATP–ligand interactions and selectivity.
Facebook
TwitterAttribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
A bilingual corpus composed of biomedical abstracts written in English and Spanish, extracted from MEDLINE.
Facebook
TwitterTriple random ensemble method for multi-label classification
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data and programs for the paper "Nested cross-validation when selecting machine learning algorithms is overzealous"
Facebook
Twitterhttps://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
This are the results for the work published in "Indexing and classifying gigabytes of time series under time warping"
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
the dataset is collected from social media such as facebook and telegram. the dataset is further processed. the collection are orginal_cleaned: this dataset is neither stemed nor stopword are remove: stopword_removed: in this dataset stopwords are removed but not stemmed and in stemed datset is stemmed and stopwords are removed. stemming is done using hornmorpho developed by Michael Gesser( available at https://github.com/hltdi/HornMorpho) all datasets are normalized and free from noise such as punctuation marks and emojs.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the result for the paper "Elastic band across the path: A new framework to lower bound DTW"
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MusicOSet is an open and enhanced dataset of musical elements (artists, songs and albums) based on musical popularity classification. Provides a directly accessible collection of data suitable for numerous tasks in music data mining (e.g., data visualization, classification, clustering, similarity search, MIR, HSS and so forth). To create MusicOSet, the potential information sources were divided into three main categories: music popularity sources, metadata sources, and acoustic and lyrical features sources. Data from all three categories were initially collected between January and May 2019. Nevertheless, the update and enhancement of the data happened in June 2019.
The attractive features of MusicOSet include:
| Data | # Records |
|:-----------------:|:---------:|
| Songs | 20,405 |
| Artists | 11,518 |
| Albums | 26,522 |
| Lyrics | 19,664 |
| Acoustic Features | 20,405 |
| Genres | 1,561 |
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.