100+ datasets found

m
Educational Attainment in North Carolina Public Schools: Use of statistical...
data.mendeley.com
Updated Nov 14, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Scott Herford (2018). Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets. [Dataset]. http://doi.org/10.17632/6cm9wyd5g5.1
Explore at:
Unique identifier
https://doi.org/10.17632/6cm9wyd5g5.1
Dataset updated
Nov 14, 2018
Authors
Scott Herford
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.
Table_1_Data Mining Techniques in Analyzing Process Data: A Didactic.pdf
frontiersin.figshare.com
pdf
Updated Jun 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xin Qiao; Hong Jiao (2023). Table_1_Data Mining Techniques in Analyzing Process Data: A Didactic.pdf [Dataset]. http://doi.org/10.3389/fpsyg.2018.02231.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fpsyg.2018.02231.s001
Dataset updated
Jun 7, 2023
Dataset provided by
Frontiers Mediahttp://www.frontiersin.org/
Authors
Xin Qiao; Hong Jiao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Due to increasing use of technology-enhanced educational assessment, data mining methods have been explored to analyse process data in log files from such assessment. However, most studies were limited to one data mining technique under one specific scenario. The current study demonstrates the usage of four frequently used supervised techniques, including Classification and Regression Trees (CART), gradient boosting, random forest, support vector machine (SVM), and two unsupervised methods, Self-organizing Map (SOM) and k-means, fitted to one assessment data. The USA sample (N = 426) from the 2012 Program for International Student Assessment (PISA) responding to problem-solving items is extracted to demonstrate the methods. After concrete feature generation and feature selection, classifier development procedures are implemented using the illustrated techniques. Results show satisfactory classification accuracy for all the techniques. Suggestions for the selection of classifiers are presented based on the research questions, the interpretability and the simplicity of the classifiers. Interpretations for the results from both supervised and unsupervised learning methods are provided.
e
Data from: Supervised Learning for classification
paper.erudition.co.in
html
Updated Dec 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Einetic (2025). Supervised Learning for classification [Dataset]. https://paper.erudition.co.in/makaut/btech-in-computer-science-and-engineering-artificial-intelligence-and-machine-learning/6/data-mining
Explore at:
htmlAvailable download formats
Dataset updated
Dec 3, 2025
Dataset authored and provided by
Einetic
License
https://paper.erudition.co.in/termshttps://paper.erudition.co.in/terms
Description
Question Paper Solutions of chapter Supervised Learning for classification of Data Mining, 6th Semester , B.Tech in Computer Science & Engineering (Artificial Intelligence and Machine Learning)
Comparison of 14 classifiers
figshare.com
application/gzip
Updated Jun 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jacques Wainer (2023). Comparison of 14 classifiers [Dataset]. http://doi.org/10.6084/m9.figshare.3407932.v2
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3407932.v2
Dataset updated
Jun 11, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Jacques Wainer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data, programs, results, and analysis software for the paper "Comparison of 14 different families of classification algorithms on 115 binary data sets" https://arxiv.org/abs/1606.00930
e
Classification and Prediction
paper.erudition.co.in
html
Updated Jan 7, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Einetic (2022). Classification and Prediction [Dataset]. https://paper.erudition.co.in/1/btech-in-computer-science-and-engineering/6/data-warehousing-and-data-mining
Explore at:
htmlAvailable download formats
Dataset updated
Jan 7, 2022
Dataset authored and provided by
Einetic
License
https://paper.erudition.co.in/termshttps://paper.erudition.co.in/terms
Description
Question Paper Solutions of chapter Classification and Prediction of Data Warehousing and Data Mining, 6th Semester , Computer Science and Engineering
Data supporting the Master thesis "Monitoring von Open Data Praktiken -...
zenodo.org
data.niaid.nih.gov
+1more
zip
Updated Nov 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Katharina Zinke; Katharina Zinke (2024). Data supporting the Master thesis "Monitoring von Open Data Praktiken - Herausforderungen beim Auffinden von Datenpublikationen am Beispiel der Publikationen von Forschenden der TU Dresden" [Dataset]. http://doi.org/10.5281/zenodo.14196539
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14196539
Dataset updated
Nov 21, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Katharina Zinke; Katharina Zinke
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Dresden
Description
Data supporting the Master thesis "Monitoring von Open Data Praktiken - Herausforderungen beim Auffinden von Datenpublikationen am Beispiel der Publikationen von Forschenden der TU Dresden" (Monitoring open data practices - challenges in finding data publications using the example of publications by researchers at TU Dresden) - Katharina Zinke, Institut für Bibliotheks- und Informationswissenschaften, Humboldt-Universität Berlin, 2023

This ZIP-File contains the data the thesis is based on, interim exports of the results and the R script with all pre-processing, data merging and analyses carried out. The documentation of the additional, explorative analysis is also available. The actual PDFs and text files of the scientific papers used are not included as they are published open access.

The folder structure is shown below with the file names and a brief description of the contents of each file. For details concerning the analyses approach, please refer to the master's thesis (publication following soon).

## Data sources

Folder 01_SourceData/

- PLOS-Dataset_v2_Mar23.csv (PLOS-OSI dataset)

- ScopusSearch_ExportResults.csv (export of Scopus search results from Scopus)

- ScopusSearch_ExportResults.ris (export of Scopus search results from Scopus)

- Zotero_Export_ScopusSearch.csv (export of the file names and DOIs of the Scopus search results from Zotero)

## Automatic classification

Folder 02_AutomaticClassification/

- (NOT INCLUDED) PDFs folder (Folder for PDFs of all publications identified by the Scopus search, named AuthorLastName_Year_PublicationTitle_Title)

- (NOT INCLUDED) PDFs_to_text folder (Folder for all texts extracted from the PDFs by ODDPub, named AuthorLastName_Year_PublicationTitle_Title)

- PLOS_ScopusSearch_matched.csv (merge of the Scopus search results with the PLOS_OSI dataset for the files contained in both)

- oddpub_results_wDOIs.csv (results file of the ODDPub classification)

- PLOS_ODDPub.csv (merge of the results file of the ODDPub classification with the PLOS-OSI dataset for the publications contained in both)

## Manual coding

Folder 03_ManualCheck/

- CodeSheet_ManualCheck.txt (Code sheet with descriptions of the variables for manual coding)

- ManualCheck_2023-06-08.csv (Manual coding results file)

- PLOS_ODDPub_Manual.csv (Merge of the results file of the ODDPub and PLOS-OSI classification with the results file of the manual coding)

## Explorative analysis for the discoverability of open data

Folder04_FurtherAnalyses

Proof_of_of_Concept_Open_Data_Monitoring.pdf (Description of the explorative analysis of the discoverability of open data publications using the example of a researcher) - in German

## R-Script

Analyses_MA_OpenDataMonitoring.R (R-Script for preparing, merging and analyzing the data and for performing the ODDPub algorithm)
Prediction of Online Orders
kaggle.com
zip
Updated May 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oscar Aguilar (2023). Prediction of Online Orders [Dataset]. https://www.kaggle.com/datasets/oscarm524/prediction-of-orders/versions/3
Explore at:
zip(6680913 bytes)Available download formats
Dataset updated
May 23, 2023
Authors
Oscar Aguilar
Description
The visit of an online shop by a possible customer is also called a session. During a session the visitor clicks on products in order to see the corresponding detail page. Furthermore, he possibly will add or remove products to/from his shopping basket. At the end of a session it is possible that one or several products from the shopping basket will be ordered. The activities of the user are also called transactions. The goal of the analysis is to predict whether the visitor will place an order or not on the basis of the transaction data collected during the session.

Tasks

In the first task historical shop data are given consisting of the session activities inclusive of the associated information whether an order was placed or not. These data can be used in order to subsequently make order forecasts for other session activities in the same shop. Of course, the real outcome of the sessions for this set is not known. Thus, the first task can be understood as a classical data mining problem.

The second task deals with the online scenario. In this context the participants are to implement an agent learning on the basis of transactions. That means that the agent successively receives the individual transactions and has to make a forecast for each of them with respect to the outcome of the shopping cart transaction. This task maps the practice scenario in the best possible way in the case that a transaction-based forecast is required and a corresponding algorithm should learn in an adaptive manner.

The Data

For the individual tasks anonymised real shop data are provided in the form of structured text files consisting of individual data sets. The data sets represent in each case transactions in the shop and may contain redundant information. For the data, in particular the following applies:

Each data set is in an individual line that is closed by “LF”(“line feed”, 0xA), “CR”(“carriage return”, 0xD), or “CR”and “LF”(“carriage return”and “line feed”, 0xD and 0xA).

The first line is structured analog to the data sets but contains the names of the respective columns (data arrays).

The header and each data set contain several arrays separated by the symbol “|”.

There is no escape character, and no quota system is used.

ASCII is used as character set.

There may be missing values. These are marked by the symbol “?”.

In concrete terms, only the array names of the attached document “*features.pdf*” in their respective sequence will be used as column headings. The corresponding value ranges are listed there, too.

The training file for task 1 is “*transact_train.txt*“) contains all data arrays of the document, whereas the corresponding classification file (“*transact_class.txt*”) of course does not contain the target attribute “*order*”.

In task 2 data in the form of a string array are transferred to the implementations of the participants by means of a method. The individual fields of the array contain the same data arrays that are listed in “*features.pdf*”–also without the target attribute “*order*”–and exactly in the sequence used there.

Acknowledgement

This dataset is publicly available in the data-mining-cup-website.
e
Classification and Prediction
paper.erudition.co.in
html
Updated Dec 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Einetic (2025). Classification and Prediction [Dataset]. https://paper.erudition.co.in/makaut/master-of-computer-applications-2-years/3/data-warehousing-and-data-mining
Explore at:
htmlAvailable download formats
Dataset updated
Dec 3, 2025
Dataset authored and provided by
Einetic
License
https://paper.erudition.co.in/termshttps://paper.erudition.co.in/terms
Description
Question Paper Solutions of chapter Classification and Prediction of Data Warehousing and Data Mining, 3rd Semester , Master of Computer Applications (2 Years)
m
Results
bridges.monash.edu
researchdata.edu.au
xlsx
Updated Jun 10, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chang Wei Tan (2019). Results [Dataset]. http://doi.org/10.26180/5c30a56c0bda8
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.26180/5c30a56c0bda8
Dataset updated
Jun 10, 2019
Dataset provided by
Monash University
Authors
Chang Wei Tan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the results for the FastEE paper.
g
Application of image processing and machine learning techniques to...
data.griidc.org
search.dataone.org
Updated Oct 26, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kendra Daly (2015). Application of image processing and machine learning techniques to distinguish suspected oil droplets from plankton and other particles for the SIPPER imaging system [Dataset]. http://doi.org/10.7266/N74X55RS
Explore at:
Unique identifier
https://doi.org/10.7266/N74X55RS
Dataset updated
Oct 26, 2015
Dataset provided by
GRIIDC
Authors
Kendra Daly
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Image classification features and examples of statistical results for the data mining approach using a one-versus-one strategy to implement a SVM (support vector machine) multi-class classifier. Data published in: Fefilatyev, S., K. Kramer, L. Hall, D. Goldgof, R. Kasturi, A. Remsen, K. Daly. 2011. Detection of Anomalous Particles from the Deepwater Horizon Oil Spill Using the SIPPER3 Underwater Imaging Platform. Proceedings of International Conference on Data Mining Workshops, p. 741-748. Awarded Data Mining Practice Prize at the IEEE International Conference on Data Mining (ICDM), Vancouver, Canada, December 11-14, 2011. DOI 10.1109/ICDMW.2011.65.
Zenodo Open Metadata snapshot - Training dataset for records classifier...
zenodo.org
application/gzip, bin
Updated Dec 14, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alex Ioannidis; Alex Ioannidis (2022). Zenodo Open Metadata snapshot - Training dataset for records classifier building [Dataset]. http://doi.org/10.5281/zenodo.1255786
Explore at:
bin, application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.1255786
Dataset updated
Dec 14, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Alex Ioannidis; Alex Ioannidis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains Zenodo's published open access records' metadata, including also records that have been marked by the Zenodo staff as spam and deleted.

The dataset is a gzipped compressed JSON-lines file, where each line is a JSON object representation of a Zenodo record.

Each object contains the terms:
part_of, thesis, description, doi, meeting, imprint, references, recid, alternate_identifiers, resource_type, journal, related_identifiers, title, subjects, notes, creators, communities, access_right, keywords, contributors, publication_date

which are corresponding to the fields with the same name available in Zenodo's record JSON Schema at https://zenodo.org/schemas/records/record-v1.0.0.json.

In addition, some terms have been altered:

The term files contains a list of dictionaries containing filetype, size, and filename only.
The term license contains a short Zenodo ID of the license (e.g "cc-by").
The term spam contains a boolean value, determining whether a given record was marked as a spam record by Zenodo staff.

Some values for the top-level terms, which were missing in the metadata may contain a null value.

A smaller uncompressed random sample of 200 JSON lines is also included to allow for testing and getting familiar with the format without having to download the entire dataset.
r
Index1NN: Time Series Indexing (TSI)
researchdata.edu.au
bridges.monash.edu
Updated May 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chang Wei Tan (2022). Index1NN: Time Series Indexing (TSI) [Dataset]. http://doi.org/10.4225/03/587db15ba0852
Explore at:
Unique identifier
https://doi.org/10.4225/03/587db15ba0852
Dataset updated
May 5, 2022
Dataset provided by
Monash University
Authors
Chang Wei Tan
Description
This is the required files to run the experiment published in the paper "Indexing and classifying gigabytes of time series under time warping". It contains the nearest neighbour indices for each query in each dataset.
f
Data from: Integrative Data Mining, Scaffold Analysis, and Sequential Binary...
datasetcatalog.nlm.nih.gov
Updated Nov 8, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zdrazil, Barbara; Türková, Alžběta; Jain, Sankalp (2018). Integrative Data Mining, Scaffold Analysis, and Sequential Binary Classification Models for Exploring Ligand Profiles of Hepatic Organic Anion Transporting Polypeptides [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000696733
Explore at:
Dataset updated
Nov 8, 2018
Authors
Zdrazil, Barbara; Türková, Alžběta; Jain, Sankalp
Description
Hepatocellular organic anion transporting polypeptides (OATP1B1, OATP1B3, and OATP2B1) are important for proper liver function and the regulation of the drug elimination process. Understanding their roles in different conditions of liver toxicity and cancer requires an in-depth investigation of hepatic OATP–ligand interactions and selectivity. However, such studies are impeded by the lack of crystal structures, the promiscuous nature of these transporters, and the limited availability of reliable bioactivity data, which are spread over different data sources in the open domain. To this end, we integrated ligand bioactivity data for hepatic OATPs from five open data sources (ChEMBL, the UCSF–FDA TransPortal database, DrugBank, Metrabase, and IUPHAR) in a semiautomatic KNIME workflow. Highly curated data sets were analyzed with respect to enriched scaffolds, and their activity profiles and interesting scaffold series providing indication for selective, dual-, or pan-inhibitory activity toward hepatic OATPs could be extracted. In addition, a sequential binary modeling approach revealed common and distinctive ligand features for inhibitory activity toward the individual transporters. The workflows designed for integrating data from open sources, data curation, and subsequent substructure analyses are freely available and fully adaptable. The new data sets for inhibitors and substrates of hepatic OATPs as well as the insights provided by the feature and substructure analyses will guide future structure-based studies on hepatic OATP–ligand interactions and selectivity.
m
CL-UvigoMED
data.mendeley.com
narcis.nl
Updated Oct 24, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marcos Mouriño García (2016). CL-UvigoMED [Dataset]. http://doi.org/10.17632/7ph4hhh429.1
Explore at:
Unique identifier
https://doi.org/10.17632/7ph4hhh429.1
Dataset updated
Oct 24, 2016
Authors
Marcos Mouriño García
License
Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
Description
A bilingual corpus composed of biomedical abstracts written in English and Spanish, extracted from MEDLINE.
r
Triple random ensemble method for multi-label classification
researchdata.edu.au
dro.deakin.edu.au
Updated Sep 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
G Tsoumakas; G Nasierding; Abbas Z. Kouzani (2024). Triple random ensemble method for multi-label classification [Dataset]. https://researchdata.edu.au/triple-random-ensemble-label-classification/3385179
Explore at:
Dataset updated
Sep 25, 2024
Dataset provided by
Deakin University
Authors
G Tsoumakas; G Nasierding; Abbas Z. Kouzani
Description
Triple random ensemble method for multi-label classification
Nested cross validation is overzelous
figshare.com
txt
Updated Feb 27, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jacques Wainer (2021). Nested cross validation is overzelous [Dataset]. http://doi.org/10.6084/m9.figshare.3457238.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3457238.v2
Dataset updated
Feb 27, 2021
Dataset provided by
figshare
Authors
Jacques Wainer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data and programs for the paper "Nested cross-validation when selecting machine learning algorithms is overzealous"
m
Results: Time Series Indexing (TSI)
bridges.monash.edu
researchdata.edu.au
application/x-rar
Updated Feb 20, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chang Wei Tan (2017). Results: Time Series Indexing (TSI) [Dataset]. http://doi.org/10.4225/03/587db0d0b3770
Explore at:
application/x-rarAvailable download formats
Unique identifier
https://doi.org/10.4225/03/587db0d0b3770
Dataset updated
Feb 20, 2017
Dataset provided by
Monash University
Authors
Chang Wei Tan
License
https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
Description
This are the results for the work published in "Indexing and classifying gigabytes of time series under time warping"
m
Amharic text dataset extracted from memes for hate speech detection or...
data.mendeley.com
Updated Jun 8, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mequanent Degu (2023). Amharic text dataset extracted from memes for hate speech detection or classification [Dataset]. http://doi.org/10.17632/gw3fdtw5v7.2
Explore at:
Unique identifier
https://doi.org/10.17632/gw3fdtw5v7.2
Dataset updated
Jun 8, 2023
Authors
Mequanent Degu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
the dataset is collected from social media such as facebook and telegram. the dataset is further processed. the collection are orginal_cleaned: this dataset is neither stemed nor stopword are remove: stopword_removed: in this dataset stopwords are removed but not stemmed and in stemed datset is stemmed and stopwords are removed. stemming is done using hornmorpho developed by Michael Gesser( available at https://github.com/hltdi/HornMorpho) all datasets are normalized and free from noise such as punctuation marks and emojs.
m
Results
bridges.monash.edu
figshare.com
xlsx
Updated Nov 9, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chang Wei Tan (2018). Results [Dataset]. http://doi.org/10.26180/5be4d0a9b1937
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.26180/5be4d0a9b1937
Dataset updated
Nov 9, 2018
Dataset provided by
Monash University
Authors
Chang Wei Tan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the result for the paper "Elastic band across the path: A new framework to lower bound DTW"
Data from: MusicOSet: An Enhanced Open Dataset for Music Data Mining
zenodo.org
data.niaid.nih.gov
+1more
bin, zip
Updated Jun 7, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mariana O. Silva; Mariana O. Silva; Laís Mota; Mirella M. Moro; Mirella M. Moro; Laís Mota (2021). MusicOSet: An Enhanced Open Dataset for Music Data Mining [Dataset]. http://doi.org/10.5281/zenodo.4904639
Explore at:
zip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4904639
Dataset updated
Jun 7, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mariana O. Silva; Mariana O. Silva; Laís Mota; Mirella M. Moro; Mirella M. Moro; Laís Mota
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
MusicOSet is an open and enhanced dataset of musical elements (artists, songs and albums) based on musical popularity classification. Provides a directly accessible collection of data suitable for numerous tasks in music data mining (e.g., data visualization, classification, clustering, similarity search, MIR, HSS and so forth). To create MusicOSet, the potential information sources were divided into three main categories: music popularity sources, metadata sources, and acoustic and lyrical features sources. Data from all three categories were initially collected between January and May 2019. Nevertheless, the update and enhancement of the data happened in June 2019.

The attractive features of MusicOSet include:

Integration and centralization of different musical data sources

Calculation of popularity scores and classification of hits and non-hits musical elements, varying from 1962 to 2018

Enriched metadata for music, artists, and albums from the US popular music industry

Availability of acoustic and lyrical resources

Unrestricted access in two formats: SQL database and compressed .csv files

| Data | # Records | |:-----------------:|:---------:| | Songs | 20,405 | | Artists | 11,518 | | Albums | 26,522 | | Lyrics | 19,664 | | Acoustic Features | 20,405 | | Genres | 1,561 |

Facebook

Twitter

Click to copy link

Link copied

Cite

Scott Herford (2018). Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets. [Dataset]. http://doi.org/10.17632/6cm9wyd5g5.1

Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets.

Explore at:

Unique identifier

https://doi.org/10.17632/6cm9wyd5g5.1

Dataset updated

Nov 14, 2018

Authors

Scott Herford

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.

Clear search

Close search

Google apps

Main menu

Educational Attainment in North Carolina Public Schools: Use of statistical...

Table_1_Data Mining Techniques in Analyzing Process Data: A Didactic.pdf

Data from: Supervised Learning for classification

Comparison of 14 classifiers

Classification and Prediction

Data supporting the Master thesis "Monitoring von Open Data Praktiken -...

Prediction of Online Orders

Tasks

The Data

Acknowledgement

Classification and Prediction

Results

Application of image processing and machine learning techniques to...

Zenodo Open Metadata snapshot - Training dataset for records classifier...

Index1NN: Time Series Indexing (TSI)

Data from: Integrative Data Mining, Scaffold Analysis, and Sequential Binary...

CL-UvigoMED

Triple random ensemble method for multi-label classification

Nested cross validation is overzelous

Results: Time Series Indexing (TSI)

Amharic text dataset extracted from memes for hate speech detection or...

Results

Data from: MusicOSet: An Enhanced Open Dataset for Music Data Mining

Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets.