100+ datasets found

f
Input data considered for the biomedical research assimilator context.
plos.figshare.com
figshare.com
xls
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Georgia Tsiliki; Nikos Karacapilidis; Spyros Christodoulou; Manolis Tzagarakis (2023). Input data considered for the biomedical research assimilator context. [Dataset]. http://doi.org/10.1371/journal.pone.0108600.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0108600.t001
Dataset updated
Jun 4, 2023
Dataset provided by
PLOS ONE
Authors
Georgia Tsiliki; Nikos Karacapilidis; Spyros Christodoulou; Manolis Tzagarakis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Input data considered for the biomedical research assimilator context.
o
[Research Data] Mining Relevant Solutions for Programming Tasks from Search...
explore.openaire.eu
Updated Apr 18, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adriano Mendonça Rocha; Marcelo De Almeida Maia (2022). [Research Data] Mining Relevant Solutions for Programming Tasks from Search Engine Results [Dataset]. http://doi.org/10.5281/zenodo.6467629
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.6467629
Dataset updated
Apr 18, 2022
Authors
Adriano Mendonça Rocha; Marcelo De Almeida Maia
Description
[Abstract] Software development is a knowledge-intensive activity. Official documentation for developers may not be sufficient for all developer needs. Searching for information on the Internet is a usual practice, but finding really useful information may be challenging, because the best solutions are not always among the first ranked pages. So, developers have to read and discard irrelevant pages, that is, pages that do not have code examples or that have content with little focus on the desired solution. This work aims at proposing an approach to mine relevant solutions for programming tasks from search engine results that remove irrelevant pages. The approach works as follows: a query related to the programming task is prepared, and given as an input to a search engine. The returned pages pass through an automatic filter to select relevant pages. We evaluated the top-20 pages returned by the Google search engine, for 10 different queries, and observed that only 31\% of the evaluated pages are relevant to developers. Then, we proposed and evaluated three different approaches to mine the relevant pages returned by the search engine. Google’s search engine has been used as a baseline, and our results have shown that Google’s search engine returns a reasonable number of irrelevant pages for developers, and we could find an effective approach to remove irrelevant pages, suggesting that developers could benefit from a customized web search filter for development content. [Contents of Research Data.rar file] The Research Data.rar file has a folder called Research Data that contains 3 folders internally, with the names: “01 – Source Code”, “02 - Data” and “03 – Preprocessing rules”. The folder “01 – Source Code” contains the JAVA source code of the implementations of the proposed approaches. The folder “02 - Data” contains the data of the evaluations carried out in the work, which are in the folders “01 - Evaluation results of pages returned by Google” and “02 - Results of approaches comparisons”. The folder “01 - Evaluation results of pages returned by Google” has the evaluations carried out on the first 20 pages returned by Google, following the criteria defined in the work, for the 10 queries considered in the evaluation. The folder “02 - Results of approaches comparisons” contains the results of the evaluation of the proposed approaches, for the 10 queries considered in the evaluation. In this evaluation, the number of pages given as input for the approaches was increased from 3 to 20 pages, for each number of pages a folder was generated with the results. In addition to the results of the Precision, Recall and F-Measure metrics that are in the file named Results Approaches.txt, other files were generated for analysis. For example, the Instances_without_outliers.txt file shows which pages were filtered out after applying the outlier page removal filter. The Selected Pages Approach 4.txt file, on the other hand, shows which pages were filtered after applying the filters of the GORCUO approach. The folder “03 - Preprocessing rules” has a file called Rules.java. In this file, there is the commented JAVA source code, from the implementation of the rules created in the pre-processing stage of the proposed approach.
d
Data from: Privacy Preserving Outlier Detection through Random Nonlinear...
catalog.data.gov
data.amerigeoss.org
Updated Apr 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Privacy Preserving Outlier Detection through Random Nonlinear Data Distortion [Dataset]. https://catalog.data.gov/dataset/privacy-preserving-outlier-detection-through-random-nonlinear-data-distortion
Explore at:
Dataset updated
Apr 10, 2025
Dataset provided by
Dashlink
Description
Consider a scenario in which the data owner has some private/sensitive data and wants a data miner to access it for studying important patterns without revealing the sensitive information. Privacy preserving data mining aims to solve this problem by randomly transforming the data prior to its release to data miners. Previous work only considered the case of linear data perturbations — additive, multiplicative or a combination of both for studying the usefulness of the perturbed output. In this paper, we discuss nonlinear data distortion using potentially nonlinear random data transformation and show how it can be useful for privacy preserving anomaly detection from sensitive datasets. We develop bounds on the expected accuracy of the nonlinear distortion and also quantify privacy by using standard definitions. The highlight of this approach is to allow a user to control the amount of privacy by varying the degree of nonlinearity. We show how our general transformation can be used for anomaly detection in practice for two specific problem instances: a linear model and a popular nonlinear model using the sigmoid function. We also analyze the proposed nonlinear transformation in full generality and then show that for specific cases it is distance preserving. A main contribution of this paper is the discussion between the invertibility of a transformation and privacy preservation and the application of these techniques to outlier detection. Experiments conducted on real-life datasets demonstrate the effectiveness of the approach.
l
LScD (Leicester Scientific Dictionary)
figshare.le.ac.uk
docx
Updated Apr 15, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neslihan Suzen (2020). LScD (Leicester Scientific Dictionary) [Dataset]. http://doi.org/10.25392/leicester.data.9746900.v3
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.25392/leicester.data.9746900.v3
Dataset updated
Apr 15, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Leicester
Description
LScD (Leicester Scientific Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScD (Leicester Scientific Dictionary) is created from the updated LSC (Leicester Scientific Corpus) - Version 2*. All pre-processing steps applied to build the new version of the dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. After pre-processing steps, the total number of unique words in the new version of the dictionary is 972,060. The files provided with this description are also same as described as for LScD Version 2 below.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2** Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v2[Version 2] Getting StartedThis document provides the pre-processing steps for creating an ordered list of words from the LSC (Leicester Scientific Corpus) [1] and the description of LScD (Leicester Scientific Dictionary). This dictionary is created to be used in future work on the quantification of the meaning of research texts. R code for producing the dictionary from LSC and instructions for usage of the code are available in [2]. The code can be also used for list of texts from other sources, amendments to the code may be required.LSC is a collection of abstracts of articles and proceeding papers published in 2014 and indexed by the Web of Science (WoS) database [3]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English. The corpus was collected in July 2018 and contains the number of citations from publication date to July 2018. The total number of documents in LSC is 1,673,824.LScD is an ordered list of words from texts of abstracts in LSC.The dictionary stores 974,238 unique words, is sorted by the number of documents containing the word in descending order. All words in the LScD are in stemmed form of words. The LScD contains the following information:1.Unique words in abstracts2.Number of documents containing each word3.Number of appearance of a word in the entire corpusProcessing the LSCStep 1.Downloading the LSC Online: Use of the LSC is subject to acceptance of request of the link by email. To access the LSC for research purposes, please email to ns433@le.ac.uk. The data are extracted from Web of Science [3]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.Step 2.Importing the Corpus to R: The full R code for processing the corpus can be found in the GitHub [2].All following steps can be applied for arbitrary list of texts from any source with changes of parameter. The structure of the corpus such as file format and names (also the position) of fields should be taken into account to apply our code. The organisation of CSV files of LSC is described in README file for LSC [1].Step 3.Extracting Abstracts and Saving Metadata: Metadata that include all fields in a document excluding abstracts and the field of abstracts are separated. Metadata are then saved as MetaData.R. Fields of metadata are: List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.Step 4.Text Pre-processing Steps on the Collection of Abstracts: In this section, we presented our approaches to pre-process abstracts of the LSC.1.Removing punctuations and special characters: This is the process of substitution of all non-alphanumeric characters by space. We did not substitute the character “-” in this step, because we need to keep words like “z-score”, “non-payment” and “pre-processing” in order not to lose the actual meaning of such words. A processing of uniting prefixes with words are performed in later steps of pre-processing.2.Lowercasing the text data: Lowercasing is performed to avoid considering same words like “Corpus”, “corpus” and “CORPUS” differently. Entire collection of texts are converted to lowercase.3.Uniting prefixes of words: Words containing prefixes joined with character “-” are united as a word. The list of prefixes united for this research are listed in the file “list_of_prefixes.csv”. The most of prefixes are extracted from [4]. We also added commonly used prefixes: ‘e’, ‘extra’, ‘per’, ‘self’ and ‘ultra’.4.Substitution of words: Some of words joined with “-” in the abstracts of the LSC require an additional process of substitution to avoid losing the meaning of the word before removing the character “-”. Some examples of such words are “z-test”, “well-known” and “chi-square”. These words have been substituted to “ztest”, “wellknown” and “chisquare”. Identification of such words is done by sampling of abstracts form LSC. The full list of such words and decision taken for substitution are presented in the file “list_of_substitution.csv”.5.Removing the character “-”: All remaining character “-” are replaced by space.6.Removing numbers: All digits which are not included in a word are replaced by space. All words that contain digits and letters are kept because alphanumeric characters such as chemical formula might be important for our analysis. Some examples are “co2”, “h2o” and “21st”.7.Stemming: Stemming is the process of converting inflected words into their word stem. This step results in uniting several forms of words with similar meaning into one form and also saving memory space and time [5]. All words in the LScD are stemmed to their word stem.8.Stop words removal: Stop words are words that are extreme common but provide little value in a language. Some common stop words in English are ‘I’, ‘the’, ‘a’ etc. We used ‘tm’ package in R to remove stop words [6]. There are 174 English stop words listed in the package.Step 5.Writing the LScD into CSV Format: There are 1,673,824 plain processed texts for further analysis. All unique words in the corpus are extracted and written in the file “LScD.csv”.The Organisation of the LScDThe total number of words in the file “LScD.csv” is 974,238. Each field is described below:Word: It contains unique words from the corpus. All words are in lowercase and their stem forms. The field is sorted by the number of documents that contain words in descending order.Number of Documents Containing the Word: In this content, binary calculation is used: if a word exists in an abstract then there is a count of 1. If the word exits more than once in a document, the count is still 1. Total number of document containing the word is counted as the sum of 1s in the entire corpus.Number of Appearance in Corpus: It contains how many times a word occurs in the corpus when the corpus is considered as one large document.Instructions for R CodeLScD_Creation.R is an R script for processing the LSC to create an ordered list of words from the corpus [2]. Outputs of the code are saved as RData file and in CSV format. Outputs of the code are:Metadata File: It includes all fields in a document excluding abstracts. Fields are List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.File of Abstracts: It contains all abstracts after pre-processing steps defined in the step 4.DTM: It is the Document Term Matrix constructed from the LSC[6]. Each entry of the matrix is the number of times the word occurs in the corresponding document.LScD: An ordered list of words from LSC as defined in the previous section.The code can be used by:1.Download the folder ‘LSC’, ‘list_of_prefixes.csv’ and ‘list_of_substitution.csv’2.Open LScD_Creation.R script3.Change parameters in the script: replace with the full path of the directory with source files and the full path of the directory to write output files4.Run the full code.References[1]N. Suzen. (2019). LSC (Leicester Scientific Corpus) [Dataset]. Available: https://doi.org/10.25392/leicester.data.9449639.v1[2]N. Suzen. (2019). LScD-LEICESTER SCIENTIFIC DICTIONARY CREATION. Available: https://github.com/neslihansuzen/LScD-LEICESTER-SCIENTIFIC-DICTIONARY-CREATION[3]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4]A. Thomas, "Common Prefixes, Suffixes and Roots," Center for Development and Learning, 2013.[5]C. Ramasubramanian and R. Ramya, "Effective pre-processing activities in text mining using improved porter’s stemming algorithm," International Journal of Advanced Research in Computer and Communication Engineering, vol. 2, no. 12, pp. 4536-4538, 2013.[6]I. Feinerer, "Introduction to the tm Package Text Mining in R," Accessible en ligne: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf, 2013.
Market Basket Analysis
kaggle.com
Updated Dec 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 9, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Aslan Ahmedov
Description
Market Basket Analysis

Market basket analysis with Apriori algorithm

The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

Introduction

Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

An Example of Association Rules

Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

Strategy

Data Import

Data Understanding and Exploration

Transformation of the data – so that is ready to be consumed by the association rules algorithm

Running association rules

Exploring the rules generated

Filtering the generated rules

Visualization of Rule

Dataset Description

File name: Assignment-1_Data

List name: retaildata

File format: . xlsx

Number of Row: 522065

Number of Attributes: 7

BillNo: 6-digit number assigned to each transaction. Nominal.

Itemname: Product name. Nominal.

Quantity: The quantities of each product per transaction. Numeric.

Date: The day and time when each transaction was generated. Numeric.

Price: Product price. Numeric.

CustomerID: 5-digit number assigned to each customer. Nominal.

Country: Name of the country where each customer resides. Nominal.

https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

Libraries in R

First, we need to load required libraries. Shortly I describe all libraries.

arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).

arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.

tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.

readxl - Read Excel Files in R.

plyr - Tools for Splitting, Applying and Combining Data.

ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

knitr - Dynamic Report generation in R.

magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.

dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.

tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

Data Pre-processing

Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

After we will clear our data frame, will remove missing values.

https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
f
Orange dataset table
figshare.com
xlsx
Updated Mar 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rui Simões (2022). Orange dataset table [Dataset]. http://doi.org/10.6084/m9.figshare.19146410.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19146410.v1
Dataset updated
Mar 4, 2022
Dataset provided by
figshare
Authors
Rui Simões
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The complete dataset used in the analysis comprises 36 samples, each described by 11 numeric features and 1 target. The attributes considered were caspase 3/7 activity, Mitotracker red CMXRos area and intensity (3 h and 24 h incubations with both compounds), Mitosox oxidation (3 h incubation with the referred compounds) and oxidation rate, DCFDA fluorescence (3 h and 24 h incubations with either compound) and oxidation rate, and DQ BSA hydrolysis. The target of each instance corresponds to one of the 9 possible classes (4 samples per class): Control, 6.25, 12.5, 25 and 50 µM for 6-OHDA and 0.03, 0.06, 0.125 and 0.25 µM for rotenone. The dataset is balanced, it does not contain any missing values and data was standardized across features. The small number of samples prevented a full and strong statistical analysis of the results. Nevertheless, it allowed the identification of relevant hidden patterns and trends.

Exploratory data analysis, information gain, hierarchical clustering, and supervised predictive modeling were performed using Orange Data Mining version 3.25.1 [41]. Hierarchical clustering was performed using the Euclidean distance metric and weighted linkage. Cluster maps were plotted to relate the features with higher mutual information (in rows) with instances (in columns), with the color of each cell representing the normalized level of a particular feature in a specific instance. The information is grouped both in rows and in columns by a two-way hierarchical clustering method using the Euclidean distances and average linkage. Stratified cross-validation was used to train the supervised decision tree. A set of preliminary empirical experiments were performed to choose the best parameters for each algorithm, and we verified that, within moderate variations, there were no significant changes in the outcome. The following settings were adopted for the decision tree algorithm: minimum number of samples in leaves: 2; minimum number of samples required to split an internal node: 5; stop splitting when majority reaches: 95%; criterion: gain ratio. The performance of the supervised model was assessed using accuracy, precision, recall, F-measure and area under the ROC curve (AUC) metrics.
d
Data from: Client-side Web Mining for Community Formation in Peer-to-Peer...
catalog.data.gov
data.nasa.gov
+1more
Updated Apr 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Client-side Web Mining for Community Formation in Peer-to-Peer Environments [Dataset]. https://catalog.data.gov/dataset/client-side-web-mining-for-community-formation-in-peer-to-peer-environments
Explore at:
Dataset updated
Apr 10, 2025
Dataset provided by
Dashlink
Description
In this paper we present a framework for forming interests-based Peer-to-Peer communities using client-side web browsing history. At the heart of this framework is the use of an order statistics-based approach to build communities with hierarchical structure. We have also carefully considered privacy concerns of the peers and adopted cryptographic protocols to measure similarity between them without disclosing their personal profiles. We evaluated our framework on a distributed data mining platform we have developed. The experimental results show that our framework could effectively build interests-based communities.
Main technologies per senior business executives globally in 2023
ai-chatbox.pro
statista.com
Updated Apr 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bergur Thormundsson (2025). Main technologies per senior business executives globally in 2023 [Dataset]. https://www.ai-chatbox.pro/?_=%2Fstudy%2F39859%2Fblockchain-statista-dossier%2F%23XgboD02vawLYpGJjSPEePEUG%2FVFd%2Bik%3D
Explore at:
Dataset updated
Apr 25, 2025
Dataset provided by
Statistahttp://statista.com/
Authors
Bergur Thormundsson
Description
During a 2023 survey conducted in a variety of countries across the globe, it was found that 50 percent of respondents considered artificial intelligence (AI) to be a technology of strategic importance and would prioritize it in the coming year. 5G came in hot on the heels of AI, with 46 percent of respondents saying they would prioritize it.

Artificial intelligence

Artificial intelligence refers to the development of computer and machine skills to mimic human mind capabilities, such as problem-solving and decision-making. Particularly, AI learns from previous experiences to understand and respond to language, decisions, and problems. In recent years, more and more industries have adopted AI, from automotive to retail to healthcare, deployed to perform a variety of different tasks, including service operations and supply chain management. However, given its fast development, AI is not only affecting industries and job markets but is also impacting our everyday life.

Big data analytics

The expression “big data” indicates extremely large data sets that are difficult to process using traditional data-processing application software. In recent years, the size of the big data analytics market has increased and is forecast to amount to over 308 billion U.S. dollars in 2023. The growth of the big data analytics market has been fueled by the exponential growth in the volume of data exchanged online via a variety of sources, ranging from healthcare to social media. Tech giants like Oracle, Microsoft, and IBM form part of the market, providing big data analytics software tools for predictive analytics, forecasting, data mining, and optimization.
Student oriented subset of the Open University Learning Analytics dataset
zenodo.org
data.niaid.nih.gov
csv
Updated Sep 30, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabriella Casalino; Gabriella Casalino; Giovanna Castellano; Giovanna Castellano; Gennaro Vessio; Gennaro Vessio (2021). Student oriented subset of the Open University Learning Analytics dataset [Dataset]. http://doi.org/10.5281/zenodo.4264397
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4264397
Dataset updated
Sep 30, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gabriella Casalino; Gabriella Casalino; Giovanna Castellano; Giovanna Castellano; Gennaro Vessio; Gennaro Vessio
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Open University (OU) dataset is an open database containing student demographic and click-stream interaction with the virtual learning platform. The available data are structured in different CSV files. You can find more information about the original dataset at the following link: https://analyse.kmi.open.ac.uk/open_dataset.

We extracted a subset of the original dataset that focuses on student information. 25,819 records were collected referring to a specific student, course and semester. Each record is described by the following 20 attributes: code_module, code_presentation, gender, highest_education, imd_band, age_band, num_of_prev_attempts, studies_credits, disability, resource, homepage, forum, glossary, outcontent, subpage, url, outcollaborate, quiz, AvgScore, count.

Two target classes were considered, namely Fail and Pass, combining the original four classes (Fail and Withdrawn and Pass and Distinction, respectively). The final_result attribute contains the target values.

All features have been converted to numbers for automatic processing.

Below is the mapping used to convert categorical values to numeric:

code_module: 'AAA'=0, 'BBB'=1, 'CCC'=2, 'DDD'=3, 'EEE'=4, 'FFF'=5, 'GGG'=6

code_presentation: '2013B'=0, '2013J'=1, '2014B'=2, '2014J'=3

gender: 'F'=0, 'M'=1

highest_education: 'No_Formal_quals'=0, 'Post_Graduate_Qualification'=1, 'HE_Qualification'=2, 'Lower_Than_A_Level'=3, 'A_level_or_Equivalent'=4

IMBD_band: 'unknown'=0, 'between_0_and_10_percent'=1, 'between_10_and_20_percent'=2, 'between_20_and_30_percent'=3, 'between_30_and_40_percent'=4, 'between_40_and_50_percent'=5, 'between_50_and_60_percent'=6, 'between_60_and_70_percent'=7, 'between_70_and_80_percent'=8, 'between_80_and_90_percent'=9, 'between_90_and_100_percent'=10

age_band: 'between_0_and_35'=0, 'between_35_and_55'=1, 'higher_than_55'=2

disability: 'N'=0, 'Y'=1

student's outcome: 'Fail'=0, 'Pass'=1

For more detailed information, please refer to:

Casalino G., Castellano G., Vessio G. (2021) Exploiting Time in Adaptive Learning from Educational Data. In: Agrati L.S. et al. (eds) Bridges and Mediation in Higher Distance Education. HELMeTO 2020. Communications in Computer and Information Science, vol 1344. Springer, Cham. https://doi.org/10.1007/978-3-030-67435-9_1
c
Privacy Preservation through Random Nonlinear Distortion
s.cnmilf.com
catalog.data.gov
Updated Apr 9, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Privacy Preservation through Random Nonlinear Distortion [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/privacy-preservation-through-random-nonlinear-distortion
Explore at:
Dataset updated
Apr 9, 2025
Dataset provided by
Dashlink
Description
Consider a scenario in which the data owner has some private or sensitive data and wants a data miner to access them for studying important patterns without revealing the sensitive information. Privacy-preserving data mining aims to solve this problem by randomly transforming the data prior to their release to the data miners. Previous works only considered the case of linear data perturbations - additive, multiplicative, or a combination of both - for studying the usefulness of the perturbed output. In this paper, we discuss nonlinear data distortion using potentially nonlinear random data transformation and show how it can be useful for privacy-preserving anomaly detection from sensitive data sets. We develop bounds on the expected accuracy of the nonlinear distortion and also quantify privacy by using standard definitions. The highlight of this approach is to allow a user to control the amount of privacy by varying the degree of nonlinearity. We show how our general transformation can be used for anomaly detection in practice for two specific problem instances: a linear model and a popular nonlinear model using the sigmoid function. We also analyze the proposed nonlinear transformation in full generality and then show that, for specific cases, it is distance preserving. A main contribution of this paper is the discussion between the invertibility of a transformation and privacy preservation and the application of these techniques to outlier detection. The experiments conducted on real-life data sets demonstrate the effectiveness of the approach.
Supporting data for "CoVEffect: Interactive System for Mining the Effects of...
zenodo.org
data.niaid.nih.gov
bin, tsv, txt
Updated Apr 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Giuseppe Serna Garcia; Giuseppe Serna Garcia; Ruba Al Khalaf; Ruba Al Khalaf; Francesco Invernici; Francesco Invernici; Anna Bernasconi; Anna Bernasconi; Stefano Ceri; Stefano Ceri (2023). Supporting data for "CoVEffect: Interactive System for Mining the Effects of SARS-CoV-2 Mutations and Variants Based on Deep Learning" [Dataset]. http://doi.org/10.5281/zenodo.7817520
Explore at:
tsv, bin, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7817520
Dataset updated
Apr 20, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Giuseppe Serna Garcia; Giuseppe Serna Garcia; Ruba Al Khalaf; Ruba Al Khalaf; Francesco Invernici; Francesco Invernici; Anna Bernasconi; Anna Bernasconi; Stefano Ceri; Stefano Ceri
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This repository contains the datasets created and extracted for the paper:

Giuseppe Serna García, Ruba Al Khalaf, Francesco Invernici, Stefano Ceri, and Anna Bernasconi. 2022.
"CoVEffect: Interactive System for Mining the Effects of SARS-CoV-2 Mutations and Variants Based on Deep Learning". (Available online at http://gmql.eu/coveffect)

--------------------------------------------------------------------------------
LIST OF FILES WITH DESCRIPTION:
--------------------------------------------------------------------------------

AdditionalFile1-effects-taxonomy:
Descriptions of legal values for the 'Effect' field, based on a categorized taxonomy.

AdditionalFile2-levels-taxonomy:
Descriptions of legal values for the 'Level' field.

AdditionalFile3-training_dataset_target:
List of target tuples (manually annotated) of 221 abstracts considered for training the model. For each abstract, target tuples follow the schema ID, DOI, title, entity, effect, level, type (mutation or variant), tuples_count (>1 when an effect/level is shared by multiple entities, #abstracts containing the same effect described in the tuple).

AdditionalFile4-validation_dataset_target:
List of target tuples (manually annotated) of 50 abstracts considered for validating the prepared prediction model.
For each abstract, target tuples follow the schema defined for AdditionalFile3.

AdditionalFile5-validation_dataset_highlighted:
Textual abstracts of the 50 manuscripts considered for validation; the text used to support the manual target annotations has been highlighted in yellow.

AdditionalFile6-validation_dataset_prediction:
List of predicted annotations of 50 abstracts considered for validating the prepared prediction model. The file is split in 4 TSV, respectively for entity (a), effect (b), level (c), and whole tuple predictions (d).

AdditionalFile7-keywords_query_list:
Keyword-based search run on the CORD-19 dataset to extract a relevant subset of abstracts regarding the scope of interest of CoVEffect. The Boolean logic used to combine keywords is explained in the section 'Annotations of the biology-related CORD-19 cluster'.

AdditionalFile8-CORD-19_batch_dataset_metadata:
Metadata of the 7,230 papers extracted by the keyword-based query in AdditionalFile7.
These abstracts have been annotated by the prediction framework.

AdditionalFile9-CORD-19_batch_dataset_prediction:
List of predicted annotations of 7,230 abstracts extracted from the biology-related cluster of CORD-19.

AdditionalFile10-test_dataset_target:
List of target tuples (manually annotated) of 100 abstracts randomly selected from the 7,230 extracted as in AdditionalFile8.
For each abstract, target tuples follow the schema defined for AdditionalFile3.

AdditionalFile11-test_dataset_prediction:
List of predicted annotations of 100 abstracts considered for testing the prediction model on a subset of the CORD-19 biology-related cluster. As AdditionalFile6, it is split in 4 TSV, respectively for entity (a), effect (b), level (c), and whole tuple predictions (d).
A
WV Mining - Mineral Operations
data.amerigeoss.org
cloud.csiss.gmu.edu
+1more
html
Updated Aug 9, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Energy Data Exchange (2019). WV Mining - Mineral Operations [Dataset]. https://data.amerigeoss.org/dataset/wv-mining-mineral-operations
Explore at:
htmlAvailable download formats
Dataset updated
Aug 9, 2019
Dataset provided by
Energy Data Exchange
Area covered
West Virginia
Description
From the site: "Commodities covered by the Minerals Information Team (MIT) of the U.S. Geological Survey. Included are crushed stone operations considered active in 1998 with production greater than 30,000 tons, ferrous metal processing plants considered active in 1997, miscellaneous industrial minerals plants and/or mines considered active in 1997, or in 2001 for fullers earth and kaolin (This file is an update of the 1998 data set.), nonferrous metal processing plants considered active in 1997, refractory, abrasive, and other industrial minerals plants and/or mines considered active in 1997, or in 2001 bentonite and fire clay, and sand and gravel operations considered active in 1998 with production greater than 30,000 tons. Companies represented were surveyed byt the MIT.

Shapefiles representing six mineral operations nation wide were downloaded from the National Atlas of the United States website, clipped individually to a 1:24,000 WV boundary then merged into a single shapefile. Published April 2002."
Data from: Semi-supervised Multi-View Learning for Gene Network...
figshare.com
zip
Updated Jan 20, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gianvito Pio (2016). Semi-supervised Multi-View Learning for Gene Network Reconstruction [Dataset]. http://doi.org/10.6084/m9.figshare.1604827.v8
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1604827.v8
Dataset updated
Jan 20, 2016
Dataset provided by
figshare
Authors
Gianvito Pio
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Semi-supervised Multi-View Learning for Gene Network Reconstruction

SynTReN Data: E.coli and Yeast sub-networks, generated expression data and gold standards (Input_Datasets.zip) Interactions predicted by base methods (Base_Method_Predictions.zip) Interactions predicted by our approach - Clustering performed with PCA (Predictions.zip) Interactions predicted by our approach - Clustering performed with K-means (PredictionsK.zip)

Dream5 Data: Expression data and gold standards provided by Marbach et al. 2012 1 Interactions predicted by the considered DREAM5 base methods provided by Marbach et al. 2012 1 Interactions predicted by our approach - Clustering performed with PCA (Predictions_D5.zip) Interactions predicted by our approach - Clustering performed with K-means (PredictionsK_D5.zip)

[1] Marbach, D., Costello, J. C., Kuffner, R., Vega, N. M., Prill, R. J., Camacho, D. M., Allison, K. R., Kellis, M., Collins, J. J., and Stolovitzky, G., Wisdom of crowds for robust gene network inference, Nature Methods, 9, 796-804, 2012.
m
Data Mining and Unsupervised Machine Learning in Canadian In Situ Oil Sands...
data.mendeley.com
Updated Aug 30, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Minxing Si (2020). Data Mining and Unsupervised Machine Learning in Canadian In Situ Oil Sands Database for Knowledge Discovery and Carbon Cost Analysis [Dataset]. http://doi.org/10.17632/8ngkgz69zb.3
Explore at:
Unique identifier
https://doi.org/10.17632/8ngkgz69zb.3
Dataset updated
Aug 30, 2020
Authors
Minxing Si
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A better understanding of greenhouse gas (GHG) emissions resulting from oil sands (bitumen) extraction can help to meet global oil demands, identify potential mitigation measures, and design effective carbon policies. While several studies have attempted to model GHG emissions from oil sands extractions, these studies have encountered data availability challenges, particularly with respect to actual fuel use data, and have thus struggled to accurately quantify GHG emissions. In this study, we extracted operating data from a public database⁠—Petrinex—containing over 35 million records for 20 in situ oil sands extraction schemes. From 2015 to 2019, the weighted averages of fuel use for schemes employing steam-assisted gravity drainage (SAGD) and cyclic steam stimulation (CSS) were 0.21 103m3 fuel to produce 1 m3 bitumen (0.24 103m3/m3) and 0.34 103m3 fuel to produce 1 m3 bitumen (0.34 103m3/m3), respectively. The weighted average emission intensity (EI) for SAGD was 0.39 t CO2e/m3 undiluted bitumen (62 kg CO2e/bbl), and the weighted average EI for CSS was 0.65 t CO2e/m3 undiluted bitumen (103 kg CO2e/bbl). At a carbon price of CAD $30/t CO2e and an undiluted bitumen price of CAD $326/m3 (USD $39/bbl), the average carbon cost accounted for 2% ($ t CO2e/$ m3 undiluted bitumen). A single emission cap for the entire in situ oil sands sector is not appropriate because carbon costs range significantly. To prevent carbon leakage due to competitiveness migration, facility-specific or recovery method-specific emission caps should be considered when designing carbon pricing programs. A single intensity-based emissions cap was ineffective in reducing emissions. The annual average emission intensity for the 20 in situ oil sands operations decreaed by 15%. The absolute emissions, however, increased by 120%. The combination of an intensity-based emissions cap and an absolute emissions cap should be considered as a means of bending the emissions curve.
m
Hybrid models based on genetic algorithm and deep learning algorithms for...
data.mendeley.com
Updated Oct 18, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Serhat KILIÇARSLAN (2022). Hybrid models based on genetic algorithm and deep learning algorithms for nutritional Anemia disease classification. Biomedical Signal Processing and Control, 63, 102231. https://doi.org/10.1016/j.bspc.2020.102231 [Dataset]. http://doi.org/10.17632/dt89jydgnv.1
Explore at:
Unique identifier
https://doi.org/10.17632/dt89jydgnv.1
Dataset updated
Oct 18, 2022
Authors
Serhat KILIÇARSLAN
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The anemia dataset used in this study were obtained from the Faculty of Medicine, Tokat Gaziosmanpaşa University, Turkey. The data contains the complete blood count test results of 15,300 patients in the 5-year interval between 2013 and 2018. The dataset of pregnant women, children, and patients with cancer were excluded from the study. The noise in the dataset was eliminated and the parameters, which were considered insignificant in the diagnosis of anemia, were excluded from the dataset with the help of the experts. It is observed that, in the dataset, some of the records have missing parameter values and have values outside the reference range of the parameters which are marked by specialist doctors as noise in our study. Thus, records that have missing data and parameter values outside the reference ranges were removed from the dataset. In the study, Pearson correlation method was used to understand whether there is any relationship between the parameters. It is observed that the relationship between the parameters in the dataset is generally a weak relationship which is below p < 0.4 [59]. Because of this reason none of the parameters excluded from the dataset. Twenty-four features (Table 1) and 5 classes in the dataset were used in the study (Table 2). Since the difference between the parameters in the dataset was very high, a linear transformation was performed on the data with min-max normalization [30]. This dataset consists of data from 15,300 patients, of which 10,379 were female and 4921 were male. The dataset consists of 1019 (7%) patients with HGB-anemia, 4182 (27%) patients with iron deficiency, 199 (1%) patients with B12 deficiency, 153 (1%) patients with folate deficiency, and 9747 (64%) patients who had no anemia (Table 2). The transferring saturation in the dataset was obtained by the "SDTSD" feature, using the Eq. (1), which was developed with the help of a specialist physician. Saturation is the ratio of serum iron to total serum iron. In the Equation SD represents Serum Iron and TSD represents Total Serum Iron.
d
Data from: Fleet Level Anomaly Detection of Aviation Safety Data
catalog.data.gov
data.staging.idas-ds1.appdat.jsc.nasa.gov
+1more
Updated Apr 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Fleet Level Anomaly Detection of Aviation Safety Data [Dataset]. https://catalog.data.gov/dataset/fleet-level-anomaly-detection-of-aviation-safety-data
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Dashlink
Description
For the purposes of this paper, the National Airspace System (NAS) encompasses the operations of all aircraft which are subject to air traffic control procedures. The NAS is a highly complex dynamic system that is sensitive to aeronautical decision-making and risk management skills. In order to ensure a healthy system with safe flights a systematic approach to anomaly detection is very important when evaluating a given set of circumstances and for determination of the best possible course of action. Given the fact that the NAS is a vast and loosely integrated network of systems, it requires improved safety assurance capabilities to maintain an extremely low accident rate under increasingly dense operating conditions. Data mining based tools and techniques are required to support and aid operators’ (such as pilots, management, or policy makers) overall decision-making capacity. Within the NAS, the ability to analyze fleetwide aircraft data autonomously is still considered a significantly challenging task. For our purposes a fleet is defined as a group of aircraft sharing generally compatible parameter lists. Here, in this effort, we aim at developing a system level analysis scheme. In this paper we address the capability for detection of fleetwide anomalies as they occur, which itself is an important initiative toward the safety of the real-world flight operations. The flight data recorders archive millions of data points with valuable information on flights everyday. The operational parameters consist of both continuous and discrete (binary & categorical) data from several critical subsystems and numerous complex procedures. In this paper, we discuss a system level anomaly detection approach based on the theory of kernel learning to detect potential safety anomalies in a very large data base of commercial aircraft. We also demonstrate that the proposed approach uncovers some operationally significant events due to environmental, mechanical, and human factors issues in high dimensional, multivariate Flight Operations Quality Assurance (FOQA) data. We present the results of our detection algorithms on real FOQA data from a regional carrier.
Privacy Preservation through Random Nonlinear Distortion - Dataset - NASA...
data.nasa.gov
data.staging.idas-ds1.appdat.jsc.nasa.gov
Updated Mar 31, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). Privacy Preservation through Random Nonlinear Distortion - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/privacy-preservation-through-random-nonlinear-distortion
Explore at:
Dataset updated
Mar 31, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
Consider a scenario in which the data owner has some private or sensitive data and wants a data miner to access them for studying important patterns without revealing the sensitive information. Privacy-preserving data mining aims to solve this problem by randomly transforming the data prior to their release to the data miners. Previous works only considered the case of linear data perturbations - additive, multiplicative, or a combination of both - for studying the usefulness of the perturbed output. In this paper, we discuss nonlinear data distortion using potentially nonlinear random data transformation and show how it can be useful for privacy-preserving anomaly detection from sensitive data sets. We develop bounds on the expected accuracy of the nonlinear distortion and also quantify privacy by using standard definitions. The highlight of this approach is to allow a user to control the amount of privacy by varying the degree of nonlinearity. We show how our general transformation can be used for anomaly detection in practice for two specific problem instances: a linear model and a popular nonlinear model using the sigmoid function. We also analyze the proposed nonlinear transformation in full generality and then show that, for specific cases, it is distance preserving. A main contribution of this paper is the discussion between the invertibility of a transformation and privacy preservation and the application of these techniques to outlier detection. The experiments conducted on real-life data sets demonstrate the effectiveness of the approach.
Z
Dataset of 30 energy customers with flexibility data, and distributed...
data.niaid.nih.gov
Updated Apr 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vale, Zita (2024). Dataset of 30 energy customers with flexibility data, and distributed generation, considering residential, small commerce, large commerce, and industrial customers [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6783288
Explore at:
Dataset updated
Apr 1, 2024
Dataset provided by
Vale, Zita
Gomes, Luis
Morais, Hugo
Pereira, Helder
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset has 30 customers: ten residential, ten small commerce, five large commerce, and five industrial customers. The combination of several energy customer types allows the creation of a dataset with different types of consumption profiles, generation, and flexibility, and, therefore, different values of participation in demand response events.

The residential profiles of the considered customers use the data available in the Working Group on Intelligent Data Mining and Analysis (IDMA): https://site.ieee.org/pes-iss/data-sets/

The values represent a week period using 15 minutes reading periods. All the values are expressed in kWh and the matrixes were created as [customer x time_period].

We would be grateful if you could acknowledge the use of this dataset in your publications. Please use the Zenodo publication to cite this work.
R
Replication Data for: Detecting Rocks in Challenging Mining Environments...
datos.uchile.cl
zip
Updated Apr 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Patricio Loncomilla; Samtani, Pavan; Javier Ruiz-del-Solar; Patricio Loncomilla; Samtani, Pavan; Javier Ruiz-del-Solar (2023). Replication Data for: Detecting Rocks in Challenging Mining Environments using Convolutional Neural Networks and Ellipses as an alternative to Bounding Boxes [Dataset]. http://doi.org/10.34691/FK2/1GQBHK
Explore at:
zip(372432555), zip(86831735)Available download formats
Unique identifier
https://doi.org/10.34691/FK2/1GQBHK
Dataset updated
Apr 12, 2023
Dataset provided by
Repositorio de datos de investigación de la Universidad de Chile
Authors
Patricio Loncomilla; Samtani, Pavan; Javier Ruiz-del-Solar; Patricio Loncomilla; Samtani, Pavan; Javier Ruiz-del-Solar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The automation of heavy-duty machinery and vehicles used in underground mines is a growing tendency which requires addressing several challenges, such as the robust detection of rocks in the production areas of mines. For instance, human assistance must be requested when using autonomous LHD (Load-Haul-Dump) loaders in case rocks are too big to be loaded into the bucket. Also, in the case of autonomous rock breaking hammers, oversized rocks need to be identified and located, to then be broken in smaller sections. In this work, a novel approach called Rocky-CenterNet is proposed for detecting rocks. Unlike other object detectors, Rocky-CenterNet uses ellipses to enclose a rock’s bounds, enabling a better description of the shape of the rocks than the classical approach based on bounding boxes. The performance of Rocky-CenterNet is compared with the one of CenterNet and Mask R-CNN, which use bounding boxes and segmentation masks, respectively. The comparisons were performed on two datasets: the Hammer-Rocks dataset (introduced in this work) and the Scaled Front View dataset. The Hammer-Rocks dataset was captured in an underground ore pass, while a rock-breaking hammer was operating. This dataset includes challenging conditions such as the presence of dust in the air and occluded rocks. The metrics considered are related to the quality of the detections and the processing times involved. From the results, it is shown that ellipses provide a better approximation of the rocks shapes’ than bounding boxes. Moreover, when rocks are annotated using ellipses, Rocky-CenterNet offers the best performance while requiring shorter processing times than Mask-RCNN (4x faster). Thus, using ellipses to describe rocks is a reliable alternative. Both the datasets and the code are available for research purposes.
f
Data from: Inference of topics with Latent Dirichlet Allocation for Open...
figshare.com
tiff
Updated Jun 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nádia Felix Felipe da Silva; Núbia Rosa da Silva; Kátia Kelvis Cassiano; Douglas Farias Cordeiro (2023). Inference of topics with Latent Dirichlet Allocation for Open Government Data [Dataset]. http://doi.org/10.6084/m9.figshare.20006430.v1
Explore at:
tiffAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.20006430.v1
Dataset updated
Jun 5, 2023
Dataset provided by
SciELO journals
Authors
Nádia Felix Felipe da Silva; Núbia Rosa da Silva; Kátia Kelvis Cassiano; Douglas Farias Cordeiro
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ABSTRACT Open government data can be considered as an important initiative of institutions of civil society, promoting transparency and allowing its reuse as an input in the development of innovation projects. However, it is common for certain databases to require the application of specific treatments, so that the data can be used more efficiently, such as the case of classification using Data Mining. In this scenario, this paper presents an automatic topic inference proposal using the Latent Dirichlet Allocation method to classify cultural projects in their thematic areas, by identifying the similarity in their data. The results demonstrate the feasibility of the approach in the context of open government data.

Facebook

Twitter

Click to copy link

Link copied

Cite

Georgia Tsiliki; Nikos Karacapilidis; Spyros Christodoulou; Manolis Tzagarakis (2023). Input data considered for the biomedical research assimilator context. [Dataset]. http://doi.org/10.1371/journal.pone.0108600.t001

Input data considered for the biomedical research assimilator context.

Explore at:

xlsAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0108600.t001

Dataset updated

Jun 4, 2023

Dataset provided by

PLOS ONE

Authors

Georgia Tsiliki; Nikos Karacapilidis; Spyros Christodoulou; Manolis Tzagarakis

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Input data considered for the biomedical research assimilator context.

Clear search

Close search

Google apps

Main menu

Input data considered for the biomedical research assimilator context.

[Research Data] Mining Relevant Solutions for Programming Tasks from Search...

Data from: Privacy Preserving Outlier Detection through Random Nonlinear...

LScD (Leicester Scientific Dictionary)

Market Basket Analysis

Market Basket Analysis

Introduction

An Example of Association Rules

Strategy

Dataset Description

Libraries in R

Data Pre-processing

Orange dataset table

Data from: Client-side Web Mining for Community Formation in Peer-to-Peer...

Main technologies per senior business executives globally in 2023

Student oriented subset of the Open University Learning Analytics dataset

Privacy Preservation through Random Nonlinear Distortion

Supporting data for "CoVEffect: Interactive System for Mining the Effects of...

WV Mining - Mineral Operations

Data from: Semi-supervised Multi-View Learning for Gene Network...

Data Mining and Unsupervised Machine Learning in Canadian In Situ Oil Sands...

Hybrid models based on genetic algorithm and deep learning algorithms for...

Data from: Fleet Level Anomaly Detection of Aviation Safety Data

Privacy Preservation through Random Nonlinear Distortion - Dataset - NASA...

Dataset of 30 energy customers with flexibility data, and distributed...

Replication Data for: Detecting Rocks in Challenging Mining Environments...

Data from: Inference of topics with Latent Dirichlet Allocation for Open...

Input data considered for the biomedical research assimilator context.