18 datasets found

l
LScDC (Leicester Scientific Dictionary-Core)
figshare.le.ac.uk
docx
Updated Apr 15, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neslihan Suzen (2020). LScDC (Leicester Scientific Dictionary-Core) [Dataset]. http://doi.org/10.25392/leicester.data.9896579.v3
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.25392/leicester.data.9896579.v3
Dataset updated
Apr 15, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Leicester
Description
The LScDC (Leicester Scientific Dictionary-Core Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScDC (Leicester Scientific Dictionary-Core) is formed using the updated LScD (Leicester Scientific Dictionary) - Version 3*. All steps applied to build the new version of core dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. The files provided with this description are also same as described as for LScDC Version 2. The numbers of words in the 3rd versions of LScD and LScDC are summarized below. # of wordsLScD (v3) 972,060LScDC (v3) 103,998 * Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v3 ** Suzen, Neslihan (2019): LScDC (Leicester Scientific Dictionary-Core). figshare. Dataset. https://doi.org/10.25392/leicester.data.9896579.v2[Version 2] Getting StartedThis file describes a sorted and cleaned list of words from LScD (Leicester Scientific Dictionary), explains steps for sub-setting the LScD and basic statistics of words in the LSC (Leicester Scientific Corpus), to be found in [1, 2]. The LScDC (Leicester Scientific Dictionary-Core) is a list of words ordered by the number of documents containing the words, and is available in the CSV file published. There are 104,223 unique words (lemmas) in the LScDC. This dictionary is created to be used in future work on the quantification of the sense of research texts. The objective of sub-setting the LScD is to discard words which appear too rarely in the corpus. In text mining algorithms, usage of enormous number of text data brings the challenge to the performance and the accuracy of data mining applications. The performance and the accuracy of models are heavily depend on the type of words (such as stop words and content words) and the number of words in the corpus. Rare occurrence of words in a collection is not useful in discriminating texts in large corpora as rare words are likely to be non-informative signals (or noise) and redundant in the collection of texts. The selection of relevant words also holds out the possibility of more effective and faster operation of text mining algorithms.To build the LScDC, we decided the following process on LScD: removing words that appear in no more than 10 documents (
f
2018 Database of Effect Sizes (ES)
figshare.com
txt
Updated Sep 15, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Paul Monsarrat; Jean-Noel Vergnes (2018). 2018 Database of Effect Sizes (ES) [Dataset]. http://doi.org/10.6084/m9.figshare.7066397.v3
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7066397.v3
Dataset updated
Sep 15, 2018
Dataset provided by
figshare
Authors
Paul Monsarrat; Jean-Noel Vergnes
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The database contains 18 fields:- xmlfile: identification of the xmlfile from which the ES was extracted.- id: a unique identifier of extracted ES within a given xmlfile.- pmid: the PMID identifier of the abstract from which ES were extracted. - year: year of publication of the abstract.- month: month of publication of the abstract.- or: value of the ES (T#0).- lci: value of the lower part of the confidence interval.- hci: value of the higher part of the confidence interval.- orhrrr: classification of the type of ES (OR, HR, RR and PR). PR was considered as OR at the step of the statistical script.- ci: type of confidence interval (90%, 95% or 99%). If no CI was provided, 95% was assumed at the step of the statistical script.- counteies: continents extracted from author’s affiliations, using MapAfill and text mining (pipe ‘|’ delimited)- adjusted: 1 if an adjusted ES was detected (e.g. adjusted OR, aOR), 0 otherwise.- multivariate: 1 if a multivariate analysis was found within abstract, 0 otherwise.- sr: 1 if the abstract was a “Review” (review, systematic review or meta-analysis), 0 otherwise.- ccj: 1 if the journal was found within the Core Clinical Journals list, 0 otherwise.- doaj: 1 if the journal was found within the Directory of Open Access Journals, 0 otherwise. - pmc: PubMed Central identifier, if exists.- nlm: unique identifier of the journal abstract.
Market Basket Analysis
kaggle.com
Updated Dec 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 9, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Aslan Ahmedov
Description
Market Basket Analysis

Market basket analysis with Apriori algorithm

The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

Introduction

Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

An Example of Association Rules

Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

Strategy

Data Import

Data Understanding and Exploration

Transformation of the data – so that is ready to be consumed by the association rules algorithm

Running association rules

Exploring the rules generated

Filtering the generated rules

Visualization of Rule

Dataset Description

File name: Assignment-1_Data

List name: retaildata

File format: . xlsx

Number of Row: 522065

Number of Attributes: 7

BillNo: 6-digit number assigned to each transaction. Nominal.

Itemname: Product name. Nominal.

Quantity: The quantities of each product per transaction. Numeric.

Date: The day and time when each transaction was generated. Numeric.

Price: Product price. Numeric.

CustomerID: 5-digit number assigned to each customer. Nominal.

Country: Name of the country where each customer resides. Nominal.

https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

Libraries in R

First, we need to load required libraries. Shortly I describe all libraries.

arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).

arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.

tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.

readxl - Read Excel Files in R.

plyr - Tools for Splitting, Applying and Combining Data.

ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

knitr - Dynamic Report generation in R.

magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.

dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.

tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

Data Pre-processing

Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

After we will clear our data frame, will remove missing values.

https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
d
Multivariate Time Series Search
catalog.data.gov
data.wu.ac.at
Updated Apr 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Multivariate Time Series Search [Dataset]. https://catalog.data.gov/dataset/multivariate-time-series-search
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Dashlink
Description
Multivariate Time-Series (MTS) are ubiquitous, and are generated in areas as disparate as sensor recordings in aerospace systems, music and video streams, medical monitoring, and financial systems. Domain experts are often interested in searching for interesting multivariate patterns from these MTS databases which can contain up to several gigabytes of data. Surprisingly, research on MTS search is very limited. Most existing work only supports queries with the same length of data, or queries on a fixed set of variables. In this paper, we propose an efficient and flexible subsequence search framework for massive MTS databases, that, for the first time, enables querying on any subset of variables with arbitrary time delays between them. We propose two provably correct algorithms to solve this problem — (1) an R-tree Based Search (RBS) which uses Minimum Bounding Rectangles (MBR) to organize the subsequences, and (2) a List Based Search (LBS) algorithm which uses sorted lists for indexing. We demonstrate the performance of these algorithms using two large MTS databases from the aviation domain, each containing several millions of observations. Both these tests show that our algorithms have very high prune rates (>95%) thus needing actual disk access for only less than 5% of the observations. To the best of our knowledge, this is the first flexible MTS search algorithm capable of subsequence search on any subset of variables. Moreover, MTS subsequence search has never been attempted on datasets of the size we have used in this paper.
f
List of classifiers employed in the analysis.
plos.figshare.com
xls
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Diego Raphael Amancio; Cesar Henrique Comin; Dalcimar Casanova; Gonzalo Travieso; Odemir Martinez Bruno; Francisco Aparecido Rodrigues; Luciano da Fontoura Costa (2023). List of classifiers employed in the analysis. [Dataset]. http://doi.org/10.1371/journal.pone.0094137.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0094137.t001
Dataset updated
May 30, 2023
Dataset provided by
PLOS ONE
Authors
Diego Raphael Amancio; Cesar Henrique Comin; Dalcimar Casanova; Gonzalo Travieso; Odemir Martinez Bruno; Francisco Aparecido Rodrigues; Luciano da Fontoura Costa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
List of classifiers evaluated in our study. The abbreviated names used for some classifiers are indicated after the respective name.
Bluesky Social Dataset
zenodo.org
application/gzip, csv
Updated Jan 16, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrea Failla; Andrea Failla; Giulio Rossetti; Giulio Rossetti (2025). Bluesky Social Dataset [Dataset]. http://doi.org/10.5281/zenodo.14669616
Explore at:
application/gzip, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14669616
Dataset updated
Jan 16, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Andrea Failla; Andrea Failla; Giulio Rossetti; Giulio Rossetti
License
https://bsky.social/about/support/toshttps://bsky.social/about/support/tos
Description
Bluesky Social Dataset

Pollution of online social spaces caused by rampaging d/misinformation is a growing societal concern. However, recent decisions to reduce access to social media APIs are causing a shortage of publicly available, recent, social media data, thus hindering the advancement of computational social science as a whole. We present a large, high-coverage dataset of social interactions and user-generated content from Bluesky Social to address this pressing issue.

The dataset contains the complete post history of over 4M users (81% of all registered accounts), totaling 235M posts. We also make available social data covering follow, comment, repost, and quote interactions.

Since Bluesky allows users to create and bookmark feed generators (i.e., content recommendation algorithms), we also release the full output of several popular algorithms available on the platform, along with their “like” interactions and time of bookmarking.

Dataset

Here is a description of the dataset files.

followers.csv.gz. This compressed file contains the anonymized follower edge list. Once decompressed, each row consists of two comma-separated integers representing a directed following relation (i.e., user u follows user v).

user_posts.tar.gz. This compressed folder contains data on the individual posts collected. Decompressing this file results in a collection of files, each containing the post of an anonymized user. Each post is stored as a JSON-formatted line.

interactions.csv.gz. This compressed file contains the anonymized interactions edge list. Once decompressed, each row consists of six comma-separated integers representing a comment, repost, or quote interaction. These integers correspond to the following fields, in this order: user_id, replied_author, thread_root_author, reposted_author,quoted_author, and date.

graphs.tar.gz. This compressed folder contains edge list files for the graphs emerging from reposts, quotes, and replies. Each interaction is timestamped. The folder also contains timestamped higher-order interactions emerging from discussion threads, each containing all users participating in a thread.

feed_posts.tar.gz. This compressed folder contains posts that appear in 11 thematic feeds. Decompressing this folder results in 11 files containing posts from one feed each. Posts are stored as a JSON-formatted line. Fields are correspond to those in posts.tar.gz, except for those related to sentiment analysis (sent_label, sent_score), and reposts (repost_from, reposted_author);

feed_bookmarks.csv. This file contains users who bookmarked any of the collected feeds. Each record contains three comma-separated values: the feed name, user id, and timestamp.

feed_post_likes.tar.gz. This compressed folder contains data on likes to posts appearing in the feeds, one file per feed. Each record in the files contains the following information, in this order: the id of the ``liker'', the id of the post's author, the id of the liked post, and the like timestamp;

scripts.tar.gz. A collection of Python scripts, including the ones originally used to crawl the data, and to perform experiments. These scripts are detailed in a document released within the folder.

Citation

If used for research purposes, please cite the following paper describing the dataset details:

Andrea Failla and Giulio Rossetti. "I'm in the Bluesky Tonight: Insights from a Year's Worth of Social Data." PlosOne (2024) https://doi.org/10.1371/journal.pone.0310330

Right to Erasure (Right to be forgotten)

Note: If your account was created after March 21st, 2024, or if you did not post on Bluesky before such date, no data about your account exists in the dataset. Before sending a data removal request, please make sure that you were active and posting on bluesky before March 21st, 2024.

Users included in the Bluesky Social dataset have the right to opt-out and request the removal of their data, per GDPR provisions (Article 17).

We emphasize that the released data has been thoroughly pseudonymized in compliance with GDPR (Article 4(5)). Specifically, usernames and object identifiers (e.g., URIs) have been removed, and object timestamps have been coarsened to protect individual privacy further and minimize reidentification risk. Moreover, it should be noted that the dataset was created for scientific research purposes, thereby falling under the scenarios for which GDPR provides opt-out derogations (Article 17(3)(d) and Article 89).

Nonetheless, if you wish to have your activities excluded from this dataset, please submit your request to blueskydatase tmoderation@gmail.com (with the subject "Removal request: [username]"). We will process your request within a reasonable timeframe - updates will occur monthly, if necessary, and access to previous versions will be restricted.

Acknowledgments:

This work is supported by :

the European Union – Horizon 2020 Program under the scheme “INFRAIA-01-2018-2019 – Integrating Activities for Advanced Communities”,
Grant Agreement n.871042, “SoBigData++: European Integrated Infrastructure for Social Mining and Big Data Analytics” (http://www.sobigdata.eu);

SoBigData.it which receives funding from the European Union – NextGenerationEU – National Recovery and Resilience Plan (Piano Nazionale di Ripresa e Resilienza, PNRR) – Project: “SoBigData.it – Strengthening the Italian RI for Social Mining and Big Data Analytics” – Prot. IR0000013 – Avviso n. 3264 del 28/12/2021;

EU NextGenerationEU programme under the funding schemes PNRR-PE-AI FAIR (Future Artificial Intelligence Research).
f
List of the variables used in the models.
plos.figshare.com
xls
Updated Jun 9, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sylvain Delerce; Hugo Dorado; Alexandre Grillon; Maria Camila Rebolledo; Steven D. Prager; Victor Hugo Patiño; Gabriel Garcés Varón; Daniel Jiménez (2023). List of the variables used in the models. [Dataset]. http://doi.org/10.1371/journal.pone.0161620.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0161620.t002
Dataset updated
Jun 9, 2023
Dataset provided by
PLOS ONE
Authors
Sylvain Delerce; Hugo Dorado; Alexandre Grillon; Maria Camila Rebolledo; Steven D. Prager; Victor Hugo Patiño; Gabriel Garcés Varón; Daniel Jiménez
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
List of the variables used in the models.
4
Data underlying the publication: A Ground Truth Approach for Assessing...
data.4tu.nl
zip
Updated Feb 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dominique Sommers (2025). Data underlying the publication: A Ground Truth Approach for Assessing Process Mining Techniques [Dataset]. http://doi.org/10.4121/bc43e334-74e1-44ff-abf1-ed32847250c9.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/bc43e334-74e1-44ff-abf1-ed32847250c9.v1
Dataset updated
Feb 4, 2025
Dataset provided by
4TU.ResearchData
Authors
Dominique Sommers
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This folder contains the synthetically generated dataset (process model and event logs) containing process data of a synthetically designed package delivery process, as described in [1]. The event logs present simulations of a process model, each with an incorporated issue, be it a behavioral deviation, i.e., where the process is differently exhibited with regard to the expected behavior described by the process model, or a recording error, i.e., where the execution of the process is recorded differently with regard to how it is exhibited. Each issue is added to the process model through a model transformation providing ground truth to the discrepancies introduced in the simulated event log.

The package delivery process starts with the choice of home or depot delivery, after which the package queues for a warehouse employee to pick and load it into a van. In case of home delivery, a courier drives off and rings a door after which he continues to either immediately hand over the package, or deliver it at the corresponding depot after registration, where it is left for collection. Alternatively, for depot delivery, "ringing" and therefore also "deliver at home" is omitted in the subprocess.
models/delivery_base_model.json contains the specification of the process model that incorporates this "expected behavior", and is depicted in models/delivery_base_model.pdf.

On top of this, six patterns of behavioral deviations (BI) and six patterns of recording errors (RI) are applied to the base model:
BI5: Overtaking in the FIFO queue for picking packages;
BI7: Switching roles from a courier to that of a warehouse employee;
BI10: Batching is ignored, leaving with a delivery van before it was fully loaded;
BI3: Skipping the activity of ringing, modeling behavior where e.g., the door was already opened upon arrival;
BI9: Different resource memory where the package is delivered to a different depot than where it is registered;
BI2: Multitasking of couriers during the delivery of multiple packages, modeling interruption of a delivery;
RI1: Incorrect event, recording an order for depot delivery when it was intended for home delivery;
RI2: Incorrect event, vice versa, i.e., recording an order for home delivery when it was intended for depot delivery;
RI3: Missing event for the activity of loading a package in a truck;
RI4: Missing object of the involved van for loading, e.g., due to a temporary connection failure of a recording device;
RI5: Incorrect object of the involved courier when ringing, e.g., due to not logging out by the courier on the previous shift;
RI6: Missing positions for the recording of the delivery and the collection at a depot, e.g., due to coarse timestamp logging.

The behavior of each deviation pattern is added separately to the base model, resulting in twelve process models, accordingly named models/package_delivery_
Each model is simulated resulting in twelve logs, accordingly named logs/package_delivery_

All models and corresponding generated logs with the applied patterns are also available at gitlab.com/dominiquesommers/mira/-/tree/main/mira/simulation, which additionally includes scripts to load and process the data.

We refer to [1] for more information on the dataset.

[1] Dominique Sommers, Natalia Sidorova, Boudewijn F. van Dongen. A ground truth approach for assessing process mining techniques. arXiv preprint, https://doi.org/10.48550/arXiv.2501.14345, 2025.
f
Discovering Hidden Connections among Diseases, Genes and Drugs Based on...
figshare.com
tiff
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jain-Shing Wu; E-Fong Kao; Chung-Nan Lee (2023). Discovering Hidden Connections among Diseases, Genes and Drugs Based on Microarray Expression Profiles with Negative-Term Filtering [Dataset]. http://doi.org/10.1371/journal.pone.0098826
Explore at:
tiffAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0098826
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Jain-Shing Wu; E-Fong Kao; Chung-Nan Lee
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Microarrays based on gene expression profiles (GEPs) can be tailored specifically for a variety of topics to provide a precise and efficient means with which to discover hidden information. This study proposes a novel means of employing existing GEPs to reveal hidden relationships among diseases, genes, and drugs within a rich biomedical database, PubMed. Unlike the co-occurrence method, which considers only the appearance of keywords, the proposed method also takes into account negative relationships and non-relationships among keywords, the importance of which has been demonstrated in previous studies. Three scenarios were conducted to verify the efficacy of the proposed method. In Scenario 1, disease and drug GEPs (disease: lymphoma cancer, lymph node cancer, and drug: cyclophosphamide) were used to obtain lists of disease- and drug-related genes. Fifteen hidden connections were identified between the diseases and the drug. In Scenario 2, we adopted different diseases and drug GEPs (disease: AML-ALL dataset and drug: Gefitinib) to obtain lists of important diseases and drug-related genes. In this case, ten hidden connections were identified. In Scenario 3, we obtained a list of disease-related genes from the disease-related GEP (liver cancer) and the drug (Capecitabine) on the PharmGKB website, resulting in twenty-two hidden connections. Experimental results demonstrate the efficacy of the proposed method in uncovering hidden connections among diseases, genes, and drugs. Following implementation of the weight function in the proposed method, a large number of the documents obtained in each of the scenarios were judged to be related: 834 of 4028 documents, 789 of 1216 documents, and 1928 of 3791 documents in Scenarios 1, 2, and 3, respectively. The negative-term filtering scheme also uncovered a large number of negative relationships as well as non-relationships among these connections: 97 of 834, 38 of 789, and 202 of 1928 in Scenarios 1, 2, and 3, respectively.
f
List of false positive rates of the proposed DBPPred and the existing...
plos.figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wangchao Lou; Xiaoqing Wang; Fan Chen; Yixiao Chen; Bo Jiang; Hua Zhang (2023). List of false positive rates of the proposed DBPPred and the existing iDNA-Prot, DNA-Prot, DNAbinder and DNABIND on datasets NDBP4025, RB174, RB256 and RB430. [Dataset]. http://doi.org/10.1371/journal.pone.0086703.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0086703.t004
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Wangchao Lou; Xiaoqing Wang; Fan Chen; Yixiao Chen; Bo Jiang; Hua Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
List of false positive rates of the proposed DBPPred and the existing iDNA-Prot, DNA-Prot, DNAbinder and DNABIND on datasets NDBP4025, RB174, RB256 and RB430.
Data Analytics Market Analysis, Size, and Forecast 2025-2029: North America...
technavio.com
Updated Jun 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Technavio (2024). Data Analytics Market Analysis, Size, and Forecast 2025-2029: North America (US and Canada), Europe (France, Germany, and UK), Middle East and Africa (UAE), APAC (China, India, Japan, and South Korea), South America (Brazil), and Rest of World (ROW) [Dataset]. https://www.technavio.com/report/data-analytics-market-industry-analysis
Explore at:
Dataset updated
Jun 23, 2024
Dataset provided by
TechNavio
Authors
Technavio
Time period covered
2021 - 2025
Area covered
Global
Description
Snapshot img

Data Analytics Market Size 2025-2029

The data analytics market size is forecast to increase by USD 288.7 billion, at a CAGR of 14.7% between 2024 and 2029.

The market is driven by the extensive use of modern technology in company operations, enabling businesses to extract valuable insights from their data. The prevalence of the Internet and the increased use of linked and integrated technologies have facilitated the collection and analysis of vast amounts of data from various sources. This trend is expected to continue as companies seek to gain a competitive edge by making data-driven decisions. However, the integration of data from different sources poses significant challenges. Ensuring data accuracy, consistency, and security is crucial as companies deal with large volumes of data from various internal and external sources. Additionally, the complexity of data analytics tools and the need for specialized skills can hinder adoption, particularly for smaller organizations with limited resources. Companies must address these challenges by investing in robust data management systems, implementing rigorous data validation processes, and providing training and development opportunities for their employees. By doing so, they can effectively harness the power of data analytics to drive growth and improve operational efficiency.

What will be the Size of the Data Analytics Market during the forecast period?

Explore in-depth regional segment analysis with market size data - historical 2019-2023 and forecasts 2025-2029 - in the full report.
Request Free SampleIn the dynamic and ever-evolving the market, entities such as explainable AI, time series analysis, data integration, data lakes, algorithm selection, feature engineering, marketing analytics, computer vision, data visualization, financial modeling, real-time analytics, data mining tools, and KPI dashboards continue to unfold and intertwine, shaping the industry's landscape. The application of these technologies spans various sectors, from risk management and fraud detection to conversion rate optimization and social media analytics. ETL processes, data warehousing, statistical software, data wrangling, and data storytelling are integral components of the data analytics ecosystem, enabling organizations to extract insights from their data. Cloud computing, deep learning, and data visualization tools further enhance the capabilities of data analytics platforms, allowing for advanced data-driven decision making and real-time analysis. Marketing analytics, clustering algorithms, and customer segmentation are essential for businesses seeking to optimize their marketing strategies and gain a competitive edge. Regression analysis, data visualization tools, and machine learning algorithms are instrumental in uncovering hidden patterns and trends, while predictive modeling and causal inference help organizations anticipate future outcomes and make informed decisions. Data governance, data quality, and bias detection are crucial aspects of the data analytics process, ensuring the accuracy, security, and ethical use of data. Supply chain analytics, healthcare analytics, and financial modeling are just a few examples of the diverse applications of data analytics, demonstrating the industry's far-reaching impact. Data pipelines, data mining, and model monitoring are essential for maintaining the continuous flow of data and ensuring the accuracy and reliability of analytics models. The integration of various data analytics tools and techniques continues to evolve, as the industry adapts to the ever-changing needs of businesses and consumers alike.

How is this Data Analytics Industry segmented?

The data analytics industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD billion' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments. ComponentServicesSoftwareHardwareDeploymentCloudOn-premisesTypePrescriptive AnalyticsPredictive AnalyticsCustomer AnalyticsDescriptive AnalyticsOthersApplicationSupply Chain ManagementEnterprise Resource PlanningDatabase ManagementHuman Resource ManagementOthersGeographyNorth AmericaUSCanadaEuropeFranceGermanyUKMiddle East and AfricaUAEAPACChinaIndiaJapanSouth KoreaSouth AmericaBrazilRest of World (ROW)

By Component Insights

The services segment is estimated to witness significant growth during the forecast period.The market is experiencing significant growth as businesses increasingly rely on advanced technologies to gain insights from their data. Natural language processing is a key component of this trend, enabling more sophisticated analysis of unstructured data. Fraud detection and data security solutions are also in high demand, as companies seek to protect against threats and maintain customer trust. Data analytics platforms, including cloud-based offeri
f
List of Demographic & clinical variables used in PM-BMII.
plos.figshare.com
xls
Updated Jun 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhuo Zhang; Yanwu Xu; Jiang Liu; Damon Wing Kee Wong; Chee Keong Kwoh; Seang-Mei Saw; Tien Yin Wong (2023). List of Demographic & clinical variables used in PM-BMII. [Dataset]. http://doi.org/10.1371/journal.pone.0065736.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0065736.t003
Dataset updated
Jun 9, 2023
Dataset provided by
PLOS ONE
Authors
Zhuo Zhang; Yanwu Xu; Jiang Liu; Damon Wing Kee Wong; Chee Keong Kwoh; Seang-Mei Saw; Tien Yin Wong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
List of Demographic & clinical variables used in PM-BMII.
f
Data from: In Silico Investigation into H2 Uptake in MOFs: Combined...
acs.figshare.com
xlsx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Omer Tayfuroglu; Abdulkadir Kocak; Yunus Zorlu (2023). In Silico Investigation into H2 Uptake in MOFs: Combined Text/Data Mining and Structural Calculations [Dataset]. http://doi.org/10.1021/acs.langmuir.9b03618.s003
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.langmuir.9b03618.s003
Dataset updated
Jun 1, 2023
Dataset provided by
ACS Publications
Authors
Omer Tayfuroglu; Abdulkadir Kocak; Yunus Zorlu
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Metal-organic frameworks (MOFs) with high surface areas and adjustable lattice structures are attractive for gas storage and thus have been a great interest in research. Although tremendous amount of data on MOFs have been available in the literature, there are very few studies considering methodological approach for H2 uptake properties of MOFs. In this study, we systematically investigated the H2 uptake capabilities of MOFs by means of text and data mining (TDM) through retrieving data of the surface areas (SA) and pore volumes (PV) from published manuscripts. In addition, we calculated theoretical SA and PV values of all real MOFs available in Cambridge Structural Database (CSD). Prior to calculation, we applied an automated structure analysis algorithm that loads the coordinates of molecules from CSD experimental X-ray single-crystal structure and removes guest/solvent contaminants from the structure. We compared SA, PV, and H2 uptake data from both TDM and structural calculation techniques and unraveled a list of MOFs with H2 uptakes predicted from both experimental and theoretical SA/PV values that may be regarded as the most promising candidates for H2 storage. The extensive and systematic TDM strategy estimates 5975 experimental SA and 7748 experimental PV values (2080 MOFs with SA + PV values) with 78–82% success rate. In addition, structural calculations reveal the theoretical SA and PV values along with a theoretical H2 adsorption limit of MOFs in the absence of guest molecules. Combination of both TDM and structural calculation strategies provides a more comprehensive perspective for the investigation of hydrogen storage capacities in MOFs, which elucidates plausibility of new compounds as candidates for H2 storage materials.
f
The Utility-Linked (UL)-list structure of S4.
plos.figshare.com
xls
Updated Jun 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Min Shi; Yongshun Gong; Tiantian Xu; Long Zhao (2023). The Utility-Linked (UL)-list structure of S4. [Dataset]. http://doi.org/10.1371/journal.pone.0283365.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0283365.t003
Dataset updated
Jun 21, 2023
Dataset provided by
PLOS ONE
Authors
Min Shi; Yongshun Gong; Tiantian Xu; Long Zhao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
High utility sequential pattern (HUSP) mining aims to mine actionable patterns with high utilities, widely applied in real-world learning scenarios such as market basket analysis, scenic route planning and click-stream analysis. The existing HUSP mining algorithms mainly attempt to improve computation efficiency while maintaining the algorithm stability in the setting of large-scale data. Although these methods have made some progress, they ignore the relationship between additional items and underlying sequences, which directly leads to the generation of redundant sequential patterns sharing the same underlying sequence. Hence, the mined patterns’ actionability is limited, which significantly compromises the performance of patterns in real-world applications. To address this problem, we present a new method named Combined Utility-Association Sequential Pattern Mining (CUASPM) by incorporating item/sequence relations, which can effectively remove redundant patterns and extract high discriminative and strongly associated sequential pattern combinations with high utilities. Specifically, we introduce the concept of actionable combined mining into HUSP mining for the first time and develop a novel tree structure to select discriminative high utility sequential patterns (HUSPs) for downstream tasks. Furthermore, two efficient strategies (i.e., global and local strategies) are presented to facilitate mining HUSPs while guaranteeing utility growth and high levels of association. Last, two parameters are introduced to evaluate the interestingness of patterns to choose the most useful actionable combined HUSPs (ACHUSPs). Extensive experimental results demonstrate that the proposed CUASPM outperforms the baselines in terms of execution time, memory usage, mining high discriminative and strongly associated HUSPs.
Bengalese Finch song repository
figshare.com
tiff
Updated Jul 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Nicholson; Jonah E. Queen; Samuel J. Sober (2023). Bengalese Finch song repository [Dataset]. http://doi.org/10.6084/m9.figshare.4805749.v9
Explore at:
tiffAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.4805749.v9
Dataset updated
Jul 24, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
David Nicholson; Jonah E. Queen; Samuel J. Sober
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Please see this website for information about the dataset: https://nickledave.github.io/bfsongrepo/ The site includes:

methodology usage terminal commands to download the dataset citations in common formats list of works that cite this dataset
List of dynamic features.
plos.figshare.com
xls
Updated Jun 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Margherita Rosnati; Vincent Fortuin (2023). List of dynamic features. [Dataset]. http://doi.org/10.1371/journal.pone.0251248.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0251248.t001
Dataset updated
Jun 10, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Margherita Rosnati; Vincent Fortuin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
List of dynamic features.
f
datasheet1_Q-Finder: An Algorithm for Credible Subgroup Discovery in...
frontiersin.figshare.com
pdf
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cyril Esnault; May-Line Gadonna; Maxence Queyrel; Alexandre Templier; Jean-Daniel Zucker (2023). datasheet1_Q-Finder: An Algorithm for Credible Subgroup Discovery in Clinical Data Analysis — An Application to the International Diabetes Management Practice Study.pdf [Dataset]. http://doi.org/10.3389/frai.2020.559927.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/frai.2020.559927.s001
Dataset updated
Jun 4, 2023
Dataset provided by
Frontiers
Authors
Cyril Esnault; May-Line Gadonna; Maxence Queyrel; Alexandre Templier; Jean-Daniel Zucker
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Addressing the heterogeneity of both the outcome of a disease and the treatment response to an intervention is a mandatory pathway for regulatory approval of medicines. In randomized clinical trials (RCTs), confirmatory subgroup analyses focus on the assessment of drugs in predefined subgroups, while exploratory ones allow a posteriori the identification of subsets of patients who respond differently. Within the latter area, subgroup discovery (SD) data mining approach is widely used—particularly in precision medicine—to evaluate treatment effect across different groups of patients from various data sources (be it from clinical trials or real-world data). However, both the limited consideration by standard SD algorithms of recommended criteria to define credible subgroups and the lack of statistical power of the findings after correcting for multiple testing hinder the generation of hypothesis and their acceptance by healthcare authorities and practitioners. In this paper, we present the Q-Finder algorithm that aims to generate statistically credible subgroups to answer clinical questions, such as finding drivers of natural disease progression or treatment response. It combines an exhaustive search with a cascade of filters based on metrics assessing key credibility criteria, including relative risk reduction assessment, adjustment on confounding factors, individual feature’s contribution to the subgroup’s effect, interaction tests for assessing between-subgroup treatment effect interactions and tests adjustment (multiple testing). This allows Q-Finder to directly target and assess subgroups on recommended credibility criteria. The top-k credible subgroups are then selected, while accounting for subgroups’ diversity and, possibly, clinical relevance. Those subgroups are tested on independent data to assess their consistency across databases, while preserving statistical power by limiting the number of tests. To illustrate this algorithm, we applied it on the database of the International Diabetes Management Practice Study (IDMPS) to better understand the drivers of improved glycemic control and rate of episodes of hypoglycemia in type 2 diabetics patients. We compared Q-Finder with state-of-the-art approaches from both Subgroup Identification and Knowledge Discovery in Databases literature. The results demonstrate its ability to identify and support a short list of highly credible and diverse data-driven subgroups for both prognostic and predictive tasks.
f
Comparison of ranks of authors who are SIGKDD innovation award winners based...
plos.figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chao Gao; Zhen Wang; Xianghua Li; Zili Zhang; Wei Zeng (2023). Comparison of ranks of authors who are SIGKDD innovation award winners based on different indicators. [Dataset]. http://doi.org/10.1371/journal.pone.0161755.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0161755.t003
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Chao Gao; Zhen Wang; Xianghua Li; Zili Zhang; Wei Zeng
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The boldface refers to the minimum number of each row. PR_AC gives the highest ranking to awarded authors. Leo Breiman who awarded the SIGKDD innovation award in 2005 is not list in this table. Actually, our dataset is clawed based on the keyword “data mining” and Prof. Leo mainly focuses on the statistics and machine learning. Therefore, there are a few records of Leo Breiman in our dataset.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Neslihan Suzen (2020). LScDC (Leicester Scientific Dictionary-Core) [Dataset]. http://doi.org/10.25392/leicester.data.9896579.v3

LScDC (Leicester Scientific Dictionary-Core)

Explore at:

3 scholarly articles cite this dataset (View in Google Scholar)

docxAvailable download formats

Unique identifier

https://doi.org/10.25392/leicester.data.9896579.v3

Dataset updated

Apr 15, 2020

Dataset provided by

University of Leicester

Authors

Neslihan Suzen

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Area covered

Leicester

Description

The LScDC (Leicester Scientific Dictionary-Core Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScDC (Leicester Scientific Dictionary-Core) is formed using the updated LScD (Leicester Scientific Dictionary) - Version 3*. All steps applied to build the new version of core dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. The files provided with this description are also same as described as for LScDC Version 2. The numbers of words in the 3rd versions of LScD and LScDC are summarized below. # of wordsLScD (v3) 972,060LScDC (v3) 103,998 * Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v3 ** Suzen, Neslihan (2019): LScDC (Leicester Scientific Dictionary-Core). figshare. Dataset. https://doi.org/10.25392/leicester.data.9896579.v2[Version 2] Getting StartedThis file describes a sorted and cleaned list of words from LScD (Leicester Scientific Dictionary), explains steps for sub-setting the LScD and basic statistics of words in the LSC (Leicester Scientific Corpus), to be found in [1, 2]. The LScDC (Leicester Scientific Dictionary-Core) is a list of words ordered by the number of documents containing the words, and is available in the CSV file published. There are 104,223 unique words (lemmas) in the LScDC. This dictionary is created to be used in future work on the quantification of the sense of research texts. The objective of sub-setting the LScD is to discard words which appear too rarely in the corpus. In text mining algorithms, usage of enormous number of text data brings the challenge to the performance and the accuracy of data mining applications. The performance and the accuracy of models are heavily depend on the type of words (such as stop words and content words) and the number of words in the corpus. Rare occurrence of words in a collection is not useful in discriminating texts in large corpora as rare words are likely to be non-informative signals (or noise) and redundant in the collection of texts. The selection of relevant words also holds out the possibility of more effective and faster operation of text mining algorithms.To build the LScDC, we decided the following process on LScD: removing words that appear in no more than 10 documents (

Clear search

Close search

Google apps

Main menu

LScDC (Leicester Scientific Dictionary-Core)

2018 Database of Effect Sizes (ES)

Market Basket Analysis

Market Basket Analysis

Introduction

An Example of Association Rules

Strategy

Dataset Description

Libraries in R

Data Pre-processing

Multivariate Time Series Search

List of classifiers employed in the analysis.

Bluesky Social Dataset

Bluesky Social Dataset

Dataset

Citation

Right to Erasure (Right to be forgotten)

Acknowledgments:

List of the variables used in the models.

Data underlying the publication: A Ground Truth Approach for Assessing...

Discovering Hidden Connections among Diseases, Genes and Drugs Based on...

List of false positive rates of the proposed DBPPred and the existing...

Data Analytics Market Analysis, Size, and Forecast 2025-2029: North America...

Snapshot img

List of Demographic & clinical variables used in PM-BMII.

Data from: In Silico Investigation into H2 Uptake in MOFs: Combined...

The Utility-Linked (UL)-list structure of S4.

Bengalese Finch song repository

List of dynamic features.

datasheet1_Q-Finder: An Algorithm for Credible Subgroup Discovery in...

Comparison of ranks of authors who are SIGKDD innovation award winners based...

LScDC (Leicester Scientific Dictionary-Core)See More Versions

LScDC (Leicester Scientific Dictionary-Core)