5 datasets found

Additional file 2: of Secure and scalable deduplication of horizontally...
springernature.figshare.com
text/x-shellscript
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kassaye Yigzaw; Antonis Michalas; Johan Bellika (2023). Additional file 2: of Secure and scalable deduplication of horizontally partitioned health data for privacy-preserving distributed statistical computation [Dataset]. http://doi.org/10.6084/m9.figshare.c.3656951_D1.v1
Explore at:
text/x-shellscriptAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.c.3656951_D1.v1
Dataset updated
Jun 1, 2023
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Kassaye Yigzaw; Antonis Michalas; Johan Bellika
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
It is a Bash script that implements the algorithm we used to generate the simulated microbiology datasets. The script is discussed in Additional file 1. (SH 9Â kb)
D
Document Duplication Detection Software Report
datainsightsmarket.com
doc, pdf, ppt
Updated Jun 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). Document Duplication Detection Software Report [Dataset]. https://www.datainsightsmarket.com/reports/document-duplication-detection-software-1421242
Explore at:
doc, ppt, pdfAvailable download formats
Dataset updated
Jun 3, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The global market for Document Duplication Detection Software is experiencing robust growth, driven by the increasing need for efficient data management and enhanced security across various industries. The rising volume of digital documents, coupled with stricter regulatory compliance requirements (like GDPR and CCPA), is fueling the demand for solutions that can quickly and accurately identify duplicate files. This reduces storage costs, improves data quality, and minimizes the risk of data breaches. The market's expansion is further propelled by advancements in artificial intelligence (AI) and machine learning (ML) technologies, which enable more sophisticated and accurate duplicate detection. We estimate the current market size to be around $800 million in 2025, with a Compound Annual Growth Rate (CAGR) of 15% projected through 2033. This growth is expected across various segments, including cloud-based and on-premise solutions, catering to diverse industry verticals such as legal, finance, healthcare, and government. Major players like Microsoft, IBM, and Oracle are contributing to market growth through their established enterprise solutions. However, the market also features several specialized players, like Hyper Labs and Auslogics, offering niche solutions catering to specific needs. While the increasing adoption of cloud-based solutions is a key trend, potential restraints include the initial investment costs for software implementation and the need for ongoing training and support. The integration challenges with existing systems and the potential for false positives can also impede wider adoption. The market's regional distribution is expected to see a significant contribution from North America and Europe, while the Asia-Pacific region is projected to exhibit substantial growth potential driven by increasing digitalization. The forecast period (2025-2033) presents significant opportunities for market expansion, driven by technological innovation and the growing awareness of data management best practices.
e
Us Purpose Built Backup Appliance Market Research Report By Product Type...
exactitudeconsultancy.com
Updated Mar 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Exactitude Consultancy (2025). Us Purpose Built Backup Appliance Market Research Report By Product Type (Disk-Based, Cloud-Based, Hybrid), By Application (Data Backup, Data Recovery, Disaster Recovery), By End User (Small and Medium Enterprises, Large Enterprises), By Technology (Virtualization, Deduplication, Encryption), By Distribution Channel (Online, Offline) – Forecast to 2034. [Dataset]. https://exactitudeconsultancy.com/reports/48975/us-purpose-built-backup-appliance-market
Explore at:
Dataset updated
Mar 2025
Dataset authored and provided by
Exactitude Consultancy
License
https://exactitudeconsultancy.com/privacy-policyhttps://exactitudeconsultancy.com/privacy-policy
Description
The Purpose Built Backup Appliance is projected to be valued at $5 billion in 2024, driven by factors such as increasing consumer awareness and the rising prevalence of industry-specific trends. The market is expected to grow at a CAGR of 12%, reaching approximately $15 billion by 2034.
o
Data from: Identification of factors associated with duplicate rate in...
omicsdi.org
Updated Jul 19, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Identification of factors associated with duplicate rate in ChIP-seq data. [Dataset]. https://www.omicsdi.org/dataset/biostudies/S-EPMC6447195
Explore at:
Dataset updated
Jul 19, 2023
Variables measured
Unknown
Description
Chromatin immunoprecipitation and sequencing (ChIP-seq) has been widely used to map DNA-binding proteins, histone proteins and their modifications. ChIP-seq data contains redundant reads termed duplicates, referring to those mapping to the same genomic location and strand. There are two main sources of duplicates: polymerase chain reaction (PCR) duplicates and natural duplicates. Unlike natural duplicates that represent true signals from sequencing of independent DNA templates, PCR duplicates are artifacts originating from sequencing of identical copies amplified from the same DNA template. In analysis, duplicates are removed from peak calling and signal quantification. Nevertheless, a significant portion of the duplicates is believed to represent true signals. Obviously, removing all duplicates will underestimate the signal level in peaks and impact the identification of signal changes across samples. Therefore, an in-depth evaluation of the impact from duplicate removal is needed. Using eight public ChIP-seq datasets from three narrow-peak and two broad-peak marks, we tried to understand the distribution of duplicates in the genome, the extent by which duplicate removal impacts peak calling and signal estimation, and the factors associated with duplicate level in peaks. The three PCR-free histone H3 lysine 4 trimethylation (H3K4me3) ChIP-seq data had about 40% duplicates and 97% of them were within peaks. For the other datasets generated with PCR amplification of ChIP DNA, as expected, the narrow-peak marks have a much higher proportion of duplicates than the broad-peak marks. We found that duplicates are enriched in peaks and largely represent true signals, more conspicuous in those with high confidence. Furthermore, duplicate level in peaks is strongly correlated with the target enrichment level estimated using nonredundant reads, which provides the basis to properly allocate duplicates between noise and signal. Our analysis supports the feasibility of retaining the portion of signal duplicates into downstream analysis, thus alleviating the limitation of complete deduplication.
h
SlimPajama-627B
huggingface.co
opendatalab.com
Updated Oct 2, 2012
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cerebras (2012). SlimPajama-627B [Dataset]. https://huggingface.co/datasets/cerebras/SlimPajama-627B
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 2, 2012
Dataset authored and provided by
Cerebras
Description
The dataset consists of 59166 jsonl files and is ~895GB compressed. It is a cleaned and deduplicated version of Together's RedPajama. Check out our blog post explaining our methods, our code on GitHub, and join the discussion on the Cerebras Discord.

Getting Started

You can download the dataset using Hugging Face datasets: from datasets import load_dataset ds = load_dataset("cerebras/SlimPajama-627B")

Background

Today we are releasing SlimPajama – the largest… See the full description on the dataset page: https://huggingface.co/datasets/cerebras/SlimPajama-627B.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Kassaye Yigzaw; Antonis Michalas; Johan Bellika (2023). Additional file 2: of Secure and scalable deduplication of horizontally partitioned health data for privacy-preserving distributed statistical computation [Dataset]. http://doi.org/10.6084/m9.figshare.c.3656951_D1.v1

Additional file 2: of Secure and scalable deduplication of horizontally partitioned health data for privacy-preserving distributed statistical computation

Explore at:

text/x-shellscriptAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.c.3656951_D1.v1

Dataset updated

Jun 1, 2023

Dataset provided by

Figsharehttp://figshare.com/
figshare

Authors

Kassaye Yigzaw; Antonis Michalas; Johan Bellika

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

It is a Bash script that implements the algorithm we used to generate the simulated microbiology datasets. The script is discussed in Additional file 1. (SH 9Â kb)

Clear search

Close search

Google apps

Main menu

Additional file 2: of Secure and scalable deduplication of horizontally...

Document Duplication Detection Software Report

Us Purpose Built Backup Appliance Market Research Report By Product Type...

Data from: Identification of factors associated with duplicate rate in...

SlimPajama-627B

Additional file 2: of Secure and scalable deduplication of horizontally partitioned health data for privacy-preserving distributed statistical computation