5 datasets found
  1. Additional file 2: of Secure and scalable deduplication of horizontally...

    • springernature.figshare.com
    text/x-shellscript
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kassaye Yigzaw; Antonis Michalas; Johan Bellika (2023). Additional file 2: of Secure and scalable deduplication of horizontally partitioned health data for privacy-preserving distributed statistical computation [Dataset]. http://doi.org/10.6084/m9.figshare.c.3656951_D1.v1
    Explore at:
    text/x-shellscriptAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Kassaye Yigzaw; Antonis Michalas; Johan Bellika
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    It is a Bash script that implements the algorithm we used to generate the simulated microbiology datasets. The script is discussed in Additional file 1. (SH 9Â kb)

  2. D

    Document Duplication Detection Software Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Jun 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Document Duplication Detection Software Report [Dataset]. https://www.datainsightsmarket.com/reports/document-duplication-detection-software-1421242
    Explore at:
    doc, ppt, pdfAvailable download formats
    Dataset updated
    Jun 3, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The global market for Document Duplication Detection Software is experiencing robust growth, driven by the increasing need for efficient data management and enhanced security across various industries. The rising volume of digital documents, coupled with stricter regulatory compliance requirements (like GDPR and CCPA), is fueling the demand for solutions that can quickly and accurately identify duplicate files. This reduces storage costs, improves data quality, and minimizes the risk of data breaches. The market's expansion is further propelled by advancements in artificial intelligence (AI) and machine learning (ML) technologies, which enable more sophisticated and accurate duplicate detection. We estimate the current market size to be around $800 million in 2025, with a Compound Annual Growth Rate (CAGR) of 15% projected through 2033. This growth is expected across various segments, including cloud-based and on-premise solutions, catering to diverse industry verticals such as legal, finance, healthcare, and government. Major players like Microsoft, IBM, and Oracle are contributing to market growth through their established enterprise solutions. However, the market also features several specialized players, like Hyper Labs and Auslogics, offering niche solutions catering to specific needs. While the increasing adoption of cloud-based solutions is a key trend, potential restraints include the initial investment costs for software implementation and the need for ongoing training and support. The integration challenges with existing systems and the potential for false positives can also impede wider adoption. The market's regional distribution is expected to see a significant contribution from North America and Europe, while the Asia-Pacific region is projected to exhibit substantial growth potential driven by increasing digitalization. The forecast period (2025-2033) presents significant opportunities for market expansion, driven by technological innovation and the growing awareness of data management best practices.

  3. e

    Us Purpose Built Backup Appliance Market Research Report By Product Type...

    • exactitudeconsultancy.com
    Updated Mar 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Exactitude Consultancy (2025). Us Purpose Built Backup Appliance Market Research Report By Product Type (Disk-Based, Cloud-Based, Hybrid), By Application (Data Backup, Data Recovery, Disaster Recovery), By End User (Small and Medium Enterprises, Large Enterprises), By Technology (Virtualization, Deduplication, Encryption), By Distribution Channel (Online, Offline) – Forecast to 2034. [Dataset]. https://exactitudeconsultancy.com/reports/48975/us-purpose-built-backup-appliance-market
    Explore at:
    Dataset updated
    Mar 2025
    Dataset authored and provided by
    Exactitude Consultancy
    License

    https://exactitudeconsultancy.com/privacy-policyhttps://exactitudeconsultancy.com/privacy-policy

    Description

    The Purpose Built Backup Appliance is projected to be valued at $5 billion in 2024, driven by factors such as increasing consumer awareness and the rising prevalence of industry-specific trends. The market is expected to grow at a CAGR of 12%, reaching approximately $15 billion by 2034.

  4. o

    Data from: Identification of factors associated with duplicate rate in...

    • omicsdi.org
    Updated Jul 19, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Identification of factors associated with duplicate rate in ChIP-seq data. [Dataset]. https://www.omicsdi.org/dataset/biostudies/S-EPMC6447195
    Explore at:
    Dataset updated
    Jul 19, 2023
    Variables measured
    Unknown
    Description

    Chromatin immunoprecipitation and sequencing (ChIP-seq) has been widely used to map DNA-binding proteins, histone proteins and their modifications. ChIP-seq data contains redundant reads termed duplicates, referring to those mapping to the same genomic location and strand. There are two main sources of duplicates: polymerase chain reaction (PCR) duplicates and natural duplicates. Unlike natural duplicates that represent true signals from sequencing of independent DNA templates, PCR duplicates are artifacts originating from sequencing of identical copies amplified from the same DNA template. In analysis, duplicates are removed from peak calling and signal quantification. Nevertheless, a significant portion of the duplicates is believed to represent true signals. Obviously, removing all duplicates will underestimate the signal level in peaks and impact the identification of signal changes across samples. Therefore, an in-depth evaluation of the impact from duplicate removal is needed. Using eight public ChIP-seq datasets from three narrow-peak and two broad-peak marks, we tried to understand the distribution of duplicates in the genome, the extent by which duplicate removal impacts peak calling and signal estimation, and the factors associated with duplicate level in peaks. The three PCR-free histone H3 lysine 4 trimethylation (H3K4me3) ChIP-seq data had about 40% duplicates and 97% of them were within peaks. For the other datasets generated with PCR amplification of ChIP DNA, as expected, the narrow-peak marks have a much higher proportion of duplicates than the broad-peak marks. We found that duplicates are enriched in peaks and largely represent true signals, more conspicuous in those with high confidence. Furthermore, duplicate level in peaks is strongly correlated with the target enrichment level estimated using nonredundant reads, which provides the basis to properly allocate duplicates between noise and signal. Our analysis supports the feasibility of retaining the portion of signal duplicates into downstream analysis, thus alleviating the limitation of complete deduplication.

  5. h

    SlimPajama-627B

    • huggingface.co
    • opendatalab.com
    Updated Oct 2, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cerebras (2012). SlimPajama-627B [Dataset]. https://huggingface.co/datasets/cerebras/SlimPajama-627B
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 2, 2012
    Dataset authored and provided by
    Cerebras
    Description

    The dataset consists of 59166 jsonl files and is ~895GB compressed. It is a cleaned and deduplicated version of Together's RedPajama. Check out our blog post explaining our methods, our code on GitHub, and join the discussion on the Cerebras Discord.

      Getting Started
    

    You can download the dataset using Hugging Face datasets: from datasets import load_dataset ds = load_dataset("cerebras/SlimPajama-627B")

      Background
    

    Today we are releasing SlimPajama – the largest… See the full description on the dataset page: https://huggingface.co/datasets/cerebras/SlimPajama-627B.

  6. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Kassaye Yigzaw; Antonis Michalas; Johan Bellika (2023). Additional file 2: of Secure and scalable deduplication of horizontally partitioned health data for privacy-preserving distributed statistical computation [Dataset]. http://doi.org/10.6084/m9.figshare.c.3656951_D1.v1
Organization logoOrganization logo

Additional file 2: of Secure and scalable deduplication of horizontally partitioned health data for privacy-preserving distributed statistical computation

Related Article
Explore at:
text/x-shellscriptAvailable download formats
Dataset updated
Jun 1, 2023
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Kassaye Yigzaw; Antonis Michalas; Johan Bellika
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

It is a Bash script that implements the algorithm we used to generate the simulated microbiology datasets. The script is discussed in Additional file 1. (SH 9Â kb)

Search
Clear search
Close search
Google apps
Main menu