100+ datasets found
  1. d

    Papers on duplicate records

    • search.dataone.org
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Slomczynski, Kazimierz M.; Powałko, Przemek; Krauze, Tadeusz (2023). Papers on duplicate records [Dataset]. http://doi.org/10.7910/DVN/TK1U7E
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Slomczynski, Kazimierz M.; Powałko, Przemek; Krauze, Tadeusz
    Description

    Papers on duplicate records.. Visit https://dataone.org/datasets/sha256%3A8af3814a53c4db3d7260b3e66c119db96471654505927ce8a87df16ddf4592ab for complete metadata about this dataset.

  2. d

    Data Quality Assurance - Laboratory duplicates

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Data Quality Assurance - Laboratory duplicates [Dataset]. https://catalog.data.gov/dataset/data-quality-assurance-laboratory-duplicates
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Description

    This dataset includes data quality assurance information concerning the Relative Percent Difference (RPD) of laboratory duplicates. No laboratory duplicate information exists for 2010. The formula for calculating relative percent difference is: ABS(2*[(A-B)/(A+B)]). An RPD of less the 10% is considered acceptable.

  3. Potential Duplicate Products Report

    • catalog.data.gov
    Updated Feb 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DHS (2023). Potential Duplicate Products Report [Dataset]. https://catalog.data.gov/dataset/potential-duplicate-products-report
    Explore at:
    Dataset updated
    Feb 16, 2023
    Dataset provided by
    U.S. Department of Homeland Securityhttp://www.dhs.gov/
    Description

    Displays potential software and hardware product duplicates within a manufacturer. Product duplicates have the same name, component, and manufacturer. Also displays duplicate software versions (patch level and edition must be the same) and hardware models within a product.

  4. Duplicate Analysis

    • kaggle.com
    Updated Jun 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alinaswe Simfukwe (2025). Duplicate Analysis [Dataset]. https://www.kaggle.com/datasets/alinaswesimfukwe/duplicate-analysis/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 2, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Alinaswe Simfukwe
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Overview:

    Total Records: 749 Original Records: 700 Duplicate Records: 49 (7% of total) File Name: synthetic_claims_with_duplicates.csv Key Features:

    Claim Information: Unique claim IDs (CLAIM000001 to CLAIM000700) Employee IDs (EMP0001 to EMP0700) Realistic employee names Financial Data: Amounts range: 100.00 to 20,000.00 Service codes: SVC001, SVC002, SVC003, SVC004 Departments: Finance, HR, IT, Marketing, Operations Transaction Details: Dates within the last 2 years Timestamps for submission Statuses: Submitted, Approved, Paid Random UUIDs for submitter IDs Fraud Detection: 49 exact duplicates (7%) Random distribution throughout the dataset Boolean is_duplicate flag for identification Purpose: The dataset is designed to test fraud detection systems, particularly for identifying duplicate transactions. It simulates real-world scenarios where duplicate entries might occur due to fraud or data entry errors.

    Usage:

    Testing duplicate transaction detection Training fraud detection models Data validation and cleaning Algorithm benchmarking The dataset is now ready for analysis in your fraud detection system.

  5. h

    quora-duplicates

    • huggingface.co
    Updated Apr 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sentence Transformers (2024). quora-duplicates [Dataset]. https://huggingface.co/datasets/sentence-transformers/quora-duplicates
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 27, 2024
    Dataset authored and provided by
    Sentence Transformers
    Description

    Dataset Card for Quora Duplicate Questions

    This dataset contains the Quora Question Pairs dataset in four formats that are easily used with Sentence Transformers to train embedding models. The data was originally created by Quora for this Kaggle Competition.

      Dataset Subsets
    
    
    
    
    
      pair-class subset
    

    Columns: "sentence1", "sentence2", "label" Column types: str, str, class with {"0": "different", "1": "duplicate"} Examples:{ 'sentence1': 'What is the step by step… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/quora-duplicates.

  6. n

    Data from: The comparative landscape of duplications in Heliconius melpomene...

    • data.niaid.nih.gov
    • search.dataone.org
    • +2more
    zip
    Updated Sep 23, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ana Pinharanda; Simon H. Martin; Sarah L. Barker; John W. Davey; Chris D. Jiggins (2016). The comparative landscape of duplications in Heliconius melpomene and Heliconius cydno [Dataset]. http://doi.org/10.5061/dryad.8jv30
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 23, 2016
    Dataset provided by
    University of Cambridge
    Authors
    Ana Pinharanda; Simon H. Martin; Sarah L. Barker; John W. Davey; Chris D. Jiggins
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Area covered
    Central America
    Description

    Gene duplications can facilitate adaptation and may lead to inter-population divergence, causing reproductive isolation. We used whole-genome re-sequencing data from 34 butterflies to detect duplications in two Heliconius species, H. cydno and H. melpomene. Taking advantage of three distinctive signals of duplication in short-read sequencing data, we identified 744 duplicated loci in H. cydno and H. melpomene, 96% of which were validated with single molecule sequencing. We have found that duplications overlap genes significantly less than expected at random in H. melpomene, consistent with the action of background selection against duplicates in functional regions of the genome. Duplicate loci that are highly differentiated between H. melpomene and H. cydno map to four different chromosomes. Four duplications were identified with a strong signal of divergent selection, including an odorant binding protein and another in close proximity with a known wing colour pattern locus that differs between the two species.

  7. d

    Mobile Location Data | Asia | +300M Unique Devices | +100M Daily Users |...

    • datarade.ai
    .json, .csv, .xls
    Updated Mar 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Quadrant (2025). Mobile Location Data | Asia | +300M Unique Devices | +100M Daily Users | +200B Events / Month [Dataset]. https://datarade.ai/data-products/mobile-location-data-asia-300m-unique-devices-100m-da-quadrant
    Explore at:
    .json, .csv, .xlsAvailable download formats
    Dataset updated
    Mar 21, 2025
    Dataset authored and provided by
    Quadrant
    Area covered
    Asia, Oman, Korea (Democratic People's Republic of), Israel, Palestine, Iran (Islamic Republic of), Georgia, Armenia, Bahrain, Kyrgyzstan, Philippines
    Description

    Quadrant provides Insightful, accurate, and reliable mobile location data.

    Our privacy-first mobile location data unveils hidden patterns and opportunities, provides actionable insights, and fuels data-driven decision-making at the world's biggest companies.

    These companies rely on our privacy-first Mobile Location and Points-of-Interest Data to unveil hidden patterns and opportunities, provide actionable insights, and fuel data-driven decision-making. They build better AI models, uncover business insights, and enable location-based services using our robust and reliable real-world data.

    We conduct stringent evaluations on data providers to ensure authenticity and quality. Our proprietary algorithms detect, and cleanse corrupted and duplicated data points – allowing you to leverage our datasets rapidly with minimal processing or cleaning. During the ingestion process, our proprietary Data Filtering Algorithms remove events based on a number of both qualitative factors, as well as latency and other integrity variables to provide more efficient data delivery. The deduplicating algorithm focuses on a combination of four important attributes: Device ID, Latitude, Longitude, and Timestamp. This algorithm scours our data and identifies rows that contain the same combination of these four attributes. Post-identification, it retains a single copy and eliminates duplicate values to ensure our customers only receive complete and unique datasets.

    We actively identify overlapping values at the provider level to determine the value each offers. Our data science team has developed a sophisticated overlap analysis model that helps us maintain a high-quality data feed by qualifying providers based on unique data values rather than volumes alone – measures that provide significant benefit to our end-use partners.

    Quadrant mobility data contains all standard attributes such as Device ID, Latitude, Longitude, Timestamp, Horizontal Accuracy, and IP Address, and non-standard attributes such as Geohash and H3. In addition, we have historical data available back through 2022.

    Through our in-house data science team, we offer sophisticated technical documentation, location data algorithms, and queries that help data buyers get a head start on their analyses. Our goal is to provide you with data that is “fit for purpose”.

  8. d

    Catalog of natural and induced earthquakes without duplicates

    • datasets.ai
    • search.dataone.org
    • +2more
    55
    Updated Sep 11, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of the Interior (2024). Catalog of natural and induced earthquakes without duplicates [Dataset]. https://datasets.ai/datasets/catalog-of-natural-and-induced-earthquakes-without-duplicates
    Explore at:
    55Available download formats
    Dataset updated
    Sep 11, 2024
    Dataset authored and provided by
    Department of the Interior
    Description

    The U. S. Geological Survey (USGS) makes long-term seismic hazard forecasts that are used in building codes. The hazard models usually consider only natural seismicity; non-tectonic (man-made) earthquakes are excluded because they are transitory or too small. In the past decade, however, thousands of earthquakes related to underground fluid injection have occurred in the central and eastern U.S. (CEUS), and some have caused damage. In response, the USGS is now also making short-term forecasts that account for the hazard from these induced earthquakes. A uniform earthquake catalog is assembled by combining and winnowing pre-existing source catalogs. Seismicity statistics are analyzed to develop recurrence models, accounting for catalog completeness. In the USGS hazard modeling methodology, earthquakes are counted on a map grid, recurrence models are applied to estimate the rates of future earthquakes in each grid cell, and these rates are combined with maximum-magnitude models and ground-motion models to compute the hazard. The USGS published a forecast for the years 2016 and 2017. This data set is the catalog of natural and induced earthquakes without duplicates. Duplicate events have been removed based on a hierarchy of the source catalogs. Explosions and mining related events have been deleted.

  9. o

    Data from: Identification of factors associated with duplicate rate in...

    • omicsdi.org
    Updated Jul 19, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Identification of factors associated with duplicate rate in ChIP-seq data. [Dataset]. https://www.omicsdi.org/dataset/biostudies/S-EPMC6447195
    Explore at:
    Dataset updated
    Jul 19, 2023
    Variables measured
    Unknown
    Description

    Chromatin immunoprecipitation and sequencing (ChIP-seq) has been widely used to map DNA-binding proteins, histone proteins and their modifications. ChIP-seq data contains redundant reads termed duplicates, referring to those mapping to the same genomic location and strand. There are two main sources of duplicates: polymerase chain reaction (PCR) duplicates and natural duplicates. Unlike natural duplicates that represent true signals from sequencing of independent DNA templates, PCR duplicates are artifacts originating from sequencing of identical copies amplified from the same DNA template. In analysis, duplicates are removed from peak calling and signal quantification. Nevertheless, a significant portion of the duplicates is believed to represent true signals. Obviously, removing all duplicates will underestimate the signal level in peaks and impact the identification of signal changes across samples. Therefore, an in-depth evaluation of the impact from duplicate removal is needed. Using eight public ChIP-seq datasets from three narrow-peak and two broad-peak marks, we tried to understand the distribution of duplicates in the genome, the extent by which duplicate removal impacts peak calling and signal estimation, and the factors associated with duplicate level in peaks. The three PCR-free histone H3 lysine 4 trimethylation (H3K4me3) ChIP-seq data had about 40% duplicates and 97% of them were within peaks. For the other datasets generated with PCR amplification of ChIP DNA, as expected, the narrow-peak marks have a much higher proportion of duplicates than the broad-peak marks. We found that duplicates are enriched in peaks and largely represent true signals, more conspicuous in those with high confidence. Furthermore, duplicate level in peaks is strongly correlated with the target enrichment level estimated using nonredundant reads, which provides the basis to properly allocate duplicates between noise and signal. Our analysis supports the feasibility of retaining the portion of signal duplicates into downstream analysis, thus alleviating the limitation of complete deduplication.

  10. R

    Data from: Duplicated Dataset

    • universe.roboflow.com
    zip
    Updated Dec 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AVANTHIKA S (2024). Duplicated Dataset [Dataset]. https://universe.roboflow.com/avanthika-s-nfpex/duplicated/model/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 31, 2024
    Dataset authored and provided by
    AVANTHIKA S
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Non_emergency 9m2i Bounding Boxes
    Description

    Duplicated

    ## Overview
    
    Duplicated is a dataset for object detection tasks - it contains Non_emergency 9m2i annotations for 942 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  11. f

    Label quantities of the non-duplicate entries and duplicate entries compared...

    • plos.figshare.com
    xls
    Updated May 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Erik D. Huckvale; Hunter N. B. Moseley (2024). Label quantities of the non-duplicate entries and duplicate entries compared to the original dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0299583.t008
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 2, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Erik D. Huckvale; Hunter N. B. Moseley
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Label quantities of the non-duplicate entries and duplicate entries compared to the original dataset.

  12. C

    311 Service Requests - Graffiti Removal - No Duplicates

    • data.cityofchicago.org
    • data.wu.ac.at
    Updated Mar 6, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    City of Chicago (2019). 311 Service Requests - Graffiti Removal - No Duplicates [Dataset]. https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Graffiti-Removal-No-Duplicate/8tus-apua
    Explore at:
    application/rdfxml, csv, tsv, application/rssxml, xml, kml, application/geo+json, kmzAvailable download formats
    Dataset updated
    Mar 6, 2019
    Dataset authored and provided by
    City of Chicago
    Description

    Note: This filtered view shows only those service requests from the underlying dataset that are not marked as duplicates. -- All open graffiti removal requests made to 311 and all requests completed since January 1, 2011. The Department of Streets & Sanitation's Graffiti Blasters crews offer a vandalism removal service to private property owners. Graffiti Blasters employ "blast" trucks that use baking soda under high water pressure to erase painted graffiti from brick, stone and other mineral surfaces. They also use paint trucks to cover graffiti on the remaining surfaces. Organizations and residents may report graffiti and request its removal. 311 sometimes receives duplicate requests for graffiti removal. Requests that have been labeled as Duplicates are in the same geographic area and have been entered into 311’s Customer Service Requests (CSR) system at around the same time as a previous request. Duplicate reports/requests are labeled as such in the Status field, as either "Open - Dup" or "Completed - Dup." Data is updated daily.

  13. Additional file 1: of A proficient cost reduction framework for...

    • springernature.figshare.com
    txt
    Updated May 31, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Asif Sohail; Muhammad Yousaf (2023). Additional file 1: of A proficient cost reduction framework for de-duplication of records in data integration [Dataset]. http://doi.org/10.6084/m9.figshare.c.3637745_D1.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Asif Sohail; Muhammad Yousaf
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset-A with one duplicate against an original record and one modification per duplicate record. (CSV 92 kb)

  14. f

    Unique entry occurrence compared to label count.

    • plos.figshare.com
    xls
    Updated May 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Unique entry occurrence compared to label count. [Dataset]. https://plos.figshare.com/articles/dataset/Unique_entry_occurrence_compared_to_label_count_/25740470
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 2, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Erik D. Huckvale; Hunter N. B. Moseley
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The mapping of metabolite-specific data to pathways within cellular metabolism is a major data analysis step needed for biochemical interpretation. A variety of machine learning approaches, particularly deep learning approaches, have been used to predict these metabolite-to-pathway mappings, utilizing a training dataset of known metabolite-to-pathway mappings. A few such training datasets have been derived from the Kyoto Encyclopedia of Genes and Genomes (KEGG). However, several prior published machine learning approaches utilized an erroneous KEGG-derived training dataset that used SMILES molecular representations strings (KEGG-SMILES dataset) and contained a sizable proportion (~26%) duplicate entries. The presence of so many duplicates taint the training and testing sets generated from k-fold cross-validation of the KEGG-SMILES dataset. Therefore, the k-fold cross-validation performance of the resulting machine learning models was grossly inflated by the erroneous presence of these duplicate entries. Here we describe and evaluate the KEGG-SMILES dataset so that others may avoid using it. We also identify the prior publications that utilized this erroneous KEGG-SMILES dataset so their machine learning results can be properly and critically evaluated. In addition, we demonstrate the reduction of model k-fold cross-validation (CV) performance after de-duplicating the KEGG-SMILES dataset. This is a cautionary tale about properly vetting prior published benchmark datasets before using them in machine learning approaches. We hope others will avoid similar mistakes.

  15. H

    Supplementary materials - The Large Number of Duplicate Records in...

    • dataverse.harvard.edu
    docx, xlsx
    Updated Jul 17, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harvard Dataverse (2015). Supplementary materials - The Large Number of Duplicate Records in International Survey Projects: The Need for Data Quality Control [Dataset]. http://doi.org/10.7910/DVN/HPXFA1
    Explore at:
    docx(29983), docx(16124), docx(16622), xlsx(118576), docx(12326), docx(39473), docx(152667), docx(12327), docx(14177), xlsx(15968), xlsx(23707), xlsx(18164)Available download formats
    Dataset updated
    Jul 17, 2015
    Dataset provided by
    Harvard Dataverse
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Time period covered
    1966 - 2013
    Dataset funded by
    National Science Centre
    Description

    These materials provide detailed information about our findings and allow researchers to replicate analyses presented in the paper.

  16. Duplicate Image Detection

    • kaggle.com
    Updated Dec 17, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gerry (2021). Duplicate Image Detection [Dataset]. https://www.kaggle.com/datasets/gpiosenka/duplicate-image-detection
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 17, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Gerry
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Gerry

    Released under CC0: Public Domain

    Contents

  17. SoundDesc: Cleaned and Group-Filtered Splits

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Aug 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Benno Weck; Benno Weck; Xavier Serra; Xavier Serra (2023). SoundDesc: Cleaned and Group-Filtered Splits [Dataset]. http://doi.org/10.5281/zenodo.7665917
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 26, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Benno Weck; Benno Weck; Xavier Serra; Xavier Serra
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This upload contains dataset splits of SoundDesc [1] and other supporting material for our paper:

    Data leakage in cross-modal retrieval training: A case study [arXiv] [ieeexplore]

    In our paper, we demonstrated that a data leakage problem in the previously published splits of SoundDesc leads to overly optimistic retrieval results.
    Using an off-the-shelf audio fingerprinting software, we identified that the data leakage stems from duplicates in the dataset.
    We define two new splits for the dataset: a cleaned split to remove the leakage and a group-filtered to avoid other kinds of weak contamination of the test data.

    SoundDesc is a dataset which was automatically sourced from the BBC Sound Effects web page [2]. The results from our paper can be reproduced using clean_split01 and group_filtered_split01.

    If you use the splits, please cite our work:

    Benno Weck, Xavier Serra, "Data Leakage in Cross-Modal Retrieval Training: A Case Study," ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 2023, pp. 1-5, doi: 10.1109/ICASSP49357.2023.10094617.

    @INPROCEEDINGS{10094617,
     author={Weck, Benno and Serra, Xavier},
     booktitle={ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
     title={Data Leakage in Cross-Modal Retrieval Training: A Case Study}, 
     year={2023},
     volume={},
     number={},
     pages={1-5},
     doi={10.1109/ICASSP49357.2023.10094617}}
    

    References:

    [1] A. S. Koepke, A. -M. Oncescu, J. Henriques, Z. Akata and S. Albanie, "Audio Retrieval with Natural Language Queries: A Benchmark Study," in IEEE Transactions on Multimedia, doi: 10.1109/TMM.2022.3149712.

    [2] https://sound-effects.bbcrewind.co.uk/

  18. D

    Document Duplication Detection Software Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Jun 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Document Duplication Detection Software Report [Dataset]. https://www.datainsightsmarket.com/reports/document-duplication-detection-software-1421242
    Explore at:
    doc, ppt, pdfAvailable download formats
    Dataset updated
    Jun 3, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The global market for Document Duplication Detection Software is experiencing robust growth, driven by the increasing need for efficient data management and enhanced security across various industries. The rising volume of digital documents, coupled with stricter regulatory compliance requirements (like GDPR and CCPA), is fueling the demand for solutions that can quickly and accurately identify duplicate files. This reduces storage costs, improves data quality, and minimizes the risk of data breaches. The market's expansion is further propelled by advancements in artificial intelligence (AI) and machine learning (ML) technologies, which enable more sophisticated and accurate duplicate detection. We estimate the current market size to be around $800 million in 2025, with a Compound Annual Growth Rate (CAGR) of 15% projected through 2033. This growth is expected across various segments, including cloud-based and on-premise solutions, catering to diverse industry verticals such as legal, finance, healthcare, and government. Major players like Microsoft, IBM, and Oracle are contributing to market growth through their established enterprise solutions. However, the market also features several specialized players, like Hyper Labs and Auslogics, offering niche solutions catering to specific needs. While the increasing adoption of cloud-based solutions is a key trend, potential restraints include the initial investment costs for software implementation and the need for ongoing training and support. The integration challenges with existing systems and the potential for false positives can also impede wider adoption. The market's regional distribution is expected to see a significant contribution from North America and Europe, while the Asia-Pacific region is projected to exhibit substantial growth potential driven by increasing digitalization. The forecast period (2025-2033) presents significant opportunities for market expansion, driven by technological innovation and the growing awareness of data management best practices.

  19. O

    schaaf summaries - duplicates!?

    • data.oaklandca.gov
    application/rdfxml +5
    Updated Jul 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    City of Oakland Public Ethics Commission (2025). schaaf summaries - duplicates!? [Dataset]. https://data.oaklandca.gov/dataset/schaaf-summaries-duplicates-/e5wg-z7ke
    Explore at:
    application/rssxml, application/rdfxml, csv, tsv, json, xmlAvailable download formats
    Dataset updated
    Jul 10, 2025
    Authors
    City of Oakland Public Ethics Commission
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset includes all summary totals e-filed on Fair Political Practices Commission (FPPC) Form 460 Summary Page from 2011 to the present. The data is current as of the last modified date on this dataset. See the data key for column definitions: https://data.sfgov.org/Ethics/Campaign-Finance-Data-Key/wygs-cc76

  20. C

    311 Service Requests - Alley Lights Out - No Duplicates

    • data.cityofchicago.org
    • data.wu.ac.at
    Updated Mar 6, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    City of Chicago (2018). 311 Service Requests - Alley Lights Out - No Duplicates [Dataset]. https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Alley-Lights-Out-No-Duplicate/up7z-t43p
    Explore at:
    csv, xml, tsv, application/rssxml, application/rdfxml, kmz, application/geo+json, kmlAvailable download formats
    Dataset updated
    Mar 6, 2019
    Dataset authored and provided by
    City of Chicago
    Description

    Note: This filtered view shows only those service requests from the underlying dataset that are not marked as duplicates. -- This dataset contains all open 311 reports of one or more lights out on a wooden pole in the alley and all completed requests since January 1, 2011. If two requests regarding the same address are made within 30 calendar days of each other, the newest CSR is automatically given the status of “Duplicate (Open)”. Once the alley light is repaired, the CSR status will read “Completed” for the original request and “Duplicate (Closed)” for any duplicate requests. Data is updated daily.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Slomczynski, Kazimierz M.; Powałko, Przemek; Krauze, Tadeusz (2023). Papers on duplicate records [Dataset]. http://doi.org/10.7910/DVN/TK1U7E

Papers on duplicate records

Explore at:
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
Slomczynski, Kazimierz M.; Powałko, Przemek; Krauze, Tadeusz
Description

Papers on duplicate records.. Visit https://dataone.org/datasets/sha256%3A8af3814a53c4db3d7260b3e66c119db96471654505927ce8a87df16ddf4592ab for complete metadata about this dataset.

Search
Clear search
Close search
Google apps
Main menu