100+ datasets found

d
Papers on duplicate records
search.dataone.org
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Slomczynski, Kazimierz M.; Powałko, Przemek; Krauze, Tadeusz (2023). Papers on duplicate records [Dataset]. http://doi.org/10.7910/DVN/TK1U7E
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/TK1U7E
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
Slomczynski, Kazimierz M.; Powałko, Przemek; Krauze, Tadeusz
Description
Papers on duplicate records.. Visit https://dataone.org/datasets/sha256%3A8af3814a53c4db3d7260b3e66c119db96471654505927ce8a87df16ddf4592ab for complete metadata about this dataset.
d
Data Quality Assurance - Laboratory duplicates
catalog.data.gov
data.usgs.gov
+1more
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Data Quality Assurance - Laboratory duplicates [Dataset]. https://catalog.data.gov/dataset/data-quality-assurance-laboratory-duplicates
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
This dataset includes data quality assurance information concerning the Relative Percent Difference (RPD) of laboratory duplicates. No laboratory duplicate information exists for 2010. The formula for calculating relative percent difference is: ABS(2*[(A-B)/(A+B)]). An RPD of less the 10% is considered acceptable.
Potential Duplicate Products Report
catalog.data.gov
Updated Feb 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DHS (2023). Potential Duplicate Products Report [Dataset]. https://catalog.data.gov/dataset/potential-duplicate-products-report
Explore at:
Dataset updated
Feb 16, 2023
Dataset provided by
U.S. Department of Homeland Securityhttp://www.dhs.gov/
Description
Displays potential software and hardware product duplicates within a manufacturer. Product duplicates have the same name, component, and manufacturer. Also displays duplicate software versions (patch level and edition must be the same) and hardware models within a product.
Duplicate Analysis
kaggle.com
Updated Jun 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alinaswe Simfukwe (2025). Duplicate Analysis [Dataset]. https://www.kaggle.com/datasets/alinaswesimfukwe/duplicate-analysis/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 2, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Alinaswe Simfukwe
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Overview:

Total Records: 749 Original Records: 700 Duplicate Records: 49 (7% of total) File Name: synthetic_claims_with_duplicates.csv Key Features:

Claim Information: Unique claim IDs (CLAIM000001 to CLAIM000700) Employee IDs (EMP0001 to EMP0700) Realistic employee names Financial Data: Amounts range: 100.00 to 20,000.00 Service codes: SVC001, SVC002, SVC003, SVC004 Departments: Finance, HR, IT, Marketing, Operations Transaction Details: Dates within the last 2 years Timestamps for submission Statuses: Submitted, Approved, Paid Random UUIDs for submitter IDs Fraud Detection: 49 exact duplicates (7%) Random distribution throughout the dataset Boolean is_duplicate flag for identification Purpose: The dataset is designed to test fraud detection systems, particularly for identifying duplicate transactions. It simulates real-world scenarios where duplicate entries might occur due to fraud or data entry errors.

Usage:

Testing duplicate transaction detection Training fraud detection models Data validation and cleaning Algorithm benchmarking The dataset is now ready for analysis in your fraud detection system.
h
quora-duplicates
huggingface.co
Updated Apr 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sentence Transformers (2024). quora-duplicates [Dataset]. https://huggingface.co/datasets/sentence-transformers/quora-duplicates
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 27, 2024
Dataset authored and provided by
Sentence Transformers
Description
Dataset Card for Quora Duplicate Questions

This dataset contains the Quora Question Pairs dataset in four formats that are easily used with Sentence Transformers to train embedding models. The data was originally created by Quora for this Kaggle Competition.

Dataset Subsets pair-class subset

Columns: "sentence1", "sentence2", "label" Column types: str, str, class with {"0": "different", "1": "duplicate"} Examples:{ 'sentence1': 'What is the step by step… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/quora-duplicates.
n
Data from: The comparative landscape of duplications in Heliconius melpomene...
data.niaid.nih.gov
search.dataone.org
+2more
zip
Updated Sep 23, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ana Pinharanda; Simon H. Martin; Sarah L. Barker; John W. Davey; Chris D. Jiggins (2016). The comparative landscape of duplications in Heliconius melpomene and Heliconius cydno [Dataset]. http://doi.org/10.5061/dryad.8jv30
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.8jv30
Dataset updated
Sep 23, 2016
Dataset provided by
University of Cambridge
Authors
Ana Pinharanda; Simon H. Martin; Sarah L. Barker; John W. Davey; Chris D. Jiggins
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Area covered
Central America
Description
Gene duplications can facilitate adaptation and may lead to inter-population divergence, causing reproductive isolation. We used whole-genome re-sequencing data from 34 butterflies to detect duplications in two Heliconius species, H. cydno and H. melpomene. Taking advantage of three distinctive signals of duplication in short-read sequencing data, we identified 744 duplicated loci in H. cydno and H. melpomene, 96% of which were validated with single molecule sequencing. We have found that duplications overlap genes significantly less than expected at random in H. melpomene, consistent with the action of background selection against duplicates in functional regions of the genome. Duplicate loci that are highly differentiated between H. melpomene and H. cydno map to four different chromosomes. Four duplications were identified with a strong signal of divergent selection, including an odorant binding protein and another in close proximity with a known wing colour pattern locus that differs between the two species.
d
Mobile Location Data | Asia | +300M Unique Devices | +100M Daily Users |...
datarade.ai
.json, .csv, .xls
Updated Mar 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Quadrant (2025). Mobile Location Data | Asia | +300M Unique Devices | +100M Daily Users | +200B Events / Month [Dataset]. https://datarade.ai/data-products/mobile-location-data-asia-300m-unique-devices-100m-da-quadrant
Explore at:
.json, .csv, .xlsAvailable download formats
Dataset updated
Mar 21, 2025
Dataset authored and provided by
Quadrant
Area covered
Asia, Oman, Korea (Democratic People's Republic of), Israel, Palestine, Iran (Islamic Republic of), Georgia, Armenia, Bahrain, Kyrgyzstan, Philippines
Description
Quadrant provides Insightful, accurate, and reliable mobile location data.

Our privacy-first mobile location data unveils hidden patterns and opportunities, provides actionable insights, and fuels data-driven decision-making at the world's biggest companies.

These companies rely on our privacy-first Mobile Location and Points-of-Interest Data to unveil hidden patterns and opportunities, provide actionable insights, and fuel data-driven decision-making. They build better AI models, uncover business insights, and enable location-based services using our robust and reliable real-world data.

We conduct stringent evaluations on data providers to ensure authenticity and quality. Our proprietary algorithms detect, and cleanse corrupted and duplicated data points – allowing you to leverage our datasets rapidly with minimal processing or cleaning. During the ingestion process, our proprietary Data Filtering Algorithms remove events based on a number of both qualitative factors, as well as latency and other integrity variables to provide more efficient data delivery. The deduplicating algorithm focuses on a combination of four important attributes: Device ID, Latitude, Longitude, and Timestamp. This algorithm scours our data and identifies rows that contain the same combination of these four attributes. Post-identification, it retains a single copy and eliminates duplicate values to ensure our customers only receive complete and unique datasets.

We actively identify overlapping values at the provider level to determine the value each offers. Our data science team has developed a sophisticated overlap analysis model that helps us maintain a high-quality data feed by qualifying providers based on unique data values rather than volumes alone – measures that provide significant benefit to our end-use partners.

Quadrant mobility data contains all standard attributes such as Device ID, Latitude, Longitude, Timestamp, Horizontal Accuracy, and IP Address, and non-standard attributes such as Geohash and H3. In addition, we have historical data available back through 2022.

Through our in-house data science team, we offer sophisticated technical documentation, location data algorithms, and queries that help data buyers get a head start on their analyses. Our goal is to provide you with data that is “fit for purpose”.
d
Catalog of natural and induced earthquakes without duplicates
datasets.ai
search.dataone.org
+2more
55
Updated Sep 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of the Interior (2024). Catalog of natural and induced earthquakes without duplicates [Dataset]. https://datasets.ai/datasets/catalog-of-natural-and-induced-earthquakes-without-duplicates
Explore at:
55Available download formats
Dataset updated
Sep 11, 2024
Dataset authored and provided by
Department of the Interior
Description
The U. S. Geological Survey (USGS) makes long-term seismic hazard forecasts that are used in building codes. The hazard models usually consider only natural seismicity; non-tectonic (man-made) earthquakes are excluded because they are transitory or too small. In the past decade, however, thousands of earthquakes related to underground fluid injection have occurred in the central and eastern U.S. (CEUS), and some have caused damage. In response, the USGS is now also making short-term forecasts that account for the hazard from these induced earthquakes. A uniform earthquake catalog is assembled by combining and winnowing pre-existing source catalogs. Seismicity statistics are analyzed to develop recurrence models, accounting for catalog completeness. In the USGS hazard modeling methodology, earthquakes are counted on a map grid, recurrence models are applied to estimate the rates of future earthquakes in each grid cell, and these rates are combined with maximum-magnitude models and ground-motion models to compute the hazard. The USGS published a forecast for the years 2016 and 2017. This data set is the catalog of natural and induced earthquakes without duplicates. Duplicate events have been removed based on a hierarchy of the source catalogs. Explosions and mining related events have been deleted.
o
Data from: Identification of factors associated with duplicate rate in...
omicsdi.org
Updated Jul 19, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Identification of factors associated with duplicate rate in ChIP-seq data. [Dataset]. https://www.omicsdi.org/dataset/biostudies/S-EPMC6447195
Explore at:
Dataset updated
Jul 19, 2023
Variables measured
Unknown
Description
Chromatin immunoprecipitation and sequencing (ChIP-seq) has been widely used to map DNA-binding proteins, histone proteins and their modifications. ChIP-seq data contains redundant reads termed duplicates, referring to those mapping to the same genomic location and strand. There are two main sources of duplicates: polymerase chain reaction (PCR) duplicates and natural duplicates. Unlike natural duplicates that represent true signals from sequencing of independent DNA templates, PCR duplicates are artifacts originating from sequencing of identical copies amplified from the same DNA template. In analysis, duplicates are removed from peak calling and signal quantification. Nevertheless, a significant portion of the duplicates is believed to represent true signals. Obviously, removing all duplicates will underestimate the signal level in peaks and impact the identification of signal changes across samples. Therefore, an in-depth evaluation of the impact from duplicate removal is needed. Using eight public ChIP-seq datasets from three narrow-peak and two broad-peak marks, we tried to understand the distribution of duplicates in the genome, the extent by which duplicate removal impacts peak calling and signal estimation, and the factors associated with duplicate level in peaks. The three PCR-free histone H3 lysine 4 trimethylation (H3K4me3) ChIP-seq data had about 40% duplicates and 97% of them were within peaks. For the other datasets generated with PCR amplification of ChIP DNA, as expected, the narrow-peak marks have a much higher proportion of duplicates than the broad-peak marks. We found that duplicates are enriched in peaks and largely represent true signals, more conspicuous in those with high confidence. Furthermore, duplicate level in peaks is strongly correlated with the target enrichment level estimated using nonredundant reads, which provides the basis to properly allocate duplicates between noise and signal. Our analysis supports the feasibility of retaining the portion of signal duplicates into downstream analysis, thus alleviating the limitation of complete deduplication.
R
Data from: Duplicated Dataset
universe.roboflow.com
zip
Updated Dec 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AVANTHIKA S (2024). Duplicated Dataset [Dataset]. https://universe.roboflow.com/avanthika-s-nfpex/duplicated/model/1
Explore at:
zipAvailable download formats
Dataset updated
Dec 31, 2024
Dataset authored and provided by
AVANTHIKA S
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Non_emergency 9m2i Bounding Boxes
Description
Duplicated

## Overview Duplicated is a dataset for object detection tasks - it contains Non_emergency 9m2i annotations for 942 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
f
Label quantities of the non-duplicate entries and duplicate entries compared...
plos.figshare.com
xls
Updated May 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Erik D. Huckvale; Hunter N. B. Moseley (2024). Label quantities of the non-duplicate entries and duplicate entries compared to the original dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0299583.t008
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0299583.t008
Dataset updated
May 2, 2024
Dataset provided by
PLOS ONE
Authors
Erik D. Huckvale; Hunter N. B. Moseley
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Label quantities of the non-duplicate entries and duplicate entries compared to the original dataset.
C
311 Service Requests - Graffiti Removal - No Duplicates
data.cityofchicago.org
data.wu.ac.at
Updated Mar 6, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Chicago (2019). 311 Service Requests - Graffiti Removal - No Duplicates [Dataset]. https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Graffiti-Removal-No-Duplicate/8tus-apua
Explore at:
application/rdfxml, csv, tsv, application/rssxml, xml, kml, application/geo+json, kmzAvailable download formats
Dataset updated
Mar 6, 2019
Dataset authored and provided by
City of Chicago
Description
Note: This filtered view shows only those service requests from the underlying dataset that are not marked as duplicates. -- All open graffiti removal requests made to 311 and all requests completed since January 1, 2011. The Department of Streets & Sanitation's Graffiti Blasters crews offer a vandalism removal service to private property owners. Graffiti Blasters employ "blast" trucks that use baking soda under high water pressure to erase painted graffiti from brick, stone and other mineral surfaces. They also use paint trucks to cover graffiti on the remaining surfaces. Organizations and residents may report graffiti and request its removal. 311 sometimes receives duplicate requests for graffiti removal. Requests that have been labeled as Duplicates are in the same geographic area and have been entered into 311’s Customer Service Requests (CSR) system at around the same time as a previous request. Duplicate reports/requests are labeled as such in the Status field, as either "Open - Dup" or "Completed - Dup." Data is updated daily.
Additional file 1: of A proficient cost reduction framework for...
springernature.figshare.com
txt
Updated May 31, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Asif Sohail; Muhammad Yousaf (2023). Additional file 1: of A proficient cost reduction framework for de-duplication of records in data integration [Dataset]. http://doi.org/10.6084/m9.figshare.c.3637745_D1.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.c.3637745_D1.v1
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Asif Sohail; Muhammad Yousaf
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset-A with one duplicate against an original record and one modification per duplicate record. (CSV 92 kb)
f
Unique entry occurrence compared to label count.
plos.figshare.com
xls
Updated May 2, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Unique entry occurrence compared to label count. [Dataset]. https://plos.figshare.com/articles/dataset/Unique_entry_occurrence_compared_to_label_count_/25740470
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0299583.t007
Dataset updated
May 2, 2024
Dataset provided by
PLOS ONE
Authors
Erik D. Huckvale; Hunter N. B. Moseley
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The mapping of metabolite-specific data to pathways within cellular metabolism is a major data analysis step needed for biochemical interpretation. A variety of machine learning approaches, particularly deep learning approaches, have been used to predict these metabolite-to-pathway mappings, utilizing a training dataset of known metabolite-to-pathway mappings. A few such training datasets have been derived from the Kyoto Encyclopedia of Genes and Genomes (KEGG). However, several prior published machine learning approaches utilized an erroneous KEGG-derived training dataset that used SMILES molecular representations strings (KEGG-SMILES dataset) and contained a sizable proportion (~26%) duplicate entries. The presence of so many duplicates taint the training and testing sets generated from k-fold cross-validation of the KEGG-SMILES dataset. Therefore, the k-fold cross-validation performance of the resulting machine learning models was grossly inflated by the erroneous presence of these duplicate entries. Here we describe and evaluate the KEGG-SMILES dataset so that others may avoid using it. We also identify the prior publications that utilized this erroneous KEGG-SMILES dataset so their machine learning results can be properly and critically evaluated. In addition, we demonstrate the reduction of model k-fold cross-validation (CV) performance after de-duplicating the KEGG-SMILES dataset. This is a cautionary tale about properly vetting prior published benchmark datasets before using them in machine learning approaches. We hope others will avoid similar mistakes.
H
Supplementary materials - The Large Number of Duplicate Records in...
dataverse.harvard.edu
docx, xlsx
Updated Jul 17, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harvard Dataverse (2015). Supplementary materials - The Large Number of Duplicate Records in International Survey Projects: The Need for Data Quality Control [Dataset]. http://doi.org/10.7910/DVN/HPXFA1
Explore at:
docx(29983), docx(16124), docx(16622), xlsx(118576), docx(12326), docx(39473), docx(152667), docx(12327), docx(14177), xlsx(15968), xlsx(23707), xlsx(18164)Available download formats
Unique identifier
https://doi.org/10.7910/DVN/HPXFA1
Dataset updated
Jul 17, 2015
Dataset provided by
Harvard Dataverse
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Time period covered
1966 - 2013
Dataset funded by
National Science Centre
Description
These materials provide detailed information about our findings and allow researchers to replicate analyses presented in the paper.
Duplicate Image Detection
kaggle.com
Updated Dec 17, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gerry (2021). Duplicate Image Detection [Dataset]. https://www.kaggle.com/datasets/gpiosenka/duplicate-image-detection
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 17, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Gerry
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset

This dataset was created by Gerry

Released under CC0: Public Domain

Contents
SoundDesc: Cleaned and Group-Filtered Splits
zenodo.org
data.niaid.nih.gov
zip
Updated Aug 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Benno Weck; Benno Weck; Xavier Serra; Xavier Serra (2023). SoundDesc: Cleaned and Group-Filtered Splits [Dataset]. http://doi.org/10.5281/zenodo.7665917
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7665917
Dataset updated
Aug 26, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Benno Weck; Benno Weck; Xavier Serra; Xavier Serra
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This upload contains dataset splits of SoundDesc [1] and other supporting material for our paper:

Data leakage in cross-modal retrieval training: A case study [arXiv] [ieeexplore]

In our paper, we demonstrated that a data leakage problem in the previously published splits of SoundDesc leads to overly optimistic retrieval results.
Using an off-the-shelf audio fingerprinting software, we identified that the data leakage stems from duplicates in the dataset.
We define two new splits for the dataset: a cleaned split to remove the leakage and a group-filtered to avoid other kinds of weak contamination of the test data.

SoundDesc is a dataset which was automatically sourced from the BBC Sound Effects web page [2]. The results from our paper can be reproduced using clean_split01 and group_filtered_split01.

If you use the splits, please cite our work:

Benno Weck, Xavier Serra, "Data Leakage in Cross-Modal Retrieval Training: A Case Study," ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 2023, pp. 1-5, doi: 10.1109/ICASSP49357.2023.10094617.

@INPROCEEDINGS{10094617, author={Weck, Benno and Serra, Xavier}, booktitle={ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, title={Data Leakage in Cross-Modal Retrieval Training: A Case Study}, year={2023}, volume={}, number={}, pages={1-5}, doi={10.1109/ICASSP49357.2023.10094617}}

References:

[1] A. S. Koepke, A. -M. Oncescu, J. Henriques, Z. Akata and S. Albanie, "Audio Retrieval with Natural Language Queries: A Benchmark Study," in IEEE Transactions on Multimedia, doi: 10.1109/TMM.2022.3149712.

[2] https://sound-effects.bbcrewind.co.uk/
D
Document Duplication Detection Software Report
datainsightsmarket.com
doc, pdf, ppt
Updated Jun 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). Document Duplication Detection Software Report [Dataset]. https://www.datainsightsmarket.com/reports/document-duplication-detection-software-1421242
Explore at:
doc, ppt, pdfAvailable download formats
Dataset updated
Jun 3, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The global market for Document Duplication Detection Software is experiencing robust growth, driven by the increasing need for efficient data management and enhanced security across various industries. The rising volume of digital documents, coupled with stricter regulatory compliance requirements (like GDPR and CCPA), is fueling the demand for solutions that can quickly and accurately identify duplicate files. This reduces storage costs, improves data quality, and minimizes the risk of data breaches. The market's expansion is further propelled by advancements in artificial intelligence (AI) and machine learning (ML) technologies, which enable more sophisticated and accurate duplicate detection. We estimate the current market size to be around $800 million in 2025, with a Compound Annual Growth Rate (CAGR) of 15% projected through 2033. This growth is expected across various segments, including cloud-based and on-premise solutions, catering to diverse industry verticals such as legal, finance, healthcare, and government. Major players like Microsoft, IBM, and Oracle are contributing to market growth through their established enterprise solutions. However, the market also features several specialized players, like Hyper Labs and Auslogics, offering niche solutions catering to specific needs. While the increasing adoption of cloud-based solutions is a key trend, potential restraints include the initial investment costs for software implementation and the need for ongoing training and support. The integration challenges with existing systems and the potential for false positives can also impede wider adoption. The market's regional distribution is expected to see a significant contribution from North America and Europe, while the Asia-Pacific region is projected to exhibit substantial growth potential driven by increasing digitalization. The forecast period (2025-2033) presents significant opportunities for market expansion, driven by technological innovation and the growing awareness of data management best practices.
O
schaaf summaries - duplicates!?
data.oaklandca.gov
application/rdfxml +5
Updated Jul 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Oakland Public Ethics Commission (2025). schaaf summaries - duplicates!? [Dataset]. https://data.oaklandca.gov/dataset/schaaf-summaries-duplicates-/e5wg-z7ke
Explore at:
application/rssxml, application/rdfxml, csv, tsv, json, xmlAvailable download formats
Dataset updated
Jul 10, 2025
Authors
City of Oakland Public Ethics Commission
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset includes all summary totals e-filed on Fair Political Practices Commission (FPPC) Form 460 Summary Page from 2011 to the present. The data is current as of the last modified date on this dataset. See the data key for column definitions: https://data.sfgov.org/Ethics/Campaign-Finance-Data-Key/wygs-cc76
C
311 Service Requests - Alley Lights Out - No Duplicates
data.cityofchicago.org
data.wu.ac.at
Updated Mar 6, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Chicago (2018). 311 Service Requests - Alley Lights Out - No Duplicates [Dataset]. https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Alley-Lights-Out-No-Duplicate/up7z-t43p
Explore at:
csv, xml, tsv, application/rssxml, application/rdfxml, kmz, application/geo+json, kmlAvailable download formats
Dataset updated
Mar 6, 2019
Dataset authored and provided by
City of Chicago
Description
Note: This filtered view shows only those service requests from the underlying dataset that are not marked as duplicates. -- This dataset contains all open 311 reports of one or more lights out on a wooden pole in the alley and all completed requests since January 1, 2011. If two requests regarding the same address are made within 30 calendar days of each other, the newest CSR is automatically given the status of “Duplicate (Open)”. Once the alley light is repaired, the CSR status will read “Completed” for the original request and “Duplicate (Closed)” for any duplicate requests. Data is updated daily.