Facebook
Twitterhttps://webtechsurvey.com/termshttps://webtechsurvey.com/terms
A complete list of live websites using the Gf Prevent Duplicates technology, compiled through global website indexing conducted by WebTechSurvey.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Basic Information:
Number of entries: 374,661 Number of features: 19 Data Types:
15 integer columns 3 float columns 1 object column (label) Column Names:
id, Time, Is_CH, who CH, Dist_To_CH, ADV_S, ADV_R, JOIN_S, JOIN_R, SCH_S, SCH_R, Rank, DATA_S, DATA_R, Data_Sent_To_BS, dist_CH_To_BS, send_code, Consumed Energy, label Explore the Dataset First Five Rows:
id Time Is_CH who CH Dist_To_CH ADV_S ADV_R JOIN_S JOIN_R SCH_S SCH_R Rank DATA_S DATA_R Data_Sent_To_BS dist_CH_To_BS send_code Consumed Energy label 0 101000 50 1 101000 0.00000 1 0 0 25 1 0 0 0 1200 48 0.00000 1 0.00000 Attack 1 101001 50 0 101044 75.32345 0 4 1 0 0 1 2 38 0 0 0.00000 1 0.09797 Normal 2 101002 50 0 101010 46.95453 0 4 1 0 0 1 19 41 0 0 0.00000 1 0.09797 Normal 3 101003 50 0 101044 64.85231 0 4 1 0 0 1 16 38 0 0 0.00000 1 0.09797 Normal 4 101004 50 0 101010 4.83341 0 4 1 0 0 1 0 41 0 0 0.00000 1 0.09797 Normal Missing Values: No missing values detected in the dataset.
Statistical Summary:
The dataset includes various features related to network operations such as Time, Is_CH, who CH, Dist_To_CH, and more. The target variable label contains categorical values such as "Attack" and "Normal". Analyze Class Distribution Let's analyze the distribution of the classes within the dataset.
python Copy code class_distribution = dataset['label'].value_counts() class_distribution Handle Class Imbalance If a class imbalance is identified, appropriate techniques such as undersampling, oversampling, or SMOTE (Synthetic Minority Over-sampling Technique) will be applied to balance the dataset.
Next Steps:
Identify the class distribution. Apply balancing techniques if necessary. Continue with data preprocessing and feature engineering. We will perform the class distribution analysis and balancing in the subsequent step. ​​
i have found some duplicate values and drop them dataset.duplicated().sum() dataset.drop_duplicates(inplace=True)
ChatGPT Duplicate Handling Initial Duplicate Count Duplicates found: 8,873 Action Taken Duplicates dropped: The dataset has been cleaned by removing all duplicate entries. Verification Duplicates after cleaning: 0 The dataset is now free of duplicate entries. We can proceed with the analysis of the class distribution and any necessary steps to handle class imbalance.
Analyze Class Distribution Let's analyze the distribution of the label column to understand the balance between the classes.
python Copy code class_distribution = dataset['label'].value_counts() class_distribution I'll compute this now. ​​
Class Distribution Analysis The distribution of the classes within the dataset is as follows:
Normal: 332,040 Grayhole: 13,909 Blackhole: 10,049 TDMA: 6,633 Flooding: 3,157 Observations There is a significant class imbalance, with the majority class being "Normal" (332,040 instances). Minority classes include "Grayhole", "Blackhole", "TDMA", and "Flooding".
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The (beginner datasets files contains duplicate entries. Duplicate data can lead to errors in analysis and reporting, making it essential to identify and remove them.
Duplicate File: The file pretty_dd_automobile.json includes the duplicate entries found in automobile.csv.
Steps to Identify Duplicates: 1. Load the data from automobile.csv. 2. Analyze the data for duplicates with KnoDL 3. Save the identified duplicates to the file pretty_dd_automobile.json
Video Tutorial:
For a visual example of finding duplicates, you can watch the following YouTube video: Duplicate Detection in Kaggle's Automobile Dataset Using KnoDL
These steps and examples will help you correctly document the duplicate entries and provide a clear tutorial for users.
dimonds.csv 88 positions
employee.csv 2673 positions
facebook.csv 51 positions
forest.csv 4 positions
france.csv 16 positions
germany.csv 15 positions
income.csv 2762 positions
insurance.csv 1 position
iris.csv 4 positions
traffic.csv 253 positions
tweets.csv 26 positions
Facebook
TwitterThe U. S. Geological Survey (USGS) makes long-term seismic hazard forecasts that are used in building codes. The hazard models usually consider only natural seismicity; non-tectonic (man-made) earthquakes are excluded because they are transitory or too small. In the past decade, however, thousands of earthquakes related to underground fluid injection have occurred in the central and eastern U.S. (CEUS), and some have caused damage. In response, the USGS is now also making short-term forecasts that account for the hazard from these induced earthquakes. A uniform earthquake catalog is assembled by combining and winnowing pre-existing source catalogs. Seismicity statistics are analyzed to develop recurrence models, accounting for catalog completeness. In the USGS hazard modeling methodology, earthquakes are counted on a map grid, recurrence models are applied to estimate the rates of future earthquakes in each grid cell, and these rates are combined with maximum-magnitude models and ground-motion models to compute the hazard. The USGS published a forecast for the years 2016 and 2017. This data set is the catalog of natural and induced earthquakes without duplicates. Duplicate events have been removed based on a hierarchy of the source catalogs. Explosions and mining related events have been deleted.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
TO DO Checklist:
Clean Data Remove duplicates Handle missing values Standardize data formats
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Blockchain data query: Total Distinct Count (remove duplicated)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The presence of duplicates introduced by PCR amplification is a major issue in paired short reads from next-generation sequencing platforms. These duplicates might have a serious impact on research applications, such as scaffolding in whole-genome sequencing and discovering large-scale genome variations, and are usually removed. We present FastUniq as a fast de novo tool for removal of duplicates in paired short reads. FastUniq identifies duplicates by comparing sequences between read pairs and does not require complete genome sequences as prerequisites. FastUniq is capable of simultaneously handling reads with different lengths and results in highly efficient running time, which increases linearly at an average speed of 87 million reads per 10 minutes. FastUniq is freely available at http://sourceforge.net/projects/fastuniq/.
Facebook
TwitterIntroduction Do you utilize Windows Live Mail or eM Client email application? Are you looking for an advanced solution to the query of how to delete duplicate files Windows Live Mail? As de-duplication is such a necessary task, huge amount of duplicate files may cause trouble by consuming so much space in your local drive. And it may also decline the efficiency of email client. In this article I will talk about an accurate approach that can help you de-duplicating EML files efficiently. Let’s start now.
CubexSoft EML Duplicate Remover Tool is an absolute way to delete multiple duplicate EML files in batch mode. The software de duplicates retaining formatting properties and other components of data same. And there is no restriction and constraint has been set on size. Users are eligible to de duplicate EML files from Windows Live Mail, Thunderbird, eM Client, AppleMail, DreamMail, etc. And without installation of any such email apps, as this is a full-fledged independent application. Also, users can get the intricacies and functioning of the software without acquiring any technical skill. There are separate options such as “search duplicates within the folders” and search duplicate emails across the folder”, these two options pay a significant role in detecting all duplicate files from the system. There options of adding filters are available for better specification of files according to date, to, from, subject, and root folder. Users can also specify the destination path according to preference. Users will be able to grab a complete report of conversion in Notepad at the ending point of de duplication process.
Follow these below mentioned easy steps to remove duplicate email files in batch directly: Step1: To dedupe, de-duplicate emails launch EML Duplicates Remover. Step2: Now upload files by “Select Files” and “Select Folder” options. Step3: Choose specific files from appeared files along with checkboxes, you may check/uncheck as per requirement. Step4: Now, to search duplicates two options available “search duplicate email within the folder” and “search duplicates across the folder”. Step5: Now add filters for more specification in conversion as per date range, to, email attachments, and root folders. Step6: Browse desired path then finally click on “Remove” button.
Will this utility allow me to remove duplicate emails from eM Client also? Answer: Yes, users are allowed to de-duplicate from all EML based email clients such as Windows Live Mail, eM Client, etc. Can I take a free trial before purchasing license key? Answer: Yes, free demo of EML file Duplicate Remover is open for all. As a user from non-tech educational background, will I be able to know software’s functioning easily? Answer: Yes, you will not have to face any trouble for sure, it is user-friendly application. Summing Up Now users are advised to launch this tool to grab free demo edition on Windows Operating Systems, all versions are suitable with it for example Windows 10, 11, 8, 7, 8.1, XP, and Vista etc. Free trail will allow proceeding with 25 emails de duplication without any charges.
Facebook
Twitterhttps://doi.org/10.4121/resource:terms_of_usehttps://doi.org/10.4121/resource:terms_of_use
A collection of 800 synthesized models with duplicated tasks and their corresponding logs. Used in the experiments for the paper "Handling Duplicated Tasks in Process Discovery by Refining Event Labels", which is accepted in BPM 2016.
Facebook
Twitterdapo dataset processed with community instructions. import pandas as pd import polars as pl df = pd.read_parquet('DAPO-Math-17k/data/dapo-math-17k.parquet')
pl_df = pl.from_pandas(df).unique(subset=["data_source", "prompt", "ability", "reward_model"])
pl_df = pl_df.with_columns( pl.col("reward_model").n_unique().over("prompt").alias("n_rm") )
cleaned = pl_df.filter(pl.col("n_rm") ==… See the full description on the dataset page: https://huggingface.co/datasets/fengyao1909/dapo-math-17k-deduplicated.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset contains details of 2,495,958 unique public tags added to 10,403,650 resources in Trove between August 2008 and June 2024. I harvested the data using the Trove API and saved it as a CSV file with the following columns:
tag – lower-cased text tag
date – date the tag was added
zone – API zone containing the tagged resource
record_id – the identifier of the tagged resource
I've documented the method used to harvest the tags in this notebook.
Using the zone and record_id you can find more information about a tagged item. To create urls to the resources in Trove:
for resources in the 'book', 'article', 'picture', 'music', 'map', and 'collection' zones add the record_id to https://trove.nla.gov.au/work/
for resources in the 'newspaper' and 'gazette' zones add the record_id to https://trove.nla.gov.au/article/
for resources in the 'list' zone add the record_id to https://trove.nla.gov.au/list/
Notes:
Works (such as books) in Trove can have tags attached at either work or version level. This dataset aggregates all tags at the work level, removing any duplicates.
A single resource in Trove can appear in multiple zones – for example, a book that includes maps and illustrations might appear in the 'book', 'picture', and 'map' zones. This means that some of the tags will essentially be duplicates – harvested from different zones, but relating to the same resource. Depending on your needs, you might want to remove these duplicates.
While most of the tags were added by Trove users, more than 500,000 tags were added by Trove itself in November 2009. I think these tags were automatically generated from related Wikipedia pages. Depending on your needs, you might want to exclude these by limiting the date range or zones.
User content added to Trove, including tags, is available for reuse under a CC-BY-NC licence.
See this notebook for some examples of how you can manipulate, analyse, and visualise the tag data.
Facebook
Twitterhttps://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
According to our latest research, the global Duplicate Folder Cleanup Tools market size reached USD 1.24 billion in 2024, with a robust growth trajectory expected throughout the forecast period. The market is projected to expand at a CAGR of 11.2% from 2025 to 2033, reaching a forecasted value of USD 3.13 billion by 2033. This significant growth is fueled by the increasing demand for efficient data management solutions across enterprises and individuals, driven by the exponential rise in digital content and the need to optimize storage resources.
The primary growth factor for the Duplicate Folder Cleanup Tools market is the unprecedented surge in digital data generation across all sectors. Organizations and individuals alike are grappling with vast amounts of redundant files and folders that not only consume valuable storage space but also hinder operational efficiency. As businesses undergo digital transformation and migrate to cloud platforms, the risk of data duplication escalates, necessitating advanced duplicate folder cleanup tools. These solutions play a pivotal role in reducing storage costs, enhancing data accuracy, and streamlining workflows, making them indispensable in today’s data-driven landscape.
Another critical driver contributing to the market’s expansion is the increasing adoption of cloud computing and hybrid IT environments. As enterprises shift their infrastructure to cloud-based platforms, the complexity of managing and organizing data multiplies. Duplicate folder cleanup tools, especially those with robust automation and AI-powered features, are being rapidly integrated into cloud ecosystems to address these challenges. The ability to seamlessly identify, analyze, and remove redundant folders across diverse environments is a compelling value proposition for organizations aiming to maintain data hygiene and regulatory compliance.
Furthermore, the growing emphasis on data security and compliance is accelerating the uptake of duplicate folder cleanup solutions. Regulatory frameworks such as GDPR, HIPAA, and CCPA mandate stringent data management practices, including the elimination of unnecessary or duplicate records. Failure to comply can result in substantial penalties and reputational damage. As a result, organizations are investing in advanced duplicate folder cleanup tools that not only enhance storage efficiency but also ensure adherence to legal and industry standards. The integration of these tools with enterprise data governance strategies is expected to further propel market growth in the coming years.
Regionally, North America continues to dominate the Duplicate Folder Cleanup Tools market, accounting for the largest share in 2024, followed closely by Europe and Asia Pacific. The high adoption rate of digital technologies, coupled with the presence of leading software vendors and tech-savvy enterprises, positions North America as a key growth engine. Meanwhile, Asia Pacific is witnessing the fastest CAGR, driven by rapid digitalization, expanding IT infrastructure, and increasing awareness about efficient data management solutions. Latin America and Middle East & Africa are also emerging as promising markets, supported by growing investments in digital transformation initiatives.
The Component segment of the Duplicate Folder Cleanup Tools market is bifurcated into Software and Services, both of which play integral roles in addressing the challenges of data redundancy. Software solutions form the backbone of this segment, encompassing standalone applications, integrated modules, and AI-powered platforms designed to automate the detection and removal of duplicate folders. The software segment leads the market, owing to its scalability, ease of deployment, and continuous innovation in features such as real-time monitoring, advanced analytics, and seamless integration with existing IT ecosystems. Organizations are increasingly prioritizing software that offers intuitive user interfaces and robust security protocols, ensuring both efficiency and compliance.
On the other hand, the Services segment includes consulting, implementation, customization, and support services that complement software offerings. As enterprises grapple with complex IT environments, the demand for specialized services to tailor duplicate folder cleanup solutions to uniqu
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Weighted contribution of included publications related to results of preanalytical procedures.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This Web pages outlines a process to use and understand ("read") the whole of a HathiTrust collection. Such a process is outlined here: 1) articulate a research question, 2)search the 'Trust and create a collection, 3) download the collection file and refine it, or at the least, remove duplicates, 4) use the result as input to htid2books; download the full text of each item, 5) use Reader Toolbox to build a "study carrel"; create a data set, 6) compute against the data set to address the research question, and 7) go to Step #1; repeat iteratively
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This Web pages outlines a process to use and understand ("read") the whole of a HathiTrust collection. Such a process is outlined here: 1) articulate a research question, 2)search the 'Trust and create a collection, 3) download the collection file and refine it, or at the least, remove duplicates, 4) use the result as input to htid2books; download the full text of each item, 5) use Reader Toolbox to build a "study carrel"; create a data set, 6) compute against the data set to address the research question, and 7) go to Step #1; repeat iteratively
Facebook
TwitterQuadrant provides Insightful, accurate, and reliable mobile location data.
Our privacy-first mobile location data unveils hidden patterns and opportunities, provides actionable insights, and fuels data-driven decision-making at the world's biggest companies.
These companies rely on our privacy-first Mobile Location and Points-of-Interest Data to unveil hidden patterns and opportunities, provide actionable insights, and fuel data-driven decision-making. They build better AI models, uncover business insights, and enable location-based services using our robust and reliable real-world data.
We conduct stringent evaluations on data providers to ensure authenticity and quality. Our proprietary algorithms detect, and cleanse corrupted and duplicated data points – allowing you to leverage our datasets rapidly with minimal processing or cleaning. During the ingestion process, our proprietary Data Filtering Algorithms remove events based on a number of both qualitative factors, as well as latency and other integrity variables to provide more efficient data delivery. The deduplicating algorithm focuses on a combination of four important attributes: Device ID, Latitude, Longitude, and Timestamp. This algorithm scours our data and identifies rows that contain the same combination of these four attributes. Post-identification, it retains a single copy and eliminates duplicate values to ensure our customers only receive complete and unique datasets.
We actively identify overlapping values at the provider level to determine the value each offers. Our data science team has developed a sophisticated overlap analysis model that helps us maintain a high-quality data feed by qualifying providers based on unique data values rather than volumes alone – measures that provide significant benefit to our end-use partners.
Quadrant mobility data contains all standard attributes such as Device ID, Latitude, Longitude, Timestamp, Horizontal Accuracy, and IP Address, and non-standard attributes such as Geohash and H3. In addition, we have historical data available back through 2022.
Through our in-house data science team, we offer sophisticated technical documentation, location data algorithms, and queries that help data buyers get a head start on their analyses. Our goal is to provide you with data that is “fit for purpose”.
Facebook
TwitterLineworks copied directly from NHDHighRes data thats present on SGID10 database. UDWR Water Names and Water Id's have been assigned to the features. Oringial NHD features copied from the NHDHighRes feature class around 2014. Please note that some of the line work could of been captured prior to 2014 and be from an earlier version of the NHDHighRes data set.Permanent_Identifier and ReachCode were copied directly from the NHDHighRes data set. Updated on 10/01/2019 to remove duplicates linework.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This is a dataset of randomly generated 8-dimensional Q-factorial Fano toric varieties of Picard rank 2.
The data is divided into four plain text files:
The numbers 7 and 10 in the file names indicate the bound on the weights used when generating the data. Those varieties with at worst terminal singularities are in the files "bound_N_terminal.txt", and those with non-terminal singularities are in the files "bound_N_non_terminal.txt". The data within each file is de-duplicated, however the data in different files may contain duplicates (for example, it is possible that "bound_7_terminal.txt" and "bound_10_terminal.txt" contain some identical entries).
Each line of a file specifies the entries of a (2 x 10)-matrix. For example, the first line of "bound_7_terminal.txt" is:
[[5,6,7,7,5,2,5,3,2,2],[0,0,0,1,1,2,6,4,3,3]]
and this corresponds to the 8-dimensional Q-factorial Fano toric variety with weight matrix
5 6 7 7 5 2 5 3 2 2
0 0 0 1 1 2 6 4 3 3
and stability condition given by the sum of the columns, which in this case is
44
20
It can be checked that, in this case, the corresponding variety has at worst terminal singularities. In this example the largest occurring weight in the matrix is 7.
The number of entries in each file is:
For details, see the paper:
"Machine learning detects terminal singularities", Tom Coates, Alexander M. Kasprzyk, and Sara Veneziale. Neural Information Processing Systems (NeurIPS), 2023.
Magma code capable of generating this dataset is in the file "terminal_dim_8.m". The bound on the weights is set on line 142 by adjusting the value of 'k' (currently set to 10). The target dimension is set on line 143 by adjusting the value of 'dim' (currently set to 8). It is important to note that this code does not attempt to remove duplicates. The code also does not guarantee that the resulting variety has dimension 8. Deduplication and verification of the dimension need to be done separately, after the data has been generated.
If you make use of this data, please cite the above paper and the DOI for this data:
doi:10.5281/zenodo.10046893
Facebook
Twitterhttps://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
According to our latest research, the global duplicate payment detection market size reached USD 1.12 billion in 2024, driven by the increasing adoption of automated financial controls and advanced analytics across enterprises. The market is expected to witness a robust CAGR of 13.2% from 2025 to 2033, with the value projected to reach USD 3.36 billion by 2033. This impressive growth is primarily fueled by the rising need to reduce financial leakages, enhance compliance, and improve operational efficiency in financial processes worldwide.
The expansion of the duplicate payment detection market is strongly influenced by the rapid digital transformation across industries. As organizations transition from manual to automated financial processes, the risk of duplicate payments due to system integration issues, data entry errors, and complex vendor relationships becomes more pronounced. This has heightened the demand for advanced duplicate payment detection solutions that leverage artificial intelligence (AI), machine learning (ML), and data analytics to identify and prevent duplicate transactions in real-time. Furthermore, the increasing regulatory scrutiny and the need for transparent financial reporting have compelled organizations to invest in robust payment control systems, further propelling market growth.
Another significant growth driver is the proliferation of cloud-based financial management systems. Cloud deployment offers scalability, flexibility, and cost-effectiveness, making it particularly attractive to small and medium enterprises (SMEs) that lack the resources for extensive on-premises infrastructure. The integration of duplicate payment detection capabilities within cloud-based enterprise resource planning (ERP) and accounts payable (AP) solutions enables organizations to centralize financial data, streamline workflows, and ensure consistent application of controls across multiple business units and geographies. This shift towards cloud solutions is expected to accelerate market growth, especially in emerging economies where digital adoption is on the rise.
Additionally, the evolving landscape of global business operations, characterized by complex supply chains and multi-currency transactions, has amplified the risk of payment errors and fraud. Organizations are increasingly recognizing the financial and reputational risks associated with duplicate payments, prompting a surge in the adoption of specialized detection tools. These tools not only help in identifying duplicate invoices and payments but also provide actionable insights for process improvement and fraud prevention. The growing emphasis on cost optimization and the need to safeguard against financial losses are expected to sustain the demand for duplicate payment detection solutions in the coming years.
From a regional perspective, North America continues to dominate the duplicate payment detection market, accounting for the largest revenue share in 2024. This is attributed to the presence of large enterprises, stringent regulatory frameworks, and early adoption of advanced financial technologies in the region. However, the Asia Pacific region is anticipated to witness the highest growth rate during the forecast period, driven by the rapid digitalization of financial processes in countries such as China, India, and Japan. The increasing focus on compliance, coupled with the expanding presence of multinational corporations, is expected to create lucrative opportunities for market players in this region.
The duplicate payment detection market by component is segmented into software and services. The software segment currently accounts for the largest share, as organizations increasingly deploy advanced solutions to automate payment auditing and control processes. Modern duplicate payment detection software incorporates sophisticated algorithms, AI, and ML to analyze vast volumes of transactional data, identify anomalies, and flag potential duplicate entries with high accuracy. These solutions are often integrated with existing ERP and financial management systems, providing seamless workflows and real-time alerts. The growing complexity of business operations, coupled with the need for continuous monitoring, has made software solutions indispensable for organizations aiming to minimize payment errors and improve financial governance.
<
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Lightning Talk at the International Digital Curation Conference 2025. The presentation examines OpenAIRE's solution to the “entity disambiguation” problem, presenting a hybrid data curation method that combines deduplication algorithms with the expertise of human curators to ensure high-quality, interoperable scholarly information. Entity disambiguation is invaluable to building a robust and interconnected open scholarly communication system. It involves accurately identifying and differentiating entities such as authors, organisations, data sources and research results across various entity providers. This task is particularly complex in contexts like the OpenAIRE Graph, where metadata is collected from over 100,000 data sources. Different metadata describing the same entity can be collected multiple times, potentially providing different information, such as different Persistent Identifiers (PIDs) or names, for the same entity. This heterogeneity poses several challenges to the disambiguation process. For example, the same organisation may be referenced using different names in different languages, or abbreviations. In some cases, even the use of PIDs might not be effective, as different identifiers may be assigned by different data providers. Therefore, accurate entity disambiguation is essential for ensuring data quality, improving search and discovery, facilitating knowledge graph construction, and supporting reliable research impact assessment. To address this challenge, OpenAIRE employs a deduplication algorithm to identify and merge duplicate entities, configured to handle different entity types. While the algorithm proves effective for research results, when applied to organisations and data sources, it needs to be complemented with human curation and validation since additional information may be needed. OpenAIRE's data source disambiguation relies primarily on the OpenAIRE technical team overseeing the deduplication process and ensuring accurate matches across DRIS, FAIRSharing, re3data, and OpenDOAR registries. While the algorithm automates much of the process, human experts verify matches, address discrepancies and actively search for matches not proposed by the algorithm. External stakeholders, such as data source managers, can also contribute by submitting suggestions through a dedicated ticketing system. So far OpenAIRE curated almost 3 935 groups for a total of 8 140 data sources. To address organisational disambiguation, OpenAIRE developed OpenOrgs, a hybrid system combining automated processes and human expertise. The tool works on organisational data aggregated from multiple sources (ROR registry, funders databases, CRIS systems, and others) by the OpenAIRE infrastructure, automatically compares metadata, and suggests potential merged entities to human curators. These curators, authorised experts in their respective research landscapes, validate merged entities, identify additional duplicates, and enrich organisational records with missing information such as PIDs, alternative names, and hierarchical relationships. With over 100 curators from 40 countries, OpenOrgs has curated more than 100,000 organisations to date. A dataset containing all the OpenOrgs organizations can be found on Zenodo (https://doi.org/10.5281/zenodo.13271358). This presentation demonstrates how OpenAIRE's entity disambiguation techniques and OpenOrgs aim to be game-changers for the research community by building and maintaining an integrated open scholarly communication system in the years to come.
Facebook
Twitterhttps://webtechsurvey.com/termshttps://webtechsurvey.com/terms
A complete list of live websites using the Gf Prevent Duplicates technology, compiled through global website indexing conducted by WebTechSurvey.