Photographic capture–recapture is a valuable tool for obtaining demographic information on wildlife populations due to its noninvasive nature and cost-effectiveness. Recently, several computer-aided photo-matching algorithms have been developed to more efficiently match images of unique individuals in databases with thousands of images. However, the identification accuracy of these algorithms can severely bias estimates of vital rates and population size. Therefore, it is important to understand the performance and limitations of state-of-the-art photo-matching algorithms prior to implementation in capture–recapture studies involving possibly thousands of images. Here, we compared the performance of four photo-matching algorithms; Wild-ID, I3S Pattern+, APHIS, and AmphIdent using multiple amphibian databases of varying image quality. We measured the performance of each algorithm and evaluated the performance in relation to database size and the number of matching images in the database....
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The goal of the Patient Matching Algorithm Challenge is to bring about greater transparency and data on the performance of existing patient matching algorithms, spur the adoption of performance metrics for patient data matching algorithm vendors, and positively impact other aspects of patient matching such as deduplication and linking to clinical data. Participants will be provided a data set and will have their answers evaluated and scored against a master key. Up to 6 cash prizes will be awarded with a total purse of up to $75,000.00.https://www.patientmatchingchallenge.com/The test dataset used in the ONC Patient Matching Algorithm Challenge is available for download by students, researchers, or anyone else interested in additional analysis and patient matching algorithm development. More information about the Patient Matching Algorithm Challenge can be found: https://www.patientmatchingchallenge.com/.The dataset containing 1 million patients was split into eight files of alphabetical groupings by the the patient's last name, plus an additional file containing test patients with no last name recorded (Null). All files should be downloaded and merged for analysis.https://github.com/onc-healthit/patient-matching
The same product could have different titles, descriptions and product ID'S on different sites, depending on the structure of each site.
Our algorithm allows our clients to automatically match and track the performance of the same products across multiple platforms such as eBay, Amazon, and DTC sites.
In this research, we propose a variant of the classical Matching Pursuit Decomposition (MPD) algorithm with significantly improved scalability and computational performance. MPD is a powerful iterative algorithm that decomposes a signal into linear combinations of its dictionary elements or “atoms”. A best fit atom from an arbitrarily defined dictionary is determined through cross-correlation. The selected atom is subtracted from the signal and this procedure is repeated on the residual in the subsequent iterations until a stopping criteria is met. A sufficiently large dictionary is required for an accurate reconstruction; this in return increases the computational burden of the algorithm, thus limiting its applicability and level of adoption. Our main contribution lies in improving the computational efficiency of the algorithm to allow faster decomposition while maintaining a similar level of accuracy. The Correlation Thresholding and Multiple Atom Extractions techniques were proposed to decrease the computational burden of the algorithm. Correlation thresholds prune insignificant atoms from the dictionary. The ability to extract multiple atoms within a single iteration enhanced the effectiveness and efficiency of each iteration. The proposed algorithm, entitled MPD++, was demonstrated using real world data set.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Test code to reproduce the results of the paper.This work presents a robust stereo matching method for occluded regions. First, we generate cost volumes using the CEN-SUS transform and the scale-invariant feature transform(SIFT). Then, label-based cost volumes are aggregated using adaptive support weight and SLIC scheme from generated two cost volumes. In order to obtain optimal disparity by two label-based cost volumes, we select the disparity corresponding to high confidence similarity of CENSUS or SIFT with minimum cost point.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The name of the file indicate information: {type of sequence}_{type of measure}_{sequence properites}_{additional information}.csv
{type of sequence} - 'synth' for synthetic or 'london' for real mobility data from London, UK. {type of measure} - 'r2' for R-squared measure or 'corr' for Spearman's correlation {sequence properties} - for synthetic data there are three types of sequences, described in the research article (random, markovian, nonstationary). For real mobility data this part includes information about data processing parameters: (...)_london_{type of mobility sequence}_{DBSCAN epsilon value}_{DBSCAN min_pts value}. {type of mobility sequence} is 'seq' for next-place sequences and '30min' or '1H' for the next time-bin sequences and indicate the size of the time-bin. Files with 'predictability' at the end of the file contain R-squared and Spearman's correlation of measures calculated in relation to the predictability measure.
R2 files include values of R-squared for all types of modelled regression functions. 'line' indicates {y = ax + b} for single variable and {y = ax + by + c} for two variables. 'expo' indicates {y = a*x^b + c} for single variable and {y = a*x^b + c*y^d + e} for two variables 'log' indicates {y = a*log(x*b) + c} for single variable and {y = a * x + c * log(y) + e + d*x * log(y)} for two variables. 'logf' indicates {y = a*log(x) + c * log(y) + e + b*log(x) * log(y)} for two variables
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This data contains results of simulations from the existing and the modified algorithms used in the paper.
These are the data files for the PLOS ONE journal article "Two-Sided Matching for mentor-mentee allocations - Algorithms and manipulation strategies". Three files are provided: - Data.xlsx: An overview of the original preferences of mentors and mentee, a data dictionary, and two summary tables used to create figures in the manuscript - MatchingTables.csv: The outcome matching tables for each simulated scenario and repetition - Preferences.csv: The (un)manipulated preferences that were used as input to calculate the solution for each simulated scenario and repetition.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comparing outcomes across different levels of trauma centers is vital in evaluating regionalized trauma care. With observational data, it is critical to adjust for patient characteristics to render valid causal comparisons. Propensity score matching is a popular method to infer causal relationships in observational studies with two treatment arms. Few studies, however, have used matching designs with more than two groups, due to the complexity of matching algorithms. We fill the gap by developing an iterative matching algorithm for the three-group setting. Our algorithm outperforms the nearest neighbor algorithm and is shown to produce matched samples with total distance no larger than twice the optimal distance. We implement the evidence factors method for binary outcomes, which includes a randomization-based testing strategy and a sensitivity analysis for hidden bias in three-group matched designs. We apply our method to the Nationwide Emergency Department Sample data to compare emergency department mortality among non-trauma, level I, and level II trauma centers. Our tests suggest that the admission to a trauma center has a beneficial effect on mortality, assuming no unmeasured confounding. A sensitivity analysis for hidden bias shows that unmeasured confounders, moderately associated with the type of care received, may change the result qualitatively. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Exact pattern matching algorithms are popular and used widely in several applications, such as molecular biology, text processing, image processing, web search engines, network intrusion detection systems and operating systems. The focus of these algorithms is to achieve time efficiency according to applications but not memory consumption. In this work, we propose a novel idea to achieve both time efficiency and memory consumption by splitting query string for searching in Corpus. For a given text, the proposed algorithm split the query pattern into two equal halves and considers the second (right) half as a query string for searching in Corpus. Once the match is found with second halves, the proposed algorithm applies brute force procedure to find remaining match by referring the location of right half. Experimental results on different S1 Dataset, namely Arabic, English, Chinese, Italian and French text databases show that the proposed algorithm outperforms the existing S1 Algorithm in terms of time efficiency and memory consumption as the length of the query pattern increases.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The four datasets represent the results of a two sequentials processus. The first processus consists on a automatic matching between landmark in different sources and landmark in a referenced dataset (french national topographic data: BDTOPO). Then the links 1:1 are manually validated by experts in the second processus.
The four different datasets and the BDTOPO dataset are archived here.
The data matching algorithm is described in this paper.
Each file represents the result matching for features belonging to a data source with:
- the name of file depends on the data source
- id_source corresponds to the identify of the landmark in data source
- types_of_matching_results describes the type of result matching :
- id_bdtopo corresponds to the identify of the landmark in BDTOPO if and only if there is a validated matching link
The four datasets represent the results of a two sequentials processus. The first processus consists on a automatic matching between landmark in different sources and landmark in a referenced dataset (french national topographic data: BDTOPO). Then the links 1:1 are manually validated by experts in the second processus. The four different datasets and the BDTOPO dataset are archived here. The data matching algorithm is described in this paper. Each file represents the result matching for features belonging to a data source with: - the name of file depending on the data source - the column "id_source" corresponds to the identifier of the landmark in data source - the column "types_of_matching_results" describes the type of matching result: « 1:0 »: means that a landmark from a data source (e.g. Camptocamp) has no homologue landmark in BDTOPO « 1:1 validated »: means that a homologous feature exist in BDTOPO and the link was validated « 1:1 non validated »: means that the matching link was not validated « without candidates »: represents the non-matched landmarks because there are no candidates in BDTOPO or because the landmark in data source is far away from its homologous in BDTOPO « uncertain »: uncertainty cases are complex cases where any decision is taken by the data matching algorithm - the column "id_candidat" corresponds to the identifier of the landmark in BDTOPO if and only if there is a validated matching link - the column "samal" corresponds to the Samal distance The matching results are obtained using an ontology application named OOR. These specific results are obtained using the version of OOR V1.0.1 which is an improved version and contains new concepts compared to the first release 1.0.0. The new version of OOR (i.e. 1.0.1) will be released by the end of May 31 2022. The new link will be added here. This archive is released for transparency and reproducibility purposes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains the input data and results used in the paper"Comparative study on matching methods for the distinction of building modifications and replacements based on mul-ti-temporal building footprint data".License information:The LoD1 data used as input in this study are openly available at Transparenzportal Hamburg (https://transparenz.hamburg.de/),from Freie und Hansestadt Hamburg, Landesbetrieb Geoinformation und Vermessung (LGV), in compliance with the licence dl-de/by-2-0 (https://www.govdata.de/dl-de/by-2-0):)Content:1. Input Footprints of non-identical pairs:input_reference_objects.zip2. Results without additional position deviation:results_without_deviation.zip3. Results with generated position deviation including geometries:results_with_deviation.zip
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data corresponds to the data and experiments described in Section 5 of the following paper:
Algorithms for new types of fair stable matchings Authors: Frances Cooper and David Manlove
The paper is located at: https://arxiv.org/abs/2001.10875
The software is located at: https://zenodo.org/record/3630383
The data is located at: https://zenodo.org/record/3630349
See the README for more information.
Xverum’s AI & ML Training Data provides one of the most extensive datasets available for AI and machine learning applications, featuring 800M B2B profiles with 100+ attributes. This dataset is designed to enable AI developers, data scientists, and businesses to train robust and accurate ML models. From natural language processing (NLP) to predictive analytics, our data empowers a wide range of industries and use cases with unparalleled scale, depth, and quality.
What Makes Our Data Unique?
Scale and Coverage: - A global dataset encompassing 800M B2B profiles from a wide array of industries and geographies. - Includes coverage across the Americas, Europe, Asia, and other key markets, ensuring worldwide representation.
Rich Attributes for Training Models: - Over 100 fields of detailed information, including company details, job roles, geographic data, industry categories, past experiences, and behavioral insights. - Tailored for training models in NLP, recommendation systems, and predictive algorithms.
Compliance and Quality: - Fully GDPR and CCPA compliant, providing secure and ethically sourced data. - Extensive data cleaning and validation processes ensure reliability and accuracy.
Annotation-Ready: - Pre-structured and formatted datasets that are easily ingestible into AI workflows. - Ideal for supervised learning with tagging options such as entities, sentiment, or categories.
How Is the Data Sourced? - Publicly available information gathered through advanced, GDPR-compliant web aggregation techniques. - Proprietary enrichment pipelines that validate, clean, and structure raw data into high-quality datasets. This approach ensures we deliver comprehensive, up-to-date, and actionable data for machine learning training.
Primary Use Cases and Verticals
Natural Language Processing (NLP): Train models for named entity recognition (NER), text classification, sentiment analysis, and conversational AI. Ideal for chatbots, language models, and content categorization.
Predictive Analytics and Recommendation Systems: Enable personalized marketing campaigns by predicting buyer behavior. Build smarter recommendation engines for ecommerce and content platforms.
B2B Lead Generation and Market Insights: Create models that identify high-value leads using enriched company and contact information. Develop AI systems that track trends and provide strategic insights for businesses.
HR and Talent Acquisition AI: Optimize talent-matching algorithms using structured job descriptions and candidate profiles. Build AI-powered platforms for recruitment analytics.
How This Product Fits Into Xverum’s Broader Data Offering Xverum is a leading provider of structured, high-quality web datasets. While we specialize in B2B profiles and company data, we also offer complementary datasets tailored for specific verticals, including ecommerce product data, job listings, and customer reviews. The AI Training Data is a natural extension of our core capabilities, bridging the gap between structured data and machine learning workflows. By providing annotation-ready datasets, real-time API access, and customization options, we ensure our clients can seamlessly integrate our data into their AI development processes.
Why Choose Xverum? - Experience and Expertise: A trusted name in structured web data with a proven track record. - Flexibility: Datasets can be tailored for any AI/ML application. - Scalability: With 800M profiles and more being added, you’ll always have access to fresh, up-to-date data. - Compliance: We prioritize data ethics and security, ensuring all data adheres to GDPR and other legal frameworks.
Ready to supercharge your AI and ML projects? Explore Xverum’s AI Training Data to unlock the potential of 800M global B2B profiles. Whether you’re building a chatbot, predictive algorithm, or next-gen AI application, our data is here to help.
Contact us for sample datasets or to discuss your specific needs.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This repository includes datasets on experimental cases of study and analysis regarding the research called "Programmable content and a pattern-matching algorithm for automatic adaptive authoring in Augmented Reality for maintenance".DOI:Abstract: "Augmented Reality (AR) can increase efficiency and safety of maintenance operations, but costs of augmented content creation (authoring) are hindering its industrial deployment. A relevant research gap involves the ability of authoring solutions to automatically generate content for multiple operations. Hence, this paper offers programmable content formats and a pattern-matching algorithm for automatic adaptive authoring of ontology -based maintenance data. The proposed solution is validated against common authoring tools for repair and remote diagnosis AR applications in terms of operational efficiency gains achieved with the content they produce. Experimental results show that content from all authoring solutions attain same time reductions (42%) in comparison with non-AR information delivery tools. Surveys results suggest alike perceived usability of all authoring solutions and better content adaptiveness and user’s performance tracking of this authoring proposal."
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This repository contains replication code and data for all analysis carried out in the paper along with any anonymized datasets used to run the language-matching algorithm. It also contains the code used to preprocess the administrative data used in the language-matching algorithm (along with the preprocessed versions of the data) and the set of cutoffs we deployed in our pilot with Santa Clara County.
This data corresponds to the data and experiments described in Section 5 of the following paper submitted to SEA conference 2018: A 3/2-approximation algorithm for the Student-Project Allocation problem Authors: Frances Cooper and David Manlove The data is located at: https://doi.org/10.5281/zenodo.1186823 The software is located at: https://doi.org/10.5281/zenodo.1183221 See the README for more information.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The ARG Database is a huge collection of labeled and unlabeled graphs realized by the MIVIA Group.
The aim of this collection is to provide the graph research community with a standard test ground for the benchmarking of graph matching algorithms.The database is organized in two section: labeled and unlabeled graphs.
Both labeled and unlabeled graphs have been randomly generated according to six different generation models, each involving different possible parameter settings. As a result, 168 diverse kinds of graphs are contained in the database. Each type of unlabeled graph is represented by thousands of pairs of graphs for which an isomorphism or a graph-subgraph isomorphism relation holds, for a total of 143,600 graphs. Furthermore, each type of labeled graph is represented by thousands of pairs of graphs holding a not trivial common subgraph, for a total of 166,000 graphs.
For more details follow this link: https://mivia.unisa.it/datasets/graph-database/arg-database/documentation/
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Currently, the repository provides codes for two such methods:The ABE fully automated approach: This approach is a fully automated method for linking historical datasets (e.g. complete-count Censuses) by first name, last name and age. The approach was first developed by Ferrie (1996) and adapted and scaled for the computer by Abramitzky, Boustan and Eriksson (2012, 2014, 2017). Because names are often misspelled or mistranscribed, our approach suggests testing robustness to alternative name matching (using raw names, NYSIIS standardization, and Jaro-Winkler distance). To reduce the chances of false positives, our approach suggests testing robustness by requiring names to be unique within a five year window and/or requiring the match on age to be exact.A fully automated probabilistic approach (EM): This approach (Abramitzky, Mill, and Perez 2019) suggests a fully automated probabilistic method for linking historical datasets. We combine distances in reported names and ages between each two potential records into a single score, roughly corresponding to the probability that both records belong to the same individual. We estimate these probabilities using the Expectation-Maximization (EM) algorithm, a standard technique in the statistical literature. We suggest a number of decision rules that use these estimated probabilities to determine which records to use in the analysis.
Photographic capture–recapture is a valuable tool for obtaining demographic information on wildlife populations due to its noninvasive nature and cost-effectiveness. Recently, several computer-aided photo-matching algorithms have been developed to more efficiently match images of unique individuals in databases with thousands of images. However, the identification accuracy of these algorithms can severely bias estimates of vital rates and population size. Therefore, it is important to understand the performance and limitations of state-of-the-art photo-matching algorithms prior to implementation in capture–recapture studies involving possibly thousands of images. Here, we compared the performance of four photo-matching algorithms; Wild-ID, I3S Pattern+, APHIS, and AmphIdent using multiple amphibian databases of varying image quality. We measured the performance of each algorithm and evaluated the performance in relation to database size and the number of matching images in the database....