25 datasets found
  1. SPIDER (v2): Synthetic Person Information Dataset for Entity Resolution

    • figshare.com
    csv
    Updated Oct 29, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Praveen Chinnappa; Rose Mary Arokiya Dass; yash mathur (2025). SPIDER (v2): Synthetic Person Information Dataset for Entity Resolution [Dataset]. http://doi.org/10.6084/m9.figshare.30472712.v1
    Explore at:
    csvAvailable download formats
    Dataset updated
    Oct 29, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Praveen Chinnappa; Rose Mary Arokiya Dass; yash mathur
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    SPIDER (v2) – Synthetic Person Information Dataset for Entity Resolution provides researchers with ready-to-use data for benchmarking Duplicate or Entity Resolution algorithms. The dataset focuses on person-level fields typical in customer or citizen records. Since real-world person-level data is restricted due to Personally Identifiable Information (PII) constraints, publicly available synthetic datasets are limited in scope, volume, or realism.SPIDER addresses these limitations by providing a large-scale, realistic dataset containing first name, last name, email, phone, address, and date of birth (DOB) attributes. Using the Python Faker library, 40,000 unique synthetic person records were generated, followed by 10,000 controlled duplicate records derived using seven real-world transformation rules. Each duplicate record is linked to its original base record and rule through the fields is_duplicate_of and duplication_rule.Version 2 introduces major realism and structural improvements, enhancing both the dataset and generation framework.Enhancements in Version 2New cluster_id column to group base and duplicate records for improved entity-level benchmarking.Improved data realism with consistent field relationships:State and ZIP codes now match correctly.Phone numbers are generated based on state codes.Email addresses are logically related to name components.Refined duplication logic:Rule 4 updated for realistic address variation.Rule 7 enhanced to simulate shared accounts among different individuals (with distinct DOBs).Improved data validation and formatting for address, email, and date fields.Updated Python generation script for modular configuration, reproducibility, and extensibility.Duplicate Rules (with real-world use cases)Duplicate record with a variation in email address.Use case: Same person using multiple email accounts.Duplicate record with a variation in phone numbers.Use case: Same person using multiple contact numbers.Duplicate record with last-name variation.Use case: Name changes or data entry inconsistencies.Duplicate record with address variation.Use case: Same person maintaining multiple addresses or moving residences.Duplicate record with a nickname.Use case: Same person using formal and informal names (Robert → Bob, Elizabeth → Liz).Duplicate record with minor spelling variations in the first name.Use case: Legitimate entry or migration errors (Sara → Sarah).Duplicate record with multiple individuals sharing the same email and last name but different DOBs.Use case: Realistic shared accounts among family members or households (benefits, tax, or insurance portals).Output FormatThe dataset is available in both CSV and JSON formats for direct use in data-processing, machine-learning, and record-linkage frameworks.Data RegenerationThe included Python script can be used to fully regenerate the dataset and supports:Addition of new duplication rulesRegional, linguistic, or domain-specific variationsVolume scaling for large-scale testing scenariosFiles Includedspider_dataset_v2_6_20251027_022215.csvspider_dataset_v2_6_20251027_022215.jsonspider_readme_v2.mdSPIDER_generation_script_v2.pySupportingDocuments/ folder containing:benchmark_comparison_script.py – script used for derive F-1 score.Public_census_data_surname.csv – sample U.S. Census name and demographic data used for comparison.ssa_firstnames.csv – Social Security Administration names dataset.simplemaps_uszips.csv – ZIP-to-state mapping data used for phone and address validation.

  2. d

    US B2B Phone Number Data | 148MM Phone Numbers, Verified Data

    • datarade.ai
    Updated Feb 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Salutary Data (2024). US B2B Phone Number Data | 148MM Phone Numbers, Verified Data [Dataset]. https://datarade.ai/data-products/salutary-data-b2b-data-phone-number-data-mobile-phone-72-salutary-data
    Explore at:
    .json, .csv, .xls, .txtAvailable download formats
    Dataset updated
    Feb 20, 2024
    Dataset authored and provided by
    Salutary Data
    Area covered
    United States of America
    Description

    Discover the ultimate resource for your B2B needs with our meticulously curated dataset, featuring 148MM+ highly relevant US B2B Contact Data records and associated company information.

    Very high fill rates for Phone Number, including for Mobile Phone!

    This encompasses a diverse range of fields, including Contact Name (First & Last), Work Address, Work Email, Personal Email, Mobile Phone, Direct-Dial Work Phone, Job Title, Job Function, Job Level, LinkedIn URL, Company Name, Domain, Email Domain, HQ Address, Employee Size, Revenue Size, Industry, NAICS and SIC Codes + Descriptions, ensuring you have the most detailed insights for your business endeavors.

    Key Features:

    Extensive Data Coverage: Access a vast pool of B2B Contact Data records, providing valuable information on where the contacts work now, empowering your sales, marketing, recruiting, and research efforts.

    Versatile Applications: Leverage this robust dataset for Sales Prospecting, Lead Generation, Marketing Campaigns, Recruiting initiatives, Identity Resolution, Analytics, Research, and more.

    Phone Number Data Inclusion: Benefit from our comprehensive Phone Number Data, ensuring you have direct and effective communication channels. Explore our Phone Number Datasets and Phone Number Databases for an even more enriched experience.

    Flexible Pricing Models: Tailor your investment to match your unique business needs, data use-cases, and specific requirements. Choose from targeted lists, CSV enrichment, or licensing our entire database or subsets to seamlessly integrate this data into your products, platform, or service offerings.

    Strategic Utilization of B2B Intelligence:

    Sales Prospecting: Identify and engage with the right decision-makers to drive your sales initiatives.

    Lead Generation: Generate high-quality leads with precise targeting based on specific criteria.

    Marketing Campaigns: Amplify your marketing strategies by reaching the right audience with targeted campaigns.

    Recruiting: Streamline your recruitment efforts by connecting with qualified candidates.

    Identity Resolution: Enhance your data quality and accuracy by resolving identities with our reliable dataset.

    Analytics and Research: Fuel your analytics and research endeavors with comprehensive and up-to-date B2B insights.

    Access Your Tailored B2B Data Solution:

    Reach out to us today to explore flexible pricing options and discover how Salutary Data Company Data, B2B Contact Data, B2B Marketing Data, B2B Email Data, Phone Number Data, Phone Number Datasets, and Phone Number Databases can transform your business strategies. Elevate your decision-making with top-notch B2B intelligence.

  3. G

    Entity Resolution Graph for Investigations Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Oct 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). Entity Resolution Graph for Investigations Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/entity-resolution-graph-for-investigations-market
    Explore at:
    csv, pptx, pdfAvailable download formats
    Dataset updated
    Oct 7, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Entity Resolution Graph for Investigations Market Outlook



    According to our latest research, the global Entity Resolution Graph for Investigations market size stood at USD 2.41 billion in 2024, underlining the sector’s robust presence in the global analytics and investigation ecosystem. The market is anticipated to expand at a compound annual growth rate (CAGR) of 18.2% from 2025 to 2033, reaching a forecasted size of USD 12.26 billion by 2033. This remarkable growth trajectory is primarily driven by the rising need for advanced data analytics, the proliferation of digital fraud, and increasing regulatory scrutiny across industries. As organizations face mounting pressure to manage complex data relationships and uncover hidden connections, the Entity Resolution Graph for Investigations market is poised for significant expansion over the coming decade.




    One of the principal growth factors for the Entity Resolution Graph for Investigations market is the escalating volume and complexity of data generated by modern enterprises. As businesses digitize their operations, the data landscape has become fragmented, making it difficult to establish clear relationships between entities such as individuals, organizations, and transactions. Entity resolution graph solutions offer a sophisticated approach to integrating disparate datasets, enabling investigators to identify patterns, detect anomalies, and uncover hidden relationships. This capability is increasingly vital for sectors such as BFSI, government, and healthcare, where the accuracy of entity identification directly impacts risk management, compliance, and investigative outcomes. The integration of artificial intelligence and machine learning algorithms into these solutions further enhances their ability to deliver real-time insights, driving adoption across industries.




    Another significant driver is the surge in regulatory requirements and compliance mandates globally. Financial institutions, healthcare providers, and government agencies are under unprecedented pressure to comply with anti-money laundering (AML), know your customer (KYC), and data privacy regulations. Entity resolution graph technology enables these organizations to efficiently reconcile and validate data from multiple sources, ensuring compliance while minimizing manual intervention. The technology’s ability to provide a unified view of entities across vast datasets is critical for timely and accurate reporting, audit readiness, and risk mitigation. As regulatory frameworks continue to evolve and become more stringent, demand for robust entity resolution solutions is expected to intensify, further propelling market growth.




    The rise of sophisticated fraud schemes and cyber threats is also fueling demand for entity resolution graph solutions. Fraud detection and risk management applications rely heavily on the ability to correlate seemingly unrelated data points to uncover fraudulent activities. Entity resolution graphs empower organizations to visualize and analyze complex networks of relationships, making it easier to detect fraud rings, insider threats, and other malicious activities. The growing adoption of digital channels in banking, retail, and other sectors has expanded the attack surface for fraudsters, necessitating advanced investigative tools. As organizations invest in strengthening their security postures, the adoption of entity resolution graph technology is set to accelerate, underpinning the market’s sustained growth.




    From a regional perspective, North America currently dominates the Entity Resolution Graph for Investigations market, driven by the early adoption of advanced analytics, a strong regulatory environment, and significant investments in digital transformation. However, Asia Pacific is emerging as a high-growth region, fueled by rapid digitization, increasing awareness of data-driven investigations, and expanding regulatory frameworks. Europe also represents a substantial share of the market, with stringent data protection laws and a mature financial services sector contributing to steady demand. As organizations across these regions continue to grapple with complex data challenges and evolving threats, the adoption of entity resolution graph solutions is expected to rise, supporting robust market growth globally.



  4. c

    ORBITAAL: cOmpRehensive BItcoin daTaset for temorAl grAph anaLysis - Dataset...

    • cryptodata.center
    Updated Dec 4, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). ORBITAAL: cOmpRehensive BItcoin daTaset for temorAl grAph anaLysis - Dataset - CryptoData Hub [Dataset]. https://cryptodata.center/dataset/orbitaal-comprehensive-bitcoin-dataset-for-temoral-graph-analysis
    Explore at:
    Dataset updated
    Dec 4, 2024
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Construction This dataset captures the temporal network of Bitcoin (BTC) flow exchanged between entities at the finest time resolution in UNIX timestamp. Its construction is based on the blockchain covering the period from January, 3rd of 2009 to January the 25th of 2021. The blockchain extraction has been made using bitcoin-etl (https://github.com/blockchain-etl/bitcoin-etl) Python package. The entity-entity network is built by aggregating Bitcoin addresses using the common-input heuristic [1] as well as popular Bitcoin users' addresses provided by https://www.walletexplorer.com/ [1] M. Harrigan and C. Fretter, "The Unreasonable Effectiveness of Address Clustering," 2016 Intl IEEE Conferences on Ubiquitous Intelligence & Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People, and Smart World Congress (UIC/ATC/ScalCom/CBDCom/IoP/SmartWorld), Toulouse, France, 2016, pp. 368-373, doi: 10.1109/UIC-ATC-ScalCom-CBDCom-IoP-SmartWorld.2016.0071.keywords: {Online banking;Merging;Protocols;Upper bound;Bipartite graph;Electronic mail;Size measurement;bitcoin;cryptocurrency;blockchain}, Dataset Description Bitcoin Activity Temporal Coverage: From 03 January 2009 to 25 January 2021 Overview: This dataset provides a comprehensive representation of Bitcoin exchanges between entities over a significant temporal span, spanning from the inception of Bitcoin to recent years. It encompasses various temporal resolutions and representations to facilitate Bitcoin transaction network analysis in the context of temporal graphs. Every dates have been retrieved from bloc UNIX timestamp and GMT timezone. Contents: The dataset is distributed across three compressed archives: All data are stored in the Apache Parquet file format, a columnar storage format optimized for analytical queries. It can be used with pyspark Python package. orbitaal-stream_graph.tar.gz: The root directory is STREAM_GRAPH/ Contains a stream graph representation of Bitcoin exchanges at the finest temporal scale, corresponding to the validation time of each block (averaging approximately 10 minutes). The stream graph is divided into 13 files, one for each year Files format is parquet Name format is orbitaal-stream_graph-date-[YYYY]-file-id-[ID].snappy.parquet, where [YYYY] stands for the corresponding year and [ID] is an integer from 1 to N (number of files here) such as sorting in increasing [ID] ordering is similar to sort by increasing year ordering These files are in the subdirectory STREAM_GRAPH/EDGES/ orbitaal-snapshot-all.tar.gz: The root directory is SNAPSHOT/ Contains the snapshot network representing all transactions aggregated over the whole dataset period (from Jan. 2009 to Jan. 2021). Files format is parquet Name format is orbitaal-snapshot-all.snappy.parquet. These files are in the subdirectory SNAPSHOT/EDGES/ALL/ orbitaal-snapshot-year.tar.gz: The root directory is SNAPSHOT/ Contains the yearly resolution of snapshot networks Files format is parquet Name format is orbitaal-snapshot-date-[YYYY]-file-id-[ID].snappy.parquet, where [YYYY] stands for the corresponding year and [ID] is an integer from 1 to N (number of files here) such as sorting in increasing [ID] ordering is similar to sort by increasing year ordering These files are in the subdirectory SNAPSHOT/EDGES/year/ orbitaal-snapshot-month.tar.gz: The root directory is SNAPSHOT/ Contains the monthly resoluted snapshot networks Files format is parquet Name format is orbitaal-snapshot-date-[YYYY]-[MM]-file-id-[ID].snappy.parquet, where [YYYY] and [MM] stands for the corresponding year and month, and [ID] is an integer from 1 to N (number of files here) such as sorting in increasing [ID] ordering is similar to sort by increasing year and month ordering These files are in the subdirectory SNAPSHOT/EDGES/month/ orbitaal-snapshot-day.tar.gz: The root directory is SNAPSHOT/ Contains the daily resoluted snapshot networks Files format is parquet Name format is orbitaal-snapshot-date-[YYYY]-[MM]-[DD]-file-id-[ID].snappy.parquet, where [YYYY], [MM], and [DD] stand for the corresponding year, month, and day, and [ID] is an integer from 1 to N (number of files here) such as sorting in increasing [ID] ordering is similar to sort by increasing year, month, and day ordering These files are in the subdirectory SNAPSHOT/EDGES/day/ orbitaal-snapshot-hour.tar.gz: The root directory is SNAPSHOT/ Contains the hourly resoluted snapshot networks Files format is parquet Name format is orbitaal-snapshot-date-[YYYY]-[MM]-[DD]-[hh]-file-id-[ID].snappy.parquet, where [YYYY], [MM], [DD], and [hh] stand for the corresponding year, month, day, and hour, and [ID] is an integer from 1 to N (number of files here) such as sorting in increasing [ID] ordering is similar to sort by increasing year, month, day and hour ordering These files are in the subdirectory SNAPSHOT/EDGES/hour/ orbitaal-nodetable.tar.gz: The root directory is NODE_TABLE/ Contains two files in parquet format, the first one gives information related to nodes present in stream graphs and snapshots such as period of activity and associated global Bitcoin balance, and the other one contains the list of all associated Bitcoin addresses. Small samples in CSV format orbitaal-stream_graph-2016_07_08.csv and orbitaal-stream_graph-2016_07_09.csv These two CSV files are related to stream graph representations of an halvening happening in 2016.

  5. d

    Email Address Data| Email Database | US Consumers | 650 million Consumer...

    • datarade.ai
    .csv, .txt
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stirista, Email Address Data| Email Database | US Consumers | 650 million Consumer Email Addresses [Dataset]. https://datarade.ai/data-products/email-address-data-email-database-us-consumers-564-milli-stirista
    Explore at:
    .csv, .txtAvailable download formats
    Dataset authored and provided by
    Stirista
    Area covered
    United States of America
    Description

    Andrew Wharton's Actionable US Consumer Email Database hosts over 650 million email addresses that have been active within the last 36 months. This database is fully CAN-SPAM compliant and 100% opted-in for Third Party Use.

    This Email Address database successfully connects you with your customers and/or prospects at their most recent, deliverable online address. and Increase impression rates, deliverability, and engagement in your digital campaigns.

    The Email Address Data is 100% populated with email address, HEMS (MD5, Sha1, Sha256) first name, last name, postal address (primary and secondary), IP Address, Time Stamp(s) for Last Registration, Verification, and First Seen. An enhanced version of the database is available with Date-of-Birth (where available), Phone (mobile and landline) and MAIDs to Hashed email conversion.

    The Andrews Wharton Actionable US Consumer Email Database is updated monthly. A complete replacement database or new adds are available as update files.

    Contact us at successdelivered@andrewswharton.com or visit us at www.andrewswharton.com to learn more about this dataset.

  6. U

    A circa 2010 global land cover reference dataset from commercial high...

    • data.usgs.gov
    • s.cnmilf.com
    • +1more
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bruce Pengra; Jordan Long; Devendra Dahal; Steve Stehman; Thomas Loveland, A circa 2010 global land cover reference dataset from commercial high resolution satellite data [Dataset]. http://doi.org/10.5066/P96FKANW
    Explore at:
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Authors
    Bruce Pengra; Jordan Long; Devendra Dahal; Steve Stehman; Thomas Loveland
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Time period covered
    May 19, 2002 - May 29, 2014
    Description

    The data are 475 thematic land cover raster’s at 2m resolution. Land cover classification was to the land cover classes: Tree (1), Water (2), Barren (3), Other Vegetation (4) and Ice & Snow (8). Cloud cover and Shadow were sometimes coded as Cloud (5) and Shadow (6), however for any land cover application would be considered NoData. Some raster’s may have Cloud and Shadow pixels coded or recoded to NoData already. Commercial high-resolution satellite data was used to create the classifications. Usable image data for the target year (2010) was acquired for 475 of the 500 primary sample locations, with 90% of images acquired within ±2 years of the 2010 target. The remaining 25 of the 500 sample blocks had no usable data so were not able to be mapped. Tabular data is included with the raster classifications indicating the specific high-resolution sensor and date of acquisition for source imagery as well as the stratum to which that sample block belonged. Methods for this classifi ...

  7. Effective Crowdsourcing of Multiple Tasks for Comprehensive Information...

    • figshare.com
    zip
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sangha Nam (2023). Effective Crowdsourcing of Multiple Tasks for Comprehensive Information Extraction [Dataset]. http://doi.org/10.6084/m9.figshare.7935185.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Sangha Nam
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    IntroductionThis dataset aims to propose a Korean information extraction standard and promote research in this field by presenting crowdsourcing data collected for four information extraction tasks from the same corpus and the training and evaluation results for each task of a state-of-the-art model. These machine learning data for Korean information extraction are the first of their kind, and there are plans to continuously increase the data volume. The test results will serve as a standard for each Korean information extraction task and are expected to serve as a comparison target for various studies on Korean information extraction using the data collected in this study. The dataset is available for research purposes.Description - There are two crowdsourcing .zip files; wiki-10000-part1&2.zip. In each file, 1) task1-1 : Entity Detection2) task1-2 : Entity Linking3) task2 : co-reference resolution4) task4 : relation extraction - For an entity linking model(https://github.com/machinereading/eld-2018), here is a pre-trained embedding files in el-korean.tar.gz- For an co-reference resolution model(https://github.com/machinereading/CR), here is a pre-trained embedding files in cr-korean.tar.gz- For a relation extraction model(https://github.com/machinereading/re-gan), here is a corpus, dataset and pre-trained embedding files in ko-gan-data.zip- For a relation extraction model(https://github.com/machinereading/re-re-RL-Crowd), here is a pre-trained embedding files in rerl-korean.tar.gzHow to useAll crowdsourcing file are in JSON format. Detail example and usage are in here (https://github.com/machinereading/okbqa-7-task4)

  8. h

    polyglot_ner

    • huggingface.co
    • opendatalab.com
    Updated May 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rami Al-Rfou (2024). polyglot_ner [Dataset]. https://huggingface.co/datasets/rmyeid/polyglot_ner
    Explore at:
    Dataset updated
    May 17, 2024
    Authors
    Rami Al-Rfou
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    Polyglot-NER A training dataset automatically generated from Wikipedia and Freebase the task of named entity recognition. The dataset contains the basic Wikipedia based training data for 40 languages we have (with coreference resolution) for the task of named entity recognition. The details of the procedure of generating them is outlined in Section 3 of the paper (https://arxiv.org/abs/1410.3791). Each config contains the data corresponding to a different language. For example, "es" includes only spanish examples.

  9. E

    Data from: Slovene coreference resolution corpus coref149

    • live.european-language-grid.eu
    binary format
    Updated Mar 18, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). Slovene coreference resolution corpus coref149 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/8274
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Mar 18, 2018
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This corpus contains a subset of the ssj500k v1.4 corpus, http://hdl.handle.net/11356/1052. Each of 149 documents contains a paragraph from ssj500k that contains at least 100 words and at least 6 named entities. The data is in TCF format, exported from the WebAnno tool, https://webanno.github.io/webanno/.

    The annotated entities are of type person, organization or location. Mentions are annotated as coreference chains without additional classifications of different coreference types. Annotations also include implicit mentions that are specific for the Slovene language - in this case, a verb is tagged. The corpus consists of 1277 entities, 2329 mentions, 831 singleton entities, 40 appositions and 215 overlapping mentions. We also annotated overlapping mentions of the same entity - for example in text [strokovnega direktorja KC [Zorana Arneža]] we annotate two overlapping mentions that refer to the same entity. There are 97 such mentions in the corpus.

    In the public source code repository https://bitbucket.org/szitnik/nutie-core class TEIP5Importer contains an additional function to read the dataset and merge it together with the ssj500k dataset.

  10. d

    1datapipe | Identity & Lifestyle Data | Indonesia | 208M Dataset | Complete...

    • datarade.ai
    .csv
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    1datapipe, 1datapipe | Identity & Lifestyle Data | Indonesia | 208M Dataset | Complete Insights [Dataset]. https://datarade.ai/data-products/identity-lifestyle-data-indonesia-187m-dataset-comple-1datapipe
    Explore at:
    .csvAvailable download formats
    Dataset authored and provided by
    1datapipe
    Area covered
    Indonesia
    Description

    Living Identity™ is 1datapipe’s flagship dataset—an exclusive, high-integrity identity graph that provides 1.35B+ real-world verified profiles across 18 of the world’s most dynamic and data-scarce emerging markets. With 90–95% adult population coverage per country, it is the deepest identity dataset available outside traditional credit bureaus, and it is fully compliant, privacy-first, and legally licensed for enterprise use. Each identity is tied to core national attributes—government-issued ID numbers, full names, addresses, phone numbers, emails, and date of birth—and cross-verified against telecom, financial, and commercial records. With continuous updates and strict normalization processes, Living Identity delivers AI-ready, structured data that powers decisioning where traditional data falls short. This dataset helps organizations confidently verify, onboard, and model users across regions where fraud risk is high and legacy data is fragmented or unavailable. It offers a single source of truth for resolving identities at scale—unlocking new revenue, reducing regulatory exposure, and enabling inclusive growth. DESIGNED FOR: Banks, Fintechs & Credit Bureaus: Enable real-time onboarding, digital KYC, thin-file scoring, and cross-border credit modeling with verified identity data in hard-to-penetrate markets. Fraud & Identity Verification Platforms: Detect synthetic identities, verify identity claims, and prevent account takeovers with population-scale data tied to official and telecom-based sources. Risk & Compliance Teams: Automate regulatory KYC/AML compliance across jurisdictions with datasets built to align with LGPD, PDPA, GDPR, and country-specific standards. AI & Machine Learning Labs: Train fraud, credit, and segmentation models using ground-truth data with verified input variables—improving performance, reducing bias, and boosting explainability. Digital Ecosystems & Superapps: Power seamless identity resolution for users across banking, e-commerce, remittances, and payments—enabling inclusive onboarding at scale in low-data environments.

    OPTIMIZED FOR:

    • Real-time digital onboarding with verified, high-coverage identity data

    • KYC/AML automation aligned with LGPD, PDPA, GDPR, and regional frameworks

    • Cross-border credit risk modeling and thin-file scoring in underserved markets

    • Synthetic fraud detection and account takeover prevention using telecom-verified identity resolution

    • AI training datasets for segmentation, risk scoring, and fraud analytics

    • Inclusive identity verification for superapps, payments, and remittance ecosystems Living Identity™ transforms identity from a barrier into an enabler—delivering trust, precision, and regulatory-grade intelligence to the organizations shaping the future of digital economies.

  11. WinoBias Coreference Dataset

    • kaggle.com
    zip
    Updated Dec 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). WinoBias Coreference Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/winobias-coreference-dataset/discussion
    Explore at:
    zip(152896 bytes)Available download formats
    Dataset updated
    Dec 5, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    WinoBias Coreference Dataset

    Gender-biased coreference dataset focused on occupation stereotypes in WinoBias

    By wino_bias (From Huggingface) [source]

    About this dataset

    The WinoBias dataset is a comprehensive and valuable resource designed specifically for coreference resolution, with a special emphasis on addressing gender bias. The dataset is centered around Winograd-schema style sentences where various entities are referred to by their respective occupations, such as the nurse, the doctor, or the carpenter.

    The primary objective of this groundbreaking dataset is to facilitate the accurate and effective resolution of coreference in these sentences, particularly when it comes to gender-related biases. By examining the relationships between words and their referents in context, coreference resolution models have the opportunity to uncover and address instances where gender stereotypes might be perpetuated.

    Each entry in the dataset includes multiple attributes that enhance its usefulness and versatility. These attributes encompass crucial linguistic elements such as part-of-speech tags, parse bits (syntactic structure annotations), word senses, speaker information, named entity recognition tags (identifying entities like persons or locations), verbal predicates, lemma forms of predicates (verb base forms), and coreference clusters.

    With its diverse range of occupation-related sentences containing subtle gender biases, the WinoBias dataset provides an invaluable resource for researchers, developers, and evaluators working on improving coreference resolution systems. By evaluating model performance using this data, stakeholders can gain insights into potential areas of bias within their algorithms while striving towards more equitable language processing technologies.

    In summary, the WinoBias dataset represents a vital contribution to addressing gender bias in natural language processing tasks by focusing specifically on coreference resolution. Its rich collection of meticulously annotated sentences offers an opportunity for developing more robust models capable of mitigating biased assumptions related to occupations based on gender stereotypes

    How to use the dataset

    Overview

    The dataset consists of Winograd-schema style sentences where entities are referred to by their occupation, such as the nurse, the doctor, or the carpenter. The main goal is to resolve the coreference within these sentences.

    File Description

    The dataset includes several CSV files with different purposes:

    • type2_anti_validation.csv: This file contains validation data for evaluating the performance of coreference resolution models on gender-biased sentences in the WinoBias dataset related to occupations.

    • type2_pro_test.csv: A test data file that evaluates the performance of coreference resolution models specifically on gender-biased sentences related to occupations.

    • type1_pro_validation.csv: Here you will find validation data for evaluating the performance of a coreference resolution model on gender bias in occupations within the WinoBias dataset.

    Each CSV file contains multiple columns representing different features and information about each sentence, such as part number, word number, tokens (words), part-of-speech tags (POS tags), parse bit for each token, predicate lemma (verb lemma), word sense, speaker information, named entity recognition tags (NER tags), verbal predicates used in a sentence, and coreference clusters.

    It is important to note that some columns may be repeated multiple times across different files with shared information. For example, part_number may appear more than once but represents different parts or sections within a sentence.

    Instructions

    To utilize this dataset effectively:

    • Import one or more relevant CSV files into your preferred programming environment or tool that supports handling tabular data (e.g., Python pandas).

    • Explore the columns and understand their meanings by referring to the column descriptions provided in this guide.

    • Analyze the data and perform necessary pre-processing steps based on your specific research or analysis goals. You can consider tasks such as gender bias detection, coreference resolution model development, or evaluation of existing models.

    • Choose appropriate features/columns for your task and utilize them accordingly.

    • Leverage the insights from this dataset to gain a better understanding of gender biases present in coreference resolution and find ways to mitigate such biases.

    Remember that proper data cleaning, preparation, and feature engineering are crucial s...

  12. d

    Database on Ideology, Money in Politics, and Elections (DIME)

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bonica, Adam (2023). Database on Ideology, Money in Politics, and Elections (DIME) [Dataset]. http://doi.org/10.7910/DVN/O5PX0B
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Bonica, Adam
    Time period covered
    Jan 1, 1979 - Jan 1, 2014
    Description

    Abstract: The Database on Ideology, Money in Politics, and Elections (DIME) is intended as a general resource for the study of campaign finance and ideology in American politics. The database was developed as part of the project on Ideology in the Political Marketplace, which is an on-going effort to perform a comprehensive ideological mapping of political elites, interest groups, and donors using the common-space CFscore scaling methodology (Bonica 2014). Constructing the database required a large-scale effort to compile, clean, and process data on contribution records, candidate characteristics, and election outcomes from various sources. The resulting database contains over 130 million political contributions made by individuals and organizations to local, state, and federal elections spanning a period from 1979 to 2014. A corresponding database of candidates and committees provides additional information on state and federal elections. The DIME+ data repository on congressional activity extends DIME to cover detailed data on legislative voting, lawmaking, and political rhetoric. (See http://dx.doi.org/10.7910/DVN/BO7WOW for details.) The DIME data is available for download as a standalone SQLite database. The SQLite database is stored on disk and can be accessed using a SQLite client or queried directly from R using the RSQLite package. SQLite is particularly well-suited for tasks that require searching through the database for specific individuals or contribution records. (Click here to download.) Overview: The database is intended to make data on campaign finance and elections (1) more centralized and accessible, (2) easier to work with, and (3) more versatile in terms of the types of questions that can be addressed. A list of the main value-added features of the database is below: Data processing: Names, addresses, and occupation and employer titles have been cleaned and standardized. Unique identifiers: Entity resolution techniques were used to assign unique identifiers for all individual and institutional donors included in the database. The contributor IDs make it possible to track giving by individuals across election cycles and levels of government. Geocoding: Each record has been geocoded and placed into congressional districts. The geocoding scheme relies on the contributor IDs to assign a complete set of consistent geo-coordinates to donors that report their full address in some records but not in others. This is accomplished by combining information on self-reported address across records. The geocoding scheme further takes into account donors with multiple addresses. Geocoding was performed using the Data Science Toolkit maintained by Pete Warden and hosted at http://www.datasciencetoolkit.org/. Shape files for congressional districts are from Census.gov (http://www.census.gov/rdo/data). Ideological measures: The common-space CFscores allow for direct distance comparisons of the ideal points of a wide range of political actors from state and federal politics spanning a 35 year period. In total, the database includes ideal point estimates for 70,871 candidates and 12,271 political committees as recipients and 14.7 million individuals and 1.7 million organizations as donors. Corresponding data on candidates, committees, and elections: The recipient database includes information on voting records, fundraising statistics, election outcomes, gender, and other candidate characteristics. All candidates are assigned unique identifiers that make it possible to track candidates if they campaign for different offices. The recipient IDs can also be used to match against the database of contribution records. The database also includes entries for PACs, super PACs, party committees, leadership PACs, 527s, state ballot campaigns, and other committees that engage in fundraising activities. Identifying sets of important political actors: Contribution records have been matched onto other publicly available databases of important political actors. Examples include: Fortune 500 directors and CEOs: (Data) (Paper) Federal court judges: (Data) (Paper} State supreme court justices: (Data) (Paper} Executives appointees to federal agencies: (Data) (Paper) Medical professionals: (Data) (Paper)

  13. Z

    ORBITAAL: cOmpRehensive BItcoin daTaset for temporAl grAph anaLysis

    • nde-dev.biothings.io
    • data-staging.niaid.nih.gov
    • +1more
    Updated Nov 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cazabet, Remy (2024). ORBITAAL: cOmpRehensive BItcoin daTaset for temporAl grAph anaLysis [Dataset]. https://nde-dev.biothings.io/resources?id=zenodo_10844224
    Explore at:
    Dataset updated
    Nov 27, 2024
    Dataset provided by
    Coquidé, Célestin
    Cazabet, Remy
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Construction

    This dataset captures the temporal network of Bitcoin (BTC) flow exchanged between entities at the finest time resolution in UNIX timestamp. Its construction is based on the blockchain covering the period from January, 3rd of 2009 to January the 25th of 2021. The blockchain extraction has been made using bitcoin-etl (https://github.com/blockchain-etl/bitcoin-etl) Python package. The entity-entity network is built by aggregating Bitcoin addresses using the common-input heuristic [1] as well as popular Bitcoin users' addresses provided by https://www.walletexplorer.com/

    [1] M. Harrigan and C. Fretter, "The Unreasonable Effectiveness of Address Clustering," 2016 Intl IEEE Conferences on Ubiquitous Intelligence & Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People, and Smart World Congress (UIC/ATC/ScalCom/CBDCom/IoP/SmartWorld), Toulouse, France, 2016, pp. 368-373, doi: 10.1109/UIC-ATC-ScalCom-CBDCom-IoP-SmartWorld.2016.0071.keywords: {Online banking;Merging;Protocols;Upper bound;Bipartite graph;Electronic mail;Size measurement;bitcoin;cryptocurrency;blockchain},

    Dataset Description

    Bitcoin Activity Temporal Coverage: From 03 January 2009 to 25 January 2021

    Overview:

    This dataset provides a comprehensive representation of Bitcoin exchanges between entities over a significant temporal span, spanning from the inception of Bitcoin to recent years. It encompasses various temporal resolutions and representations to facilitate Bitcoin transaction network analysis in the context of temporal graphs.

    Every dates have been retrieved from bloc UNIX timestamp and GMT timezone.

    Contents:

    The dataset is distributed across three compressed archives:

    All data are stored in the Apache Parquet file format, a columnar storage format optimized for analytical queries. It can be used with pyspark Python package.

    orbitaal-stream_graph.tar.gz:

    The root directory is STREAM_GRAPH/

    Contains a stream graph representation of Bitcoin exchanges at the finest temporal scale, corresponding to the validation time of each block (averaging approximately 10 minutes).

    The stream graph is divided into 13 files, one for each year

    Files format is parquet

    Name format is orbitaal-stream_graph-date-[YYYY]-file-id-[ID].snappy.parquet, where [YYYY] stands for the corresponding year and [ID] is an integer from 1 to N (number of files here) such as sorting in increasing [ID] ordering is similar to sort by increasing year ordering

    These files are in the subdirectory STREAM_GRAPH/EDGES/

    orbitaal-snapshot-all.tar.gz:

    The root directory is SNAPSHOT/

    Contains the snapshot network representing all transactions aggregated over the whole dataset period (from Jan. 2009 to Jan. 2021).

    Files format is parquet

    Name format is orbitaal-snapshot-all.snappy.parquet.

    These files are in the subdirectory SNAPSHOT/EDGES/ALL/

    orbitaal-snapshot-year.tar.gz:

    The root directory is SNAPSHOT/

    Contains the yearly resolution of snapshot networks

    Files format is parquet

    Name format is orbitaal-snapshot-date-[YYYY]-file-id-[ID].snappy.parquet, where [YYYY] stands for the corresponding year and [ID] is an integer from 1 to N (number of files here) such as sorting in increasing [ID] ordering is similar to sort by increasing year ordering

    These files are in the subdirectory SNAPSHOT/EDGES/year/

    orbitaal-snapshot-month.tar.gz:

    The root directory is SNAPSHOT/

    Contains the monthly resoluted snapshot networks

    Files format is parquet

    Name format is orbitaal-snapshot-date-[YYYY]-[MM]-file-id-[ID].snappy.parquet, where

    [YYYY] and [MM] stands for the corresponding year and month, and [ID] is an integer from 1 to N (number of files here) such as sorting in increasing [ID] ordering is similar to sort by increasing year and month ordering

    These files are in the subdirectory SNAPSHOT/EDGES/month/

    orbitaal-snapshot-day.tar.gz:

    The root directory is SNAPSHOT/

    Contains the daily resoluted snapshot networks

    Files format is parquet

    Name format is orbitaal-snapshot-date-[YYYY]-[MM]-[DD]-file-id-[ID].snappy.parquet, where

    [YYYY], [MM], and [DD] stand for the corresponding year, month, and day, and [ID] is an integer from 1 to N (number of files here) such as sorting in increasing [ID] ordering is similar to sort by increasing year, month, and day ordering

    These files are in the subdirectory SNAPSHOT/EDGES/day/

    orbitaal-snapshot-hour.tar.gz:

    The root directory is SNAPSHOT/

    Contains the hourly resoluted snapshot networks

    Files format is parquet

    Name format is orbitaal-snapshot-date-[YYYY]-[MM]-[DD]-[hh]-file-id-[ID].snappy.parquet, where

    [YYYY], [MM], [DD], and [hh] stand for the corresponding year, month, day, and hour, and [ID] is an integer from 1 to N (number of files here) such as sorting in increasing [ID] ordering is similar to sort by increasing year, month, day and hour ordering

    These files are in the subdirectory SNAPSHOT/EDGES/hour/

    orbitaal-nodetable.tar.gz:

    The root directory is NODE_TABLE/

    Contains two files in parquet format, the first one gives information related to nodes present in stream graphs and snapshots such as period of activity and associated global Bitcoin balance, and the other one contains the list of all associated Bitcoin addresses.

    Small samples in CSV format

    orbitaal-stream_graph-2016_07_08.csv and orbitaal-stream_graph-2016_07_09.csv

    These two CSV files are related to stream graph representations of an halvening happening in 2016.

    orbitaal-snapshot-2016_07_08.csv and orbitaal-snapshot-2016_07_09.csv

    These two CSV files are related to daily snapshot representations of an halvening happening in 2016.

  14. ChokePoint Dataset

    • zenodo.org
    • data.niaid.nih.gov
    txt, xz
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yongkang Wong; Shaokang Chen; Sandra Mau; Conrad Sanderson; Brian Lovell; Yongkang Wong; Shaokang Chen; Sandra Mau; Conrad Sanderson; Brian Lovell (2020). ChokePoint Dataset [Dataset]. http://doi.org/10.5281/zenodo.815657
    Explore at:
    xz, txtAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Yongkang Wong; Shaokang Chen; Sandra Mau; Conrad Sanderson; Brian Lovell; Yongkang Wong; Shaokang Chen; Sandra Mau; Conrad Sanderson; Brian Lovell
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The ChokePoint dataset is designed for experiments in person identification/verification under real-world surveillance conditions using existing technologies. An array of three cameras was placed above several portals (natural choke points in terms of pedestrian traffic) to capture subjects walking through each portal in a natural way. While a person is walking through a portal, a sequence of face images (ie. a face set) can be captured. Faces in such sets will have variations in terms of illumination conditions, pose, sharpness, as well as misalignment due to automatic face localisation/detection. Due to the three camera configuration, one of the cameras is likely to capture a face set where a subset of the faces is near-frontal.

    The dataset consists of 25 subjects (19 male and 6 female) in portal 1 and 29 subjects (23 male and 6 female) in portal 2. The recording of portal 1 and portal 2 are one month apart. The dataset has frame rate of 30 fps and the image resolution is 800X600 pixels. In total, the dataset consists of 48 video sequences and 64,204 face images. In all sequences, only one subject is presented in the image at a time. The first 100 frames of each sequence are for background modelling where no foreground objects were presented.

    Each sequence was named according to the recording conditions (eg. P2E_S1_C3) where P, S, and C stand for portal, sequence and camera, respectively. E and L indicate subjects either entering or leaving the portal. The numbers indicate the respective portal, sequence and camera label. For example, P2L_S1_C3 indicates that the recording was done in Portal 2, with people leaving the portal, and captured by camera 3 in the first recorded sequence.

    To pose a more challenging real-world surveillance problems, two seqeunces (P2E_S5 and P2L_S5) were recorded with crowded scenario. In additional to the aforementioned variations, the sequences were presented with continuous occlusion. This phenomenon presents challenges in identidy tracking and face verification.

    This dataset can be applied, but not limited, to the following research areas:

    • person re-identification
    • image set matching
    • face quality measurement
    • face clustering
    • 3D face reconstruction
    • pedestrian/face tracking
    • background estimation and subtraction

    Please cite the following paper if you use the ChokePoint dataset in your work (papers, articles, reports, books, software, etc):

    • Y. Wong, S. Chen, S. Mau, C. Sanderson, B.C. Lovell
      Patch-based Probabilistic Image Quality Assessment for Face Selection and Improved Video-based Face Recognition
      IEEE Biometrics Workshop, Computer Vision and Pattern Recognition (CVPR) Workshops, pages 81-88, 2011.
      http://doi.org/10.1109/CVPRW.2011.5981881

  15. Database Infrastructure for Mass Spectrometry - Per- and Polyfluoroalkyl...

    • data.nist.gov
    • nist.gov
    • +1more
    Updated Jul 5, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2023). Database Infrastructure for Mass Spectrometry - Per- and Polyfluoroalkyl Substances [Dataset]. http://doi.org/10.18434/mds2-2905
    Explore at:
    Dataset updated
    Jul 5, 2023
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    License

    https://www.nist.gov/open/licensehttps://www.nist.gov/open/license

    Description

    Data here contain and describe an open-source structured query language (SQLite) portable database containing high resolution mass spectrometry data (MS1 and MS2) for per- and polyfluorinated alykl substances (PFAS) and associated metadata regarding their measurement techniques, quality assurance metrics, and the samples from which they were produced. These data are stored in a format adhering to the Database Infrastructure for Mass Spectrometry (DIMSpec) project. That project produces and uses databases like this one, providing a complete toolkit for non-targeted analysis. See more information about the full DIMSpec code base - as well as these data for demonstration purposes - at GitHub (https://github.com/usnistgov/dimspec) or view the full User Guide for DIMSpec (https://pages.nist.gov/dimspec/docs). Files of most interest contained here include the database file itself (dimspec_nist_pfas.sqlite) as well as an entity relationship diagram (ERD.png) and data dictionary (DIMSpec for PFAS_1.0.1.20230615_data_dictionary.json) to elucidate the database structure and assist in interpretation and use.

  16. d

    Full US Phone Number and Telecom Data | 387,543,864 Phones | Full USA...

    • datarade.ai
    .json, .csv, .xls
    Updated Aug 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CompCurve (2023). Full US Phone Number and Telecom Data | 387,543,864 Phones | Full USA Coverage | Mobile and Landline with Carrier | 100% Verifiable Data [Dataset]. https://datarade.ai/data-products/full-us-phone-number-and-telecom-data-387-543-864-phones-compcurve
    Explore at:
    .json, .csv, .xlsAvailable download formats
    Dataset updated
    Aug 12, 2023
    Dataset authored and provided by
    CompCurve
    Area covered
    United States
    Description

    This comprehensive dataset delivers 387M+ U.S. phone numbers enriched with deep telecom intelligence and granular geographic metadata, providing one of the most complete national phone data assets available today. Designed for data enrichment, verification, identity resolution, analytics, risk modeling, telecom research, and large-scale customer intelligence, this file combines broad coverage with highly structured attributes and reliable carrier-grade metadata. It is a powerful resource for any organization that needs accurate, up-to-date U.S. phone number data supported by robust telecom identifiers.

    Our dataset includes mobile, landline, and VOIP numbers, paired with detailed fields such as carrier, line type, city, state, ZIP code, county, latitude/longitude, time zone, rate center, LATA, and OCN. These attributes make the file suitable for a wide range of applications, from consumer analytics and segmentation to identity graph construction and marketing audience modeling. Updated regularly and validated for completeness, this dataset offers high-confidence coverage across all 50 states, major metros, rural areas, and underserved regions.

    Field Coverage & Schema Overview

    The dataset contains a rich set of fields commonly required for telecom analysis, identity resolution, and large-scale data cleansing:

    Phone Number – Standardized 10-digit U.S. number

    Line Type – Wireless, Landline, VOIP, fixed-wireless, etc.

    Carrier / Provider – Underlying or current carrier assignment

    City & State – Parsed from rate center and location metadata

    ZIP Code – Primary ZIP associated with the phone block

    County – County name mapped to geographic area

    Latitude / Longitude – Approximate geo centroid for the assigned location

    Time Zone – Automatically mapped; useful for outbound compliance

    Rate Center – Telco rate center tied to number blocks

    LATA – Local Access and Transport Area for telecom routing

    OCN (Operating Company Number) – Carrier identifier for precision analytics

    Additional metadata such as region codes, telecom identifiers, and national routing attributes depending on the number block

    These data points provide a complete snapshot of the phone number’s telecom context and geographic footprint.

    Key Features

    387M+ fully structured U.S. phone numbers

    Mobile, landline, and VOIP line types

    Accurate carrier and OCN information

    Geo-enriched records with city, state, ZIP, county, lat/long

    Telecom routing metadata including rate center and LATA

    Ideal for large-scale analytics, enrichment, and modeling

    Nationwide coverage with consistent formatting and schema

    Primary Use Cases 1. Data Enrichment & Appending

    Enhance customer databases by adding carrier information, line type, geographic attributes, and telecom routing fields to improve downstream analytics and segmentation.

    1. Identity Resolution & Profile Matching

    Use carrier, OCN, and geographic fields to strengthen your identity graph, resolve duplicate entities, confirm telephone types, or enrich cross-channel identifiers.

    1. Lead Scoring & Consumer Modeling

    Build predictive models based on:

    Line type (mobile vs landline)

    Geography (state, county, ZIP)

    Telecom infrastructure and regional carrier assignments Useful for ML/AI scoring, propensity models, risk analysis, and customer lifetime value studies.

    1. Compliance-Aware Outreach Planning

    Fields like time zone, rate center, and line type support compliant outbound operations, call scheduling, and segmentation of mobile vs landline users for regulated environments.

    1. Data Quality, Cleansing & Validation

    Normalize customer files, detect outdated or mismatched phone metadata, resolve carrier inconsistencies, and remove non-U.S. or structurally invalid numbers.

    1. Telecom Market Analysis

    Researchers and telecom analysts can use the dataset to understand national carrier distribution, regional line-type patterns, infrastructure growth, and switching behavior.

    1. Fraud Detection & Risk Intelligence

    Carrier metadata, OCN patterns, and geographic context support:

    Synthetic identity detection

    Fraud scoring models

    Device/number reputation systems

    VOIP risk modeling

    1. Location-Based Analytics & Mapping

    Lat/long and geographic context fields allow integration into GIS systems, heat-mapping, regional modeling, and ZIP- or county-level segmentation.

    1. Customer Acquisition & Audience Building

    Build highly targeted audiences for:

    Marketing analytics

    Look-alike modeling

    Cross-channel segmentation

    Regional consumer insights

    1. Enterprise-Scale ETL & Data Infrastructure

    The structured, normalized schema makes this file easy to integrate into:

    Data lakes

    Snowflake / BigQuery warehouses

    ID graphs

    Customer 360 platforms

    Telecom research systems

    Ideal Users

    Marketing analytics teams

    Data science groups

    Identity resolution providers

    Fraud & risk intelligence platforms

    Telecom analysts

    Consumer data platforms

    Credit, insurance, and fintech modeling teams

    Data brokers & a...

  17. d

    Global Email to Phone Data | 850M+ Verified Matches | Identity Resolution &...

    • datarade.ai
    Updated Nov 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CompCurve (2025). Global Email to Phone Data | 850M+ Verified Matches | Identity Resolution & Enrichment Hashed & Plain Text | Real-Time API & Batch [Dataset]. https://datarade.ai/data-products/global-email-to-phone-data-850m-verified-matches-identit-compcurve
    Explore at:
    .json, .csv, .xls, .txt, .jsonl, .pdfAvailable download formats
    Dataset updated
    Nov 21, 2025
    Dataset authored and provided by
    CompCurve
    Area covered
    Algeria, Western Sahara, Saint Barthélemy, Wallis and Futuna, Guatemala, Iran (Islamic Republic of), Senegal, Gabon, Virgin Islands (U.S.), Burundi
    Description

    Product Overview Scale your Identity Resolution and Contact Enrichment capabilities with the world’s largest commercially available Email-to-Phone linkage dataset. Covering over 850 Million verified pairs across 190+ countries, this dataset bridges the gap between digital identifiers (Email) and physical reachability (Mobile/Phone).

    We provide a deterministic link between email addresses and phone numbers, enabling enterprises to resolve customer identities, prevent fraud, and enrich CRM records with high-accuracy mobile data. Unlike regional providers, our Global Identity Graph aggregates data from telco partnerships, e-commerce signals, and opt-in consortiums to deliver a single, unified solution for global operations.

    Key Questions This Data Answers Identity & Risk Teams:

    Is this email address associated with a valid, active mobile number?

    Does the phone number country match the user's IP location? (Critical for Fraud Detection)

    Is this a VOIP/Burner line or a legitimate contract mobile number?

    Marketing & Sales Teams:

    What is the direct mobile number for this prospect?

    How can I reactivate dormant email leads via SMS or Telemarketing?

    Which records in my CRM are missing phone numbers?

    Common Use Cases 1. Fraud Prevention & Risk Scoring Stop synthetic fraud at the gate. By validating that an incoming email is tied to a legitimate, long-standing mobile number, you can drastically reduce account takeover (ATO) and fake sign-ups.

    Signal: Match status (Match/No Match) acts as a strong trust signal.

    Line Type: Flag risky VOIP or non-fixed VOIP lines immediately.

    1. CRM Enrichment & Reverse Append Breathe life into legacy databases. Take a list of email addresses (hashed or plain text) and append verified phone numbers to unlock new channels like SMS, WhatsApp, or Direct Sales calls.

    Fill Rates: Achieve industry-leading match rates (30-60% depending on region).

    Refresh: Update old landlines to current mobile numbers.

    1. Identity Verification (KYC/AML) Strengthen Know Your Customer (KYC) workflows by adding a passive layer of verification. Confirm that the user providing an email owns the associated mobile device without adding friction to the UX.

    2. Omnichannel Marketing Create a unified customer view. Link a user's email activity (Newsletter opens) with their mobile identity to orchestrate synchronized Email + SMS campaigns.

    Data Dictionary & Schema Attributes We provide a rich output schema. You send us an Email (Plain Text, MD5, SHA1, or SHA256); we return the following:

    Core Identity Fields:

    email_address: The input email (or hash).

    phone_number: The matched phone number in E.164 format (e.g., +14155550123).

    match_score: Confidence score of the linkage (0-100).

    last_seen_date: Timestamp of the most recent signal validating this link.

    Phone Metadata:

    country_code: ISO 2-letter country code (e.g., US, GB, DE).

    carrier_name: Name of the telecom provider (e.g., Verizon, Vodafone).

    line_type: Classification of the number (Mobile, Landline, Fixed VOIP, Non-Fixed VOIP, Toll-Free).

    is_active: Boolean flag indicating if the line has shown recent activity.

    Linkage Metadata:

    linkage_type: Source of the match (Deterministic vs. Probabilistic).

    source_category: Aggregated source type (e.g., E-commerce, Telco, Utility).

    Global Coverage & Scale Our 850M+ matches are not just US-centric. We offer significant density in key global markets:

    North America: ~350M Matches

    Europe (GDPR Compliant): ~250M Matches

    APAC: ~150M Matches

    LATAM: ~100M Matches

    Methodology & Compliance Privacy First: We strictly adhere to GDPR, CCPA, and TCPA regulations. All European data is sourced from consent-based frameworks.

    Hashing Supported: We accept and return hashed data (MD5/SHA256) for privacy-safe mapping in clean rooms (Snowflake/AWS).

    Verification: Our "Active Line" check pings the HLR (Home Location Register) to ensure the number is currently in service, reducing SMS bounce rates.

    Delivery & Formats Real-Time API: <100ms latency for live verification at checkout.

    Batch Upload: Secure SFTP or S3 bucket transfer for large-scale CRM enrichment.

    Formats: JSON, CSV, Parquet.

  18. d

    B2B Marketing Data | USA Coverage

    • datarade.ai
    Updated Nov 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archetype Data (2025). B2B Marketing Data | USA Coverage [Dataset]. https://datarade.ai/data-products/b2b-marketing-data-usa-coverage-archetype-data
    Explore at:
    Dataset updated
    Nov 21, 2025
    Dataset authored and provided by
    Archetype Data
    Area covered
    United States of America
    Description

    Archetype Data’s B2B dataset provides a comprehensive, high-fidelity view of the U.S. business landscape, encompassing over 20 million verified business entities across every industry, company size, and geography. Designed to empower marketers, analysts, and enterprise data teams, this dataset delivers the scale, depth, and accuracy needed to identify, segment, and engage decision-makers with precision.

    Each business record is built from verified commercial, public, and proprietary data sources and continuously refreshed to maintain accuracy and recency. The dataset includes key firmographic attributes such as company name, address, industry (SIC/NAICS), revenue range and employee count, alongside advanced linkage attributes that connect businesses to their owners, executives, and affiliated professionals.

    Archetype Data’s proprietary entity resolution and normalization process eliminates duplicates, harmonizes naming conventions, and links related records; ensuring clean, standardized, and activation-ready data. This structure allows for enhanced segmentation and audience building, making it easy to target industries, business sizes, or professional roles that align with campaign objectives.

  19. d

    Identity Data | Europe | Email-Device Matching with Login Metadata and...

    • datarade.ai
    Updated Aug 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Irys (2025). Identity Data | Europe | Email-Device Matching with Login Metadata and Hashed Emails [Dataset]. https://datarade.ai/data-products/identity-data-europe-email-device-matching-with-login-met-irys
    Explore at:
    .json, .csv, .xls, .sqlAvailable download formats
    Dataset updated
    Aug 14, 2025
    Dataset authored and provided by
    Irys
    Area covered
    Europe, Guernsey, Holy See, Liechtenstein, Faroe Islands, Italy, United Kingdom, Lithuania, Switzerland, Poland, Greece
    Description

    This European-focused dataset enables identity resolution through hashed emails and mobile device identifiers. It includes login time (Unix), hashed emails (MD5, SHA-1, SHA-256), IPs, device model, and country of activity.

    Designed to support GDPR-compliant personalization, this dataset is perfect for CRM onboarding, identity mapping, and programmatic advertising across EU markets.

    It is highly useful for analytics teams, marketers, and fintech players seeking compliant, high-fidelity identity linkages.

  20. d

    Alesco Email Database - Identity Data - 2.3+ Billion US email records -...

    • datarade.ai
    .csv, .xls, .txt
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alesco Data, Alesco Email Database - Identity Data - 2.3+ Billion US email records - available for identify resolution and appending! [Dataset]. https://datarade.ai/data-products/alesco-email-database-identity-data-1-8-billion-us-email-alesco-data
    Explore at:
    .csv, .xls, .txtAvailable download formats
    Dataset authored and provided by
    Alesco Data
    Area covered
    United States of America
    Description

    Alesco’s aggregated consumer email database consists of over 2.3 billion U.S. records with name, address and email. The database is fully CAN-SPAM and privacy compliant, and records include referring URL, IP address and date stamp. Postal addresses are address standardized and processed through the US postal service National Change of Address (NCOA) service. Available for licensing!

    File size: 2.3 Billion IP Address: 1.9 Billion eAppend data: 1.48 Billion (full name/postal) Acquisition: 269 Million (full demo’s)

    Fields Included: -Name -Address -Email -Phone -IP Address

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Praveen Chinnappa; Rose Mary Arokiya Dass; yash mathur (2025). SPIDER (v2): Synthetic Person Information Dataset for Entity Resolution [Dataset]. http://doi.org/10.6084/m9.figshare.30472712.v1
Organization logo

SPIDER (v2): Synthetic Person Information Dataset for Entity Resolution

Explore at:
csvAvailable download formats
Dataset updated
Oct 29, 2025
Dataset provided by
Figsharehttp://figshare.com/
Authors
Praveen Chinnappa; Rose Mary Arokiya Dass; yash mathur
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

SPIDER (v2) – Synthetic Person Information Dataset for Entity Resolution provides researchers with ready-to-use data for benchmarking Duplicate or Entity Resolution algorithms. The dataset focuses on person-level fields typical in customer or citizen records. Since real-world person-level data is restricted due to Personally Identifiable Information (PII) constraints, publicly available synthetic datasets are limited in scope, volume, or realism.SPIDER addresses these limitations by providing a large-scale, realistic dataset containing first name, last name, email, phone, address, and date of birth (DOB) attributes. Using the Python Faker library, 40,000 unique synthetic person records were generated, followed by 10,000 controlled duplicate records derived using seven real-world transformation rules. Each duplicate record is linked to its original base record and rule through the fields is_duplicate_of and duplication_rule.Version 2 introduces major realism and structural improvements, enhancing both the dataset and generation framework.Enhancements in Version 2New cluster_id column to group base and duplicate records for improved entity-level benchmarking.Improved data realism with consistent field relationships:State and ZIP codes now match correctly.Phone numbers are generated based on state codes.Email addresses are logically related to name components.Refined duplication logic:Rule 4 updated for realistic address variation.Rule 7 enhanced to simulate shared accounts among different individuals (with distinct DOBs).Improved data validation and formatting for address, email, and date fields.Updated Python generation script for modular configuration, reproducibility, and extensibility.Duplicate Rules (with real-world use cases)Duplicate record with a variation in email address.Use case: Same person using multiple email accounts.Duplicate record with a variation in phone numbers.Use case: Same person using multiple contact numbers.Duplicate record with last-name variation.Use case: Name changes or data entry inconsistencies.Duplicate record with address variation.Use case: Same person maintaining multiple addresses or moving residences.Duplicate record with a nickname.Use case: Same person using formal and informal names (Robert → Bob, Elizabeth → Liz).Duplicate record with minor spelling variations in the first name.Use case: Legitimate entry or migration errors (Sara → Sarah).Duplicate record with multiple individuals sharing the same email and last name but different DOBs.Use case: Realistic shared accounts among family members or households (benefits, tax, or insurance portals).Output FormatThe dataset is available in both CSV and JSON formats for direct use in data-processing, machine-learning, and record-linkage frameworks.Data RegenerationThe included Python script can be used to fully regenerate the dataset and supports:Addition of new duplication rulesRegional, linguistic, or domain-specific variationsVolume scaling for large-scale testing scenariosFiles Includedspider_dataset_v2_6_20251027_022215.csvspider_dataset_v2_6_20251027_022215.jsonspider_readme_v2.mdSPIDER_generation_script_v2.pySupportingDocuments/ folder containing:benchmark_comparison_script.py – script used for derive F-1 score.Public_census_data_surname.csv – sample U.S. Census name and demographic data used for comparison.ssa_firstnames.csv – Social Security Administration names dataset.simplemaps_uszips.csv – ZIP-to-state mapping data used for phone and address validation.

Search
Clear search
Close search
Google apps
Main menu