100+ datasets found
  1. Example Datasets for Functional Enrichment Analysis

    • zenodo.org
    • data.niaid.nih.gov
    bin, txt
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel; Daniel (2020). Example Datasets for Functional Enrichment Analysis [Dataset]. http://doi.org/10.5281/zenodo.2564088
    Explore at:
    txt, binAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Daniel; Daniel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains a set of example data for a functional enrichment tutorial.

  2. Additional file 1 of SEDE-GPS: socio-economic data enrichment based on GPS...

    • springernature.figshare.com
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Theodor Sperlea; Stefan FĂźser; Jens Boenigk; Dominik Heider (2023). Additional file 1 of SEDE-GPS: socio-economic data enrichment based on GPS information [Dataset]. http://doi.org/10.6084/m9.figshare.7405250.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Theodor Sperlea; Stefan FĂźser; Jens Boenigk; Dominik Heider
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This table contains names, positions, and references for the samples contained in the sequence dataset and whether Prokaryotes and/or Eukaryotes were analyzed from the sample in this study. (CSV 3 kb)

  3. Health administrative data enrichment using cohort information: Comparative...

    • plos.figshare.com
    docx
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bernard C. Silenou; Marta Avalos; Catherine Helmer; Claudine Berr; Antoine Pariente; Helene Jacqmin-Gadda (2023). Health administrative data enrichment using cohort information: Comparative evaluation of methods by simulation and application to real data [Dataset]. http://doi.org/10.1371/journal.pone.0211118
    Explore at:
    docxAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Bernard C. Silenou; Marta Avalos; Catherine Helmer; Claudine Berr; Antoine Pariente; Helene Jacqmin-Gadda
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundStudies using health administrative databases (HAD) may lead to biased results since information on potential confounders is often missing. Methods that integrate confounder data from cohort studies, such as multivariate imputation by chained equations (MICE) and two-stage calibration (TSC), aim to reduce confounding bias. We provide new insights into their behavior under different deviations from representativeness of the cohort.MethodsWe conducted an extensive simulation study to assess the performance of these two methods under different deviations from representativeness of the cohort. We illustrate these approaches by studying the association between benzodiazepine use and fractures in the elderly using the general sample of French health insurance beneficiaries (EGB) as main database and two French cohorts (Paquid and 3C) as validation samples.ResultsWhen the cohort was representative from the same population as the HAD, the two methods are unbiased. TSC was more efficient and faster but its variance could be slightly underestimated when confounders were non-Gaussian. If the cohort was a subsample of the HAD (internal validation) with the probability of the subject being included in the cohort depending on both exposure and outcome, MICE was unbiased while TSC was biased. The two methods appeared biased when the inclusion probability in the cohort depended on unobserved confounders.ConclusionWhen choosing the most appropriate method, epidemiologists should consider the origin of the cohort (internal or external validation) as well as the (anticipated or observed) selection biases of the validation sample.

  4. Enriched NYTimes COVID19 U.S. County Dataset

    • kaggle.com
    zip
    Updated Jun 14, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ringhilterra17 (2020). Enriched NYTimes COVID19 U.S. County Dataset [Dataset]. https://www.kaggle.com/ringhilterra17/enrichednytimescovid19
    Explore at:
    zip(11291611 bytes)Available download formats
    Dataset updated
    Jun 14, 2020
    Authors
    ringhilterra17
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Area covered
    United States
    Description

    Overview and Inspiration

    I wanted to make some geospatial visualizations to convey the current severity of COVID19 in different parts of the U.S..

    I liked the NYTimes COVID dataset, but it was lacking information on county boundary shape data, population per county, new cases / deaths per day, and per capita calculations, and county demographics.

    After a lot of work tracking down the different data sources I wanted and doing all of the data wrangling and joins in python, I wanted to open-source the final enriched data set in order to give others a head start in their COVID-19 related analytic, modeling, and visualization efforts.

    This dataset is enriched with county shapes, county center point coordinates, 2019 census population estimates, county population densities, cases and deaths per capita, and calculated per day cases / deaths metrics. It contains daily data per county back to January, allowing for analyizng changes over time.

    UPDATE: I have also included demographic information per county, including ages, races, and gender breakdown. This could help determine which counties are most susceptible to an outbreak.

    How this data can be used

    Geospatial analysis and visualization - Which counties are currently getting hit the hardest (per capita and totals)? - What patterns are there in the spread of the virus across counties? (network based spread simulations using county center lat / lons) -county population densities play a role in how quickly the virus spreads? -how does a specific county/state cases and deaths compare to other counties/states? Join with other county level datasets easily (with fips code column)

    Content Details

    See the column descriptions for more details on the dataset

    Visualizations and Analysis Examples

    COVID-19 U.S. Time-lapse: Confirmed Cases per County (per capita)

    https://github.com/ringhilterra/enriched-covid19-data/blob/master/example_viz/covid-cases-final-04-06.gif?raw=true" alt="">-

    Other Data Notes

    • Please review nytimes README for detailed notes on Covid-19 data - https://github.com/nytimes/covid-19-data/
    • The only update I made in regards to 'Geographic Exceptions', is that I took 'New York City' county provided in the Covid-19 data, which has all cases for 'for the five boroughs of New York City (New York, Kings, Queens, Bronx and Richmond counties) and replaced the missing FIPS for those rows with the 'New York County' fips code 36061. That way I could join to a geometry, and then I used the sum of those five boroughs population estimates for the 'New York City' estimate, which allowed me calculate 'per capita' metrics for 'New York City' entries in the Covid-19 dataset

    Acknowledgements

  5. d

    Matlab example for Local Enrichment Analysis (LEA) analysis with real data

    • datadryad.org
    • data.niaid.nih.gov
    • +2more
    zip
    Updated Aug 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Berend Snijder; Yannik Severin (2022). Matlab example for Local Enrichment Analysis (LEA) analysis with real data [Dataset]. http://doi.org/10.5061/dryad.2jm63xssk
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 29, 2022
    Dataset provided by
    Dryad
    Authors
    Berend Snijder; Yannik Severin
    Time period covered
    Aug 26, 2022
    Description

    Code is compatible with Matlab v2020. The corresponding open-source alternative is Octave (https://octave.org/).

  6. d

    Data from: Argon data for enriched MORB from the 8°20' N seamount chain

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Nov 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Argon data for enriched MORB from the 8°20' N seamount chain [Dataset]. https://catalog.data.gov/dataset/argon-data-for-enriched-morb-from-the-820-n-seamount-chain
    Explore at:
    Dataset updated
    Nov 26, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    Hawaiian–Emperor seamount chain
    Description

    This dataset accompanies planned publication 'Near-Ridge Magmatism Constrained Using 40Ar/39Ar Dating of Enriched MORB from the 8°20' N Seamount Chain'. The Ar/Ar data are for samples that record the volcanic history of the area. The geochronology provides time constraints for the eruption of rocks studied in the manuscript. Samples were collected from the 8°20' N seamount chain by Molly Anderson (University of Florida), who sent them to the USGS Denver Argon Geochronology Laboratory for Ar/Ar analysis.

  7. LinkedIn Datasets

    • brightdata.com
    .json, .csv, .xlsx
    Updated Dec 17, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bright Data (2021). LinkedIn Datasets [Dataset]. https://brightdata.com/products/datasets/linkedin
    Explore at:
    .json, .csv, .xlsxAvailable download formats
    Dataset updated
    Dec 17, 2021
    Dataset authored and provided by
    Bright Datahttps://brightdata.com/
    License

    https://brightdata.com/licensehttps://brightdata.com/license

    Area covered
    Worldwide
    Description

    Unlock the full potential of LinkedIn data with our extensive dataset that combines profiles, company information, and job listings into one powerful resource for business decision-making, strategic hiring, competitive analysis, and market trend insights. This all-encompassing dataset is ideal for professionals, recruiters, analysts, and marketers aiming to enhance their strategies and operations across various business functions. Dataset Features

    Profiles: Dive into detailed public profiles featuring names, titles, positions, experience, education, skills, and more. Utilize this data for talent sourcing, lead generation, and investment signaling, with a refresh rate ensuring up to 30 million records per month. Companies: Access comprehensive company data including ID, country, industry, size, number of followers, website details, subsidiaries, and posts. Tailored subsets by industry or region provide invaluable insights for CRM enrichment, competitive intelligence, and understanding the startup ecosystem, updated monthly with up to 40 million records. Job Listings: Explore current job opportunities detailed with job titles, company names, locations, and employment specifics such as seniority levels and employment functions. This dataset includes direct application links and real-time application numbers, serving as a crucial tool for job seekers and analysts looking to understand industry trends and the job market dynamics.

    Customizable Subsets for Specific Needs Our LinkedIn dataset offers the flexibility to tailor the dataset according to your specific business requirements. Whether you need comprehensive insights across all data points or are focused on specific segments like job listings, company profiles, or individual professional details, we can customize the dataset to match your needs. This modular approach ensures that you get only the data that is most relevant to your objectives, maximizing efficiency and relevance in your strategic applications. Popular Use Cases

    Strategic Hiring and Recruiting: Track talent movement, identify growth opportunities, and enhance your recruiting efforts with targeted data. Market Analysis and Competitive Intelligence: Gain a competitive edge by analyzing company growth, industry trends, and strategic opportunities. Lead Generation and CRM Enrichment: Enrich your database with up-to-date company and professional data for targeted marketing and sales strategies. Job Market Insights and Trends: Leverage detailed job listings for a nuanced understanding of employment trends and opportunities, facilitating effective job matching and market analysis. AI-Driven Predictive Analytics: Utilize AI algorithms to analyze large datasets for predicting industry shifts, optimizing business operations, and enhancing decision-making processes based on actionable data insights.

    Whether you are mapping out competitive landscapes, sourcing new talent, or analyzing job market trends, our LinkedIn dataset provides the tools you need to succeed. Customize your access to fit specific needs, ensuring that you have the most relevant and timely data at your fingertips.

  8. d

    Data from: Assessment of targeted enrichment locus capture across time and...

    • datadryad.org
    • nde-dev.biothings.io
    • +4more
    zip
    Updated May 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rhema Uche-Dike; Aaron Goodman; Ethan Tolman; John Abbott; Jesse Breinholt; Seth Bybee; Paul Frandsen; Stephen Gosnell; Rob Guralnick; Vincent Kalkman; Manpreet Kohli; Judicael Fomekong-Lontchi; Pungki Lupiyaningdyah; Lacie Newton; Jessica Ware (2023). Assessment of targeted enrichment locus capture across time and museums using odonate specimens [Dataset]. http://doi.org/10.5061/dryad.kprr4xh8z
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 18, 2023
    Dataset provided by
    Dryad
    Authors
    Rhema Uche-Dike; Aaron Goodman; Ethan Tolman; John Abbott; Jesse Breinholt; Seth Bybee; Paul Frandsen; Stephen Gosnell; Rob Guralnick; Vincent Kalkman; Manpreet Kohli; Judicael Fomekong-Lontchi; Pungki Lupiyaningdyah; Lacie Newton; Jessica Ware
    Time period covered
    May 15, 2023
    Description

    IQ-Tree v.2.1.3 (Data matrix - fasta file) UNIX/Command line or a Text Editor for viewing (fastq files - raw data) FigTree (Tree file - .treefile) BBEdit (Partition files - Nexus)

  9. Additional file 20: of MGSEA – a multivariate Gene set enrichment analysis...

    • springernature.figshare.com
    zip
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Khong-Loon Tiong; Chen-Hsiang Yeang (2023). Additional file 20: of MGSEA – a multivariate Gene set enrichment analysis [Dataset]. http://doi.org/10.6084/m9.figshare.7861256.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Khong-Loon Tiong; Chen-Hsiang Yeang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Supplementary File S1. The R source codes of the MGSEA program, a toy example dataset, and a brief explanation for running the program. (ZIP 1832 kb)

  10. vdjHopper Example Data

    • zenodo.org
    application/gzip, bin
    Updated Aug 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicholas Borcherding; Nicholas Borcherding (2025). vdjHopper Example Data [Dataset]. http://doi.org/10.5281/zenodo.16929319
    Explore at:
    bin, application/gzipAvailable download formats
    Dataset updated
    Aug 22, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nicholas Borcherding; Nicholas Borcherding
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Aug 22, 2025
    Description

    vdjHopper Example Data

    This repository provides example single-cell V(D)J sequencing files used by the vdjHopper R package. The data are intended for demonstration and teaching purposes only, allowing users to test the package’s decontamination, IgBLAST integration, and chain pairing workflows without requiring large raw datasets.

    Contents

    The dataset contains representative files derived from the 10x Genomics NSCLC tumor TCR enrichment dataset .
    All files have been subset or downsampled to reduce size and ensure the total archive remains <5 MB.

    • all_contig_annotations.csv.gz
      Filtered contig annotations (CSV, compressed). Contains selected columns such as barcode, chain, cdr3, v_gene, and j_gene.

    • all_contig.fasta.gz
      Representative TCR sequences in FASTA format.

    • all_contig.fastq.gz
      A small subset of raw sequencing reads in FASTQ format, provided for demonstration only.

    • all_contig.bam
      Full BAM alignment file (~3.5 GB). This file is not included in the CRAN package build but can be downloaded from this Zenodo record if required for advanced tutorials. Users should call vdjHopper::fetch_example_data() to retrieve and cache this file programmatically.

    License and Source

    When using this data, please cite both 10x Genomics.

  11. e

    Teosto Open Api – open interface for live music data

    • data.europa.eu
    unknown
    Updated Dec 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yhdistykset ja säätiöt (2023). Teosto Open Api – open interface for live music data [Dataset]. https://data.europa.eu/data/datasets/3c7de080-ea97-4ddb-9a26-218579825170?locale=en
    Explore at:
    unknownAvailable download formats
    Dataset updated
    Dec 6, 2023
    Dataset authored and provided by
    Yhdistykset ja säätiöt
    Description

    The live music data collected by Teosto is the largest and most comprehensive in Finland. The data opened through the open interface now includes all live gigs announced to Teosto in Finland last year (2014): the dates of the gigs, the venues with their location and coordinates, the performers, the songs presented and the authors of the songs.

    We challenge developers to enrich live music spatial data and develop new, innovative uses for it. Examples of data enrichment include combining other open spatial datasets with event data or music-related metadata with song-specific data.

    The development of live data is part of the Open Finland Challenge competition and the Ultrahack event.

  12. E

    Paintings Gemma-Enriched Dataset. Fotothek - Bibliotheca Hertziana

    • edmond.mpg.de
    zip
    Updated Nov 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pietro Maria Liuzzo; Pietro Maria Liuzzo (2025). Paintings Gemma-Enriched Dataset. Fotothek - Bibliotheca Hertziana [Dataset]. http://doi.org/10.17617/3.Z8W2JR
    Explore at:
    zip(221010765)Available download formats
    Dataset updated
    Nov 24, 2025
    Dataset provided by
    Edmond
    Authors
    Pietro Maria Liuzzo; Pietro Maria Liuzzo
    License

    https://edmond.mpg.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.17617/3.Z8W2JRhttps://edmond.mpg.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.17617/3.Z8W2JR

    Description

    Gemälde Dataset - AI-Enhanced Art Historical Descriptions This dataset contains 224x224 images and relative metadata extracted from the MIDAS XML of the Catalogue of the Photographic Collection of the Bibliotheca Hertziana enriched with AI-generated prose texts. The dataset is limited to photographs of objects classified as painting (Gemälde), and has been processed using Google Gemma 2 9B Instruct large language model on the KISSKI HPC cluster of the GWDG. Scripts to process the data on KISSKI have been elaborated with Claude Code in Virtual Studio Code. Dataset Overview Source Data: Original dataset: gemalde.tsv (19,051 rows) Extracted from: MIDAS XML format (combined.xml) Institution: Photographic Collection. Bibliotheca Hertziana - Max Planck Institute for Art History Photographic Collection Catalogue: Fotothek der Bibliotheca Hertziana Output: Enriched metadata: TSV files with AI-generated German and English descriptions 224x224 images downloaded from IIIF Image Api of the Photographic Collection Processing Pipeline 1. Data Extraction Source data was extracted with gemalde.xql from MIDAS XML format combined.xml containing structured art historical metadata including: Object titles and descriptions (textobj, textfoto) Artist information (aob30) Location data (aob26, aob28) Dating and provenance Image references (a8540) Images Download 224x224 images downloaded in advance from the IIIF Service based on gemalde.tsv. The script processing for AI Text Enrichment from the metadata checks that the image has been downloaded, so the output data has a 100% certainty of having a matching image. 17,657 images downloaded from 19,051 rows. This is due to known missing digital images. The dataset corresponds to published data and each row contains the licence and accessibility of the single image, date of creation and last update of the catalogue object. 2. AI Text Generation Model Used: Name: Google Gemma 2 9B Instruct Parameters: 9 billion Quantization: FP16 (no quantization) Context window: 8,192 tokens License: Gemma Terms of Use Processing Workflow: Input cleaning: Removal of numeric codes, normalization of Unicode characters Paragraph generation: German text from structured metadata Translation: German → English Categories processed: paragraph foto DE/EN - Photograph description paragraph obj DE/EN - Object/artwork description paragraph verwalter DE/EN - Collection/custodian information paragraph standort DE/EN - Location information AI Prompts Used Paragraph Generation Prompt Convert the following structured information into a coherent text in German. The text contains field data that should be transformed into flowing prose while preserving all information. IMPORTANT: - Write a MAXIMUM of 2 paragraphs - Do NOT include any URLs or web links - Do NOT include reference codes or numerical codes - Do NOT add any comments or explanations - Only output the paragraph text itself Field: {field_name} Text: {cleaned_text} German text (maximum 2 paragraphs): Example Input: Field: textobj Text: Bildnis Filippo Neri Hl. Filippo Neri geboren 1515 Florenz gestorben 1595 Rom Priester Ordensgründer Gründer Oratorium Kongregation des Oratoriums Example Output: Filippo Neri, geboren 1515 in Florenz und gestorben 1595 in Rom, war ein Priester und bedeutender Ordensgründer. Er gründete das Oratorium und die Kongregation des Oratoriums, die bis heute eine wichtige Rolle in der katholischen Kirche spielen. Translation Prompt Translate the following German text to English. Preserve the meaning and style as much as possible. IMPORTANT: - Do NOT include any URLs or web links in the translation - Do NOT include reference codes starting with "bh" followed by numbers - Do NOT include numerical codes like 08012353 - Do NOT add any comments or explanations - Only output the translated text itself German text: {text} English translation: Example Translation: Input (DE): Filippo Neri, geboren 1515 in Florenz und gestorben 1595 in Rom, war ein Priester und bedeutender Ordensgründer. Output (EN): Filippo Neri, born 1515 in Florence and died 1595 in Rome, was a priest and important founder of a religious order. KISSKI Cluster Resources Hardware Configuration GPU: NVIDIA A100 (80GB VRAM) Architecture: Ampere Tensor Cores: 432 FP16 Performance: ~312 TFLOPS Memory Bandwidth: 2 TB/s Allocation per job: GPUs: 1× A100 CPUs: 4 cores RAM: 64 GB Time limit: 6 hours per job Job Array Configuration Array setup: Total jobs: 38 (indices 0-37) Chunk size: 500 rows per job Parallel jobs: 10 simultaneous Total rows processed: 19,000 (rows 0-18,999) Performance Metrics AI operations per row: 4 paragraph generations (foto, obj, verwalter, standort) 4 translations (DE → EN) Total: 8 LLM inference calls per row Resource consumption: GPU hours: ~125 GPU hours total (38 jobs × 3.3 hours) Model size in memory: ~18 GB (FP16) Peak VRAM usage: ~25 GB per job Output Structure data_gemalde/ ├── enriched_data/ │ ├── data_0-499.tsv # Rows 0-499 │ ├── data_500-999.tsv #...

  13. d

    Phone Number Data | Global Coverage | 100M+ B2B Mobile Phone Numbers | 95%+...

    • datarade.ai
    .json, .csv
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Forager.ai, Phone Number Data | Global Coverage | 100M+ B2B Mobile Phone Numbers | 95%+ Accuracy [Dataset]. https://datarade.ai/data-products/global-mobile-phone-number-data-90m-95-accuracy-api-b-forager-ai-905f
    Explore at:
    .json, .csvAvailable download formats
    Dataset provided by
    Forager.ai
    Area covered
    Japan, Moldova (Republic of), Botswana, Martinique, United Arab Emirates, Uruguay, Macedonia (the former Yugoslav Republic of), South Georgia and the South Sandwich Islands, Cambodia, Colombia
    Description

    Global B2B Mobile Phone Number Database | 100M+ Verified Contacts | 95% Accuracy Forager.ai provides the world’s most reliable mobile phone number data for businesses that refuse to compromise on quality. With 100 million+ professionally verified mobile numbers refreshed every 3 weeks, our database ensures 95% accuracy – so your teams never waste time on dead-end leads.

    Why Our Data Wins ✅ Accuracy You Can Trust 95% of mobile numbers are verified against live carrier records and tied to current job roles. Say goodbye to “disconnected number” voicemails.

    ✅ Depth Beyond Digits Each contact includes 150+ data points:

    Direct mobile numbers

    Current job title, company, and department

    Full career history + education background

    Location data + LinkedIn profiles

    Company size, industry, and revenue

    ✅ Freshness Guaranteed Bi-weekly updates combat job-hopping and role changes – critical for sales teams targeting decision-makers.

    ✅ Ethically Sourced & Compliant First-party collected data with full GDPR/CCPA compliance.

    Who Uses This Data?

    Sales Teams: Cold-call C-suite prospects with verified mobile numbers.

    Marketers: Run hyper-personalized SMS/WhatsApp campaigns.

    Recruiters: Source passive candidates with up-to-date contact intel.

    Data Vendors: License premium datasets to enhance your product.

    Tech Platforms: Power your SaaS tools via API with enterprise-grade B2B data.

    Flexible Delivery, Instant Results

    API (REST): Real-time integration for CRMs, dialers, or marketing stacks

    CSV/JSON: Campaign-ready files.

    PostgreSQL: Custom databases for large-scale enrichment

    Compliance: Full audit trails + opt-out management

    Why Forager.ai? → Proven ROI: Clients see 62% higher connect rates vs. industry averages (request case studies). → No Guesswork: Test-drive free samples before committing. → Scalable Pricing: Pay per record, license datasets, or get unlimited API access.

    B2B Mobile Phone Data | Verified Contact Database | Sales Prospecting Lists | CRM Enrichment | Recruitment Phone Numbers | Marketing Automation | Phone Number Datasets | GDPR-Compliant Leads | Direct Dial Contacts | Decision-Maker Data

    Need Proof? Contact us to see why Fortune 500 companies and startups alike trust Forager.ai for mission-critical outreach.

  14. Code4ML 2.0

    • zenodo.org
    csv, txt
    Updated May 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonimous authors; Anonimous authors (2025). Code4ML 2.0 [Dataset]. http://doi.org/10.5281/zenodo.15465737
    Explore at:
    csv, txtAvailable download formats
    Dataset updated
    May 19, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonimous authors; Anonimous authors
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is an enriched version of the Code4ML dataset, a large-scale corpus of annotated Python code snippets, competition summaries, and data descriptions sourced from Kaggle. The initial release includes approximately 2.5 million snippets of machine learning code extracted from around 100,000 Jupyter notebooks. A portion of these snippets has been manually annotated by human assessors through a custom-built, user-friendly interface designed for this task.

    The original dataset is organized into multiple CSV files, each containing structured data on different entities:

    • code_blocks.csv: Contains raw code snippets extracted from Kaggle.
    • kernels_meta.csv: Metadata for the notebooks (kernels) from which the code snippets were derived.
    • competitions_meta.csv: Metadata describing Kaggle competitions, including information about tasks and data.
    • markup_data.csv: Annotated code blocks with semantic types, allowing deeper analysis of code structure.
    • vertices.csv: A mapping from numeric IDs to semantic types and subclasses, used to interpret annotated code blocks.

    Table 1. code_blocks.csv structure

    ColumnDescription
    code_blocks_indexGlobal index linking code blocks to markup_data.csv.
    kernel_idIdentifier for the Kaggle Jupyter notebook from which the code block was extracted.
    code_block_id

    Position of the code block within the notebook.

    code_block

    The actual machine learning code snippet.

    Table 2. kernels_meta.csv structure

    ColumnDescription
    kernel_idIdentifier for the Kaggle Jupyter notebook.
    kaggle_scorePerformance metric of the notebook.
    kaggle_commentsNumber of comments on the notebook.
    kaggle_upvotesNumber of upvotes the notebook received.
    kernel_linkURL to the notebook.
    comp_nameName of the associated Kaggle competition.

    Table 3. competitions_meta.csv structure

    ColumnDescription
    comp_nameName of the Kaggle competition.
    descriptionOverview of the competition task.
    data_typeType of data used in the competition.
    comp_typeClassification of the competition.
    subtitleShort description of the task.
    EvaluationAlgorithmAbbreviationMetric used for assessing competition submissions.
    data_sourcesLinks to datasets used.
    metric typeClass label for the assessment metric.

    Table 4. markup_data.csv structure

    ColumnDescription
    code_blockMachine learning code block.
    too_longFlag indicating whether the block spans multiple semantic types.
    marksConfidence level of the annotation.
    graph_vertex_idID of the semantic type.

    The dataset allows mapping between these tables. For example:

    • code_blocks.csv can be linked to kernels_meta.csv via the kernel_id column.
    • kernels_meta.csv is connected to competitions_meta.csv through comp_name. To maintain quality, kernels_meta.csv includes only notebooks with available Kaggle scores.

    In addition, data_with_preds.csv contains automatically classified code blocks, with a mapping back to code_blocks.csvvia the code_blocks_index column.

    Code4ML 2.0 Enhancements

    The updated Code4ML 2.0 corpus introduces kernels extracted from Meta Kaggle Code. These kernels correspond to the kaggle competitions launched since 2020. The natural descriptions of the competitions are retrieved with the aim of LLM.

    Notebooks in kernels_meta2.csv may not have a Kaggle score but include a leaderboard ranking (rank), providing additional context for evaluation.

    competitions_meta_2.csv is enriched with data_cards, decsribing the data used in the competitions.

    Applications

    The Code4ML 2.0 corpus is a versatile resource, enabling training and evaluation of models in areas such as:

    • Code generation
    • Code understanding
    • Natural language processing of code-related tasks
  15. DGE GO Enrichment Analysis Microarray Data GDS2778

    • kaggle.com
    zip
    Updated Nov 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dr. Nagendra (2025). DGE GO Enrichment Analysis Microarray Data GDS2778 [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/dge-go-enrichment-analysis-microarray-data-gds2778
    Explore at:
    zip(6820264 bytes)Available download formats
    Dataset updated
    Nov 29, 2025
    Authors
    Dr. Nagendra
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    his dataset is based on National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) DataSet accession GDS2778. girke.bioinformatics.ucr.edu +1

    The dataset originates from a microarray experiment measuring global gene expression under specific experimental conditions. girke.bioinformatics.ucr.edu +1

    Raw and processed expression data (for all probes/genes) are included, enabling downstream analysis such as normalization, differential expression, and clustering.

    The dataset has been used to perform differential gene expression (DGE) analysis to identify genes that are up- or down-regulated under the experimental condition compared to control.

    Data processing steps typically include normalization (e.g., log-transformation), quality control, probe-to-gene mapping, and statistical testing for significance (e.g., using packages such as limma or other DGE tools). mahsa-ehsanifard.github.io +1

    Resulting differentially expressed genes (DEGs) include statistics such as log fold change (logFC), adjusted p‑values (adj.P.Val), and possibly other metrics (e.g., B-statistic), allowing assessment of both magnitude and significance of changes.

    The dataset also includes a visualization file (heatmap image) that displays expression patterns of DEGs (or top variable genes) across samples — enabling clustering and pattern recognition across samples and genes.

    The heatmap helps illustrate sample-wise and gene-wise expression variation: clustering groups together samples (e.g. control vs treatment) and genes with similar expression dynamics. NCBI +1

    This dataset is suitable for further bioinformatics analysis: e.g. functional enrichment (GO/Pathway), co‑expression analysis, gene signature identification, or integration with other datasets.

    Users who download this dataset can reproduce or extend analyses, such as re-normalization, alternative clustering, custom DEG thresholds, or downstream biological interpretation (pathway, network analysis).

  16. d

    Data from: Enriching the ant tree of life: enhanced UCE bait set for...

    • datadryad.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Feb 14, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael G. Branstetter; John T. Longino; Philip S. Ward; Brant C. Faircloth (2017). Enriching the ant tree of life: enhanced UCE bait set for genome-scale phylogenetics of ants and other Hymenoptera [Dataset]. http://doi.org/10.5061/dryad.89n87
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 14, 2017
    Dataset provided by
    Dryad
    Authors
    Michael G. Branstetter; John T. Longino; Philip S. Ward; Brant C. Faircloth
    Time period covered
    Feb 12, 2017
    Area covered
    Global
    Description
    1. Targeted enrichment of conserved genomic regions (e.g., ultraconserved elements or UCEs) has emerged as a promising tool for inferring evolutionary history in many organismal groups. Because the UCE approach is still relatively new, much remains to be learned about how best to identify UCE loci and design baits to enrich them.

    2. We test an updated UCE identification and bait design workflow for the insect order Hymenoptera, with a particular focus on ants. The new strategy augments a previous bait design for Hymenoptera by (a) changing the parameters by which conserved genomic regions are identified and retained, and (b) increasing the number of genomes used for locus identification and bait design. We perform in vitro validation of the approach in ants by synthesizing an ant-specific bait set that targets UCE loci and a set of “legacy” phylogenetic markers. Using this bait set, we generate new data for 84 taxa (16/17 ant subfamilies) and extract loci from an additional 17 genome-e...

  17. Photovoltaik Installations by Zipcode in Germany

    • kaggle.com
    zip
    Updated Dec 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Klaus Langemann (2022). Photovoltaik Installations by Zipcode in Germany [Dataset]. https://www.kaggle.com/datasets/klauslangemann/data-by-zip-code-germany
    Explore at:
    zip(19547881 bytes)Available download formats
    Dataset updated
    Dec 29, 2022
    Authors
    Klaus Langemann
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Area covered
    Germany
    Description

    This dataset compiles in two csv files, both in an absolute and in a min-max-scaled form, a variety of data for 8.170 different zip code areas of Germany. Examples of such data are, amongst others, average sunshine hours per year, average annual income per person, number of yearly crimes committed, percentage of population below and above age of 60 years and share of voters for green party.

  18. E

    Drawings Gemma-Enriched Dataset. Fotothek - Bibliotheca Hertziana

    • edmond.mpg.de
    zip
    Updated Nov 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pietro Maria Liuzzo; Pietro Maria Liuzzo (2025). Drawings Gemma-Enriched Dataset. Fotothek - Bibliotheca Hertziana [Dataset]. http://doi.org/10.17617/3.1GN3OL
    Explore at:
    zip(458644918)Available download formats
    Dataset updated
    Nov 24, 2025
    Dataset provided by
    Edmond
    Authors
    Pietro Maria Liuzzo; Pietro Maria Liuzzo
    License

    https://edmond.mpg.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.17617/3.1GN3OLhttps://edmond.mpg.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.17617/3.1GN3OL

    Description

    Zeichnungen Dataset - AI-Enhanced Art Historical Descriptions Zeichnungen Dataset - AI-Enhanced Art Historical Descriptions with Iconography Dataset Overview Processing Pipeline AI Prompts Used KISSKI Cluster Resources Output Structure KISSKI Documentation ICONCLASS Resources Data Usage & Citation Quality Notes Zeichnungen Dataset - AI-Enhanced Art Historical Descriptions with Iconography This dataset contains 224x224 images and relative metadata extracted from the MIDAS XML of the Catalogue of the Photographic Collection of the Bibliotheca Hertziana enriched with AI-generated prose texts and iconographic analysis. The dataset is limited to photographs of objects classified as drawing (Zeichnungen), and has been processed using Google Gemma 2 9B Instruct large language model on the KISSKI HPC cluster of the GWDG. Scripts to process the data on KISSKI have been elaborated with Claude Code in Virtual Studio Code. Dataset Overview Source Data: Original dataset: zeichnungen.tsv (30,000 rows / 29,999 data rows) Extracted from: MIDAS XML format (combined.xml) Source institution: Bibliotheca Hertziana - Max Planck Institute for Art History Image repository: Fotothek der Bibliotheca Hertziana Output: Enriched metadata: TSV files with AI-generated German and English descriptions Iconographic analysis: Descriptions based on ICONCLASS classification 224x224 images downloaded from IIIF Image Api of the Photographic Collection Processing Pipeline 1. Data Extraction Source data was extracted with zeichnungen.xql from MIDAS XML format combined.xml containing structured art historical metadata including: Object titles and descriptions (textobj, textfoto) Artist information (aob30) Location data (aob26, aob28) ICONCLASS codes (a5500) - Standardized iconographic classification Dating and provenance Image references (a8540) The set was limited to 30000 entries. 2. ICONCLASS Cache Preparation ICONCLASS System: Source: ICONCLASS.org - Multilingual classification system for cultural content GitHub repository: https://github.com/iconclass/data Images Download 224x224 images downloaded in advance from the IIIF Service based on gemalde.tsv. The script processing for AI Text Enrichment from the metadata checks that the image has been downloaded, so the output data has a 100% certainty of having a matching image. 28.165 images downloaded from 29,999 rows. This is due to known missing digital images. The dataset corresponds to published data and each row contains the licence and accessibility of the single image, date of creation and last update of the catalogue object. 3. AI Text Generation Model Used: Name: Google Gemma 2 9B Instruct Parameters: 9 billion Quantization: FP16 (no quantization) Context window: 8,192 tokens License: Gemma Terms of Use Processing Workflow: Input cleaning: Removal of numeric codes, normalization of Unicode characters, increased CSV field size limit (10 MB) Paragraph generation: German text from structured metadata ICONCLASS lookup: Offline cache-based iconographic description retrieval Iconographic synthesis: AI-generated description from ICONCLASS codes Translation: German → English Categories processed: paragraph foto DE/EN - Photograph description paragraph obj DE/EN - Object/artwork description paragraph verwalter DE/EN - Collection/custodian information paragraph standort DE/EN - Location information paragraph iconclass DE/EN - Iconographic content description (NEW) AI Prompts Used Paragraph Generation Prompt Convert the following structured information into a coherent text in German. The text contains field data that should be transformed into flowing prose while preserving all information. IMPORTANT: - Write a MAXIMUM of 2 paragraphs - Do NOT include any URLs or web links - Do NOT include reference codes or numerical codes - Do NOT add any comments or explanations - Only output the paragraph text itself Field: {field_name} Text: {cleaned_text} German text (maximum 2 paragraphs): ICONCLASS Paragraph Prompt Based on the following Iconclass descriptions, write a brief German paragraph describing what the image depicts. Descriptions: {'; '.join(descriptions)} IMPORTANT: - Start with "Das Bild zeigt" or similar phrasing - Combine all descriptions into a flowing text - Maximum 1-2 sentences - Do NOT include iconclass codes or numbers - Do NOT include reference codes starting with "bh" - Only output the descriptive German text German description: Example ICONCLASS Processing: Input from data: a5500: 31 A 23 1 | 31 A 25 11 | 31 B 62 11 ICONCLASS lookup (from cache): 31 A 23 1 → "standing figure" 31 A 25 11 → "arm raised upward" 31 B 62 11 → "looking upwards" AI-generated output (DE): Das Bild zeigt eine stehende Figur mit erhobenem Arm, die nach oben blickt. Translation (EN): The image shows a standing figure with raised arm, looking upwards. Translation Prompt Translate the following German text to English. Preserve the meaning and style as much as possible. IMPORTANT: - Do NOT include any URLs or web...

  19. D

    Data from: A target enrichment method for gathering phylogenetic information...

    • datasetcatalog.nlm.nih.gov
    • data.niaid.nih.gov
    • +2more
    Updated Jan 2, 2015
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kozik, Alex; Dikow, Rebecca B.; Michelmore, Richard W.; Mandel, Jennifer R.; Funk, Vicki A.; Masalia, Rishi R.; Rieseberg, Loren H.; Burke, John M.; Staton, S. Evan (2015). A target enrichment method for gathering phylogenetic information from hundreds of loci: an example from the Compositae [Dataset]. http://doi.org/10.5061/dryad.gr93t
    Explore at:
    Dataset updated
    Jan 2, 2015
    Authors
    Kozik, Alex; Dikow, Rebecca B.; Michelmore, Richard W.; Mandel, Jennifer R.; Funk, Vicki A.; Masalia, Rishi R.; Rieseberg, Loren H.; Burke, John M.; Staton, S. Evan
    Description

    Premise of the study: The Compositae (Asteraceae) are a large and diverse family of plants, and the most comprehensive phylogeny to date is a meta-tree based on 10 chloroplast loci that has several major unresolved nodes. We describe the development of an approach that enables the rapid sequencing of large numbers of orthologous nuclear loci to facilitate efficient phylogenomic analyses. Methods and Results: We designed a set of sequence capture probes that target conserved orthologous sequences in the Compositae. We also developed a bioinformatic and phylogenetic workflow for processing and analyzing the resulting data. Application of our approach to 15 species from across the Compositae resulted in the production of phylogenetically informative sequence data from 763 loci and the successful reconstruction of known phylogenetic relationships across the family. Conclusions: These methods should be of great use to members of the broader Compositae community, and the general approach should also be of use to researchers studying other families.

  20. Goodreads Spoilers

    • kaggle.com
    zip
    Updated Oct 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmad (2023). Goodreads Spoilers [Dataset]. https://www.kaggle.com/datasets/pypiahmad/goodreads-book-reviews
    Explore at:
    zip(6776059711 bytes)Available download formats
    Dataset updated
    Oct 30, 2023
    Authors
    Ahmad
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    The Goodreads Spoilers dataset embodies a trove of reviews from the Goodreads book review platform, with a special emphasis on annotated "spoiler" information from each review. This dataset is an invaluable asset for those keen on delving into spoiler detection, sentiment analysis related to spoilers, and understanding user behavior in the context of revealing or discussing plot twists.

    Basic Statistics: - Books: 25,475 - Users: 18,892 - Reviews: 1,378,033

    Metadata: - Reviews: The text of the reviews provided by users. - Ratings: Ratings assigned to books by users. - Spoilers: Annotated spoilers within the review text. - (Additionally, metadata from the complete Goodreads dataset can be utilized to enrich analysis.)

    Example (spoiler data): json { 'user_id': '01ec1a320ffded6b2dd47833f2c8e4fb', 'timestamp': '2013-12-28', 'review_sentences': [[0, 'First, be aware that this book is not for the faint of heart.'], [0, 'Human trafficking, drugs, kidnapping, abuse in all forms - this story contains all of this and more.'], ..., [0, '(ARC provided by the author in return for an honest review.)']], 'rating': 5, 'has_spoiler': False, 'book_id': '18398089', 'review_id': '4b3ffeaf14310ac6854f140188e191cd' }

    Use Cases: - Spoiler Detection: Developing algorithms to automatically detect spoilers in review text. - Sentiment Analysis: Analyzing the sentiment of reviews and examining how the presence of spoilers affects sentiment. - User Behavior Analysis: Understanding how users interact with books that have spoilers and how they disclose such information in reviews. - Natural Language Processing: Training models to understand and process user-generated text which contains spoilers.

    Citation: Please cite the following if you use the data: Fine-grained spoiler detection from large-scale review corpora Mengting Wan, Rishabh Misra, Ndapa Nakashole, Julian McAuley ACL, 2019 [PDF](https://cseweb.ucsd.edu/~jmcauley/pdfs/acl19a.pdf)

    Code Samples: The datasets are accompanied by a series of code samples housed in the dataset's Github repository. These code samples include: - Downloading datasets without GUI: A notebook to facilitate dataset downloading sans graphical user interface. - Displaying sample records: A notebook to showcase sample records from the dataset. - Calculating basic statistics: A notebook to calculate and understand basic statistics of the dataset. - Exploring the interaction data: A notebook to explore interaction data and understand user-book interactions. - Exploring the review data: A notebook to delve into the review data and extract insights from user reviews.

    Datasets:

    Meta-Data of Books:

    • Detailed Book Graph (goodreads_books.json.gz): A comprehensive graph detailing around 2.3 million books, acting as a rich source of book attributes and metadata.
    • Detailed Information of Authors (goodreads_book_authors.json.gz):
      • An extensive dataset containing detailed information about book authors, essential for understanding author-centric trends and insights.
      • Download Link
    • Detailed Information of Works (goodreads_book_works.json.gz):
      • This dataset provides abstract information about a book disregarding any particular editions, facilitating a high-level understanding of each work.
      • Download Link
    • Detailed Information of Book Series (goodreads_book_series.json.gz):
      • A dataset encompassing detailed information about book series, aiding in understanding series-related trends and insights. Note that the series id included here cannot be used for URL hack.
      • [Download Link](https://d...
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Daniel; Daniel (2020). Example Datasets for Functional Enrichment Analysis [Dataset]. http://doi.org/10.5281/zenodo.2564088
Organization logo

Example Datasets for Functional Enrichment Analysis

Explore at:
txt, binAvailable download formats
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Daniel; Daniel
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset contains a set of example data for a functional enrichment tutorial.

Search
Clear search
Close search
Google apps
Main menu