100+ datasets found

Example Datasets for Functional Enrichment Analysis
zenodo.org
data.niaid.nih.gov
bin, txt
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel; Daniel (2020). Example Datasets for Functional Enrichment Analysis [Dataset]. http://doi.org/10.5281/zenodo.2564088
Explore at:
txt, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.2564088
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Daniel; Daniel
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains a set of example data for a functional enrichment tutorial.
Additional file 1 of SEDE-GPS: socio-economic data enrichment based on GPS...
springernature.figshare.com
txt
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Theodor Sperlea; Stefan FĂźser; Jens Boenigk; Dominik Heider (2023). Additional file 1 of SEDE-GPS: socio-economic data enrichment based on GPS information [Dataset]. http://doi.org/10.6084/m9.figshare.7405250.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7405250.v1
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Theodor Sperlea; Stefan FĂźser; Jens Boenigk; Dominik Heider
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This table contains names, positions, and references for the samples contained in the sequence dataset and whether Prokaryotes and/or Eukaryotes were analyzed from the sample in this study. (CSV 3 kb)
Health administrative data enrichment using cohort information: Comparative...
plos.figshare.com
docx
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bernard C. Silenou; Marta Avalos; Catherine Helmer; Claudine Berr; Antoine Pariente; Helene Jacqmin-Gadda (2023). Health administrative data enrichment using cohort information: Comparative evaluation of methods by simulation and application to real data [Dataset]. http://doi.org/10.1371/journal.pone.0211118
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0211118
Dataset updated
May 30, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Bernard C. Silenou; Marta Avalos; Catherine Helmer; Claudine Berr; Antoine Pariente; Helene Jacqmin-Gadda
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundStudies using health administrative databases (HAD) may lead to biased results since information on potential confounders is often missing. Methods that integrate confounder data from cohort studies, such as multivariate imputation by chained equations (MICE) and two-stage calibration (TSC), aim to reduce confounding bias. We provide new insights into their behavior under different deviations from representativeness of the cohort.MethodsWe conducted an extensive simulation study to assess the performance of these two methods under different deviations from representativeness of the cohort. We illustrate these approaches by studying the association between benzodiazepine use and fractures in the elderly using the general sample of French health insurance beneficiaries (EGB) as main database and two French cohorts (Paquid and 3C) as validation samples.ResultsWhen the cohort was representative from the same population as the HAD, the two methods are unbiased. TSC was more efficient and faster but its variance could be slightly underestimated when confounders were non-Gaussian. If the cohort was a subsample of the HAD (internal validation) with the probability of the subject being included in the cohort depending on both exposure and outcome, MICE was unbiased while TSC was biased. The two methods appeared biased when the inclusion probability in the cohort depended on unobserved confounders.ConclusionWhen choosing the most appropriate method, epidemiologists should consider the origin of the cohort (internal or external validation) as well as the (anticipated or observed) selection biases of the validation sample.
Enriched NYTimes COVID19 U.S. County Dataset
kaggle.com
zip
Updated Jun 14, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ringhilterra17 (2020). Enriched NYTimes COVID19 U.S. County Dataset [Dataset]. https://www.kaggle.com/ringhilterra17/enrichednytimescovid19
Explore at:
zip(11291611 bytes)Available download formats
Dataset updated
Jun 14, 2020
Authors
ringhilterra17
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Area covered
United States
Description
Overview and Inspiration

I wanted to make some geospatial visualizations to convey the current severity of COVID19 in different parts of the U.S..

I liked the NYTimes COVID dataset, but it was lacking information on county boundary shape data, population per county, new cases / deaths per day, and per capita calculations, and county demographics.

After a lot of work tracking down the different data sources I wanted and doing all of the data wrangling and joins in python, I wanted to open-source the final enriched data set in order to give others a head start in their COVID-19 related analytic, modeling, and visualization efforts.

This dataset is enriched with county shapes, county center point coordinates, 2019 census population estimates, county population densities, cases and deaths per capita, and calculated per day cases / deaths metrics. It contains daily data per county back to January, allowing for analyizng changes over time.

UPDATE: I have also included demographic information per county, including ages, races, and gender breakdown. This could help determine which counties are most susceptible to an outbreak.

How this data can be used

Geospatial analysis and visualization - Which counties are currently getting hit the hardest (per capita and totals)? - What patterns are there in the spread of the virus across counties? (network based spread simulations using county center lat / lons) -county population densities play a role in how quickly the virus spreads? -how does a specific county/state cases and deaths compare to other counties/states? Join with other county level datasets easily (with fips code column)

Content Details

See the column descriptions for more details on the dataset

Visualizations and Analysis Examples

COVID-19 U.S. Time-lapse: Confirmed Cases per County (per capita)

https://github.com/ringhilterra/enriched-covid19-data/blob/master/example_viz/covid-cases-final-04-06.gif?raw=true" alt="">-

Other Data Notes

Please review nytimes README for detailed notes on Covid-19 data - https://github.com/nytimes/covid-19-data/

The only update I made in regards to 'Geographic Exceptions', is that I took 'New York City' county provided in the Covid-19 data, which has all cases for 'for the five boroughs of New York City (New York, Kings, Queens, Bronx and Richmond counties) and replaced the missing FIPS for those rows with the 'New York County' fips code 36061. That way I could join to a geometry, and then I used the sum of those five boroughs population estimates for the 'New York City' estimate, which allowed me calculate 'per capita' metrics for 'New York City' entries in the Covid-19 dataset

Acknowledgements

Special thanks to NYTimes for all of their hard work gathering and consolidating all of the U.S. COVID19 related data on daily basis. Their git repo https://github.com/nytimes/covid-19-data/

Also, thanks to ykzeng for the county population density estimates: https://github.com/ykzeng/covid-19/tree/master/data-
d
Matlab example for Local Enrichment Analysis (LEA) analysis with real data
datadryad.org
data.niaid.nih.gov
+2more
zip
Updated Aug 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Berend Snijder; Yannik Severin (2022). Matlab example for Local Enrichment Analysis (LEA) analysis with real data [Dataset]. http://doi.org/10.5061/dryad.2jm63xssk
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.2jm63xssk
Dataset updated
Aug 29, 2022
Dataset provided by
Dryad
Authors
Berend Snijder; Yannik Severin
Time period covered
Aug 26, 2022
Description
Code is compatible with Matlab v2020. The corresponding open-source alternative is Octave (https://octave.org/).
d
Data from: Argon data for enriched MORB from the 8°20' N seamount chain
catalog.data.gov
data.usgs.gov
+1more
Updated Nov 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Argon data for enriched MORB from the 8°20' N seamount chain [Dataset]. https://catalog.data.gov/dataset/argon-data-for-enriched-morb-from-the-820-n-seamount-chain
Explore at:
Dataset updated
Nov 26, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Hawaiian–Emperor seamount chain
Description
This dataset accompanies planned publication 'Near-Ridge Magmatism Constrained Using 40Ar/39Ar Dating of Enriched MORB from the 8°20' N Seamount Chain'. The Ar/Ar data are for samples that record the volcanic history of the area. The geochronology provides time constraints for the eruption of rocks studied in the manuscript. Samples were collected from the 8°20' N seamount chain by Molly Anderson (University of Florida), who sent them to the USGS Denver Argon Geochronology Laboratory for Ar/Ar analysis.
LinkedIn Datasets
brightdata.com
.json, .csv, .xlsx
Updated Dec 17, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bright Data (2021). LinkedIn Datasets [Dataset]. https://brightdata.com/products/datasets/linkedin
Explore at:
.json, .csv, .xlsxAvailable download formats
Dataset updated
Dec 17, 2021
Dataset authored and provided by
Bright Datahttps://brightdata.com/
License
https://brightdata.com/licensehttps://brightdata.com/license
Area covered
Worldwide
Description
Unlock the full potential of LinkedIn data with our extensive dataset that combines profiles, company information, and job listings into one powerful resource for business decision-making, strategic hiring, competitive analysis, and market trend insights. This all-encompassing dataset is ideal for professionals, recruiters, analysts, and marketers aiming to enhance their strategies and operations across various business functions. Dataset Features

Profiles: Dive into detailed public profiles featuring names, titles, positions, experience, education, skills, and more. Utilize this data for talent sourcing, lead generation, and investment signaling, with a refresh rate ensuring up to 30 million records per month. Companies: Access comprehensive company data including ID, country, industry, size, number of followers, website details, subsidiaries, and posts. Tailored subsets by industry or region provide invaluable insights for CRM enrichment, competitive intelligence, and understanding the startup ecosystem, updated monthly with up to 40 million records. Job Listings: Explore current job opportunities detailed with job titles, company names, locations, and employment specifics such as seniority levels and employment functions. This dataset includes direct application links and real-time application numbers, serving as a crucial tool for job seekers and analysts looking to understand industry trends and the job market dynamics.

Customizable Subsets for Specific Needs Our LinkedIn dataset offers the flexibility to tailor the dataset according to your specific business requirements. Whether you need comprehensive insights across all data points or are focused on specific segments like job listings, company profiles, or individual professional details, we can customize the dataset to match your needs. This modular approach ensures that you get only the data that is most relevant to your objectives, maximizing efficiency and relevance in your strategic applications. Popular Use Cases

Strategic Hiring and Recruiting: Track talent movement, identify growth opportunities, and enhance your recruiting efforts with targeted data. Market Analysis and Competitive Intelligence: Gain a competitive edge by analyzing company growth, industry trends, and strategic opportunities. Lead Generation and CRM Enrichment: Enrich your database with up-to-date company and professional data for targeted marketing and sales strategies. Job Market Insights and Trends: Leverage detailed job listings for a nuanced understanding of employment trends and opportunities, facilitating effective job matching and market analysis. AI-Driven Predictive Analytics: Utilize AI algorithms to analyze large datasets for predicting industry shifts, optimizing business operations, and enhancing decision-making processes based on actionable data insights.

Whether you are mapping out competitive landscapes, sourcing new talent, or analyzing job market trends, our LinkedIn dataset provides the tools you need to succeed. Customize your access to fit specific needs, ensuring that you have the most relevant and timely data at your fingertips.
d
Data from: Assessment of targeted enrichment locus capture across time and...
datadryad.org
nde-dev.biothings.io
+4more
zip
Updated May 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rhema Uche-Dike; Aaron Goodman; Ethan Tolman; John Abbott; Jesse Breinholt; Seth Bybee; Paul Frandsen; Stephen Gosnell; Rob Guralnick; Vincent Kalkman; Manpreet Kohli; Judicael Fomekong-Lontchi; Pungki Lupiyaningdyah; Lacie Newton; Jessica Ware (2023). Assessment of targeted enrichment locus capture across time and museums using odonate specimens [Dataset]. http://doi.org/10.5061/dryad.kprr4xh8z
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.kprr4xh8z
Dataset updated
May 18, 2023
Dataset provided by
Dryad
Authors
Rhema Uche-Dike; Aaron Goodman; Ethan Tolman; John Abbott; Jesse Breinholt; Seth Bybee; Paul Frandsen; Stephen Gosnell; Rob Guralnick; Vincent Kalkman; Manpreet Kohli; Judicael Fomekong-Lontchi; Pungki Lupiyaningdyah; Lacie Newton; Jessica Ware
Time period covered
May 15, 2023
Description
IQ-Tree v.2.1.3 (Data matrix - fasta file) UNIX/Command line or a Text Editor for viewing (fastq files - raw data) FigTree (Tree file - .treefile) BBEdit (Partition files - Nexus)
Additional file 20: of MGSEA â€“ a multivariate Gene set enrichment analysis...
springernature.figshare.com
zip
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Khong-Loon Tiong; Chen-Hsiang Yeang (2023). Additional file 20: of MGSEA â€“ a multivariate Gene set enrichment analysis [Dataset]. http://doi.org/10.6084/m9.figshare.7861256.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7861256.v1
Dataset updated
Jun 2, 2023
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Khong-Loon Tiong; Chen-Hsiang Yeang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Supplementary File S1. The R source codes of the MGSEA program, a toy example dataset, and a brief explanation for running the program. (ZIP 1832 kb)
vdjHopper Example Data
zenodo.org
application/gzip, bin
Updated Aug 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicholas Borcherding; Nicholas Borcherding (2025). vdjHopper Example Data [Dataset]. http://doi.org/10.5281/zenodo.16929319
Explore at:
bin, application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.16929319
Dataset updated
Aug 22, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nicholas Borcherding; Nicholas Borcherding
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Aug 22, 2025
Description
vdjHopper Example Data

This repository provides example single-cell V(D)J sequencing files used by the vdjHopper R package. The data are intended for demonstration and teaching purposes only, allowing users to test the package’s decontamination, IgBLAST integration, and chain pairing workflows without requiring large raw datasets.

Contents

The dataset contains representative files derived from the 10x Genomics NSCLC tumor TCR enrichment dataset .
All files have been subset or downsampled to reduce size and ensure the total archive remains <5 MB.

all_contig_annotations.csv.gz
Filtered contig annotations (CSV, compressed). Contains selected columns such as barcode, chain, cdr3, v_gene, and j_gene.

all_contig.fasta.gz
Representative TCR sequences in FASTA format.

all_contig.fastq.gz
A small subset of raw sequencing reads in FASTQ format, provided for demonstration only.

all_contig.bam
Full BAM alignment file (~3.5 GB). This file is not included in the CRAN package build but can be downloaded from this Zenodo record if required for advanced tutorials. Users should call vdjHopper::fetch_example_data() to retrieve and cache this file programmatically.

License and Source

Original dataset: 10x Genomics – NSCLC tumor TCR enrichment dataset

License: CC BY 4.0

When using this data, please cite both 10x Genomics.
e
Teosto Open Api – open interface for live music data
data.europa.eu
unknown
Updated Dec 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yhdistykset ja säätiöt (2023). Teosto Open Api – open interface for live music data [Dataset]. https://data.europa.eu/data/datasets/3c7de080-ea97-4ddb-9a26-218579825170?locale=en
Explore at:
unknownAvailable download formats
Dataset updated
Dec 6, 2023
Dataset authored and provided by
Yhdistykset ja säätiöt
Description
The live music data collected by Teosto is the largest and most comprehensive in Finland. The data opened through the open interface now includes all live gigs announced to Teosto in Finland last year (2014): the dates of the gigs, the venues with their location and coordinates, the performers, the songs presented and the authors of the songs.

We challenge developers to enrich live music spatial data and develop new, innovative uses for it. Examples of data enrichment include combining other open spatial datasets with event data or music-related metadata with song-specific data.

The development of live data is part of the Open Finland Challenge competition and the Ultrahack event.
E
Paintings Gemma-Enriched Dataset. Fotothek - Bibliotheca Hertziana
edmond.mpg.de
zip
Updated Nov 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pietro Maria Liuzzo; Pietro Maria Liuzzo (2025). Paintings Gemma-Enriched Dataset. Fotothek - Bibliotheca Hertziana [Dataset]. http://doi.org/10.17617/3.Z8W2JR
Explore at:
zip(221010765)Available download formats
Unique identifier
https://doi.org/10.17617/3.Z8W2JR
Dataset updated
Nov 24, 2025
Dataset provided by
Edmond
Authors
Pietro Maria Liuzzo; Pietro Maria Liuzzo
License
https://edmond.mpg.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.17617/3.Z8W2JRhttps://edmond.mpg.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.17617/3.Z8W2JR
Description
Gemälde Dataset - AI-Enhanced Art Historical Descriptions This dataset contains 224x224 images and relative metadata extracted from the MIDAS XML of the Catalogue of the Photographic Collection of the Bibliotheca Hertziana enriched with AI-generated prose texts. The dataset is limited to photographs of objects classified as painting (Gemälde), and has been processed using Google Gemma 2 9B Instruct large language model on the KISSKI HPC cluster of the GWDG. Scripts to process the data on KISSKI have been elaborated with Claude Code in Virtual Studio Code. Dataset Overview Source Data: Original dataset: gemalde.tsv (19,051 rows) Extracted from: MIDAS XML format (combined.xml) Institution: Photographic Collection. Bibliotheca Hertziana - Max Planck Institute for Art History Photographic Collection Catalogue: Fotothek der Bibliotheca Hertziana Output: Enriched metadata: TSV files with AI-generated German and English descriptions 224x224 images downloaded from IIIF Image Api of the Photographic Collection Processing Pipeline 1. Data Extraction Source data was extracted with gemalde.xql from MIDAS XML format combined.xml containing structured art historical metadata including: Object titles and descriptions (textobj, textfoto) Artist information (aob30) Location data (aob26, aob28) Dating and provenance Image references (a8540) Images Download 224x224 images downloaded in advance from the IIIF Service based on gemalde.tsv. The script processing for AI Text Enrichment from the metadata checks that the image has been downloaded, so the output data has a 100% certainty of having a matching image. 17,657 images downloaded from 19,051 rows. This is due to known missing digital images. The dataset corresponds to published data and each row contains the licence and accessibility of the single image, date of creation and last update of the catalogue object. 2. AI Text Generation Model Used: Name: Google Gemma 2 9B Instruct Parameters: 9 billion Quantization: FP16 (no quantization) Context window: 8,192 tokens License: Gemma Terms of Use Processing Workflow: Input cleaning: Removal of numeric codes, normalization of Unicode characters Paragraph generation: German text from structured metadata Translation: German → English Categories processed: paragraph foto DE/EN - Photograph description paragraph obj DE/EN - Object/artwork description paragraph verwalter DE/EN - Collection/custodian information paragraph standort DE/EN - Location information AI Prompts Used Paragraph Generation Prompt Convert the following structured information into a coherent text in German. The text contains field data that should be transformed into flowing prose while preserving all information. IMPORTANT: - Write a MAXIMUM of 2 paragraphs - Do NOT include any URLs or web links - Do NOT include reference codes or numerical codes - Do NOT add any comments or explanations - Only output the paragraph text itself Field: {field_name} Text: {cleaned_text} German text (maximum 2 paragraphs): Example Input: Field: textobj Text: Bildnis Filippo Neri Hl. Filippo Neri geboren 1515 Florenz gestorben 1595 Rom Priester Ordensgründer Gründer Oratorium Kongregation des Oratoriums Example Output: Filippo Neri, geboren 1515 in Florenz und gestorben 1595 in Rom, war ein Priester und bedeutender Ordensgründer. Er gründete das Oratorium und die Kongregation des Oratoriums, die bis heute eine wichtige Rolle in der katholischen Kirche spielen. Translation Prompt Translate the following German text to English. Preserve the meaning and style as much as possible. IMPORTANT: - Do NOT include any URLs or web links in the translation - Do NOT include reference codes starting with "bh" followed by numbers - Do NOT include numerical codes like 08012353 - Do NOT add any comments or explanations - Only output the translated text itself German text: {text} English translation: Example Translation: Input (DE): Filippo Neri, geboren 1515 in Florenz und gestorben 1595 in Rom, war ein Priester und bedeutender Ordensgründer. Output (EN): Filippo Neri, born 1515 in Florence and died 1595 in Rome, was a priest and important founder of a religious order. KISSKI Cluster Resources Hardware Configuration GPU: NVIDIA A100 (80GB VRAM) Architecture: Ampere Tensor Cores: 432 FP16 Performance: ~312 TFLOPS Memory Bandwidth: 2 TB/s Allocation per job: GPUs: 1× A100 CPUs: 4 cores RAM: 64 GB Time limit: 6 hours per job Job Array Configuration Array setup: Total jobs: 38 (indices 0-37) Chunk size: 500 rows per job Parallel jobs: 10 simultaneous Total rows processed: 19,000 (rows 0-18,999) Performance Metrics AI operations per row: 4 paragraph generations (foto, obj, verwalter, standort) 4 translations (DE → EN) Total: 8 LLM inference calls per row Resource consumption: GPU hours: ~125 GPU hours total (38 jobs × 3.3 hours) Model size in memory: ~18 GB (FP16) Peak VRAM usage: ~25 GB per job Output Structure data_gemalde/ ├── enriched_data/ │ ├── data_0-499.tsv # Rows 0-499 │ ├── data_500-999.tsv #...
d
Phone Number Data | Global Coverage | 100M+ B2B Mobile Phone Numbers | 95%+...
datarade.ai
.json, .csv
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Forager.ai, Phone Number Data | Global Coverage | 100M+ B2B Mobile Phone Numbers | 95%+ Accuracy [Dataset]. https://datarade.ai/data-products/global-mobile-phone-number-data-90m-95-accuracy-api-b-forager-ai-905f
Explore at:
.json, .csvAvailable download formats
Dataset provided by
Forager.ai
Area covered
Japan, Moldova (Republic of), Botswana, Martinique, United Arab Emirates, Uruguay, Macedonia (the former Yugoslav Republic of), South Georgia and the South Sandwich Islands, Cambodia, Colombia
Description
Global B2B Mobile Phone Number Database | 100M+ Verified Contacts | 95% Accuracy Forager.ai provides the world’s most reliable mobile phone number data for businesses that refuse to compromise on quality. With 100 million+ professionally verified mobile numbers refreshed every 3 weeks, our database ensures 95% accuracy – so your teams never waste time on dead-end leads.

Why Our Data Wins ✅ Accuracy You Can Trust 95% of mobile numbers are verified against live carrier records and tied to current job roles. Say goodbye to “disconnected number” voicemails.

✅ Depth Beyond Digits Each contact includes 150+ data points:

Direct mobile numbers

Current job title, company, and department

Full career history + education background

Location data + LinkedIn profiles

Company size, industry, and revenue

✅ Freshness Guaranteed Bi-weekly updates combat job-hopping and role changes – critical for sales teams targeting decision-makers.

✅ Ethically Sourced & Compliant First-party collected data with full GDPR/CCPA compliance.

Who Uses This Data?

Sales Teams: Cold-call C-suite prospects with verified mobile numbers.

Marketers: Run hyper-personalized SMS/WhatsApp campaigns.

Recruiters: Source passive candidates with up-to-date contact intel.

Data Vendors: License premium datasets to enhance your product.

Tech Platforms: Power your SaaS tools via API with enterprise-grade B2B data.

Flexible Delivery, Instant Results

API (REST): Real-time integration for CRMs, dialers, or marketing stacks

CSV/JSON: Campaign-ready files.

PostgreSQL: Custom databases for large-scale enrichment

Compliance: Full audit trails + opt-out management

Why Forager.ai? → Proven ROI: Clients see 62% higher connect rates vs. industry averages (request case studies). → No Guesswork: Test-drive free samples before committing. → Scalable Pricing: Pay per record, license datasets, or get unlimited API access.

B2B Mobile Phone Data | Verified Contact Database | Sales Prospecting Lists | CRM Enrichment | Recruitment Phone Numbers | Marketing Automation | Phone Number Datasets | GDPR-Compliant Leads | Direct Dial Contacts | Decision-Maker Data

Need Proof? Contact us to see why Fortune 500 companies and startups alike trust Forager.ai for mission-critical outreach.

Code4ML 2.0

zenodo.org

csv, txt

Updated May 19, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Anonimous authors; Anonimous authors (2025). Code4ML 2.0 [Dataset]. http://doi.org/10.5281/zenodo.15465737

Explore at:

csv, txtAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.15465737

Dataset updated

May 19, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Anonimous authors; Anonimous authors

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This is an enriched version of the Code4ML dataset, a large-scale corpus of annotated Python code snippets, competition summaries, and data descriptions sourced from Kaggle. The initial release includes approximately 2.5 million snippets of machine learning code extracted from around 100,000 Jupyter notebooks. A portion of these snippets has been manually annotated by human assessors through a custom-built, user-friendly interface designed for this task.

The original dataset is organized into multiple CSV files, each containing structured data on different entities:

code_blocks.csv: Contains raw code snippets extracted from Kaggle.
kernels_meta.csv: Metadata for the notebooks (kernels) from which the code snippets were derived.
competitions_meta.csv: Metadata describing Kaggle competitions, including information about tasks and data.
markup_data.csv: Annotated code blocks with semantic types, allowing deeper analysis of code structure.
vertices.csv: A mapping from numeric IDs to semantic types and subclasses, used to interpret annotated code blocks.

Table 1. code_blocks.csv structure

Column	Description
code_blocks_index	Global index linking code blocks to markup_data.csv.
kernel_id	Identifier for the Kaggle Jupyter notebook from which the code block was extracted.
code_block_id	Position of the code block within the notebook.
code_block	The actual machine learning code snippet.

Table 2. kernels_meta.csv structure

Column	Description
kernel_id	Identifier for the Kaggle Jupyter notebook.
kaggle_score	Performance metric of the notebook.
kaggle_comments	Number of comments on the notebook.
kaggle_upvotes	Number of upvotes the notebook received.
kernel_link	URL to the notebook.
comp_name	Name of the associated Kaggle competition.

Table 3. competitions_meta.csv structure

Column	Description
comp_name	Name of the Kaggle competition.
description	Overview of the competition task.
data_type	Type of data used in the competition.
comp_type	Classification of the competition.
subtitle	Short description of the task.
EvaluationAlgorithmAbbreviation	Metric used for assessing competition submissions.
data_sources	Links to datasets used.
metric type	Class label for the assessment metric.

Table 4. markup_data.csv structure

Column	Description
code_block	Machine learning code block.
too_long	Flag indicating whether the block spans multiple semantic types.
marks	Confidence level of the annotation.
graph_vertex_id	ID of the semantic type.

The dataset allows mapping between these tables. For example:

code_blocks.csv can be linked to kernels_meta.csv via the kernel_id column.
kernels_meta.csv is connected to competitions_meta.csv through comp_name. To maintain quality, kernels_meta.csv includes only notebooks with available Kaggle scores.

In addition, data_with_preds.csv contains automatically classified code blocks, with a mapping back to code_blocks.csvvia the code_blocks_index column.

Code4ML 2.0 Enhancements

The updated Code4ML 2.0 corpus introduces kernels extracted from Meta Kaggle Code. These kernels correspond to the kaggle competitions launched since 2020. The natural descriptions of the competitions are retrieved with the aim of LLM.

Notebooks in kernels_meta2.csv may not have a Kaggle score but include a leaderboard ranking (rank), providing additional context for evaluation.

competitions_meta_2.csv is enriched with data_cards, decsribing the data used in the competitions.

Applications

The Code4ML 2.0 corpus is a versatile resource, enabling training and evaluation of models in areas such as:

Code generation
Code understanding
Natural language processing of code-related tasks

DGE GO Enrichment Analysis Microarray Data GDS2778
kaggle.com
zip
Updated Nov 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dr. Nagendra (2025). DGE GO Enrichment Analysis Microarray Data GDS2778 [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/dge-go-enrichment-analysis-microarray-data-gds2778
Explore at:
zip(6820264 bytes)Available download formats
Dataset updated
Nov 29, 2025
Authors
Dr. Nagendra
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
his dataset is based on National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) DataSet accession GDS2778. girke.bioinformatics.ucr.edu +1

The dataset originates from a microarray experiment measuring global gene expression under specific experimental conditions. girke.bioinformatics.ucr.edu +1

Raw and processed expression data (for all probes/genes) are included, enabling downstream analysis such as normalization, differential expression, and clustering.

The dataset has been used to perform differential gene expression (DGE) analysis to identify genes that are up- or down-regulated under the experimental condition compared to control.

Data processing steps typically include normalization (e.g., log-transformation), quality control, probe-to-gene mapping, and statistical testing for significance (e.g., using packages such as limma or other DGE tools). mahsa-ehsanifard.github.io +1

Resulting differentially expressed genes (DEGs) include statistics such as log fold change (logFC), adjusted p‑values (adj.P.Val), and possibly other metrics (e.g., B-statistic), allowing assessment of both magnitude and significance of changes.

The dataset also includes a visualization file (heatmap image) that displays expression patterns of DEGs (or top variable genes) across samples — enabling clustering and pattern recognition across samples and genes.

The heatmap helps illustrate sample-wise and gene-wise expression variation: clustering groups together samples (e.g. control vs treatment) and genes with similar expression dynamics. NCBI +1

This dataset is suitable for further bioinformatics analysis: e.g. functional enrichment (GO/Pathway), co‑expression analysis, gene signature identification, or integration with other datasets.

Users who download this dataset can reproduce or extend analyses, such as re-normalization, alternative clustering, custom DEG thresholds, or downstream biological interpretation (pathway, network analysis).
d
Data from: Enriching the ant tree of life: enhanced UCE bait set for...
datadryad.org
data.niaid.nih.gov
+1more
zip
Updated Feb 14, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael G. Branstetter; John T. Longino; Philip S. Ward; Brant C. Faircloth (2017). Enriching the ant tree of life: enhanced UCE bait set for genome-scale phylogenetics of ants and other Hymenoptera [Dataset]. http://doi.org/10.5061/dryad.89n87
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.89n87
Dataset updated
Feb 14, 2017
Dataset provided by
Dryad
Authors
Michael G. Branstetter; John T. Longino; Philip S. Ward; Brant C. Faircloth
Time period covered
Feb 12, 2017
Area covered
Global
Description
Targeted enrichment of conserved genomic regions (e.g., ultraconserved elements or UCEs) has emerged as a promising tool for inferring evolutionary history in many organismal groups. Because the UCE approach is still relatively new, much remains to be learned about how best to identify UCE loci and design baits to enrich them.

We test an updated UCE identification and bait design workflow for the insect order Hymenoptera, with a particular focus on ants. The new strategy augments a previous bait design for Hymenoptera by (a) changing the parameters by which conserved genomic regions are identified and retained, and (b) increasing the number of genomes used for locus identification and bait design. We perform in vitro validation of the approach in ants by synthesizing an ant-specific bait set that targets UCE loci and a set of “legacy” phylogenetic markers. Using this bait set, we generate new data for 84 taxa (16/17 ant subfamilies) and extract loci from an additional 17 genome-e...
Photovoltaik Installations by Zipcode in Germany
kaggle.com
zip
Updated Dec 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Klaus Langemann (2022). Photovoltaik Installations by Zipcode in Germany [Dataset]. https://www.kaggle.com/datasets/klauslangemann/data-by-zip-code-germany
Explore at:
zip(19547881 bytes)Available download formats
Dataset updated
Dec 29, 2022
Authors
Klaus Langemann
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Area covered
Germany
Description
This dataset compiles in two csv files, both in an absolute and in a min-max-scaled form, a variety of data for 8.170 different zip code areas of Germany. Examples of such data are, amongst others, average sunshine hours per year, average annual income per person, number of yearly crimes committed, percentage of population below and above age of 60 years and share of voters for green party.
E
Drawings Gemma-Enriched Dataset. Fotothek - Bibliotheca Hertziana
edmond.mpg.de
zip
Updated Nov 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pietro Maria Liuzzo; Pietro Maria Liuzzo (2025). Drawings Gemma-Enriched Dataset. Fotothek - Bibliotheca Hertziana [Dataset]. http://doi.org/10.17617/3.1GN3OL
Explore at:
zip(458644918)Available download formats
Unique identifier
https://doi.org/10.17617/3.1GN3OL
Dataset updated
Nov 24, 2025
Dataset provided by
Edmond
Authors
Pietro Maria Liuzzo; Pietro Maria Liuzzo
License
https://edmond.mpg.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.17617/3.1GN3OLhttps://edmond.mpg.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.17617/3.1GN3OL
Description
Zeichnungen Dataset - AI-Enhanced Art Historical Descriptions Zeichnungen Dataset - AI-Enhanced Art Historical Descriptions with Iconography Dataset Overview Processing Pipeline AI Prompts Used KISSKI Cluster Resources Output Structure KISSKI Documentation ICONCLASS Resources Data Usage & Citation Quality Notes Zeichnungen Dataset - AI-Enhanced Art Historical Descriptions with Iconography This dataset contains 224x224 images and relative metadata extracted from the MIDAS XML of the Catalogue of the Photographic Collection of the Bibliotheca Hertziana enriched with AI-generated prose texts and iconographic analysis. The dataset is limited to photographs of objects classified as drawing (Zeichnungen), and has been processed using Google Gemma 2 9B Instruct large language model on the KISSKI HPC cluster of the GWDG. Scripts to process the data on KISSKI have been elaborated with Claude Code in Virtual Studio Code. Dataset Overview Source Data: Original dataset: zeichnungen.tsv (30,000 rows / 29,999 data rows) Extracted from: MIDAS XML format (combined.xml) Source institution: Bibliotheca Hertziana - Max Planck Institute for Art History Image repository: Fotothek der Bibliotheca Hertziana Output: Enriched metadata: TSV files with AI-generated German and English descriptions Iconographic analysis: Descriptions based on ICONCLASS classification 224x224 images downloaded from IIIF Image Api of the Photographic Collection Processing Pipeline 1. Data Extraction Source data was extracted with zeichnungen.xql from MIDAS XML format combined.xml containing structured art historical metadata including: Object titles and descriptions (textobj, textfoto) Artist information (aob30) Location data (aob26, aob28) ICONCLASS codes (a5500) - Standardized iconographic classification Dating and provenance Image references (a8540) The set was limited to 30000 entries. 2. ICONCLASS Cache Preparation ICONCLASS System: Source: ICONCLASS.org - Multilingual classification system for cultural content GitHub repository: https://github.com/iconclass/data Images Download 224x224 images downloaded in advance from the IIIF Service based on gemalde.tsv. The script processing for AI Text Enrichment from the metadata checks that the image has been downloaded, so the output data has a 100% certainty of having a matching image. 28.165 images downloaded from 29,999 rows. This is due to known missing digital images. The dataset corresponds to published data and each row contains the licence and accessibility of the single image, date of creation and last update of the catalogue object. 3. AI Text Generation Model Used: Name: Google Gemma 2 9B Instruct Parameters: 9 billion Quantization: FP16 (no quantization) Context window: 8,192 tokens License: Gemma Terms of Use Processing Workflow: Input cleaning: Removal of numeric codes, normalization of Unicode characters, increased CSV field size limit (10 MB) Paragraph generation: German text from structured metadata ICONCLASS lookup: Offline cache-based iconographic description retrieval Iconographic synthesis: AI-generated description from ICONCLASS codes Translation: German → English Categories processed: paragraph foto DE/EN - Photograph description paragraph obj DE/EN - Object/artwork description paragraph verwalter DE/EN - Collection/custodian information paragraph standort DE/EN - Location information paragraph iconclass DE/EN - Iconographic content description (NEW) AI Prompts Used Paragraph Generation Prompt Convert the following structured information into a coherent text in German. The text contains field data that should be transformed into flowing prose while preserving all information. IMPORTANT: - Write a MAXIMUM of 2 paragraphs - Do NOT include any URLs or web links - Do NOT include reference codes or numerical codes - Do NOT add any comments or explanations - Only output the paragraph text itself Field: {field_name} Text: {cleaned_text} German text (maximum 2 paragraphs): ICONCLASS Paragraph Prompt Based on the following Iconclass descriptions, write a brief German paragraph describing what the image depicts. Descriptions: {'; '.join(descriptions)} IMPORTANT: - Start with "Das Bild zeigt" or similar phrasing - Combine all descriptions into a flowing text - Maximum 1-2 sentences - Do NOT include iconclass codes or numbers - Do NOT include reference codes starting with "bh" - Only output the descriptive German text German description: Example ICONCLASS Processing: Input from data: a5500: 31 A 23 1 | 31 A 25 11 | 31 B 62 11 ICONCLASS lookup (from cache): 31 A 23 1 → "standing figure" 31 A 25 11 → "arm raised upward" 31 B 62 11 → "looking upwards" AI-generated output (DE): Das Bild zeigt eine stehende Figur mit erhobenem Arm, die nach oben blickt. Translation (EN): The image shows a standing figure with raised arm, looking upwards. Translation Prompt Translate the following German text to English. Preserve the meaning and style as much as possible. IMPORTANT: - Do NOT include any URLs or web...
D
Data from: A target enrichment method for gathering phylogenetic information...
datasetcatalog.nlm.nih.gov
data.niaid.nih.gov
+2more
Updated Jan 2, 2015
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kozik, Alex; Dikow, Rebecca B.; Michelmore, Richard W.; Mandel, Jennifer R.; Funk, Vicki A.; Masalia, Rishi R.; Rieseberg, Loren H.; Burke, John M.; Staton, S. Evan (2015). A target enrichment method for gathering phylogenetic information from hundreds of loci: an example from the Compositae [Dataset]. http://doi.org/10.5061/dryad.gr93t
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.gr93t
Dataset updated
Jan 2, 2015
Authors
Kozik, Alex; Dikow, Rebecca B.; Michelmore, Richard W.; Mandel, Jennifer R.; Funk, Vicki A.; Masalia, Rishi R.; Rieseberg, Loren H.; Burke, John M.; Staton, S. Evan
Description
Premise of the study: The Compositae (Asteraceae) are a large and diverse family of plants, and the most comprehensive phylogeny to date is a meta-tree based on 10 chloroplast loci that has several major unresolved nodes. We describe the development of an approach that enables the rapid sequencing of large numbers of orthologous nuclear loci to facilitate efficient phylogenomic analyses. Methods and Results: We designed a set of sequence capture probes that target conserved orthologous sequences in the Compositae. We also developed a bioinformatic and phylogenetic workflow for processing and analyzing the resulting data. Application of our approach to 15 species from across the Compositae resulted in the production of phylogenetically informative sequence data from 763 loci and the successful reconstruction of known phylogenetic relationships across the family. Conclusions: These methods should be of great use to members of the broader Compositae community, and the general approach should also be of use to researchers studying other families.
Goodreads Spoilers
kaggle.com
zip
Updated Oct 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmad (2023). Goodreads Spoilers [Dataset]. https://www.kaggle.com/datasets/pypiahmad/goodreads-book-reviews
Explore at:
zip(6776059711 bytes)Available download formats
Dataset updated
Oct 30, 2023
Authors
Ahmad
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
The Goodreads Spoilers dataset embodies a trove of reviews from the Goodreads book review platform, with a special emphasis on annotated "spoiler" information from each review. This dataset is an invaluable asset for those keen on delving into spoiler detection, sentiment analysis related to spoilers, and understanding user behavior in the context of revealing or discussing plot twists.

Basic Statistics: - Books: 25,475 - Users: 18,892 - Reviews: 1,378,033

Metadata: - Reviews: The text of the reviews provided by users. - Ratings: Ratings assigned to books by users. - Spoilers: Annotated spoilers within the review text. - (Additionally, metadata from the complete Goodreads dataset can be utilized to enrich analysis.)

Example (spoiler data): json { 'user_id': '01ec1a320ffded6b2dd47833f2c8e4fb', 'timestamp': '2013-12-28', 'review_sentences': [[0, 'First, be aware that this book is not for the faint of heart.'], [0, 'Human trafficking, drugs, kidnapping, abuse in all forms - this story contains all of this and more.'], ..., [0, '(ARC provided by the author in return for an honest review.)']], 'rating': 5, 'has_spoiler': False, 'book_id': '18398089', 'review_id': '4b3ffeaf14310ac6854f140188e191cd' }

Use Cases: - Spoiler Detection: Developing algorithms to automatically detect spoilers in review text. - Sentiment Analysis: Analyzing the sentiment of reviews and examining how the presence of spoilers affects sentiment. - User Behavior Analysis: Understanding how users interact with books that have spoilers and how they disclose such information in reviews. - Natural Language Processing: Training models to understand and process user-generated text which contains spoilers.

Citation: Please cite the following if you use the data: Fine-grained spoiler detection from large-scale review corpora Mengting Wan, Rishabh Misra, Ndapa Nakashole, Julian McAuley ACL, 2019 [PDF](https://cseweb.ucsd.edu/~jmcauley/pdfs/acl19a.pdf)

Code Samples: The datasets are accompanied by a series of code samples housed in the dataset's Github repository. These code samples include: - Downloading datasets without GUI: A notebook to facilitate dataset downloading sans graphical user interface. - Displaying sample records: A notebook to showcase sample records from the dataset. - Calculating basic statistics: A notebook to calculate and understand basic statistics of the dataset. - Exploring the interaction data: A notebook to explore interaction data and understand user-book interactions. - Exploring the review data: A notebook to delve into the review data and extract insights from user reviews.

Datasets:

Meta-Data of Books:

Detailed Book Graph (goodreads_books.json.gz): A comprehensive graph detailing around 2.3 million books, acting as a rich source of book attributes and metadata.

Download Link

Detailed Information of Authors (goodreads_book_authors.json.gz):

An extensive dataset containing detailed information about book authors, essential for understanding author-centric trends and insights.

Download Link

Detailed Information of Works (goodreads_book_works.json.gz):

This dataset provides abstract information about a book disregarding any particular editions, facilitating a high-level understanding of each work.

Download Link

Detailed Information of Book Series (goodreads_book_series.json.gz):

A dataset encompassing detailed information about book series, aiding in understanding series-related trends and insights. Note that the series id included here cannot be used for URL hack.

[Download Link](https://d...

Facebook

Twitter

Click to copy link

Link copied

Cite

Daniel; Daniel (2020). Example Datasets for Functional Enrichment Analysis [Dataset]. http://doi.org/10.5281/zenodo.2564088

Example Datasets for Functional Enrichment Analysis

Explore at:

txt, binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.2564088

Dataset updated

Jan 24, 2020

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Daniel; Daniel

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset contains a set of example data for a functional enrichment tutorial.

Clear search

Close search

Google apps

Main menu

Example Datasets for Functional Enrichment Analysis

Additional file 1 of SEDE-GPS: socio-economic data enrichment based on GPS...

Health administrative data enrichment using cohort information: Comparative...

Enriched NYTimes COVID19 U.S. County Dataset

Overview and Inspiration

How this data can be used

Content Details

Visualizations and Analysis Examples

Other Data Notes

Acknowledgements

Matlab example for Local Enrichment Analysis (LEA) analysis with real data

Data from: Argon data for enriched MORB from the 8°20' N seamount chain

LinkedIn Datasets

Data from: Assessment of targeted enrichment locus capture across time and...

Additional file 20: of MGSEA â€“ a multivariate Gene set enrichment analysis...

vdjHopper Example Data

vdjHopper Example Data

Contents

License and Source

Teosto Open Api – open interface for live music data

Paintings Gemma-Enriched Dataset. Fotothek - Bibliotheca Hertziana

Phone Number Data | Global Coverage | 100M+ B2B Mobile Phone Numbers | 95%+...

Code4ML 2.0

Code4ML 2.0 Enhancements

Applications

DGE GO Enrichment Analysis Microarray Data GDS2778

Data from: Enriching the ant tree of life: enhanced UCE bait set for...

Photovoltaik Installations by Zipcode in Germany

Drawings Gemma-Enriched Dataset. Fotothek - Bibliotheca Hertziana

Data from: A target enrichment method for gathering phylogenetic information...

Goodreads Spoilers

Meta-Data of Books:

Example Datasets for Functional Enrichment Analysis