Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains a set of example data for a functional enrichment tutorial.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This table contains names, positions, and references for the samples contained in the sequence dataset and whether Prokaryotes and/or Eukaryotes were analyzed from the sample in this study. (CSV 3 kb)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundStudies using health administrative databases (HAD) may lead to biased results since information on potential confounders is often missing. Methods that integrate confounder data from cohort studies, such as multivariate imputation by chained equations (MICE) and two-stage calibration (TSC), aim to reduce confounding bias. We provide new insights into their behavior under different deviations from representativeness of the cohort.MethodsWe conducted an extensive simulation study to assess the performance of these two methods under different deviations from representativeness of the cohort. We illustrate these approaches by studying the association between benzodiazepine use and fractures in the elderly using the general sample of French health insurance beneficiaries (EGB) as main database and two French cohorts (Paquid and 3C) as validation samples.ResultsWhen the cohort was representative from the same population as the HAD, the two methods are unbiased. TSC was more efficient and faster but its variance could be slightly underestimated when confounders were non-Gaussian. If the cohort was a subsample of the HAD (internal validation) with the probability of the subject being included in the cohort depending on both exposure and outcome, MICE was unbiased while TSC was biased. The two methods appeared biased when the inclusion probability in the cohort depended on unobserved confounders.ConclusionWhen choosing the most appropriate method, epidemiologists should consider the origin of the cohort (internal or external validation) as well as the (anticipated or observed) selection biases of the validation sample.
Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
I wanted to make some geospatial visualizations to convey the current severity of COVID19 in different parts of the U.S..
I liked the NYTimes COVID dataset, but it was lacking information on county boundary shape data, population per county, new cases / deaths per day, and per capita calculations, and county demographics.
After a lot of work tracking down the different data sources I wanted and doing all of the data wrangling and joins in python, I wanted to open-source the final enriched data set in order to give others a head start in their COVID-19 related analytic, modeling, and visualization efforts.
This dataset is enriched with county shapes, county center point coordinates, 2019 census population estimates, county population densities, cases and deaths per capita, and calculated per day cases / deaths metrics. It contains daily data per county back to January, allowing for analyizng changes over time.
UPDATE: I have also included demographic information per county, including ages, races, and gender breakdown. This could help determine which counties are most susceptible to an outbreak.
Geospatial analysis and visualization - Which counties are currently getting hit the hardest (per capita and totals)? - What patterns are there in the spread of the virus across counties? (network based spread simulations using county center lat / lons) -county population densities play a role in how quickly the virus spreads? -how does a specific county/state cases and deaths compare to other counties/states? Join with other county level datasets easily (with fips code column)
See the column descriptions for more details on the dataset
COVID-19 U.S. Time-lapse: Confirmed Cases per County (per capita)
https://github.com/ringhilterra/enriched-covid19-data/blob/master/example_viz/covid-cases-final-04-06.gif?raw=true" alt="">-
Facebook
TwitterCode is compatible with Matlab v2020. The corresponding open-source alternative is Octave (https://octave.org/).
Facebook
TwitterThis dataset accompanies planned publication 'Near-Ridge Magmatism Constrained Using 40Ar/39Ar Dating of Enriched MORB from the 8°20' N Seamount Chain'. The Ar/Ar data are for samples that record the volcanic history of the area. The geochronology provides time constraints for the eruption of rocks studied in the manuscript. Samples were collected from the 8°20' N seamount chain by Molly Anderson (University of Florida), who sent them to the USGS Denver Argon Geochronology Laboratory for Ar/Ar analysis.
Facebook
Twitterhttps://brightdata.com/licensehttps://brightdata.com/license
Unlock the full potential of LinkedIn data with our extensive dataset that combines profiles, company information, and job listings into one powerful resource for business decision-making, strategic hiring, competitive analysis, and market trend insights. This all-encompassing dataset is ideal for professionals, recruiters, analysts, and marketers aiming to enhance their strategies and operations across various business functions. Dataset Features
Profiles: Dive into detailed public profiles featuring names, titles, positions, experience, education, skills, and more. Utilize this data for talent sourcing, lead generation, and investment signaling, with a refresh rate ensuring up to 30 million records per month. Companies: Access comprehensive company data including ID, country, industry, size, number of followers, website details, subsidiaries, and posts. Tailored subsets by industry or region provide invaluable insights for CRM enrichment, competitive intelligence, and understanding the startup ecosystem, updated monthly with up to 40 million records. Job Listings: Explore current job opportunities detailed with job titles, company names, locations, and employment specifics such as seniority levels and employment functions. This dataset includes direct application links and real-time application numbers, serving as a crucial tool for job seekers and analysts looking to understand industry trends and the job market dynamics.
Customizable Subsets for Specific Needs Our LinkedIn dataset offers the flexibility to tailor the dataset according to your specific business requirements. Whether you need comprehensive insights across all data points or are focused on specific segments like job listings, company profiles, or individual professional details, we can customize the dataset to match your needs. This modular approach ensures that you get only the data that is most relevant to your objectives, maximizing efficiency and relevance in your strategic applications. Popular Use Cases
Strategic Hiring and Recruiting: Track talent movement, identify growth opportunities, and enhance your recruiting efforts with targeted data. Market Analysis and Competitive Intelligence: Gain a competitive edge by analyzing company growth, industry trends, and strategic opportunities. Lead Generation and CRM Enrichment: Enrich your database with up-to-date company and professional data for targeted marketing and sales strategies. Job Market Insights and Trends: Leverage detailed job listings for a nuanced understanding of employment trends and opportunities, facilitating effective job matching and market analysis. AI-Driven Predictive Analytics: Utilize AI algorithms to analyze large datasets for predicting industry shifts, optimizing business operations, and enhancing decision-making processes based on actionable data insights.
Whether you are mapping out competitive landscapes, sourcing new talent, or analyzing job market trends, our LinkedIn dataset provides the tools you need to succeed. Customize your access to fit specific needs, ensuring that you have the most relevant and timely data at your fingertips.
Facebook
TwitterIQ-Tree v.2.1.3 (Data matrix - fasta file) UNIX/Command line or a Text Editor for viewing (fastq files - raw data) FigTree (Tree file - .treefile) BBEdit (Partition files - Nexus)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Supplementary File S1. The R source codes of the MGSEA program, a toy example dataset, and a brief explanation for running the program. (ZIP 1832 kb)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository provides example single-cell V(D)J sequencing files used by the vdjHopper R package. The data are intended for demonstration and teaching purposes only, allowing users to test the package’s decontamination, IgBLAST integration, and chain pairing workflows without requiring large raw datasets.
The dataset contains representative files derived from the 10x Genomics NSCLC tumor TCR enrichment dataset .
All files have been subset or downsampled to reduce size and ensure the total archive remains <5 MB.
all_contig_annotations.csv.gz
Filtered contig annotations (CSV, compressed). Contains selected columns such as barcode, chain, cdr3, v_gene, and j_gene.
all_contig.fasta.gz
Representative TCR sequences in FASTA format.
all_contig.fastq.gz
A small subset of raw sequencing reads in FASTQ format, provided for demonstration only.
all_contig.bam
Full BAM alignment file (~3.5 GB). This file is not included in the CRAN package build but can be downloaded from this Zenodo record if required for advanced tutorials. Users should call vdjHopper::fetch_example_data() to retrieve and cache this file programmatically.
Original dataset: 10x Genomics – NSCLC tumor TCR enrichment dataset
License: CC BY 4.0
When using this data, please cite both 10x Genomics.
Facebook
TwitterThe live music data collected by Teosto is the largest and most comprehensive in Finland. The data opened through the open interface now includes all live gigs announced to Teosto in Finland last year (2014): the dates of the gigs, the venues with their location and coordinates, the performers, the songs presented and the authors of the songs.
We challenge developers to enrich live music spatial data and develop new, innovative uses for it. Examples of data enrichment include combining other open spatial datasets with event data or music-related metadata with song-specific data.
The development of live data is part of the Open Finland Challenge competition and the Ultrahack event.
Facebook
Twitterhttps://edmond.mpg.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.17617/3.Z8W2JRhttps://edmond.mpg.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.17617/3.Z8W2JR
Gemälde Dataset - AI-Enhanced Art Historical Descriptions This dataset contains 224x224 images and relative metadata extracted from the MIDAS XML of the Catalogue of the Photographic Collection of the Bibliotheca Hertziana enriched with AI-generated prose texts. The dataset is limited to photographs of objects classified as painting (Gemälde), and has been processed using Google Gemma 2 9B Instruct large language model on the KISSKI HPC cluster of the GWDG. Scripts to process the data on KISSKI have been elaborated with Claude Code in Virtual Studio Code. Dataset Overview Source Data: Original dataset: gemalde.tsv (19,051 rows) Extracted from: MIDAS XML format (combined.xml) Institution: Photographic Collection. Bibliotheca Hertziana - Max Planck Institute for Art History Photographic Collection Catalogue: Fotothek der Bibliotheca Hertziana Output: Enriched metadata: TSV files with AI-generated German and English descriptions 224x224 images downloaded from IIIF Image Api of the Photographic Collection Processing Pipeline 1. Data Extraction Source data was extracted with gemalde.xql from MIDAS XML format combined.xml containing structured art historical metadata including: Object titles and descriptions (textobj, textfoto) Artist information (aob30) Location data (aob26, aob28) Dating and provenance Image references (a8540) Images Download 224x224 images downloaded in advance from the IIIF Service based on gemalde.tsv. The script processing for AI Text Enrichment from the metadata checks that the image has been downloaded, so the output data has a 100% certainty of having a matching image. 17,657 images downloaded from 19,051 rows. This is due to known missing digital images. The dataset corresponds to published data and each row contains the licence and accessibility of the single image, date of creation and last update of the catalogue object. 2. AI Text Generation Model Used: Name: Google Gemma 2 9B Instruct Parameters: 9 billion Quantization: FP16 (no quantization) Context window: 8,192 tokens License: Gemma Terms of Use Processing Workflow: Input cleaning: Removal of numeric codes, normalization of Unicode characters Paragraph generation: German text from structured metadata Translation: German → English Categories processed: paragraph foto DE/EN - Photograph description paragraph obj DE/EN - Object/artwork description paragraph verwalter DE/EN - Collection/custodian information paragraph standort DE/EN - Location information AI Prompts Used Paragraph Generation Prompt Convert the following structured information into a coherent text in German. The text contains field data that should be transformed into flowing prose while preserving all information. IMPORTANT: - Write a MAXIMUM of 2 paragraphs - Do NOT include any URLs or web links - Do NOT include reference codes or numerical codes - Do NOT add any comments or explanations - Only output the paragraph text itself Field: {field_name} Text: {cleaned_text} German text (maximum 2 paragraphs): Example Input: Field: textobj Text: Bildnis Filippo Neri Hl. Filippo Neri geboren 1515 Florenz gestorben 1595 Rom Priester Ordensgründer Gründer Oratorium Kongregation des Oratoriums Example Output: Filippo Neri, geboren 1515 in Florenz und gestorben 1595 in Rom, war ein Priester und bedeutender Ordensgründer. Er gründete das Oratorium und die Kongregation des Oratoriums, die bis heute eine wichtige Rolle in der katholischen Kirche spielen. Translation Prompt Translate the following German text to English. Preserve the meaning and style as much as possible. IMPORTANT: - Do NOT include any URLs or web links in the translation - Do NOT include reference codes starting with "bh" followed by numbers - Do NOT include numerical codes like 08012353 - Do NOT add any comments or explanations - Only output the translated text itself German text: {text} English translation: Example Translation: Input (DE): Filippo Neri, geboren 1515 in Florenz und gestorben 1595 in Rom, war ein Priester und bedeutender Ordensgründer. Output (EN): Filippo Neri, born 1515 in Florence and died 1595 in Rome, was a priest and important founder of a religious order. KISSKI Cluster Resources Hardware Configuration GPU: NVIDIA A100 (80GB VRAM) Architecture: Ampere Tensor Cores: 432 FP16 Performance: ~312 TFLOPS Memory Bandwidth: 2 TB/s Allocation per job: GPUs: 1× A100 CPUs: 4 cores RAM: 64 GB Time limit: 6 hours per job Job Array Configuration Array setup: Total jobs: 38 (indices 0-37) Chunk size: 500 rows per job Parallel jobs: 10 simultaneous Total rows processed: 19,000 (rows 0-18,999) Performance Metrics AI operations per row: 4 paragraph generations (foto, obj, verwalter, standort) 4 translations (DE → EN) Total: 8 LLM inference calls per row Resource consumption: GPU hours: ~125 GPU hours total (38 jobs × 3.3 hours) Model size in memory: ~18 GB (FP16) Peak VRAM usage: ~25 GB per job Output Structure data_gemalde/ ├── enriched_data/ │ ├── data_0-499.tsv # Rows 0-499 │ ├── data_500-999.tsv #...
Facebook
TwitterGlobal B2B Mobile Phone Number Database | 100M+ Verified Contacts | 95% Accuracy Forager.ai provides the world’s most reliable mobile phone number data for businesses that refuse to compromise on quality. With 100 million+ professionally verified mobile numbers refreshed every 3 weeks, our database ensures 95% accuracy – so your teams never waste time on dead-end leads.
Why Our Data Wins ✅ Accuracy You Can Trust 95% of mobile numbers are verified against live carrier records and tied to current job roles. Say goodbye to “disconnected number” voicemails.
✅ Depth Beyond Digits Each contact includes 150+ data points:
Direct mobile numbers
Current job title, company, and department
Full career history + education background
Location data + LinkedIn profiles
Company size, industry, and revenue
✅ Freshness Guaranteed Bi-weekly updates combat job-hopping and role changes – critical for sales teams targeting decision-makers.
✅ Ethically Sourced & Compliant First-party collected data with full GDPR/CCPA compliance.
Who Uses This Data?
Sales Teams: Cold-call C-suite prospects with verified mobile numbers.
Marketers: Run hyper-personalized SMS/WhatsApp campaigns.
Recruiters: Source passive candidates with up-to-date contact intel.
Data Vendors: License premium datasets to enhance your product.
Tech Platforms: Power your SaaS tools via API with enterprise-grade B2B data.
Flexible Delivery, Instant Results
API (REST): Real-time integration for CRMs, dialers, or marketing stacks
CSV/JSON: Campaign-ready files.
PostgreSQL: Custom databases for large-scale enrichment
Compliance: Full audit trails + opt-out management
Why Forager.ai? → Proven ROI: Clients see 62% higher connect rates vs. industry averages (request case studies). → No Guesswork: Test-drive free samples before committing. → Scalable Pricing: Pay per record, license datasets, or get unlimited API access.
B2B Mobile Phone Data | Verified Contact Database | Sales Prospecting Lists | CRM Enrichment | Recruitment Phone Numbers | Marketing Automation | Phone Number Datasets | GDPR-Compliant Leads | Direct Dial Contacts | Decision-Maker Data
Need Proof? Contact us to see why Fortune 500 companies and startups alike trust Forager.ai for mission-critical outreach.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is an enriched version of the Code4ML dataset, a large-scale corpus of annotated Python code snippets, competition summaries, and data descriptions sourced from Kaggle. The initial release includes approximately 2.5 million snippets of machine learning code extracted from around 100,000 Jupyter notebooks. A portion of these snippets has been manually annotated by human assessors through a custom-built, user-friendly interface designed for this task.
The original dataset is organized into multiple CSV files, each containing structured data on different entities:
Table 1. code_blocks.csv structure
| Column | Description |
| code_blocks_index | Global index linking code blocks to markup_data.csv. |
| kernel_id | Identifier for the Kaggle Jupyter notebook from which the code block was extracted. |
| code_block_id |
Position of the code block within the notebook. |
| code_block |
The actual machine learning code snippet. |
Table 2. kernels_meta.csv structure
| Column | Description |
| kernel_id | Identifier for the Kaggle Jupyter notebook. |
| kaggle_score | Performance metric of the notebook. |
| kaggle_comments | Number of comments on the notebook. |
| kaggle_upvotes | Number of upvotes the notebook received. |
| kernel_link | URL to the notebook. |
| comp_name | Name of the associated Kaggle competition. |
Table 3. competitions_meta.csv structure
| Column | Description |
| comp_name | Name of the Kaggle competition. |
| description | Overview of the competition task. |
| data_type | Type of data used in the competition. |
| comp_type | Classification of the competition. |
| subtitle | Short description of the task. |
| EvaluationAlgorithmAbbreviation | Metric used for assessing competition submissions. |
| data_sources | Links to datasets used. |
| metric type | Class label for the assessment metric. |
Table 4. markup_data.csv structure
| Column | Description |
| code_block | Machine learning code block. |
| too_long | Flag indicating whether the block spans multiple semantic types. |
| marks | Confidence level of the annotation. |
| graph_vertex_id | ID of the semantic type. |
The dataset allows mapping between these tables. For example:
kernel_id column.comp_name. To maintain quality, kernels_meta.csv includes only notebooks with available Kaggle scores.In addition, data_with_preds.csv contains automatically classified code blocks, with a mapping back to code_blocks.csvvia the code_blocks_index column.
The updated Code4ML 2.0 corpus introduces kernels extracted from Meta Kaggle Code. These kernels correspond to the kaggle competitions launched since 2020. The natural descriptions of the competitions are retrieved with the aim of LLM.
Notebooks in kernels_meta2.csv may not have a Kaggle score but include a leaderboard ranking (rank), providing additional context for evaluation.
competitions_meta_2.csv is enriched with data_cards, decsribing the data used in the competitions.
The Code4ML 2.0 corpus is a versatile resource, enabling training and evaluation of models in areas such as:
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
his dataset is based on National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) DataSet accession GDS2778. girke.bioinformatics.ucr.edu +1
The dataset originates from a microarray experiment measuring global gene expression under specific experimental conditions. girke.bioinformatics.ucr.edu +1
Raw and processed expression data (for all probes/genes) are included, enabling downstream analysis such as normalization, differential expression, and clustering.
The dataset has been used to perform differential gene expression (DGE) analysis to identify genes that are up- or down-regulated under the experimental condition compared to control.
Data processing steps typically include normalization (e.g., log-transformation), quality control, probe-to-gene mapping, and statistical testing for significance (e.g., using packages such as limma or other DGE tools). mahsa-ehsanifard.github.io +1
Resulting differentially expressed genes (DEGs) include statistics such as log fold change (logFC), adjusted p‑values (adj.P.Val), and possibly other metrics (e.g., B-statistic), allowing assessment of both magnitude and significance of changes.
The dataset also includes a visualization file (heatmap image) that displays expression patterns of DEGs (or top variable genes) across samples — enabling clustering and pattern recognition across samples and genes.
The heatmap helps illustrate sample-wise and gene-wise expression variation: clustering groups together samples (e.g. control vs treatment) and genes with similar expression dynamics. NCBI +1
This dataset is suitable for further bioinformatics analysis: e.g. functional enrichment (GO/Pathway), co‑expression analysis, gene signature identification, or integration with other datasets.
Users who download this dataset can reproduce or extend analyses, such as re-normalization, alternative clustering, custom DEG thresholds, or downstream biological interpretation (pathway, network analysis).
Facebook
TwitterTargeted enrichment of conserved genomic regions (e.g., ultraconserved elements or UCEs) has emerged as a promising tool for inferring evolutionary history in many organismal groups. Because the UCE approach is still relatively new, much remains to be learned about how best to identify UCE loci and design baits to enrich them.
We test an updated UCE identification and bait design workflow for the insect order Hymenoptera, with a particular focus on ants. The new strategy augments a previous bait design for Hymenoptera by (a) changing the parameters by which conserved genomic regions are identified and retained, and (b) increasing the number of genomes used for locus identification and bait design. We perform in vitro validation of the approach in ants by synthesizing an ant-specific bait set that targets UCE loci and a set of “legacy” phylogenetic markers. Using this bait set, we generate new data for 84 taxa (16/17 ant subfamilies) and extract loci from an additional 17 genome-e...
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset compiles in two csv files, both in an absolute and in a min-max-scaled form, a variety of data for 8.170 different zip code areas of Germany. Examples of such data are, amongst others, average sunshine hours per year, average annual income per person, number of yearly crimes committed, percentage of population below and above age of 60 years and share of voters for green party.
Facebook
Twitterhttps://edmond.mpg.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.17617/3.1GN3OLhttps://edmond.mpg.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.17617/3.1GN3OL
Zeichnungen Dataset - AI-Enhanced Art Historical Descriptions Zeichnungen Dataset - AI-Enhanced Art Historical Descriptions with Iconography Dataset Overview Processing Pipeline AI Prompts Used KISSKI Cluster Resources Output Structure KISSKI Documentation ICONCLASS Resources Data Usage & Citation Quality Notes Zeichnungen Dataset - AI-Enhanced Art Historical Descriptions with Iconography This dataset contains 224x224 images and relative metadata extracted from the MIDAS XML of the Catalogue of the Photographic Collection of the Bibliotheca Hertziana enriched with AI-generated prose texts and iconographic analysis. The dataset is limited to photographs of objects classified as drawing (Zeichnungen), and has been processed using Google Gemma 2 9B Instruct large language model on the KISSKI HPC cluster of the GWDG. Scripts to process the data on KISSKI have been elaborated with Claude Code in Virtual Studio Code. Dataset Overview Source Data: Original dataset: zeichnungen.tsv (30,000 rows / 29,999 data rows) Extracted from: MIDAS XML format (combined.xml) Source institution: Bibliotheca Hertziana - Max Planck Institute for Art History Image repository: Fotothek der Bibliotheca Hertziana Output: Enriched metadata: TSV files with AI-generated German and English descriptions Iconographic analysis: Descriptions based on ICONCLASS classification 224x224 images downloaded from IIIF Image Api of the Photographic Collection Processing Pipeline 1. Data Extraction Source data was extracted with zeichnungen.xql from MIDAS XML format combined.xml containing structured art historical metadata including: Object titles and descriptions (textobj, textfoto) Artist information (aob30) Location data (aob26, aob28) ICONCLASS codes (a5500) - Standardized iconographic classification Dating and provenance Image references (a8540) The set was limited to 30000 entries. 2. ICONCLASS Cache Preparation ICONCLASS System: Source: ICONCLASS.org - Multilingual classification system for cultural content GitHub repository: https://github.com/iconclass/data Images Download 224x224 images downloaded in advance from the IIIF Service based on gemalde.tsv. The script processing for AI Text Enrichment from the metadata checks that the image has been downloaded, so the output data has a 100% certainty of having a matching image. 28.165 images downloaded from 29,999 rows. This is due to known missing digital images. The dataset corresponds to published data and each row contains the licence and accessibility of the single image, date of creation and last update of the catalogue object. 3. AI Text Generation Model Used: Name: Google Gemma 2 9B Instruct Parameters: 9 billion Quantization: FP16 (no quantization) Context window: 8,192 tokens License: Gemma Terms of Use Processing Workflow: Input cleaning: Removal of numeric codes, normalization of Unicode characters, increased CSV field size limit (10 MB) Paragraph generation: German text from structured metadata ICONCLASS lookup: Offline cache-based iconographic description retrieval Iconographic synthesis: AI-generated description from ICONCLASS codes Translation: German → English Categories processed: paragraph foto DE/EN - Photograph description paragraph obj DE/EN - Object/artwork description paragraph verwalter DE/EN - Collection/custodian information paragraph standort DE/EN - Location information paragraph iconclass DE/EN - Iconographic content description (NEW) AI Prompts Used Paragraph Generation Prompt Convert the following structured information into a coherent text in German. The text contains field data that should be transformed into flowing prose while preserving all information. IMPORTANT: - Write a MAXIMUM of 2 paragraphs - Do NOT include any URLs or web links - Do NOT include reference codes or numerical codes - Do NOT add any comments or explanations - Only output the paragraph text itself Field: {field_name} Text: {cleaned_text} German text (maximum 2 paragraphs): ICONCLASS Paragraph Prompt Based on the following Iconclass descriptions, write a brief German paragraph describing what the image depicts. Descriptions: {'; '.join(descriptions)} IMPORTANT: - Start with "Das Bild zeigt" or similar phrasing - Combine all descriptions into a flowing text - Maximum 1-2 sentences - Do NOT include iconclass codes or numbers - Do NOT include reference codes starting with "bh" - Only output the descriptive German text German description: Example ICONCLASS Processing: Input from data: a5500: 31 A 23 1 | 31 A 25 11 | 31 B 62 11 ICONCLASS lookup (from cache): 31 A 23 1 → "standing figure" 31 A 25 11 → "arm raised upward" 31 B 62 11 → "looking upwards" AI-generated output (DE): Das Bild zeigt eine stehende Figur mit erhobenem Arm, die nach oben blickt. Translation (EN): The image shows a standing figure with raised arm, looking upwards. Translation Prompt Translate the following German text to English. Preserve the meaning and style as much as possible. IMPORTANT: - Do NOT include any URLs or web...
Facebook
TwitterPremise of the study: The Compositae (Asteraceae) are a large and diverse family of plants, and the most comprehensive phylogeny to date is a meta-tree based on 10 chloroplast loci that has several major unresolved nodes. We describe the development of an approach that enables the rapid sequencing of large numbers of orthologous nuclear loci to facilitate efficient phylogenomic analyses. Methods and Results: We designed a set of sequence capture probes that target conserved orthologous sequences in the Compositae. We also developed a bioinformatic and phylogenetic workflow for processing and analyzing the resulting data. Application of our approach to 15 species from across the Compositae resulted in the production of phylogenetically informative sequence data from 763 loci and the successful reconstruction of known phylogenetic relationships across the family. Conclusions: These methods should be of great use to members of the broader Compositae community, and the general approach should also be of use to researchers studying other families.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The Goodreads Spoilers dataset embodies a trove of reviews from the Goodreads book review platform, with a special emphasis on annotated "spoiler" information from each review. This dataset is an invaluable asset for those keen on delving into spoiler detection, sentiment analysis related to spoilers, and understanding user behavior in the context of revealing or discussing plot twists.
Basic Statistics: - Books: 25,475 - Users: 18,892 - Reviews: 1,378,033
Metadata: - Reviews: The text of the reviews provided by users. - Ratings: Ratings assigned to books by users. - Spoilers: Annotated spoilers within the review text. - (Additionally, metadata from the complete Goodreads dataset can be utilized to enrich analysis.)
Example (spoiler data):
json
{
'user_id': '01ec1a320ffded6b2dd47833f2c8e4fb',
'timestamp': '2013-12-28',
'review_sentences': [[0, 'First, be aware that this book is not for the faint of heart.'],
[0, 'Human trafficking, drugs, kidnapping, abuse in all forms - this story contains all of this and more.'],
...,
[0, '(ARC provided by the author in return for an honest review.)']],
'rating': 5,
'has_spoiler': False,
'book_id': '18398089',
'review_id': '4b3ffeaf14310ac6854f140188e191cd'
}
Use Cases: - Spoiler Detection: Developing algorithms to automatically detect spoilers in review text. - Sentiment Analysis: Analyzing the sentiment of reviews and examining how the presence of spoilers affects sentiment. - User Behavior Analysis: Understanding how users interact with books that have spoilers and how they disclose such information in reviews. - Natural Language Processing: Training models to understand and process user-generated text which contains spoilers.
Citation:
Please cite the following if you use the data:
Fine-grained spoiler detection from large-scale review corpora
Mengting Wan, Rishabh Misra, Ndapa Nakashole, Julian McAuley
ACL, 2019
[PDF](https://cseweb.ucsd.edu/~jmcauley/pdfs/acl19a.pdf)
Code Samples: The datasets are accompanied by a series of code samples housed in the dataset's Github repository. These code samples include: - Downloading datasets without GUI: A notebook to facilitate dataset downloading sans graphical user interface. - Displaying sample records: A notebook to showcase sample records from the dataset. - Calculating basic statistics: A notebook to calculate and understand basic statistics of the dataset. - Exploring the interaction data: A notebook to explore interaction data and understand user-book interactions. - Exploring the review data: A notebook to delve into the review data and extract insights from user reviews.
Datasets:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains a set of example data for a functional enrichment tutorial.