22 datasets found

Audio Annotation Services | AI-assisted Labeling |Speech Data | AI Training...
datarade.ai
Updated Dec 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2023). Audio Annotation Services | AI-assisted Labeling |Speech Data | AI Training Data | Natural Language Processing (NLP) Data [Dataset]. https://datarade.ai/data-products/nexdata-audio-annotation-services-ai-assisted-labeling-nexdata
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset updated
Dec 29, 2023
Dataset authored and provided by
Nexdata
Area covered
Bulgaria, Lithuania, Ukraine, Australia, Cyprus, Korea (Republic of), Thailand, Spain, Belarus, Austria
Description
Overview We provide various types of Natural Language Processing (NLP) Data services, including:

Audio cleaning

Speech annotation

Speech transcription

Noise Annotation

Phoneme segmentation

Prosodic annotation

Part-of-speech tagging ...

Our Capacity

Platform: Our platform supports human-machine interaction and semi-automatic labeling, increasing labeling efficiency by more than 30% per annotator.It has successfully been applied to nearly 5,000 projects.

Annotation Tools: Nexdata's platform integrates 30 sets of annotation templates, covering audio, image, video, point cloud and text.

-Secure Implementation: NDA is signed to gurantee secure implementation and data is destroyed upon delivery.

-Quality: Multiple rounds of quality inspections ensures high quality data output, certified with ISO9001

About Nexdata Nexdata has global data processing centers and more than 20,000 professional annotators, supporting on-demand data annotation services, such as speech, image, video, point cloud and Natural Language Processing (NLP) Data, etc. Please visit us at https://www.nexdata.ai/datasets/speechrecog?=Datarade
Piaf — The French-language dataset of Questions-Answers
data.europa.eu
csv, json, plain text
Updated Jun 19, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Etalab (2022). Piaf — The French-language dataset of Questions-Answers [Dataset]. https://data.europa.eu/data/datasets/5e83c3ed38f46c1808801fbb?locale=en
Explore at:
csv(1014663), plain text(816), json(4744747), json(2834209)Available download formats
Dataset updated
Jun 19, 2022
Dataset authored and provided by
Etalab
Area covered
French, France
Description
Piaf, build an open French-language dataset for AI

The use of artificial intelligence in public action is often identified as an opportunity to query documentary texts and produce automatic QR tools for users. Questioning the work code in natural language, providing a conversational agent for a given service, developing efficient search engines, improving knowledge management, all of which require a body of quality training data in order to develop Q & A algorithms. The PIAF dataset is a public and open Francophone training dataset that allows to train these algorithms.

Inspired by Squad, the well-known dataset of English QR, we had the ambition to build a similar dataset that would be open to all. The protocol we followed is very similar to that of the first version of Squad (Squad v1.1). However, some changes had to be made to adapt to the characteristics of the French Wikipedia. Another big difference is that we do not employ micro-workers via crowd-sourcing platforms.

After several months of annotation, we have a robust and free annotation platform, a sufficient amount of annotations and a well-founded and innovative community animation and collaborative participation approach within the French administration.

PIAF: a shared tool of the IA Lab

In March 2018, France launched its national strategy for artificial intelligence. Piloted within the Interdepartmental Digital Branch, this strategy has three components: research, the economy and public transformation.

Given that the data policy is a major focus of the development of artificial intelligence, the Etalab mission is piloting the establishment of an interministerial “Lab IA”, whose mission is to accelerate the deployment of AI in administrations via 3 main activities:

Build a core team to internalise skills and expertise around AI

Supporting AI projects in administrations through calls for expressions of interest

Co-build shared tools that can be used as openly as possible

The PIAF project is one of the shared tools of the IA Lab.

Descriptive of the data made available

The dataset follows the format of Squad v1.1. PIAFv1.2 contains 9225 Q & A peers. This is a JSON file. A text file illustrating the schema is included below. This file can be used to generate and evaluate Question-Response templates. For example, following these instructions.

Thanks to the 500 contributors!

We deeply thank our contributors who have made this project live on a voluntary basis to this day.

Links

Information on the protocol followed, the project news, the annotation platform and the related code are here:

https://piaf.etalab.studio/

https://piaf.etalab.studio/actualites.html

https://github.com/etalab/piaf

https://github.com/etalab-ia/piaf-code
TTS average voice library
kaggle.com
Updated Oct 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
longmaodata (2024). TTS average voice library [Dataset]. https://www.kaggle.com/datasets/longmaodata/tts-average-voice-library
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 15, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
longmaodata
Description
🔔Due to the platform's upload size restrictions and the extensive nature of our numerous public datasets, we can only provide samples of the datasets here. If you need the full public dataset, please join our official group to access it;

🔔It is entirely free!

🔔This helps promote open-source development!

Complete data size

67.1GB

Join the group

🚀🚀🚀🚀https://t.me/+Y5kL2iHis9A0ZWI1

✅ Obtain a complete dataset

✅ Mutual communication within the industry

✅ Get more information and consultation

✅ Timely dataset update notifications

Dataset Introduction

TTS average voice library

Version

v1.0

Release Date

2024-10-15

Data Description

Personnel requirements: Professional and common speakers, male and female each half

Collection equipment:Professional recording booth

Data format:.WAV,.TXT,.textgrid

Data features: Common and professional speakers, text content needs to cover all Chinese phonemes, as well as the main Chinese context environments, recorded content and text completely consistent

Annotation content: Vowel-consonant segmentation annotation, rhythm annotation, pinyin annotation

Requirements:

All speakers except for those aged 0-15 and over 50 have certain broadcasting foundation and hold first-class second or first certificate of Putonghua.

For speakers without Putonghua certificate, their voices should be natural and friendly, with relatively standard Putonghua.

Speaking speed should be natural, and volume and speed should be kept as consistent as possible.

Good state during recording, avoid breath sounds and too much saliva sounds in silent segments.

Quality

Each person has 1500 read-aloud texts in one complete file, where the order of sentences in the file is exactly the same as that in the text. And there is 700ms-1s silence between every two sentences.

The audio format is .wav, sampling rate is 48kHz, bit rate is 16bits, single channel. The background noise of the audio is lower than -60dB, and the signal-to-noise ratio reaches 35dB. The peak value of a single sentence's voice is between -2dB~9dB, no clipping wave.

High-frequency information of the audio is complete.

There is no obvious in the audio, including but not limited to background noise, current sound, key-pressing sound, breathing sound, saliva sound, etc.

The sound ensures maximum restoration of real human voice.

Gender distribution: 50 men, 50 women, total 100 people

Age & quantity distribution: ``` | Age | Gender | Number | Quantity | | --- | --- | --- | --- | | 0-15 years old | Male | 5 | Effective time 1.5-2 hours/person, total effective time 142 hours, total 102400 sentences | | Female | 6 |

16-50 years old Male 40
Female 39

Over 50 years old Male 5
Female 5

## Directory Structure

root_directory/

├── audio/

│ ├── audio1.wav

│ ├── text1.txt
S
Screen Writing and Annotation Software Report
archivemarketresearch.com
doc, pdf, ppt
Updated Mar 16, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Archive Market Research (2025). Screen Writing and Annotation Software Report [Dataset]. https://www.archivemarketresearch.com/reports/screen-writing-and-annotation-software-60034
Explore at:
doc, ppt, pdfAvailable download formats
Dataset updated
Mar 16, 2025
Dataset authored and provided by
Archive Market Research
License
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The screenwriting and annotation software market is experiencing robust growth, driven by the increasing demand for efficient and collaborative tools in the film, television, and animation industries. The market size in 2025 is estimated at $500 million, exhibiting a Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033. This growth is fueled by several key factors. The rising adoption of cloud-based solutions facilitates seamless collaboration among writers, editors, and directors, regardless of geographical location. Furthermore, the integration of advanced features like real-time annotation, version control, and script breakdown tools enhances productivity and streamlines the entire screenwriting process. The market is segmented by deployment (on-premise and cloud-based) and user type (large enterprises, medium enterprises), with cloud-based solutions gaining significant traction due to their accessibility and cost-effectiveness. While the prevalence of free or open-source alternatives presents a restraint, the increasing need for professional-grade features, robust security, and seamless integration with other production tools is driving adoption of paid software. Major players like Final Draft and Celtx dominate the market, but a growing number of smaller companies are offering niche solutions, fostering innovation and competition. The North American market currently holds the largest share, followed by Europe and Asia Pacific, with significant growth potential in emerging markets. The forecast period from 2025 to 2033 projects continued expansion, primarily driven by technological advancements and increasing content creation across various media platforms. The shift towards digital workflows and the growing preference for collaborative writing environments are strengthening the market outlook. However, challenges such as the high initial investment for professional software and the need for continuous training and support might slightly hinder market growth in certain regions. Nevertheless, the overall trajectory points towards a significant increase in market value, with the potential for niche players to gain a stronger foothold through specialized functionalities and targeted marketing strategies within specific segments of the industry. The market's continued growth underscores its importance in the evolving media landscape, where efficiency and collaboration are increasingly crucial for success.
Assembly and annotation of Pinot Noir
zenodo.org
Updated Sep 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wang Yue; Wang Yue (2025). Assembly and annotation of Pinot Noir [Dataset]. http://doi.org/10.5281/zenodo.8080252
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.8080252
Dataset updated
Sep 30, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Wang Yue; Wang Yue
Description
Pinot Noir is one of the world's most famous grapes, with its late ripening light ruby red fruit., was used for T2T genome assembly. A total of 33,349,412,693bp HiFi reads with ~ 65x coverage were generated by the PacBio platform. The preliminary assembly were conducted using Hifiasm on HiFi reads. A total of 83,958,664,800 bp of Hi-C reads with ~ 160× coverage were used to anchor and remove some short contigs using Juicer. 3D-DNA was then used to obtain the genome at the scaffold level. The two T2T gap-free haplotypes of the hybrid Pinot Noir, PN1 (473.43MB) and PN2 (467.09MB).
To validate the quality of our assembly, K-mer and BUSCO were conducted. We used K-mer to evaluate genomic heterozygosity, estimated 1.43%. BUSCO to evaluate genomic completeness, about 98.3% in PN1 and 98.4%in PN2 of the core conserved plant genes were found complete in the genome assembly. For genome annotation, the number of genes identified by the genome is similar, more than 37,000 genes were found for Pinot Noir, PN1 (37037) PN2 (37350).

The PN1 genome assembly: PNhap1.fa

The PN1gene annotation: PNhap1.gff3

The PN1 TE annotation: PNhap1.TE.gff

The PN2 genome assembly: PNhap2.fa

The PN2gene annotation: PNhap2.gff3

The PN2 TE annotation: PNhap2.TE.gff
P
PDF Reader Software Report
archivemarketresearch.com
doc, pdf, ppt
Updated May 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Archive Market Research (2025). PDF Reader Software Report [Dataset]. https://www.archivemarketresearch.com/reports/pdf-reader-software-566204
Explore at:
doc, pdf, pptAvailable download formats
Dataset updated
May 28, 2025
Dataset authored and provided by
Archive Market Research
License
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The global PDF reader software market is experiencing robust growth, driven by the increasing adoption of digital documents across various sectors. While precise market size figures for the base year (2025) are unavailable, considering industry trends and the presence of major players like Adobe, Foxit, and Google, a reasonable estimate places the market value at approximately $2.5 billion in 2025. This market is projected to exhibit a Compound Annual Growth Rate (CAGR) of 8% from 2025 to 2033, indicating a substantial expansion over the forecast period. This growth is fueled by several key factors including the rising demand for collaborative document editing, enhanced security features in PDF readers, and the increasing use of PDFs across diverse platforms and devices. The market's segmentation includes various pricing models (free, subscription-based), functionalities (basic viewing, editing, annotation), and deployment methods (cloud, on-premise). The competitive landscape is dominated by established players like Adobe Acrobat Reader, but also features strong competition from smaller, specialized providers offering niche solutions. The market's future trajectory hinges on factors like technological advancements leading to more intuitive and feature-rich software, further integration with cloud services and collaboration tools, and ongoing concerns around PDF security and data privacy. The market's growth is also influenced by regional variations in technology adoption and digitalization. North America and Europe currently hold a significant market share, attributed to high digital literacy and established IT infrastructure. However, rapid growth is anticipated in emerging economies like Asia-Pacific, driven by increasing smartphone penetration and a growing demand for cost-effective PDF solutions. Constraints on market growth include the availability of free, basic PDF readers, which may limit the adoption of premium, feature-rich options. Furthermore, concerns about data security and the potential for vulnerabilities within PDF software remain a crucial consideration influencing both consumer and enterprise adoption. Future growth depends on addressing these security concerns and introducing innovations that enhance user experience and productivity.
Psylliodes chrysocephala adult transcriptome annotation
figshare.com
txt
Updated Sep 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Doga CEDDEN (2023). Psylliodes chrysocephala adult transcriptome annotation [Dataset]. http://doi.org/10.6084/m9.figshare.21922938.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21922938.v2
Dataset updated
Sep 5, 2023
Dataset provided by
figshare
Authors
Doga CEDDEN
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
1.1 Library preparation

The total RNA from pre-aestivation (5-day-old), aestivation (30-day-old), and post-aestivation (55-day-old) female beetles were extracted using ZYMO Quick-RNA Tissue/Insect Kit (ZYMO Research, Irvine, CA, USA) and cleaned using TURBO DNA-free™ kit (Thermo Fisher Scientific, Langenselbold, Germany) according to the manufacturer’s instructions. We opted to sample only the females to eliminate sex-related variations. RNA quantity was determined using a Nanodrop ND-1000 UV/Vis spectrophotometer (Thermo Fisher Scientific). The integrity of the RNA samples was determined using the Agilent 2100 Bioanalyzer and an RNA 6000 Nano Kit (Agilent Technologies, Santa Clara, CA, USA). RIN values ≥ 7.0 were considered appropriate for mRNA library preparation. In total, 10 libraries (4, 3, and 3 libraries respectively per pre-aestivation, aestivation, and post-aestivation stages) were prepared using NEBNext® Poly(A) mRNA Magnetic Isolation Module kit (NEB E7490, New England Biolabs) according to the manufacturer’s instructions. The qualities of the libraries were checked via RNA fragment analysis conducted on the Agilent 2100 Bioanalyzer using the Agilent DNF-935 Reagent Kit (Agilent Technologies). The libraries were pooled based on their concentration, and an overall concentration of 3.4 ng/µL was obtained. The sequencing service was provided by BGI Genomics Tech Solutions Co. Ltd (Hong Kong) on a DNBSEQ-T7 platform.

1.2 De novo assembly and functional annotation

Erroneous k-mers from paired read ends were removed using r-Corrector (v1.0.5) the with default options (Song & Florea, 2015), and the unfixable reads were discarded using the “FilterUncorrectabledPEfastq.py” function in Transcriptome Assembly Tools (Song & Florea, 2015). The adaptor sequences from the reads were removed, and the reads having a quality score above 30 were retained using TrimGalore! (v0.6.7). The cleaned reads (n = 3 per three adult phases) were de novo assembled using Trinity with default options. The completeness of the transcriptome was quantified using BUSCO (v5.4.2) via a comparison against the endopterygota dataset (BUSCO.v4 datasets). The transcriptome (including isoforms) was annotated using Trinotate (v3.2.2), which combines the outputs of NCBI BLAST+ (v2.13.0; nucleotide and predicted protein BLAST), TransDecoder (v5.5.0; coding region prediction), signal (v4.0; signal peptide prediction), TmHMM (v2.0; transmembrane domain prediction), and HMMER (v3.3.2; homology search) packages into an SQLite annotation database. The latest uniport_sprot (04/2022) and Pfam-A (11/2015) databases were downloaded using Trinotate, and the default E-value thresholds were used during the searches with BLAST+ and HMMER, respectively. The obtained annotation database was used to extract gene ontology (GO) terms associated with individual genes using the “extract_GO_assignments_from_Trinotate_xls.pl” whereas the signals and TmHMM outputs were manually extracted using Excel spreadsheets. The longest protein-coding regions in the super transcript data predicted by TransDecoder were subjected to Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway annotation via GhostKoala v2.2 (https://www.kegg.jp/ghostkoala/)
Summary of structural annotation changes for curated organisms since the...
plos.figshare.com
xls
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Achchuthan Shanmugasundram; David Starns; Ulrike Böhme; Beatrice Amos; Paul A. Wilkinson; Omar S. Harb; Susanne Warrenfeltz; Jessica C. Kissinger; Mary Ann McDowell; David S. Roos; Kathryn Crouch; Andrew R. Jones (2023). Summary of structural annotation changes for curated organisms since the initial integration of genomes in TriTrypDB. [Dataset]. http://doi.org/10.1371/journal.pntd.0011058.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pntd.0011058.t002
Dataset updated
Jun 4, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Achchuthan Shanmugasundram; David Starns; Ulrike Böhme; Beatrice Amos; Paul A. Wilkinson; Omar S. Harb; Susanne Warrenfeltz; Jessica C. Kissinger; Mary Ann McDowell; David S. Roos; Kathryn Crouch; Andrew R. Jones
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Summary of structural annotation changes for curated organisms since the initial integration of genomes in TriTrypDB.
n
Data from: A high-quality genome assembly and annotation of the gray...
data.niaid.nih.gov
datadryad.org
+1more
zip
Updated Feb 17, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Guillermo Friis; Joel Vizueta; David R. Nelson; Basel Khraiwesh; Enas Qudeimat; Kourosh Salehi-Ashtiani; Alejandra Ortega; Alyssa Marshell; Carlos M. Duarte; Edward Smith (2022). A high-quality genome assembly and annotation of the gray mangrove, Avicennia marina [Dataset]. http://doi.org/10.5061/dryad.3j9kd51f5
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.3j9kd51f5
Dataset updated
Feb 17, 2022
Dataset provided by
Sultan Qaboos University
New York University Abu Dhabi
King Abdullah University of Science and Technology
Universitat de Barcelona
Authors
Guillermo Friis; Joel Vizueta; David R. Nelson; Basel Khraiwesh; Enas Qudeimat; Kourosh Salehi-Ashtiani; Alejandra Ortega; Alyssa Marshell; Carlos M. Duarte; Edward Smith
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
The gray mangrove [Avicennia marina (Forsk.) Vierh.] is the most widely distributed mangrove species, ranging throughout the Indo-West Pacific. It presents remarkable levels of geographic variation both in phenotypic traits and habitat, often occupying extreme environments at the edges of its distribution. However, subspecific evolutionary relationships and adaptive mechanisms remain understudied, especially across populations of the West Indian Ocean. High-quality genomic resources accounting for such variability are also sparse. Here we report the first chromosome-level assembly of the genome of A. marina. We used a previously release draft assembly and proximity ligation libraries Chicago and Dovetail HiC for scaffolding, producing a 456,526,188 bp long genome. The largest 32 scaffolds (22.4 Mb to 10.5 Mb) accounted for 98 % of the genome assembly, with the remaining 2% distributed among much shorter 3,759 scaffolds (62.4 Kb to 1 Kb). We annotated 45,032 protein-coding genes using tissue-specific RNA-seq data in combination with de novo gene prediction, from which 34,442 were associated to GO terms. Genome assembly and annotated set of genes yield a 96.7% and 95.1% completeness score, respectively, when compared with the eudicots BUSCO dataset. Furthermore, an FST survey based on resequencing data successfully identified a set of candidate genes potentially involved in local adaptation, and revealed patterns of adaptive variability correlating with a temperature gradient in Arabian mangrove populations. Our A. marina genomic assembly provides a highly valuable resource for genome evolution analysis, as well as for identifying functional genes involved in adaptive processes and speciation.

Methods Genome sequencing and assembly The sequenced sample was leaf tissue obtained from an individual located at Ras Ghurab Island in the Arabian Gulf (Abu Dhabi, United Arab Emirates; 24.601°N, 4.566 °E), corresponding to the A. m. marina variety. A high-quality genome was produced using proximity ligation libraries and the software pipeline HiRise at Dovetail Genomics, LLC. Briefly, for Chicago and the Dovetail HiC library preparation, chromatin was fixed with formaldehyde. Fixed chromatin was then digested with DpnII and free blunt ends were ligated. Crosslinks were reversed, and the DNA purified from protein, which was then sheared to ~350 bp mean fragment size. Libraries were generated using NEBNext Ultra enzymes and Illumina-compatible adapters, and sequencing was carried out on an Illumina HiSeq X platform. Chicago and Dovetail HiC library reads were then used as input data for genome assembly for HiRise, a software pipeline designed specifically for using proximity ligation data to scaffold genome assemblies. A previously reported draft genome of Avicennia marina (GenBank accesion: GCA_900003535.1) was used in the assembly pipeline, excluding scaffolds shorter than 1Kb since HiRise does not assemble them.

The mitochondrial genome was assembled using NOVOplasty2.7.2 and resequencing data based on Illumina paired-end 150 bp libraries from a conspecific individual (See below; Supplementary Information). The maturase (matR) mitochondrial gene available in NCBI (GenBank accession no. AY289666.1) was used for the input seed sequence.

Genome annotation We performed the annotation of the A. marina genome using mRNA data from a set of tissues of conspecific individuals in combination with de novo gene prediction using BRAKER2 v2.1.5 (Hoff et al. 2016). Samples were collected on the coast of the Eastern Central Red Sea north of Jeddah in the Kingdom of Saudi Arabia (22.324 °N, 39.100 °E; Figure 1A). Total RNA was isolated from root, stem, leaf, flower, and seed using TRIzol reagent (Invitrogen, USA). RNA-seq libraries were prepared using TruSeq RNA sample prep kit (Illumina, Inc.), with inserts that range in size from approximately 100-400 bp. Library quality control and quantification were performed with a Bioanalyzer Chip DNA 1000 series II (Agilent), and sequenced in a HiSeq2000 platform (Illumina, Inc.). First, repetitive regions were modelled ab initio using RepeatModeler v2.0.1 (Flynn et al. 2019) in all scaffolds longer than 100 Kb with default options. The resulting repeat library was used to annotate and soft-mask repeats in the genome assembly with RepeatMasker 4.0.9 (Smit et al. 2015). Next, messenger RNA reads were mapped against the soft-masked genome assembly with HISAT2 (Kim et al. 2015). Gene prediction was conducted with BRAKER2 using both the RNA-seq data and the conserved orthologous genes from BUSCO Eudicots_odb10 as proteins from short evolutionary distance to provide hints and train GeneMark-ETP and Augustus (--etpmode; Hoff et al. 2019; Bruna et al. 2020; Lomsadze et al. 2005; Buchfink et al. 2015; Gotoh 2008; Iwata and Gotoh 2012; Li et al. 2009; Barnett et al. 2011; Lomsadze et al. 2014; Stanke et al. 2008; Stanke et al. 2006). The obtained gene annotation gff3 file was validated and used to generate the reported gene annotation statistics with GenomeTools (Gremme et al. 2013) and in-house Perl scripts. Finally, we conducted a similarity-based approach to assist the functional annotation of the predicted proteins. We integrated InterProScan v5.31 (Jones et al. 2014) and BLAST (Tatusova and Madden 1999) searches using the UniProt Swiss-Prot database and the annotated proteins from the Arabidopsis thaliana genome (UniProt Consortium 2019) to generate a final set of annotated functional genes.

Variant calling from resequencing data Whole genome resequencing was carried out for the 60 individuals from 6 populations around the Arabian Peninsula at Novogene facilities. Illumina paired-end 150 bp libraries with insert size equal to 350 pb were prepared and sequenced in a Novaseq platform. A total of 2.4G reads were produced resulting in a mean coverage per site and sample of 85X before filtering. Read quality was evaluated using FASTQC after sorting reads by individual with AXE. Trimming and quality filtering treatment was conducted using Trim Galore, resulting in a set of reads ranging between 90 and 138 bp long. Reads were then mapped against the A. marina reference genome using the mem algorithm in the Burrows-Wheeler Aligner. Read groups were assigned and BAM files generated with Picard Tools version 1.126. We used the HaplotypeCaller + GenotypeGVCFs tools from the Genome Analysis Toolkit version 3.6-0 to produce a set of single nucleotide polymorphisms (SNPs) in the variant call format (vcf). Genotype quality and missing data filters for downstream analyses were implemented with vcftools. Samples with less than 25% of the sites genotyped were discarded. Then, a SNP matrix was constructed excluding those out of a range of coverage between 4 and 50 or with a genotyping phred quality score below 40. Positions for which one or more samples were not genotyped were removed, along with those presenting a minor allele count (MAC) below 3. Only the SNPs from the 32 major scaffolds were retained. A threshold for SNPs showing highly significant deviations from Hardy-Weinberg equilibrium (HWE) with a p-value of 10-4 was also implemented to filter out false variants arisen by the alignment of paralogous loci. Final dataset consisted on 56 samples and 538,185 SNPs.
P
PDF Reader Software Report
marketresearchforecast.com
doc, pdf, ppt
Updated Mar 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Research Forecast (2025). PDF Reader Software Report [Dataset]. https://www.marketresearchforecast.com/reports/pdf-reader-software-43763
Explore at:
doc, ppt, pdfAvailable download formats
Dataset updated
Mar 20, 2025
Dataset authored and provided by
Market Research Forecast
License
https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The global PDF reader software market is experiencing robust growth, driven by the increasing adoption of digital documents across various sectors. The market's expansion is fueled by several key factors, including the rising demand for efficient document management solutions in businesses of all sizes, the increasing need for cross-platform compatibility, and the growing popularity of cloud-based PDF editing and annotation tools. The market is segmented by functionality (with and without additional editing features) and operating system (Windows, macOS, and others), with the "with additional editing features" segment showing significantly higher growth due to its enhanced productivity capabilities. Major players like Adobe, Foxit Software, and Nitro Software dominate the market, leveraging their established brand recognition and extensive feature sets. However, smaller players are emerging, offering specialized features or competitive pricing, thereby intensifying the competition. Geographic analysis reveals strong market penetration in North America and Europe, owing to high digital literacy rates and advanced technological infrastructure. However, rapid growth is projected in the Asia-Pacific region, driven by increasing internet penetration and smartphone usage. The market faces certain restraints, such as the availability of free or open-source alternatives and the potential for security vulnerabilities associated with PDF files. Despite these challenges, the overall market outlook remains positive, indicating sustained growth in the coming years, driven by continuous technological advancements and increasing demand for streamlined document workflows. The forecast for the PDF reader software market indicates a continued upward trajectory. While precise figures are not provided, based on general market trends and the indicated presence of established players like Adobe, a conservative estimate suggests a market size of approximately $5 billion in 2025, with a Compound Annual Growth Rate (CAGR) between 7-10% over the forecast period (2025-2033). This growth is attributed to expanding business adoption of digital document solutions, increasing government mandates for digitalization, and a continuous rise in remote work and collaboration. The market will likely see further innovation in areas like AI-powered features, enhanced security protocols, and seamless integration with other productivity tools. Competitive pressure is expected to remain high, with companies vying to offer the most comprehensive and user-friendly solutions. This will drive continuous improvement in features, performance, and pricing models, benefiting both enterprise and individual users.
f
piRNA data analysis shows the composition of evaluated reads from nine...
figshare.com
xls
Updated Jun 4, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Melanie Spornraft; Benedikt Kirchner; Bettina Haase; Vladimir Benes; Michael W. Pfaffl; Irmgard Riedmaier (2023). piRNA data analysis shows the composition of evaluated reads from nine animals generated by computer data analysis pipeline using free software tools. [Dataset]. http://doi.org/10.1371/journal.pone.0107259.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0107259.t003
Dataset updated
Jun 4, 2023
Dataset provided by
PLOS ONE
Authors
Melanie Spornraft; Benedikt Kirchner; Bettina Haase; Vladimir Benes; Michael W. Pfaffl; Irmgard Riedmaier
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
piRNA data analysis shows the composition of evaluated reads from nine animals generated by computer data analysis pipeline using free software tools.
m
Data from: A fully-annotated imagery dataset of sublittoral benthic species...
data.mendeley.com
narcis.nl
Updated Dec 17, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrius Šiaulys (2020). A fully-annotated imagery dataset of sublittoral benthic species in Svalbard, Arctic [Dataset]. http://doi.org/10.17632/mmzb4hhptc.1
Explore at:
Unique identifier
https://doi.org/10.17632/mmzb4hhptc.1
Dataset updated
Dec 17, 2020
Authors
Andrius Šiaulys
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Arctic, Svalbard
Description
We present visual data of bottom macrofauna filmed in sublittoral of European Arctic – Svalbard. Some of the areas (Burgerbukta, Borebukta, Dahlbrebukta, St. Johnsfjorden, Trygghamna) are in the vicinity of melting glaciers, others are in ice-free areas (Adriabukta, Eidembukta, Gipsvika). The dataset consists of three types of data: video samples, 2D mosaics and annotations of objects found. In total, 22 min 51 s of video footage was collected and split into 10-30 s segments, resulting in 47 video samples. he labels of samples indicate site, transect, a part of raw video from which video sample was cropped and duration of video sample. E.g., file named B1_0332_30s.bmp, where B stands for Borebukta bay (AD – Adriabukta, B – Borebukta, D - Dahlbrebukta, E - Eidembukta, G - Gipsvika, HB - Burgerbukta, SJ - St. Johnsfjorden, T - Trygghamna), 1 – transect number, 0332 – 3 min 32 s start time from a raw video, 30s – length of video sample from which this mosaic file was made. All video samples were converted into still images (video mosaics), that were manually analysed by marine biologists - specialists in Arctic biota, who identified visible biological objects at the lowest possible taxonomic level. Twelve taxons were targeted for annotation: brown alga – kelp Laminaria, spider crab Hyas sp., brittlestars Ophiuroidea, burrowing anemone Halcampa sp., sea squirts Tunicata, tube anemone Ceriantharia sp., sea star Urasterias lincki, tube dwelling Polychaeta, snailfishes Liparidae, flatfishes Pleuronectiformes, Shrimps, and benthic trachymedusa Ptychogastria polaris. Annotation process, where 4 experts performed manual pixel-wise segmentation (with Labelbox tool) and mask refinement survey (on SurveyJS platform), resulted in 2242 annotated objects with the most popular category – Ophiuroidea.

Dataset consists of three directories: video samples (3-5 fps, .AVI format), video mosaics (.JPG format) and annotated categories with/without background (.PNG format). 47 video samples and 47 resulting 2D mosaics with corresponding annotations (masks and mask overlays) for 2242 objects in 12 categories.

Geographic information. The following bays of Svalbard archipelago: Adriabukta (77.000100, 16.192216), Burgerbukta (77.057108, 16.007882), Borebukta (78.38859, 14.28120), Dahlbrebukta (78.566666, 12.368533), Eidembukta (78.360133, 12.779950), Gipsvika (78.42591, 16.52873), St. Johnsfjord (78.506766, 12.931066), Trygghamna (78.254050, 13.761500) .
Telomere-to-Telomere assembly and annotation of Pinot Noir 40024
zenodo.org
bin, txt
Updated Aug 30, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ShiXiaoya; ShiXiaoya (2023). Telomere-to-Telomere assembly and annotation of Pinot Noir 40024 [Dataset]. http://doi.org/10.5281/zenodo.7751391
Explore at:
bin, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7751391
Dataset updated
Aug 30, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
ShiXiaoya; ShiXiaoya
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
PN40024, a highly homozygous Pinot Noir inbred line, was used for T2T genome assembly. In total, we generated 39.12Gb (~65X coverage) HiFi reads by the PacBio platform. The preliminary assembly were conducted using Hifiasm on HiFi reads, and Mumer was used to order and orient the contig-level assemblies using the PN40024.v3 genome as the reference, forming 169 contigs representing 19 chromosomes.

The PN_T2T genome size (494.87M) is longer than that of pn40024.v3 (426.18M). Due to the accuracy of HiFi long reads, the N50 length PN_T2T of (26.89 Mb) is 260 times higher than PN_v3 (~102Kb). For all 9423 gaps in PN_v3 assembly, PN_T2T assembly is the gap-free grape genome.To validate the quality of our assembly, K-mer and BUSCO were conducted. We used K-mer to evaluate genomic heterozygosity, estimated 99.8%. BUSCO to evaluate genomic completeness, about 98.5% of the core conserved plant genes were found complete in the genome assembly.

The PN40024.T2T genome assembly: PN40024.T2T.fa

The PN40024.T2T gene annotation: PN40024.gff

The PN40024.T2T TE annotation: PN40024.TE.gff

The PN40024.T2T centromere annotation: PN40024.trf.gff3

Comparison of gene annotation among PN_T2T and 12X.v0, 12X.v2, PN40024.v4, PN40024.v4.1: Gene connected list .xlsx
Hate Speech and Offensive Language Detection
kaggle.com
Updated Dec 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Hate Speech and Offensive Language Detection [Dataset]. https://www.kaggle.com/datasets/thedevastator/hate-speech-and-offensive-language-detection
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 2, 2023
Dataset provided by
Kaggle
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Hate Speech and Offensive Language Detection

Hate Speech and Offensive Language Detection on Twitter

By hate_speech_offensive (From Huggingface) [source]

About this dataset

This dataset, named hate_speech_offensive, is a meticulously curated collection of annotated tweets with the specific purpose of detecting hate speech and offensive language. The dataset primarily consists of English tweets and is designed to train machine learning models or algorithms in the task of hate speech detection. It should be noted that the dataset has not been divided into multiple subsets, and only the train split is currently available for use.

The dataset includes several columns that provide valuable information for understanding each tweet's classification. The column count represents the total number of annotations provided for each tweet, whereas hate_speech_count signifies how many annotations classified a particular tweet as hate speech. On the other hand, offensive_language_count indicates the number of annotations categorizing a tweet as containing offensive language. Additionally, neither_count denotes how many annotations identified a tweet as neither hate speech nor offensive language.

For researchers and developers aiming to create effective models or algorithms capable of detecting hate speech and offensive language on Twitter, this comprehensive dataset offers a rich resource for training and evaluation purposes

How to use the dataset

Introduction:

Dataset Overview:

The dataset is presented in a CSV file format named 'train.csv'.

It consists of annotated tweets with information about their classification as hate speech, offensive language, or neither.

Each row represents a tweet along with the corresponding annotations provided by multiple annotators.

The main columns that will be essential for your analysis are: count (total number of annotations), hate_speech_count (number of annotations classifying a tweet as hate speech), offensive_language_count (number of annotations classifying a tweet as offensive language), neither_count (number of annotations classifying a tweet as neither hate speech nor offensive language).

Data Collection Methodology: The data collection methodology used to create this dataset involved obtaining tweets from Twitter's public API using specific search terms related to hate speech and offensive language. These tweets were then manually labeled by multiple annotators who reviewed them for classification purposes.

Data Quality: Although efforts have been made to ensure the accuracy of the data, it is important to acknowledge that annotations are subjective opinions provided by individual annotators. As such, there may be variations in classifications between annotators.

Preprocessing Techniques: Prior to training machine learning models or algorithms on this dataset, it is recommended to apply standard preprocessing techniques such as removing URLs, usernames/handles, special characters/punctuation marks, stop words removal, tokenization, stemming/lemmatization etc., depending on your analysis requirements.

Exploratory Data Analysis (EDA): Conducting EDA on the dataset will help you gain insights and understand the underlying patterns in hate speech and offensive language. Some potential analysis ideas include:

Distribution of tweet counts per classification category (hate speech, offensive language, neither).

Most common words/phrases associated with each class.

Co-occurrence analysis to identify correlations between hate speech and offensive language.

Building Machine Learning Models: To train models for automatic detection of hate speech and offensive language, you can follow these steps: a) Split the dataset into training and testing sets for model evaluation purposes. b) Choose appropriate features/

Research Ideas

Sentiment Analysis: This dataset can be used to train models for sentiment analysis on Twitter data. By classifying tweets as hate speech, offensive language, or neither, the dataset can help in understanding the sentiment behind different tweets and identifying patterns of negative or offensive language.

Hate Speech Detection: The dataset can be used to develop models that automatically detect hate speech on Twitter. By training machine learning algorithms on this annotated dataset, it becomes possible to create systems that can identify and flag hate speech in real-time, making social media platforms safer and more inclusive.

Content Moderation: Social media platforms can use this dataset to improve their content m...
Z
Dataset for the paper: "Monant Medical Misinformation Dataset: Mapping...
data.niaid.nih.gov
zenodo.org
Updated Apr 22, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert Moro (2022). Dataset for the paper: "Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5996863
Explore at:
Dataset updated
Apr 22, 2022
Dataset provided by
Robert Moro
Maria Bielikova
Elena Stefancova
Jakub Simko
Ivan Srba
Branislav Pecher
Matus Tomlein
Description
Overview

This dataset of medical misinformation was collected and is published by Kempelen Institute of Intelligent Technologies (KInIT). It consists of approx. 317k news articles and blog posts on medical topics published between January 1, 1998 and February 1, 2022 from a total of 207 reliable and unreliable sources. The dataset contains full-texts of the articles, their original source URL and other extracted metadata. If a source has a credibility score available (e.g., from Media Bias/Fact Check), it is also included in the form of annotation. Besides the articles, the dataset contains around 3.5k fact-checks and extracted verified medical claims with their unified veracity ratings published by fact-checking organisations such as Snopes or FullFact. Lastly and most importantly, the dataset contains 573 manually and more than 51k automatically labelled mappings between previously verified claims and the articles; mappings consist of two values: claim presence (i.e., whether a claim is contained in the given article) and article stance (i.e., whether the given article supports or rejects the claim or provides both sides of the argument).

The dataset is primarily intended to be used as a training and evaluation set for machine learning methods for claim presence detection and article stance classification, but it enables a range of other misinformation related tasks, such as misinformation characterisation or analyses of misinformation spreading.

Its novelty and our main contributions lie in (1) focus on medical news article and blog posts as opposed to social media posts or political discussions; (2) providing multiple modalities (beside full-texts of the articles, there are also images and videos), thus enabling research of multimodal approaches; (3) mapping of the articles to the fact-checked claims (with manual as well as predicted labels); (4) providing source credibility labels for 95% of all articles and other potential sources of weak labels that can be mined from the articles' content and metadata.

The dataset is associated with the research paper "Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims" accepted and presented at ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22).

The accompanying Github repository provides a small static sample of the dataset and the dataset's descriptive analysis in a form of Jupyter notebooks.

Options to access the dataset

There are two ways how to get access to the dataset:

Static dump of the dataset available in the CSV format

Continuously updated dataset available via REST API

In order to obtain an access to the dataset (either to full static dump or REST API), please, request the access by following instructions provided below.

References

If you use this dataset in any publication, project, tool or in any other form, please, cite the following papers:

@inproceedings{SrbaMonantPlatform, author = {Srba, Ivan and Moro, Robert and Simko, Jakub and Sevcech, Jakub and Chuda, Daniela and Navrat, Pavol and Bielikova, Maria}, booktitle = {Proceedings of Workshop on Reducing Online Misinformation Exposure (ROME 2019)}, pages = {1--7}, title = {Monant: Universal and Extensible Platform for Monitoring, Detection and Mitigation of Antisocial Behavior}, year = {2019} }

@inproceedings{SrbaMonantMedicalDataset, author = {Srba, Ivan and Pecher, Branislav and Tomlein Matus and Moro, Robert and Stefancova, Elena and Simko, Jakub and Bielikova, Maria}, booktitle = {Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22)}, numpages = {11}, title = {Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims}, year = {2022}, doi = {10.1145/3477495.3531726}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3477495.3531726}, }

Dataset creation process

In order to create this dataset (and to continuously obtain new data), we used our research platform Monant. The Monant platform provides so called data providers to extract news articles/blogs from news/blog sites as well as fact-checking articles from fact-checking sites. General parsers (from RSS feeds, Wordpress sites, Google Fact Check Tool, etc.) as well as custom crawler and parsers were implemented (e.g., for fact checking site Snopes.com). All data is stored in the unified format in a central data storage.

Ethical considerations

The dataset was collected and is published for research purposes only. We collected only publicly available content of news/blog articles. The dataset contains identities of authors of the articles if they were stated in the original source; we left this information, since the presence of an author's name can be a strong credibility indicator. However, we anonymised the identities of the authors of discussion posts included in the dataset.

The main identified ethical issue related to the presented dataset lies in the risk of mislabelling of an article as supporting a false fact-checked claim and, to a lesser extent, in mislabelling an article as not containing a false claim or not supporting it when it actually does. To minimise these risks, we developed a labelling methodology and require an agreement of at least two independent annotators to assign a claim presence or article stance label to an article. It is also worth noting that we do not label an article as a whole as false or true. Nevertheless, we provide partial article-claim pair veracities based on the combination of claim presence and article stance labels.

As to the veracity labels of the fact-checked claims and the credibility (reliability) labels of the articles' sources, we take these from the fact-checking sites and external listings such as Media Bias/Fact Check as they are and refer to their methodologies for more details on how they were established.

Lastly, the dataset also contains automatically predicted labels of claim presence and article stance using our baselines described in the next section. These methods have their limitations and work with certain accuracy as reported in this paper. This should be taken into account when interpreting them.

Reporting mistakes in the dataset The mean to report considerable mistakes in raw collected data or in manual annotations is by creating a new issue in the accompanying Github repository. Alternately, general enquiries or requests can be sent at info [at] kinit.sk.

Dataset structure

Raw data

At first, the dataset contains so called raw data (i.e., data extracted by the Web monitoring module of Monant platform and stored in exactly the same form as they appear at the original websites). Raw data consist of articles from news sites and blogs (e.g. naturalnews.com), discussions attached to such articles, fact-checking articles from fact-checking portals (e.g. snopes.com). In addition, the dataset contains feedback (number of likes, shares, comments) provided by user on social network Facebook which is regularly extracted for all news/blogs articles.

Raw data are contained in these CSV files (and corresponding REST API endpoints):

sources.csv

articles.csv

article_media.csv

article_authors.csv

discussion_posts.csv

discussion_post_authors.csv

fact_checking_articles.csv

fact_checking_article_media.csv

claims.csv

feedback_facebook.csv

Note: Personal information about discussion posts' authors (name, website, gravatar) are anonymised.

Annotations

Secondly, the dataset contains so called annotations. Entity annotations describe the individual raw data entities (e.g., article, source). Relation annotations describe relation between two of such entities.

Each annotation is described by the following attributes:

category of annotation (annotation_category). Possible values: label (annotation corresponds to ground truth, determined by human experts) and prediction (annotation was created by means of AI method).

type of annotation (annotation_type_id). Example values: Source reliability (binary), Claim presence. The list of possible values can be obtained from enumeration in annotation_types.csv.

method which created annotation (method_id). Example values: Expert-based source reliability evaluation, Fact-checking article to claim transformation method. The list of possible values can be obtained from enumeration methods.csv.

its value (value). The value is stored in JSON format and its structure differs according to particular annotation type.

At the same time, annotations are associated with a particular object identified by:

entity type (parameter entity_type in case of entity annotations, or source_entity_type and target_entity_type in case of relation annotations). Possible values: sources, articles, fact-checking-articles.

entity id (parameter entity_id in case of entity annotations, or source_entity_id and target_entity_id in case of relation annotations).

The dataset provides specifically these entity annotations:

Source reliability (binary). Determines validity of source (website) at a binary scale with two options: reliable source and unreliable source.

Article veracity. Aggregated information about veracity from article-claim pairs.

The dataset provides specifically these relation annotations:

Fact-checking article to claim mapping. Determines mapping between fact-checking article and claim.

Claim presence. Determines presence of claim in article.

Claim stance. Determines stance of an article to a claim.

Annotations are contained in these CSV files (and corresponding REST API endpoints):

entity_annotations.csv

relation_annotations.csv

Note: Identification of human annotators authors (email provided in the annotation app) is anonymised.
d
Pixta AI | Imagery Data | Global | 10,000 Stock Images | Annotation and...
datarade.ai
.json, .xml, .csv
Updated Nov 11, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pixta AI (2022). Pixta AI | Imagery Data | Global | 10,000 Stock Images | Annotation and Labelling Services Provided | Traffic scenes from low view for AI & ML [Dataset]. https://datarade.ai/data-products/10-000-traffic-scenes-from-low-view-for-ai-ml-model-pixta-ai
Explore at:
.json, .xml, .csvAvailable download formats
Dataset updated
Nov 11, 2022
Dataset authored and provided by
Pixta AI
Area covered
United States of America, New Zealand, Korea (Republic of), Singapore, Taiwan, Canada, Japan, Australia, Hong Kong
Description
"1. Overview This dataset is a collection of low view traffic images in multiple scenes, backgrounds and lighting conditions that are ready to use for optimizing the accuracy of computer vision models. All of the contents is sourced from PIXTA's stock library of 100M+ Asian-featured images and videos. PIXTA is the largest platform of visual materials in the Asia Pacific region offering fully-managed services, high quality contents and data, and powerful tools for businesses & organisations to enable their creative and machine learning projects.

Use case This dataset is used for AI solutions training & testing in several cases: Traffic monitoring, Traffic camera system, Vehicle flow estimation,... Each data set is supported by both AI and human review process to ensure labelling consistency and accuracy. Contact us for more custom datasets.

About PIXTA PIXTASTOCK is the largest Asian-featured stock platform providing data, contents, tools and services since 2005. PIXTA experiences 15 years of integrating advanced AI technology in managing, curating, processing over 100M visual materials and serving global leading brands for their creative and data demands. Visit us at https://www.pixta.ai/ for more details.
P
Forex News Annotated Dataset for Sentiment Analysis Dataset
paperswithcode.com
data.niaid.nih.gov
Updated Aug 12, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Georgios Fatouros; John Soldatos; Kalliopi Kouroumali; Georgios Makridis; Dimosthenis Kyriazis (2023). Forex News Annotated Dataset for Sentiment Analysis Dataset [Dataset]. https://paperswithcode.com/dataset/forex-news-annotated-dataset-for-sentiment
Explore at:
Dataset updated
Aug 12, 2023
Authors
Georgios Fatouros; John Soldatos; Kalliopi Kouroumali; Georgios Makridis; Dimosthenis Kyriazis
Description
This dataset contains news headlines relevant to key forex pairs: AUDUSD, EURCHF, EURUSD, GBPUSD, and USDJPY. The data was extracted from reputable platforms Forex Live and FXstreet over a period of 86 days, from January to May 2023. The dataset comprises 2,291 unique news headlines. Each headline includes an associated forex pair, timestamp, source, author, URL, and the corresponding article text. Data was collected using web scraping techniques executed via a custom service on a virtual machine. This service periodically retrieves the latest news for a specified forex pair (ticker) from each platform, parsing all available information. The collected data is then processed to extract details such as the article's timestamp, author, and URL. The URL is further used to retrieve the full text of each article. This data acquisition process repeats approximately every 15 minutes.

To ensure the reliability of the dataset, we manually annotated each headline for sentiment. Instead of solely focusing on the textual content, we ascertained sentiment based on the potential short-term impact of the headline on its corresponding forex pair. This method recognizes the currency market's acute sensitivity to economic news, which significantly influences many trading strategies. As such, this dataset could serve as an invaluable resource for fine-tuning sentiment analysis models in the financial realm.

We used three categories for annotation: 'positive', 'negative', and 'neutral', which correspond to bullish, bearish, and hold sentiments, respectively, for the forex pair linked to each headline. The following Table provides examples of annotated headlines along with brief explanations of the assigned sentiment.

Examples of Annotated Headlines Forex Pair Headline Sentiment Explanation GBPUSD Diminishing bets for a move to 12400 Neutral Lack of strong sentiment in either direction GBPUSD No reasons to dislike Cable in the very near term as long as the Dollar momentum remains soft Positive Positive sentiment towards GBPUSD (Cable) in the near term GBPUSD When are the UK jobs and how could they affect GBPUSD Neutral Poses a question and does not express a clear sentiment JPYUSD Appropriate to continue monetary easing to achieve 2% inflation target with wage growth Positive Monetary easing from Bank of Japan (BoJ) could lead to a weaker JPY in the short term due to increased money supply USDJPY Dollar rebounds despite US data. Yen gains amid lower yields Neutral Since both the USD and JPY are gaining, the effects on the USDJPY forex pair might offset each other USDJPY USDJPY to reach 124 by Q4 as the likelihood of a BoJ policy shift should accelerate Yen gains Negative USDJPY is expected to reach a lower value, with the USD losing value against the JPY AUDUSD RBA Governor Lowe’s Testimony High inflation is damaging and corrosive

Positive Reserve Bank of Australia (RBA) expresses concerns about inflation. Typically, central banks combat high inflation with higher interest rates, which could strengthen AUD. Moreover, the dataset includes two columns with the predicted sentiment class and score as predicted by the FinBERT model. Specifically, the FinBERT model outputs a set of probabilities for each sentiment class (positive, negative, and neutral), representing the model's confidence in associating the input headline with each sentiment category. These probabilities are used to determine the predicted class and a sentiment score for each headline. The sentiment score is computed by subtracting the negative class probability from the positive one.
P
RiceMap Version 2 Beta：A Map Like Rice Genome Browser and Related Resource
opendata.pku.edu.cn
Updated Nov 20, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peking University Open Research Data Platform (2015). RiceMap Version 2 Beta：A Map Like Rice Genome Browser and Related Resource [Dataset]. http://doi.org/10.18170/DVN/AC39ON
Explore at:
Unique identifier
https://doi.org/10.18170/DVN/AC39ON
Dataset updated
Nov 20, 2015
Dataset provided by
Peking University Open Research Data Platform
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Access to Data The concurrent release of rice genome sequences for two subspecies (Oryza sativa L. ssp. japonica and Oryza sativa L. ssp. indica) facilitates rice studies at the whole genome level. Since the advent of high-throughput analysis, huge amounts of functional genomics data have been delivered rapidly, making an integrated online genome browser indispensable for scientists to visualize and analyze these data. Based on next-generation web technologies and high-throughput experimental data, we have developed Rice-Map, a novel genome browser for researchers to navigate, analyze and annotate rice genome interactively. More than one hundred annotation tracks (81 for japonica and 82 for indica) have been compiled and loaded into Rice-Map. These pre-computed annotations cover gene models, transcript evidences, expression profiling, epigenetic modifications, inter-species and intra-species homologies, genetic markers and other genomic features. In addition to these pre-computed tracks, registered users can interactively add comments and research notes to Rice-Map as User-Defined Annotation entries. By smoothly scrolling, dragging and zooming, users can browse various genomic features simultaneously at multiple scales. On-the-fly analysis for selected entries could be performed through dedicated bioinformatic analysis platforms such as WebLab and Galaxy. Furthermore, a BioMart-powered data warehouse ""Rice Mart"" is offered for advanced users to fetch bulk datasets based on complex criteria. Rice-Map delivers abundant up-to-date japonica and indica annotations, providing a valuable resource for both computational and bench biologists. Rice-Map is publicly accessible at http://www.ricemap.org/, with all data available for free downloading.
e
CDD
ebi.ac.uk
Updated Apr 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). CDD [Dataset]. https://www.ebi.ac.uk/interpro/
Explore at:
Dataset updated
Apr 18, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
CDD is a protein annotation resource that consists of a collection of annotated multiple sequence alignment models for ancient domains and full-length proteins. These are available as position-specific score matrices (PSSMs) for fast identification of conserved domains in protein sequences via RPS-BLAST. CDD content includes NCBI-curated domain models, which use 3D-structure information to explicitly define domain boundaries and provide insights into sequence/structure/function relationships, as well as domain models imported from a number of external source databases.
e
HAMAP
ebi.ac.uk
Updated Feb 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). HAMAP [Dataset]. https://www.ebi.ac.uk/interpro/
Explore at:
Dataset updated
Feb 5, 2025
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
HAMAP stands for High-quality Automated and Manual Annotation of Proteins. HAMAP profiles are manually created by expert curators. They identify proteins that are part of well-conserved protein families or subfamilies. HAMAP is based at the SIB Swiss Institute of Bioinformatics, Geneva, Switzerland.

16-50 years old	Male	40
Female	39

Over 50 years old	Male	5
Female	5

Facebook

Twitter

Click to copy link

Link copied

Cite

Nexdata (2023). Audio Annotation Services | AI-assisted Labeling |Speech Data | AI Training Data | Natural Language Processing (NLP) Data [Dataset]. https://datarade.ai/data-products/nexdata-audio-annotation-services-ai-assisted-labeling-nexdata

Audio Annotation Services | AI-assisted Labeling |Speech Data | AI Training Data | Natural Language Processing (NLP) Data

Explore at:

.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats

Dataset updated

Dec 29, 2023

Dataset authored and provided by

Nexdata

Area covered

Bulgaria, Lithuania, Ukraine, Australia, Cyprus, Korea (Republic of), Thailand, Spain, Belarus, Austria

Description

Overview We provide various types of Natural Language Processing (NLP) Data services, including:
Audio cleaning
Speech annotation
Speech transcription
Noise Annotation
Phoneme segmentation
Prosodic annotation
Part-of-speech tagging ...
Our Capacity
Platform: Our platform supports human-machine interaction and semi-automatic labeling, increasing labeling efficiency by more than 30% per annotator.It has successfully been applied to nearly 5,000 projects.

Annotation Tools: Nexdata's platform integrates 30 sets of annotation templates, covering audio, image, video, point cloud and text.

-Secure Implementation: NDA is signed to gurantee secure implementation and data is destroyed upon delivery.

-Quality: Multiple rounds of quality inspections ensures high quality data output, certified with ISO9001

About Nexdata Nexdata has global data processing centers and more than 20,000 professional annotators, supporting on-demand data annotation services, such as speech, image, video, point cloud and Natural Language Processing (NLP) Data, etc. Please visit us at https://www.nexdata.ai/datasets/speechrecog?=Datarade

Clear search

Close search

Google apps

Main menu

Audio Annotation Services | AI-assisted Labeling |Speech Data | AI Training...

Piaf — The French-language dataset of Questions-Answers

Piaf, build an open French-language dataset for AI

PIAF: a shared tool of the IA Lab

Descriptive of the data made available

Thanks to the 500 contributors!

Links

TTS average voice library

🔔Due to the platform's upload size restrictions and the extensive nature of our numerous public datasets, we can only provide samples of the datasets here. If you need the full public dataset, please join our official group to access it;

🔔It is entirely free!

🔔This helps promote open-source development!

Complete data size

67.1GB

Join the group

Dataset Introduction

TTS average voice library

Version

Release Date

Data Description

Requirements:

Quality

Screen Writing and Annotation Software Report

Assembly and annotation of Pinot Noir

PDF Reader Software Report

Psylliodes chrysocephala adult transcriptome annotation

Summary of structural annotation changes for curated organisms since the...

Data from: A high-quality genome assembly and annotation of the gray...

PDF Reader Software Report

piRNA data analysis shows the composition of evaluated reads from nine...

Data from: A fully-annotated imagery dataset of sublittoral benthic species...

Telomere-to-Telomere assembly and annotation of Pinot Noir 40024

Hate Speech and Offensive Language Detection

Hate Speech and Offensive Language Detection

Hate Speech and Offensive Language Detection on Twitter

About this dataset

How to use the dataset

Research Ideas

Dataset for the paper: "Monant Medical Misinformation Dataset: Mapping...

Pixta AI | Imagery Data | Global | 10,000 Stock Images | Annotation and...

Forex News Annotated Dataset for Sentiment Analysis Dataset

RiceMap Version 2 Beta：A Map Like Rice Genome Browser and Related Resource

CDD

HAMAP

Audio Annotation Services | AI-assisted Labeling |Speech Data | AI Training Data | Natural Language Processing (NLP) Data