-Secure Implementation: NDA is signed to gurantee secure implementation and data is destroyed upon delivery.
-Quality: Multiple rounds of quality inspections ensures high quality data output, certified with ISO9001
The use of artificial intelligence in public action is often identified as an opportunity to query documentary texts and produce automatic QR tools for users. Questioning the work code in natural language, providing a conversational agent for a given service, developing efficient search engines, improving knowledge management, all of which require a body of quality training data in order to develop Q & A algorithms. The PIAF dataset is a public and open Francophone training dataset that allows to train these algorithms.
Inspired by Squad, the well-known dataset of English QR, we had the ambition to build a similar dataset that would be open to all. The protocol we followed is very similar to that of the first version of Squad (Squad v1.1). However, some changes had to be made to adapt to the characteristics of the French Wikipedia. Another big difference is that we do not employ micro-workers via crowd-sourcing platforms.
After several months of annotation, we have a robust and free annotation platform, a sufficient amount of annotations and a well-founded and innovative community animation and collaborative participation approach within the French administration.
In March 2018, France launched its national strategy for artificial intelligence. Piloted within the Interdepartmental Digital Branch, this strategy has three components: research, the economy and public transformation.
Given that the data policy is a major focus of the development of artificial intelligence, the Etalab mission is piloting the establishment of an interministerial “Lab IA”, whose mission is to accelerate the deployment of AI in administrations via 3 main activities:
The PIAF project is one of the shared tools of the IA Lab.
The dataset follows the format of Squad v1.1. PIAFv1.2 contains 9225 Q & A peers. This is a JSON file. A text file illustrating the schema is included below. This file can be used to generate and evaluate Question-Response templates. For example, following these instructions.
We deeply thank our contributors who have made this project live on a voluntary basis to this day.
Information on the protocol followed, the project news, the annotation platform and the related code are here:
🚀🚀🚀🚀https://t.me/+Y5kL2iHis9A0ZWI1
✅ Obtain a complete dataset
✅ Mutual communication within the industry
✅ Get more information and consultation
✅ Timely dataset update notifications
v1.0
2024-10-15
Personnel requirements: Professional and common speakers, male and female each half
Collection equipment:Professional recording booth
Data format:.WAV,.TXT,.textgrid
Data features: Common and professional speakers, text content needs to cover all Chinese phonemes, as well as the main Chinese context environments, recorded content and text completely consistent
Annotation content: Vowel-consonant segmentation annotation, rhythm annotation, pinyin annotation
All speakers except for those aged 0-15 and over 50 have certain broadcasting foundation and hold first-class second or first certificate of Putonghua.
For speakers without Putonghua certificate, their voices should be natural and friendly, with relatively standard Putonghua.
Speaking speed should be natural, and volume and speed should be kept as consistent as possible.
Good state during recording, avoid breath sounds and too much saliva sounds in silent segments.
Each person has 1500 read-aloud texts in one complete file, where the order of sentences in the file is exactly the same as that in the text. And there is 700ms-1s silence between every two sentences.
The audio format is .wav, sampling rate is 48kHz, bit rate is 16bits, single channel. The background noise of the audio is lower than -60dB, and the signal-to-noise ratio reaches 35dB. The peak value of a single sentence's voice is between -2dB~9dB, no clipping wave.
High-frequency information of the audio is complete.
There is no obvious in the audio, including but not limited to background noise, current sound, key-pressing sound, breathing sound, saliva sound, etc.
The sound ensures maximum restoration of real human voice.
Gender distribution: 50 men, 50 women, total 100 people
Age & quantity distribution: ``` | Age | Gender | Number | Quantity | | --- | --- | --- | --- | | 0-15 years old | Male | 5 | Effective time 1.5-2 hours/person, total effective time 142 hours, total 102400 sentences | | Female | 6 |
16-50 years old | Male | 40 |
---|---|---|
Female | 39 |
Over 50 years old | Male | 5 |
---|---|---|
Female | 5 |
## Directory Structure
root_directory/
├── audio/
│ ├── audio1.wav
│ ├── text1.txt
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
The screenwriting and annotation software market is experiencing robust growth, driven by the increasing demand for efficient and collaborative tools in the film, television, and animation industries. The market size in 2025 is estimated at $500 million, exhibiting a Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033. This growth is fueled by several key factors. The rising adoption of cloud-based solutions facilitates seamless collaboration among writers, editors, and directors, regardless of geographical location. Furthermore, the integration of advanced features like real-time annotation, version control, and script breakdown tools enhances productivity and streamlines the entire screenwriting process. The market is segmented by deployment (on-premise and cloud-based) and user type (large enterprises, medium enterprises), with cloud-based solutions gaining significant traction due to their accessibility and cost-effectiveness. While the prevalence of free or open-source alternatives presents a restraint, the increasing need for professional-grade features, robust security, and seamless integration with other production tools is driving adoption of paid software. Major players like Final Draft and Celtx dominate the market, but a growing number of smaller companies are offering niche solutions, fostering innovation and competition. The North American market currently holds the largest share, followed by Europe and Asia Pacific, with significant growth potential in emerging markets. The forecast period from 2025 to 2033 projects continued expansion, primarily driven by technological advancements and increasing content creation across various media platforms. The shift towards digital workflows and the growing preference for collaborative writing environments are strengthening the market outlook. However, challenges such as the high initial investment for professional software and the need for continuous training and support might slightly hinder market growth in certain regions. Nevertheless, the overall trajectory points towards a significant increase in market value, with the potential for niche players to gain a stronger foothold through specialized functionalities and targeted marketing strategies within specific segments of the industry. The market's continued growth underscores its importance in the evolving media landscape, where efficiency and collaboration are increasingly crucial for success.
Pinot Noir is one of the world's most famous grapes, with its late ripening light ruby red fruit., was used for T2T genome assembly. A total of 33,349,412,693bp HiFi reads with ~ 65x coverage were generated by the PacBio platform. The preliminary assembly were conducted using Hifiasm on HiFi reads. A total of 83,958,664,800 bp of Hi-C reads with ~ 160× coverage were used to anchor and remove some short contigs using Juicer. 3D-DNA was then used to obtain the genome at the scaffold level. The two T2T gap-free haplotypes of the hybrid Pinot Noir, PN1 (473.43MB) and PN2 (467.09MB).
To validate the quality of our assembly, K-mer and BUSCO were conducted. We used K-mer to evaluate genomic heterozygosity, estimated 1.43%. BUSCO to evaluate genomic completeness, about 98.3% in PN1 and 98.4%in PN2 of the core conserved plant genes were found complete in the genome assembly. For genome annotation, the number of genes identified by the genome is similar, more than 37,000 genes were found for Pinot Noir, PN1 (37037) PN2 (37350).
The PN1 genome assembly: PNhap1.fa
The PN1gene annotation: PNhap1.gff3
The PN1 TE annotation: PNhap1.TE.gff
The PN2 genome assembly: PNhap2.fa
The PN2gene annotation: PNhap2.gff3
The PN2 TE annotation: PNhap2.TE.gff
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
The global PDF reader software market is experiencing robust growth, driven by the increasing adoption of digital documents across various sectors. While precise market size figures for the base year (2025) are unavailable, considering industry trends and the presence of major players like Adobe, Foxit, and Google, a reasonable estimate places the market value at approximately $2.5 billion in 2025. This market is projected to exhibit a Compound Annual Growth Rate (CAGR) of 8% from 2025 to 2033, indicating a substantial expansion over the forecast period. This growth is fueled by several key factors including the rising demand for collaborative document editing, enhanced security features in PDF readers, and the increasing use of PDFs across diverse platforms and devices. The market's segmentation includes various pricing models (free, subscription-based), functionalities (basic viewing, editing, annotation), and deployment methods (cloud, on-premise). The competitive landscape is dominated by established players like Adobe Acrobat Reader, but also features strong competition from smaller, specialized providers offering niche solutions. The market's future trajectory hinges on factors like technological advancements leading to more intuitive and feature-rich software, further integration with cloud services and collaboration tools, and ongoing concerns around PDF security and data privacy. The market's growth is also influenced by regional variations in technology adoption and digitalization. North America and Europe currently hold a significant market share, attributed to high digital literacy and established IT infrastructure. However, rapid growth is anticipated in emerging economies like Asia-Pacific, driven by increasing smartphone penetration and a growing demand for cost-effective PDF solutions. Constraints on market growth include the availability of free, basic PDF readers, which may limit the adoption of premium, feature-rich options. Furthermore, concerns about data security and the potential for vulnerabilities within PDF software remain a crucial consideration influencing both consumer and enterprise adoption. Future growth depends on addressing these security concerns and introducing innovations that enhance user experience and productivity.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
1.1 Library preparation
The total RNA from pre-aestivation (5-day-old), aestivation (30-day-old), and post-aestivation (55-day-old) female beetles were extracted using ZYMO Quick-RNA Tissue/Insect Kit (ZYMO Research, Irvine, CA, USA) and cleaned using TURBO DNA-free™ kit (Thermo Fisher Scientific, Langenselbold, Germany) according to the manufacturer’s instructions. We opted to sample only the females to eliminate sex-related variations. RNA quantity was determined using a Nanodrop ND-1000 UV/Vis spectrophotometer (Thermo Fisher Scientific). The integrity of the RNA samples was determined using the Agilent 2100 Bioanalyzer and an RNA 6000 Nano Kit (Agilent Technologies, Santa Clara, CA, USA). RIN values ≥ 7.0 were considered appropriate for mRNA library preparation. In total, 10 libraries (4, 3, and 3 libraries respectively per pre-aestivation, aestivation, and post-aestivation stages) were prepared using NEBNext® Poly(A) mRNA Magnetic Isolation Module kit (NEB E7490, New England Biolabs) according to the manufacturer’s instructions. The qualities of the libraries were checked via RNA fragment analysis conducted on the Agilent 2100 Bioanalyzer using the Agilent DNF-935 Reagent Kit (Agilent Technologies). The libraries were pooled based on their concentration, and an overall concentration of 3.4 ng/µL was obtained. The sequencing service was provided by BGI Genomics Tech Solutions Co. Ltd (Hong Kong) on a DNBSEQ-T7 platform.
1.2 De novo assembly and functional annotation
Erroneous k-mers from paired read ends were removed using r-Corrector (v1.0.5) the with default options (Song & Florea, 2015), and the unfixable reads were discarded using the “FilterUncorrectabledPEfastq.py” function in Transcriptome Assembly Tools (Song & Florea, 2015). The adaptor sequences from the reads were removed, and the reads having a quality score above 30 were retained using TrimGalore! (v0.6.7). The cleaned reads (n = 3 per three adult phases) were de novo assembled using Trinity with default options. The completeness of the transcriptome was quantified using BUSCO (v5.4.2) via a comparison against the endopterygota dataset (BUSCO.v4 datasets). The transcriptome (including isoforms) was annotated using Trinotate (v3.2.2), which combines the outputs of NCBI BLAST+ (v2.13.0; nucleotide and predicted protein BLAST), TransDecoder (v5.5.0; coding region prediction), signal (v4.0; signal peptide prediction), TmHMM (v2.0; transmembrane domain prediction), and HMMER (v3.3.2; homology search) packages into an SQLite annotation database. The latest uniport_sprot (04/2022) and Pfam-A (11/2015) databases were downloaded using Trinotate, and the default E-value thresholds were used during the searches with BLAST+ and HMMER, respectively. The obtained annotation database was used to extract gene ontology (GO) terms associated with individual genes using the “extract_GO_assignments_from_Trinotate_xls.pl” whereas the signals and TmHMM outputs were manually extracted using Excel spreadsheets. The longest protein-coding regions in the super transcript data predicted by TransDecoder were subjected to Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway annotation via GhostKoala v2.2 (https://www.kegg.jp/ghostkoala/)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Summary of structural annotation changes for curated organisms since the initial integration of genomes in TriTrypDB.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
The gray mangrove [Avicennia marina (Forsk.) Vierh.] is the most widely distributed mangrove species, ranging throughout the Indo-West Pacific. It presents remarkable levels of geographic variation both in phenotypic traits and habitat, often occupying extreme environments at the edges of its distribution. However, subspecific evolutionary relationships and adaptive mechanisms remain understudied, especially across populations of the West Indian Ocean. High-quality genomic resources accounting for such variability are also sparse. Here we report the first chromosome-level assembly of the genome of A. marina. We used a previously release draft assembly and proximity ligation libraries Chicago and Dovetail HiC for scaffolding, producing a 456,526,188 bp long genome. The largest 32 scaffolds (22.4 Mb to 10.5 Mb) accounted for 98 % of the genome assembly, with the remaining 2% distributed among much shorter 3,759 scaffolds (62.4 Kb to 1 Kb). We annotated 45,032 protein-coding genes using tissue-specific RNA-seq data in combination with de novo gene prediction, from which 34,442 were associated to GO terms. Genome assembly and annotated set of genes yield a 96.7% and 95.1% completeness score, respectively, when compared with the eudicots BUSCO dataset. Furthermore, an FST survey based on resequencing data successfully identified a set of candidate genes potentially involved in local adaptation, and revealed patterns of adaptive variability correlating with a temperature gradient in Arabian mangrove populations. Our A. marina genomic assembly provides a highly valuable resource for genome evolution analysis, as well as for identifying functional genes involved in adaptive processes and speciation.
Methods Genome sequencing and assembly The sequenced sample was leaf tissue obtained from an individual located at Ras Ghurab Island in the Arabian Gulf (Abu Dhabi, United Arab Emirates; 24.601°N, 4.566 °E), corresponding to the A. m. marina variety. A high-quality genome was produced using proximity ligation libraries and the software pipeline HiRise at Dovetail Genomics, LLC. Briefly, for Chicago and the Dovetail HiC library preparation, chromatin was fixed with formaldehyde. Fixed chromatin was then digested with DpnII and free blunt ends were ligated. Crosslinks were reversed, and the DNA purified from protein, which was then sheared to ~350 bp mean fragment size. Libraries were generated using NEBNext Ultra enzymes and Illumina-compatible adapters, and sequencing was carried out on an Illumina HiSeq X platform. Chicago and Dovetail HiC library reads were then used as input data for genome assembly for HiRise, a software pipeline designed specifically for using proximity ligation data to scaffold genome assemblies. A previously reported draft genome of Avicennia marina (GenBank accesion: GCA_900003535.1) was used in the assembly pipeline, excluding scaffolds shorter than 1Kb since HiRise does not assemble them.
The mitochondrial genome was assembled using NOVOplasty2.7.2 and resequencing data based on Illumina paired-end 150 bp libraries from a conspecific individual (See below; Supplementary Information). The maturase (matR) mitochondrial gene available in NCBI (GenBank accession no. AY289666.1) was used for the input seed sequence.
Genome annotation We performed the annotation of the A. marina genome using mRNA data from a set of tissues of conspecific individuals in combination with de novo gene prediction using BRAKER2 v2.1.5 (Hoff et al. 2016). Samples were collected on the coast of the Eastern Central Red Sea north of Jeddah in the Kingdom of Saudi Arabia (22.324 °N, 39.100 °E; Figure 1A). Total RNA was isolated from root, stem, leaf, flower, and seed using TRIzol reagent (Invitrogen, USA). RNA-seq libraries were prepared using TruSeq RNA sample prep kit (Illumina, Inc.), with inserts that range in size from approximately 100-400 bp. Library quality control and quantification were performed with a Bioanalyzer Chip DNA 1000 series II (Agilent), and sequenced in a HiSeq2000 platform (Illumina, Inc.). First, repetitive regions were modelled ab initio using RepeatModeler v2.0.1 (Flynn et al. 2019) in all scaffolds longer than 100 Kb with default options. The resulting repeat library was used to annotate and soft-mask repeats in the genome assembly with RepeatMasker 4.0.9 (Smit et al. 2015). Next, messenger RNA reads were mapped against the soft-masked genome assembly with HISAT2 (Kim et al. 2015). Gene prediction was conducted with BRAKER2 using both the RNA-seq data and the conserved orthologous genes from BUSCO Eudicots_odb10 as proteins from short evolutionary distance to provide hints and train GeneMark-ETP and Augustus (--etpmode; Hoff et al. 2019; Bruna et al. 2020; Lomsadze et al. 2005; Buchfink et al. 2015; Gotoh 2008; Iwata and Gotoh 2012; Li et al. 2009; Barnett et al. 2011; Lomsadze et al. 2014; Stanke et al. 2008; Stanke et al. 2006). The obtained gene annotation gff3 file was validated and used to generate the reported gene annotation statistics with GenomeTools (Gremme et al. 2013) and in-house Perl scripts. Finally, we conducted a similarity-based approach to assist the functional annotation of the predicted proteins. We integrated InterProScan v5.31 (Jones et al. 2014) and BLAST (Tatusova and Madden 1999) searches using the UniProt Swiss-Prot database and the annotated proteins from the Arabidopsis thaliana genome (UniProt Consortium 2019) to generate a final set of annotated functional genes.
Variant calling from resequencing data Whole genome resequencing was carried out for the 60 individuals from 6 populations around the Arabian Peninsula at Novogene facilities. Illumina paired-end 150 bp libraries with insert size equal to 350 pb were prepared and sequenced in a Novaseq platform. A total of 2.4G reads were produced resulting in a mean coverage per site and sample of 85X before filtering. Read quality was evaluated using FASTQC after sorting reads by individual with AXE. Trimming and quality filtering treatment was conducted using Trim Galore, resulting in a set of reads ranging between 90 and 138 bp long. Reads were then mapped against the A. marina reference genome using the mem algorithm in the Burrows-Wheeler Aligner. Read groups were assigned and BAM files generated with Picard Tools version 1.126. We used the HaplotypeCaller + GenotypeGVCFs tools from the Genome Analysis Toolkit version 3.6-0 to produce a set of single nucleotide polymorphisms (SNPs) in the variant call format (vcf). Genotype quality and missing data filters for downstream analyses were implemented with vcftools. Samples with less than 25% of the sites genotyped were discarded. Then, a SNP matrix was constructed excluding those out of a range of coverage between 4 and 50 or with a genotyping phred quality score below 40. Positions for which one or more samples were not genotyped were removed, along with those presenting a minor allele count (MAC) below 3. Only the SNPs from the 32 major scaffolds were retained. A threshold for SNPs showing highly significant deviations from Hardy-Weinberg equilibrium (HWE) with a p-value of 10-4 was also implemented to filter out false variants arisen by the alignment of paralogous loci. Final dataset consisted on 56 samples and 538,185 SNPs.
https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy
The global PDF reader software market is experiencing robust growth, driven by the increasing adoption of digital documents across various sectors. The market's expansion is fueled by several key factors, including the rising demand for efficient document management solutions in businesses of all sizes, the increasing need for cross-platform compatibility, and the growing popularity of cloud-based PDF editing and annotation tools. The market is segmented by functionality (with and without additional editing features) and operating system (Windows, macOS, and others), with the "with additional editing features" segment showing significantly higher growth due to its enhanced productivity capabilities. Major players like Adobe, Foxit Software, and Nitro Software dominate the market, leveraging their established brand recognition and extensive feature sets. However, smaller players are emerging, offering specialized features or competitive pricing, thereby intensifying the competition. Geographic analysis reveals strong market penetration in North America and Europe, owing to high digital literacy rates and advanced technological infrastructure. However, rapid growth is projected in the Asia-Pacific region, driven by increasing internet penetration and smartphone usage. The market faces certain restraints, such as the availability of free or open-source alternatives and the potential for security vulnerabilities associated with PDF files. Despite these challenges, the overall market outlook remains positive, indicating sustained growth in the coming years, driven by continuous technological advancements and increasing demand for streamlined document workflows. The forecast for the PDF reader software market indicates a continued upward trajectory. While precise figures are not provided, based on general market trends and the indicated presence of established players like Adobe, a conservative estimate suggests a market size of approximately $5 billion in 2025, with a Compound Annual Growth Rate (CAGR) between 7-10% over the forecast period (2025-2033). This growth is attributed to expanding business adoption of digital document solutions, increasing government mandates for digitalization, and a continuous rise in remote work and collaboration. The market will likely see further innovation in areas like AI-powered features, enhanced security protocols, and seamless integration with other productivity tools. Competitive pressure is expected to remain high, with companies vying to offer the most comprehensive and user-friendly solutions. This will drive continuous improvement in features, performance, and pricing models, benefiting both enterprise and individual users.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
piRNA data analysis shows the composition of evaluated reads from nine animals generated by computer data analysis pipeline using free software tools.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We present visual data of bottom macrofauna filmed in sublittoral of European Arctic – Svalbard. Some of the areas (Burgerbukta, Borebukta, Dahlbrebukta, St. Johnsfjorden, Trygghamna) are in the vicinity of melting glaciers, others are in ice-free areas (Adriabukta, Eidembukta, Gipsvika). The dataset consists of three types of data: video samples, 2D mosaics and annotations of objects found. In total, 22 min 51 s of video footage was collected and split into 10-30 s segments, resulting in 47 video samples. he labels of samples indicate site, transect, a part of raw video from which video sample was cropped and duration of video sample. E.g., file named B1_0332_30s.bmp, where B stands for Borebukta bay (AD – Adriabukta, B – Borebukta, D - Dahlbrebukta, E - Eidembukta, G - Gipsvika, HB - Burgerbukta, SJ - St. Johnsfjorden, T - Trygghamna), 1 – transect number, 0332 – 3 min 32 s start time from a raw video, 30s – length of video sample from which this mosaic file was made. All video samples were converted into still images (video mosaics), that were manually analysed by marine biologists - specialists in Arctic biota, who identified visible biological objects at the lowest possible taxonomic level. Twelve taxons were targeted for annotation: brown alga – kelp Laminaria, spider crab Hyas sp., brittlestars Ophiuroidea, burrowing anemone Halcampa sp., sea squirts Tunicata, tube anemone Ceriantharia sp., sea star Urasterias lincki, tube dwelling Polychaeta, snailfishes Liparidae, flatfishes Pleuronectiformes, Shrimps, and benthic trachymedusa Ptychogastria polaris. Annotation process, where 4 experts performed manual pixel-wise segmentation (with Labelbox tool) and mask refinement survey (on SurveyJS platform), resulted in 2242 annotated objects with the most popular category – Ophiuroidea.
Dataset consists of three directories: video samples (3-5 fps, .AVI format), video mosaics (.JPG format) and annotated categories with/without background (.PNG format). 47 video samples and 47 resulting 2D mosaics with corresponding annotations (masks and mask overlays) for 2242 objects in 12 categories.
Geographic information. The following bays of Svalbard archipelago: Adriabukta (77.000100, 16.192216), Burgerbukta (77.057108, 16.007882), Borebukta (78.38859, 14.28120), Dahlbrebukta (78.566666, 12.368533), Eidembukta (78.360133, 12.779950), Gipsvika (78.42591, 16.52873), St. Johnsfjord (78.506766, 12.931066), Trygghamna (78.254050, 13.761500) .
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PN40024, a highly homozygous Pinot Noir inbred line, was used for T2T genome assembly. In total, we generated 39.12Gb (~65X coverage) HiFi reads by the PacBio platform. The preliminary assembly were conducted using Hifiasm on HiFi reads, and Mumer was used to order and orient the contig-level assemblies using the PN40024.v3 genome as the reference, forming 169 contigs representing 19 chromosomes.
The PN_T2T genome size (494.87M) is longer than that of pn40024.v3 (426.18M). Due to the accuracy of HiFi long reads, the N50 length PN_T2T of (26.89 Mb) is 260 times higher than PN_v3 (~102Kb). For all 9423 gaps in PN_v3 assembly, PN_T2T assembly is the gap-free grape genome.To validate the quality of our assembly, K-mer and BUSCO were conducted. We used K-mer to evaluate genomic heterozygosity, estimated 99.8%. BUSCO to evaluate genomic completeness, about 98.5% of the core conserved plant genes were found complete in the genome assembly.
The PN40024.T2T genome assembly: PN40024.T2T.fa
The PN40024.T2T gene annotation: PN40024.gff
The PN40024.T2T TE annotation: PN40024.TE.gff
The PN40024.T2T centromere annotation: PN40024.trf.gff3
Comparison of gene annotation among PN_T2T and 12X.v0, 12X.v2, PN40024.v4, PN40024.v4.1: Gene connected list .xlsx
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By hate_speech_offensive (From Huggingface) [source]
This dataset, named hate_speech_offensive, is a meticulously curated collection of annotated tweets with the specific purpose of detecting hate speech and offensive language. The dataset primarily consists of English tweets and is designed to train machine learning models or algorithms in the task of hate speech detection. It should be noted that the dataset has not been divided into multiple subsets, and only the train split is currently available for use.
The dataset includes several columns that provide valuable information for understanding each tweet's classification. The column count represents the total number of annotations provided for each tweet, whereas hate_speech_count signifies how many annotations classified a particular tweet as hate speech. On the other hand, offensive_language_count indicates the number of annotations categorizing a tweet as containing offensive language. Additionally, neither_count denotes how many annotations identified a tweet as neither hate speech nor offensive language.
For researchers and developers aiming to create effective models or algorithms capable of detecting hate speech and offensive language on Twitter, this comprehensive dataset offers a rich resource for training and evaluation purposes
Introduction:
Dataset Overview:
- The dataset is presented in a CSV file format named 'train.csv'.
- It consists of annotated tweets with information about their classification as hate speech, offensive language, or neither.
- Each row represents a tweet along with the corresponding annotations provided by multiple annotators.
- The main columns that will be essential for your analysis are: count (total number of annotations), hate_speech_count (number of annotations classifying a tweet as hate speech), offensive_language_count (number of annotations classifying a tweet as offensive language), neither_count (number of annotations classifying a tweet as neither hate speech nor offensive language).
Data Collection Methodology: The data collection methodology used to create this dataset involved obtaining tweets from Twitter's public API using specific search terms related to hate speech and offensive language. These tweets were then manually labeled by multiple annotators who reviewed them for classification purposes.
Data Quality: Although efforts have been made to ensure the accuracy of the data, it is important to acknowledge that annotations are subjective opinions provided by individual annotators. As such, there may be variations in classifications between annotators.
Preprocessing Techniques: Prior to training machine learning models or algorithms on this dataset, it is recommended to apply standard preprocessing techniques such as removing URLs, usernames/handles, special characters/punctuation marks, stop words removal, tokenization, stemming/lemmatization etc., depending on your analysis requirements.
Exploratory Data Analysis (EDA): Conducting EDA on the dataset will help you gain insights and understand the underlying patterns in hate speech and offensive language. Some potential analysis ideas include:
- Distribution of tweet counts per classification category (hate speech, offensive language, neither).
- Most common words/phrases associated with each class.
- Co-occurrence analysis to identify correlations between hate speech and offensive language.
Building Machine Learning Models: To train models for automatic detection of hate speech and offensive language, you can follow these steps: a) Split the dataset into training and testing sets for model evaluation purposes. b) Choose appropriate features/
- Sentiment Analysis: This dataset can be used to train models for sentiment analysis on Twitter data. By classifying tweets as hate speech, offensive language, or neither, the dataset can help in understanding the sentiment behind different tweets and identifying patterns of negative or offensive language.
- Hate Speech Detection: The dataset can be used to develop models that automatically detect hate speech on Twitter. By training machine learning algorithms on this annotated dataset, it becomes possible to create systems that can identify and flag hate speech in real-time, making social media platforms safer and more inclusive.
- Content Moderation: Social media platforms can use this dataset to improve their content m...
Overview
This dataset of medical misinformation was collected and is published by Kempelen Institute of Intelligent Technologies (KInIT). It consists of approx. 317k news articles and blog posts on medical topics published between January 1, 1998 and February 1, 2022 from a total of 207 reliable and unreliable sources. The dataset contains full-texts of the articles, their original source URL and other extracted metadata. If a source has a credibility score available (e.g., from Media Bias/Fact Check), it is also included in the form of annotation. Besides the articles, the dataset contains around 3.5k fact-checks and extracted verified medical claims with their unified veracity ratings published by fact-checking organisations such as Snopes or FullFact. Lastly and most importantly, the dataset contains 573 manually and more than 51k automatically labelled mappings between previously verified claims and the articles; mappings consist of two values: claim presence (i.e., whether a claim is contained in the given article) and article stance (i.e., whether the given article supports or rejects the claim or provides both sides of the argument).
The dataset is primarily intended to be used as a training and evaluation set for machine learning methods for claim presence detection and article stance classification, but it enables a range of other misinformation related tasks, such as misinformation characterisation or analyses of misinformation spreading.
Its novelty and our main contributions lie in (1) focus on medical news article and blog posts as opposed to social media posts or political discussions; (2) providing multiple modalities (beside full-texts of the articles, there are also images and videos), thus enabling research of multimodal approaches; (3) mapping of the articles to the fact-checked claims (with manual as well as predicted labels); (4) providing source credibility labels for 95% of all articles and other potential sources of weak labels that can be mined from the articles' content and metadata.
The dataset is associated with the research paper "Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims" accepted and presented at ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22).
The accompanying Github repository provides a small static sample of the dataset and the dataset's descriptive analysis in a form of Jupyter notebooks.
Options to access the dataset
There are two ways how to get access to the dataset:
In order to obtain an access to the dataset (either to full static dump or REST API), please, request the access by following instructions provided below.
References
If you use this dataset in any publication, project, tool or in any other form, please, cite the following papers:
@inproceedings{SrbaMonantPlatform, author = {Srba, Ivan and Moro, Robert and Simko, Jakub and Sevcech, Jakub and Chuda, Daniela and Navrat, Pavol and Bielikova, Maria}, booktitle = {Proceedings of Workshop on Reducing Online Misinformation Exposure (ROME 2019)}, pages = {1--7}, title = {Monant: Universal and Extensible Platform for Monitoring, Detection and Mitigation of Antisocial Behavior}, year = {2019} }
@inproceedings{SrbaMonantMedicalDataset, author = {Srba, Ivan and Pecher, Branislav and Tomlein Matus and Moro, Robert and Stefancova, Elena and Simko, Jakub and Bielikova, Maria}, booktitle = {Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22)}, numpages = {11}, title = {Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims}, year = {2022}, doi = {10.1145/3477495.3531726}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3477495.3531726}, }
Dataset creation process
In order to create this dataset (and to continuously obtain new data), we used our research platform Monant. The Monant platform provides so called data providers to extract news articles/blogs from news/blog sites as well as fact-checking articles from fact-checking sites. General parsers (from RSS feeds, Wordpress sites, Google Fact Check Tool, etc.) as well as custom crawler and parsers were implemented (e.g., for fact checking site Snopes.com). All data is stored in the unified format in a central data storage.
Ethical considerations
The dataset was collected and is published for research purposes only. We collected only publicly available content of news/blog articles. The dataset contains identities of authors of the articles if they were stated in the original source; we left this information, since the presence of an author's name can be a strong credibility indicator. However, we anonymised the identities of the authors of discussion posts included in the dataset.
The main identified ethical issue related to the presented dataset lies in the risk of mislabelling of an article as supporting a false fact-checked claim and, to a lesser extent, in mislabelling an article as not containing a false claim or not supporting it when it actually does. To minimise these risks, we developed a labelling methodology and require an agreement of at least two independent annotators to assign a claim presence or article stance label to an article. It is also worth noting that we do not label an article as a whole as false or true. Nevertheless, we provide partial article-claim pair veracities based on the combination of claim presence and article stance labels.
As to the veracity labels of the fact-checked claims and the credibility (reliability) labels of the articles' sources, we take these from the fact-checking sites and external listings such as Media Bias/Fact Check as they are and refer to their methodologies for more details on how they were established.
Lastly, the dataset also contains automatically predicted labels of claim presence and article stance using our baselines described in the next section. These methods have their limitations and work with certain accuracy as reported in this paper. This should be taken into account when interpreting them.
Reporting mistakes in the dataset The mean to report considerable mistakes in raw collected data or in manual annotations is by creating a new issue in the accompanying Github repository. Alternately, general enquiries or requests can be sent at info [at] kinit.sk.
Dataset structure
Raw data
At first, the dataset contains so called raw data (i.e., data extracted by the Web monitoring module of Monant platform and stored in exactly the same form as they appear at the original websites). Raw data consist of articles from news sites and blogs (e.g. naturalnews.com), discussions attached to such articles, fact-checking articles from fact-checking portals (e.g. snopes.com). In addition, the dataset contains feedback (number of likes, shares, comments) provided by user on social network Facebook which is regularly extracted for all news/blogs articles.
Raw data are contained in these CSV files (and corresponding REST API endpoints):
sources.csv
articles.csv
article_media.csv
article_authors.csv
discussion_posts.csv
discussion_post_authors.csv
fact_checking_articles.csv
fact_checking_article_media.csv
claims.csv
feedback_facebook.csv
Note: Personal information about discussion posts' authors (name, website, gravatar) are anonymised.
Annotations
Secondly, the dataset contains so called annotations. Entity annotations describe the individual raw data entities (e.g., article, source). Relation annotations describe relation between two of such entities.
Each annotation is described by the following attributes:
category of annotation (annotation_category
). Possible values: label (annotation corresponds to ground truth, determined by human experts) and prediction (annotation was created by means of AI method).
type of annotation (annotation_type_id
). Example values: Source reliability (binary), Claim presence. The list of possible values can be obtained from enumeration in annotation_types.csv.
method which created annotation (method_id
). Example values: Expert-based source reliability evaluation, Fact-checking article to claim transformation method. The list of possible values can be obtained from enumeration methods.csv.
its value (value
). The value is stored in JSON format and its structure differs according to particular annotation type.
At the same time, annotations are associated with a particular object identified by:
entity type (parameter entity_type
in case of entity annotations, or source_entity_type
and target_entity_type
in case of relation annotations). Possible values: sources, articles, fact-checking-articles.
entity id (parameter entity_id
in case of entity annotations, or source_entity_id
and target_entity_id
in case of relation annotations).
The dataset provides specifically these entity annotations:
Source reliability (binary). Determines validity of source (website) at a binary scale with two options: reliable source and unreliable source.
Article veracity. Aggregated information about veracity from article-claim pairs.
The dataset provides specifically these relation annotations:
Fact-checking article to claim mapping. Determines mapping between fact-checking article and claim.
Claim presence. Determines presence of claim in article.
Claim stance. Determines stance of an article to a claim.
Annotations are contained in these CSV files (and corresponding REST API endpoints):
entity_annotations.csv
relation_annotations.csv
Note: Identification of human annotators authors (email provided in the annotation app) is anonymised.
"1. Overview This dataset is a collection of low view traffic images in multiple scenes, backgrounds and lighting conditions that are ready to use for optimizing the accuracy of computer vision models. All of the contents is sourced from PIXTA's stock library of 100M+ Asian-featured images and videos. PIXTA is the largest platform of visual materials in the Asia Pacific region offering fully-managed services, high quality contents and data, and powerful tools for businesses & organisations to enable their creative and machine learning projects.
Use case This dataset is used for AI solutions training & testing in several cases: Traffic monitoring, Traffic camera system, Vehicle flow estimation,... Each data set is supported by both AI and human review process to ensure labelling consistency and accuracy. Contact us for more custom datasets.
About PIXTA PIXTASTOCK is the largest Asian-featured stock platform providing data, contents, tools and services since 2005. PIXTA experiences 15 years of integrating advanced AI technology in managing, curating, processing over 100M visual materials and serving global leading brands for their creative and data demands. Visit us at https://www.pixta.ai/ for more details.
This dataset contains news headlines relevant to key forex pairs: AUDUSD, EURCHF, EURUSD, GBPUSD, and USDJPY. The data was extracted from reputable platforms Forex Live and FXstreet over a period of 86 days, from January to May 2023. The dataset comprises 2,291 unique news headlines. Each headline includes an associated forex pair, timestamp, source, author, URL, and the corresponding article text. Data was collected using web scraping techniques executed via a custom service on a virtual machine. This service periodically retrieves the latest news for a specified forex pair (ticker) from each platform, parsing all available information. The collected data is then processed to extract details such as the article's timestamp, author, and URL. The URL is further used to retrieve the full text of each article. This data acquisition process repeats approximately every 15 minutes.
To ensure the reliability of the dataset, we manually annotated each headline for sentiment. Instead of solely focusing on the textual content, we ascertained sentiment based on the potential short-term impact of the headline on its corresponding forex pair. This method recognizes the currency market's acute sensitivity to economic news, which significantly influences many trading strategies. As such, this dataset could serve as an invaluable resource for fine-tuning sentiment analysis models in the financial realm.
We used three categories for annotation: 'positive', 'negative', and 'neutral', which correspond to bullish, bearish, and hold sentiments, respectively, for the forex pair linked to each headline. The following Table provides examples of annotated headlines along with brief explanations of the assigned sentiment.
Examples of Annotated Headlines Forex Pair Headline Sentiment Explanation GBPUSD Diminishing bets for a move to 12400 Neutral Lack of strong sentiment in either direction GBPUSD No reasons to dislike Cable in the very near term as long as the Dollar momentum remains soft Positive Positive sentiment towards GBPUSD (Cable) in the near term GBPUSD When are the UK jobs and how could they affect GBPUSD Neutral Poses a question and does not express a clear sentiment JPYUSD Appropriate to continue monetary easing to achieve 2% inflation target with wage growth Positive Monetary easing from Bank of Japan (BoJ) could lead to a weaker JPY in the short term due to increased money supply USDJPY Dollar rebounds despite US data. Yen gains amid lower yields Neutral Since both the USD and JPY are gaining, the effects on the USDJPY forex pair might offset each other USDJPY USDJPY to reach 124 by Q4 as the likelihood of a BoJ policy shift should accelerate Yen gains Negative USDJPY is expected to reach a lower value, with the USD losing value against the JPY AUDUSD RBA Governor Lowe’s Testimony High inflation is damaging and corrosive
Positive Reserve Bank of Australia (RBA) expresses concerns about inflation. Typically, central banks combat high inflation with higher interest rates, which could strengthen AUD. Moreover, the dataset includes two columns with the predicted sentiment class and score as predicted by the FinBERT model. Specifically, the FinBERT model outputs a set of probabilities for each sentiment class (positive, negative, and neutral), representing the model's confidence in associating the input headline with each sentiment category. These probabilities are used to determine the predicted class and a sentiment score for each headline. The sentiment score is computed by subtracting the negative class probability from the positive one.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Access to Data The concurrent release of rice genome sequences for two subspecies (Oryza sativa L. ssp. japonica and Oryza sativa L. ssp. indica) facilitates rice studies at the whole genome level. Since the advent of high-throughput analysis, huge amounts of functional genomics data have been delivered rapidly, making an integrated online genome browser indispensable for scientists to visualize and analyze these data. Based on next-generation web technologies and high-throughput experimental data, we have developed Rice-Map, a novel genome browser for researchers to navigate, analyze and annotate rice genome interactively. More than one hundred annotation tracks (81 for japonica and 82 for indica) have been compiled and loaded into Rice-Map. These pre-computed annotations cover gene models, transcript evidences, expression profiling, epigenetic modifications, inter-species and intra-species homologies, genetic markers and other genomic features. In addition to these pre-computed tracks, registered users can interactively add comments and research notes to Rice-Map as User-Defined Annotation entries. By smoothly scrolling, dragging and zooming, users can browse various genomic features simultaneously at multiple scales. On-the-fly analysis for selected entries could be performed through dedicated bioinformatic analysis platforms such as WebLab and Galaxy. Furthermore, a BioMart-powered data warehouse ""Rice Mart"" is offered for advanced users to fetch bulk datasets based on complex criteria. Rice-Map delivers abundant up-to-date japonica and indica annotations, providing a valuable resource for both computational and bench biologists. Rice-Map is publicly accessible at http://www.ricemap.org/, with all data available for free downloading.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
CDD is a protein annotation resource that consists of a collection of annotated multiple sequence alignment models for ancient domains and full-length proteins. These are available as position-specific score matrices (PSSMs) for fast identification of conserved domains in protein sequences via RPS-BLAST. CDD content includes NCBI-curated domain models, which use 3D-structure information to explicitly define domain boundaries and provide insights into sequence/structure/function relationships, as well as domain models imported from a number of external source databases.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
HAMAP stands for High-quality Automated and Manual Annotation of Proteins. HAMAP profiles are manually created by expert curators. They identify proteins that are part of well-conserved protein families or subfamilies. HAMAP is based at the SIB Swiss Institute of Bioinformatics, Geneva, Switzerland.
-Secure Implementation: NDA is signed to gurantee secure implementation and data is destroyed upon delivery.
-Quality: Multiple rounds of quality inspections ensures high quality data output, certified with ISO9001