Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
About Dataset
The dataset consists of data from a bunch of youtube videos ranging from videos from fastai lessons, FSDL lesson to random videos teaching something. In total this dataset contains 600 chapter markers in youtube and contains 25, 000 lesson transcript. This dataset can be used for NLP tasks like summarization, topic segmentation etc. You can refer to some of the models we have trained with this dataset in github repo link for Full stack deep learning 2022 projects.
Facebook
Twitterhttps://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html
(English version below) L’un des objectifs du projet Hérelles est de trouver de nouveaux mécanismes afin de faciliter l’étiquetage (ou sémantisation) des clusters issus des séries temporelles d’images satellite. Pour y parvenir, une solution proposée est d’associer des éléments textuels d’intérêt (adéquation avec la thématique d’étude, et le périmètre spatio-temporel des séries temporelles) aux données satellite. Ce jeu de donnée est une version consolidée du jeux de donnée "Segments Textuels Hérelles". Il présente un corpus thématique préalablement récolté et annoté manuellement ainsi que le code et les résultats d’une méthode d’extraction automatique des éléments textuels d'intérêt. Il comprend les éléments suivants : Le fichier "Corpus_Expert_Links" présente le corpus thématique utilisé avec les liens vers les documents qui le composent. Ils ont été choisis pour leur richesse en règles et contraintes concernant l’occupation des sols. Le fichier "Lisez_Moi_Consolidated_Version" est la version consolidée du premier protocole d’annotation, avec la définition des différents termes employés (segments, règles, …). Le fichier "Read_Me_Consolidated_Version" est la version anglaise du fichier Lisez_Moi. Le dossier compressé "Corpus_Manual_Annotation_Consolidated_Version" contient les documents d’intérêt en version txt et annotés manuellement. Le dossier compressé "Corpus_Extracted_Segments_Consolidated_Version" contient la version consolidée des résultats du processus de segmentation automatique sur les documents d’intérêt avec les labels selon les 4 classes (Verifiable, Non-verifiable, Informative and Not pertinent). Le dossier compressé "LUPAN_code" contient le code associé à l'extraction de texte des documents PDF, la construction de segments à partir des documents texte, la préparation des données pour l'évaluation utilisant la méthode (CamemBERT). Notre corpus est disponible dans la bibliothèque Huggingface et peut être chargé directement en Python : https://huggingface.co/datasets/Herelles/lupan De plus, nous avons fine-tuné un modèle au-dessus de CamemBERT en utilisant LUPAN qui est également disponible sur huggingface : https://huggingface.co/Herelles/camembert-base-lupan Enfin, nous avons développé une démo qui démontre les capacités de notre corpus et de ce modèle : https://huggingface.co/spaces/Herelles/segments-lupan -------- One of the objectives of the Hérelles project is to discover new mechanisms to facilitate the labeling (or semantic annotation) of clusters extracted from time series of satellite images. To achieve this, a proposed solution is to associate textual elements of interest (relevant to the study's theme and the spatiotemporal scope of the time series) with satellite data. This dataset is a consolidated version of the "Hérelles Textual Segments" dataset. It includes a thematically collected and manually annotated corpus, as well as the code and the results of an automatic extraction method for textual elements of interest. It comprises the following elements: The file "Corpus_Expert_Links" presents the thematic corpus used with links to its constituent documents. These documents were chosen for their richness in rules and constraints regarding land use. The file "Lisez_Moi_Consolidated_Version" is the consolidated version of the initial annotation protocol, providing definitions of various terms used (segments, rules, etc.). The file "Read_Me_Consolidated_Version" is the English version of the "Lisez_Moi" file. The compressed folder "Corpus_Manual_Annotation_Consolidated_Version" contains the manually annotated versions of the documents of interest in txt format. The compressed folder "Corpus_Extracted_Segments_Consolidated_Version" contains the consolidated version of the results of the automatic segmentation process applied to the documents of interest, along with labels for the four classes (Verifiable, Non-verifiable, Informative, and Not pertinent). The compressed folder "LUPAN_code" contains the code of the corpus construction and preliminary evaluation for LUPAN. It includes extraction of the text from the PDF documents, construction of segments from the text documents, preparation of the data for evaluation, evaluation experiments using a state-of-the-art method (CamemBERT). Our corpus is available in the Huggingface library and can be loaded directly in Python: https://huggingface.co/datasets/Herelles/lupan In addition, we fine-tuned a model on top of CamemBERT using LUPAN, which is also available on huggingface: https://huggingface.co/Herelles/camembert-base-lupan Finally, we developed a demo which demonstrates the capabilities of our corpus and this model: https://huggingface.co/spaces/Herelles/segments-lupan
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
From Text Segmentation to Smart Chaptering: A Novel Benchmark for Structuring Video Transcriptions
We present YTSeg, a topically and structurally diverse benchmark for the text segmentation task based on YouTube transcriptions. The dataset comprises 19,299 videos from 393 channels, amounting to 6,533 content hours. The topics are wide-ranging, covering domains such as science, lifestyle, politics, health, economy, and technology. The videos are from various types of content formats… See the full description on the dataset page: https://huggingface.co/datasets/retkowski/ytseg.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This is the official release of OAIZIB-CM dataset.
For convenient dataset download in Python, please refer to the Hugging Face release of the same dataset:
https://huggingface.co/datasets/YongchengYAO/OAIZIB-CM
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
GigaSpeech is an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, and 40,000 hours of total audio suitable for semi-supervised and unsupervised training. Around 40,000 hours of transcribed audio is first collected from audiobooks, podcasts and YouTube, covering both read and spontaneous speaking styles, and a variety of topics, such as arts, science, sports, etc. A new forced alignment and segmentation pipeline is proposed to create sentence segments suitable for speech recognition training, and to filter out segments with low-quality transcription. For system training, GigaSpeech provides five subsets of different sizes, 10h, 250h, 1000h, 2500h, and 10000h. For our 10,000-hour XL training subset, we cap the word error rate at 4% during the filtering/validation stage, and for all our other smaller training subsets, we cap it at 0%. The DEV and TEST evaluation sets, on the other hand, are re-processed by professional human transcribers to ensure high transcription quality.
Facebook
Twitterhttps://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
ChatGPT-Research-Abstracts
This is a dataset created in relation to a bachelor thesis written by Nicolai Thorer Sivesind and Andreas Bentzen Winje. It contains human-produced and machine-generated text samples of scientific research abstracts. A reformatted version for text-classification is available in the dataset collection Human-vs-Machine. In this collection, all samples are split into separate data points for real and generated, and labeled either 0 (human-produced) or 1… See the full description on the dataset page: https://huggingface.co/datasets/NicolaiSivesind/ChatGPT-Research-Abstracts.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
About Dataset
The dataset consists of data from a bunch of youtube videos ranging from videos from fastai lessons, FSDL lesson to random videos teaching something. In total this dataset contains 600 chapter markers in youtube and contains 25, 000 lesson transcript. This dataset can be used for NLP tasks like summarization, topic segmentation etc. You can refer to some of the models we have trained with this dataset in github repo link for Full stack deep learning 2022 projects.