6 datasets found

h
Course_summaries_dataset
huggingface.co
Updated Apr 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
recapper (2023). Course_summaries_dataset [Dataset]. https://huggingface.co/datasets/recapper/Course_summaries_dataset
Explore at:
Dataset updated
Apr 6, 2023
Dataset authored and provided by
recapper
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
About Dataset

The dataset consists of data from a bunch of youtube videos ranging from videos from fastai lessons, FSDL lesson to random videos teaching something. In total this dataset contains 600 chapter markers in youtube and contains 25, 000 lesson transcript. This dataset can be used for NLP tasks like summarization, topic segmentation etc. You can refer to some of the models we have trained with this dataset in github repo link for Full stack deep learning 2022 projects.
R
Segments textuels consolidés - Consolidated Textual Segments - Hérelles...
entrepot.recherche.data.gouv.fr
pdf, tsv, zip
Updated Mar 18, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Margaux Holveck; Maksim Koptelov; Maksim Koptelov; Mathieu Roche; Maguelonne Teisseire; Maguelonne Teisseire; Margaux Holveck; Mathieu Roche (2024). Segments textuels consolidés - Consolidated Textual Segments - Hérelles Project [Dataset]. http://doi.org/10.57745/XIVJ65
Explore at:
zip(148129), pdf(655744), tsv(682), zip(8577), pdf(622295), zip(166810)Available download formats
Unique identifier
https://doi.org/10.57745/XIVJ65
Dataset updated
Mar 18, 2024
Dataset provided by
Recherche Data Gouv
Authors
Margaux Holveck; Maksim Koptelov; Maksim Koptelov; Mathieu Roche; Maguelonne Teisseire; Maguelonne Teisseire; Margaux Holveck; Mathieu Roche
License
https://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html
Description
(English version below) L’un des objectifs du projet Hérelles est de trouver de nouveaux mécanismes afin de faciliter l’étiquetage (ou sémantisation) des clusters issus des séries temporelles d’images satellite. Pour y parvenir, une solution proposée est d’associer des éléments textuels d’intérêt (adéquation avec la thématique d’étude, et le périmètre spatio-temporel des séries temporelles) aux données satellite. Ce jeu de donnée est une version consolidée du jeux de donnée "Segments Textuels Hérelles". Il présente un corpus thématique préalablement récolté et annoté manuellement ainsi que le code et les résultats d’une méthode d’extraction automatique des éléments textuels d'intérêt. Il comprend les éléments suivants : Le fichier "Corpus_Expert_Links" présente le corpus thématique utilisé avec les liens vers les documents qui le composent. Ils ont été choisis pour leur richesse en règles et contraintes concernant l’occupation des sols. Le fichier "Lisez_Moi_Consolidated_Version" est la version consolidée du premier protocole d’annotation, avec la définition des différents termes employés (segments, règles, …). Le fichier "Read_Me_Consolidated_Version" est la version anglaise du fichier Lisez_Moi. Le dossier compressé "Corpus_Manual_Annotation_Consolidated_Version" contient les documents d’intérêt en version txt et annotés manuellement. Le dossier compressé "Corpus_Extracted_Segments_Consolidated_Version" contient la version consolidée des résultats du processus de segmentation automatique sur les documents d’intérêt avec les labels selon les 4 classes (Verifiable, Non-verifiable, Informative and Not pertinent). Le dossier compressé "LUPAN_code" contient le code associé à l'extraction de texte des documents PDF, la construction de segments à partir des documents texte, la préparation des données pour l'évaluation utilisant la méthode (CamemBERT). Notre corpus est disponible dans la bibliothèque Huggingface et peut être chargé directement en Python : https://huggingface.co/datasets/Herelles/lupan De plus, nous avons fine-tuné un modèle au-dessus de CamemBERT en utilisant LUPAN qui est également disponible sur huggingface : https://huggingface.co/Herelles/camembert-base-lupan Enfin, nous avons développé une démo qui démontre les capacités de notre corpus et de ce modèle : https://huggingface.co/spaces/Herelles/segments-lupan -------- One of the objectives of the Hérelles project is to discover new mechanisms to facilitate the labeling (or semantic annotation) of clusters extracted from time series of satellite images. To achieve this, a proposed solution is to associate textual elements of interest (relevant to the study's theme and the spatiotemporal scope of the time series) with satellite data. This dataset is a consolidated version of the "Hérelles Textual Segments" dataset. It includes a thematically collected and manually annotated corpus, as well as the code and the results of an automatic extraction method for textual elements of interest. It comprises the following elements: The file "Corpus_Expert_Links" presents the thematic corpus used with links to its constituent documents. These documents were chosen for their richness in rules and constraints regarding land use. The file "Lisez_Moi_Consolidated_Version" is the consolidated version of the initial annotation protocol, providing definitions of various terms used (segments, rules, etc.). The file "Read_Me_Consolidated_Version" is the English version of the "Lisez_Moi" file. The compressed folder "Corpus_Manual_Annotation_Consolidated_Version" contains the manually annotated versions of the documents of interest in txt format. The compressed folder "Corpus_Extracted_Segments_Consolidated_Version" contains the consolidated version of the results of the automatic segmentation process applied to the documents of interest, along with labels for the four classes (Verifiable, Non-verifiable, Informative, and Not pertinent). The compressed folder "LUPAN_code" contains the code of the corpus construction and preliminary evaluation for LUPAN. It includes extraction of the text from the PDF documents, construction of segments from the text documents, preparation of the data for evaluation, evaluation experiments using a state-of-the-art method (CamemBERT). Our corpus is available in the Huggingface library and can be loaded directly in Python: https://huggingface.co/datasets/Herelles/lupan In addition, we fine-tuned a model on top of CamemBERT using LUPAN, which is also available on huggingface: https://huggingface.co/Herelles/camembert-base-lupan Finally, we developed a demo which demonstrates the capabilities of our corpus and this model: https://huggingface.co/spaces/Herelles/segments-lupan
h
ytseg
huggingface.co
Updated Feb 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fabian Retkowski (2024). ytseg [Dataset]. http://doi.org/10.57967/hf/1824
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/1824
Dataset updated
Feb 28, 2024
Authors
Fabian Retkowski
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
From Text Segmentation to Smart Chaptering: A Novel Benchmark for Structuring Video Transcriptions

We present YTSeg, a topically and structurally diverse benchmark for the text segmentation task based on YouTube transcriptions. The dataset comprises 19,299 videos from 393 channels, amounting to 6,533 content hours. The topics are wide-ranging, covering domains such as science, lifestyle, politics, health, economy, and technology. The videos are from various types of content formats… See the full description on the dataset page: https://huggingface.co/datasets/retkowski/ytseg.
OAIZIB-CM: Dataset from the CartiMorph Project
zenodo.org
zip
Updated Feb 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yongcheng Yao; Yongcheng Yao (2025). OAIZIB-CM: Dataset from the CartiMorph Project [Dataset]. http://doi.org/10.5281/zenodo.14934086
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14934086
Dataset updated
Feb 27, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Yongcheng Yao; Yongcheng Yao
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Time period covered
Feb 27, 2025
Description
This is the official release of OAIZIB-CM dataset.

OAIZIB-CM is based on the OAIZIB dataset

OAIZIB paper: Automated Segmentation of Knee Bone and Cartilage combining Statistical Shape Knowledge and Convolutional Neural Networks: Data from the Osteoarthritis Initiative

In OAIZIB-CM, tibial cartilage is split into medial and lateral tibial cartilages.

OAIZIB-CM includes CLAIR-Knee-103R, consisting of

a template image learned from 103 MR images of subjects without radiographic OA

corresponding 5-ROI segmentation mask for cartilages and bones

corresponding 20-ROI atlas for articular cartilages

It is compulsory to cite the paper if you use the dataset

CartiMorph: A framework for automated knee articular cartilage morphometrics

For convenient dataset download in Python, please refer to the Hugging Face release of the same dataset:
https://huggingface.co/datasets/YongchengYAO/OAIZIB-CM
h
gigaspeech
huggingface.co
opendatalab.com
Updated Aug 30, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SpeechColab (2022). gigaspeech [Dataset]. http://doi.org/10.57967/hf/6261
Explore at:
Unique identifier
https://doi.org/10.57967/hf/6261
Dataset updated
Aug 30, 2022
Dataset authored and provided by
SpeechColab
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
GigaSpeech is an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, and 40,000 hours of total audio suitable for semi-supervised and unsupervised training. Around 40,000 hours of transcribed audio is first collected from audiobooks, podcasts and YouTube, covering both read and spontaneous speaking styles, and a variety of topics, such as arts, science, sports, etc. A new forced alignment and segmentation pipeline is proposed to create sentence segments suitable for speech recognition training, and to filter out segments with low-quality transcription. For system training, GigaSpeech provides five subsets of different sizes, 10h, 250h, 1000h, 2500h, and 10000h. For our 10,000-hour XL training subset, we cap the word error rate at 4% during the filtering/validation stage, and for all our other smaller training subsets, we cap it at 0%. The DEV and TEST evaluation sets, on the other hand, are re-processed by professional human transcribers to ensure high transcription quality.
h
ChatGPT-Research-Abstracts
huggingface.co
opendatalab.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicolai Thorer Sivesind, ChatGPT-Research-Abstracts [Dataset]. https://huggingface.co/datasets/NicolaiSivesind/ChatGPT-Research-Abstracts
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Nicolai Thorer Sivesind
License
https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
Description
ChatGPT-Research-Abstracts

This is a dataset created in relation to a bachelor thesis written by Nicolai Thorer Sivesind and Andreas Bentzen Winje. It contains human-produced and machine-generated text samples of scientific research abstracts. A reformatted version for text-classification is available in the dataset collection Human-vs-Machine. In this collection, all samples are split into separate data points for real and generated, and labeled either 0 (human-produced) or 1… See the full description on the dataset page: https://huggingface.co/datasets/NicolaiSivesind/ChatGPT-Research-Abstracts.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

recapper (2023). Course_summaries_dataset [Dataset]. https://huggingface.co/datasets/recapper/Course_summaries_dataset

Course_summaries_dataset

recapper/Course_summaries_dataset

Explore at:

Dataset updated

Apr 6, 2023

Dataset authored and provided by

recapper

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

About Dataset

The dataset consists of data from a bunch of youtube videos ranging from videos from fastai lessons, FSDL lesson to random videos teaching something. In total this dataset contains 600 chapter markers in youtube and contains 25, 000 lesson transcript. This dataset can be used for NLP tasks like summarization, topic segmentation etc. You can refer to some of the models we have trained with this dataset in github repo link for Full stack deep learning 2022 projects.

Clear search

Close search

Google apps

Main menu

Course_summaries_dataset

Segments textuels consolidés - Consolidated Textual Segments - Hérelles...

ytseg

OAIZIB-CM: Dataset from the CartiMorph Project

gigaspeech

ChatGPT-Research-Abstracts

Course_summaries_dataset

recapper/Course_summaries_dataset