7 datasets found

h
mediflow
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Microsoft, mediflow [Dataset]. https://huggingface.co/datasets/microsoft/mediflow
Explore at:
Dataset authored and provided by
Microsoft
License
https://choosealicense.com/licenses/cdla-permissive-2.0/https://choosealicense.com/licenses/cdla-permissive-2.0/
Description
MediFlow

A large-scale synthetic instruction dataset of 2.5M rows (~700k unique instructions) for clinical natural language processing covering 14 task types and 98 fine-grained input clinical documents.

t-SNE 2D Plot of MediFlow Embeddings by Task Types Dataset Splits

mediflow: 2.5M instruction data for SFT alignment. mediflow_dpo: ~135k top-quality instructions with GPT-4o generated rejected_output for DPO alignment.

Main Columns

instruction:… See the full description on the dataset page: https://huggingface.co/datasets/microsoft/mediflow.
e
Texte provenant des pdfs trouvés sur data.gouv.fr
data.europa.eu
tgz
Updated May 20, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pavel Soriano (2020). Texte provenant des pdfs trouvés sur data.gouv.fr [Dataset]. https://data.europa.eu/data/datasets/5ec45f516a58eec727e79af7?locale=sv
Explore at:
tgzAvailable download formats
Dataset updated
May 20, 2020
Dataset authored and provided by
Pavel Soriano
License
https://www.etalab.gouv.fr/licence-ouverte-open-licencehttps://www.etalab.gouv.fr/licence-ouverte-open-licence
Area covered
France
Description
Texte extrait des pdfs trouvés sur data.gouv.fr

Description

Ce dataset contient le texte extrait de 6602 fichiers qui ont l'extension pdf dans le catalogue de ressources de data.gouv.fr.

Le dataset contient que les pdfs de 20 Mb ou moins et qui sont toujours disponibles sur l'adresse URL indiquée.

L'extraction a été réalisée avec PDFBox via son wrapper Python python-pdfbox. Les PDFs qui sont des images (scans, cartes, etc) sont détectés avec une heuristique simple : si après la conversion au format texte avec pdfbox, la taille du fichier produit est inférieure à 20 bytes on considère qu'il s'agit d'une image. Dans ce cas, on procède à la OCRisation. Celle-ci est réalisé avec Tesseract via son wrapper Python pyocr.

Le résultat sont des fichiers txt provenant des pdfs triés par organisation (l'organisation qui a publiée la ressource). Il y a 175 organisations dans ce dataset, donc 175 dossiers. Le nom de chaque fichier correspond au string {id-du-dataset}--{id-de-la-ressource}.txt.

Input

Catalogue de ressources data.gouv.fr.

Output

Fichiers texte de chaque ressource type pdf trouvée dans le catalogue qui a été converti avec succès et qui a satisfait les contraintes ci-dessus. L'arborescence est la suivante :

. ├── ACTION_Nogent-sur-Marne │ ├── 53ba55c4a3a729219b7beae2--0cf9f9cd-e398-4512-80de-5fd0e2d1cb0a.txt │ ├── 53ba55c4a3a729219b7beae2--1ffcb2cb-2355-4426-b74a-946dadeba7f1.txt │ ├── 53ba55c4a3a729219b7beae2--297a0466-daaa-47f4-972a-0d5bea2ab180.txt │ ├── 53ba55c4a3a729219b7beae2--3ac0a881-181f-499e-8b3f-c2b0ddd528f7.txt │ ├── 53ba55c4a3a729219b7beae2--3ca6bd8f-05a6-469a-a36b-afda5a7444a4.txt |── ... ├── Aeroport_La_Rochelle-Ile_de_Re ├── Agence_de_services_et_de_paiement_ASP ├── Agence_du_Numerique ├── ...

Distribution des textes [au 20 mai 2020]

Le top 10 d'organisations avec le nombre le plus grand des documents est: python [('Les_Lilas', 1294), ('Ville_de_Pirae', 1099), ('Region_Hauts-de-France', 592), ('Ressourcerie_datalocale', 297), ('NA', 268), ('CORBION', 244), ('Education_Nationale', 189), ('Incubateur_de_Services_Numeriques', 157), ('Ministere_des_Solidarites_et_de_la_Sante', 148), ('Communaute_dAgglomeration_Plaine_Vallee', 142)] Et leur aperçu en 2D est (HashFeatures+TruncatedSVD+t-SNE) : https://raw.githubusercontent.com/psorianom/data_gouv_text/master/img/samplefigure.png" alt="Plot t-SNE des textes DGF">

Code

Les scripts Python utilisés pour faire cette extraction sont ici.

Remarques

Dû à la qualité des pdfs d'origine (scans de basse résolution, pdfs non alignés, ...) et à la performance des méthodes de transformation pdf-->txt, les résultats peuvent être très bruités.
f
Table_2_A Novel Computational Framework for Precision Diagnosis and Subtype...
frontiersin.figshare.com
xlsx
Updated Jun 15, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fei Xia; Xiaojun Xie; Zongqin Wang; Shichao Jin; Ke Yan; Zhiwei Ji (2023). Table_2_A Novel Computational Framework for Precision Diagnosis and Subtype Discovery of Plant With Lesion.XLSX [Dataset]. http://doi.org/10.3389/fpls.2021.789630.s003
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/fpls.2021.789630.s003
Dataset updated
Jun 15, 2023
Dataset provided by
Frontiers
Authors
Fei Xia; Xiaojun Xie; Zongqin Wang; Shichao Jin; Ke Yan; Zhiwei Ji
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Plants are often attacked by various pathogens during their growth, which may cause environmental pollution, food shortages, or economic losses in a certain area. Integration of high throughput phenomics data and computer vision (CV) provides a great opportunity to realize plant disease diagnosis in the early stage and uncover the subtype or stage patterns in the disease progression. In this study, we proposed a novel computational framework for plant disease identification and subtype discovery through a deep-embedding image-clustering strategy, Weighted Distance Metric and the t-stochastic neighbor embedding algorithm (WDM-tSNE). To verify the effectiveness, we applied our method on four public datasets of images. The results demonstrated that the newly developed tool is capable of identifying the plant disease and further uncover the underlying subtypes associated with pathogenic resistance. In summary, the current framework provides great clustering performance for the root or leave images of diseased plants with pronounced disease spots or symptoms.
c
MRQy quality measures for TCIA MRI datasets
cancerimagingarchive.net
n/a, xlsx
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Cancer Imaging Archive (2020). MRQy quality measures for TCIA MRI datasets [Dataset]. http://doi.org/10.7937/K9/TCIA.2020.JHZ2-T694
Explore at:
n/a, xlsxAvailable download formats
Unique identifier
https://doi.org/10.7937/K9/TCIA.2020.JHZ2-T694
Dataset authored and provided by
The Cancer Imaging Archive
License
https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
Time period covered
Jul 16, 2020
Dataset funded by
National Cancer Institutehttp://www.cancer.gov/
Description
Magnetic Resonance Imaging (MRI) quality assessment measures were generated for

T1-weighted post-contrast axial MRI sequences from 133 TCGA-GBM subjects

T1-weighted post-contrast axial MRI sequences from 46 CPTAC-GBM subjects

Both T1- and T2-weighted axial MRI sequences from 54 TCGA-CESC subjects

All MRI scans for each cohort were downloaded as DICOM files from TCIA and then processed via MRQy to compute quality measures for (a) interrogating the presence of site- or equipment-specific variations within a cohort, and (b) quantifying the impact of MRI artifacts to determine what pre-analytical corrections are needed. The MRQy output can be easily interrogated via the associated HTML5 based front-end, allowing for real-time filtering and visualization. MRQy is available for download at: http://github.com/ccipd/MRQy. Manifest files to download the DICOM images these results were derived from are available in the "Collections Used In This Analysis Result" table. In the figure, see (a) MRQy front-end interface for interrogating TCGA-GBM cohort. (b) Outlier dataset identified on the parallel coordinate chart for the CJV quality measure found to exhibit shading artifacts on (c) representative images, especially when compared to (d) a different dataset without this artifact. (e) t-SNE scatter plot of quality measures revealing presence of site-specific batch effects (colors correspond to different sites, note presence of site-specific clusters).
f
Table_1_Subgroup-Independent Mapping of Renal Cell Carcinoma—Machine...
figshare.com
xlsx
Updated Jun 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
André Marquardt; Antonio Giovanni Solimando; Alexander Kerscher; Max Bittrich; Charis Kalogirou; Hubert Kübler; Andreas Rosenwald; Ralf Bargou; Philip Kollmannsberger; Bastian Schilling; Svenja Meierjohann; Markus Krebs (2023). Table_1_Subgroup-Independent Mapping of Renal Cell Carcinoma—Machine Learning Reveals Prognostic Mitochondrial Gene Signature Beyond Histopathologic Boundaries.XLSX [Dataset]. http://doi.org/10.3389/fonc.2021.621278.s009
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/fonc.2021.621278.s009
Dataset updated
Jun 8, 2023
Dataset provided by
Frontiers
Authors
André Marquardt; Antonio Giovanni Solimando; Alexander Kerscher; Max Bittrich; Charis Kalogirou; Hubert Kübler; Andreas Rosenwald; Ralf Bargou; Philip Kollmannsberger; Bastian Schilling; Svenja Meierjohann; Markus Krebs
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Background: Renal cell carcinoma (RCC) is divided into three major histopathologic groups—clear cell (ccRCC), papillary (pRCC) and chromophobe RCC (chRCC). We performed a comprehensive re-analysis of publicly available RCC datasets from the TCGA (The Cancer Genome Atlas) database, thereby combining samples from all three subgroups, for an exploratory transcriptome profiling of RCC subgroups.Materials and Methods: We used FPKM (fragments per kilobase per million) files derived from the ccRCC, pRCC and chRCC cohorts of the TCGA database, representing transcriptomic data of 891 patients. Using principal component analysis, we visualized datasets as t-SNE plot for cluster detection. Clusters were characterized by machine learning, resulting gene signatures were validated by correlation analyses in the TCGA dataset and three external datasets (ICGC RECA-EU, CPTAC-3-Kidney, and GSE157256).Results: Many RCC samples co-clustered according to histopathology. However, a substantial number of samples clustered independently from histopathologic origin (mixed subgroup)—demonstrating divergence between histopathology and transcriptomic data. Further analyses of mixed subgroup via machine learning revealed a predominant mitochondrial gene signature—a trait previously known for chRCC—across all histopathologic subgroups. Additionally, ccRCC samples from mixed subgroup presented an inverse correlation of mitochondrial and angiogenesis-related genes in the TCGA and in three external validation cohorts. Moreover, mixed subgroup affiliation was associated with a highly significant shorter overall survival for patients with ccRCC—and a highly significant longer overall survival for chRCC patients.Conclusions: Pan-RCC clustering according to RNA-sequencing data revealed a distinct histology-independent subgroup characterized by strengthened mitochondrial and weakened angiogenesis-related gene signatures. Moreover, affiliation to mixed subgroup went along with a significantly shorter overall survival for ccRCC and a longer overall survival for chRCC patients. Further research could offer a therapy stratification by specifically addressing the mitochondrial metabolism of such tumors and its microenvironment.
MCA DGE Data
figshare.com
application/gzip
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Guoji Guo (2023). MCA DGE Data [Dataset]. http://doi.org/10.6084/m9.figshare.5435866.v8
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5435866.v8
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Guoji Guo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
MCA single cell DGE data (Cells with >500UMI ) for the following manuscript:Mapping the Mouse Cell Atlas by Microwell-seqMCA_500more_dge.rar: The raw digital expression matrix (dge) of more than 400,000 single cells sorted by tissues. All cells have more than 500 transcripts. The batch genes were not removed.MCA_BatchRemove_dge.zip: The batch gene removed dge of more than 200,000 primary single cells sorted by tissues. Some tissues are not included due to relatively strong batch effects. This dataset can be used to make global tissue tSNE plot and do cross-tissue analysis.MCA_CellAssignments.csv: The annotation of cells, which includes the cell names, cluster ID, belonged tissues, experimental batches and cell barcodes.MCA_Figure2-batch-removed.txt.tar.gz: The batch removed dge of approximately 60,000 cells of high quality. 1500 cells were sampled from 43 tissues respectively. This sampled data is used for Figure 2.MCA_Figure2_Cell.info.xlsx: The annotations of cells used in Figure2. Sheet1: The annotations of each cell used in Figure2, including cell names, cluster ID, belonged tissues. Sheet2: The annotations of 98 clusters in Figure2. Sheet3: The composition of cell numbers in 98 clusters and 43 tissues. MCA_Batch Information.xlsx: The batch information, which includes the age and gender of the mouse, and experiment batches for MCA data.MCA_BatchRemoved_Merge_dge.h5ad：The updated dge with batch gene removed. It can be read with scanpy python package. About 333778 cells are included.MCA_BatchRemoved_Merge_dge_cellinfo.csv: The cell information of MCA_BatchRemoved_Merge_dge.h5ad.Batch effect removalFor cross tissue comparison, we removed the batch gene background to improve presentation. We assume that for each batch of experiment, the cell barcodes with less than 500UMI correspond to the empty beads exposed free RNA during the cell lysis, RNA capture and washing steps. The batch gene background value is defined as the average gene detection for all cellular barcodes with less than 500 UMI, multiplied by a coefficient of 2, and then rounded to the nearest integer. Genes detected in less 25% of all cells are removed from the batch gene background list. We subtract the batch gene background for each cell from the digital expression matrix before making the cross tissue comparison figures.
Data from: HCL DGE Data
figshare.com
xlsx
Updated Sep 3, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Guoji Guo (2022). HCL DGE Data [Dataset]. http://doi.org/10.6084/m9.figshare.7235471.v4
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7235471.v4
Dataset updated
Sep 3, 2022
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Guoji Guo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Single-cell analysis is a valuable tool to dissect cellular heterogeneity in complex systems. Yet, a systematic single-cell atlas has not been achieved for human beings. We used single-cell RNA sequencing to determine cell type composition of all major human organs and construct a basic scheme for the human cell landscape (HCL). We reveal a single-cell hierarchy for many tissues that has not been well characterized previously. We present a ‘‘single-cell HCL analysis’’ pipeline that accurately defines human cell types; and exemplify its utility in stem cell biology. Finally, we perform single-cell comparative analysis for human and mouse cell atlas to reveal the conserved genetic networks in the mammalian system.File Extension SpecificationHCL_Fig1_adata.h5ad: Use scanpy.api.read_h5ad to load AnnData. This AnnData stores the data used for HCL Figure 1.HCL_Fig1_cell_Info: The information, which includes the cell names, samples, clusters, stages, batches, donors and cell types for cells of data used for HCL Figure 1.cluster_markers_HCL&MCA1.1: The cell type annotation and marker genes for 102 HCL clusters of Fig1 and 104 MCA1.1 clusters of SFig. dge_raw_data.tar: The raw digital expression matrix (dge) of more than 720,000 single cells sorted by tissues. The batch genes were not removed.dge_rmbatch_data.tar: The batch gene removed dge of more than 700,000 primary single cells sorted by tissues. Some tissues are not included due to relatively strong batch effects. This dataset can be used to make global tissue tSNE plot and do cross-tissue analysis.annotation_rmbatch_data.tar: The cell annotations, which include cluster ID, belonged tissues, age (gestational age for fetal tissue), clusters and cell types for each rmbatch dge data.annotation_cluster_info: Modified cell type annotation of each cluster in accord with ClusterID in annotation_rmbatch_data.zipMCA1.1_adata.h5ad: Use scanpy.api.read_h5ad to load AnnData. This AnnData stores the MCA1.1 data. MCA1.1_cell_Info: The information, which includes the cell names, samples, clusters, stages, batches, donors and ce;; types for cells of MCA1.1 data.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Microsoft, mediflow [Dataset]. https://huggingface.co/datasets/microsoft/mediflow

mediflow

microsoft/mediflow

Explore at:

45 scholarly articles cite this dataset (View in Google Scholar)

Dataset authored and provided by

Microsoft

License

https://choosealicense.com/licenses/cdla-permissive-2.0/https://choosealicense.com/licenses/cdla-permissive-2.0/

Description

MediFlow

A large-scale synthetic instruction dataset of 2.5M rows (~700k unique instructions) for clinical natural language processing covering 14 task types and 98 fine-grained input clinical documents.

  t-SNE 2D Plot of MediFlow Embeddings by Task Types







  Dataset Splits

mediflow: 2.5M instruction data for SFT alignment. mediflow_dpo: ~135k top-quality instructions with GPT-4o generated rejected_output for DPO alignment.

  Main Columns

instruction:… See the full description on the dataset page: https://huggingface.co/datasets/microsoft/mediflow.

Clear search

Close search

Google apps

Main menu

mediflow

Texte provenant des pdfs trouvés sur data.gouv.fr

Texte extrait des pdfs trouvés sur data.gouv.fr

Description

Input

Output

Distribution des textes [au 20 mai 2020]

Code

Remarques

Table_2_A Novel Computational Framework for Precision Diagnosis and Subtype...

MRQy quality measures for TCIA MRI datasets

Table_1_Subgroup-Independent Mapping of Renal Cell Carcinoma—Machine...

MCA DGE Data

Data from: HCL DGE Data

mediflow

microsoft/mediflow