https://choosealicense.com/licenses/cdla-permissive-2.0/https://choosealicense.com/licenses/cdla-permissive-2.0/
MediFlow
A large-scale synthetic instruction dataset of 2.5M rows (~700k unique instructions) for clinical natural language processing covering 14 task types and 98 fine-grained input clinical documents.
t-SNE 2D Plot of MediFlow Embeddings by Task Types
Dataset Splits
mediflow: 2.5M instruction data for SFT alignment. mediflow_dpo: ~135k top-quality instructions with GPT-4o generated rejected_output for DPO alignment.
Main Columns
instruction:… See the full description on the dataset page: https://huggingface.co/datasets/microsoft/mediflow.
https://www.etalab.gouv.fr/licence-ouverte-open-licencehttps://www.etalab.gouv.fr/licence-ouverte-open-licence
Ce dataset contient le texte extrait de 6602 fichiers qui ont l'extension pdf
dans le catalogue de ressources de data.gouv.fr.
Le dataset contient que les pdfs de 20 Mb ou moins et qui sont toujours disponibles sur l'adresse URL indiquée.
L'extraction a été réalisée avec PDFBox via son wrapper Python python-pdfbox. Les PDFs qui sont des images (scans, cartes, etc)
sont détectés avec une heuristique simple : si après la conversion au format texte avec pdfbox
, la taille du fichier produit est inférieure à 20 bytes on considère qu'il s'agit d'une image.
Dans ce cas, on procède à la OCRisation. Celle-ci est réalisé avec Tesseract via son wrapper Python pyocr.
Le résultat sont des fichiers txt
provenant des pdfs
triés par organisation (l'organisation qui a publiée la ressource). Il y a 175 organisations dans ce dataset, donc 175 dossiers.
Le nom de chaque fichier correspond au string {id-du-dataset}--{id-de-la-ressource}.txt
.
Catalogue de ressources data.gouv.fr.
Fichiers texte de chaque ressource type pdf
trouvée dans le catalogue qui a été converti avec succès et qui a satisfait les contraintes ci-dessus.
L'arborescence est la suivante :
.
├── ACTION_Nogent-sur-Marne
│ ├── 53ba55c4a3a729219b7beae2--0cf9f9cd-e398-4512-80de-5fd0e2d1cb0a.txt
│ ├── 53ba55c4a3a729219b7beae2--1ffcb2cb-2355-4426-b74a-946dadeba7f1.txt
│ ├── 53ba55c4a3a729219b7beae2--297a0466-daaa-47f4-972a-0d5bea2ab180.txt
│ ├── 53ba55c4a3a729219b7beae2--3ac0a881-181f-499e-8b3f-c2b0ddd528f7.txt
│ ├── 53ba55c4a3a729219b7beae2--3ca6bd8f-05a6-469a-a36b-afda5a7444a4.txt
|── ...
├── Aeroport_La_Rochelle-Ile_de_Re
├── Agence_de_services_et_de_paiement_ASP
├── Agence_du_Numerique
├── ...
Le top 10 d'organisations avec le nombre le plus grand des documents est:
python
[('Les_Lilas', 1294),
('Ville_de_Pirae', 1099),
('Region_Hauts-de-France', 592),
('Ressourcerie_datalocale', 297),
('NA', 268),
('CORBION', 244),
('Education_Nationale', 189),
('Incubateur_de_Services_Numeriques', 157),
('Ministere_des_Solidarites_et_de_la_Sante', 148),
('Communaute_dAgglomeration_Plaine_Vallee', 142)]
Et leur aperçu en 2D est (HashFeatures+TruncatedSVD+t-SNE) :
https://raw.githubusercontent.com/psorianom/data_gouv_text/master/img/samplefigure.png" alt="Plot t-SNE des textes DGF">
Les scripts Python utilisés pour faire cette extraction sont ici.
Dû à la qualité des pdfs d'origine (scans de basse résolution, pdfs non alignés, ...) et à la performance des méthodes de transformation pdf-->txt, les résultats peuvent être très bruités.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Plants are often attacked by various pathogens during their growth, which may cause environmental pollution, food shortages, or economic losses in a certain area. Integration of high throughput phenomics data and computer vision (CV) provides a great opportunity to realize plant disease diagnosis in the early stage and uncover the subtype or stage patterns in the disease progression. In this study, we proposed a novel computational framework for plant disease identification and subtype discovery through a deep-embedding image-clustering strategy, Weighted Distance Metric and the t-stochastic neighbor embedding algorithm (WDM-tSNE). To verify the effectiveness, we applied our method on four public datasets of images. The results demonstrated that the newly developed tool is capable of identifying the plant disease and further uncover the underlying subtypes associated with pathogenic resistance. In summary, the current framework provides great clustering performance for the root or leave images of diseased plants with pronounced disease spots or symptoms.
https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
Magnetic Resonance Imaging (MRI) quality assessment measures were generated for
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Background: Renal cell carcinoma (RCC) is divided into three major histopathologic groups—clear cell (ccRCC), papillary (pRCC) and chromophobe RCC (chRCC). We performed a comprehensive re-analysis of publicly available RCC datasets from the TCGA (The Cancer Genome Atlas) database, thereby combining samples from all three subgroups, for an exploratory transcriptome profiling of RCC subgroups.Materials and Methods: We used FPKM (fragments per kilobase per million) files derived from the ccRCC, pRCC and chRCC cohorts of the TCGA database, representing transcriptomic data of 891 patients. Using principal component analysis, we visualized datasets as t-SNE plot for cluster detection. Clusters were characterized by machine learning, resulting gene signatures were validated by correlation analyses in the TCGA dataset and three external datasets (ICGC RECA-EU, CPTAC-3-Kidney, and GSE157256).Results: Many RCC samples co-clustered according to histopathology. However, a substantial number of samples clustered independently from histopathologic origin (mixed subgroup)—demonstrating divergence between histopathology and transcriptomic data. Further analyses of mixed subgroup via machine learning revealed a predominant mitochondrial gene signature—a trait previously known for chRCC—across all histopathologic subgroups. Additionally, ccRCC samples from mixed subgroup presented an inverse correlation of mitochondrial and angiogenesis-related genes in the TCGA and in three external validation cohorts. Moreover, mixed subgroup affiliation was associated with a highly significant shorter overall survival for patients with ccRCC—and a highly significant longer overall survival for chRCC patients.Conclusions: Pan-RCC clustering according to RNA-sequencing data revealed a distinct histology-independent subgroup characterized by strengthened mitochondrial and weakened angiogenesis-related gene signatures. Moreover, affiliation to mixed subgroup went along with a significantly shorter overall survival for ccRCC and a longer overall survival for chRCC patients. Further research could offer a therapy stratification by specifically addressing the mitochondrial metabolism of such tumors and its microenvironment.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MCA single cell DGE data (Cells with >500UMI ) for the following manuscript:Mapping the Mouse Cell Atlas by Microwell-seqMCA_500more_dge.rar: The raw digital expression matrix (dge) of more than 400,000 single cells sorted by tissues. All cells have more than 500 transcripts. The batch genes were not removed.MCA_BatchRemove_dge.zip: The batch gene removed dge of more than 200,000 primary single cells sorted by tissues. Some tissues are not included due to relatively strong batch effects. This dataset can be used to make global tissue tSNE plot and do cross-tissue analysis.MCA_CellAssignments.csv: The annotation of cells, which includes the cell names, cluster ID, belonged tissues, experimental batches and cell barcodes.MCA_Figure2-batch-removed.txt.tar.gz: The batch removed dge of approximately 60,000 cells of high quality. 1500 cells were sampled from 43 tissues respectively. This sampled data is used for Figure 2.MCA_Figure2_Cell.info.xlsx: The annotations of cells used in Figure2. Sheet1: The annotations of each cell used in Figure2, including cell names, cluster ID, belonged tissues. Sheet2: The annotations of 98 clusters in Figure2. Sheet3: The composition of cell numbers in 98 clusters and 43 tissues. MCA_Batch Information.xlsx: The batch information, which includes the age and gender of the mouse, and experiment batches for MCA data.MCA_BatchRemoved_Merge_dge.h5ad:The updated dge with batch gene removed. It can be read with scanpy python package. About 333778 cells are included.MCA_BatchRemoved_Merge_dge_cellinfo.csv: The cell information of MCA_BatchRemoved_Merge_dge.h5ad.Batch effect removalFor cross tissue comparison, we removed the batch gene background to improve presentation. We assume that for each batch of experiment, the cell barcodes with less than 500UMI correspond to the empty beads exposed free RNA during the cell lysis, RNA capture and washing steps. The batch gene background value is defined as the average gene detection for all cellular barcodes with less than 500 UMI, multiplied by a coefficient of 2, and then rounded to the nearest integer. Genes detected in less 25% of all cells are removed from the batch gene background list. We subtract the batch gene background for each cell from the digital expression matrix before making the cross tissue comparison figures.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Single-cell analysis is a valuable tool to dissect cellular heterogeneity in complex systems. Yet, a systematic single-cell atlas has not been achieved for human beings. We used single-cell RNA sequencing to determine cell type composition of all major human organs and construct a basic scheme for the human cell landscape (HCL). We reveal a single-cell hierarchy for many tissues that has not been well characterized previously. We present a ‘‘single-cell HCL analysis’’ pipeline that accurately defines human cell types; and exemplify its utility in stem cell biology. Finally, we perform single-cell comparative analysis for human and mouse cell atlas to reveal the conserved genetic networks in the mammalian system.File Extension SpecificationHCL_Fig1_adata.h5ad: Use scanpy.api.read_h5ad to load AnnData. This AnnData stores the data used for HCL Figure 1.HCL_Fig1_cell_Info: The information, which includes the cell names, samples, clusters, stages, batches, donors and cell types for cells of data used for HCL Figure 1.cluster_markers_HCL&MCA1.1: The cell type annotation and marker genes for 102 HCL clusters of Fig1 and 104 MCA1.1 clusters of SFig. dge_raw_data.tar: The raw digital expression matrix (dge) of more than 720,000 single cells sorted by tissues. The batch genes were not removed.dge_rmbatch_data.tar: The batch gene removed dge of more than 700,000 primary single cells sorted by tissues. Some tissues are not included due to relatively strong batch effects. This dataset can be used to make global tissue tSNE plot and do cross-tissue analysis.annotation_rmbatch_data.tar: The cell annotations, which include cluster ID, belonged tissues, age (gestational age for fetal tissue), clusters and cell types for each rmbatch dge data.annotation_cluster_info: Modified cell type annotation of each cluster in accord with ClusterID in annotation_rmbatch_data.zipMCA1.1_adata.h5ad: Use scanpy.api.read_h5ad to load AnnData. This AnnData stores the MCA1.1 data. MCA1.1_cell_Info: The information, which includes the cell names, samples, clusters, stages, batches, donors and ce;; types for cells of MCA1.1 data.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
https://choosealicense.com/licenses/cdla-permissive-2.0/https://choosealicense.com/licenses/cdla-permissive-2.0/
MediFlow
A large-scale synthetic instruction dataset of 2.5M rows (~700k unique instructions) for clinical natural language processing covering 14 task types and 98 fine-grained input clinical documents.
t-SNE 2D Plot of MediFlow Embeddings by Task Types
Dataset Splits
mediflow: 2.5M instruction data for SFT alignment. mediflow_dpo: ~135k top-quality instructions with GPT-4o generated rejected_output for DPO alignment.
Main Columns
instruction:… See the full description on the dataset page: https://huggingface.co/datasets/microsoft/mediflow.