7 datasets found
  1. h

    mediflow

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Microsoft, mediflow [Dataset]. https://huggingface.co/datasets/microsoft/mediflow
    Explore at:
    Dataset authored and provided by
    Microsoft
    License

    https://choosealicense.com/licenses/cdla-permissive-2.0/https://choosealicense.com/licenses/cdla-permissive-2.0/

    Description

    MediFlow

    A large-scale synthetic instruction dataset of 2.5M rows (~700k unique instructions) for clinical natural language processing covering 14 task types and 98 fine-grained input clinical documents.

      t-SNE 2D Plot of MediFlow Embeddings by Task Types
    
    
    
    
    
    
    
      Dataset Splits
    

    mediflow: 2.5M instruction data for SFT alignment. mediflow_dpo: ~135k top-quality instructions with GPT-4o generated rejected_output for DPO alignment.

      Main Columns
    

    instruction:… See the full description on the dataset page: https://huggingface.co/datasets/microsoft/mediflow.

  2. e

    Texte provenant des pdfs trouvés sur data.gouv.fr

    • data.europa.eu
    tgz
    Updated May 20, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pavel Soriano (2020). Texte provenant des pdfs trouvés sur data.gouv.fr [Dataset]. https://data.europa.eu/data/datasets/5ec45f516a58eec727e79af7?locale=sv
    Explore at:
    tgzAvailable download formats
    Dataset updated
    May 20, 2020
    Dataset authored and provided by
    Pavel Soriano
    License

    https://www.etalab.gouv.fr/licence-ouverte-open-licencehttps://www.etalab.gouv.fr/licence-ouverte-open-licence

    Area covered
    France
    Description

    Texte extrait des pdfs trouvés sur data.gouv.fr

    Description

    Ce dataset contient le texte extrait de 6602 fichiers qui ont l'extension pdf dans le catalogue de ressources de data.gouv.fr.

    Le dataset contient que les pdfs de 20 Mb ou moins et qui sont toujours disponibles sur l'adresse URL indiquée.

    L'extraction a été réalisée avec PDFBox via son wrapper Python python-pdfbox. Les PDFs qui sont des images (scans, cartes, etc) sont détectés avec une heuristique simple : si après la conversion au format texte avec pdfbox, la taille du fichier produit est inférieure à 20 bytes on considère qu'il s'agit d'une image. Dans ce cas, on procède à la OCRisation. Celle-ci est réalisé avec Tesseract via son wrapper Python pyocr.

    Le résultat sont des fichiers txt provenant des pdfs triés par organisation (l'organisation qui a publiée la ressource). Il y a 175 organisations dans ce dataset, donc 175 dossiers. Le nom de chaque fichier correspond au string {id-du-dataset}--{id-de-la-ressource}.txt.

    Input

    Catalogue de ressources data.gouv.fr.

    Output

    Fichiers texte de chaque ressource type pdf trouvée dans le catalogue qui a été converti avec succès et qui a satisfait les contraintes ci-dessus. L'arborescence est la suivante :

    .
    ├── ACTION_Nogent-sur-Marne
    │ ├── 53ba55c4a3a729219b7beae2--0cf9f9cd-e398-4512-80de-5fd0e2d1cb0a.txt
    │ ├── 53ba55c4a3a729219b7beae2--1ffcb2cb-2355-4426-b74a-946dadeba7f1.txt
    │ ├── 53ba55c4a3a729219b7beae2--297a0466-daaa-47f4-972a-0d5bea2ab180.txt
    │ ├── 53ba55c4a3a729219b7beae2--3ac0a881-181f-499e-8b3f-c2b0ddd528f7.txt
    │ ├── 53ba55c4a3a729219b7beae2--3ca6bd8f-05a6-469a-a36b-afda5a7444a4.txt
    |── ...
    ├── Aeroport_La_Rochelle-Ile_de_Re
    ├── Agence_de_services_et_de_paiement_ASP
    ├── Agence_du_Numerique
    ├── ...
    
    

    Distribution des textes [au 20 mai 2020]

    Le top 10 d'organisations avec le nombre le plus grand des documents est: python [('Les_Lilas', 1294), ('Ville_de_Pirae', 1099), ('Region_Hauts-de-France', 592), ('Ressourcerie_datalocale', 297), ('NA', 268), ('CORBION', 244), ('Education_Nationale', 189), ('Incubateur_de_Services_Numeriques', 157), ('Ministere_des_Solidarites_et_de_la_Sante', 148), ('Communaute_dAgglomeration_Plaine_Vallee', 142)] Et leur aperçu en 2D est (HashFeatures+TruncatedSVD+t-SNE) : https://raw.githubusercontent.com/psorianom/data_gouv_text/master/img/samplefigure.png" alt="Plot t-SNE des textes DGF">

    Code

    Les scripts Python utilisés pour faire cette extraction sont ici.

    Remarques

    Dû à la qualité des pdfs d'origine (scans de basse résolution, pdfs non alignés, ...) et à la performance des méthodes de transformation pdf-->txt, les résultats peuvent être très bruités.

  3. f

    Table_2_A Novel Computational Framework for Precision Diagnosis and Subtype...

    • frontiersin.figshare.com
    xlsx
    Updated Jun 15, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fei Xia; Xiaojun Xie; Zongqin Wang; Shichao Jin; Ke Yan; Zhiwei Ji (2023). Table_2_A Novel Computational Framework for Precision Diagnosis and Subtype Discovery of Plant With Lesion.XLSX [Dataset]. http://doi.org/10.3389/fpls.2021.789630.s003
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 15, 2023
    Dataset provided by
    Frontiers
    Authors
    Fei Xia; Xiaojun Xie; Zongqin Wang; Shichao Jin; Ke Yan; Zhiwei Ji
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Plants are often attacked by various pathogens during their growth, which may cause environmental pollution, food shortages, or economic losses in a certain area. Integration of high throughput phenomics data and computer vision (CV) provides a great opportunity to realize plant disease diagnosis in the early stage and uncover the subtype or stage patterns in the disease progression. In this study, we proposed a novel computational framework for plant disease identification and subtype discovery through a deep-embedding image-clustering strategy, Weighted Distance Metric and the t-stochastic neighbor embedding algorithm (WDM-tSNE). To verify the effectiveness, we applied our method on four public datasets of images. The results demonstrated that the newly developed tool is capable of identifying the plant disease and further uncover the underlying subtypes associated with pathogenic resistance. In summary, the current framework provides great clustering performance for the root or leave images of diseased plants with pronounced disease spots or symptoms.

  4. c

    MRQy quality measures for TCIA MRI datasets

    • cancerimagingarchive.net
    n/a, xlsx
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Cancer Imaging Archive (2020). MRQy quality measures for TCIA MRI datasets [Dataset]. http://doi.org/10.7937/K9/TCIA.2020.JHZ2-T694
    Explore at:
    n/a, xlsxAvailable download formats
    Dataset authored and provided by
    The Cancer Imaging Archive
    License

    https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/

    Time period covered
    Jul 16, 2020
    Dataset funded by
    National Cancer Institutehttp://www.cancer.gov/
    Description

    Magnetic Resonance Imaging (MRI) quality assessment measures were generated for

    1. T1-weighted post-contrast axial MRI sequences from 133 TCGA-GBM subjects
    2. T1-weighted post-contrast axial MRI sequences from 46 CPTAC-GBM subjects
    3. Both T1- and T2-weighted axial MRI sequences from 54 TCGA-CESC subjects
    All MRI scans for each cohort were downloaded as DICOM files from TCIA and then processed via MRQy to compute quality measures for (a) interrogating the presence of site- or equipment-specific variations within a cohort, and (b) quantifying the impact of MRI artifacts to determine what pre-analytical corrections are needed. The MRQy output can be easily interrogated via the associated HTML5 based front-end, allowing for real-time filtering and visualization. MRQy is available for download at: http://github.com/ccipd/MRQy. Manifest files to download the DICOM images these results were derived from are available in the "Collections Used In This Analysis Result" table. In the figure, see (a) MRQy front-end interface for interrogating TCGA-GBM cohort. (b) Outlier dataset identified on the parallel coordinate chart for the CJV quality measure found to exhibit shading artifacts on (c) representative images, especially when compared to (d) a different dataset without this artifact. (e) t-SNE scatter plot of quality measures revealing presence of site-specific batch effects (colors correspond to different sites, note presence of site-specific clusters).

  5. f

    Table_1_Subgroup-Independent Mapping of Renal Cell Carcinoma—Machine...

    • figshare.com
    xlsx
    Updated Jun 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    André Marquardt; Antonio Giovanni Solimando; Alexander Kerscher; Max Bittrich; Charis Kalogirou; Hubert Kübler; Andreas Rosenwald; Ralf Bargou; Philip Kollmannsberger; Bastian Schilling; Svenja Meierjohann; Markus Krebs (2023). Table_1_Subgroup-Independent Mapping of Renal Cell Carcinoma—Machine Learning Reveals Prognostic Mitochondrial Gene Signature Beyond Histopathologic Boundaries.XLSX [Dataset]. http://doi.org/10.3389/fonc.2021.621278.s009
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 8, 2023
    Dataset provided by
    Frontiers
    Authors
    André Marquardt; Antonio Giovanni Solimando; Alexander Kerscher; Max Bittrich; Charis Kalogirou; Hubert Kübler; Andreas Rosenwald; Ralf Bargou; Philip Kollmannsberger; Bastian Schilling; Svenja Meierjohann; Markus Krebs
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Background: Renal cell carcinoma (RCC) is divided into three major histopathologic groups—clear cell (ccRCC), papillary (pRCC) and chromophobe RCC (chRCC). We performed a comprehensive re-analysis of publicly available RCC datasets from the TCGA (The Cancer Genome Atlas) database, thereby combining samples from all three subgroups, for an exploratory transcriptome profiling of RCC subgroups.Materials and Methods: We used FPKM (fragments per kilobase per million) files derived from the ccRCC, pRCC and chRCC cohorts of the TCGA database, representing transcriptomic data of 891 patients. Using principal component analysis, we visualized datasets as t-SNE plot for cluster detection. Clusters were characterized by machine learning, resulting gene signatures were validated by correlation analyses in the TCGA dataset and three external datasets (ICGC RECA-EU, CPTAC-3-Kidney, and GSE157256).Results: Many RCC samples co-clustered according to histopathology. However, a substantial number of samples clustered independently from histopathologic origin (mixed subgroup)—demonstrating divergence between histopathology and transcriptomic data. Further analyses of mixed subgroup via machine learning revealed a predominant mitochondrial gene signature—a trait previously known for chRCC—across all histopathologic subgroups. Additionally, ccRCC samples from mixed subgroup presented an inverse correlation of mitochondrial and angiogenesis-related genes in the TCGA and in three external validation cohorts. Moreover, mixed subgroup affiliation was associated with a highly significant shorter overall survival for patients with ccRCC—and a highly significant longer overall survival for chRCC patients.Conclusions: Pan-RCC clustering according to RNA-sequencing data revealed a distinct histology-independent subgroup characterized by strengthened mitochondrial and weakened angiogenesis-related gene signatures. Moreover, affiliation to mixed subgroup went along with a significantly shorter overall survival for ccRCC and a longer overall survival for chRCC patients. Further research could offer a therapy stratification by specifically addressing the mitochondrial metabolism of such tumors and its microenvironment.

  6. MCA DGE Data

    • figshare.com
    application/gzip
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Guoji Guo (2023). MCA DGE Data [Dataset]. http://doi.org/10.6084/m9.figshare.5435866.v8
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Guoji Guo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    MCA single cell DGE data (Cells with >500UMI ) for the following manuscript:Mapping the Mouse Cell Atlas by Microwell-seqMCA_500more_dge.rar: The raw digital expression matrix (dge) of more than 400,000 single cells sorted by tissues. All cells have more than 500 transcripts. The batch genes were not removed.MCA_BatchRemove_dge.zip: The batch gene removed dge of more than 200,000 primary single cells sorted by tissues. Some tissues are not included due to relatively strong batch effects. This dataset can be used to make global tissue tSNE plot and do cross-tissue analysis.MCA_CellAssignments.csv: The annotation of cells, which includes the cell names, cluster ID, belonged tissues, experimental batches and cell barcodes.MCA_Figure2-batch-removed.txt.tar.gz: The batch removed dge of approximately 60,000 cells of high quality. 1500 cells were sampled from 43 tissues respectively. This sampled data is used for Figure 2.MCA_Figure2_Cell.info.xlsx: The annotations of cells used in Figure2. Sheet1: The annotations of each cell used in Figure2, including cell names, cluster ID, belonged tissues. Sheet2: The annotations of 98 clusters in Figure2. Sheet3: The composition of cell numbers in 98 clusters and 43 tissues. MCA_Batch Information.xlsx: The batch information, which includes the age and gender of the mouse, and experiment batches for MCA data.MCA_BatchRemoved_Merge_dge.h5ad:The updated dge with batch gene removed. It can be read with scanpy python package. About 333778 cells are included.MCA_BatchRemoved_Merge_dge_cellinfo.csv: The cell information of MCA_BatchRemoved_Merge_dge.h5ad.Batch effect removalFor cross tissue comparison, we removed the batch gene background to improve presentation. We assume that for each batch of experiment, the cell barcodes with less than 500UMI correspond to the empty beads exposed free RNA during the cell lysis, RNA capture and washing steps. The batch gene background value is defined as the average gene detection for all cellular barcodes with less than 500 UMI, multiplied by a coefficient of 2, and then rounded to the nearest integer. Genes detected in less 25% of all cells are removed from the batch gene background list. We subtract the batch gene background for each cell from the digital expression matrix before making the cross tissue comparison figures.

  7. Data from: HCL DGE Data

    • figshare.com
    xlsx
    Updated Sep 3, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Guoji Guo (2022). HCL DGE Data [Dataset]. http://doi.org/10.6084/m9.figshare.7235471.v4
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Sep 3, 2022
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Guoji Guo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Single-cell analysis is a valuable tool to dissect cellular heterogeneity in complex systems. Yet, a systematic single-cell atlas has not been achieved for human beings. We used single-cell RNA sequencing to determine cell type composition of all major human organs and construct a basic scheme for the human cell landscape (HCL). We reveal a single-cell hierarchy for many tissues that has not been well characterized previously. We present a ‘‘single-cell HCL analysis’’ pipeline that accurately defines human cell types; and exemplify its utility in stem cell biology. Finally, we perform single-cell comparative analysis for human and mouse cell atlas to reveal the conserved genetic networks in the mammalian system.File Extension SpecificationHCL_Fig1_adata.h5ad: Use scanpy.api.read_h5ad to load AnnData. This AnnData stores the data used for HCL Figure 1.HCL_Fig1_cell_Info: The information, which includes the cell names, samples, clusters, stages, batches, donors and cell types for cells of data used for HCL Figure 1.cluster_markers_HCL&MCA1.1: The cell type annotation and marker genes for 102 HCL clusters of Fig1 and 104 MCA1.1 clusters of SFig. dge_raw_data.tar: The raw digital expression matrix (dge) of more than 720,000 single cells sorted by tissues. The batch genes were not removed.dge_rmbatch_data.tar: The batch gene removed dge of more than 700,000 primary single cells sorted by tissues. Some tissues are not included due to relatively strong batch effects. This dataset can be used to make global tissue tSNE plot and do cross-tissue analysis.annotation_rmbatch_data.tar: The cell annotations, which include cluster ID, belonged tissues, age (gestational age for fetal tissue), clusters and cell types for each rmbatch dge data.annotation_cluster_info: Modified cell type annotation of each cluster in accord with ClusterID in annotation_rmbatch_data.zipMCA1.1_adata.h5ad: Use scanpy.api.read_h5ad to load AnnData. This AnnData stores the MCA1.1 data. MCA1.1_cell_Info: The information, which includes the cell names, samples, clusters, stages, batches, donors and ce;; types for cells of MCA1.1 data.

  8. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Microsoft, mediflow [Dataset]. https://huggingface.co/datasets/microsoft/mediflow

mediflow

microsoft/mediflow

Explore at:
45 scholarly articles cite this dataset (View in Google Scholar)
Dataset authored and provided by
Microsoft
License

https://choosealicense.com/licenses/cdla-permissive-2.0/https://choosealicense.com/licenses/cdla-permissive-2.0/

Description

MediFlow

A large-scale synthetic instruction dataset of 2.5M rows (~700k unique instructions) for clinical natural language processing covering 14 task types and 98 fine-grained input clinical documents.

  t-SNE 2D Plot of MediFlow Embeddings by Task Types







  Dataset Splits

mediflow: 2.5M instruction data for SFT alignment. mediflow_dpo: ~135k top-quality instructions with GPT-4o generated rejected_output for DPO alignment.

  Main Columns

instruction:… See the full description on the dataset page: https://huggingface.co/datasets/microsoft/mediflow.

Search
Clear search
Close search
Google apps
Main menu