3 datasets found
  1. Z

    DrCyZ: Techniques for analyzing and extracting useful information from CyZ.

    • data.niaid.nih.gov
    Updated Jan 19, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    de Zarzà, I. (2022). DrCyZ: Techniques for analyzing and extracting useful information from CyZ. [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5816857
    Explore at:
    Dataset updated
    Jan 19, 2022
    Dataset provided by
    de Zarzà, I.
    de Curtò, J.
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    DrCyZ: Techniques for analyzing and extracting useful information from CyZ.

    Samples from NASA Perseverance and set of GAN generated synthetic images from Neural Mars.

    Repository: https://github.com/decurtoidiaz/drcyz

    Subset of samples from (includes tools to visualize and analyse the dataset):

    CyZ: MARS Space Exploration Dataset. [https://doi.org/10.5281/zenodo.5655473]

    Images from NASA missions of the celestial body.

    Repository: https://github.com/decurtoidiaz/cyz

    Authors:

    J. de Curtò c@decurto.be

    I. de Zarzà z@dezarza.be

    File Information from DrCyZ-1.1

    • Subset of samples from Perseverance (drcyz/c).
      ∙ png (drcyz/c/png).
        PNG files (5025) selected from NASA Perseverance (CyZ-1.1) after t-SNE and K-means Clustering. 
      ∙ csv (drcyz/c/csv).
        CSV file.
    
    
    • Resized samples from Perseverance (drcyz/c+).
      ∙ png 64x64; 128x128; 256x256; 512x512; 1024x1024 (drcyz/c+/drcyz_64-1024).
        PNG files resized at the corresponding size. 
      ∙ TFRecords 64x64; 128x128; 256x256; 512x512; 1024x1024 (drcyz/c+/tfr_drcyz_64-1024).
        TFRecord resized at the corresponding size to import on Tensorflow.
    
    
    • Synthetic images from Neural Mars generated using Stylegan2-ada (drcyz/drcyz+).
      ∙ png 100; 1000; 10000 (drcyz/drcyz+/drcyz_256_100-10000)
        PNG files subset of 100, 1000 and 10000 at size 256x256.
    
    
    • Network Checkpoint from Stylegan2-ada trained at size 256x256 (drcyz/model_drcyz).
      ∙ network-snapshot-000798-drcyz.pkl
    
    
    • Notebooks in python to analyse the original dataset and reproduce the experiments; K-means Clustering, t-SNE, PCA, synthetic generation using Stylegan2-ada and instance segmentation using Deeplab (https://github.com/decurtoidiaz/drcyz/tree/main/dr_cyz+).
      ∙ clustering_curiosity_de_curto_and_de_zarza.ipynb
        K-means Clustering and PCA(2) with images from Curiosity.
      ∙ clustering_perseverance_de_curto_and_de_zarza.ipynb
        K-means Clustering and PCA(2) with images from Perseverance.
      ∙ tsne_curiosity_de_curto_and_de_zarza.ipynb
        t-SNE and PCA (components selected to explain 99% of variance) with images from Curiosity.
      ∙ tsne_perseverance_de_curto_and_de_zarza.ipynb
        t-SNE and PCA (components selected to explain 99% of variance) with images from Perseverance.
      ∙ Stylegan2-ada_de_curto_and_de_zarza.ipynb
        Stylegan2-ada trained on a subset of images from NASA Perseverance (DrCyZ).
      ∙ statistics_perseverance_de_curto_and_de_zarza.ipynb
        Compute statistics from synthetic samples generated by Stylegan2-ada (DrCyZ) and images from NASA Perseverance (CyZ).
      ∙ DeepLab_TFLite_ADE20k_de_curto_and_de_zarza.ipynb
        Example of instance segmentation using Deeplab with a sample from NASA Perseverance (DrCyZ).
    
  2. e

    Texte provenant des pdfs trouvés sur data.gouv.fr

    • data.europa.eu
    tgz
    Updated May 20, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pavel Soriano (2020). Texte provenant des pdfs trouvés sur data.gouv.fr [Dataset]. https://data.europa.eu/data/datasets/5ec45f516a58eec727e79af7?locale=sv
    Explore at:
    tgzAvailable download formats
    Dataset updated
    May 20, 2020
    Dataset authored and provided by
    Pavel Soriano
    License

    https://www.etalab.gouv.fr/licence-ouverte-open-licencehttps://www.etalab.gouv.fr/licence-ouverte-open-licence

    Area covered
    France
    Description

    Texte extrait des pdfs trouvés sur data.gouv.fr

    Description

    Ce dataset contient le texte extrait de 6602 fichiers qui ont l'extension pdf dans le catalogue de ressources de data.gouv.fr.

    Le dataset contient que les pdfs de 20 Mb ou moins et qui sont toujours disponibles sur l'adresse URL indiquée.

    L'extraction a été réalisée avec PDFBox via son wrapper Python python-pdfbox. Les PDFs qui sont des images (scans, cartes, etc) sont détectés avec une heuristique simple : si après la conversion au format texte avec pdfbox, la taille du fichier produit est inférieure à 20 bytes on considère qu'il s'agit d'une image. Dans ce cas, on procède à la OCRisation. Celle-ci est réalisé avec Tesseract via son wrapper Python pyocr.

    Le résultat sont des fichiers txt provenant des pdfs triés par organisation (l'organisation qui a publiée la ressource). Il y a 175 organisations dans ce dataset, donc 175 dossiers. Le nom de chaque fichier correspond au string {id-du-dataset}--{id-de-la-ressource}.txt.

    Input

    Catalogue de ressources data.gouv.fr.

    Output

    Fichiers texte de chaque ressource type pdf trouvée dans le catalogue qui a été converti avec succès et qui a satisfait les contraintes ci-dessus. L'arborescence est la suivante :

    .
    ├── ACTION_Nogent-sur-Marne
    │ ├── 53ba55c4a3a729219b7beae2--0cf9f9cd-e398-4512-80de-5fd0e2d1cb0a.txt
    │ ├── 53ba55c4a3a729219b7beae2--1ffcb2cb-2355-4426-b74a-946dadeba7f1.txt
    │ ├── 53ba55c4a3a729219b7beae2--297a0466-daaa-47f4-972a-0d5bea2ab180.txt
    │ ├── 53ba55c4a3a729219b7beae2--3ac0a881-181f-499e-8b3f-c2b0ddd528f7.txt
    │ ├── 53ba55c4a3a729219b7beae2--3ca6bd8f-05a6-469a-a36b-afda5a7444a4.txt
    |── ...
    ├── Aeroport_La_Rochelle-Ile_de_Re
    ├── Agence_de_services_et_de_paiement_ASP
    ├── Agence_du_Numerique
    ├── ...
    
    

    Distribution des textes [au 20 mai 2020]

    Le top 10 d'organisations avec le nombre le plus grand des documents est: python [('Les_Lilas', 1294), ('Ville_de_Pirae', 1099), ('Region_Hauts-de-France', 592), ('Ressourcerie_datalocale', 297), ('NA', 268), ('CORBION', 244), ('Education_Nationale', 189), ('Incubateur_de_Services_Numeriques', 157), ('Ministere_des_Solidarites_et_de_la_Sante', 148), ('Communaute_dAgglomeration_Plaine_Vallee', 142)] Et leur aperçu en 2D est (HashFeatures+TruncatedSVD+t-SNE) : https://raw.githubusercontent.com/psorianom/data_gouv_text/master/img/samplefigure.png" alt="Plot t-SNE des textes DGF">

    Code

    Les scripts Python utilisés pour faire cette extraction sont ici.

    Remarques

    Dû à la qualité des pdfs d'origine (scans de basse résolution, pdfs non alignés, ...) et à la performance des méthodes de transformation pdf-->txt, les résultats peuvent être très bruités.

  3. e

    Tekst fra pdf'er fundet på data.gouv.fr

    • data.europa.eu
    tgz
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pavel Soriano, Tekst fra pdf'er fundet på data.gouv.fr [Dataset]. https://data.europa.eu/data/datasets/5ec45f516a58eec727e79af7?locale=da
    Explore at:
    tgz(74434932)Available download formats
    Dataset authored and provided by
    Pavel Soriano
    License

    https://www.etalab.gouv.fr/licence-ouverte-open-licencehttps://www.etalab.gouv.fr/licence-ouverte-open-licence

    Description

    Tekst udvundet fra pdf'er fundet på data.gouv.fr

    ## Beskrivelse Dette datasæt indeholder tekst udvundet fra 6602 filer, der har "pdf' udvidelse i ressource katalog af data.gouv.fr.

    Datasættet indeholder kun pdf'er på 20 Mb eller derunder, og som altid er tilgængelige på den angivne URL.

    Udvindingen blev gjort med PDFBox via sin Python wrapper python-PDFBox. PDF-filer, der er billeder (scanninger, kort osv.) opdages med en simpel heuristisk:hvis filstørrelsen efter konvertering til tekst med "PDFBox" er mindre end 20 bytes, anses den for at være et billede.

    I dette tilfælde udføres OCRization.Denne ene er lavet med Tesseract via sin Python wrapper pyocr. Resultatet er "txt"-filer fra "pdf'er sorteret efter organisation (den organisation, der offentliggjorde ressourcen). Der er 175 organisationer i dette datasæt, så 175 filer. Navnet på hver fil svarer til strengen "{id-du-dataset}--{id-de-la-resource}.txt".

    Input

    Katalog over data.gouv.fr ressourcer. #### Output Tekstfiler af hver "pdf" ressource findes i kataloget, der blev konverteret og opfyldt ovenstående begrænsninger. Træet er som følger:

    Bash . ACTION_Nogent-sur-Marne 53ba55c4a3a729219b7beae2--0cf9f9cd-e398-4512-80de-5fd0e2d1cb0a.txt 53ba55c4a3a729219b7beae2--1ffc2cb-2355-4426-b74a-946dadeba7f1.txt 53ba55c4a3a729219b7beae2--297a0466-daaa-47f4-972a-0d5bea2ab180.txt 53ba55c4a3a729219b7beae2--3ac0a881-181f-499e-8b3f-c2b0ddd528f7.txt 53ba55c4a3a729219b7beae2--3ca6bd8f-05a6-469a-a36b-afda5a7444a4.txt HVAD ER DET? Aeroport_La_Rochelle-Ile_de_Re Agency_de_services_and_payment_ASP Agency_du_Numerique

    ... "'"

    Uddeling af tekster [pr. 20. maj 2020]

    Top 10 organisationer med det største antal dokumenter er: Python [("Les_Lilas", 1294), ("Ville_de_Pirae", 1099) ("Region_Hauts-de-France", 592) ("Ressourcerie_datalocale", 297), ("NA", 268),

    ("CORBION", 244) ("Education_Nationale", 189), ("Inkubator_of_Services_Numeriques", 157)

    ("Ministere_des_Solidarites_and_de_la_Sante", 148),

    ("Communaute_dAgglomeration_Plaine_Vallee", 142)]

    "'"

    Og deres forhåndsvisning i 2D er (HashFeatures+TruncatedSVD+[t-SNE]):

    Plot t-SNE af DGF tekster

    Kode

    Python scripts, der bruges til at gøre denne udvinding er her.

    Bemærkninger

    På grund af kvaliteten af de oprindelige pdf'er (scanninger med lav opløsning, ikke-tilpassede pdf'er,...) og udførelsen af pdf->txt transformationsmetoderne, kan resultaterne være meget høje.

  4. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
de Zarzà, I. (2022). DrCyZ: Techniques for analyzing and extracting useful information from CyZ. [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5816857

DrCyZ: Techniques for analyzing and extracting useful information from CyZ.

Explore at:
Dataset updated
Jan 19, 2022
Dataset provided by
de Zarzà, I.
de Curtò, J.
License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

DrCyZ: Techniques for analyzing and extracting useful information from CyZ.

Samples from NASA Perseverance and set of GAN generated synthetic images from Neural Mars.

Repository: https://github.com/decurtoidiaz/drcyz

Subset of samples from (includes tools to visualize and analyse the dataset):

CyZ: MARS Space Exploration Dataset. [https://doi.org/10.5281/zenodo.5655473]

Images from NASA missions of the celestial body.

Repository: https://github.com/decurtoidiaz/cyz

Authors:

J. de Curtò c@decurto.be

I. de Zarzà z@dezarza.be

File Information from DrCyZ-1.1

• Subset of samples from Perseverance (drcyz/c).
  ∙ png (drcyz/c/png).
    PNG files (5025) selected from NASA Perseverance (CyZ-1.1) after t-SNE and K-means Clustering. 
  ∙ csv (drcyz/c/csv).
    CSV file.


• Resized samples from Perseverance (drcyz/c+).
  ∙ png 64x64; 128x128; 256x256; 512x512; 1024x1024 (drcyz/c+/drcyz_64-1024).
    PNG files resized at the corresponding size. 
  ∙ TFRecords 64x64; 128x128; 256x256; 512x512; 1024x1024 (drcyz/c+/tfr_drcyz_64-1024).
    TFRecord resized at the corresponding size to import on Tensorflow.


• Synthetic images from Neural Mars generated using Stylegan2-ada (drcyz/drcyz+).
  ∙ png 100; 1000; 10000 (drcyz/drcyz+/drcyz_256_100-10000)
    PNG files subset of 100, 1000 and 10000 at size 256x256.


• Network Checkpoint from Stylegan2-ada trained at size 256x256 (drcyz/model_drcyz).
  ∙ network-snapshot-000798-drcyz.pkl


• Notebooks in python to analyse the original dataset and reproduce the experiments; K-means Clustering, t-SNE, PCA, synthetic generation using Stylegan2-ada and instance segmentation using Deeplab (https://github.com/decurtoidiaz/drcyz/tree/main/dr_cyz+).
  ∙ clustering_curiosity_de_curto_and_de_zarza.ipynb
    K-means Clustering and PCA(2) with images from Curiosity.
  ∙ clustering_perseverance_de_curto_and_de_zarza.ipynb
    K-means Clustering and PCA(2) with images from Perseverance.
  ∙ tsne_curiosity_de_curto_and_de_zarza.ipynb
    t-SNE and PCA (components selected to explain 99% of variance) with images from Curiosity.
  ∙ tsne_perseverance_de_curto_and_de_zarza.ipynb
    t-SNE and PCA (components selected to explain 99% of variance) with images from Perseverance.
  ∙ Stylegan2-ada_de_curto_and_de_zarza.ipynb
    Stylegan2-ada trained on a subset of images from NASA Perseverance (DrCyZ).
  ∙ statistics_perseverance_de_curto_and_de_zarza.ipynb
    Compute statistics from synthetic samples generated by Stylegan2-ada (DrCyZ) and images from NASA Perseverance (CyZ).
  ∙ DeepLab_TFLite_ADE20k_de_curto_and_de_zarza.ipynb
    Example of instance segmentation using Deeplab with a sample from NASA Perseverance (DrCyZ).
Search
Clear search
Close search
Google apps
Main menu