3 datasets found

DrCyZ: Techniques for analyzing and extracting useful information from CyZ.

data.niaid.nih.gov

Updated Jan 19, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

de Zarzà, I. (2022). DrCyZ: Techniques for analyzing and extracting useful information from CyZ. [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5816857

Explore at:

Dataset updated

Jan 19, 2022

Dataset provided by

de Curtò, J.
de Zarzà, I.

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

DrCyZ: Techniques for analyzing and extracting useful information from CyZ.

Samples from NASA Perseverance and set of GAN generated synthetic images from Neural Mars.

Repository: https://github.com/decurtoidiaz/drcyz

Subset of samples from (includes tools to visualize and analyse the dataset):

CyZ: MARS Space Exploration Dataset. [https://doi.org/10.5281/zenodo.5655473]

Images from NASA missions of the celestial body.

Repository: https://github.com/decurtoidiaz/cyz

Authors:

J. de Curtò c@decurto.be

I. de Zarzà z@dezarza.be

File Information from DrCyZ-1.1

• Subset of samples from Perseverance (drcyz/c).
  ∙ png (drcyz/c/png).
    PNG files (5025) selected from NASA Perseverance (CyZ-1.1) after t-SNE and K-means Clustering. 
  ∙ csv (drcyz/c/csv).
    CSV file.


• Resized samples from Perseverance (drcyz/c+).
  ∙ png 64x64; 128x128; 256x256; 512x512; 1024x1024 (drcyz/c+/drcyz_64-1024).
    PNG files resized at the corresponding size. 
  ∙ TFRecords 64x64; 128x128; 256x256; 512x512; 1024x1024 (drcyz/c+/tfr_drcyz_64-1024).
    TFRecord resized at the corresponding size to import on Tensorflow.


• Synthetic images from Neural Mars generated using Stylegan2-ada (drcyz/drcyz+).
  ∙ png 100; 1000; 10000 (drcyz/drcyz+/drcyz_256_100-10000)
    PNG files subset of 100, 1000 and 10000 at size 256x256.


• Network Checkpoint from Stylegan2-ada trained at size 256x256 (drcyz/model_drcyz).
  ∙ network-snapshot-000798-drcyz.pkl


• Notebooks in python to analyse the original dataset and reproduce the experiments; K-means Clustering, t-SNE, PCA, synthetic generation using Stylegan2-ada and instance segmentation using Deeplab (https://github.com/decurtoidiaz/drcyz/tree/main/dr_cyz+).
  ∙ clustering_curiosity_de_curto_and_de_zarza.ipynb
    K-means Clustering and PCA(2) with images from Curiosity.
  ∙ clustering_perseverance_de_curto_and_de_zarza.ipynb
    K-means Clustering and PCA(2) with images from Perseverance.
  ∙ tsne_curiosity_de_curto_and_de_zarza.ipynb
    t-SNE and PCA (components selected to explain 99% of variance) with images from Curiosity.
  ∙ tsne_perseverance_de_curto_and_de_zarza.ipynb
    t-SNE and PCA (components selected to explain 99% of variance) with images from Perseverance.
  ∙ Stylegan2-ada_de_curto_and_de_zarza.ipynb
    Stylegan2-ada trained on a subset of images from NASA Perseverance (DrCyZ).
  ∙ statistics_perseverance_de_curto_and_de_zarza.ipynb
    Compute statistics from synthetic samples generated by Stylegan2-ada (DrCyZ) and images from NASA Perseverance (CyZ).
  ∙ DeepLab_TFLite_ADE20k_de_curto_and_de_zarza.ipynb
    Example of instance segmentation using Deeplab with a sample from NASA Perseverance (DrCyZ).

e
Texte provenant des pdfs trouvés sur data.gouv.fr
data.europa.eu
tgz
Updated May 20, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pavel Soriano (2020). Texte provenant des pdfs trouvés sur data.gouv.fr [Dataset]. https://data.europa.eu/data/datasets/5ec45f516a58eec727e79af7?locale=sv
Explore at:
tgzAvailable download formats
Dataset updated
May 20, 2020
Dataset authored and provided by
Pavel Soriano
License
https://www.etalab.gouv.fr/licence-ouverte-open-licencehttps://www.etalab.gouv.fr/licence-ouverte-open-licence
Area covered
France
Description
Texte extrait des pdfs trouvés sur data.gouv.fr

Description

Ce dataset contient le texte extrait de 6602 fichiers qui ont l'extension pdf dans le catalogue de ressources de data.gouv.fr.

Le dataset contient que les pdfs de 20 Mb ou moins et qui sont toujours disponibles sur l'adresse URL indiquée.

L'extraction a été réalisée avec PDFBox via son wrapper Python python-pdfbox. Les PDFs qui sont des images (scans, cartes, etc) sont détectés avec une heuristique simple : si après la conversion au format texte avec pdfbox, la taille du fichier produit est inférieure à 20 bytes on considère qu'il s'agit d'une image. Dans ce cas, on procède à la OCRisation. Celle-ci est réalisé avec Tesseract via son wrapper Python pyocr.

Le résultat sont des fichiers txt provenant des pdfs triés par organisation (l'organisation qui a publiée la ressource). Il y a 175 organisations dans ce dataset, donc 175 dossiers. Le nom de chaque fichier correspond au string {id-du-dataset}--{id-de-la-ressource}.txt.

Input

Catalogue de ressources data.gouv.fr.

Output

Fichiers texte de chaque ressource type pdf trouvée dans le catalogue qui a été converti avec succès et qui a satisfait les contraintes ci-dessus. L'arborescence est la suivante :

. ├── ACTION_Nogent-sur-Marne │ ├── 53ba55c4a3a729219b7beae2--0cf9f9cd-e398-4512-80de-5fd0e2d1cb0a.txt │ ├── 53ba55c4a3a729219b7beae2--1ffcb2cb-2355-4426-b74a-946dadeba7f1.txt │ ├── 53ba55c4a3a729219b7beae2--297a0466-daaa-47f4-972a-0d5bea2ab180.txt │ ├── 53ba55c4a3a729219b7beae2--3ac0a881-181f-499e-8b3f-c2b0ddd528f7.txt │ ├── 53ba55c4a3a729219b7beae2--3ca6bd8f-05a6-469a-a36b-afda5a7444a4.txt |── ... ├── Aeroport_La_Rochelle-Ile_de_Re ├── Agence_de_services_et_de_paiement_ASP ├── Agence_du_Numerique ├── ...

Distribution des textes [au 20 mai 2020]

Le top 10 d'organisations avec le nombre le plus grand des documents est: python [('Les_Lilas', 1294), ('Ville_de_Pirae', 1099), ('Region_Hauts-de-France', 592), ('Ressourcerie_datalocale', 297), ('NA', 268), ('CORBION', 244), ('Education_Nationale', 189), ('Incubateur_de_Services_Numeriques', 157), ('Ministere_des_Solidarites_et_de_la_Sante', 148), ('Communaute_dAgglomeration_Plaine_Vallee', 142)] Et leur aperçu en 2D est (HashFeatures+TruncatedSVD+t-SNE) : https://raw.githubusercontent.com/psorianom/data_gouv_text/master/img/samplefigure.png" alt="Plot t-SNE des textes DGF">

Code

Les scripts Python utilisés pour faire cette extraction sont ici.

Remarques

Dû à la qualité des pdfs d'origine (scans de basse résolution, pdfs non alignés, ...) et à la performance des méthodes de transformation pdf-->txt, les résultats peuvent être très bruités.
e
Text z pdf nájdený na data.gouv.fr
data.europa.eu
tgz
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pavel Soriano, Text z pdf nájdený na data.gouv.fr [Dataset]. https://data.europa.eu/data/datasets/5ec45f516a58eec727e79af7?locale=sk
Explore at:
tgz(74434932)Available download formats
Dataset authored and provided by
Pavel Soriano
License
https://www.etalab.gouv.fr/licence-ouverte-open-licencehttps://www.etalab.gouv.fr/licence-ouverte-open-licence
Description
Text extrahovaný z pdf nájdený na data.gouv.fr

## Popis Tento súbor údajov obsahuje text extrahovaný z 6602 súborov, ktoré majú príponu „pdf“ v katalógu zdrojov data.gouv.fr. Súbor údajov obsahuje iba súbory PDF s veľkosťou 20 Mb alebo menej, ktoré sú vždy k dispozícii na uvedenej adrese URL.

Extrakcia sa vykonala pomocou PDFBox prostredníctvom obalu Python python-PDFBox.PDF, ktoré sú obrázkami (skeny, mapy atď.) sú detekované jednoduchým heuristickým: ak je po konverzii na text s „PDFBox“ veľkosť súboru menšia ako 20 bajtov, považuje sa za obrázok. V tomto prípade sa vykonáva OCRization.Tento je vyrobený z Tesseract prostredníctvom obalu Python pyocr.

Výsledkom sú súbory „txt“ z „pdf“ zoradené podľa organizácie (organizácia, ktorá zdroj uverejnila). V tomto súbore údajov je 175 organizácií, takže 175 súborov. Názov každého súboru zodpovedá reťazcu ‚{id-du-dataset}--{id-de-la-resource}.txt‘.

Vstup

Katalóg data.gouv.fr resources.

Výstup Textové súbory z každého zdroja „pdf“ nájdené v katalógu, ktorý bol úspešne skonvertovaný a spĺňal vyššie uvedené obmedzenia.

Strom je nasledovný:

Báseň . ACTION_Nogent-sur-Marne – reštaurácie v okolí

53ba55c4a3a729219b7beae2--0cf9f9cd-e398 – 4512 – 80de-5fd0e2d1cb0a.txt 53ba55c4a3a729219b7beae2--1ffcb2cb-2355 – 4426-b74a-946dadeba7f1.txt 53ba55c4a3a729219b7beae2--297a0466-daaa-47f4 – 972a – 0d5bea2ab180.txt 53ba55c4a3a729219b7beae2--3ac0a881 – 181f – 499e-8b3f-c2b0ddd528f7.txt 53ba55c4a3a729219b7beae2--3ca6bd8f-05a6 – 469a-a36b-afda5a7444a4.txt | |... Aeroport_La_Rochelle-Ile_de_Re Agentúra_de_services_and_payment_ASP Agentúra_du_Numerique ... ‚'‘

Distribúcia textov [od 20. mája 2020]

Top 10 organizácií s najväčším počtom dokumentov je:

Python [(‚Les_Lilas‘, 1294), („Ville_de_Pirae“, 1099), („Region_Hauts-de-France“, 592), („Ressourcerie_datalocale“, 297), („NA“, 268),

(„CORBION“, 244), („Education_Nationale“, 189), („Incubator_of_Services_Numeriques“, 157), („Ministere_des_Solidarites_and_de_la_Sante“, 148), („Communaute_dAgglomeration_Plaine_Vallee“, 142)] ‚'‘

A ich náhľad v 2D je [HashFeatures+TruncatedSVD+[t-SNE]): Poznámka t-SNE textov DGF ## Kód

Python skripty používané na vykonanie tejto extrakcie sú tu.

Poznámky

Vzhľadom na kvalitu pôvodných pdf (skeny s nízkym rozlíšením, nezarovnané pdf,...) a výkon pdf->txt transformačných metód, výsledky môžu byť veľmi hlasné.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

de Zarzà, I. (2022). DrCyZ: Techniques for analyzing and extracting useful information from CyZ. [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5816857

DrCyZ: Techniques for analyzing and extracting useful information from CyZ.

Explore at:

Dataset updated

Jan 19, 2022

Dataset provided by

de Curtò, J.
de Zarzà, I.

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

DrCyZ: Techniques for analyzing and extracting useful information from CyZ.

Samples from NASA Perseverance and set of GAN generated synthetic images from Neural Mars.

Repository: https://github.com/decurtoidiaz/drcyz

Subset of samples from (includes tools to visualize and analyse the dataset):

CyZ: MARS Space Exploration Dataset. [https://doi.org/10.5281/zenodo.5655473]

Images from NASA missions of the celestial body.

Repository: https://github.com/decurtoidiaz/cyz

Authors:

J. de Curtò c@decurto.be

I. de Zarzà z@dezarza.be

File Information from DrCyZ-1.1

• Subset of samples from Perseverance (drcyz/c).
  ∙ png (drcyz/c/png).
    PNG files (5025) selected from NASA Perseverance (CyZ-1.1) after t-SNE and K-means Clustering. 
  ∙ csv (drcyz/c/csv).
    CSV file.


• Resized samples from Perseverance (drcyz/c+).
  ∙ png 64x64; 128x128; 256x256; 512x512; 1024x1024 (drcyz/c+/drcyz_64-1024).
    PNG files resized at the corresponding size. 
  ∙ TFRecords 64x64; 128x128; 256x256; 512x512; 1024x1024 (drcyz/c+/tfr_drcyz_64-1024).
    TFRecord resized at the corresponding size to import on Tensorflow.


• Synthetic images from Neural Mars generated using Stylegan2-ada (drcyz/drcyz+).
  ∙ png 100; 1000; 10000 (drcyz/drcyz+/drcyz_256_100-10000)
    PNG files subset of 100, 1000 and 10000 at size 256x256.


• Network Checkpoint from Stylegan2-ada trained at size 256x256 (drcyz/model_drcyz).
  ∙ network-snapshot-000798-drcyz.pkl


• Notebooks in python to analyse the original dataset and reproduce the experiments; K-means Clustering, t-SNE, PCA, synthetic generation using Stylegan2-ada and instance segmentation using Deeplab (https://github.com/decurtoidiaz/drcyz/tree/main/dr_cyz+).
  ∙ clustering_curiosity_de_curto_and_de_zarza.ipynb
    K-means Clustering and PCA(2) with images from Curiosity.
  ∙ clustering_perseverance_de_curto_and_de_zarza.ipynb
    K-means Clustering and PCA(2) with images from Perseverance.
  ∙ tsne_curiosity_de_curto_and_de_zarza.ipynb
    t-SNE and PCA (components selected to explain 99% of variance) with images from Curiosity.
  ∙ tsne_perseverance_de_curto_and_de_zarza.ipynb
    t-SNE and PCA (components selected to explain 99% of variance) with images from Perseverance.
  ∙ Stylegan2-ada_de_curto_and_de_zarza.ipynb
    Stylegan2-ada trained on a subset of images from NASA Perseverance (DrCyZ).
  ∙ statistics_perseverance_de_curto_and_de_zarza.ipynb
    Compute statistics from synthetic samples generated by Stylegan2-ada (DrCyZ) and images from NASA Perseverance (CyZ).
  ∙ DeepLab_TFLite_ADE20k_de_curto_and_de_zarza.ipynb
    Example of instance segmentation using Deeplab with a sample from NASA Perseverance (DrCyZ).

Clear search

Close search

Google apps

Main menu

DrCyZ: Techniques for analyzing and extracting useful information from CyZ.

File Information from DrCyZ-1.1

Texte provenant des pdfs trouvés sur data.gouv.fr

Texte extrait des pdfs trouvés sur data.gouv.fr

Description

Input

Output

Distribution des textes [au 20 mai 2020]

Code

Remarques

Text z pdf nájdený na data.gouv.fr

Text extrahovaný z pdf nájdený na data.gouv.fr

Vstup

Výstup Textové súbory z každého zdroja „pdf“ nájdené v katalógu, ktorý bol úspešne skonvertovaný a spĺňal vyššie uvedené obmedzenia.

Distribúcia textov [od 20. mája 2020]

Poznámky

DrCyZ: Techniques for analyzing and extracting useful information from CyZ.

File Information from DrCyZ-1.1