Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
DrCyZ: Techniques for analyzing and extracting useful information from CyZ.
Samples from NASA Perseverance and set of GAN generated synthetic images from Neural Mars.
Repository: https://github.com/decurtoidiaz/drcyz
Subset of samples from (includes tools to visualize and analyse the dataset):
CyZ: MARS Space Exploration Dataset. [https://doi.org/10.5281/zenodo.5655473]
Images from NASA missions of the celestial body.
Repository: https://github.com/decurtoidiaz/cyz
Authors:
J. de Curtò c@decurto.be
I. de Zarzà z@dezarza.be
• Subset of samples from Perseverance (drcyz/c).
∙ png (drcyz/c/png).
PNG files (5025) selected from NASA Perseverance (CyZ-1.1) after t-SNE and K-means Clustering.
∙ csv (drcyz/c/csv).
CSV file.
• Resized samples from Perseverance (drcyz/c+).
∙ png 64x64; 128x128; 256x256; 512x512; 1024x1024 (drcyz/c+/drcyz_64-1024).
PNG files resized at the corresponding size.
∙ TFRecords 64x64; 128x128; 256x256; 512x512; 1024x1024 (drcyz/c+/tfr_drcyz_64-1024).
TFRecord resized at the corresponding size to import on Tensorflow.
• Synthetic images from Neural Mars generated using Stylegan2-ada (drcyz/drcyz+).
∙ png 100; 1000; 10000 (drcyz/drcyz+/drcyz_256_100-10000)
PNG files subset of 100, 1000 and 10000 at size 256x256.
• Network Checkpoint from Stylegan2-ada trained at size 256x256 (drcyz/model_drcyz).
∙ network-snapshot-000798-drcyz.pkl
• Notebooks in python to analyse the original dataset and reproduce the experiments; K-means Clustering, t-SNE, PCA, synthetic generation using Stylegan2-ada and instance segmentation using Deeplab (https://github.com/decurtoidiaz/drcyz/tree/main/dr_cyz+).
∙ clustering_curiosity_de_curto_and_de_zarza.ipynb
K-means Clustering and PCA(2) with images from Curiosity.
∙ clustering_perseverance_de_curto_and_de_zarza.ipynb
K-means Clustering and PCA(2) with images from Perseverance.
∙ tsne_curiosity_de_curto_and_de_zarza.ipynb
t-SNE and PCA (components selected to explain 99% of variance) with images from Curiosity.
∙ tsne_perseverance_de_curto_and_de_zarza.ipynb
t-SNE and PCA (components selected to explain 99% of variance) with images from Perseverance.
∙ Stylegan2-ada_de_curto_and_de_zarza.ipynb
Stylegan2-ada trained on a subset of images from NASA Perseverance (DrCyZ).
∙ statistics_perseverance_de_curto_and_de_zarza.ipynb
Compute statistics from synthetic samples generated by Stylegan2-ada (DrCyZ) and images from NASA Perseverance (CyZ).
∙ DeepLab_TFLite_ADE20k_de_curto_and_de_zarza.ipynb
Example of instance segmentation using Deeplab with a sample from NASA Perseverance (DrCyZ).
https://www.etalab.gouv.fr/licence-ouverte-open-licencehttps://www.etalab.gouv.fr/licence-ouverte-open-licence
Ce dataset contient le texte extrait de 6602 fichiers qui ont l'extension pdf
dans le catalogue de ressources de data.gouv.fr.
Le dataset contient que les pdfs de 20 Mb ou moins et qui sont toujours disponibles sur l'adresse URL indiquée.
L'extraction a été réalisée avec PDFBox via son wrapper Python python-pdfbox. Les PDFs qui sont des images (scans, cartes, etc)
sont détectés avec une heuristique simple : si après la conversion au format texte avec pdfbox
, la taille du fichier produit est inférieure à 20 bytes on considère qu'il s'agit d'une image.
Dans ce cas, on procède à la OCRisation. Celle-ci est réalisé avec Tesseract via son wrapper Python pyocr.
Le résultat sont des fichiers txt
provenant des pdfs
triés par organisation (l'organisation qui a publiée la ressource). Il y a 175 organisations dans ce dataset, donc 175 dossiers.
Le nom de chaque fichier correspond au string {id-du-dataset}--{id-de-la-ressource}.txt
.
Catalogue de ressources data.gouv.fr.
Fichiers texte de chaque ressource type pdf
trouvée dans le catalogue qui a été converti avec succès et qui a satisfait les contraintes ci-dessus.
L'arborescence est la suivante :
.
├── ACTION_Nogent-sur-Marne
│ ├── 53ba55c4a3a729219b7beae2--0cf9f9cd-e398-4512-80de-5fd0e2d1cb0a.txt
│ ├── 53ba55c4a3a729219b7beae2--1ffcb2cb-2355-4426-b74a-946dadeba7f1.txt
│ ├── 53ba55c4a3a729219b7beae2--297a0466-daaa-47f4-972a-0d5bea2ab180.txt
│ ├── 53ba55c4a3a729219b7beae2--3ac0a881-181f-499e-8b3f-c2b0ddd528f7.txt
│ ├── 53ba55c4a3a729219b7beae2--3ca6bd8f-05a6-469a-a36b-afda5a7444a4.txt
|── ...
├── Aeroport_La_Rochelle-Ile_de_Re
├── Agence_de_services_et_de_paiement_ASP
├── Agence_du_Numerique
├── ...
Le top 10 d'organisations avec le nombre le plus grand des documents est:
python
[('Les_Lilas', 1294),
('Ville_de_Pirae', 1099),
('Region_Hauts-de-France', 592),
('Ressourcerie_datalocale', 297),
('NA', 268),
('CORBION', 244),
('Education_Nationale', 189),
('Incubateur_de_Services_Numeriques', 157),
('Ministere_des_Solidarites_et_de_la_Sante', 148),
('Communaute_dAgglomeration_Plaine_Vallee', 142)]
Et leur aperçu en 2D est (HashFeatures+TruncatedSVD+t-SNE) :
https://raw.githubusercontent.com/psorianom/data_gouv_text/master/img/samplefigure.png" alt="Plot t-SNE des textes DGF">
Les scripts Python utilisés pour faire cette extraction sont ici.
Dû à la qualité des pdfs d'origine (scans de basse résolution, pdfs non alignés, ...) et à la performance des méthodes de transformation pdf-->txt, les résultats peuvent être très bruités.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
DrCyZ: Techniques for analyzing and extracting useful information from CyZ.
Samples from NASA Perseverance and set of GAN generated synthetic images from Neural Mars.
Repository: https://github.com/decurtoidiaz/drcyz
Subset of samples from (includes tools to visualize and analyse the dataset):
CyZ: MARS Space Exploration Dataset. [https://doi.org/10.5281/zenodo.5655473]
Images from NASA missions of the celestial body.
Repository: https://github.com/decurtoidiaz/cyz
Authors:
J. de Curtò c@decurto.be
I. de Zarzà z@dezarza.be
• Subset of samples from Perseverance (drcyz/c).
∙ png (drcyz/c/png).
PNG files (5025) selected from NASA Perseverance (CyZ-1.1) after t-SNE and K-means Clustering.
∙ csv (drcyz/c/csv).
CSV file.
• Resized samples from Perseverance (drcyz/c+).
∙ png 64x64; 128x128; 256x256; 512x512; 1024x1024 (drcyz/c+/drcyz_64-1024).
PNG files resized at the corresponding size.
∙ TFRecords 64x64; 128x128; 256x256; 512x512; 1024x1024 (drcyz/c+/tfr_drcyz_64-1024).
TFRecord resized at the corresponding size to import on Tensorflow.
• Synthetic images from Neural Mars generated using Stylegan2-ada (drcyz/drcyz+).
∙ png 100; 1000; 10000 (drcyz/drcyz+/drcyz_256_100-10000)
PNG files subset of 100, 1000 and 10000 at size 256x256.
• Network Checkpoint from Stylegan2-ada trained at size 256x256 (drcyz/model_drcyz).
∙ network-snapshot-000798-drcyz.pkl
• Notebooks in python to analyse the original dataset and reproduce the experiments; K-means Clustering, t-SNE, PCA, synthetic generation using Stylegan2-ada and instance segmentation using Deeplab (https://github.com/decurtoidiaz/drcyz/tree/main/dr_cyz+).
∙ clustering_curiosity_de_curto_and_de_zarza.ipynb
K-means Clustering and PCA(2) with images from Curiosity.
∙ clustering_perseverance_de_curto_and_de_zarza.ipynb
K-means Clustering and PCA(2) with images from Perseverance.
∙ tsne_curiosity_de_curto_and_de_zarza.ipynb
t-SNE and PCA (components selected to explain 99% of variance) with images from Curiosity.
∙ tsne_perseverance_de_curto_and_de_zarza.ipynb
t-SNE and PCA (components selected to explain 99% of variance) with images from Perseverance.
∙ Stylegan2-ada_de_curto_and_de_zarza.ipynb
Stylegan2-ada trained on a subset of images from NASA Perseverance (DrCyZ).
∙ statistics_perseverance_de_curto_and_de_zarza.ipynb
Compute statistics from synthetic samples generated by Stylegan2-ada (DrCyZ) and images from NASA Perseverance (CyZ).
∙ DeepLab_TFLite_ADE20k_de_curto_and_de_zarza.ipynb
Example of instance segmentation using Deeplab with a sample from NASA Perseverance (DrCyZ).