3 datasets found

h
roots-tsne-data
huggingface.co
Updated May 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christopher Akiki (2023). roots-tsne-data [Dataset]. https://huggingface.co/datasets/christopher/roots-tsne-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 16, 2023
Authors
Christopher Akiki
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
What follows is research code. It is by no means optimized for speed, efficiency, or readability.

Data loading, tokenizing and sharding

import os import numpy as np import pandas as pd from sklearn.feature_extraction.text import TfidfTransformer from sklearn.decomposition import TruncatedSVD from tqdm.notebook import tqdm from openTSNE import TSNE import datashader as ds import colorcet as cc

fromdask.distributed import Client import dask.dataframe as dd import dask_ml import… See the full description on the dataset page: https://huggingface.co/datasets/christopher/roots-tsne-data.
Diabetes_Dataset_1.1
kaggle.com
Updated Nov 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KIRANMAYI G 777 (2023). Diabetes_Dataset_1.1 [Dataset]. https://www.kaggle.com/datasets/kiranmayig777/diabetes-dataset-1-1/versions/1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 2, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
KIRANMAYI G 777
Description
import pandas as pd import numpy as np

PERFORMING EDA

data.head() data.info()

attributes_data = data.iloc[:, 1:] attributes_data

attributes_data.describe() attributes_data.corr()

import seaborn as sns import matplotlib.pyplot as plt

Calculate correlation matrix

correlation_matrix = attributes_data.corr() plt.figure(figsize=(18, 10))

Create a heatmap

sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm') plt.show()

CHECKING IF DATASET IS LINEAR OR NON-LINEAR

Calculate correlations between target and predictor columns

correlations = data.corr()['Diabetes_binary'].drop('Diabetes_binary')

Create a bar chart

plt.figure(figsize=(10, 6)) correlations.plot(kind='bar') plt.xlabel('Predictor Columns') plt.ylabel('Correlation values') plt.title('Correlation between Diabetes_binary and Predictors') plt.show()

CHECKING FOR NULL AND MISSING VALUES, CLEANING THEM

Count the number of null values in each column

print(data.isnull().sum())

to check for missing values in all columns

print(data.isna().sum())

LASSO import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.linear_model import Lasso from sklearn.model_selection import train_test_split from sklearn.model_selection import GridSearchCV, KFold

X = data.iloc[:, 1:] y = data.iloc[:, 0] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

gridsearchcv is used to find the optimal combination of hyperparameters for a given model

So, in the end, we can select the best parameters from the listed hyperparameters.

parameters = {"alpha": np.arange(0.00001, 10, 500)}
kfold = KFold(n_splits = 10, shuffle=True, random_state = 42) lassoReg = Lasso() lasso_cv = GridSearchCV(lassoReg, param_grid = parameters, cv = kfold) lasso_cv.fit(X, y) print("Best Params {}".format(lasso_cv.best_params_))

column_names = list(data) column_names = column_names[1:] column_names

lassoModel = Lasso(alpha = 0.00001) lassoModel.fit(X_train, y_train) lasso_coeff = np.abs(lassoModel.coef_)#making all coefficients positive plt.bar(column_names, lasso_coeff, color = 'orange') plt.xticks(rotation=90) plt.grid() plt.title("Feature Selection Based on Lasso") plt.xlabel("Features") plt.ylabel("Importance") plt.ylim(0, 0.16) plt.show()

RFE from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

from sklearn.feature_selection import RFECV from sklearn.tree import DecisionTreeClassifier model = DecisionTreeClassifier() rfecv = RFECV(estimator= model, step = 1, cv = 20, scoring="accuracy") rfecv = rfecv.fit(X_train, y_train)

num_features_selected = len(rfecv.rankin_)

Cross-validation scores

cv_scores = rfecv.ranking_

Plotting the number of features vs. cross-validation score

plt.figure(figsize=(10, 6)) plt.xlabel("Number of features selected") plt.ylabel("Score (accuracy)") plt.plot(range(1, num_features_selected + 1), cv_scores, marker='o', color='r') plt.xticks(range(1, num_features_selected + 1)) # Set x-ticks to integers plt.grid() plt.title("RFECV: Number of Features vs. Score(accuracy)") plt.show()

print("The optimal number of features:", rfecv.n_features_) print("Best features:", X_train.columns[rfecv.support_])

PCA import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler

X = data.drop(["Diabetes_binary"], axis=1) y = data["Diabetes_binary"]

df1=pd.DataFrame(data = data,columns=data.columns) print(df1)

scaling=StandardScaler() scaling.fit(df1) Scaled_data=scaling.transform(df1) principal=PCA(n_components=3) principal.fit(Scaled_data) x=principal.transform(Scaled_data) print(x.shape)

principal.components_

plt.figure(figsize=(10,10))

plt.scatter(x[:,0],x[:,1],c=data['Diabetes_binary'],cmap='plasma') plt.xlabel('pc1') plt.ylabel('pc2')

print(principal.explained_variance_ratio_)

T-SNE from sklearn.manifold import TSNE from numpy import reshape import seaborn as sns

tsne = TSNE(n_components=3, verbose=1, random_state=42) z = tsne.fit_transform(X)

df = pd.DataFrame() df["y"] = y df["comp-1"] = z[:,0] df["comp-2"] = z[:,1] df["comp-3"] = z[:,2] sns.scatterplot(x="comp-1", y="comp-2", hue=df.y.tolist(), palette=sns.color_palette("husl", 2), data=df).set(title="Diabetes data T-SNE projection")
e
Texte provenant des pdfs trouvés sur data.gouv.fr
data.europa.eu
tgz
Updated May 20, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pavel Soriano (2020). Texte provenant des pdfs trouvés sur data.gouv.fr [Dataset]. https://data.europa.eu/data/datasets/5ec45f516a58eec727e79af7?locale=sv
Explore at:
tgzAvailable download formats
Dataset updated
May 20, 2020
Dataset authored and provided by
Pavel Soriano
License
https://www.etalab.gouv.fr/licence-ouverte-open-licencehttps://www.etalab.gouv.fr/licence-ouverte-open-licence
Area covered
France
Description
Texte extrait des pdfs trouvés sur data.gouv.fr

Description

Ce dataset contient le texte extrait de 6602 fichiers qui ont l'extension pdf dans le catalogue de ressources de data.gouv.fr.

Le dataset contient que les pdfs de 20 Mb ou moins et qui sont toujours disponibles sur l'adresse URL indiquée.

L'extraction a été réalisée avec PDFBox via son wrapper Python python-pdfbox. Les PDFs qui sont des images (scans, cartes, etc) sont détectés avec une heuristique simple : si après la conversion au format texte avec pdfbox, la taille du fichier produit est inférieure à 20 bytes on considère qu'il s'agit d'une image. Dans ce cas, on procède à la OCRisation. Celle-ci est réalisé avec Tesseract via son wrapper Python pyocr.

Le résultat sont des fichiers txt provenant des pdfs triés par organisation (l'organisation qui a publiée la ressource). Il y a 175 organisations dans ce dataset, donc 175 dossiers. Le nom de chaque fichier correspond au string {id-du-dataset}--{id-de-la-ressource}.txt.

Input

Catalogue de ressources data.gouv.fr.

Output

Fichiers texte de chaque ressource type pdf trouvée dans le catalogue qui a été converti avec succès et qui a satisfait les contraintes ci-dessus. L'arborescence est la suivante :

. ├── ACTION_Nogent-sur-Marne │ ├── 53ba55c4a3a729219b7beae2--0cf9f9cd-e398-4512-80de-5fd0e2d1cb0a.txt │ ├── 53ba55c4a3a729219b7beae2--1ffcb2cb-2355-4426-b74a-946dadeba7f1.txt │ ├── 53ba55c4a3a729219b7beae2--297a0466-daaa-47f4-972a-0d5bea2ab180.txt │ ├── 53ba55c4a3a729219b7beae2--3ac0a881-181f-499e-8b3f-c2b0ddd528f7.txt │ ├── 53ba55c4a3a729219b7beae2--3ca6bd8f-05a6-469a-a36b-afda5a7444a4.txt |── ... ├── Aeroport_La_Rochelle-Ile_de_Re ├── Agence_de_services_et_de_paiement_ASP ├── Agence_du_Numerique ├── ...

Distribution des textes [au 20 mai 2020]

Le top 10 d'organisations avec le nombre le plus grand des documents est: python [('Les_Lilas', 1294), ('Ville_de_Pirae', 1099), ('Region_Hauts-de-France', 592), ('Ressourcerie_datalocale', 297), ('NA', 268), ('CORBION', 244), ('Education_Nationale', 189), ('Incubateur_de_Services_Numeriques', 157), ('Ministere_des_Solidarites_et_de_la_Sante', 148), ('Communaute_dAgglomeration_Plaine_Vallee', 142)] Et leur aperçu en 2D est (HashFeatures+TruncatedSVD+t-SNE) : https://raw.githubusercontent.com/psorianom/data_gouv_text/master/img/samplefigure.png" alt="Plot t-SNE des textes DGF">

Code

Les scripts Python utilisés pour faire cette extraction sont ici.

Remarques

Dû à la qualité des pdfs d'origine (scans de basse résolution, pdfs non alignés, ...) et à la performance des méthodes de transformation pdf-->txt, les résultats peuvent être très bruités.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Christopher Akiki (2023). roots-tsne-data [Dataset]. https://huggingface.co/datasets/christopher/roots-tsne-data

roots-tsne-data

christopher/roots-tsne-data

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

May 16, 2023

Authors

Christopher Akiki

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

What follows is research code. It is by no means optimized for speed, efficiency, or readability.

  Data loading, tokenizing and sharding

import os import numpy as np import pandas as pd from sklearn.feature_extraction.text import TfidfTransformer from sklearn.decomposition import TruncatedSVD from tqdm.notebook import tqdm from openTSNE import TSNE import datashader as ds import colorcet as cc

fromdask.distributed import Client import dask.dataframe as dd import dask_ml import… See the full description on the dataset page: https://huggingface.co/datasets/christopher/roots-tsne-data.

Clear search

Close search

Google apps

Main menu

roots-tsne-data

Diabetes_Dataset_1.1

Calculate correlation matrix

Create a heatmap

Calculate correlations between target and predictor columns

Create a bar chart

Count the number of null values in each column

to check for missing values in all columns

gridsearchcv is used to find the optimal combination of hyperparameters for a given model

So, in the end, we can select the best parameters from the listed hyperparameters.

Cross-validation scores

Plotting the number of features vs. cross-validation score

plt.figure(figsize=(10,10))

Texte provenant des pdfs trouvés sur data.gouv.fr

Texte extrait des pdfs trouvés sur data.gouv.fr

Description

Input

Output

Distribution des textes [au 20 mai 2020]

Code

Remarques

roots-tsne-data

christopher/roots-tsne-data