24 datasets found

Shopping Mall
kaggle.com
Updated Dec 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anshul Pachauri (2023). Shopping Mall [Dataset]. https://www.kaggle.com/datasets/anshulpachauri/shopping-mall
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 15, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Anshul Pachauri
Description
Libraries Import:

Importing necessary libraries such as pandas, seaborn, matplotlib, scikit-learn's KMeans, and warnings. Data Loading and Exploration:

Reading a dataset named "Mall_Customers.csv" into a pandas DataFrame (df). Displaying the first few rows of the dataset using df.head(). Conducting univariate analysis by calculating descriptive statistics with df.describe(). Univariate Analysis:

Visualizing the distribution of the 'Annual Income (k$)' column using sns.distplot. Looping through selected columns ('Age', 'Annual Income (k$)', 'Spending Score (1-100)') and plotting individual distribution plots. Bivariate Analysis:

Creating a scatter plot for 'Annual Income (k$)' vs 'Spending Score (1-100)' using sns.scatterplot. Generating a pair plot for selected columns with gender differentiation using sns.pairplot. Gender-Based Analysis:

Grouping the data by 'Gender' and calculating the mean for selected columns. Computing the correlation matrix for the grouped data and visualizing it using a heatmap. Univariate Clustering:

Applying KMeans clustering with 3 clusters based on 'Annual Income (k$)' and adding the 'Income Cluster' column to the DataFrame. Plotting the elbow method to determine the optimal number of clusters. Bivariate Clustering:

Applying KMeans clustering with 5 clusters based on 'Annual Income (k$)' and 'Spending Score (1-100)' and adding the 'Spending and Income Cluster' column. Plotting the elbow method for bivariate clustering and visualizing the cluster centers on a scatter plot. Displaying a normalized cross-tabulation between 'Spending and Income Cluster' and 'Gender'. Multivariate Clustering:

Performing multivariate clustering by creating dummy variables, scaling selected columns, and applying KMeans clustering. Plotting the elbow method for multivariate clustering. Result Saving:

Saving the modified DataFrame with cluster information to a CSV file named "Result.csv". Saving the multivariate clustering plot as an image file ("Multivariate_figure.png").
PandasPlotBench
huggingface.co
Updated Nov 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
JetBrains Research (2024). PandasPlotBench [Dataset]. https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 25, 2024
Dataset provided by
JetBrainshttp://jetbrains.com/
Authors
JetBrains Research
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
PandasPlotBench

PandasPlotBench is a benchmark to assess the capability of models in writing the code for visualizations given the description of the Pandas DataFrame. 🛠️ Task. Given the plotting task and the description of a Pandas DataFrame, write the code to build a plot. The dataset is based on the MatPlotLib gallery. The paper can be found in arXiv: https://arxiv.org/abs/2412.02764v1. To score your model on this dataset, you can use the our GitHub repository. 📩 If you have… See the full description on the dataset page: https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench.
merged-dataframe
kaggle.com
Updated Jan 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AnshKGoyal (2024). merged-dataframe [Dataset]. https://www.kaggle.com/datasets/anshkgoyal/merged-dataframe/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 27, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
AnshKGoyal
Description
This dataset is an intermediate output from a book recommendation system project. It contains merged data from Amazon book reviews and book details, with added sentiment scores and labels. The sentiment analysis was performed using a custom model. This dataset is not intended as a standalone resource, but rather as a checkpoint in the development process of the recommendation system.
Pandas Test Data
kaggle.com
zip
Updated Aug 23, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gyan Kumar (2020). Pandas Test Data [Dataset]. https://www.kaggle.com/kgmgyan57/pandas-test-data
Explore at:
zip(63445451 bytes)Available download formats
Dataset updated
Aug 23, 2020
Authors
Gyan Kumar
Description
Dataset

This dataset was created by Gyan Kumar

Contents

It contains the following files:
H
GenBank data submission network yearly data frame files
dataverse.harvard.edu
Updated Jun 7, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jian Qin; Jeff Hemsley; Sarah Bratt (2021). GenBank data submission network yearly data frame files [Dataset]. http://doi.org/10.7910/DVN/4QUAXY
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/4QUAXY
Dataset updated
Jun 7, 2021
Dataset provided by
Harvard Dataverse
Authors
Jian Qin; Jeff Hemsley; Sarah Bratt
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Time period covered
1992 - 2018
Dataset funded by
NIH
NSF
Description
GenBank data submission network R data frames by year from 1992-2018.
w
Pandas 1.x cookbook : practical recipes for scientific computing, time...
workwithdata.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data, Pandas 1.x cookbook : practical recipes for scientific computing, time series and exploratory data analysis using Python [Dataset]. https://www.workwithdata.com/object/pandas-1-x-cookbook-practical-recipes-for-scientific-computing-time-series-and-exploratory-data-analysis-using-python-book-by-matt-harrison-1975
Explore at:
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Explore Pandas 1.x cookbook : practical recipes for scientific computing, time series and exploratory data analysis using Python through data • Key facts: author, publication date, book publisher, book series, book subjects • Real-time news, visualizations and datasets
u
Data from: dblp XML dataset as CSV for Python Data Analysis Library
observatorio-cientifico.ua.es
Updated 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carrasco, Rafael C.; Candela, Gustavo; Carrasco, Rafael C.; Candela, Gustavo (2021). dblp XML dataset as CSV for Python Data Analysis Library [Dataset]. https://observatorio-cientifico.ua.es/documentos/668fc45db9e7c03b01bdb2d0
Explore at:
Dataset updated
2021
Authors
Carrasco, Rafael C.; Candela, Gustavo; Carrasco, Rafael C.; Candela, Gustavo
Description
Based on the dblp XML file, this dataset consists on a CSV file that has been extracted using a python script. The dataset can be easily loaded in a Python Data Analysis Library dataframe.
Figure 5 dataframe
figshare.com
txt
Updated Dec 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aland Chan (2023). Figure 5 dataframe [Dataset]. http://doi.org/10.6084/m9.figshare.23245640.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.23245640.v1
Dataset updated
Dec 18, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Aland Chan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataframe used the generate the area plot in Figure 3.

Columns - time: time after fire g: proportion of pixels being grasslands s: proportion of pixels being shrublands sfg: proportion of pixels being shrublands that developed from burnt grasslands f: proportion of pixels being forests ffg: proportion of pixels being forests that developed from burnt grasslands ffs: proportion of pixels being forests that developed from burnt shrublands
Quebec Ministry of Tourism (2012 to 2017) web archive collection derivatives...
zenodo.org
data.niaid.nih.gov
application/gzip
Updated Mar 3, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nick Ruest; Nick Ruest; Carole Gagné; Dave Mitchell; Carole Gagné; Dave Mitchell (2020). Quebec Ministry of Tourism (2012 to 2017) web archive collection derivatives [Dataset]. http://doi.org/10.5281/zenodo.3693801
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3693801
Dataset updated
Mar 3, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nick Ruest; Nick Ruest; Carole Gagné; Dave Mitchell; Carole Gagné; Dave Mitchell
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Quebec
Description
Web archive derivatives of the Quebec Ministry of Tourism (2012 to 2017) collection from the Bibliothèque et Archives nationales du Québec. The derivatives were created with the Archives Unleashed Toolkit. Merci beaucoup BAnQ!

These derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples.

Domains

.webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc)

Produces a DataFrame with the following columns:

domain

count

Web Pages

.webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))

Produces a DataFrame with the following columns:

crawl_date

url

mime_type_web_server

mime_type_tika

content

Web Graph

.webgraph()

Produces a DataFrame with the following columns:

crawl_date

src

dest

anchor

Image Links

.imageLinks()

Produces a DataFrame with the following columns:

src

image_url

Binary Analysis

Audio

Images

PDFs

Presentation program files

Spreadsheets

Text files

Videos

Word processor files
f
Table_2_Assessing Urinary Metabolomics in Giant Pandas Using...
figshare.com
xls
Updated Jun 2, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maosheng Cao; Chunjin Li; Yuliang Liu; Kailai Cai; Lu Chen; Chenfeng Yuan; Zijiao Zhao; Boqi Zhang; Rong Hou; Xu Zhou (2023). Table_2_Assessing Urinary Metabolomics in Giant Pandas Using Chromatography/Mass Spectrometry: Pregnancy-Related Changes in the Metabolome.xls [Dataset]. http://doi.org/10.3389/fendo.2020.00215.s003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.3389/fendo.2020.00215.s003
Dataset updated
Jun 2, 2023
Dataset provided by
Frontiers
Authors
Maosheng Cao; Chunjin Li; Yuliang Liu; Kailai Cai; Lu Chen; Chenfeng Yuan; Zijiao Zhao; Boqi Zhang; Rong Hou; Xu Zhou
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Giant pandas represent one of the most endangered species worldwide, and their reproductive capacity is extremely low. They have a relatively long gestational period, mainly because embryo implantation is delayed. Giant panda cubs comprise only a small proportion of the mother's body weight, making it difficult to determine whether a giant panda is pregnant. Timely determination of pregnancy contributes to the efficient breeding and management of giant pandas. Meanwhile, metabolomics studies the metabolic composition of biological samples, which can reflect metabolic functions in cells, tissues, and organisms. This work explored the urinary metabolites of giant pandas during pregnancy. A sample of 8 female pandas was selected. Differences in metabolite levels in giant panda urine samples were analyzed via ultra-high-performance liquid chromatography/mass spectrometry comparing pregnancy to anoestrus. Pattern recognition techniques, including partial least squares-discriminant analysis and orthogonal partial least squares-discriminant analysis, were used to analyze multiple parameters of the data. Compared with the results during anoestrus, multivariate statistical analysis of results obtained from the same pandas being pregnant identified 16 differential metabolites in the positive-ion mode and 43 differential metabolites in the negative-ion mode. The levels of tryptophan, choline, kynurenic acid, uric acid, indole-3-acetaldehyde, taurine, and betaine were higher in samples during pregnancy, whereas those of xanthurenic acid and S-adenosylhomocysteine were lower. Amino acid metabolism, lipid metabolism, and organic acid production differed significantly between anoestrus and pregnancy. Our results provide new insights into metabolic changes in the urine of giant pandas during pregnancy, and the differential levels of metabolites in urine provide a basis for determining pregnancy in giant pandas. Understanding these metabolic changes could be helpful for managing pregnant pandas to provide proper nutrients to their fetuses.
Harvest Quebec Government Websites from December 2006 web archive collection...
zenodo.org
application/gzip
Updated Feb 26, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nick Ruest; Nick Ruest; Carole Gagné; Dave Mitchell; Carole Gagné; Dave Mitchell (2020). Harvest Quebec Government Websites from December 2006 web archive collection derivatives [Dataset]. http://doi.org/10.5281/zenodo.3688354
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3688354
Dataset updated
Feb 26, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nick Ruest; Nick Ruest; Carole Gagné; Dave Mitchell; Carole Gagné; Dave Mitchell
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Quebec
Description
Web archive derivatives of the Sites of the Harvest Quebec Government Websites from December 2006 collection from the Bibliothèque et Archives nationales du Québec. The derivatives were created with the Archives Unleashed Toolkit. Merci beaucoup BAnQ!

These derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples.

Domains

.webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc)

Produces a DataFrame with the following columns:

domain

count

Web Pages

.webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))

Produces a DataFrame with the following columns:

crawl_date

url

mime_type_web_server

mime_type_tika

content

Web Graph

.webgraph()

Produces a DataFrame with the following columns:

crawl_date

src

dest

anchor

Image Links

.imageLinks()

Produces a DataFrame with the following columns:

src

image_url

Binary Analysis

Audio

Images

PDFs

Presentation program files

Spreadsheets

Text files

Videos

Word processor files
o
Extreme Right Movements in Europe web archive collection derivatives
explore.openaire.eu
data.niaid.nih.gov
+1more
Updated Jan 31, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nick Ruest (2020). Extreme Right Movements in Europe web archive collection derivatives [Dataset]. http://doi.org/10.5281/zenodo.3633160
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.3633160
Dataset updated
Jan 31, 2020
Authors
Nick Ruest
Area covered
Europe
Description
Web archive derivatives of the Literary Authors from Europe and Eurasia Web Archive collection from the Ivy Plus Libraries Confederation. The derivatives were created with the Archives Unleashed Toolkit and Archives Unleashed Cloud. The ivy-11670-parquet.tar.gz derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples. Domains .webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc) Produces a DataFrame with the following columns: domain count Web Pages .webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content")) Produces a DataFrame with the following columns: crawl_date url mime_type_web_server mime_type_tika content Web Graph .webgraph() Produces a DataFrame with the following columns: crawl_date src dest anchor Image Links .imageLinks() Produces a DataFrame with the following columns: src image_url Binary Analysis Audio Images PDFs Presentation program files Spreadsheets Text files Word processor files The ivy-11670-auk.tar.gz derivatives are the standard set of web archive derivatives produced by the Archives Unleashed Cloud. Gephi file, which can be loaded into Gephi. It will have basic characteristics already computed and a basic layout. Raw Network file, which can also be loaded into Gephi. You will have to use that network program to lay it out yourself. Full text file. In it, each website within the web archive collection will have its full text presented on one line, along with information around when it was crawled, the name of the domain, and the full URL of the content. Domains count file. A text file containing the frequency count of domains captured within your web archive.
Z
Freely Accessible eJournals web archive collection derivatives
data.niaid.nih.gov
Updated Feb 2, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ruest, Nick (2020). Freely Accessible eJournals web archive collection derivatives [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3633670
Explore at:
Dataset updated
Feb 2, 2020
Dataset authored and provided by
Ruest, Nick
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Web archive derivatives of the Freely Accessible eJournals collection from Columbia University Libraries. The derivatives were created with the Archives Unleashed Toolkit and Archives Unleashed Cloud.

The cul-5921-parquet.tar.gz derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples.

Domains

.webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc)

Produces a DataFrame with the following columns:

domain

count

Web Pages

.webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))

Produces a DataFrame with the following columns:

crawl_date

url

mime_type_web_server

mime_type_tika

content

Web Graph

.webgraph()

Produces a DataFrame with the following columns:

crawl_date

src

dest

anchor

Image Links

.imageLinks()

Produces a DataFrame with the following columns:

src

image_url

Binary Analysis

Audio

Images

PDFs

Presentation program files

Spreadsheets

Text files

Word processor files

The cul-12143-auk.tar.gz derivatives are the standard set of web archive derivatives produced by the Archives Unleashed Cloud.

Gephi file, which can be loaded into Gephi. It will have basic characteristics already computed and a basic layout.

Raw Network file, which can also be loaded into Gephi. You will have to use that network program to lay it out yourself.

Full text file. In it, each website within the web archive collection will have its full text presented on one line, along with information around when it was crawled, the name of the domain, and the full URL of the content.

Domains count file. A text file containing the frequency count of domains captured within your web archive.
Z
Web Archive of Independent News Sites on Turkish Affairs derivatives
data.niaid.nih.gov
zenodo.org
Updated Jan 31, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ruest, Nick (2020). Web Archive of Independent News Sites on Turkish Affairs derivatives [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3633233
Explore at:
Dataset updated
Jan 31, 2020
Dataset authored and provided by
Ruest, Nick
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Derivatives of the Web Archive of Independent News Sites on Turkish Affairs collection from the Ivy Plus Libraries Confederation. The derivatives were created with the Archives Unleashed Toolkit and Archives Unleashed Cloud.

The ivy-12911-parquet.tar.gz derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples.

Domains

.webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc)

Produces a DataFrame with the following columns:

domain

count

Web Pages

.webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))

Produces a DataFrame with the following columns:

crawl_date

url

mime_type_web_server

mime_type_tika

content

Web Graph

.webgraph()

Produces a DataFrame with the following columns:

crawl_date

src

dest

anchor

Image Links

.imageLinks()

Produces a DataFrame with the following columns:

src

image_url

Binary Analysis

Audio

Images

PDFs

Presentation program files

Spreadsheets

Text files

Word processor files

The ivy-12911-auk.tar.gz derivatives are the standard set of web archive derivatives produced by the Archives Unleashed Cloud.

Gephi file, which can be loaded into Gephi. It will have basic characteristics already computed and a basic layout.

Raw Network file, which can also be loaded into Gephi. You will have to use that network program to lay it out yourself.

Full text file. In it, each website within the web archive collection will have its full text presented on one line, along with information around when it was crawled, the name of the domain, and the full URL of the content.

Domains count file. A text file containing the frequency count of domains captured within your web archive.
Z
Queer Japan Web Archive collection derivatives
data.niaid.nih.gov
Updated Feb 1, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yanagihara, Yoshie (2020). Queer Japan Web Archive collection derivatives [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3633283
Explore at:
Dataset updated
Feb 1, 2020
Dataset provided by
Shida, Tetsuyuki
Abrams, Samantha
Yanagihara, Yoshie
Nakamura, Haruko
Ruest, Nick
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Web archive derivatives of the Queer Japan Web Archive collection from the Ivy Plus Libraries Confederation. The derivatives were created with the Archives Unleashed Toolkit and Archives Unleashed Cloud.

The ivy-12172-parquet.tar.gz derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples.

Domains

.webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc)

Produces a DataFrame with the following columns:

domain

count

Web Pages

.webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))

Produces a DataFrame with the following columns:

crawl_date

url

mime_type_web_server

mime_type_tika

content

Web Graph

.webgraph()

Produces a DataFrame with the following columns:

crawl_date

src

dest

anchor

Image Links

.imageLinks()

Produces a DataFrame with the following columns:

src

image_url

Binary Analysis

Audio

Images

PDFs

Presentation program files

Spreadsheets

Text files

Videos

Word processor files

The ivy-11854-auk.tar.gz derivatives are the standard set of web archive derivatives produced by the Archives Unleashed Cloud.

Gephi file, which can be loaded into Gephi. It will have basic characteristics already computed and a basic layout.

Raw Network file, which can also be loaded into Gephi. You will have to use that network program to lay it out yourself.

Full text file. In it, each website within the web archive collection will have its full text presented on one line, along with information around when it was crawled, the name of the domain, and the full URL of the content.

Domains count file. A text file containing the frequency count of domains captured within your web archive.
Learn Pandas
kaggle.com
Updated Oct 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vaidik Patel (2023). Learn Pandas [Dataset]. https://www.kaggle.com/datasets/js1js2js3js4js5/learn-pandas/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 5, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Vaidik Patel
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
It is a dataset with notebook kind of learning. Download the whole package and you will find everything to learn basics to advanced pandas which is exactly what you will need in machine learning and in data science. 😄

This will gives you the overview and data analysis tools in pandas that is mostly required in the data manipulation and extraction important data.

Use this notebook as notes for pandas. whenever you forget the code or syntax open it and scroll through it and you will find the solution. 🥳
d
Young and older adult vowel categorization responses
datadryad.org
zip
Updated Mar 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mishaela DiNino (2024). Young and older adult vowel categorization responses [Dataset]. http://doi.org/10.5061/dryad.brv15dvh0
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.brv15dvh0
Dataset updated
Mar 14, 2024
Dataset provided by
Dryad
Authors
Mishaela DiNino
Description
Young and older adult vowel categorization responses

https://doi.org/10.5061/dryad.brv15dvh0

On each trial, participants heard a stimulus and clicked a box on the computer screen to indicate whether they heard "SET" or "SAT." Responses of "SET" are coded as 0 and responses of "SAT" are coded as 1. The continuum steps, from 1-7, for duration and spectral quality cues of the stimulus on each trial are named "DurationStep" and "SpectralStep," respectively. Group (young or older adult) and listening condition (quiet or noise) information are provided for each row of the dataset.
Z
Data associated with "A weakened recurrent circuit in the hippocampus of...
data.niaid.nih.gov
zenodo.org
Updated Feb 24, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wang, Wei (2022). Data associated with "A weakened recurrent circuit in the hippocampus of Rett syndrome mice disrupts long-term memory representations" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5999291
Explore at:
Dataset updated
Feb 24, 2022
Dataset provided by
Tang, Jianrong
Sun, Yaling
Wang, Wei
He, Lingjie
Caudill, Matthew S.
Jing, Junzhan
Zoghbi, Huda Y.
Jiang, Xiaolong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Datasets used in A weakened recurrent circuit in the hippocampus of Rett syndrome mice disrupts long-term memory representations.

Datatypes:

Multi-index pandas dataframe (.pkl)

Numpy array (.npy)

Collection of numpy arrays (.npz)

Python dictionary objects (.pkl)

Datasets:

alignments.pkl: A dataframe containing numpy arrays of image displacements for each mouse in each memory context.

This multi-index dataframe has rows indexed by genotype ('wt' or 'het') and mouse_id. The columns are ['T', 'F1', 'N1', 'F2', 'N2'] for the training, recall 1-hour, neutral, recall 1-day, neutral day 2 memory contexts respectively. Each element of this dataframe is a numpy array of shape images x 2 that hold x and y image displacements respectively. These alignments are computed after the inscopix software motion correction and are used in Supplemental Figure 2 of the paper.

behavior_df.pkl: A dataframe of behavior readouts recorded by a camera positioned above the mice in each context chamber.

This multi-index dataframe has rows indexed by genotype ('wt' or 'het') and mouse_id.The columns are; sample times (_time), freezing boolean arrays (_freeze), x-positions in context chamber (_x) and y-positions in the context chamber (_y) for each context (*) in ('Train', 'Fear', 'Neutral', 'Fear_2', 'Neutral_2').

correlated_pairs_df.pkl: A dataframe containing arrays of neuron indices that have a correlation in activity pattern > 0.3.

This multi-index dataframe has rows indexed by genotype ('wt' or 'het') and mouse_id and treatment ('NA'). The columns contain ['Train', 'Fear', 'Neutral', 'Fear_2', 'Neutral_2'] representing each memory context. Each element of the dataframe is a numpy array with three columns. The first two columns are the neuron indices that are correlated and the last column is the strength of the correlation.

dredd_freezes_df.pkl: A dataframe containing freezing percentages for SOM-Cre and RTT-SOM-Cre mice treated with DREADDS.

This multi-index dataframe has rows indexed by genotype ('wt' or 'het') and mouse_id and treatment (mcherry, hm3d, hm4d). The columns contain one of ['Neutral', 'Fear', 'Fear_2']. Each element of the dataframe is a freezing percentage for a single mouse. This dataframe is built from reading the dredd_behavior.xlsx excel file. This is used to generate figure 5E of the paper.

high_degree_df.pkl: A dataframe containing list of high degree neuron indices.

This multi-index dataframe has rows indexed by genotype ('wt' or 'het') and mouse_id and treatment ('NA'=not applicable since no DREADD used). The columns contain ['Train', 'Fear', 'Neutral', 'Fear_2', 'Neutral_2'] representing each memory context. Each element of the dataframe is a list of neuron indices that are high-degree cells.

N006_wt_basis.npz: a dict containing three numpy arrays representing the basis images for mouse N006 of genotype wild-type.

This dict has three arrays stored under the variable names 'U', 'sigma' and 'img_shape'. U is a matrix of column vector basis images. Each column is the vector representation of a basis image (row pixels x column pixels). There are 220 basis images (columns) in U. The sigma variable is the singular value associated with each basis image vector in U. img_shape can be used to reshape each basis column vector into a 2-D image for viewing. This data is used in Supplemental Figure 2 of the paper.

N006_wt_cxtbasis.pkl: A dictionary containing arrays for basis images and singular values for each context.

This dictionary has keys, ['Train', 'Fear', 'Neutral', 'Fear_2', 'Neutral_2'] representing the memory contexts. Each value is a 2 element list containing the U-basis images as column vectors and singular values, one per basis image in U. The shape of the basis images is the same shape stored in N006_wt_basis.pkl. This dataset is used in Supplementary Figure 2 to track cells across contexts of the CFC task (see also N006_wt_cxtsources.pkl)

N006_wt_cxtsources.pkl: A dictionary containing the independent component source images computed from the basis images for automatically identifying regions of interest (ROIs).

The dictionary is keyed on ['Train', 'Fear', 'Neutral', 'Fear_2', 'Neutral_2'] contexts. Each value in the dictionary at a given key is a 3-D numpy array of shape sources x height x width. These data were used to construct the source images and max intensity projection image of the sources in Supplemental Figure 2F-J of the paper.

N006_wt_rois.pkl: A dictionary containing the boundaries and annuli coordinates of all rois for mouse N006 of genotype wild-type.

This dictionary is keyed on ['boundaries', 'annuli'] contexts and each value is a 179 element list of arrays of boundary line coordinates or annulus point coordinates one per ROI detected for this mouse.

N006_wt_sources.npy: A numpy array containing all source images computed from all contexts of the CFC task for mouse N006 of genotype wild-type.

This numpy array has shape n x height x width where n=205 source images, height=517 pixels and width=704 pixels. This data was used to construct Supplemental Figure 3F.

N019_wt_basis.npz: a dict containing three numpy arrays representing the basis images for mouse N019 of genotype wild-type.

This dict has three arrays stored under the variable names 'U', 'sigma' and 'img_shape'. U is a matrix of column vector basis images. Each column is the vector representation of a basis image (row pixels x column pixels). There are 220 basis images (columns) in U. The sigma variable is the singular value associated with each basis image vector in U. img_shape can be used to reshape each basis column vector into a 2-D image for viewing. This data is used in Figure 1C of the paper.

N019_wt_sources.npy: A numpy array containing all source images computed from all contexts of the CFC task for mouse N019 of genotype wild-type.

This numpy array has shape n x height x width where n=204 source images, height=516 pixels and width=698 pixels. This data was used to construct Figure 1C of the paper.

P80_animals.pkl: A pandas multi-index object containing the genotype, mouse_id and treatment of the top 80% behavioral performance animals.

In this study, we drop the lowest 20% performing WT and RTT animals based on freezing percentage during the recall contexts. This multi-index is used to filter the data before each computation or plot in this study. So for example Figure 1B contains only the top 80% performing WT and RTT mice.

pc_sipscs_amps.pkl: A dictionary containing the amplitudes of spontaneous IPSCs recorded in pyramidal cells of WT and RTT mice.

This dictionary is keyed on ['wt', 'mecp2_pos', 'mecp2_neg'] representing whether the pyramidal cell was recorded from a wild-type mouse ('wt') or is an MeCP2 negative or MeCP2 positive RTT cell. This value under each key is an array of IPSC amplitudes, one per recorded cell. This data was used to construct Figure 4C in the paper.

pc_sipscs_freqs.pkl: A dictionary containing the frequencies of spontaneous IPSCs recorded in pyramidal cells of WT and RTT mice.

This dictionary is keyed on ['wt', 'mecp2_pos', 'mecp2_neg'] representing whether the pyramidal cell was recorded from a wild-type mouse ('wt') or is an MeCP2 negative or MeCP2 positive RTT cell. This value under each key is an array of IPSC frequencies, one per recorded cell. This data was used to construct Figure 4C in the paper.

rois_df.pkl: A multi-index dataframe containing all ROI information for each non-DREADD treated cell in this study (Figures 1-3).

This dataframe index contains the genotype ('wt', 'het'), the mouse_id, the treatment ('NA'=not applicable since no DREADD used), and the cell index starting from 0. The columns are ['centroid', 'cell_boundary', 'annulus_boundary']. The centroid for each cell is a 2-tuple of row, column pixel centroid coordinates. The cell_boundary is a two-column array of row, col boundary points for each ROI. The annulus_boundary is a two-column array of row, column interior points in the annulus. The annulus region excludes points of overlap with nearby cell bodies (See STAR methods of the paper).

signals_df.pkl: A multi-index dataframe containing calcium signals, inferred spikes and metadata for all Non-DREADD experiments used in this study (Figs 1-3).

This dataframe index contains the genotype ('wt', 'het'), the mouse_id, the treatment ('NA'=not applicable since no DREADD used), and the cell index starting from 0 and going up to 5771 cells. The columns are ['channels', 'channel', 'num_pages', 'width', 'height', 'bits', 'Train_signals', 'Fear_signals', 'Neutral_signals', 'Cue_signals', 'Fear_2_signals', 'Neutral_2_signals', 'Cue_2_signals', 'Train_spikes', 'Fear_spikes', 'Neutral_spikes', 'Cue_spikes', 'Fear_2_spikes', 'Neutral_2_spikes', 'Cue_2_spikes', 'sample_rate']. The channels are all the recorded channels, the channels is the channel on which ROIs were detected, the width and height are the image dimensions, the bits is the image bit depth of the calcium movie. The _signals' are the df/f signals for each cell in each context. Each signal is a numpy array with the first 800 samples have been set to NAN due to settling time of the miniscope. The '_spikes' are the inferred spikes for each cell stored as an image index. This signal and spike indices can be converted to time using the sample column. This dataframe is used in the construction of Figures 1-3 in the paper.

som_behavior_df.pkl: A dataframe of behavior readouts recorded by a camera positioned above the mice in each context chamber.

This multi-index dataframe has rows indexed by genotype ('wt' or 'het') and mouse_id. The columns are; sample times (_time), freezing boolean arrays (_freeze), x-positions in context chamber (_x) and y-positions in the context chamber (_y) for each context in *=('Train', 'Fear', 'Neutral', 'Fear_2', 'Neutral_2'). This dataframe was not used in the paper but may still be useful for further
f
Table_4_Metagenomic Analysis of Bacteria, Fungi, Bacteriophages, and...
frontiersin.figshare.com
docx
Updated Jun 1, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shengzhi Yang; Xin Gao; Jianghong Meng; Anyun Zhang; Yingmin Zhou; Mei Long; Bei Li; Wenwen Deng; Lei Jin; Siyue Zhao; Daifu Wu; Yongguo He; Caiwu Li; Shuliang Liu; Yan Huang; Hemin Zhang; Likou Zou (2023). Table_4_Metagenomic Analysis of Bacteria, Fungi, Bacteriophages, and Helminths in the Gut of Giant Pandas.DOCX [Dataset]. http://doi.org/10.3389/fmicb.2018.01717.s019
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fmicb.2018.01717.s019
Dataset updated
Jun 1, 2023
Dataset provided by
Frontiers
Authors
Shengzhi Yang; Xin Gao; Jianghong Meng; Anyun Zhang; Yingmin Zhou; Mei Long; Bei Li; Wenwen Deng; Lei Jin; Siyue Zhao; Daifu Wu; Yongguo He; Caiwu Li; Shuliang Liu; Yan Huang; Hemin Zhang; Likou Zou
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
To obtain full details of gut microbiota, including bacteria, fungi, bacteriophages, and helminths, in giant pandas (GPs), we created a comprehensive microbial genome database and used metagenomic sequences to align against the database. We delineated a detailed and different gut microbiota structures of GPs. A total of 680 species of bacteria, 198 fungi, 185 bacteriophages, and 45 helminths were found. Compared with 16S rRNA sequencing, the dominant bacterium phyla not only included Proteobacteria, Firmicutes, Bacteroidetes, and Actinobacteria but also Cyanobacteria and other eight phyla. Aside from Ascomycota, Basidiomycota, and Glomeromycota, Mucoromycota, and Microsporidia were the dominant fungi phyla. The bacteriophages were predominantly dsDNA Myoviridae, Siphoviridae, Podoviridae, ssDNA Inoviridae, and Microviridae. For helminths, phylum Nematoda was the dominant. In addition to previously described parasites, another 44 species of helminths were found in GPs. Also, differences in abundance of microbiota were found between the captive, semiwild, and wild GPs. A total of 1,739 genes encoding cellulase, β-glucosidase, and cellulose β-1,4-cellobiosidase were responsible for the metabolism of cellulose, and 128,707 putative glycoside hydrolase genes were found in bacteria/fungi. Taken together, the results indicated not only bacteria but also fungi, bacteriophages, and helminths were diverse in gut of giant pandas, which provided basis for the further identification of role of gut microbiota. Besides, metagenomics revealed that the bacteria/fungi in gut of GPs harbor the ability of cellulose and hemicellulose degradation.
Z
Quebec Health Ministry (2013-2018) web archive collection derivatives
data.niaid.nih.gov
zenodo.org
Updated Mar 3, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mitchell, Dave (2020). Quebec Health Ministry (2013-2018) web archive collection derivatives [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3693791
Explore at:
Dataset updated
Mar 3, 2020
Dataset provided by
Ruest, Nick
Mitchell, Dave
Gagné, Carole
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Quebec
Description
Web archive derivatives of the Quebec Health Ministry (2013-2018) collection from the Bibliothèque et Archives nationales du Québec. The derivatives were created with the Archives Unleashed Toolkit. Merci beaucoup BAnQ!

These derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples.

Domains

.webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc)

Produces a DataFrame with the following columns:

domain

count

Web Pages

.webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))

Produces a DataFrame with the following columns:

crawl_date

url

mime_type_web_server

mime_type_tika

content

Web Graph

.webgraph()

Produces a DataFrame with the following columns:

crawl_date

src

dest

anchor

Image Links

.imageLinks()

Produces a DataFrame with the following columns:

src

image_url

Binary Analysis

Audio

Images

PDFs

Presentation program files

Spreadsheets

Text files

Word processor files

Facebook

Twitter

Click to copy link

Link copied

Cite

Anshul Pachauri (2023). Shopping Mall [Dataset]. https://www.kaggle.com/datasets/anshulpachauri/shopping-mall

Shopping Mall

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Dec 15, 2023

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Anshul Pachauri

Description

Libraries Import:

Importing necessary libraries such as pandas, seaborn, matplotlib, scikit-learn's KMeans, and warnings. Data Loading and Exploration:

Reading a dataset named "Mall_Customers.csv" into a pandas DataFrame (df). Displaying the first few rows of the dataset using df.head(). Conducting univariate analysis by calculating descriptive statistics with df.describe(). Univariate Analysis:

Visualizing the distribution of the 'Annual Income (k$)' column using sns.distplot. Looping through selected columns ('Age', 'Annual Income (k$)', 'Spending Score (1-100)') and plotting individual distribution plots. Bivariate Analysis:

Creating a scatter plot for 'Annual Income (k$)' vs 'Spending Score (1-100)' using sns.scatterplot. Generating a pair plot for selected columns with gender differentiation using sns.pairplot. Gender-Based Analysis:

Grouping the data by 'Gender' and calculating the mean for selected columns. Computing the correlation matrix for the grouped data and visualizing it using a heatmap. Univariate Clustering:

Applying KMeans clustering with 3 clusters based on 'Annual Income (k$)' and adding the 'Income Cluster' column to the DataFrame. Plotting the elbow method to determine the optimal number of clusters. Bivariate Clustering:

Applying KMeans clustering with 5 clusters based on 'Annual Income (k$)' and 'Spending Score (1-100)' and adding the 'Spending and Income Cluster' column. Plotting the elbow method for bivariate clustering and visualizing the cluster centers on a scatter plot. Displaying a normalized cross-tabulation between 'Spending and Income Cluster' and 'Gender'. Multivariate Clustering:

Performing multivariate clustering by creating dummy variables, scaling selected columns, and applying KMeans clustering. Plotting the elbow method for multivariate clustering. Result Saving:

Saving the modified DataFrame with cluster information to a CSV file named "Result.csv". Saving the multivariate clustering plot as an image file ("Multivariate_figure.png").

Clear search

Close search

Google apps

Main menu

Shopping Mall

PandasPlotBench

merged-dataframe

Pandas Test Data

Dataset

Contents

GenBank data submission network yearly data frame files

Pandas 1.x cookbook : practical recipes for scientific computing, time...

Data from: dblp XML dataset as CSV for Python Data Analysis Library

Figure 5 dataframe

Quebec Ministry of Tourism (2012 to 2017) web archive collection derivatives...

Table_2_Assessing Urinary Metabolomics in Giant Pandas Using...

Harvest Quebec Government Websites from December 2006 web archive collection...

Extreme Right Movements in Europe web archive collection derivatives

Freely Accessible eJournals web archive collection derivatives

Web Archive of Independent News Sites on Turkish Affairs derivatives

Queer Japan Web Archive collection derivatives

Learn Pandas

Young and older adult vowel categorization responses

Young and older adult vowel categorization responses

Data associated with "A weakened recurrent circuit in the hippocampus of...

Table_4_Metagenomic Analysis of Bacteria, Fungi, Bacteriophages, and...

Quebec Health Ministry (2013-2018) web archive collection derivatives

Shopping Mall