Libraries Import:
Importing necessary libraries such as pandas, seaborn, matplotlib, scikit-learn's KMeans, and warnings. Data Loading and Exploration:
Reading a dataset named "Mall_Customers.csv" into a pandas DataFrame (df). Displaying the first few rows of the dataset using df.head(). Conducting univariate analysis by calculating descriptive statistics with df.describe(). Univariate Analysis:
Visualizing the distribution of the 'Annual Income (k$)' column using sns.distplot. Looping through selected columns ('Age', 'Annual Income (k$)', 'Spending Score (1-100)') and plotting individual distribution plots. Bivariate Analysis:
Creating a scatter plot for 'Annual Income (k$)' vs 'Spending Score (1-100)' using sns.scatterplot. Generating a pair plot for selected columns with gender differentiation using sns.pairplot. Gender-Based Analysis:
Grouping the data by 'Gender' and calculating the mean for selected columns. Computing the correlation matrix for the grouped data and visualizing it using a heatmap. Univariate Clustering:
Applying KMeans clustering with 3 clusters based on 'Annual Income (k$)' and adding the 'Income Cluster' column to the DataFrame. Plotting the elbow method to determine the optimal number of clusters. Bivariate Clustering:
Applying KMeans clustering with 5 clusters based on 'Annual Income (k$)' and 'Spending Score (1-100)' and adding the 'Spending and Income Cluster' column. Plotting the elbow method for bivariate clustering and visualizing the cluster centers on a scatter plot. Displaying a normalized cross-tabulation between 'Spending and Income Cluster' and 'Gender'. Multivariate Clustering:
Performing multivariate clustering by creating dummy variables, scaling selected columns, and applying KMeans clustering. Plotting the elbow method for multivariate clustering. Result Saving:
Saving the modified DataFrame with cluster information to a CSV file named "Result.csv". Saving the multivariate clustering plot as an image file ("Multivariate_figure.png").
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
PandasPlotBench
PandasPlotBench is a benchmark to assess the capability of models in writing the code for visualizations given the description of the Pandas DataFrame. 🛠️ Task. Given the plotting task and the description of a Pandas DataFrame, write the code to build a plot. The dataset is based on the MatPlotLib gallery. The paper can be found in arXiv: https://arxiv.org/abs/2412.02764v1. To score your model on this dataset, you can use the our GitHub repository. 📩 If you have… See the full description on the dataset page: https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench.
This dataset is an intermediate output from a book recommendation system project. It contains merged data from Amazon book reviews and book details, with added sentiment scores and labels. The sentiment analysis was performed using a custom model. This dataset is not intended as a standalone resource, but rather as a checkpoint in the development process of the recommendation system.
This dataset was created by Gyan Kumar
It contains the following files:
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
GenBank data submission network R data frames by year from 1992-2018.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Explore Pandas 1.x cookbook : practical recipes for scientific computing, time series and exploratory data analysis using Python through data • Key facts: author, publication date, book publisher, book series, book subjects • Real-time news, visualizations and datasets
Based on the dblp XML file, this dataset consists on a CSV file that has been extracted using a python script. The dataset can be easily loaded in a Python Data Analysis Library dataframe.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataframe used the generate the area plot in Figure 3.
Columns - time: time after fire g: proportion of pixels being grasslands s: proportion of pixels being shrublands sfg: proportion of pixels being shrublands that developed from burnt grasslands f: proportion of pixels being forests ffg: proportion of pixels being forests that developed from burnt grasslands ffs: proportion of pixels being forests that developed from burnt shrublands
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Web archive derivatives of the Quebec Ministry of Tourism (2012 to 2017) collection from the Bibliothèque et Archives nationales du Québec. The derivatives were created with the Archives Unleashed Toolkit. Merci beaucoup BAnQ!
These derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples.
Domains
.webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc)
Produces a DataFrame with the following columns:
Web Pages
.webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))
Produces a DataFrame with the following columns:
Web Graph
.webgraph()
Produces a DataFrame with the following columns:
Image Links
.imageLinks()
Produces a DataFrame with the following columns:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Giant pandas represent one of the most endangered species worldwide, and their reproductive capacity is extremely low. They have a relatively long gestational period, mainly because embryo implantation is delayed. Giant panda cubs comprise only a small proportion of the mother's body weight, making it difficult to determine whether a giant panda is pregnant. Timely determination of pregnancy contributes to the efficient breeding and management of giant pandas. Meanwhile, metabolomics studies the metabolic composition of biological samples, which can reflect metabolic functions in cells, tissues, and organisms. This work explored the urinary metabolites of giant pandas during pregnancy. A sample of 8 female pandas was selected. Differences in metabolite levels in giant panda urine samples were analyzed via ultra-high-performance liquid chromatography/mass spectrometry comparing pregnancy to anoestrus. Pattern recognition techniques, including partial least squares-discriminant analysis and orthogonal partial least squares-discriminant analysis, were used to analyze multiple parameters of the data. Compared with the results during anoestrus, multivariate statistical analysis of results obtained from the same pandas being pregnant identified 16 differential metabolites in the positive-ion mode and 43 differential metabolites in the negative-ion mode. The levels of tryptophan, choline, kynurenic acid, uric acid, indole-3-acetaldehyde, taurine, and betaine were higher in samples during pregnancy, whereas those of xanthurenic acid and S-adenosylhomocysteine were lower. Amino acid metabolism, lipid metabolism, and organic acid production differed significantly between anoestrus and pregnancy. Our results provide new insights into metabolic changes in the urine of giant pandas during pregnancy, and the differential levels of metabolites in urine provide a basis for determining pregnancy in giant pandas. Understanding these metabolic changes could be helpful for managing pregnant pandas to provide proper nutrients to their fetuses.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Web archive derivatives of the Sites of the Harvest Quebec Government Websites from December 2006 collection from the Bibliothèque et Archives nationales du Québec. The derivatives were created with the Archives Unleashed Toolkit. Merci beaucoup BAnQ!
These derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples.
Domains
.webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc)
Produces a DataFrame with the following columns:
Web Pages
.webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))
Produces a DataFrame with the following columns:
Web Graph
.webgraph()
Produces a DataFrame with the following columns:
Image Links
.imageLinks()
Produces a DataFrame with the following columns:
Web archive derivatives of the Literary Authors from Europe and Eurasia Web Archive collection from the Ivy Plus Libraries Confederation. The derivatives were created with the Archives Unleashed Toolkit and Archives Unleashed Cloud. The ivy-11670-parquet.tar.gz derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples. Domains .webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc) Produces a DataFrame with the following columns: domain count Web Pages .webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content")) Produces a DataFrame with the following columns: crawl_date url mime_type_web_server mime_type_tika content Web Graph .webgraph() Produces a DataFrame with the following columns: crawl_date src dest anchor Image Links .imageLinks() Produces a DataFrame with the following columns: src image_url Binary Analysis Audio Images PDFs Presentation program files Spreadsheets Text files Word processor files The ivy-11670-auk.tar.gz derivatives are the standard set of web archive derivatives produced by the Archives Unleashed Cloud. Gephi file, which can be loaded into Gephi. It will have basic characteristics already computed and a basic layout. Raw Network file, which can also be loaded into Gephi. You will have to use that network program to lay it out yourself. Full text file. In it, each website within the web archive collection will have its full text presented on one line, along with information around when it was crawled, the name of the domain, and the full URL of the content. Domains count file. A text file containing the frequency count of domains captured within your web archive.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Web archive derivatives of the Freely Accessible eJournals collection from Columbia University Libraries. The derivatives were created with the Archives Unleashed Toolkit and Archives Unleashed Cloud.
The cul-5921-parquet.tar.gz derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples.
Domains
.webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc)
Produces a DataFrame with the following columns:
domain
count
Web Pages
.webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))
Produces a DataFrame with the following columns:
crawl_date
url
mime_type_web_server
mime_type_tika
content
Web Graph
.webgraph()
Produces a DataFrame with the following columns:
crawl_date
src
dest
anchor
Image Links
.imageLinks()
Produces a DataFrame with the following columns:
src
image_url
Binary Analysis
Audio
Images
PDFs
Presentation program files
Spreadsheets
Text files
Word processor files
The cul-12143-auk.tar.gz derivatives are the standard set of web archive derivatives produced by the Archives Unleashed Cloud.
Gephi file, which can be loaded into Gephi. It will have basic characteristics already computed and a basic layout.
Raw Network file, which can also be loaded into Gephi. You will have to use that network program to lay it out yourself.
Full text file. In it, each website within the web archive collection will have its full text presented on one line, along with information around when it was crawled, the name of the domain, and the full URL of the content.
Domains count file. A text file containing the frequency count of domains captured within your web archive.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Derivatives of the Web Archive of Independent News Sites on Turkish Affairs collection from the Ivy Plus Libraries Confederation. The derivatives were created with the Archives Unleashed Toolkit and Archives Unleashed Cloud.
The ivy-12911-parquet.tar.gz derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples.
Domains
.webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc)
Produces a DataFrame with the following columns:
domain
count
Web Pages
.webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))
Produces a DataFrame with the following columns:
crawl_date
url
mime_type_web_server
mime_type_tika
content
Web Graph
.webgraph()
Produces a DataFrame with the following columns:
crawl_date
src
dest
anchor
Image Links
.imageLinks()
Produces a DataFrame with the following columns:
src
image_url
Binary Analysis
Audio
Images
PDFs
Presentation program files
Spreadsheets
Text files
Word processor files
The ivy-12911-auk.tar.gz derivatives are the standard set of web archive derivatives produced by the Archives Unleashed Cloud.
Gephi file, which can be loaded into Gephi. It will have basic characteristics already computed and a basic layout.
Raw Network file, which can also be loaded into Gephi. You will have to use that network program to lay it out yourself.
Full text file. In it, each website within the web archive collection will have its full text presented on one line, along with information around when it was crawled, the name of the domain, and the full URL of the content.
Domains count file. A text file containing the frequency count of domains captured within your web archive.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Web archive derivatives of the Queer Japan Web Archive collection from the Ivy Plus Libraries Confederation. The derivatives were created with the Archives Unleashed Toolkit and Archives Unleashed Cloud.
The ivy-12172-parquet.tar.gz derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples.
Domains
.webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc)
Produces a DataFrame with the following columns:
domain
count
Web Pages
.webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))
Produces a DataFrame with the following columns:
crawl_date
url
mime_type_web_server
mime_type_tika
content
Web Graph
.webgraph()
Produces a DataFrame with the following columns:
crawl_date
src
dest
anchor
Image Links
.imageLinks()
Produces a DataFrame with the following columns:
src
image_url
Binary Analysis
Audio
Images
PDFs
Presentation program files
Spreadsheets
Text files
Videos
Word processor files
The ivy-11854-auk.tar.gz derivatives are the standard set of web archive derivatives produced by the Archives Unleashed Cloud.
Gephi file, which can be loaded into Gephi. It will have basic characteristics already computed and a basic layout.
Raw Network file, which can also be loaded into Gephi. You will have to use that network program to lay it out yourself.
Full text file. In it, each website within the web archive collection will have its full text presented on one line, along with information around when it was crawled, the name of the domain, and the full URL of the content.
Domains count file. A text file containing the frequency count of domains captured within your web archive.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
It is a dataset with notebook kind of learning. Download the whole package and you will find everything to learn basics to advanced pandas which is exactly what you will need in machine learning and in data science. 😄
This will gives you the overview and data analysis tools in pandas that is mostly required in the data manipulation and extraction important data.
Use this notebook as notes for pandas. whenever you forget the code or syntax open it and scroll through it and you will find the solution. 🥳
https://doi.org/10.5061/dryad.brv15dvh0
On each trial, participants heard a stimulus and clicked a box on the computer screen to indicate whether they heard "SET" or "SAT." Responses of "SET" are coded as 0 and responses of "SAT" are coded as 1. The continuum steps, from 1-7, for duration and spectral quality cues of the stimulus on each trial are named "DurationStep" and "SpectralStep," respectively. Group (young or older adult) and listening condition (quiet or noise) information are provided for each row of the dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Datasets used in A weakened recurrent circuit in the hippocampus of Rett syndrome mice disrupts long-term memory representations.
Datatypes:
Multi-index pandas dataframe (.pkl)
Numpy array (.npy)
Collection of numpy arrays (.npz)
Python dictionary objects (.pkl)
Datasets:
alignments.pkl: A dataframe containing numpy arrays of image displacements for each mouse in each memory context.
This multi-index dataframe has rows indexed by genotype ('wt' or 'het') and mouse_id. The columns are ['T', 'F1', 'N1', 'F2', 'N2'] for the training, recall 1-hour, neutral, recall 1-day, neutral day 2 memory contexts respectively. Each element of this dataframe is a numpy array of shape images x 2 that hold x and y image displacements respectively. These alignments are computed after the inscopix software motion correction and are used in Supplemental Figure 2 of the paper.
behavior_df.pkl: A dataframe of behavior readouts recorded by a camera positioned above the mice in each context chamber.
This multi-index dataframe has rows indexed by genotype ('wt' or 'het') and mouse_id.The columns are; sample times (_time), freezing boolean arrays (_freeze), x-positions in context chamber (_x) and y-positions in the context chamber (_y) for each context (*) in ('Train', 'Fear', 'Neutral', 'Fear_2', 'Neutral_2').
correlated_pairs_df.pkl: A dataframe containing arrays of neuron indices that have a correlation in activity pattern > 0.3.
This multi-index dataframe has rows indexed by genotype ('wt' or 'het') and mouse_id and treatment ('NA'). The columns contain ['Train', 'Fear', 'Neutral', 'Fear_2', 'Neutral_2'] representing each memory context. Each element of the dataframe is a numpy array with three columns. The first two columns are the neuron indices that are correlated and the last column is the strength of the correlation.
dredd_freezes_df.pkl: A dataframe containing freezing percentages for SOM-Cre and RTT-SOM-Cre mice treated with DREADDS.
This multi-index dataframe has rows indexed by genotype ('wt' or 'het') and mouse_id and treatment (mcherry, hm3d, hm4d). The columns contain one of ['Neutral', 'Fear', 'Fear_2']. Each element of the dataframe is a freezing percentage for a single mouse. This dataframe is built from reading the dredd_behavior.xlsx excel file. This is used to generate figure 5E of the paper.
high_degree_df.pkl: A dataframe containing list of high degree neuron indices.
This multi-index dataframe has rows indexed by genotype ('wt' or 'het') and mouse_id and treatment ('NA'=not applicable since no DREADD used). The columns contain ['Train', 'Fear', 'Neutral', 'Fear_2', 'Neutral_2'] representing each memory context. Each element of the dataframe is a list of neuron indices that are high-degree cells.
N006_wt_basis.npz: a dict containing three numpy arrays representing the basis images for mouse N006 of genotype wild-type.
This dict has three arrays stored under the variable names 'U', 'sigma' and 'img_shape'. U is a matrix of column vector basis images. Each column is the vector representation of a basis image (row pixels x column pixels). There are 220 basis images (columns) in U. The sigma variable is the singular value associated with each basis image vector in U. img_shape can be used to reshape each basis column vector into a 2-D image for viewing. This data is used in Supplemental Figure 2 of the paper.
N006_wt_cxtbasis.pkl: A dictionary containing arrays for basis images and singular values for each context.
This dictionary has keys, ['Train', 'Fear', 'Neutral', 'Fear_2', 'Neutral_2'] representing the memory contexts. Each value is a 2 element list containing the U-basis images as column vectors and singular values, one per basis image in U. The shape of the basis images is the same shape stored in N006_wt_basis.pkl. This dataset is used in Supplementary Figure 2 to track cells across contexts of the CFC task (see also N006_wt_cxtsources.pkl)
N006_wt_cxtsources.pkl: A dictionary containing the independent component source images computed from the basis images for automatically identifying regions of interest (ROIs).
The dictionary is keyed on ['Train', 'Fear', 'Neutral', 'Fear_2', 'Neutral_2'] contexts. Each value in the dictionary at a given key is a 3-D numpy array of shape sources x height x width. These data were used to construct the source images and max intensity projection image of the sources in Supplemental Figure 2F-J of the paper.
N006_wt_rois.pkl: A dictionary containing the boundaries and annuli coordinates of all rois for mouse N006 of genotype wild-type.
This dictionary is keyed on ['boundaries', 'annuli'] contexts and each value is a 179 element list of arrays of boundary line coordinates or annulus point coordinates one per ROI detected for this mouse.
N006_wt_sources.npy: A numpy array containing all source images computed from all contexts of the CFC task for mouse N006 of genotype wild-type.
This numpy array has shape n x height x width where n=205 source images, height=517 pixels and width=704 pixels. This data was used to construct Supplemental Figure 3F.
N019_wt_basis.npz: a dict containing three numpy arrays representing the basis images for mouse N019 of genotype wild-type.
This dict has three arrays stored under the variable names 'U', 'sigma' and 'img_shape'. U is a matrix of column vector basis images. Each column is the vector representation of a basis image (row pixels x column pixels). There are 220 basis images (columns) in U. The sigma variable is the singular value associated with each basis image vector in U. img_shape can be used to reshape each basis column vector into a 2-D image for viewing. This data is used in Figure 1C of the paper.
N019_wt_sources.npy: A numpy array containing all source images computed from all contexts of the CFC task for mouse N019 of genotype wild-type.
This numpy array has shape n x height x width where n=204 source images, height=516 pixels and width=698 pixels. This data was used to construct Figure 1C of the paper.
P80_animals.pkl: A pandas multi-index object containing the genotype, mouse_id and treatment of the top 80% behavioral performance animals.
In this study, we drop the lowest 20% performing WT and RTT animals based on freezing percentage during the recall contexts. This multi-index is used to filter the data before each computation or plot in this study. So for example Figure 1B contains only the top 80% performing WT and RTT mice.
pc_sipscs_amps.pkl: A dictionary containing the amplitudes of spontaneous IPSCs recorded in pyramidal cells of WT and RTT mice.
This dictionary is keyed on ['wt', 'mecp2_pos', 'mecp2_neg'] representing whether the pyramidal cell was recorded from a wild-type mouse ('wt') or is an MeCP2 negative or MeCP2 positive RTT cell. This value under each key is an array of IPSC amplitudes, one per recorded cell. This data was used to construct Figure 4C in the paper.
pc_sipscs_freqs.pkl: A dictionary containing the frequencies of spontaneous IPSCs recorded in pyramidal cells of WT and RTT mice.
This dictionary is keyed on ['wt', 'mecp2_pos', 'mecp2_neg'] representing whether the pyramidal cell was recorded from a wild-type mouse ('wt') or is an MeCP2 negative or MeCP2 positive RTT cell. This value under each key is an array of IPSC frequencies, one per recorded cell. This data was used to construct Figure 4C in the paper.
rois_df.pkl: A multi-index dataframe containing all ROI information for each non-DREADD treated cell in this study (Figures 1-3).
This dataframe index contains the genotype ('wt', 'het'), the mouse_id, the treatment ('NA'=not applicable since no DREADD used), and the cell index starting from 0. The columns are ['centroid', 'cell_boundary', 'annulus_boundary']. The centroid for each cell is a 2-tuple of row, column pixel centroid coordinates. The cell_boundary is a two-column array of row, col boundary points for each ROI. The annulus_boundary is a two-column array of row, column interior points in the annulus. The annulus region excludes points of overlap with nearby cell bodies (See STAR methods of the paper).
signals_df.pkl: A multi-index dataframe containing calcium signals, inferred spikes and metadata for all Non-DREADD experiments used in this study (Figs 1-3).
This dataframe index contains the genotype ('wt', 'het'), the mouse_id, the treatment ('NA'=not applicable since no DREADD used), and the cell index starting from 0 and going up to 5771 cells. The columns are ['channels', 'channel', 'num_pages', 'width', 'height', 'bits', 'Train_signals', 'Fear_signals', 'Neutral_signals', 'Cue_signals', 'Fear_2_signals', 'Neutral_2_signals', 'Cue_2_signals', 'Train_spikes', 'Fear_spikes', 'Neutral_spikes', 'Cue_spikes', 'Fear_2_spikes', 'Neutral_2_spikes', 'Cue_2_spikes', 'sample_rate']. The channels are all the recorded channels, the channels is the channel on which ROIs were detected, the width and height are the image dimensions, the bits is the image bit depth of the calcium movie. The _signals' are the df/f signals for each cell in each context. Each signal is a numpy array with the first 800 samples have been set to NAN due to settling time of the miniscope. The '_spikes' are the inferred spikes for each cell stored as an image index. This signal and spike indices can be converted to time using the sample column. This dataframe is used in the construction of Figures 1-3 in the paper.
som_behavior_df.pkl: A dataframe of behavior readouts recorded by a camera positioned above the mice in each context chamber.
This multi-index dataframe has rows indexed by genotype ('wt' or 'het') and mouse_id. The columns are; sample times (_time), freezing boolean arrays (_freeze), x-positions in context chamber (_x) and y-positions in the context chamber (_y) for each context in *=('Train', 'Fear', 'Neutral', 'Fear_2', 'Neutral_2'). This dataframe was not used in the paper but may still be useful for further
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
To obtain full details of gut microbiota, including bacteria, fungi, bacteriophages, and helminths, in giant pandas (GPs), we created a comprehensive microbial genome database and used metagenomic sequences to align against the database. We delineated a detailed and different gut microbiota structures of GPs. A total of 680 species of bacteria, 198 fungi, 185 bacteriophages, and 45 helminths were found. Compared with 16S rRNA sequencing, the dominant bacterium phyla not only included Proteobacteria, Firmicutes, Bacteroidetes, and Actinobacteria but also Cyanobacteria and other eight phyla. Aside from Ascomycota, Basidiomycota, and Glomeromycota, Mucoromycota, and Microsporidia were the dominant fungi phyla. The bacteriophages were predominantly dsDNA Myoviridae, Siphoviridae, Podoviridae, ssDNA Inoviridae, and Microviridae. For helminths, phylum Nematoda was the dominant. In addition to previously described parasites, another 44 species of helminths were found in GPs. Also, differences in abundance of microbiota were found between the captive, semiwild, and wild GPs. A total of 1,739 genes encoding cellulase, β-glucosidase, and cellulose β-1,4-cellobiosidase were responsible for the metabolism of cellulose, and 128,707 putative glycoside hydrolase genes were found in bacteria/fungi. Taken together, the results indicated not only bacteria but also fungi, bacteriophages, and helminths were diverse in gut of giant pandas, which provided basis for the further identification of role of gut microbiota. Besides, metagenomics revealed that the bacteria/fungi in gut of GPs harbor the ability of cellulose and hemicellulose degradation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Web archive derivatives of the Quebec Health Ministry (2013-2018) collection from the Bibliothèque et Archives nationales du Québec. The derivatives were created with the Archives Unleashed Toolkit. Merci beaucoup BAnQ!
These derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples.
Domains
.webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc)
Produces a DataFrame with the following columns:
domain
count
Web Pages
.webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))
Produces a DataFrame with the following columns:
crawl_date
url
mime_type_web_server
mime_type_tika
content
Web Graph
.webgraph()
Produces a DataFrame with the following columns:
crawl_date
src
dest
anchor
Image Links
.imageLinks()
Produces a DataFrame with the following columns:
src
image_url
Binary Analysis
Audio
Images
PDFs
Presentation program files
Spreadsheets
Text files
Word processor files
Libraries Import:
Importing necessary libraries such as pandas, seaborn, matplotlib, scikit-learn's KMeans, and warnings. Data Loading and Exploration:
Reading a dataset named "Mall_Customers.csv" into a pandas DataFrame (df). Displaying the first few rows of the dataset using df.head(). Conducting univariate analysis by calculating descriptive statistics with df.describe(). Univariate Analysis:
Visualizing the distribution of the 'Annual Income (k$)' column using sns.distplot. Looping through selected columns ('Age', 'Annual Income (k$)', 'Spending Score (1-100)') and plotting individual distribution plots. Bivariate Analysis:
Creating a scatter plot for 'Annual Income (k$)' vs 'Spending Score (1-100)' using sns.scatterplot. Generating a pair plot for selected columns with gender differentiation using sns.pairplot. Gender-Based Analysis:
Grouping the data by 'Gender' and calculating the mean for selected columns. Computing the correlation matrix for the grouped data and visualizing it using a heatmap. Univariate Clustering:
Applying KMeans clustering with 3 clusters based on 'Annual Income (k$)' and adding the 'Income Cluster' column to the DataFrame. Plotting the elbow method to determine the optimal number of clusters. Bivariate Clustering:
Applying KMeans clustering with 5 clusters based on 'Annual Income (k$)' and 'Spending Score (1-100)' and adding the 'Spending and Income Cluster' column. Plotting the elbow method for bivariate clustering and visualizing the cluster centers on a scatter plot. Displaying a normalized cross-tabulation between 'Spending and Income Cluster' and 'Gender'. Multivariate Clustering:
Performing multivariate clustering by creating dummy variables, scaling selected columns, and applying KMeans clustering. Plotting the elbow method for multivariate clustering. Result Saving:
Saving the modified DataFrame with cluster information to a CSV file named "Result.csv". Saving the multivariate clustering plot as an image file ("Multivariate_figure.png").