24 datasets found
  1. Shopping Mall

    • kaggle.com
    Updated Dec 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anshul Pachauri (2023). Shopping Mall [Dataset]. https://www.kaggle.com/datasets/anshulpachauri/shopping-mall
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 15, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Anshul Pachauri
    Description

    Libraries Import:

    Importing necessary libraries such as pandas, seaborn, matplotlib, scikit-learn's KMeans, and warnings. Data Loading and Exploration:

    Reading a dataset named "Mall_Customers.csv" into a pandas DataFrame (df). Displaying the first few rows of the dataset using df.head(). Conducting univariate analysis by calculating descriptive statistics with df.describe(). Univariate Analysis:

    Visualizing the distribution of the 'Annual Income (k$)' column using sns.distplot. Looping through selected columns ('Age', 'Annual Income (k$)', 'Spending Score (1-100)') and plotting individual distribution plots. Bivariate Analysis:

    Creating a scatter plot for 'Annual Income (k$)' vs 'Spending Score (1-100)' using sns.scatterplot. Generating a pair plot for selected columns with gender differentiation using sns.pairplot. Gender-Based Analysis:

    Grouping the data by 'Gender' and calculating the mean for selected columns. Computing the correlation matrix for the grouped data and visualizing it using a heatmap. Univariate Clustering:

    Applying KMeans clustering with 3 clusters based on 'Annual Income (k$)' and adding the 'Income Cluster' column to the DataFrame. Plotting the elbow method to determine the optimal number of clusters. Bivariate Clustering:

    Applying KMeans clustering with 5 clusters based on 'Annual Income (k$)' and 'Spending Score (1-100)' and adding the 'Spending and Income Cluster' column. Plotting the elbow method for bivariate clustering and visualizing the cluster centers on a scatter plot. Displaying a normalized cross-tabulation between 'Spending and Income Cluster' and 'Gender'. Multivariate Clustering:

    Performing multivariate clustering by creating dummy variables, scaling selected columns, and applying KMeans clustering. Plotting the elbow method for multivariate clustering. Result Saving:

    Saving the modified DataFrame with cluster information to a CSV file named "Result.csv". Saving the multivariate clustering plot as an image file ("Multivariate_figure.png").

  2. PandasPlotBench

    • huggingface.co
    Updated Nov 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    JetBrains Research (2024). PandasPlotBench [Dataset]. https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 25, 2024
    Dataset provided by
    JetBrainshttp://jetbrains.com/
    Authors
    JetBrains Research
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    PandasPlotBench

    PandasPlotBench is a benchmark to assess the capability of models in writing the code for visualizations given the description of the Pandas DataFrame. 🛠️ Task. Given the plotting task and the description of a Pandas DataFrame, write the code to build a plot. The dataset is based on the MatPlotLib gallery. The paper can be found in arXiv: https://arxiv.org/abs/2412.02764v1. To score your model on this dataset, you can use the our GitHub repository. 📩 If you have… See the full description on the dataset page: https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench.

  3. merged-dataframe

    • kaggle.com
    Updated Jan 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AnshKGoyal (2024). merged-dataframe [Dataset]. https://www.kaggle.com/datasets/anshkgoyal/merged-dataframe/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 27, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    AnshKGoyal
    Description

    This dataset is an intermediate output from a book recommendation system project. It contains merged data from Amazon book reviews and book details, with added sentiment scores and labels. The sentiment analysis was performed using a custom model. This dataset is not intended as a standalone resource, but rather as a checkpoint in the development process of the recommendation system.

  4. Pandas Test Data

    • kaggle.com
    zip
    Updated Aug 23, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gyan Kumar (2020). Pandas Test Data [Dataset]. https://www.kaggle.com/kgmgyan57/pandas-test-data
    Explore at:
    zip(63445451 bytes)Available download formats
    Dataset updated
    Aug 23, 2020
    Authors
    Gyan Kumar
    Description

    Dataset

    This dataset was created by Gyan Kumar

    Contents

    It contains the following files:

  5. H

    GenBank data submission network yearly data frame files

    • dataverse.harvard.edu
    Updated Jun 7, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jian Qin; Jeff Hemsley; Sarah Bratt (2021). GenBank data submission network yearly data frame files [Dataset]. http://doi.org/10.7910/DVN/4QUAXY
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 7, 2021
    Dataset provided by
    Harvard Dataverse
    Authors
    Jian Qin; Jeff Hemsley; Sarah Bratt
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Time period covered
    1992 - 2018
    Dataset funded by
    NIH
    NSF
    Description

    GenBank data submission network R data frames by year from 1992-2018.

  6. w

    Pandas 1.x cookbook : practical recipes for scientific computing, time...

    • workwithdata.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data, Pandas 1.x cookbook : practical recipes for scientific computing, time series and exploratory data analysis using Python [Dataset]. https://www.workwithdata.com/object/pandas-1-x-cookbook-practical-recipes-for-scientific-computing-time-series-and-exploratory-data-analysis-using-python-book-by-matt-harrison-1975
    Explore at:
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Explore Pandas 1.x cookbook : practical recipes for scientific computing, time series and exploratory data analysis using Python through data • Key facts: author, publication date, book publisher, book series, book subjects • Real-time news, visualizations and datasets

  7. u

    Data from: dblp XML dataset as CSV for Python Data Analysis Library

    • observatorio-cientifico.ua.es
    Updated 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carrasco, Rafael C.; Candela, Gustavo; Carrasco, Rafael C.; Candela, Gustavo (2021). dblp XML dataset as CSV for Python Data Analysis Library [Dataset]. https://observatorio-cientifico.ua.es/documentos/668fc45db9e7c03b01bdb2d0
    Explore at:
    Dataset updated
    2021
    Authors
    Carrasco, Rafael C.; Candela, Gustavo; Carrasco, Rafael C.; Candela, Gustavo
    Description

    Based on the dblp XML file, this dataset consists on a CSV file that has been extracted using a python script. The dataset can be easily loaded in a Python Data Analysis Library dataframe.

  8. Figure 5 dataframe

    • figshare.com
    txt
    Updated Dec 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aland Chan (2023). Figure 5 dataframe [Dataset]. http://doi.org/10.6084/m9.figshare.23245640.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Dec 18, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Aland Chan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataframe used the generate the area plot in Figure 3.

    Columns - time: time after fire g: proportion of pixels being grasslands s: proportion of pixels being shrublands sfg: proportion of pixels being shrublands that developed from burnt grasslands f: proportion of pixels being forests ffg: proportion of pixels being forests that developed from burnt grasslands ffs: proportion of pixels being forests that developed from burnt shrublands

  9. Quebec Ministry of Tourism (2012 to 2017) web archive collection derivatives...

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip
    Updated Mar 3, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nick Ruest; Nick Ruest; Carole Gagné; Dave Mitchell; Carole Gagné; Dave Mitchell (2020). Quebec Ministry of Tourism (2012 to 2017) web archive collection derivatives [Dataset]. http://doi.org/10.5281/zenodo.3693801
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Mar 3, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nick Ruest; Nick Ruest; Carole Gagné; Dave Mitchell; Carole Gagné; Dave Mitchell
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Quebec
    Description

    Web archive derivatives of the Quebec Ministry of Tourism (2012 to 2017) collection from the Bibliothèque et Archives nationales du Québec. The derivatives were created with the Archives Unleashed Toolkit. Merci beaucoup BAnQ!

    These derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples.

    Domains

    .webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc)

    Produces a DataFrame with the following columns:

    • domain
    • count

    Web Pages

    .webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))

    Produces a DataFrame with the following columns:

    • crawl_date
    • url
    • mime_type_web_server
    • mime_type_tika
    • content

    Web Graph

    .webgraph()

    Produces a DataFrame with the following columns:

    • crawl_date
    • src
    • dest
    • anchor

    Image Links

    .imageLinks()

    Produces a DataFrame with the following columns:

    • src
    • image_url

    Binary Analysis

    • Audio
    • Images
    • PDFs
    • Presentation program files
    • Spreadsheets
    • Text files
    • Videos
    • Word processor files
  10. f

    Table_2_Assessing Urinary Metabolomics in Giant Pandas Using...

    • figshare.com
    xls
    Updated Jun 2, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maosheng Cao; Chunjin Li; Yuliang Liu; Kailai Cai; Lu Chen; Chenfeng Yuan; Zijiao Zhao; Boqi Zhang; Rong Hou; Xu Zhou (2023). Table_2_Assessing Urinary Metabolomics in Giant Pandas Using Chromatography/Mass Spectrometry: Pregnancy-Related Changes in the Metabolome.xls [Dataset]. http://doi.org/10.3389/fendo.2020.00215.s003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    Frontiers
    Authors
    Maosheng Cao; Chunjin Li; Yuliang Liu; Kailai Cai; Lu Chen; Chenfeng Yuan; Zijiao Zhao; Boqi Zhang; Rong Hou; Xu Zhou
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Giant pandas represent one of the most endangered species worldwide, and their reproductive capacity is extremely low. They have a relatively long gestational period, mainly because embryo implantation is delayed. Giant panda cubs comprise only a small proportion of the mother's body weight, making it difficult to determine whether a giant panda is pregnant. Timely determination of pregnancy contributes to the efficient breeding and management of giant pandas. Meanwhile, metabolomics studies the metabolic composition of biological samples, which can reflect metabolic functions in cells, tissues, and organisms. This work explored the urinary metabolites of giant pandas during pregnancy. A sample of 8 female pandas was selected. Differences in metabolite levels in giant panda urine samples were analyzed via ultra-high-performance liquid chromatography/mass spectrometry comparing pregnancy to anoestrus. Pattern recognition techniques, including partial least squares-discriminant analysis and orthogonal partial least squares-discriminant analysis, were used to analyze multiple parameters of the data. Compared with the results during anoestrus, multivariate statistical analysis of results obtained from the same pandas being pregnant identified 16 differential metabolites in the positive-ion mode and 43 differential metabolites in the negative-ion mode. The levels of tryptophan, choline, kynurenic acid, uric acid, indole-3-acetaldehyde, taurine, and betaine were higher in samples during pregnancy, whereas those of xanthurenic acid and S-adenosylhomocysteine were lower. Amino acid metabolism, lipid metabolism, and organic acid production differed significantly between anoestrus and pregnancy. Our results provide new insights into metabolic changes in the urine of giant pandas during pregnancy, and the differential levels of metabolites in urine provide a basis for determining pregnancy in giant pandas. Understanding these metabolic changes could be helpful for managing pregnant pandas to provide proper nutrients to their fetuses.

  11. Harvest Quebec Government Websites from December 2006 web archive collection...

    • zenodo.org
    application/gzip
    Updated Feb 26, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nick Ruest; Nick Ruest; Carole Gagné; Dave Mitchell; Carole Gagné; Dave Mitchell (2020). Harvest Quebec Government Websites from December 2006 web archive collection derivatives [Dataset]. http://doi.org/10.5281/zenodo.3688354
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Feb 26, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nick Ruest; Nick Ruest; Carole Gagné; Dave Mitchell; Carole Gagné; Dave Mitchell
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Quebec
    Description

    Web archive derivatives of the Sites of the Harvest Quebec Government Websites from December 2006 collection from the Bibliothèque et Archives nationales du Québec. The derivatives were created with the Archives Unleashed Toolkit. Merci beaucoup BAnQ!

    These derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples.

    Domains

    .webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc)

    Produces a DataFrame with the following columns:

    • domain
    • count

    Web Pages

    .webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))

    Produces a DataFrame with the following columns:

    • crawl_date
    • url
    • mime_type_web_server
    • mime_type_tika
    • content

    Web Graph

    .webgraph()

    Produces a DataFrame with the following columns:

    • crawl_date
    • src
    • dest
    • anchor

    Image Links

    .imageLinks()

    Produces a DataFrame with the following columns:

    • src
    • image_url

    Binary Analysis

    • Audio
    • Images
    • PDFs
    • Presentation program files
    • Spreadsheets
    • Text files
    • Videos
    • Word processor files
  12. o

    Extreme Right Movements in Europe web archive collection derivatives

    • explore.openaire.eu
    • data.niaid.nih.gov
    • +1more
    Updated Jan 31, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nick Ruest (2020). Extreme Right Movements in Europe web archive collection derivatives [Dataset]. http://doi.org/10.5281/zenodo.3633160
    Explore at:
    Dataset updated
    Jan 31, 2020
    Authors
    Nick Ruest
    Area covered
    Europe
    Description

    Web archive derivatives of the Literary Authors from Europe and Eurasia Web Archive collection from the Ivy Plus Libraries Confederation. The derivatives were created with the Archives Unleashed Toolkit and Archives Unleashed Cloud. The ivy-11670-parquet.tar.gz derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples. Domains .webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc) Produces a DataFrame with the following columns: domain count Web Pages .webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content")) Produces a DataFrame with the following columns: crawl_date url mime_type_web_server mime_type_tika content Web Graph .webgraph() Produces a DataFrame with the following columns: crawl_date src dest anchor Image Links .imageLinks() Produces a DataFrame with the following columns: src image_url Binary Analysis Audio Images PDFs Presentation program files Spreadsheets Text files Word processor files The ivy-11670-auk.tar.gz derivatives are the standard set of web archive derivatives produced by the Archives Unleashed Cloud. Gephi file, which can be loaded into Gephi. It will have basic characteristics already computed and a basic layout. Raw Network file, which can also be loaded into Gephi. You will have to use that network program to lay it out yourself. Full text file. In it, each website within the web archive collection will have its full text presented on one line, along with information around when it was crawled, the name of the domain, and the full URL of the content. Domains count file. A text file containing the frequency count of domains captured within your web archive.

  13. Z

    Freely Accessible eJournals web archive collection derivatives

    • data.niaid.nih.gov
    Updated Feb 2, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ruest, Nick (2020). Freely Accessible eJournals web archive collection derivatives [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3633670
    Explore at:
    Dataset updated
    Feb 2, 2020
    Dataset authored and provided by
    Ruest, Nick
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Web archive derivatives of the Freely Accessible eJournals collection from Columbia University Libraries. The derivatives were created with the Archives Unleashed Toolkit and Archives Unleashed Cloud.

    The cul-5921-parquet.tar.gz derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples.

    Domains

    .webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc)

    Produces a DataFrame with the following columns:

    domain

    count

    Web Pages

    .webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))

    Produces a DataFrame with the following columns:

    crawl_date

    url

    mime_type_web_server

    mime_type_tika

    content

    Web Graph

    .webgraph()

    Produces a DataFrame with the following columns:

    crawl_date

    src

    dest

    anchor

    Image Links

    .imageLinks()

    Produces a DataFrame with the following columns:

    src

    image_url

    Binary Analysis

    Audio

    Images

    PDFs

    Presentation program files

    Spreadsheets

    Text files

    Word processor files

    The cul-12143-auk.tar.gz derivatives are the standard set of web archive derivatives produced by the Archives Unleashed Cloud.

    Gephi file, which can be loaded into Gephi. It will have basic characteristics already computed and a basic layout.

    Raw Network file, which can also be loaded into Gephi. You will have to use that network program to lay it out yourself.

    Full text file. In it, each website within the web archive collection will have its full text presented on one line, along with information around when it was crawled, the name of the domain, and the full URL of the content.

    Domains count file. A text file containing the frequency count of domains captured within your web archive.

  14. Z

    Web Archive of Independent News Sites on Turkish Affairs derivatives

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 31, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ruest, Nick (2020). Web Archive of Independent News Sites on Turkish Affairs derivatives [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3633233
    Explore at:
    Dataset updated
    Jan 31, 2020
    Dataset authored and provided by
    Ruest, Nick
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Derivatives of the Web Archive of Independent News Sites on Turkish Affairs collection from the Ivy Plus Libraries Confederation. The derivatives were created with the Archives Unleashed Toolkit and Archives Unleashed Cloud.

    The ivy-12911-parquet.tar.gz derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples.

    Domains

    .webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc)

    Produces a DataFrame with the following columns:

    domain

    count

    Web Pages

    .webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))

    Produces a DataFrame with the following columns:

    crawl_date

    url

    mime_type_web_server

    mime_type_tika

    content

    Web Graph

    .webgraph()

    Produces a DataFrame with the following columns:

    crawl_date

    src

    dest

    anchor

    Image Links

    .imageLinks()

    Produces a DataFrame with the following columns:

    src

    image_url

    Binary Analysis

    Audio

    Images

    PDFs

    Presentation program files

    Spreadsheets

    Text files

    Word processor files

    The ivy-12911-auk.tar.gz derivatives are the standard set of web archive derivatives produced by the Archives Unleashed Cloud.

    Gephi file, which can be loaded into Gephi. It will have basic characteristics already computed and a basic layout.

    Raw Network file, which can also be loaded into Gephi. You will have to use that network program to lay it out yourself.

    Full text file. In it, each website within the web archive collection will have its full text presented on one line, along with information around when it was crawled, the name of the domain, and the full URL of the content.

    Domains count file. A text file containing the frequency count of domains captured within your web archive.

  15. Z

    Queer Japan Web Archive collection derivatives

    • data.niaid.nih.gov
    Updated Feb 1, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yanagihara, Yoshie (2020). Queer Japan Web Archive collection derivatives [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3633283
    Explore at:
    Dataset updated
    Feb 1, 2020
    Dataset provided by
    Shida, Tetsuyuki
    Abrams, Samantha
    Yanagihara, Yoshie
    Nakamura, Haruko
    Ruest, Nick
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Web archive derivatives of the Queer Japan Web Archive collection from the Ivy Plus Libraries Confederation. The derivatives were created with the Archives Unleashed Toolkit and Archives Unleashed Cloud.

    The ivy-12172-parquet.tar.gz derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples.

    Domains

    .webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc)

    Produces a DataFrame with the following columns:

    domain

    count

    Web Pages

    .webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))

    Produces a DataFrame with the following columns:

    crawl_date

    url

    mime_type_web_server

    mime_type_tika

    content

    Web Graph

    .webgraph()

    Produces a DataFrame with the following columns:

    crawl_date

    src

    dest

    anchor

    Image Links

    .imageLinks()

    Produces a DataFrame with the following columns:

    src

    image_url

    Binary Analysis

    Audio

    Images

    PDFs

    Presentation program files

    Spreadsheets

    Text files

    Videos

    Word processor files

    The ivy-11854-auk.tar.gz derivatives are the standard set of web archive derivatives produced by the Archives Unleashed Cloud.

    Gephi file, which can be loaded into Gephi. It will have basic characteristics already computed and a basic layout.

    Raw Network file, which can also be loaded into Gephi. You will have to use that network program to lay it out yourself.

    Full text file. In it, each website within the web archive collection will have its full text presented on one line, along with information around when it was crawled, the name of the domain, and the full URL of the content.

    Domains count file. A text file containing the frequency count of domains captured within your web archive.

  16. Learn Pandas

    • kaggle.com
    Updated Oct 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vaidik Patel (2023). Learn Pandas [Dataset]. https://www.kaggle.com/datasets/js1js2js3js4js5/learn-pandas/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 5, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Vaidik Patel
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    It is a dataset with notebook kind of learning. Download the whole package and you will find everything to learn basics to advanced pandas which is exactly what you will need in machine learning and in data science. 😄

    This will gives you the overview and data analysis tools in pandas that is mostly required in the data manipulation and extraction important data.

    Use this notebook as notes for pandas. whenever you forget the code or syntax open it and scroll through it and you will find the solution. 🥳

  17. d

    Young and older adult vowel categorization responses

    • datadryad.org
    zip
    Updated Mar 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mishaela DiNino (2024). Young and older adult vowel categorization responses [Dataset]. http://doi.org/10.5061/dryad.brv15dvh0
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 14, 2024
    Dataset provided by
    Dryad
    Authors
    Mishaela DiNino
    Description

    Young and older adult vowel categorization responses

    https://doi.org/10.5061/dryad.brv15dvh0

    On each trial, participants heard a stimulus and clicked a box on the computer screen to indicate whether they heard "SET" or "SAT." Responses of "SET" are coded as 0 and responses of "SAT" are coded as 1. The continuum steps, from 1-7, for duration and spectral quality cues of the stimulus on each trial are named "DurationStep" and "SpectralStep," respectively. Group (young or older adult) and listening condition (quiet or noise) information are provided for each row of the dataset.

  18. Z

    Data associated with "A weakened recurrent circuit in the hippocampus of...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Feb 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wang, Wei (2022). Data associated with "A weakened recurrent circuit in the hippocampus of Rett syndrome mice disrupts long-term memory representations" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5999291
    Explore at:
    Dataset updated
    Feb 24, 2022
    Dataset provided by
    Tang, Jianrong
    Sun, Yaling
    Wang, Wei
    He, Lingjie
    Caudill, Matthew S.
    Jing, Junzhan
    Zoghbi, Huda Y.
    Jiang, Xiaolong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Datasets used in A weakened recurrent circuit in the hippocampus of Rett syndrome mice disrupts long-term memory representations.

    Datatypes:

    Multi-index pandas dataframe (.pkl)

    Numpy array (.npy)

    Collection of numpy arrays (.npz)

    Python dictionary objects (.pkl)

    Datasets:

    alignments.pkl: A dataframe containing numpy arrays of image displacements for each mouse in each memory context.

    This multi-index dataframe has rows indexed by genotype ('wt' or 'het') and mouse_id. The columns are ['T', 'F1', 'N1', 'F2', 'N2'] for the training, recall 1-hour, neutral, recall 1-day, neutral day 2 memory contexts respectively. Each element of this dataframe is a numpy array of shape images x 2 that hold x and y image displacements respectively. These alignments are computed after the inscopix software motion correction and are used in Supplemental Figure 2 of the paper.

    behavior_df.pkl: A dataframe of behavior readouts recorded by a camera positioned above the mice in each context chamber.

    This multi-index dataframe has rows indexed by genotype ('wt' or 'het') and mouse_id.The columns are; sample times (_time), freezing boolean arrays (_freeze), x-positions in context chamber (_x) and y-positions in the context chamber (_y) for each context (*) in ('Train', 'Fear', 'Neutral', 'Fear_2', 'Neutral_2').

    correlated_pairs_df.pkl: A dataframe containing arrays of neuron indices that have a correlation in activity pattern > 0.3.

    This multi-index dataframe has rows indexed by genotype ('wt' or 'het') and mouse_id and treatment ('NA'). The columns contain ['Train', 'Fear', 'Neutral', 'Fear_2', 'Neutral_2'] representing each memory context. Each element of the dataframe is a numpy array with three columns. The first two columns are the neuron indices that are correlated and the last column is the strength of the correlation.

    dredd_freezes_df.pkl: A dataframe containing freezing percentages for SOM-Cre and RTT-SOM-Cre mice treated with DREADDS.

    This multi-index dataframe has rows indexed by genotype ('wt' or 'het') and mouse_id and treatment (mcherry, hm3d, hm4d). The columns contain one of ['Neutral', 'Fear', 'Fear_2']. Each element of the dataframe is a freezing percentage for a single mouse. This dataframe is built from reading the dredd_behavior.xlsx excel file. This is used to generate figure 5E of the paper.

    high_degree_df.pkl: A dataframe containing list of high degree neuron indices.

    This multi-index dataframe has rows indexed by genotype ('wt' or 'het') and mouse_id and treatment ('NA'=not applicable since no DREADD used). The columns contain ['Train', 'Fear', 'Neutral', 'Fear_2', 'Neutral_2'] representing each memory context. Each element of the dataframe is a list of neuron indices that are high-degree cells.

    N006_wt_basis.npz: a dict containing three numpy arrays representing the basis images for mouse N006 of genotype wild-type.

    This dict has three arrays stored under the variable names 'U', 'sigma' and 'img_shape'. U is a matrix of column vector basis images. Each column is the vector representation of a basis image (row pixels x column pixels). There are 220 basis images (columns) in U. The sigma variable is the singular value associated with each basis image vector in U. img_shape can be used to reshape each basis column vector into a 2-D image for viewing. This data is used in Supplemental Figure 2 of the paper.

    N006_wt_cxtbasis.pkl: A dictionary containing arrays for basis images and singular values for each context.

    This dictionary has keys, ['Train', 'Fear', 'Neutral', 'Fear_2', 'Neutral_2'] representing the memory contexts. Each value is a 2 element list containing the U-basis images as column vectors and singular values, one per basis image in U. The shape of the basis images is the same shape stored in N006_wt_basis.pkl. This dataset is used in Supplementary Figure 2 to track cells across contexts of the CFC task (see also N006_wt_cxtsources.pkl)

    N006_wt_cxtsources.pkl: A dictionary containing the independent component source images computed from the basis images for automatically identifying regions of interest (ROIs).

    The dictionary is keyed on ['Train', 'Fear', 'Neutral', 'Fear_2', 'Neutral_2'] contexts. Each value in the dictionary at a given key is a 3-D numpy array of shape sources x height x width. These data were used to construct the source images and max intensity projection image of the sources in Supplemental Figure 2F-J of the paper.

    N006_wt_rois.pkl: A dictionary containing the boundaries and annuli coordinates of all rois for mouse N006 of genotype wild-type.

    This dictionary is keyed on ['boundaries', 'annuli'] contexts and each value is a 179 element list of arrays of boundary line coordinates or annulus point coordinates one per ROI detected for this mouse.

    N006_wt_sources.npy: A numpy array containing all source images computed from all contexts of the CFC task for mouse N006 of genotype wild-type.

    This numpy array has shape n x height x width where n=205 source images, height=517 pixels and width=704 pixels. This data was used to construct Supplemental Figure 3F.

    N019_wt_basis.npz: a dict containing three numpy arrays representing the basis images for mouse N019 of genotype wild-type.

    This dict has three arrays stored under the variable names 'U', 'sigma' and 'img_shape'. U is a matrix of column vector basis images. Each column is the vector representation of a basis image (row pixels x column pixels). There are 220 basis images (columns) in U. The sigma variable is the singular value associated with each basis image vector in U. img_shape can be used to reshape each basis column vector into a 2-D image for viewing. This data is used in Figure 1C of the paper.

    N019_wt_sources.npy: A numpy array containing all source images computed from all contexts of the CFC task for mouse N019 of genotype wild-type.

    This numpy array has shape n x height x width where n=204 source images, height=516 pixels and width=698 pixels. This data was used to construct Figure 1C of the paper.

    P80_animals.pkl: A pandas multi-index object containing the genotype, mouse_id and treatment of the top 80% behavioral performance animals.

    In this study, we drop the lowest 20% performing WT and RTT animals based on freezing percentage during the recall contexts. This multi-index is used to filter the data before each computation or plot in this study. So for example Figure 1B contains only the top 80% performing WT and RTT mice.

    pc_sipscs_amps.pkl: A dictionary containing the amplitudes of spontaneous IPSCs recorded in pyramidal cells of WT and RTT mice.

    This dictionary is keyed on ['wt', 'mecp2_pos', 'mecp2_neg'] representing whether the pyramidal cell was recorded from a wild-type mouse ('wt') or is an MeCP2 negative or MeCP2 positive RTT cell. This value under each key is an array of IPSC amplitudes, one per recorded cell. This data was used to construct Figure 4C in the paper.

    pc_sipscs_freqs.pkl: A dictionary containing the frequencies of spontaneous IPSCs recorded in pyramidal cells of WT and RTT mice.

    This dictionary is keyed on ['wt', 'mecp2_pos', 'mecp2_neg'] representing whether the pyramidal cell was recorded from a wild-type mouse ('wt') or is an MeCP2 negative or MeCP2 positive RTT cell. This value under each key is an array of IPSC frequencies, one per recorded cell. This data was used to construct Figure 4C in the paper.

    rois_df.pkl: A multi-index dataframe containing all ROI information for each non-DREADD treated cell in this study (Figures 1-3).

    This dataframe index contains the genotype ('wt', 'het'), the mouse_id, the treatment ('NA'=not applicable since no DREADD used), and the cell index starting from 0. The columns are ['centroid', 'cell_boundary', 'annulus_boundary']. The centroid for each cell is a 2-tuple of row, column pixel centroid coordinates. The cell_boundary is a two-column array of row, col boundary points for each ROI. The annulus_boundary is a two-column array of row, column interior points in the annulus. The annulus region excludes points of overlap with nearby cell bodies (See STAR methods of the paper).

    signals_df.pkl: A multi-index dataframe containing calcium signals, inferred spikes and metadata for all Non-DREADD experiments used in this study (Figs 1-3).

    This dataframe index contains the genotype ('wt', 'het'), the mouse_id, the treatment ('NA'=not applicable since no DREADD used), and the cell index starting from 0 and going up to 5771 cells. The columns are ['channels', 'channel', 'num_pages', 'width', 'height', 'bits', 'Train_signals', 'Fear_signals', 'Neutral_signals', 'Cue_signals', 'Fear_2_signals', 'Neutral_2_signals', 'Cue_2_signals', 'Train_spikes', 'Fear_spikes', 'Neutral_spikes', 'Cue_spikes', 'Fear_2_spikes', 'Neutral_2_spikes', 'Cue_2_spikes', 'sample_rate']. The channels are all the recorded channels, the channels is the channel on which ROIs were detected, the width and height are the image dimensions, the bits is the image bit depth of the calcium movie. The _signals' are the df/f signals for each cell in each context. Each signal is a numpy array with the first 800 samples have been set to NAN due to settling time of the miniscope. The '_spikes' are the inferred spikes for each cell stored as an image index. This signal and spike indices can be converted to time using the sample column. This dataframe is used in the construction of Figures 1-3 in the paper.

    som_behavior_df.pkl: A dataframe of behavior readouts recorded by a camera positioned above the mice in each context chamber.

    This multi-index dataframe has rows indexed by genotype ('wt' or 'het') and mouse_id. The columns are; sample times (_time), freezing boolean arrays (_freeze), x-positions in context chamber (_x) and y-positions in the context chamber (_y) for each context in *=('Train', 'Fear', 'Neutral', 'Fear_2', 'Neutral_2'). This dataframe was not used in the paper but may still be useful for further

  19. f

    Table_4_Metagenomic Analysis of Bacteria, Fungi, Bacteriophages, and...

    • frontiersin.figshare.com
    docx
    Updated Jun 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shengzhi Yang; Xin Gao; Jianghong Meng; Anyun Zhang; Yingmin Zhou; Mei Long; Bei Li; Wenwen Deng; Lei Jin; Siyue Zhao; Daifu Wu; Yongguo He; Caiwu Li; Shuliang Liu; Yan Huang; Hemin Zhang; Likou Zou (2023). Table_4_Metagenomic Analysis of Bacteria, Fungi, Bacteriophages, and Helminths in the Gut of Giant Pandas.DOCX [Dataset]. http://doi.org/10.3389/fmicb.2018.01717.s019
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Frontiers
    Authors
    Shengzhi Yang; Xin Gao; Jianghong Meng; Anyun Zhang; Yingmin Zhou; Mei Long; Bei Li; Wenwen Deng; Lei Jin; Siyue Zhao; Daifu Wu; Yongguo He; Caiwu Li; Shuliang Liu; Yan Huang; Hemin Zhang; Likou Zou
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    To obtain full details of gut microbiota, including bacteria, fungi, bacteriophages, and helminths, in giant pandas (GPs), we created a comprehensive microbial genome database and used metagenomic sequences to align against the database. We delineated a detailed and different gut microbiota structures of GPs. A total of 680 species of bacteria, 198 fungi, 185 bacteriophages, and 45 helminths were found. Compared with 16S rRNA sequencing, the dominant bacterium phyla not only included Proteobacteria, Firmicutes, Bacteroidetes, and Actinobacteria but also Cyanobacteria and other eight phyla. Aside from Ascomycota, Basidiomycota, and Glomeromycota, Mucoromycota, and Microsporidia were the dominant fungi phyla. The bacteriophages were predominantly dsDNA Myoviridae, Siphoviridae, Podoviridae, ssDNA Inoviridae, and Microviridae. For helminths, phylum Nematoda was the dominant. In addition to previously described parasites, another 44 species of helminths were found in GPs. Also, differences in abundance of microbiota were found between the captive, semiwild, and wild GPs. A total of 1,739 genes encoding cellulase, β-glucosidase, and cellulose β-1,4-cellobiosidase were responsible for the metabolism of cellulose, and 128,707 putative glycoside hydrolase genes were found in bacteria/fungi. Taken together, the results indicated not only bacteria but also fungi, bacteriophages, and helminths were diverse in gut of giant pandas, which provided basis for the further identification of role of gut microbiota. Besides, metagenomics revealed that the bacteria/fungi in gut of GPs harbor the ability of cellulose and hemicellulose degradation.

  20. Z

    Quebec Health Ministry (2013-2018) web archive collection derivatives

    • data.niaid.nih.gov
    • zenodo.org
    Updated Mar 3, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mitchell, Dave (2020). Quebec Health Ministry (2013-2018) web archive collection derivatives [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3693791
    Explore at:
    Dataset updated
    Mar 3, 2020
    Dataset provided by
    Ruest, Nick
    Mitchell, Dave
    Gagné, Carole
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Quebec
    Description

    Web archive derivatives of the Quebec Health Ministry (2013-2018) collection from the Bibliothèque et Archives nationales du Québec. The derivatives were created with the Archives Unleashed Toolkit. Merci beaucoup BAnQ!

    These derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples.

    Domains

    .webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc)

    Produces a DataFrame with the following columns:

    domain

    count

    Web Pages

    .webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content"))

    Produces a DataFrame with the following columns:

    crawl_date

    url

    mime_type_web_server

    mime_type_tika

    content

    Web Graph

    .webgraph()

    Produces a DataFrame with the following columns:

    crawl_date

    src

    dest

    anchor

    Image Links

    .imageLinks()

    Produces a DataFrame with the following columns:

    src

    image_url

    Binary Analysis

    Audio

    Images

    PDFs

    Presentation program files

    Spreadsheets

    Text files

    Word processor files

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Anshul Pachauri (2023). Shopping Mall [Dataset]. https://www.kaggle.com/datasets/anshulpachauri/shopping-mall
Organization logo

Shopping Mall

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 15, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Anshul Pachauri
Description

Libraries Import:

Importing necessary libraries such as pandas, seaborn, matplotlib, scikit-learn's KMeans, and warnings. Data Loading and Exploration:

Reading a dataset named "Mall_Customers.csv" into a pandas DataFrame (df). Displaying the first few rows of the dataset using df.head(). Conducting univariate analysis by calculating descriptive statistics with df.describe(). Univariate Analysis:

Visualizing the distribution of the 'Annual Income (k$)' column using sns.distplot. Looping through selected columns ('Age', 'Annual Income (k$)', 'Spending Score (1-100)') and plotting individual distribution plots. Bivariate Analysis:

Creating a scatter plot for 'Annual Income (k$)' vs 'Spending Score (1-100)' using sns.scatterplot. Generating a pair plot for selected columns with gender differentiation using sns.pairplot. Gender-Based Analysis:

Grouping the data by 'Gender' and calculating the mean for selected columns. Computing the correlation matrix for the grouped data and visualizing it using a heatmap. Univariate Clustering:

Applying KMeans clustering with 3 clusters based on 'Annual Income (k$)' and adding the 'Income Cluster' column to the DataFrame. Plotting the elbow method to determine the optimal number of clusters. Bivariate Clustering:

Applying KMeans clustering with 5 clusters based on 'Annual Income (k$)' and 'Spending Score (1-100)' and adding the 'Spending and Income Cluster' column. Plotting the elbow method for bivariate clustering and visualizing the cluster centers on a scatter plot. Displaying a normalized cross-tabulation between 'Spending and Income Cluster' and 'Gender'. Multivariate Clustering:

Performing multivariate clustering by creating dummy variables, scaling selected columns, and applying KMeans clustering. Plotting the elbow method for multivariate clustering. Result Saving:

Saving the modified DataFrame with cluster information to a CSV file named "Result.csv". Saving the multivariate clustering plot as an image file ("Multivariate_figure.png").

Search
Clear search
Close search
Google apps
Main menu