51 datasets found
  1. Shopping Mall

    • kaggle.com
    zip
    Updated Dec 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anshul Pachauri (2023). Shopping Mall [Dataset]. https://www.kaggle.com/datasets/anshulpachauri/shopping-mall
    Explore at:
    zip(22852 bytes)Available download formats
    Dataset updated
    Dec 15, 2023
    Authors
    Anshul Pachauri
    Description

    Libraries Import:

    Importing necessary libraries such as pandas, seaborn, matplotlib, scikit-learn's KMeans, and warnings. Data Loading and Exploration:

    Reading a dataset named "Mall_Customers.csv" into a pandas DataFrame (df). Displaying the first few rows of the dataset using df.head(). Conducting univariate analysis by calculating descriptive statistics with df.describe(). Univariate Analysis:

    Visualizing the distribution of the 'Annual Income (k$)' column using sns.distplot. Looping through selected columns ('Age', 'Annual Income (k$)', 'Spending Score (1-100)') and plotting individual distribution plots. Bivariate Analysis:

    Creating a scatter plot for 'Annual Income (k$)' vs 'Spending Score (1-100)' using sns.scatterplot. Generating a pair plot for selected columns with gender differentiation using sns.pairplot. Gender-Based Analysis:

    Grouping the data by 'Gender' and calculating the mean for selected columns. Computing the correlation matrix for the grouped data and visualizing it using a heatmap. Univariate Clustering:

    Applying KMeans clustering with 3 clusters based on 'Annual Income (k$)' and adding the 'Income Cluster' column to the DataFrame. Plotting the elbow method to determine the optimal number of clusters. Bivariate Clustering:

    Applying KMeans clustering with 5 clusters based on 'Annual Income (k$)' and 'Spending Score (1-100)' and adding the 'Spending and Income Cluster' column. Plotting the elbow method for bivariate clustering and visualizing the cluster centers on a scatter plot. Displaying a normalized cross-tabulation between 'Spending and Income Cluster' and 'Gender'. Multivariate Clustering:

    Performing multivariate clustering by creating dummy variables, scaling selected columns, and applying KMeans clustering. Plotting the elbow method for multivariate clustering. Result Saving:

    Saving the modified DataFrame with cluster information to a CSV file named "Result.csv". Saving the multivariate clustering plot as an image file ("Multivariate_figure.png").

  2. PandasPlotBench

    • huggingface.co
    Updated Nov 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    JetBrains Research (2024). PandasPlotBench [Dataset]. https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 25, 2024
    Dataset provided by
    JetBrainshttp://jetbrains.com/
    Authors
    JetBrains Research
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    PandasPlotBench

    PandasPlotBench is a benchmark to assess the capability of models in writing the code for visualizations given the description of the Pandas DataFrame. 🛠️ Task. Given the plotting task and the description of a Pandas DataFrame, write the code to build a plot. The dataset is based on the MatPlotLib gallery. The paper can be found in arXiv: https://arxiv.org/abs/2412.02764v1. To score your model on this dataset, you can use the our GitHub repository. 📩 If you have… See the full description on the dataset page: https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench.

  3. h

    pandas-create-context

    • huggingface.co
    Updated Jan 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Or Hiltch (2024). pandas-create-context [Dataset]. https://huggingface.co/datasets/hiltch/pandas-create-context
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 8, 2024
    Authors
    Or Hiltch
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview

    This dataset is built from sql-create-context, which in itself builds from WikiSQL and Spider. I have used GPT4 to translate the SQL schema into pandas DataFrame schem initialization statements and to translate the SQL queries into pandas queries. There are 862 examples of natural language queries, pandas DataFrame creation statements, and pandas query answering the question using the DataFrame creation statement as context. This dataset was built with text-to-pandas LLMs… See the full description on the dataset page: https://huggingface.co/datasets/hiltch/pandas-create-context.

  4. Diabetes_Dataset_1.1

    • kaggle.com
    zip
    Updated Nov 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    KIRANMAYI G 777 (2023). Diabetes_Dataset_1.1 [Dataset]. https://www.kaggle.com/datasets/kiranmayig777/diabetes-dataset-1-1/code
    Explore at:
    zip(779755 bytes)Available download formats
    Dataset updated
    Nov 2, 2023
    Authors
    KIRANMAYI G 777
    Description

    import pandas as pd import numpy as np

    PERFORMING EDA

    data.head() data.info()

    attributes_data = data.iloc[:, 1:] attributes_data

    attributes_data.describe() attributes_data.corr()

    import seaborn as sns import matplotlib.pyplot as plt

    Calculate correlation matrix

    correlation_matrix = attributes_data.corr() plt.figure(figsize=(18, 10))

    Create a heatmap

    sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm') plt.show()

    CHECKING IF DATASET IS LINEAR OR NON-LINEAR

    Calculate correlations between target and predictor columns

    correlations = data.corr()['Diabetes_binary'].drop('Diabetes_binary')

    Create a bar chart

    plt.figure(figsize=(10, 6)) correlations.plot(kind='bar') plt.xlabel('Predictor Columns') plt.ylabel('Correlation values') plt.title('Correlation between Diabetes_binary and Predictors') plt.show()

    CHECKING FOR NULL AND MISSING VALUES, CLEANING THEM

    Count the number of null values in each column

    print(data.isnull().sum())

    to check for missing values in all columns

    print(data.isna().sum())

    LASSO import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.linear_model import Lasso from sklearn.model_selection import train_test_split from sklearn.model_selection import GridSearchCV, KFold

    X = data.iloc[:, 1:] y = data.iloc[:, 0] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

    gridsearchcv is used to find the optimal combination of hyperparameters for a given model

    So, in the end, we can select the best parameters from the listed hyperparameters.

    parameters = {"alpha": np.arange(0.00001, 10, 500)}
    kfold = KFold(n_splits = 10, shuffle=True, random_state = 42) lassoReg = Lasso() lasso_cv = GridSearchCV(lassoReg, param_grid = parameters, cv = kfold) lasso_cv.fit(X, y) print("Best Params {}".format(lasso_cv.best_params_))

    column_names = list(data) column_names = column_names[1:] column_names

    lassoModel = Lasso(alpha = 0.00001) lassoModel.fit(X_train, y_train) lasso_coeff = np.abs(lassoModel.coef_)#making all coefficients positive plt.bar(column_names, lasso_coeff, color = 'orange') plt.xticks(rotation=90) plt.grid() plt.title("Feature Selection Based on Lasso") plt.xlabel("Features") plt.ylabel("Importance") plt.ylim(0, 0.16) plt.show()

    RFE from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

    from sklearn.feature_selection import RFECV from sklearn.tree import DecisionTreeClassifier model = DecisionTreeClassifier() rfecv = RFECV(estimator= model, step = 1, cv = 20, scoring="accuracy") rfecv = rfecv.fit(X_train, y_train)

    num_features_selected = len(rfecv.rankin_)

    Cross-validation scores

    cv_scores = rfecv.ranking_

    Plotting the number of features vs. cross-validation score

    plt.figure(figsize=(10, 6)) plt.xlabel("Number of features selected") plt.ylabel("Score (accuracy)") plt.plot(range(1, num_features_selected + 1), cv_scores, marker='o', color='r') plt.xticks(range(1, num_features_selected + 1)) # Set x-ticks to integers plt.grid() plt.title("RFECV: Number of Features vs. Score(accuracy)") plt.show()

    print("The optimal number of features:", rfecv.n_features_) print("Best features:", X_train.columns[rfecv.support_])

    PCA import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler

    X = data.drop(["Diabetes_binary"], axis=1) y = data["Diabetes_binary"]

    df1=pd.DataFrame(data = data,columns=data.columns) print(df1)

    scaling=StandardScaler() scaling.fit(df1) Scaled_data=scaling.transform(df1) principal=PCA(n_components=3) principal.fit(Scaled_data) x=principal.transform(Scaled_data) print(x.shape)

    principal.components_

    plt.figure(figsize=(10,10))

    plt.scatter(x[:,0],x[:,1],c=data['Diabetes_binary'],cmap='plasma') plt.xlabel('pc1') plt.ylabel('pc2')

    print(principal.explained_variance_ratio_)

    T-SNE from sklearn.manifold import TSNE from numpy import reshape import seaborn as sns

    tsne = TSNE(n_components=3, verbose=1, random_state=42) z = tsne.fit_transform(X)

    df = pd.DataFrame() df["y"] = y df["comp-1"] = z[:,0] df["comp-2"] = z[:,1] df["comp-3"] = z[:,2] sns.scatterplot(x="comp-1", y="comp-2", hue=df.y.tolist(), palette=sns.color_palette("husl", 2), data=df).set(title="Diabetes data T-SNE projection")

  5. Convert Text to Pandas

    • kaggle.com
    zip
    Updated Sep 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zeyad Usf (2024). Convert Text to Pandas [Dataset]. https://www.kaggle.com/datasets/zeyadusf/convert-text-to-pandas
    Explore at:
    zip(4333134 bytes)Available download formats
    Dataset updated
    Sep 22, 2024
    Authors
    Zeyad Usf
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    kaggle notebook
    Github Repo

    I found two datasets about converting text with context to pandas code on Hugging Face, but the challenge is in the context. The context in both datasets is different which reduces the results of the model. First let's mention the data I found and then show examples, solution and some other problems.

    • Rahima411/text-to-pandas:

      • The data is divided into Train with 57.5k and Test with 19.2k.

      • The data has two columns as you can see in the example:

        • "Input": Contains the context and the question together, in the context it shows the metadata about the data frame.
        • "Pandas Query": Pandas code txt Input | Pandas Query -----------------------------------------------------------|------------------------------------------- Table Name: head (age (object), head_id (object)) | result = management['head.age'].unique() Table Name: management (head_id (object), | temporary_acting (object)) | What are the distinct ages of the heads who are acting? |
    • hiltch/pandas-create-context:

      • It contains 17k rows with three columns:
        • question : text .
        • context : Code to create a data frame with column names, unlike the first data set which contains the name of the data frame, column names and data type.
        • answer : Pandas code.
          question           |            context             |       answer 
    ----------------------------------------|--------------------------------------------------------|---------------------------------------
    What was the lowest # of total votes?  | df = pd.DataFrame(columns=['_number_of_total_votes']) | df['_number_of_total_votes'].min()   
    

    As you can see, the problem with this data is that they are not similar as inputs and the structure of the context is different . My solution to this problem was: - Convert the first data set to become like the second in the context. I chose this because it is difficult to get the data type for the columns in the second data set. It was easy to convert the structure of the context from this shape Table Name: head (age (object), head_id (object)) to this head = pd.DataFrame(columns=['age','head_id']) through this code that I wrote. - Then separate the question from the context. This was easy because if you look at the data, you will find that the context always ends with "(" and then a blank and then the question. You will find all of this in this code. - You will also notice that more than one code or line can be returned to the context, and this has been engineered into the code. ```py def extract_table_creation(text:str)->(str,str): """ Extracts DataFrame creation statements and questions from the given text.

    Args:
      text (str): The input text containing table definitions and questions.
    
    Returns:
      tuple: A tuple containing a concatenated DataFrame creation string and a question.
    """
    # Define patterns
    table_pattern = r'Table Name: (\w+) \(([\w\s,()]+)\)'
    column_pattern = r'(\w+)\s*\((object|int64|float64)\)'
    
    # Find all table names and column definitions
    matches = re.findall(table_pattern, text)
    
    # Initialize a list to hold DataFrame creation statements
    df_creations = []
    
    for table_name, columns_str in matches:
      # Extract column names
      columns = re.findall(column_pattern, columns_str)
      column_names = [col[0] for col in columns]
    
      # Format DataFrame creation statement
      df_creation = f"{table_name} = pd.DataFrame(columns={column_names})"
      df_creations.append(df_creation)
    
    # Concatenate all DataFrame creation statements
    df_creation_concat = '
    

    '.join(df_creations)

    # Extract and clean the question
    question = text[text.rindex(')')+1:].strip()
    
    return df_creation_concat, question
    
    After both datasets were similar in structure, they were merged into one set and divided into _72.8K_ train and _18.6K_ test. We analyzed this dataset and you can see it all through the **[`notebook`](https://www.kaggle.com/code/zeyadusf/text-2-pandas-t5#Exploratory-Data-Analysis(EDA))**, but we found some problems in the dataset as well, such as
    > - `Answer` : `df['Id'].count()` has been repeated, but this is possible, so we do not need to dispense with these rows.
    > - `Context` : We see that it contains `147` rows that do not contain any text. We will see Through the experiment if this will affect the results negatively or positively.
    > - `Question` : It is ...
    
  6. Z

    Multimodal Vision-Audio-Language Dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Schaumlöffel, Timothy; Roig, Gemma; Choksi, Bhavin (2024). Multimodal Vision-Audio-Language Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10060784
    Explore at:
    Dataset updated
    Jul 11, 2024
    Dataset provided by
    Goethe University Frankfurt
    Authors
    Schaumlöffel, Timothy; Roig, Gemma; Choksi, Bhavin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities. Details can be found in the attached report. Annotation The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library. The split into train, validation and test set follows the split of the original datasets. Installation

    pip install pandas pyarrow Example

    import pandas as pddf = pd.read_parquet('annotation_train.parquet', engine='pyarrow')print(df.iloc[0])

    dataset AudioSet filename train/---2_BBVHAA.mp3 captions_visual [a man in a black hat and glasses.] captions_auditory [a man speaks and dishes clank.] tags [Speech] Description The annotation file consists of the following fields:filename: Name of the corresponding file (video or audio file)dataset: Source dataset associated with the data pointcaptions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual contentcaptions_auditory: A list of captions related to the auditory content of the videotags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided Data files The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de

  7. f

    Table4_Whole genome bisulfite sequencing reveals DNA methylation roles in...

    • figshare.com
    • frontiersin.figshare.com
    xlsx
    Updated Jun 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiaodie Jie; Honglin Wu; Miao Yang; Ming He; Guangqing Zhao; Shanshan Ling; Yan Huang; Bisong Yue; Nan Yang; Xiuyue Zhang (2023). Table4_Whole genome bisulfite sequencing reveals DNA methylation roles in the adaptive response of wildness training giant pandas to wild environment.XLSX [Dataset]. http://doi.org/10.3389/fgene.2022.995700.s004
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 13, 2023
    Dataset provided by
    Frontiers
    Authors
    Xiaodie Jie; Honglin Wu; Miao Yang; Ming He; Guangqing Zhao; Shanshan Ling; Yan Huang; Bisong Yue; Nan Yang; Xiuyue Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    DNA methylation modification can regulate gene expression without changing the genome sequence, which helps organisms to rapidly adapt to new environments. However, few studies have been reported in non-model mammals. Giant panda (Ailuropoda melanoleuca) is a flagship species for global biodiversity conservation. Wildness and reintroduction of giant pandas are the important content of giant pandas’ protection. However, it is unclear how wildness training affects the epigenetics of giant pandas, and we lack the means to assess the adaptive capacity of wildness training giant pandas. We comparatively analyzed genome-level methylation differences in captive giant pandas with and without wildness training to determine whether methylation modification played a role in the adaptive response of wildness training pandas. The whole genome DNA methylation sequencing results showed that genomic cytosine methylation ratio of all samples was 5.35%–5.49%, and the methylation ratio of the CpG site was the highest. Differential methylation analysis identified 544 differentially methylated genes (DMGs). The results of KEGG pathway enrichment of DMGs showed that VAV3, PLCG2, TEC and PTPRC participated in multiple immune-related pathways, and may participate in the immune response of wildness training giant pandas by regulating adaptive immune cells. A large number of DMGs enriched in GO terms may also be related to the regulation of immune activation during wildness training of giant pandas. Promoter differentially methylation analysis identified 1,199 genes with differential methylation at promoter regions. Genes with low methylation level at promoter regions and high expression such as, CCL5, P2Y13, GZMA, ANP32A, VWF, MYOZ1, NME7, MRPS31 and TPM1 were important in environmental adaptation for wildness training giant pandas. The methylation and expression patterns of these genes indicated that wildness training giant pandas have strong immunity, blood coagulation, athletic abilities and disease resistance. The adaptive response of giant pandas undergoing wildness training may be regulated by their negatively related promoter methylation. We are the first to describe the DNA methylation profile of giant panda blood tissue and our results indicated methylation modification is involved in the adaptation of captive giant pandas when undergoing wildness training. Our study also provided potential monitoring indicators for the successful reintroduction of valuable and threatened animals to the wild.

  8. Salaries case study

    • kaggle.com
    zip
    Updated Oct 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shobhit Chauhan (2024). Salaries case study [Dataset]. https://www.kaggle.com/datasets/satyam0123/salaries-case-study
    Explore at:
    zip(13105509 bytes)Available download formats
    Dataset updated
    Oct 2, 2024
    Authors
    Shobhit Chauhan
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    To analyze the salaries of company employees using Pandas, NumPy, and other tools, you can structure the analysis process into several steps:

    Case Study: Employee Salary Analysis In this case study, we aim to analyze the salaries of employees across different departments and levels within a company. Our goal is to uncover key patterns, identify outliers, and provide insights that can support decisions related to compensation and workforce management.

    Step 1: Data Collection and Preparation Data Sources: The dataset typically includes employee ID, name, department, position, years of experience, salary, and additional compensation (bonuses, stock options, etc.). Data Cleaning: We use Pandas to handle missing or incomplete data, remove duplicates, and standardize formats. Example: df.dropna() to handle missing salary information, and df.drop_duplicates() to eliminate duplicate entries. Step 2: Data Exploration and Descriptive Statistics Exploratory Data Analysis (EDA): Using Pandas to calculate basic statistics such as mean, median, mode, and standard deviation for employee salaries. Example: df['salary'].describe() provides an overview of the distribution of salaries. Data Visualization: Leveraging tools like Matplotlib or Seaborn for visualizing salary distributions, box plots to detect outliers, and bar charts for department-wise salary breakdowns. Example: sns.boxplot(x='department', y='salary', data=df) provides a visual representation of salary variations by department. Step 3: Analysis Using NumPy Calculating Salary Ranges: NumPy can be used to calculate the range, variance, and percentiles of salary data to identify the spread and skewness of the salary distribution. Example: np.percentile(df['salary'], [25, 50, 75]) helps identify salary quartiles. Correlation Analysis: Identify the relationship between variables such as experience and salary using NumPy to compute correlation coefficients. Example: np.corrcoef(df['years_of_experience'], df['salary']) reveals if experience is a significant factor in salary determination. Step 4: Grouping and Aggregation Salary by Department and Position: Using Pandas' groupby function, we can summarize salary information for different departments and job titles to identify trends or inequalities. Example: df.groupby('department')['salary'].mean() calculates the average salary per department. Step 5: Salary Forecasting (Optional) Predictive Analysis: Using tools such as Scikit-learn, we could build a regression model to predict future salary increases based on factors like experience, education level, and performance ratings. Step 6: Insights and Recommendations Outlier Identification: Detect any employees earning significantly more or less than the average, which could signal inequities or high performers. Salary Discrepancies: Highlight any salary discrepancies between departments or gender that may require further investigation. Compensation Planning: Based on the analysis, suggest potential changes to the salary structure or bonus allocations to ensure fair compensation across the organization. Tools Used: Pandas: For data manipulation, grouping, and descriptive analysis. NumPy: For numerical operations such as percentiles and correlations. Matplotlib/Seaborn: For data visualization to highlight key patterns and trends. Scikit-learn (Optional): For building predictive models if salary forecasting is included in the analysis. This approach ensures a comprehensive analysis of employee salaries, providing actionable insights for human resource planning and compensation strategy.

  9. h

    oldIT2modIT

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Massimo Romano, oldIT2modIT [Dataset]. https://huggingface.co/datasets/cybernetic-m/oldIT2modIT
    Explore at:
    Authors
    Massimo Romano
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Download the dataset

    At the moment to download the dataset you should use Pandas DataFrame: import pandas as pd df = pd.read_csv("https://huggingface.co/datasets/cybernetic-m/oldIT2modIT/resolve/main/oldIT2modIT_dataset.csv")

    You can visualize the dataset with: df.head()

    To convert into Huggingface dataset: from datasets import Dataset dataset = Dataset.from_pandas(df)

      Dataset Description
    

    This is an italian dataset formed by 200 old (ancient) italian sentence and… See the full description on the dataset page: https://huggingface.co/datasets/cybernetic-m/oldIT2modIT.

  10. f

    Data from: First Steps toward the Giant Panda Metabolome Database:...

    • datasetcatalog.nlm.nih.gov
    • acs.figshare.com
    Updated Feb 7, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Laghi, Luca; Zou, Likou; Zhang, Hemin; Wu, Daifu; Li, Caiwu; Zhang, Zhizhong; He, Yongguo; Huang, Yan; Zhu, Chenglin (2020). First Steps toward the Giant Panda Metabolome Database: Untargeted Metabolomics of Feces, Urine, Serum, and Saliva by 1H NMR [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000463094
    Explore at:
    Dataset updated
    Feb 7, 2020
    Authors
    Laghi, Luca; Zou, Likou; Zhang, Hemin; Wu, Daifu; Li, Caiwu; Zhang, Zhizhong; He, Yongguo; Huang, Yan; Zhu, Chenglin
    Description

    Differences in the concentration of metabolites in the biofluids of animals closely reflect their physiological diversities. In order to set the basis for a metabolomic atlas for giant panda (Ailuropoda melanoleuca), we characterized the metabolome of healthy giant panda feces (23), urine (16), serum (6), and saliva (4) samples by means of 1H NMR. A total of 107 metabolites and a core metabolome of 12 metabolites was quantified across the four biological matrices. Through univariate analysis followed by robust principal component analysis, we were able to describe how the molecular profile observed in giant panda urine and feces was affected by gender and age. Among the molecules modified by age in feces, fucose plays a peculiar role because it is related to the digestion of bamboo’s hemicellulose, which is considered as the main source of energy for giant panda. A metagenomic investigation directed toward this molecule showed that its concentration was indeed positively related to the two-component system pathway and negatively related to the amino sugar and nucleotide sugar metabolism pathway. Such work is meant to provide a robust framework for further -omics research studies on giant panda to accelerate our understanding of the interaction of giant panda with its natural environment.

  11. f

    Table1_Immunological characterization of an Italian PANDAS cohort.docx

    • datasetcatalog.nlm.nih.gov
    • figshare.com
    Updated Jan 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Duse, Marzia; Guido, Cristiana Alessia; Carsetti, Rita; Loffredo, Lorenzo; Mortari, Eva Piano; Lorenzetti, Giulia; Förster-Waldl, Elisabeth; Zicari, Anna Maria; Leonardi, Lucia; Spalice, Alberto (2024). Table1_Immunological characterization of an Italian PANDAS cohort.docx [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001272684
    Explore at:
    Dataset updated
    Jan 4, 2024
    Authors
    Duse, Marzia; Guido, Cristiana Alessia; Carsetti, Rita; Loffredo, Lorenzo; Mortari, Eva Piano; Lorenzetti, Giulia; Förster-Waldl, Elisabeth; Zicari, Anna Maria; Leonardi, Lucia; Spalice, Alberto
    Description

    This cross-sectional study aimed to contribute to the definition of Pediatric Autoimmune Neuropsychiatric Disorders Associated with Streptococcal Infections (PANDAS) pathophysiology. An extensive immunological assessment has been conducted to investigate both immune defects, potentially leading to recurrent Group A β-hemolytic Streptococcus (GABHS) infections, and immune dysregulation responsible for a systemic inflammatory state. Twenty-six PANDAS patients with relapsing-remitting course of disease and 11 controls with recurrent pharyngotonsillitis were enrolled. Each subject underwent a detailed phenotypic and immunological assessment including cytokine profile. A possible correlation of immunological parameters with clinical-anamnestic data was analyzed. No inborn errors of immunity were detected in either group, using first level immunological assessments. However, a trend toward higher TNF-alpha and IL-17 levels, and lower C3 levels, was detected in the PANDAS patients compared to the control group. Maternal autoimmune diseases were described in 53.3% of PANDAS patients and neuropsychiatric symptoms other than OCD and tics were detected in 76.9% patients. ASO titer did not differ significantly between the two groups. A possible correlation between enduring inflammation (elevated serum TNF-α and IL-17) and the persistence of neuropsychiatric symptoms in PANDAS patients beyond infectious episodes needs to be addressed. Further studies with larger cohorts would be pivotal to better define the role of TNF-α and IL-17 in PANDAS pathophysiology.

  12. f

    Table5_Whole genome bisulfite sequencing reveals DNA methylation roles in...

    • figshare.com
    xlsx
    Updated Jun 13, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiaodie Jie; Honglin Wu; Miao Yang; Ming He; Guangqing Zhao; Shanshan Ling; Yan Huang; Bisong Yue; Nan Yang; Xiuyue Zhang (2023). Table5_Whole genome bisulfite sequencing reveals DNA methylation roles in the adaptive response of wildness training giant pandas to wild environment.XLSX [Dataset]. http://doi.org/10.3389/fgene.2022.995700.s005
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 13, 2023
    Dataset provided by
    Frontiers
    Authors
    Xiaodie Jie; Honglin Wu; Miao Yang; Ming He; Guangqing Zhao; Shanshan Ling; Yan Huang; Bisong Yue; Nan Yang; Xiuyue Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    DNA methylation modification can regulate gene expression without changing the genome sequence, which helps organisms to rapidly adapt to new environments. However, few studies have been reported in non-model mammals. Giant panda (Ailuropoda melanoleuca) is a flagship species for global biodiversity conservation. Wildness and reintroduction of giant pandas are the important content of giant pandas’ protection. However, it is unclear how wildness training affects the epigenetics of giant pandas, and we lack the means to assess the adaptive capacity of wildness training giant pandas. We comparatively analyzed genome-level methylation differences in captive giant pandas with and without wildness training to determine whether methylation modification played a role in the adaptive response of wildness training pandas. The whole genome DNA methylation sequencing results showed that genomic cytosine methylation ratio of all samples was 5.35%–5.49%, and the methylation ratio of the CpG site was the highest. Differential methylation analysis identified 544 differentially methylated genes (DMGs). The results of KEGG pathway enrichment of DMGs showed that VAV3, PLCG2, TEC and PTPRC participated in multiple immune-related pathways, and may participate in the immune response of wildness training giant pandas by regulating adaptive immune cells. A large number of DMGs enriched in GO terms may also be related to the regulation of immune activation during wildness training of giant pandas. Promoter differentially methylation analysis identified 1,199 genes with differential methylation at promoter regions. Genes with low methylation level at promoter regions and high expression such as, CCL5, P2Y13, GZMA, ANP32A, VWF, MYOZ1, NME7, MRPS31 and TPM1 were important in environmental adaptation for wildness training giant pandas. The methylation and expression patterns of these genes indicated that wildness training giant pandas have strong immunity, blood coagulation, athletic abilities and disease resistance. The adaptive response of giant pandas undergoing wildness training may be regulated by their negatively related promoter methylation. We are the first to describe the DNA methylation profile of giant panda blood tissue and our results indicated methylation modification is involved in the adaptation of captive giant pandas when undergoing wildness training. Our study also provided potential monitoring indicators for the successful reintroduction of valuable and threatened animals to the wild.

  13. Stack Overflow tags

    • kaggle.com
    zip
    Updated Jan 6, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abid Ali Awan (2021). Stack Overflow tags [Dataset]. https://www.kaggle.com/datasets/kingabzpro/stack-overflow-tags/code
    Explore at:
    zip(273306 bytes)Available download formats
    Dataset updated
    Jan 6, 2021
    Authors
    Abid Ali Awan
    License

    http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

    Description

    Context

    How can we tell what programming languages and technologies are used by the most people? How about what languages are growing and which are shrinking, so that we can tell which are most worth investing time in?

    One excellent source of data is Stack Overflow, a programming question and answer site with more than 16 million questions on programming topics. By measuring the number of questions about each technology, we can get an approximate sense of how many people are using it. We're going to use open data from the Stack Exchange Data Explorer to examine the relative popularity of languages like R, Python, Java and Javascript have changed over time.

    Content

    Each Stack Overflow question has a tag, which marks a question to describe its topic or technology. For instance, there's a tag for languages like R or Python, and for packages like ggplot2 or pandas.

    We'll be working with a dataset with one observation for each tag in each year. The dataset includes both the number of questions asked in that tag in that year, and the total number of questions asked in that year.

    Acknowledgements

    DataCamp

  14. h

    PlotQA_V1

    • huggingface.co
    Updated Sep 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aryan Badkul (2025). PlotQA_V1 [Dataset]. https://huggingface.co/datasets/Abd223653/PlotQA_V1
    Explore at:
    Dataset updated
    Sep 22, 2025
    Authors
    Aryan Badkul
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Plotqa V1

      Dataset Description
    

    This dataset was uploaded from a pandas DataFrame.

      Dataset Structure
    
    
    
    
    
      Overview
    

    Total Examples: 5,733,893 Total Features: 9 Dataset Size: ~2805.4 MB Format: Parquet files Created: 2025-09-22 20:12:01 UTC

      Data Instances
    

    The dataset contains 5,733,893 rows and 9 columns.

      Data Fields
    

    image_index (int64): 0 null values (0.0%), Range: [0.00, 157069.00], Mean: 78036.26 qid (object): 0 null values (0.0%)… See the full description on the dataset page: https://huggingface.co/datasets/Abd223653/PlotQA_V1.

  15. n

    Patterns of genetic differentiation at MHC class I genes and microsatellites...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Oct 7, 2014
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ying Zhu; Qiu-Hong Wan; Bin Yu; Yun-Fa Ge; Shengguo Fang (2014). Patterns of genetic differentiation at MHC class I genes and microsatellites identify conservation units in the giant panda [Dataset]. http://doi.org/10.5061/dryad.2gt86
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 7, 2014
    Dataset provided by
    Zhejiang University
    Authors
    Ying Zhu; Qiu-Hong Wan; Bin Yu; Yun-Fa Ge; Shengguo Fang
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Area covered
    China
    Description

    Background: Evaluating patterns of genetic variation is important to identify conservation units (i.e., evolutionarily significant units [ESUs], management units [MUs], and adaptive units [AUs]) in endangered species. While neutral markers could be used to infer population history, their application in the estimation of adaptive variation is limited. The capacity to adapt to various environments is vital for the long-term survival of endangered species. Hence, analysis of adaptive loci, such as the major histocompatibility complex (MHC) genes, is critical for conservation genetics studies. Here, we investigated 4 classical MHC class I genes (Aime-C, Aime-F, Aime-I, and Aime-L) and 8 microsatellites to infer patterns of genetic variation in the giant panda (Ailuropoda melanoleuca) and to further define conservation units. Results: Overall, we identified 24 haplotypes (9 for Aime-C, 1 for Aime-F, 7 for Aime-I, and 7 for Aime-L) from 218 individuals obtained from 6 populations of giant panda. We found that the Xiaoxiangling population had the highest genetic variation at microsatellites among the 6 giant panda populations and higher genetic variation at Aime-MHC class I genes than other larger populations (Qinling, Qionglai, and Minshan populations). Differentiation index (FST)-based phylogenetic and Bayesian clustering analyses for Aime-MHC-I and microsatellite loci both supported that most populations were highly differentiated. The Qinling population was the most genetically differentiated. Conclusions: The giant panda showed a relatively higher level of genetic diversity at MHC class I genes compared with endangered felids. Using all of the loci, we found that the 6 giant panda populations fell into 2 ESUs: Qinling and non-Qinling populations. We defined 3 MUs based on microsatellites: Qinling, Minshan-Qionglai, and Daxiangling-Xiaoxiangling-Liangshan. We also recommended 3 possible AUs based on MHC loci: Qinling, Minshan-Qionglai, and Daxiangling-Xiaoxiangling-Liangshan. Furthermore, we recommend that a captive breeding program be considered for the Qinling panda population.

  16. h

    rag

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    VIGNESH M, rag [Dataset]. https://huggingface.co/datasets/vicky3241/rag
    Explore at:
    Authors
    VIGNESH M
    Description

    import pandas as pd

      Example dataset with new columns
    

    data = [ { "title": "Pandas Library", "about": "Pandas is a Python library for data manipulation and analysis.", "procedure": "Install Pandas via pip, load data into DataFrames, clean and analyze data using built-in functions.", "content": """ Pandas provides data structures like Series and DataFrame for handling structured data. It supports indexing, slicing, aggregation, joining, and filtering… See the full description on the dataset page: https://huggingface.co/datasets/vicky3241/rag.

  17. h

    books

    • huggingface.co
    Updated Apr 4, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Washington Cunha (2025). books [Dataset]. https://huggingface.co/datasets/waashk/books
    Explore at:
    Dataset updated
    Apr 4, 2025
    Authors
    Washington Cunha
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset used in the paper: A thorough benchmark of automatic text classification From traditional approaches to large language models https://github.com/waashk/atcBench To guarantee the reproducibility of the obtained results, the dataset and its respective CV train-test partitions is available here. Each dataset contains the following files:

    data.parquet: pandas DataFrame with texts and associated encoded labels for each document. split_

  18. 🛌 Sleep patterns

    • kaggle.com
    zip
    Updated Jan 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexander Kapturov (2025). 🛌 Sleep patterns [Dataset]. https://www.kaggle.com/datasets/kapturovalexander/sleep-patterns/data
    Explore at:
    zip(28271 bytes)Available download formats
    Dataset updated
    Jan 4, 2025
    Authors
    Alexander Kapturov
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    🛌🛌 This dataset contains information about various characteristics and indicators for a group of people.

    Each row represents data about an individual, and each column represents a specific characteristic or attribute. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10074224%2Febf1eae8b3b9d01687f49265c1d130d8%2F2.jpg?generation=1697451755813585&alt=media" alt="">

    Description:

    1. age: Age in years.
    2. black: Equal to 1 if the individual is of Black race.
    3. case: Case identifier.
    4. clerical: Equal to 1 if the individual is a clerical worker.
    5. construc: Equal to 1 if the individual is a construction worker.
    6. educ: Years of schooling.
    7. earns74: Total earnings in 1974.
    8. gdhlth: Equal to 1 if the individual is in good or excellent health.
    9. inlf: Equal to 1 if the individual is in the labor force.
    10. leis1: Sleep time minus working time.
    11. leis2: Sleep time (including short naps) minus working time.
    12. leis3: Relaxation and leisure time minus working time.
    13. smsa: Equal to 1 if the individual lives in a Standard Metropolitan Statistical Area (SMSA).
    14. lhrwage: Natural logarithm of hourly wage.
    15. lothinc: Natural logarithm of other income, unless income is less than 0.
    16. male: Equal to 1 if the individual is male.
    17. marr: Equal to 1 if the individual is married.
    18. prot: Equal to 1 if the individual is Protestant.
    19. rlxall: Total relaxation and leisure time, including personal activities.
    20. selfe: Equal to 1 if the individual is self-employed.
    21. sleep: Minutes of sleep at night per week.
    22. slpnaps: Minutes of sleep, including naps, per week.
    23. south: Equal to 1 if the individual lives in the South.
    24. spsepay: Spousal wage income.
    25. spwrk75: Equal to 1 if the spouse works.
    26. totwrk: Minutes worked per week.
    27. union: Equal to 1 if the individual belongs to a labor union.
    28. worknrm: Minutes worked in the main job.
    29. workscnd: Minutes worked in the second job.
    30. exper: Work experience calculated as age - education - 6.
    31. yngkid: Equal to 1 if children under 3 years old are present.
    32. yrsmarr: Years married.
    33. hrwage: Hourly wage.
    34. agesq: Age squared. ##### 🏅 If you liked this dataset or downloaded it, please upvote it!👱‍♂️
  19. Pandas

    • kaggle.com
    zip
    Updated Feb 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shail_2604 (2024). Pandas [Dataset]. https://www.kaggle.com/datasets/shail2604/pandas/code
    Explore at:
    zip(1050 bytes)Available download formats
    Dataset updated
    Feb 27, 2024
    Authors
    Shail_2604
    Description

    Dataset

    This dataset was created by Shail_2604

    Released under Other (specified in description)

    Contents

  20. Kretzoiarctos gen. nov., the Oldest Member of the Giant Panda Clade

    • plos.figshare.com
    txt
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Juan Abella; David M. Alba; Josep M. Robles; Alberto Valenciano; Cheyenn Rotgers; Raül Carmona; Plinio Montoya; Jorge Morales (2023). Kretzoiarctos gen. nov., the Oldest Member of the Giant Panda Clade [Dataset]. http://doi.org/10.1371/journal.pone.0048985
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Juan Abella; David M. Alba; Josep M. Robles; Alberto Valenciano; Cheyenn Rotgers; Raül Carmona; Plinio Montoya; Jorge Morales
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The phylogenetic position of the giant panda, Ailuropoda melanoleuca (Carnivora: Ursidae: Ailuropodinae), has been one of the most hotly debated topics by mammalian biologists and paleontologists during the last century. Based on molecular data, it is currently recognized as a true ursid, sister-taxon of the remaining extant bears, from which it would have diverged by the Early Miocene. However, from a paleobiogeographic and chronological perspective, the origin of the giant panda lineage has remained elusive due to the scarcity of the available Miocene fossil record. Until recently, the genus Ailurarctos from the Late Miocene of China (ca. 8–7 mya) was recognized as the oldest undoubted member of the Ailuropodinae, suggesting that the panda lineage might have originated from an Ursavus ancestor. The role of the purported ailuropodine Agriarctos, from the Miocene of Europe, in the origins of this clade has been generally dismissed due to the paucity of the available material. Here, we describe a new ailuropodine genus, Kretzoiarctos gen. nov., based on remains from two Middle Miocene (ca. 12–11 Ma) Spanish localities. A cladistic analysis of fossil and extant members of the Ursoidea confirms the inclusion of the new genus into the Ailuropodinae. Moreover, Kretzoiarctos precedes in time the previously-known, Late Miocene members of the giant panda clade from Eurasia (Agriarctos and Ailurarctos). The former can be therefore considered the oldest recorded member of the giant panda lineage, which has significant implications for understanding the origins of this clade from a paleobiogeographic viewpoint.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Anshul Pachauri (2023). Shopping Mall [Dataset]. https://www.kaggle.com/datasets/anshulpachauri/shopping-mall
Organization logo

Shopping Mall

Explore at:
zip(22852 bytes)Available download formats
Dataset updated
Dec 15, 2023
Authors
Anshul Pachauri
Description

Libraries Import:

Importing necessary libraries such as pandas, seaborn, matplotlib, scikit-learn's KMeans, and warnings. Data Loading and Exploration:

Reading a dataset named "Mall_Customers.csv" into a pandas DataFrame (df). Displaying the first few rows of the dataset using df.head(). Conducting univariate analysis by calculating descriptive statistics with df.describe(). Univariate Analysis:

Visualizing the distribution of the 'Annual Income (k$)' column using sns.distplot. Looping through selected columns ('Age', 'Annual Income (k$)', 'Spending Score (1-100)') and plotting individual distribution plots. Bivariate Analysis:

Creating a scatter plot for 'Annual Income (k$)' vs 'Spending Score (1-100)' using sns.scatterplot. Generating a pair plot for selected columns with gender differentiation using sns.pairplot. Gender-Based Analysis:

Grouping the data by 'Gender' and calculating the mean for selected columns. Computing the correlation matrix for the grouped data and visualizing it using a heatmap. Univariate Clustering:

Applying KMeans clustering with 3 clusters based on 'Annual Income (k$)' and adding the 'Income Cluster' column to the DataFrame. Plotting the elbow method to determine the optimal number of clusters. Bivariate Clustering:

Applying KMeans clustering with 5 clusters based on 'Annual Income (k$)' and 'Spending Score (1-100)' and adding the 'Spending and Income Cluster' column. Plotting the elbow method for bivariate clustering and visualizing the cluster centers on a scatter plot. Displaying a normalized cross-tabulation between 'Spending and Income Cluster' and 'Gender'. Multivariate Clustering:

Performing multivariate clustering by creating dummy variables, scaling selected columns, and applying KMeans clustering. Plotting the elbow method for multivariate clustering. Result Saving:

Saving the modified DataFrame with cluster information to a CSV file named "Result.csv". Saving the multivariate clustering plot as an image file ("Multivariate_figure.png").

Search
Clear search
Close search
Google apps
Main menu