Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
I found two datasets about converting text with context to pandas code on Hugging Face, but the challenge is in the context. The context in both datasets is different which reduces the results of the model. First let's mention the data I found and then show examples, solution and some other problems.
Rahima411/text-to-pandas:
The data is divided into Train with 57.5k and Test with 19.2k.
The data has two columns as you can see in the example:
txt
Input | Pandas Query
-----------------------------------------------------------|-------------------------------------------
Table Name: head (age (object), head_id (object)) | result = management['head.age'].unique()
Table Name: management (head_id (object), |
temporary_acting (object)) |
What are the distinct ages of the heads who are acting? |hiltch/pandas-create-context:
question | context | answer
----------------------------------------|--------------------------------------------------------|---------------------------------------
What was the lowest # of total votes? | df = pd.DataFrame(columns=['_number_of_total_votes']) | df['_number_of_total_votes'].min()
As you can see, the problem with this data is that they are not similar as inputs and the structure of the context is different . My solution to this problem was:
- Convert the first data set to become like the second in the context. I chose this because it is difficult to get the data type for the columns in the second data set. It was easy to convert the structure of the context from this shape Table Name: head (age (object), head_id (object)) to this head = pd.DataFrame(columns=['age','head_id']) through this code that I wrote.
- Then separate the question from the context. This was easy because if you look at the data, you will find that the context always ends with "(" and then a blank and then the question.
You will find all of this in this code.
- You will also notice that more than one code or line can be returned to the context, and this has been engineered into the code.
```py
def extract_table_creation(text:str)->(str,str):
"""
Extracts DataFrame creation statements and questions from the given text.
Args:
text (str): The input text containing table definitions and questions.
Returns:
tuple: A tuple containing a concatenated DataFrame creation string and a question.
"""
# Define patterns
table_pattern = r'Table Name: (\w+) \(([\w\s,()]+)\)'
column_pattern = r'(\w+)\s*\((object|int64|float64)\)'
# Find all table names and column definitions
matches = re.findall(table_pattern, text)
# Initialize a list to hold DataFrame creation statements
df_creations = []
for table_name, columns_str in matches:
# Extract column names
columns = re.findall(column_pattern, columns_str)
column_names = [col[0] for col in columns]
# Format DataFrame creation statement
df_creation = f"{table_name} = pd.DataFrame(columns={column_names})"
df_creations.append(df_creation)
# Concatenate all DataFrame creation statements
df_creation_concat = '
'.join(df_creations)
# Extract and clean the question
question = text[text.rindex(')')+1:].strip()
return df_creation_concat, question
After both datasets were similar in structure, they were merged into one set and divided into _72.8K_ train and _18.6K_ test. We analyzed this dataset and you can see it all through the **[`notebook`](https://www.kaggle.com/code/zeyadusf/text-2-pandas-t5#Exploratory-Data-Analysis(EDA))**, but we found some problems in the dataset as well, such as
> - `Answer` : `df['Id'].count()` has been repeated, but this is possible, so we do not need to dispense with these rows.
> - `Context` : We see that it contains `147` rows that do not contain any text. We will see Through the experiment if this will affect the results negatively or positively.
> - `Question` : It is ...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview
This dataset is built from sql-create-context, which in itself builds from WikiSQL and Spider. I have used GPT4 to translate the SQL schema into pandas DataFrame schem initialization statements and to translate the SQL queries into pandas queries. There are 862 examples of natural language queries, pandas DataFrame creation statements, and pandas query answering the question using the DataFrame creation statement as context. This dataset was built with text-to-pandas LLMs… See the full description on the dataset page: https://huggingface.co/datasets/hiltch/pandas-create-context.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
PandasPlotBench
PandasPlotBench is a benchmark to assess the capability of models in writing the code for visualizations given the description of the Pandas DataFrame. 🛠️ Task. Given the plotting task and the description of a Pandas DataFrame, write the code to build a plot. The dataset is based on the MatPlotLib gallery. The paper can be found in arXiv: https://arxiv.org/abs/2412.02764v1. To score your model on this dataset, you can use the our GitHub repository. 📩 If you have… See the full description on the dataset page: https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench.
Facebook
TwitterLibraries Import:
Importing necessary libraries such as pandas, seaborn, matplotlib, scikit-learn's KMeans, and warnings. Data Loading and Exploration:
Reading a dataset named "Mall_Customers.csv" into a pandas DataFrame (df). Displaying the first few rows of the dataset using df.head(). Conducting univariate analysis by calculating descriptive statistics with df.describe(). Univariate Analysis:
Visualizing the distribution of the 'Annual Income (k$)' column using sns.distplot. Looping through selected columns ('Age', 'Annual Income (k$)', 'Spending Score (1-100)') and plotting individual distribution plots. Bivariate Analysis:
Creating a scatter plot for 'Annual Income (k$)' vs 'Spending Score (1-100)' using sns.scatterplot. Generating a pair plot for selected columns with gender differentiation using sns.pairplot. Gender-Based Analysis:
Grouping the data by 'Gender' and calculating the mean for selected columns. Computing the correlation matrix for the grouped data and visualizing it using a heatmap. Univariate Clustering:
Applying KMeans clustering with 3 clusters based on 'Annual Income (k$)' and adding the 'Income Cluster' column to the DataFrame. Plotting the elbow method to determine the optimal number of clusters. Bivariate Clustering:
Applying KMeans clustering with 5 clusters based on 'Annual Income (k$)' and 'Spending Score (1-100)' and adding the 'Spending and Income Cluster' column. Plotting the elbow method for bivariate clustering and visualizing the cluster centers on a scatter plot. Displaying a normalized cross-tabulation between 'Spending and Income Cluster' and 'Gender'. Multivariate Clustering:
Performing multivariate clustering by creating dummy variables, scaling selected columns, and applying KMeans clustering. Plotting the elbow method for multivariate clustering. Result Saving:
Saving the modified DataFrame with cluster information to a CSV file named "Result.csv". Saving the multivariate clustering plot as an image file ("Multivariate_figure.png").
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities. Details can be found in the attached report. Annotation The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library. The split into train, validation and test set follows the split of the original datasets. Installation
pip install pandas pyarrow Example
import pandas as pddf = pd.read_parquet('annotation_train.parquet', engine='pyarrow')print(df.iloc[0])
dataset AudioSet filename train/---2_BBVHAA.mp3 captions_visual [a man in a black hat and glasses.] captions_auditory [a man speaks and dishes clank.] tags [Speech] Description The annotation file consists of the following fields:filename: Name of the corresponding file (video or audio file)dataset: Source dataset associated with the data pointcaptions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual contentcaptions_auditory: A list of captions related to the auditory content of the videotags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided Data files The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Download the dataset
At the moment to download the dataset you should use Pandas DataFrame: import pandas as pd df = pd.read_csv("https://huggingface.co/datasets/cybernetic-m/oldIT2modIT/resolve/main/oldIT2modIT_dataset.csv")
You can visualize the dataset with: df.head()
To convert into Huggingface dataset: from datasets import Dataset dataset = Dataset.from_pandas(df)
Dataset Description
This is an italian dataset formed by 200 old (ancient) italian sentence and… See the full description on the dataset page: https://huggingface.co/datasets/cybernetic-m/oldIT2modIT.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
DNA methylation modification can regulate gene expression without changing the genome sequence, which helps organisms to rapidly adapt to new environments. However, few studies have been reported in non-model mammals. Giant panda (Ailuropoda melanoleuca) is a flagship species for global biodiversity conservation. Wildness and reintroduction of giant pandas are the important content of giant pandas’ protection. However, it is unclear how wildness training affects the epigenetics of giant pandas, and we lack the means to assess the adaptive capacity of wildness training giant pandas. We comparatively analyzed genome-level methylation differences in captive giant pandas with and without wildness training to determine whether methylation modification played a role in the adaptive response of wildness training pandas. The whole genome DNA methylation sequencing results showed that genomic cytosine methylation ratio of all samples was 5.35%–5.49%, and the methylation ratio of the CpG site was the highest. Differential methylation analysis identified 544 differentially methylated genes (DMGs). The results of KEGG pathway enrichment of DMGs showed that VAV3, PLCG2, TEC and PTPRC participated in multiple immune-related pathways, and may participate in the immune response of wildness training giant pandas by regulating adaptive immune cells. A large number of DMGs enriched in GO terms may also be related to the regulation of immune activation during wildness training of giant pandas. Promoter differentially methylation analysis identified 1,199 genes with differential methylation at promoter regions. Genes with low methylation level at promoter regions and high expression such as, CCL5, P2Y13, GZMA, ANP32A, VWF, MYOZ1, NME7, MRPS31 and TPM1 were important in environmental adaptation for wildness training giant pandas. The methylation and expression patterns of these genes indicated that wildness training giant pandas have strong immunity, blood coagulation, athletic abilities and disease resistance. The adaptive response of giant pandas undergoing wildness training may be regulated by their negatively related promoter methylation. We are the first to describe the DNA methylation profile of giant panda blood tissue and our results indicated methylation modification is involved in the adaptation of captive giant pandas when undergoing wildness training. Our study also provided potential monitoring indicators for the successful reintroduction of valuable and threatened animals to the wild.
Facebook
Twitterhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
How can we tell what programming languages and technologies are used by the most people? How about what languages are growing and which are shrinking, so that we can tell which are most worth investing time in?
One excellent source of data is Stack Overflow, a programming question and answer site with more than 16 million questions on programming topics. By measuring the number of questions about each technology, we can get an approximate sense of how many people are using it. We're going to use open data from the Stack Exchange Data Explorer to examine the relative popularity of languages like R, Python, Java and Javascript have changed over time.
Each Stack Overflow question has a tag, which marks a question to describe its topic or technology. For instance, there's a tag for languages like R or Python, and for packages like ggplot2 or pandas.
We'll be working with a dataset with one observation for each tag in each year. The dataset includes both the number of questions asked in that tag in that year, and the total number of questions asked in that year.
DataCamp
Facebook
TwitterDifferences in the concentration of metabolites in the biofluids of animals closely reflect their physiological diversities. In order to set the basis for a metabolomic atlas for giant panda (Ailuropoda melanoleuca), we characterized the metabolome of healthy giant panda feces (23), urine (16), serum (6), and saliva (4) samples by means of 1H NMR. A total of 107 metabolites and a core metabolome of 12 metabolites was quantified across the four biological matrices. Through univariate analysis followed by robust principal component analysis, we were able to describe how the molecular profile observed in giant panda urine and feces was affected by gender and age. Among the molecules modified by age in feces, fucose plays a peculiar role because it is related to the digestion of bamboo’s hemicellulose, which is considered as the main source of energy for giant panda. A metagenomic investigation directed toward this molecule showed that its concentration was indeed positively related to the two-component system pathway and negatively related to the amino sugar and nucleotide sugar metabolism pathway. Such work is meant to provide a robust framework for further -omics research studies on giant panda to accelerate our understanding of the interaction of giant panda with its natural environment.
Facebook
TwitterThis dataset was created by Shail_2604
Released under Other (specified in description)
Facebook
Twitterconfigs: - config_name: default data_files: "metadata.csv" delimiter: "|" header: 1 names: ["Id", "Raw text",normalised"]… See the full description on the dataset page: https://huggingface.co/datasets/Chithekitale/eyaaaa.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The auction dataset is a really small data set ( 19 items) which is being created for the sole purpose of learning pandas library.
The auction data set contains 5 columns :
1. Item :Gives the description of what items are being sold. 2. Bidding Price : Gives the price at which the item will start being sold at. 3. Selling Price : The selling price tells us at which amount the item was sold. 4. Calls :Calls indicate the number of times the items value was raised or decreased by the customer. 5. Bought By : Gives us the idea which customer bought the item.
Note: There are missing values, which we will try to fill. And yes some values might not make sense once we make those imputations, but this notebook is for the sole purpose of learning.
Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Pandas GitHub Issues
This dataset contains 5,000 GitHub issues collected from the pandas-dev/pandas repository.It includes issue metadata, content, labels, user information, timestamps, and comments.
The dataset is suitable for text classification, multi-label classification, and document retrieval tasks.
Dataset Structure
Columns:
id — Internal ID of the issue (int64)
number — GitHub issue number (int64)
title — Title of the issue (string)
state — Issue… See the full description on the dataset page: https://huggingface.co/datasets/cicboy/pandas-issues.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
DNA methylation modification can regulate gene expression without changing the genome sequence, which helps organisms to rapidly adapt to new environments. However, few studies have been reported in non-model mammals. Giant panda (Ailuropoda melanoleuca) is a flagship species for global biodiversity conservation. Wildness and reintroduction of giant pandas are the important content of giant pandas’ protection. However, it is unclear how wildness training affects the epigenetics of giant pandas, and we lack the means to assess the adaptive capacity of wildness training giant pandas. We comparatively analyzed genome-level methylation differences in captive giant pandas with and without wildness training to determine whether methylation modification played a role in the adaptive response of wildness training pandas. The whole genome DNA methylation sequencing results showed that genomic cytosine methylation ratio of all samples was 5.35%–5.49%, and the methylation ratio of the CpG site was the highest. Differential methylation analysis identified 544 differentially methylated genes (DMGs). The results of KEGG pathway enrichment of DMGs showed that VAV3, PLCG2, TEC and PTPRC participated in multiple immune-related pathways, and may participate in the immune response of wildness training giant pandas by regulating adaptive immune cells. A large number of DMGs enriched in GO terms may also be related to the regulation of immune activation during wildness training of giant pandas. Promoter differentially methylation analysis identified 1,199 genes with differential methylation at promoter regions. Genes with low methylation level at promoter regions and high expression such as, CCL5, P2Y13, GZMA, ANP32A, VWF, MYOZ1, NME7, MRPS31 and TPM1 were important in environmental adaptation for wildness training giant pandas. The methylation and expression patterns of these genes indicated that wildness training giant pandas have strong immunity, blood coagulation, athletic abilities and disease resistance. The adaptive response of giant pandas undergoing wildness training may be regulated by their negatively related promoter methylation. We are the first to describe the DNA methylation profile of giant panda blood tissue and our results indicated methylation modification is involved in the adaptation of captive giant pandas when undergoing wildness training. Our study also provided potential monitoring indicators for the successful reintroduction of valuable and threatened animals to the wild.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Each row represents data about an individual, and each column represents a specific characteristic or attribute.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10074224%2Febf1eae8b3b9d01687f49265c1d130d8%2F2.jpg?generation=1697451755813585&alt=media" alt="">
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
ManiSkill Panda PickCube Dataset
This dataset contains robot demonstrations for pick-and-place tasks using a Franka Panda robot in the ManiSkill simulation environment.
Dataset Description
This dataset was collected using ManiSkill PickCube-v1 environment and converted to LeRobot format for training Vision-Language-Action (VLA) models, specifically optimized for pi0 architecture.
Task Description
The robot needs to pick up a cube and place it in a designated… See the full description on the dataset page: https://huggingface.co/datasets/dancher00/maniskill-panda-pickcube.
Facebook
TwitterThis cross-sectional study aimed to contribute to the definition of Pediatric Autoimmune Neuropsychiatric Disorders Associated with Streptococcal Infections (PANDAS) pathophysiology. An extensive immunological assessment has been conducted to investigate both immune defects, potentially leading to recurrent Group A β-hemolytic Streptococcus (GABHS) infections, and immune dysregulation responsible for a systemic inflammatory state. Twenty-six PANDAS patients with relapsing-remitting course of disease and 11 controls with recurrent pharyngotonsillitis were enrolled. Each subject underwent a detailed phenotypic and immunological assessment including cytokine profile. A possible correlation of immunological parameters with clinical-anamnestic data was analyzed. No inborn errors of immunity were detected in either group, using first level immunological assessments. However, a trend toward higher TNF-alpha and IL-17 levels, and lower C3 levels, was detected in the PANDAS patients compared to the control group. Maternal autoimmune diseases were described in 53.3% of PANDAS patients and neuropsychiatric symptoms other than OCD and tics were detected in 76.9% patients. ASO titer did not differ significantly between the two groups. A possible correlation between enduring inflammation (elevated serum TNF-α and IL-17) and the persistence of neuropsychiatric symptoms in PANDAS patients beyond infectious episodes needs to be addressed. Further studies with larger cohorts would be pivotal to better define the role of TNF-α and IL-17 in PANDAS pathophysiology.
Facebook
TwitterThis dataset was created by mnijhuis
Released under Other (specified in description)
Facebook
TwitterView details of Board Import Data of Panda Game Manufacturing Asia Limited Supplier to US with product description, price, date, quantity, major us ports, countries and more.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
PandaBench
PandaBench is a comprehensive benchmark for evaluating Large Language Model (LLM) safety, focusing on jailbreak attacks, defense mechanisms, and evaluation methodologies.
The PandaGuard framework architecture illustrating the end-to-end pipeline for LLM safety evaluation. The system connects three key components: Attackers, Defenders, and Judges.
Dataset Description
This repository contains the benchmark results from extensive evaluations of various LLMs… See the full description on the dataset page: https://huggingface.co/datasets/Beijing-AISI/panda-bench.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
I found two datasets about converting text with context to pandas code on Hugging Face, but the challenge is in the context. The context in both datasets is different which reduces the results of the model. First let's mention the data I found and then show examples, solution and some other problems.
Rahima411/text-to-pandas:
The data is divided into Train with 57.5k and Test with 19.2k.
The data has two columns as you can see in the example:
txt
Input | Pandas Query
-----------------------------------------------------------|-------------------------------------------
Table Name: head (age (object), head_id (object)) | result = management['head.age'].unique()
Table Name: management (head_id (object), |
temporary_acting (object)) |
What are the distinct ages of the heads who are acting? |hiltch/pandas-create-context:
question | context | answer
----------------------------------------|--------------------------------------------------------|---------------------------------------
What was the lowest # of total votes? | df = pd.DataFrame(columns=['_number_of_total_votes']) | df['_number_of_total_votes'].min()
As you can see, the problem with this data is that they are not similar as inputs and the structure of the context is different . My solution to this problem was:
- Convert the first data set to become like the second in the context. I chose this because it is difficult to get the data type for the columns in the second data set. It was easy to convert the structure of the context from this shape Table Name: head (age (object), head_id (object)) to this head = pd.DataFrame(columns=['age','head_id']) through this code that I wrote.
- Then separate the question from the context. This was easy because if you look at the data, you will find that the context always ends with "(" and then a blank and then the question.
You will find all of this in this code.
- You will also notice that more than one code or line can be returned to the context, and this has been engineered into the code.
```py
def extract_table_creation(text:str)->(str,str):
"""
Extracts DataFrame creation statements and questions from the given text.
Args:
text (str): The input text containing table definitions and questions.
Returns:
tuple: A tuple containing a concatenated DataFrame creation string and a question.
"""
# Define patterns
table_pattern = r'Table Name: (\w+) \(([\w\s,()]+)\)'
column_pattern = r'(\w+)\s*\((object|int64|float64)\)'
# Find all table names and column definitions
matches = re.findall(table_pattern, text)
# Initialize a list to hold DataFrame creation statements
df_creations = []
for table_name, columns_str in matches:
# Extract column names
columns = re.findall(column_pattern, columns_str)
column_names = [col[0] for col in columns]
# Format DataFrame creation statement
df_creation = f"{table_name} = pd.DataFrame(columns={column_names})"
df_creations.append(df_creation)
# Concatenate all DataFrame creation statements
df_creation_concat = '
'.join(df_creations)
# Extract and clean the question
question = text[text.rindex(')')+1:].strip()
return df_creation_concat, question
After both datasets were similar in structure, they were merged into one set and divided into _72.8K_ train and _18.6K_ test. We analyzed this dataset and you can see it all through the **[`notebook`](https://www.kaggle.com/code/zeyadusf/text-2-pandas-t5#Exploratory-Data-Analysis(EDA))**, but we found some problems in the dataset as well, such as
> - `Answer` : `df['Id'].count()` has been repeated, but this is possible, so we do not need to dispense with these rows.
> - `Context` : We see that it contains `147` rows that do not contain any text. We will see Through the experiment if this will affect the results negatively or positively.
> - `Question` : It is ...