Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
What is Pandas?
Pandas is a Python library used for working with data sets.
It has functions for analyzing, cleaning, exploring, and manipulating data.
The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.
Why Use Pandas?
Pandas allows us to analyze big data and make conclusions based on statistical theories.
Pandas can clean messy data sets, and make them readable and relevant.
Relevant data is very important in data science.
What Can Pandas Do?
Pandas gives you answers about the data. Like:
Is there a correlation between two or more columns?
What is average value?
Max value?
Min value?
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
It is a dataset with notebook kind of learning. Download the whole package and you will find everything to learn basics to advanced pandas which is exactly what you will need in machine learning and in data science. 😄
This will gives you the overview and data analysis tools in pandas that is mostly required in the data manipulation and extraction important data.
Use this notebook as notes for pandas. whenever you forget the code or syntax open it and scroll through it and you will find the solution. 🥳
Facebook
Twitterhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Chengdu Research Base of Gaint Panda Breeding
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
This dataset was created by Nimra hafeez
Released under Database: Open Database, Contents: Database Contents
Facebook
TwitterThis dataset was created by Sahil
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by _anxious
Released under CC0: Public Domain
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
I found two datasets about converting text with context to pandas code on Hugging Face, but the challenge is in the context. The context in both datasets is different which reduces the results of the model. First let's mention the data I found and then show examples, solution and some other problems.
Rahima411/text-to-pandas:
The data is divided into Train with 57.5k and Test with 19.2k.
The data has two columns as you can see in the example:
txt
Input | Pandas Query
-----------------------------------------------------------|-------------------------------------------
Table Name: head (age (object), head_id (object)) | result = management['head.age'].unique()
Table Name: management (head_id (object), |
temporary_acting (object)) |
What are the distinct ages of the heads who are acting? |hiltch/pandas-create-context:
question | context | answer
----------------------------------------|--------------------------------------------------------|---------------------------------------
What was the lowest # of total votes? | df = pd.DataFrame(columns=['_number_of_total_votes']) | df['_number_of_total_votes'].min()
As you can see, the problem with this data is that they are not similar as inputs and the structure of the context is different . My solution to this problem was:
- Convert the first data set to become like the second in the context. I chose this because it is difficult to get the data type for the columns in the second data set. It was easy to convert the structure of the context from this shape Table Name: head (age (object), head_id (object)) to this head = pd.DataFrame(columns=['age','head_id']) through this code that I wrote.
- Then separate the question from the context. This was easy because if you look at the data, you will find that the context always ends with "(" and then a blank and then the question.
You will find all of this in this code.
- You will also notice that more than one code or line can be returned to the context, and this has been engineered into the code.
```py
def extract_table_creation(text:str)->(str,str):
"""
Extracts DataFrame creation statements and questions from the given text.
Args:
text (str): The input text containing table definitions and questions.
Returns:
tuple: A tuple containing a concatenated DataFrame creation string and a question.
"""
# Define patterns
table_pattern = r'Table Name: (\w+) \(([\w\s,()]+)\)'
column_pattern = r'(\w+)\s*\((object|int64|float64)\)'
# Find all table names and column definitions
matches = re.findall(table_pattern, text)
# Initialize a list to hold DataFrame creation statements
df_creations = []
for table_name, columns_str in matches:
# Extract column names
columns = re.findall(column_pattern, columns_str)
column_names = [col[0] for col in columns]
# Format DataFrame creation statement
df_creation = f"{table_name} = pd.DataFrame(columns={column_names})"
df_creations.append(df_creation)
# Concatenate all DataFrame creation statements
df_creation_concat = '
'.join(df_creations)
# Extract and clean the question
question = text[text.rindex(')')+1:].strip()
return df_creation_concat, question
After both datasets were similar in structure, they were merged into one set and divided into _72.8K_ train and _18.6K_ test. We analyzed this dataset and you can see it all through the **[`notebook`](https://www.kaggle.com/code/zeyadusf/text-2-pandas-t5#Exploratory-Data-Analysis(EDA))**, but we found some problems in the dataset as well, such as
> - `Answer` : `df['Id'].count()` has been repeated, but this is possible, so we do not need to dispense with these rows.
> - `Context` : We see that it contains `147` rows that do not contain any text. We will see Through the experiment if this will affect the results negatively or positively.
> - `Question` : It is ...
Facebook
TwitterThis dataset was created by NG NM WT
Plotting
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by jedike
Released under MIT
Facebook
TwitterThis dataset was created by Amir Raja
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset consists of three one-dimensional datasets, each containing a series of values. One-dimensional data is characterized by its simplicity, making it an ideal starting point for those new to data analysis and manipulation. With the power of Pandas Series, you can perform a wide range of operations and functions to gain insights and derive valuable information from these datasets.
Facebook
TwitterThis dataset was created by BHARGAV NATH
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
I have taught many students to use Pandas. Often, many lacked context to apply their newly acquired skills. This dataset will help new learners work on their Pandas skills.
This dataset contains 13 columns and 6889 rows. The data is at a unique customer level. Each customers transaction amount and number of transactions information is present in a separate column (or unpivoted). Also, the data contains its first and last transaction date.
To be added.
I was inspired by creating contextual questions that will help students learn Pandas faster.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
"A clean and comprehensive Foodpanda dataset with 6000 records, featuring customer demographics, orders, payments, ratings, and delivery details — ideal for analyzing customer behavior, sales trends, and churn analysis
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
If you need a practice dataset to improve your skills then use this dataset.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2681031%2F9dfecc01e0d719b732e69389b592de91%2Fpython%20panda2.jpg?generation=1599246568164860&alt=media" alt="">
I created this dataset for my notebook Getting started with Dictionary and Pandas. To help people improve their dictionary and panda skills. https://www.kaggle.com/brendan45774/getting-started-with-dictionary-and-pandas
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is used to practice Pandas for beginners
This dataset is presented with some errors which is needed to be fixed. You can use this dataset to practice: Cleaning NaN values with basic Pandas techniques.
I have this dataset from w3school
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by AyanStark
Released under MIT
Facebook
TwitterThis dataset includes data that is provided in the Udemy course "Data Analysis with Pandas and Python" by Boris Paskhaver.
Facebook
TwitterThis dataset was created by Aditya Bhushan
Facebook
TwitterThis dataset was created by Bertille Pagès
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
What is Pandas?
Pandas is a Python library used for working with data sets.
It has functions for analyzing, cleaning, exploring, and manipulating data.
The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.
Why Use Pandas?
Pandas allows us to analyze big data and make conclusions based on statistical theories.
Pandas can clean messy data sets, and make them readable and relevant.
Relevant data is very important in data science.
What Can Pandas Do?
Pandas gives you answers about the data. Like:
Is there a correlation between two or more columns?
What is average value?
Max value?
Min value?