Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
I found two datasets about converting text with context to pandas code on Hugging Face, but the challenge is in the context. The context in both datasets is different which reduces the results of the model. First let's mention the data I found and then show examples, solution and some other problems.
Rahima411/text-to-pandas:
The data is divided into Train with 57.5k and Test with 19.2k.
The data has two columns as you can see in the example:
txt
Input | Pandas Query
-----------------------------------------------------------|-------------------------------------------
Table Name: head (age (object), head_id (object)) | result = management['head.age'].unique()
Table Name: management (head_id (object), |
temporary_acting (object)) |
What are the distinct ages of the heads who are acting? |hiltch/pandas-create-context:
question | context | answer
----------------------------------------|--------------------------------------------------------|---------------------------------------
What was the lowest # of total votes? | df = pd.DataFrame(columns=['_number_of_total_votes']) | df['_number_of_total_votes'].min()
As you can see, the problem with this data is that they are not similar as inputs and the structure of the context is different . My solution to this problem was:
- Convert the first data set to become like the second in the context. I chose this because it is difficult to get the data type for the columns in the second data set. It was easy to convert the structure of the context from this shape Table Name: head (age (object), head_id (object)) to this head = pd.DataFrame(columns=['age','head_id']) through this code that I wrote.
- Then separate the question from the context. This was easy because if you look at the data, you will find that the context always ends with "(" and then a blank and then the question.
You will find all of this in this code.
- You will also notice that more than one code or line can be returned to the context, and this has been engineered into the code.
```py
def extract_table_creation(text:str)->(str,str):
"""
Extracts DataFrame creation statements and questions from the given text.
Args:
text (str): The input text containing table definitions and questions.
Returns:
tuple: A tuple containing a concatenated DataFrame creation string and a question.
"""
# Define patterns
table_pattern = r'Table Name: (\w+) \(([\w\s,()]+)\)'
column_pattern = r'(\w+)\s*\((object|int64|float64)\)'
# Find all table names and column definitions
matches = re.findall(table_pattern, text)
# Initialize a list to hold DataFrame creation statements
df_creations = []
for table_name, columns_str in matches:
# Extract column names
columns = re.findall(column_pattern, columns_str)
column_names = [col[0] for col in columns]
# Format DataFrame creation statement
df_creation = f"{table_name} = pd.DataFrame(columns={column_names})"
df_creations.append(df_creation)
# Concatenate all DataFrame creation statements
df_creation_concat = '
'.join(df_creations)
# Extract and clean the question
question = text[text.rindex(')')+1:].strip()
return df_creation_concat, question
After both datasets were similar in structure, they were merged into one set and divided into _72.8K_ train and _18.6K_ test. We analyzed this dataset and you can see it all through the **[`notebook`](https://www.kaggle.com/code/zeyadusf/text-2-pandas-t5#Exploratory-Data-Analysis(EDA))**, but we found some problems in the dataset as well, such as
> - `Answer` : `df['Id'].count()` has been repeated, but this is possible, so we do not need to dispense with these rows.
> - `Context` : We see that it contains `147` rows that do not contain any text. We will see Through the experiment if this will affect the results negatively or positively.
> - `Question` : It is ...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This project was developed with the mission of creating a dataframe with the updated list from the beste videogames of all time by Metacritic website.
To collect the data, I created a python script that uses selenium and pandas. You can access it on my github 👇 - Github
The data was collected from the Metacritic website. - Metacritic
Facebook
Twitterquadrat.scale.dataRefer to R script ("Dwyer_&_Laughlin_2017_Trait_covariance_script.r" for information about this dataframe.species.in.quadrat.scale.dataRefer to R script ("Dwyer_&_Laughlin_2017_Trait_covariance_script.r" for information about this dataframe.Dwyer_&_Laughlin_2017_Trait_covariance_scriptThis script reads in the two dataframes of "raw" data, calculates diversity and trait metrics and runs the major analyses presented in Dwyer & Laughlin 2017.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
PandasPlotBench
PandasPlotBench is a benchmark to assess the capability of models in writing the code for visualizations given the description of the Pandas DataFrame. 🛠️ Task. Given the plotting task and the description of a Pandas DataFrame, write the code to build a plot. The dataset is based on the MatPlotLib gallery. The paper can be found in arXiv: https://arxiv.org/abs/2412.02764v1. To score your model on this dataset, you can use the our GitHub repository. 📩 If you have… See the full description on the dataset page: https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview
This dataset is built from sql-create-context, which in itself builds from WikiSQL and Spider. I have used GPT4 to translate the SQL schema into pandas DataFrame schem initialization statements and to translate the SQL queries into pandas queries. There are 862 examples of natural language queries, pandas DataFrame creation statements, and pandas query answering the question using the DataFrame creation statement as context. This dataset was built with text-to-pandas LLMs… See the full description on the dataset page: https://huggingface.co/datasets/hiltch/pandas-create-context.
Facebook
TwitterThis cross-sectional study aimed to contribute to the definition of Pediatric Autoimmune Neuropsychiatric Disorders Associated with Streptococcal Infections (PANDAS) pathophysiology. An extensive immunological assessment has been conducted to investigate both immune defects, potentially leading to recurrent Group A β-hemolytic Streptococcus (GABHS) infections, and immune dysregulation responsible for a systemic inflammatory state. Twenty-six PANDAS patients with relapsing-remitting course of disease and 11 controls with recurrent pharyngotonsillitis were enrolled. Each subject underwent a detailed phenotypic and immunological assessment including cytokine profile. A possible correlation of immunological parameters with clinical-anamnestic data was analyzed. No inborn errors of immunity were detected in either group, using first level immunological assessments. However, a trend toward higher TNF-alpha and IL-17 levels, and lower C3 levels, was detected in the PANDAS patients compared to the control group. Maternal autoimmune diseases were described in 53.3% of PANDAS patients and neuropsychiatric symptoms other than OCD and tics were detected in 76.9% patients. ASO titer did not differ significantly between the two groups. A possible correlation between enduring inflammation (elevated serum TNF-α and IL-17) and the persistence of neuropsychiatric symptoms in PANDAS patients beyond infectious episodes needs to be addressed. Further studies with larger cohorts would be pivotal to better define the role of TNF-α and IL-17 in PANDAS pathophysiology.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Here we present three datasets describing three large European landscapes in France (Bauges Geopark - 89,000 ha), Poland (Milicz forest district - 21,000 ha) and Slovenia (Snežnik forest - 4,700 ha) down to the tree level. Individual trees were generated combining inventory plot data, vegetation maps and Airborne Laser Scanning (ALS) data. Together, these landscapes (hereafter virtual landscapes) cover more than 100,000 ha including about 64,000 ha of forest and consist of more than 42 million trees of 51 different species. For each virtual landscape we provide a table (in .csv format) with the following columns:- cellID25: the unique ID of each 25x25 m² cell- sp: species latin names- n: number of trees. n is an integer >= 1, meaning that a specific set of species "sp", diameter "dbh" and height "h" can be present multiple times in a cell.- dbh: tree diameter at breast height (cm)- h: tree height (m) We also provide, for each virtual landscape, a raster (in .asc format) with the cell IDs (cellID25) which makes data spatialisation possible. The coordinate reference systems are EPSG: 2154 for the Bauges, EPSG: 2180 for Milicz, and EPSG: 3912 for Sneznik. The v2.0.0 presents the algorithm in its final state. Finally, we provide a proof of how our algorithm makes it possible to reach the total BA and the BA proportion of broadleaf trees provided by the ALS mapping using the alpha correction coefficient and how it maintains the Dg ratios observed on the field plots between the different species (see algorithm presented in the associated Open Research Europe article). Below is an example of R code that opens the datasets and creates a tree density map. ------------------------------------------------------------# load package library(terra) library(dplyr)
setwd() # define path to the I-MAESTRO_data folder
tree <- read.csv2('./sneznik/sneznik_trees.csv', sep = ',')
cellID <- rast('./sneznik/sneznik_cellID25.asc')
cellIDdf <- as.data.frame(cellID) colnames(cellIDdf) <- 'cellID25'
dens <- tree %>% group_by(cellID25) %>% summarise(n = sum(n))
dens <- left_join(cellIDdf, dens, join_by(cellID25))
cellID$dens <- dens$n
plot(cellID$dens)
Facebook
TwitterpolyOne Data Set
The data set contains 100 million hypothetical polymers each with 29 predicted properties using machine learning models. We use PSMILES strings to represent polymer structures, see here and here. The polymers are generated by decomposing previously synthesized polymers into unique chemical fragments. Random and enumerative compositions of these fragments yield 100 million hypothetical PSMILES strings. All PSMILES strings are chemically valid polymers but, mostly, have never been synthesized before. More information can be found in the paper. Please note the license agreement in the LICENSE file.
Full data set including the properties
The data files are in Apache Parquet format. The files start with polyOne_*.parquet.
I recommend using dask (pip install dask) to load and process the data set. Pandas also works but is slower.
Load sharded data set with dask
python
import dask.dataframe as dd
ddf = dd.read_parquet("*.parquet", engine="pyarrow")
For example, compute the description of data set ```python df_describe = ddf.describe().compute() df_describe
PSMILES strings only
generated_polymer_smiles_train.txt - 80 million PSMILES strings for training polyBERT. One string per line.
generated_polymer_smiles_dev.txt - 20 million PSMILES strings for testing polyBERT. One string per line.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities. Details can be found in the attached report. Annotation The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library. The split into train, validation and test set follows the split of the original datasets. Installation
pip install pandas pyarrow Example
import pandas as pddf = pd.read_parquet('annotation_train.parquet', engine='pyarrow')print(df.iloc[0])
dataset AudioSet filename train/---2_BBVHAA.mp3 captions_visual [a man in a black hat and glasses.] captions_auditory [a man speaks and dishes clank.] tags [Speech] Description The annotation file consists of the following fields:filename: Name of the corresponding file (video or audio file)dataset: Source dataset associated with the data pointcaptions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual contentcaptions_auditory: A list of captions related to the auditory content of the videotags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided Data files The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de
Facebook
TwitterThis dataset contain two types of data for a combinatorial library containing approximately 12,000 compounds that we call ZEL031. This library was designed with a specific target in mind, but we are not currently explicitly disclosing the target. Phenotype imaging data was collected from A549 cells and transcriptome data was collected from HEK293T cells. Transcriptome data was collected from 2 different arrays, and phenotype data was collected from 3 different arrays. Imaging data is provided as 1 "plain" HDF5 file per array, and each HDF5 file is accompanied by 2 CSV files which describe the first 2 dimensions of the "images" dataset in the HDF5 file; the last two dimensions are image height and image width. Transcriptome counts data is provided in an AnnData-format HDF5 file, and the metadata is stored in the .obs DataFrame. The source device for the transcriptome data is available in the device_id column of the .obs DataFrame.
In total, there are approximately 64,000 transcriptome observations and 28,000 phenotype observations.
All observations are described by 5 columns of metadata: control_rx_id, bb1_id, bb2_id, residual_linker, and censored. The censored column indicates that the chemical identity information for that well has been hidden, and all the values for the other columns ending in _id should be -1. About 20% of the data has been censored. For the remaining 80% of the data, a combination of the columns ending in _id can be used to look up the associated chemical perturbation in the smiles.csv file. For the imaging data, these columns can be found in the _dim_0_metadata.csv files, along with a physical_well_id columns which identifies the source of the imaging data. Some wells are imaged more than once during data collection. For those wells, both sets of images are included and share a value in the physical_well_id column.
Each device's imaging data includes 2 extra files, {device}.segmentations.h5 and {device}.beadlocations.csv which contain some cell segmentation data and the locations of beads within the images in {device}.h5, respectively.
The segmentations are present in their HDF5 as an "images" dataset, but with only 3 dimensions (n_obs x H x W). A positional value of zero indicates that the pixel was not detected as a cell, and all pixels with each unique non-zero positive integer correspond to one cell; integer values are unique only within a single slice along the first dimension of data. Despite some weak spots, these maps have served us well for QC purposes.
Bead detections in the associated CSV have 4 columns: hdf5_dim_0_index, cx, cy, and radius.
- hdf5_dim_0_index maps to the first dimension of the images/segmentations HDF5s
- cy maps to the second-to-last dimension of the images/segmentations HDF5s
- cx maps to the last dimension of the images/segmentations HDF5s
- radius is the radius of a (rough) circle centered at cy, cx
There are certainly artifacts present which we'd prefer to avoid (some we're aware of include images being out-of-focus or perturbation delivery beads moving from their expected locations) and we look forward to sharing improved datasets in the future. Nonetheless, we've found some interesting patterns in these data and we'd be absolutely delighted to learn of any interesting patterns you can find (whether artifacts or biological patterns)!
| filename | MD5 hash |
|---|---|
| 4020_4021_cens.h5ad | fd4e5d843443813cd86f7aa058052ac9 |
| ZS26_dim_0_metadata.csv | 9d2d46e28e8321b4d5478a95e77d5c7a |
| ZS26_dim_1_metadata.csv | f5b35dcd381a40a6a188e6c2aec5b9be |
| ZS26.h5 | 2e97081c4003960bdf5b8ecf882aa3dc |
| ZS26.beadlocations.csv | ec557c2449b1c4f934670ab135876705 |
| ZS26.segmentations.h5 | efc0862557a3d709d59f1cfad713115b |
| ZS27_dim_0_metadata.csv | f9b8475a47e4f741563b474035b02499 |
| ZS27_dim_1_metadata.csv | f5b35dcd381a40a6a188e6c2aec5b9be |
| ZS27.h5 | 8647a7800893f7cdb3013a8449374860 |
| ZS27.beadlocations.csv | 1965c349977d69eca833f8705d229df9 |
| ZS27.segmentations.h5 | dc25b2e859aafe9816c4c3936057795b |
| ZS30_dim_0_metadata.csv | fb8f9c51b8bc6fe1a5b86e4611dac85d |
| ZS30_dim_1_metadata.csv | f5b35dcd381a40a6a188e6c2aec5b9be |
| ZS30.h5 | 20c2bc56ce149198e9c9b60c6b669861 |
| ZS30.beadlocations.csv | 095b63bb7ffa98558fb09fddc4593bf9 |
| ZS30.segmentations.h5 | e12d5d1cac8bbeb9c925d688544b5b26 |
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Included files:Each file includes LFP (local field potential) data for both animals (‘h’, ‘y’) during a particular type of task control (‘bmi’ or ‘manual’) and time-locked to 500ms before or after a particular event in the task (‘go_cue’ or ‘target’) for each rewarded trial in each day of the task (‘h’: [1-13], ‘y’: [1-22]).File description:Each file includes a Pandas DataFrame, saved as a .feather file. Data can be accessed using Python by calling:import pandas as pdpd.read_feather([file name])Each DataFrame has the following columns:control_type: ‘bmi’, ‘manual’, or ‘baseline’event: go cue (‘go_cue’) or target acquisition (‘target’)subj: which animal, ‘h’ or ‘y’day: which day of the session, ‘h’: [1-13], ‘y’: [1-22]roi: region of interest; ‘direct’, ‘dlpfc’, or ‘cd’ where ‘direct’ includes most channels from m1 each day but is specific to channels which had sufficient spiking to be used as input to the BMI decoderch: electrode channel number, only low-noise channels were included (see Methods for details)n_rewarded_trial: which trial number data segment is from, only successfully completed (rewarded) trials are includedtime_from_window_ms: go_cue: 0-500ms from go cue, for target: -500-0ms from target acquisitionlfp: local field potential value (see Methods for details)
Facebook
TwitterDescription
Dataframe containing 2075 French books in txt format (= the ~2600 French books present in gutenberg from which all books by authors present in the french_books_summuries dataset have been removed to avoid any leaks).More precisely :
the texte column contains the texts the titre column contains the book title the auteur column contains the author's name and dates of birth and death (if you want to filter the texts to keep only those from the given century to the present… See the full description on the dataset page: https://huggingface.co/datasets/CATIE-AQ/french_books.
Facebook
Twitterklib library enables us to quickly visualize missing data, perform data cleaning, visualize data distribution plot, visualize correlation plot and visualize categorical column values. klib is a Python library for importing, cleaning, analyzing and preprocessing data. Explanations on key functionalities can be found on Medium / TowardsDataScience in the examples section or on YouTube (Data Professor).
Original Github repo
https://raw.githubusercontent.com/akanz1/klib/main/examples/images/header.png" alt="klib Header">
!pip install klib
import klib
import pandas as pd
df = pd.DataFrame(data)
# klib.describe functions for visualizing datasets
- klib.cat_plot(df) # returns a visualization of the number and frequency of categorical features
- klib.corr_mat(df) # returns a color-encoded correlation matrix
- klib.corr_plot(df) # returns a color-encoded heatmap, ideal for correlations
- klib.dist_plot(df) # returns a distribution plot for every numeric feature
- klib.missingval_plot(df) # returns a figure containing information about missing values
Take a look at this starter notebook.
Further examples, as well as applications of the functions can be found here.
Pull requests and ideas, especially for further functions are welcome. For major changes or feedback, please open an issue first to discuss what you would like to change. Take a look at this Github repo.
Facebook
TwitterAttribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
This ANU Data Commons collection contains two master data sets associated with the manuscript accepted for publication in the Functional Ecology journal, manuscript detailed below.
Bryant C, Harris RJ, Brothers N, Bone C, Walsh N, Nicotra AB, Ball MC 2024, Cross tolerance: salinity gradients and dehydration increase photosynthetic heat tolerance in mangrove leaves Functional Ecology (Accepted article)
Each xlsl file contains a Metadata sheet describing headers, units of measurement, and data form (numeric, character, factor), followed by a single master dataframe sheet.
For full methods used for experimental data collection, see the methods described in the publication.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Fenic 0.4.0 API Documentation Dataset
Dataset Description
This dataset contains comprehensive API documentation for Fenic 0.4.0, a PySpark-inspired DataFrame framework designed for building production AI and agentic applications. The dataset provides structured information about all public and private API elements, including modules, classes, functions, methods, and attributes.
Dataset Summary
Fenic is a DataFrame framework that combines traditional data… See the full description on the dataset page: https://huggingface.co/datasets/typedef-ai/fenic-0.4.0-codebase.
Facebook
TwitterLibraries Import:
Importing necessary libraries such as pandas, seaborn, matplotlib, scikit-learn's KMeans, and warnings. Data Loading and Exploration:
Reading a dataset named "Mall_Customers.csv" into a pandas DataFrame (df). Displaying the first few rows of the dataset using df.head(). Conducting univariate analysis by calculating descriptive statistics with df.describe(). Univariate Analysis:
Visualizing the distribution of the 'Annual Income (k$)' column using sns.distplot. Looping through selected columns ('Age', 'Annual Income (k$)', 'Spending Score (1-100)') and plotting individual distribution plots. Bivariate Analysis:
Creating a scatter plot for 'Annual Income (k$)' vs 'Spending Score (1-100)' using sns.scatterplot. Generating a pair plot for selected columns with gender differentiation using sns.pairplot. Gender-Based Analysis:
Grouping the data by 'Gender' and calculating the mean for selected columns. Computing the correlation matrix for the grouped data and visualizing it using a heatmap. Univariate Clustering:
Applying KMeans clustering with 3 clusters based on 'Annual Income (k$)' and adding the 'Income Cluster' column to the DataFrame. Plotting the elbow method to determine the optimal number of clusters. Bivariate Clustering:
Applying KMeans clustering with 5 clusters based on 'Annual Income (k$)' and 'Spending Score (1-100)' and adding the 'Spending and Income Cluster' column. Plotting the elbow method for bivariate clustering and visualizing the cluster centers on a scatter plot. Displaying a normalized cross-tabulation between 'Spending and Income Cluster' and 'Gender'. Multivariate Clustering:
Performing multivariate clustering by creating dummy variables, scaling selected columns, and applying KMeans clustering. Plotting the elbow method for multivariate clustering. Result Saving:
Saving the modified DataFrame with cluster information to a CSV file named "Result.csv". Saving the multivariate clustering plot as an image file ("Multivariate_figure.png").
Facebook
TwitterPediatric autoimmune neuropsychiatric disorders associated with streptococcal infections (PANDAS) are clinical conditions characterized by the sudden onset of obsessive–compulsive disorder and/or tics, often accompanied by other behavioral symptoms in a group of children with streptococcal infection. PANDAS-related disorders, including pediatric acute-onset neuropsychiatric syndrome (PANS), childhood acute neuropsychiatric symptoms (CANS), and pediatric infection triggered autoimmune neuropsychiatric disorders (PITANDs), have also been described. Since first defined in 1998, PANDAS has been considered a controversial diagnosis. A comprehensive review of the literature was performed on PubMed and Scopus databases, searching for diagnostic criteria and diagnostic procedures of PANDAS and related disorders. We propose a test panel to support clinicians in the workout of PANDAS/PANS patients establishing an appropriate treatment. However, further studies are needed to improve our knowledge on these acute-onset neuropsychiatric conditions.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Plotqa V1
Dataset Description
This dataset was uploaded from a pandas DataFrame.
Dataset Structure
Overview
Total Examples: 5,733,893 Total Features: 9 Dataset Size: ~2805.4 MB Format: Parquet files Created: 2025-09-22 20:12:01 UTC
Data Instances
The dataset contains 5,733,893 rows and 9 columns.
Data Fields
image_index (int64): 0 null values (0.0%), Range: [0.00, 157069.00], Mean: 78036.26 qid (object): 0 null values (0.0%)… See the full description on the dataset page: https://huggingface.co/datasets/Abd223653/PlotQA_V1.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Sulfolobus acidocaldarius oxygen experiment data. Includes both GDGT and growth information. Experiment described as 0.22 oxgyen concentration is serial transfer 0.2% O2 experiment
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset is generated syhthetically to create tables with following characteristics:
Empty cell percentage in following range 0,30 There is clear seperator between rows and columns (Structured). 4 <= num rows <= 10, 2 <= num columns <= 6 (Small)
Load the dataset
import io import pandas as pd from PIL import Image
def bytes_to_image(self, image_bytes: bytes): return Image.open(io.BytesIO(image_bytes))
def parse_annotations(self, annotations: str) -> pd.DataFrame:… See the full description on the dataset page: https://huggingface.co/datasets/nanonets/small_dense_structured_table.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
I found two datasets about converting text with context to pandas code on Hugging Face, but the challenge is in the context. The context in both datasets is different which reduces the results of the model. First let's mention the data I found and then show examples, solution and some other problems.
Rahima411/text-to-pandas:
The data is divided into Train with 57.5k and Test with 19.2k.
The data has two columns as you can see in the example:
txt
Input | Pandas Query
-----------------------------------------------------------|-------------------------------------------
Table Name: head (age (object), head_id (object)) | result = management['head.age'].unique()
Table Name: management (head_id (object), |
temporary_acting (object)) |
What are the distinct ages of the heads who are acting? |hiltch/pandas-create-context:
question | context | answer
----------------------------------------|--------------------------------------------------------|---------------------------------------
What was the lowest # of total votes? | df = pd.DataFrame(columns=['_number_of_total_votes']) | df['_number_of_total_votes'].min()
As you can see, the problem with this data is that they are not similar as inputs and the structure of the context is different . My solution to this problem was:
- Convert the first data set to become like the second in the context. I chose this because it is difficult to get the data type for the columns in the second data set. It was easy to convert the structure of the context from this shape Table Name: head (age (object), head_id (object)) to this head = pd.DataFrame(columns=['age','head_id']) through this code that I wrote.
- Then separate the question from the context. This was easy because if you look at the data, you will find that the context always ends with "(" and then a blank and then the question.
You will find all of this in this code.
- You will also notice that more than one code or line can be returned to the context, and this has been engineered into the code.
```py
def extract_table_creation(text:str)->(str,str):
"""
Extracts DataFrame creation statements and questions from the given text.
Args:
text (str): The input text containing table definitions and questions.
Returns:
tuple: A tuple containing a concatenated DataFrame creation string and a question.
"""
# Define patterns
table_pattern = r'Table Name: (\w+) \(([\w\s,()]+)\)'
column_pattern = r'(\w+)\s*\((object|int64|float64)\)'
# Find all table names and column definitions
matches = re.findall(table_pattern, text)
# Initialize a list to hold DataFrame creation statements
df_creations = []
for table_name, columns_str in matches:
# Extract column names
columns = re.findall(column_pattern, columns_str)
column_names = [col[0] for col in columns]
# Format DataFrame creation statement
df_creation = f"{table_name} = pd.DataFrame(columns={column_names})"
df_creations.append(df_creation)
# Concatenate all DataFrame creation statements
df_creation_concat = '
'.join(df_creations)
# Extract and clean the question
question = text[text.rindex(')')+1:].strip()
return df_creation_concat, question
After both datasets were similar in structure, they were merged into one set and divided into _72.8K_ train and _18.6K_ test. We analyzed this dataset and you can see it all through the **[`notebook`](https://www.kaggle.com/code/zeyadusf/text-2-pandas-t5#Exploratory-Data-Analysis(EDA))**, but we found some problems in the dataset as well, such as
> - `Answer` : `df['Id'].count()` has been repeated, but this is possible, so we do not need to dispense with these rows.
> - `Context` : We see that it contains `147` rows that do not contain any text. We will see Through the experiment if this will affect the results negatively or positively.
> - `Question` : It is ...