13 datasets found

Convert Text to Pandas
kaggle.com
zip
Updated Sep 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zeyad Usf (2024). Convert Text to Pandas [Dataset]. https://www.kaggle.com/datasets/zeyadusf/convert-text-to-pandas
Explore at:
zip(4333134 bytes)Available download formats
Dataset updated
Sep 22, 2024
Authors
Zeyad Usf
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
kaggle notebook
Github Repo

I found two datasets about converting text with context to pandas code on Hugging Face, but the challenge is in the context. The context in both datasets is different which reduces the results of the model. First let's mention the data I found and then show examples, solution and some other problems.

Rahima411/text-to-pandas:

The data is divided into Train with 57.5k and Test with 19.2k.

The data has two columns as you can see in the example:

"Input": Contains the context and the question together, in the context it shows the metadata about the data frame.

"Pandas Query": Pandas code txt Input | Pandas Query -----------------------------------------------------------|------------------------------------------- Table Name: head (age (object), head_id (object)) | result = management['head.age'].unique() Table Name: management (head_id (object), | temporary_acting (object)) | What are the distinct ages of the heads who are acting? |

hiltch/pandas-create-context:

It contains 17k rows with three columns:

question : text .

context : Code to create a data frame with column names, unlike the first data set which contains the name of the data frame, column names and data type.

answer : Pandas code.

question | context | answer ----------------------------------------|--------------------------------------------------------|--------------------------------------- What was the lowest # of total votes? | df = pd.DataFrame(columns=['_number_of_total_votes']) | df['_number_of_total_votes'].min()

As you can see, the problem with this data is that they are not similar as inputs and the structure of the context is different . My solution to this problem was: - Convert the first data set to become like the second in the context. I chose this because it is difficult to get the data type for the columns in the second data set. It was easy to convert the structure of the context from this shape Table Name: head (age (object), head_id (object)) to this head = pd.DataFrame(columns=['age','head_id']) through this code that I wrote. - Then separate the question from the context. This was easy because if you look at the data, you will find that the context always ends with "(" and then a blank and then the question. You will find all of this in this code. - You will also notice that more than one code or line can be returned to the context, and this has been engineered into the code. ```py def extract_table_creation(text:str)->(str,str): """ Extracts DataFrame creation statements and questions from the given text.

Args: text (str): The input text containing table definitions and questions. Returns: tuple: A tuple containing a concatenated DataFrame creation string and a question. """ # Define patterns table_pattern = r'Table Name: (\w+) \(([\w\s,()]+)\)' column_pattern = r'(\w+)\s*\((object|int64|float64)\)' # Find all table names and column definitions matches = re.findall(table_pattern, text) # Initialize a list to hold DataFrame creation statements df_creations = [] for table_name, columns_str in matches: # Extract column names columns = re.findall(column_pattern, columns_str) column_names = [col[0] for col in columns] # Format DataFrame creation statement df_creation = f"{table_name} = pd.DataFrame(columns={column_names})" df_creations.append(df_creation) # Concatenate all DataFrame creation statements df_creation_concat = '

'.join(df_creations)

# Extract and clean the question question = text[text.rindex(')')+1:].strip() return df_creation_concat, question

After both datasets were similar in structure, they were merged into one set and divided into _72.8K_ train and _18.6K_ test. We analyzed this dataset and you can see it all through the **[`notebook`](https://www.kaggle.com/code/zeyadusf/text-2-pandas-t5#Exploratory-Data-Analysis(EDA))**, but we found some problems in the dataset as well, such as > - `Answer` : `df['Id'].count()` has been repeated, but this is possible, so we do not need to dispense with these rows. > - `Context` : We see that it contains `147` rows that do not contain any text. We will see Through the experiment if this will affect the results negatively or positively. > - `Question` : It is ...
h
oldIT2modIT
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Massimo Romano, oldIT2modIT [Dataset]. https://huggingface.co/datasets/cybernetic-m/oldIT2modIT
Explore at:
Authors
Massimo Romano
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Download the dataset

At the moment to download the dataset you should use Pandas DataFrame: import pandas as pd df = pd.read_csv("https://huggingface.co/datasets/cybernetic-m/oldIT2modIT/resolve/main/oldIT2modIT_dataset.csv")

You can visualize the dataset with: df.head()

To convert into Huggingface dataset: from datasets import Dataset dataset = Dataset.from_pandas(df)

Dataset Description

This is an italian dataset formed by 200 old (ancient) italian sentence and… See the full description on the dataset page: https://huggingface.co/datasets/cybernetic-m/oldIT2modIT.
h
df-translate-data
huggingface.co
Updated Mar 12, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
dfap (2025). df-translate-data [Dataset]. https://huggingface.co/datasets/dfap/df-translate-data
Explore at:
Dataset updated
Mar 12, 2025
Authors
dfap
Description
dfap/df-translate-data dataset hosted on Hugging Face and contributed by the HF Datasets community
h
pandas-create-context
huggingface.co
Updated Jan 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Or Hiltch (2024). pandas-create-context [Dataset]. https://huggingface.co/datasets/hiltch/pandas-create-context
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 8, 2024
Authors
Or Hiltch
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Overview

This dataset is built from sql-create-context, which in itself builds from WikiSQL and Spider. I have used GPT4 to translate the SQL schema into pandas DataFrame schem initialization statements and to translate the SQL queries into pandas queries. There are 862 examples of natural language queries, pandas DataFrame creation statements, and pandas query answering the question using the DataFrame creation statement as context. This dataset was built with text-to-pandas LLMs… See the full description on the dataset page: https://huggingface.co/datasets/hiltch/pandas-create-context.
h
Data from: dataset-creation
huggingface.co
Updated Jul 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
uv scripts for HF Jobs (2025). dataset-creation [Dataset]. https://huggingface.co/datasets/uv-scripts/dataset-creation
Explore at:
Dataset updated
Jul 23, 2025
Dataset authored and provided by
uv scripts for HF Jobs
Description
Dataset Creation Scripts

Ready-to-run scripts for creating Hugging Face datasets from local files.

Available Scripts 📄 pdf-to-dataset.py

Convert directories of PDF files into Hugging Face datasets. Features:

📁 Uploads PDFs as dataset objects for flexible processing 🏷️ Automatic labeling from folder structure 🚀 Zero configuration - just point at your PDFs 📤 Direct upload to Hugging Face Hub

Usage:

Basic usage

uv run pdf-to-dataset.py /path/to/pdfs… See the full description on the dataset page: https://huggingface.co/datasets/uv-scripts/dataset-creation.
h
DataCollection
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abuthahir, DataCollection [Dataset]. https://huggingface.co/datasets/Abu1998/DataCollection
Explore at:
Authors
Abuthahir
Description
from datasets import load_dataset

Load the dataset from your Hugging Face Space

dataset = load_dataset("Abu1998/DataCollection", split="train")

Convert to pandas DataFrame and save locally

df = dataset.to_pandas() df.to_csv("appointments.csv", index=False)
h
google-gemini-3-pro-pre-release-model-card
huggingface.co
Updated Nov 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Apolinário from multimodal AI art (2025). google-gemini-3-pro-pre-release-model-card [Dataset]. https://huggingface.co/datasets/multimodalart/google-gemini-3-pro-pre-release-model-card
Explore at:
Dataset updated
Nov 18, 2025
Authors
Apolinário from multimodal AI art
Description
Gemini 3 Pro Model Card

⚠️ This is a mirror to the pre-release model card taht was officially published by Google @ https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-Pro-Model-Card.pdf and subsequentially yanked. It will get outdated by the actual release version in a few hours (date of publication Nov 18, 13:16 CET) Model card published: November, 2025 Gemini 3 Pro - Model Card

Model Cards are intended to provide essential information on… See the full description on the dataset page: https://huggingface.co/datasets/multimodalart/google-gemini-3-pro-pre-release-model-card.
h
DocLayNet-base
huggingface.co
opendatalab.com
Updated Apr 27, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pierre Guillou (2023). DocLayNet-base [Dataset]. https://huggingface.co/datasets/pierreguillou/DocLayNet-base
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 27, 2023
Authors
Pierre Guillou
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Accurate document layout analysis is a key requirement for high-quality PDF document conversion. With the recent availability of public, large ground-truth datasets such as PubLayNet and DocBank, deep-learning models have proven to be very effective at layout detection and segmentation. While these datasets are of adequate size to train such models, they severely lack in layout variability since they are sourced from scientific article repositories such as PubMed and arXiv only. Consequently, the accuracy of the layout segmentation drops significantly when these models are applied on more challenging and diverse layouts. In this paper, we present \textit{DocLayNet}, a new, publicly available, document-layout annotation dataset in COCO format. It contains 80863 manually annotated pages from diverse data sources to represent a wide variability in layouts. For each PDF page, the layout annotations provide labelled bounding-boxes with a choice of 11 distinct classes. DocLayNet also provides a subset of double- and triple-annotated pages to determine the inter-annotator agreement. In multiple experiments, we provide smallline accuracy scores (in mAP) for a set of popular object detection models. We also demonstrate that these models fall approximately 10\% behind the inter-annotator agreement. Furthermore, we provide evidence that DocLayNet is of sufficient size. Lastly, we compare models trained on PubLayNet, DocBank and DocLayNet, showing that layout predictions of the DocLayNet-trained models are more robust and thus the preferred choice for general-purpose document-layout analysis.
h
TrainColpaliForInvoiceAuditor
huggingface.co
Updated Jan 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Garg (2025). TrainColpaliForInvoiceAuditor [Dataset]. https://huggingface.co/datasets/deepak1pec/TrainColpaliForInvoiceAuditor
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 30, 2025
Authors
Garg
Description
Dataset Card for deepak1pec/TrainColpaliForInvoiceAuditor

Dataset Description

This dataset contains images converted from PDFs using the PDFs to Page Images Converter Space.

Number of images: 14 Number of PDFs processed: 3 Sample size per PDF: 100 Created on: 2025-01-30 22:09:35

Dataset Creation Source Data

The images in this dataset were generated from user-uploaded PDF files.

Processing Steps

PDF files were uploaded to the PDFs to… See the full description on the dataset page: https://huggingface.co/datasets/deepak1pec/TrainColpaliForInvoiceAuditor.
h
govdocs1-pdf-source
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BEEspoke Data, govdocs1-pdf-source [Dataset]. https://huggingface.co/datasets/BEE-spoke-data/govdocs1-pdf-source
Explore at:
Dataset authored and provided by
BEEspoke Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
govdocs1: source PDF files

[!NOTE] Converted versions of other document types (word, txt, etc) are available in this repo

This is ~220,000 open-access PDF documents (about 6.6M pages) from the dataset govdocs1. It wants to be OCR'd.

Uploaded as tar file pieces of ~10 GiB each due to size/file count limits with an index.csv covering details 5,000 randomly sampled PDFs are available unarchived in sample/. Hugging Face supports previewing these in-browser, for example this one… See the full description on the dataset page: https://huggingface.co/datasets/BEE-spoke-data/govdocs1-pdf-source.
h
54k-resume
huggingface.co
Updated Nov 18, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Suriya ganesh (2024). 54k-resume [Dataset]. https://huggingface.co/datasets/Suriyaganesh/54k-resume
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 18, 2024
Authors
Suriya ganesh
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset is aggregated from sources such as

https://www.kaggle.com/datasets/snehaanbhawal/resume-dataset https://github.com/YanyuanSu/Resume-Corpus https://github.com/florex/resume_corpus.git etc.

Entirely available in the public domain. Resumes are usually in pdf format. OCR was used to convert the PDF into text and LLMs were used to convert the data into a structured format.

Dataset Overview

This dataset contains structured information extracted from professional resumes… See the full description on the dataset page: https://huggingface.co/datasets/Suriyaganesh/54k-resume.
h
shark_attacks_cleaned
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Omry Nadiv, shark_attacks_cleaned [Dataset]. https://huggingface.co/datasets/Omrynadiv/shark_attacks_cleaned
Explore at:
Authors
Omry Nadiv
Description
=== Shark Attacks EDA – Visual Display Only ===

Run in Colab or HuggingFace Notebook Make sure you have cleaned_shark_attacks.csv in the same folder.

import pandas as pd import numpy as np import matplotlib.pyplot as plt

--- Load ---

df = pd.read_csv("cleaned_shark_attacks.csv")

--- Clean basics --- Convert Date → Year

df["Year"] = pd.to_datetime(df["Date"], errors="coerce").dt.year

Normalize Sex

df["Sex"] =… See the full description on the dataset page: https://huggingface.co/datasets/Omrynadiv/shark_attacks_cleaned.
olmOCR-bench
huggingface.co
Updated Nov 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai2 (2025). olmOCR-bench [Dataset]. https://huggingface.co/datasets/allenai/olmOCR-bench
Explore at:
Dataset updated
Nov 29, 2025
Dataset provided by
Allen Institute for AIhttp://allenai.org/
Authors
Ai2
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
olmOCR-bench

olmOCR-bench is a dataset of 1,403 PDF files, plus 7,010 unit test cases that capture properties of the output that a good OCR system should have. This benchmark evaluates the ability of OCR systems to accurately convert PDF documents to markdown format while preserving critical textual and structural information. Quick links:

📃 Paper 🛠️ Code 🎮 Demo

Table 1. Distribution of Test Classes by Document Source

Document Source Text Present Text… See the full description on the dataset page: https://huggingface.co/datasets/allenai/olmOCR-bench.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Zeyad Usf (2024). Convert Text to Pandas [Dataset]. https://www.kaggle.com/datasets/zeyadusf/convert-text-to-pandas

Convert Text to Pandas

convert Text 2 Pandas

Explore at:

zip(4333134 bytes)Available download formats

Dataset updated

Sep 22, 2024

Authors

Zeyad Usf

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

kaggle notebook
Github Repo

I found two datasets about converting text with context to pandas code on Hugging Face, but the challenge is in the context. The context in both datasets is different which reduces the results of the model. First let's mention the data I found and then show examples, solution and some other problems.

Rahima411/text-to-pandas:
- The data is divided into Train with 57.5k and Test with 19.2k.
- The data has two columns as you can see in the example:
  - "Input": Contains the context and the question together, in the context it shows the metadata about the data frame.
  - "Pandas Query": Pandas code txt Input | Pandas Query -----------------------------------------------------------|------------------------------------------- Table Name: head (age (object), head_id (object)) | result = management['head.age'].unique() Table Name: management (head_id (object), | temporary_acting (object)) | What are the distinct ages of the heads who are acting? |
hiltch/pandas-create-context:
- It contains 17k rows with three columns:
  - question : text .
  - context : Code to create a data frame with column names, unlike the first data set which contains the name of the data frame, column names and data type.
  - answer : Pandas code.

      question           |            context             |       answer 
----------------------------------------|--------------------------------------------------------|---------------------------------------
What was the lowest # of total votes?  | df = pd.DataFrame(columns=['_number_of_total_votes']) | df['_number_of_total_votes'].min()

As you can see, the problem with this data is that they are not similar as inputs and the structure of the context is different . My solution to this problem was: - Convert the first data set to become like the second in the context. I chose this because it is difficult to get the data type for the columns in the second data set. It was easy to convert the structure of the context from this shape Table Name: head (age (object), head_id (object)) to this head = pd.DataFrame(columns=['age','head_id']) through this code that I wrote. - Then separate the question from the context. This was easy because if you look at the data, you will find that the context always ends with "(" and then a blank and then the question. You will find all of this in this code. - You will also notice that more than one code or line can be returned to the context, and this has been engineered into the code. ```py def extract_table_creation(text:str)->(str,str): """ Extracts DataFrame creation statements and questions from the given text.

Args:
  text (str): The input text containing table definitions and questions.

Returns:
  tuple: A tuple containing a concatenated DataFrame creation string and a question.
"""
# Define patterns
table_pattern = r'Table Name: (\w+) \(([\w\s,()]+)\)'
column_pattern = r'(\w+)\s*\((object|int64|float64)\)'

# Find all table names and column definitions
matches = re.findall(table_pattern, text)

# Initialize a list to hold DataFrame creation statements
df_creations = []

for table_name, columns_str in matches:
  # Extract column names
  columns = re.findall(column_pattern, columns_str)
  column_names = [col[0] for col in columns]

  # Format DataFrame creation statement
  df_creation = f"{table_name} = pd.DataFrame(columns={column_names})"
  df_creations.append(df_creation)

# Concatenate all DataFrame creation statements
df_creation_concat = '

'.join(df_creations)

# Extract and clean the question
question = text[text.rindex(')')+1:].strip()

return df_creation_concat, question

After both datasets were similar in structure, they were merged into one set and divided into _72.8K_ train and _18.6K_ test. We analyzed this dataset and you can see it all through the **[`notebook`](https://www.kaggle.com/code/zeyadusf/text-2-pandas-t5#Exploratory-Data-Analysis(EDA))**, but we found some problems in the dataset as well, such as
> - `Answer` : `df['Id'].count()` has been repeated, but this is possible, so we do not need to dispense with these rows.
> - `Context` : We see that it contains `147` rows that do not contain any text. We will see Through the experiment if this will affect the results negatively or positively.
> - `Question` : It is ...

Clear search

Close search

Google apps

Main menu

Convert Text to Pandas

oldIT2modIT

df-translate-data

pandas-create-context

Data from: dataset-creation

Basic usage

DataCollection

google-gemini-3-pro-pre-release-model-card

DocLayNet-base

TrainColpaliForInvoiceAuditor

govdocs1-pdf-source

54k-resume

shark_attacks_cleaned

olmOCR-bench

Convert Text to Pandas

convert Text 2 Pandas