13 datasets found
  1. Convert Text to Pandas

    • kaggle.com
    zip
    Updated Sep 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zeyad Usf (2024). Convert Text to Pandas [Dataset]. https://www.kaggle.com/datasets/zeyadusf/convert-text-to-pandas
    Explore at:
    zip(4333134 bytes)Available download formats
    Dataset updated
    Sep 22, 2024
    Authors
    Zeyad Usf
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    kaggle notebook
    Github Repo

    I found two datasets about converting text with context to pandas code on Hugging Face, but the challenge is in the context. The context in both datasets is different which reduces the results of the model. First let's mention the data I found and then show examples, solution and some other problems.

    • Rahima411/text-to-pandas:

      • The data is divided into Train with 57.5k and Test with 19.2k.

      • The data has two columns as you can see in the example:

        • "Input": Contains the context and the question together, in the context it shows the metadata about the data frame.
        • "Pandas Query": Pandas code txt Input | Pandas Query -----------------------------------------------------------|------------------------------------------- Table Name: head (age (object), head_id (object)) | result = management['head.age'].unique() Table Name: management (head_id (object), | temporary_acting (object)) | What are the distinct ages of the heads who are acting? |
    • hiltch/pandas-create-context:

      • It contains 17k rows with three columns:
        • question : text .
        • context : Code to create a data frame with column names, unlike the first data set which contains the name of the data frame, column names and data type.
        • answer : Pandas code.
          question           |            context             |       answer 
    ----------------------------------------|--------------------------------------------------------|---------------------------------------
    What was the lowest # of total votes?  | df = pd.DataFrame(columns=['_number_of_total_votes']) | df['_number_of_total_votes'].min()   
    

    As you can see, the problem with this data is that they are not similar as inputs and the structure of the context is different . My solution to this problem was: - Convert the first data set to become like the second in the context. I chose this because it is difficult to get the data type for the columns in the second data set. It was easy to convert the structure of the context from this shape Table Name: head (age (object), head_id (object)) to this head = pd.DataFrame(columns=['age','head_id']) through this code that I wrote. - Then separate the question from the context. This was easy because if you look at the data, you will find that the context always ends with "(" and then a blank and then the question. You will find all of this in this code. - You will also notice that more than one code or line can be returned to the context, and this has been engineered into the code. ```py def extract_table_creation(text:str)->(str,str): """ Extracts DataFrame creation statements and questions from the given text.

    Args:
      text (str): The input text containing table definitions and questions.
    
    Returns:
      tuple: A tuple containing a concatenated DataFrame creation string and a question.
    """
    # Define patterns
    table_pattern = r'Table Name: (\w+) \(([\w\s,()]+)\)'
    column_pattern = r'(\w+)\s*\((object|int64|float64)\)'
    
    # Find all table names and column definitions
    matches = re.findall(table_pattern, text)
    
    # Initialize a list to hold DataFrame creation statements
    df_creations = []
    
    for table_name, columns_str in matches:
      # Extract column names
      columns = re.findall(column_pattern, columns_str)
      column_names = [col[0] for col in columns]
    
      # Format DataFrame creation statement
      df_creation = f"{table_name} = pd.DataFrame(columns={column_names})"
      df_creations.append(df_creation)
    
    # Concatenate all DataFrame creation statements
    df_creation_concat = '
    

    '.join(df_creations)

    # Extract and clean the question
    question = text[text.rindex(')')+1:].strip()
    
    return df_creation_concat, question
    
    After both datasets were similar in structure, they were merged into one set and divided into _72.8K_ train and _18.6K_ test. We analyzed this dataset and you can see it all through the **[`notebook`](https://www.kaggle.com/code/zeyadusf/text-2-pandas-t5#Exploratory-Data-Analysis(EDA))**, but we found some problems in the dataset as well, such as
    > - `Answer` : `df['Id'].count()` has been repeated, but this is possible, so we do not need to dispense with these rows.
    > - `Context` : We see that it contains `147` rows that do not contain any text. We will see Through the experiment if this will affect the results negatively or positively.
    > - `Question` : It is ...
    
  2. h

    oldIT2modIT

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Massimo Romano, oldIT2modIT [Dataset]. https://huggingface.co/datasets/cybernetic-m/oldIT2modIT
    Explore at:
    Authors
    Massimo Romano
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Download the dataset

    At the moment to download the dataset you should use Pandas DataFrame: import pandas as pd df = pd.read_csv("https://huggingface.co/datasets/cybernetic-m/oldIT2modIT/resolve/main/oldIT2modIT_dataset.csv")

    You can visualize the dataset with: df.head()

    To convert into Huggingface dataset: from datasets import Dataset dataset = Dataset.from_pandas(df)

      Dataset Description
    

    This is an italian dataset formed by 200 old (ancient) italian sentence and… See the full description on the dataset page: https://huggingface.co/datasets/cybernetic-m/oldIT2modIT.

  3. h

    df-translate-data

    • huggingface.co
    Updated Mar 12, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    dfap (2025). df-translate-data [Dataset]. https://huggingface.co/datasets/dfap/df-translate-data
    Explore at:
    Dataset updated
    Mar 12, 2025
    Authors
    dfap
    Description

    dfap/df-translate-data dataset hosted on Hugging Face and contributed by the HF Datasets community

  4. h

    pandas-create-context

    • huggingface.co
    Updated Jan 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Or Hiltch (2024). pandas-create-context [Dataset]. https://huggingface.co/datasets/hiltch/pandas-create-context
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 8, 2024
    Authors
    Or Hiltch
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview

    This dataset is built from sql-create-context, which in itself builds from WikiSQL and Spider. I have used GPT4 to translate the SQL schema into pandas DataFrame schem initialization statements and to translate the SQL queries into pandas queries. There are 862 examples of natural language queries, pandas DataFrame creation statements, and pandas query answering the question using the DataFrame creation statement as context. This dataset was built with text-to-pandas LLMs… See the full description on the dataset page: https://huggingface.co/datasets/hiltch/pandas-create-context.

  5. h

    Data from: dataset-creation

    • huggingface.co
    Updated Jul 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    uv scripts for HF Jobs (2025). dataset-creation [Dataset]. https://huggingface.co/datasets/uv-scripts/dataset-creation
    Explore at:
    Dataset updated
    Jul 23, 2025
    Dataset authored and provided by
    uv scripts for HF Jobs
    Description

    Dataset Creation Scripts

    Ready-to-run scripts for creating Hugging Face datasets from local files.

      Available Scripts
    
    
    
    
    
      📄 pdf-to-dataset.py
    

    Convert directories of PDF files into Hugging Face datasets. Features:

    📁 Uploads PDFs as dataset objects for flexible processing 🏷️ Automatic labeling from folder structure 🚀 Zero configuration - just point at your PDFs 📤 Direct upload to Hugging Face Hub

    Usage:

    Basic usage

    uv run pdf-to-dataset.py /path/to/pdfs… See the full description on the dataset page: https://huggingface.co/datasets/uv-scripts/dataset-creation.

  6. h

    DataCollection

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abuthahir, DataCollection [Dataset]. https://huggingface.co/datasets/Abu1998/DataCollection
    Explore at:
    Authors
    Abuthahir
    Description

    from datasets import load_dataset

      Load the dataset from your Hugging Face Space
    

    dataset = load_dataset("Abu1998/DataCollection", split="train")

      Convert to pandas DataFrame and save locally
    

    df = dataset.to_pandas() df.to_csv("appointments.csv", index=False)

  7. h

    google-gemini-3-pro-pre-release-model-card

    • huggingface.co
    Updated Nov 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Apolinário from multimodal AI art (2025). google-gemini-3-pro-pre-release-model-card [Dataset]. https://huggingface.co/datasets/multimodalart/google-gemini-3-pro-pre-release-model-card
    Explore at:
    Dataset updated
    Nov 18, 2025
    Authors
    Apolinário from multimodal AI art
    Description

    Gemini 3 Pro Model Card

      ⚠️ This is a mirror to the pre-release model card taht was officially published by Google @ https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-Pro-Model-Card.pdf and subsequentially yanked. It will get outdated by the actual release version in a few hours (date of publication Nov 18, 13:16 CET)
    Model card published: November, 2025
    
    
    
      Gemini 3 Pro - Model Card
    

    Model Cards are intended to provide essential information on… See the full description on the dataset page: https://huggingface.co/datasets/multimodalart/google-gemini-3-pro-pre-release-model-card.

  8. h

    DocLayNet-base

    • huggingface.co
    • opendatalab.com
    Updated Apr 27, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pierre Guillou (2023). DocLayNet-base [Dataset]. https://huggingface.co/datasets/pierreguillou/DocLayNet-base
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 27, 2023
    Authors
    Pierre Guillou
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Accurate document layout analysis is a key requirement for high-quality PDF document conversion. With the recent availability of public, large ground-truth datasets such as PubLayNet and DocBank, deep-learning models have proven to be very effective at layout detection and segmentation. While these datasets are of adequate size to train such models, they severely lack in layout variability since they are sourced from scientific article repositories such as PubMed and arXiv only. Consequently, the accuracy of the layout segmentation drops significantly when these models are applied on more challenging and diverse layouts. In this paper, we present \textit{DocLayNet}, a new, publicly available, document-layout annotation dataset in COCO format. It contains 80863 manually annotated pages from diverse data sources to represent a wide variability in layouts. For each PDF page, the layout annotations provide labelled bounding-boxes with a choice of 11 distinct classes. DocLayNet also provides a subset of double- and triple-annotated pages to determine the inter-annotator agreement. In multiple experiments, we provide smallline accuracy scores (in mAP) for a set of popular object detection models. We also demonstrate that these models fall approximately 10\% behind the inter-annotator agreement. Furthermore, we provide evidence that DocLayNet is of sufficient size. Lastly, we compare models trained on PubLayNet, DocBank and DocLayNet, showing that layout predictions of the DocLayNet-trained models are more robust and thus the preferred choice for general-purpose document-layout analysis.

  9. h

    TrainColpaliForInvoiceAuditor

    • huggingface.co
    Updated Jan 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Garg (2025). TrainColpaliForInvoiceAuditor [Dataset]. https://huggingface.co/datasets/deepak1pec/TrainColpaliForInvoiceAuditor
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 30, 2025
    Authors
    Garg
    Description

    Dataset Card for deepak1pec/TrainColpaliForInvoiceAuditor

      Dataset Description
    

    This dataset contains images converted from PDFs using the PDFs to Page Images Converter Space.

    Number of images: 14 Number of PDFs processed: 3 Sample size per PDF: 100 Created on: 2025-01-30 22:09:35

      Dataset Creation
    
    
    
    
    
      Source Data
    

    The images in this dataset were generated from user-uploaded PDF files.

      Processing Steps
    

    PDF files were uploaded to the PDFs to… See the full description on the dataset page: https://huggingface.co/datasets/deepak1pec/TrainColpaliForInvoiceAuditor.

  10. h

    govdocs1-pdf-source

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BEEspoke Data, govdocs1-pdf-source [Dataset]. https://huggingface.co/datasets/BEE-spoke-data/govdocs1-pdf-source
    Explore at:
    Dataset authored and provided by
    BEEspoke Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    govdocs1: source PDF files

    [!NOTE] Converted versions of other document types (word, txt, etc) are available in this repo

    This is ~220,000 open-access PDF documents (about 6.6M pages) from the dataset govdocs1. It wants to be OCR'd.

    Uploaded as tar file pieces of ~10 GiB each due to size/file count limits with an index.csv covering details 5,000 randomly sampled PDFs are available unarchived in sample/. Hugging Face supports previewing these in-browser, for example this one… See the full description on the dataset page: https://huggingface.co/datasets/BEE-spoke-data/govdocs1-pdf-source.

  11. h

    54k-resume

    • huggingface.co
    Updated Nov 18, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Suriya ganesh (2024). 54k-resume [Dataset]. https://huggingface.co/datasets/Suriyaganesh/54k-resume
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 18, 2024
    Authors
    Suriya ganesh
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset is aggregated from sources such as

    https://www.kaggle.com/datasets/snehaanbhawal/resume-dataset https://github.com/YanyuanSu/Resume-Corpus https://github.com/florex/resume_corpus.git etc.

    Entirely available in the public domain. Resumes are usually in pdf format. OCR was used to convert the PDF into text and LLMs were used to convert the data into a structured format.

      Dataset Overview
    

    This dataset contains structured information extracted from professional resumes… See the full description on the dataset page: https://huggingface.co/datasets/Suriyaganesh/54k-resume.

  12. h

    shark_attacks_cleaned

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Omry Nadiv, shark_attacks_cleaned [Dataset]. https://huggingface.co/datasets/Omrynadiv/shark_attacks_cleaned
    Explore at:
    Authors
    Omry Nadiv
    Description

    === Shark Attacks EDA – Visual Display Only ===

      Run in Colab or HuggingFace Notebook
    
    
    
    
    
      Make sure you have cleaned_shark_attacks.csv in the same folder.
    

    import pandas as pd import numpy as np import matplotlib.pyplot as plt

      --- Load ---
    

    df = pd.read_csv("cleaned_shark_attacks.csv")

      --- Clean basics ---
    
    
    
    
    
      Convert Date → Year
    

    df["Year"] = pd.to_datetime(df["Date"], errors="coerce").dt.year

      Normalize Sex
    

    df["Sex"] =… See the full description on the dataset page: https://huggingface.co/datasets/Omrynadiv/shark_attacks_cleaned.

  13. olmOCR-bench

    • huggingface.co
    Updated Nov 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai2 (2025). olmOCR-bench [Dataset]. https://huggingface.co/datasets/allenai/olmOCR-bench
    Explore at:
    Dataset updated
    Nov 29, 2025
    Dataset provided by
    Allen Institute for AIhttp://allenai.org/
    Authors
    Ai2
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    olmOCR-bench

    olmOCR-bench is a dataset of 1,403 PDF files, plus 7,010 unit test cases that capture properties of the output that a good OCR system should have. This benchmark evaluates the ability of OCR systems to accurately convert PDF documents to markdown format while preserving critical textual and structural information. Quick links:

    📃 Paper 🛠️ Code 🎮 Demo

      Table 1. Distribution of Test Classes by Document Source
    

    Document Source Text Present Text… See the full description on the dataset page: https://huggingface.co/datasets/allenai/olmOCR-bench.

  14. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Zeyad Usf (2024). Convert Text to Pandas [Dataset]. https://www.kaggle.com/datasets/zeyadusf/convert-text-to-pandas
Organization logo

Convert Text to Pandas

convert Text 2 Pandas

Explore at:
zip(4333134 bytes)Available download formats
Dataset updated
Sep 22, 2024
Authors
Zeyad Usf
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

kaggle notebook
Github Repo

I found two datasets about converting text with context to pandas code on Hugging Face, but the challenge is in the context. The context in both datasets is different which reduces the results of the model. First let's mention the data I found and then show examples, solution and some other problems.

  • Rahima411/text-to-pandas:

    • The data is divided into Train with 57.5k and Test with 19.2k.

    • The data has two columns as you can see in the example:

      • "Input": Contains the context and the question together, in the context it shows the metadata about the data frame.
      • "Pandas Query": Pandas code txt Input | Pandas Query -----------------------------------------------------------|------------------------------------------- Table Name: head (age (object), head_id (object)) | result = management['head.age'].unique() Table Name: management (head_id (object), | temporary_acting (object)) | What are the distinct ages of the heads who are acting? |
  • hiltch/pandas-create-context:

    • It contains 17k rows with three columns:
      • question : text .
      • context : Code to create a data frame with column names, unlike the first data set which contains the name of the data frame, column names and data type.
      • answer : Pandas code.
      question           |            context             |       answer 
----------------------------------------|--------------------------------------------------------|---------------------------------------
What was the lowest # of total votes?  | df = pd.DataFrame(columns=['_number_of_total_votes']) | df['_number_of_total_votes'].min()   

As you can see, the problem with this data is that they are not similar as inputs and the structure of the context is different . My solution to this problem was: - Convert the first data set to become like the second in the context. I chose this because it is difficult to get the data type for the columns in the second data set. It was easy to convert the structure of the context from this shape Table Name: head (age (object), head_id (object)) to this head = pd.DataFrame(columns=['age','head_id']) through this code that I wrote. - Then separate the question from the context. This was easy because if you look at the data, you will find that the context always ends with "(" and then a blank and then the question. You will find all of this in this code. - You will also notice that more than one code or line can be returned to the context, and this has been engineered into the code. ```py def extract_table_creation(text:str)->(str,str): """ Extracts DataFrame creation statements and questions from the given text.

Args:
  text (str): The input text containing table definitions and questions.

Returns:
  tuple: A tuple containing a concatenated DataFrame creation string and a question.
"""
# Define patterns
table_pattern = r'Table Name: (\w+) \(([\w\s,()]+)\)'
column_pattern = r'(\w+)\s*\((object|int64|float64)\)'

# Find all table names and column definitions
matches = re.findall(table_pattern, text)

# Initialize a list to hold DataFrame creation statements
df_creations = []

for table_name, columns_str in matches:
  # Extract column names
  columns = re.findall(column_pattern, columns_str)
  column_names = [col[0] for col in columns]

  # Format DataFrame creation statement
  df_creation = f"{table_name} = pd.DataFrame(columns={column_names})"
  df_creations.append(df_creation)

# Concatenate all DataFrame creation statements
df_creation_concat = '

'.join(df_creations)

# Extract and clean the question
question = text[text.rindex(')')+1:].strip()

return df_creation_concat, question
After both datasets were similar in structure, they were merged into one set and divided into _72.8K_ train and _18.6K_ test. We analyzed this dataset and you can see it all through the **[`notebook`](https://www.kaggle.com/code/zeyadusf/text-2-pandas-t5#Exploratory-Data-Analysis(EDA))**, but we found some problems in the dataset as well, such as
> - `Answer` : `df['Id'].count()` has been repeated, but this is possible, so we do not need to dispense with these rows.
> - `Context` : We see that it contains `147` rows that do not contain any text. We will see Through the experiment if this will affect the results negatively or positively.
> - `Question` : It is ...
Search
Clear search
Close search
Google apps
Main menu