74 datasets found
  1. Convert Text to Pandas

    • kaggle.com
    zip
    Updated Sep 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zeyad Usf (2024). Convert Text to Pandas [Dataset]. https://www.kaggle.com/datasets/zeyadusf/convert-text-to-pandas
    Explore at:
    zip(4333134 bytes)Available download formats
    Dataset updated
    Sep 22, 2024
    Authors
    Zeyad Usf
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    kaggle notebook
    Github Repo

    I found two datasets about converting text with context to pandas code on Hugging Face, but the challenge is in the context. The context in both datasets is different which reduces the results of the model. First let's mention the data I found and then show examples, solution and some other problems.

    • Rahima411/text-to-pandas:

      • The data is divided into Train with 57.5k and Test with 19.2k.

      • The data has two columns as you can see in the example:

        • "Input": Contains the context and the question together, in the context it shows the metadata about the data frame.
        • "Pandas Query": Pandas code txt Input | Pandas Query -----------------------------------------------------------|------------------------------------------- Table Name: head (age (object), head_id (object)) | result = management['head.age'].unique() Table Name: management (head_id (object), | temporary_acting (object)) | What are the distinct ages of the heads who are acting? |
    • hiltch/pandas-create-context:

      • It contains 17k rows with three columns:
        • question : text .
        • context : Code to create a data frame with column names, unlike the first data set which contains the name of the data frame, column names and data type.
        • answer : Pandas code.
          question           |            context             |       answer 
    ----------------------------------------|--------------------------------------------------------|---------------------------------------
    What was the lowest # of total votes?  | df = pd.DataFrame(columns=['_number_of_total_votes']) | df['_number_of_total_votes'].min()   
    

    As you can see, the problem with this data is that they are not similar as inputs and the structure of the context is different . My solution to this problem was: - Convert the first data set to become like the second in the context. I chose this because it is difficult to get the data type for the columns in the second data set. It was easy to convert the structure of the context from this shape Table Name: head (age (object), head_id (object)) to this head = pd.DataFrame(columns=['age','head_id']) through this code that I wrote. - Then separate the question from the context. This was easy because if you look at the data, you will find that the context always ends with "(" and then a blank and then the question. You will find all of this in this code. - You will also notice that more than one code or line can be returned to the context, and this has been engineered into the code. ```py def extract_table_creation(text:str)->(str,str): """ Extracts DataFrame creation statements and questions from the given text.

    Args:
      text (str): The input text containing table definitions and questions.
    
    Returns:
      tuple: A tuple containing a concatenated DataFrame creation string and a question.
    """
    # Define patterns
    table_pattern = r'Table Name: (\w+) \(([\w\s,()]+)\)'
    column_pattern = r'(\w+)\s*\((object|int64|float64)\)'
    
    # Find all table names and column definitions
    matches = re.findall(table_pattern, text)
    
    # Initialize a list to hold DataFrame creation statements
    df_creations = []
    
    for table_name, columns_str in matches:
      # Extract column names
      columns = re.findall(column_pattern, columns_str)
      column_names = [col[0] for col in columns]
    
      # Format DataFrame creation statement
      df_creation = f"{table_name} = pd.DataFrame(columns={column_names})"
      df_creations.append(df_creation)
    
    # Concatenate all DataFrame creation statements
    df_creation_concat = '
    

    '.join(df_creations)

    # Extract and clean the question
    question = text[text.rindex(')')+1:].strip()
    
    return df_creation_concat, question
    
    After both datasets were similar in structure, they were merged into one set and divided into _72.8K_ train and _18.6K_ test. We analyzed this dataset and you can see it all through the **[`notebook`](https://www.kaggle.com/code/zeyadusf/text-2-pandas-t5#Exploratory-Data-Analysis(EDA))**, but we found some problems in the dataset as well, such as
    > - `Answer` : `df['Id'].count()` has been repeated, but this is possible, so we do not need to dispense with these rows.
    > - `Context` : We see that it contains `147` rows that do not contain any text. We will see Through the experiment if this will affect the results negatively or positively.
    > - `Question` : It is ...
    
  2. Metacritics Best Video Games of All Time 2022

    • kaggle.com
    zip
    Updated Jan 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Caique Rezende (2022). Metacritics Best Video Games of All Time 2022 [Dataset]. https://www.kaggle.com/caiquerezende/metacritics-best-video-games-of-all-time-2021
    Explore at:
    zip(339704 bytes)Available download formats
    Dataset updated
    Jan 12, 2022
    Authors
    Caique Rezende
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Metacritic's Best Video Games of All Time

    This project was developed with the mission of creating a dataframe with the updated list from the beste videogames of all time by Metacritic website.

    Reference

    To collect the data, I created a python script that uses selenium and pandas. You can access it on my github 👇 - Github

    The data was collected from the Metacritic website. - Metacritic

    Author

  3. d

    Data from: Constraints on trait combinations explain climatic drivers of...

    • datadryad.org
    • search.dataone.org
    zip
    Updated Apr 27, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John M. Dwyer; Daniel C. Laughlin (2018). Constraints on trait combinations explain climatic drivers of biodiversity: the importance of trait covariance in community assembly [Dataset]. http://doi.org/10.5061/dryad.76kt8
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 27, 2018
    Dataset provided by
    Dryad
    Authors
    John M. Dwyer; Daniel C. Laughlin
    Time period covered
    Apr 27, 2017
    Description

    quadrat.scale.dataRefer to R script ("Dwyer_&_Laughlin_2017_Trait_covariance_script.r" for information about this dataframe.species.in.quadrat.scale.dataRefer to R script ("Dwyer_&_Laughlin_2017_Trait_covariance_script.r" for information about this dataframe.Dwyer_&_Laughlin_2017_Trait_covariance_scriptThis script reads in the two dataframes of "raw" data, calculates diversity and trait metrics and runs the major analyses presented in Dwyer & Laughlin 2017.

  4. PandasPlotBench

    • huggingface.co
    Updated Nov 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    JetBrains Research (2024). PandasPlotBench [Dataset]. https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 25, 2024
    Dataset provided by
    JetBrainshttp://jetbrains.com/
    Authors
    JetBrains Research
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    PandasPlotBench

    PandasPlotBench is a benchmark to assess the capability of models in writing the code for visualizations given the description of the Pandas DataFrame. 🛠️ Task. Given the plotting task and the description of a Pandas DataFrame, write the code to build a plot. The dataset is based on the MatPlotLib gallery. The paper can be found in arXiv: https://arxiv.org/abs/2412.02764v1. To score your model on this dataset, you can use the our GitHub repository. 📩 If you have… See the full description on the dataset page: https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench.

  5. h

    pandas-create-context

    • huggingface.co
    Updated Jan 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Or Hiltch (2024). pandas-create-context [Dataset]. https://huggingface.co/datasets/hiltch/pandas-create-context
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 8, 2024
    Authors
    Or Hiltch
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview

    This dataset is built from sql-create-context, which in itself builds from WikiSQL and Spider. I have used GPT4 to translate the SQL schema into pandas DataFrame schem initialization statements and to translate the SQL queries into pandas queries. There are 862 examples of natural language queries, pandas DataFrame creation statements, and pandas query answering the question using the DataFrame creation statement as context. This dataset was built with text-to-pandas LLMs… See the full description on the dataset page: https://huggingface.co/datasets/hiltch/pandas-create-context.

  6. f

    Table1_Immunological characterization of an Italian PANDAS cohort.docx

    • datasetcatalog.nlm.nih.gov
    • figshare.com
    Updated Jan 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Duse, Marzia; Guido, Cristiana Alessia; Carsetti, Rita; Loffredo, Lorenzo; Mortari, Eva Piano; Lorenzetti, Giulia; Förster-Waldl, Elisabeth; Zicari, Anna Maria; Leonardi, Lucia; Spalice, Alberto (2024). Table1_Immunological characterization of an Italian PANDAS cohort.docx [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001272684
    Explore at:
    Dataset updated
    Jan 4, 2024
    Authors
    Duse, Marzia; Guido, Cristiana Alessia; Carsetti, Rita; Loffredo, Lorenzo; Mortari, Eva Piano; Lorenzetti, Giulia; Förster-Waldl, Elisabeth; Zicari, Anna Maria; Leonardi, Lucia; Spalice, Alberto
    Description

    This cross-sectional study aimed to contribute to the definition of Pediatric Autoimmune Neuropsychiatric Disorders Associated with Streptococcal Infections (PANDAS) pathophysiology. An extensive immunological assessment has been conducted to investigate both immune defects, potentially leading to recurrent Group A β-hemolytic Streptococcus (GABHS) infections, and immune dysregulation responsible for a systemic inflammatory state. Twenty-six PANDAS patients with relapsing-remitting course of disease and 11 controls with recurrent pharyngotonsillitis were enrolled. Each subject underwent a detailed phenotypic and immunological assessment including cytokine profile. A possible correlation of immunological parameters with clinical-anamnestic data was analyzed. No inborn errors of immunity were detected in either group, using first level immunological assessments. However, a trend toward higher TNF-alpha and IL-17 levels, and lower C3 levels, was detected in the PANDAS patients compared to the control group. Maternal autoimmune diseases were described in 53.3% of PANDAS patients and neuropsychiatric symptoms other than OCD and tics were detected in 76.9% patients. ASO titer did not differ significantly between the two groups. A possible correlation between enduring inflammation (elevated serum TNF-α and IL-17) and the persistence of neuropsychiatric symptoms in PANDAS patients beyond infectious episodes needs to be addressed. Further studies with larger cohorts would be pivotal to better define the role of TNF-α and IL-17 in PANDAS pathophysiology.

  7. Z

    Data from: I-MAESTRO data: 42 million trees from three large European...

    • data.niaid.nih.gov
    • data.europa.eu
    Updated Jul 10, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raphaël Aussenac; Jean-Matthieu Monnet; Matija Klopčič; Paweł Hawryło; Jarosław Socha; Mats Mahnken; Martin Gutsch; Thomas Cordonnier; Patrick Vallet (2024). I-MAESTRO data: 42 million trees from three large European landscapes in France, Poland and Slovenia [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7462440
    Explore at:
    Dataset updated
    Jul 10, 2024
    Dataset provided by
    University of Agriculture in Krakow
    Univ. Grenoble Alpes, INRAE
    University of Ljubljana
    Univ. Grenoble Alpes, INRAE & Office National des Forêts
    Potsdam Institute for Climate Impact Research
    Univ. Grenoble Alpes, INRAE & Univ Montpellier, CIRAD
    Authors
    Raphaël Aussenac; Jean-Matthieu Monnet; Matija Klopčič; Paweł Hawryło; Jarosław Socha; Mats Mahnken; Martin Gutsch; Thomas Cordonnier; Patrick Vallet
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Slovenia, Europe, Poland, France
    Description

    Here we present three datasets describing three large European landscapes in France (Bauges Geopark - 89,000 ha), Poland (Milicz forest district - 21,000 ha) and Slovenia (Snežnik forest - 4,700 ha) down to the tree level. Individual trees were generated combining inventory plot data, vegetation maps and Airborne Laser Scanning (ALS) data. Together, these landscapes (hereafter virtual landscapes) cover more than 100,000 ha including about 64,000 ha of forest and consist of more than 42 million trees of 51 different species. For each virtual landscape we provide a table (in .csv format) with the following columns:- cellID25: the unique ID of each 25x25 m² cell- sp: species latin names- n: number of trees. n is an integer >= 1, meaning that a specific set of species "sp", diameter "dbh" and height "h" can be present multiple times in a cell.- dbh: tree diameter at breast height (cm)- h: tree height (m) We also provide, for each virtual landscape, a raster (in .asc format) with the cell IDs (cellID25) which makes data spatialisation possible. The coordinate reference systems are EPSG: 2154 for the Bauges, EPSG: 2180 for Milicz, and EPSG: 3912 for Sneznik. The v2.0.0 presents the algorithm in its final state. Finally, we provide a proof of how our algorithm makes it possible to reach the total BA and the BA proportion of broadleaf trees provided by the ALS mapping using the alpha correction coefficient and how it maintains the Dg ratios observed on the field plots between the different species (see algorithm presented in the associated Open Research Europe article). Below is an example of R code that opens the datasets and creates a tree density map. ------------------------------------------------------------# load package library(terra) library(dplyr)

    set work directory

    setwd() # define path to the I-MAESTRO_data folder

    load tree data

    tree <- read.csv2('./sneznik/sneznik_trees.csv', sep = ',')

    load spatial data

    cellID <- rast('./sneznik/sneznik_cellID25.asc')

    set coordinate reference system

    Bauges:

    crs(cellID) <- "epsg:2154"

    Milicz:

    crs(cellID) <- "epsg:2180"

    Sneznik:

    crs(cellID) <- "epsg:3912"

    convert raster into dataframe

    cellIDdf <- as.data.frame(cellID) colnames(cellIDdf) <- 'cellID25'

    calculate tree density from tree dataframe

    dens <- tree %>% group_by(cellID25) %>% summarise(n = sum(n))

    merge the two dataframes

    dens <- left_join(cellIDdf, dens, join_by(cellID25))

    add density to raster

    cellID$dens <- dens$n

    plot density map

    plot(cellID$dens)

  8. Z

    polyOne Data Set - 100 million hypothetical polymers including 29 properties...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Mar 24, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christopher Kuenneth; Rampi Ramprasad (2023). polyOne Data Set - 100 million hypothetical polymers including 29 properties [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7124187
    Explore at:
    Dataset updated
    Mar 24, 2023
    Dataset provided by
    Georgia Institute of Technology
    Authors
    Christopher Kuenneth; Rampi Ramprasad
    Description

    polyOne Data Set

    The data set contains 100 million hypothetical polymers each with 29 predicted properties using machine learning models. We use PSMILES strings to represent polymer structures, see here and here. The polymers are generated by decomposing previously synthesized polymers into unique chemical fragments. Random and enumerative compositions of these fragments yield 100 million hypothetical PSMILES strings. All PSMILES strings are chemically valid polymers but, mostly, have never been synthesized before. More information can be found in the paper. Please note the license agreement in the LICENSE file.

    Full data set including the properties

    The data files are in Apache Parquet format. The files start with polyOne_*.parquet.

    I recommend using dask (pip install dask) to load and process the data set. Pandas also works but is slower.

    Load sharded data set with dask python import dask.dataframe as dd ddf = dd.read_parquet("*.parquet", engine="pyarrow")

    For example, compute the description of data set ```python df_describe = ddf.describe().compute() df_describe

    
    
    PSMILES strings only
    
    
    
      
    generated_polymer_smiles_train.txt - 80 million PSMILES strings for training polyBERT. One string per line.
      
    generated_polymer_smiles_dev.txt - 20 million PSMILES strings for testing polyBERT. One string per line.
    
  9. Z

    Multimodal Vision-Audio-Language Dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Schaumlöffel, Timothy; Roig, Gemma; Choksi, Bhavin (2024). Multimodal Vision-Audio-Language Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10060784
    Explore at:
    Dataset updated
    Jul 11, 2024
    Dataset provided by
    Goethe University Frankfurt
    Authors
    Schaumlöffel, Timothy; Roig, Gemma; Choksi, Bhavin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities. Details can be found in the attached report. Annotation The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library. The split into train, validation and test set follows the split of the original datasets. Installation

    pip install pandas pyarrow Example

    import pandas as pddf = pd.read_parquet('annotation_train.parquet', engine='pyarrow')print(df.iloc[0])

    dataset AudioSet filename train/---2_BBVHAA.mp3 captions_visual [a man in a black hat and glasses.] captions_auditory [a man speaks and dishes clank.] tags [Speech] Description The annotation file consists of the following fields:filename: Name of the corresponding file (video or audio file)dataset: Source dataset associated with the data pointcaptions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual contentcaptions_auditory: A list of captions related to the auditory content of the videotags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided Data files The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de

  10. ZEL031

    • kaggle.com
    zip
    Updated Jul 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zafrens (2025). ZEL031 [Dataset]. https://www.kaggle.com/datasets/zafrens-data/zel031
    Explore at:
    zip(11507107465 bytes)Available download formats
    Dataset updated
    Jul 15, 2025
    Dataset authored and provided by
    Zafrens
    Description

    Data Version Changelog

    • Version 5: identical to version 3
    • Version 4: placeholder for temporary unavailable data
    • Version 3: added detected bead locations & cell segmentation maps for imaging data
    • Version 2: added 1 new array of phenotype data (ZS30) containing ~10,000 new observations
    • Version 1: initial release

    Data Description

    This dataset contain two types of data for a combinatorial library containing approximately 12,000 compounds that we call ZEL031. This library was designed with a specific target in mind, but we are not currently explicitly disclosing the target. Phenotype imaging data was collected from A549 cells and transcriptome data was collected from HEK293T cells. Transcriptome data was collected from 2 different arrays, and phenotype data was collected from 3 different arrays. Imaging data is provided as 1 "plain" HDF5 file per array, and each HDF5 file is accompanied by 2 CSV files which describe the first 2 dimensions of the "images" dataset in the HDF5 file; the last two dimensions are image height and image width. Transcriptome counts data is provided in an AnnData-format HDF5 file, and the metadata is stored in the .obs DataFrame. The source device for the transcriptome data is available in the device_id column of the .obs DataFrame.

    In total, there are approximately 64,000 transcriptome observations and 28,000 phenotype observations.

    Metadata Description

    All observations are described by 5 columns of metadata: control_rx_id, bb1_id, bb2_id, residual_linker, and censored. The censored column indicates that the chemical identity information for that well has been hidden, and all the values for the other columns ending in _id should be -1. About 20% of the data has been censored. For the remaining 80% of the data, a combination of the columns ending in _id can be used to look up the associated chemical perturbation in the smiles.csv file. For the imaging data, these columns can be found in the _dim_0_metadata.csv files, along with a physical_well_id columns which identifies the source of the imaging data. Some wells are imaged more than once during data collection. For those wells, both sets of images are included and share a value in the physical_well_id column.

    Interpretive Data

    Each device's imaging data includes 2 extra files, {device}.segmentations.h5 and {device}.beadlocations.csv which contain some cell segmentation data and the locations of beads within the images in {device}.h5, respectively.

    The segmentations are present in their HDF5 as an "images" dataset, but with only 3 dimensions (n_obs x H x W). A positional value of zero indicates that the pixel was not detected as a cell, and all pixels with each unique non-zero positive integer correspond to one cell; integer values are unique only within a single slice along the first dimension of data. Despite some weak spots, these maps have served us well for QC purposes.

    Bead detections in the associated CSV have 4 columns: hdf5_dim_0_index, cx, cy, and radius. - hdf5_dim_0_index maps to the first dimension of the images/segmentations HDF5s - cy maps to the second-to-last dimension of the images/segmentations HDF5s - cx maps to the last dimension of the images/segmentations HDF5s - radius is the radius of a (rough) circle centered at cy, cx

    Notes

    There are certainly artifacts present which we'd prefer to avoid (some we're aware of include images being out-of-focus or perturbation delivery beads moving from their expected locations) and we look forward to sharing improved datasets in the future. Nonetheless, we've found some interesting patterns in these data and we'd be absolutely delighted to learn of any interesting patterns you can find (whether artifacts or biological patterns)!

    MD5 Hashes

    filenameMD5 hash
    4020_4021_cens.h5adfd4e5d843443813cd86f7aa058052ac9
    ZS26_dim_0_metadata.csv9d2d46e28e8321b4d5478a95e77d5c7a
    ZS26_dim_1_metadata.csvf5b35dcd381a40a6a188e6c2aec5b9be
    ZS26.h52e97081c4003960bdf5b8ecf882aa3dc
    ZS26.beadlocations.csvec557c2449b1c4f934670ab135876705
    ZS26.segmentations.h5efc0862557a3d709d59f1cfad713115b
    ZS27_dim_0_metadata.csvf9b8475a47e4f741563b474035b02499
    ZS27_dim_1_metadata.csvf5b35dcd381a40a6a188e6c2aec5b9be
    ZS27.h58647a7800893f7cdb3013a8449374860
    ZS27.beadlocations.csv1965c349977d69eca833f8705d229df9
    ZS27.segmentations.h5dc25b2e859aafe9816c4c3936057795b
    ZS30_dim_0_metadata.csvfb8f9c51b8bc6fe1a5b86e4611dac85d
    ZS30_dim_1_metadata.csvf5b35dcd381a40a6a188e6c2aec5b9be
    ZS30.h520c2bc56ce149198e9c9b60c6b669861
    ZS30.beadlocations.csv095b63bb7ffa98558fb09fddc4593bf9
    ZS30.segmentations.h5e12d5d1cac8bbeb9c925d688544b5b26
  11. Zippi_Shvartsman_et_al_2023_bmi_manual_files

    • figshare.com
    bin
    Updated Aug 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gabrielle Shvartsman; Ellen Zippi; Nuria Vendrell-Llopis; Joni D. Wallis; Jose M. Carmena (2023). Zippi_Shvartsman_et_al_2023_bmi_manual_files [Dataset]. http://doi.org/10.6084/m9.figshare.23674200.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Aug 29, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Gabrielle Shvartsman; Ellen Zippi; Nuria Vendrell-Llopis; Joni D. Wallis; Jose M. Carmena
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Included files:Each file includes LFP (local field potential) data for both animals (‘h’, ‘y’) during a particular type of task control (‘bmi’ or ‘manual’) and time-locked to 500ms before or after a particular event in the task (‘go_cue’ or ‘target’) for each rewarded trial in each day of the task (‘h’: [1-13], ‘y’: [1-22]).File description:Each file includes a Pandas DataFrame, saved as a .feather file. Data can be accessed using Python by calling:import pandas as pdpd.read_feather([file name])Each DataFrame has the following columns:control_type: ‘bmi’, ‘manual’, or ‘baseline’event: go cue (‘go_cue’) or target acquisition (‘target’)subj: which animal, ‘h’ or ‘y’day: which day of the session, ‘h’: [1-13], ‘y’: [1-22]roi: region of interest; ‘direct’, ‘dlpfc’, or ‘cd’ where ‘direct’ includes most channels from m1 each day but is specific to channels which had sufficient spiking to be used as input to the BMI decoderch: electrode channel number, only low-noise channels were included (see Methods for details)n_rewarded_trial: which trial number data segment is from, only successfully completed (rewarded) trials are includedtime_from_window_ms: go_cue: 0-500ms from go cue, for target: -500-0ms from target acquisitionlfp: local field potential value (see Methods for details)

  12. h

    french_books

    • huggingface.co
    Updated Jul 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CATIE (2025). french_books [Dataset]. https://huggingface.co/datasets/CATIE-AQ/french_books
    Explore at:
    Dataset updated
    Jul 29, 2025
    Dataset authored and provided by
    CATIE
    Area covered
    French
    Description

    Description

    Dataframe containing 2075 French books in txt format (= the ~2600 French books present in gutenberg from which all books by authors present in the french_books_summuries dataset have been removed to avoid any leaks).More precisely :

    the texte column contains the texts the titre column contains the book title the auteur column contains the author's name and dates of birth and death (if you want to filter the texts to keep only those from the given century to the present… See the full description on the dataset page: https://huggingface.co/datasets/CATIE-AQ/french_books.

  13. Klib library python

    • kaggle.com
    zip
    Updated Jan 11, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sripaad Srinivasan (2021). Klib library python [Dataset]. https://www.kaggle.com/sripaadsrinivasan/klib-library-python
    Explore at:
    zip(89892446 bytes)Available download formats
    Dataset updated
    Jan 11, 2021
    Authors
    Sripaad Srinivasan
    Description

    klib library enables us to quickly visualize missing data, perform data cleaning, visualize data distribution plot, visualize correlation plot and visualize categorical column values. klib is a Python library for importing, cleaning, analyzing and preprocessing data. Explanations on key functionalities can be found on Medium / TowardsDataScience in the examples section or on YouTube (Data Professor).

    Original Github repo

    https://raw.githubusercontent.com/akanz1/klib/main/examples/images/header.png" alt="klib Header">

    Usage

    !pip install klib
    
    import klib
    import pandas as pd
    
    df = pd.DataFrame(data)
    
    # klib.describe functions for visualizing datasets
    - klib.cat_plot(df) # returns a visualization of the number and frequency of categorical features
    - klib.corr_mat(df) # returns a color-encoded correlation matrix
    - klib.corr_plot(df) # returns a color-encoded heatmap, ideal for correlations
    - klib.dist_plot(df) # returns a distribution plot for every numeric feature
    - klib.missingval_plot(df) # returns a figure containing information about missing values
    

    Examples

    Take a look at this starter notebook.

    Further examples, as well as applications of the functions can be found here.

    Contributing

    Pull requests and ideas, especially for further functions are welcome. For major changes or feedback, please open an issue first to discuss what you would like to change. Take a look at this Github repo.

    License

    MIT

  14. r

    Publication dataframes - Cross tolerance: salinity gradients and dehydration...

    • researchdata.edu.au
    Updated Nov 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mr. Callum Bryant (2025). Publication dataframes - Cross tolerance: salinity gradients and dehydration increase photosynthetic heat tolerance in mangrove leaves. [Dataset]. http://doi.org/10.25911/K541-G386
    Explore at:
    Dataset updated
    Nov 24, 2025
    Dataset provided by
    The Australian National University
    Authors
    Mr. Callum Bryant
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Time period covered
    2022
    Description

    This ANU Data Commons collection contains two master data sets associated with the manuscript accepted for publication in the Functional Ecology journal, manuscript detailed below.

    Bryant C, Harris RJ, Brothers N, Bone C, Walsh N, Nicotra AB, Ball MC 2024, Cross tolerance: salinity gradients and dehydration increase photosynthetic heat tolerance in mangrove leaves Functional Ecology (Accepted article)

    Each xlsl file contains a Metadata sheet describing headers, units of measurement, and data form (numeric, character, factor), followed by a single master dataframe sheet.

    For full methods used for experimental data collection, see the methods described in the publication.

  15. h

    fenic-0.4.0-codebase

    • huggingface.co
    Updated Sep 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Typedef, Inc. (2025). fenic-0.4.0-codebase [Dataset]. https://huggingface.co/datasets/typedef-ai/fenic-0.4.0-codebase
    Explore at:
    Dataset updated
    Sep 26, 2025
    Dataset authored and provided by
    Typedef, Inc.
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Fenic 0.4.0 API Documentation Dataset

      Dataset Description
    

    This dataset contains comprehensive API documentation for Fenic 0.4.0, a PySpark-inspired DataFrame framework designed for building production AI and agentic applications. The dataset provides structured information about all public and private API elements, including modules, classes, functions, methods, and attributes.

      Dataset Summary
    

    Fenic is a DataFrame framework that combines traditional data… See the full description on the dataset page: https://huggingface.co/datasets/typedef-ai/fenic-0.4.0-codebase.

  16. Shopping Mall

    • kaggle.com
    zip
    Updated Dec 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anshul Pachauri (2023). Shopping Mall [Dataset]. https://www.kaggle.com/datasets/anshulpachauri/shopping-mall
    Explore at:
    zip(22852 bytes)Available download formats
    Dataset updated
    Dec 15, 2023
    Authors
    Anshul Pachauri
    Description

    Libraries Import:

    Importing necessary libraries such as pandas, seaborn, matplotlib, scikit-learn's KMeans, and warnings. Data Loading and Exploration:

    Reading a dataset named "Mall_Customers.csv" into a pandas DataFrame (df). Displaying the first few rows of the dataset using df.head(). Conducting univariate analysis by calculating descriptive statistics with df.describe(). Univariate Analysis:

    Visualizing the distribution of the 'Annual Income (k$)' column using sns.distplot. Looping through selected columns ('Age', 'Annual Income (k$)', 'Spending Score (1-100)') and plotting individual distribution plots. Bivariate Analysis:

    Creating a scatter plot for 'Annual Income (k$)' vs 'Spending Score (1-100)' using sns.scatterplot. Generating a pair plot for selected columns with gender differentiation using sns.pairplot. Gender-Based Analysis:

    Grouping the data by 'Gender' and calculating the mean for selected columns. Computing the correlation matrix for the grouped data and visualizing it using a heatmap. Univariate Clustering:

    Applying KMeans clustering with 3 clusters based on 'Annual Income (k$)' and adding the 'Income Cluster' column to the DataFrame. Plotting the elbow method to determine the optimal number of clusters. Bivariate Clustering:

    Applying KMeans clustering with 5 clusters based on 'Annual Income (k$)' and 'Spending Score (1-100)' and adding the 'Spending and Income Cluster' column. Plotting the elbow method for bivariate clustering and visualizing the cluster centers on a scatter plot. Displaying a normalized cross-tabulation between 'Spending and Income Cluster' and 'Gender'. Multivariate Clustering:

    Performing multivariate clustering by creating dummy variables, scaling selected columns, and applying KMeans clustering. Plotting the elbow method for multivariate clustering. Result Saving:

    Saving the modified DataFrame with cluster information to a CSV file named "Result.csv". Saving the multivariate clustering plot as an image file ("Multivariate_figure.png").

  17. f

    Data_Sheet_1_Diagnostic Approach to Pediatric Autoimmune Neuropsychiatric...

    • datasetcatalog.nlm.nih.gov
    Updated Oct 27, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gulisano, Mariangela; Barone, Rita; Scerbo, Miriam; Rizzo, Renata; Vicario, Carmelo M.; Prato, Adriana (2021). Data_Sheet_1_Diagnostic Approach to Pediatric Autoimmune Neuropsychiatric Disorders Associated With Streptococcal Infections (PANDAS): A Narrative Review of Literature Data.PDF [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000814425
    Explore at:
    Dataset updated
    Oct 27, 2021
    Authors
    Gulisano, Mariangela; Barone, Rita; Scerbo, Miriam; Rizzo, Renata; Vicario, Carmelo M.; Prato, Adriana
    Description

    Pediatric autoimmune neuropsychiatric disorders associated with streptococcal infections (PANDAS) are clinical conditions characterized by the sudden onset of obsessive–compulsive disorder and/or tics, often accompanied by other behavioral symptoms in a group of children with streptococcal infection. PANDAS-related disorders, including pediatric acute-onset neuropsychiatric syndrome (PANS), childhood acute neuropsychiatric symptoms (CANS), and pediatric infection triggered autoimmune neuropsychiatric disorders (PITANDs), have also been described. Since first defined in 1998, PANDAS has been considered a controversial diagnosis. A comprehensive review of the literature was performed on PubMed and Scopus databases, searching for diagnostic criteria and diagnostic procedures of PANDAS and related disorders. We propose a test panel to support clinicians in the workout of PANDAS/PANS patients establishing an appropriate treatment. However, further studies are needed to improve our knowledge on these acute-onset neuropsychiatric conditions.

  18. h

    PlotQA_V1

    • huggingface.co
    Updated Sep 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aryan Badkul (2025). PlotQA_V1 [Dataset]. https://huggingface.co/datasets/Abd223653/PlotQA_V1
    Explore at:
    Dataset updated
    Sep 22, 2025
    Authors
    Aryan Badkul
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Plotqa V1

      Dataset Description
    

    This dataset was uploaded from a pandas DataFrame.

      Dataset Structure
    
    
    
    
    
      Overview
    

    Total Examples: 5,733,893 Total Features: 9 Dataset Size: ~2805.4 MB Format: Parquet files Created: 2025-09-22 20:12:01 UTC

      Data Instances
    

    The dataset contains 5,733,893 rows and 9 columns.

      Data Fields
    

    image_index (int64): 0 null values (0.0%), Range: [0.00, 157069.00], Mean: 78036.26 qid (object): 0 null values (0.0%)… See the full description on the dataset page: https://huggingface.co/datasets/Abd223653/PlotQA_V1.

  19. Oxygen Experiment Dataframe

    • figshare.com
    txt
    Updated Apr 7, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alec Cobban (2020). Oxygen Experiment Dataframe [Dataset]. http://doi.org/10.6084/m9.figshare.11964885.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Apr 7, 2020
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Alec Cobban
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Sulfolobus acidocaldarius oxygen experiment data. Includes both GDGT and growth information. Experiment described as 0.22 oxgyen concentration is serial transfer 0.2% O2 experiment

  20. h

    small_dense_structured_table

    • huggingface.co
    Updated May 9, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nanonets (2025). small_dense_structured_table [Dataset]. https://huggingface.co/datasets/nanonets/small_dense_structured_table
    Explore at:
    Dataset updated
    May 9, 2025
    Dataset authored and provided by
    Nanonets
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset is generated syhthetically to create tables with following characteristics:

    Empty cell percentage in following range 0,30 There is clear seperator between rows and columns (Structured). 4 <= num rows <= 10, 2 <= num columns <= 6 (Small)

      Load the dataset
    

    import io import pandas as pd from PIL import Image

    def bytes_to_image(self, image_bytes: bytes): return Image.open(io.BytesIO(image_bytes))

    def parse_annotations(self, annotations: str) -> pd.DataFrame:… See the full description on the dataset page: https://huggingface.co/datasets/nanonets/small_dense_structured_table.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Zeyad Usf (2024). Convert Text to Pandas [Dataset]. https://www.kaggle.com/datasets/zeyadusf/convert-text-to-pandas
Organization logo

Convert Text to Pandas

convert Text 2 Pandas

Explore at:
zip(4333134 bytes)Available download formats
Dataset updated
Sep 22, 2024
Authors
Zeyad Usf
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

kaggle notebook
Github Repo

I found two datasets about converting text with context to pandas code on Hugging Face, but the challenge is in the context. The context in both datasets is different which reduces the results of the model. First let's mention the data I found and then show examples, solution and some other problems.

  • Rahima411/text-to-pandas:

    • The data is divided into Train with 57.5k and Test with 19.2k.

    • The data has two columns as you can see in the example:

      • "Input": Contains the context and the question together, in the context it shows the metadata about the data frame.
      • "Pandas Query": Pandas code txt Input | Pandas Query -----------------------------------------------------------|------------------------------------------- Table Name: head (age (object), head_id (object)) | result = management['head.age'].unique() Table Name: management (head_id (object), | temporary_acting (object)) | What are the distinct ages of the heads who are acting? |
  • hiltch/pandas-create-context:

    • It contains 17k rows with three columns:
      • question : text .
      • context : Code to create a data frame with column names, unlike the first data set which contains the name of the data frame, column names and data type.
      • answer : Pandas code.
      question           |            context             |       answer 
----------------------------------------|--------------------------------------------------------|---------------------------------------
What was the lowest # of total votes?  | df = pd.DataFrame(columns=['_number_of_total_votes']) | df['_number_of_total_votes'].min()   

As you can see, the problem with this data is that they are not similar as inputs and the structure of the context is different . My solution to this problem was: - Convert the first data set to become like the second in the context. I chose this because it is difficult to get the data type for the columns in the second data set. It was easy to convert the structure of the context from this shape Table Name: head (age (object), head_id (object)) to this head = pd.DataFrame(columns=['age','head_id']) through this code that I wrote. - Then separate the question from the context. This was easy because if you look at the data, you will find that the context always ends with "(" and then a blank and then the question. You will find all of this in this code. - You will also notice that more than one code or line can be returned to the context, and this has been engineered into the code. ```py def extract_table_creation(text:str)->(str,str): """ Extracts DataFrame creation statements and questions from the given text.

Args:
  text (str): The input text containing table definitions and questions.

Returns:
  tuple: A tuple containing a concatenated DataFrame creation string and a question.
"""
# Define patterns
table_pattern = r'Table Name: (\w+) \(([\w\s,()]+)\)'
column_pattern = r'(\w+)\s*\((object|int64|float64)\)'

# Find all table names and column definitions
matches = re.findall(table_pattern, text)

# Initialize a list to hold DataFrame creation statements
df_creations = []

for table_name, columns_str in matches:
  # Extract column names
  columns = re.findall(column_pattern, columns_str)
  column_names = [col[0] for col in columns]

  # Format DataFrame creation statement
  df_creation = f"{table_name} = pd.DataFrame(columns={column_names})"
  df_creations.append(df_creation)

# Concatenate all DataFrame creation statements
df_creation_concat = '

'.join(df_creations)

# Extract and clean the question
question = text[text.rindex(')')+1:].strip()

return df_creation_concat, question
After both datasets were similar in structure, they were merged into one set and divided into _72.8K_ train and _18.6K_ test. We analyzed this dataset and you can see it all through the **[`notebook`](https://www.kaggle.com/code/zeyadusf/text-2-pandas-t5#Exploratory-Data-Analysis(EDA))**, but we found some problems in the dataset as well, such as
> - `Answer` : `df['Id'].count()` has been repeated, but this is possible, so we do not need to dispense with these rows.
> - `Context` : We see that it contains `147` rows that do not contain any text. We will see Through the experiment if this will affect the results negatively or positively.
> - `Question` : It is ...
Search
Clear search
Close search
Google apps
Main menu