100+ datasets found
  1. h

    pandas-create-context

    • huggingface.co
    Updated Jan 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Or Hiltch (2024). pandas-create-context [Dataset]. https://huggingface.co/datasets/hiltch/pandas-create-context
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 8, 2024
    Authors
    Or Hiltch
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview

    This dataset is built from sql-create-context, which in itself builds from WikiSQL and Spider. I have used GPT4 to translate the SQL schema into pandas DataFrame schem initialization statements and to translate the SQL queries into pandas queries. There are 862 examples of natural language queries, pandas DataFrame creation statements, and pandas query answering the question using the DataFrame creation statement as context. This dataset was built with text-to-pandas LLMs… See the full description on the dataset page: https://huggingface.co/datasets/hiltch/pandas-create-context.

  2. Convert Text to Pandas

    • kaggle.com
    zip
    Updated Sep 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zeyad Usf (2024). Convert Text to Pandas [Dataset]. https://www.kaggle.com/datasets/zeyadusf/convert-text-to-pandas
    Explore at:
    zip(4333134 bytes)Available download formats
    Dataset updated
    Sep 22, 2024
    Authors
    Zeyad Usf
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    kaggle notebook
    Github Repo

    I found two datasets about converting text with context to pandas code on Hugging Face, but the challenge is in the context. The context in both datasets is different which reduces the results of the model. First let's mention the data I found and then show examples, solution and some other problems.

    • Rahima411/text-to-pandas:

      • The data is divided into Train with 57.5k and Test with 19.2k.

      • The data has two columns as you can see in the example:

        • "Input": Contains the context and the question together, in the context it shows the metadata about the data frame.
        • "Pandas Query": Pandas code txt Input | Pandas Query -----------------------------------------------------------|------------------------------------------- Table Name: head (age (object), head_id (object)) | result = management['head.age'].unique() Table Name: management (head_id (object), | temporary_acting (object)) | What are the distinct ages of the heads who are acting? |
    • hiltch/pandas-create-context:

      • It contains 17k rows with three columns:
        • question : text .
        • context : Code to create a data frame with column names, unlike the first data set which contains the name of the data frame, column names and data type.
        • answer : Pandas code.
          question           |            context             |       answer 
    ----------------------------------------|--------------------------------------------------------|---------------------------------------
    What was the lowest # of total votes?  | df = pd.DataFrame(columns=['_number_of_total_votes']) | df['_number_of_total_votes'].min()   
    

    As you can see, the problem with this data is that they are not similar as inputs and the structure of the context is different . My solution to this problem was: - Convert the first data set to become like the second in the context. I chose this because it is difficult to get the data type for the columns in the second data set. It was easy to convert the structure of the context from this shape Table Name: head (age (object), head_id (object)) to this head = pd.DataFrame(columns=['age','head_id']) through this code that I wrote. - Then separate the question from the context. This was easy because if you look at the data, you will find that the context always ends with "(" and then a blank and then the question. You will find all of this in this code. - You will also notice that more than one code or line can be returned to the context, and this has been engineered into the code. ```py def extract_table_creation(text:str)->(str,str): """ Extracts DataFrame creation statements and questions from the given text.

    Args:
      text (str): The input text containing table definitions and questions.
    
    Returns:
      tuple: A tuple containing a concatenated DataFrame creation string and a question.
    """
    # Define patterns
    table_pattern = r'Table Name: (\w+) \(([\w\s,()]+)\)'
    column_pattern = r'(\w+)\s*\((object|int64|float64)\)'
    
    # Find all table names and column definitions
    matches = re.findall(table_pattern, text)
    
    # Initialize a list to hold DataFrame creation statements
    df_creations = []
    
    for table_name, columns_str in matches:
      # Extract column names
      columns = re.findall(column_pattern, columns_str)
      column_names = [col[0] for col in columns]
    
      # Format DataFrame creation statement
      df_creation = f"{table_name} = pd.DataFrame(columns={column_names})"
      df_creations.append(df_creation)
    
    # Concatenate all DataFrame creation statements
    df_creation_concat = '
    

    '.join(df_creations)

    # Extract and clean the question
    question = text[text.rindex(')')+1:].strip()
    
    return df_creation_concat, question
    
    After both datasets were similar in structure, they were merged into one set and divided into _72.8K_ train and _18.6K_ test. We analyzed this dataset and you can see it all through the **[`notebook`](https://www.kaggle.com/code/zeyadusf/text-2-pandas-t5#Exploratory-Data-Analysis(EDA))**, but we found some problems in the dataset as well, such as
    > - `Answer` : `df['Id'].count()` has been repeated, but this is possible, so we do not need to dispense with these rows.
    > - `Context` : We see that it contains `147` rows that do not contain any text. We will see Through the experiment if this will affect the results negatively or positively.
    > - `Question` : It is ...
    
  3. Shopping Mall

    • kaggle.com
    zip
    Updated Dec 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anshul Pachauri (2023). Shopping Mall [Dataset]. https://www.kaggle.com/datasets/anshulpachauri/shopping-mall
    Explore at:
    zip(22852 bytes)Available download formats
    Dataset updated
    Dec 15, 2023
    Authors
    Anshul Pachauri
    Description

    Libraries Import:

    Importing necessary libraries such as pandas, seaborn, matplotlib, scikit-learn's KMeans, and warnings. Data Loading and Exploration:

    Reading a dataset named "Mall_Customers.csv" into a pandas DataFrame (df). Displaying the first few rows of the dataset using df.head(). Conducting univariate analysis by calculating descriptive statistics with df.describe(). Univariate Analysis:

    Visualizing the distribution of the 'Annual Income (k$)' column using sns.distplot. Looping through selected columns ('Age', 'Annual Income (k$)', 'Spending Score (1-100)') and plotting individual distribution plots. Bivariate Analysis:

    Creating a scatter plot for 'Annual Income (k$)' vs 'Spending Score (1-100)' using sns.scatterplot. Generating a pair plot for selected columns with gender differentiation using sns.pairplot. Gender-Based Analysis:

    Grouping the data by 'Gender' and calculating the mean for selected columns. Computing the correlation matrix for the grouped data and visualizing it using a heatmap. Univariate Clustering:

    Applying KMeans clustering with 3 clusters based on 'Annual Income (k$)' and adding the 'Income Cluster' column to the DataFrame. Plotting the elbow method to determine the optimal number of clusters. Bivariate Clustering:

    Applying KMeans clustering with 5 clusters based on 'Annual Income (k$)' and 'Spending Score (1-100)' and adding the 'Spending and Income Cluster' column. Plotting the elbow method for bivariate clustering and visualizing the cluster centers on a scatter plot. Displaying a normalized cross-tabulation between 'Spending and Income Cluster' and 'Gender'. Multivariate Clustering:

    Performing multivariate clustering by creating dummy variables, scaling selected columns, and applying KMeans clustering. Plotting the elbow method for multivariate clustering. Result Saving:

    Saving the modified DataFrame with cluster information to a CSV file named "Result.csv". Saving the multivariate clustering plot as an image file ("Multivariate_figure.png").

  4. Pokémon Data Analysis using Pandas

    • kaggle.com
    zip
    Updated Sep 19, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ertiza Abbas (2021). Pokémon Data Analysis using Pandas [Dataset]. https://www.kaggle.com/ertizaabbas/pokmon-data-analysis-using-pandas
    Explore at:
    zip(126346 bytes)Available download formats
    Dataset updated
    Sep 19, 2021
    Authors
    Ertiza Abbas
    Description

    Context

    I have use Pokymon dataset to analyze further through pandas, This dataset and jupyter notebook can be use to understand how pandas work, all steps are fairly described in markdown sections.

    Importing data

    POKYMON data which is available publically on kaggle, you can use any dataset to do further analysis or practice.

    Acknowledgements

    I would like to thanks keith galli to initiate such lovely opportunity for beginners, to understand python and its libraries in a very simple.

    Inspiration

    Further analysis can be done on various outcomes, as data is everchanging category.

  5. Z

    Multimodal Vision-Audio-Language Dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Schaumlöffel, Timothy; Roig, Gemma; Choksi, Bhavin (2024). Multimodal Vision-Audio-Language Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10060784
    Explore at:
    Dataset updated
    Jul 11, 2024
    Dataset provided by
    Goethe University Frankfurt
    Authors
    Schaumlöffel, Timothy; Roig, Gemma; Choksi, Bhavin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities. Details can be found in the attached report. Annotation The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library. The split into train, validation and test set follows the split of the original datasets. Installation

    pip install pandas pyarrow Example

    import pandas as pddf = pd.read_parquet('annotation_train.parquet', engine='pyarrow')print(df.iloc[0])

    dataset AudioSet filename train/---2_BBVHAA.mp3 captions_visual [a man in a black hat and glasses.] captions_auditory [a man speaks and dishes clank.] tags [Speech] Description The annotation file consists of the following fields:filename: Name of the corresponding file (video or audio file)dataset: Source dataset associated with the data pointcaptions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual contentcaptions_auditory: A list of captions related to the auditory content of the videotags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided Data files The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de

  6. PandasPlotBench

    • huggingface.co
    Updated Nov 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    JetBrains Research (2024). PandasPlotBench [Dataset]. https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 25, 2024
    Dataset provided by
    JetBrainshttp://jetbrains.com/
    Authors
    JetBrains Research
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    PandasPlotBench

    PandasPlotBench is a benchmark to assess the capability of models in writing the code for visualizations given the description of the Pandas DataFrame. 🛠️ Task. Given the plotting task and the description of a Pandas DataFrame, write the code to build a plot. The dataset is based on the MatPlotLib gallery. The paper can be found in arXiv: https://arxiv.org/abs/2412.02764v1. To score your model on this dataset, you can use the our GitHub repository. 📩 If you have… See the full description on the dataset page: https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench.

  7. R

    Red Pandas Dataset

    • universe.roboflow.com
    zip
    Updated Jan 22, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    training (2024). Red Pandas Dataset [Dataset]. https://universe.roboflow.com/training-rduft/red-pandas
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 22, 2024
    Dataset authored and provided by
    training
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Variables measured
    Red Pandas Bounding Boxes
    Description

    Red Pandas

    ## Overview
    
    Red Pandas is a dataset for object detection tasks - it contains Red Pandas annotations for 1,756 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [MIT license](https://creativecommons.org/licenses/MIT).
    
  8. f

    Table1_Immunological characterization of an Italian PANDAS cohort.docx

    • datasetcatalog.nlm.nih.gov
    • figshare.com
    Updated Jan 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Duse, Marzia; Guido, Cristiana Alessia; Carsetti, Rita; Loffredo, Lorenzo; Mortari, Eva Piano; Lorenzetti, Giulia; Förster-Waldl, Elisabeth; Zicari, Anna Maria; Leonardi, Lucia; Spalice, Alberto (2024). Table1_Immunological characterization of an Italian PANDAS cohort.docx [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001272684
    Explore at:
    Dataset updated
    Jan 4, 2024
    Authors
    Duse, Marzia; Guido, Cristiana Alessia; Carsetti, Rita; Loffredo, Lorenzo; Mortari, Eva Piano; Lorenzetti, Giulia; Förster-Waldl, Elisabeth; Zicari, Anna Maria; Leonardi, Lucia; Spalice, Alberto
    Description

    This cross-sectional study aimed to contribute to the definition of Pediatric Autoimmune Neuropsychiatric Disorders Associated with Streptococcal Infections (PANDAS) pathophysiology. An extensive immunological assessment has been conducted to investigate both immune defects, potentially leading to recurrent Group A β-hemolytic Streptococcus (GABHS) infections, and immune dysregulation responsible for a systemic inflammatory state. Twenty-six PANDAS patients with relapsing-remitting course of disease and 11 controls with recurrent pharyngotonsillitis were enrolled. Each subject underwent a detailed phenotypic and immunological assessment including cytokine profile. A possible correlation of immunological parameters with clinical-anamnestic data was analyzed. No inborn errors of immunity were detected in either group, using first level immunological assessments. However, a trend toward higher TNF-alpha and IL-17 levels, and lower C3 levels, was detected in the PANDAS patients compared to the control group. Maternal autoimmune diseases were described in 53.3% of PANDAS patients and neuropsychiatric symptoms other than OCD and tics were detected in 76.9% patients. ASO titer did not differ significantly between the two groups. A possible correlation between enduring inflammation (elevated serum TNF-α and IL-17) and the persistence of neuropsychiatric symptoms in PANDAS patients beyond infectious episodes needs to be addressed. Further studies with larger cohorts would be pivotal to better define the role of TNF-α and IL-17 in PANDAS pathophysiology.

  9. Pandas Practice Dataset

    • kaggle.com
    zip
    Updated Jan 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mrityunjay Pathak (2023). Pandas Practice Dataset [Dataset]. https://www.kaggle.com/datasets/themrityunjaypathak/pandas-practice-dataset/discussion
    Explore at:
    zip(493 bytes)Available download formats
    Dataset updated
    Jan 27, 2023
    Authors
    Mrityunjay Pathak
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    What is Pandas?

    Pandas is a Python library used for working with data sets.

    It has functions for analyzing, cleaning, exploring, and manipulating data.

    The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.

    Why Use Pandas?

    Pandas allows us to analyze big data and make conclusions based on statistical theories.

    Pandas can clean messy data sets, and make them readable and relevant.

    Relevant data is very important in data science.

    What Can Pandas Do?

    Pandas gives you answers about the data. Like:

    Is there a correlation between two or more columns?

    What is average value?

    Max value?

    Min value?

  10. Learn Pandas

    • kaggle.com
    zip
    Updated Oct 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vaidik Patel (2023). Learn Pandas [Dataset]. https://www.kaggle.com/datasets/js1js2js3js4js5/learn-pandas
    Explore at:
    zip(1209861 bytes)Available download formats
    Dataset updated
    Oct 5, 2023
    Authors
    Vaidik Patel
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    It is a dataset with notebook kind of learning. Download the whole package and you will find everything to learn basics to advanced pandas which is exactly what you will need in machine learning and in data science. 😄

    This will gives you the overview and data analysis tools in pandas that is mostly required in the data manipulation and extraction important data.

    Use this notebook as notes for pandas. whenever you forget the code or syntax open it and scroll through it and you will find the solution. 🥳

  11. h

    oldIT2modIT

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Massimo Romano, oldIT2modIT [Dataset]. https://huggingface.co/datasets/cybernetic-m/oldIT2modIT
    Explore at:
    Authors
    Massimo Romano
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Download the dataset

    At the moment to download the dataset you should use Pandas DataFrame: import pandas as pd df = pd.read_csv("https://huggingface.co/datasets/cybernetic-m/oldIT2modIT/resolve/main/oldIT2modIT_dataset.csv")

    You can visualize the dataset with: df.head()

    To convert into Huggingface dataset: from datasets import Dataset dataset = Dataset.from_pandas(df)

      Dataset Description
    

    This is an italian dataset formed by 200 old (ancient) italian sentence and… See the full description on the dataset page: https://huggingface.co/datasets/cybernetic-m/oldIT2modIT.

  12. Z

    Correcting Transiting Lightcurves of Exoplanets From Non-linear...

    • data.niaid.nih.gov
    Updated Mar 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yip, Kai Hou; Simões, Luís; Waldmann, Ingo; Tsiaras, Angelos; Nikolaou, Nikolaos (2025). Correcting Transiting Lightcurves of Exoplanets From Non-linear Astrophysical and Instrumental Noise [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14793721
    Explore at:
    Dataset updated
    Mar 19, 2025
    Dataset provided by
    University College London
    ML Analytics
    Spaceflux Ltd
    Authors
    Yip, Kai Hou; Simões, Luís; Waldmann, Ingo; Tsiaras, Angelos; Nikolaou, Nikolaos
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Purpose

    This dataset, restructured and curated from the original competition data from 2019 and 2021 edition of the Ariel Data Challenge, is kindly provided by Dr Luís F. Simões from ML Analytics. It is designed to advance the development of machine learning techniques to detrend or denoise non-linear noise from astronomical observation data, including, but not limited to, exoplanet observation.

    Introduction

    Exoplanets are planets orbiting other stars, like our own solar system planets orbit our sun. To date, we know over 5500 exoplanets in more than 3400 solar systems (but this number changes daily so have a look NASA Exoplanet webpage for an accuarate number).

    When analysing these distant worlds, disentangling the effects of stellar activity and the non-linear noise of the instrument are the major data analysis challenges in the field and directly impacts our scientific measurements. Without correcting for brightness variabilities from the star and sensitivity variations of the instrument, we are not able to measure the radius of the planet correctly and, perhaps more importantly, the chemistry of their atmospheres.

    The Ariel mission is a European Space Agency (ESA) medium size mission (~500M Euros) to be launched to the second Lagrange point in 2029. The goal of the mission is to study the atmospheres and the chemistry of 1000 extrasolar planets (aka exoplanets) in our local galactic neighbourhood. By understanding their atmospheres, we can infer how these planets formed, what their natures are like and ultimately put our own solar system into context. For more information on Ariel, here is the link to the website.

    The dataset contains simulated observations of exoplanets transiting their respective host stars, these lightcurves are corrupted by astrophysical (such as stellar spots) and instrumental noises (such as phton noise, persistent noise and 1/f noise).

    Drifts in the data

    Data drifts undermines the performance of the model in test time and production environment. Ariel presents a unique challenge where the simulated data is likely not a good representatuion of the instrument's actual performance in space. To simiulate this challenge, the curated dataset combines data from 2019 and 2021 edition, both targeting the same problem (as detailed in the problem statement) but was generated from two different data generation pipelines. The two datasets are separated into folders and are aligned in the sense that the same set of planets (and their respective planetary system) is used to simulate the observation (but with different simulation pipelines).

    Problem Statement:

    Given noisy lightcurves (corrupted by different physical processes) and auxiliary information of the planets and their respective planetary system, devise solutions that converts these noisy observations (referred to as lightcurves) into transmission spectra. The task is conventionally advertised to utilise machine learning techniques, however, users are free to use other methods as appropriate. This task is a major step in the processing of astrophysical observations so that it can eventaully be used to interpret the atmosphere of exoplanets from the resultant spectrum.

    Reading the Data

    Both files are stored in hdf5 format and can be loaded using the following Python code:

    import h5py

    adc19_data = h5py.File('adc19_core.h5','r')

    print(adc19_data.keys())

    adc19_data.close()

    This format is best for reading in the spectroscopic lightcurve input data ('X') and the targets transmission spectra ('y')

    To read the auxiliary parameters, i.e. their planetary system, we will recommand using Pandas to extract them e.g.

    import pandas as pd

    h5_data = pd.HDFStore('adc21_core.h5','r')

    df = h5_data['y_params']

    or

    df = pd.read_hdf('adc19_core.h5','X_params')

    Notebooks:

    we have included notebooks to walk you through how to divide the dataset to train a model without leakage from the test set (Example Train Test Split.ipynb). As well as a dummy model (DummyModel.ipynb) to calculate your score (based on 2021/2019 competition)

    Data Structure

    Each hdf5 file contains a nested file structure with the following keys:

    1. obs_to_fname:

      • Description: A pandas readable table containing the mapping from observation filenames (e.g., AAAA_BB_CC.txt) to tuples (A, B, C), where:

        • AAAA: Planet index (0001 to 2097).

        • BB: Stellar spot noise instance (01 to 10).

        • CC: Gaussian photon noise instance (01 to 10).

    2. planet:

      • Description: A pandas readable table containing information about the observed planetary systems. This data is repeated for each simulation instance.

      • Shape: (N_planets, ...).

    3. X_params

      • Description: A pandas readable table containing auxiliary information about the observations, including stellar and planetary parameters.

      • Shape: (N_observations, 9).

      • Columns: planet, stellar_spot, photon, star_temp, star_logg, star_rad, star_mass, star_k_mag, period.

    4. y_params:

      • Description: A pandas readable table containing auxiliary information about the targets, their photon or stellar spot instances, including optional parameters (sma and incl).

      • Shape: (N_observations, 5).

      • Columns: planet, stellar_spot,photon, sma (semimajor axis), incl (inclination).

    5. X:

      • Description: Noisy observations of light curves. This is a nested dictionary where each key corresponds to an observation filename (e.g., 0001_01_01.txt), and the value is a 2D array of relative fluxes.

      • Shape: (55 wavelengths x 300 time steps).

    6. y:

      • Description: Target values for the regression problem. Another nested dictionary strucutre, where each key contains a 1D array of relative radii (planet-to-star-radius ratios) for each wavelength channel.

      • Shape: (55 wavelengths,).

    Data Format

    Observations (X)

    • Each observation is a 2D array of relative fluxes, organized as follows:

      • Rows: 55 wavelength channels (w1 to w55).

      • Columns: 300 time steps (t1 to t300).

    • Example structure (these numbers are only representative and do not form part of the dataset):

    t1 t2 ... t300

    w1 1.00010151742 1.00010151742 ... 1.00010151742

    w2 0.999857792623 0.999857792623 ... 0.999857792623

    ... ... ... ... ...

    w55 0.999468565171 0.999468565171 ... 0.999468565171

    Targets (y)

    • Each target is a 1D array of relative radii for 55 wavelength channels. These numbers are only representative and do not form part of the dataset.

    Example structure:

    w1 w2 ... w55

    1.00010151742 1.00010151742 ... 1.00010151742

    Auxiliary Parameters (X_params and y_params)

    • X_params contains stellar and planetary parameters for each observation.

    y_params contains optional parameters (sma and incl) that can be used as intermediate targets or ignored. Other columns contains instances of the photon or stellar spots.

    Complementary information:

    mcs19.pkl is a pickle file modified from the original 2019 Mission Candidate Sample, kindly provided by Dr. Edwards. We have added two additional columns inside it: stellar log gravity in two different unit systems. If you wish to include this in your research, you are welcome to do so. However, doing this may make your results potentially incomparable to existing results from the competition, as past competitions did not have this information available.

  13. I

    Synthetic datasets for SimiC

    • databank.illinois.edu
    Updated Apr 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peng Jianhao; Ochoa Idoia (2022). Synthetic datasets for SimiC [Dataset]. http://doi.org/10.13012/B2IDB-4996748_V1
    Explore at:
    Dataset updated
    Apr 15, 2022
    Authors
    Peng Jianhao; Ochoa Idoia
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This is the 5 states 5000 cells synthetic expression file we used for validation of SimiC, a single cell gene regulatory network inference method with similarity constraints. Ground truth GRNs are stored in Numpy array format, and expression profiles of all states combined are stored in Pandas DataFrame in format of Pickle files.

  14. h

    PlotQA_V1

    • huggingface.co
    Updated Sep 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aryan Badkul (2025). PlotQA_V1 [Dataset]. https://huggingface.co/datasets/Abd223653/PlotQA_V1
    Explore at:
    Dataset updated
    Sep 22, 2025
    Authors
    Aryan Badkul
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Plotqa V1

      Dataset Description
    

    This dataset was uploaded from a pandas DataFrame.

      Dataset Structure
    
    
    
    
    
      Overview
    

    Total Examples: 5,733,893 Total Features: 9 Dataset Size: ~2805.4 MB Format: Parquet files Created: 2025-09-22 20:12:01 UTC

      Data Instances
    

    The dataset contains 5,733,893 rows and 9 columns.

      Data Fields
    

    image_index (int64): 0 null values (0.0%), Range: [0.00, 157069.00], Mean: 78036.26 qid (object): 0 null values (0.0%)… See the full description on the dataset page: https://huggingface.co/datasets/Abd223653/PlotQA_V1.

  15. f

    Table4_Whole genome bisulfite sequencing reveals DNA methylation roles in...

    • figshare.com
    • frontiersin.figshare.com
    xlsx
    Updated Jun 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiaodie Jie; Honglin Wu; Miao Yang; Ming He; Guangqing Zhao; Shanshan Ling; Yan Huang; Bisong Yue; Nan Yang; Xiuyue Zhang (2023). Table4_Whole genome bisulfite sequencing reveals DNA methylation roles in the adaptive response of wildness training giant pandas to wild environment.XLSX [Dataset]. http://doi.org/10.3389/fgene.2022.995700.s004
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 13, 2023
    Dataset provided by
    Frontiers
    Authors
    Xiaodie Jie; Honglin Wu; Miao Yang; Ming He; Guangqing Zhao; Shanshan Ling; Yan Huang; Bisong Yue; Nan Yang; Xiuyue Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    DNA methylation modification can regulate gene expression without changing the genome sequence, which helps organisms to rapidly adapt to new environments. However, few studies have been reported in non-model mammals. Giant panda (Ailuropoda melanoleuca) is a flagship species for global biodiversity conservation. Wildness and reintroduction of giant pandas are the important content of giant pandas’ protection. However, it is unclear how wildness training affects the epigenetics of giant pandas, and we lack the means to assess the adaptive capacity of wildness training giant pandas. We comparatively analyzed genome-level methylation differences in captive giant pandas with and without wildness training to determine whether methylation modification played a role in the adaptive response of wildness training pandas. The whole genome DNA methylation sequencing results showed that genomic cytosine methylation ratio of all samples was 5.35%–5.49%, and the methylation ratio of the CpG site was the highest. Differential methylation analysis identified 544 differentially methylated genes (DMGs). The results of KEGG pathway enrichment of DMGs showed that VAV3, PLCG2, TEC and PTPRC participated in multiple immune-related pathways, and may participate in the immune response of wildness training giant pandas by regulating adaptive immune cells. A large number of DMGs enriched in GO terms may also be related to the regulation of immune activation during wildness training of giant pandas. Promoter differentially methylation analysis identified 1,199 genes with differential methylation at promoter regions. Genes with low methylation level at promoter regions and high expression such as, CCL5, P2Y13, GZMA, ANP32A, VWF, MYOZ1, NME7, MRPS31 and TPM1 were important in environmental adaptation for wildness training giant pandas. The methylation and expression patterns of these genes indicated that wildness training giant pandas have strong immunity, blood coagulation, athletic abilities and disease resistance. The adaptive response of giant pandas undergoing wildness training may be regulated by their negatively related promoter methylation. We are the first to describe the DNA methylation profile of giant panda blood tissue and our results indicated methylation modification is involved in the adaptation of captive giant pandas when undergoing wildness training. Our study also provided potential monitoring indicators for the successful reintroduction of valuable and threatened animals to the wild.

  16. Z

    SELTO Dataset

    • data.niaid.nih.gov
    Updated May 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dittmer, Sören; Erzmann, David; Harms, Henrik; Falck, Rielson; Gosch, Marco (2023). SELTO Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7034898
    Explore at:
    Dataset updated
    May 23, 2023
    Dataset provided by
    ArianeGroup GmbH
    University of Bremen, University of Cambridge
    University of Bremen
    Authors
    Dittmer, Sören; Erzmann, David; Harms, Henrik; Falck, Rielson; Gosch, Marco
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A Benchmark Dataset for Deep Learning for 3D Topology Optimization

    This dataset represents voxelized 3D topology optimization problems and solutions. The solutions have been generated in cooperation with the Ariane Group and Synera using the Altair OptiStruct implementation of SIMP within the Synera software. The SELTO dataset consists of four different 3D datasets for topology optimization, called disc simple, disc complex, sphere simple and sphere complex. Each of these datasets is further split into a training and a validation subset.

    The following paper provides full documentation and examples:

    Dittmer, S., Erzmann, D., Harms, H., Maass, P., SELTO: Sample-Efficient Learned Topology Optimization (2022) https://arxiv.org/abs/2209.05098.

    The Python library DL4TO (https://github.com/dl4to/dl4to) can be used to download and access all SELTO dataset subsets. Each TAR.GZ file container consists of multiple enumerated pairs of CSV files. Each pair describes a unique topology optimization problem and contains an associated ground truth solution. Each problem-solution pair consists of two files, where one contains voxel-wise information and the other file contains scalar information. For example, the i-th sample is stored in the files i.csv and i_info.csv, where i.csv contains all voxel-wise information and i_info.csv contains all scalar information. We define all spatially varying quantities at the center of the voxels, rather than on the vertices or surfaces. This allows for a shape-consistent tensor representation.

    For the i-th sample, the columns of i_info.csv correspond to the following scalar information:

    E - Young's modulus [Pa]

    ν - Poisson's ratio [-]

    σ_ys - a yield stress [Pa]

    h - discretization size of the voxel grid [m]

    The columns of i.csv correspond to the following voxel-wise information:

    x, y, z - the indices that state the location of the voxel within the voxel mesh

    Ω_design - design space information for each voxel. This is a ternary variable that indicates the type of density constraint on the voxel. 0 and 1 indicate that the density is fixed at 0 or 1, respectively. -1 indicates the absence of constraints, i.e., the density in that voxel can be freely optimized

    Ω_dirichlet_x, Ω_dirichlet_y, Ω_dirichlet_z - homogeneous Dirichlet boundary conditions for each voxel. These are binary variables that define whether the voxel is subject to homogeneous Dirichlet boundary constraints in the respective dimension

    F_x, F_y, F_z - floating point variables that define the three spacial components of external forces applied to each voxel. All forces are body forces given in [N/m^3]

    density - defines the binary voxel-wise density of the ground truth solution to the topology optimization problem

    How to Import the Dataset

    with DL4TO: With the Python library DL4TO (https://github.com/dl4to/dl4to) it is straightforward to download and access the dataset as a customized PyTorch torch.utils.data.Dataset object. As shown in the tutorial this can be done via:

    from dl4to.datasets import SELTODataset

    dataset = SELTODataset(root=root, name=name, train=train)

    Here, root is the path where the dataset should be saved. name is the name of the SELTO subset and can be one of "disc_simple", "disc_complex", "sphere_simple" and "sphere_complex". train is a boolean that indicates whether the corresponding training or validation subset should be loaded. See here for further documentation on the SELTODataset class.

    without DL4TO: After downloading and unzipping, any of the i.csv files can be manually imported into Python as a Pandas dataframe object:

    import pandas as pd

    root = ... file_path = f'{root}/{i}.csv' columns = ['x', 'y', 'z', 'Ω_design','Ω_dirichlet_x', 'Ω_dirichlet_y', 'Ω_dirichlet_z', 'F_x', 'F_y', 'F_z', 'density'] df = pd.read_csv(file_path, names=columns)

    Similarly, we can import a i_info.csv file via:

    file_path = f'{root}/{i}_info.csv' info_column_names = ['E', 'ν', 'σ_ys', 'h'] df_info = pd.read_csv(file_path, names=info_columns)

    We can extract PyTorch tensors from the Pandas dataframe df using the following function:

    import torch

    def get_torch_tensors_from_dataframe(df, dtype=torch.float32): shape = df[['x', 'y', 'z']].iloc[-1].values.astype(int) + 1 voxels = [df['x'].values, df['y'].values, df['z'].values]

    Ω_design = torch.zeros(1, *shape, dtype=int)
    Ω_design[:, voxels[0], voxels[1], voxels[2]] = torch.from_numpy(data['Ω_design'].values.astype(int))
    
    Ω_Dirichlet = torch.zeros(3, *shape, dtype=dtype)
    Ω_Dirichlet[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_x'].values, dtype=dtype)
    Ω_Dirichlet[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_y'].values, dtype=dtype)
    Ω_Dirichlet[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_z'].values, dtype=dtype)
    
    F = torch.zeros(3, *shape, dtype=dtype)
    F[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_x'].values, dtype=dtype)
    F[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_y'].values, dtype=dtype)
    F[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_z'].values, dtype=dtype)
    
    density = torch.zeros(1, *shape, dtype=dtype)
    density[:, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['density'].values, dtype=dtype)
    
    return Ω_design, Ω_Dirichlet, F, density
    
  17. f

    Description of ecological and anthropogenic covariates and their predicted...

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Jul 14, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Panthi, Saroj; Aryal, Achyut; Srivathsa, Arjun; Khanal, Gopal; Acharya, Krishna Prasad (2017). Description of ecological and anthropogenic covariates and their predicted influence (direction) on parameters of interest: Site-level occupancy probability (ψ), and detection probability (p); a priori predictions about their influence on probability of red panda occupancy are also described. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001742443
    Explore at:
    Dataset updated
    Jul 14, 2017
    Authors
    Panthi, Saroj; Aryal, Achyut; Srivathsa, Arjun; Khanal, Gopal; Acharya, Krishna Prasad
    Description

    The relationship between the parameter of interest and the covariate is assumed to be linear (on the logit scale) unless specified otherwise.

  18. h

    my-pandas-dataset-AbstractAndLink

    • huggingface.co
    Updated Aug 25, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alvian Khairi (2023). my-pandas-dataset-AbstractAndLink [Dataset]. https://huggingface.co/datasets/AlvianKhairi/my-pandas-dataset-AbstractAndLink
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 25, 2023
    Authors
    Alvian Khairi
    Description

    Dataset Card for "my-pandas-dataset-AbstractAndLink"

    More Information needed

  19. Datatset: Machine-Learning Side-Channel Attacks on the GALACTICS...

    • zenodo.org
    • data.niaid.nih.gov
    bin
    Updated Jul 15, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Soundes Marzougui; Soundes Marzougui; Nils Wisiol; Nils Wisiol; Patrick Gersch; Patrick Gersch; Juliane Krämer; Juliane Krämer; Jean-Pierre Seifert; Jean-Pierre Seifert (2021). Datatset: Machine-Learning Side-Channel Attacks on the GALACTICS Constant-Time Implementation of BLISS [Dataset]. http://doi.org/10.5281/zenodo.5101343
    Explore at:
    binAvailable download formats
    Dataset updated
    Jul 15, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Soundes Marzougui; Soundes Marzougui; Nils Wisiol; Nils Wisiol; Patrick Gersch; Patrick Gersch; Juliane Krämer; Juliane Krämer; Jean-Pierre Seifert; Jean-Pierre Seifert
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset accompanies the paper "Machine-Learning Side-Channel Attacks on the GALACTICS Constant-Time Implementation of BLISS". It was used to experimentally prove the presented attack strategies on real hardware. The corresponding source code for all three attacks is also publicly available.

    A detailed description of how the data was obtained can be found in the paper. Section 4 precisely describes the experimental setup.

    Prerequisites:

    sudo apt-get install p7zip

    Extract the data:

    7z x galactics_attack_data.7z

    Running the attacks:

    The source code to run the three presented attacks can be found on Github. The instructions on how to use the python code can be obtained from the corresponding README.

    Re-using the dataset:

    The dataset consists of .pickle and .bin files. The .pickle files can be read using Pythons Pandas library. Python access functions for the .bin files are also provided.

  20. m

    Reddit r/AskScience Flair Dataset

    • data.mendeley.com
    Updated May 23, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sumit Mishra (2022). Reddit r/AskScience Flair Dataset [Dataset]. http://doi.org/10.17632/k9r2d9z999.3
    Explore at:
    Dataset updated
    May 23, 2022
    Authors
    Sumit Mishra
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Reddit is a social news, content rating and discussion website. It's one of the most popular sites on the internet. Reddit has 52 million daily active users and approximately 430 million users who use it once a month. Reddit has different subreddits and here We'll use the r/AskScience Subreddit.

    The dataset is extracted from the subreddit /r/AskScience from Reddit. The data was collected between 01-01-2016 and 20-05-2022. It contains 612,668 Datapoints and 25 Columns. The database contains a number of information about the questions asked on the subreddit, the description of the submission, the flair of the question, NSFW or SFW status, the year of the submission, and more. The data is extracted using python and Pushshift's API. A little bit of cleaning is done using NumPy and pandas as well. (see the descriptions of individual columns below).

    The dataset contains the following columns and descriptions: author - Redditor Name author_fullname - Redditor Full name contest_mode - Contest mode [implement obscured scores and randomized sorting]. created_utc - Time the submission was created, represented in Unix Time. domain - Domain of submission. edited - If the post is edited or not. full_link - Link of the post on the subreddit. id - ID of the submission. is_self - Whether or not the submission is a self post (text-only). link_flair_css_class - CSS Class used to identify the flair. link_flair_text - Flair on the post or The link flair’s text content. locked - Whether or not the submission has been locked. num_comments - The number of comments on the submission. over_18 - Whether or not the submission has been marked as NSFW. permalink - A permalink for the submission. retrieved_on - time ingested. score - The number of upvotes for the submission. description - Description of the Submission. spoiler - Whether or not the submission has been marked as a spoiler. stickied - Whether or not the submission is stickied. thumbnail - Thumbnail of Submission. question - Question Asked in the Submission. url - The URL the submission links to, or the permalink if a self post. year - Year of the Submission. banned - Banned by the moderator or not.

    This dataset can be used for Flair Prediction, NSFW Classification, and different Text Mining/NLP tasks. Exploratory Data Analysis can also be done to get the insights and see the trend and patterns over the years.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Or Hiltch (2024). pandas-create-context [Dataset]. https://huggingface.co/datasets/hiltch/pandas-create-context

pandas-create-context

pandas-create-context

hiltch/pandas-create-context

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 8, 2024
Authors
Or Hiltch
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Overview

This dataset is built from sql-create-context, which in itself builds from WikiSQL and Spider. I have used GPT4 to translate the SQL schema into pandas DataFrame schem initialization statements and to translate the SQL queries into pandas queries. There are 862 examples of natural language queries, pandas DataFrame creation statements, and pandas query answering the question using the DataFrame creation statement as context. This dataset was built with text-to-pandas LLMs… See the full description on the dataset page: https://huggingface.co/datasets/hiltch/pandas-create-context.

Search
Clear search
Close search
Google apps
Main menu