40 datasets found
  1. Convert Text to Pandas

    • kaggle.com
    zip
    Updated Sep 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zeyad Usf (2024). Convert Text to Pandas [Dataset]. https://www.kaggle.com/datasets/zeyadusf/convert-text-to-pandas
    Explore at:
    zip(4333134 bytes)Available download formats
    Dataset updated
    Sep 22, 2024
    Authors
    Zeyad Usf
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    kaggle notebook
    Github Repo

    I found two datasets about converting text with context to pandas code on Hugging Face, but the challenge is in the context. The context in both datasets is different which reduces the results of the model. First let's mention the data I found and then show examples, solution and some other problems.

    • Rahima411/text-to-pandas:

      • The data is divided into Train with 57.5k and Test with 19.2k.

      • The data has two columns as you can see in the example:

        • "Input": Contains the context and the question together, in the context it shows the metadata about the data frame.
        • "Pandas Query": Pandas code txt Input | Pandas Query -----------------------------------------------------------|------------------------------------------- Table Name: head (age (object), head_id (object)) | result = management['head.age'].unique() Table Name: management (head_id (object), | temporary_acting (object)) | What are the distinct ages of the heads who are acting? |
    • hiltch/pandas-create-context:

      • It contains 17k rows with three columns:
        • question : text .
        • context : Code to create a data frame with column names, unlike the first data set which contains the name of the data frame, column names and data type.
        • answer : Pandas code.
          question           |            context             |       answer 
    ----------------------------------------|--------------------------------------------------------|---------------------------------------
    What was the lowest # of total votes?  | df = pd.DataFrame(columns=['_number_of_total_votes']) | df['_number_of_total_votes'].min()   
    

    As you can see, the problem with this data is that they are not similar as inputs and the structure of the context is different . My solution to this problem was: - Convert the first data set to become like the second in the context. I chose this because it is difficult to get the data type for the columns in the second data set. It was easy to convert the structure of the context from this shape Table Name: head (age (object), head_id (object)) to this head = pd.DataFrame(columns=['age','head_id']) through this code that I wrote. - Then separate the question from the context. This was easy because if you look at the data, you will find that the context always ends with "(" and then a blank and then the question. You will find all of this in this code. - You will also notice that more than one code or line can be returned to the context, and this has been engineered into the code. ```py def extract_table_creation(text:str)->(str,str): """ Extracts DataFrame creation statements and questions from the given text.

    Args:
      text (str): The input text containing table definitions and questions.
    
    Returns:
      tuple: A tuple containing a concatenated DataFrame creation string and a question.
    """
    # Define patterns
    table_pattern = r'Table Name: (\w+) \(([\w\s,()]+)\)'
    column_pattern = r'(\w+)\s*\((object|int64|float64)\)'
    
    # Find all table names and column definitions
    matches = re.findall(table_pattern, text)
    
    # Initialize a list to hold DataFrame creation statements
    df_creations = []
    
    for table_name, columns_str in matches:
      # Extract column names
      columns = re.findall(column_pattern, columns_str)
      column_names = [col[0] for col in columns]
    
      # Format DataFrame creation statement
      df_creation = f"{table_name} = pd.DataFrame(columns={column_names})"
      df_creations.append(df_creation)
    
    # Concatenate all DataFrame creation statements
    df_creation_concat = '
    

    '.join(df_creations)

    # Extract and clean the question
    question = text[text.rindex(')')+1:].strip()
    
    return df_creation_concat, question
    
    After both datasets were similar in structure, they were merged into one set and divided into _72.8K_ train and _18.6K_ test. We analyzed this dataset and you can see it all through the **[`notebook`](https://www.kaggle.com/code/zeyadusf/text-2-pandas-t5#Exploratory-Data-Analysis(EDA))**, but we found some problems in the dataset as well, such as
    > - `Answer` : `df['Id'].count()` has been repeated, but this is possible, so we do not need to dispense with these rows.
    > - `Context` : We see that it contains `147` rows that do not contain any text. We will see Through the experiment if this will affect the results negatively or positively.
    > - `Question` : It is ...
    
  2. PandasPlotBench

    • huggingface.co
    Updated Nov 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    JetBrains Research (2024). PandasPlotBench [Dataset]. https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 25, 2024
    Dataset provided by
    JetBrainshttp://jetbrains.com/
    Authors
    JetBrains Research
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    PandasPlotBench

    PandasPlotBench is a benchmark to assess the capability of models in writing the code for visualizations given the description of the Pandas DataFrame. 🛠️ Task. Given the plotting task and the description of a Pandas DataFrame, write the code to build a plot. The dataset is based on the MatPlotLib gallery. The paper can be found in arXiv: https://arxiv.org/abs/2412.02764v1. To score your model on this dataset, you can use the our GitHub repository. 📩 If you have… See the full description on the dataset page: https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench.

  3. Shopping Mall

    • kaggle.com
    zip
    Updated Dec 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anshul Pachauri (2023). Shopping Mall [Dataset]. https://www.kaggle.com/datasets/anshulpachauri/shopping-mall
    Explore at:
    zip(22852 bytes)Available download formats
    Dataset updated
    Dec 15, 2023
    Authors
    Anshul Pachauri
    Description

    Libraries Import:

    Importing necessary libraries such as pandas, seaborn, matplotlib, scikit-learn's KMeans, and warnings. Data Loading and Exploration:

    Reading a dataset named "Mall_Customers.csv" into a pandas DataFrame (df). Displaying the first few rows of the dataset using df.head(). Conducting univariate analysis by calculating descriptive statistics with df.describe(). Univariate Analysis:

    Visualizing the distribution of the 'Annual Income (k$)' column using sns.distplot. Looping through selected columns ('Age', 'Annual Income (k$)', 'Spending Score (1-100)') and plotting individual distribution plots. Bivariate Analysis:

    Creating a scatter plot for 'Annual Income (k$)' vs 'Spending Score (1-100)' using sns.scatterplot. Generating a pair plot for selected columns with gender differentiation using sns.pairplot. Gender-Based Analysis:

    Grouping the data by 'Gender' and calculating the mean for selected columns. Computing the correlation matrix for the grouped data and visualizing it using a heatmap. Univariate Clustering:

    Applying KMeans clustering with 3 clusters based on 'Annual Income (k$)' and adding the 'Income Cluster' column to the DataFrame. Plotting the elbow method to determine the optimal number of clusters. Bivariate Clustering:

    Applying KMeans clustering with 5 clusters based on 'Annual Income (k$)' and 'Spending Score (1-100)' and adding the 'Spending and Income Cluster' column. Plotting the elbow method for bivariate clustering and visualizing the cluster centers on a scatter plot. Displaying a normalized cross-tabulation between 'Spending and Income Cluster' and 'Gender'. Multivariate Clustering:

    Performing multivariate clustering by creating dummy variables, scaling selected columns, and applying KMeans clustering. Plotting the elbow method for multivariate clustering. Result Saving:

    Saving the modified DataFrame with cluster information to a CSV file named "Result.csv". Saving the multivariate clustering plot as an image file ("Multivariate_figure.png").

  4. h

    pandas-create-context

    • huggingface.co
    Updated Jan 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Or Hiltch (2024). pandas-create-context [Dataset]. https://huggingface.co/datasets/hiltch/pandas-create-context
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 8, 2024
    Authors
    Or Hiltch
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview

    This dataset is built from sql-create-context, which in itself builds from WikiSQL and Spider. I have used GPT4 to translate the SQL schema into pandas DataFrame schem initialization statements and to translate the SQL queries into pandas queries. There are 862 examples of natural language queries, pandas DataFrame creation statements, and pandas query answering the question using the DataFrame creation statement as context. This dataset was built with text-to-pandas LLMs… See the full description on the dataset page: https://huggingface.co/datasets/hiltch/pandas-create-context.

  5. Merge number of excel file,convert into csv file

    • kaggle.com
    zip
    Updated Mar 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aashirvad pandey (2024). Merge number of excel file,convert into csv file [Dataset]. https://www.kaggle.com/datasets/aashirvadpandey/merge-number-of-excel-fileconvert-into-csv-file
    Explore at:
    zip(6731 bytes)Available download formats
    Dataset updated
    Mar 30, 2024
    Authors
    Aashirvad pandey
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Project Description:

    Title: Pandas Data Manipulation and File Conversion

    Overview: This project aims to demonstrate the basic functionalities of Pandas, a powerful data manipulation library in Python. In this project, we will create a DataFrame, perform some data manipulation operations using Pandas, and then convert the DataFrame into both Excel and CSV formats.

    Key Objectives:

    1. DataFrame Creation: Utilize Pandas to create a DataFrame with sample data.
    2. Data Manipulation: Perform basic data manipulation tasks such as adding columns, filtering data, and performing calculations.
    3. File Conversion: Convert the DataFrame into Excel (.xlsx) and CSV (.csv) file formats.

    Tools and Libraries Used:

    • Python
    • Pandas

    Project Implementation:

    1. DataFrame Creation:

      • Import the Pandas library.
      • Create a DataFrame using either a dictionary, a list of dictionaries, or by reading data from an external source like a CSV file.
      • Populate the DataFrame with sample data representing various data types (e.g., integer, float, string, datetime).
    2. Data Manipulation:

      • Add new columns to the DataFrame representing derived data or computations based on existing columns.
      • Filter the DataFrame to include only specific rows based on certain conditions.
      • Perform basic calculations or transformations on the data, such as aggregation functions or arithmetic operations.
    3. File Conversion:

      • Utilize Pandas to convert the DataFrame into an Excel (.xlsx) file using the to_excel() function.
      • Convert the DataFrame into a CSV (.csv) file using the to_csv() function.
      • Save the generated files to the local file system for further analysis or sharing.

    Expected Outcome:

    Upon completion of this project, you will have gained a fundamental understanding of how to work with Pandas DataFrames, perform basic data manipulation tasks, and convert DataFrames into different file formats. This knowledge will be valuable for data analysis, preprocessing, and data export tasks in various data science and analytics projects.

    Conclusion:

    The Pandas library offers powerful tools for data manipulation and file conversion in Python. By completing this project, you will have acquired essential skills that are widely applicable in the field of data science and analytics. You can further extend this project by exploring more advanced Pandas functionalities or integrating it into larger data processing pipelines.in this data we add number of data and make that data a data frame.and save in single excel file as different sheet name and then convert that excel file in csv file .

  6. Zippi_Shvartsman_et_al_2023_bmi_manual_files

    • figshare.com
    bin
    Updated Aug 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gabrielle Shvartsman; Ellen Zippi; Nuria Vendrell-Llopis; Joni D. Wallis; Jose M. Carmena (2023). Zippi_Shvartsman_et_al_2023_bmi_manual_files [Dataset]. http://doi.org/10.6084/m9.figshare.23674200.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Aug 29, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Gabrielle Shvartsman; Ellen Zippi; Nuria Vendrell-Llopis; Joni D. Wallis; Jose M. Carmena
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Included files:Each file includes LFP (local field potential) data for both animals (‘h’, ‘y’) during a particular type of task control (‘bmi’ or ‘manual’) and time-locked to 500ms before or after a particular event in the task (‘go_cue’ or ‘target’) for each rewarded trial in each day of the task (‘h’: [1-13], ‘y’: [1-22]).File description:Each file includes a Pandas DataFrame, saved as a .feather file. Data can be accessed using Python by calling:import pandas as pdpd.read_feather([file name])Each DataFrame has the following columns:control_type: ‘bmi’, ‘manual’, or ‘baseline’event: go cue (‘go_cue’) or target acquisition (‘target’)subj: which animal, ‘h’ or ‘y’day: which day of the session, ‘h’: [1-13], ‘y’: [1-22]roi: region of interest; ‘direct’, ‘dlpfc’, or ‘cd’ where ‘direct’ includes most channels from m1 each day but is specific to channels which had sufficient spiking to be used as input to the BMI decoderch: electrode channel number, only low-noise channels were included (see Methods for details)n_rewarded_trial: which trial number data segment is from, only successfully completed (rewarded) trials are includedtime_from_window_ms: go_cue: 0-500ms from go cue, for target: -500-0ms from target acquisitionlfp: local field potential value (see Methods for details)

  7. Z

    SELTO Dataset

    • data.niaid.nih.gov
    Updated May 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dittmer, Sören; Erzmann, David; Harms, Henrik; Falck, Rielson; Gosch, Marco (2023). SELTO Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7034898
    Explore at:
    Dataset updated
    May 23, 2023
    Dataset provided by
    University of Bremen, University of Cambridge
    ArianeGroup GmbH
    University of Bremen
    Authors
    Dittmer, Sören; Erzmann, David; Harms, Henrik; Falck, Rielson; Gosch, Marco
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A Benchmark Dataset for Deep Learning for 3D Topology Optimization

    This dataset represents voxelized 3D topology optimization problems and solutions. The solutions have been generated in cooperation with the Ariane Group and Synera using the Altair OptiStruct implementation of SIMP within the Synera software. The SELTO dataset consists of four different 3D datasets for topology optimization, called disc simple, disc complex, sphere simple and sphere complex. Each of these datasets is further split into a training and a validation subset.

    The following paper provides full documentation and examples:

    Dittmer, S., Erzmann, D., Harms, H., Maass, P., SELTO: Sample-Efficient Learned Topology Optimization (2022) https://arxiv.org/abs/2209.05098.

    The Python library DL4TO (https://github.com/dl4to/dl4to) can be used to download and access all SELTO dataset subsets. Each TAR.GZ file container consists of multiple enumerated pairs of CSV files. Each pair describes a unique topology optimization problem and contains an associated ground truth solution. Each problem-solution pair consists of two files, where one contains voxel-wise information and the other file contains scalar information. For example, the i-th sample is stored in the files i.csv and i_info.csv, where i.csv contains all voxel-wise information and i_info.csv contains all scalar information. We define all spatially varying quantities at the center of the voxels, rather than on the vertices or surfaces. This allows for a shape-consistent tensor representation.

    For the i-th sample, the columns of i_info.csv correspond to the following scalar information:

    E - Young's modulus [Pa]

    ν - Poisson's ratio [-]

    σ_ys - a yield stress [Pa]

    h - discretization size of the voxel grid [m]

    The columns of i.csv correspond to the following voxel-wise information:

    x, y, z - the indices that state the location of the voxel within the voxel mesh

    Ω_design - design space information for each voxel. This is a ternary variable that indicates the type of density constraint on the voxel. 0 and 1 indicate that the density is fixed at 0 or 1, respectively. -1 indicates the absence of constraints, i.e., the density in that voxel can be freely optimized

    Ω_dirichlet_x, Ω_dirichlet_y, Ω_dirichlet_z - homogeneous Dirichlet boundary conditions for each voxel. These are binary variables that define whether the voxel is subject to homogeneous Dirichlet boundary constraints in the respective dimension

    F_x, F_y, F_z - floating point variables that define the three spacial components of external forces applied to each voxel. All forces are body forces given in [N/m^3]

    density - defines the binary voxel-wise density of the ground truth solution to the topology optimization problem

    How to Import the Dataset

    with DL4TO: With the Python library DL4TO (https://github.com/dl4to/dl4to) it is straightforward to download and access the dataset as a customized PyTorch torch.utils.data.Dataset object. As shown in the tutorial this can be done via:

    from dl4to.datasets import SELTODataset

    dataset = SELTODataset(root=root, name=name, train=train)

    Here, root is the path where the dataset should be saved. name is the name of the SELTO subset and can be one of "disc_simple", "disc_complex", "sphere_simple" and "sphere_complex". train is a boolean that indicates whether the corresponding training or validation subset should be loaded. See here for further documentation on the SELTODataset class.

    without DL4TO: After downloading and unzipping, any of the i.csv files can be manually imported into Python as a Pandas dataframe object:

    import pandas as pd

    root = ... file_path = f'{root}/{i}.csv' columns = ['x', 'y', 'z', 'Ω_design','Ω_dirichlet_x', 'Ω_dirichlet_y', 'Ω_dirichlet_z', 'F_x', 'F_y', 'F_z', 'density'] df = pd.read_csv(file_path, names=columns)

    Similarly, we can import a i_info.csv file via:

    file_path = f'{root}/{i}_info.csv' info_column_names = ['E', 'ν', 'σ_ys', 'h'] df_info = pd.read_csv(file_path, names=info_columns)

    We can extract PyTorch tensors from the Pandas dataframe df using the following function:

    import torch

    def get_torch_tensors_from_dataframe(df, dtype=torch.float32): shape = df[['x', 'y', 'z']].iloc[-1].values.astype(int) + 1 voxels = [df['x'].values, df['y'].values, df['z'].values]

    Ω_design = torch.zeros(1, *shape, dtype=int)
    Ω_design[:, voxels[0], voxels[1], voxels[2]] = torch.from_numpy(data['Ω_design'].values.astype(int))
    
    Ω_Dirichlet = torch.zeros(3, *shape, dtype=dtype)
    Ω_Dirichlet[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_x'].values, dtype=dtype)
    Ω_Dirichlet[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_y'].values, dtype=dtype)
    Ω_Dirichlet[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_z'].values, dtype=dtype)
    
    F = torch.zeros(3, *shape, dtype=dtype)
    F[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_x'].values, dtype=dtype)
    F[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_y'].values, dtype=dtype)
    F[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_z'].values, dtype=dtype)
    
    density = torch.zeros(1, *shape, dtype=dtype)
    density[:, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['density'].values, dtype=dtype)
    
    return Ω_design, Ω_Dirichlet, F, density
    
  8. Zippi_Shvartsman_et_al_2023_baseline_files

    • figshare.com
    bin
    Updated Aug 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gabrielle Shvartsman; Ellen Zippi; Nuria Vendrell-Llopis; Joni D. Wallis; Jose M. Carmena (2023). Zippi_Shvartsman_et_al_2023_baseline_files [Dataset]. http://doi.org/10.6084/m9.figshare.24052824.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Aug 29, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Gabrielle Shvartsman; Ellen Zippi; Nuria Vendrell-Llopis; Joni D. Wallis; Jose M. Carmena
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Included files:Each file includes LFP (local field potential) data for both animals (‘h’, ‘y’) during rest periods for each day (‘baseline’) without any time-locking (500ms segments were randomly selected from baseline in our analyses). Separate baseline files are included for each animal.File description:Each file includes a Pandas DataFrame, saved as a .feather file. Data can be accessed using Python by calling:import pandas as pdpd.read_feather([file name])Each DataFrame has the following columns:control_type: ‘bmi’, ‘manual’, or ‘baseline’subj: which animal, ‘h’ or ‘y’day: which day of the session, ‘h’: [1-13], ‘y’: [1-22]roi: region of interest; ‘direct’, ‘dlpfc’, or ‘cd’ where ‘direct’ includes most channels from m1 each day but is specific to channels which had sufficient spiking to be used as input to the BMI decoderch: electrode channel number, only low-noise channels were included (see Methods for details)time_from_window_ms: represents every ms from start to end of the recorded rest periodlfp: local field potential value (see Methods for details)

  9. Klib library python

    • kaggle.com
    zip
    Updated Jan 11, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sripaad Srinivasan (2021). Klib library python [Dataset]. https://www.kaggle.com/sripaadsrinivasan/klib-library-python
    Explore at:
    zip(89892446 bytes)Available download formats
    Dataset updated
    Jan 11, 2021
    Authors
    Sripaad Srinivasan
    Description

    klib library enables us to quickly visualize missing data, perform data cleaning, visualize data distribution plot, visualize correlation plot and visualize categorical column values. klib is a Python library for importing, cleaning, analyzing and preprocessing data. Explanations on key functionalities can be found on Medium / TowardsDataScience in the examples section or on YouTube (Data Professor).

    Original Github repo

    https://raw.githubusercontent.com/akanz1/klib/main/examples/images/header.png" alt="klib Header">

    Usage

    !pip install klib
    
    import klib
    import pandas as pd
    
    df = pd.DataFrame(data)
    
    # klib.describe functions for visualizing datasets
    - klib.cat_plot(df) # returns a visualization of the number and frequency of categorical features
    - klib.corr_mat(df) # returns a color-encoded correlation matrix
    - klib.corr_plot(df) # returns a color-encoded heatmap, ideal for correlations
    - klib.dist_plot(df) # returns a distribution plot for every numeric feature
    - klib.missingval_plot(df) # returns a figure containing information about missing values
    

    Examples

    Take a look at this starter notebook.

    Further examples, as well as applications of the functions can be found here.

    Contributing

    Pull requests and ideas, especially for further functions are welcome. For major changes or feedback, please open an issue first to discuss what you would like to change. Take a look at this Github repo.

    License

    MIT

  10. h

    oldIT2modIT

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Massimo Romano, oldIT2modIT [Dataset]. https://huggingface.co/datasets/cybernetic-m/oldIT2modIT
    Explore at:
    Authors
    Massimo Romano
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Download the dataset

    At the moment to download the dataset you should use Pandas DataFrame: import pandas as pd df = pd.read_csv("https://huggingface.co/datasets/cybernetic-m/oldIT2modIT/resolve/main/oldIT2modIT_dataset.csv")

    You can visualize the dataset with: df.head()

    To convert into Huggingface dataset: from datasets import Dataset dataset = Dataset.from_pandas(df)

      Dataset Description
    

    This is an italian dataset formed by 200 old (ancient) italian sentence and… See the full description on the dataset page: https://huggingface.co/datasets/cybernetic-m/oldIT2modIT.

  11. h

    PlotQA_V1

    • huggingface.co
    Updated Sep 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aryan Badkul (2025). PlotQA_V1 [Dataset]. https://huggingface.co/datasets/Abd223653/PlotQA_V1
    Explore at:
    Dataset updated
    Sep 22, 2025
    Authors
    Aryan Badkul
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Plotqa V1

      Dataset Description
    

    This dataset was uploaded from a pandas DataFrame.

      Dataset Structure
    
    
    
    
    
      Overview
    

    Total Examples: 5,733,893 Total Features: 9 Dataset Size: ~2805.4 MB Format: Parquet files Created: 2025-09-22 20:12:01 UTC

      Data Instances
    

    The dataset contains 5,733,893 rows and 9 columns.

      Data Fields
    

    image_index (int64): 0 null values (0.0%), Range: [0.00, 157069.00], Mean: 78036.26 qid (object): 0 null values (0.0%)… See the full description on the dataset page: https://huggingface.co/datasets/Abd223653/PlotQA_V1.

  12. Arabic(Indian) digits MADBase

    • kaggle.com
    zip
    Updated Jul 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HOSSAM_AHMED_SALAH (2023). Arabic(Indian) digits MADBase [Dataset]. https://www.kaggle.com/datasets/hossamahmedsalah/arabicindian-digits-madbase/code
    Explore at:
    zip(15373598 bytes)Available download formats
    Dataset updated
    Jul 26, 2023
    Authors
    HOSSAM_AHMED_SALAH
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Area covered
    India
    Description

    This dataset is flattern images where each image is represented in a row - Objective: Establish benchmark results for Arabic digit recognition using different classification techniques. - Objective: Compare performances of different classification techniques on Arabic and Latin digit recognition problems. - Valid comparison requires Arabic and Latin digit databases to be in the same format. - A Modified version of the ADBase (MADBase) with the same size and format as MNIST is created. - MADBase is derived from ADBase by size-normalizing each digit to a 20x20 box while preserving aspect ratio. - Size-normalization procedure results in gray levels due to anti-aliasing filter. - MADBase and MNIST have the same size and format. - MNIST is a modified version of the NIST digits database. - MNIST is available for download. I used this code to turn 70k arabic digit into a tabular data for ease of use and to waste less time in the preprocessing ```

    Define the root directory of the dataset

    root_dir = "MAHD"

    Define the names of the folders containing the images

    folder_names = ['Part{:02d}'.format(i) for i in range(1, 13)]

    folder_names = ['Part{}'.format(i) if i>9 else 'Part0{}'.format(i) for i in range(1, 13)]

    Define the names of the subfolders containing the training and testing images

    train_test_folders = ['MAHDBase_TrainingSet', 'test']

    Initialize an empty list to store the image data and labels

    data = [] labels = []

    Loop over the training and testing subfolders in each Part folder

    for tt in train_test_folders: for folder_name in folder_names: if tt == train_test_folders[1] and folder_name == 'Part03': break subfolder_path = os.path.join(root_dir, tt, folder_name) print(subfolder_path) print(os.listdir(subfolder_path)) for filename in os.listdir(subfolder_path): # check of the file fromat that it's an image if os.path.splitext(filename)[1].lower() not in '.bmp': continue # Load the image img_path = os.path.join(subfolder_path, filename) img = Image.open(img_path)

        # Convert the image to grayscale and flatten it into a 1D array
        img_grey = img.convert('L')
        img_data = np.array(img_grey).flatten()
    
        # Extract the label from the filename and convert it to an integer
        label = int(filename.split('_')[2].replace('digit', '').split('.')[0])
    
        # Add the image data and label to the lists
        data.append(img_data)
        labels.append(label)
    

    Convert the image data and labels to a pandas dataframe

    df = pd.DataFrame(data) df['label'] = labels ``` This dataset made by https://datacenter.aucegypt.edu/shazeem with 2 datasets - ADBase - MADBase (✅ the one this dataset derived from , similar in form to mnist)

  13. Z

    Wrist-mounted IMU data towards the investigation of free-living smoking...

    • data.niaid.nih.gov
    • data.europa.eu
    Updated May 3, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kirmizis, Athanasios; Kyritsis, Konstantinos; Delopoulos, Anastasios (2021). Wrist-mounted IMU data towards the investigation of free-living smoking behavior - the Smoking Event Detection (SED) and Free-living Smoking Event Detection (SED-FL) datasets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4507450
    Explore at:
    Dataset updated
    May 3, 2021
    Dataset provided by
    Aristotle University of Thessaloniki
    Authors
    Kirmizis, Athanasios; Kyritsis, Konstantinos; Delopoulos, Anastasios
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Introduction

    The Smoking Event Detection (SED) and the Free-living Smoking Event Detection (SED-FL) datasets were created by the Multimedia Understanding Group towards the investigation of smoking behavior, both while smoking and in-the-wild. Both datasets contain the triaxial acceleration and orientation velocity signals ( DoF) that originate from a commercial smartwatch (Mobvoi TicWatch E™). The SED dataset consists of (20) smoking sessions provided by (11) unique subjects, while the SED-FL dataset contains (10) all-day recordings provided by (7) unique subjects.

    In addition, the start and end moments of each puff cycle are annotated throughout the SED dataset.

    Description

    SED

    A total of (11) subjects were recorded while smoking a cigarette at interior or exterior areas. The total duration of the (20) sessions sums up to (161) minutes, with a mean duration of (8.08) minutes. Each participant was free to smoke naturally, with the only limitation being to not swap the cigarette between hands during the smoking session. Prior to the recording, the participant was asked to wear the smartwatch to the hand that he typically uses in his everyday life to smoke. A camera was already set facing the participant, including at least the whole length of the arms in its field of view. The purpose of video recording was to obtain ground truth information for each of the puff cycles that occur during the smoking session. Participants were also asked to perform a clapping hand movement both at the start and end of the meal, for synchronization purposes (as this movement is distinctive in the accelerometer signal). No other instructions were given to the participants. It should be noted that the SED dataset does not contain instances of electronic cigarettes (also known as vaping devices), or heated tobacco products.

    SED-FL

    SED-FL includes (10) in-the-wild sessions that belong to (7) unique subjects. This is achieved by recording the subjects’ meals as a small part part of their everyday life, unscripted, activities. Participants were instructed to wear the smartwatch to the hand of their preference well ahead before any smoking session and continue to wear it throughout the day until the battery is depleted. In addition, we followed a self-report labeling model, meaning that the ground truth is provided from the participant by documenting the start and end moments of their smoking sessions to the best of their abilities as well as the hand they wear the smartwatch on. The total duration of the recordings sums up to (78.3) hours, with a mean duration of (7.83) hours.

    For both datasets, the accompanying Python script read_dataset.py will visualize the IMU signals and ground truth for each of the recordings. Information on how to execute the Python scripts can be found below.

    The script and the daataset's pickle file must be located in the same directory.

    Tested with Python 3.6.4

    Requirements: Pandas, Pickle and Matplotlib

    Visualize signals and ground truth

    python read_datasets.py

    Annotation

    For all recordings, we annotated the start and end points for each puff cycle (i.e., smoking gesture). The annotation process was performed in such a way that the start and end times of each smoking gesture do not overlap each other.

    Technical details

    SED

    We provide the SED dataset as a pickle. The file can be loaded using Python in the following way:

    import pickle as pkl import pandas as pd

    with open('./SED.pkl','rb') as fh: dataset = pkl.load(fh)

    The dataset variable in the snippet above is a dictionary with keys, each corresponding to a unique subject (numbered from to ). It should be mentioned that the subject identifier in SED is in-line with the subject identifier in the SED-FL dataset; i.e., SED’s subject with id equal to is the same person as SED-FL’s subject with id equal to .

    The content of a dataset ‘s subject is a list with length equal to corresponding subject’s number of recorded smoking sessions. For example, assuming that subject has recorded smoking sessions, the command:

    sessions = dataset['8']

    would yield a list of length equal to . Each member of the list is a Pandas DataFrame with dimensions , where is the length of the recording.

    The columns of a session’s DataFrame are:

    'T': The timestamps in seconds

    'AccX': The accelerometer measurements for the axis in (m/s^2)

    'AccY': The accelerometer measurements for the axis in (m/s^2)

    'AccZ': The accelerometer measurements for the axis in (m/s^2)

    'GyrX': The gyroscope measurements for the axis in (rad/s)

    'GyrY': The gyroscope measurements for the axis in (rad/s)

    'GyrZ': The gyroscope measurements for the axis in (rad/s)

    'GT': The manually annotated ground truth for puff cycles

    The contents of this DataFrame are essentially the accelerometer and gyroscope sensor streams, resampled at a constant sampling rate of Hz and aligned with each other and with their puff cycle ground truth. All sensor streams are transformed in such a way that reflects all participants wearing the smartwatch at the same hand with the same orientation, thusly achieving data uniformity. This transformation is in par with the signals in the SED-FL dataset. The ground truth is a signal with value during puff cycles, and elsewhere.

    No other preprocessing is performed on the data; e.g., the acceleration component due to the Earth's gravitational field is present at the processed acceleration measurements. The potential researcher can consult the article "Modeling Wrist Micromovements to Measure In-Meal Eating Behavior from Inertial Sensor Data" by Kyritsis et al. on how to further preprocess the IMU signals (i.e., smooth and remove the gravitational component).

    SED-FL

    Similar to SED, we provide the SED-FL dataset as a pickle. The file can be loaded using Python in the following way:

    import pickle as pkl import pandas as pd

    with open('./SED-FL.pkl','rb') as fh: dataset = pkl.load(fh)

    The dataset variable in the snippet above is a dictionary with keys, each corresponding to a unique subject. It should be mentioned that the subject identifier in SED-FL is in-line with the subject identifier in the SED dataset; i.e., SED-FL’s subject with id equal to is the same person as SED’s subject with id equal to .

    The content of a dataset ‘s subject is a list with length equal to corresponding subject’s number of recorded daily sessions. For example, assuming that subject has recorded 2 daily sessions, the command:

    sessions = dataset['8']

    would yield a list of length equal to (2). Each member of the list is a Pandas DataFrame with dimensions (M \times 8), where (M) is the length of the recording.

    The columns of a session’s DataFrame are exactly the same with the ones in the SED dataset. However, the 'GT' column contains ground truth that relates with the smoking sessions during the day (instead of puff cycles in SED).

    The contents of this DataFrame are essentially the accelerometer and gyroscope sensor streams, resampled at a constant sampling rate of (50) Hz and aligned with each other and with their smoking session ground truth. All sensor streams are transformed in such a way that reflects all participants wearing the smartwatch at the same hand with the same orientation, thusly achieving data uniformity. This transformation is in par with the signals in the SED dataset. The ground truth is a signal with value (+1) during smoking sessions, and (-1) elsewhere.

    No other preprocessing is performed on the data; e.g., the acceleration component due to the Earth's gravitational field is present at the processed acceleration measurements. The potential researcher can consult the article "Modeling Wrist Micromovements to Measure In-Meal Eating Behavior from Inertial Sensor Data" by Kyritsis et al. on how to further preprocess the IMU signals (i.e., smooth and remove the gravitational component).

    Ethics and funding

    Informed consent, including permission for third-party access to anonymized data, was obtained from all subjects prior to their engagement in the study. The work leading to these results has received funding from the EU Commission under Grant Agreement No. 965231, the REBECCA project (H2020).

    Contact

    Any inquiries regarding the SED and SED-FL datasets should be addressed to:

    Mr. Konstantinos KYRITSIS (Electrical & Computer Engineer, PhD candidate)

    Multimedia Understanding Group (MUG) Department of Electrical & Computer Engineering Aristotle University of Thessaloniki University Campus, Building C, 3rd floor Thessaloniki, Greece, GR54124

    Tel: +30 2310 996359, 996365 Fax: +30 2310 996398 E-mail: kokirits [at] mug [dot] ee [dot] auth [dot] gr

  14. Stone Classification

    • kaggle.com
    zip
    Updated Mar 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Khadgar (2025). Stone Classification [Dataset]. https://www.kaggle.com/datasets/claydonwang/stone-classification
    Explore at:
    zip(69490 bytes)Available download formats
    Dataset updated
    Mar 18, 2025
    Authors
    Khadgar
    Description

    Outline

    The dataset is used in final project of STA325 at SUSTech.

    How to Generate submission.csv from test_loader

    1. Define the Prediction Function

    Use the following function to extract predictions from test_loader: ```python def predict(model, loader, device): model.eval() # Set the model to evaluation mode predictions = [] # Store predicted classes image_ids = [] # Store image filenames

    with torch.no_grad(): # Disable gradient computation for images, img_paths in tqdm(loader, desc="Predicting on test set"): images = images.to(device) # Move images to the specified device outputs = model(images) # Forward pass to get model outputs _, predicted = torch.max(outputs, 1) # Get predicted classes

      # Collect predictions and image IDs
      predictions.extend(predicted.cpu().numpy())
      image_ids.extend([os.path.basename(path) for path in img_paths])
    

    return image_ids, predictions ```

    2. Run Predictions

    Call the prediction function with the trained model, test_loader, and device: python image_ids, predictions = predict(model, test_loader, device)

    3. Create the Submission File

    import pandas as pd
    import os
    
    # Create DataFrame
    submission_df = pd.DataFrame({
      "id": image_ids,  # Image filenames
      "label": predictions # Predicted classes
    })
    
    # Save to the specified path
    OUTPUT_DIR = "logs"
    os.makedirs(OUTPUT_DIR, exist_ok=True)
    submission_path = os.path.join(OUTPUT_DIR, "submission.csv")
    submission_df.to_csv(submission_path, index=False)
    print(f"Kaggle submission file saved to {submission_path}")
    

    Output Description

    • submission.csv Format:
      The file contains two columns:
    • id: Filenames of test images (without paths, e.g., image1.jpg).
    • label: Predicted class indices (e.g., 0, 1, 2, depending on the number of classes).

    • Example Content: id,label 000001.jpg,0 000002.jpg,1 000003.jpg,2 Then submit the submission.csv to Kaggle.

  15. h

    agnews

    • huggingface.co
    Updated Apr 5, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Washington Cunha (2025). agnews [Dataset]. https://huggingface.co/datasets/waashk/agnews
    Explore at:
    Dataset updated
    Apr 5, 2025
    Authors
    Washington Cunha
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset used in the paper: A thorough benchmark of automatic text classification From traditional approaches to large language models https://github.com/waashk/atcBench To guarantee the reproducibility of the obtained results, the dataset and its respective CV train-test partitions is available here. Each dataset contains the following files:

    data.parquet: pandas DataFrame with texts and associated encoded labels for each document. split_

  16. Linux Terminal Commands Dataset

    • kaggle.com
    zip
    Updated May 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SUNNY THAKUR (2025). Linux Terminal Commands Dataset [Dataset]. https://www.kaggle.com/datasets/cyberprince/linux-terminal-commands-dataset
    Explore at:
    zip(32599 bytes)Available download formats
    Dataset updated
    May 21, 2025
    Authors
    SUNNY THAKUR
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Linux Terminal Commands Dataset Overview The Linux Terminal Commands Dataset is a comprehensive collection of 600 unique Linux terminal commands (cmd-001 to cmd-600), curated for cybersecurity professionals, system administrators, data scientists, and machine learning engineers. This dataset is designed to support advanced use cases such as penetration testing, system administration, forensic analysis, and training machine learning models for command-line automation and anomaly detection. The commands span 10 categories: Navigation, File Management, Viewing, System Info, Permissions, Package Management, Networking, User Management, Process, and Editor. Each entry includes a command, its category, a description, an example output, and a reference to the relevant manual page, ensuring usability for both human users and automated systems. Key Features

    Uniqueness: 600 distinct commands with no overlap, covering basic to unconventional tools. Sophistication: Includes advanced commands for SELinux, eBPF tracing, network forensics, and filesystem debugging. Unconventional Tools: Features obscure utilities like bpftrace, tcpflow, zstd, and aa-status for red teaming and system tinkering. ML-Ready: Structured in JSON Lines (.jsonl) format for easy parsing and integration into machine learning pipelines. Professional Focus: Tailored for cybersecurity (e.g., auditing, hardening), system administration (e.g., performance tuning), and data science (e.g., log analysis).

    Dataset Structure The dataset is stored in a JSON Lines file (linux_terminal_commands_dataset.jsonl), where each line represents a single command with the following fields:

    Field Description

    id Unique identifier (e.g., cmd-001 to cmd-600).

    command The Linux terminal command (e.g., setfacl -m u:user:rw file.txt).

    category One of 10 categories (e.g., Permissions, Networking).

    description A concise explanation of the command's purpose and functionality.

    example_output Sample output or expected behavior (e.g., [No output if successful]).

    man_reference URL to the official manual page (e.g., https://man7.org/linux/man-pages/...).

    Category Distribution

    Category Count

    Navigation 11

    File Management 56

    Viewing 35

    System Info 51

    Permissions 28

    Package Management 12

    Networking 56

    User Management 19

    Process 42

    Editor 10

    Usage Prerequisites

    Python 3.6+: For parsing and analyzing the dataset. Linux Environment: Most commands require a Linux system (e.g., Ubuntu, CentOS, Fedora) for execution. Optional Tools: Install tools like pandas for data analysis or jq for JSON processing.

    Loading the Dataset ```python Use Python to load and explore the dataset: import json import pandas as pd

    Load dataset

    dataset = [] with open("linux_terminal_commands_dataset.jsonl", "r") as file: for line in file: dataset.append(json.loads(line))

    Convert to DataFrame

    df = pd.DataFrame(dataset)

    Example: View category distribution

    print(df.groupby("category").size())

    Example: Filter Networking commands

    networking_cmds = df[df["category"] == "Networking"] print(networking_cmds[["id", "command", "description"]]) ```

    Example Applications

    Cybersecurity: Use bpftrace or tcpdump commands for real-time system and network monitoring. Audit permissions with setfacl, chcon, or aa-status for system hardening.

    System Administration: Monitor performance with slabtop, pidstat, or systemd-analyze. Manage filesystems with btrfs, xfs_repair, or cryptsetup.

    Machine Learning: Train NLP models to predict command categories or generate command sequences. Use example outputs for anomaly detection in system logs.

    Pentesting: Leverage nping, tcpflow, or ngrep for network reconnaissance. Explore find / -perm /u+s to identify potential privilege escalation vectors.

    Executing Commands Warning: Some commands (e.g., mkfs.btrfs, fuser -k, cryptsetup) can modify or destroy data. Always test in a sandboxed environment. To execute a command:

    Example: List SELinux file contexts

    semanage fcontext -l

    Installation

    Clone the repository:git clone https://github.com/sunnythakur25/linux-terminal-commands-dataset.git cd linux-terminal-commands-dataset

    Ensure the dataset file (linux_terminal_commands_dataset.jsonl) is in the project directory. Install dependencies for analysis (optional):pip install pandas

    Contribution Guidelines We welcome contributions to expand the dataset or improve its documentation. To contribute:

    Fork the Repository: Create a fork on GitHub. Add Commands: Ensure new commands are unique, unconventional, and include all required fields (id, command, category, etc.). Test Commands: Verify commands work on a Linux system and provide accurate example outputs. Submit a Pull Request: Include a clear description of your changes and their purpose. Follow Standards: Use JSON Lines format. Reference man7.org for manual pages. Categorize c...

  17. h

    vader_movie_2L

    • huggingface.co
    Updated Apr 4, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Washington Cunha (2025). vader_movie_2L [Dataset]. https://huggingface.co/datasets/waashk/vader_movie_2L
    Explore at:
    Dataset updated
    Apr 4, 2025
    Authors
    Washington Cunha
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset used in the paper: A thorough benchmark of automatic text classification From traditional approaches to large language models https://github.com/waashk/atcBench To guarantee the reproducibility of the obtained results, the dataset and its respective CV train-test partitions is available here. Each dataset contains the following files:

    data.parquet: pandas DataFrame with texts and associated encoded labels for each document. split_

  18. H

    Twitch.tv Chat Log Data

    • dataverse.harvard.edu
    • search.dataone.org
    Updated Aug 1, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jeongmin Kim (2019). Twitch.tv Chat Log Data [Dataset]. http://doi.org/10.7910/DVN/VE0IVQ
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 1, 2019
    Dataset provided by
    Harvard Dataverse
    Authors
    Jeongmin Kim
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Collection of chat log of 2,162 Twitch streaming videos by 52 streamers. Time period of target streaming video is from 2018-04-24 to 2018-06-24. Description of columns follows below: body: Actual text for user chat channel_id: Channel identifier (integer) commenter_id: User identifier (integer) commenter_type: User type (character) created_at: Time of when chat was entered (ISO 8601 date and time) fragments: Chat text including parsing information of Twitch emote (JSON list) offset: Time offset between start time of video stream and the time of when chat was entered (float) updated_at: Time of when chat was edited (ISO 8601 date and time) video_id: Video identifier (integer) File name indicates name of Twitch stream channel. This dataset is saved as python3 pandas.DataFrame with python pickle format. import pandas as pd pd.read_pickle('ninja.pkl')

  19. Z

    glassDef dataset: metallic glass deformation

    • data.niaid.nih.gov
    Updated Dec 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kamran Karimi; Amin Esfandiarpour; Rene Alvarez-Donado; Mikko J. Alava; Stefanos Papanikolaou (2023). glassDef dataset: metallic glass deformation [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7736625
    Explore at:
    Dataset updated
    Dec 24, 2023
    Dataset provided by
    NOMATEN Centre of Excellence, National Center for Nuclear Research, ul. A. Sołtana 7, 05-400 Swierk/Otwock, Poland
    Authors
    Kamran Karimi; Amin Esfandiarpour; Rene Alvarez-Donado; Mikko J. Alava; Stefanos Papanikolaou
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    The glassDef dataset contains a set of text-based LAMMPS dump files corresponding to shear deformation tests on different bulk metallic glasses. This includes FeNi, CoNiFe, CoNiCrFe, CoCrFeMn, CoNiCrFeMn, and Co5Cr2Fe40Mn27Ni26 amorphous alloys with data files that exist in relevant subdirectories. Each dump file corresponds to multiple realizations and includes the dimensions of the simulation box as well as atom coordinates, the atom ID, and associated type of nearly 50,000 atoms.

    Load glassDef Dataset in Python

    The glassDef dataset may be loaded in Python into Pandas DataFrame. To go into the relevant subdirectory, run cd glass{glass_name}/Run[0-3]/ where “glass_name” denotes the chemical composition. Each subdirectory contains at least three glass realizations within subfolders that are labeled as “Run[0-3]”.

    cd glassFeNi/Run0; python

    import pandas

    df = pandas.read_csv("FeNi_glass.dump",skiprows=9)

    One may display an assigned DataFrame in the form of a table:

    df.head()

    To learn more about further analyses performed on the loaded data, please refer to the paper cited below.

    glassDef Dataset Structure

    glassDef Data Fields

    Dump files: “id”, “type”, “x”, “y”, “z”.

    glassDef Dataset Description

    Paper: Karimi, Kamran, Amin Esfandiarpour, René Alvarez-Donado, Mikko J. Alava, and Stefanos Papanikolaou. "Shear banding instability in multicomponent metallic glasses: Interplay of composition and short-range order." Physical Review B 105, no. 9 (2022): 094117.

    Contact: kamran.karimi@ncbj.gov.pl

  20. The Device Activity Report with Complete Knowledge (DARCK) for NILM

    • zenodo.org
    bin, xz
    Updated Sep 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous Anonymous; Anonymous Anonymous (2025). The Device Activity Report with Complete Knowledge (DARCK) for NILM [Dataset]. http://doi.org/10.5281/zenodo.17159850
    Explore at:
    bin, xzAvailable download formats
    Dataset updated
    Sep 19, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous Anonymous; Anonymous Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    1. Abstract

    This dataset contains aggregated and sub-metered power consumption data from a two-person apartment in Germany. Data was collected from March 5 to September 4, 2025, spanning 6 months. It includes an aggregate reading from a main smart meter and individual readings from 40 smart plugs, smart relays, and smart power meters monitoring various appliances.

    2. Dataset Overview

    • Apartment: Two-person apartment, approx. 58m², located in Aachen, Germany.
    • Aggregate Meter: eBZ DD3
    • Sub-meters: 31 Shelly Plus Plug S, 6 Shelly Plus 1PM, 3 Shelly Plus PM Mini Gen3
    • Sampling Rate: 1 Hz
    • Measured Quantity: Active Power
    • Unit of Measurement: Watt
    • Duration: 6 months
    • Format: Single CSV file (`DARCK.csv`)
    • Structure: Timestamped rows with columns for the aggregate meter and each sub-metered appliance.
    • Completeness: The main power meter has a completeness of 99.3%. Missing values were linearly interpolated.

    3. Download and Usage

    The dataset can be downloaded here: https://doi.org/10.5281/zenodo.17159850

    As it contains longer off periods with zeros, the CSV file is nicely compressible.


    To extract it use: xz -d DARCK.csv.xz.
    The compression leads to a 97% smaller file size (From 4GB to 90.9MB).


    To use the dataset in python, you can, e.g., load the csv file into a pandas dataframe.

    python
    import pandas as pd

    df = pd.read_csv("DARCK.csv", parse_dates=["time"])

    4. Measurement Setup

    The main meter was monitored using an infrared reading head magnetically attached to the infrared interface of the meter. An ESP8266 flashed with Tasmota decodes the binary datagrams and forwards the Watt readings to the MQTT broker. Individual appliances were monitored using a combination of Shelly Plugs (for outlets), Shelly 1PM (for wired-in devices like ceiling lights), and Shelly PM Mini (for each of the three phases of the oven). All devices reported to a central InfluxDB database via Home Assistant running in docker on a Dell OptiPlex 3020M.

    5. File Format (DARCK.csv)

    The dataset is provided as a single comma-separated value (CSV) file.

    • The first row is a header containing the column names.
    • All power values are rounded to the first decimal place.
    • There are no missing values in the final dataset.
    • Each row represents 1 second, from start of measuring in March until the end in September.

    Column Descriptions

    Column Name

    Data Type

    Unit

    Description

    timedatetime-Timestamp for the reading in YYYY-MM-DD HH:MM:SS
    mainfloatWattTotal aggregate power consumption for the apartment, measured at the main electrical panel.
    [appliance_name]floatWattPower consumption of an individual appliance (e.g., lightbathroom, fridge, sherlockpc). See Section 8 for a full list.
    Aggregate Columns
    aggr_chargersfloatWattThe sum of sherlockcharger, sherlocklaptop, watsoncharger, watsonlaptop, watsonipadcharger, kitchencharger.
    aggr_stoveplatesfloatWattThe sum of stoveplatel1 and stoveplatel2.
    aggr_lightsfloatWattThe sum of lightbathroom, lighthallway, lightsherlock, lightkitchen, lightlivingroom, lightwatson, lightstoreroom, fcob, sherlockalarmclocklight, sherlockfloorlamphue, sherlockledstrip, livingfloorlamphue, sherlockglobe, watsonfloorlamp, watsondesklamp and watsonledmap.
    Analysis Columns
    inaccuracyfloatWattAs no electrical device bypasses a power meter, the true inaccuracy can be assessed. It is the absolute error between the sum of individual measurements and the mains reading. A 30W offset is applied to the sum since the measurement devices themselves draw power which is otherwise unaccounted for.

    6. Data Postprocessing Pipeline

    The final dataset was generated from two raw data sources (meter.csv and shellies.csv) using a comprehensive postprocessing pipeline.

    6.1. Main Meter (main) Postprocessing

    The aggregate power data required several cleaning steps to ensure accuracy.

    1. Outlier Removal: Readings below 10W or above 10,000W were removed (merely 3 occurrences).
    2. Timestamp Burst Correction: The source data contained bursts of delayed readings. A custom algorithm was used to identify these bursts (large time gap followed by rapid readings) and back-fill the timestamps to create an evenly spaced time series.
    3. Alignment & Interpolation: The smart meter pushes a new value via infrared every second. To align those to the whole seconds, it was resampled to a 1-second frequency by taking the mean of all readings within each second (in 99.5% only 1 value). Any resulting gaps (0.7% outage ratio) were filled using linear interpolation.

    6.2. Sub-metered Devices (shellies) Postprocessing

    The Shelly devices are not prone to the same burst issue as the ESP8266 is. They push a new reading at every change in power drawn. If no power change is observed or the one observed is too small (less than a few Watt), the reading is pushed once a minute, together with a heartbeat. When a device turns on or off, intermediate power values are published, which leads to sub-second values that need to be handled.

    1. Grouping: Data was grouped by the unique device identifier.
    2. Resampling & Filling: The data for each device was resampled to a 1-second frequency using .resample('1s').last().ffill().
      This method was chosen to firstly, capture the last known state of the device within each second, handling rapid on/off events. Secondly, to forward-fill the last state across periods of no new data, modeling that the device's consumption remained constant until a new reading was sent.

    6.3. Merging and Finalization

    1. Merge: The cleaned main meter and all sub-metered device dataframes were merged into a single dataframe on the time index.
    2. Final Fill: Any remaining NaN values (e.g., from before a device was installed) were filled with 0.0, assuming zero consumption.

    7. Manual Corrections and Known Data Issues

    During analysis, two significant unmetered load events were identified and manually corrected to improve the accuracy of the aggregate reading. The error column (inaccuracy) was recalculated after these corrections.

    1. March 10th - Unmetered Bulb: An unmetered 107W bulb was active. It was subtracted from the main reading as if it never happened.
    2. May 31st - Unmetered Air Pump: An unmetered 101W pump for an air mattress was used directly in an outlet with no intermediary plug and hence manually added to the respective plug.

    8. Appliance Details and Multipurpose Plugs

    The following table lists the column names with an explanation where needed. As Watson moved at the beginning of June, some metering plugs changed their appliance.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Zeyad Usf (2024). Convert Text to Pandas [Dataset]. https://www.kaggle.com/datasets/zeyadusf/convert-text-to-pandas
Organization logo

Convert Text to Pandas

convert Text 2 Pandas

Explore at:
zip(4333134 bytes)Available download formats
Dataset updated
Sep 22, 2024
Authors
Zeyad Usf
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

kaggle notebook
Github Repo

I found two datasets about converting text with context to pandas code on Hugging Face, but the challenge is in the context. The context in both datasets is different which reduces the results of the model. First let's mention the data I found and then show examples, solution and some other problems.

  • Rahima411/text-to-pandas:

    • The data is divided into Train with 57.5k and Test with 19.2k.

    • The data has two columns as you can see in the example:

      • "Input": Contains the context and the question together, in the context it shows the metadata about the data frame.
      • "Pandas Query": Pandas code txt Input | Pandas Query -----------------------------------------------------------|------------------------------------------- Table Name: head (age (object), head_id (object)) | result = management['head.age'].unique() Table Name: management (head_id (object), | temporary_acting (object)) | What are the distinct ages of the heads who are acting? |
  • hiltch/pandas-create-context:

    • It contains 17k rows with three columns:
      • question : text .
      • context : Code to create a data frame with column names, unlike the first data set which contains the name of the data frame, column names and data type.
      • answer : Pandas code.
      question           |            context             |       answer 
----------------------------------------|--------------------------------------------------------|---------------------------------------
What was the lowest # of total votes?  | df = pd.DataFrame(columns=['_number_of_total_votes']) | df['_number_of_total_votes'].min()   

As you can see, the problem with this data is that they are not similar as inputs and the structure of the context is different . My solution to this problem was: - Convert the first data set to become like the second in the context. I chose this because it is difficult to get the data type for the columns in the second data set. It was easy to convert the structure of the context from this shape Table Name: head (age (object), head_id (object)) to this head = pd.DataFrame(columns=['age','head_id']) through this code that I wrote. - Then separate the question from the context. This was easy because if you look at the data, you will find that the context always ends with "(" and then a blank and then the question. You will find all of this in this code. - You will also notice that more than one code or line can be returned to the context, and this has been engineered into the code. ```py def extract_table_creation(text:str)->(str,str): """ Extracts DataFrame creation statements and questions from the given text.

Args:
  text (str): The input text containing table definitions and questions.

Returns:
  tuple: A tuple containing a concatenated DataFrame creation string and a question.
"""
# Define patterns
table_pattern = r'Table Name: (\w+) \(([\w\s,()]+)\)'
column_pattern = r'(\w+)\s*\((object|int64|float64)\)'

# Find all table names and column definitions
matches = re.findall(table_pattern, text)

# Initialize a list to hold DataFrame creation statements
df_creations = []

for table_name, columns_str in matches:
  # Extract column names
  columns = re.findall(column_pattern, columns_str)
  column_names = [col[0] for col in columns]

  # Format DataFrame creation statement
  df_creation = f"{table_name} = pd.DataFrame(columns={column_names})"
  df_creations.append(df_creation)

# Concatenate all DataFrame creation statements
df_creation_concat = '

'.join(df_creations)

# Extract and clean the question
question = text[text.rindex(')')+1:].strip()

return df_creation_concat, question
After both datasets were similar in structure, they were merged into one set and divided into _72.8K_ train and _18.6K_ test. We analyzed this dataset and you can see it all through the **[`notebook`](https://www.kaggle.com/code/zeyadusf/text-2-pandas-t5#Exploratory-Data-Analysis(EDA))**, but we found some problems in the dataset as well, such as
> - `Answer` : `df['Id'].count()` has been repeated, but this is possible, so we do not need to dispense with these rows.
> - `Context` : We see that it contains `147` rows that do not contain any text. We will see Through the experiment if this will affect the results negatively or positively.
> - `Question` : It is ...
Search
Clear search
Close search
Google apps
Main menu