7 datasets found

PandasPlotBench
huggingface.co
Updated Nov 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
JetBrains Research (2024). PandasPlotBench [Dataset]. https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 25, 2024
Dataset provided by
JetBrainshttp://jetbrains.com/
Authors
JetBrains Research
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
PandasPlotBench

PandasPlotBench is a benchmark to assess the capability of models in writing the code for visualizations given the description of the Pandas DataFrame. 🛠️ Task. Given the plotting task and the description of a Pandas DataFrame, write the code to build a plot. The dataset is based on the MatPlotLib gallery. The paper can be found in arXiv: https://arxiv.org/abs/2412.02764v1. To score your model on this dataset, you can use the our GitHub repository. 📩 If you have… See the full description on the dataset page: https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench.
Enterprise GenAI Adoption & Workforce Impact Data
kaggle.com
Updated Jun 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ojas Singh (2025). Enterprise GenAI Adoption & Workforce Impact Data [Dataset]. https://www.kaggle.com/datasets/tfisthis/enterprise-genai-adoption-and-workforce-impact-data/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 12, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ojas Singh
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Enterprise GenAI Adoption & Workforce Impact Dataset (100K+ Rows)

This dataset originates from a multi-year enterprise survey conducted across industries and countries. It focuses on the organizational effects of adopting Generative AI tools such as ChatGPT, Claude, Gemini, Mixtral, LLaMA, and Groq. The dataset captures detailed metrics on job role creation, workforce transformation, productivity changes, and employee sentiment.

Data Schema

columns = [ "Company Name", # Anonymized name "Industry", # Sector (e.g., Finance, Healthcare) "Country", # Country of operation "GenAI Tool", # GenAI platform used "Adoption Year", # Year of initial deployment (2022–2024) "Number of Employees Impacted", # Affected staff count "New Roles Created", # Number of AI-driven job roles introduced "Training Hours Provided", # Upskilling time investment "Productivity Change (%)", # % shift in reported productivity "Employee Sentiment" # Textual feedback from employees ]

Load the Dataset

import pandas as pd df = pd.read_csv("Large_Enterprise_GenAI_Adoption_Impact.csv") df.shape

Basic Exploration

df.head(10) df.describe() df["GenAI Tool"].value_counts() df["Industry"].unique()

Filter Examples

Filter by Year and Country

df[(df["Adoption Year"] == 2023) & (df["Country"] == "India")]

Get Top 5 Industries by Productivity Gain

df.groupby("Industry")["Productivity Change (%)"].mean().sort_values(ascending=False).head()

Text Analysis on Employee Sentiment

Word Frequency Analysis

from collections import Counter import re text = " ".join(df["Employee Sentiment"].dropna().tolist()) words = re.findall(r'\b\w+\b', text.lower()) common_words = Counter(words).most_common(20) print(common_words)

Sentiment Length Distribution

df["Sentiment Length"] = df["Employee Sentiment"].apply(lambda x: len(x.split())) df["Sentiment Length"].hist(bins=50)

Group-Based Insights

Role Creation by Tool

df.groupby("GenAI Tool")["New Roles Created"].mean().sort_values(ascending=False)

Training Hours by Industry

df.groupby("Industry")["Training Hours Provided"].mean().sort_values(ascending=False)

Sample Use Cases

Evaluate GenAI adoption patterns by sector or region

Analyze workforce upskilling initiatives and investments

Explore employee reactions to AI integration using NLP

Build models to predict productivity impact based on tool, industry, or country

Study role creation trends to anticipate future AI-based job market shifts
h
oldIT2modIT
huggingface.co
Updated Jun 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Massimo Romano (2025). oldIT2modIT [Dataset]. https://huggingface.co/datasets/cybernetic-m/oldIT2modIT
Explore at:
Dataset updated
Jun 3, 2025
Authors
Massimo Romano
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Download the dataset

At the moment to download the dataset you should use Pandas DataFrame: import pandas as pd df = pd.read_csv("https://huggingface.co/datasets/cybernetic-m/oldIT2modIT/resolve/main/oldIT2modIT_dataset.csv")

You can visualize the dataset with: df.head()

To convert into Huggingface dataset: from datasets import Dataset dataset = Dataset.from_pandas(df)

Dataset Description

This is an italian dataset formed by 200 old (ancient) italian sentence and… See the full description on the dataset page: https://huggingface.co/datasets/cybernetic-m/oldIT2modIT.
Klib library python
kaggle.com
Updated Jan 11, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sripaad Srinivasan (2021). Klib library python [Dataset]. https://www.kaggle.com/sripaadsrinivasan/klib-library-python/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 11, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sripaad Srinivasan
Description
klib library enables us to quickly visualize missing data, perform data cleaning, visualize data distribution plot, visualize correlation plot and visualize categorical column values. klib is a Python library for importing, cleaning, analyzing and preprocessing data. Explanations on key functionalities can be found on Medium / TowardsDataScience in the examples section or on YouTube (Data Professor).

Original Github repo

https://raw.githubusercontent.com/akanz1/klib/main/examples/images/header.png" alt="klib Header">

Usage

!pip install klib

import klib import pandas as pd df = pd.DataFrame(data) # klib.describe functions for visualizing datasets - klib.cat_plot(df) # returns a visualization of the number and frequency of categorical features - klib.corr_mat(df) # returns a color-encoded correlation matrix - klib.corr_plot(df) # returns a color-encoded heatmap, ideal for correlations - klib.dist_plot(df) # returns a distribution plot for every numeric feature - klib.missingval_plot(df) # returns a figure containing information about missing values

Examples

Take a look at this starter notebook.

Further examples, as well as applications of the functions can be found here.

Contributing

Pull requests and ideas, especially for further functions are welcome. For major changes or feedback, please open an issue first to discuss what you would like to change. Take a look at this Github repo.

License

MIT
SELTO Dataset
zenodo.org
data.niaid.nih.gov
application/gzip
Updated May 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sören Dittmer; David Erzmann; Henrik Harms; Rielson Falck; Marco Gosch; Sören Dittmer; David Erzmann; Henrik Harms; Rielson Falck; Marco Gosch (2023). SELTO Dataset [Dataset]. http://doi.org/10.5281/zenodo.7781392
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7781392
Dataset updated
May 23, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sören Dittmer; David Erzmann; Henrik Harms; Rielson Falck; Marco Gosch; Sören Dittmer; David Erzmann; Henrik Harms; Rielson Falck; Marco Gosch
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A Benchmark Dataset for Deep Learning for 3D Topology Optimization

This dataset represents voxelized 3D topology optimization problems and solutions. The solutions have been generated in cooperation with the Ariane Group and Synera using the Altair OptiStruct implementation of SIMP within the Synera software. The SELTO dataset consists of four different 3D datasets for topology optimization, called disc simple, disc complex, sphere simple and sphere complex. Each of these datasets is further split into a training and a validation subset.

The following paper provides full documentation and examples:

Dittmer, S., Erzmann, D., Harms, H., Maass, P., SELTO: Sample-Efficient Learned Topology Optimization (2022) https://arxiv.org/abs/2209.05098.

The Python library DL4TO (https://github.com/dl4to/dl4to) can be used to download and access all SELTO dataset subsets.
Each TAR.GZ file container consists of multiple enumerated pairs of CSV files. Each pair describes a unique topology optimization problem and contains an associated ground truth solution. Each problem-solution pair consists of two files, where one contains voxel-wise information and the other file contains scalar information. For example, the i-th sample is stored in the files i.csv and i_info.csv, where i.csv contains all voxel-wise information and i_info.csv contains all scalar information. We define all spatially varying quantities at the center of the voxels, rather than on the vertices or surfaces. This allows for a shape-consistent tensor representation.

For the i-th sample, the columns of i_info.csv correspond to the following scalar information:

E - Young's modulus [Pa]

ν - Poisson's ratio [-]

σ_ys - a yield stress [Pa]

h - discretization size of the voxel grid [m]

The columns of i.csv correspond to the following voxel-wise information:

x, y, z - the indices that state the location of the voxel within the voxel mesh

Ω_design - design space information for each voxel. This is a ternary variable that indicates the type of density constraint on the voxel. 0 and 1 indicate that the density is fixed at 0 or 1, respectively. -1 indicates the absence of constraints, i.e., the density in that voxel can be freely optimized

Ω_dirichlet_x, Ω_dirichlet_y, Ω_dirichlet_z - homogeneous Dirichlet boundary conditions for each voxel. These are binary variables that define whether the voxel is subject to homogeneous Dirichlet boundary constraints in the respective dimension

F_x, F_y, F_z - floating point variables that define the three spacial components of external forces applied to each voxel. All forces are body forces given in [N/m^3]

density - defines the binary voxel-wise density of the ground truth solution to the topology optimization problem

How to Import the Dataset

with DL4TO: With the Python library DL4TO (https://github.com/dl4to/dl4to) it is straightforward to download and access the dataset as a customized PyTorch torch.utils.data.Dataset object. As shown in the tutorial this can be done via:

from dl4to.datasets import SELTODataset dataset = SELTODataset(root=root, name=name, train=train)

Here, root is the path where the dataset should be saved. name is the name of the SELTO subset and can be one of "disc_simple", "disc_complex", "sphere_simple" and "sphere_complex". train is a boolean that indicates whether the corresponding training or validation subset should be loaded. See here for further documentation on the SELTODataset class.

without DL4TO: After downloading and unzipping, any of the i.csv files can be manually imported into Python as a Pandas dataframe object:

import pandas as pd root = ... file_path = f'{root}/{i}.csv' columns = ['x', 'y', 'z', 'Ω_design','Ω_dirichlet_x', 'Ω_dirichlet_y', 'Ω_dirichlet_z', 'F_x', 'F_y', 'F_z', 'density'] df = pd.read_csv(file_path, names=columns)

Similarly, we can import a i_info.csv file via:

file_path = f'{root}/{i}_info.csv' info_column_names = ['E', 'ν', 'σ_ys', 'h'] df_info = pd.read_csv(file_path, names=info_columns)

We can extract PyTorch tensors from the Pandas dataframe df using the following function:

import torch def get_torch_tensors_from_dataframe(df, dtype=torch.float32): shape = df[['x', 'y', 'z']].iloc[-1].values.astype(int) + 1 voxels = [df['x'].values, df['y'].values, df['z'].values] Ω_design = torch.zeros(1, *shape, dtype=int) Ω_design[:, voxels[0], voxels[1], voxels[2]] = torch.from_numpy(data['Ω_design'].values.astype(int)) Ω_Dirichlet = torch.zeros(3, *shape, dtype=dtype) Ω_Dirichlet[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_x'].values, dtype=dtype) Ω_Dirichlet[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_y'].values, dtype=dtype) Ω_Dirichlet[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_z'].values, dtype=dtype) F = torch.zeros(3, *shape, dtype=dtype) F[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_x'].values, dtype=dtype) F[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_y'].values, dtype=dtype) F[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_z'].values, dtype=dtype) density = torch.zeros(1, *shape, dtype=dtype) density[:, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['density'].values, dtype=dtype) return Ω_design, Ω_Dirichlet, F, density
h
kaggle-entity-annotated-corpus-ner-dataset
huggingface.co
Updated Jul 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rafael Arias Calles (2022). kaggle-entity-annotated-corpus-ner-dataset [Dataset]. https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 10, 2022
Authors
Rafael Arias Calles
License
https://choosealicense.com/licenses/odbl/https://choosealicense.com/licenses/odbl/
Description
Date: 2022-07-10 Files: ner_dataset.csv Source: Kaggle entity annotated corpus notes: The dataset only contains the tokens and ner tag labels. Labels are uppercase.

About Dataset

from Kaggle Datasets

Context

Annotated Corpus for Named Entity Recognition using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set. Tip: Use Pandas Dataframe to load dataset if using Python for… See the full description on the dataset page: https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset.
polyOne Data Set - 100 million hypothetical polymers including 29 properties...
zenodo.org
bin, txt
Updated Mar 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christopher Kuenneth; Christopher Kuenneth; Rampi Ramprasad; Rampi Ramprasad (2023). polyOne Data Set - 100 million hypothetical polymers including 29 properties [Dataset]. http://doi.org/10.5281/zenodo.7766806
Explore at:
bin, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7766806
Dataset updated
Mar 24, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Christopher Kuenneth; Christopher Kuenneth; Rampi Ramprasad; Rampi Ramprasad
Description
polyOne Data Set

The data set contains 100 million hypothetical polymers each with 29 predicted properties using machine learning models. We use PSMILES strings to represent polymer structures, see here and here. The polymers are generated by decomposing previously synthesized polymers into unique chemical fragments. Random and enumerative compositions of these fragments yield 100 million hypothetical PSMILES strings. All PSMILES strings are chemically valid polymers but, mostly, have never been synthesized before. More information can be found in the paper. Please note the license agreement in the LICENSE file.

Full data set including the properties

The data files are in Apache Parquet format. The files start with `polyOne_*.parquet`.

I recommend using dask (`pip install dask`) to load and process the data set. Pandas also works but is slower.

Load sharded data set with dask
```python
import dask.dataframe as dd
ddf = dd.read_parquet("*.parquet", engine="pyarrow")
```

For example, compute the description of data set
```python
df_describe = ddf.describe().compute()
df_describe

```

PSMILES strings only

generated_polymer_smiles_train.txt - 80 million PSMILES strings for training polyBERT. One string per line.

generated_polymer_smiles_dev.txt - 20 million PSMILES strings for testing polyBERT. One string per line.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

JetBrains Research (2024). PandasPlotBench [Dataset]. https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench

PandasPlotBench

JetBrains-Research/PandasPlotBench

Explore at:

3 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Nov 25, 2024

Dataset provided by

JetBrainshttp://jetbrains.com/

Authors

JetBrains Research

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

PandasPlotBench

PandasPlotBench is a benchmark to assess the capability of models in writing the code for visualizations given the description of the Pandas DataFrame. 🛠️ Task. Given the plotting task and the description of a Pandas DataFrame, write the code to build a plot. The dataset is based on the MatPlotLib gallery. The paper can be found in arXiv: https://arxiv.org/abs/2412.02764v1. To score your model on this dataset, you can use the our GitHub repository. 📩 If you have… See the full description on the dataset page: https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench.

Clear search

Close search

Google apps

Main menu

PandasPlotBench

Enterprise GenAI Adoption & Workforce Impact Data

Enterprise GenAI Adoption & Workforce Impact Dataset (100K+ Rows)

Data Schema

Load the Dataset

Basic Exploration

Filter Examples

Filter by Year and Country

Get Top 5 Industries by Productivity Gain

Text Analysis on Employee Sentiment

Word Frequency Analysis

Sentiment Length Distribution

Group-Based Insights

Role Creation by Tool

Training Hours by Industry

Sample Use Cases

oldIT2modIT

Klib library python

Usage

Examples

Contributing

License

SELTO Dataset

kaggle-entity-annotated-corpus-ner-dataset

polyOne Data Set - 100 million hypothetical polymers including 29 properties...

PandasPlotBench

JetBrains-Research/PandasPlotBench