55 datasets found

h
wikipedia-summary-dataset-128k
huggingface.co
Updated Apr 4, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martin Bukowski (2015). wikipedia-summary-dataset-128k [Dataset]. https://huggingface.co/datasets/mbukowski/wikipedia-summary-dataset-128k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 4, 2015
Authors
Martin Bukowski
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Wikipedia Summary Dataset 128k

This is random subsample of 128k entries from the wikipedia summary dataset, processed with the following code: import pandas as pd

df = pd.read_parquet('wikipedia-summary.parquet') df['l'] = df['summary'].str.len() rdf = df[(df['l'] > 300) & (df['l'] < 600)]

Filter out any rows 'topic' that have non-alphanumeric characters

mask = rdf['topic'].str.contains(r'^[a-zA-Z0-9 ]+$') == True rdf = rdf[mask == True].sample(128000)[['topic'… See the full description on the dataset page: https://huggingface.co/datasets/mbukowski/wikipedia-summary-dataset-128k.
Shopping Mall
kaggle.com
zip
Updated Dec 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anshul Pachauri (2023). Shopping Mall [Dataset]. https://www.kaggle.com/datasets/anshulpachauri/shopping-mall
Explore at:
zip(22852 bytes)Available download formats
Dataset updated
Dec 15, 2023
Authors
Anshul Pachauri
Description
Libraries Import:

Importing necessary libraries such as pandas, seaborn, matplotlib, scikit-learn's KMeans, and warnings. Data Loading and Exploration:

Reading a dataset named "Mall_Customers.csv" into a pandas DataFrame (df). Displaying the first few rows of the dataset using df.head(). Conducting univariate analysis by calculating descriptive statistics with df.describe(). Univariate Analysis:

Visualizing the distribution of the 'Annual Income (k$)' column using sns.distplot. Looping through selected columns ('Age', 'Annual Income (k$)', 'Spending Score (1-100)') and plotting individual distribution plots. Bivariate Analysis:

Creating a scatter plot for 'Annual Income (k$)' vs 'Spending Score (1-100)' using sns.scatterplot. Generating a pair plot for selected columns with gender differentiation using sns.pairplot. Gender-Based Analysis:

Grouping the data by 'Gender' and calculating the mean for selected columns. Computing the correlation matrix for the grouped data and visualizing it using a heatmap. Univariate Clustering:

Applying KMeans clustering with 3 clusters based on 'Annual Income (k$)' and adding the 'Income Cluster' column to the DataFrame. Plotting the elbow method to determine the optimal number of clusters. Bivariate Clustering:

Applying KMeans clustering with 5 clusters based on 'Annual Income (k$)' and 'Spending Score (1-100)' and adding the 'Spending and Income Cluster' column. Plotting the elbow method for bivariate clustering and visualizing the cluster centers on a scatter plot. Displaying a normalized cross-tabulation between 'Spending and Income Cluster' and 'Gender'. Multivariate Clustering:

Performing multivariate clustering by creating dummy variables, scaling selected columns, and applying KMeans clustering. Plotting the elbow method for multivariate clustering. Result Saving:

Saving the modified DataFrame with cluster information to a CSV file named "Result.csv". Saving the multivariate clustering plot as an image file ("Multivariate_figure.png").
Convert Text to Pandas
kaggle.com
zip
Updated Sep 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zeyad Usf (2024). Convert Text to Pandas [Dataset]. https://www.kaggle.com/datasets/zeyadusf/convert-text-to-pandas
Explore at:
zip(4333134 bytes)Available download formats
Dataset updated
Sep 22, 2024
Authors
Zeyad Usf
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
kaggle notebook
Github Repo

I found two datasets about converting text with context to pandas code on Hugging Face, but the challenge is in the context. The context in both datasets is different which reduces the results of the model. First let's mention the data I found and then show examples, solution and some other problems.

Rahima411/text-to-pandas:

The data is divided into Train with 57.5k and Test with 19.2k.

The data has two columns as you can see in the example:

"Input": Contains the context and the question together, in the context it shows the metadata about the data frame.

"Pandas Query": Pandas code txt Input | Pandas Query -----------------------------------------------------------|------------------------------------------- Table Name: head (age (object), head_id (object)) | result = management['head.age'].unique() Table Name: management (head_id (object), | temporary_acting (object)) | What are the distinct ages of the heads who are acting? |

hiltch/pandas-create-context:

It contains 17k rows with three columns:

question : text .

context : Code to create a data frame with column names, unlike the first data set which contains the name of the data frame, column names and data type.

answer : Pandas code.

question | context | answer ----------------------------------------|--------------------------------------------------------|--------------------------------------- What was the lowest # of total votes? | df = pd.DataFrame(columns=['_number_of_total_votes']) | df['_number_of_total_votes'].min()

As you can see, the problem with this data is that they are not similar as inputs and the structure of the context is different . My solution to this problem was: - Convert the first data set to become like the second in the context. I chose this because it is difficult to get the data type for the columns in the second data set. It was easy to convert the structure of the context from this shape Table Name: head (age (object), head_id (object)) to this head = pd.DataFrame(columns=['age','head_id']) through this code that I wrote. - Then separate the question from the context. This was easy because if you look at the data, you will find that the context always ends with "(" and then a blank and then the question. You will find all of this in this code. - You will also notice that more than one code or line can be returned to the context, and this has been engineered into the code. ```py def extract_table_creation(text:str)->(str,str): """ Extracts DataFrame creation statements and questions from the given text.

Args: text (str): The input text containing table definitions and questions. Returns: tuple: A tuple containing a concatenated DataFrame creation string and a question. """ # Define patterns table_pattern = r'Table Name: (\w+) $([\w\s,()]+)$' column_pattern = r'(\w+)\s*$(object|int64|float64)$' # Find all table names and column definitions matches = re.findall(table_pattern, text) # Initialize a list to hold DataFrame creation statements df_creations = [] for table_name, columns_str in matches: # Extract column names columns = re.findall(column_pattern, columns_str) column_names = [col[0] for col in columns] # Format DataFrame creation statement df_creation = f"{table_name} = pd.DataFrame(columns={column_names})" df_creations.append(df_creation) # Concatenate all DataFrame creation statements df_creation_concat = '

'.join(df_creations)

# Extract and clean the question question = text[text.rindex(')')+1:].strip() return df_creation_concat, question

After both datasets were similar in structure, they were merged into one set and divided into _72.8K_ train and _18.6K_ test. We analyzed this dataset and you can see it all through the **[`notebook`](https://www.kaggle.com/code/zeyadusf/text-2-pandas-t5#Exploratory-Data-Analysis(EDA))**, but we found some problems in the dataset as well, such as > - `Answer` : `df['Id'].count()` has been repeated, but this is possible, so we do not need to dispense with these rows. > - `Context` : We see that it contains `147` rows that do not contain any text. We will see Through the experiment if this will affect the results negatively or positively. > - `Question` : It is ...
PandasPlotBench
huggingface.co
Updated Nov 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
JetBrains Research (2024). PandasPlotBench [Dataset]. https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 25, 2024
Dataset provided by
JetBrainshttp://jetbrains.com/
Authors
JetBrains Research
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
PandasPlotBench

PandasPlotBench is a benchmark to assess the capability of models in writing the code for visualizations given the description of the Pandas DataFrame. 🛠️ Task. Given the plotting task and the description of a Pandas DataFrame, write the code to build a plot. The dataset is based on the MatPlotLib gallery. The paper can be found in arXiv: https://arxiv.org/abs/2412.02764v1. To score your model on this dataset, you can use the our GitHub repository. 📩 If you have… See the full description on the dataset page: https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench.
BBC NEWS SUMMARY(CSV FORMAT)
kaggle.com
zip
Updated Sep 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dhiraj (2024). BBC NEWS SUMMARY(CSV FORMAT) [Dataset]. https://www.kaggle.com/datasets/dignity45/bbc-news-summarycsv-format
Explore at:
zip(2097600 bytes)Available download formats
Dataset updated
Sep 9, 2024
Authors
Dhiraj
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Description: Text Summarization Dataset

This dataset is designed for users aiming to train models for text summarization. It contains 2,225 rows of data with two columns: "Text" and "Summary". Each row features a detailed news article or piece of text paired with its corresponding summary, providing a rich resource for developing and fine-tuning summarization algorithms.

Key Features:

Text: Full-length articles or passages that serve as the input for summarization.

Summary: Concise summaries of the articles, which are ideal for training models to generate brief, coherent summaries from longer texts.

Future Enhancements:

This evolving dataset is planned to include additional features, such as text class labels, in future updates. These enhancements will provide more context and facilitate the development of models that can perform summarization across different categories of news content.

Usage:

Ideal for researchers and developers focused on text summarization tasks, this dataset enables the training of models to effectively compress information while retaining the essence of the original content.

Acknowledgment

We would like to extend our sincere gratitude to the dataset creator for their contribution to this valuable resource. This dataset, sourced from the BBC News Summary dataset on Kaggle, was created by Pariza. Their work has provided an invaluable asset for those working on text summarization tasks, and we appreciate their efforts in curating and sharing this data with the community.

Thank you for supporting research and development in the field of natural language processing!

File Description

This script processes and consolidates text data from various directories containing news articles and their corresponding summaries. It reads the files from specified folders, handles encoding issues, and then creates a DataFrame that is saved as a CSV file for further analysis.

Key Components:

Imports:

numpy (np): Numerical operations library, though it's not used in this script.

pandas (pd): Data manipulation and analysis library.

os: For interacting with the operating system, e.g., building file paths.

glob: For file pattern matching and retrieving file paths.

Function: get_texts

Parameters:

text_folders: List of folders containing news article text files.

text_list: List to store the content of text files.

summ_folder: List of folders containing summary text files.

sum_list: List to store the content of summary files.

encodings: List of encodings to try for reading files.

Purpose:

Reads text files from specified folders, handles different encodings, and appends the content to text_list and sum_list.

Returns the updated lists of texts and summaries.

Data Preparation:

text_folder: List of directories for news articles.

summ_folder: List of directories for summaries.

text_list and summ_list: Initialize empty lists to store the contents.

data_df: Empty DataFrame to store the final data.

Execution:

Calls get_texts function to populate text_list and summ_list.

Creates a DataFrame data_df with columns 'Text' and 'Summary'.

Saves data_df to a CSV file at /kaggle/working/bbc_news_data.csv.

Output:

Prints the first few entries of the DataFrame to verify the content.

Column Descriptions:

Text: Contains the full-length articles or passages of news content. This column is used as the input for summarization models.

Summary: Contains concise summaries of the corresponding articles in the "Text" column. This column is used as the target output for summarization models.

Usage:

This script is designed to be run in a Kaggle environment where paths to text data are predefined.

It is intended for preprocessing and saving text data from news articles and summaries for subsequent analysis or model training.
h
onlystacked-xsum-1024
huggingface.co
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stacked Summaries (2023). onlystacked-xsum-1024 [Dataset]. https://huggingface.co/datasets/stacked-summaries/onlystacked-xsum-1024
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 1, 2023
Dataset authored and provided by
Stacked Summaries
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
stacked-summaries/onlystacked-xsum-1024

Same thing as stacked-summaries/stacked-xsum-1024 but filtered such that is_stacked=True. Please refer to the original dataset for info and to raise issues if needed. Basic info on train split:

0 document 116994 non-null string 1… See the full description on the dataset page: https://huggingface.co/datasets/stacked-summaries/onlystacked-xsum-1024.
NY Times - Latest News Articles Dataset
kaggle.com
zip
Updated Jan 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anish Chougule (2025). NY Times - Latest News Articles Dataset [Dataset]. https://www.kaggle.com/datasets/anishchougule2002/ny-times-latest-news-articles-dataset
Explore at:
zip(297117 bytes)Available download formats
Dataset updated
Jan 16, 2025
Authors
Anish Chougule
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
# News Articles Dataset

1. Fetching Data:

The data is fetched from the New York Times Home page using the New York Times API.

API returns the following columns for every article: - section - subsection - title - abstract - url - uri - byline - item_type - updated_date - created_date - published_date - material_type_facet - kicker - des_facet - org_facet - per_facet - geo_facet - multimedia - short_url

2. Data Cleaning:

The fetched data is converted into a pandas DataFrame and empty fields are dropped. All the keywords column are combined into one keywords column. Unnecessary fields are dropped. and then saved into a csv file.

3. Data Visualization:

There is an example file for data visualization using this dataset.
h
TUDelft-Electricity-Consumption-1.0
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenSynth-Energy, TUDelft-Electricity-Consumption-1.0 [Dataset]. https://huggingface.co/datasets/OpenSynth/TUDelft-Electricity-Consumption-1.0
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
OpenSynth-Energy
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Timeseries Data Processing

This repository contains a script for loading and processing time series data using the datasets library and converting it to a pandas DataFrame for further analysis.

Dataset

The dataset used contains time series data with the following features:

id: Identifier for the dataset, formatted as Country_Number of Household (e.g., GE_1 for Germany, household 1).
datetime: Timestamp indicating the date and time of the observation.
target: Energy… See the full description on the dataset page: https://huggingface.co/datasets/OpenSynth/TUDelft-Electricity-Consumption-1.0.
h
medical_institutions_reviews
huggingface.co
Updated Jan 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pavel Blinov (2023). medical_institutions_reviews [Dataset]. https://huggingface.co/datasets/blinoff/medical_institutions_reviews
Explore at:
Dataset updated
Jan 29, 2023
Authors
Pavel Blinov
Description
Dataset Summary

The dataset contains user reviews about medical institutions. In total it contains 12,036 reviews. A review tagged with the general sentiment and sentiments on 5 aspects: quality, service, equipment, food, location.

Data Fields

Each sample contains the following fields:

review_id; content: review text; general; quality; service; equipment; food; location.

Python

import pandas as pd df = pd.read_json('medical_institutions_reviews.jsonl'… See the full description on the dataset page: https://huggingface.co/datasets/blinoff/medical_institutions_reviews.
Crimp Force Curve Dataset
zenodo.org
bin
Updated Sep 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bernd Hofmann; Bernd Hofmann; Patrick Bründl; Patrick Bründl; Jörg Franke; Jörg Franke (2025). Crimp Force Curve Dataset [Dataset]. http://doi.org/10.7910/dvn/wbdkn6
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.7910/dvn/wbdkn6
Dataset updated
Sep 11, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Bernd Hofmann; Bernd Hofmann; Patrick Bründl; Patrick Bründl; Jörg Franke; Jörg Franke
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The "Crimp Force Curve Dataset" is a comprehensive collection of univariate time series data representing crimp force curves recorded during the manufacturing process of crimp connections. This dataset has been designed to support a variety of applications, including anomaly detection, fault diagnosis, and research in data-driven quality assurance.

A salient feature of this dataset is the presence of high-quality labels. Each crimp force curve is annotated both by a state-of-the-art crimp force monitoring system - capable of binary anomaly detection - and by domain experts who manually classified the curves into detailed quality classes. The expert annotations provide a valuable ground truth for training and benchmarking machine learning models beyond anomaly detection.

The dataset is particularly well-suited for tasks involving time series analysis, such as training and evaluating of machine learning algorithms for quality control and fault detection. It provides a substantial foundation for the development of generalisable, yet domain-specific (crimping), data-driven quality control systems.

The data is stored in a Python pickle file crimp_force_curves.pkl, which is a binary format used to serialize and deserialize Python objects. It can be conveniently loaded into a pandas DataFrame for exploration and analysis using the following command:

df = pd.read_pickle("crimp_force_curves.pkl")

This dataset is a valuable resource for researchers and practitioners in manufacturing engineering, computer science, and data science who are working at the intersection of quality control in manufacturing and machine learning.
Summary of miRNAs sequencing.
plos.figshare.com
xls
Updated Jun 4, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mingyu Yang; Lianming Du; Wujiao Li; Fujun Shen; Zhenxin Fan; Zuoyi Jian; Rong Hou; Yongmei Shen; Bisong Yue; Xiuyue Zhang (2023). Summary of miRNAs sequencing. [Dataset]. http://doi.org/10.1371/journal.pone.0143242.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0143242.t002
Dataset updated
Jun 4, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Mingyu Yang; Lianming Du; Wujiao Li; Fujun Shen; Zhenxin Fan; Zuoyi Jian; Rong Hou; Yongmei Shen; Bisong Yue; Xiuyue Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Summary of miRNAs sequencing.
h
Press-and-Plot
huggingface.co
Updated Oct 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Center for Humanities Computing Aarhus (2025). Press-and-Plot [Dataset]. https://huggingface.co/datasets/chcaa/Press-and-Plot
Explore at:
Dataset updated
Oct 27, 2025
Dataset authored and provided by
Center for Humanities Computing Aarhus
Description
Press&Plot: Curated Danish 19th-Century Stories & Serial Fiction (v1.0)

Short description:A curated collection of 29 Danish newspaper stories (1816–1832), including single-part and multi-part fiction, manually inspected, cleaned, and categorized for research use. The dataset is a growing resource.

Dowloading the dataset

using python

from datasets import load_dataset

ds = load_dataset("chcaa/press-and-plot", split="train")

if you want it as a pandas DataFrame:

df =… See the full description on the dataset page: https://huggingface.co/datasets/chcaa/Press-and-Plot.
d
Replication Data for Exploring an extinct society through the lens of...
dataone.org
Updated Dec 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wieczorek, Oliver; Malzahn, Melanie (2023). Replication Data for Exploring an extinct society through the lens of Habitus-Field theory and the Tocharian text corpus [Dataset]. http://doi.org/10.7910/DVN/UF8DHK
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/UF8DHK
Dataset updated
Dec 16, 2023
Dataset provided by
Harvard Dataverse
Authors
Wieczorek, Oliver; Malzahn, Melanie
Description
The files and workflow will allow you to replicate the study titled "Exploring an extinct society through the lens of Habitus-Field theory and the Tocharian text corpus". This study aimed at utilizing the CEToM-corpus (https://cetom.univie.ac.at/) (Tocharian) to analyze the life-world of the elites of an extinct society situated in modern eastern China. To acquire the raw data needed for steps 1 & 2, please contact Melanie Malzahn melanie.malzahn@univie.ac.at. We conducted a mixed methods study, containing of close reading, content analysis, and multiple correspondence analysis (MCA). The excel file titled "fragments_architecture_combined.xlsx" allows for replication of the MCA and equates to the third step of the workflow outlined below. We used the following programming languages and packages to prepare the dataset and to analyze the data. Data preparation and merging procedures were achieved in python (version 3.9.10) with packages pandas (version 1.5.3), os (version 3.12.0), re (version 3.12.0), numpy (version 1.24.3), gensim (version 4.3.1), BeautifulSoup4 (version 4.12.2), pyasn1 (version 0.4.8), and langdetect (version 1.0.9). Multiple Correspondence Analyses were conducted in R (version 4.3.2) with the packages FactoMineR (version 2.9), factoextra (version 1.0.7), readxl version(1.4.3), tidyverse version(2.0.0), ggplot2 (version 3.4.4) and psych (version 2.3.9). After requesting the necessary files, please open the scripts in the order outlined bellow and execute the code-files to replicate the analysis: Preparatory step: Create a folder for the python and r-scripts downloadable in this repository. Open the file 0_create folders.py and declare a root folder in line 19. This first script will generate you the following folders: "tarim-brahmi_database" = Folder, which contains tocharian dictionaries and tocharian text fragments. "dictionaries" = contains tocharian A and tocharian B vocabularies, including linguistic features such as translations, meanings, part of speech tags etc. A full overview of the words is provided on https://cetom.univie.ac.at/?words. "fragments" = contains tocharian text fragments as xml-files. "word_corpus_data" = folder will contain excel-files of the corpus data after the first step. "Architectural_terms" = This folder contains the data on the architectural terms used in the dataset (e.g. dwelling, house). "regional_data" = This folder contains the data on the findsports (tocharian and modern chinese equivalent, e.g. Duldur-Akhur & Kucha). "mca_ready_data" = This is the folder, in which the excel-file with the merged data will be saved. Note that the prepared file named "fragments_architecture_combined.xlsx" can be saved into this directory. This allows you to skip steps 1 &2 and reproduce the MCA of the content analysis based on the third step of our workflow (R-Script titled 3_conduct_MCA.R). First step - run 1_read_xml-files.py: loops over the xml-files in folder dictionaries and identifies a) word metadata, including language (Tocharian A or B), keywords, part of speech, lemmata, word etymology, and loan sources. Then, it loops over the xml-textfiles and extracts a text id number, langauge (Tocharian A or B), text title, text genre, text subgenre, prose type, verse type, material on which the text is written, medium, findspot, the source text in tocharian, and the translation where available. After successful feature extraction, the resulting pandas dataframe object is exported to the word_corpus_data folder. Second step - run 2_merge_excel_files.py: merges all excel files (corpus, data on findspots, word data) and reproduces the content analysis, which was based upon close reading in the first place. Third step - run 3_conduct_MCA.R: recodes, prepares, and selects the variables necessary to conduct the MCA. Then produces the descriptive values, before conducitng the MCA, identifying typical texts per dimension, and exporting the png-files uploaded to this repository.
m
Data for: Can government transfers make energy subsidy reform socially...
data.mendeley.com
Updated Mar 31, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Filip Schaffitzel (2020). Data for: Can government transfers make energy subsidy reform socially acceptable? A case study on Ecuador [Dataset]. http://doi.org/10.17632/z35m76mf9g.1
Explore at:
Unique identifier
https://doi.org/10.17632/z35m76mf9g.1
Dataset updated
Mar 31, 2020
Authors
Filip Schaffitzel
License
Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
Area covered
Ecuador
Description
Estimating the distributional impacts of energy subsidy removal and compensation schemes in Ecuador based on input-output and household data.

Import files: Dictionary Categories.csv, Dictionary ENI-IOT.csv, and Dictionary Subcategories.csv based on [1] Dictionary IOT.csv and IOT_2012.csv (cannot be redistruted) based on [2] Dictionary Taxes.csv and Dictionary Transfers.csv based on [3] ENIGHUR11_GASTOS_V.csv, ENIGHUR11_HOGARES_AGREGADOS.csv, and ENIGHUR11_PERSONAS_INGRESOS.csv based on [4] Price increase scenarios.csv based on [5]

Further basic files and documents: [1] 4_M&D_Mapping ENIGHUR expenditures to IOT_180605.xlsm [2] Input-output table 2012 (https://contenido.bce.fin.ec/documentos/PublicacionesNotas/Catalogo/CuentasNacionales/Anuales/Dolares/MIP2012Ampliada.xls). Save the sheet with the IOT 2012 (Matriz simétrica) as IOT_2012.csv and edit the format: first column and row: IOT labels [3] 4_M&D_ENIGHUR income_180606.xlsx [4] ENIGHUR data can be retrieved from http://www.ecuadorencifras.gob.ec/encuesta-nacional-de-ingresos-y-gastos-de-los-hogares-urbanos-y-rurales/ Household datasets are only available in SPSS file format and the free software PSPP is used to convert .sav- to .csv-files, as this format can be read directly and efficiently into a Python Pandas DataFrame. See PSPP syntax below: save translate /outfile = filename /type = CSV /textoptions decimal = DOT /textoptions delimiter = ';' /fieldnames /cells=values /replace. [5] 3_Ecuador_Energy subsidies and 4_M&D_Price scenarios_180610.xlsx
Summary the gender and age for all samples.
plos.figshare.com
datasetcatalog.nlm.nih.gov
xls
Updated Jun 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mingyu Yang; Lianming Du; Wujiao Li; Fujun Shen; Zhenxin Fan; Zuoyi Jian; Rong Hou; Yongmei Shen; Bisong Yue; Xiuyue Zhang (2023). Summary the gender and age for all samples. [Dataset]. http://doi.org/10.1371/journal.pone.0143242.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0143242.t001
Dataset updated
Jun 7, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Mingyu Yang; Lianming Du; Wujiao Li; Fujun Shen; Zhenxin Fan; Zuoyi Jian; Rong Hou; Yongmei Shen; Bisong Yue; Xiuyue Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Summary the gender and age for all samples.
h
n-ensemble
huggingface.co
Updated Jun 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nova AI Labs (2025). n-ensemble [Dataset]. https://huggingface.co/datasets/nova-ai-labs/n-ensemble
Explore at:
Dataset updated
Jun 3, 2025
Dataset authored and provided by
Nova AI Labs
Description
Nessembele

A small coding dataset for practice and learning.

Quick Start

from datasets import load_dataset

dataset = load_dataset("novastudio/nessembele") df = dataset["train"].to_pandas() print(df.head())

What's Inside

File: coding_dataset.csv Format: CSV with headers Size: Small and beginner-friendly Purpose: Coding practice and data analysis

Usage Load with Pandas

import pandas as pd df =… See the full description on the dataset page: https://huggingface.co/datasets/nova-ai-labs/n-ensemble.
m
Data for: Electrical system architectures for building-ntegrated...
data.mendeley.com
Updated Mar 31, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Konstantinos Spiliotis (2020). Data for: Electrical system architectures for building-ntegrated photovoltaics (BIPV): A comparative analysis using a modelling framework in Modelica [Dataset]. http://doi.org/10.17632/g83gxhn77y.1
Explore at:
Unique identifier
https://doi.org/10.17632/g83gxhn77y.1
Dataset updated
Mar 31, 2020
Authors
Konstantinos Spiliotis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The folder contains data related to manuscript: "Electrical system architectures for building-integratedphotovoltaics (BIPV): A comparative analysis using amodelling framework in Modelica". Specifically, it contains:

1) Power electronics efficiency curves 2) Input meteorological data per location (TMY) 3) Results (KPI) in pandas dataframe csv format.

Feel free to use the any data, provided that you respect our authorship and you cite the dataset and/or the associated paper that provides detailed explanations on them.
US Consumer Complaints Against Businesses
kaggle.com
zip
Updated Oct 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeffery Mandrake (2022). US Consumer Complaints Against Businesses [Dataset]. https://www.kaggle.com/jefferymandrake/us-consumer-complaints-dataset-through-2019
Explore at:
zip(343188956 bytes)Available download formats
Dataset updated
Oct 9, 2022
Authors
Jeffery Mandrake
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
2,121,458 records

I used Google Colab to check out this dataset and pull the column names using Pandas.

Sample code example: Python Pandas read csv file compressed with gzip and load into Pandas dataframe https://pastexy.com/106/python-pandas-read-csv-file-compressed-with-gzip-and-load-into-pandas-dataframe

Columns: ['Date received', 'Product', 'Sub-product', 'Issue', 'Sub-issue', 'Consumer complaint narrative', 'Company public response', 'Company', 'State', 'ZIP code', 'Tags', 'Consumer consent provided?', 'Submitted via', 'Date sent to company', 'Company response to consumer', 'Timely response?', 'Consumer disputed?', 'Complaint ID']

I did not modify the dataset.

Use it to practice with dataframes - Pandas or PySpark on Google Colab:

!unzip complaints.csv.zip

import pandas as pd df = pd.read_csv('complaints.csv') df.columns

df.head() etc.

The Device Activity Report with Complete Knowledge (DARCK) for NILM

zenodo.org

bin, xz

Updated Sep 19, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Anonymous Anonymous; Anonymous Anonymous (2025). The Device Activity Report with Complete Knowledge (DARCK) for NILM [Dataset]. http://doi.org/10.5281/zenodo.17159850

Explore at:

bin, xzAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.17159850

Dataset updated

Sep 19, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Anonymous Anonymous; Anonymous Anonymous

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

1. Abstract

This dataset contains aggregated and sub-metered power consumption data from a two-person apartment in Germany. Data was collected from March 5 to September 4, 2025, spanning 6 months. It includes an aggregate reading from a main smart meter and individual readings from 40 smart plugs, smart relays, and smart power meters monitoring various appliances.

2. Dataset Overview

Apartment: Two-person apartment, approx. 58m², located in Aachen, Germany.
Aggregate Meter: eBZ DD3
Sub-meters: 31 Shelly Plus Plug S, 6 Shelly Plus 1PM, 3 Shelly Plus PM Mini Gen3
Sampling Rate: 1 Hz
Measured Quantity: Active Power
Unit of Measurement: Watt
Duration: 6 months
Format: Single CSV file (`DARCK.csv`)
Structure: Timestamped rows with columns for the aggregate meter and each sub-metered appliance.
Completeness: The main power meter has a completeness of 99.3%. Missing values were linearly interpolated.

3. Download and Usage

The dataset can be downloaded here: https://doi.org/10.5281/zenodo.17159850

As it contains longer off periods with zeros, the CSV file is nicely compressible.

To extract it use: xz -d DARCK.csv.xz.
The compression leads to a 97% smaller file size (From 4GB to 90.9MB).

To use the dataset in python, you can, e.g., load the csv file into a pandas dataframe.

python
import pandas as pd

df = pd.read_csv("DARCK.csv", parse_dates=["time"])

4. Measurement Setup

The main meter was monitored using an infrared reading head magnetically attached to the infrared interface of the meter. An ESP8266 flashed with Tasmota decodes the binary datagrams and forwards the Watt readings to the MQTT broker. Individual appliances were monitored using a combination of Shelly Plugs (for outlets), Shelly 1PM (for wired-in devices like ceiling lights), and Shelly PM Mini (for each of the three phases of the oven). All devices reported to a central InfluxDB database via Home Assistant running in docker on a Dell OptiPlex 3020M.

5. File Format (`DARCK.csv`)

The dataset is provided as a single comma-separated value (CSV) file.

The first row is a header containing the column names.
All power values are rounded to the first decimal place.
There are no missing values in the final dataset.
Each row represents 1 second, from start of measuring in March until the end in September.

Column Descriptions

Column Name	Data Type	Unit	Description
`time`	datetime	-	Timestamp for the reading in `YYYY-MM-DD HH:MM:SS`
`main`	float	Watt	Total aggregate power consumption for the apartment, measured at the main electrical panel.
`[appliance_name]`	float	Watt	Power consumption of an individual appliance (e.g., `lightbathroom`, `fridge`, `sherlockpc`). See Section 8 for a full list.
Aggregate Columns
`aggr_chargers`	float	Watt	The sum of `sherlockcharger`, `sherlocklaptop`, `watsoncharger`, `watsonlaptop`, `watsonipadcharger`, `kitchencharger`.
`aggr_stoveplates`	float	Watt	The sum of `stoveplatel1` and `stoveplatel2`.
`aggr_lights`	float	Watt	The sum of `lightbathroom`, `lighthallway`, `lightsherlock`, `lightkitchen`, `lightlivingroom`, `lightwatson`, `lightstoreroom`, `fcob`, `sherlockalarmclocklight`, `sherlockfloorlamphue`, `sherlockledstrip`, `livingfloorlamphue`, `sherlockglobe`, `watsonfloorlamp`, `watsondesklamp` and `watsonledmap`.
Analysis Columns
`inaccuracy`	float	Watt	As no electrical device bypasses a power meter, the true inaccuracy can be assessed. It is the absolute error between the sum of individual measurements and the mains reading. A 30W offset is applied to the sum since the measurement devices themselves draw power which is otherwise unaccounted for.

6. Data Postprocessing Pipeline

The final dataset was generated from two raw data sources (meter.csv and shellies.csv) using a comprehensive postprocessing pipeline.

6.1. Main Meter (`main`) Postprocessing

The aggregate power data required several cleaning steps to ensure accuracy.

Outlier Removal: Readings below 10W or above 10,000W were removed (merely 3 occurrences).
Timestamp Burst Correction: The source data contained bursts of delayed readings. A custom algorithm was used to identify these bursts (large time gap followed by rapid readings) and back-fill the timestamps to create an evenly spaced time series.
Alignment & Interpolation: The smart meter pushes a new value via infrared every second. To align those to the whole seconds, it was resampled to a 1-second frequency by taking the mean of all readings within each second (in 99.5% only 1 value). Any resulting gaps (0.7% outage ratio) were filled using linear interpolation.

6.2. Sub-metered Devices (`shellies`) Postprocessing

The Shelly devices are not prone to the same burst issue as the ESP8266 is. They push a new reading at every change in power drawn. If no power change is observed or the one observed is too small (less than a few Watt), the reading is pushed once a minute, together with a heartbeat. When a device turns on or off, intermediate power values are published, which leads to sub-second values that need to be handled.

Grouping: Data was grouped by the unique device identifier.
Resampling & Filling: The data for each device was resampled to a 1-second frequency using .resample('1s').last().ffill().
This method was chosen to firstly, capture the last known state of the device within each second, handling rapid on/off events. Secondly, to forward-fill the last state across periods of no new data, modeling that the device's consumption remained constant until a new reading was sent.

6.3. Merging and Finalization

Merge: The cleaned main meter and all sub-metered device dataframes were merged into a single dataframe on the time index.
Final Fill: Any remaining NaN values (e.g., from before a device was installed) were filled with 0.0, assuming zero consumption.

7. Manual Corrections and Known Data Issues

During analysis, two significant unmetered load events were identified and manually corrected to improve the accuracy of the aggregate reading. The error column (inaccuracy) was recalculated after these corrections.

March 10th - Unmetered Bulb: An unmetered 107W bulb was active. It was subtracted from the main reading as if it never happened.
May 31st - Unmetered Air Pump: An unmetered 101W pump for an air mattress was used directly in an outlet with no intermediary plug and hence manually added to the respective plug.

8. Appliance Details and Multipurpose Plugs

The following table lists the column names with an explanation where needed. As Watson moved at the beginning of June, some metering plugs changed their appliance.

h
rag
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
VIGNESH M, rag [Dataset]. https://huggingface.co/datasets/vicky3241/rag
Explore at:
Authors
VIGNESH M
Description
import pandas as pd

Example dataset with new columns

data = [ { "title": "Pandas Library", "about": "Pandas is a Python library for data manipulation and analysis.", "procedure": "Install Pandas via pip, load data into DataFrames, clean and analyze data using built-in functions.", "content": """ Pandas provides data structures like Series and DataFrame for handling structured data. It supports indexing, slicing, aggregation, joining, and filtering… See the full description on the dataset page: https://huggingface.co/datasets/vicky3241/rag.

Facebook

Twitter

Click to copy link

Link copied

Cite

Martin Bukowski (2015). wikipedia-summary-dataset-128k [Dataset]. https://huggingface.co/datasets/mbukowski/wikipedia-summary-dataset-128k

wikipedia-summary-dataset-128k

mbukowski/wikipedia-summary-dataset-128k

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Apr 4, 2015

Authors

Martin Bukowski

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Wikipedia Summary Dataset 128k

This is random subsample of 128k entries from the wikipedia summary dataset, processed with the following code: import pandas as pd

df = pd.read_parquet('wikipedia-summary.parquet') df['l'] = df['summary'].str.len() rdf = df[(df['l'] > 300) & (df['l'] < 600)]

Filter out any rows 'topic' that have non-alphanumeric characters

mask = rdf['topic'].str.contains(r'^[a-zA-Z0-9 ]+$') == True rdf = rdf[mask == True].sample(128000)[['topic'… See the full description on the dataset page: https://huggingface.co/datasets/mbukowski/wikipedia-summary-dataset-128k.

Clear search

Close search

Google apps

Main menu

wikipedia-summary-dataset-128k

Filter out any rows 'topic' that have non-alphanumeric characters

Shopping Mall

Convert Text to Pandas

PandasPlotBench

BBC NEWS SUMMARY(CSV FORMAT)

Dataset Description: Text Summarization Dataset

Key Features:

Future Enhancements:

Usage:

Acknowledgment

File Description

Key Components:

Column Descriptions:

Usage:

onlystacked-xsum-1024

NY Times - Latest News Articles Dataset

# News Articles Dataset

1. Fetching Data:

2. Data Cleaning:

3. Data Visualization:

TUDelft-Electricity-Consumption-1.0

medical_institutions_reviews

Crimp Force Curve Dataset

Summary of miRNAs sequencing.

Press-and-Plot

using python

if you want it as a pandas DataFrame:

Replication Data for Exploring an extinct society through the lens of...

Data for: Can government transfers make energy subsidy reform socially...

Summary the gender and age for all samples.

n-ensemble

Data for: Electrical system architectures for building-ntegrated...

US Consumer Complaints Against Businesses

The Device Activity Report with Complete Knowledge (DARCK) for NILM

1. Abstract

2. Dataset Overview

3. Download and Usage

4. Measurement Setup

5. File Format (DARCK.csv)

Column Descriptions

Column Name

Data Type

Unit

Description

6. Data Postprocessing Pipeline

6.1. Main Meter (main) Postprocessing

6.2. Sub-metered Devices (shellies) Postprocessing

6.3. Merging and Finalization

7. Manual Corrections and Known Data Issues

8. Appliance Details and Multipurpose Plugs

rag

wikipedia-summary-dataset-128k

mbukowski/wikipedia-summary-dataset-128k

Filter out any rows 'topic' that have non-alphanumeric characters

5. File Format (`DARCK.csv`)

6.1. Main Meter (`main`) Postprocessing

6.2. Sub-metered Devices (`shellies`) Postprocessing