55 datasets found
  1. h

    wikipedia-summary-dataset-128k

    • huggingface.co
    Updated Apr 4, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Bukowski (2015). wikipedia-summary-dataset-128k [Dataset]. https://huggingface.co/datasets/mbukowski/wikipedia-summary-dataset-128k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 4, 2015
    Authors
    Martin Bukowski
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Wikipedia Summary Dataset 128k

    This is random subsample of 128k entries from the wikipedia summary dataset, processed with the following code: import pandas as pd

    df = pd.read_parquet('wikipedia-summary.parquet') df['l'] = df['summary'].str.len() rdf = df[(df['l'] > 300) & (df['l'] < 600)]

    Filter out any rows 'topic' that have non-alphanumeric characters

    mask = rdf['topic'].str.contains(r'^[a-zA-Z0-9 ]+$') == True rdf = rdf[mask == True].sample(128000)[['topic'… See the full description on the dataset page: https://huggingface.co/datasets/mbukowski/wikipedia-summary-dataset-128k.

  2. Shopping Mall

    • kaggle.com
    zip
    Updated Dec 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anshul Pachauri (2023). Shopping Mall [Dataset]. https://www.kaggle.com/datasets/anshulpachauri/shopping-mall
    Explore at:
    zip(22852 bytes)Available download formats
    Dataset updated
    Dec 15, 2023
    Authors
    Anshul Pachauri
    Description

    Libraries Import:

    Importing necessary libraries such as pandas, seaborn, matplotlib, scikit-learn's KMeans, and warnings. Data Loading and Exploration:

    Reading a dataset named "Mall_Customers.csv" into a pandas DataFrame (df). Displaying the first few rows of the dataset using df.head(). Conducting univariate analysis by calculating descriptive statistics with df.describe(). Univariate Analysis:

    Visualizing the distribution of the 'Annual Income (k$)' column using sns.distplot. Looping through selected columns ('Age', 'Annual Income (k$)', 'Spending Score (1-100)') and plotting individual distribution plots. Bivariate Analysis:

    Creating a scatter plot for 'Annual Income (k$)' vs 'Spending Score (1-100)' using sns.scatterplot. Generating a pair plot for selected columns with gender differentiation using sns.pairplot. Gender-Based Analysis:

    Grouping the data by 'Gender' and calculating the mean for selected columns. Computing the correlation matrix for the grouped data and visualizing it using a heatmap. Univariate Clustering:

    Applying KMeans clustering with 3 clusters based on 'Annual Income (k$)' and adding the 'Income Cluster' column to the DataFrame. Plotting the elbow method to determine the optimal number of clusters. Bivariate Clustering:

    Applying KMeans clustering with 5 clusters based on 'Annual Income (k$)' and 'Spending Score (1-100)' and adding the 'Spending and Income Cluster' column. Plotting the elbow method for bivariate clustering and visualizing the cluster centers on a scatter plot. Displaying a normalized cross-tabulation between 'Spending and Income Cluster' and 'Gender'. Multivariate Clustering:

    Performing multivariate clustering by creating dummy variables, scaling selected columns, and applying KMeans clustering. Plotting the elbow method for multivariate clustering. Result Saving:

    Saving the modified DataFrame with cluster information to a CSV file named "Result.csv". Saving the multivariate clustering plot as an image file ("Multivariate_figure.png").

  3. Convert Text to Pandas

    • kaggle.com
    zip
    Updated Sep 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zeyad Usf (2024). Convert Text to Pandas [Dataset]. https://www.kaggle.com/datasets/zeyadusf/convert-text-to-pandas
    Explore at:
    zip(4333134 bytes)Available download formats
    Dataset updated
    Sep 22, 2024
    Authors
    Zeyad Usf
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    kaggle notebook
    Github Repo

    I found two datasets about converting text with context to pandas code on Hugging Face, but the challenge is in the context. The context in both datasets is different which reduces the results of the model. First let's mention the data I found and then show examples, solution and some other problems.

    • Rahima411/text-to-pandas:

      • The data is divided into Train with 57.5k and Test with 19.2k.

      • The data has two columns as you can see in the example:

        • "Input": Contains the context and the question together, in the context it shows the metadata about the data frame.
        • "Pandas Query": Pandas code txt Input | Pandas Query -----------------------------------------------------------|------------------------------------------- Table Name: head (age (object), head_id (object)) | result = management['head.age'].unique() Table Name: management (head_id (object), | temporary_acting (object)) | What are the distinct ages of the heads who are acting? |
    • hiltch/pandas-create-context:

      • It contains 17k rows with three columns:
        • question : text .
        • context : Code to create a data frame with column names, unlike the first data set which contains the name of the data frame, column names and data type.
        • answer : Pandas code.
          question           |            context             |       answer 
    ----------------------------------------|--------------------------------------------------------|---------------------------------------
    What was the lowest # of total votes?  | df = pd.DataFrame(columns=['_number_of_total_votes']) | df['_number_of_total_votes'].min()   
    

    As you can see, the problem with this data is that they are not similar as inputs and the structure of the context is different . My solution to this problem was: - Convert the first data set to become like the second in the context. I chose this because it is difficult to get the data type for the columns in the second data set. It was easy to convert the structure of the context from this shape Table Name: head (age (object), head_id (object)) to this head = pd.DataFrame(columns=['age','head_id']) through this code that I wrote. - Then separate the question from the context. This was easy because if you look at the data, you will find that the context always ends with "(" and then a blank and then the question. You will find all of this in this code. - You will also notice that more than one code or line can be returned to the context, and this has been engineered into the code. ```py def extract_table_creation(text:str)->(str,str): """ Extracts DataFrame creation statements and questions from the given text.

    Args:
      text (str): The input text containing table definitions and questions.
    
    Returns:
      tuple: A tuple containing a concatenated DataFrame creation string and a question.
    """
    # Define patterns
    table_pattern = r'Table Name: (\w+) \(([\w\s,()]+)\)'
    column_pattern = r'(\w+)\s*\((object|int64|float64)\)'
    
    # Find all table names and column definitions
    matches = re.findall(table_pattern, text)
    
    # Initialize a list to hold DataFrame creation statements
    df_creations = []
    
    for table_name, columns_str in matches:
      # Extract column names
      columns = re.findall(column_pattern, columns_str)
      column_names = [col[0] for col in columns]
    
      # Format DataFrame creation statement
      df_creation = f"{table_name} = pd.DataFrame(columns={column_names})"
      df_creations.append(df_creation)
    
    # Concatenate all DataFrame creation statements
    df_creation_concat = '
    

    '.join(df_creations)

    # Extract and clean the question
    question = text[text.rindex(')')+1:].strip()
    
    return df_creation_concat, question
    
    After both datasets were similar in structure, they were merged into one set and divided into _72.8K_ train and _18.6K_ test. We analyzed this dataset and you can see it all through the **[`notebook`](https://www.kaggle.com/code/zeyadusf/text-2-pandas-t5#Exploratory-Data-Analysis(EDA))**, but we found some problems in the dataset as well, such as
    > - `Answer` : `df['Id'].count()` has been repeated, but this is possible, so we do not need to dispense with these rows.
    > - `Context` : We see that it contains `147` rows that do not contain any text. We will see Through the experiment if this will affect the results negatively or positively.
    > - `Question` : It is ...
    
  4. PandasPlotBench

    • huggingface.co
    Updated Nov 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    JetBrains Research (2024). PandasPlotBench [Dataset]. https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 25, 2024
    Dataset provided by
    JetBrainshttp://jetbrains.com/
    Authors
    JetBrains Research
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    PandasPlotBench

    PandasPlotBench is a benchmark to assess the capability of models in writing the code for visualizations given the description of the Pandas DataFrame. 🛠️ Task. Given the plotting task and the description of a Pandas DataFrame, write the code to build a plot. The dataset is based on the MatPlotLib gallery. The paper can be found in arXiv: https://arxiv.org/abs/2412.02764v1. To score your model on this dataset, you can use the our GitHub repository. 📩 If you have… See the full description on the dataset page: https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench.

  5. BBC NEWS SUMMARY(CSV FORMAT)

    • kaggle.com
    zip
    Updated Sep 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dhiraj (2024). BBC NEWS SUMMARY(CSV FORMAT) [Dataset]. https://www.kaggle.com/datasets/dignity45/bbc-news-summarycsv-format
    Explore at:
    zip(2097600 bytes)Available download formats
    Dataset updated
    Sep 9, 2024
    Authors
    Dhiraj
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Description: Text Summarization Dataset

    This dataset is designed for users aiming to train models for text summarization. It contains 2,225 rows of data with two columns: "Text" and "Summary". Each row features a detailed news article or piece of text paired with its corresponding summary, providing a rich resource for developing and fine-tuning summarization algorithms.

    Key Features:

    • Text: Full-length articles or passages that serve as the input for summarization.
    • Summary: Concise summaries of the articles, which are ideal for training models to generate brief, coherent summaries from longer texts.

    Future Enhancements:

    This evolving dataset is planned to include additional features, such as text class labels, in future updates. These enhancements will provide more context and facilitate the development of models that can perform summarization across different categories of news content.

    Usage:

    Ideal for researchers and developers focused on text summarization tasks, this dataset enables the training of models to effectively compress information while retaining the essence of the original content.

    Acknowledgment

    We would like to extend our sincere gratitude to the dataset creator for their contribution to this valuable resource. This dataset, sourced from the BBC News Summary dataset on Kaggle, was created by Pariza. Their work has provided an invaluable asset for those working on text summarization tasks, and we appreciate their efforts in curating and sharing this data with the community.

    Thank you for supporting research and development in the field of natural language processing!

    File Description

    This script processes and consolidates text data from various directories containing news articles and their corresponding summaries. It reads the files from specified folders, handles encoding issues, and then creates a DataFrame that is saved as a CSV file for further analysis.

    Key Components:

    1. Imports:

      • numpy (np): Numerical operations library, though it's not used in this script.
      • pandas (pd): Data manipulation and analysis library.
      • os: For interacting with the operating system, e.g., building file paths.
      • glob: For file pattern matching and retrieving file paths.
    2. Function: get_texts

      • Parameters:
        • text_folders: List of folders containing news article text files.
        • text_list: List to store the content of text files.
        • summ_folder: List of folders containing summary text files.
        • sum_list: List to store the content of summary files.
        • encodings: List of encodings to try for reading files.
      • Purpose:
        • Reads text files from specified folders, handles different encodings, and appends the content to text_list and sum_list.
        • Returns the updated lists of texts and summaries.
    3. Data Preparation:

      • text_folder: List of directories for news articles.
      • summ_folder: List of directories for summaries.
      • text_list and summ_list: Initialize empty lists to store the contents.
      • data_df: Empty DataFrame to store the final data.
    4. Execution:

      • Calls get_texts function to populate text_list and summ_list.
      • Creates a DataFrame data_df with columns 'Text' and 'Summary'.
      • Saves data_df to a CSV file at /kaggle/working/bbc_news_data.csv.
    5. Output:

      • Prints the first few entries of the DataFrame to verify the content.

    Column Descriptions:

    • Text: Contains the full-length articles or passages of news content. This column is used as the input for summarization models.
    • Summary: Contains concise summaries of the corresponding articles in the "Text" column. This column is used as the target output for summarization models.

    Usage:

    • This script is designed to be run in a Kaggle environment where paths to text data are predefined.
    • It is intended for preprocessing and saving text data from news articles and summaries for subsequent analysis or model training.
  6. h

    onlystacked-xsum-1024

    • huggingface.co
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stacked Summaries (2023). onlystacked-xsum-1024 [Dataset]. https://huggingface.co/datasets/stacked-summaries/onlystacked-xsum-1024
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 1, 2023
    Dataset authored and provided by
    Stacked Summaries
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    stacked-summaries/onlystacked-xsum-1024

    Same thing as stacked-summaries/stacked-xsum-1024 but filtered such that is_stacked=True. Please refer to the original dataset for info and to raise issues if needed. Basic info on train split:

    0 document 116994 non-null string 1… See the full description on the dataset page: https://huggingface.co/datasets/stacked-summaries/onlystacked-xsum-1024.

  7. NY Times - Latest News Articles Dataset

    • kaggle.com
    zip
    Updated Jan 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anish Chougule (2025). NY Times - Latest News Articles Dataset [Dataset]. https://www.kaggle.com/datasets/anishchougule2002/ny-times-latest-news-articles-dataset
    Explore at:
    zip(297117 bytes)Available download formats
    Dataset updated
    Jan 16, 2025
    Authors
    Anish Chougule
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    # News Articles Dataset

    1. Fetching Data:

    The data is fetched from the New York Times Home page using the New York Times API.

    API returns the following columns for every article: - section - subsection - title - abstract - url - uri - byline - item_type - updated_date - created_date - published_date - material_type_facet - kicker - des_facet - org_facet - per_facet - geo_facet - multimedia - short_url

    2. Data Cleaning:

    The fetched data is converted into a pandas DataFrame and empty fields are dropped. All the keywords column are combined into one keywords column. Unnecessary fields are dropped. and then saved into a csv file.

    3. Data Visualization:

    There is an example file for data visualization using this dataset.

  8. h

    TUDelft-Electricity-Consumption-1.0

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenSynth-Energy, TUDelft-Electricity-Consumption-1.0 [Dataset]. https://huggingface.co/datasets/OpenSynth/TUDelft-Electricity-Consumption-1.0
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    OpenSynth-Energy
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Timeseries Data Processing

    This repository contains a script for loading and processing time series data using the datasets library and converting it to a pandas DataFrame for further analysis.

      Dataset
    

    The dataset used contains time series data with the following features:

    id: Identifier for the dataset, formatted as Country_Number of Household (e.g., GE_1 for Germany, household 1).
    datetime: Timestamp indicating the date and time of the observation.
    target: Energy… See the full description on the dataset page: https://huggingface.co/datasets/OpenSynth/TUDelft-Electricity-Consumption-1.0.

  9. h

    medical_institutions_reviews

    • huggingface.co
    Updated Jan 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pavel Blinov (2023). medical_institutions_reviews [Dataset]. https://huggingface.co/datasets/blinoff/medical_institutions_reviews
    Explore at:
    Dataset updated
    Jan 29, 2023
    Authors
    Pavel Blinov
    Description

    Dataset Summary

    The dataset contains user reviews about medical institutions. In total it contains 12,036 reviews. A review tagged with the general sentiment and sentiments on 5 aspects: quality, service, equipment, food, location.

      Data Fields
    

    Each sample contains the following fields:

    review_id; content: review text; general; quality; service; equipment; food; location.

      Python
    

    import pandas as pd df = pd.read_json('medical_institutions_reviews.jsonl'… See the full description on the dataset page: https://huggingface.co/datasets/blinoff/medical_institutions_reviews.

  10. Crimp Force Curve Dataset

    • zenodo.org
    bin
    Updated Sep 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bernd Hofmann; Bernd Hofmann; Patrick Bründl; Patrick Bründl; Jörg Franke; Jörg Franke (2025). Crimp Force Curve Dataset [Dataset]. http://doi.org/10.7910/dvn/wbdkn6
    Explore at:
    binAvailable download formats
    Dataset updated
    Sep 11, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Bernd Hofmann; Bernd Hofmann; Patrick Bründl; Patrick Bründl; Jörg Franke; Jörg Franke
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The "Crimp Force Curve Dataset" is a comprehensive collection of univariate time series data representing crimp force curves recorded during the manufacturing process of crimp connections. This dataset has been designed to support a variety of applications, including anomaly detection, fault diagnosis, and research in data-driven quality assurance.

    A salient feature of this dataset is the presence of high-quality labels. Each crimp force curve is annotated both by a state-of-the-art crimp force monitoring system - capable of binary anomaly detection - and by domain experts who manually classified the curves into detailed quality classes. The expert annotations provide a valuable ground truth for training and benchmarking machine learning models beyond anomaly detection.

    The dataset is particularly well-suited for tasks involving time series analysis, such as training and evaluating of machine learning algorithms for quality control and fault detection. It provides a substantial foundation for the development of generalisable, yet domain-specific (crimping), data-driven quality control systems.

    The data is stored in a Python pickle file crimp_force_curves.pkl, which is a binary format used to serialize and deserialize Python objects. It can be conveniently loaded into a pandas DataFrame for exploration and analysis using the following command:

    df = pd.read_pickle("crimp_force_curves.pkl")

    This dataset is a valuable resource for researchers and practitioners in manufacturing engineering, computer science, and data science who are working at the intersection of quality control in manufacturing and machine learning.

  11. Summary of miRNAs sequencing.

    • plos.figshare.com
    xls
    Updated Jun 4, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mingyu Yang; Lianming Du; Wujiao Li; Fujun Shen; Zhenxin Fan; Zuoyi Jian; Rong Hou; Yongmei Shen; Bisong Yue; Xiuyue Zhang (2023). Summary of miRNAs sequencing. [Dataset]. http://doi.org/10.1371/journal.pone.0143242.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Mingyu Yang; Lianming Du; Wujiao Li; Fujun Shen; Zhenxin Fan; Zuoyi Jian; Rong Hou; Yongmei Shen; Bisong Yue; Xiuyue Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Summary of miRNAs sequencing.

  12. h

    Press-and-Plot

    • huggingface.co
    Updated Oct 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Center for Humanities Computing Aarhus (2025). Press-and-Plot [Dataset]. https://huggingface.co/datasets/chcaa/Press-and-Plot
    Explore at:
    Dataset updated
    Oct 27, 2025
    Dataset authored and provided by
    Center for Humanities Computing Aarhus
    Description

    Press&Plot: Curated Danish 19th-Century Stories & Serial Fiction (v1.0)

    Short description:A curated collection of 29 Danish newspaper stories (1816–1832), including single-part and multi-part fiction, manually inspected, cleaned, and categorized for research use. The dataset is a growing resource.

      Dowloading the dataset
    

    using python

    from datasets import load_dataset

    ds = load_dataset("chcaa/press-and-plot", split="train")

    if you want it as a pandas DataFrame:

    df =… See the full description on the dataset page: https://huggingface.co/datasets/chcaa/Press-and-Plot.

  13. d

    Replication Data for Exploring an extinct society through the lens of...

    • dataone.org
    Updated Dec 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wieczorek, Oliver; Malzahn, Melanie (2023). Replication Data for Exploring an extinct society through the lens of Habitus-Field theory and the Tocharian text corpus [Dataset]. http://doi.org/10.7910/DVN/UF8DHK
    Explore at:
    Dataset updated
    Dec 16, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Wieczorek, Oliver; Malzahn, Melanie
    Description

    The files and workflow will allow you to replicate the study titled "Exploring an extinct society through the lens of Habitus-Field theory and the Tocharian text corpus". This study aimed at utilizing the CEToM-corpus (https://cetom.univie.ac.at/) (Tocharian) to analyze the life-world of the elites of an extinct society situated in modern eastern China. To acquire the raw data needed for steps 1 & 2, please contact Melanie Malzahn melanie.malzahn@univie.ac.at. We conducted a mixed methods study, containing of close reading, content analysis, and multiple correspondence analysis (MCA). The excel file titled "fragments_architecture_combined.xlsx" allows for replication of the MCA and equates to the third step of the workflow outlined below. We used the following programming languages and packages to prepare the dataset and to analyze the data. Data preparation and merging procedures were achieved in python (version 3.9.10) with packages pandas (version 1.5.3), os (version 3.12.0), re (version 3.12.0), numpy (version 1.24.3), gensim (version 4.3.1), BeautifulSoup4 (version 4.12.2), pyasn1 (version 0.4.8), and langdetect (version 1.0.9). Multiple Correspondence Analyses were conducted in R (version 4.3.2) with the packages FactoMineR (version 2.9), factoextra (version 1.0.7), readxl version(1.4.3), tidyverse version(2.0.0), ggplot2 (version 3.4.4) and psych (version 2.3.9). After requesting the necessary files, please open the scripts in the order outlined bellow and execute the code-files to replicate the analysis: Preparatory step: Create a folder for the python and r-scripts downloadable in this repository. Open the file 0_create folders.py and declare a root folder in line 19. This first script will generate you the following folders: "tarim-brahmi_database" = Folder, which contains tocharian dictionaries and tocharian text fragments. "dictionaries" = contains tocharian A and tocharian B vocabularies, including linguistic features such as translations, meanings, part of speech tags etc. A full overview of the words is provided on https://cetom.univie.ac.at/?words. "fragments" = contains tocharian text fragments as xml-files. "word_corpus_data" = folder will contain excel-files of the corpus data after the first step. "Architectural_terms" = This folder contains the data on the architectural terms used in the dataset (e.g. dwelling, house). "regional_data" = This folder contains the data on the findsports (tocharian and modern chinese equivalent, e.g. Duldur-Akhur & Kucha). "mca_ready_data" = This is the folder, in which the excel-file with the merged data will be saved. Note that the prepared file named "fragments_architecture_combined.xlsx" can be saved into this directory. This allows you to skip steps 1 &2 and reproduce the MCA of the content analysis based on the third step of our workflow (R-Script titled 3_conduct_MCA.R). First step - run 1_read_xml-files.py: loops over the xml-files in folder dictionaries and identifies a) word metadata, including language (Tocharian A or B), keywords, part of speech, lemmata, word etymology, and loan sources. Then, it loops over the xml-textfiles and extracts a text id number, langauge (Tocharian A or B), text title, text genre, text subgenre, prose type, verse type, material on which the text is written, medium, findspot, the source text in tocharian, and the translation where available. After successful feature extraction, the resulting pandas dataframe object is exported to the word_corpus_data folder. Second step - run 2_merge_excel_files.py: merges all excel files (corpus, data on findspots, word data) and reproduces the content analysis, which was based upon close reading in the first place. Third step - run 3_conduct_MCA.R: recodes, prepares, and selects the variables necessary to conduct the MCA. Then produces the descriptive values, before conducitng the MCA, identifying typical texts per dimension, and exporting the png-files uploaded to this repository.

  14. m

    Data for: Can government transfers make energy subsidy reform socially...

    • data.mendeley.com
    Updated Mar 31, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Filip Schaffitzel (2020). Data for: Can government transfers make energy subsidy reform socially acceptable? A case study on Ecuador [Dataset]. http://doi.org/10.17632/z35m76mf9g.1
    Explore at:
    Dataset updated
    Mar 31, 2020
    Authors
    Filip Schaffitzel
    License

    Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
    License information was derived automatically

    Area covered
    Ecuador
    Description

    Estimating the distributional impacts of energy subsidy removal and compensation schemes in Ecuador based on input-output and household data.

    Import files: Dictionary Categories.csv, Dictionary ENI-IOT.csv, and Dictionary Subcategories.csv based on [1] Dictionary IOT.csv and IOT_2012.csv (cannot be redistruted) based on [2] Dictionary Taxes.csv and Dictionary Transfers.csv based on [3] ENIGHUR11_GASTOS_V.csv, ENIGHUR11_HOGARES_AGREGADOS.csv, and ENIGHUR11_PERSONAS_INGRESOS.csv based on [4] Price increase scenarios.csv based on [5]

    Further basic files and documents: [1] 4_M&D_Mapping ENIGHUR expenditures to IOT_180605.xlsm [2] Input-output table 2012 (https://contenido.bce.fin.ec/documentos/PublicacionesNotas/Catalogo/CuentasNacionales/Anuales/Dolares/MIP2012Ampliada.xls). Save the sheet with the IOT 2012 (Matriz simétrica) as IOT_2012.csv and edit the format: first column and row: IOT labels [3] 4_M&D_ENIGHUR income_180606.xlsx [4] ENIGHUR data can be retrieved from http://www.ecuadorencifras.gob.ec/encuesta-nacional-de-ingresos-y-gastos-de-los-hogares-urbanos-y-rurales/ Household datasets are only available in SPSS file format and the free software PSPP is used to convert .sav- to .csv-files, as this format can be read directly and efficiently into a Python Pandas DataFrame. See PSPP syntax below: save translate /outfile = filename /type = CSV /textoptions decimal = DOT /textoptions delimiter = ';' /fieldnames /cells=values /replace. [5] 3_Ecuador_Energy subsidies and 4_M&D_Price scenarios_180610.xlsx

  15. Summary the gender and age for all samples.

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated Jun 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mingyu Yang; Lianming Du; Wujiao Li; Fujun Shen; Zhenxin Fan; Zuoyi Jian; Rong Hou; Yongmei Shen; Bisong Yue; Xiuyue Zhang (2023). Summary the gender and age for all samples. [Dataset]. http://doi.org/10.1371/journal.pone.0143242.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 7, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Mingyu Yang; Lianming Du; Wujiao Li; Fujun Shen; Zhenxin Fan; Zuoyi Jian; Rong Hou; Yongmei Shen; Bisong Yue; Xiuyue Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Summary the gender and age for all samples.

  16. h

    n-ensemble

    • huggingface.co
    Updated Jun 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nova AI Labs (2025). n-ensemble [Dataset]. https://huggingface.co/datasets/nova-ai-labs/n-ensemble
    Explore at:
    Dataset updated
    Jun 3, 2025
    Dataset authored and provided by
    Nova AI Labs
    Description

    Nessembele

    A small coding dataset for practice and learning.

      Quick Start
    

    from datasets import load_dataset

    dataset = load_dataset("novastudio/nessembele") df = dataset["train"].to_pandas() print(df.head())

      What's Inside
    

    File: coding_dataset.csv Format: CSV with headers Size: Small and beginner-friendly Purpose: Coding practice and data analysis

      Usage
    
    
    
    
    
      Load with Pandas
    

    import pandas as pd df =… See the full description on the dataset page: https://huggingface.co/datasets/nova-ai-labs/n-ensemble.

  17. m

    Data for: Electrical system architectures for building-ntegrated...

    • data.mendeley.com
    Updated Mar 31, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Konstantinos Spiliotis (2020). Data for: Electrical system architectures for building-ntegrated photovoltaics (BIPV): A comparative analysis using a modelling framework in Modelica [Dataset]. http://doi.org/10.17632/g83gxhn77y.1
    Explore at:
    Dataset updated
    Mar 31, 2020
    Authors
    Konstantinos Spiliotis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The folder contains data related to manuscript: "Electrical system architectures for building-integratedphotovoltaics (BIPV): A comparative analysis using amodelling framework in Modelica". Specifically, it contains:

    1) Power electronics efficiency curves 2) Input meteorological data per location (TMY) 3) Results (KPI) in pandas dataframe csv format.

    Feel free to use the any data, provided that you respect our authorship and you cite the dataset and/or the associated paper that provides detailed explanations on them.

  18. US Consumer Complaints Against Businesses

    • kaggle.com
    zip
    Updated Oct 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jeffery Mandrake (2022). US Consumer Complaints Against Businesses [Dataset]. https://www.kaggle.com/jefferymandrake/us-consumer-complaints-dataset-through-2019
    Explore at:
    zip(343188956 bytes)Available download formats
    Dataset updated
    Oct 9, 2022
    Authors
    Jeffery Mandrake
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    2,121,458 records

    I used Google Colab to check out this dataset and pull the column names using Pandas.

    Sample code example: Python Pandas read csv file compressed with gzip and load into Pandas dataframe https://pastexy.com/106/python-pandas-read-csv-file-compressed-with-gzip-and-load-into-pandas-dataframe

    Columns: ['Date received', 'Product', 'Sub-product', 'Issue', 'Sub-issue', 'Consumer complaint narrative', 'Company public response', 'Company', 'State', 'ZIP code', 'Tags', 'Consumer consent provided?', 'Submitted via', 'Date sent to company', 'Company response to consumer', 'Timely response?', 'Consumer disputed?', 'Complaint ID']

    I did not modify the dataset.

    Use it to practice with dataframes - Pandas or PySpark on Google Colab:

    !unzip complaints.csv.zip

    import pandas as pd df = pd.read_csv('complaints.csv') df.columns

    df.head() etc.

  19. The Device Activity Report with Complete Knowledge (DARCK) for NILM

    • zenodo.org
    bin, xz
    Updated Sep 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous Anonymous; Anonymous Anonymous (2025). The Device Activity Report with Complete Knowledge (DARCK) for NILM [Dataset]. http://doi.org/10.5281/zenodo.17159850
    Explore at:
    bin, xzAvailable download formats
    Dataset updated
    Sep 19, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous Anonymous; Anonymous Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    1. Abstract

    This dataset contains aggregated and sub-metered power consumption data from a two-person apartment in Germany. Data was collected from March 5 to September 4, 2025, spanning 6 months. It includes an aggregate reading from a main smart meter and individual readings from 40 smart plugs, smart relays, and smart power meters monitoring various appliances.

    2. Dataset Overview

    • Apartment: Two-person apartment, approx. 58m², located in Aachen, Germany.
    • Aggregate Meter: eBZ DD3
    • Sub-meters: 31 Shelly Plus Plug S, 6 Shelly Plus 1PM, 3 Shelly Plus PM Mini Gen3
    • Sampling Rate: 1 Hz
    • Measured Quantity: Active Power
    • Unit of Measurement: Watt
    • Duration: 6 months
    • Format: Single CSV file (`DARCK.csv`)
    • Structure: Timestamped rows with columns for the aggregate meter and each sub-metered appliance.
    • Completeness: The main power meter has a completeness of 99.3%. Missing values were linearly interpolated.

    3. Download and Usage

    The dataset can be downloaded here: https://doi.org/10.5281/zenodo.17159850

    As it contains longer off periods with zeros, the CSV file is nicely compressible.


    To extract it use: xz -d DARCK.csv.xz.
    The compression leads to a 97% smaller file size (From 4GB to 90.9MB).


    To use the dataset in python, you can, e.g., load the csv file into a pandas dataframe.

    python
    import pandas as pd

    df = pd.read_csv("DARCK.csv", parse_dates=["time"])

    4. Measurement Setup

    The main meter was monitored using an infrared reading head magnetically attached to the infrared interface of the meter. An ESP8266 flashed with Tasmota decodes the binary datagrams and forwards the Watt readings to the MQTT broker. Individual appliances were monitored using a combination of Shelly Plugs (for outlets), Shelly 1PM (for wired-in devices like ceiling lights), and Shelly PM Mini (for each of the three phases of the oven). All devices reported to a central InfluxDB database via Home Assistant running in docker on a Dell OptiPlex 3020M.

    5. File Format (DARCK.csv)

    The dataset is provided as a single comma-separated value (CSV) file.

    • The first row is a header containing the column names.
    • All power values are rounded to the first decimal place.
    • There are no missing values in the final dataset.
    • Each row represents 1 second, from start of measuring in March until the end in September.

    Column Descriptions

    Column Name

    Data Type

    Unit

    Description

    timedatetime-Timestamp for the reading in YYYY-MM-DD HH:MM:SS
    mainfloatWattTotal aggregate power consumption for the apartment, measured at the main electrical panel.
    [appliance_name]floatWattPower consumption of an individual appliance (e.g., lightbathroom, fridge, sherlockpc). See Section 8 for a full list.
    Aggregate Columns
    aggr_chargersfloatWattThe sum of sherlockcharger, sherlocklaptop, watsoncharger, watsonlaptop, watsonipadcharger, kitchencharger.
    aggr_stoveplatesfloatWattThe sum of stoveplatel1 and stoveplatel2.
    aggr_lightsfloatWattThe sum of lightbathroom, lighthallway, lightsherlock, lightkitchen, lightlivingroom, lightwatson, lightstoreroom, fcob, sherlockalarmclocklight, sherlockfloorlamphue, sherlockledstrip, livingfloorlamphue, sherlockglobe, watsonfloorlamp, watsondesklamp and watsonledmap.
    Analysis Columns
    inaccuracyfloatWattAs no electrical device bypasses a power meter, the true inaccuracy can be assessed. It is the absolute error between the sum of individual measurements and the mains reading. A 30W offset is applied to the sum since the measurement devices themselves draw power which is otherwise unaccounted for.

    6. Data Postprocessing Pipeline

    The final dataset was generated from two raw data sources (meter.csv and shellies.csv) using a comprehensive postprocessing pipeline.

    6.1. Main Meter (main) Postprocessing

    The aggregate power data required several cleaning steps to ensure accuracy.

    1. Outlier Removal: Readings below 10W or above 10,000W were removed (merely 3 occurrences).
    2. Timestamp Burst Correction: The source data contained bursts of delayed readings. A custom algorithm was used to identify these bursts (large time gap followed by rapid readings) and back-fill the timestamps to create an evenly spaced time series.
    3. Alignment & Interpolation: The smart meter pushes a new value via infrared every second. To align those to the whole seconds, it was resampled to a 1-second frequency by taking the mean of all readings within each second (in 99.5% only 1 value). Any resulting gaps (0.7% outage ratio) were filled using linear interpolation.

    6.2. Sub-metered Devices (shellies) Postprocessing

    The Shelly devices are not prone to the same burst issue as the ESP8266 is. They push a new reading at every change in power drawn. If no power change is observed or the one observed is too small (less than a few Watt), the reading is pushed once a minute, together with a heartbeat. When a device turns on or off, intermediate power values are published, which leads to sub-second values that need to be handled.

    1. Grouping: Data was grouped by the unique device identifier.
    2. Resampling & Filling: The data for each device was resampled to a 1-second frequency using .resample('1s').last().ffill().
      This method was chosen to firstly, capture the last known state of the device within each second, handling rapid on/off events. Secondly, to forward-fill the last state across periods of no new data, modeling that the device's consumption remained constant until a new reading was sent.

    6.3. Merging and Finalization

    1. Merge: The cleaned main meter and all sub-metered device dataframes were merged into a single dataframe on the time index.
    2. Final Fill: Any remaining NaN values (e.g., from before a device was installed) were filled with 0.0, assuming zero consumption.

    7. Manual Corrections and Known Data Issues

    During analysis, two significant unmetered load events were identified and manually corrected to improve the accuracy of the aggregate reading. The error column (inaccuracy) was recalculated after these corrections.

    1. March 10th - Unmetered Bulb: An unmetered 107W bulb was active. It was subtracted from the main reading as if it never happened.
    2. May 31st - Unmetered Air Pump: An unmetered 101W pump for an air mattress was used directly in an outlet with no intermediary plug and hence manually added to the respective plug.

    8. Appliance Details and Multipurpose Plugs

    The following table lists the column names with an explanation where needed. As Watson moved at the beginning of June, some metering plugs changed their appliance.

  20. h

    rag

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    VIGNESH M, rag [Dataset]. https://huggingface.co/datasets/vicky3241/rag
    Explore at:
    Authors
    VIGNESH M
    Description

    import pandas as pd

      Example dataset with new columns
    

    data = [ { "title": "Pandas Library", "about": "Pandas is a Python library for data manipulation and analysis.", "procedure": "Install Pandas via pip, load data into DataFrames, clean and analyze data using built-in functions.", "content": """ Pandas provides data structures like Series and DataFrame for handling structured data. It supports indexing, slicing, aggregation, joining, and filtering… See the full description on the dataset page: https://huggingface.co/datasets/vicky3241/rag.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Martin Bukowski (2015). wikipedia-summary-dataset-128k [Dataset]. https://huggingface.co/datasets/mbukowski/wikipedia-summary-dataset-128k

wikipedia-summary-dataset-128k

mbukowski/wikipedia-summary-dataset-128k

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 4, 2015
Authors
Martin Bukowski
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Wikipedia Summary Dataset 128k

This is random subsample of 128k entries from the wikipedia summary dataset, processed with the following code: import pandas as pd

df = pd.read_parquet('wikipedia-summary.parquet') df['l'] = df['summary'].str.len() rdf = df[(df['l'] > 300) & (df['l'] < 600)]

Filter out any rows 'topic' that have non-alphanumeric characters

mask = rdf['topic'].str.contains(r'^[a-zA-Z0-9 ]+$') == True rdf = rdf[mask == True].sample(128000)[['topic'… See the full description on the dataset page: https://huggingface.co/datasets/mbukowski/wikipedia-summary-dataset-128k.

Search
Clear search
Close search
Google apps
Main menu