81 datasets found
  1. Shopping Mall

    • kaggle.com
    zip
    Updated Dec 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anshul Pachauri (2023). Shopping Mall [Dataset]. https://www.kaggle.com/datasets/anshulpachauri/shopping-mall
    Explore at:
    zip(22852 bytes)Available download formats
    Dataset updated
    Dec 15, 2023
    Authors
    Anshul Pachauri
    Description

    Libraries Import:

    Importing necessary libraries such as pandas, seaborn, matplotlib, scikit-learn's KMeans, and warnings. Data Loading and Exploration:

    Reading a dataset named "Mall_Customers.csv" into a pandas DataFrame (df). Displaying the first few rows of the dataset using df.head(). Conducting univariate analysis by calculating descriptive statistics with df.describe(). Univariate Analysis:

    Visualizing the distribution of the 'Annual Income (k$)' column using sns.distplot. Looping through selected columns ('Age', 'Annual Income (k$)', 'Spending Score (1-100)') and plotting individual distribution plots. Bivariate Analysis:

    Creating a scatter plot for 'Annual Income (k$)' vs 'Spending Score (1-100)' using sns.scatterplot. Generating a pair plot for selected columns with gender differentiation using sns.pairplot. Gender-Based Analysis:

    Grouping the data by 'Gender' and calculating the mean for selected columns. Computing the correlation matrix for the grouped data and visualizing it using a heatmap. Univariate Clustering:

    Applying KMeans clustering with 3 clusters based on 'Annual Income (k$)' and adding the 'Income Cluster' column to the DataFrame. Plotting the elbow method to determine the optimal number of clusters. Bivariate Clustering:

    Applying KMeans clustering with 5 clusters based on 'Annual Income (k$)' and 'Spending Score (1-100)' and adding the 'Spending and Income Cluster' column. Plotting the elbow method for bivariate clustering and visualizing the cluster centers on a scatter plot. Displaying a normalized cross-tabulation between 'Spending and Income Cluster' and 'Gender'. Multivariate Clustering:

    Performing multivariate clustering by creating dummy variables, scaling selected columns, and applying KMeans clustering. Plotting the elbow method for multivariate clustering. Result Saving:

    Saving the modified DataFrame with cluster information to a CSV file named "Result.csv". Saving the multivariate clustering plot as an image file ("Multivariate_figure.png").

  2. PandasPlotBench

    • huggingface.co
    Updated Nov 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    JetBrains Research (2024). PandasPlotBench [Dataset]. https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 25, 2024
    Dataset provided by
    JetBrainshttp://jetbrains.com/
    Authors
    JetBrains Research
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    PandasPlotBench

    PandasPlotBench is a benchmark to assess the capability of models in writing the code for visualizations given the description of the Pandas DataFrame. 🛠️ Task. Given the plotting task and the description of a Pandas DataFrame, write the code to build a plot. The dataset is based on the MatPlotLib gallery. The paper can be found in arXiv: https://arxiv.org/abs/2412.02764v1. To score your model on this dataset, you can use the our GitHub repository. 📩 If you have… See the full description on the dataset page: https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench.

  3. NY Times - Latest News Articles Dataset

    • kaggle.com
    zip
    Updated Jan 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anish Chougule (2025). NY Times - Latest News Articles Dataset [Dataset]. https://www.kaggle.com/datasets/anishchougule2002/ny-times-latest-news-articles-dataset
    Explore at:
    zip(297117 bytes)Available download formats
    Dataset updated
    Jan 16, 2025
    Authors
    Anish Chougule
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    # News Articles Dataset

    1. Fetching Data:

    The data is fetched from the New York Times Home page using the New York Times API.

    API returns the following columns for every article: - section - subsection - title - abstract - url - uri - byline - item_type - updated_date - created_date - published_date - material_type_facet - kicker - des_facet - org_facet - per_facet - geo_facet - multimedia - short_url

    2. Data Cleaning:

    The fetched data is converted into a pandas DataFrame and empty fields are dropped. All the keywords column are combined into one keywords column. Unnecessary fields are dropped. and then saved into a csv file.

    3. Data Visualization:

    There is an example file for data visualization using this dataset.

  4. h

    onlystacked-xsum-1024

    • huggingface.co
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stacked Summaries (2023). onlystacked-xsum-1024 [Dataset]. https://huggingface.co/datasets/stacked-summaries/onlystacked-xsum-1024
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 1, 2023
    Dataset authored and provided by
    Stacked Summaries
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    stacked-summaries/onlystacked-xsum-1024

    Same thing as stacked-summaries/stacked-xsum-1024 but filtered such that is_stacked=True. Please refer to the original dataset for info and to raise issues if needed. Basic info on train split:

    0 document 116994 non-null string 1… See the full description on the dataset page: https://huggingface.co/datasets/stacked-summaries/onlystacked-xsum-1024.

  5. Classicmodels

    • kaggle.com
    zip
    Updated Dec 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Javier Landaeta (2024). Classicmodels [Dataset]. https://www.kaggle.com/datasets/javierlandaeta/classicmodels
    Explore at:
    zip(65751 bytes)Available download formats
    Dataset updated
    Dec 15, 2024
    Authors
    Javier Landaeta
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Abstract This project presents a comprehensive analysis of a company's annual sales, using the classic dataset classicmodels as the database. Python is used as the main programming language, along with the Pandas, NumPy and SQLAlchemy libraries for data manipulation and analysis, and PostgreSQL as the database management system.

    The main objective of the project is to answer key questions related to the company's sales performance, such as: Which were the most profitable products and customers? Were sales goals met? The results obtained serve as input for strategic decision making in future sales campaigns.

    Methodology 1. Data Extraction:

    • A connection is established with the PostgreSQL database to extract the relevant data from the orders, orderdetails, customers, products and employees tables.
    • A reusable function is created to read each table and load it into a Pandas DataFrame.

    2. Data Cleansing and Transformation:

    • An exploratory analysis of the data is performed to identify missing values, inconsistencies, and outliers.
    • New variables are calculated, such as the total value of each sale, cost, and profit.
    • Different DataFrames are joined using primary and foreign keys to obtain a complete view of sales.

    3. Exploratory Data Analysis (EDA):

    • Key metrics such as total sales, number of unique customers, and average order value are calculated.
    • Data is grouped by different dimensions (products, customers, dates) to identify patterns and trends.
    • Results are visualized using relevant graphics (histograms, bar charts, etc.).

    4. Modeling and Prediction:

    • Although the main focus of the project is descriptive, predictive modeling techniques (e.g., time series) could be explored to forecast future sales.

    5. Report Generation:

    • Detailed reports are created in Pandas DataFrames format that answer specific business questions.
    • These reports are stored in new PostgreSQL tables for further analysis and visualization.

    Results - Identification of top products and customers: The best-selling products and the customers that generate the most revenue are identified. - Analysis of sales trends: Sales trends over time are analyzed and possible factors that influence sales behavior are identified. - Calculation of key metrics: Metrics such as average profit margin and sales growth rate are calculated.

    Conclusions This project demonstrates how Python and PostgreSQL can be effectively used to analyze large data sets and obtain valuable insights for business decision making. The results obtained can serve as a starting point for future research and development in the area of ​​sales analysis.

    Technologies Used - Python: Pandas, NumPy, SQLAlchemy, Matplotlib/Seaborn - Database: PostgreSQL - Tools: Jupyter Notebook - Keywords: data analysis, Python, PostgreSQL, Pandas, NumPy, SQLAlchemy, EDA, sales, business intelligence

  6. d

    Replication Data for Exploring an extinct society through the lens of...

    • dataone.org
    Updated Dec 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wieczorek, Oliver; Malzahn, Melanie (2023). Replication Data for Exploring an extinct society through the lens of Habitus-Field theory and the Tocharian text corpus [Dataset]. http://doi.org/10.7910/DVN/UF8DHK
    Explore at:
    Dataset updated
    Dec 16, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Wieczorek, Oliver; Malzahn, Melanie
    Description

    The files and workflow will allow you to replicate the study titled "Exploring an extinct society through the lens of Habitus-Field theory and the Tocharian text corpus". This study aimed at utilizing the CEToM-corpus (https://cetom.univie.ac.at/) (Tocharian) to analyze the life-world of the elites of an extinct society situated in modern eastern China. To acquire the raw data needed for steps 1 & 2, please contact Melanie Malzahn melanie.malzahn@univie.ac.at. We conducted a mixed methods study, containing of close reading, content analysis, and multiple correspondence analysis (MCA). The excel file titled "fragments_architecture_combined.xlsx" allows for replication of the MCA and equates to the third step of the workflow outlined below. We used the following programming languages and packages to prepare the dataset and to analyze the data. Data preparation and merging procedures were achieved in python (version 3.9.10) with packages pandas (version 1.5.3), os (version 3.12.0), re (version 3.12.0), numpy (version 1.24.3), gensim (version 4.3.1), BeautifulSoup4 (version 4.12.2), pyasn1 (version 0.4.8), and langdetect (version 1.0.9). Multiple Correspondence Analyses were conducted in R (version 4.3.2) with the packages FactoMineR (version 2.9), factoextra (version 1.0.7), readxl version(1.4.3), tidyverse version(2.0.0), ggplot2 (version 3.4.4) and psych (version 2.3.9). After requesting the necessary files, please open the scripts in the order outlined bellow and execute the code-files to replicate the analysis: Preparatory step: Create a folder for the python and r-scripts downloadable in this repository. Open the file 0_create folders.py and declare a root folder in line 19. This first script will generate you the following folders: "tarim-brahmi_database" = Folder, which contains tocharian dictionaries and tocharian text fragments. "dictionaries" = contains tocharian A and tocharian B vocabularies, including linguistic features such as translations, meanings, part of speech tags etc. A full overview of the words is provided on https://cetom.univie.ac.at/?words. "fragments" = contains tocharian text fragments as xml-files. "word_corpus_data" = folder will contain excel-files of the corpus data after the first step. "Architectural_terms" = This folder contains the data on the architectural terms used in the dataset (e.g. dwelling, house). "regional_data" = This folder contains the data on the findsports (tocharian and modern chinese equivalent, e.g. Duldur-Akhur & Kucha). "mca_ready_data" = This is the folder, in which the excel-file with the merged data will be saved. Note that the prepared file named "fragments_architecture_combined.xlsx" can be saved into this directory. This allows you to skip steps 1 &2 and reproduce the MCA of the content analysis based on the third step of our workflow (R-Script titled 3_conduct_MCA.R). First step - run 1_read_xml-files.py: loops over the xml-files in folder dictionaries and identifies a) word metadata, including language (Tocharian A or B), keywords, part of speech, lemmata, word etymology, and loan sources. Then, it loops over the xml-textfiles and extracts a text id number, langauge (Tocharian A or B), text title, text genre, text subgenre, prose type, verse type, material on which the text is written, medium, findspot, the source text in tocharian, and the translation where available. After successful feature extraction, the resulting pandas dataframe object is exported to the word_corpus_data folder. Second step - run 2_merge_excel_files.py: merges all excel files (corpus, data on findspots, word data) and reproduces the content analysis, which was based upon close reading in the first place. Third step - run 3_conduct_MCA.R: recodes, prepares, and selects the variables necessary to conduct the MCA. Then produces the descriptive values, before conducitng the MCA, identifying typical texts per dimension, and exporting the png-files uploaded to this repository.

  7. m

    Data for: Can government transfers make energy subsidy reform socially...

    • data.mendeley.com
    Updated Mar 31, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Filip Schaffitzel (2020). Data for: Can government transfers make energy subsidy reform socially acceptable? A case study on Ecuador [Dataset]. http://doi.org/10.17632/z35m76mf9g.1
    Explore at:
    Dataset updated
    Mar 31, 2020
    Authors
    Filip Schaffitzel
    License

    Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
    License information was derived automatically

    Area covered
    Ecuador
    Description

    Estimating the distributional impacts of energy subsidy removal and compensation schemes in Ecuador based on input-output and household data.

    Import files: Dictionary Categories.csv, Dictionary ENI-IOT.csv, and Dictionary Subcategories.csv based on [1] Dictionary IOT.csv and IOT_2012.csv (cannot be redistruted) based on [2] Dictionary Taxes.csv and Dictionary Transfers.csv based on [3] ENIGHUR11_GASTOS_V.csv, ENIGHUR11_HOGARES_AGREGADOS.csv, and ENIGHUR11_PERSONAS_INGRESOS.csv based on [4] Price increase scenarios.csv based on [5]

    Further basic files and documents: [1] 4_M&D_Mapping ENIGHUR expenditures to IOT_180605.xlsm [2] Input-output table 2012 (https://contenido.bce.fin.ec/documentos/PublicacionesNotas/Catalogo/CuentasNacionales/Anuales/Dolares/MIP2012Ampliada.xls). Save the sheet with the IOT 2012 (Matriz simétrica) as IOT_2012.csv and edit the format: first column and row: IOT labels [3] 4_M&D_ENIGHUR income_180606.xlsx [4] ENIGHUR data can be retrieved from http://www.ecuadorencifras.gob.ec/encuesta-nacional-de-ingresos-y-gastos-de-los-hogares-urbanos-y-rurales/ Household datasets are only available in SPSS file format and the free software PSPP is used to convert .sav- to .csv-files, as this format can be read directly and efficiently into a Python Pandas DataFrame. See PSPP syntax below: save translate /outfile = filename /type = CSV /textoptions decimal = DOT /textoptions delimiter = ';' /fieldnames /cells=values /replace. [5] 3_Ecuador_Energy subsidies and 4_M&D_Price scenarios_180610.xlsx

  8. Convert Text to Pandas

    • kaggle.com
    zip
    Updated Sep 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zeyad Usf (2024). Convert Text to Pandas [Dataset]. https://www.kaggle.com/datasets/zeyadusf/convert-text-to-pandas
    Explore at:
    zip(4333134 bytes)Available download formats
    Dataset updated
    Sep 22, 2024
    Authors
    Zeyad Usf
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    kaggle notebook
    Github Repo

    I found two datasets about converting text with context to pandas code on Hugging Face, but the challenge is in the context. The context in both datasets is different which reduces the results of the model. First let's mention the data I found and then show examples, solution and some other problems.

    • Rahima411/text-to-pandas:

      • The data is divided into Train with 57.5k and Test with 19.2k.

      • The data has two columns as you can see in the example:

        • "Input": Contains the context and the question together, in the context it shows the metadata about the data frame.
        • "Pandas Query": Pandas code txt Input | Pandas Query -----------------------------------------------------------|------------------------------------------- Table Name: head (age (object), head_id (object)) | result = management['head.age'].unique() Table Name: management (head_id (object), | temporary_acting (object)) | What are the distinct ages of the heads who are acting? |
    • hiltch/pandas-create-context:

      • It contains 17k rows with three columns:
        • question : text .
        • context : Code to create a data frame with column names, unlike the first data set which contains the name of the data frame, column names and data type.
        • answer : Pandas code.
          question           |            context             |       answer 
    ----------------------------------------|--------------------------------------------------------|---------------------------------------
    What was the lowest # of total votes?  | df = pd.DataFrame(columns=['_number_of_total_votes']) | df['_number_of_total_votes'].min()   
    

    As you can see, the problem with this data is that they are not similar as inputs and the structure of the context is different . My solution to this problem was: - Convert the first data set to become like the second in the context. I chose this because it is difficult to get the data type for the columns in the second data set. It was easy to convert the structure of the context from this shape Table Name: head (age (object), head_id (object)) to this head = pd.DataFrame(columns=['age','head_id']) through this code that I wrote. - Then separate the question from the context. This was easy because if you look at the data, you will find that the context always ends with "(" and then a blank and then the question. You will find all of this in this code. - You will also notice that more than one code or line can be returned to the context, and this has been engineered into the code. ```py def extract_table_creation(text:str)->(str,str): """ Extracts DataFrame creation statements and questions from the given text.

    Args:
      text (str): The input text containing table definitions and questions.
    
    Returns:
      tuple: A tuple containing a concatenated DataFrame creation string and a question.
    """
    # Define patterns
    table_pattern = r'Table Name: (\w+) \(([\w\s,()]+)\)'
    column_pattern = r'(\w+)\s*\((object|int64|float64)\)'
    
    # Find all table names and column definitions
    matches = re.findall(table_pattern, text)
    
    # Initialize a list to hold DataFrame creation statements
    df_creations = []
    
    for table_name, columns_str in matches:
      # Extract column names
      columns = re.findall(column_pattern, columns_str)
      column_names = [col[0] for col in columns]
    
      # Format DataFrame creation statement
      df_creation = f"{table_name} = pd.DataFrame(columns={column_names})"
      df_creations.append(df_creation)
    
    # Concatenate all DataFrame creation statements
    df_creation_concat = '
    

    '.join(df_creations)

    # Extract and clean the question
    question = text[text.rindex(')')+1:].strip()
    
    return df_creation_concat, question
    
    After both datasets were similar in structure, they were merged into one set and divided into _72.8K_ train and _18.6K_ test. We analyzed this dataset and you can see it all through the **[`notebook`](https://www.kaggle.com/code/zeyadusf/text-2-pandas-t5#Exploratory-Data-Analysis(EDA))**, but we found some problems in the dataset as well, such as
    > - `Answer` : `df['Id'].count()` has been repeated, but this is possible, so we do not need to dispense with these rows.
    > - `Context` : We see that it contains `147` rows that do not contain any text. We will see Through the experiment if this will affect the results negatively or positively.
    > - `Question` : It is ...
    
  9. u

    Analysis of network performance when confirmed traffic is present in Long...

    • researchdata.up.ac.za
    zip
    Updated Feb 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jaco Marais; Gerhardus Hancke; Adnan Abu Mahfouz (2024). Analysis of network performance when confirmed traffic is present in Long Range Wide Area Networks (LoRaWANs) [Dataset]. http://doi.org/10.25403/UPresearchdata.22113050.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 16, 2024
    Dataset provided by
    University of Pretoria
    Authors
    Jaco Marais; Gerhardus Hancke; Adnan Abu Mahfouz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Quantitative data of figures and graphing scripts from the thesis titled 'Developing a congestion management scheme to reduce the impact of congestion in mixed traffic LoRaWANs'. The files contain the processed output of simulations conducted with a modified version of the ns-3 plugin lorawan. Processed simulation output was Pandas dataframes stored in text files. Software used: ns-3 (version 3.30), Jupyter notebooks, Python with packages sem, pandas, seaborn, modified version of lorawan module from signetlabdei. Python scripts refer to Std and Ex, std refers to the standard LoRaWAN module and Ex refers to the Extended version of the module with the algorithms presented in the thesis. Text files contain a legend at the top of all of the fields present in the dataframe.

  10. Crimp Force Curve Dataset

    • zenodo.org
    bin
    Updated Sep 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bernd Hofmann; Bernd Hofmann; Patrick Bründl; Patrick Bründl; Jörg Franke; Jörg Franke (2025). Crimp Force Curve Dataset [Dataset]. http://doi.org/10.7910/dvn/wbdkn6
    Explore at:
    binAvailable download formats
    Dataset updated
    Sep 11, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Bernd Hofmann; Bernd Hofmann; Patrick Bründl; Patrick Bründl; Jörg Franke; Jörg Franke
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The "Crimp Force Curve Dataset" is a comprehensive collection of univariate time series data representing crimp force curves recorded during the manufacturing process of crimp connections. This dataset has been designed to support a variety of applications, including anomaly detection, fault diagnosis, and research in data-driven quality assurance.

    A salient feature of this dataset is the presence of high-quality labels. Each crimp force curve is annotated both by a state-of-the-art crimp force monitoring system - capable of binary anomaly detection - and by domain experts who manually classified the curves into detailed quality classes. The expert annotations provide a valuable ground truth for training and benchmarking machine learning models beyond anomaly detection.

    The dataset is particularly well-suited for tasks involving time series analysis, such as training and evaluating of machine learning algorithms for quality control and fault detection. It provides a substantial foundation for the development of generalisable, yet domain-specific (crimping), data-driven quality control systems.

    The data is stored in a Python pickle file crimp_force_curves.pkl, which is a binary format used to serialize and deserialize Python objects. It can be conveniently loaded into a pandas DataFrame for exploration and analysis using the following command:

    df = pd.read_pickle("crimp_force_curves.pkl")

    This dataset is a valuable resource for researchers and practitioners in manufacturing engineering, computer science, and data science who are working at the intersection of quality control in manufacturing and machine learning.

  11. BBC NEWS SUMMARY(CSV FORMAT)

    • kaggle.com
    zip
    Updated Sep 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dhiraj (2024). BBC NEWS SUMMARY(CSV FORMAT) [Dataset]. https://www.kaggle.com/datasets/dignity45/bbc-news-summarycsv-format
    Explore at:
    zip(2097600 bytes)Available download formats
    Dataset updated
    Sep 9, 2024
    Authors
    Dhiraj
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Description: Text Summarization Dataset

    This dataset is designed for users aiming to train models for text summarization. It contains 2,225 rows of data with two columns: "Text" and "Summary". Each row features a detailed news article or piece of text paired with its corresponding summary, providing a rich resource for developing and fine-tuning summarization algorithms.

    Key Features:

    • Text: Full-length articles or passages that serve as the input for summarization.
    • Summary: Concise summaries of the articles, which are ideal for training models to generate brief, coherent summaries from longer texts.

    Future Enhancements:

    This evolving dataset is planned to include additional features, such as text class labels, in future updates. These enhancements will provide more context and facilitate the development of models that can perform summarization across different categories of news content.

    Usage:

    Ideal for researchers and developers focused on text summarization tasks, this dataset enables the training of models to effectively compress information while retaining the essence of the original content.

    Acknowledgment

    We would like to extend our sincere gratitude to the dataset creator for their contribution to this valuable resource. This dataset, sourced from the BBC News Summary dataset on Kaggle, was created by Pariza. Their work has provided an invaluable asset for those working on text summarization tasks, and we appreciate their efforts in curating and sharing this data with the community.

    Thank you for supporting research and development in the field of natural language processing!

    File Description

    This script processes and consolidates text data from various directories containing news articles and their corresponding summaries. It reads the files from specified folders, handles encoding issues, and then creates a DataFrame that is saved as a CSV file for further analysis.

    Key Components:

    1. Imports:

      • numpy (np): Numerical operations library, though it's not used in this script.
      • pandas (pd): Data manipulation and analysis library.
      • os: For interacting with the operating system, e.g., building file paths.
      • glob: For file pattern matching and retrieving file paths.
    2. Function: get_texts

      • Parameters:
        • text_folders: List of folders containing news article text files.
        • text_list: List to store the content of text files.
        • summ_folder: List of folders containing summary text files.
        • sum_list: List to store the content of summary files.
        • encodings: List of encodings to try for reading files.
      • Purpose:
        • Reads text files from specified folders, handles different encodings, and appends the content to text_list and sum_list.
        • Returns the updated lists of texts and summaries.
    3. Data Preparation:

      • text_folder: List of directories for news articles.
      • summ_folder: List of directories for summaries.
      • text_list and summ_list: Initialize empty lists to store the contents.
      • data_df: Empty DataFrame to store the final data.
    4. Execution:

      • Calls get_texts function to populate text_list and summ_list.
      • Creates a DataFrame data_df with columns 'Text' and 'Summary'.
      • Saves data_df to a CSV file at /kaggle/working/bbc_news_data.csv.
    5. Output:

      • Prints the first few entries of the DataFrame to verify the content.

    Column Descriptions:

    • Text: Contains the full-length articles or passages of news content. This column is used as the input for summarization models.
    • Summary: Contains concise summaries of the corresponding articles in the "Text" column. This column is used as the target output for summarization models.

    Usage:

    • This script is designed to be run in a Kaggle environment where paths to text data are predefined.
    • It is intended for preprocessing and saving text data from news articles and summaries for subsequent analysis or model training.
  12. h

    wikipedia-summary-dataset-128k

    • huggingface.co
    Updated Apr 4, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Bukowski (2015). wikipedia-summary-dataset-128k [Dataset]. https://huggingface.co/datasets/mbukowski/wikipedia-summary-dataset-128k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 4, 2015
    Authors
    Martin Bukowski
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Wikipedia Summary Dataset 128k

    This is random subsample of 128k entries from the wikipedia summary dataset, processed with the following code: import pandas as pd

    df = pd.read_parquet('wikipedia-summary.parquet') df['l'] = df['summary'].str.len() rdf = df[(df['l'] > 300) & (df['l'] < 600)]

    Filter out any rows 'topic' that have non-alphanumeric characters

    mask = rdf['topic'].str.contains(r'^[a-zA-Z0-9 ]+$') == True rdf = rdf[mask == True].sample(128000)[['topic'… See the full description on the dataset page: https://huggingface.co/datasets/mbukowski/wikipedia-summary-dataset-128k.

  13. Z

    Analysis of references in the IPCC AR6 WG2 Report of 2022

    • data.niaid.nih.gov
    Updated Mar 11, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cameron Neylon; Bianca Kramer (2022). Analysis of references in the IPCC AR6 WG2 Report of 2022 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6327206
    Explore at:
    Dataset updated
    Mar 11, 2022
    Dataset provided by
    Utrehct University
    Centre for Culture and Technology, Curtin University
    Authors
    Cameron Neylon; Bianca Kramer
    License

    https://creativecommons.org/licenses/publicdomain/https://creativecommons.org/licenses/publicdomain/

    Description

    This repository contains data on 17,419 DOIs cited in the IPCC Working Group 2 contribution to the Sixth Assessment Report, and the code to link them to the dataset built at the Curtin Open Knowledge Initiative (COKI).

    References were extracted from the report's PDFs (downloaded 2022-03-01) via Scholarcy and exported as RIS and BibTeX files. DOI strings were identified from RIS files by pattern matching and saved as CSV file. The list of DOIs for each chapter and cross chapter paper was processed using a custom Python script to generate a pandas DataFrame which was saved as CSV file and uploaded to Google Big Query.

    We used the main object table of the Academic Observatory, which combines information from Crossref, Unpaywall, Microsoft Academic, Open Citations, the Research Organization Registry and Geonames to enrich the DOIs with bibliographic information, affiliations, and open access status. A custom query was used to join and format the data and the resulting table was visualised in a Google DataStudio dashboard.

    This version of the repository also includes the set of DOIs from references in the IPCC Working Group 1 contribution to the Sixth Assessment Report as extracted by Alexis-Michel Mugabushaka and shared on Zenodo: https://doi.org/10.5281/zenodo.5475442 (CC-BY)

    A brief descriptive analysis was provided as a blogpost on the COKI website.

    The repository contains the following content:

    Data:

    data/scholarcy/RIS/ - extracted references as RIS files

    data/scholarcy/BibTeX/ - extracted references as BibTeX files

    IPCC_AR6_WGII_dois.csv - list of DOIs

    data/10.5281_zenodo.5475442/ - references from IPCC AR6 WG1 report

    Processing:

    preprocessing.R - preprocessing steps for identifying and cleaning DOIs

    process.py - Python script for transforming data and linking to COKI data through Google Big Query

    Outcomes:

    Dataset on BigQuery - requires a google account for access and bigquery account for querying

    Data Studio Dashboard - interactive analysis of the generated data

    Zotero library of references extracted via Scholarcy

    PDF version of blogpost

    Note on licenses: Data are made available under CC0 (with the exception of WG1 reference data, which have been shared under CC-BY 4.0) Code is made available under Apache License 2.0

  14. Z

    Data from: Bayesian Analysis for Remote Biosignature Identification on...

    • data.niaid.nih.gov
    Updated Jul 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Natasha Latouf; Avi Mandell; Geronimo Villanueva; Michael Dane Moore; Nicholas Susemiehl; Vincent Kofman; Michael D. Himes (2023). Bayesian Analysis for Remote Biosignature Identification on exoEarths (BARBIE) I: Using Grid-Based Nested Sampling in Coronagraphy Observation Simulations for H2O [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7897197
    Explore at:
    Dataset updated
    Jul 29, 2023
    Dataset provided by
    NASA Goddard Space Flight Center
    George Mason University, NASA Goddard Space Flight Center
    Authors
    Natasha Latouf; Avi Mandell; Geronimo Villanueva; Michael Dane Moore; Nicholas Susemiehl; Vincent Kofman; Michael D. Himes
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We present all of the data across our SNR and abundance study for the molecule H2O for an exoEarth twin. The wavelength range is from 0.515-1 micron, with 25 evenly spaced 20% bandpasses in this range. The SNR ranges from 3-16, and the abundance values range from log10(VMR) = -3.5 to -1.5 in steps of 0.5 and 0.25 (all presented in VMR in the associated table). We present the lower and upper wavelength per bandpass, the input H2O value (abundance case), the retrieved H2O value (presented as the log10(VMR)), the lower and upper limits of the 68% credible region (presented as the log10(VMR)), and the log-Bayes factor for H2O. For more information about how these were calculated, please see Bayesian Analysis for Remote Biosignature Identification on exoEarths (BARBIE) I: Using Grid-Based Nested Sampling in Coronagraphy Observation Simulations for H2O, accepted and currently available on arXiv.

    To open this csv as a Pandas dataframe, use the following command:

    your_dataframe_name = pd.read_csv(f'zenodo_table.csv', dtype={'Input H2O': str})

  15. h

    TUDelft-Electricity-Consumption-1.0

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenSynth-Energy, TUDelft-Electricity-Consumption-1.0 [Dataset]. https://huggingface.co/datasets/OpenSynth/TUDelft-Electricity-Consumption-1.0
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    OpenSynth-Energy
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Timeseries Data Processing

    This repository contains a script for loading and processing time series data using the datasets library and converting it to a pandas DataFrame for further analysis.

      Dataset
    

    The dataset used contains time series data with the following features:

    id: Identifier for the dataset, formatted as Country_Number of Household (e.g., GE_1 for Germany, household 1).
    datetime: Timestamp indicating the date and time of the observation.
    target: Energy… See the full description on the dataset page: https://huggingface.co/datasets/OpenSynth/TUDelft-Electricity-Consumption-1.0.

  16. d

    University Archives web archive collection derivatives

    • search.dataone.org
    • borealisdata.ca
    Updated Dec 28, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ruest, Nick; Wilk, Jocelyn; Thurman, Alex (2023). University Archives web archive collection derivatives [Dataset]. http://doi.org/10.5683/SP2/FONRZU
    Explore at:
    Dataset updated
    Dec 28, 2023
    Dataset provided by
    Borealis
    Authors
    Ruest, Nick; Wilk, Jocelyn; Thurman, Alex
    Description

    Web archive derivatives of the University Archives collection from Columbia University Libraries. The derivatives were created with the Archives Unleashed Toolkit and Archives Unleashed Cloud. The cul-1914-parquet.tar.gz derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples. Domains .webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc) Produces a DataFrame with the following columns: domain count Web Pages .webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content")) Produces a DataFrame with the following columns: crawl_date url mime_type_web_server mime_type_tika content Web Graph .webgraph() Produces a DataFrame with the following columns: crawl_date src dest anchor Image Links .imageLinks() Produces a DataFrame with the following columns: src image_url Binary Analysis Images PDFs Presentation program files Spreadsheets Text files Word processor files The cul-1914-auk.tar.gz derivatives are the standard set of web archive derivatives produced by the Archives Unleashed Cloud. Gephi file, which can be loaded into Gephi. It will have basic characteristics already computed and a basic layout. Raw Network file, which can also be loaded into Gephi. You will have to use that network program to lay it out yourself. Full text file. In it, each website within the web archive collection will have its full text presented on one line, along with information around when it was crawled, the name of the domain, and the full URL of the content. Domains count file. A text file containing the frequency count of domains captured within your web archive. Due to file size restrictions in Scholars Portal Dataverse, each of the derivative files needed to be split into 1G parts. These parts can be joined back together with cat. For example: cat cul-1914-parquet.tar.gz.part* > cul-1914-parquet.tar.gz

  17. m

    Data for: Electrical system architectures for building-ntegrated...

    • data.mendeley.com
    Updated Mar 31, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Konstantinos Spiliotis (2020). Data for: Electrical system architectures for building-ntegrated photovoltaics (BIPV): A comparative analysis using a modelling framework in Modelica [Dataset]. http://doi.org/10.17632/g83gxhn77y.1
    Explore at:
    Dataset updated
    Mar 31, 2020
    Authors
    Konstantinos Spiliotis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The folder contains data related to manuscript: "Electrical system architectures for building-integratedphotovoltaics (BIPV): A comparative analysis using amodelling framework in Modelica". Specifically, it contains:

    1) Power electronics efficiency curves 2) Input meteorological data per location (TMY) 3) Results (KPI) in pandas dataframe csv format.

    Feel free to use the any data, provided that you respect our authorship and you cite the dataset and/or the associated paper that provides detailed explanations on them.

  18. Fracture toughness of mixed-mode anticracks in highly porous materials...

    • zenodo.org
    bin, text/x-python +1
    Updated Sep 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Valentin Adam; Valentin Adam; Bastian Bergfeld; Bastian Bergfeld; Philipp Weißgraeber; Philipp Weißgraeber; Alec van Herwijnen; Alec van Herwijnen; Philipp L. Rosendahl; Philipp L. Rosendahl (2024). Fracture toughness of mixed-mode anticracks in highly porous materials dataset and data processing [Dataset]. http://doi.org/10.5281/zenodo.11443644
    Explore at:
    text/x-python, txt, binAvailable download formats
    Dataset updated
    Sep 2, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Valentin Adam; Valentin Adam; Bastian Bergfeld; Bastian Bergfeld; Philipp Weißgraeber; Philipp Weißgraeber; Alec van Herwijnen; Alec van Herwijnen; Philipp L. Rosendahl; Philipp L. Rosendahl
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    This repository contains the code and datasets used in the data analysis for "Fracture toughness of mixed-mode anticracks in highly porous materials". The analysis is implemented in Python, using Jupyter Notebooks.

    Contents

    • main.ipynb: Jupyter notebook with the main data analysis workflow.
    • energy.py: Methods for the calculation of energy release rates.
    • regression.py: Methods for the regression analyses.
    • visualization.py: Methods for generating visualizations.
    • df_mmft.pkl: Pickled DataFrame with experimental data gathered in the present work.
    • df_legacy.pkl: Pickled DataFrame with literature data.

    Prerequisites

    • To run the scripts and notebooks, you need:
    • Python 3.12 or higher
    • Jupyter Notebook or JupyterLab
    • Libraries: pandas, matplotlib, numpy, scipy, tqdm, uncertainties, weac

    Setup

    1. Download the zip file or clone this repository to your local machine.
    2. Ensure that Python and Jupyter are installed.
    3. Install required Python libraries using pip install -r requirements.txt.

    Running the Analysis

    1. Open the main.ipynb notebook in Jupyter Notebook or JupyterLab.
    2. Execute the cells in sequence to reproduce the analysis.

    Data Description

    The data included in this repository is encapsulated in two pickled DataFrame files, df_mmft.pkl and df_legacy.pkl, which contain experimental measurements and corresponding parameters. Below are the descriptions for each column in these DataFrames:

    df_mmft.pkl

    Includes data such as experiment identifiers, datetime, and physical measurements like slope inclination and critical cut lengths.
    • exp_id: Unique identifier for each experiment.
    • datestring: Date of the experiment as a string.
    • datetime: Timestamp of the experiment.
    • bunker: Field site of the experiment. Bunker IDs 1 and 2 correspond to field sites A and B, respectively.
    • slope_incl: Inclination of the slope in degrees.
    • h_sledge_top: Distance from sample top surface to the sled in mm.
    • h_wl_top: Distance from sample top surface to weak layer in mm.
    • h_wl_notch: Distance from the notch root to the weak layer in mm.
    • rc_right: Critical cut length in mm, measured on the front side of the sample.
    • rc_left: Critical cut length in mm, measured on the back side of the sample.
    • rc: Mean of rc_right and rc_left.
    • densities: List of density measurements in kg/m^3 for each distinct slab layer of each sample.
    • densities_mean: Daily mean of densities.
    • layers: 2D array with layer density (kg/m^3) and layer thickness (mm) pairs for each distinct slab layer.
    • layers_mean: Daily mean of layers.
    • surface_lineload: Surface line load of added surface weights in N/mm.
    • wl_thickness: Weak-layer thickness in mm.
    • notes: Additional notes regarding the experiment or observations.
    • L: Length of the slab–weak-layer assembly in mm.

    df_legacy.pkl

    Contains robustness data such as radii of curvature, slope inclination, and various geometrical measurements.
    • #: Record number.
    • rc: Critical cut length in mm.
    • slope_incl: Inclination of the slope in degrees.
    • h: Slab height in mm.
    • density: Mean slab density in kg/m^3.
    • L: Lenght of the slab–weak-layer assembly in mm.
    • collapse_height: Weak-layer height reduction through collapse.
    • layers_mean: 2D array with layer density (kg/m^3) and layer thickness (mm) pairs for each distinct slab layer.
    • wl_thickness: Weak-layer thickness in mm.
    • surface_lineload: Surface line load from added weights in N/mm.

    For more detailed information on the datasets, refer to the paper or the documentation provided within the Jupyter notebook.

    License

    You are free to:
    • Share — copy and redistribute the material in any medium or format
    • Adapt — remix, transform, and build upon the material for any purpose, even commercially.
    Under the following terms:
    • Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

    Citation

    Please cite the following paper if you use this analysis or the accompanying datasets:
    • Adam, V., Bergfeld, B., Weißgraeber, P. van Herwijnen, A., Rosendahl, P.L., Fracture toughness of mixed-mode anticracks in highly porous materials. Nature Communincations 15, 7379 (2024). https://doi.org/10.1038/s41467-024-51491-7
  19. Z

    Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Dec 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hartloper, Alexander R.; Ozden, Selimcan; de Castro e Sousa, Albano; Lignos, Dimitrios G. (2022). Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6965146
    Explore at:
    Dataset updated
    Dec 24, 2022
    Dataset provided by
    Imperial College London
    EPFL
    Authors
    Hartloper, Alexander R.; Ozden, Selimcan; de Castro e Sousa, Albano; Lignos, Dimitrios G.
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials

    Background

    This dataset contains data from monotonic and cyclic loading experiments on structural metallic materials. The materials are primarily structural steels and one iron-based shape memory alloy is also included. Summary files are included that provide an overview of the database and data from the individual experiments is also included.

    The files included in the database are outlined below and the format of the files is briefly described. Additional information regarding the formatting can be found through the post-processing library (https://github.com/ahartloper/rlmtp/tree/master/protocols).

    Usage

    The data is licensed through the Creative Commons Attribution 4.0 International.

    If you have used our data and are publishing your work, we ask that you please reference both:

    this database through its DOI, and

    any publication that is associated with the experiments. See the Overall_Summary and Database_References files for the associated publication references.

    Included Files

    Overall_Summary_2022-08-25_v1-0-0.csv: summarises the specimen information for all experiments in the database.

    Summarized_Mechanical_Props_Campaign_2022-08-25_v1-0-0.csv: summarises the average initial yield stress and average initial elastic modulus per campaign.

    Unreduced_Data-#_v1-0-0.zip: contain the original (not downsampled) data

    Where # is one of: 1, 2, 3, 4, 5, 6. The unreduced data is broken into separate archives because of upload limitations to Zenodo. Together they provide all the experimental data.

    We recommend you un-zip all the folders and place them in one "Unreduced_Data" directory similar to the "Clean_Data"

    The experimental data is provided through .csv files for each test that contain the processed data. The experiments are organised by experimental campaign and named by load protocol and specimen. A .pdf file accompanies each test showing the stress-strain graph.

    There is a "db_tag_clean_data_map.csv" file that is used to map the database summary with the unreduced data.

    The computed yield stresses and elastic moduli are stored in the "yield_stress" directory.

    Clean_Data_v1-0-0.zip: contains all the downsampled data

    The experimental data is provided through .csv files for each test that contain the processed data. The experiments are organised by experimental campaign and named by load protocol and specimen. A .pdf file accompanies each test showing the stress-strain graph.

    There is a "db_tag_clean_data_map.csv" file that is used to map the database summary with the clean data.

    The computed yield stresses and elastic moduli are stored in the "yield_stress" directory.

    Database_References_v1-0-0.bib

    Contains a bibtex reference for many of the experiments in the database. Corresponds to the "citekey" entry in the summary files.

    File Format: Downsampled Data

    These are the "LP_Specimen_processed_data.csv" files in the "Clean_Data" directory. The is the load protocol designation and the is the specimen number for that load protocol and material source. Each file contains the following columns:

    The header of the first column is empty: the first column corresponds to the index of the sample point in the original (unreduced) data

    Time[s]: time in seconds since the start of the test

    e_true: true strain

    Sigma_true: true stress in MPa

    (optional) Temperature[C]: the surface temperature in degC

    These data files can be easily loaded using the pandas library in Python through:

    import pandas data = pandas.read_csv(data_file, index_col=0)

    The data is formatted so it can be used directly in RESSPyLab (https://github.com/AlbanoCastroSousa/RESSPyLab). Note that the column names "e_true" and "Sigma_true" were kept for backwards compatibility reasons with RESSPyLab.

    File Format: Unreduced Data

    These are the "LP_Specimen_processed_data.csv" files in the "Unreduced_Data" directory. The is the load protocol designation and the is the specimen number for that load protocol and material source. Each file contains the following columns:

    The first column is the index of each data point

    S/No: sample number recorded by the DAQ

    System Date: Date and time of sample

    Time[s]: time in seconds since the start of the test

    C_1_Force[kN]: load cell force

    C_1_Déform1[mm]: extensometer displacement

    C_1_Déplacement[mm]: cross-head displacement

    Eng_Stress[MPa]: engineering stress

    Eng_Strain[]: engineering strain

    e_true: true strain

    Sigma_true: true stress in MPa

    (optional) Temperature[C]: specimen surface temperature in degC

    The data can be loaded and used similarly to the downsampled data.

    File Format: Overall_Summary

    The overall summary file provides data on all the test specimens in the database. The columns include:

    hidden_index: internal reference ID

    grade: material grade

    spec: specifications for the material

    source: base material for the test specimen

    id: internal name for the specimen

    lp: load protocol

    size: type of specimen (M8, M12, M20)

    gage_length_mm_: unreduced section length in mm

    avg_reduced_dia_mm_: average measured diameter for the reduced section in mm

    avg_fractured_dia_top_mm_: average measured diameter of the top fracture surface in mm

    avg_fractured_dia_bot_mm_: average measured diameter of the bottom fracture surface in mm

    fy_n_mpa_: nominal yield stress

    fu_n_mpa_: nominal ultimate stress

    t_a_deg_c_: ambient temperature in degC

    date: date of test

    investigator: person(s) who conducted the test

    location: laboratory where test was conducted

    machine: setup used to conduct test

    pid_force_k_p, pid_force_t_i, pid_force_t_d: PID parameters for force control

    pid_disp_k_p, pid_disp_t_i, pid_disp_t_d: PID parameters for displacement control

    pid_extenso_k_p, pid_extenso_t_i, pid_extenso_t_d: PID parameters for extensometer control

    citekey: reference corresponding to the Database_References.bib file

    yield_stress_mpa_: computed yield stress in MPa

    elastic_modulus_mpa_: computed elastic modulus in MPa

    fracture_strain: computed average true strain across the fracture surface

    c,si,mn,p,s,n,cu,mo,ni,cr,v,nb,ti,al,b,zr,sn,ca,h,fe: chemical compositions in units of %mass

    file: file name of corresponding clean (downsampled) stress-strain data

    File Format: Summarized_Mechanical_Props_Campaign

    Meant to be loaded in Python as a pandas DataFrame with multi-indexing, e.g.,

    tab1 = pd.read_csv('Summarized_Mechanical_Props_Campaign_' + date + version + '.csv', index_col=[0, 1, 2, 3], skipinitialspace=True, header=[0, 1], keep_default_na=False, na_values='')

    citekey: reference in "Campaign_References.bib".

    Grade: material grade.

    Spec.: specifications (e.g., J2+N).

    Yield Stress [MPa]: initial yield stress in MPa

    size, count, mean, coefvar: number of experiments in campaign, number of experiments in mean, mean value for campaign, coefficient of variation for campaign

    Elastic Modulus [MPa]: initial elastic modulus in MPa

    size, count, mean, coefvar: number of experiments in campaign, number of experiments in mean, mean value for campaign, coefficient of variation for campaign

    Caveats

    The files in the following directories were tested before the protocol was established. Therefore, only the true stress-strain is available for each:

    A500

    A992_Gr50

    BCP325

    BCR295

    HYP400

    S460NL

    S690QL/25mm

    S355J2_Plates/S355J2_N_25mm and S355J2_N_50mm

  20. Salaries case study

    • kaggle.com
    zip
    Updated Oct 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shobhit Chauhan (2024). Salaries case study [Dataset]. https://www.kaggle.com/datasets/satyam0123/salaries-case-study
    Explore at:
    zip(13105509 bytes)Available download formats
    Dataset updated
    Oct 2, 2024
    Authors
    Shobhit Chauhan
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    To analyze the salaries of company employees using Pandas, NumPy, and other tools, you can structure the analysis process into several steps:

    Case Study: Employee Salary Analysis In this case study, we aim to analyze the salaries of employees across different departments and levels within a company. Our goal is to uncover key patterns, identify outliers, and provide insights that can support decisions related to compensation and workforce management.

    Step 1: Data Collection and Preparation Data Sources: The dataset typically includes employee ID, name, department, position, years of experience, salary, and additional compensation (bonuses, stock options, etc.). Data Cleaning: We use Pandas to handle missing or incomplete data, remove duplicates, and standardize formats. Example: df.dropna() to handle missing salary information, and df.drop_duplicates() to eliminate duplicate entries. Step 2: Data Exploration and Descriptive Statistics Exploratory Data Analysis (EDA): Using Pandas to calculate basic statistics such as mean, median, mode, and standard deviation for employee salaries. Example: df['salary'].describe() provides an overview of the distribution of salaries. Data Visualization: Leveraging tools like Matplotlib or Seaborn for visualizing salary distributions, box plots to detect outliers, and bar charts for department-wise salary breakdowns. Example: sns.boxplot(x='department', y='salary', data=df) provides a visual representation of salary variations by department. Step 3: Analysis Using NumPy Calculating Salary Ranges: NumPy can be used to calculate the range, variance, and percentiles of salary data to identify the spread and skewness of the salary distribution. Example: np.percentile(df['salary'], [25, 50, 75]) helps identify salary quartiles. Correlation Analysis: Identify the relationship between variables such as experience and salary using NumPy to compute correlation coefficients. Example: np.corrcoef(df['years_of_experience'], df['salary']) reveals if experience is a significant factor in salary determination. Step 4: Grouping and Aggregation Salary by Department and Position: Using Pandas' groupby function, we can summarize salary information for different departments and job titles to identify trends or inequalities. Example: df.groupby('department')['salary'].mean() calculates the average salary per department. Step 5: Salary Forecasting (Optional) Predictive Analysis: Using tools such as Scikit-learn, we could build a regression model to predict future salary increases based on factors like experience, education level, and performance ratings. Step 6: Insights and Recommendations Outlier Identification: Detect any employees earning significantly more or less than the average, which could signal inequities or high performers. Salary Discrepancies: Highlight any salary discrepancies between departments or gender that may require further investigation. Compensation Planning: Based on the analysis, suggest potential changes to the salary structure or bonus allocations to ensure fair compensation across the organization. Tools Used: Pandas: For data manipulation, grouping, and descriptive analysis. NumPy: For numerical operations such as percentiles and correlations. Matplotlib/Seaborn: For data visualization to highlight key patterns and trends. Scikit-learn (Optional): For building predictive models if salary forecasting is included in the analysis. This approach ensures a comprehensive analysis of employee salaries, providing actionable insights for human resource planning and compensation strategy.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Anshul Pachauri (2023). Shopping Mall [Dataset]. https://www.kaggle.com/datasets/anshulpachauri/shopping-mall
Organization logo

Shopping Mall

Explore at:
zip(22852 bytes)Available download formats
Dataset updated
Dec 15, 2023
Authors
Anshul Pachauri
Description

Libraries Import:

Importing necessary libraries such as pandas, seaborn, matplotlib, scikit-learn's KMeans, and warnings. Data Loading and Exploration:

Reading a dataset named "Mall_Customers.csv" into a pandas DataFrame (df). Displaying the first few rows of the dataset using df.head(). Conducting univariate analysis by calculating descriptive statistics with df.describe(). Univariate Analysis:

Visualizing the distribution of the 'Annual Income (k$)' column using sns.distplot. Looping through selected columns ('Age', 'Annual Income (k$)', 'Spending Score (1-100)') and plotting individual distribution plots. Bivariate Analysis:

Creating a scatter plot for 'Annual Income (k$)' vs 'Spending Score (1-100)' using sns.scatterplot. Generating a pair plot for selected columns with gender differentiation using sns.pairplot. Gender-Based Analysis:

Grouping the data by 'Gender' and calculating the mean for selected columns. Computing the correlation matrix for the grouped data and visualizing it using a heatmap. Univariate Clustering:

Applying KMeans clustering with 3 clusters based on 'Annual Income (k$)' and adding the 'Income Cluster' column to the DataFrame. Plotting the elbow method to determine the optimal number of clusters. Bivariate Clustering:

Applying KMeans clustering with 5 clusters based on 'Annual Income (k$)' and 'Spending Score (1-100)' and adding the 'Spending and Income Cluster' column. Plotting the elbow method for bivariate clustering and visualizing the cluster centers on a scatter plot. Displaying a normalized cross-tabulation between 'Spending and Income Cluster' and 'Gender'. Multivariate Clustering:

Performing multivariate clustering by creating dummy variables, scaling selected columns, and applying KMeans clustering. Plotting the elbow method for multivariate clustering. Result Saving:

Saving the modified DataFrame with cluster information to a CSV file named "Result.csv". Saving the multivariate clustering plot as an image file ("Multivariate_figure.png").

Search
Clear search
Close search
Google apps
Main menu