42 datasets found
  1. Additional file 1 of Tensor extrapolation: an adaptation to data sets with...

    • springernature.figshare.com
    html
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Josef Schosser (2023). Additional file 1 of Tensor extrapolation: an adaptation to data sets with missing entries [Dataset]. http://doi.org/10.6084/m9.figshare.19242463.v1
    Explore at:
    htmlAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Josef Schosser
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Additional file 1. Python code. The notebook highlights core components of the code applied in the study.

  2. Ecommerce Dataset for Data Analysis

    • kaggle.com
    zip
    Updated Sep 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shrishti Manja (2024). Ecommerce Dataset for Data Analysis [Dataset]. https://www.kaggle.com/datasets/shrishtimanja/ecommerce-dataset-for-data-analysis/code
    Explore at:
    zip(2028853 bytes)Available download formats
    Dataset updated
    Sep 19, 2024
    Authors
    Shrishti Manja
    Description

    This dataset contains 55,000 entries of synthetic customer transactions, generated using Python's Faker library. The goal behind creating this dataset was to provide a resource for learners like myself to explore, analyze, and apply various data analysis techniques in a context that closely mimics real-world data.

    About the Dataset: - CID (Customer ID): A unique identifier for each customer. - TID (Transaction ID): A unique identifier for each transaction. - Gender: The gender of the customer, categorized as Male or Female. - Age Group: Age group of the customer, divided into several ranges. - Purchase Date: The timestamp of when the transaction took place. - Product Category: The category of the product purchased, such as Electronics, Apparel, etc. - Discount Availed: Indicates whether the customer availed any discount (Yes/No). - Discount Name: Name of the discount applied (e.g., FESTIVE50). - Discount Amount (INR): The amount of discount availed by the customer. - Gross Amount: The total amount before applying any discount. - Net Amount: The final amount after applying the discount. - Purchase Method: The payment method used (e.g., Credit Card, Debit Card, etc.). - Location: The city where the purchase took place.

    Use Cases: 1. Exploratory Data Analysis (EDA): This dataset is ideal for conducting EDA, allowing users to practice techniques such as summary statistics, visualizations, and identifying patterns within the data. 2. Data Preprocessing and Cleaning: Learners can work on handling missing data, encoding categorical variables, and normalizing numerical values to prepare the dataset for analysis. 3. Data Visualization: Use tools like Python’s Matplotlib, Seaborn, or Power BI to visualize purchasing trends, customer demographics, or the impact of discounts on purchase amounts. 4. Machine Learning Applications: After applying feature engineering, this dataset is suitable for supervised learning models, such as predicting whether a customer will avail a discount or forecasting purchase amounts based on the input features.

    This dataset provides an excellent sandbox for honing skills in data analysis, machine learning, and visualization in a structured but flexible manner.

    This is not a real dataset. This dataset was generated using Python's Faker library for the sole purpose of learning

  3. Pre-Processed Power Grid Frequency Time Series

    • zenodo.org
    bin, zip
    Updated Jul 15, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Johannes Kruse; Johannes Kruse; Benjamin Schäfer; Benjamin Schäfer; Dirk Witthaut; Dirk Witthaut (2021). Pre-Processed Power Grid Frequency Time Series [Dataset]. http://doi.org/10.5281/zenodo.3744121
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    Jul 15, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Johannes Kruse; Johannes Kruse; Benjamin Schäfer; Benjamin Schäfer; Dirk Witthaut; Dirk Witthaut
    Description

    Overview
    This repository contains ready-to-use frequency time series as well as the corresponding pre-processing scripts in python. The data covers three synchronous areas of the European power grid:

    • Continental Europe
    • Great Britain
    • Nordic

    This work is part of the paper "Predictability of Power Grid Frequency"[1]. Please cite this paper, when using the data and the code. For a detailed documentation of the pre-processing procedure we refer to the supplementary material of the paper.

    Data sources
    We downloaded the frequency recordings from publically available repositories of three different Transmission System Operators (TSOs).

    • Continental Europe [2]: We downloaded the data from the German TSO TransnetBW GmbH, which retains the Copyright on the data, but allows to re-publish it upon request [3].
    • Great Britain [4]: The download was supported by National Grid ESO Open Data, which belongs to the British TSO National Grid. They publish the frequency recordings under the NGESO Open License [5].
    • Nordic [6]: We obtained the data from the Finish TSO Fingrid, which provides the data under the open license CC-BY 4.0 [7].

    Content of the repository

    A) Scripts

    1. In the "Download_scripts" folder you will find three scripts to automatically download frequency data from the TSO's websites.
    2. In "convert_data_format.py" we save the data with corrected timestamp formats. Missing data is marked as NaN (processing step (1) in the supplementary material of [1]).
    3. In "clean_corrupted_data.py" we load the converted data and identify corrupted recordings. We mark them as NaN and clean some of the resulting data holes (processing step (2) in the supplementary material of [1]).

    The python scripts run with Python 3.7 and with the packages found in "requirements.txt".

    B) Data_converted and Data_cleansed
    The folder "Data_converted" contains the output of "convert_data_format.py" and "Data_cleansed" contains the output of "clean_corrupted_data.py".

    • File type: The files are zipped csv-files, where each file comprises one year.
    • Data format: The files contain two columns. The first one represents the time stamps in the format Year-Month-Day Hour-Minute-Second, which is given as naive local time. The second column contains the frequency values in Hz.
    • NaN representation: We mark corrupted and missing data as "NaN" in the csv-files.

    Use cases
    We point out that this repository can be used in two different was:

    • Use pre-processed data: You can directly use the converted or the cleansed data. Note however that both data sets include segments of NaN-values due to missing and corrupted recordings. Only a very small part of the NaN-values were eliminated in the cleansed data to not manipulate the data too much. If your application cannot deal with NaNs, you could build upon the following commands to select the longest interval of valid data from the cleansed data:
    from helper_functions import *
    import pandas as pd
    
    cleansed_data = pd.read_csv('/Path_to_cleansed_data/data.zip',
                index_col=0, header=None, squeeze=True,
                parse_dates=[0])
    valid_bounds, valid_sizes = true_intervals(~cleansed_data.isnull())
    start,end= valid_bounds[ np.argmax(valid_sizes) ]
    data_without_nan = cleansed_data.iloc[start:end]
    • Produce your own cleansed data: Depending on your application, you might want to cleanse the data in a custom way. You can easily add your custom cleansing procedure in "clean_corrupted_data.py" and then produce cleansed data from the raw data in "Data_converted".

    License
    We release the code in the folder "Scripts" under the MIT license [8]. In the case of Nationalgrid and Fingrid, we further release the pre-processed data in the folder "Data_converted" and "Data_cleansed" under the CC-BY 4.0 license [7]. TransnetBW originally did not publish their data under an open license. We have explicitly received the permission to publish the pre-processed version from TransnetBW. However, we cannot publish our pre-processed version under an open license due to the missing license of the original TransnetBW data.

  4. Z

    Data from: A comprehensive dataset for the accelerated development and...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carreira Pedro, Hugo; Larson, David; Coimbra, Carlos (2020). A comprehensive dataset for the accelerated development and benchmarking of solar forecasting methods [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2826938
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    University of California San Diego
    Authors
    Carreira Pedro, Hugo; Larson, David; Coimbra, Carlos
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description This repository contains a comprehensive solar irradiance, imaging, and forecasting dataset. The goal with this release is to provide standardized solar and meteorological datasets to the research community for the accelerated development and benchmarking of forecasting methods. The data consist of three years (2014–2016) of quality-controlled, 1-min resolution global horizontal irradiance and direct normal irradiance ground measurements in California. In addition, we provide overlapping data from commonly used exogenous variables, including sky images, satellite imagery, Numerical Weather Prediction forecasts, and weather data. We also include sample codes of baseline models for benchmarking of more elaborated models.

    Data usage The usage of the datasets and sample codes presented here is intended for research and development purposes only and implies explicit reference to the paper: Pedro, H.T.C., Larson, D.P., Coimbra, C.F.M., 2019. A comprehensive dataset for the accelerated development and benchmarking of solar forecasting methods. Journal of Renewable and Sustainable Energy 11, 036102. https://doi.org/10.1063/1.5094494

    Although every effort was made to ensure the quality of the data, no guarantees or liabilities are implied by the authors or publishers of the data.

    Sample code As part of the data release, we are also including the sample code written in Python 3. The preprocessed data used in the scripts are also provided. The code can be used to reproduce the results presented in this work and as a starting point for future studies. Besides the standard scientific Python packages (numpy, scipy, and matplotlib), the code depends on pandas for time-series operations, pvlib for common solar-related tasks, and scikit-learn for Machine Learning models. All required Python packages are readily available on Mac, Linux, and Windows and can be installed via, e.g., pip.

    Units All time stamps are in UTC (YYYY-MM-DD HH:MM:SS). All irradiance and weather data are in SI units. Sky image features are derived from 8-bit RGB (256 color levels) data. Satellite images are derived from 8-bit gray-scale (256 color levels) data.

    Missing data The string "NAN" indicates missing data

    File formats All time series data files as in CSV (comma separated values) Images are given in tar.bz2 files

    Files

    Folsom_irradiance.csv Primary One-minute GHI, DNI, and DHI data.

    Folsom_weather.csv Primary One-minute weather data.

    Folsom_sky_images_{YEAR}.tar.bz2 Primary Tar archives with daytime sky images captured at 1-min intervals for the years 2014, 2015, and 2016, compressed with bz2.

    Folsom_NAM_lat{LAT}_lon{LON}.csv Primary NAM forecasts for the four nodes nearest the target location. {LAT} and {LON} are replaced by the node’s coordinates listed in Table I in the paper.

    Folsom_sky_image_features.csv Secondary Features derived from the sky images.

    Folsom_satellite.csv Secondary 10 pixel by 10 pixel GOES-15 images centered in the target location.

    Irradiance_features_{horizon}.csv Secondary Irradiance features for the different forecasting horizons ({horizon} 1⁄4 {intra-hour, intra-day, day-ahead}).

    Sky_image_features_intra-hour.csv Secondary Sky image features for the intra-hour forecasting issuing times.

    Sat_image_features_intra-day.csv Secondary Satellite image features for the intra-day forecasting issuing times.

    NAM_nearest_node_day-ahead.csv Secondary NAM forecasts (GHI, DNI computed with the DISC algorithm, and total cloud cover) for the nearest node to the target location prepared for day-ahead forecasting.

    Target_{horizon}.csv Secondary Target data for the different forecasting horizons.

    Forecast_{horizon}.py Code Python script used to create the forecasts for the different horizons.

    Postprocess.py Code Python script used to compute the error metric for all the forecasts.

  5. Overwatch 2 statistics

    • kaggle.com
    zip
    Updated Jun 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mykhailo Kachan (2023). Overwatch 2 statistics [Dataset]. https://www.kaggle.com/datasets/mykhailokachan/overwatch-2-statistics/code
    Explore at:
    zip(67546 bytes)Available download formats
    Dataset updated
    Jun 27, 2023
    Authors
    Mykhailo Kachan
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset is built on data from Overbuff with the help of python and selenium. Development environment - Jupyter Notebook.

    The tables contain the data for competitive seasons 1-4 and for quick play for each hero and rank along with the standard statistics (common to each hero as well as information belonging to a specific hero).

    Note: data for some columns are missing on Overbuff site (there is '—' instead of a specific value), so they were dropped: Scoped Crits for Ashe and Widowmaker, Rip Tire Kills for Junkrat, Minefield Kills for Wrecking Ball. 'Self Healing' column for Bastion was dropped too as Bastion doesn't have this property anymore in OW2. Also, there are no values for "Javelin Spin Kills / 10min" for Orisa in season 1 (the column was dropped). Overall, all missing values were cleaned.

    Attention: Overbuff doesn't contain info about OW 1 competitive seasons (when you change a skill tier, the data isn't changed). If you know a site where it's possible to get this data, please, leave a comment. Thank you!

    The code on GitHub .

    All procedure is done in 5 stages:

    Stage 1:

    Data is retrieved directly from HTML elements on the page with the selenium tool on python.

    Stage 2:

    After scraping, data was cleansed: 1) Deleted comma separator on thousands (e.g. 1,009 => 1009). 2) Translated time representation (e.g. '01:23') to seconds (1*60 + 23 => 83). 3) Lúcio has become Lucio, Torbjörn - Torbjorn.

    Stage 3:

    Data were arranged into a table and saved to CSV.

    Stage 4:

    Columns which are supposed to have only numeric values are checked. All non-numeric values are dropped. This stage helps to find missing values which contain '—' instead and delete them.

    Stage 5:

    Additional missing values are searched for and dealt with. It's either column rename that happens (as the program cannot infer the correct column name for missing values) or a column drop. This stage ensures all wrong data are truly fixed.

    The procedure to fetch the data takes 7 minutes on average.

    This project and code were born from this GitHub code.

  6. Number of hidden nodes of SCNs and the corresponding training error.

    • plos.figshare.com
    xls
    Updated Mar 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xinfu Pang; Wei Sun; Haibo Li; Wei Liu; Changfeng Luan (2024). Number of hidden nodes of SCNs and the corresponding training error. [Dataset]. http://doi.org/10.1371/journal.pone.0300229.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Mar 19, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Xinfu Pang; Wei Sun; Haibo Li; Wei Liu; Changfeng Luan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Number of hidden nodes of SCNs and the corresponding training error.

  7. Automobile_Price_prediction

    • kaggle.com
    zip
    Updated Oct 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ayush Triapthi (2023). Automobile_Price_prediction [Dataset]. https://www.kaggle.com/datasets/ayusht18dec/case-study-dataset
    Explore at:
    zip(10947 bytes)Available download formats
    Dataset updated
    Oct 22, 2023
    Authors
    Ayush Triapthi
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    A Case Study

    In this case study we are going to use the automobile dataset, which plenty of car manufacturers withtheir specifications in in order to build a predictive model to find out the approximate car price. This dataset has 26 columns, including categorical and quantitative attributes.

    The given_automobile.csv contains records from the above-mentioned dataset.

    You need to write descriptive answers to the questions under each task and also usea proper program written in Python and execute the code. 1. The missing values are presented as ‘?’ in the dataset. Apply data wrangling techniques using Python programming language to solve missing values inall the attributes. 2. Check the data types of those columns with the missing values, and convert the data type if needed. 3. Find all the correlated features to the ‘Price’. 4. Build a predictive model to predict the car price based on using one of the independent correlated variables. 5. Continue with the same built model in No.4, but choose differentindependent variables and discuss the result.

  8. Number of SCNs-based learners and corresponding error.

    • plos.figshare.com
    xls
    Updated Mar 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xinfu Pang; Wei Sun; Haibo Li; Wei Liu; Changfeng Luan (2024). Number of SCNs-based learners and corresponding error. [Dataset]. http://doi.org/10.1371/journal.pone.0300229.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Mar 19, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Xinfu Pang; Wei Sun; Haibo Li; Wei Liu; Changfeng Luan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Number of SCNs-based learners and corresponding error.

  9. BI intro to data cleaning eda and machine learning

    • kaggle.com
    zip
    Updated Nov 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Walekhwa Tambiti Leo Philip (2025). BI intro to data cleaning eda and machine learning [Dataset]. https://www.kaggle.com/datasets/walekhwatlphilip/intro-to-data-cleaning-eda-and-machine-learning/suggestions
    Explore at:
    zip(9961 bytes)Available download formats
    Dataset updated
    Nov 17, 2025
    Authors
    Walekhwa Tambiti Leo Philip
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Real-World Data Science Challenge

    Business Intelligence Program Strategy — Student Success Optimization

    Hosted by: Walsoft Computer Institute 📁 Download dataset 👤 Kaggle profile

    Background

    Walsoft Computer Institute runs a Business Intelligence (BI) training program for students from diverse educational, geographical, and demographic backgrounds. The institute has collected detailed data on student attributes, entry exams, study effort, and final performance in two technical subjects: Python Programming and Database Systems.

    As part of an internal review, the leadership team has hired you — a Data Science Consultant — to analyze this dataset and provide clear, evidence-based recommendations on how to improve:

    • Admissions decision-making
    • Academic support strategies
    • Overall program impact and ROI

    Your Mission

    Answer this central question:

    “Using the BI program dataset, how can Walsoft strategically improve student success, optimize resources, and increase the effectiveness of its training program?”

    Key Strategic Areas

    You are required to analyze and provide actionable insights for the following three areas:

    1. Admissions Optimization

    Should entry exams remain the primary admissions filter?

    Your task is to evaluate the predictive power of entry exam scores compared to other features such as prior education, age, gender, and study hours.

    ✅ Deliverables:

    • Feature importance ranking for predicting Python and DB scores
    • Admission policy recommendation (e.g., retain exams, add screening tools, adjust thresholds)
    • Business rationale and risk analysis

    2. Curriculum Support Strategy

    Are there at-risk student groups who need extra support?

    Your task is to uncover whether certain backgrounds (e.g., prior education level, country, residence type) correlate with poor performance and recommend targeted interventions.

    ✅ Deliverables:

    • At-risk segment identification
    • Support program design (e.g., prep course, mentoring)
    • Expected outcomes, costs, and KPIs

    3. Resource Allocation & Program ROI

    How can we allocate resources for maximum student success?

    Your task is to segment students by success profiles and suggest differentiated teaching/facility strategies.

    ✅ Deliverables:

    • Performance drivers
    • Student segmentation
    • Resource allocation plan and ROI projection

    🛠️ Dataset Overview

    ColumnDescription
    fNAME, lNAMEStudent first and last name
    AgeStudent age (21–71 years)
    genderGender (standardized as "Male"/"Female")
    countryStudent’s country of origin
    residenceStudent housing/residence type
    entryEXAMEntry test score (28–98)
    prevEducationPrior education (High School, Diploma, etc.)
    studyHOURSTotal study hours logged
    PythonFinal Python exam score
    DBFinal Database exam score

    📊 Dataset

    You are provided with a real-world messy dataset that reflects the types of issues data scientists face every day — from inconsistent formatting to missing values.

    Raw Dataset (Recommended for Full Project)

    Download: bi.csv

    This dataset includes common data quality challenges:

    • Country name inconsistencies
      e.g. Norge → Norway, RSA → South Africa, UK → United Kingdom

    • Residence type variations
      e.g. BI-Residence, BIResidence, BI_Residence → unify to BI Residence

    • Education level typos and casing issues
      e.g. Barrrchelors → Bachelor, DIPLOMA, DiplomaaaDiploma

    • Gender value noise
      e.g. M, F, female → standardize to Male / Female

    • Missing scores in Python subject
      Fill NaN values using column mean or suitable imputation strategy

    Participants using this dataset are expected to apply data cleaning techniques such as: - String standardization - Null value imputation - Type correction (e.g., scores as float) - Validation and visual verification

    Bonus: Submissions that use and clean this dataset will earn additional Technical Competency points.

    Cleaned Dataset (Optional Shortcut)

    Download: cleaned_bi.csv

    This version has been fully standardized and preprocessed: - All fields cleaned and renamed consistently - Missing Python scores filled with th...

  10. f

    Data from: Calculations of Absolute Solvation Free Energies with...

    • acs.figshare.com
    zip
    Updated Aug 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Johannes Karwounopoulos; Åsmund Kaupang; Marcus Wieder; Stefan Boresch (2023). Calculations of Absolute Solvation Free Energies with TransformatoApplication to the FreeSolv Database Using the CGenFF Force Field [Dataset]. http://doi.org/10.1021/acs.jctc.3c00691.s003
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 24, 2023
    Dataset provided by
    ACS Publications
    Authors
    Johannes Karwounopoulos; Åsmund Kaupang; Marcus Wieder; Stefan Boresch
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    We recently introduced transformato, an open-source Python package for the automated setup of large-scale calculations of relative solvation and binding free energy differences. Here, we extend the capabilities of transformato to the calculation of absolute solvation free energy differences. After careful validation against the literature results and reference calculations with the PERT module of CHARMM, we used transformato to compute absolute solvation free energies for most molecules in the FreeSolv database (621 out of 642). The force field parameters were obtained with the program cgenff (v2.5.1), which derives missing parameters from the CHARMM general force field (CGenFF v4.6). A long-range correction for the Lennard-Jones interactions was added to all computed solvation free energies. The mean absolute error compared to the experimental data is 1.12 kcal/mol. Our results allow a detailed comparison between the AMBER and CHARMM general force fields and provide a more in-depth understanding of the capabilities and limitations of the CGenFF small molecule parameters.

  11. H

    Modeled groundwater levels across Central Valley, CA, from March 2015 to...

    • dataverse.harvard.edu
    Updated Sep 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anshuman Pradhan; Kyra H. Adams; Venkat Chandrasekaran; Zhen Liu; John T. Reager; Andrew M. Stuart; Michael J. Turmon (2024). Modeled groundwater levels across Central Valley, CA, from March 2015 to August 2020, using GP-DNN regression [Dataset]. http://doi.org/10.7910/DVN/23TNJO
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 29, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Anshuman Pradhan; Kyra H. Adams; Venkat Chandrasekaran; Zhen Liu; John T. Reager; Andrew M. Stuart; Michael J. Turmon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Central Valley, California
    Description

    This dataset contains groundwater level trends and time series data across a discretized grid of California's Central Valley, modeled with well data using hierarchical Gaussian process and neural network regression methodology. The spatial grid consists of 400 cells, spanning latitudes 34.91 to 40.895 degrees, and 220 cells, spanning longitudes -122.6 to -118.658 degrees. The temporal axis spans March 2015 to Aug 2020, discretized at biweekly intervals, with a total of 132 cells. The spatiotemporal grid details are present in relevant files. The first dataset is contained in the following Python pickle file. 1. 'CV_water_level_trends_Mar2015_Aug2020.pkl': This file contains a nested Python dictionary with following pairs: 1.1. 'longitude': Numpy array of shape 400 x 220 1.2. 'longitude': Numpy array of shape 400 x 220 1.3. 'mean': Python dictionary with mean long-term and seasonal water level trends 1.4. 'P10': Python dictionary with P10 long-term and seasonal water level trends 1.5. 'P90': Python dictionary with P90 long-term and seasonal water level trends Each of the dictionary in 1.3., 1.4. and 1.5. contain the following key and values: 'initial_water_level_ft': Mean/P10/P90 of March 2015 water levels in feet stored as Numpy array of shape 400 x 220 'water_level_decline_rate_ft/biweek': Mean/P10/P90 of March 2015 - Aug 2020 water level decline rates in ft/biweek stored as Numpy array of shape 400 x 220 'water_level_amplitude_ft': Mean/P10/P90 of seasonal water level oscillation amplitude stored as Numpy array of shape 400 x 220 'water_level_phase_deg': Mean/P10/P90 of time to peak seasonal signal in degrees stored as Numpy array of shape 400 x 220 The second dataset is contained in the following Python pickle file. 2. 'CV_water_level_time_series_Mar2015_Aug2020.pkl': This file contains a Python dictionary with following pairs. 2.1. 'longitude': Numpy array of shape 400 x 220 2.2. 'longitude': Numpy array of shape 400 x 220 2.3. 'time_axis': Python list on length 132 containing strings for biweekly periods from March 2015 - August 2020 2.4. 'water_level_well_ft': Processed water level observations in feet from 1744 wells, irregularly sampled across time. The data is stored as Numpy array of shape 400 x 220 x 132, with missing values as nans. 2.5. 'water_level_modeled_mean_ft': Modeled mean water level time series in feet stored as Numpy array of shape 400 x 220 x 132 2.6. 'water_level_modeled_P10_ft': Modeled P10 water level time series in feet stored as Numpy array of shape 400 x 220 x 132 2.6. 'water_level_modeled_P90_ft': Modeled P90 water level time series in feet stored as Numpy array of shape 400 x 220 x 132

  12. ACEA competition additional datasets

    • kaggle.com
    zip
    Updated Apr 7, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vincent Larmet (2021). ACEA competition additional datasets [Dataset]. https://www.kaggle.com/vlarmet/acea-competition-additional-datasets
    Explore at:
    zip(74756921 bytes)Available download formats
    Dataset updated
    Apr 7, 2021
    Authors
    Vincent Larmet
    Description

    These datasets come from Google Earth Engine and are used in ACEA challenge

    The first is daily time series from Copernicus ECMWF ERA5 Daily aggregates, extracted using weather station geolocations. Time series range from 1998 to 2020. 48 different stations are located in Italy.
    The extraction have been done with this script :

    import pandas as pd
    import numpy as np
    from datetime import datetime as dt
    import ee
    def extract_time_series(lat, lon, start, end, product_name, sf):
      # Set up point geometry
      point = ee.Geometry.Point(lon, lat)
    
      # Obtain image collection for all images within query dates
      coll = ee.ImageCollection(product_name)\
        .filterDate(start, end)
    
      def setProperty(image):
        dic = image.reduceRegion(ee.Reducer.first(), point)
        return image.set(dic)
    
      data = coll.map(setProperty)
      data = data.getInfo()
      liste = list(map(lambda x: pd.DataFrame(x['properties']), data['features']))
      df = pd.concat(liste)
      return df
    
    if _name_ == "_main_":
      ee.Initialize()
      for i in locations.keys(): # locations is a dictionnary containing latitude and longitude
        print(i)
        latitude = locations[i]['lat']
        longitude = locations[i]['lon'] 
        while True:
          try: 
            output = extract_time_series(latitude,
                     longitude,
                     '1998-01-01',
                     '2020-01-01',
                     'ECMWF/ERA5/DAILY',
                     1)
            break
          except: 
            print(i + " 1 fail")
            continue              
        name =PATH + i + "_1.csv"        
        output.to_csv(name, index=True)
    

    The second dataset is Forecasted Weather from Global Forecast System.
    The purpose of this dataset is to provide forecasted rainfall and temperature for the 16 coming days. Creation_time column is the released date while forecast_hours is forecasted weather for horizon : creation_time + forecast_hours. Time series are daily and range from 2015 to 2020. Unfortunately, there are missing values.
    Python script :

    import pandas as pd
    import numpy as np
    from datetime import datetime as dt
    import ee
    def extract_time_series_gfs(lat, lon, start, end, product_name, sf, h):
    
      # Set up point geometry
      point = ee.Geometry.Point(lon, lat)
    
      # Obtain image collection for all images within query dates
      coll = ee.ImageCollection(product_name)\
        .select(['total_precipitation_surface','temperature_2m_above_ground'])\
        .filterDate(start, end)\
        .filterMetadata('forecast_hours', 'equals', h)
    
      def setProperty(image):
        dic = image.reduceRegion(ee.Reducer.first(), point)
        return image.set(dic)
    
      data = coll.map(setProperty)
      data = data.getInfo()
      
      liste = list(map(lambda x: pd.DataFrame(x['properties']), data['features']))
      df = pd.concat(liste)
      df=df[df["system:footprint"] == "LinearRing"]
    
      return df
    if _name_ == "_main_":
    
      ee.Initialize()
      horizon = [i*24 for i in range(1,17)]
      for i in locations.keys():
        print(i)
        latitude = locations[i]['lat']
        longitude = locations[i]['lon'] 
        
        for j in horizon:
          while True:
            try:
              output = extract_time_series_gfs(latitude,
                     longitude,
                     '2015-07-01',
                     '2020-08-01',
                     'NOAA/GFS0P25',
                     1,
                     j)
              break
            except:
              print(i + " " + str(j) +" 1 fail")
              continue
          name = PATH + i + "_" + str(j) +"_1.csv"
               
          output.to_csv(name, index=True)
    
  13. CpG Signature Profiling and Heatmap Visualization of SARS-CoV Genomes:...

    • figshare.com
    txt
    Updated Apr 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tahir Bhatti (2025). CpG Signature Profiling and Heatmap Visualization of SARS-CoV Genomes: Tracing the Genomic Divergence From SARS-CoV (2003) to SARS-CoV-2 (2019) [Dataset]. http://doi.org/10.6084/m9.figshare.28736501.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Apr 5, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Tahir Bhatti
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ObjectiveThe primary objective of this study was to analyze CpG dinucleotide dynamics in coronaviruses by comparing Wuhan-Hu-1 with its closest and most distant relatives. Heatmaps were generated to visualize CpG counts and O/E ratios across intergenic regions, providing a clear depiction of conserved and divergent CpG patterns.Methods1. Data CollectionSource : The dataset includes CpG counts and O/E ratios for various coronaviruses, extracted from publicly available genomic sequences.Format : Data was compiled into a CSV file containing columns for intergenic regions, CpG counts, and O/E ratios for each virus.2. PreprocessingData Cleaning :Missing values (NaN), infinite values (inf, -inf), and blank entries were handled using Python's pandas library.Missing values were replaced with column means, and infinite values were capped at a large finite value (1e9).Reshaping :The data was reshaped into matrices for CpG counts and O/E ratios using meltpandas[] and pivot[] functions.3. Distance CalculationEuclidean Distance :Pairwise Euclidean distances were calculated between Wuhan-Hu-1 and other viruses using the scipy.spatial.distance.euclidean function.Distances were computed separately for CpG counts and O/E ratios, and the total distance was derived as the sum of both metrics.4. Identification of Closest and Distant RelativesThe virus with the smallest total distance was identified as the closest relative .The virus with the largest total distance was identified as the most distant relative .5. Heatmap GenerationTools :Heatmaps were generated using Python's seaborn library (sns.heatmap) and matplotlib for visualization.Parameters :Heatmaps were annotated with numerical values for clarity.A color gradient (coolwarm) was used to represent varying CpG counts and O/E ratios.Titles and axis labels were added to describe the comparison between Wuhan-Hu-1 and its relatives.ResultsClosest Relative :The closest relative to Wuhan-Hu-1 was identified based on the smallest Euclidean distance.Heatmaps for CpG counts and O/E ratios show high similarity in specific intergenic regions.Most Distant Relative :The most distant relative was identified based on the largest Euclidean distance.Heatmaps reveal significant differences in CpG dynamics compared to Wuhan-Hu-1 .Tools and LibrariesThe following tools and libraries were used in this analysis:Programming Language :Python 3.13Libraries :pandas: For data manipulation and cleaning.numpy: For numerical operations and handling missing/infinite values.scipy.spatial.distance: For calculating Euclidean distances.seaborn: For generating heatmaps.matplotlib: For additional visualization enhancements.File Formats :Input: CSV files containing CpG counts and O/E ratios.Output: PNG images of heatmaps.Files IncludedCSV File :Contains the raw data of CpG counts and O/E ratios for all viruses.Heatmap Images :Heatmaps for CpG counts and O/E ratios comparing Wuhan-Hu-1 with its closest and most distant relatives.Python Script :Full Python code used for data processing, distance calculation, and heatmap generation.Usage NotesResearchers can use this dataset to further explore the evolutionary dynamics of CpG dinucleotides in coronaviruses.The Python script can be adapted to analyze other viral genomes or datasets.Heatmaps provide a visual summary of CpG dynamics, aiding in hypothesis generation and experimental design.AcknowledgmentsSpecial thanks to the open-source community for developing tools like pandas, numpy, seaborn, and matplotlib.This work was conducted as part of an independent research project in molecular biology and bioinformatics.LicenseThis dataset is shared under the CC BY 4.0 License , allowing others to share and adapt the material as long as proper attribution is given.DOI: 10.6084/m9.figshare.28736501

  14. The Device Activity Report with Complete Knowledge (DARCK) for NILM

    • zenodo.org
    bin, xz
    Updated Sep 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous Anonymous; Anonymous Anonymous (2025). The Device Activity Report with Complete Knowledge (DARCK) for NILM [Dataset]. http://doi.org/10.5281/zenodo.17159850
    Explore at:
    bin, xzAvailable download formats
    Dataset updated
    Sep 19, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous Anonymous; Anonymous Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    1. Abstract

    This dataset contains aggregated and sub-metered power consumption data from a two-person apartment in Germany. Data was collected from March 5 to September 4, 2025, spanning 6 months. It includes an aggregate reading from a main smart meter and individual readings from 40 smart plugs, smart relays, and smart power meters monitoring various appliances.

    2. Dataset Overview

    • Apartment: Two-person apartment, approx. 58m², located in Aachen, Germany.
    • Aggregate Meter: eBZ DD3
    • Sub-meters: 31 Shelly Plus Plug S, 6 Shelly Plus 1PM, 3 Shelly Plus PM Mini Gen3
    • Sampling Rate: 1 Hz
    • Measured Quantity: Active Power
    • Unit of Measurement: Watt
    • Duration: 6 months
    • Format: Single CSV file (`DARCK.csv`)
    • Structure: Timestamped rows with columns for the aggregate meter and each sub-metered appliance.
    • Completeness: The main power meter has a completeness of 99.3%. Missing values were linearly interpolated.

    3. Download and Usage

    The dataset can be downloaded here: https://doi.org/10.5281/zenodo.17159850

    As it contains longer off periods with zeros, the CSV file is nicely compressible.


    To extract it use: xz -d DARCK.csv.xz.
    The compression leads to a 97% smaller file size (From 4GB to 90.9MB).


    To use the dataset in python, you can, e.g., load the csv file into a pandas dataframe.

    python
    import pandas as pd

    df = pd.read_csv("DARCK.csv", parse_dates=["time"])

    4. Measurement Setup

    The main meter was monitored using an infrared reading head magnetically attached to the infrared interface of the meter. An ESP8266 flashed with Tasmota decodes the binary datagrams and forwards the Watt readings to the MQTT broker. Individual appliances were monitored using a combination of Shelly Plugs (for outlets), Shelly 1PM (for wired-in devices like ceiling lights), and Shelly PM Mini (for each of the three phases of the oven). All devices reported to a central InfluxDB database via Home Assistant running in docker on a Dell OptiPlex 3020M.

    5. File Format (DARCK.csv)

    The dataset is provided as a single comma-separated value (CSV) file.

    • The first row is a header containing the column names.
    • All power values are rounded to the first decimal place.
    • There are no missing values in the final dataset.
    • Each row represents 1 second, from start of measuring in March until the end in September.

    Column Descriptions

    Column Name

    Data Type

    Unit

    Description

    timedatetime-Timestamp for the reading in YYYY-MM-DD HH:MM:SS
    mainfloatWattTotal aggregate power consumption for the apartment, measured at the main electrical panel.
    [appliance_name]floatWattPower consumption of an individual appliance (e.g., lightbathroom, fridge, sherlockpc). See Section 8 for a full list.
    Aggregate Columns
    aggr_chargersfloatWattThe sum of sherlockcharger, sherlocklaptop, watsoncharger, watsonlaptop, watsonipadcharger, kitchencharger.
    aggr_stoveplatesfloatWattThe sum of stoveplatel1 and stoveplatel2.
    aggr_lightsfloatWattThe sum of lightbathroom, lighthallway, lightsherlock, lightkitchen, lightlivingroom, lightwatson, lightstoreroom, fcob, sherlockalarmclocklight, sherlockfloorlamphue, sherlockledstrip, livingfloorlamphue, sherlockglobe, watsonfloorlamp, watsondesklamp and watsonledmap.
    Analysis Columns
    inaccuracyfloatWattAs no electrical device bypasses a power meter, the true inaccuracy can be assessed. It is the absolute error between the sum of individual measurements and the mains reading. A 30W offset is applied to the sum since the measurement devices themselves draw power which is otherwise unaccounted for.

    6. Data Postprocessing Pipeline

    The final dataset was generated from two raw data sources (meter.csv and shellies.csv) using a comprehensive postprocessing pipeline.

    6.1. Main Meter (main) Postprocessing

    The aggregate power data required several cleaning steps to ensure accuracy.

    1. Outlier Removal: Readings below 10W or above 10,000W were removed (merely 3 occurrences).
    2. Timestamp Burst Correction: The source data contained bursts of delayed readings. A custom algorithm was used to identify these bursts (large time gap followed by rapid readings) and back-fill the timestamps to create an evenly spaced time series.
    3. Alignment & Interpolation: The smart meter pushes a new value via infrared every second. To align those to the whole seconds, it was resampled to a 1-second frequency by taking the mean of all readings within each second (in 99.5% only 1 value). Any resulting gaps (0.7% outage ratio) were filled using linear interpolation.

    6.2. Sub-metered Devices (shellies) Postprocessing

    The Shelly devices are not prone to the same burst issue as the ESP8266 is. They push a new reading at every change in power drawn. If no power change is observed or the one observed is too small (less than a few Watt), the reading is pushed once a minute, together with a heartbeat. When a device turns on or off, intermediate power values are published, which leads to sub-second values that need to be handled.

    1. Grouping: Data was grouped by the unique device identifier.
    2. Resampling & Filling: The data for each device was resampled to a 1-second frequency using .resample('1s').last().ffill().
      This method was chosen to firstly, capture the last known state of the device within each second, handling rapid on/off events. Secondly, to forward-fill the last state across periods of no new data, modeling that the device's consumption remained constant until a new reading was sent.

    6.3. Merging and Finalization

    1. Merge: The cleaned main meter and all sub-metered device dataframes were merged into a single dataframe on the time index.
    2. Final Fill: Any remaining NaN values (e.g., from before a device was installed) were filled with 0.0, assuming zero consumption.

    7. Manual Corrections and Known Data Issues

    During analysis, two significant unmetered load events were identified and manually corrected to improve the accuracy of the aggregate reading. The error column (inaccuracy) was recalculated after these corrections.

    1. March 10th - Unmetered Bulb: An unmetered 107W bulb was active. It was subtracted from the main reading as if it never happened.
    2. May 31st - Unmetered Air Pump: An unmetered 101W pump for an air mattress was used directly in an outlet with no intermediary plug and hence manually added to the respective plug.

    8. Appliance Details and Multipurpose Plugs

    The following table lists the column names with an explanation where needed. As Watson moved at the beginning of June, some metering plugs changed their appliance.

  15. m

    Python code for the estimation of missing prices in real-estate market with...

    • data.mendeley.com
    Updated Sep 17, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Iván García-Magariño (2017). Python code for the estimation of missing prices in real-estate market with a dataset of house prices from the center of Teruel city [Dataset]. http://doi.org/10.17632/mxpgf54czz.1
    Explore at:
    Dataset updated
    Sep 17, 2017
    Authors
    Iván García-Magariño
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Teruel
    Description

    This research data file contains the necessary software and the dataset for estimating the missing prices of house units. This approach combines several machine learning techniques (linear regression, support vector regression, the k-nearest neighbors and a multi-layer perceptron neural network) with several dimensionality reduction techniques (non-negative factorization, recursive feature elimination and feature selection with a variance threshold). It includes the input dataset formed with the available house prices in the center of Teruel city (Spain) in December 30, 2016 from Idealista website.

    This dataset supports the research of the authors in the improvement of the setup of agent-based simulations about real-estate market. The work about this dataset has been submitted for consideration for publication to a scientific journal.

    The open source python code is composed of all the files with the “.py” extension. The main program can be executed from the “main.py” file. The “boxplotErrors.eps” is a chart generated from the execution of the code, and compares the results of the different combinations of machine learning techniques and dimensionality reduction methods.

    The dataset is in the “data” folder. The input raw data of the house prices are in the “dataRaw.csv” file. These were shuffled into the “dataShuffled.csv” file. We used cross-validation to obtain the estimations of house prices. The outputted estimations alongside the real values are stored in different files of the “data” folder, in which each filename is composed by the machine learning technique abbreviation and the dimensionality reduction method abbreviation.

  16. n

    AMAS: a fast tool for large alignment manipulation and computing of summary...

    • data.niaid.nih.gov
    • search.dataone.org
    • +2more
    zip
    Updated Jan 19, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marek L. Borowiec (2017). AMAS: a fast tool for large alignment manipulation and computing of summary statistics [Dataset]. http://doi.org/10.5061/dryad.p2q52
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 19, 2017
    Dataset provided by
    University of California, Davis
    Authors
    Marek L. Borowiec
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    The amount of data used in phylogenetics has grown explosively in the recent years and many phylogenies are inferred with hundreds or even thousands of loci and many taxa. These modern phylogenomic studies often entail separate analyses of each of the loci in addition to multiple analyses of subsets of genes or concatenated sequences. Computationally efficient tools for handling and computing properties of thousands of single-locus or large concatenated alignments are needed. Here I present AMAS (Alignment Manipulation And Summary), a tool that can be used either as a stand-alone command-line utility or as a Python package. AMAS works on amino acid and nucleotide alignments and combines capabilities of sequence manipulation with a function that calculates basic statistics. The manipulation functions include conversions among popular formats, concatenation, extracting sites and splitting according to a pre-defined partitioning scheme, creation of replicate data sets, and removal of taxa. The statistics calculated include the number of taxa, alignment length, total count of matrix cells, overall number of undetermined characters, percent of missing data, AT and GC contents (for DNA alignments), count and proportion of variable sites, count and proportion of parsimony informative sites, and counts of all characters relevant for a nucleotide or amino acid alphabet. AMAS is particularly suitable for very large alignments with hundreds of taxa and thousands of loci. It is computationally efficient, utilizes parallel processing, and performs better at concatenation than other popular tools. AMAS is a Python 3 program that relies solely on Python’s core modules and needs no additional dependencies. AMAS source code and manual can be downloaded from http://github.com/marekborowiec/AMAS/ under GNU General Public License.

  17. c

    Machine learning to predict mesenchymal stem cell efficacy for cartilage...

    • repository.cam.ac.uk
    bin, ods
    Updated Sep 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Liu, Yu Yang Fredrik; Lu, Yin; Oh, Steve; Conduit, Gareth (2020). Machine learning to predict mesenchymal stem cell efficacy for cartilage repair [Dataset]. http://doi.org/10.17863/CAM.52036
    Explore at:
    bin(169108 bytes), bin(2855 bytes), ods(132585 bytes), bin(2707 bytes), bin(5176 bytes)Available download formats
    Dataset updated
    Sep 24, 2020
    Dataset provided by
    University of Cambridge
    Apollo
    Authors
    Liu, Yu Yang Fredrik; Lu, Yin; Oh, Steve; Conduit, Gareth
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We collected data from 36 published articles on PubMed [13–48] to train and validate our machine learning models. Some articles comprised more than one type of cartilage injury models or treatment condition. In total, 15 clinical trial conditions and 29 animal 66 model conditions (1 goat, 6 pigs, 2 dogs, 9 rabbits, 9 rats, and 2 mice) on osteochondral injury or osteoarthritis were included, where MSCs were transplanted to repair the cartilage tissue. We documented each case with specific treatment condition into an entry by considering the cell- and treatment target-related factors as input properties, including species, body weight, tissue source, cell number, cell concentration, defect area, defect depth, and type of cartilage damage. The therapeutic outcomes were considered as output properties, which were evaluated using integrated clinical and histological cartilage repair scores, including the international cartilage repair society (ICRS) scoring system, the O’Driscoll score, the Pineda score, the Mankin score, the osteoarthritis research society international (OARSI) scoring system, the international knee documentation committee (IKDC) score, the visual analog score (VAS) for pain, the knee injury and osteoarthritis outcome score (KOOS), the Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC), and Lyscholm score. In this study, these scores were linearly normalized to a number between 0 and 1, with 0 representing the worst damage or pain, and 1 representing the completely healthy tissue. The list of entries was combined together to form a database.

    We have provided the details for the imputation algorithm in the subsection Handling missing data under Methods and a flowchart in Fig 2. Data imputation algorithm for the vector x was added in the manuscript for illustration. The pseudo-code for uncertainty calculation was shown in S1 Algorithm: A ensemble model to measure the ANN's prediction uncertainty. The original database gathered from the literature, and a ‘complete’ database with missing information filled from our neural network are also included, along with a sample neural network architecture file in Python.

    Here we provide a Python notebook comprising a neural network that delivers the performance and results described in the manuscript. Documentation in the form of comments and installation guide is included in the Python notebook. This Python notebook along with the methods described in the manuscript provides sufficient details for other interested readers to either extend this script or write their own scripts and reproduce the results in the paper.

  18. Customer Sale Dataset for Data Visualization

    • kaggle.com
    Updated Jun 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Atul (2025). Customer Sale Dataset for Data Visualization [Dataset]. https://www.kaggle.com/datasets/atulkgoyl/customer-sale-dataset-for-visualization
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 6, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Atul
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This synthetic dataset is designed specifically for practicing data visualization and exploratory data analysis (EDA) using popular Python libraries like Seaborn, Matplotlib, and Pandas.

    Unlike most public datasets, this one includes a diverse mix of column types:

    📅 Date columns (for time series and trend plots) 🔢 Numerical columns (for histograms, boxplots, scatter plots) 🏷️ Categorical columns (for bar charts, group analysis)

    Whether you are a beginner learning how to visualize data or an intermediate user testing new charting techniques, this dataset offers a versatile playground.

    Feel free to:

    Create EDA notebooks Practice plotting techniques Experiment with filtering, grouping, and aggregations 🛠️ No missing values, no data cleaning needed — just download and start exploring!

    Hope you find this helpful. Looking forward to hearing from you all.

  19. Classicmodels

    • kaggle.com
    zip
    Updated Dec 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Javier Landaeta (2024). Classicmodels [Dataset]. https://www.kaggle.com/datasets/javierlandaeta/classicmodels
    Explore at:
    zip(65751 bytes)Available download formats
    Dataset updated
    Dec 15, 2024
    Authors
    Javier Landaeta
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Abstract This project presents a comprehensive analysis of a company's annual sales, using the classic dataset classicmodels as the database. Python is used as the main programming language, along with the Pandas, NumPy and SQLAlchemy libraries for data manipulation and analysis, and PostgreSQL as the database management system.

    The main objective of the project is to answer key questions related to the company's sales performance, such as: Which were the most profitable products and customers? Were sales goals met? The results obtained serve as input for strategic decision making in future sales campaigns.

    Methodology 1. Data Extraction:

    • A connection is established with the PostgreSQL database to extract the relevant data from the orders, orderdetails, customers, products and employees tables.
    • A reusable function is created to read each table and load it into a Pandas DataFrame.

    2. Data Cleansing and Transformation:

    • An exploratory analysis of the data is performed to identify missing values, inconsistencies, and outliers.
    • New variables are calculated, such as the total value of each sale, cost, and profit.
    • Different DataFrames are joined using primary and foreign keys to obtain a complete view of sales.

    3. Exploratory Data Analysis (EDA):

    • Key metrics such as total sales, number of unique customers, and average order value are calculated.
    • Data is grouped by different dimensions (products, customers, dates) to identify patterns and trends.
    • Results are visualized using relevant graphics (histograms, bar charts, etc.).

    4. Modeling and Prediction:

    • Although the main focus of the project is descriptive, predictive modeling techniques (e.g., time series) could be explored to forecast future sales.

    5. Report Generation:

    • Detailed reports are created in Pandas DataFrames format that answer specific business questions.
    • These reports are stored in new PostgreSQL tables for further analysis and visualization.

    Results - Identification of top products and customers: The best-selling products and the customers that generate the most revenue are identified. - Analysis of sales trends: Sales trends over time are analyzed and possible factors that influence sales behavior are identified. - Calculation of key metrics: Metrics such as average profit margin and sales growth rate are calculated.

    Conclusions This project demonstrates how Python and PostgreSQL can be effectively used to analyze large data sets and obtain valuable insights for business decision making. The results obtained can serve as a starting point for future research and development in the area of ​​sales analysis.

    Technologies Used - Python: Pandas, NumPy, SQLAlchemy, Matplotlib/Seaborn - Database: PostgreSQL - Tools: Jupyter Notebook - Keywords: data analysis, Python, PostgreSQL, Pandas, NumPy, SQLAlchemy, EDA, sales, business intelligence

  20. Shiny Seshat scrubbing code.

    • plos.figshare.com
    zip
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lux Miranda; Jacob Freeman (2023). Shiny Seshat scrubbing code. [Dataset]. http://doi.org/10.1371/journal.pone.0232609.s004
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Lux Miranda; Jacob Freeman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We include our Python 3 program that begins with the original, untampered Seshat database and performs the entire process of turning it into Shiny Seshat (including all error correction, Complexity Characteristic creation, imputation of missing values, etc.). (ZIP)

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Josef Schosser (2023). Additional file 1 of Tensor extrapolation: an adaptation to data sets with missing entries [Dataset]. http://doi.org/10.6084/m9.figshare.19242463.v1
Organization logoOrganization logo

Additional file 1 of Tensor extrapolation: an adaptation to data sets with missing entries

Related Article
Explore at:
htmlAvailable download formats
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Josef Schosser
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Additional file 1. Python code. The notebook highlights core components of the code applied in the study.

Search
Clear search
Close search
Google apps
Main menu