7 datasets found
  1. d

    Comprehensive dataset and Python toolkit for housing market analysis in...

    • search.dataone.org
    Updated Oct 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Li, Kingston (2025). Comprehensive dataset and Python toolkit for housing market analysis in Mercer County, NJ [Dataset]. http://doi.org/10.7910/DVN/LYRDHG
    Explore at:
    Dataset updated
    Oct 29, 2025
    Dataset provided by
    Harvard Dataverse
    Authors
    Li, Kingston
    Area covered
    Mercer County, New Jersey
    Description

    This project combines data extraction, predictive modeling, and geospatial mapping to analyze housing trends in Mercer County, New Jersey. It consists of three core components: Census Data Extraction: Gathers U.S. Census data (2012โ€“2022) on median house value, household income, and racial demographics for all census tracts in the county. It accounts for changes in census tract boundaries between 2010 and 2020 by approximating values for newly defined tracts. House Value Prediction: Uses an LSTM model with k-fold cross-validation to forecast median house values through 2025. Multiple feature combinations and sequence lengths are tested to optimize prediction accuracy, with the final model selected based on MSE and MAE scores. Data Mapping: Visualizes historical and predicted housing data using GeoJSON files from the TIGERWeb API. It generates interactive maps showing raw values, changes over time, and percent differences, with customization options to handle outliers and improve interpretability. This modular workflow can be adapted to other regions by changing the input FIPS codes and feature selections.

  2. Student Performance and Attendance Dataset

    • kaggle.com
    zip
    Updated Mar 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marvy Ayman Halim (2025). Student Performance and Attendance Dataset [Dataset]. https://www.kaggle.com/datasets/marvyaymanhalim/student-performance-and-attendance-dataset
    Explore at:
    zip(5849540 bytes)Available download formats
    Dataset updated
    Mar 10, 2025
    Authors
    Marvy Ayman Halim
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    ๐Ÿ“ Description: This synthetic dataset is designed to help beginners and intermediate learners practice data cleaning and analysis in a realistic setting. It simulates a student tracking system, covering key areas like:

    Attendance tracking ๐Ÿ“…

    Homework completion ๐Ÿ“

    Exam performance ๐ŸŽฏ

    Parent-teacher communication ๐Ÿ“ข

    โœ… Why Use This Dataset? While many datasets are pre-cleaned, real-world data is often messy. This dataset includes intentional errors to help you develop essential data cleaning skills before diving into analysis. Itโ€™s perfect for building confidence in handling raw data!

    ๐Ÿ› ๏ธ Cleaning Challenges Youโ€™ll Tackle This dataset is packed with real-world issues, including:

    Messy data: Names in lowercase, typos in attendance status.

    Inconsistent date formats: Mix of MM/DD/YYYY and YYYY-MM-DD.

    Incorrect values: Homework completion rates in mixed formats (e.g., 80% and 90).

    Missing data: Guardian signatures, teacher comments, and emergency contacts.

    Outliers: Exam scores over 100 and negative homework completion rates.

    ๐Ÿš€ Your Task: Clean, structure, and analyze this dataset using Python or SQL to uncover meaningful insights!

    ๐Ÿ“Œ 5. Handle Outliers

    Remove exam scores above 100.

    Convert homework completion rates to consistent percentages.

    ๐Ÿ“Œ 6. Generate Insights & Visualizations

    Whatโ€™s the average attendance rate per grade?

    Which subjects have the highest performance?

    What are the most common topics in parent-teacher communication?

  3. d

    Python Script for Cleaning Alum Dataset

    • search.dataone.org
    • hydroshare.org
    Updated Oct 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    saikumar payyavula; Jeff Sadler (2025). Python Script for Cleaning Alum Dataset [Dataset]. https://search.dataone.org/view/sha256%3A9df1a010044e2d50d741d5671b755351813450f4331dd7b0cc2f0a527750b30e
    Explore at:
    Dataset updated
    Oct 18, 2025
    Dataset provided by
    Hydroshare
    Authors
    saikumar payyavula; Jeff Sadler
    Description

    This resource contains a Python script used to clean and preprocess the alum dosage dataset from a small Oklahoma water treatment plant. The script handles missing values, removes outliers, merges historical water quality and weather data, and prepares the dataset for AI model training.

  4. Insurance_claims

    • kaggle.com
    • data.mendeley.com
    zip
    Updated Oct 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Miannotti (2025). Insurance_claims [Dataset]. https://www.kaggle.com/datasets/mian91218/insurance-claims
    Explore at:
    zip(68984 bytes)Available download formats
    Dataset updated
    Oct 19, 2025
    Authors
    Miannotti
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    AQQAD, ABDELRAHIM (2023), โ€œinsurance_claims โ€, Mendeley Data, V2, doi: 10.17632/992mh7dk9y.2

    https://data.mendeley.com/datasets/992mh7dk9y/2

    Latest version Version 2 Published: 22 Aug 2023 DOI: 10.17632/992mh7dk9y.2

    Data Acquisition: - Obtain the dataset titled "Insurance_claims" from the following Mendeley repository: https://https://data.mendeley.com/drafts/992mh7dk9y - Download and store the dataset locally for easy access during subsequent steps.

    Data Loading & Initial Exploration: - Use Python's Pandas library to load the dataset into a DataFrame. python Code used:

    Load the Dataset File

    insurance_df = pd.read_csv('insurance_claims.csv')

    • Inspect the initial rows, data types, and summary statistics to get an understanding of the dataset's structure.

    Data Cleaning & Pre-processing: - Handle missing values, if any. Strategies may include imputation or deletion based on the nature of the missing data. - Identify and handle outliers. In this research, particularly, outliers in the 'umbrella_limit' column were addressed. - Normalize or standardize features if necessary.

    Exploratory Data Analysis (EDA): - Utilize visualization libraries such as Matplotlib and Seaborn in Python for graphical exploration. - Examine distributions, correlations, and patterns in the data, especially between features and the target variable 'fraud_reported'. - Identify features that exhibit distinct patterns for fraudulent and non-fraudulent claims.

    Feature Engineering & Selection: - Create or transform existing features to improve model performance. - Use techniques like Recursive Feature Elimination (RFECV) to identify and retain only the most informative features.

    Modeling: - Split the dataset into training and test sets to ensure the model's generalizability. - Implement machine learning algorithms such as Support Vector Machine, RandomForest, and Voting Classifier using libraries like Scikit-learn. - Handle class imbalance issues using methods like Synthetic Minority Over-sampling Technique (SMOTE).

    Model Evaluation: - Evaluate the performance of each model using metrics like precision, recall, F1-score, ROC-AUC score, and confusion matrix. - Fine-tune the models based on the results. Hyperparameter tuning can be performed using techniques like Grid Search or Random Search.

    Model Interpretation: - Use methods like SHAP (SHapley Additive exPlanations) or LIME (Local Interpretable Model-agnostic Explanations) to interpret and understand the predictions made by the model.

    Deployment & Prediction: - Utilize the best-performing model to make predictions on unseen data. - If the intention is to deploy the model in a real-world scenario, convert the trained model into a format suitable for deployment (e.g., using libraries like joblib or pickle).

    Software & Tools: - Programming Language: Python (version: GoogleColab) - Libraries: Pandas, Numpy, Matplotlib, Seaborn, Scikit-learn, Imbalanced-learn, LIME, and SHAP. - Environment: Jupyter Notebook or any Python IDE.

  5. Metabolomics Data Preprocessing PQN PCA

    • kaggle.com
    zip
    Updated Nov 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dr. Nagendra (2025). Metabolomics Data Preprocessing PQN PCA [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/metabolomics-data-preprocessing-pqn-pca
    Explore at:
    zip(22763 bytes)Available download formats
    Dataset updated
    Nov 29, 2025
    Authors
    Dr. Nagendra
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset provides a step-by-step pipeline for preprocessing metabolomics data.

    The pipeline implements Probabilistic Quotient Normalization (PQN) to correct dilution effects in metabolomics measurements.

    Includes guidance on handling raw metabolomics datasets obtained from LC-MS or NMR experiments.

    Demonstrates Principal Component Analysis (PCA) for dimensionality reduction and exploratory data analysis.

    Includes data visualization techniques to interpret PCA results effectively.

    Suitable for metabolomics researchers and data scientists working on omics data.

    Enables better reproducibility of preprocessing workflows for metabolomics studies.

    Can be used to normalize data, detect outliers, and identify major patterns in metabolomics datasets.

    Provides a Python-based notebook that is easy to adapt to new datasets.

    Includes example datasets and code snippets for immediate application.

    Helps users understand the impact of normalization on downstream statistical analyses.

    Supports integration with other metabolomics pipelines or machine learning workflows.

  6. f

    Socio-demographic and economic characteristics of respondents.

    • plos.figshare.com
    • figshare.com
    xls
    Updated Oct 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shimels Derso Kebede; Daniel Niguse Mamo; Jibril Bashir Adem; Birhan Ewunu Semagn; Agmasie Damtew Walle (2023). Socio-demographic and economic characteristics of respondents. [Dataset]. http://doi.org/10.1371/journal.pdig.0000345.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 17, 2023
    Dataset provided by
    PLOS Digital Health
    Authors
    Shimels Derso Kebede; Daniel Niguse Mamo; Jibril Bashir Adem; Birhan Ewunu Semagn; Agmasie Damtew Walle
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Socio-demographic and economic characteristics of respondents.

  7. INDIA ELECTRICITY & ENERGY ANALYSIS PROJECT

    • kaggle.com
    zip
    Updated Nov 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bimal Kumar Saini (2025). INDIA ELECTRICITY & ENERGY ANALYSIS PROJECT [Dataset]. https://www.kaggle.com/datasets/bimalkumarsaini/india-electricity-and-energy-analysis-project
    Explore at:
    zip(4986654 bytes)Available download formats
    Dataset updated
    Nov 23, 2025
    Authors
    Bimal Kumar Saini
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Area covered
    India
    Description

    โšก INDIA ELECTRICITY & ENERGY ANALYSIS PROJECT

    This repository presents an extensive data engineering, cleaning, and analytical study on Indiaโ€™s electricity ecosystem using Python. The project covers coal stock status, thermal power generation, renewable energy trends, energy requirements & availability, and installed capacity across states.

    The goal is to identify operational bottlenecks, resource deficits, energy trends, and support data-driven decisions in the power sector.

    ๐Ÿ“Š Electricity Data Insights & System Analysis

    The project leverages five government datasets:

    ๐Ÿ”น Daily Coal Stock Data

    ๐Ÿ”น Daily Power Generation

    ๐Ÿ”น Renewable Energy Production

    ๐Ÿ”น State-wise Energy Requirement vs Availability

    ๐Ÿ”น Installed Capacity Across Fuel Types

    The final analysis includes EDA, heatmaps, trend analysis, outlier detection, data-cleaning automation, and visual summaries.

    ๐Ÿ”น Key Features โœ… 1. Comprehensive Data Cleaning Pipeline

    Null value treatment using median/mode strategies

    Standardizing categorical inconsistencies

    Filling missing regions, states, and production values

    Date format standardization

    Removing duplicates across all datasets

    Large-scale outlier detection using custom 5ร—IQR logic (to preserve real-world operational variance)

    โœ… 2. Exploratory Data Analysis (EDA)

    Includes:

    Coal stock trends over years

    Daily power generation patterns

    Solar, wind, and renewable growth

    State-wise energy shortage & surplus

    Installed capacity distribution across India

    Correlation maps for all major datasets

    โœ… 3. Trend Visualizations

    ๐Ÿ“ˆ Coal Stock Time-Series

    ๐Ÿ”ฅ Thermal Power Daily Output

    ๐ŸŒž Solar & Wind Contribution Over Time

    ๐Ÿ‡ฎ๐Ÿ‡ณ State-wise Energy Deficit Bar Chart

    ๐Ÿ—บ๏ธ MOM Energy Requirement Heatmap

    โš™๏ธ Installed Capacity Share of Each State

    ๐Ÿ“Œ Dashboard & Analysis Components Section Description ๐Ÿ”น Coal Stock Dashboard Daily stock, consumption, transport mode, critical plants ๐Ÿ”น Power Generation Capacity, planned vs actual generation ๐Ÿ”น Renewable Mix Solar, wind, hydro & total RE contributions ๐Ÿ”น Energy Shortfall Requirement vs availability across states ๐Ÿ”น Installed Capacity Coal, Gas, Hydro, Nuclear & RES capacity stacks ๐Ÿง  Insights & Findings ๐Ÿ”ฅ Coal Stock

    Critical coal stock days observed for multiple stations

    Seasonal dips in stock days & indigenous supply shocks

    Import dependency minimal but volatile

    โšก Power Generation

    Thermal stations show fluctuating PLF (Plant Load Factor)

    Many states underperform planned generation

    ๐ŸŒž Renewable Energy

    Solar shows continuous year-over-year growth

    Wind output peaks around monsoon months

    ๐Ÿ”Œ Energy Requirement vs Availability

    States like Delhi, Bihar, Jharkhand show intermittent deficits

    MOM heatmap highlights major seasonal spikes

    โš™๏ธ Installed Capacity

    Southern & Western regions dominate national capacity

    Coal remains the largest but renewable share rising rapidly

    ๐Ÿ“ Files in This Repository File Description coal_stock.csv Cleaned coal stock dataset power_gen.csv Daily power generation data renewable_engy.csv State-wise renewable energy dataset engy_reqmt.csv Monthly requirement & availability dataset install_cpty.csv Installed capacity across fuel types electricity.ipynb Full Python EDA notebook electricity.pdf Export of full Colab notebook (code + visuals) README.md GitHub project summary

    ๐Ÿ› ๏ธ Technologies Used ๐Ÿ“Š Data Analysis

    Python (Pandas, NumPy, Matplotlib, Seaborn)

    ๐Ÿงน Data Cleaning

    Null Imputation

    Outlier Detection (5ร—IQR)

    Standardization & Encoding

    Handling Large Multi-year Datasets

    ๐Ÿ”ง System Concepts

    Modular Python Code

    Data Pipelines & Feature Engineering

    Version Control (Git/GitHub)

    Cloud Concepts (Google Colab + Drive Integration)

    ๐Ÿ“ˆ Core Metrics & KPIs

    Total Stock Days

    PLF% (Plant Load Factor)

    Renewable Energy Contribution

    Energy Deficit (%)

    National Installed Capacity Share

    ๐Ÿ“š Future Enhancements

    Build a Power BI dashboard for visual storytelling

    Integrate forecasting models (ARIMA / Prophet)

    Automate coal shortage alerts

    Add state-level energy prediction for seasonality

    Deploy the analysis as a web dashboard (Streamlit)

  8. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Li, Kingston (2025). Comprehensive dataset and Python toolkit for housing market analysis in Mercer County, NJ [Dataset]. http://doi.org/10.7910/DVN/LYRDHG

Comprehensive dataset and Python toolkit for housing market analysis in Mercer County, NJ

Explore at:
Dataset updated
Oct 29, 2025
Dataset provided by
Harvard Dataverse
Authors
Li, Kingston
Area covered
Mercer County, New Jersey
Description

This project combines data extraction, predictive modeling, and geospatial mapping to analyze housing trends in Mercer County, New Jersey. It consists of three core components: Census Data Extraction: Gathers U.S. Census data (2012โ€“2022) on median house value, household income, and racial demographics for all census tracts in the county. It accounts for changes in census tract boundaries between 2010 and 2020 by approximating values for newly defined tracts. House Value Prediction: Uses an LSTM model with k-fold cross-validation to forecast median house values through 2025. Multiple feature combinations and sequence lengths are tested to optimize prediction accuracy, with the final model selected based on MSE and MAE scores. Data Mapping: Visualizes historical and predicted housing data using GeoJSON files from the TIGERWeb API. It generates interactive maps showing raw values, changes over time, and percent differences, with customization options to handle outliers and improve interpretability. This modular workflow can be adapted to other regions by changing the input FIPS codes and feature selections.

Search
Clear search
Close search
Google apps
Main menu