100+ datasets found

Weather and Housing in North America
kaggle.com
zip
Updated Feb 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Weather and Housing in North America [Dataset]. https://www.kaggle.com/datasets/thedevastator/weather-and-housing-in-north-america
Explore at:
zip(512280 bytes)Available download formats
Dataset updated
Feb 13, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
North America
Description
Weather and Housing in North America

Exploring the Relationship between Weather and Housing Conditions in 2012

By [source]

About this dataset

This comprehensive dataset explores the relationship between housing and weather conditions across North America in 2012. Through a range of climate variables such as temperature, wind speed, humidity, pressure and visibility it provides unique insights into the weather-influenced environment of numerous regions. The interrelated nature of housing parameters such as longitude, latitude, median income, median house value and ocean proximity further enhances our understanding of how distinct climates play an integral part in area real estate valuations. Analyzing these two data sets offers a wealth of knowledge when it comes to understanding what factors can dictate the value and comfort level offered by residential areas throughout North America

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset offers plenty of insights into the effects of weather and housing on North American regions. To explore these relationships, you can perform data analysis on the variables provided.

First, start by examining descriptive statistics (i.e., mean, median, mode). This can help show you the general trend and distribution of each variable in this dataset. For example, what is the most common temperature in a given region? What is the average wind speed? How does this vary across different regions? By looking at descriptive statistics, you can get an initial idea of how various weather conditions and housing attributes interact with one another.

Next, explore correlations between variables. Are certain weather variables correlated with specific housing attributes? Is there a link between wind speeds and median house value? Or between humidity and ocean proximity? Analyzing correlations allows for deeper insights into how different aspects may influence one another for a given region or area. These correlations may also inform broader patterns that are present across multiple North American regions or countries.

Finally, use visualizations to further investigate this relationship between climate and housing attributes in North America in 2012. Graphs allow you visualize trends like seasonal variations or long-term changes over time more easily so they are useful when interpreting large amounts of data quickly while providing larger context beyond what numbers alone can tell us about relationships between different aspects within this dataset

Research Ideas

Analyzing the effect of climate change on housing markets across North America. By looking at temperature and weather trends in combination with housing values, researchers can better understand how climate change may be impacting certain regions differently than others.

Investigating the relationship between median income, house values and ocean proximity in coastal areas. Understanding how ocean proximity plays into housing prices may help inform real estate investment decisions and urban planning initiatives related to coastal development.

Utilizing differences in weather patterns across different climates to determine optimal seasonal rental prices for property owners. By analyzing changes in temperature, wind speed, humidity, pressure and visibility from season to season an investor could gain valuable insights into seasonal market trends to maximize their profits from rentals or Airbnb listings over time

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: Weather.csv | Column name | Description | |:---------------------|:-----------------------------------------------| | Date/Time | Date and time of the observation. (Date/Time) | | Temp_C | Temperature in Celsius. (Numeric) | | Dew Point Temp_C | Dew point temperature in Celsius. (Numeric) | | Rel Hum_% | Relative humidity in percent. (Numeric) | | Wind Speed_km/h | Wind speed in kilometers per hour. (Numeric) | | Visibility_km | Visibilit...
Simulation Data Set
catalog.data.gov
s.cnmilf.com
Updated Nov 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2020). Simulation Data Set [Dataset]. https://catalog.data.gov/dataset/simulation-data-set
Explore at:
Dataset updated
Nov 12, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: File format: R workspace file; “Simulated_Dataset.RData”. Metadata (including data dictionary) • y: Vector of binary responses (1: adverse outcome, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate) Code Abstract We provide R statistical software code (“CWVS_LMC.txt”) to fit the linear model of coregionalization (LMC) version of the Critical Window Variable Selection (CWVS) method developed in the manuscript. We also provide R code (“Results_Summary.txt”) to summarize/plot the estimated critical windows and posterior marginal inclusion probabilities. Description “CWVS_LMC.txt”: This code is delivered to the user in the form of a .txt file that contains R statistical software code. Once the “Simulated_Dataset.RData” workspace has been loaded into R, the text in the file can be used to identify/estimate critical windows of susceptibility and posterior marginal inclusion probabilities. “Results_Summary.txt”: This code is also delivered to the user in the form of a .txt file that contains R statistical software code. Once the “CWVS_LMC.txt” code is applied to the simulated dataset and the program has completed, this code can be used to summarize and plot the identified/estimated critical windows and posterior marginal inclusion probabilities (similar to the plots shown in the manuscript). Optional Information (complete as necessary) Required R packages: • For running “CWVS_LMC.txt”: • msm: Sampling from the truncated normal distribution • mnormt: Sampling from the multivariate normal distribution • BayesLogit: Sampling from the Polya-Gamma distribution • For running “Results_Summary.txt”: • plotrix: Plotting the posterior means and credible intervals Instructions for Use Reproducibility (Mandatory) What can be reproduced: The data and code can be used to identify/estimate critical windows from one of the actual simulated datasets generated under setting E4 from the presented simulation study. How to use the information: • Load the “Simulated_Dataset.RData” workspace • Run the code contained in “CWVS_LMC.txt” • Once the “CWVS_LMC.txt” code is complete, run “Results_Summary.txt”. Format: Below is the replication procedure for the attached data set for the portion of the analyses using a simulated data set: Data The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publically available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics, and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).
N
Income Distribution by Quintile: Mean Household Income in Lake View, AL
neilsberg.com
csv, json
Updated Jan 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neilsberg Research (2024). Income Distribution by Quintile: Mean Household Income in Lake View, AL [Dataset]. https://www.neilsberg.com/research/datasets/94b43a8f-7479-11ee-949f-3860777c1fe6/
Explore at:
json, csvAvailable download formats
Dataset updated
Jan 11, 2024
Dataset authored and provided by
Neilsberg Research
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Lake View, Alabama
Variables measured
Income Level, Mean Household Income
Measurement technique
The data presented in this dataset is derived from the U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates. It delineates income distributions across income quintiles (mentioned above) following an initial analysis and categorization. Subsequently, we adjusted these figures for inflation using the Consumer Price Index retroactive series via current methods (R-CPI-U-RS). For additional information about these estimations, please contact us via email at research@neilsberg.com
Dataset funded by
Neilsberg Research
Description
About this dataset

Context

The dataset presents the mean household income for each of the five quintiles in Lake View, AL, as reported by the U.S. Census Bureau. The dataset highlights the variation in mean household income across quintiles, offering valuable insights into income distribution and inequality.

Key observations

Income disparities: The mean income of the lowest quintile (20% of households with the lowest income) is 28,416, while the mean income for the highest quintile (20% of households with the highest income) is 200,443. This indicates that the top earners earn 7 times compared to the lowest earners.

*Top 5%: * The mean household income for the wealthiest population (top 5%) is 260,140, which is 129.78% higher compared to the highest quintile, and 915.47% higher compared to the lowest quintile.

https://i.neilsberg.com/ch/lake-view-al-mean-household-income-by-quintiles.jpeg" alt="Mean household income by quintiles in Lake View, AL (in 2022 inflation-adjusted dollars))">

Content

When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates.

Income Levels:

Lowest Quintile

Second Quintile

Third Quintile

Fourth Quintile

Highest Quintile

Top 5 Percent

Variables / Data Columns

Income Level: This column showcases the income levels (As mentioned above).

Mean Household Income: Mean household income, in 2022 inflation-adjusted dollars for the specific income level.

Good to know

Margin of Error

Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

Custom data

If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

Inspiration

Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

Recommended for further research

This dataset is a part of the main dataset for Lake View median household income. You can refer the same here
Google Analytics data of an E-commerce Company
kaggle.com
zip
Updated Oct 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
fehu.zone (2024). Google Analytics data of an E-commerce Company [Dataset]. https://www.kaggle.com/datasets/fehu94/google-analytics-data-of-an-e-commerce-company
Explore at:
zip(3156 bytes)Available download formats
Dataset updated
Oct 19, 2024
Authors
fehu.zone
Description
📊 Dataset Title: Daily Active Users Dataset

📝 Description

This dataset provides detailed insights into daily active users (DAU) of a platform or service, captured over a defined period of time. The dataset includes information such as the number of active users per day, allowing data analysts and business intelligence teams to track usage trends, monitor platform engagement, and identify patterns in user activity over time.

The data is ideal for performing time series analysis, statistical analysis, and trend forecasting. You can utilize this dataset to measure the success of platform initiatives, evaluate user behavior, or predict future trends in engagement. It is also suitable for training machine learning models that focus on user activity prediction or anomaly detection.

📂 Dataset Structure

The dataset is structured in a simple and easy-to-use format, containing the following columns:

Date: The date on which the data was recorded, formatted as YYYYMMDD.

Number of Active Users: The number of users who were active on the platform on the corresponding date.

Each row in the dataset represents a unique date and its corresponding number of active users. This allows for time-based analysis, such as calculating the moving average of active users, detecting seasonality, or spotting sudden spikes or drops in engagement.

🧐 Key Use Cases

This dataset can be used for a wide range of purposes, including:

Time Series Analysis: Analyze trends and seasonality of user engagement.

Trend Detection: Discover peaks and valleys in user activity.

Anomaly Detection: Use statistical methods or machine learning algorithms to detect anomalies in user behavior.

Forecasting User Growth: Build forecasting models to predict future platform usage.

Seasonality Insights: Identify patterns like increased activity on weekends or holidays.

📈 Potential Analysis

Here are some specific analyses you can perform using this dataset:

Moving Average and Smoothing: Calculate the moving average over a 7-day or 30-day period.

Correlation with External Factors: Correlate daily active users with other datasets.

Statistical Hypothesis Testing: Perform t-tests or ANOVA to determine significant differences in user activity.

Machine Learning for Prediction: Train machine learning models to predict user engagement.

🚀 Getting Started

To get started with this dataset, you can load it into your preferred analysis tool. Here's how to do it using Python's pandas library:

import pandas as pd # Load the dataset data = pd.read_csv('path_to_dataset.csv') # Display the first few rows print(data.head()) # Basic statistics print(data.describe())
f
Data used to calculate mean resting orientations of ornithischian scapular...
datasetcatalog.nlm.nih.gov
figshare.com
+1more
Updated Dec 18, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Senter, Phil; Robins, James H. (2015). Data used to calculate mean resting orientations of ornithischian scapular angles in lateral view. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001934624
Explore at:
Dataset updated
Dec 18, 2015
Authors
Senter, Phil; Robins, James H.
Description
See Materials and Methods section for description of angle B. Group means shown without confidence intervals are those for which sample size is too small to derive 95% confidence intervals (n < 8). See Table 1 for institutional abbreviations.
Pakistan House Price dataset
kaggle.com
zip
Updated May 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jillani SofTech (2023). Pakistan House Price dataset [Dataset]. https://www.kaggle.com/datasets/jillanisofttech/pakistan-house-price-dataset/versions/1
Explore at:
zip(8379623 bytes)Available download formats
Dataset updated
May 6, 2023
Authors
Jillani SofTech
Area covered
Pakistan
Description
Dataset Description: The dataset contains information about properties. Each property has a unique property ID and is associated with a location ID based on the subcategory of the city. The dataset includes the following attributes:

Property ID: Unique identifier for each property. Location ID: Unique identifier for each location within a city. Page URL: The URL of the webpage where the property was published. Property Type: Categorization of the property into six types: House, FarmHouse, Upper Portion, Lower Portion, Flat, or Room. Price: The price of the property, which is the dependent feature in this dataset. City: The city where the property is located. The dataset includes five cities: Lahore, Karachi, Faisalabad, Rawalpindi, and Islamabad. Province: The state or province where the city is located. Location: Different types of locations within each city. Latitude and Longitude: Geographic coordinates of the cities. Steps Involved in the Analysis:

Statistical Analysis:

Data Types: Determine the data types of the attributes. Level of Measurement: Identify the level of measurement for each attribute. Summary Statistics: Calculate mean, standard deviation, minimum, and maximum values for numerical attributes. Data Cleaning:

Filling Null Values: Handle missing values in the dataset. Duplicate Values: Remove duplicate records, if any. Correcting Data Types: Ensure the correct data types for each attribute. Outliers Detection: Identify and handle outliers in the data. Exploratory Data Analysis (EDA):

Visualization: Use libraries such as Seaborn, Matplotlib, and Plotly to visualize the data and gain insights. Model Building:

Libraries: Utilize libraries like Sklearn and pickle. List of Models: Build models using Linear Regression, Decision Tree, Random Forest, K-Nearest Neighbors (KNN), XG Boost, Gradient Boost, and Ada Boost. Model Saving: Save the selected model into a pickle file for future use. I hope this captures the essence of the provided information. Let me know if you need any further assistance!
N
Median Household Income Variation by Family Size in Ocean View, DE:...
neilsberg.com
csv, json
Updated Mar 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neilsberg Research (2025). Median Household Income Variation by Family Size in Ocean View, DE: Comparative analysis across 7 household sizes [Dataset]. https://www.neilsberg.com/insights/ocean-view-de-median-household-income/
Explore at:
json, csvAvailable download formats
Dataset updated
Mar 3, 2025
Dataset authored and provided by
Neilsberg Research
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Ocean View, Delaware
Variables measured
Household size, Median Household Income
Measurement technique
The data presented in this dataset is derived from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates. It delineates income distributions across 7 household sizes (mentioned above) following an initial analysis and categorization. Using this dataset, you can find out how household income varies with the size of the family unit. For additional information about these estimations, please contact us via email at research@neilsberg.com
Dataset funded by
Neilsberg Research
Description
About this dataset

Context

The dataset presents median household incomes for various household sizes in Ocean View, DE, as reported by the U.S. Census Bureau. The dataset highlights the variation in median household income with the size of the family unit, offering valuable insights into economic trends and disparities within different household sizes, aiding in data analysis and decision-making.

Key observations

Of the 7 household sizes (1 person to 7-or-more person households) reported by the census bureau, Ocean View did not include 5, 6, or 7-person households. Across the different household sizes in Ocean View the mean income is $114,088, and the standard deviation is $59,951. The coefficient of variation (CV) is 52.55%. This high CV indicates high relative variability, suggesting that the incomes vary significantly across different sizes of households.

In the most recent year, 2023, The smallest household size for which the bureau reported a median household income was 1-person households, with an income of $40,761. It then further increased to $167,813 for 4-person households, the largest household size for which the bureau reported a median household income.

Content

When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.

Household Sizes:

1-person households

2-person households

3-person households

4-person households

5-person households

6-person households

7-or-more-person households

Variables / Data Columns

Household Size: This column showcases 7 household sizes ranging from 1-person households to 7-or-more-person households (As mentioned above).

Median Household Income: Median household income, in 2023 inflation-adjusted dollars for the specific household size.

Good to know

Margin of Error

Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

Custom data

If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

Inspiration

Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

Recommended for further research

This dataset is a part of the main dataset for Ocean View median household income. You can refer the same here
w
Charts of climate statistics and MODIS data for all Bioregional Assessment...
data.wu.ac.at
researchdata.edu.au
+1more
Updated Jun 14, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bioregional Assessment Programme (2018). Charts of climate statistics and MODIS data for all Bioregional Assessment subregions [Dataset]. https://data.wu.ac.at/schema/data_gov_au/NzAxNGQ5NjYtODdmNS00ODJkLThjYWQtNjNkZTk2NGFkZGQ5
Explore at:
Dataset updated
Jun 14, 2018
Dataset provided by
Bioregional Assessment Programme
Description
Abstract

This dataset was derived by the Bioregional Assessment Programme from 'Mean climate variables for all subregions' and 'fPAR derived from MODIS for BA subregions'. You can find a link to the parent datasets in the Lineage Field in this metadata statement. The History Field in this metadata statement describes how this dataset was derived.

These are charts of climate statistics and MODIS data for each BA subregion. There are six 600dpi PNG files per subregion, with the naming convention BA-[regioncode]-[subregioncode]-[chartname].png. The charts, according to their filename, are: rain (time-series of rainfall; Figure 1), P-PET (average monthly precipitation and potential evapotranspiration; Figure 2), 5line (assorted monthly statistics; Figure 3), trend (monthly long-term trends; Figure 4) and fPAR (fraction of photosynthetically available radiation - an indication of biomass; Figure 5).

This version was created on 18 November 2014, using data that accounted for a modified boundary for the Gippsland Basin bioregion and the combination of two subregions to form the Sydney Basin bioregion.

Purpose

These charts were generated to be included in the Contextual Report (geography) for each subregion.

Dataset History

These charts were generated using MatPlotLib 1.3.0 in Python 2.7.5 (Anaconda distribution v1.7.0 32-bit).

The script for generating these plots is BA-ClimateCharts.py, and is packaged with the dataset. This script is a data collection and chart drawing script, it does not do any analysis. The data are charted as they appear in the parent datasets (see Lineage). A word document (BA-ClimateGraphs-ReadMe) is also included. This document includes examples of, and approved captions for, each chart.

Dataset Citation

Bioregional Assessment Programme (2014) Charts of climate statistics and MODIS data for all Bioregional Assessment subregions. Bioregional Assessment Derived Dataset. Viewed 14 June 2018, http://data.bioregionalassessments.gov.au/dataset/8a1c5f43-b150-4357-aa25-5f301b1a02e1.

Dataset Ancestors

Derived From Mean climate variables for all subregions

Derived From BILO Gridded Climate Data: Daily Climate Data for each year from 1900 to 2012

Derived From fPar derived from MODIS for BA subregions
Earth Radiation area average time series through Wide-field-of-view...
data.nasa.gov
Updated Mar 31, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). Earth Radiation area average time series through Wide-field-of-view nonscanner abroad Earth Radiation Budget Satellite - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/earth-radiation-area-average-time-series-through-wide-field-of-view-nonscanner-abroad-eart-d59c2
Explore at:
Dataset updated
Mar 31, 2025
Dataset provided by
NASAhttp://nasa.gov/
Area covered
Earth
Description
Understanding the mean and variability of the Earth’s radiation budget (ERB) at the Top-of-Atmosphere (TOA) and surface is a fundamental quantity governing climate variability and, for that reason, NASA has been making concerted efforts to observe the ERB since1984 through two projects: ERBE and CERES, that span nearly 30 years to date. The proposed project utilizes knowledge gained in the last 10 years through CERES data analyses and apply the knowledge to existing data to develop long-term (nearly 30 years) consistent and calibrated data product (TOA irradiances at the same radiometric scale) from multiple missions (ERBS and CERES). This project proposes to produce level 3 surface irradiance products that are consistent with observed TOA irradiances in a framework of 1D radiative transfer theory. Based on these TOA and surface irradiance products, a data product will be developed which contains the contribution of atmospheric and cloud property variability to TOA and surface irradiance variability. All algorithms used in the process are based on existing CERES algorithms. All data sets produced by this project will be available from the Atmospheric Science Data Center.
Z
Dataset from: High consistency and repeatability in the breeding migrations...
data.niaid.nih.gov
zenodo.org
Updated Jun 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous (2024). Dataset from: High consistency and repeatability in the breeding migrations of a benthic shark [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11467088
Explore at:
Dataset updated
Jun 4, 2024
Authors
Anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset and scripts used for manuscript: High consistency and repeatability in the breeding migrations of a benthic shark.

Project title: High consistency and repeatability in the breeding migrations of a benthic sharkDate:23/04/2024

Folders:- 1_Raw_data - Perpendicular_Point_068151, Sanctuary_Point_068088, SST raw data, sst_nc_files, IMOS_animal_measurements, IMOS_detections, PS&Syd&JB tags, rainfall_raw, sample_size, Point_Perpendicular_2013_2019, Sanctuary_Point_2013_2019, EAC_transport- 2_Processed_data - SST (anomaly, historic_sst, mean_sst_31_years, week_1992_sst:week_2022_sst including week_2019_complete_sst) - Rain (weekly_rain, weekly_rainfall_completed) - Clean (clean, cleaned_data, cleaned_gam, cleaned_pj_data)- 3_Script_processing_data - Plots(dual_axis_plot (Fig. 1 & Fig. 4).R, period_plot (Fig. 2).R, sd_plot (Fig. 5).R, sex_plot (Fig. 3).R - cleaned_data.R, cleaned_data_gam.R, weekly_rainfall_completed.R, descriptive_stats.R, sst.R, sst_2019b.R, sst_anomaly.R- 4_Script_analyses - gam.R, gam_eac.R, glm.R, lme.R, Repeatability.R- 5_Output_doc - Plots (arrival_dual_plot_with_anomaly (Fig. 1).png, period_plot (Fig.2).png, sex_arrival_departure (Fig. 3).png, departure_dual_plot_with_anomaly (Fig. 4).png, standard deviation plot (Fig. 5).png) - Tables (gam_arrival_eac_selection_table.csv (Table S2), gam_departure_eac_selection_table (Table S5), gam_arrival_selection_table (Table. S3), gam_departure_selection_table (Table. S6), glm_arrival_selection_table, glm_departure_selection_table, lme_arrival_anova_table, lme_arrival_selection_table (Table S4), lme_departure_anova_table, lme_departure_selection_table (Table. S8))

Descriptions of scripts and files used:- cleaned_data.R: script to extract detections of sharks at Jervis Bay. Calculate arrival and departure dates over the seven breeding seasons. Add sex and length for each individual. Extract moon phase (numerical value) and period of the day from arrival and departure times. - IMOS_detections.csv: raw data file with detections of Port Jackson sharks over different sites in Australia. - IMOS_animal_measurements.csv: raw data file with morphological data of Port Jackson sharks - PS&Syd&JB tags: file with measurements and sex identification of sharks (different from IMOS, it was used to complete missing sex and length). - cleaned_data.csv: file with arrival and departure dates of the final sample size of sharks (N=49) with missing sex and length for some individuals. - clean.csv: completed file using PS&Syd&JB tags, note: tag ID 117393679 was wrongly identified as a male in IMOS and correctly identified as a female in PS&Syd&JB tags file as indicated by its large size. - cleaned_pj_data: Final data file with arrival and departure dates, sex, length, moon phase (numerical) and period of the day.

weekly_rainfall_completed.R: script to calculate average weekly rainfall and correlation between the two weather stations used (Point perpendicular and Sanctuary point). - weekly_rain.csv: file with the corresponding week number (1-28) for each date (01-06-2013 to 13-12-2019) - weekly_rainfall_completed.csv: file with week number (1-28), year (2013-2019) and weekly rainfall average completed with Sanctuary Point for week 2 of 2017 - Point_Perpendicular_2013_2019: Rainfall (mm) from 01-01-2013 to 31-12-2020 at the Point Perpendicular weather station - Sanctuary_Point_2013_2019: Rainfall (mm) from 01-01-2013 to 31-12-2020 at the Sanctuary Point weather station - IDCJAC0009_068088_2017_Data.csv: Rainfall (mm) from 01-01-2017 to 31-12-2017 at the Sanctuary Point weather station (to fill in missing value for average rainfall of week 2 of 2017)

cleaned_data_gam.R: script to calculate weekly counts of sharks to run gam models and add weekly averages of rainfall and sst anomaly - cleaned_pj_data.csv - anomaly.csv: weekly (1-28) average sst anomalies for Jervis Bay (2013-2019) - weekly_rainfall_completed.csv: weekly (1-28) average rainfall for Jervis Bay (2013-2019_ - sample_size.csv: file with the number of sharks tagged (13-49) for each year (2013-2019)

sst.R: script to extract daily and weekly sst from IMOS nc files from 01-05 until 31-12 for the following years: 1992:2022 for Jervis Bay - sst_raw_data: folder with all the raw weekly (1:28) csv files for each year (1992:2022) to fill in with sst data using the sst script - sst_nc_files: folder with all the nc files downloaded from IMOS from the last 31 years (1992-2022) at the sensor (IMOS - SRS - SST - L3S-Single Sensor - 1 day - night time – Australia). - SST: folder with the average weekly (1-28) sst data extracted from the nc files using the sst script for each of the 31 years (to calculate temperature anomaly).

sst_2019b.R: script to extract daily and weekly sst from IMOS nc file for 2019 (missing value for week 19) for Jervis Bay - week_2019_sst: weekly average sst 2019 with a missing value for week 19 - week_2019b_sst: sst data from 2019 with another sensor (IMOS – SRS – MODIS - 01 day - Ocean Colour-SST) to fill in the gap of week 19 - week_2019_complete_sst: completed average weekly sst data from the year 2019 for weeks 1-28.

sst_anomaly.R: script to calculate mean weekly sst anomaly for the study period (2013-2019) using mean historic weekly sst (1992-2022) - historic_sst.csv: mean weekly (1-28) and yearly (1992-2022) sst for Jervis Bay - mean_sst_31_years.csv: mean weekly (1-28) sst across all years (1992-2022) for Jervis Bay - anomaly.csv: mean weekly and yearly sst anomalies for the study period (2013-2019)

Descriptive_stats.R: script to calculate minimum and maximum length of sharks, mean Julian arrival and departure dates per individual per year, mean Julian arrival and departure dates per year for all sharks (Table. S10), summary of standard deviation of julian arrival dates (Table. S9) - cleaned_pj_data.csv

gam.R: script used to run the Generalized additive model for rainfall and sea surface temperature - cleaned_gam.csv

glm.R: script used to run the Generalized linear mixed models for the period of the day and moon phase - cleaned_pj_data.csv - sample_size.csv

lme.R: script used to run the Linear mixed model for sex and size - cleaned_pj_data.csv

Repeatability.R: script used to run the Repeatability for Julian arrival and Julian departure dates - cleaned_pj_data.csv
Summer Products Sales Performance
kaggle.com
zip
Updated Dec 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Summer Products Sales Performance [Dataset]. https://www.kaggle.com/thedevastator/summer-products-sales-performance
Explore at:
zip(436244 bytes)Available download formats
Dataset updated
Dec 4, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Summer Products Sales Performance

E-commerce sales performance and ratings data for summer products

By Jeffrey Mvutu Mabilama [source]

About this dataset

The Summer Products and Sales Performance dataset is a comprehensive collection of product listings, ratings, and sales data from the Wish platform. The dataset aims to provide insights into the trends and patterns in e-commerce during the summer season. It contains valuable information such as product titles, prices, retail prices, currency used for pricing, units sold, whether ad boosts are used for product listings, average ratings for products, total ratings count for products, counts of five-star to one-star ratings for products.

Additionally, the dataset includes data on various aspects related to product quality and shipping options such as badges count (indicating special qualities), local product status (whether the product is sold locally), product quality rating badges (indicating the quality of the product), fast shipping availability badges (indicating whether fast shipping is available), tags associated with products (making them more discoverable), color variations of products available in inventory along with their count. It also provides information on different shipping options including option names and their corresponding prices.

Moreover,the dataset encompasses details about merchants selling these products including merchant title and name as well as information on merchant rating count (total number of ratings received by merchants) ,merchant profile picture availability,and subtitle which gives additional details about merchant's info.

The dataset further includes links to images of individual listed products along with links to respective online shop pages where these are found . In addition,currency buyer specifies currency type used by buyers throughout various transactions.Items flagged under urgency text have an associated urgency text rate indicating how urgently they are desired or needed.

This comprehensive dataset also allows users to analyze units sold per listed item as well as mean units sold per listed item across different categories/theme .Further evaluation can be done using totalunitsold variable which represents total volume sales from all listed items tied together across Wish platform.

Aiding further analysis around elasticity theory users can find marked down rates/percentage tagged describing discounts over retail price,ranging from 0-1 as well as average discount values for individual listed products.Further custom insights such as number of countries items can be delivered to, their origin country, if they possess an urgency banner or fast shipping and if the seller is famous/has a profile picture.

This comprehensive dataset served to build model helping sellers predict how well an item may sell so as to equip businesses with ability to make replenishment decisions guided by this model

How to use the dataset

Familiarize Yourself with the Columns:

Before diving into data analysis, it's important to understand the meaning of each column in the dataset. The columns contain information such as product titles, prices, ratings, inventory details, shipping options, merchant information, and more. Refer to the dataset documentation or use descriptive statistics methods to gain insights into different attributes.

Explore Product Categories:

The dataset includes a column named theme that represents the category or theme of each product listing. By analyzing this column's values and frequency distribution, you can identify top-selling categories during the summer season. This information can be beneficial for businesses looking to optimize their product offerings.

Analyze Pricing Data:

The columns like price, retail_price, and currency_buyer provide insights into pricing strategies employed by sellers on Wish platform.

Calculate various statistical measures like mean price using 'meanproductprices', highest priced items using 'price', average discount using averagediscount'

Investigate relationships between pricing factors such as discounted prices compared to original retail prices ('discounted price' = 'retail_price' - 'price').

Examine Ratings Data: 4a) Analyze Product Ratings: To gauge customer satisfaction levels regarding products listed on Wish platform products rating features have been provided. Available columns- -> Number of ratings received per star rating -> Total number of ratings received (rating_count) -> Average rating (rating) Perform analysis to find: - Aver...
Binance Coin BNB, 1m Full Historical Data
kaggle.com
zip
Updated Oct 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Imran Bukhari (2025). Binance Coin BNB, 1m Full Historical Data [Dataset]. https://www.kaggle.com/datasets/imranbukhari/comprehensive-bnbusd-1m-data/data
Explore at:
zip(266775584 bytes)Available download formats
Dataset updated
Oct 11, 2025
Authors
Imran Bukhari
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
I am a new developer and I would greatly appreciate your support. If you find this dataset helpful, please consider giving it an upvote!

Key Features:

Complete 1m Data: Raw 1m historical data from multiple exchanges, covering the entire trading history of BNBUSD available through their API endpoints. This dataset is updated daily to ensure up-to-date coverage.

Combined Index Dataset: A unique feature of this dataset is the combined index, which is derived by averaging all other datasets into one, please see attached notebook. This creates the longest continuous, unbroken BNBUSD dataset available on Kaggle, with no gaps and no erroneous values. It gives a much more comprehensive view of the market i.e. total volume across multiple exchanges.

Superior Performance: The combined index dataset has demonstrated superior 'mean average error' (MAE) metric performance when training machine learning models, compared to single-source datasets by a whole order of MAE magnitude.

Unbroken History: The combined dataset's continuous history is a valuable asset for researchers and traders who require accurate and uninterrupted time series data for modeling or back-testing.

https://i.imgur.com/aqtuPay.png" alt="BNBUSD Dataset Summary">

https://i.imgur.com/mnzs2f4.png" alt="Combined Dataset Close Plot"> This plot illustrates the continuity of the dataset over time, with no gaps in data, making it ideal for time series analysis.

Included Resources:

Two Notebooks:

Dataset Usage and Diagnostics: This notebook demonstrates how to use the dataset and includes a powerful data diagnostics function, which is useful for all time series analyses.

Aggregating Multiple Data Sources: This notebook walks you through the process of combining multiple exchange datasets into a single, clean dataset. (Currently unavailable, will be added shortly)
Fish dataset
kaggle.com
zip
Updated May 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AbdElRahman16 (2025). Fish dataset [Dataset]. https://www.kaggle.com/datasets/abdelrahman16/11111111111111111111
Explore at:
zip(20477 bytes)Available download formats
Dataset updated
May 6, 2025
Authors
AbdElRahman16
Description
🔍 Dataset Overview: 🐟 Species: Name of the fish species (e.g., Anabas testudineus)

📏 Length: Length of the fish (in centimeters)

⚖️ Weight: Weight of the fish (in grams)

🧮 W/L Ratio: Weight-to-length ratio of the fish

🧠 Steps to Build the Prediction Model: 📋 Data Preprocessing: 1 - Handle Missing Values: Check for and handle any missing values appropriately using methods like:

Imputation (mean/median for numeric data)

Row or column removal (if data is too sparse)

2 - Convert Data Types: Ensure numerical columns (Length, Weight, W/L Ratio) are in the correct numeric format.

3 - Handle Categorical Variables: Convert the Species column into numerical format using:

One-Hot Encoding

Label Encoding

🎯 Feature Selection: 1 - Correlation Analysis: Use correlation heatmaps or statistical tests to identify features most related to the target variable (e.g., Weight).

2 - Feature Importance: Use tree-based models (like Random Forest) to determine which features are most predictive.

🔍 Model Selection: 1 - Algorithm Choice: Choose suitable machine learning algorithms such as:

Linear Regression

Decision Tree Regressor

Random Forest Regressor

Gradient Boosting Regressor

2 - Model Comparison: Evaluate each model using metrics like:

Mean Absolute Error (MAE)

Mean Squared Error (MSE)

R-squared (R²)

🚀 Model Training and Evaluation: 1 - Train the Model: Split the dataset into training and testing sets (e.g., 80/20 split). Train the selected model(s) on the training set.

2 - Evaluate the Model: Use the test set to assess model performance and fine-tune as necessary using grid search or cross-validation.

This dataset and workflow are useful for exploring biometric relationships in fish and building regression models to predict weight based on length or species. Great for marine biology, aquaculture analytics, and educational projects.

🐠 Happy modeling! 👍 Please upvote if you found this helpful!

https://www.kaggle.com/code/abdelrahman16/fish-clustering-diverse-techniques
N
Median Household Income Variation by Family Size in Forest View, IL:...
neilsberg.com
csv, json
Updated Mar 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neilsberg Research (2025). Median Household Income Variation by Family Size in Forest View, IL: Comparative analysis across 7 household sizes [Dataset]. https://www.neilsberg.com/research/datasets/23fe928c-f81d-11ef-a994-3860777c1fe6/
Explore at:
json, csvAvailable download formats
Dataset updated
Mar 3, 2025
Dataset authored and provided by
Neilsberg Research
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Illinois, Forest View
Variables measured
Household size, Median Household Income
Measurement technique
The data presented in this dataset is derived from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates. It delineates income distributions across 7 household sizes (mentioned above) following an initial analysis and categorization. Using this dataset, you can find out how household income varies with the size of the family unit. For additional information about these estimations, please contact us via email at research@neilsberg.com
Dataset funded by
Neilsberg Research
Description
About this dataset

Context

The dataset presents median household incomes for various household sizes in Forest View, IL, as reported by the U.S. Census Bureau. The dataset highlights the variation in median household income with the size of the family unit, offering valuable insights into economic trends and disparities within different household sizes, aiding in data analysis and decision-making.

Key observations

Of the 7 household sizes (1 person to 7-or-more person households) reported by the census bureau, Forest View did not include 6, or 7-person households. Across the different household sizes in Forest View the mean income is $116,795, and the standard deviation is $37,772. The coefficient of variation (CV) is 32.34%. This high CV indicates high relative variability, suggesting that the incomes vary significantly across different sizes of households.

In the most recent year, 2023, The smallest household size for which the bureau reported a median household income was 1-person households, with an income of $76,250. It then further increased to $135,179 for 5-person households, the largest household size for which the bureau reported a median household income.

Content

When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.

Household Sizes:

1-person households

2-person households

3-person households

4-person households

5-person households

6-person households

7-or-more-person households

Variables / Data Columns

Household Size: This column showcases 7 household sizes ranging from 1-person households to 7-or-more-person households (As mentioned above).

Median Household Income: Median household income, in 2023 inflation-adjusted dollars for the specific household size.

Good to know

Margin of Error

Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

Custom data

If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

Inspiration

Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

Recommended for further research

This dataset is a part of the main dataset for Forest View median household income. You can refer the same here
YouTube Channel Performance Analytics
kaggle.com
zip
Updated Oct 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
L3WY (2024). YouTube Channel Performance Analytics [Dataset]. https://www.kaggle.com/datasets/positivealexey/youtube-channel-performance-analytics
Explore at:
zip(41446 bytes)Available download formats
Dataset updated
Oct 25, 2024
Authors
L3WY
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Area covered
YouTube
Description
This dataset provides an in-depth look at YouTube video analytics, capturing key metrics related to video performance, audience engagement, revenue generation, and viewer behavior. Sourced from real video data, it highlights how variables like video duration, upload time, and ad impressions contribute to monetization and audience retention. This dataset is ideal for data analysts, content creators, and marketers aiming to uncover trends in viewer engagement, optimize content strategies, and maximize ad revenue. Inspired by the evolving landscape of digital content, it serves as a resource for understanding the impact of YouTube metrics on channel growth and content reach.

Video Details: Columns like Video Duration, Video Publish Time, Days Since Publish, Day of Week.

Revenue Metrics: Includes Revenue per 1000 Views (USD), Estimated Revenue (USD), Ad Impressions, and various ad revenue sources (e.g., AdSense, DoubleClick).

Engagement Metrics: Metrics such as Views, Likes, Dislikes, Shares, Comments, Average View Duration, Average View Percentage (%), and Video Thumbnail CTR (%).

Audience Data: Data on New Subscribers, Unsubscribes, Unique Viewers, Returning Viewers, and New Viewers.

Monetization & Transaction Metrics: Details on Monetized Playbacks, Playback-Based CPM, YouTube Premium Revenue, and transactions like Orders and Total Sales Volume (USD).

Student Academic Performance (Synthetic Dataset)

kaggle.com

zip

Updated Oct 10, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Mamun Hasan (2025). Student Academic Performance (Synthetic Dataset) [Dataset]. https://www.kaggle.com/datasets/mamunhasan2cs/student-academic-performance-synthetic-dataset

Explore at:

zip(9287 bytes)Available download formats

Dataset updated

Oct 10, 2025

Authors

Mamun Hasan

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

This dataset is a synthetic collection of student performance data created for data preprocessing, cleaning, and analysis practice in Data Mining and Machine Learning courses. It contains information about 1,020 students, including their study habits, attendance, and test performance, with intentionally introduced missing values, duplicates, and outliers to simulate real-world data issues.

The dataset is suitable for laboratory exercises, assignments, and demonstration of key preprocessing techniques such as:

Handling missing values
Removing duplicates
Detecting and treating outliers
Data normalization and transformation
Encoding categorical variables
Exploratory data analysis (EDA)
Regression Analysis

📊 Columns Description

Column Name	Description
Student_ID	Unique identifier for each student (e.g., S0001, S0002, …)
Age	Age of the student (between 18 and 25 years)
Gender	Gender of the student (Male/Female)
Study_Hours	Average number of study hours per day (contains missing values and outliers)
Attendance(%)	Percentage of class attendance (contains missing values)
Test_Score	Final exam score (0–100 scale)
Grade	Letter grade derived from test scores (`F`, `C`, `B`, `A`, `A+`)

🧠 Example Lab Tasks Using This Dataset:

Identify and impute missing values using mean/median.
Detect and remove duplicate records.
Use IQR or Z-score methods to handle outliers.
Normalize Study_Hours and Test_Score using Min-Max scaling.
Encode categorical variables (Gender, Grade) for model input.
Prepare a clean dataset ready for classification/regression analysis.
Can be used for Limited Regression

🎯 Possible Regression Targets

Test_Score → Predict test score based on study hours, attendance, age, and gender.

🧩 Example Regression Problem

Predict the student’s test score using their study hours, attendance percentage, and age.

🧠 Sample Features: X = ['Age', 'Gender', 'Study_Hours', 'Attendance(%)'] y = ['Test_Score']

You can use:

Linear Regression (for simplicity)
Polynomial Regression (to explore nonlinear patterns)
Decision Tree Regressor or Random Forest Regressor

And analyze feature influence using correlation or SHAP/LIME explainability.

Student Performance Dataset
kaggle.com
Updated Aug 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ghulam Muhammad Nabeel (2025). Student Performance Dataset [Dataset]. https://www.kaggle.com/datasets/nabeelqureshitiii/student-performance-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 27, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ghulam Muhammad Nabeel
Description
📊 Student Performance Dataset (Synthetic, Realistic)

Overview

This dataset contains 1000000 rows of realistic student performance data, designed for beginners in Machine Learning to practice Linear Regression, model training, and evaluation techniques.

Each row represents one student with features like study hours, attendance, class participation, and final score.
The dataset is small, clean, and structured to be beginner-friendly.

🔑 Columns Description

student_id → Unique identifier for each student.

weekly_self_study_hours → Average weekly self-study hours (0–40). Generated using a normal distribution centered around 15 hours.

attendance_percentage → Attendance percentage (50–100). Simulated with a normal distribution around 85%.

class_participation → Score between 0–10 indicating how actively the student participates in class. Generated from a normal distribution centered around 6.

total_score → Final performance score (0–100). Calculated as a function of study hours + random noise, then clipped between 0–100. Stronger correlation with study hours.

grade → Categorical label (A, B, C, D, F) derived from total_score.

📐 Data Generation Logic

Weekly Study Hours: Modeled using a normal distribution (mean ≈ 15, std ≈ 7), capped between 0 and 40 hours.

Scores: More study hours → higher score. Formula:

Random noise simulates differences in learning ability, motivation, etc.

Attendance & Participation: Independent but realistic variations added.

Grades: Assigned from scores using thresholds:

A: ≥ 85

B: ≥ 70

C: ≥ 55

D: ≥ 40

F: < 40

🎯 How to Use This Dataset

Regression Tasks

Predict total_score from weekly_self_study_hours.

Train and evaluate Linear Regression models.

Extend to multiple regression using attendance_percentage and class_participation.

Classification Tasks

Predict grade (A–F) using study hours, attendance, and participation.

Model Evaluation Practice

Apply train-test split and cross-validation.

Evaluate with MAE, RMSE, R².

Compare simple vs. multiple regression.

✅ This dataset is intentionally kept simple, so that new ML learners can clearly see the relationship between input features (study, attendance, participation) and output (score/grade).
Kokoro Speech Dataset v1.1 Tiny
kaggle.com
zip
Updated May 14, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Katsuya Iida (2021). Kokoro Speech Dataset v1.1 Tiny [Dataset]. https://www.kaggle.com/datasets/kaiida/kokoro-speech-dataset-v11-tiny
Explore at:
zip(48156884 bytes)Available download formats
Dataset updated
May 14, 2021
Authors
Katsuya Iida
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Kokoro Speech Dataset

Kokoro Speech Dataset is a public domain Japanese speech dataset. It contains 34,958 short audio clips of a single speaker reading 9 novel books. The format of the metadata is similar to that of LJ Speech so that the dataset is compatible with modern speech synthesis systems.

The texts are from Aozora Bunko, which is in the public domain. The audio clips are from LibriVox project, which is also in the public domain. Readings are estimated by MeCab and UniDic Lite from kanji-kana mixture text. Readings are romanized which are similar to the format used by Julius.

The audio clips were split and transcripts were aligned automatically by Voice100.

Sample data

Listen from your browser or download randomly sampled 100 clips.

File Format

Metadata is provided in metadata.csv. This file consists of one record per line, delimited by the pipe character (0x7c). The fields are:

ID: this is the name of the corresponding .wav file

Transcription: Kanji-kana mixture text spoken by the reader (UTF-8)

Reading: Romanized text spoken by the reader (UTF-8)

Each audio file is a single-channel 16-bit PCM WAV with a sample rate of 22050 Hz.

Statistics

The dataset is provided in different sizes, large, small, tiny. small and tiny don't share same clips. large contains all available clips, including small and tiny.

Large: Total clips: 34958 Min duration: 3.007 secs Max duration: 14.745 secs Mean duration: 4.978 secs Total duration: 48:20:24 Small: Total clips: 8812 Min duration: 3.007 secs Max duration: 14.431 secs Mean duration: 4.951 secs Total duration: 12:07:12 Tiny: Total clips: 285 Min duration: 3.019 secs Max duration: 9.462 secs Mean duration: 4.871 secs Total duration: 00:23:08

How to get the data

Because of its large data size of the dataset, audio files are not included in this repository, but the metadata is included.

To make .wav files of the dataset, run

$ bash download.sh

to download the metadata from the project page. Then run

$ pip3 install torchaudio $ python3 extract.py --size tiny

This prints a shell script example to download MP3 audio files from archive.org and extract them if you haven't done it already.

After doing so, run the command again

$ python3 extract.py --size tiny

to get files for tiny under ./output directory.

You can give another size name to the --size option to get dataset of the size.

Pretrained Tacotron model

Audio Samples

Pretrained model

Pretrained Tacotron model trained with Kokoro Speech Dataset and audio samples are available. The model was trained for 21K steps with small. According to the above repo, "Speech started to become intelligible around 20K steps" with LJ Speech Dataset. Audio samples read the first few sentences from Gon Gitsune which is not included in small.

Books

The dataset contains recordings from these books read by ekzemplaro

明暗 (Meian) 16:39:29 Online text

こころ (Kokoro) 08:46:41 Online text

田舎教師 (Inaka Kyoshi) 08:13:26 Online text

野分 (Nowaki) 4:40:49 Online text

草枕 (Kusamakura) 04:27:35 Online text

坊っちゃん (Botchan) 04:26:27 Online text

雁 (Gan) 03:41:31 Online text

ごん狐 (Gon gitsune) 0:15:42 Online text

[コーカサスの禿鷹 (Caucasus no Hagetaka)](https://l...
N
Income Distribution by Quintile: Mean Household Income in Bay View, OH
neilsberg.com
csv, json
Updated Jan 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neilsberg Research (2024). Income Distribution by Quintile: Mean Household Income in Bay View, OH [Dataset]. https://www.neilsberg.com/research/datasets/945d8e9e-7479-11ee-949f-3860777c1fe6/
Explore at:
json, csvAvailable download formats
Dataset updated
Jan 11, 2024
Dataset authored and provided by
Neilsberg Research
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Bay View
Variables measured
Income Level, Mean Household Income
Measurement technique
The data presented in this dataset is derived from the U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates. It delineates income distributions across income quintiles (mentioned above) following an initial analysis and categorization. Subsequently, we adjusted these figures for inflation using the Consumer Price Index retroactive series via current methods (R-CPI-U-RS). For additional information about these estimations, please contact us via email at research@neilsberg.com
Dataset funded by
Neilsberg Research
Description
About this dataset

Context

The dataset presents the mean household income for each of the five quintiles in Bay View, OH, as reported by the U.S. Census Bureau. The dataset highlights the variation in mean household income across quintiles, offering valuable insights into income distribution and inequality.

Key observations

Income disparities: The mean income of the lowest quintile (20% of households with the lowest income) is 22,803, while the mean income for the highest quintile (20% of households with the highest income) is 219,508. This indicates that the top earners earn 10 times compared to the lowest earners.

*Top 5%: * The mean household income for the wealthiest population (top 5%) is 407,667, which is 185.72% higher compared to the highest quintile, and 1787.78% higher compared to the lowest quintile.

https://i.neilsberg.com/ch/bay-view-oh-mean-household-income-by-quintiles.jpeg" alt="Mean household income by quintiles in Bay View, OH (in 2022 inflation-adjusted dollars))">

Content

When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates.

Income Levels:

Lowest Quintile

Second Quintile

Third Quintile

Fourth Quintile

Highest Quintile

Top 5 Percent

Variables / Data Columns

Income Level: This column showcases the income levels (As mentioned above).

Mean Household Income: Mean household income, in 2022 inflation-adjusted dollars for the specific income level.

Good to know

Margin of Error

Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

Custom data

If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

Inspiration

Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

Recommended for further research

This dataset is a part of the main dataset for Bay View median household income. You can refer the same here
Coffee Shop Daily Revenue Prediction Dataset
kaggle.com
zip
Updated Feb 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Himel Sarder (2025). Coffee Shop Daily Revenue Prediction Dataset [Dataset]. https://www.kaggle.com/datasets/himelsarder/coffee-shop-daily-revenue-prediction-dataset
Explore at:
zip(30259 bytes)Available download formats
Dataset updated
Feb 7, 2025
Authors
Himel Sarder
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset Overview

This dataset contains 2,000 rows of data from coffee shops, offering detailed insights into factors that influence daily revenue. It includes key operational and environmental variables that provide a comprehensive view of how business activities and external conditions affect sales performance. Designed for use in predictive analytics and business optimization, this dataset is a valuable resource for anyone looking to understand the relationship between customer behavior, operational decisions, and revenue generation in the food and beverage industry.

Columns & Variables

The dataset features a variety of columns that capture the operational details of coffee shops, including customer activity, store operations, and external factors such as marketing spend and location foot traffic.

Number of Customers Per Day

The total number of customers visiting the coffee shop on any given day.

Range: 50 - 500 customers.

Average Order Value ($)

The average dollar amount spent by each customer during their visit.

Range: $2.50 - $10.00.

Operating Hours Per Day

The total number of hours the coffee shop is open for business each day.

Range: 6 - 18 hours.

Number of Employees

The number of employees working on a given day. This can influence service speed, customer satisfaction, and ultimately, sales.

Range: 2 - 15 employees.

Marketing Spend Per Day ($)

The amount of money spent on marketing campaigns or promotions on any given day.

Range: $10 - $500 per day.

Location Foot Traffic (people/hour)

The number of people passing by the coffee shop per hour, a variable indicative of the shop's location and its potential to attract customers.

Range: 50 - 1000 people per hour.

Target Variable

Daily Revenue ($)

This is the dependent variable representing the total revenue generated by the coffee shop each day.

It is calculated as a combination of customer visits, average spending, and other operational factors like marketing spend and staff availability.

Range: $200 - $10,000 per day.

Data Distribution & Insights

The dataset spans a wide variety of operational scenarios, from small neighborhood coffee shops with limited traffic to larger, high-traffic locations with extensive marketing budgets. This variety allows for exploring different predictive modeling strategies. Key insights that can be derived from the data include:

The effect of marketing spend on daily revenue.

The correlation between customer count and daily sales.

The relationship between staffing levels and revenue generation.

The influence of foot traffic and operating hours on customer behavior.

Use Cases & Applications

The dataset offers a wide range of applications, especially in predictive analytics, business optimization, and forecasting:

Predictive Modeling: Use machine learning models such as regression, decision trees, or neural networks to predict daily revenue based on operational data.

Business Strategy Development: Analyze how changes in marketing spend, staff numbers, or operating hours can optimize revenue and improve efficiency.

Customer Insights: Identify patterns in customer behavior related to shop operations and external factors like foot traffic and marketing campaigns.

Resource Allocation: Determine optimal staffing levels and marketing budgets based on predicted sales, improving overall profitability.

Real-World Applications in the Food & Beverage Industry

For coffee shop owners, managers, and analysts in the food and beverage industry, this dataset provides an essential tool for refining daily operations and boosting profitability. Insights gained from this data can help:

Optimize Marketing Campaigns: Evaluate the effectiveness of daily or seasonal marketing campaigns on revenue.

Staff Scheduling: Predict busy days and ensure that the right number of employees are scheduled to maximize efficiency.

Revenue Forecasting: Provide accurate revenue projections that can assist with financial planning and decision-making.

Operational Efficiency: Discover the most profitable operating hours and adjust business hours accordingly.

This dataset is also ideal for aspiring data scientists and machine learning practitioners looking to apply their skills to real-world business problems in the food and beverage sector.

Conclusion

The Coffee Shop Revenue Prediction Dataset is a versatile and comprehensive resource for understanding the dynamics of daily sales performance in coffee shops. With a focus on key operational factors, it is perfect for building predictive models, ...

Facebook

Twitter

Click to copy link

Link copied

Cite

The Devastator (2023). Weather and Housing in North America [Dataset]. https://www.kaggle.com/datasets/thedevastator/weather-and-housing-in-north-america

Weather and Housing in North America

Exploring the Relationship between Weather and Housing Conditions in 2012

Explore at:

zip(512280 bytes)Available download formats

Dataset updated

Feb 13, 2023

Authors

The Devastator

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Area covered

North America

Description

Weather and Housing in North America

Exploring the Relationship between Weather and Housing Conditions in 2012

By [source]

About this dataset

This comprehensive dataset explores the relationship between housing and weather conditions across North America in 2012. Through a range of climate variables such as temperature, wind speed, humidity, pressure and visibility it provides unique insights into the weather-influenced environment of numerous regions. The interrelated nature of housing parameters such as longitude, latitude, median income, median house value and ocean proximity further enhances our understanding of how distinct climates play an integral part in area real estate valuations. Analyzing these two data sets offers a wealth of knowledge when it comes to understanding what factors can dictate the value and comfort level offered by residential areas throughout North America

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset offers plenty of insights into the effects of weather and housing on North American regions. To explore these relationships, you can perform data analysis on the variables provided.

First, start by examining descriptive statistics (i.e., mean, median, mode). This can help show you the general trend and distribution of each variable in this dataset. For example, what is the most common temperature in a given region? What is the average wind speed? How does this vary across different regions? By looking at descriptive statistics, you can get an initial idea of how various weather conditions and housing attributes interact with one another.

Next, explore correlations between variables. Are certain weather variables correlated with specific housing attributes? Is there a link between wind speeds and median house value? Or between humidity and ocean proximity? Analyzing correlations allows for deeper insights into how different aspects may influence one another for a given region or area. These correlations may also inform broader patterns that are present across multiple North American regions or countries.

Finally, use visualizations to further investigate this relationship between climate and housing attributes in North America in 2012. Graphs allow you visualize trends like seasonal variations or long-term changes over time more easily so they are useful when interpreting large amounts of data quickly while providing larger context beyond what numbers alone can tell us about relationships between different aspects within this dataset

Research Ideas

Analyzing the effect of climate change on housing markets across North America. By looking at temperature and weather trends in combination with housing values, researchers can better understand how climate change may be impacting certain regions differently than others.

Investigating the relationship between median income, house values and ocean proximity in coastal areas. Understanding how ocean proximity plays into housing prices may help inform real estate investment decisions and urban planning initiatives related to coastal development.

Utilizing differences in weather patterns across different climates to determine optimal seasonal rental prices for property owners. By analyzing changes in temperature, wind speed, humidity, pressure and visibility from season to season an investor could gain valuable insights into seasonal market trends to maximize their profits from rentals or Airbnb listings over time

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: Weather.csv | Column name | Description | |:---------------------|:-----------------------------------------------| | Date/Time | Date and time of the observation. (Date/Time) | | Temp_C | Temperature in Celsius. (Numeric) | | Dew Point Temp_C | Dew point temperature in Celsius. (Numeric) | | Rel Hum_% | Relative humidity in percent. (Numeric) | | Wind Speed_km/h | Wind speed in kilometers per hour. (Numeric) | | Visibility_km | Visibilit...

Clear search

Close search

Google apps

Main menu

Weather and Housing in North America

Weather and Housing in North America

Exploring the Relationship between Weather and Housing Conditions in 2012

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Simulation Data Set

Income Distribution by Quintile: Mean Household Income in Lake View, AL

About this dataset

Content

Inspiration

Recommended for further research

Google Analytics data of an E-commerce Company

📊 Dataset Title: Daily Active Users Dataset

📝 Description

📂 Dataset Structure

🧐 Key Use Cases

📈 Potential Analysis

🚀 Getting Started

Data used to calculate mean resting orientations of ornithischian scapular...

Pakistan House Price dataset

Median Household Income Variation by Family Size in Ocean View, DE:...

About this dataset

Content

Inspiration

Recommended for further research

Charts of climate statistics and MODIS data for all Bioregional Assessment...

Abstract

Purpose

Dataset History

Dataset Citation

Dataset Ancestors

Earth Radiation area average time series through Wide-field-of-view...

Dataset from: High consistency and repeatability in the breeding migrations...

Summer Products Sales Performance

Summer Products Sales Performance

E-commerce sales performance and ratings data for summer products

About this dataset

How to use the dataset

Binance Coin BNB, 1m Full Historical Data

Key Features:

Included Resources:

Two Notebooks:

Fish dataset

Median Household Income Variation by Family Size in Forest View, IL:...

About this dataset

Content

Inspiration

Recommended for further research

YouTube Channel Performance Analytics

Student Academic Performance (Synthetic Dataset)

📊 Columns Description

🧠 Example Lab Tasks Using This Dataset:

🎯 Possible Regression Targets

🧩 Example Regression Problem

Student Performance Dataset

📊 Student Performance Dataset (Synthetic, Realistic)

Overview

🔑 Columns Description

📐 Data Generation Logic

🎯 How to Use This Dataset

Kokoro Speech Dataset v1.1 Tiny

Kokoro Speech Dataset

Sample data

File Format

Statistics

How to get the data

Pretrained Tacotron model

Books

Income Distribution by Quintile: Mean Household Income in Bay View, OH

About this dataset

Content

Inspiration

Recommended for further research

Coffee Shop Daily Revenue Prediction Dataset