Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By [source]
This comprehensive dataset explores the relationship between housing and weather conditions across North America in 2012. Through a range of climate variables such as temperature, wind speed, humidity, pressure and visibility it provides unique insights into the weather-influenced environment of numerous regions. The interrelated nature of housing parameters such as longitude, latitude, median income, median house value and ocean proximity further enhances our understanding of how distinct climates play an integral part in area real estate valuations. Analyzing these two data sets offers a wealth of knowledge when it comes to understanding what factors can dictate the value and comfort level offered by residential areas throughout North America
For more datasets, click here.
- ๐จ Your notebook can be here! ๐จ!
This dataset offers plenty of insights into the effects of weather and housing on North American regions. To explore these relationships, you can perform data analysis on the variables provided.
First, start by examining descriptive statistics (i.e., mean, median, mode). This can help show you the general trend and distribution of each variable in this dataset. For example, what is the most common temperature in a given region? What is the average wind speed? How does this vary across different regions? By looking at descriptive statistics, you can get an initial idea of how various weather conditions and housing attributes interact with one another.
Next, explore correlations between variables. Are certain weather variables correlated with specific housing attributes? Is there a link between wind speeds and median house value? Or between humidity and ocean proximity? Analyzing correlations allows for deeper insights into how different aspects may influence one another for a given region or area. These correlations may also inform broader patterns that are present across multiple North American regions or countries.
Finally, use visualizations to further investigate this relationship between climate and housing attributes in North America in 2012. Graphs allow you visualize trends like seasonal variations or long-term changes over time more easily so they are useful when interpreting large amounts of data quickly while providing larger context beyond what numbers alone can tell us about relationships between different aspects within this dataset
- Analyzing the effect of climate change on housing markets across North America. By looking at temperature and weather trends in combination with housing values, researchers can better understand how climate change may be impacting certain regions differently than others.
- Investigating the relationship between median income, house values and ocean proximity in coastal areas. Understanding how ocean proximity plays into housing prices may help inform real estate investment decisions and urban planning initiatives related to coastal development.
- Utilizing differences in weather patterns across different climates to determine optimal seasonal rental prices for property owners. By analyzing changes in temperature, wind speed, humidity, pressure and visibility from season to season an investor could gain valuable insights into seasonal market trends to maximize their profits from rentals or Airbnb listings over time
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: Weather.csv | Column name | Description | |:---------------------|:-----------------------------------------------| | Date/Time | Date and time of the observation. (Date/Time) | | Temp_C | Temperature in Celsius. (Numeric) | | Dew Point Temp_C | Dew point temperature in Celsius. (Numeric) | | Rel Hum_% | Relative humidity in percent. (Numeric) | | Wind Speed_km/h | Wind speed in kilometers per hour. (Numeric) | | Visibility_km | Visibilit...
Facebook
TwitterThese are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: File format: R workspace file; โSimulated_Dataset.RDataโ. Metadata (including data dictionary) โข y: Vector of binary responses (1: adverse outcome, 0: control) โข x: Matrix of covariates; one row for each simulated individual โข z: Matrix of standardized pollution exposures โข n: Number of simulated individuals โข m: Number of exposure time periods (e.g., weeks of pregnancy) โข p: Number of columns in the covariate design matrix โข alpha_true: Vector of โtrueโ critical window locations/magnitudes (i.e., the ground truth that we want to estimate) Code Abstract We provide R statistical software code (โCWVS_LMC.txtโ) to fit the linear model of coregionalization (LMC) version of the Critical Window Variable Selection (CWVS) method developed in the manuscript. We also provide R code (โResults_Summary.txtโ) to summarize/plot the estimated critical windows and posterior marginal inclusion probabilities. Description โCWVS_LMC.txtโ: This code is delivered to the user in the form of a .txt file that contains R statistical software code. Once the โSimulated_Dataset.RDataโ workspace has been loaded into R, the text in the file can be used to identify/estimate critical windows of susceptibility and posterior marginal inclusion probabilities. โResults_Summary.txtโ: This code is also delivered to the user in the form of a .txt file that contains R statistical software code. Once the โCWVS_LMC.txtโ code is applied to the simulated dataset and the program has completed, this code can be used to summarize and plot the identified/estimated critical windows and posterior marginal inclusion probabilities (similar to the plots shown in the manuscript). Optional Information (complete as necessary) Required R packages: โข For running โCWVS_LMC.txtโ: โข msm: Sampling from the truncated normal distribution โข mnormt: Sampling from the multivariate normal distribution โข BayesLogit: Sampling from the Polya-Gamma distribution โข For running โResults_Summary.txtโ: โข plotrix: Plotting the posterior means and credible intervals Instructions for Use Reproducibility (Mandatory) What can be reproduced: The data and code can be used to identify/estimate critical windows from one of the actual simulated datasets generated under setting E4 from the presented simulation study. How to use the information: โข Load the โSimulated_Dataset.RDataโ workspace โข Run the code contained in โCWVS_LMC.txtโ โข Once the โCWVS_LMC.txtโ code is complete, run โResults_Summary.txtโ. Format: Below is the replication procedure for the attached data set for the portion of the analyses using a simulated data set: Data The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publically available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics, and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context
The dataset presents the mean household income for each of the five quintiles in Lake View, AL, as reported by the U.S. Census Bureau. The dataset highlights the variation in mean household income across quintiles, offering valuable insights into income distribution and inequality.
Key observations
https://i.neilsberg.com/ch/lake-view-al-mean-household-income-by-quintiles.jpeg" alt="Mean household income by quintiles in Lake View, AL (in 2022 inflation-adjusted dollars))">
When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates.
Income Levels:
Variables / Data Columns
Good to know
Margin of Error
Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.
Custom data
If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.
Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.
This dataset is a part of the main dataset for Lake View median household income. You can refer the same here
Facebook
TwitterThis dataset provides detailed insights into daily active users (DAU) of a platform or service, captured over a defined period of time. The dataset includes information such as the number of active users per day, allowing data analysts and business intelligence teams to track usage trends, monitor platform engagement, and identify patterns in user activity over time.
The data is ideal for performing time series analysis, statistical analysis, and trend forecasting. You can utilize this dataset to measure the success of platform initiatives, evaluate user behavior, or predict future trends in engagement. It is also suitable for training machine learning models that focus on user activity prediction or anomaly detection.
The dataset is structured in a simple and easy-to-use format, containing the following columns:
Each row in the dataset represents a unique date and its corresponding number of active users. This allows for time-based analysis, such as calculating the moving average of active users, detecting seasonality, or spotting sudden spikes or drops in engagement.
This dataset can be used for a wide range of purposes, including:
Here are some specific analyses you can perform using this dataset:
To get started with this dataset, you can load it into your preferred analysis tool. Here's how to do it using Python's pandas library:
import pandas as pd
# Load the dataset
data = pd.read_csv('path_to_dataset.csv')
# Display the first few rows
print(data.head())
# Basic statistics
print(data.describe())
Facebook
TwitterSee Materials and Methods section for description of angle B. Group means shown without confidence intervals are those for which sample size is too small to derive 95% confidence intervals (n < 8). See Table 1 for institutional abbreviations.
Facebook
TwitterDataset Description: The dataset contains information about properties. Each property has a unique property ID and is associated with a location ID based on the subcategory of the city. The dataset includes the following attributes:
Property ID: Unique identifier for each property. Location ID: Unique identifier for each location within a city. Page URL: The URL of the webpage where the property was published. Property Type: Categorization of the property into six types: House, FarmHouse, Upper Portion, Lower Portion, Flat, or Room. Price: The price of the property, which is the dependent feature in this dataset. City: The city where the property is located. The dataset includes five cities: Lahore, Karachi, Faisalabad, Rawalpindi, and Islamabad. Province: The state or province where the city is located. Location: Different types of locations within each city. Latitude and Longitude: Geographic coordinates of the cities. Steps Involved in the Analysis:
Statistical Analysis:
Data Types: Determine the data types of the attributes. Level of Measurement: Identify the level of measurement for each attribute. Summary Statistics: Calculate mean, standard deviation, minimum, and maximum values for numerical attributes. Data Cleaning:
Filling Null Values: Handle missing values in the dataset. Duplicate Values: Remove duplicate records, if any. Correcting Data Types: Ensure the correct data types for each attribute. Outliers Detection: Identify and handle outliers in the data. Exploratory Data Analysis (EDA):
Visualization: Use libraries such as Seaborn, Matplotlib, and Plotly to visualize the data and gain insights. Model Building:
Libraries: Utilize libraries like Sklearn and pickle. List of Models: Build models using Linear Regression, Decision Tree, Random Forest, K-Nearest Neighbors (KNN), XG Boost, Gradient Boost, and Ada Boost. Model Saving: Save the selected model into a pickle file for future use. I hope this captures the essence of the provided information. Let me know if you need any further assistance!
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context
The dataset presents median household incomes for various household sizes in Ocean View, DE, as reported by the U.S. Census Bureau. The dataset highlights the variation in median household income with the size of the family unit, offering valuable insights into economic trends and disparities within different household sizes, aiding in data analysis and decision-making.
Key observations
When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.
Household Sizes:
Variables / Data Columns
Good to know
Margin of Error
Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.
Custom data
If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.
Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.
This dataset is a part of the main dataset for Ocean View median household income. You can refer the same here
Facebook
TwitterThis dataset was derived by the Bioregional Assessment Programme from 'Mean climate variables for all subregions' and 'fPAR derived from MODIS for BA subregions'. You can find a link to the parent datasets in the Lineage Field in this metadata statement. The History Field in this metadata statement describes how this dataset was derived.
These are charts of climate statistics and MODIS data for each BA subregion. There are six 600dpi PNG files per subregion, with the naming convention BA-[regioncode]-[subregioncode]-[chartname].png. The charts, according to their filename, are: rain (time-series of rainfall; Figure 1), P-PET (average monthly precipitation and potential evapotranspiration; Figure 2), 5line (assorted monthly statistics; Figure 3), trend (monthly long-term trends; Figure 4) and fPAR (fraction of photosynthetically available radiation - an indication of biomass; Figure 5).
This version was created on 18 November 2014, using data that accounted for a modified boundary for the Gippsland Basin bioregion and the combination of two subregions to form the Sydney Basin bioregion.
These charts were generated to be included in the Contextual Report (geography) for each subregion.
These charts were generated using MatPlotLib 1.3.0 in Python 2.7.5 (Anaconda distribution v1.7.0 32-bit).
The script for generating these plots is BA-ClimateCharts.py, and is packaged with the dataset. This script is a data collection and chart drawing script, it does not do any analysis. The data are charted as they appear in the parent datasets (see Lineage). A word document (BA-ClimateGraphs-ReadMe) is also included. This document includes examples of, and approved captions for, each chart.
Bioregional Assessment Programme (2014) Charts of climate statistics and MODIS data for all Bioregional Assessment subregions. Bioregional Assessment Derived Dataset. Viewed 14 June 2018, http://data.bioregionalassessments.gov.au/dataset/8a1c5f43-b150-4357-aa25-5f301b1a02e1.
Derived From Mean climate variables for all subregions
Derived From BILO Gridded Climate Data: Daily Climate Data for each year from 1900 to 2012
Derived From fPar derived from MODIS for BA subregions
Facebook
TwitterUnderstanding the mean and variability of the Earthโs radiation budget (ERB) at the Top-of-Atmosphere (TOA) and surface is a fundamental quantity governing climate variability and, for that reason, NASA has been making concerted efforts to observe the ERB since1984 through two projects: ERBE and CERES, that span nearly 30 years to date. The proposed project utilizes knowledge gained in the last 10 years through CERES data analyses and apply the knowledge to existing data to develop long-term (nearly 30 years) consistent and calibrated data product (TOA irradiances at the same radiometric scale) from multiple missions (ERBS and CERES). This project proposes to produce level 3 surface irradiance products that are consistent with observed TOA irradiances in a framework of 1D radiative transfer theory. Based on these TOA and surface irradiance products, a data product will be developed which contains the contribution of atmospheric and cloud property variability to TOA and surface irradiance variability. All algorithms used in the process are based on existing CERES algorithms. All data sets produced by this project will be available from the Atmospheric Science Data Center.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset and scripts used for manuscript: High consistency and repeatability in the breeding migrations of a benthic shark.
Project title: High consistency and repeatability in the breeding migrations of a benthic sharkDate:23/04/2024
Folders:- 1_Raw_data - Perpendicular_Point_068151, Sanctuary_Point_068088, SST raw data, sst_nc_files, IMOS_animal_measurements, IMOS_detections, PS&Syd&JB tags, rainfall_raw, sample_size, Point_Perpendicular_2013_2019, Sanctuary_Point_2013_2019, EAC_transport- 2_Processed_data - SST (anomaly, historic_sst, mean_sst_31_years, week_1992_sst:week_2022_sst including week_2019_complete_sst) - Rain (weekly_rain, weekly_rainfall_completed) - Clean (clean, cleaned_data, cleaned_gam, cleaned_pj_data)- 3_Script_processing_data - Plots(dual_axis_plot (Fig. 1 & Fig. 4).R, period_plot (Fig. 2).R, sd_plot (Fig. 5).R, sex_plot (Fig. 3).R - cleaned_data.R, cleaned_data_gam.R, weekly_rainfall_completed.R, descriptive_stats.R, sst.R, sst_2019b.R, sst_anomaly.R- 4_Script_analyses - gam.R, gam_eac.R, glm.R, lme.R, Repeatability.R- 5_Output_doc - Plots (arrival_dual_plot_with_anomaly (Fig. 1).png, period_plot (Fig.2).png, sex_arrival_departure (Fig. 3).png, departure_dual_plot_with_anomaly (Fig. 4).png, standard deviation plot (Fig. 5).png) - Tables (gam_arrival_eac_selection_table.csv (Table S2), gam_departure_eac_selection_table (Table S5), gam_arrival_selection_table (Table. S3), gam_departure_selection_table (Table. S6), glm_arrival_selection_table, glm_departure_selection_table, lme_arrival_anova_table, lme_arrival_selection_table (Table S4), lme_departure_anova_table, lme_departure_selection_table (Table. S8))
Descriptions of scripts and files used:- cleaned_data.R: script to extract detections of sharks at Jervis Bay. Calculate arrival and departure dates over the seven breeding seasons. Add sex and length for each individual. Extract moon phase (numerical value) and period of the day from arrival and departure times. - IMOS_detections.csv: raw data file with detections of Port Jackson sharks over different sites in Australia. - IMOS_animal_measurements.csv: raw data file with morphological data of Port Jackson sharks - PS&Syd&JB tags: file with measurements and sex identification of sharks (different from IMOS, it was used to complete missing sex and length). - cleaned_data.csv: file with arrival and departure dates of the final sample size of sharks (N=49) with missing sex and length for some individuals. - clean.csv: completed file using PS&Syd&JB tags, note: tag ID 117393679 was wrongly identified as a male in IMOS and correctly identified as a female in PS&Syd&JB tags file as indicated by its large size. - cleaned_pj_data: Final data file with arrival and departure dates, sex, length, moon phase (numerical) and period of the day.
weekly_rainfall_completed.R: script to calculate average weekly rainfall and correlation between the two weather stations used (Point perpendicular and Sanctuary point). - weekly_rain.csv: file with the corresponding week number (1-28) for each date (01-06-2013 to 13-12-2019) - weekly_rainfall_completed.csv: file with week number (1-28), year (2013-2019) and weekly rainfall average completed with Sanctuary Point for week 2 of 2017 - Point_Perpendicular_2013_2019: Rainfall (mm) from 01-01-2013 to 31-12-2020 at the Point Perpendicular weather station - Sanctuary_Point_2013_2019: Rainfall (mm) from 01-01-2013 to 31-12-2020 at the Sanctuary Point weather station - IDCJAC0009_068088_2017_Data.csv: Rainfall (mm) from 01-01-2017 to 31-12-2017 at the Sanctuary Point weather station (to fill in missing value for average rainfall of week 2 of 2017)
cleaned_data_gam.R: script to calculate weekly counts of sharks to run gam models and add weekly averages of rainfall and sst anomaly - cleaned_pj_data.csv - anomaly.csv: weekly (1-28) average sst anomalies for Jervis Bay (2013-2019) - weekly_rainfall_completed.csv: weekly (1-28) average rainfall for Jervis Bay (2013-2019_ - sample_size.csv: file with the number of sharks tagged (13-49) for each year (2013-2019)
sst.R: script to extract daily and weekly sst from IMOS nc files from 01-05 until 31-12 for the following years: 1992:2022 for Jervis Bay - sst_raw_data: folder with all the raw weekly (1:28) csv files for each year (1992:2022) to fill in with sst data using the sst script - sst_nc_files: folder with all the nc files downloaded from IMOS from the last 31 years (1992-2022) at the sensor (IMOS - SRS - SST - L3S-Single Sensor - 1 day - night time โ Australia). - SST: folder with the average weekly (1-28) sst data extracted from the nc files using the sst script for each of the 31 years (to calculate temperature anomaly).
sst_2019b.R: script to extract daily and weekly sst from IMOS nc file for 2019 (missing value for week 19) for Jervis Bay - week_2019_sst: weekly average sst 2019 with a missing value for week 19 - week_2019b_sst: sst data from 2019 with another sensor (IMOS โ SRS โ MODIS - 01 day - Ocean Colour-SST) to fill in the gap of week 19 - week_2019_complete_sst: completed average weekly sst data from the year 2019 for weeks 1-28.
sst_anomaly.R: script to calculate mean weekly sst anomaly for the study period (2013-2019) using mean historic weekly sst (1992-2022) - historic_sst.csv: mean weekly (1-28) and yearly (1992-2022) sst for Jervis Bay - mean_sst_31_years.csv: mean weekly (1-28) sst across all years (1992-2022) for Jervis Bay - anomaly.csv: mean weekly and yearly sst anomalies for the study period (2013-2019)
Descriptive_stats.R: script to calculate minimum and maximum length of sharks, mean Julian arrival and departure dates per individual per year, mean Julian arrival and departure dates per year for all sharks (Table. S10), summary of standard deviation of julian arrival dates (Table. S9) - cleaned_pj_data.csv
gam.R: script used to run the Generalized additive model for rainfall and sea surface temperature - cleaned_gam.csv
glm.R: script used to run the Generalized linear mixed models for the period of the day and moon phase - cleaned_pj_data.csv - sample_size.csv
lme.R: script used to run the Linear mixed model for sex and size - cleaned_pj_data.csv
Repeatability.R: script used to run the Repeatability for Julian arrival and Julian departure dates - cleaned_pj_data.csv
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Jeffrey Mvutu Mabilama [source]
The Summer Products and Sales Performance dataset is a comprehensive collection of product listings, ratings, and sales data from the Wish platform. The dataset aims to provide insights into the trends and patterns in e-commerce during the summer season. It contains valuable information such as product titles, prices, retail prices, currency used for pricing, units sold, whether ad boosts are used for product listings, average ratings for products, total ratings count for products, counts of five-star to one-star ratings for products.
Additionally, the dataset includes data on various aspects related to product quality and shipping options such as badges count (indicating special qualities), local product status (whether the product is sold locally), product quality rating badges (indicating the quality of the product), fast shipping availability badges (indicating whether fast shipping is available), tags associated with products (making them more discoverable), color variations of products available in inventory along with their count. It also provides information on different shipping options including option names and their corresponding prices.
Moreover,the dataset encompasses details about merchants selling these products including merchant title and name as well as information on merchant rating count (total number of ratings received by merchants) ,merchant profile picture availability,and subtitle which gives additional details about merchant's info.
The dataset further includes links to images of individual listed products along with links to respective online shop pages where these are found . In addition,currency buyer specifies currency type used by buyers throughout various transactions.Items flagged under urgency text have an associated urgency text rate indicating how urgently they are desired or needed.
This comprehensive dataset also allows users to analyze units sold per listed item as well as mean units sold per listed item across different categories/theme .Further evaluation can be done using totalunitsold variable which represents total volume sales from all listed items tied together across Wish platform.
Aiding further analysis around elasticity theory users can find marked down rates/percentage tagged describing discounts over retail price,ranging from 0-1 as well as average discount values for individual listed products.Further custom insights such as number of countries items can be delivered to, their origin country, if they possess an urgency banner or fast shipping and if the seller is famous/has a profile picture.
This comprehensive dataset served to build model helping sellers predict how well an item may sell so as to equip businesses with ability to make replenishment decisions guided by this model
Familiarize Yourself with the Columns:
- Before diving into data analysis, it's important to understand the meaning of each column in the dataset. The columns contain information such as product titles, prices, ratings, inventory details, shipping options, merchant information, and more. Refer to the dataset documentation or use descriptive statistics methods to gain insights into different attributes.
Explore Product Categories:
- The dataset includes a column named theme that represents the category or theme of each product listing. By analyzing this column's values and frequency distribution, you can identify top-selling categories during the summer season. This information can be beneficial for businesses looking to optimize their product offerings.
Analyze Pricing Data:
- The columns like price, retail_price, and currency_buyer provide insights into pricing strategies employed by sellers on Wish platform.
- Calculate various statistical measures like mean price using 'meanproductprices', highest priced items using 'price', average discount using averagediscount'
- Investigate relationships between pricing factors such as discounted prices compared to original retail prices ('discounted price' = 'retail_price' - 'price').
Examine Ratings Data: 4a) Analyze Product Ratings: To gauge customer satisfaction levels regarding products listed on Wish platform products rating features have been provided. Available columns- -> Number of ratings received per star rating -> Total number of ratings received (
rating_count) -> Average rating (rating) Perform analysis to find: - Aver...
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
I am a new developer and I would greatly appreciate your support. If you find this dataset helpful, please consider giving it an upvote!
Complete 1m Data: Raw 1m historical data from multiple exchanges, covering the entire trading history of BNBUSD available through their API endpoints. This dataset is updated daily to ensure up-to-date coverage.
Combined Index Dataset: A unique feature of this dataset is the combined index, which is derived by averaging all other datasets into one, please see attached notebook. This creates the longest continuous, unbroken BNBUSD dataset available on Kaggle, with no gaps and no erroneous values. It gives a much more comprehensive view of the market i.e. total volume across multiple exchanges.
Superior Performance: The combined index dataset has demonstrated superior 'mean average error' (MAE) metric performance when training machine learning models, compared to single-source datasets by a whole order of MAE magnitude.
Unbroken History: The combined dataset's continuous history is a valuable asset for researchers and traders who require accurate and uninterrupted time series data for modeling or back-testing.
https://i.imgur.com/aqtuPay.png" alt="BNBUSD Dataset Summary">
https://i.imgur.com/mnzs2f4.png" alt="Combined Dataset Close Plot"> This plot illustrates the continuity of the dataset over time, with no gaps in data, making it ideal for time series analysis.
Dataset Usage and Diagnostics: This notebook demonstrates how to use the dataset and includes a powerful data diagnostics function, which is useful for all time series analyses.
Aggregating Multiple Data Sources: This notebook walks you through the process of combining multiple exchange datasets into a single, clean dataset. (Currently unavailable, will be added shortly)
Facebook
Twitter๐ Dataset Overview: ๐ Species: Name of the fish species (e.g., Anabas testudineus)
๐ Length: Length of the fish (in centimeters)
โ๏ธ Weight: Weight of the fish (in grams)
๐งฎ W/L Ratio: Weight-to-length ratio of the fish
๐ง Steps to Build the Prediction Model: ๐ Data Preprocessing: 1 - Handle Missing Values: Check for and handle any missing values appropriately using methods like:
Imputation (mean/median for numeric data)
Row or column removal (if data is too sparse)
2 - Convert Data Types: Ensure numerical columns (Length, Weight, W/L Ratio) are in the correct numeric format.
3 - Handle Categorical Variables: Convert the Species column into numerical format using:
One-Hot Encoding
Label Encoding
๐ฏ Feature Selection: 1 - Correlation Analysis: Use correlation heatmaps or statistical tests to identify features most related to the target variable (e.g., Weight).
2 - Feature Importance: Use tree-based models (like Random Forest) to determine which features are most predictive.
๐ Model Selection: 1 - Algorithm Choice: Choose suitable machine learning algorithms such as:
Linear Regression
Decision Tree Regressor
Random Forest Regressor
Gradient Boosting Regressor
2 - Model Comparison: Evaluate each model using metrics like:
Mean Absolute Error (MAE)
Mean Squared Error (MSE)
R-squared (Rยฒ)
๐ Model Training and Evaluation: 1 - Train the Model: Split the dataset into training and testing sets (e.g., 80/20 split). Train the selected model(s) on the training set.
2 - Evaluate the Model: Use the test set to assess model performance and fine-tune as necessary using grid search or cross-validation.
This dataset and workflow are useful for exploring biometric relationships in fish and building regression models to predict weight based on length or species. Great for marine biology, aquaculture analytics, and educational projects.
๐ Happy modeling! ๐ Please upvote if you found this helpful!
https://www.kaggle.com/code/abdelrahman16/fish-clustering-diverse-techniques
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context
The dataset presents median household incomes for various household sizes in Forest View, IL, as reported by the U.S. Census Bureau. The dataset highlights the variation in median household income with the size of the family unit, offering valuable insights into economic trends and disparities within different household sizes, aiding in data analysis and decision-making.
Key observations
When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.
Household Sizes:
Variables / Data Columns
Good to know
Margin of Error
Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.
Custom data
If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.
Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.
This dataset is a part of the main dataset for Forest View median household income. You can refer the same here
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset provides an in-depth look at YouTube video analytics, capturing key metrics related to video performance, audience engagement, revenue generation, and viewer behavior. Sourced from real video data, it highlights how variables like video duration, upload time, and ad impressions contribute to monetization and audience retention. This dataset is ideal for data analysts, content creators, and marketers aiming to uncover trends in viewer engagement, optimize content strategies, and maximize ad revenue. Inspired by the evolving landscape of digital content, it serves as a resource for understanding the impact of YouTube metrics on channel growth and content reach.
Video Details: Columns like Video Duration, Video Publish Time, Days Since Publish, Day of Week.
Revenue Metrics: Includes Revenue per 1000 Views (USD), Estimated Revenue (USD), Ad Impressions, and various ad revenue sources (e.g., AdSense, DoubleClick).
Engagement Metrics: Metrics such as Views, Likes, Dislikes, Shares, Comments, Average View Duration, Average View Percentage (%), and Video Thumbnail CTR (%).
Audience Data: Data on New Subscribers, Unsubscribes, Unique Viewers, Returning Viewers, and New Viewers.
Monetization & Transaction Metrics: Details on Monetized Playbacks, Playback-Based CPM, YouTube Premium Revenue, and transactions like Orders and Total Sales Volume (USD).
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is a synthetic collection of student performance data created for data preprocessing, cleaning, and analysis practice in Data Mining and Machine Learning courses. It contains information about 1,020 students, including their study habits, attendance, and test performance, with intentionally introduced missing values, duplicates, and outliers to simulate real-world data issues.
The dataset is suitable for laboratory exercises, assignments, and demonstration of key preprocessing techniques such as:
| Column Name | Description |
|---|---|
| Student_ID | Unique identifier for each student (e.g., S0001, S0002, โฆ) |
| Age | Age of the student (between 18 and 25 years) |
| Gender | Gender of the student (Male/Female) |
| Study_Hours | Average number of study hours per day (contains missing values and outliers) |
| Attendance(%) | Percentage of class attendance (contains missing values) |
| Test_Score | Final exam score (0โ100 scale) |
| Grade | Letter grade derived from test scores (F, C, B, A, A+) |
Test_Score โ Predict test score based on study hours, attendance, age, and gender.
Predict the studentโs test score using their study hours, attendance percentage, and age.
๐ง Sample Features: X = ['Age', 'Gender', 'Study_Hours', 'Attendance(%)'] y = ['Test_Score']
You can use:
And analyze feature influence using correlation or SHAP/LIME explainability.
Facebook
TwitterThis dataset contains 1000000 rows of realistic student performance data, designed for beginners in Machine Learning to practice Linear Regression, model training, and evaluation techniques.
Each row represents one student with features like study hours, attendance, class participation, and final score.
The dataset is small, clean, and structured to be beginner-friendly.
Random noise simulates differences in learning ability, motivation, etc.
Regression Tasks
total_score from weekly_self_study_hours. attendance_percentage and class_participation. Classification Tasks
grade (AโF) using study hours, attendance, and participation. Model Evaluation Practice
โ This dataset is intentionally kept simple, so that new ML learners can clearly see the relationship between input features (study, attendance, participation) and output (score/grade).
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Kokoro Speech Dataset is a public domain Japanese speech dataset. It contains 34,958 short audio clips of a single speaker reading 9 novel books. The format of the metadata is similar to that of LJ Speech so that the dataset is compatible with modern speech synthesis systems.
The texts are from Aozora Bunko, which is in the public domain. The audio clips are from LibriVox project, which is also in the public domain. Readings are estimated by MeCab and UniDic Lite from kanji-kana mixture text. Readings are romanized which are similar to the format used by Julius.
The audio clips were split and transcripts were aligned automatically by Voice100.
Listen from your browser or download randomly sampled 100 clips.
Metadata is provided in metadata.csv. This file consists of one record per line,
delimited by the pipe character (0x7c). The fields are:
Each audio file is a single-channel 16-bit PCM WAV with a sample rate of 22050 Hz.
The dataset is provided in different sizes, large, small, tiny. small and tiny
don't share same clips. large contains all available clips, including small and tiny.
Large:
Total clips: 34958
Min duration: 3.007 secs
Max duration: 14.745 secs
Mean duration: 4.978 secs
Total duration: 48:20:24
Small:
Total clips: 8812
Min duration: 3.007 secs
Max duration: 14.431 secs
Mean duration: 4.951 secs
Total duration: 12:07:12
Tiny:
Total clips: 285
Min duration: 3.019 secs
Max duration: 9.462 secs
Mean duration: 4.871 secs
Total duration: 00:23:08
Because of its large data size of the dataset, audio files are not included in this repository, but the metadata is included.
To make .wav files of the dataset, run
$ bash download.sh
to download the metadata from the project page. Then run
$ pip3 install torchaudio
$ python3 extract.py --size tiny
This prints a shell script example to download MP3 audio files from archive.org and extract them if you haven't done it already.
After doing so, run the command again
$ python3 extract.py --size tiny
to get files for tiny under ./output directory.
You can give another size name to the --size option to get
dataset of the size.
Pretrained Tacotron
model trained with Kokoro Speech Dataset
and audio samples are available.
The model was trained for 21K steps with small.
According to the above repo,
"Speech started to become intelligible around 20K steps" with
LJ Speech Dataset.
Audio samples read the first few sentences from Gon Gitsune
which is not included in small.
The dataset contains recordings from these books read by ekzemplaro
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context
The dataset presents the mean household income for each of the five quintiles in Bay View, OH, as reported by the U.S. Census Bureau. The dataset highlights the variation in mean household income across quintiles, offering valuable insights into income distribution and inequality.
Key observations
https://i.neilsberg.com/ch/bay-view-oh-mean-household-income-by-quintiles.jpeg" alt="Mean household income by quintiles in Bay View, OH (in 2022 inflation-adjusted dollars))">
When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates.
Income Levels:
Variables / Data Columns
Good to know
Margin of Error
Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.
Custom data
If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.
Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.
This dataset is a part of the main dataset for Bay View median household income. You can refer the same here
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains 2,000 rows of data from coffee shops, offering detailed insights into factors that influence daily revenue. It includes key operational and environmental variables that provide a comprehensive view of how business activities and external conditions affect sales performance. Designed for use in predictive analytics and business optimization, this dataset is a valuable resource for anyone looking to understand the relationship between customer behavior, operational decisions, and revenue generation in the food and beverage industry.
The dataset features a variety of columns that capture the operational details of coffee shops, including customer activity, store operations, and external factors such as marketing spend and location foot traffic.
Number of Customers Per Day
Average Order Value ($)
Operating Hours Per Day
Number of Employees
Marketing Spend Per Day ($)
Location Foot Traffic (people/hour)
The dataset spans a wide variety of operational scenarios, from small neighborhood coffee shops with limited traffic to larger, high-traffic locations with extensive marketing budgets. This variety allows for exploring different predictive modeling strategies. Key insights that can be derived from the data include:
The dataset offers a wide range of applications, especially in predictive analytics, business optimization, and forecasting:
For coffee shop owners, managers, and analysts in the food and beverage industry, this dataset provides an essential tool for refining daily operations and boosting profitability. Insights gained from this data can help:
This dataset is also ideal for aspiring data scientists and machine learning practitioners looking to apply their skills to real-world business problems in the food and beverage sector.
The Coffee Shop Revenue Prediction Dataset is a versatile and comprehensive resource for understanding the dynamics of daily sales performance in coffee shops. With a focus on key operational factors, it is perfect for building predictive models, ...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By [source]
This comprehensive dataset explores the relationship between housing and weather conditions across North America in 2012. Through a range of climate variables such as temperature, wind speed, humidity, pressure and visibility it provides unique insights into the weather-influenced environment of numerous regions. The interrelated nature of housing parameters such as longitude, latitude, median income, median house value and ocean proximity further enhances our understanding of how distinct climates play an integral part in area real estate valuations. Analyzing these two data sets offers a wealth of knowledge when it comes to understanding what factors can dictate the value and comfort level offered by residential areas throughout North America
For more datasets, click here.
- ๐จ Your notebook can be here! ๐จ!
This dataset offers plenty of insights into the effects of weather and housing on North American regions. To explore these relationships, you can perform data analysis on the variables provided.
First, start by examining descriptive statistics (i.e., mean, median, mode). This can help show you the general trend and distribution of each variable in this dataset. For example, what is the most common temperature in a given region? What is the average wind speed? How does this vary across different regions? By looking at descriptive statistics, you can get an initial idea of how various weather conditions and housing attributes interact with one another.
Next, explore correlations between variables. Are certain weather variables correlated with specific housing attributes? Is there a link between wind speeds and median house value? Or between humidity and ocean proximity? Analyzing correlations allows for deeper insights into how different aspects may influence one another for a given region or area. These correlations may also inform broader patterns that are present across multiple North American regions or countries.
Finally, use visualizations to further investigate this relationship between climate and housing attributes in North America in 2012. Graphs allow you visualize trends like seasonal variations or long-term changes over time more easily so they are useful when interpreting large amounts of data quickly while providing larger context beyond what numbers alone can tell us about relationships between different aspects within this dataset
- Analyzing the effect of climate change on housing markets across North America. By looking at temperature and weather trends in combination with housing values, researchers can better understand how climate change may be impacting certain regions differently than others.
- Investigating the relationship between median income, house values and ocean proximity in coastal areas. Understanding how ocean proximity plays into housing prices may help inform real estate investment decisions and urban planning initiatives related to coastal development.
- Utilizing differences in weather patterns across different climates to determine optimal seasonal rental prices for property owners. By analyzing changes in temperature, wind speed, humidity, pressure and visibility from season to season an investor could gain valuable insights into seasonal market trends to maximize their profits from rentals or Airbnb listings over time
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: Weather.csv | Column name | Description | |:---------------------|:-----------------------------------------------| | Date/Time | Date and time of the observation. (Date/Time) | | Temp_C | Temperature in Celsius. (Numeric) | | Dew Point Temp_C | Dew point temperature in Celsius. (Numeric) | | Rel Hum_% | Relative humidity in percent. (Numeric) | | Wind Speed_km/h | Wind speed in kilometers per hour. (Numeric) | | Visibility_km | Visibilit...