100+ datasets found

f
Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in...
frontiersin.figshare.com
docx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Giulia Varotto; Gianluca Susi; Laura Tassi; Francesca Gozzo; Silvana Franceschetti; Ferruccio Panzica (2023). Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in Machine Learning: Application to Epileptogenic Zone Localization From Interictal Intracranial EEG Recordings in Patients With Focal Epilepsy.DOCX [Dataset]. http://doi.org/10.3389/fninf.2021.715421.s002
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fninf.2021.715421.s002
Dataset updated
Jun 1, 2023
Dataset provided by
Frontiers
Authors
Giulia Varotto; Gianluca Susi; Laura Tassi; Francesca Gozzo; Silvana Franceschetti; Ferruccio Panzica
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Aim: In neuroscience research, data are quite often characterized by an imbalanced distribution between the majority and minority classes, an issue that can limit or even worsen the prediction performance of machine learning methods. Different resampling procedures have been developed to face this problem and a lot of work has been done in comparing their effectiveness in different scenarios. Notably, the robustness of such techniques has been tested among a wide variety of different datasets, without considering the performance of each specific dataset. In this study, we compare the performances of different resampling procedures for the imbalanced domain in stereo-electroencephalography (SEEG) recordings of the patients with focal epilepsies who underwent surgery.Methods: We considered data obtained by network analysis of interictal SEEG recorded from 10 patients with drug-resistant focal epilepsies, for a supervised classification problem aimed at distinguishing between the epileptogenic and non-epileptogenic brain regions in interictal conditions. We investigated the effectiveness of five oversampling and five undersampling procedures, using 10 different machine learning classifiers. Moreover, six specific ensemble methods for the imbalanced domain were also tested. To compare the performances, Area under the ROC curve (AUC), F-measure, Geometric Mean, and Balanced Accuracy were considered.Results: Both the resampling procedures showed improved performances with respect to the original dataset. The oversampling procedure was found to be more sensitive to the type of classification method employed, with Adaptive Synthetic Sampling (ADASYN) exhibiting the best performances. All the undersampling approaches were more robust than the oversampling among the different classifiers, with Random Undersampling (RUS) exhibiting the best performance despite being the simplest and most basic classification method.Conclusions: The application of machine learning techniques that take into consideration the balance of features by resampling is beneficial and leads to more accurate localization of the epileptogenic zone from interictal periods. In addition, our results highlight the importance of the type of classification method that must be used together with the resampling to maximize the benefit to the outcome.
d
Replication Data for: A Practical Method to Reduce Privacy Loss when...
search.dataone.org
dataverse.harvard.edu
Updated Nov 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chetty, Raj; Friedman, John (2023). Replication Data for: A Practical Method to Reduce Privacy Loss when Disclosing Statistics Based on Small Samples [Dataset]. http://doi.org/10.7910/DVN/RCHDXX
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/RCHDXX
Dataset updated
Nov 12, 2023
Dataset provided by
Harvard Dataverse
Authors
Chetty, Raj; Friedman, John
Description
This dataset contains replication files for "A Practical Method to Reduce Privacy Loss when Disclosing Statistics Based on Small Samples" by Raj Chetty and John Friedman. For more information, see https://opportunityinsights.org/paper/differential-privacy/. A summary of the related publication follows. Releasing statistics based on small samples – such as estimates of social mobility by Census tract, as in the Opportunity Atlas – is very valuable for policy but can potentially create privacy risks by unintentionally disclosing information about specific individuals. To mitigate such risks, we worked with researchers at the Harvard Privacy Tools Project and Census Bureau staff to develop practical methods of reducing the risks of privacy loss when releasing such data. This paper describes the methods that we developed, which can be applied to disclose any statistic of interest that is estimated using a sample with a small number of observations. We focus on the case where the dataset can be broken into many groups (“cells”) and one is interested in releasing statistics for one or more of these cells. Building on ideas from the differential privacy literature, we add noise to the statistic of interest in proportion to the statistic’s maximum observed sensitivity, defined as the maximum change in the statistic from adding or removing a single observation across all the cells in the data. Intuitively, our approach permits the release of statistics in arbitrarily small samples by adding sufficient noise to the estimates to protect privacy. Although our method does not offer a formal privacy guarantee, it generally outperforms widely used methods of disclosure limitation such as count-based cell suppression both in terms of privacy loss and statistical bias. We illustrate how the method can be implemented by discussing how it was used to release estimates of social mobility by Census tract in the Opportunity Atlas. We also provide a step-by-step guide and illustrative Stata code to implement our approach.
f
Data from: Data Nuggets: A Method for Reducing Big Data While Preserving...
tandf.figshare.com
tar
Updated Jun 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Traymon E. Beavers; Ge Cheng; Yajie Duan; Javier Cabrera; Mariusz Lubomirski; Dhammika Amaratunga; Jeffrey E. Teigler (2024). Data Nuggets: A Method for Reducing Big Data While Preserving Data Structure [Dataset]. http://doi.org/10.6084/m9.figshare.25594361.v1
Explore at:
tarAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25594361.v1
Dataset updated
Jun 11, 2024
Dataset provided by
Taylor & Francis
Authors
Traymon E. Beavers; Ge Cheng; Yajie Duan; Javier Cabrera; Mariusz Lubomirski; Dhammika Amaratunga; Jeffrey E. Teigler
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Big data, with N × P dimension where N is extremely large, has created new challenges for data analysis, particularly in the realm of creating meaningful clusters of data. Clustering techniques, such as K-means or hierarchical clustering are popular methods for performing exploratory analysis on large datasets. Unfortunately, these methods are not always possible to apply to big data due to memory or time constraints generated by calculations of order P*N(N−1)2. To circumvent this problem, typically the clustering technique is applied to a random sample drawn from the dataset; however, a weakness is that the structure of the dataset, particularly at the edges, is not necessarily maintained. We propose a new solution through the concept of “data nuggets”, which reduces a large dataset into a small collection of nuggets of data, each containing a center, weight, and scale parameter. The data nuggets are then input into algorithms that compute methods such as principal components analysis and clustering in a more computationally efficient manner. We show the consistency of the data nuggets based covariance estimator and apply the methodology of data nuggets to perform exploratory analysis of a flow cytometry dataset containing over one million observations using PCA and K-means clustering for weighted observations. Supplementary materials for this article are available online.
Orlando Neighborhood
kaggle.com
Updated Oct 7, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sebastian Giovannini (2022). Orlando Neighborhood [Dataset]. https://www.kaggle.com/datasets/sgiov95/orlando-neighborhood
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 7, 2022
Dataset provided by
Kaggle
Authors
Sebastian Giovannini
Area covered
Orlando
Description
This dataset is a snapshot from October 2022 of all 48 homes in a section of a neighborhood nearby a large university in Central Florida. All of the homes are single family homes featuring a garage, a driveway, and a fenced-in backyard. Data was gathered by hand (keyboard) via a collection of sites, including Zillow, Realtor, Redfin, Trulia, and Orange County Property Appraiser. All homes were built in the same year in the early 2000's and feature central air and all other utilities typical of contemporary suburban homes in the United States. The area is close to a university and a large portion of renters are college students and young professionals, as well as families and older adults.

There are 30 columns:

HID: House ID, a unique identifier for each house (int from 1 to 48, not the actual address number) -Sqft: The Square Footage of the Interior of the house (int) -LandSqft: The Total Square Footage of the land (int) -Neighbors: The number of homes directly adjacent to each house (int) -Stories: The number of stories in each house (int) -Pool: Does the house have a pool (int, 0 for 'No', 1 for 'Yes') -Bedrooms: The number of bedrooms in each house (int) -Bathrooms: The number of bathrooms (full or half) in each house (int) -DateLastSold: The date on which the house was last sold (datetime) -PropertyTaxes2022: The annual property taxes for 2022 (float) -OwnedByBank: Is the house owned by a bank (int, 0 for 'No', 1 for 'Yes') -OuterPortion: Is the house on the Outer Portion of the Neighborhood (int, 0 for 'No', 1 for 'Yes') -NextToLoudRoad: Is the house directly adjacent to a loud road (int, 0 for 'No', 1 for 'Yes') -PriceLastSold: Price that the house was last sold for (float) -Zestimate: Zillow's Price Estimate for the house (float) -RentZestimate: Zillow's Estimate for the Monthly Price of rent for the house (float) -RealtorcomEstimate: Realtor dot com's Estimate for the house (float) -RedfinEstimate: Redfin's Estimate for the house (float) -TruliaEstimate: Trulia's Estimate for the house (float) -OCPALandValue2022: The Land Value on the county's 2022 records (float) -OCPABuildingValue2022: The Building Value on the county's 2022 records (float) -OCPAFeaturesValue2022: The Features Value on the county's 2022 records (float) -OCPAMarketValue2022: The Market Value on the county's 2022 records (float) -OCPAAssessedValue2022: The Assessed Value on the county's 2022 records (float), AKA what homeowners are taxed on -OCPALandValue2021: The Land Value on the county's 2021 records (float) -OCPABuildingValue2021: The Building Value on the county's 2021 records (float) -OCPAFeaturesValue2021: The Features Value on the county's 2021 records (float) -OCPAMarketValue2021: The Market Value on the county's 2021 records (float) -OCPAAssessedValue2021: The Assessed Value on the county's 2021 records (float), AKA what homeowners are taxed on -Notes: any notes on any of the homes (str)

Note that while the dataset is exhaustive in that it has all of the houses, some homes are missing some columns, typically because a home did not feature a estimate on a site or the one home not found on the property appraiser's site. This also is therefore not a randomized dataset, so the only population of homes that it can be used to infer on are those within this specific portion of the neighborhood. Personally, I am going to use the dataset to practice a couple of aspects of real-world data: Cleaning, Imputing, and Exploratory Data Analysis. Mainly, I want to compare different approaches to filling in the missing values of the dataset, then do some Model Building with some additional Dimensionality Reduction.
h
Restricted Boltzmann Machine for Missing Data Imputation in Biomedical...
datahub.hku.hk
Updated Aug 13, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wen Ma (2020). Restricted Boltzmann Machine for Missing Data Imputation in Biomedical Datasets [Dataset]. http://doi.org/10.25442/hku.12752549.v1
Explore at:
Unique identifier
https://doi.org/10.25442/hku.12752549.v1
Dataset updated
Aug 13, 2020
Dataset provided by
HKU Data Repository
Authors
Wen Ma
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
NCCTG Lung cancer datasetSurvival in patients with advanced lung cancer from the North Central Cancer Treatment Group. Performance scores rate how well the patient can perform usual daily activities.2.CNV measurements of CNV of GBM This dataset records the information about copy number variation of Glioblastoma (GBM).Abstract:In biology and medicine, conservative patient and data collection malpractice can lead to missing or incorrect values in patient registries, which can affect both diagnosis and prognosis. Insufficient or biased patient information significantly impedes the sensitivity and accuracy of predicting cancer survival. In bioinformatics, making a best guess of the missing values and identifying the incorrect values are collectively called “imputation”. Existing imputation methods work by establishing a model based on the data mechanism of the missing values. Existing imputation methods work well under two assumptions: 1) the data is missing completely at random, and 2) the percentage of missing values is not high. These are not cases found in biomedical datasets, such as the Cancer Genome Atlas Glioblastoma Copy-Number Variant dataset (TCGA: 108 columns), or the North Central Cancer Treatment Group Lung Cancer (NCCTG) dataset (NCCTG: 9 columns). We tested six existing imputation methods, but only two of them worked with these datasets: The Last Observation Carried Forward (LOCF) and K-nearest Algorithm (KNN). Predictive Mean Matching (PMM) and Classification and Regression Trees (CART) worked only with the NCCTG lung cancer dataset with fewer columns, except when the dataset contains 45% missing data. The quality of the imputed values using existing methods is bad because they do not meet the two assumptions.In our study, we propose a Restricted Boltzmann Machine (RBM)-based imputation method to cope with low randomness and the high percentage of the missing values. RBM is an undirected, probabilistic and parameterized two-layer neural network model, which is often used for extracting abstract information from data, especially for high-dimensional data with unknown or non-standard distributions. In our benchmarks, we applied our method to two cancer datasets: 1) NCCTG, and 2) TCGA. The running time, root mean squared error (RMSE) of the different methods were gauged. The benchmarks for the NCCTG dataset show that our method performs better than other methods when there is 5% missing data in the dataset, with 4.64 RMSE lower than the best KNN. For the TCGA dataset, our method achieved 0.78 RMSE lower than the best KNN.In addition to imputation, RBM can achieve simultaneous predictions. We compared the RBM model with four traditional prediction methods. The running time and area under the curve (AUC) were measured to evaluate the performance. Our RBM-based approach outperformed traditional methods. Specifically, the AUC was up to 19.8% higher than the multivariate logistic regression model in the NCCTG lung cancer dataset, and the AUC was higher than the Cox proportional hazard regression model, with 28.1% in the TCGA dataset.Apart from imputation and prediction, RBM models can detect outliers in one pass by allowing the reconstruction of all the inputs in the visible layer with in a single backward pass. Our results show that RBM models have achieved higher precision and recall on detecting outliers than other methods.
Data from: WiBB: An integrated method for quantifying the relative...
zenodo.org
data.niaid.nih.gov
+1more
zip
Updated Jun 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qin Li; Qin Li; Xiaojun Kou; Xiaojun Kou (2022). WiBB: An integrated method for quantifying the relative importance of predictive variables [Dataset]. http://doi.org/10.5061/dryad.xsj3tx9g1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.xsj3tx9g1
Dataset updated
Jun 5, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Qin Li; Qin Li; Xiaojun Kou; Xiaojun Kou
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains simulated datasets, empirical data, and R scripts described in the paper: "Li, Q. and Kou, X. (2021) WiBB: An integrated method for quantifying the relative importance of predictive variables. Ecography (DOI: 10.1111/ecog.05651)".

A fundamental goal of scientific research is to identify the underlying variables that govern crucial processes of a system. Here we proposed a new index, WiBB, which integrates the merits of several existing methods: a model-weighting method from information theory (Wi), a standardized regression coefficient method measured by ß* (B), and bootstrap resampling technique (B). We applied the WiBB in simulated datasets with known correlation structures, for both linear models (LM) and generalized linear models (GLM), to evaluate its performance. We also applied two other methods, relative sum of wight (SWi), and standardized beta (ß*), to evaluate their performance in comparison with the WiBB method on ranking predictor importances under various scenarios. We also applied it to an empirical dataset in a plant genus Mimulus to select bioclimatic predictors of species' presence across the landscape. Results in the simulated datasets showed that the WiBB method outperformed the ß* and SWi methods in scenarios with small and large sample sizes, respectively, and that the bootstrap resampling technique significantly improved the discriminant ability. When testing WiBB in the empirical dataset with GLM, it sensibly identified four important predictors with high credibility out of six candidates in modeling geographical distributions of 71 Mimulus species. This integrated index has great advantages in evaluating predictor importance and hence reducing the dimensionality of data, without losing interpretive power. The simplicity of calculation of the new metric over more sophisticated statistical procedures, makes it a handy method in the statistical toolbox.

Zomato Restaurant Dataset

kaggle.com

Updated Jun 19, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Gaurav Kumar (2025). Zomato Restaurant Dataset [Dataset]. https://www.kaggle.com/datasets/gauravkumar2525/zomato-restaurant-dataset

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jun 19, 2025

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Gaurav Kumar

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

📘 ABOUT

The Enhanced Zomato Dataset provides comprehensive information on restaurants, including user ratings, cuisine types, prices, and geographic details. This enhanced version of the popular Zomato dataset includes carefully cleaned data and newly engineered features to support advanced analytics, trend analysis, and machine learning applications.

It is especially valuable for data scientists, analysts, and machine learning practitioners seeking to build recommendation systems, price predictors, or restaurant review models.

✅ Key Features of the Dataset:

✅ Cleaned data with no missing values
✅ Contains restaurant, cuisine, city, and pricing information
✅ Includes user ratings for both dining and delivery experiences
✅ Features engineered columns such as popularity scores and price-per-vote ratios
✅ Ready for data visualization, machine learning, and business insights

This dataset is an excellent resource for exploring food industry patterns, building ML models, and performing customer behavior analysis.

📂 FILE INFORMATION

The dataset contains structured records of restaurant details, user ratings, pricing, and engineered features. It was compiled from a public Zomato dataset and enhanced through feature engineering and cleaning techniques.

File Type: CSV
Data Rows: 123,657
Data Fields: Restaurant details, cuisine, city, prices, votes, ratings, and additional engineered features

📊 COLUMNS DESCRIPTION

Column Name	Description
`Restaurant_Name`	Name of the restaurant listed on Zomato.
`Dining_Rating`	User rating for the dine-in experience (0.0 to 5.0).
`Delivery_Rating`	User rating for the delivery experience (0.0 to 5.0).
`Dining_Votes`	Number of votes received for dine-in service.
`Delivery_Votes`	Number of votes received for delivery service.
`Cuisine`	Type of cuisine served (e.g., Fast Food, Chinese).
`Place_Name`	Local area or neighborhood of the restaurant.
`City`	City in which the restaurant is located.
`Item_Name`	Name of the menu item listed.
`Best_Seller`	Bestseller status (e.g., BESTSELLER, MUST TRY, NONE).
`Votes`	Combined total votes received.
`Prices`	Price of the menu item in INR.
`Average_Rating`	Mean rating calculated from available sources.
`Total_Votes`	Sum of all types of votes.
`Price_per_Vote`	Ratio of price to total votes (used to evaluate value for money).
`Log_Price`	Log-transformed price to reduce skewness in analysis.
`Is_Bestseller`	Binary flag indicating if item is marked as a bestseller.
`Restaurant_Popularity`	Number of items listed by the restaurant in the dataset.
`Avg_Rating_Restaurant`	Average rating of all items from the same restaurant.
`Avg_Price_Restaurant`	Average price of all items from the same restaurant.
`Avg_Rating_Cuisine`	Average rating across all restaurants serving the same cuisine.
`Avg_Price_Cuisine`	Average price across all restaurants serving the same cuisine.
`Avg_Rating_City`	Average rating across all restaurants in the same city.
`Avg_Price_City`	Average price of menu items in the same city.
`Is_Highly_Rated`	Binary flag for ratings ≥ 4.0.
`Is_Expensive`	Binary flag for prices above city’s average.

m
Educational Attainment in North Carolina Public Schools: Use of statistical...
data.mendeley.com
Updated Nov 14, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Scott Herford (2018). Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets. [Dataset]. http://doi.org/10.17632/6cm9wyd5g5.1
Explore at:
Unique identifier
https://doi.org/10.17632/6cm9wyd5g5.1
Dataset updated
Nov 14, 2018
Authors
Scott Herford
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.
d
Dataset from the Upper Mississippi River Restoration Program (1993-2019) to...
catalog.data.gov
datasets.ai
Updated Jul 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Dataset from the Upper Mississippi River Restoration Program (1993-2019) to reconstruct missing data by comparing interpolation techniques [Dataset]. https://catalog.data.gov/dataset/dataset-from-the-upper-mississippi-river-restoration-program-1993-2019-to-reconstruct-miss
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Upper Mississippi River, Mississippi River
Description
The dataset accompanies the scientific article,"Reconstructing missing data by comparing interpolation techniques: applications for long-term water quality data." Missingness is typical in large datasets, but intercomparisons of interpolation methods can alleviate data gaps and common problems associated with missing data. We compared seven popular interpolation methods for predicting missing values in a long-term water quality data set from the upper Mississippi River, USA.
National Health and Nutrition Examination Survey
kaggle.com
Updated Jan 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). National Health and Nutrition Examination Survey [Dataset]. https://www.kaggle.com/datasets/thedevastator/national-health-and-nutrition-examination-survey
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 12, 2023
Dataset provided by
Kaggle
Authors
The Devastator
Description
National Health and Nutrition Examination Survey (NHANES) Data

Health Indicators for Different Locations

By Centers for Disease Control and Prevention [source]

About this dataset

This dataset offers an in-depth look into the National Health and Nutrition Examination Survey (NHANES), which provides valuable insights on various health indicators throughout the United States. It includes important information such as the year when data was collected, location of the survey, data source and value, priority areas of focus, category and topic related to the survey, break out categories of data values, geographic location coordinates and other key indicators.Discover patterns in mortality rates from cardiovascular disease or analyze if pregnant women are more likely to report poor health than those who are not expecting with this NHANES dataset — a powerful collection for understanding personal health behaviors

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

Step 1: Understand the Data Format - Before beginning to work with NHANES data, you should become familiar with the different columns in the dataset. Each column contains a specific type of information about the data such as year collected, geographic location abbreviations and descriptions, sources used for collecting data, priority areas assigned by researchers or institutions associated with understanding health trends in a given area or population group as well as indicator values related to nutrition/health.

Step 2: Choose an Indicator - Once you understand what is included in each column and what type of values correspond to each field it is time to select which indicator(s) you would like plots or visualizations against demographic/geographical characteristics represented by NHANES data. Selecting an appropriate indicator helps narrow down your search criteria when conducting analyses of health/nutrition trends over time in different locations or amongst different demographic groups.

Step 3: Utilizing Subsets - When narrowing down your search criteria it may be beneficial to break up large datasets into smaller subsets that focus on a single area or topic for study (i.e., looking at nutrition trends among rural communities). This allows users to zoom into certain datasets if needed within their larger studies so they can further drill down on particular topics that are relevant for their research objectives without losing greater context from more general analysis results when viewing overall datasets containing all available fields for all locations examined by NHANES over many years of records collected at specific geographical areas requested within the parameters set forth by those wanting insights from external research teams utilizing this dataset remotely via Kaggle access granted through user accounts giving them authorized access controls solely limited by base administration permissions set forth where required prior granting needs authorization process has been met prior downloading/extraction activities successful completion finalized allowed beyond initial site signup page make sure rules followed while also ensuring positive experience interactive engagement processes fluid flow signature one-time registration entry after exit page exits once completed neutralize logout button pops finish downloading extract image files transfer end destination requires hard drive storage efficient manner duplicate second backup remain resilient mitigate file corruption concerns start working properly formatted smooth transition between systems be seamless reflective channel dynamic organization approach complement function beneficial effort allow comprehensive review completed quality control standards align desires outcomes desired critical path

Research Ideas

Creating a health calculator to help people measure their health risk. The indicator and data value fields can be used to create an algorithm that will generate a personalized label for each user's health status.

Developing a visual representation of the nutritional habits of different populations based on the DataSource, LocationAbbr, and PriorityArea fields from this dataset.

Employing machine learning to discern patterns in the data or predict potential health risks in different regions or populations by using the GeoLocation field as inputs for geographic analysis.

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

**Unknown License - Please check the dataset description for more information....
Employment Of India CLeaned and Messy Data
kaggle.com
Updated Apr 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SONIA SHINDE (2025). Employment Of India CLeaned and Messy Data [Dataset]. https://www.kaggle.com/datasets/soniaaaaaaaa/employment-of-india-cleaned-and-messy-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 7, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
SONIA SHINDE
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Area covered
India
Description
This dataset presents a dual-version representation of employment-related data from India, crafted to highlight the importance of data cleaning and transformation in any real-world data science or analytics project.

🔹 Dataset Composition:

It includes two parallel datasets: 1. Messy Dataset (Raw) – Represents a typical unprocessed dataset often encountered in data collection from surveys, databases, or manual entries. 2. Cleaned Dataset – This version demonstrates how proper data preprocessing can significantly enhance the quality and usability of data for analytical and visualization purposes.

Each record captures multiple attributes related to individuals in the Indian job market, including: - Age Group
- Employment Status (Employed/Unemployed)
- Monthly Salary (INR)
- Education Level
- Industry Sector
- Years of Experience
- Location
- Perceived AI Risk
- Date of Data Recording

Transformations & Cleaning Applied:

The raw dataset underwent comprehensive transformations to convert it into its clean, analysis-ready form: - Missing Values: Identified and handled using either row elimination (where critical data was missing) or imputation techniques. - Duplicate Records: Identified using row comparison and removed to prevent analytical skew. - Inconsistent Formatting: Unified inconsistent naming in columns (like 'monthly_salary_(inr)' → 'Monthly Salary (INR)'), capitalization, and string spacing. - Incorrect Data Types: Converted columns like salary from string/object to float for numerical analysis. - Outliers: Detected and handled based on domain logic and distribution analysis. - Categorization: Converted numeric ages into grouped age categories for comparative analysis. - Standardization: Uniform labels for employment status, industry names, education, and AI risk levels were applied for visualization clarity.

Purpose & Utility:

This dataset is ideal for learners and professionals who want to understand: - The impact of messy data on visualization and insights - How transformation steps can dramatically improve data interpretation - Practical examples of preprocessing techniques before feeding into ML models or BI tools

It's also useful for: - Training ML models with clean inputs
- Data storytelling with visual clarity
- Demonstrating reproducibility in data cleaning pipelines

By examining both the messy and clean datasets, users gain a deeper appreciation for why “garbage in, garbage out” rings true in the world of data science.
f
Predictive modeling of treatment resistant depression using data from STAR*D...
plos.figshare.com
docx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhi Nie; Srinivasan Vairavan; Vaibhav A. Narayan; Jieping Ye; Qingqin S. Li (2023). Predictive modeling of treatment resistant depression using data from STAR*D and an independent clinical study [Dataset]. http://doi.org/10.1371/journal.pone.0197268
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0197268
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Zhi Nie; Srinivasan Vairavan; Vaibhav A. Narayan; Jieping Ye; Qingqin S. Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Identification of risk factors of treatment resistance may be useful to guide treatment selection, avoid inefficient trial-and-error, and improve major depressive disorder (MDD) care. We extended the work in predictive modeling of treatment resistant depression (TRD) via partition of the data from the Sequenced Treatment Alternatives to Relieve Depression (STAR*D) cohort into a training and a testing dataset. We also included data from a small yet completely independent cohort RIS-INT-93 as an external test dataset. We used features from enrollment and level 1 treatment (up to week 2 response only) of STAR*D to explore the feature space comprehensively and applied machine learning methods to model TRD outcome at level 2. For TRD defined using QIDS-C16 remission criteria, multiple machine learning models were internally cross-validated in the STAR*D training dataset and externally validated in both the STAR*D testing dataset and RIS-INT-93 independent dataset with an area under the receiver operating characteristic curve (AUC) of 0.70–0.78 and 0.72–0.77, respectively. The upper bound for the AUC achievable with the full set of features could be as high as 0.78 in the STAR*D testing dataset. Model developed using top 30 features identified using feature selection technique (k-means clustering followed by χ2 test) achieved an AUC of 0.77 in the STAR*D testing dataset. In addition, the model developed using overlapping features between STAR*D and RIS-INT-93, achieved an AUC of > 0.70 in both the STAR*D testing and RIS-INT-93 datasets. Among all the features explored in STAR*D and RIS-INT-93 datasets, the most important feature was early or initial treatment response or symptom severity at week 2. These results indicate that prediction of TRD prior to undergoing a second round of antidepressant treatment could be feasible even in the absence of biomarker data.
d
3.27 Traffic Delay Reduction (summary)
catalog.data.gov
open.tempe.gov
+10more
Updated Jul 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Tempe (2025). 3.27 Traffic Delay Reduction (summary) [Dataset]. https://catalog.data.gov/dataset/3-27-traffic-delay-reduction-summary-3d3ad
Explore at:
Dataset updated
Jul 5, 2025
Dataset provided by
City of Tempe
Description
The city is using Travel Time Index as a measure to quantify traffic delay in the city. The Travel Time Index is the ratio of the travel time during the peak period to the time required to make the same trip at free-flow speeds. It should be noted that this data is subject to seasonal variations. The 2020 Q2 and Q3 data includes the summer months when traffic volumes are lower, thus the Travel Time Index is improved in these quarters. The performance measure page is available at 3.27 Traffic Delay Reduction. Additional Information Source: Bluetooth ARID sensors Contact (author): Cathy Hollow Contact E-Mail (author): catherine_hollow@tempe.gov Contact (maintainer): Contact E-Mail (maintainer): Data Source Type: Table, CSV Preparation Method: Peak period data is manually extracted. The travel time index calculation is the peak period data divided by the free flow data (constant per segment). Publish Frequency: Quarterly Publish Method: Manual Data Dictionary
Artificial Intelligence (AI) Training Dataset Market Research Report 2033
growthmarketreports.com
csv, pdf, pptx
Updated Aug 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2025). Artificial Intelligence (AI) Training Dataset Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/artificial-intelligence-training-dataset-market-global-industry-analysis
Explore at:
csv, pptx, pdfAvailable download formats
Dataset updated
Aug 4, 2025
Dataset authored and provided by
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
Artificial Intelligence (AI) Training Dataset Market Outlook

According to our latest research, the global Artificial Intelligence (AI) Training Dataset market size reached USD 3.15 billion in 2024, reflecting robust industry momentum. The market is expanding at a notable CAGR of 20.8% and is forecasted to attain USD 20.92 billion by 2033. This impressive growth is primarily attributed to the surging demand for high-quality, annotated datasets to fuel machine learning and deep learning models across diverse industry verticals. The proliferation of AI-driven applications, coupled with rapid advancements in data labeling technologies, is further accelerating the adoption and expansion of the AI training dataset market globally.

One of the most significant growth factors propelling the AI training dataset market is the exponential rise in data-driven AI applications across industries such as healthcare, automotive, retail, and finance. As organizations increasingly rely on AI-powered solutions for automation, predictive analytics, and personalized customer experiences, the need for large, diverse, and accurately labeled datasets has become critical. Enhanced data annotation techniques, including manual, semi-automated, and fully automated methods, are enabling organizations to generate high-quality datasets at scale, which is essential for training sophisticated AI models. The integration of AI in edge devices, smart sensors, and IoT platforms is further amplifying the demand for specialized datasets tailored for unique use cases, thereby fueling market growth.

Another key driver is the ongoing innovation in machine learning and deep learning algorithms, which require vast and varied training data to achieve optimal performance. The increasing complexity of AI models, especially in areas such as computer vision, natural language processing, and autonomous systems, necessitates the availability of comprehensive datasets that accurately represent real-world scenarios. Companies are investing heavily in data collection, annotation, and curation services to ensure their AI solutions can generalize effectively and deliver reliable outcomes. Additionally, the rise of synthetic data generation and data augmentation techniques is helping address challenges related to data scarcity, privacy, and bias, further supporting the expansion of the AI training dataset market.

The market is also benefiting from the growing emphasis on ethical AI and regulatory compliance, particularly in data-sensitive sectors like healthcare, finance, and government. Organizations are prioritizing the use of high-quality, unbiased, and diverse datasets to mitigate algorithmic bias and ensure transparency in AI decision-making processes. This focus on responsible AI development is driving demand for curated datasets that adhere to strict quality and privacy standards. Moreover, the emergence of data marketplaces and collaborative data-sharing initiatives is making it easier for organizations to access and exchange valuable training data, fostering innovation and accelerating AI adoption across multiple domains.

From a regional perspective, North America currently dominates the AI training dataset market, accounting for the largest revenue share in 2024, driven by significant investments in AI research, a mature technology ecosystem, and the presence of leading AI companies and data annotation service providers. Europe and Asia Pacific are also witnessing rapid growth, with increasing government support for AI initiatives, expanding digital infrastructure, and a rising number of AI startups. While North America sets the pace in terms of technological innovation, Asia Pacific is expected to exhibit the highest CAGR during the forecast period, fueled by the digital transformation of emerging economies and the proliferation of AI applications across various industry sectors.

Data Type Analysis

The AI training dataset market is segmented by data type into Text, Image/Video, Audio, and Others, each playing a crucial role in powering different AI applications. Text da
f
Data from: On the role of data balancing for Machine Learning-based Code...
datasetcatalog.nlm.nih.gov
figshare.com
Updated Jun 11, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Di Nucci, Dario; De Roover, Coen; De Lucia, Andrea; Pecorelli, Fabiano (2019). On the role of data balancing for Machine Learning-based Code Smell Detection [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000175016
Explore at:
Dataset updated
Jun 11, 2019
Authors
Di Nucci, Dario; De Roover, Coen; De Lucia, Andrea; Pecorelli, Fabiano
Description
Code smells can compromise software quality in the long term by inducing technical debt.For this reason, in the last decade many approaches aimed at identifying these design flaws have been proposed.Most of them are based on heuristics in which a set of metrics (e.g., code metrics, process metrics) is used to detect smelly code components.However, these techniques suffer of subjective interpretation, low agreement between detectors, and threshold dependability.To overcome the limitations, previous work applied Machine Learning techniques that can learn from previous datasets without needing any threshold definition.However, more recent work has shown that Machine Learning is not always suitable for code smell detection due to the highly unbalanced nature of the problem.In this study we investigate several approaches able to mitigate data unbalancing issues to understand their impact on ML-based code smells detection algorithms.Our findings highlight a number of limitations and open issues with respect to the usage of data balancing for ML-based code smell detection.
R
Cdd Dataset
universe.roboflow.com
zip
Updated Sep 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
hakuna matata (2023). Cdd Dataset [Dataset]. https://universe.roboflow.com/hakuna-matata/cdd-g8a6g/3
Explore at:
zipAvailable download formats
Dataset updated
Sep 5, 2023
Dataset authored and provided by
hakuna matata
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Cumcumber Diease Detection Bounding Boxes
Description
Project Documentation: Cucumber Disease Detection

Title and Introduction Title: Cucumber Disease Detection

Introduction: A machine learning model for the automatic detection of diseases in cucumber plants is to be developed as part of the "Cucumber Disease Detection" project. This research is crucial because it tackles the issue of early disease identification in agriculture, which can increase crop yield and cut down on financial losses. To train and test the model, we use a dataset of pictures of cucumber plants.

Problem Statement Problem Definition: The research uses image analysis methods to address the issue of automating the identification of diseases, including Downy Mildew, in cucumber plants. Effective disease management in agriculture depends on early illness identification.

Importance: Early disease diagnosis helps minimize crop losses, stop the spread of diseases, and better allocate resources in farming. Agriculture is a real-world application of this concept.

Goals and Objectives: Develop a machine learning model to classify cucumber plant images into healthy and diseased categories. Achieve a high level of accuracy in disease detection. Provide a tool for farmers to detect diseases early and take appropriate action.

Data Collection and Preprocessing Data Sources: The dataset comprises of pictures of cucumber plants from various sources, including both healthy and damaged specimens.

Data Collection: Using cameras and smartphones, images from agricultural areas were gathered.

Data Preprocessing: Data cleaning to remove irrelevant or corrupted images. Handling missing values, if any, in the dataset. Removing outliers that may negatively impact model training. Data augmentation techniques applied to increase dataset diversity.

Exploratory Data Analysis (EDA) The dataset was examined using visuals like scatter plots and histograms. The data was examined for patterns, trends, and correlations. Understanding the distribution of photos of healthy and ill plants was made easier by EDA.

Methodology Machine Learning Algorithms:

Convolutional Neural Networks (CNNs) were chosen for image classification due to their effectiveness in handling image data. Transfer learning using pre-trained models such as ResNet or MobileNet may be considered. Train-Test Split:

The dataset was split into training and testing sets with a suitable ratio. Cross-validation may be used to assess model performance robustly.

Model Development The CNN model's architecture consists of layers, units, and activation operations. On the basis of experimentation, hyperparameters including learning rate, batch size, and optimizer were chosen. To avoid overfitting, regularization methods like dropout and L2 regularization were used.

Model Training During training, the model was fed the prepared dataset across a number of epochs. The loss function was minimized using an optimization method. To ensure convergence, early halting and model checkpoints were used.

Model Evaluation Evaluation Metrics:

Accuracy, precision, recall, F1-score, and confusion matrix were used to assess model performance. Results were computed for both training and test datasets. Performance Discussion:

The model's performance was analyzed in the context of disease detection in cucumber plants. Strengths and weaknesses of the model were identified.

Results and Discussion Key project findings include model performance and disease detection precision. a comparison of the many models employed, showing the benefits and drawbacks of each. challenges that were faced throughout the project and the methods used to solve them.

Conclusion recap of the project's key learnings. the project's importance to early disease detection in agriculture should be highlighted. Future enhancements and potential research directions are suggested.

References Library: Pillow,Roboflow,YELO,Sklearn,matplotlib Datasets:https://data.mendeley.com/datasets/y6d3z6f8z9/1

Code Repository https://universe.roboflow.com/hakuna-matata/cdd-g8a6g

Rafiur Rahman Rafit EWU 2018-3-60-111
No Background RGB Arabic Alphabets Sign Language
kaggle.com
Updated Apr 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rabie El Kharoua (2024). No Background RGB Arabic Alphabets Sign Language [Dataset]. https://www.kaggle.com/datasets/rabieelkharoua/arsl-no-background-v2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 17, 2024
Dataset provided by
Kaggle
Authors
Rabie El Kharoua
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Description

This dataset consists of images representing the Arabic Alphabet in sign language, with the background removed to focus solely on the hand gestures. The dataset aims to facilitate the development of machine learning models for recognizing Arabic sign language, which can significantly improve communication for the hearing-impaired within the Arabic-speaking community.

Dataset Information

Number of images: 6985

Original number of images: 7856

Image resolution: Originally 5432 x 3830 pixels, resized to 224 x 224 pixels

Format: RGB images

Classes: 31 (representing the Arabic alphabet)

File Structure: Organized in folders by class

Dataset Distribuion

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F17372370%2Fd10f204b7affdee4f698b96188ffc1db%2FDistribution%20of%20data.png?generation=1716060833437519&alt=media" alt="">

Data Collection

Source of the dataset: You can find the original dataset (RGB Arabic Alphabets Sign Language DatasetAASL) with the backgrounds Here : Link.

Participants: More than 200 participants

Collection method: Images were collected using various digital cameras and smartphones. Participants were asked to perform the sign language gestures for each letter of the Arabic alphabet.

Environment: Images were taken in various lighting conditions and backgrounds, which were later removed.

Data Preprocessing

Background Removal: Backgrounds of images were removed using image processing techniques to reduce noise and improve the training efficiency of the Convolutional Neural Network (CNN) model.

Data Cleaning: Misclassified and unclear images were reclassified or removed. This process reduced the dataset from 7856 to 6985 images.

Data Resizing: All images were resized from their original resolution to 224 x 224 pixels to standardize input dimensions for the CNN model.

Sample Images

This is a sample from your dataset showing hand gestures with the background removed:

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F17372370%2Fa2e5f22e81937151a36db3477d74cb77%2FWith%20and%20without%20backgound.png?generation=1716060691283986&alt=media" alt="">

Usage

This dataset can be used to train deep learning models for recognizing Arabic Alphabet sign language gestures. The provided data augmentation techniques and preprocessing steps can help improve model accuracy and generalization.

Dataset Statistics

Class Distribution: Equal distribution across all 31 classes

Image Format: PNG or JPEG

Average Image Size: 224 x 224 pixels

Color Channels: 3 (RGB)

Citation

If you use this dataset in your research, please cite the following paper: El Kharoua, R., & Jiang, X.M. (2024). Deep Learning Recognition for Arabic Alphabet Sign Language RGB Dataset. Journal of Computer and Communications, 12, 32-51. https://doi.org/10.4236/jcc.2024.123003

Additional Information

Convolutional Neural Network (CNN) Model: The dataset was used to train a CNN model achieving 99.9% accuracy on the training set and 97.4% validation accuracy.

Preprocessing Techniques: Background removal, resizing, data cleaning, and augmentation.

Links

Paper Link

DOI

Creative Commons License

Contact

For any questions or further information, please contact Rabie El Kharoua by Email: rabie.elkharoua@gmail.com.

Abstract

-**Paper Title:** "**Deep Learning Recognition for Arabic Alphabet Sign Language RGB Dataset**"

-**Abstract:** This paper introduces a Convolutional Neural Network (CNN) model for Arabic Sign Language (AASL) recognition, using the AASL dataset. Recognizing the fundamental importance of communication for the hearing-impaired, especially within the Arabic-speaking deaf community, the study emphasizes the critical role of sign language recognition systems. The proposed methodology achieves outstanding accuracy, with the CNN model reaching 99.9% accuracy on the training set and a validation accuracy of 97.4%. This study not only establishes a high-accuracy AASL recognition model but also provides insights into effective dropout strategies. The achieved high accuracy rates position the proposed model as a significant advancement in the field, holding promise for improved communication accessibility for the Arabic-speaking deaf community.

Contact

For any questions or further information, please contact Rabie El Kharoua by Email: rabie.elkharoua@gmail.com.
m
Python code for the estimation of missing prices in real-estate market with...
data.mendeley.com
Updated Dec 12, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Iván García-Magariño (2017). Python code for the estimation of missing prices in real-estate market with a dataset of house prices from Teruel city [Dataset]. http://doi.org/10.17632/mxpgf54czz.2
Explore at:
Unique identifier
https://doi.org/10.17632/mxpgf54czz.2
Dataset updated
Dec 12, 2017
Authors
Iván García-Magariño
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Teruel
Description
This research data file contains the necessary software and the dataset for estimating the missing prices of house units. This approach combines several machine learning techniques (linear regression, support vector regression, the k-nearest neighbors and a multi-layer perceptron neural network) with several dimensionality reduction techniques (non-negative factorization, recursive feature elimination and feature selection with a variance threshold). It includes the input dataset formed with the available house prices in two neighborhoods of Teruel city (Spain) in November 13, 2017 from Idealista website. These two neighborhoods are the center of the city and “Ensanche”.

This dataset supports the research of the authors in the improvement of the setup of agent-based simulations about real-estate market. The work about this dataset has been submitted for consideration for publication to a scientific journal.

The open source python code is composed of all the files with the “.py” extension. The main program can be executed from the “main.py” file. The “boxplotErrors.eps” is a chart generated from the execution of the code, and compares the results of the different combinations of machine learning techniques and dimensionality reduction methods.

The dataset is in the “data” folder. The input raw data of the house prices are in the “dataRaw.csv” file. These were shuffled into the “dataShuffled.csv” file. We used cross-validation to obtain the estimations of house prices. The outputted estimations alongside the real values are stored in different files of the “data” folder, in which each filename is composed by the machine learning technique abbreviation and the dimensionality reduction method abbreviation.
n
AirNow Air Quality Monitoring Data (Current) - Dataset - CKAN
nationaldataplatform.org
Updated Feb 28, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). AirNow Air Quality Monitoring Data (Current) - Dataset - CKAN [Dataset]. https://nationaldataplatform.org/catalog/dataset/airnow-air-quality-monitoring-data-current
Explore at:
Dataset updated
Feb 28, 2024
Description
This United States Environmental Protection Agency (US EPA) feature layer represents monitoring site data, updated hourly concentrations and Air Quality Index (AQI) values for the latest hour received from monitoring sites that report to AirNow.Map and forecast data are collected using federal reference or equivalent monitoring techniques or techniques approved by the state, local or tribal monitoring agencies. To maintain "real-time" maps, the data are displayed after the end of each hour. Although preliminary data quality assessments are performed, the data in AirNow are not fully verified and validated through the quality assurance procedures monitoring organizations used to officially submit and certify data on the EPA Air Quality System (AQS).This data sharing, and centralization creates a one-stop source for real-time and forecast air quality data. The benefits include quality control, national reporting consistency, access to automated mapping methods, and data distribution to the public and other data systems. The U.S. Environmental Protection Agency, National Oceanic and Atmospheric Administration, National Park Service, tribal, state, and local agencies developed the AirNow system to provide the public with easy access to national air quality information. State and local agencies report the Air Quality Index (AQI) for cities across the US and parts of Canada and Mexico. AirNow data are used only to report the AQI, not to formulate or support regulation, guidance or any other EPA decision or position.About the AQIThe Air Quality Index (AQI) is an index for reporting daily air quality. It tells you how clean or polluted your air is, and what associated health effects might be a concern for you. The AQI focuses on health effects you may experience within a few hours or days after breathing polluted air. EPA calculates the AQI for five major air pollutants regulated by the Clean Air Act: ground-level ozone, particle pollution (also known as particulate matter), carbon monoxide, sulfur dioxide, and nitrogen dioxide. For each of these pollutants, EPA has established national air quality standards to protect public health. Ground-level ozone and airborne particles (often referred to as "particulate matter") are the two pollutants that pose the greatest threat to human health in this country.A number of factors influence ozone formation, including emissions from cars, trucks, buses, power plants, and industries, along with weather conditions. Weather is especially favorable for ozone formation when it’s hot, dry and sunny, and winds are calm and light. Federal and state regulations, including regulations for power plants, vehicles and fuels, are helping reduce ozone pollution nationwide.Fine particle pollution (or "particulate matter") can be emitted directly from cars, trucks, buses, power plants and industries, along with wildfires and woodstoves. But it also forms from chemical reactions of other pollutants in the air. Particle pollution can be high at different times of year, depending on where you live. In some areas, for example, colder winters can lead to increased particle pollution emissions from woodstove use, and stagnant weather conditions with calm and light winds can trap PM2.5 pollution near emission sources. Federal and state rules are helping reduce fine particle pollution, including clean diesel rules for vehicles and fuels, and rules to reduce pollution from power plants, industries, locomotives, and marine vessels, among others.How Does the AQI Work?Think of the AQI as a yardstick that runs from 0 to 500. The higher the AQI value, the greater the level of air pollution and the greater the health concern. For example, an AQI value of 50 represents good air quality with little potential to affect public health, while an AQI value over 300 represents hazardous air quality.An AQI value of 100 generally corresponds to the national air quality standard for the pollutant, which is the level EPA has set to protect public health. AQI values below 100 are generally thought of as satisfactory. When AQI values are above 100, air quality is considered to be unhealthy-at first for certain sensitive groups of people, then for everyone as AQI values get higher.Understanding the AQIThe purpose of the AQI is to help you understand what local air quality means to your health. To make it easier to understand, the AQI is divided into six categories:Air Quality Index(AQI) ValuesLevels of Health ConcernColorsWhen the AQI is in this range:..air quality conditions are:...as symbolized by this color:0 to 50GoodGreen51 to 100ModerateYellow101 to 150Unhealthy for Sensitive GroupsOrange151 to 200UnhealthyRed201 to 300Very UnhealthyPurple301 to 500HazardousMaroonNote: Values above 500 are considered Beyond the AQI. Follow recommendations for the Hazardous category. Additional information on reducing exposure to extremely high levels of particle pollution is available here.Each category corresponds to a different level of health concern. The six levels of health concern and what they mean are:"Good" AQI is 0 to 50. Air quality is considered satisfactory, and air pollution poses little or no risk."Moderate" AQI is 51 to 100. Air quality is acceptable; however, for some pollutants there may be a moderate health concern for a very small number of people. For example, people who are unusually sensitive to ozone may experience respiratory symptoms."Unhealthy for Sensitive Groups" AQI is 101 to 150. Although general public is not likely to be affected at this AQI range, people with lung disease, older adults and children are at a greater risk from exposure to ozone, whereas persons with heart and lung disease, older adults and children are at greater risk from the presence of particles in the air."Unhealthy" AQI is 151 to 200. Everyone may begin to experience some adverse health effects, and members of the sensitive groups may experience more serious effects."Very Unhealthy" AQI is 201 to 300. This would trigger a health alert signifying that everyone may experience more serious health effects."Hazardous" AQI greater than 300. This would trigger a health warnings of emergency conditions. The entire population is more likely to be affected.AQI colorsEPA has assigned a specific color to each AQI category to make it easier for people to understand quickly whether air pollution is reaching unhealthy levels in their communities. For example, the color orange means that conditions are "unhealthy for sensitive groups," while red means that conditions may be "unhealthy for everyone," and so on.Air Quality Index Levels of Health ConcernNumericalValueMeaningGood0 to 50Air quality is considered satisfactory, and air pollution poses little or no risk.Moderate51 to 100Air quality is acceptable; however, for some pollutants there may be a moderate health concern for a very small number of people who are unusually sensitive to air pollution.Unhealthy for Sensitive Groups101 to 150Members of sensitive groups may experience health effects. The general public is not likely to be affected.Unhealthy151 to 200Everyone may begin to experience health effects; members of sensitive groups may experience more serious health effects.Very Unhealthy201 to 300Health alert: everyone may experience more serious health effects.Hazardous301 to 500Health warnings of emergency conditions. The entire population is more likely to be affected.Note: Values above 500 are considered Beyond the AQI. Follow recommendations for the "Hazardous category." Additional information on reducing exposure to extremely high levels of particle pollution is available here.
m
USA POI & Foot Traffic Enriched Geospatial Dataset by Predik Data-Driven
app.mobito.io
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
USA POI & Foot Traffic Enriched Geospatial Dataset by Predik Data-Driven [Dataset]. https://app.mobito.io/data-product/usa-enriched-geospatial-framework-dataset
Explore at:
Area covered
United States
Description
Our dataset provides detailed and precise insights into the business, commercial, and industrial aspects of any given area in the USA (Including Point of Interest (POI) Data and Foot Traffic. The dataset is divided into 150x150 sqm areas (geohash 7) and has over 50 variables. - Use it for different applications: Our combined dataset, which includes POI and foot traffic data, can be employed for various purposes. Different data teams use it to guide retailers and FMCG brands in site selection, fuel marketing intelligence, analyze trade areas, and assess company risk. Our dataset has also proven to be useful for real estate investment.- Get reliable data: Our datasets have been processed, enriched, and tested so your data team can use them more quickly and accurately.- Ideal for trainning ML models. The high quality of our geographic information layers results from more than seven years of work dedicated to the deep understanding and modeling of geospatial Big Data. Among the features that distinguished this dataset is the use of anonymized and user-compliant mobile device GPS location, enriched with other alternative and public data.- Easy to use: Our dataset is user-friendly and can be easily integrated to your current models. Also, we can deliver your data in different formats, like .csv, according to your analysis requirements. - Get personalized guidance: In addition to providing reliable datasets, we advise your analysts on their correct implementation.Our data scientists can guide your internal team on the optimal algorithms and models to get the most out of the information we provide (without compromising the security of your internal data).Answer questions like: - What places does my target user visit in a particular area? Which are the best areas to place a new POS?- What is the average yearly income of users in a particular area?- What is the influx of visits that my competition receives?- What is the volume of traffic surrounding my current POS?This dataset is useful for getting insights from industries like:- Retail & FMCG- Banking, Finance, and Investment- Car Dealerships- Real Estate- Convenience Stores- Pharma and medical laboratories- Restaurant chains and franchises- Clothing chains and franchisesOur dataset includes more than 50 variables, such as:- Number of pedestrians seen in the area.- Number of vehicles seen in the area.- Average speed of movement of the vehicles seen in the area.- Point of Interest (POIs) (in number and type) seen in the area (supermarkets, pharmacies, recreational locations, restaurants, offices, hotels, parking lots, wholesalers, financial services, pet services, shopping malls, among others). - Average yearly income range (anonymized and aggregated) of the devices seen in the area.Notes to better understand this dataset:- POI confidence means the average confidence of POIs in the area. In this case, POIs are any kind of location, such as a restaurant, a hotel, or a library. - Category confidences, for example"food_drinks_tobacco_retail_confidence" indicates how confident we are in the existence of food/drink/tobacco retail locations in the area. - We added predictions for The Home Depot and Lowe's Home Improvement stores in the dataset sample. These predictions were the result of a machine-learning model that was trained with the data. Knowing where the current stores are, we can find the most similar areas for new stores to open.How efficient is a Geohash?Geohash is a faster, cost-effective geofencing option that reduces input data load and provides actionable information. Its benefits include faster querying, reduced cost, minimal configuration, and ease of use.Geohash ranges from 1 to 12 characters. The dataset can be split into variable-size geohashes, with the default being geohash7 (150m x 150m).

Facebook

Twitter

Click to copy link

Link copied

Cite

Giulia Varotto; Gianluca Susi; Laura Tassi; Francesca Gozzo; Silvana Franceschetti; Ferruccio Panzica (2023). Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in Machine Learning: Application to Epileptogenic Zone Localization From Interictal Intracranial EEG Recordings in Patients With Focal Epilepsy.DOCX [Dataset]. http://doi.org/10.3389/fninf.2021.715421.s002

Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in Machine Learning: Application to Epileptogenic Zone Localization From Interictal Intracranial EEG Recordings in Patients With Focal Epilepsy.DOCX

Explore at:

docxAvailable download formats

Unique identifier

https://doi.org/10.3389/fninf.2021.715421.s002

Dataset updated

Jun 1, 2023

Dataset provided by

Frontiers

Authors

Giulia Varotto; Gianluca Susi; Laura Tassi; Francesca Gozzo; Silvana Franceschetti; Ferruccio Panzica

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Aim: In neuroscience research, data are quite often characterized by an imbalanced distribution between the majority and minority classes, an issue that can limit or even worsen the prediction performance of machine learning methods. Different resampling procedures have been developed to face this problem and a lot of work has been done in comparing their effectiveness in different scenarios. Notably, the robustness of such techniques has been tested among a wide variety of different datasets, without considering the performance of each specific dataset. In this study, we compare the performances of different resampling procedures for the imbalanced domain in stereo-electroencephalography (SEEG) recordings of the patients with focal epilepsies who underwent surgery.Methods: We considered data obtained by network analysis of interictal SEEG recorded from 10 patients with drug-resistant focal epilepsies, for a supervised classification problem aimed at distinguishing between the epileptogenic and non-epileptogenic brain regions in interictal conditions. We investigated the effectiveness of five oversampling and five undersampling procedures, using 10 different machine learning classifiers. Moreover, six specific ensemble methods for the imbalanced domain were also tested. To compare the performances, Area under the ROC curve (AUC), F-measure, Geometric Mean, and Balanced Accuracy were considered.Results: Both the resampling procedures showed improved performances with respect to the original dataset. The oversampling procedure was found to be more sensitive to the type of classification method employed, with Adaptive Synthetic Sampling (ADASYN) exhibiting the best performances. All the undersampling approaches were more robust than the oversampling among the different classifiers, with Random Undersampling (RUS) exhibiting the best performance despite being the simplest and most basic classification method.Conclusions: The application of machine learning techniques that take into consideration the balance of features by resampling is beneficial and leads to more accurate localization of the epileptogenic zone from interictal periods. In addition, our results highlight the importance of the type of classification method that must be used together with the resampling to maximize the benefit to the outcome.

Clear search

Close search

Google apps

Main menu

Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in...

Replication Data for: A Practical Method to Reduce Privacy Loss when...

Data from: Data Nuggets: A Method for Reducing Big Data While Preserving...

Orlando Neighborhood

Restricted Boltzmann Machine for Missing Data Imputation in Biomedical...

Data from: WiBB: An integrated method for quantifying the relative...

Zomato Restaurant Dataset

📘 ABOUT

✅ Key Features of the Dataset:

📂 FILE INFORMATION

📊 COLUMNS DESCRIPTION

Educational Attainment in North Carolina Public Schools: Use of statistical...

Dataset from the Upper Mississippi River Restoration Program (1993-2019) to...

National Health and Nutrition Examination Survey

National Health and Nutrition Examination Survey (NHANES) Data

Health Indicators for Different Locations

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Employment Of India CLeaned and Messy Data

🔹 Dataset Composition:

Transformations & Cleaning Applied:

Purpose & Utility:

Predictive modeling of treatment resistant depression using data from STAR*D...

3.27 Traffic Delay Reduction (summary)

Artificial Intelligence (AI) Training Dataset Market Research Report 2033

Artificial Intelligence (AI) Training Dataset Market Outlook

Data Type Analysis

Data from: On the role of data balancing for Machine Learning-based Code...

Cdd Dataset

No Background RGB Arabic Alphabets Sign Language

Description

Dataset Information

Dataset Distribuion

Data Collection

Data Preprocessing

Sample Images

Usage

Dataset Statistics

Citation

Additional Information

Links

Contact

Abstract

Contact

Python code for the estimation of missing prices in real-estate market with...

AirNow Air Quality Monitoring Data (Current) - Dataset - CKAN

USA POI & Foot Traffic Enriched Geospatial Dataset by Predik Data-Driven

Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in Machine Learning: Application to Epileptogenic Zone Localization From Interictal Intracranial EEG Recordings in Patients With Focal Epilepsy.DOCX