19 datasets found

Housing Price Prediction using DT and RF in R
kaggle.com
zip
Updated Aug 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
vikram amin (2023). Housing Price Prediction using DT and RF in R [Dataset]. https://www.kaggle.com/datasets/vikramamin/housing-price-prediction-using-dt-and-rf-in-r
Explore at:
zip(629100 bytes)Available download formats
Dataset updated
Aug 31, 2023
Authors
vikram amin
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Objective: To predict the prices of houses in the City of Melbourne

Approach: Using Decision Tree and Random Forest https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Ffc6fb7d0bd8e854daf7a6f033937a397%2FPicture1.png?generation=1693489996707941&alt=media" alt="">

Data Cleaning:

Date column is shown as a character vector which is converted into a date vector using the library ‘lubridate’

We create a new column called age to understand the age of the house as it can be a factor in the pricing of the house. We extract the year from column ‘Date’ and subtract it from the column ‘Year Built’

We remove 11566 records which have missing values

We drop columns which are not significant such as ‘X’, ‘suburb’, ‘address’, (we have kept zipcode as it serves the purpose in place of suburb and address), ‘type’, ‘method’, ‘SellerG’, ‘date’, ‘Car’, ‘year built’, ‘Council Area’, ‘Region Name’

We split the data into ‘train’ and ‘test’ in 80/20 ratio using the sample function

Run libraries ‘rpart’, ‘rpart.plot’, ‘rattle’, ‘RcolorBrewer’

Run decision tree using the rpart function. ‘Price’ is the dependent variable https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F6065322d19b1376c4a341a4f22933a51%2FPicture2.png?generation=1693490067579017&alt=media" alt="">

Average price for 5464 houses is $1084349

Where building area is less than 200.5, the average price for 4582 houses is $931445. Where building area is less than 200.5 & age of the building is less than 67.5 years, the avg price for 3385 houses is $799299.6.

$4801538 is the Highest average prices of 13 houses where distance is lower than 5.35 & building are is >280.5
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F136542b7afb6f03c1890bae9b07dc464%2FDecision%20Tree%20Plot.jpeg?generation=1693490124083168&alt=media" alt="">

We use the caret package for tuning the parameter and the optimal complexity parameter found is 0.01 with RMSE 445197.9 https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Feb1633df9dd61ba3a51574873b055fd0%2FPicture3.png?generation=1693490163033658&alt=media" alt="">

We use library (Metrics) to find out the RMSE ($392107), MAPE (0.297) which means an accuracy of 99.70% and MAE ($272015.4)

Variables ‘postcode’, longitude and building are the most important variables

Test$Price indicates the actual price and test$predicted indicates the predicted price for particular 6 houses. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F620b1aad968c9aee169d0e7371bf3818%2FPicture4.png?generation=1693490211728176&alt=media" alt="">

We use the default parameters of random forest on the train data https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fe9a3c3f8776ee055e4a1bb92d782e19c%2FPicture5.png?generation=1693490244695668&alt=media" alt="">

The below image indicates that ‘Building Area’, ‘Age of the house’ and ‘Distance’ are the most important variables that affect the price of the house. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fc14d6266184db8f30290c528d72b9f6b%2FRandom%20Forest%20Variables%20Importance.jpeg?generation=1693490284920037&alt=media" alt="">

Based on the default parameters, RMSE is $250426.2, MAPE is 0.147 (accuracy is 99.853%) and MAE is $151657.7

Error starts to remain constant between 100 to 200 trees and thereafter there is almost minimal reduction. We can choose N tree=200. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F365f9e8587d3a65805330889d22f9e60%2FNtree%20Plot.jpeg?generation=1693490308734539&alt=media" alt="">

We tune the model and find mtry = 3 has the lowest out of bag error

We use the caret package and use 5 fold cross validation technique

RMSE is $252216.10 , MAPE is 0.146 (accuracy is 99.854%) , MAE is $151669.4

We can conclude that Random Forest give us more accurate results as compared to Decision Tree

In Random Forest , the default parameters (N tree = 500) give us lower RMSE and MAPE as compared to N tree = 200. So we can proceed with those parameters.
KC_House Dataset -Linear Regression of Home Prices
kaggle.com
zip
Updated May 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
vikram amin (2023). KC_House Dataset -Linear Regression of Home Prices [Dataset]. https://www.kaggle.com/datasets/vikramamin/kc-house-dataset-home-prices
Explore at:
zip(776807 bytes)Available download formats
Dataset updated
May 15, 2023
Authors
vikram amin
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset: House pricing dataset containing 21 columns and 21613 rows.

Programming Language : R

Objective : To predict house prices by creating a model

Steps : A) Import the dataset B) Install and run libraries C) Data Cleaning - Remove Null Values , Change Data Types , Dropping of Columns which are not important D) Data Analysis - (i)Linear Regression Model was used to establish the relationship between the dependent variable (price) and other independent variable (ii) Outliers were identified and removed (iii) Regression model was run once again after removing the outliers (iv) Multiple R- squared was calculated which indicated the independent variables can explain 73% change/ variation in the dependent variable (v) P value was less than that of alpha 0.05 which shows it is statistically significant. (vi) Interpreting the meaning of the results of the coefficients (vii) Checked the assumption of multicollinearity (viii) VIF(Variance inflation factor) was calculated for all the independent variables and their absolute value was found to be less than 5. Hence, there is not threat of multicollinearity and that we can proceed with the independent variables specified.
f
Dataset and R code from “Aridity preferences alter the relative importance...
datasetcatalog.nlm.nih.gov
figshare.com
Updated May 28, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kéfi, Sonia; Le Bagousse-Pinguet, Yoann; Gross, Nicolas; Berdugo, Miguel; codina, Santiago Soliveres; Maestre, Fernando (2018). Dataset and R code from “Aridity preferences alter the relative importance of abiotic and biotic drivers on plant species abundance in global drylands” [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000611887
Explore at:
Dataset updated
May 28, 2018
Authors
Kéfi, Sonia; Le Bagousse-Pinguet, Yoann; Gross, Nicolas; Berdugo, Miguel; codina, Santiago Soliveres; Maestre, Fernando
Description
Zip file containing the dataset and R code used to perform the analyses described in “Aridity preferences alter the relative importance of abiotic and biotic drivers on plant species abundance in global drylands”
Data from: Bike Sharing Dataset
kaggle.com
zip
Updated Sep 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ram Vishnu R (2024). Bike Sharing Dataset [Dataset]. https://www.kaggle.com/datasets/ramvishnur/bike-sharing-dataset/data
Explore at:
zip(22674 bytes)Available download formats
Dataset updated
Sep 10, 2024
Authors
Ram Vishnu R
Description
Problem Statement:

A bike-sharing system is a service in which bikes are made available for shared use to individuals on a short term basis for a price or free. Many bike share systems allow people to borrow a bike from a "dock" which is usually computer-controlled wherein the user enters the payment information, and the system unlocks it. This bike can then be returned to another dock belonging to the same system.

A US bike-sharing provider BoomBikes has recently suffered considerable dip in their revenue due to the Corona pandemic. The company is finding it very difficult to sustain in the current market scenario. So, it has decided to come up with a mindful business plan to be able to accelerate its revenue.

In such an attempt, BoomBikes aspires to understand the demand for shared bikes among the people. They have planned this to prepare themselves to cater to the people's needs once the situation gets better all around and stand out from other service providers and make huge profits.

They have contracted a consulting company to understand the factors on which the demand for these shared bikes depends. Specifically, they want to understand the factors affecting the demand for these shared bikes in the American market. The company wants to know:

Which variables are significant in predicting the demand for shared bikes.

How well those variables describe the bike demands

Based on various meteorological surveys and people's styles, the service provider firm has gathered a large dataset on daily bike demands across the American market based on some factors.

Business Goal:

You are required to model the demand for shared bikes with the available independent variables. It will be used by the management to understand how exactly the demands vary with different features. They can accordingly manipulate the business strategy to meet the demand levels and meet the customer's expectations. Further, the model will be a good way for management to understand the demand dynamics of a new market.

Data Preparation:

You can observe in the dataset that some of the variables like 'weathersit' and 'season' have values as 1, 2, 3, 4 which have specific labels associated with them (as can be seen in the data dictionary). These numeric values associated with the labels may indicate that there is some order to them - which is actually not the case (Check the data dictionary and think why). So, it is advisable to convert such feature values into categorical string values before proceeding with model building. Please refer the data dictionary to get a better understanding of all the independent variables.

You might notice the column 'yr' with two values 0 and 1 indicating the years 2018 and 2019 respectively. At the first instinct, you might think it is a good idea to drop this column as it only has two values so it might not be a value-add to the model. But in reality, since these bike-sharing systems are slowly gaining popularity, the demand for these bikes is increasing every year proving that the column 'yr' might be a good variable for prediction. So think twice before dropping it.

Model Building:

In the dataset provided, you will notice that there are three columns named 'casual', 'registered', and 'cnt'. The variable 'casual' indicates the number casual users who have made a rental. The variable 'registered' on the other hand shows the total number of registered users who have made a booking on a given day. Finally, the 'cnt' variable indicates the total number of bike rentals, including both casual and registered. The model should be built taking this 'cnt' as the target variable.

Model Evaluation:

When you're done with model building and residual analysis and have made predictions on the test set, just make sure you use the following two lines of code to calculate the R-squared score on the test set. python from sklearn.metrics import r2_score r2_score(y_test, y_pred) - where y_test is the test data set for the target variable, and y_pred is the variable containing the predicted values of the target variable on the test set. - Please perform this step as the R-squared score on the test set holds as a benchmark for your model.
Supplementary data and code 1 for "Significant shifts in latitudinal optima...
figshare.com
zip
Updated Apr 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Paulo Mateus Martins (2024). Supplementary data and code 1 for "Significant shifts in latitudinal optima of North American birds" (PNAS) [Dataset]. http://doi.org/10.6084/m9.figshare.24881544.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24881544.v1
Dataset updated
Apr 1, 2024
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Paulo Mateus Martins
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Significant shifts in latitudinal optima of North American birds (PNAS)Paulo Mateus Martins, Marti J. Anderson, Winston L. Sweatman, and Andrew J. PunnettOverviewThis file contains the raw 2022 release of the North American breeding bird survey dataset (Ziolkowski Jr et al. 2022), as well as the filtered version used in our paper and the code that generated it. We also included code for using BirdLife's species distribution shapefiles to classify species as eastern or western based on their occurrence in the BBS dataset and to calculated the percentage of their range covered by the BBS sampling extent. Note that this code requires species distribution shapefiles, which are not provided but can be obtained directly from https://datazone.birdlife.org/species/requestdis.ReferenceD. J. Ziolkowski Jr., M. Lutmerding, V. I. Aponte, M. A. R. Hudson, North American breeding bird survey dataset 1966–2021: U.S. Geological Survey data release (2022), https://doi.org/10.5066/P97WAZE5Detailed file descriptioninfo_birds_names_shp: A data frame that links BBS species names (column Species) to shapefiles (column Species_BL). See the code2_sampling coverage.dat_raw_BBS_data_v2022: This R environment contains the raw BBS data from the 2022 release (https://www.sciencebase.gov/catalog/item/625f151ed34e85fa62b7f926). This object contains data frames created with the files "Routes.zip" (route information), "SpeciesList.txt" (bird taxonomy), and "50-StopData.zip" (actual counts per route and year). This object is the starting point for creating the dataset used in the paper, which was filtered to remove taxonomic uncertainties, as demonstrated in the "code1_build_long_wide_datasets" R script.code1_build_long_wide_datasets: This code filters the original dataset (dat_raw_BBS_data_v2022) to remove taxonomic uncertainties, assigns routes as either eastern or western based on regionalization using the dynamically constrained agglomerative clustering and partitioning method (see the Methods section of the paper), and generates the full long and wide versions of the dataset used in the analyses (dat2_filtered_data_long, dat3_filtered_data_wide).dat2_filtered_data_long: The filtered raw dataset in long form. This dataset was further filtered to remove nocturnal and aquatic species, as well as species with fewer than 30 occurrences, but the complete version is available here. To obtain the exact subset used in the analysis, filter this dataset using the column Species from datasets S1 or S3.dat3_filtered_data_wide: The filtered raw dataset in its widest form. This dataset was further filtered to remove nocturnal and aquatic species, as well as species with fewer than 30 occurrences, but the complete version is available here. To obtain the exact subset used in the analysis, filter this dataset using the column Species from datasets S1 or S3.code2_sampling coverage: This code determines how much of a bird distribution is covered by the BBS sampling extent (refer to Dataset S1). It is important to note that this script requires bird species distribution shapefiles from BirdLife International, which we are not permitted to share. The shapefiles can be requested directly at https://datazone.birdlife.org/species/requestdis
c
Data from: Water Temperature of Lakes in the Conterminous U.S. Using the...
s.cnmilf.com
data.usgs.gov
+1more
Updated Oct 8, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Water Temperature of Lakes in the Conterminous U.S. Using the Landsat 8 Analysis Ready Dataset Raster Images from 2013-2023 [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/water-temperature-of-lakes-in-the-conterminous-u-s-using-the-landsat-8-analysis-ready-2013
Explore at:
Dataset updated
Oct 8, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Contiguous United States
Description
This data release contains lake and reservoir water surface temperature summary statistics calculated from Landsat 8 Analysis Ready Dataset (ARD) images available within the Conterminous United States (CONUS) from 2013-2023. All zip files within this data release contain nested directories using .parquet files to store the data. The file example_script_for_using_parquet.R contains example code for using the R arrow package (Richardson and others, 2024) to open and query the nested .parquet files. Limitations with this dataset include: - All biases inherent to the Landsat Surface Temperature product are retained in this dataset which can produce unrealistically high or low estimates of water temperature. This is observed to happen, for example, in cases with partial cloud coverage over a waterbody. - Some waterbodies are split between multiple Landsat Analysis Ready Data tiles or orbit footprints. In these cases, multiple waterbody-wide statistics may be reported - one for each data tile. The deepest point values will be extracted and reported for tile covering the deepest point. A total of 947 waterbodies are split between multiple tiles (see the multiple_tiles = “yes” column of site_id_tile_hv_crosswalk.csv). - Temperature data were not extracted from satellite images with more than 90% cloud cover. - Temperature data represents skin temperature at the water surface and may differ from temperature observations from below the water surface. Potential methods for addressing limitations with this dataset: - Identifying and removing unrealistic temperature estimates: - Calculate total percentage of cloud pixels over a given waterbody as: percent_cloud_pixels = wb_dswe9_pixels/(wb_dswe9_pixels + wb_dswe1_pixels), and filter percent_cloud_pixels by a desired percentage of cloud coverage. - Remove lakes with a limited number of water pixel values available (wb_dswe1_pixels < 10) - Filter waterbodies where the deepest point is identified as water (dp_dswe = 1) - Handling waterbodies split between multiple tiles: - These waterbodies can be identified using the "site_id_tile_hv_crosswalk.csv" file (column multiple_tiles = “yes”). A user could combine sections of the same waterbody by spatially weighting the values using the number of water pixels available within each section (wb_dswe1_pixels). This should be done with caution, as some sections of the waterbody may have data available on different dates. All zip files within this data release contain nested directories using .parquet files to store the data. The example_script_for_using_parquet.R contains example code for using the R arrow package to open and query the nested .parquet files. - "year_byscene=XXXX.zip" – includes temperature summary statistics for individual waterbodies and the deepest points (the furthest point from land within a waterbody) within each waterbody by the scene_date (when the satellite passed over). Individual waterbodies are identified by the National Hydrography Dataset (NHD) permanent_identifier included within the site_id column. Some of the .parquet files with the byscene datasets may only include one dummy row of data (identified by tile_hv="000-000"). This happens when no tabular data is extracted from the raster images because of clouds obscuring the image, a tile that covers mostly ocean with a very small amount of land, or other possible. An example file path for this dataset follows: year_byscene=2023/tile_hv=002-001/part-0.parquet -"year=XXXX.zip" – includes the summary statistics for individual waterbodies and the deepest points within each waterbody by the year (dataset=annual), month (year=0, dataset=monthly), and year-month (dataset=yrmon). The year_byscene=XXXX is used as input for generating these summary tables that aggregates temperature data by year, month, and year-month. Aggregated data is not available for the following tiles: 001-004, 001-010, 002-012, 028-013, and 029-012, because these tiles primarily cover ocean with limited land, and no output data were generated. An example file path for this dataset follows: year=2023/dataset=lakes_annual/tile_hv=002-001/part-0.parquet - "example_script_for_using_parquet.R" – This script includes code to download zip files directly from ScienceBase, identify HUC04 basins within desired landsat ARD grid tile, download NHDplus High Resolution data for visualizing, using the R arrow package to compile .parquet files in nested directories, and create example static and interactive maps. - "nhd_HUC04s_ingrid.csv" – This cross-walk file identifies the HUC04 watersheds within each Landsat ARD Tile grid. -"site_id_tile_hv_crosswalk.csv" - This cross-walk file identifies the site_id (nhdhr{permanent_identifier}) within each Landsat ARD Tile grid. This file also includes a column (multiple_tiles) to identify site_id's that fall within multiple Landsat ARD Tile grids. - "lst_grid.png" – a map of the Landsat grid tiles labelled by the horizontal – vertical ID.
Microplastics alter the functioning of marine microbial ecosystems_Data &...
zenodo.org
bin
Updated Feb 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Montoya; Daniel Montoya (2024). Microplastics alter the functioning of marine microbial ecosystems_Data & Code [Dataset]. http://doi.org/10.5281/zenodo.10617960
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10617960
Dataset updated
Feb 5, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Daniel Montoya; Daniel Montoya
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Feb 5, 2024
Description

# Microplastics dataset
---

This repository includes the data and code used in Montoyal et al (ms number). More specifically:

- Dataset.xlsx includes data on MPS, bacteria and phytoplankton data, environmental variables, and ocean productivity

- ‘Microplastics code’ refers to the programming code used for the statistical analyses and plot construction using R software.

## Description of the data and file structure

The data is provided with variables as columns. The abbreviations are explained below:

id = sampling mesocosm and time
day = sampling day
temp = temperature (celsius degrees)
depth = ocean depth (meters)
mesocosm = mesocosm ID
plast_tot = total concentration of microplastics (g cm-3)
plast_ps = concentration of polystyrene (g cm-3)
plast_pp = concentration of polypropylene (g cm-3)
plast_pet = concentration of polyethylene terephthalate (g cm-3)
plast_pvc = concentration of polyvinyl chloride (g cm-3)
plast_pe = concentration of polyethylene (g cm-3)
ammonium = Concentration of ammonium (NH4+) (mg m-3)
hna = high nucleic acid concentration bacteria (% over total bacteria)
lna = low-nuceic acid concentrationbacteria (% over total bacteria)
chla_fluo = phytoplankton biomass, measured as chlorophyll a concentration (mg m-3)
fvfm = photosynthetic efficiency, measured as the ratio between variable and maxima fluorescence (Fv/Fm)

## Sharing/Access information

Not relevant

## Code/Software

‘Microplastics code’ refers to the programming code used for the statistical analyses and plot construction using R software.
Cross-position activity recognition
kaggle.com
zip
Updated Dec 21, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jindong Wang (2017). Cross-position activity recognition [Dataset]. https://www.kaggle.com/jindongwang92/crossposition-activity-recognition
Explore at:
zip(166759400 bytes)Available download formats
Dataset updated
Dec 21, 2017
Authors
Jindong Wang
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This directory contains the cross-position activity recognition datasets used in the following paper. Please consider citing this article if you want to use the datasets.

Jindong Wang, Yiqiang Chen, Lisha Hu, Xiaohui Peng, and Philip S. Yu. Stratified Transfer Learning for Cross-domain Activity Recognition. 2018 IEEE International Conference on Pervasive Computing and Communications (PerCom).

These datasets are secondly constructed based on three public datasets: OPPORTUNITY (opp) [1], PAMAP2 (pamap2) [2], and UCI DSADS (dsads) [3].

Here are some useful information about this directory. Please feel free to contact jindongwang@outlook.com for more information.

This is NOT the raw data, since I have performed feature extraction and normalized the features into [-1,1]. The code for feature extraction can be found in here: https://github.com/jindongwang/activityrecognition/tree/master/code. Currently, there are 27 features for a single sensor. There are 81 features for a body part. More information can be found in above PerCom-18 paper.

There are 4 .mat files corresponding to each dataset: dsads.mat for UCI DSADS, opp_hl.mat and opp_ll.mat for OPPORTUNITY, and pamap.mat for PAMAP2. Note that opp_hl and opp_loco denotes 'high-level' and 'locomotion' activities, respectively. (1) dsads.mat: 9120 * 408. Columns 1~405 are features, listed in the order of 'Torso', 'Right Arm', 'Left Arm', 'Right Leg', and 'Left Leg'. Each position contains 81 columns of features. Columns 406~408 are labels. Column 406 is the activity sequence indicating the executing of activities (usually not used in experiments). Column 407 is the activity label (1~19). Column 408 denotes the person (1~8). (2) opp_hl.mat and opp_loco.mat: Same as dsads.mat. But they contain more body parts: 'Back', 'Right Upper Arm', 'Right Lower Arm', 'Left Upper Arm', 'Left Lower Arm', 'Right Shoe (Foot)', and 'Left Shoe (Foot)'. Of course we did not use the data of both shoes in our paper. Column 460 is the activity label (please refer to OPPORTUNITY dataset to see the meaning of those activities). Column 461 is the activity drill (also check the dataset information). Column 462 denotes the person (1~4). (3) pamap.mat: 7312 * 245. Columns 1~243 are features, listed in the order of 'Wrist', 'Chest', and 'Ankle'. Column 244 is the activity label. Column 245 denotes the person (1~9).

There are another 3 datasets with the prefix 'cross_', containing only 4 common classes of each dataset. This is for experimenting the cross-dataset activity recognition (see our PerCom-18 paper). The 4 common classes are lying, standing, walking, and sitting. (1) cross_dsads.mat: 1920*406. Columns 1~405 are features. Column 406 is labels. (2) cross_opp.mat: 5022*460. Columns 1~459 are features. Column 460 is labels. (3) cross_pamap.mat: 3063 * 244. Columns 1~243 are features. Column 244 is labels.

-------- Original references for the 3 datasets:

[1] R. Chavarriaga, H. Sagha, A. Calatroni, S. T. Digumarti, G. Troster, ¨ J. d. R. Millan, and D. Roggen, “The opportunity challenge: A bench- ´ mark database for on-body sensor-based activity recognition,” Pattern Recognition Letters, vol. 34, no. 15, pp. 2033–2042, 2013.

[2] A. Reiss and D. Stricker, “Introducing a new benchmarked dataset for activity monitoring,” in Wearable Computers (ISWC), 2012 16th International Symposium on. IEEE, 2012, pp. 108–109.

[3] B. Barshan and M. C. Yuksek, “Recognizing daily and sports activities ¨ in two open source machine learning environments using body-worn sensor units,” The Computer Journal, vol. 57, no. 11, pp. 1649–1667, 2014.
Priority effects alter the colonization success of a host-associated...
figshare.com
txt
Updated Nov 30, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Corlett Wood (2021). Priority effects alter the colonization success of a host-associated parasite and mutualist: Data, metadata, and R code [Dataset]. http://doi.org/10.6084/m9.figshare.14189090.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14189090.v1
Dataset updated
Nov 30, 2021
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Corlett Wood
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset, metadata, and R Markdown file required to reproduce all analyses in the manuscript "Priority effects alter the colonization success of a host-associated parasite and mutualist" (Burr et al. 2021 Ecology)
Uniform Crime Reporting (UCR) Program Data: Arrests by Age, Sex, and Race,...
search.datacite.org
doi.org
+1more
Updated 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jacob Kaplan (2018). Uniform Crime Reporting (UCR) Program Data: Arrests by Age, Sex, and Race, 1980-2016 [Dataset]. http://doi.org/10.3886/e102263v5-10021
Explore at:
Unique identifier
https://doi.org/10.3886/e102263v5-10021
Dataset updated
2018
Dataset provided by
Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
DataCitehttps://www.datacite.org/
Authors
Jacob Kaplan
Description
Version 5 release notes:
Removes support for SPSS and Excel data.Changes the crimes that are stored in each file. There are more files now with fewer crimes per file. The files and their included crimes have been updated below.
Adds in agencies that report 0 months of the year.Adds a column that indicates the number of months reported. This is generated summing up the number of unique months an agency reports data for. Note that this indicates the number of months an agency reported arrests for ANY crime. They may not necessarily report every crime every month. Agencies that did not report a crime with have a value of NA for every arrest column for that crime.Removes data on runaways.
Version 4 release notes:
Changes column names from "poss_coke" and "sale_coke" to "poss_heroin_coke" and "sale_heroin_coke" to clearly indicate that these column includes the sale of heroin as well as similar opiates such as morphine, codeine, and opium. Also changes column names for the narcotic columns to indicate that they are only for synthetic narcotics.
Version 3 release notes:
Add data for 2016.Order rows by year (descending) and ORI.Version 2 release notes:
Fix bug where Philadelphia Police Department had incorrect FIPS county code.
The Arrests by Age, Sex, and Race data is an FBI data set that is part of the annual Uniform Crime Reporting (UCR) Program data. This data contains highly granular data on the number of people arrested for a variety of crimes (see below for a full list of included crimes). The data sets here combine data from the years 1980-2015 into a single file. These files are quite large and may take some time to load.
All the data was downloaded from NACJD as ASCII+SPSS Setup files and read into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. For the R code used to clean this data, see here. https://github.com/jacobkap/crime_data. If you have any questions, comments, or suggestions please contact me at jkkaplan6@gmail.com.

I did not make any changes to the data other than the following. When an arrest column has a value of "None/not reported", I change that value to zero. This makes the (possible incorrect) assumption that these values represent zero crimes reported. The original data does not have a value when the agency reports zero arrests other than "None/not reported." In other words, this data does not differentiate between real zeros and missing values. Some agencies also incorrectly report the following numbers of arrests which I change to NA: 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 99999, 99998.

To reduce file size and make the data more manageable, all of the data is aggregated yearly. All of the data is in agency-year units such that every row indicates an agency in a given year. Columns are crime-arrest category units. For example, If you choose the data set that includes murder, you would have rows for each agency-year and columns with the number of people arrests for murder. The ASR data breaks down arrests by age and gender (e.g. Male aged 15, Male aged 18). They also provide the number of adults or juveniles arrested by race. Because most agencies and years do not report the arrestee's ethnicity (Hispanic or not Hispanic) or juvenile outcomes (e.g. referred to adult court, referred to welfare agency), I do not include these columns.

To make it easier to merge with other data, I merged this data with the Law Enforcement Agency Identifiers Crosswalk (LEAIC) data. The data from the LEAIC add FIPS (state, county, and place) and agency type/subtype. Please note that some of the FIPS codes have leading zeros and if you open it in Excel it will automatically delete those leading zeros.

I created 9 arrest categories myself. The categories are:
Total Male JuvenileTotal Female JuvenileTotal Male AdultTotal Female AdultTotal MaleTotal FemaleTotal JuvenileTotal AdultTotal ArrestsAll of these categories are based on the sums of the sex-age categories (e.g. Male under 10, Female aged 22) rather than using the provided age-race categories (e.g. adult Black, juvenile Asian). As not all agencies report the race data, my method is more accurate. These categories also make up the data in the "simple" version of the data. The "simple" file only includes the above 9 columns as the arrest data (all other columns in the data are just agency identifier columns). Because this "simple" data set need fewer columns, I include all offenses.

As the arrest data is very granular, and each category of arrest is its own column, there are dozens of columns per crime. To keep the data somewhat manageable, there are nine different files, eight which contain different crimes and the "simple" file. Each file contains the data for all years. The eight categories each have crimes belonging to a major crime category and do not overlap in crimes other than with the index offenses. Please note that the crime names provided below are not the same as the column names in the data. Due to Stata limiting column names to 32 characters maximum, I have abbreviated the crime names in the data. The files and their included crimes are:

Index Crimes
MurderRapeRobberyAggravated AssaultBurglaryTheftMotor Vehicle TheftArsonAlcohol CrimesDUIDrunkenness
LiquorDrug CrimesTotal DrugTotal Drug SalesTotal Drug PossessionCannabis PossessionCannabis SalesHeroin or Cocaine PossessionHeroin or Cocaine SalesOther Drug PossessionOther Drug SalesSynthetic Narcotic PossessionSynthetic Narcotic SalesGrey Collar and Property CrimesForgeryFraudStolen PropertyFinancial CrimesEmbezzlementTotal GamblingOther GamblingBookmakingNumbers LotterySex or Family CrimesOffenses Against the Family and Children
Other Sex Offenses
ProstitutionRapeViolent CrimesAggravated AssaultMurderNegligent ManslaughterRobberyWeapon Offenses
Other CrimesCurfewDisorderly ConductOther Non-trafficSuspicion
VandalismVagrancy
Simple
This data set has every crime and only the arrest categories that I created (see above).
If you have any questions, comments, or suggestions please contact me at jkkaplan6@gmail.com.
EK60 Water Column Sonar Data Collected During SH1806
catalog.data.gov
gimi9.com
+1more
Updated Nov 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NOAA National Centers for Environmental Information (Point of Contact) (2024). EK60 Water Column Sonar Data Collected During SH1806 [Dataset]. https://catalog.data.gov/dataset/ek60-water-column-sonar-data-collected-during-sh1806
Explore at:
Dataset updated
Nov 1, 2024
Dataset provided by
National Centers for Environmental Informationhttps://www.ncei.noaa.gov/
National Oceanic and Atmospheric Administrationhttp://www.noaa.gov/
Description
This survey is part of a long-term Bonneville Power Administration-funded effort by academic and federal scientists to understand coastal ecosystems and biological and physical processes that may influence recruitment variability of salmon in Pacific Northwest waters. Prior to a potential switch to the R/V Shimada as a long-term platform, we intend to compare catches between vessels (The R/V Shimada and the F/V Frosti) on the continental shelf of Washington. Sampling will occur during the day for 3 days (23-25 May) with surface trawls along pre-specified transects. One trawl will be performed as we leave the Strait of Juan de Fuca on the afternoon of the 22nd. In addition, a bongo net will be towed several times each night between 2300 and 0500 (nights of 19-22 June).
Z
Virtual Reality Balance Disturbance Dataset
data-staging.niaid.nih.gov
Updated Oct 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ferrete Ribeiro, Nuno; Pires, Henrique; P. Santos, Cristina (2024). Virtual Reality Balance Disturbance Dataset [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_14013467
Explore at:
Dataset updated
Oct 31, 2024
Dataset provided by
University of Minho
Authors
Ferrete Ribeiro, Nuno; Pires, Henrique; P. Santos, Cristina
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Background and Purpose:

There are very few publicly available datasets on real-world falls in scientific literature due to the lack of natural falls and the inherent difficulties in gathering biomechanical and physiological data from young subjects or older adults residing in their communities in a non-intrusive and user-friendly manner. This data gap hindered research on fall prevention strategies. Immersive Virtual Reality (VR) environments provide a unique solution.

This dataset supports research in fall prevention by providing an immersive VR setup that simulates diverse ecological environments and randomized visual disturbances, aimed at triggering and analyzing balance-compensatory reactions. The dataset is a unique tool for studying human balance responses to VR-induced perturbations, facilitating research that could inform training programs, wearable assistive technologies, and VR-based rehabilitation methods.

Dataset Content:The dataset includes:

Kinematic Data: Captured using a full-body Xsens MVN Awinda inertial measurement system, providing detailed movement data at 60 Hz.

Muscle Activity (EMG): Recorded at 1111 Hz using Delsys Trigno for tracking muscle contractions.

Electrodermal Activity (EDA)*: Captured at 100.21 Hz with a Shimmer GSR device on the dominant forearm to record physiological responses to perturbations.

Metadata: Includes participant demographics (age, height, weight, gender, dominant hand and foot), trial conditions, and perturbation characteristics (timing and type).

The files are named in the format "ParticipantX_labelled", where X represents the participant's number. Each file is provided in a .mat format, with data already synchronized across different sensor sources. The structure of each file is organized into the following columns:

Column 1: Label indicating the visual perturbation applied. 0 means no visual perturbation.

Column 2: Timestamp, providing the precise timing of each recorded data point.

Column 3: Frame identifier, which can be cross-referenced with the MVN file for detailed motion analysis.

Columns 4 to 985: Xsens motion capture features, exported directly from the MVN file.

Columns 986 to 993: EMG data - Tibialis Anterior (R&L), Gastrocnemius Medial Head (R&L), Rectus Femoris (R), Semitendinosus (R), External Oblique (R), Sternocleidomastoid (R).

Columns 994 to 1008: Shimmer data: Accelerometer (x,y,z), Gyroscope (x,y,z), Magnetometer (x,y,z), GSR Range, Skin Conductance, Skin Resistance, PPG, Pressure, Temperature.

In addition, we are also releasing the .MVN and .MVNA files for each participant (1 to 10), which provide comprehensive motion capture data and include the participants' body measurements, respectively. This additional data enables precise body modeling and further in-depth biomechanical analysis.

Participants & VR Headset:

Twelve healthy young adults (average age: 25.09 ± 2.81 years; height: 167.82 ± 8.40 cm; weight: 64.83 ± 7.77 kg; 6 males, 6 females) participated in this study (Table 1). Participants met the following criteria: i) healthy locomotion, ii) stable postural balance, iii) age ≥ 18 years, and iv) body weight < 135 kg.

Participants were excluded if they: i) had any condition affecting locomotion, ii) had epilepsy, vestibular disorders, or other neurological conditions impacting stability, iii) had undergone recent surgeries impacting mobility, iv) were involved in other experimental studies, v) were under judicial protection or guardianship, or vi) experienced complications using VR headsets (e.g., motion sickness).

All participants provided written informed consent, adhering to the ethical guidelines set by the University of Minho Ethics Committee (CEICVS 063/2021), in compliance with the Declaration of Helsinki and the Oviedo Convention.

To ensure unbiased reactions, participants were kept unaware of the specific protocol details. Visual disturbances were introduced in a random sequence and at various locations, enhancing the unpredictability of the experiment and simulating a naturalistic response.

The VR setup involved an HTC Vive Pro headset with two wirelessly synchronized base stations that tracked participants’ head movements within a 5m x 2.5m area. The base stations adjusted the VR environment’s perspective according to head movements, while controllers were used solely for setup purposes.

Table 1 - Participants' demographic information

Participant Height (cm) Weight (kg) Age Gender Dom. Hand Dom. Foot

1 159 56.5 23 F Right Right

2 157 55.3 28 F Right Right

3 174 67.1 31 M Right Right

4 176 73.8 23 M Right Right

5 158 57.3 23 F Right Right

6 181 70.9 27 M Right Right

7 171 73.3 23 M Right Right

8 159 69.2 28 F Right Right

9 177 57.3 22 M Right Right

10 171 75.5 25 M Right Right

11 163 58.1 23 F Right Right

12 168 63.7 25 F Right Right

Data Collection Methodology:

The experimental protocol was designed to integrate four essential components: (i) precise control over stimuli, (ii) high reproducibility of the experimental conditions, (iii) preservation of ecological validity, and (iv) promotion of real-world learning transfer.

Participant Instructions and Familiarization Trial: Before starting, participants were given specific instructions to (i) seek assistance if they experienced motion sickness, (ii) adjust the VR headset for comfort by modifying the lens distance and headset fit, (iii) stay within the defined virtual play area demarcated by a blue boundary, and (iv) complete a familiarization trial. During this trial, participants were encouraged to explore various virtual environments while performing a sequence of three key movements—walking forward, turning around, and returning to the initial location—without any visual perturbations. This familiarization phase helped participants acclimate to the virtual space in a controlled setting.

Experimental Protocol and Visual Perturbations: Participants were exposed to 11 different types of visual perturbations as outlined in Table 2, applied across a total of 35 unique perturbation variants (Table 3). Each variant involved the same type of perturbation, such as a clockwise Roll Axis Tilt, but varied in intensity (e.g., rotation speed) and was presented in randomized virtual locations. The selection of perturbation types was grounded in existing literature on visual disturbances. This design ensured that participants experienced a diverse range of visual effects in a manner that maintained ecological validity, supporting the potential for generalization to real-world scenarios where visual perturbations might occur spontaneously.

Protocol Flow and Randomized Presentation: Throughout the experimental protocol, each visual perturbation variant was presented three times, and participants engaged repeatedly in the familiarization activities over a nearly one-hour period. These activities—walking forward, turning around, and returning to the starting point—took place in a 5m x 2.5m physical space mirrored in VR, allowing participants to take 7–10 steps before turning. Participants were not informed of the timing or nature of any perturbations, which could occur unpredictably during their forward walk, adding a realistic element of surprise. After each return to the starting point, participants were relocated to a random position within the virtual environment, with the sequence of positions determined by a randomized, computer-generated order.

Table 2 - Visual perturbations' name and parameters (L - Lateral; B - Backward; F - Forward; S - Slip; T - Trip; CW- Clockwise; CCW - Counter-Clockwise)

Perturbation [Fall Category]

Parameters

Roll Axis Tilt - CW [L] [10º, 20º, 30º] during 0.5s

Roll Axis Tilt – CCW [L] [10º, 20º, 30º] during 0.5s

Support Surface ML Axis Translation - Bidirectional [L] Discrete Movement (static pauses between movements) – 1 m/s

AP Axis Translation - Front [F] 1 m/s

AP Axis Translation - Backwards [B] 1 m/s

Pitch Axis Tilt [S] 0º-25º, 60º/s

Virtual object with lower height than a real object [T] Variable object height

Roll-Pitch-Yaw Axis Tilt [Syncope] Sum of sinusoids drive each axis rotation

Scene Object Movement [L] Objects fly towards the subject’s head. Variable speeds

Vertigo Sensation [F/L] Walk at a comfortable speed. With and without avatar. House’s height

Axial Axis Translation [F/B/L] Free fall

Table 3 - Label Encoding

Visual Perturbation Label Visual Perturbation Label Visual Perturbation Label

Roll Indoor 1 CW10 1 Roll Indoor 1 CW20 2 Roll Indoor 1 CW30 3

Roll Indoor 1 CCW10 4 Roll Indoor 1 CCW20 5 Roll Indoor 1 CCW30 6

Roll Indoor 2 CW10 7 Roll Indoor 2 CW20 8 Roll Indoor 2 CW30 9

Roll Indoor 2 CCW10 10 Roll Indoor 2 CCW20 11 Roll Indoor 2 CCW30 12

Roll Outdoor CW10 13 Roll Outdoor CW20 14 Roll Outdoor CW30 15

Roll Outdoor CCW10 16 Roll Outdoor CCW20 17 Roll Outdoor CCW30 18

ML-Axis Trans. - Kitchen 19 AP-Axis Trans. - Corridor Forward 20 AP-Axis Trans. - Corridor Backward 21

Pitch Indoor - Bathroom (wet floor) 22 Pitch Indoor - Near Fridge (wet floor) 23 Roof Beam Walking - Vertigo 24

Roof Beam Walking - Vertigo No Avatar 25 Simple Roof - Vertigo 26 Simple Roof - Vertigo No Avatar 27

Pitch Outdoor - Near Car Oil 28 Trip - Sidewalk / Trip Shock 29/290 Bedroom Syncope 30

Garden - Object Avoidance 31 Electricity Pole - Vertigo 32 Electricity Pole - No Avatar 33

Free Fall 34 Climbing Virtual Stairs 35

Some data from Shimmer device was collected but not used or checked by the research team.
Maricopa County Assessor "Fast Food" Search Query
kaggle.com
zip
Updated Sep 21, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FoxbatCS (2021). Maricopa County Assessor "Fast Food" Search Query [Dataset]. https://www.kaggle.com/foxbatcs/maricopa-county-assessor-fast-food-search-query
Explore at:
zip(28393 bytes)Available download formats
Dataset updated
Sep 21, 2021
Authors
FoxbatCS
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
Maricopa County
Description
SOURCE

This data was obtained from the Maricopa County Assessor under the search "Fast Food". The query has approximately 1342 results, with only 1000 returned due MCA Data Policies.

DATA CLEANING

Due to some Subdivision Name values posessing unescaped commas that interfered with Pandas' ability to properly align the columns, some manual cleaning in Libre Office was performed by me.

Aside from a handful of Null values, the data is fairly clean and requires little from Pandas.

NULL VALUES

Here are the sums and percentage of NULLS in the dataframe. Interestingly, there are 17 NULLS that do not have any physical addresses. This amounts to 1.7% of values for the Address, City, and Zip, and are all corresponding rows for those missing values.

I have looked into a couple of these on the Maricopa County Assessor's GIS Portal, and they do not appear to have any assigned physical addresses. This is a good avenue of exploration for EDA. Possibly an error that could be corrected, or some obscure legal reason, but interesting nonetheless.

Additionally, there are 391 NULLS in Subdivision Name accounting for 39.1%. This is a feature that I am interested in exploring to determine if there are any predominant groups. It could also generate a list of Entities that can be searched later to see if the dataset can be enriched beyond it's initial 1,000 record limit.

There are 348 NULLS in the MCR column. This is the definition according to the MCA Glossary

MCR (MARICOPA COUNTY RECORDER NUMBER) Often associated with recorded plat maps.

This seems to be an uninteresting nominal value, so I will drop this columns.

While Property Type and Rental have no NULLS, 100% of those values are Fast Food Restaurant and N (for No), and therefore offer no useful information, and will be dropped.

I will leave the S/T/R column, although it also seems to be uninteresting nominal values, I am curious if there are predominent groups, and since it also has no NULLS, might be useful for further data enrichment.
o
Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program...
openicpsr.org
Updated May 18, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jacob Kaplan (2018). Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program Data: Hate Crime Data 1991-2020 [Dataset]. http://doi.org/10.3886/E103500V8
Explore at:
Unique identifier
https://doi.org/10.3886/E103500V8
Dataset updated
May 18, 2018
Dataset provided by
University of Pennsylvania
Authors
Jacob Kaplan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
1991 - 2020
Area covered
United States
Description
!!!WARNING~~~This dataset has a large number of flaws and is unable to properly answer many questions that people generally use it to answer, such as whether national hate crimes are changing (or at least they use the data so improperly that they get the wrong answer). A large number of people using this data (academics, advocates, reporting, US Congress) do so inappropriately and get the wrong answer to their questions as a result. Indeed, many published papers using this data should be retracted. Before using this data I highly recommend that you thoroughly read my book on UCR data, particularly the chapter on hate crimes (https://ucrbook.com/hate-crimes.html) as well as the FBI's own manual on this data. The questions you could potentially answer well are relatively narrow and generally exclude any causal relationships. ~~~WARNING!!!For a comprehensive guide to this data and other UCR data, please see my book at ucrbook.comVersion 8 release notes:Adds 2019 and 2020 data. Please note that the FBI has retired UCR data ending in 2020 data so this will be the last UCR hate crime data they release. Changes .rda file to .rds.Version 7 release notes:Changes release notes description, does not change data.Version 6 release notes:Adds 2018 dataVersion 5 release notes:Adds data in the following formats: SPSS, SAS, and Excel.Changes project name to avoid confusing this data for the ones done by NACJD.Adds data for 1991.Fixes bug where bias motivation "anti-lesbian, gay, bisexual, or transgender, mixed group (lgbt)" was labeled "anti-homosexual (gay and lesbian)" prior to 2013 causing there to be two columns and zero values for years with the wrong label.All data is now directly from the FBI, not NACJD. The data initially comes as ASCII+SPSS Setup files and read into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. Version 4 release notes: Adds data for 2017.Adds rows that submitted a zero-report (i.e. that agency reported no hate crimes in the year). This is for all years 1992-2017. Made changes to categorical variables (e.g. bias motivation columns) to make categories consistent over time. Different years had slightly different names (e.g. 'anti-am indian' and 'anti-american indian') which I made consistent. Made the 'population' column which is the total population in that agency. Version 3 release notes: Adds data for 2016.Order rows by year (descending) and ORI.Version 2 release notes: Fix bug where Philadelphia Police Department had incorrect FIPS county code. The Hate Crime data is an FBI data set that is part of the annual Uniform Crime Reporting (UCR) Program data. This data contains information about hate crimes reported in the United States. Please note that the files are quite large and may take some time to open.Each row indicates a hate crime incident for an agency in a given year. I have made a unique ID column ("unique_id") by combining the year, agency ORI9 (the 9 character Originating Identifier code), and incident number columns together. Each column is a variable related to that incident or to the reporting agency. Some of the important columns are the incident date, what crime occurred (up to 10 crimes), the number of victims for each of these crimes, the bias motivation for each of these crimes, and the location of each crime. It also includes the total number of victims, total number of offenders, and race of offenders (as a group). Finally, it has a number of columns indicating if the victim for each offense was a certain type of victim or not (e.g. individual victim, business victim religious victim, etc.). The only changes I made to the data are the following. Minor changes to column names to make all column names 32 characters or fewer (so it can be saved in a Stata format), made all character values lower case, reordered columns. I also generated incident month, weekday, and month-day variables from the incident date variable included in the original data.
Z
Data from: Lower complexity of motor primitives ensures robust control of...
data.niaid.nih.gov
Updated Jun 18, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Santuz, Alessandro; Ekizos, Antonis; Kunimasa, Yoko; Kijima, Kota; Ishikawa, Masaki; Arampatzis, Adamantios (2022). Lower complexity of motor primitives ensures robust control of high-speed human locomotion [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3764760
Explore at:
Dataset updated
Jun 18, 2022
Dataset provided by
Osaka University of Health and Sport Sciences
Humboldt-Universität zu Berlin
Humboldt-Universität zu Berlin, Dalhousie University
Authors
Santuz, Alessandro; Ekizos, Antonis; Kunimasa, Yoko; Kijima, Kota; Ishikawa, Masaki; Arampatzis, Adamantios
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Walking and running are mechanically and energetically different locomotion modes. For selecting one or another, speed is a parameter of paramount importance. Yet, both are likely controlled by similar low-dimensional neuronal networks that reflect in patterned muscle activations called muscle synergies. Here, we investigated how humans synergistically activate muscles during locomotion at different submaximal and maximal speeds. We analysed the duration and complexity (or irregularity) over time of motor primitives, the temporal components of muscle synergies. We found that the challenge imposed by controlling high-speed locomotion forces the central nervous system to produce muscle activation patterns that are wider and less complex relative to the duration of the gait cycle. The motor modules, or time-independent coefficients, were redistributed as locomotion speed changed. These outcomes show that robust locomotion control at challenging speeds is achieved by modulating the relative contribution of muscle activations and producing less complex and wider control signals, whereas slow speeds allow for more irregular control.

In this supplementary data set we made available: a) the metadata with anonymized participant information, b) the raw EMG, c) the touchdown and lift-off timings of the recorded limb, d) the filtered and time-normalized EMG, e) the muscle synergies extracted via NMF and f) the code to process the data, including the scripts to calculate the Higuchi's fractal dimension (HFD) of motor primitives. In total, 180 trials from 30 participants are included in the supplementary data set.

The file “metadata.dat” is available in ASCII and RData format and contains:

Code: the participant’s code

Group: the experimental group in which the participant was involved (G1 = walking and submaximal running; G2 = submaximal and maximal running)

Sex: the participant’s sex (M or F)

Speeds: the type of locomotion (W for walking or R for running) and speed at which the recordings were conducted in 10*[m/s]

Age: the participant’s age in years

Height: the participant’s height in [cm]

Mass: the participant’s body mass in [kg]

PB: 100 m-personal best time (for G2).

The "RAW_DATA.RData" R list consists of elements of S3 class "EMG", each of which is a human locomotion trial containing cycle segmentation timings and raw electromyographic (EMG) data from 13 muscles of the right-side leg. Cycle times are structured as data frames containing two columns that correspond to touchdown (first column) and lift-off (second column). Raw EMG data sets are also structured as data frames with one row for each recorded data point and 14 columns. The first column contains the incremental time in seconds. The remaining 13 columns contain the raw EMG data, named with the following muscle abbreviations: ME = gluteus medius, MA = gluteus maximus, FL = tensor fasciæ latæ, RF = rectus femoris, VM = vastus medialis, VL = vastus lateralis, ST = semitendinosus, BF = biceps femoris, TA = tibialis anterior, PL = peroneus longus, GM = gastrocnemius medialis, GL = gastrocnemius lateralis, SO = soleus. Please note that the following trials include less than 30 gait cycles (the actual number shown between parentheses): P16_R_83 (20), P16_R_95 (25), P17_R_28 (28), P17_R_83 (24), P17_R_95 (13), P18_R_95 (23), P19_R_95 (18), P20_R_28 (25), P20_R_42 (27), P20_R_95 (25), P22_R_28 (23), P23_R_28(29), P24_R_28 (28), P24_R_42 (29), P25_R_28 (29), P25_R_95 (28), P26_R_28 (29), P26_R_95 (28), P27_R_28 (28), P27_R_42 (29), P27_R_95 (24), P28_R_28 (29), P29_R_95 (17). All the other trials consist of 30 gait cycles. Trials are named like “P20_R_20,” where the characters “P20” indicate the participant number (in this example the 20th), the character “R” indicate the locomotion type (W=walking, R=running), and the numbers “20” indicate the locomotion speed in 10*m/s (in this case the speed is 2.0 m/s). The filtered and time-normalized emg data is named, following the same rules, like “FILT_EMG_P03_R_30”.

Old versions not compatible with the R package musclesyneRgies

The files containing the gait cycle breakdown are available in RData format, in the file named “CYCLE_TIMES.RData”. The files are structured as data frames with as many rows as the available number of gait cycles and two columns. The first column named “touchdown” contains the touchdown incremental times in seconds. The second column named “stance” contains the duration of each stance phase of the right foot in seconds. Each trial is saved as an element of a single R list. Trials are named like “CYCLE_TIMES_P20_R_20,” where the characters “CYCLE_TIMES” indicate that the trial contains the gait cycle breakdown times, the characters “P20” indicate the participant number (in this example the 20th), the character “R” indicate the locomotion type (W=walking, R=running), and the numbers “20” indicate the locomotion speed in 10*m/s (in this case the speed is 2.0 m/s). Please note that the following trials include less than 30 gait cycles (the actual number shown between parentheses): P16_R_83 (20), P16_R_95 (25), P17_R_28 (28), P17_R_83 (24), P17_R_95 (13), P18_R_95 (23), P19_R_95 (18), P20_R_28 (25), P20_R_42 (27), P20_R_95 (25), P22_R_28 (23), P23_R_28(29), P24_R_28 (28), P24_R_42 (29), P25_R_28 (29), P25_R_95 (28), P26_R_28 (29), P26_R_95 (28), P27_R_28 (28), P27_R_42 (29), P27_R_95 (24), P28_R_28 (29), P29_R_95 (17).

The files containing the raw, filtered and the normalized EMG data are available in RData format, in the files named “RAW_EMG.RData” and “FILT_EMG.RData”. The raw EMG files are structured as data frames with as many rows as the amount of recorded data points and 13 columns. The first column named “time” contains the incremental time in seconds. The remaining 12 columns contain the raw EMG data, named with muscle abbreviations that follow those reported above. Each trial is saved as an element of a single R list. Trials are named like “RAW_EMG_P03_R_30”, where the characters “RAW_EMG” indicate that the trial contains raw emg data, the characters “P03” indicate the participant number (in this example the 3rd), the character “R” indicate the locomotion type (see above), and the numbers “30” indicate the locomotion speed (see above). The filtered and time-normalized emg data is named, following the same rules, like “FILT_EMG_P03_R_30”.

The files containing the muscle synergies extracted from the filtered and normalized EMG data are available in RData format, in the files named “SYNS_H.RData” and “SYNS_W.RData”. The muscle synergies files are divided in motor primitives and motor modules and are presented as direct output of the factorisation and not in any functional order. Motor primitives are data frames with 6000 rows and a number of columns equal to the number of synergies (which might differ from trial to trial) plus one. The rows contain the time-dependent coefficients (motor primitives), one column for each synergy plus the time points (columns are named e.g. “time, Syn1, Syn2, Syn3”, where “Syn” is the abbreviation for “synergy”). Each gait cycle contains 200 data points, 100 for the stance and 100 for the swing phase which, multiplied by the 30 recorded cycles, result in 6000 data points distributed in as many rows. This output is transposed as compared to the one discussed in the methods section to improve user readability. Each set of motor primitives is saved as an element of a single R list. Trials are named like “SYNS_H_P12_W_07”, where the characters “SYNS_H” indicate that the trial contains motor primitive data, the characters “P12” indicate the participant number (in this example the 12th), the character “W” indicate the locomotion type (see above), and the numbers “07” indicate the speed (see above). Motor modules are data frames with 12 rows (number of recorded muscles) and a number of columns equal to the number of synergies (which might differ from trial to trial). The rows, named with muscle abbreviations that follow those reported above, contain the time-independent coefficients (motor modules), one for each synergy and for each muscle. Each set of motor modules relative to one synergy is saved as an element of a single R list. Trials are named like “SYNS_W_P22_R_20”, where the characters “SYNS_W” indicate that the trial contains motor module data, the characters “P22” indicate the participant number (in this example the 22nd), the character “W” indicates the locomotion type (see above), and the numbers “20” indicate the speed (see above). Given the nature of the NMF algorithm for the extraction of muscle synergies, the supplementary data set might show non-significant differences as compared to the one used for obtaining the results of this paper.

The files containing the HFD calculated from motor primitives are available in RData format, in the file named “HFD.RData”. HFD results are presented in a list of lists containing, for each trial, 1) the HFD, and 2) the interval time k used for the calculations. HFDs are presented as one number (mean HFD of the primitives for that trial), as are the interval times k. Trials are named like “HFD_P01_R_95”, where the characters “HFD” indicate that the trial contains HFD data, the characters “P01” indicate the participant number (in this example the 1st), the character “R” indicates the locomotion type (see above), and the numbers “95” indicate the speed (see above).

All the code used for the pre-processing of EMG data, the extraction of muscle synergies and the calculation of HFD is available in R format. Explanatory comments are profusely present throughout the script “muscle_synergies.R”.
FacialRecognition
kaggle.com
zip
Updated Dec 1, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TheNicelander (2016). FacialRecognition [Dataset]. https://www.kaggle.com/petein/facialrecognition
Explore at:
zip(121674455 bytes)Available download formats
Dataset updated
Dec 1, 2016
Authors
TheNicelander
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description

#https://www.kaggle.com/c/facial-keypoints-detection/details/getting-started-with-r #################################

###Variables for downloaded files data.dir <- ' ' train.file <- paste0(data.dir, 'training.csv') test.file <- paste0(data.dir, 'test.csv') #################################

###Load csv -- creates a data.frame matrix where each column can have a different type. d.train <- read.csv(train.file, stringsAsFactors = F) d.test <- read.csv(test.file, stringsAsFactors = F)

###In training.csv, we have 7049 rows, each one with 31 columns. ###The first 30 columns are keypoint locations, which R correctly identified as numbers. ###The last one is a string representation of the image, identified as a string.

###To look at samples of the data, uncomment this line:

head(d.train)

###Let's save the first column as another variable, and remove it from d.train: ###d.train is our dataframe, and we want the column called Image. ###Assigning NULL to a column removes it from the dataframe

im.train <- d.train$Image d.train$Image <- NULL #removes 'image' from the dataframe

im.test <- d.test$Image d.test$Image <- NULL #removes 'image' from the dataframe

################################# #The image is represented as a series of numbers, stored as a string #Convert these strings to integers by splitting them and converting the result to integer

#strsplit splits the string #unlist simplifies its output to a vector of strings #as.integer converts it to a vector of integers. as.integer(unlist(strsplit(im.train[1], " "))) as.integer(unlist(strsplit(im.test[1], " ")))

###Install and activate appropriate libraries ###The tutorial is meant for Linux and OSx, where they use a different library, so: ###Replace all instances of %dopar% with %do%.

install.packages('foreach')

library("foreach", lib.loc="~/R/win-library/3.3")

###implement parallelization im.train <- foreach(im = im.train, .combine=rbind) %do% { as.integer(unlist(strsplit(im, " "))) } im.test <- foreach(im = im.test, .combine=rbind) %do% { as.integer(unlist(strsplit(im, " "))) } #The foreach loop will evaluate the inner command for each row in im.train, and combine the results with rbind (combine by rows). #%do% instructs R to do all evaluations in parallel. #im.train is now a matrix with 7049 rows (one for each image) and 9216 columns (one for each pixel):

###Save all four variables in data.Rd file ###Can reload them at anytime with load('data.Rd')

save(d.train, im.train, d.test, im.test, file='data.Rd')

load('data.Rd')

#each image is a vector of 96*96 pixels (96*96 = 9216). #convert these 9216 integers into a 96x96 matrix: im <- matrix(data=rev(im.train[1,]), nrow=96, ncol=96)

#im.train[1,] returns the first row of im.train, which corresponds to the first training image. #rev reverse the resulting vector to match the interpretation of R's image function #(which expects the origin to be in the lower left corner).

#To visualize the image we use R's image function: image(1:96, 1:96, im, col=gray((0:255)/255))

#Let’s color the coordinates for the eyes and nose points(96-d.train$nose_tip_x[1], 96-d.train$nose_tip_y[1], col="red") points(96-d.train$left_eye_center_x[1], 96-d.train$left_eye_center_y[1], col="blue") points(96-d.train$right_eye_center_x[1], 96-d.train$right_eye_center_y[1], col="green")

#Another good check is to see how variable is our data. #For example, where are the centers of each nose in the 7049 images? (this takes a while to run): for(i in 1:nrow(d.train)) { points(96-d.train$nose_tip_x[i], 96-d.train$nose_tip_y[i], col="red") }

#there are quite a few outliers -- they could be labeling errors. Looking at one extreme example we get this: #In this case there's no labeling error, but this shows that not all faces are centralized idx <- which.max(d.train$nose_tip_x) im <- matrix(data=rev(im.train[idx,]), nrow=96, ncol=96) image(1:96, 1:96, im, col=gray((0:255)/255)) points(96-d.train$nose_tip_x[idx], 96-d.train$nose_tip_y[idx], col="red")

#One of the simplest things to try is to compute the mean of the coordinates of each keypoint in the training set and use that as a prediction for all images colMeans(d.train, na.rm=T)

#To build a submission file we need to apply these computed coordinates to the test instances: p <- matrix(data=colMeans(d.train, na.rm=T), nrow=nrow(d.test), ncol=ncol(d.train), byrow=T) colnames(p) <- names(d.train) predictions <- data.frame(ImageId = 1:nrow(d.test), p) head(predictions)

#The expected submission format has one one keypoint per row, but we can easily get that with the help of the reshape2 library:

install.packages('reshape2')

library(...

Articles metadata from CrossRef

kaggle.com

zip

Updated Aug 1, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Kea Kohv (2025). Articles metadata from CrossRef [Dataset]. https://www.kaggle.com/datasets/keakohv/articles-doi-metadata

Explore at:

zip(72982417 bytes)Available download formats

Dataset updated

Aug 1, 2025

Authors

Kea Kohv

Description

This data originates from Crossref API. It has metadata on the articles contained in Data Citation Corpus where the citation pair dataset is a DOI.

How to recreate this dataset in Jupyter Notebook:

1) Prepare list of articles to query ```python import pandas as pd

See: https://www.kaggle.com/datasets/keakohv/data-citation-coprus-v4-1-eupmc-and-datacite

CITATIONS_PARQUET = "data_citation_corpus_filtered_v4.1.parquet"

Load the citation pairs from the Parquet file

citation_pairs = pd.read_parquet(CITATIONS_PARQUET)

Remove all rows where https is in the 'publication' column but no "doi.org" is present

citation_pairs = citation_pairs[ ~((citation_pairs['dataset'].str.contains("https")) & (~citation_pairs['dataset'].str.contains("doi.org"))) ]

Remove all rows where figshare is in the dataset name

citation_pairs = citation_pairs[ ~citation_pairs['dataset'].str.contains("figshare") ]

citation_pairs['is_doi'] = citation_pairs['dataset'].str.contains('doi.org', na=False)

citation_pairs_doi = citation_pairs[citation_pairs['is_doi'] == True].copy()

articles = list(set(citation_pairs_doi['publication'].to_list()))

articles = [doi.replace("_", "/") for doi in articles]

Save list articles to a file

with open("articles.txt", "w") as f: for article in articles: f.write(f"{article} ") ```

2) Query articles from CrossRef API


%%writefile enrich.py
#!pip install -q aiolimiter
import sys, pathlib, asyncio, aiohttp, orjson, sqlite3, time
from aiolimiter import AsyncLimiter

# ---------- config ----------
HEADERS  = {"User-Agent": "ForDataCiteEnrichment (mailto:your_email)"} # Put your email here
MAX_RPS  = 45           # polite pool limit (50), leave head-room
BATCH_SIZE = 10_000         # rows per INSERT
DB_PATH  = pathlib.Path("crossref.sqlite").resolve()
ARTICLES  = pathlib.Path("articles.txt")
# -----------------------------

# ---- platform tweak: prefer selector loop on Windows ----
if sys.platform == "win32":
  asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())

# ---- read the DOI list ----
with ARTICLES.open(encoding="utf-8") as f:
  DOIS = [line.strip() for line in f if line.strip()]

# ---- make sure DB & table exist BEFORE the async part ----
DB_PATH.parent.mkdir(parents=True, exist_ok=True)
with sqlite3.connect(DB_PATH) as db:
  db.execute("""
    CREATE TABLE IF NOT EXISTS works (
      doi  TEXT PRIMARY KEY,
      json TEXT
    )
  """)
  db.execute("PRAGMA journal_mode=WAL;")   # better concurrency

# ---------- async section ----------
limiter = AsyncLimiter(MAX_RPS, 1)       # 45 req / second
sem   = asyncio.Semaphore(100)        # cap overall concurrency

async def fetch_one(session, doi: str):
  url = f"https://api.crossref.org/works/{doi}"
  async with limiter, sem:
    try:
      async with session.get(url, headers=HEADERS, timeout=10) as r:
        if r.status == 404:         # common “not found”
          return doi, None
        r.raise_for_status()        # propagate other 4xx/5xx
        return doi, await r.json()
    except Exception as e:
      return doi, None            # log later, don’t crash

async def main():
  start = time.perf_counter()
  db  = sqlite3.connect(DB_PATH)        # KEEP ONE connection
  db.execute("PRAGMA synchronous = NORMAL;")   # speed tweak

  async with aiohttp.ClientSession(json_serialize=orjson.dumps) as s:
    for chunk_start in range(0, len(DOIS), BATCH_SIZE):
      slice_ = DOIS[chunk_start:chunk_start + BATCH_SIZE]
      tasks = [asyncio.create_task(fetch_one(s, d)) for d in slice_]
      results = await asyncio.gather(*tasks)    # all tuples, no exc

      good_rows, bad_dois = [], []
      for doi, payload in results:
        if payload is None:
          bad_dois.append(doi)
        else:
          good_rows.append((doi, orjson.dumps(payload).decode()))

      if good_rows:
        db.executemany(
          "INSERT OR IGNORE INTO works (doi, json) VALUES (?, ?)",
          good_rows,
        )
        db.commit()

      if bad_dois:                # append for later retry
        with open("failures.log", "a", encoding="utf-8") as fh:
          fh.writelines(f"{d}
" for d in bad_dois)

      done = chunk_start + len(slice_)
      rate = done / (time.perf_counter() - start)
      print(f"{done:,}/{len(DOIS):,} ({rate:,.1f} DOI/s)")

  db.close()

if _name_ == "_main_":
  asyncio.run(main())

Then run: python !python enrich.py

3) Finally extract the necessary fields

import sqlite3
import orjson
i...

Bank Loan Approval - LR, DT, RF and AUC
kaggle.com
zip
Updated Nov 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
vikram amin (2023). Bank Loan Approval - LR, DT, RF and AUC [Dataset]. https://www.kaggle.com/datasets/vikramamin/bank-loan-approval-lr-dt-rf-and-auc
Explore at:
zip(61437 bytes)Available download formats
Dataset updated
Nov 7, 2023
Authors
vikram amin
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
DATASET: Dependent variable is 'Personal.Loan'. 0 indicates loan not approved and 1 indicates loan approved.

OBJECTIVE : We will do Exploratory Data Analysis and use Logistic Regression, Decision Tree, Random Forest and AUC to find out which is the best model. Steps:

Set the working directory and read the data

Check the data types of all the variables https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F020afd07cf0c5ba058d88add9bcd467a%2FPicture1.png?generation=1699357564112927&alt=media" alt="">

DATA CLEANING

We need to change the data types of certain variables to factor vector

Check for missing data, duplicate records and remove insignificant variables https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fa286a5225207d4419b34bcf800e3cb67%2FPicture2.png?generation=1699357685993423&alt=media" alt="">

New data frame created called 'bank1' after dropping the 'ID' column.

EXPLORATORY DATA ANALYSIS

We will try to get some insights by digging into the data through bar charts and box plots which can help the bank management in decision making

Run the required libraries https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F7363f4b9ca8245b6e998bf07005fa099%2FPicture3.png?generation=1699357871368520&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F8dba10f16fc6c2d7fd51a4c82a692136%2FCount%20of%20Loans%20Approved%20%20Not%20Approved.jpeg?generation=1699357967347355&alt=media" alt="">

Out of the total 5000 customers, 4520 have not been approved for a loan while 480 have been https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fe5eec968e7b264d9ec540bd1f24379fd%2FPicture4.png?generation=1699358066228901&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fb64eba6f373d5c043c9f504cfa348a75%2FPicture5.png?generation=1699358103026827&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F94608993dc12cdc31cfeca92932e0cb5%2FBoxPlot%20Income%20and%20Family.jpeg?generation=1699358148840198&alt=media" alt="">

THIS INDICATES THAT INCOME IS HIGHER WHEN THERE ARE LESS FAMILY MEMBERS https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F8e44daf4ed42094f71c3000737f07a32%2FPicture6.png?generation=1699360599956530&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F0fd9010b95acf9ad20f7b9d0e171f305%2FBoxplot%20between%20Income%20%20Personal%20Loan.jpeg?generation=1699359231020725&alt=media" alt="">

THIS INDICATES PERSONAL LOAN HAS BEEN APPROVED FOR CUSTOMERS HAVING HIGHER INCOME https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Ff817481849aba7f176b7c4d0147308de%2FPicture7.png?generation=1699360768102069&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F8e0bad8c76aaa11fe3b9909721d587f5%2FBoxPlot%20between%20Income%20%20Credit%20Cards.jpeg?generation=1699360798538907&alt=media" alt="">

THIS INDICATES THAT THE INCOME IS PRETTY SIMILAR FOR CUSTOMERS OWNING AND NOT OWNING A CREDIT CARD https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fab4b2fd2fde2a009bceb05a5a1161040%2FPicture8.png?generation=1699360882879480&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fe747dfa315609c4907ea83a9ac7f482c%2FBoxPlot%20between%20Income%20Class%20%20Mortgage.jpeg?generation=1699359265603058&alt=media" alt="">

CUSTOMERS BELONGING TO THE RICH CLASS (INCOME GROUP : 150-200) HAVE THE HIGHEST MORTGAGE https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F6552d3fb9564b3ab3239ef67ed17a098%2FPicture9.png?generation=1699360938106437&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F4c7c7077e26229f455c1d9ef6e83195f%2FBoxPlot%20between%20CC%20Avg%20and%20Online%20Banking.jpeg?generation=1699359306645100&alt=media" alt="">

CC AVG IS PRETTY SIMILAR FOR THOSE WHO OPTED FOR ONLINE SERVICES AND THOSE WHO DID NOT
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Feddee2ca08a8138bb54eed0c25750280%2FPicture10.png?generation=1699360994581181&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F6127e25258b25ccfbae66a5463a72773%2FBoxplot%20between%20CC%20Avg%20and%20Education.jpeg?generation=1699359333295827&alt=media" alt="">

MORE EDUCATED CUSTOMERS HAVE A HIGHER CREDIT AVERAGE ![](https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F...
Population and GDP/GNI/CO2 emissions (2019, raw data)
figshare.com
txt
Updated Feb 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Liang Zhao (2023). Population and GDP/GNI/CO2 emissions (2019, raw data) [Dataset]. http://doi.org/10.6084/m9.figshare.22085060.v6
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22085060.v6
Dataset updated
Feb 23, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Liang Zhao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Original dataset The original year-2019 dataset was downloaded from the World Bank Databank by the following approach on July 23, 2022.

Database: "World Development Indicators" Country: 266 (all available) Series: "CO2 emissions (kt)", "GDP (current US$)", "GNI, Atlas method (current US$)", and "Population, total" Time: 1960, 1970, 1980, 1990, 2000, 2010, 2017, 2018, 2019, 2020, 2021 Layout: Custom -> Time: Column, Country: Row, Series: Column Download options: Excel

Preprocessing

With libreoffice,

remove non-country entries (lines after Zimbabwe), shorten column names for easy processing: Country Name -> Country, Country Code -> Code, "XXXX ... GNI ..." -> GNI_1990, etc (notice '_', not '-', for R), remove unnesssary rows after line Zimbabwe.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

vikram amin (2023). Housing Price Prediction using DT and RF in R [Dataset]. https://www.kaggle.com/datasets/vikramamin/housing-price-prediction-using-dt-and-rf-in-r

Housing Price Prediction using DT and RF in R

Decision Tree and Random Forest in R for house price prediction

Explore at:

zip(629100 bytes)Available download formats

Dataset updated

Aug 31, 2023

Authors

vikram amin

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Objective: To predict the prices of houses in the City of Melbourne
Approach: Using Decision Tree and Random Forest https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Ffc6fb7d0bd8e854daf7a6f033937a397%2FPicture1.png?generation=1693489996707941&alt=media" alt="">
Data Cleaning:
Date column is shown as a character vector which is converted into a date vector using the library ‘lubridate’
We create a new column called age to understand the age of the house as it can be a factor in the pricing of the house. We extract the year from column ‘Date’ and subtract it from the column ‘Year Built’
We remove 11566 records which have missing values
We drop columns which are not significant such as ‘X’, ‘suburb’, ‘address’, (we have kept zipcode as it serves the purpose in place of suburb and address), ‘type’, ‘method’, ‘SellerG’, ‘date’, ‘Car’, ‘year built’, ‘Council Area’, ‘Region Name’
We split the data into ‘train’ and ‘test’ in 80/20 ratio using the sample function
Run libraries ‘rpart’, ‘rpart.plot’, ‘rattle’, ‘RcolorBrewer’
Run decision tree using the rpart function. ‘Price’ is the dependent variable https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F6065322d19b1376c4a341a4f22933a51%2FPicture2.png?generation=1693490067579017&alt=media" alt="">
Average price for 5464 houses is $1084349
Where building area is less than 200.5, the average price for 4582 houses is $931445. Where building area is less than 200.5 & age of the building is less than 67.5 years, the avg price for 3385 houses is $799299.6.
$4801538 is the Highest average prices of 13 houses where distance is lower than 5.35 & building are is >280.5
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F136542b7afb6f03c1890bae9b07dc464%2FDecision%20Tree%20Plot.jpeg?generation=1693490124083168&alt=media" alt="">
We use the caret package for tuning the parameter and the optimal complexity parameter found is 0.01 with RMSE 445197.9 https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Feb1633df9dd61ba3a51574873b055fd0%2FPicture3.png?generation=1693490163033658&alt=media" alt="">
We use library (Metrics) to find out the RMSE ($392107), MAPE (0.297) which means an accuracy of 99.70% and MAE ($272015.4)
Variables ‘postcode’, longitude and building are the most important variables
Test$Price indicates the actual price and test$predicted indicates the predicted price for particular 6 houses. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F620b1aad968c9aee169d0e7371bf3818%2FPicture4.png?generation=1693490211728176&alt=media" alt="">
We use the default parameters of random forest on the train data https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fe9a3c3f8776ee055e4a1bb92d782e19c%2FPicture5.png?generation=1693490244695668&alt=media" alt="">
The below image indicates that ‘Building Area’, ‘Age of the house’ and ‘Distance’ are the most important variables that affect the price of the house. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fc14d6266184db8f30290c528d72b9f6b%2FRandom%20Forest%20Variables%20Importance.jpeg?generation=1693490284920037&alt=media" alt="">
Based on the default parameters, RMSE is $250426.2, MAPE is 0.147 (accuracy is 99.853%) and MAE is $151657.7
Error starts to remain constant between 100 to 200 trees and thereafter there is almost minimal reduction. We can choose N tree=200. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F365f9e8587d3a65805330889d22f9e60%2FNtree%20Plot.jpeg?generation=1693490308734539&alt=media" alt="">
We tune the model and find mtry = 3 has the lowest out of bag error
We use the caret package and use 5 fold cross validation technique
RMSE is $252216.10 , MAPE is 0.146 (accuracy is 99.854%) , MAE is $151669.4
We can conclude that Random Forest give us more accurate results as compared to Decision Tree
In Random Forest , the default parameters (N tree = 500) give us lower RMSE and MAPE as compared to N tree = 200. So we can proceed with those parameters.

Clear search

Close search

Google apps

Main menu

Housing Price Prediction using DT and RF in R

KC_House Dataset -Linear Regression of Home Prices

Dataset and R code from “Aridity preferences alter the relative importance...

Data from: Bike Sharing Dataset

Problem Statement:

Business Goal:

Data Preparation:

Model Building:

Model Evaluation:

Supplementary data and code 1 for "Significant shifts in latitudinal optima...

Data from: Water Temperature of Lakes in the Conterminous U.S. Using the...

Microplastics alter the functioning of marine microbial ecosystems_Data &...

Cross-position activity recognition

Priority effects alter the colonization success of a host-associated...

Uniform Crime Reporting (UCR) Program Data: Arrests by Age, Sex, and Race,...

EK60 Water Column Sonar Data Collected During SH1806

Virtual Reality Balance Disturbance Dataset

Maricopa County Assessor "Fast Food" Search Query

SOURCE

DATA CLEANING

NULL VALUES

Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program...

Data from: Lower complexity of motor primitives ensures robust control of...

FacialRecognition

head(d.train)

install.packages('foreach')

save(d.train, im.train, d.test, im.test, file='data.Rd')

load('data.Rd')

install.packages('reshape2')

Articles metadata from CrossRef

See: https://www.kaggle.com/datasets/keakohv/data-citation-coprus-v4-1-eupmc-and-datacite

Load the citation pairs from the Parquet file

Remove all rows where https is in the 'publication' column but no "doi.org" is present

Remove all rows where figshare is in the dataset name

Save list articles to a file

Bank Loan Approval - LR, DT, RF and AUC

Population and GDP/GNI/CO2 emissions (2019, raw data)

Housing Price Prediction using DT and RF in R

Decision Tree and Random Forest in R for house price prediction