Facebook
TwitterTo make this a seamless process, I cleaned the data and delete many variables that I thought were not important to our dataset. I then uploaded all of those files to Kaggle for each of you to download. The rideshare_data has both lyft and uber but it is still a cleaned version from the dataset we downloaded from Kaggle.
You can easily subset the data into the car types that you will be modeling by first loading the csv into R, here is the code for how you do this:
df<-read.csv('uber.csv')
df_black<-subset(uber_df, uber_df$name == 'Black')
write.csv(df_black, "nameofthefileyouwanttosaveas.csv")
getwd()
Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
# ERA-NUTS (1980-2018)
This dataset contains a set of time-series of meteorological variables based on Copernicus Climate Change Service (C3S) ERA5 reanalysis. The data files can be downloaded from here while notebooks and other files can be found on the associated Github repository.
This data has been generated with the aim of providing hourly time-series of the meteorological variables commonly used for power system modelling and, more in general, studies on energy systems.
An example of the analysis that can be performed with ERA-NUTS is shown in this video.
Important: this dataset is still a work-in-progress, we will add more analysis and variables in the near-future. If you spot an error or something strange in the data please tell us sending an email or opening an Issue in the associated Github repository.
## Data
The time-series have hourly/daily/monthly frequency and are aggregated following the NUTS 2016 classification. NUTS (Nomenclature of Territorial Units for Statistics) is a European Union standard for referencing the subdivisions of countries (member states, candidate countries and EFTA countries).
This dataset contains NUTS0/1/2 time-series for the following variables obtained from the ERA5 reanalysis data (in brackets the name of the variable on the Copernicus Data Store and its unit measure):
- t2m: 2-meter temperature (`2m_temperature`, Celsius degrees)
- ssrd: Surface solar radiation (`surface_solar_radiation_downwards`, Watt per square meter)
- ssrdc: Surface solar radiation clear-sky (`surface_solar_radiation_downward_clear_sky`, Watt per square meter)
- ro: Runoff (`runoff`, millimeters)
There are also a set of derived variables:
- ws10: Wind speed at 10 meters (derived by `10m_u_component_of_wind` and `10m_v_component_of_wind`, meters per second)
- ws100: Wind speed at 100 meters (derived by `100m_u_component_of_wind` and `100m_v_component_of_wind`, meters per second)
- CS: Clear-Sky index (the ratio between the solar radiation and the solar radiation clear-sky)
- HDD/CDD: Heating/Cooling Degree days (derived by 2-meter temperature the EUROSTAT definition.
For each variable we have 350 599 hourly samples (from 01-01-1980 00:00:00 to 31-12-2019 23:00:00) for 34/115/309 regions (NUTS 0/1/2).
The data is provided in two formats:
- NetCDF version 4 (all the variables hourly and CDD/HDD daily). NOTE: the variables are stored as `int16` type using a `scale_factor` of 0.01 to minimise the size of the files.
- Comma Separated Value ("single index" format for all the variables and the time frequencies and "stacked" only for daily and monthly)
All the CSV files are stored in a zipped file for each variable.
## Methodology
The time-series have been generated using the following workflow:
1. The NetCDF files are downloaded from the Copernicus Data Store from the ERA5 hourly data on single levels from 1979 to present dataset
2. The data is read in R with the climate4r packages and aggregated using the function `/get_ts_from_shp` from panas. All the variables are aggregated at the NUTS boundaries using the average except for the runoff, which consists of the sum of all the grid points within the regional/national borders.
3. The derived variables (wind speed, CDD/HDD, clear-sky) are computed and all the CSV files are generated using R
4. The NetCDF are created using `xarray` in Python 3.7.
NOTE: air temperature, solar radiation, runoff and wind speed hourly data have been rounded with two decimal digits.
## Example notebooks
In the folder `notebooks` on the associated Github repository there are two Jupyter notebooks which shows how to deal effectively with the NetCDF data in `xarray` and how to visualise them in several ways by using matplotlib or the enlopy package.
There are currently two notebooks:
- exploring-ERA-NUTS: it shows how to open the NetCDF files (with Dask), how to manipulate and visualise them.
- ERA-NUTS-explore-with-widget: explorer interactively the datasets with [jupyter]() and ipywidgets.
The notebook `exploring-ERA-NUTS` is also available rendered as HTML.
## Additional files
In the folder `additional files`on the associated Github repository there is a map showing the spatial resolution of the ERA5 reanalysis and a CSV file specifying the number of grid points with respect to each NUTS0/1/2 region.
## License
This dataset is released under CC-BY-4.0 license.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains a set of time-series of meteorological variables based on Copernicus Climate Change Service (C3S) ERA5 reanalysis. The data files can be downloaded from here while notebooks and other files can be found on the associated Github repository.
This data has been generated with the aim of providing hourly time-series of the meteorological variables commonly used for power system modelling and, more in general, studies on energy systems.
An example of the analysis that can be performed with ERA-NUTS is shown in this video.
Important: this dataset is still a work-in-progress, we will add more analysis and variables in the near-future. If you spot an error or something strange in the data please tell us sending an email or opening an Issue in the associated Github repository.
The time-series have hourly/daily/monthly frequency and are aggregated following the NUTS 2016 classification. NUTS (Nomenclature of Territorial Units for Statistics) is a European Union standard for referencing the subdivisions of countries (member states, candidate countries and EFTA countries).
This dataset contains NUTS0/1/2 time-series for the following variables obtained from the ERA5 reanalysis data (in brackets the name of the variable on the Copernicus Data Store and its unit measure):
2m_temperature, Celsius degrees)surface_solar_radiation_downwards, Watt per square meter)surface_solar_radiation_downward_clear_sky, Watt per square meter)runoff, millimeters)sd, meters)There are also a set of derived variables:
- ws10: Wind speed at 10 meters (derived by 10m_u_component_of_wind and 10m_v_component_of_wind, meters per second)
- ws100: Wind speed at 100 meters (derived by 100m_u_component_of_wind and 100m_v_component_of_wind, meters per second)
- CS: Clear-Sky index (the ratio between the solar radiation and the solar radiation clear-sky)
- RH: Relative Humidity (computed following Lawrence, BAMS 2005 and Alduchov & Eskridge, 1996)
- HDD/CDD: Heating/Cooling Degree days (derived by 2-meter temperature the EUROSTAT definition.
For each variable we have 367 440 hourly samples (from 01-01-1980 00:00:00 to 31-12-2021 23:00:00) for 34/115/309 regions (NUTS 0/1/2).
The data is provided in two formats:
int16 type using a scale_factor to minimise the size of the files.All the CSV files are stored in a zipped file for each variable.
The time-series have been generated using the following workflow:
/get_ts_from_shp from panas. All the variables are aggregated at the NUTS boundaries using the average except for the runoff, which consists of the sum of all the grid points within the regional/national borders.xarray in Python 3.8.In the folder notebooks on the associated Github repository there are two Jupyter notebooks which shows how to deal effectively with the NetCDF data in xarray and how to visualise them in several ways by using matplotlib or the enlopy package.
There are currently two notebooks:
The notebook exploring-ERA-NUTS is also available rendered as HTML.
In the folder additional fileson the associated Github repository there is a map showing the spatial resolution of the ERA5 reanalysis and a CSV file specifying the number of grid points with respect to each NUTS0/1/2 region.
This dataset is released under CC-BY-4.0 license.
2022-04-08 Added Relative Humidity (RH) 2022-03-07 Added the missing month in CDD/HDD 2022-02-08 Updated the wind speed and temperature data due to missing months.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Scripts used for analysis of V1 and V2 Datasets.seurat_v1.R - initialize seurat object from 10X Genomics cellranger outputs. Includes filtering, normalization, regression, variable gene identification, PCA analysis, clustering, tSNE visualization. Used for v1 datasets. merge_seurat.R - merge two or more seurat objects into one seurat object. Perform linear regression to remove batch effects from separate objects. Used for v1 datasets. subcluster_seurat_v1.R - subcluster clusters of interest from Seurat object. Determine variable genes, perform regression and PCA. Used for v1 datasets.seurat_v2.R - initialize seurat object from 10X Genomics cellranger outputs. Includes filtering, normalization, regression, variable gene identification, and PCA analysis. Used for v2 datasets. clustering_markers_v2.R - clustering and tSNE visualization for v2 datasets. subcluster_seurat_v2.R - subcluster clusters of interest from Seurat object. Determine variable genes, perform regression and PCA analysis. Used for v2 datasets.seurat_object_analysis_v1_and_v2.R - downstream analysis and plotting functions for seurat object created by seurat_v1.R or seurat_v2.R. merge_clusters.R - merge clusters that do not meet gene threshold. Used for both v1 and v2 datasets. prepare_for_monocle_v1.R - subcluster cells of interest and perform linear regression, but not scaling in order to input normalized, regressed values into monocle with monocle_seurat_input_v1.R monocle_seurat_input_v1.R - monocle script using seurat batch corrected values as input for v1 merged timecourse datasets. monocle_lineage_trace.R - monocle script using nUMI as input for v2 lineage traced dataset. monocle_object_analysis.R - downstream analysis for monocle object - BEAM and plotting. CCA_merging_v2.R - script for merging v2 endocrine datasets with canonical correlation analysis and determining the number of CCs to include in downstream analysis. CCA_alignment_v2.R - script for downstream alignment, clustering, tSNE visualization, and differential gene expression analysis.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The objective behind attempting this dataset was to understand the predictors that contribute to the life expectancy around the world. I have used Linear Regression, Decision Tree and Random Forest for this purpose. Steps Involved: - Read the csv file - Data Cleaning: - Variables Country and Status were showing as having character data types. These had to be converted to factor - 2563 missing values were encountered with Population variable having the most of the missing values i.e 652 - Missing rows were dropped before we could run the analysis. 3) Run Linear Regression - Before running linear regression, 3 variables were dropped as they were not found to be having that much of an effect on the dependent variable i.e Life Expectancy. These 3 variables were Country, Year & Status. This meant we are now working with 19 variables (1 dependent and 18 independent variables) - We run the linear regression. Multiple R squared is 83% which means that independent variables can explain 83% change or variance in the dependent variable. - OULTLIER DETECTION. We check for outliers using IQR and find 54 outliers. These outliers are then removed before we run the regression analysis once again. Multiple R squared increased from 83% to 86%. - MULTICOLLINEARITY. We check for multicollinearity using the VIF model(Variance Inflation Factor). This is being done in case when two or more independent variables showing high correlation. The thumb rule is that absolute VIF values above 5 should be removed. We find 6 variables that have a VIF value higher than 5 namely Infant.deaths, percentage.expenditure,Under.five.deaths,GDP,thinness1.19,thinness5.9. Infant deaths and Under Five deaths have strong collinearity so we drop infant deaths(which has the higher VIF value). - When we run the linear regression model again, VIF value of Under.Five.Deaths goes down from 211.46 to 2.74 while the other variable's VIF values reduce very less. Variable thinness1.19 is now dropped and we run the regression once more. - Variable thinness5.9 whose absolute VIF value was 7.61 has now dropped to 1.95. GDP and Population are still having VIF value more than 5 but I decided against dropping these as I consider them to be important independent variables. - SET THE SEED AND SPLIT THE DATA INTO TRAIN AND TEST DATA. We run the train data and get multiple R squared of 86% and p value less than that of alpha which states that it is statistically significant. We use the train data to predict the test data to find out the RMSE and MAPE. We run the library(Metrics) for this purpose. - In Linear Regression, RMSE (Root Mean Squared Error) is 3.2. This indicates that on an average, the predicted values have an error of 3.2 years as compared to the actual life expectancy values. - MAPE (Mean Absolute Percentage Error) is 0.037. This indicates an accuracy prediction of 96.20% (1-0.037). - MAE (Mean Absolute Error) is 2.55. This indicates that on an average, the predicted values deviate by approximately 2.83 years from the actual values.
Conclusion: Random Forest is the best model for predicting the life expectancy values as it has the lowest RMSE, MAPE and MAE.
Facebook
Twitter[NOTE - 11/24/2021: this dataset supersedes an earlier version https://doi.org/10.15482/USDA.ADC/1518654 ] Data sources. Time series data on cattle fever tick incidence, 1959-2020, and climate variables January 1950 through December 2020, form the core information in this analysis. All variables are monthly averages or sums over the fiscal year, October 01 (of the prior calendar year, y-1) through September 30 of the current calendar year (y). Annual records on monthly new detections of Rhipicephalus microplus and R. annulatus (cattle fever tick, CFT) on premises within the Permanent Quarantine Zone (PQZ) were obtained from the Cattle Fever Tick Eradication Program (CFTEP) maintained jointly by the United States Department of Agriculture (USDA), Animal Plant Health Inspection Service and the USDA Animal Research Service in Laredo, Texas. Details of tick survey procedures, CFTEP program goals and history, and the geographic extent of the PQZ are in the main text, and in the Supporting Information (SI) of the associated paper. Data sources on oceanic indicators, on local meteorology, and their pretreatment are detailed in SI. Data pretreatment. To address the low signal-to-noise ratio and non-independence of observations common in time series, we transformed all explanatory and response variables by using a series of six consecutive steps: (i) First differences (year y minus year y-1) were calculated, (ii) these were then converted to z scores (z = (x- μ) / σ, where x is the raw value, μ is the population mean, σ is the standard deviation of the population), (iii) linear regression was applied to remove any directional trends, (iv) moving averages (typically 11-year point-centered moving averages) were calculated for each variable, (v) a lag was applied if/when deemed necessary, and (vi) statistics calculated (r, n, df, P<, p<). Principal component analysis (PCA). A matrix of z-score first differences of the 13 climate variables, and CFT (1960-2020), was entered into XLSTAT principal components analysis routine; we used Pearson correlation of the 14 x 60 matrix, and Varimax rotation of the first two components. Autoregressive Integrated Moving Average (ARIMA). An ARIMA (2,0,0) model was selected among 7 test models in which the p, d, and q terms were varied, and selection made on the basis of lowest RMSE and AIC statistics, and reduction of partial autocorrelation outcomes. A best model linear regression of CFT values on ARIMA-predicted CFT was developed using XLSTAT linear regression software with the objective of examining statistical properties (r, n, df, P<, p<), including the Durbin-Watson index of order-1 autocorrelation, and Cook’s Di distance index. Cross-validation of the model was made by withholding the last 30, and then the first 30 observations in a pair of regressions. Forecast of the next major CFT outbreak. It is generally recognized that the onset year of the first major CFT outbreak was not 1959, but may have occurred earlier in the decade. We postulated the actual underlying pattern is fully 44 years from the start to the end of a CFT cycle linked to external climatic drivers. (SI Appendix, Hypothesis on CFT cycles). The hypothetical reconstruction was projected one full CFT cycle into the future. To substantiate the projected trend, we generated a power spectrum analysis based on 1-year values of the 1959-2020 CFT dataset using SYSTAT AutoSignal software. The outcome included a forecast to 2100; this was compared to the hypothetical reconstruction and projection. Any differences were noted, and the start and end dates of the next major CFT outbreak identified. Resources in this dataset: Resource Title: CFT and climate data. File Name: climate-cft-data2.csv Resource Description: Main dataset; see data dictionary for information on each column Resource Title: Data dictionary (metadata). File Name: climate-cft-metadata2.csv Resource Description: Information on variables and their origin Resource Title: fitted models. File Name: climate-cft-models2.xlsx Resource Software Recommended: Microsoft Excel,url: https://www.microsoft.com/en-us/microsoft-365/excel; XLSTAT,url: https://www.xlstat.com/en/; SYStat Autosignal,url: https://www.systat.com/products/AutoSignal/
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Occupational characteristics, as well as personal and health systems characteristics of the health workers, were hypothesized to be associated with an increased risk of COVID-19 disease within the Kenyan tertiary-level hospital. Therefore, data collection was done using a researcher-administered and literature-based questionnaire via phone interviews on self-reported characteristics of health workers who worked in Kenyatta National Hospital between November 2021 to December 2021. The responses in the dataset therefore were treated as potential explanatory exposure variables for the study while the COVID-19 status was the study outcome. The participants consented to participation and their consent was documented before questionnaire administration. The collection of the data was approved by the Kenyatta National Hospital-University of Nairobi Ethics Review Committee(P462/06/2021), permission to conduct the study was also given by the administration of Kenyatta National Hospital and the study licence was also given by the National Commission For Science, Technology and Innovation for Kenya. The participants' identifier information was removed and de-identified, first, by anonymizing the questionnaire responses, second, the contact information database used during phone interviews was strictly kept confidential, restricted and password-protected and used for this particular study purpose only. The dataset was then cleaned in Ms EXCEL to remove obvious errors and exported into R statistical software for analysis. Missingness of data was acknowledged prior to analysis. Aggregate variables of interest were derived based on the primary variables and multiple imputation of the dataset was applied to address missing data bias. This data was analysed by regression methods and future researchers can apply similar methods to prove or disapprove their hypotheses based on the dataset.
Facebook
TwitterSummaryThis dataset includes all data and code required to reproduce the analyses presented in the manuscript Application variables that affect efficacy of disinfestants sprayed on different substrate materials to control Colletotrichum siamense.Disinfestants are an important sanitation tool used to eliminate plant pathogens, but it is unknown how effective they are even when applied following recommended practices. The efficacy of a selection of hypochlorite, isopropyl alcohol, quaternary ammonium and peroxygen compounds against Colletotrichum siamense was evaluated relative to substrate porosity, contact time and disinfestant wettability properties in seven studies where disinfestants were sprayed on six substrate materials (concrete, galvanized metal, polypropylene ground fabric, polyethylene plastic sheet, pressure-treated pine, and twin-wall clear polycarbonate) commonly found in ornamental plant production surfaces.This dataset includes the raw data and R statistical software code required to reproduce the data processing, statistical model fitting, and production of figures and tables in the associated manuscript. For each of the seven studies, we fit a Bayesian generalized linear mixed model to the data. We graphically assessed model fit, then used posterior distributions of the expected values of the posterior predictive to make predictions. From each of these distributions, we used the median to represent the point estimate and 95% equal-tailed credible intervals to represent uncertainty about the means.ContentsThe following files are included:data_readme.md: Markdown file with column metadata for all CSV files and an explanation of the treatment codes.disinfestant_effectiveness_analysis.Rmd: RMarkdown file with all R code required for reproducing the content of the manuscriptdisinfestant_effectiveness_analysis.html: Rendered output of RMarkdown notebookRaw dataStudy1 aDsnfSbstAllData.csv: Study 1 data (comparison of disinfestant by substrate by contact time)Study2 aSbstDsnfTimeBGFMdata reduced.csv (comparison of disinfestant by contact time on fabric substrate)Study3 ConcreteEffccy.csv (comparison of disinfestant by treatment schedule on concrete substrate)Study4 FabricEffccy.csv (comparison of disinfestant by treatment schedule on fabric substrate)Study5 WoodEffccy.csv (comparison of disinfestant by treatment schedule on wood substrate)Study6 KZPTdensityData.csv (comparison of disinfestant by substrate by percent coverage)Study7 ZPdensityData.csv (comparison of disinfestant by spray cover level)Cached model objectsSome of the models take several minutes to run and compile, so the cached objects containing the model fits are also archived here to enable rerunning the analysis notebook without refitting all models.exp1zoibfitsd5.rdsexp2zoibfitsd10.rdsexp3zoibfit_concretesd10.rdsexp4zoibfit_fabricsd10.rdsexp5zoibfit_woodsd10.rdsexp6zipfitsd5.rdsexp7zipfitsd5_continuous.rds
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Standardized data on large-scale and long-term patterns of species richness are critical for understanding the consequences of natural and anthropogenic changes in the environment. The North American Breeding Bird Survey (BBS) is one of the largest and most widely used sources of such data, but so far, little is known about the degree to which BBS data provide accurate estimates of regional richness. Here we test this question by comparing estimates of regional richness based on BBS data with spatially and temporally matched estimates based on state Breeding Bird Atlases (BBA). We expected that estimates based on BBA data would provide a more complete (and therefore, more accurate) representation of regional richness due to their larger number of observation units and higher sampling effort within the observation units. Our results were only partially consistent with these predictions: while estimates of regional richness based on BBA data were higher than those based on BBS data, estimates of local richness (number of species per observation unit) were higher in BBS data. The latter result is attributed to higher land-cover heterogeneity in BBS units and higher effectiveness of bird detection (more species are detected per unit time). Interestingly, estimates of regional richness based on BBA blocks were higher than those based on BBS data even when differences in the number of observation units were controlled for. Our analysis indicates that this difference was due to higher compositional turnover between BBA units, probably due to larger differences in habitat conditions between BBA units and a larger number of geographically restricted species. Our overall results indicate that estimates of regional richness based on BBS data suffer from incomplete detection of a large number of rare species, and that corrections of these estimates based on standard extrapolation techniques are not sufficient to remove this bias. Future applications of BBS data in ecology and conservation, and in particular, applications in which the representation of rare species is important (e.g., those focusing on biodiversity conservation), should be aware of this bias, and should integrate BBA data whenever possible.
Methods Overview
This is a compilation of second-generation breeding bird atlas data and corresponding breeding bird survey data. This contains presence-absence breeding bird observations in 5 U.S. states: MA, MI, NY, PA, VT, sampling effort per sampling unit, geographic location of sampling units, and environmental variables per sampling unit: elevation and elevation range from (from SRTM), mean annual precipitation & mean summer temperature (from PRISM), and NLCD 2006 land-use data.
Each row contains all observations per sampling unit, with additional tables containing information on sampling effort impact on richness, a rareness table of species per dataset, and two summary tables for both bird diversity and environmental variables.
The methods for compilation are contained in the supplementary information of the manuscript but also here:
Bird data
For BBA data, shapefiles for blocks and the data on species presences and sampling effort in blocks were received from the atlas coordinators. For BBS data, shapefiles for routes and raw species data were obtained from the Patuxent Wildlife Research Center (https://databasin.org/datasets/02fe0ebbb1b04111b0ba1579b89b7420 and https://www.pwrc.usgs.gov/BBS/RawData).
Using ArcGIS Pro© 10.0, species observations were joined to respective BBS and BBA observation units shapefiles using the Join Table tool. For both BBA and BBS, a species was coded as either present (1) or absent (0). Presence in a sampling unit was based on codes 2, 3, or 4 in the original volunteer birding checklist codes (possible breeder, probable breeder, and confirmed breeder, respectively), and absence was based on codes 0 or 1 (not observed and observed but not likely breeding). Spelling inconsistencies of species names between BBA and BBS datasets were fixed. Species that needed spelling fixes included Brewer’s Blackbird, Cooper’s Hawk, Henslow’s Sparrow, Kirtland’s Warbler, LeConte’s Sparrow, Lincoln’s Sparrow, Swainson’s Thrush, Wilson’s Snipe, and Wilson’s Warbler. In addition, naming conventions were matched between BBS and BBA data. The Alder and Willow Flycatchers were lumped into Traill’s Flycatcher and regional races were lumped into a single species column: Dark-eyed Junco regional types were lumped together into one Dark-eyed Junco, Yellow-shafted Flicker was lumped into Northern Flicker, Saltmarsh Sparrow and the Saltmarsh Sharp-tailed Sparrow were lumped into Saltmarsh Sparrow, and the Yellow-rumped Myrtle Warbler was lumped into Myrtle Warbler (currently named Yellow-rumped Warbler). Three hybrid species were removed: Brewster's and Lawrence's Warblers and the Mallard x Black Duck hybrid. Established “exotic” species were included in the analysis since we were concerned only with detection of richness and not of specific species.
The resultant species tables with sampling effort were pivoted horizontally so that every row was a sampling unit and each species observation was a column. This was done for each state using R version 3.6.2 (R© 2019, The R Foundation for Statistical Computing Platform) and all state tables were merged to yield one BBA and one BBS dataset. Following the joining of environmental variables to these datasets (see below), BBS and BBA data were joined using rbind.data.frame in R© to yield a final dataset with all species observations and environmental variables for each observation unit.
Environmental data
Using ArcGIS Pro© 10.0, all environmental raster layers, BBA and BBS shapefiles, and the species observations were integrated in a common coordinate system (North_America Equidistant_Conic) using the Project tool. For BBS routes, 400m buffers were drawn around each route using the Buffer tool. The observation unit shapefiles for all states were merged (separately for BBA blocks and BBS routes and 400m buffers) using the Merge tool to create a study-wide shapefile for each data source. Whether or not a BBA block was adjacent to a BBS route was determined using the Intersect tool based on a radius of 30m around the route buffer (to fit the NLCD map resolution). Area and length of the BBS route inside the proximate BBA block were also calculated. Mean values for annual precipitation and summer temperature, and mean and range for elevation, were extracted for every BBA block and 400m buffer BBS route using Zonal Statistics as Table tool. The area of each land-cover type in each observation unit (BBA block and BBS buffer) was calculated from the NLCD layer using the Zonal Histogram tool.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Note: a new time-series dataset from ERA5 has been published — this one won't be updated/maintained anymore
Country averages of meteorological variables generated using the R routines available in the package panas based on the Copernicus Climate Change ERA5 reanalyses. The time-series are at hourly resolution and the included variables are:
2-meter temperature (t2m),
snow depth (snow_depth),
mean sea-level pressure (mslp),
runoff,
surface solar radiation (ssrd),
surface solar radiation with clear-sky (ssrdc),
temperature at 850hPa (t850),
total precipitation (total_prec),
zonal (west-east direction) wind speed at 10m (u10) and 100m (u100),
meridional (north-sud) wind speed at 10m (v10) and 100m (v100),
dew point temperature (dew)
The original gridded data has been averaged considered the national borders of the following countries (European 2-letter country codes are used, i.e. ISO 3166 alpha-2 codes with the exception of GB->UK and GR->EL): AL, AT, BA, BE, BG, BY, CH, CY, CZ, DE, DK, DZ, EE, EL, ES, FI, FR, HR, HU, IE, IS, IT, LT, LU, LV, MD, ME, MK, NL, NO, PL, PT, RO, RS, SE, SI, SK, UA, UK.
The unit measures here used are listed in the official page: https://cds.climate.copernicus.eu/cdsapp#!/dataset/era5-hourly-data-on-single-levels-from-2000-to-2017?tab=overview
The script used to generate the files is available on github here
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Definition of important variables.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The Russian Financial Statements Database (RFSD) is an open, harmonized collection of annual unconsolidated financial statements of the universe of Russian firms:
🔓 First open data set with information on every active firm in Russia.
🗂️ First open financial statements data set that includes non-filing firms.
🏛️ Sourced from two official data providers: the Rosstat and the Federal Tax Service.
📅 Covers 2011-2023 initially, will be continuously updated.
🏗️ Restores as much data as possible through non-invasive data imputation, statement articulation, and harmonization.
The RFSD is hosted on 🤗 Hugging Face and Zenodo and is stored in a structured, column-oriented, compressed binary format Apache Parquet with yearly partitioning scheme, enabling end-users to query only variables of interest at scale.
The accompanying paper provides internal and external validation of the data: http://arxiv.org/abs/2501.05841.
Here we present the instructions for importing the data in R or Python environment. Please consult with the project repository for more information: http://github.com/irlcode/RFSD.
Importing The Data
You have two options to ingest the data: download the .parquet files manually from Hugging Face or Zenodo or rely on 🤗 Hugging Face Datasets library.
Python
🤗 Hugging Face Datasets
It is as easy as:
from datasets import load_dataset import polars as pl
RFSD = load_dataset('irlspbru/RFSD')
RFSD_2023 = pl.read_parquet('hf://datasets/irlspbru/RFSD/RFSD/year=2023/*.parquet')
Please note that the data is not shuffled within year, meaning that streaming first n rows will not yield a random sample.
Local File Import
Importing in Python requires pyarrow package installed.
import pyarrow.dataset as ds import polars as pl
RFSD = ds.dataset("local/path/to/RFSD")
print(RFSD.schema)
RFSD_full = pl.from_arrow(RFSD.to_table())
RFSD_2019 = pl.from_arrow(RFSD.to_table(filter=ds.field('year') == 2019))
RFSD_2019_revenue = pl.from_arrow( RFSD.to_table( filter=ds.field('year') == 2019, columns=['inn', 'line_2110'] ) )
renaming_df = pl.read_csv('local/path/to/descriptive_names_dict.csv') RFSD_full = RFSD_full.rename({item[0]: item[1] for item in zip(renaming_df['original'], renaming_df['descriptive'])})
R
Local File Import
Importing in R requires arrow package installed.
library(arrow) library(data.table)
RFSD <- open_dataset("local/path/to/RFSD")
schema(RFSD)
scanner <- Scanner$create(RFSD) RFSD_full <- as.data.table(scanner$ToTable())
scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scanner <- scan_builder$Finish() RFSD_2019 <- as.data.table(scanner$ToTable())
scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scan_builder$Project(cols = c("inn", "line_2110")) scanner <- scan_builder$Finish() RFSD_2019_revenue <- as.data.table(scanner$ToTable())
renaming_dt <- fread("local/path/to/descriptive_names_dict.csv") setnames(RFSD_full, old = renaming_dt$original, new = renaming_dt$descriptive)
Use Cases
🌍 For macroeconomists: Replication of a Bank of Russia study of the cost channel of monetary policy in Russia by Mogiliat et al. (2024) — interest_payments.md
🏭 For IO: Replication of the total factor productivity estimation by Kaukin and Zhemkova (2023) — tfp.md
🗺️ For economic geographers: A novel model-less house-level GDP spatialization that capitalizes on geocoding of firm addresses — spatialization.md
FAQ
Why should I use this data instead of Interfax's SPARK, Moody's Ruslana, or Kontur's Focus?hat is the data period?
To the best of our knowledge, the RFSD is the only open data set with up-to-date financial statements of Russian companies published under a permissive licence. Apart from being free-to-use, the RFSD benefits from data harmonization and error detection procedures unavailable in commercial sources. Finally, the data can be easily ingested in any statistical package with minimal effort.
What is the data period?
We provide financials for Russian firms in 2011-2023. We will add the data for 2024 by July, 2025 (see Version and Update Policy below).
Why are there no data for firm X in year Y?
Although the RFSD strives to be an all-encompassing database of financial statements, end users will encounter data gaps:
We do not include financials for firms that we considered ineligible to submit financial statements to the Rosstat/Federal Tax Service by law: financial, religious, or state organizations (state-owned commercial firms are still in the data).
Eligible firms may enjoy the right not to disclose under certain conditions. For instance, Gazprom did not file in 2022 and we had to impute its 2022 data from 2023 filings. Sibur filed only in 2023, Novatek — in 2020 and 2021. Commercial data providers such as Interfax's SPARK enjoy dedicated access to the Federal Tax Service data and therefore are able source this information elsewhere.
Firm may have submitted its annual statement but, according to the Uniform State Register of Legal Entities (EGRUL), it was not active in this year. We remove those filings.
Why is the geolocation of firm X incorrect?
We use Nominatim to geocode structured addresses of incorporation of legal entities from the EGRUL. There may be errors in the original addresses that prevent us from geocoding firms to a particular house. Gazprom, for instance, is geocoded up to a house level in 2014 and 2021-2023, but only at street level for 2015-2020 due to improper handling of the house number by Nominatim. In that case we have fallen back to street-level geocoding. Additionally, streets in different districts of one city may share identical names. We have ignored those problems in our geocoding and invite your submissions. Finally, address of incorporation may not correspond with plant locations. For instance, Rosneft has 62 field offices in addition to the central office in Moscow. We ignore the location of such offices in our geocoding, but subsidiaries set up as separate legal entities are still geocoded.
Why is the data for firm X different from https://bo.nalog.ru/?
Many firms submit correcting statements after the initial filing. While we have downloaded the data way past the April, 2024 deadline for 2023 filings, firms may have kept submitting the correcting statements. We will capture them in the future releases.
Why is the data for firm X unrealistic?
We provide the source data as is, with minimal changes. Consider a relatively unknown LLC Banknota. It reported 3.7 trillion rubles in revenue in 2023, or 2% of Russia's GDP. This is obviously an outlier firm with unrealistic financials. We manually reviewed the data and flagged such firms for user consideration (variable outlier), keeping the source data intact.
Why is the data for groups of companies different from their IFRS statements?
We should stress that we provide unconsolidated financial statements filed according to the Russian accounting standards, meaning that it would be wrong to infer financials for corporate groups with this data. Gazprom, for instance, had over 800 affiliated entities and to study this corporate group in its entirety it is not enough to consider financials of the parent company.
Why is the data not in CSV?
The data is provided in Apache Parquet format. This is a structured, column-oriented, compressed binary format allowing for conditional subsetting of columns and rows. In other words, you can easily query financials of companies of interest, keeping only variables of interest in memory, greatly reducing data footprint.
Version and Update Policy
Version (SemVer): 1.0.0.
We intend to update the RFSD annualy as the data becomes available, in other words when most of the firms have their statements filed with the Federal Tax Service. The official deadline for filing of previous year statements is April, 1. However, every year a portion of firms either fails to meet the deadline or submits corrections afterwards. Filing continues up to the very end of the year but after the end of April this stream quickly thins out. Nevertheless, there is obviously a trade-off between minimization of data completeness and version availability. We find it a reasonable compromise to query new data in early June, since on average by the end of May 96.7% statements are already filed, including 86.4% of all the correcting filings. We plan to make a new version of RFSD available by July.
Licence
Creative Commons License Attribution 4.0 International (CC BY 4.0).
Copyright © the respective contributors.
Citation
Please cite as:
@unpublished{bondarkov2025rfsd, title={{R}ussian {F}inancial {S}tatements {D}atabase}, author={Bondarkov, Sergey and Ledenev, Victor and Skougarevskiy, Dmitriy}, note={arXiv preprint arXiv:2501.05841}, doi={https://doi.org/10.48550/arXiv.2501.05841}, year={2025}}
Acknowledgments and Contacts
Data collection and processing: Sergey Bondarkov, sbondarkov@eu.spb.ru, Viktor Ledenev, vledenev@eu.spb.ru
Project conception, data validation, and use cases: Dmitriy Skougarevskiy, Ph.D.,
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset includes historical erosion data (1990–2019), future soil erosion projections under SSP126, SSP245, and SSP585 scenarios (2021–2100), and predicted R and C factors for each period.Future R factors We incorporated 25 Global Climate Models (GCMs) from CMIP6 for calculating the future R factors, selected via the NASA Earth Exchange Global Daily Downscaled Projections (NEX-GDDP-CMIP6) project (Table S3). The selection was based on the completeness of their time series and their alignment with the selected scenarios. Rainfall projections were corrected using quantile delta mapping (QDM) (Cannon et al., 2015) to address systematic biases in intensity distributions while preserving the projected trends in mean rainfall and extremes—critical for soil erosion analysis (Eekhout and de Vente, 2019). Bias correction was conducted using a 25-year baseline (1990–2014), with adjustments made monthly to correct for seasonal biases. The corrected bias functions were then applied to adjust the years (2020–2100) of daily rainfall data using the "ibicus" package, an open-source Python tool for bias adjustment and climate model evaluation. A minimum daily rainfall threshold of 0.1 mm was used to define rainy days, following established studies (Bulovic, 2024; Eekhout and de Vente, 2019; Switanek et al., 2017). Additionally, the study employed QDM to correct biases in historical GCM simulations, ensuring the applicability of the QDM method for rainfall bias correction in the YTRB. A baseline period of 1990–2010 was selected to establish the bias correction function, which was subsequently applied to adjust GCM simulations for 2011–2014. To evaluate the effectiveness of this calibration, we compared the annual mean precipitation from bias-corrected GCMs during 2011–2014 with observed precipitation data at the pixel level (Figs. S2, S3), using R² as the evaluation metric. The results showed a significant increase in R² after bias correction, confirming the effectiveness of the QDM approach. Future C factors To ensure the accuracy of the C factor predictions, we selected five CMIP6 climate models (table S4) with high spatial resolution compared to other CMIP6 climate models. Of the five selected climate models, CanESM5, IPSL-CM6-LR, and MIROC-ES2L have high equilibrium climate sensitivity (ECS) values. The ECS is the expected long-term warming after doubling of atmospheric CO2 concentrations. It is one of the most important indicators for understanding the impact of future warming (Rao et al., 2023). Therefore, we selected these five climate models with ECS values >3.0 to capture the full range of potential climate-induced changes affecting soil erosion. After selecting the climate models, we constructed an XGBoost model using historical C factor data and bioclimatic variables from the WorldClim data portal. WorldClim provides global gridded datasets with a 1 km² spatial resolution, including 19 bioclimatic variables derived from monthly temperature and precipitation data, reflecting annual trends, seasonality, and extreme environmental conditions (Hijmans et al., 2005). However, strong collinearity among the 19 bioclimatic variables and an excessive number of input features may increase model complexity and reduce XGBoost's predictive accuracy. To optimize performance, we employed Recursive Feature Elimination (RFE), an iterative method for selecting the most relevant features while preserving prediction accuracy. (Kornyo et al., 2023; Xiong et al., 2024). In each iteration, the current subset of features was used to train an XGBoost model, and feature importance was evaluated to remove the least significant variable, gradually refining the feature set. Using 80% of the data for training and 20% for testing, we employed 5-fold cross-validation to determine the feature subset that maximized the average R², ensuring optimal model performance. Additionally, a Genetic Algorithm (GA) was applied in each iteration to optimize the hyperparameters of the XGBoost model, which is crucial for enhancing both the efficiency and robustness of the model (Zhong and Liu, 2024; Zou et al., 2024). Finally, based on the variable selection results from RFE, the bioclimatic variables of future climate models were input into the trained XGBoost model to obtain the average C factor for the five selected climate models across four future periods (2020–2040, 2040–2060, 2060–2080, and 2080–2100). RUSLE model In this study, the mean annual soil loss was initially estimated using the RUSLE model, which enables us to estimate the spatial pattern of soil erosion (Renard et al.,1991). In areas where data are scarce, we consider RUSLE to be an effective runoff dependent soil erosion model because it requires only limited data for the study area (Haile et al.,2012).
Facebook
TwitterWelcome to my Kickstarter case study! In this project I’m trying to understand what the success’s factors for a Kickstarter campaign are, analyzing an available public dataset from Web Robots. The process of analysis will follow the data analysis roadmap: ASK, PREPARE, PROCESS, ANALYZE, SHARE and ACT.
ASK
Different questions will guide my analysis: 1. Is the campaign duration influencing the success of the project? 2. Is it the chosen funding budget? 3. Which category of campaign is the most likely to be successful?
PREPARE
I’m using the Kickstarter Datasets publicly available on Web Robots. Data are scraped using a bot which collects the data in CSV format once a month and all the data are divided into CSV files. Each table contains: - backers_count : number of people that contributed to the campaign - blurb : a captivating text description of the project - category : the label categorizing the campaign (technology, art, etc) - country - created_at : day and time of campaign creation - deadline : day and time of campaign max end - goal : amount to be collected - launched_at : date and time of campaign launch - name : name of campaign - pledged : amount of money collected - state : success or failure of the campaign
Each month scraping produce a huge amount of CSVs, so for an initial analysis I decided to focus on three months: November and December 2023, and January 2024. I’ve downloaded zipped files which once unzipped contained respectively: 7 CSVs (November 2023), 8 CSVs (December 2023), 8 CSVs (January 2024). Each month was divided into a specific folder.
Having a first look at the spreadsheets, it’s clear that there is some need for cleaning and modification: for example, dates and times are shown in Unix code, there are multiple columns that are not helpful for the scope of my analysis, currencies need to be uniformed (some are US$, some GB£, etc). In general, I have all the data that I need to answer my initial questions, identify trends, and make predictions.
PROCESS
I decided to use R to clean and process the data. For each month I started setting a new working environment in its own folder. After loading the necessary libraries:
R
library(tidyverse)
library(lubridate)
library(ggplot2)
library(dplyr)
library(tidyr)
I scripted a general R code that searches for CSVs files in the folder, open them as separate variable and into a single data frame:
csv_files <- list.files(pattern = "\\.csv$")
data_frames <- list()
for (file in csv_files) {
variable_name <- sub("\\.csv$", "", file)
assign(variable_name, read.csv(file))
data_frames[[variable_name]] <- get(variable_name)
}
Next, I converted some columns in numeric values because I was running into types error when trying to merge all the CSVs into a single comprehensive file.
data_frames <- lapply(data_frames, function(df) {
df$converted_pledged_amount <- as.numeric(df$converted_pledged_amount)
return(df)
})
data_frames <- lapply(data_frames, function(df) {
df$usd_exchange_rate <- as.numeric(df$usd_exchange_rate)
return(df)
})
data_frames <- lapply(data_frames, function(df) {
df$usd_pledged <- as.numeric(df$usd_pledged)
return(df)
})
In each folder I then ran a command to merge the CSVs in a single file (one for November 2023, one for December 2023 and one for January 2024):
all_nov_2023 = bind_rows(data_frames)
all_dec_2023 = bind_rows(data_frames)
all_jan_2024 = bind_rows(data_frames)`
After merging I converted the UNIX code datestamp into a readable datetime for the columns “created”, “launched”, “deadline” and deleted all the columns that had these data set to 0. I also filtered the values into the “slug” columns to show only the category of the campaign, without unnecessary information for the scope of my analysis. The final table was then saved.
filtered_dec_2023 <- all_dec_2023 %>% #this was modified according to the considered month
select(blurb, backers_count, category, country, created_at, launched_at, deadline,currency, usd_exchange_rate, goal, pledged, state) %>%
filter(created_at != 0 & deadline != 0 & launched_at != 0) %>%
mutate(category_slug = sub('.*?"slug":"(.*?)".*', '\\1', category)) %>%
mutate(created = as.POSIXct(created_at, origin = "1970-01-01")) %>%
mutate(launched = as.POSIXct(launched_at, origin = "1970-01-01")) %>%
mutate(setted_deadline = as.POSIXct(deadline, origin = "1970-01-01")) %>%
select(-category, -deadline, -launched_at, -created_at) %>%
relocate(created, launched, setted_deadline, .before = goal)
write.csv(filtered_dec_2023, "filtered_dec_2023.csv", row.names = FALSE)
The three generated files were then merged into one comprehensive CSV called "kickstarter_cleaned" which was further modified, converting a...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Supplementary Material for
Human Wellbeing and Machine Learning
by Ekaterina Oparina (r) Caspar Kaiser (r) Niccolò Gentile; Alexandre Tkatchenko, Andrew E. Clark, Jan-Emmanuel De Neve and Conchita D'Ambrosio
This repository contains the list of variables that are used in the Extended Set analysis for the German Socio-Economic Panel (SOEP.csv), the UK Household Longitudinal Study (UKHLS.csv), and the American Gallup Daily Poll (Gallup.csv). We use the 2013 Wave of Gallup and SOEP, and Wave 3 of the UKHLS (which covers 2011-2012). Our dataset includes all of the available variables, apart from direct measures of subjective wellbeing (such as domain satisfaction, happiness, or subjective health) or mental health. We also exclude variables with more than 50% missing values.
The presented lists include the variables before processing. For the analysis, we convert categorical variables into a set of dummies, one for each category. We then drop all perfectly collinear variables.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Facebook
TwitterDataset contains continuous underway measurements of ship’s position, sea surface temperature, salinity, turbidity, fluorescence, air temperature, barometric pressure, relative humidity, chlorophyll, plus wind speed and direction. These data are part of a long term ecosystem project centered on the South Shetland Islands of the Antarctic Peninsula, Antarctica. The datasets incorporated into this archive have been collected by researchers affiliated with the U.S. Antarctic Marine Living Resources (AMLR) Program under the National Marine Fisheries Service (NMFS); funding for this project was from the NMFS AMLR Program. Actual data collected any given year varies, dependent on funding availability for resources, staging and sea days.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The file set is a freely downloadable aggregation of information about Australian schools. The individual files represent a series of tables which, when considered together, form a relational database. The records cover the years 2008-2014 and include information on approximately 9500 primary and secondary school main-campuses and around 500 subcampuses. The records all relate to school-level data; no data about individuals is included. All the information has previously been published and is publicly available but it has not previously been released as a documented, useful aggregation. The information includes: (a) the names of schools (b) staffing levels, including full-time and part-time teaching and non-teaching staff (c) student enrolments, including the number of boys and girls (d) school financial information, including Commonwealth government, state government, and private funding (e) test data, potentially for school years 3, 5, 7 and 9, relating to an Australian national testing programme know by the trademark 'NAPLAN'
Documentation of this Edition 2016.1 is incomplete but the organization of the data should be readily understandable to most people. If you are a researcher, the simplest way to study the data is to make use of the SQLite3 database called 'school-data-2016-1.db'. If you are unsure how to use an SQLite database, ask a guru.
The database was constructed directly from the other included files by running the following command at a command-line prompt: sqlite3 school-data-2016-1.db < school-data-2016-1.sql Note that a few, non-consequential, errors will be reported if you run this command yourself. The reason for the errors is that the SQLite database is created by importing a series of '.csv' files. Each of the .csv files contains a header line with the names of the variable relevant to each column. The information is useful for many statistical packages but it is not what SQLite expects, so it complains about the header. Despite the complaint, the database will be created correctly.
Briefly, the data are organized as follows. (a) The .csv files ('comma separated values') do not actually use a comma as the field delimiter. Instead, the vertical bar character '|' (ASCII Octal 174 Decimal 124 Hex 7C) is used. If you read the .csv files using Microsoft Excel, Open Office, or Libre Office, you will need to set the field-separator to be '|'. Check your software documentation to understand how to do this. (b) Each school-related record is indexed by an identifer called 'ageid'. The ageid uniquely identifies each school and consequently serves as the appropriate variable for JOIN-ing records in different data files. For example, the first school-related record after the header line in file 'students-headed-bar.csv' shows the ageid of the school as 40000. The relevant school name can be found by looking in the file 'ageidtoname-headed-bar.csv' to discover that the the ageid of 40000 corresponds to a school called 'Corpus Christi Catholic School'. (3) In addition to the variable 'ageid' each record is also identified by one or two 'year' variables. The most important purpose of a year identifier will be to indicate the year that is relevant to the record. For example, if one turn again to file 'students-headed-bar.csv', one sees that the first seven school-related records after the header line all relate to the school Corpus Christi Catholic School with ageid of 40000. The variable that identifies the important differences between these seven records is the variable 'studentyear'. 'studentyear' shows the year to which the student data refer. One can see, for example, that in 2008, there were a total of 410 students enrolled, of whom 185 were girls and 225 were boys (look at the variable names in the header line). (4) The variables relating to years are given different names in each of the different files ('studentsyear' in the file 'students-headed-bar.csv', 'financesummaryyear' in the file 'financesummary-headed-bar.csv'). Despite the different names, the year variables provide the second-level means for joining information acrosss files. For example, if you wanted to relate the enrolments at a school in each year to its financial state, you might wish to JOIN records using 'ageid' in the two files and, secondarily, matching 'studentsyear' with 'financialsummaryyear'. (5) The manipulation of the data is most readily done using the SQL language with the SQLite database but it can also be done in a variety of statistical packages. (6) It is our intention for Edition 2016-2 to create large 'flat' files suitable for use by non-researchers who want to view the data with spreadsheet software. The disadvantage of such 'flat' files is that they contain vast amounts of redundant information and might not display the data in the form that the user most wants it. (7) Geocoding of the schools is not available in this edition. (8) Some files, such as 'sector-headed-bar.csv' are not used in the creation of the database but are provided as a convenience for researchers who might wish to recode some of the data to remove redundancy. (9) A detailed example of a suitable SQLite query can be found in the file 'school-data-sqlite-example.sql'. The same query, used in the context of analyses done with the excellent, freely available R statistical package (http://www.r-project.org) can be seen in the file 'school-data-with-sqlite.R'.
Facebook
TwitterA bike-sharing system is a service in which bikes are made available for shared use to individuals on a short term basis for a price or free. Many bike share systems allow people to borrow a bike from a "dock" which is usually computer-controlled wherein the user enters the payment information, and the system unlocks it. This bike can then be returned to another dock belonging to the same system.
A US bike-sharing provider BoomBikes has recently suffered considerable dip in their revenue due to the Corona pandemic. The company is finding it very difficult to sustain in the current market scenario. So, it has decided to come up with a mindful business plan to be able to accelerate its revenue.
In such an attempt, BoomBikes aspires to understand the demand for shared bikes among the people. They have planned this to prepare themselves to cater to the people's needs once the situation gets better all around and stand out from other service providers and make huge profits.
They have contracted a consulting company to understand the factors on which the demand for these shared bikes depends. Specifically, they want to understand the factors affecting the demand for these shared bikes in the American market. The company wants to know:
Based on various meteorological surveys and people's styles, the service provider firm has gathered a large dataset on daily bike demands across the American market based on some factors.
You are required to model the demand for shared bikes with the available independent variables. It will be used by the management to understand how exactly the demands vary with different features. They can accordingly manipulate the business strategy to meet the demand levels and meet the customer's expectations. Further, the model will be a good way for management to understand the demand dynamics of a new market.
In the dataset provided, you will notice that there are three columns named 'casual', 'registered', and 'cnt'. The variable 'casual' indicates the number casual users who have made a rental. The variable 'registered' on the other hand shows the total number of registered users who have made a booking on a given day. Finally, the 'cnt' variable indicates the total number of bike rentals, including both casual and registered. The model should be built taking this 'cnt' as the target variable.
When you're done with model building and residual analysis and have made predictions on the test set, just make sure you use the following two lines of code to calculate the R-squared score on the test set.
python
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)
- where y_test is the test data set for the target variable, and y_pred is the variable containing the predicted values of the target variable on the test set.
- Please perform this step as the R-squared score on the test set holds as a benchmark for your model.
Facebook
TwitterTo make this a seamless process, I cleaned the data and delete many variables that I thought were not important to our dataset. I then uploaded all of those files to Kaggle for each of you to download. The rideshare_data has both lyft and uber but it is still a cleaned version from the dataset we downloaded from Kaggle.
You can easily subset the data into the car types that you will be modeling by first loading the csv into R, here is the code for how you do this:
df<-read.csv('uber.csv')
df_black<-subset(uber_df, uber_df$name == 'Black')
write.csv(df_black, "nameofthefileyouwanttosaveas.csv")
getwd()
Your data will be in front of the world's largest data science community. What questions do you want to see answered?