13 datasets found
  1. Data Mining Project - Boston

    • kaggle.com
    zip
    Updated Nov 25, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SophieLiu (2019). Data Mining Project - Boston [Dataset]. https://www.kaggle.com/sliu65/data-mining-project-boston
    Explore at:
    zip(59313797 bytes)Available download formats
    Dataset updated
    Nov 25, 2019
    Authors
    SophieLiu
    Area covered
    Boston
    Description

    Context

    To make this a seamless process, I cleaned the data and delete many variables that I thought were not important to our dataset. I then uploaded all of those files to Kaggle for each of you to download. The rideshare_data has both lyft and uber but it is still a cleaned version from the dataset we downloaded from Kaggle.

    Use of Data Files

    You can easily subset the data into the car types that you will be modeling by first loading the csv into R, here is the code for how you do this:

    This loads the file into R

    df<-read.csv('uber.csv')

    The next codes is to subset the data into specific car types. The example below only has Uber 'Black' car types.

    df_black<-subset(uber_df, uber_df$name == 'Black')

    This next portion of code will be to load it into R. First, we must write this dataframe into a csv file on our computer in order to load it into R.

    write.csv(df_black, "nameofthefileyouwanttosaveas.csv")

    The file will appear in you working directory. If you are not familiar with your working directory. Run this code:

    getwd()

    The output will be the file path to your working directory. You will find the file you just created in that folder.

    Inspiration

    Your data will be in front of the world's largest data science community. What questions do you want to see answered?

  2. Data from: Data and code from: Environmental influences on drying rate of...

    • catalog.data.gov
    • datasetcatalog.nlm.nih.gov
    • +2more
    Updated Apr 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Data and code from: Environmental influences on drying rate of spray applied disinfestants from horticultural production services [Dataset]. https://catalog.data.gov/dataset/data-and-code-from-environmental-influences-on-drying-rate-of-spray-applied-disinfestants-
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Agricultural Research Servicehttps://www.ars.usda.gov/
    Description

    This dataset includes all the data and R code needed to reproduce the analyses in a forthcoming manuscript:Copes, W. E., Q. D. Read, and B. J. Smith. Environmental influences on drying rate of spray applied disinfestants from horticultural production services. PhytoFrontiers, DOI pending.Study description: Instructions for disinfestants typically specify a dose and a contact time to kill plant pathogens on production surfaces. A problem occurs when disinfestants are applied to large production areas where the evaporation rate is affected by weather conditions. The common contact time recommendation of 10 min may not be achieved under hot, sunny conditions that promote fast drying. This study is an investigation into how the evaporation rates of six commercial disinfestants vary when applied to six types of substrate materials under cool to hot and cloudy to sunny weather conditions. Initially, disinfestants with low surface tension spread out to provide 100% coverage and disinfestants with high surface tension beaded up to provide about 60% coverage when applied to hard smooth surfaces. Disinfestants applied to porous materials were quickly absorbed into the body of the material, such as wood and concrete. Even though disinfestants evaporated faster under hot sunny conditions than under cool cloudy conditions, coverage was reduced considerably in the first 2.5 min under most weather conditions and reduced to less than or equal to 50% coverage by 5 min. Dataset contents: This dataset includes R code to import the data and fit Bayesian statistical models using the model fitting software CmdStan, interfaced with R using the packages brms and cmdstanr. The models (one for 2022 and one for 2023) compare how quickly different spray-applied disinfestants dry, depending on what chemical was sprayed, what surface material it was sprayed onto, and what the weather conditions were at the time. Next, the statistical models are used to generate predictions and compare mean drying rates between the disinfestants, surface materials, and weather conditions. Finally, tables and figures are created. These files are included:Drying2022.csv: drying rate data for the 2022 experimental runWeather2022.csv: weather data for the 2022 experimental runDrying2023.csv: drying rate data for the 2023 experimental runWeather2023.csv: weather data for the 2023 experimental rundisinfestant_drying_analysis.Rmd: RMarkdown notebook with all data processing, analysis, and table creation codedisinfestant_drying_analysis.html: rendered output of notebookMS_figures.R: additional R code to create figures formatted for journal requirementsfit2022_discretetime_weather_solar.rds: fitted brms model object for 2022. This will allow users to reproduce the model prediction results without having to refit the model, which was originally fit on a high-performance computing clusterfit2023_discretetime_weather_solar.rds: fitted brms model object for 2023data_dictionary.xlsx: descriptions of each column in the CSV data files

  3. f

    Data for the Farewell and Herberg example of a two-phase experiment using a...

    • datasetcatalog.nlm.nih.gov
    • researchdata.edu.au
    • +1more
    Updated Jun 12, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brien, Chris (2021). Data for the Farewell and Herberg example of a two-phase experiment using a plaid design [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000884242
    Explore at:
    Dataset updated
    Jun 12, 2021
    Authors
    Brien, Chris
    Description

    The experiment that Farewell and Herzberg (2003) describe is pain-rating experiment that is a subset of the experiment reported by Solomon et al. (1997). It is a two-phase experiment. The first phase is a self-assessment phase in which patients self-assess for pain while moving a painful shoulder joint. The second phase of this experiment is an evaluation phase in which occupational and physical therapy students (the raters) are evaluated for rating patients in a set of videos for pain. The measured response is the difference between a student rating and the patient's rating.The R data file plaid.dat.rda contains the data.frame plaid.dat that has a revised version of the data for the Farewell and Herzberg example downloaded from https://doi.org/10.17863/CAM.54494. The comma delimited text file plaid.dat.csv has the same information in this more commonly accepted format, but without the metadata associated with the data.frame<\CODE>.The data.frame contains the factors Raters, Viewings, Trainings, Expressiveness, Patients, Occasions, and Motions and a column for the response variable Y. The two factors Viewings and Occasions are additional to those in the downloaded file and the remaining factors have been converted from integers or characters to factors and renamed to the names given above. The column Y is unchanged from the column in the original file.To load the data in R use: load("plaid.dat.rda") or plaid.dat <- read.csv(file = "plaid.dat.csv").ReferencesFarewell, V. T.,& Herzberg, A. M. (2003). Plaid designs for the evaluation of training for medical practitioners. Journal of Applied Statistics, 30(9), 957-965. https://doi.org/10.1080/0266476032000076092Solomon, P. E., Prkachin, K. M. & Farewell, V. (1997). Enhancing sensitivity to facial expression of pain. Pain, 71(3), 279-284. https://doi.org/10.1016/S0304-3959(97)03377-0

  4. 4

    PARAMOUNT: parallel modal analysis of large datasets

    • data.4tu.nl
    zip
    Updated Nov 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alireza Ghasemi; Jim Kok (2022). PARAMOUNT: parallel modal analysis of large datasets [Dataset]. http://doi.org/10.4121/20089760.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 28, 2022
    Dataset provided by
    4TU.ResearchData
    Authors
    Alireza Ghasemi; Jim Kok
    License

    https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html

    Description

    PARAMOUNT: parallel modal analysis of large datasets

    PARAMOUNT is a python package developed at University of Twente to perform modal analysis of large numerical and experimental datasets. Brief video introduction into the theory and methodology is presented here.

    Features

    - Distributed processing of data on local machines or clusters using Dask Distributed
    - Reading CSV files in glob format from specified folders
    - Extracting relevant columns from CSV files and writing Parquet database for each specified variable
    - Distributed computation of Proper Orthogonal Decomposition (POD)
    - Writing U, S and V matrices into Parquet database for further analysis
    - Visualizing POD modes and coefficients using pyplot


    Using PARAMOUNT

    Make sure to install the dependencies by running `pip install -r requirements.txt`

    Refer to csv_example to see how to use PARAMOUNT to read CSV files, write the variables of interest into Parquet datasets and inspect the final datasets.

    Refer to svd_example to see how to read Parquet datasets, compute the Singular Value Decomposition, and store the results in Parquet format.

    To visualize the results you can simply read the U, S and V parquet files and your plotting tool of choice. Examples are provided in viz_example.

    Author and Acknowledgements

    This package is developed by Alireza Ghasemi (alireza.ghasemi@utwente.nl) at University of Twente under the MAGISTER (https://www.magister-itn.eu/) project. This project has received funding from the European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement No. 766264.

  5. Annotated 12 lead ECG dataset

    • zenodo.org
    zip
    Updated Jun 7, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Antonio H Ribeiro; Antonio H Ribeiro; Manoel Horta Ribeiro; Manoel Horta Ribeiro; Gabriela M. Paixão; Gabriela M. Paixão; Derick M. Oliveira; Derick M. Oliveira; Paulo R. Gomes; Paulo R. Gomes; Jéssica A. Canazart; Jéssica A. Canazart; Milton P. Ferreira; Milton P. Ferreira; Carl R. Andersson; Carl R. Andersson; Peter W. Macfarlane; Peter W. Macfarlane; Wagner Meira Jr.; Wagner Meira Jr.; Thomas B. Schön; Thomas B. Schön; Antonio Luiz P. Ribeiro; Antonio Luiz P. Ribeiro (2021). Annotated 12 lead ECG dataset [Dataset]. http://doi.org/10.5281/zenodo.3625007
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 7, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Antonio H Ribeiro; Antonio H Ribeiro; Manoel Horta Ribeiro; Manoel Horta Ribeiro; Gabriela M. Paixão; Gabriela M. Paixão; Derick M. Oliveira; Derick M. Oliveira; Paulo R. Gomes; Paulo R. Gomes; Jéssica A. Canazart; Jéssica A. Canazart; Milton P. Ferreira; Milton P. Ferreira; Carl R. Andersson; Carl R. Andersson; Peter W. Macfarlane; Peter W. Macfarlane; Wagner Meira Jr.; Wagner Meira Jr.; Thomas B. Schön; Thomas B. Schön; Antonio Luiz P. Ribeiro; Antonio Luiz P. Ribeiro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    # Annotated 12 lead ECG dataset
    
    Contain 827 ECG tracings from different patients, annotated by several cardiologists, residents and medical students.
    It is used as test set on the paper:
    "Automatic Diagnosis of the Short-Duration12-Lead ECG using a Deep Neural Network".
    
    It contain annotations about 6 different ECGs abnormalities:
    - 1st degree AV block (1dAVb);
    - right bundle branch block (RBBB);
    - left bundle branch block (LBBB);
    - sinus bradycardia (SB);
    - atrial fibrillation (AF); and,
    - sinus tachycardia (ST).
    
    ## Folder content:
    
    - `ecg_tracings.hdf5`: HDF5 file containing a single dataset named `tracings`. This dataset is a 
    `(827, 4096, 12)` tensor. The first dimension correspond to the 827 different exams from different 
    patients; the second dimension correspond to the 4096 signal samples; the third dimension to the 12
    different leads of the ECG exam. 
    
    The signals are sampled at 400 Hz. Some signals originally have a duration of 
    10 seconds (10 * 400 = 4000 samples) and others of 7 seconds (7 * 400 = 2800 samples).
    In order to make them all have the same size (4096 samples) we fill them with zeros
    on both sizes. For instance, for a 7 seconds ECG signal with 2800 samples we include 648
    samples at the beginning and 648 samples at the end, yielding 4096 samples that are them saved
    in the hdf5 dataset. All signal are represented as floating point numbers at the scale 1e-4V: so it should
    be multiplied by 1000 in order to obtain the signals in V.
    
    In python, one can read this file using the following sequence:
    ```python
    import h5py
    with h5py.File(args.tracings, "r") as f:
      x = np.array(f['tracings'])
    ```
    
    - The file `attributes.csv` contain basic patient attributes: sex (M or F) and age. It
    contain 827 lines (plus the header). The i-th tracing in `ecg_tracings.hdf5` correspond to the i-th line.
    - `annotations/`: folder containing annotations csv format. Each csv file contain 827 lines (plus the header).
    The i-th line correspond to the i-th tracing in `ecg_tracings.hdf5` correspond to the in all csv files.
    The csv files all have 6 columns `1dAVb, RBBB, LBBB, SB, AF, ST`
    corresponding to weather the annotator have detect the abnormality in the ECG (`=1`) or not (`=0`).
     1. `cardiologist[1,2].csv` contain annotations from two different cardiologist.
     2. `gold_standard.csv` gold standard annotation for this test dataset. When the cardiologist 1 and cardiologist 2
     agree, the common diagnosis was considered as gold standard. In cases where there was any disagreement, a 
     third senior specialist, aware of the annotations from the other two, decided the diagnosis. 
     3. `dnn.csv` prediction from the deep neural network described in 
     "Automatic Diagnosis of the Short-Duration 12-Lead ECG using a Deep Neural Network". The threshold is set in such way 
     it maximizes the F1 score.
     4. `cardiology_residents.csv` annotations from two 4th year cardiology residents (each annotated half of the dataset).
     5. `emergency_residents.csv` annotations from two 3rd year emergency residents (each annotated half of the dataset).
     6. `medical_students.csv` annotations from two 5th year medical students (each annotated half of the dataset).
    
  6. FacialRecognition

    • kaggle.com
    zip
    Updated Dec 1, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TheNicelander (2016). FacialRecognition [Dataset]. https://www.kaggle.com/petein/facialrecognition
    Explore at:
    zip(121674455 bytes)Available download formats
    Dataset updated
    Dec 1, 2016
    Authors
    TheNicelander
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    #https://www.kaggle.com/c/facial-keypoints-detection/details/getting-started-with-r #################################

    ###Variables for downloaded files data.dir <- ' ' train.file <- paste0(data.dir, 'training.csv') test.file <- paste0(data.dir, 'test.csv') #################################

    ###Load csv -- creates a data.frame matrix where each column can have a different type. d.train <- read.csv(train.file, stringsAsFactors = F) d.test <- read.csv(test.file, stringsAsFactors = F)

    ###In training.csv, we have 7049 rows, each one with 31 columns. ###The first 30 columns are keypoint locations, which R correctly identified as numbers. ###The last one is a string representation of the image, identified as a string.

    ###To look at samples of the data, uncomment this line:

    head(d.train)

    ###Let's save the first column as another variable, and remove it from d.train: ###d.train is our dataframe, and we want the column called Image. ###Assigning NULL to a column removes it from the dataframe

    im.train <- d.train$Image d.train$Image <- NULL #removes 'image' from the dataframe

    im.test <- d.test$Image d.test$Image <- NULL #removes 'image' from the dataframe

    ################################# #The image is represented as a series of numbers, stored as a string #Convert these strings to integers by splitting them and converting the result to integer

    #strsplit splits the string #unlist simplifies its output to a vector of strings #as.integer converts it to a vector of integers. as.integer(unlist(strsplit(im.train[1], " "))) as.integer(unlist(strsplit(im.test[1], " ")))

    ###Install and activate appropriate libraries ###The tutorial is meant for Linux and OSx, where they use a different library, so: ###Replace all instances of %dopar% with %do%.

    install.packages('foreach')

    library("foreach", lib.loc="~/R/win-library/3.3")

    ###implement parallelization im.train <- foreach(im = im.train, .combine=rbind) %do% { as.integer(unlist(strsplit(im, " "))) } im.test <- foreach(im = im.test, .combine=rbind) %do% { as.integer(unlist(strsplit(im, " "))) } #The foreach loop will evaluate the inner command for each row in im.train, and combine the results with rbind (combine by rows). #%do% instructs R to do all evaluations in parallel. #im.train is now a matrix with 7049 rows (one for each image) and 9216 columns (one for each pixel):

    ###Save all four variables in data.Rd file ###Can reload them at anytime with load('data.Rd')

    save(d.train, im.train, d.test, im.test, file='data.Rd')

    load('data.Rd')

    #each image is a vector of 96*96 pixels (96*96 = 9216). #convert these 9216 integers into a 96x96 matrix: im <- matrix(data=rev(im.train[1,]), nrow=96, ncol=96)

    #im.train[1,] returns the first row of im.train, which corresponds to the first training image. #rev reverse the resulting vector to match the interpretation of R's image function #(which expects the origin to be in the lower left corner).

    #To visualize the image we use R's image function: image(1:96, 1:96, im, col=gray((0:255)/255))

    #Let’s color the coordinates for the eyes and nose points(96-d.train$nose_tip_x[1], 96-d.train$nose_tip_y[1], col="red") points(96-d.train$left_eye_center_x[1], 96-d.train$left_eye_center_y[1], col="blue") points(96-d.train$right_eye_center_x[1], 96-d.train$right_eye_center_y[1], col="green")

    #Another good check is to see how variable is our data. #For example, where are the centers of each nose in the 7049 images? (this takes a while to run): for(i in 1:nrow(d.train)) { points(96-d.train$nose_tip_x[i], 96-d.train$nose_tip_y[i], col="red") }

    #there are quite a few outliers -- they could be labeling errors. Looking at one extreme example we get this: #In this case there's no labeling error, but this shows that not all faces are centralized idx <- which.max(d.train$nose_tip_x) im <- matrix(data=rev(im.train[idx,]), nrow=96, ncol=96) image(1:96, 1:96, im, col=gray((0:255)/255)) points(96-d.train$nose_tip_x[idx], 96-d.train$nose_tip_y[idx], col="red")

    #One of the simplest things to try is to compute the mean of the coordinates of each keypoint in the training set and use that as a prediction for all images colMeans(d.train, na.rm=T)

    #To build a submission file we need to apply these computed coordinates to the test instances: p <- matrix(data=colMeans(d.train, na.rm=T), nrow=nrow(d.test), ncol=ncol(d.train), byrow=T) colnames(p) <- names(d.train) predictions <- data.frame(ImageId = 1:nrow(d.test), p) head(predictions)

    #The expected submission format has one one keypoint per row, but we can easily get that with the help of the reshape2 library:

    install.packages('reshape2')

    library(...

  7. Kickastarter Campaigns

    • kaggle.com
    zip
    Updated Jan 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alessio Cantara (2024). Kickastarter Campaigns [Dataset]. https://www.kaggle.com/datasets/alessiocantara/kickastarter-project/discussion
    Explore at:
    zip(2233314 bytes)Available download formats
    Dataset updated
    Jan 25, 2024
    Authors
    Alessio Cantara
    Description

    Welcome to my Kickstarter case study! In this project I’m trying to understand what the success’s factors for a Kickstarter campaign are, analyzing an available public dataset from Web Robots. The process of analysis will follow the data analysis roadmap: ASK, PREPARE, PROCESS, ANALYZE, SHARE and ACT.

    ASK

    Different questions will guide my analysis: 1. Is the campaign duration influencing the success of the project? 2. Is it the chosen funding budget? 3. Which category of campaign is the most likely to be successful?

    PREPARE

    I’m using the Kickstarter Datasets publicly available on Web Robots. Data are scraped using a bot which collects the data in CSV format once a month and all the data are divided into CSV files. Each table contains: - backers_count : number of people that contributed to the campaign - blurb : a captivating text description of the project - category : the label categorizing the campaign (technology, art, etc) - country - created_at : day and time of campaign creation - deadline : day and time of campaign max end - goal : amount to be collected - launched_at : date and time of campaign launch - name : name of campaign - pledged : amount of money collected - state : success or failure of the campaign

    Each month scraping produce a huge amount of CSVs, so for an initial analysis I decided to focus on three months: November and December 2023, and January 2024. I’ve downloaded zipped files which once unzipped contained respectively: 7 CSVs (November 2023), 8 CSVs (December 2023), 8 CSVs (January 2024). Each month was divided into a specific folder.

    Having a first look at the spreadsheets, it’s clear that there is some need for cleaning and modification: for example, dates and times are shown in Unix code, there are multiple columns that are not helpful for the scope of my analysis, currencies need to be uniformed (some are US$, some GB£, etc). In general, I have all the data that I need to answer my initial questions, identify trends, and make predictions.

    PROCESS

    I decided to use R to clean and process the data. For each month I started setting a new working environment in its own folder. After loading the necessary libraries: R library(tidyverse) library(lubridate) library(ggplot2) library(dplyr) library(tidyr) I scripted a general R code that searches for CSVs files in the folder, open them as separate variable and into a single data frame:

    csv_files <- list.files(pattern = "\\.csv$")
    data_frames <- list()
    
    for (file in csv_files) {
     variable_name <- sub("\\.csv$", "", file)
     assign(variable_name, read.csv(file))
     data_frames[[variable_name]] <- get(variable_name)
    }
    

    Next, I converted some columns in numeric values because I was running into types error when trying to merge all the CSVs into a single comprehensive file.

    data_frames <- lapply(data_frames, function(df) {
     df$converted_pledged_amount <- as.numeric(df$converted_pledged_amount)
     return(df)
    })
    data_frames <- lapply(data_frames, function(df) {
     df$usd_exchange_rate <- as.numeric(df$usd_exchange_rate)
     return(df)
    })
    data_frames <- lapply(data_frames, function(df) {
     df$usd_pledged <- as.numeric(df$usd_pledged)
     return(df)
    })
    

    In each folder I then ran a command to merge the CSVs in a single file (one for November 2023, one for December 2023 and one for January 2024):

    all_nov_2023 = bind_rows(data_frames)
    all_dec_2023 = bind_rows(data_frames)
    all_jan_2024 = bind_rows(data_frames)`
    

    After merging I converted the UNIX code datestamp into a readable datetime for the columns “created”, “launched”, “deadline” and deleted all the columns that had these data set to 0. I also filtered the values into the “slug” columns to show only the category of the campaign, without unnecessary information for the scope of my analysis. The final table was then saved.

    filtered_dec_2023 <- all_dec_2023 %>% #this was modified according to the considered month
     select(blurb, backers_count, category, country, created_at, launched_at, deadline,currency, usd_exchange_rate, goal, pledged, state) %>%
     filter(created_at != 0 & deadline != 0 & launched_at != 0) %>% 
     mutate(category_slug = sub('.*?"slug":"(.*?)".*', '\\1', category)) %>% 
     mutate(created = as.POSIXct(created_at, origin = "1970-01-01")) %>% 
     mutate(launched = as.POSIXct(launched_at, origin = "1970-01-01")) %>% 
     mutate(setted_deadline = as.POSIXct(deadline, origin = "1970-01-01")) %>% 
     select(-category, -deadline, -launched_at, -created_at) %>% 
     relocate(created, launched, setted_deadline, .before = goal)
    
    write.csv(filtered_dec_2023, "filtered_dec_2023.csv", row.names = FALSE)
    
    

    The three generated files were then merged into one comprehensive CSV called "kickstarter_cleaned" which was further modified, converting a...

  8. Cyclistic_Divvy_data

    • kaggle.com
    zip
    Updated Jun 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rami Ghaith (2023). Cyclistic_Divvy_data [Dataset]. https://www.kaggle.com/datasets/ramighaith/cyclistic-divvy-data
    Explore at:
    zip(21440758 bytes)Available download formats
    Dataset updated
    Jun 11, 2023
    Authors
    Rami Ghaith
    Description

    The following data shows riding information for members vs casual riders at the company Cyclistic(made up name). This is a dataset used as a case study for the google data analytics certificate.

    The Changes Done to the Data in Excel: - Removed all duplicated (none were found) - Added a ride_length column by subtracting ended_at by started_at using the following formula "=C2-B2" and then turned that type into a Time, 37:30:55 - Added a day_of_week column using the following formula "=WEEKDAY(B2,1)" to display the day the ride took place on, 1= sunday through 7=saturday. - There was data that can be seen as ########, that data was left the same with no changes done to it, this data simply represents negative data and should just be looked at as 0.

    Processing the Data in RStudio: - Installed required packages such as tidyverse for data import and wrangling, lubridate for date functions and ggplot for visualization. - Step 1: I read the csv files into R to collect the data - Step 2: Made sure the data all contained the same column names because I want to merge them into one - Step 3: Renamed all column names to make sure they align, then merged them into one combined data - Step 4: More data cleaning and analyzing - Step 5: Once my data was cleaned and clearly telling a story, I began to visualize it. The visualizations done can be seen below.

  9. Data and scripts for "The importance of within-log sampling replication in...

    • zenodo.org
    bin, csv
    Updated May 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Domenica Naranjo Orrico; Domenica Naranjo Orrico; Jenna Purhonen; Jenna Purhonen; Brendan Furneaux; Brendan Furneaux; Katri Ketola; Otso Ovaskainen; Otso Ovaskainen; Nerea Abrego; Nerea Abrego; Katri Ketola (2025). Data and scripts for "The importance of within-log sampling replication in bark- and wood-inhabiting fungal metabarcoding studies" [Dataset]. http://doi.org/10.5281/zenodo.15323471
    Explore at:
    csv, binAvailable download formats
    Dataset updated
    May 6, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Domenica Naranjo Orrico; Domenica Naranjo Orrico; Jenna Purhonen; Jenna Purhonen; Brendan Furneaux; Brendan Furneaux; Katri Ketola; Otso Ovaskainen; Otso Ovaskainen; Nerea Abrego; Nerea Abrego; Katri Ketola
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data and scripts for reproducing the analyses of Naranjo-Orrico et al, "The importance of within-log sampling replication in bark- and wood-inhabiting fungal metabarcoding studies".

    The input data consists of the following four files, "Alldata.Rdata", "data_SbVenn_meta&morpho.Rdata" , "Xmorpho.csv" and "Ymorpho_1.csv". The former two files are in R format and the latter two in CSV format. The R files need to be loaded using the function load, and the CSV files with the function read.csv2 in R.

    "Alldata.Rdata" includes in total 15 input data matrices:

    - Metadata for dataset A (meta22)

    - Metadata for dataset B (meta23)

    - Two sample x OTU tables for dataset A including the number of reads for each OTU (otu.table.plausible.2022 for the plausible OTU taxonomic identifications and otu.table.reliable.2022 for the reliable taxonomic identifications).

    - Two sample x OTU tables for dataset B including the number of reads for each OTU (otu.table.plausible.2023 for the plausible OTU taxonomic identifications and otu.table.reliable.2023 for the reliable OTU taxonomic identifications).

    - Two sample x OTU tables for dataset A including the relative read counts per OTU (otu.table.plausible.w.2022 and otu.table.reliable.w.2022).

    - Two sample x OTU tables for dataset B including the relative read counts per OTU (otu.table.plausible.w.2023 and otu.table.reliable.w.2023).

    - Read counts per sample during the different phases of the bioinformatics pipeline for dataset A (read.counts.plausible.2022) and for dataset B (read.counts.plausible.2023).

    - Taxonomic information at all taxonomic levels (i.e., form species to phylum) of the identified OTUs (taxonomy.plausible)

    - Guild assignments matrices for dataset A (Guilds_plausible_tax_2022) and for dataset B Guilds_plausible_tax_2023).

    "data_SbVenn_meta&morpho.Rdata" contains four matrices:

    - Occurrence of the lichenized OTUs identified through metabarcoding including identifications at any taxonomic level (i.e., genus or family levels when species level identifications were not achieved) (SbVenn_Lmeta).

    - Occurrence of the lichenized OTUs identified through metabarcoding including identifications at the species-only level (SB_Venn_clean_meta).

    - Occurrence of the morphologically identified lichenized fungi, including identifications at the genus level and morphospecies (SbVenn_Lmorpho)

    - Occurrences of the morphologically identified lichenized fungi, including identifications at the species-only level (SbVenn_clean_morpho).

    "Xmorpho.csv" and "Ymorpho_1.csv" contain respectively the metadata and the presence-absence data of the morphologically identified lichens.

    “Alldata.Rdata” is used in all the scripts, "data_SbVenn_meta&morpho.Rdata" is only needed for the script "S8_Venn Diagrams.R", and the the files "Xmorpho.csv" and "Ymorpho_1.csv" are used in "S11_Meta vs Morpho species richnes between tree sp and tree part.R".

    The statistical analyses consist of joint species distribution modelling with the package Hmsc, generalized linear mixed models (GLMM) with the package glmer, and non-metric multidimensional scaling analysis (NMDS) with the package vegan. To perform the HMSC analyses, the first FOUR scripts need to be run consecutively from S1 (A and B) to S3. S1A defines the first model using data A, and S1B defines the second model using dataset B. S2 fits the models fitted in the study (which include presence-absence with a different set of explanatory variables). S3 shows the parameter estimates from the fitted models, in particular the beta parameters and the variance partitioning across environmental covariates. For fitting and showing the outputs of the GLMM models only S4 is needed. In S5, runs the NMDS analyses. The rest of the scripts, S6-S11 are used to produce the different plots shown in the study of Naranjo-Orrico et al., including pieplots, boxplots, barplots, and Vennplots.

  10. Dollar-Rial-Toman Live Price Dataset

    • kaggle.com
    zip
    Updated Nov 7, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Koorosh Komeilizadeh (2025). Dollar-Rial-Toman Live Price Dataset [Dataset]. https://www.kaggle.com/datasets/kooroshkz/dollar-rial-toman-live-price-dataset
    Explore at:
    zip(66708 bytes)Available download formats
    Dataset updated
    Nov 7, 2025
    Authors
    Koorosh Komeilizadeh
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dollar-Rial-Toman Live Price Dataset

    A comprehensive, daily-updated dataset of US Dollar to Iranian Rial exchange rates (USD/IRR) with historical data from November 2011 to present. This dataset is ideal for financial analysis, economic research, forecasting, and machine learning projects.

    Dataset Overview

    • Time Period: November 26, 2011 - Present (continuously updated)
    • Total Records: 3,648+ daily price points
    • Data Source: TGJU.org (Tehran Gold & Jewelry Union)
    • Update Frequency: Daily (automated via GitHub Actions)
    • Format: CSV with proper date formatting and integer price structure

    Data Structure

    The CSV file contains the following columns:

    ColumnDescriptionFormatExample
    Open PriceOpening price of the dayInteger1012100
    Low PriceLowest price of the dayInteger1011700
    High PriceHighest price of the dayInteger1034100
    Close PriceClosing price of the dayInteger1029800
    Change AmountPrice change amountString15400
    Change PercentPrice change percentageString1.52%
    Gregorian DateGregorian dateYYYY/MM/DD2025/09/06
    Persian DatePersian/Shamsi dateYYYY/MM/DD1404/06/15

    Download the Data

    View Scraper and workflow source on GitHub

    This live dataset, scraper source code and workflow is available on GitHub where you can explore, download, and use it directly.

    Documentation & Charts

    "https://kooroshkz.github.io/Dollar-Rial-Toman-Live-Price-Dataset/" target="_blank"> imagehttps://raw.githubusercontent.com/kooroshkz/Dollar-Rial-Toman-Live-Price-Dataset/main/assets/img/IntractiveChart.png">

    Interactive charts and dataset overview are available at:
    kooroshkz.github.io/Dollar-Rial-Toman-Live-Price-Dataset

    Loading in Python

    import pandas as pd
    
    # Load dataset
    df = pd.read_csv('data/Dollar_Rial_Price_Dataset.csv')
    
    # Convert date column to datetime
    df['Gregorian Date'] = pd.to_datetime(df['Gregorian Date'], format='%Y/%m/%d')
    
    # Price columns are already integers
    price_columns = ['Open Price', 'Low Price', 'High Price', 'Close Price']
    print(df[price_columns].dtypes) # All should be int64
    

    Direct Load in Python

    # pip install kagglehub[hf-datasets]
    import kagglehub
    
    df = kagglehub.load_dataset(
      "kooroshkz/dollar-rial-toman-live-price-dataset",
      adapter="huggingface",
      file_path="Dollar_Rial_Price_Dataset.csv",
      pandas_kwargs={"parse_dates": ["Gregorian Date"]}
    )
    
    print(df.head())
    

    Loading in R

    # Load dataset
    data <- read.csv("data/Dollar_Rial_Price_Dataset.csv", stringsAsFactors = FALSE)
    
    # Convert date column
    data$Gregorian.Date <- as.Date(data$Gregorian.Date, format = "%Y/%m/%d")
    
    # View structure
    str(data)
    

    Data Quality & Updates

    • Validation: All price data undergoes validation checks for accuracy
    • Automated Updates: Dataset is automatically updated daily at 8:00 AM UTC
    • Data Integrity: Built-in duplicate prevention and format validation
    • Historical Consistency: Maintains consistent formatting across all time periods
    • Integer Prices: All price values stored as integers for precise calculations

    Technical Implementation

    This dataset is maintained using an automated web scraping system that:

    • Monitors TGJU.org for new exchange rate data
    • Validates and processes new records
    • Maintains data consistency and prevents duplicates
    • Automatically commits updates to the repository

    Contributing

    If you find data inconsistencies or have suggestions for improvements, please open an issue in the GitHub repository.

    License

    This project is licensed under the MIT License - see the LICENSE file for details.

    Citation

    If you use this dataset in your research or projects, please cite:

    Dollar-Rial-Toman Live Price Dataset
    Author: Koorosh Komeili Zadeh
    Source: https://github.com/kooroshkz/Dollar-Rial-Toman-Live-Price-Dataset
    Data Source: TGJU.org (Tehran Gold & Jewelry Union)
    Date Range: November 2011 - Present
    

    Keywords

    USD to Rial dataset, Dollar to Toman dataset, Iran exchange rate CSV, USD/IRR daily price, foreign exchange Iran dataset, TGJU data, time series currency dataset

    Disclaimer

    This ...

  11. m

    ESG rating of general stock indices

    • data.mendeley.com
    • narcis.nl
    Updated Oct 22, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Szilárd Erhart (2021). ESG rating of general stock indices [Dataset]. http://doi.org/10.17632/58mwkj5pf8.1
    Explore at:
    Dataset updated
    Oct 22, 2021
    Authors
    Szilárd Erhart
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    THE FILES HAVE BEEN CREATED BY SZILÁRD ERHART FOR A RESEARCH: ERHART (2021): ESG RATINGS OF GENERAL

    STOCK EXCHANGE INDICES, INTERNATIONAL REVIEW OF FINANCIAL ANALYSIS

    USERS OF THE FILES AGREE TO QUOTE THE ABOVE PAPER

    THE PYTHON SCRIPT (PYTHONESG_ERHART.TXT) HELPS USERS TO GET TICKERS BY STOCK EXCHANGES AND EXTRACT ESG SCORES FOR THE UNDERLYING STOCKS FROM YAHOO FINANCE.

    THE R SCRIPT (ESG_UA.TXT) HELPS TO REPLICATE THE MONTE CARLO EXPERIMENT DETAILED IN THE STUDY.

    THE EXPORT_ALL CSV CONTAINS THE DOWNLOADED ESG DATA (SCORES, CONTROVERSIES, ETC) ORGANIZED BY STOCKS AND EXCHANGES.

    DISCLAIMER

    The author takes no responsibility for the timeliness, accuracy, completeness or quality of the information provided.

    The author is in no event liable for damages of any kind incurred or suffered as a result of the use or non-use of the

    information presented or the use of defective or incomplete information.

    The contents are subject to confirmation and not binding.

    The author expressly reserves the right to alter, amend, whole and in part,

    without prior notice or to discontinue publication for a period of time or even completely.

    ##############################READ ME

    BEFORE USING THE MONTE CARLO SIMULATIONS SCRIPT:

    (1) COPY THE goascores.csv and goalscores_alt.csv FILES ONTO YOUR ON COMPUTER DRIVE. THE TWO FILES ARE IDENTICAL.

    (2) SET THE EXACT FILE LOCATION INFORMATION IN THE 'Read in data' SECTION OF THE MONTE CARLO SCRIPT AND FOR THE OUTPUT FILES AT THE END OF THE SCRIPT

    (3) LOAD MISC TOOLS AND MATRIXSTATS IN YOUR R APPLICATION

    (4) RUN THE CODE.

    ##############################READ ME
  12. Data from: A dataset to model Levantine landcover and land-use change...

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Dec 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Kempf; Michael Kempf (2023). A dataset to model Levantine landcover and land-use change connected to climate change, the Arab Spring and COVID-19 [Dataset]. http://doi.org/10.5281/zenodo.10396148
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 16, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Michael Kempf; Michael Kempf
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Dec 16, 2023
    Area covered
    Levant
    Description

    Overview

    This dataset is the repository for the following paper submitted to Data in Brief:

    Kempf, M. A dataset to model Levantine landcover and land-use change connected to climate change, the Arab Spring and COVID-19. Data in Brief (submitted: December 2023).

    The Data in Brief article contains the supplement information and is the related data paper to:

    Kempf, M. Climate change, the Arab Spring, and COVID-19 - Impacts on landcover transformations in the Levant. Journal of Arid Environments (revision submitted: December 2023).

    Description/abstract

    The Levant region is highly vulnerable to climate change, experiencing prolonged heat waves that have led to societal crises and population displacement. Since 2010, the area has been marked by socio-political turmoil, including the Syrian civil war and currently the escalation of the so-called Israeli-Palestinian Conflict, which strained neighbouring countries like Jordan due to the influx of Syrian refugees and increases population vulnerability to governmental decision-making. Jordan, in particular, has seen rapid population growth and significant changes in land-use and infrastructure, leading to over-exploitation of the landscape through irrigation and construction. This dataset uses climate data, satellite imagery, and land cover information to illustrate the substantial increase in construction activity and highlights the intricate relationship between climate change predictions and current socio-political developments in the Levant.

    Folder structure

    The main folder after download contains all data, in which the following subfolders are stored are stored as zipped files:

    “code” stores the above described 9 code chunks to read, extract, process, analyse, and visualize the data.

    “MODIS_merged” contains the 16-days, 250 m resolution NDVI imagery merged from three tiles (h20v05, h21v05, h21v06) and cropped to the study area, n=510, covering January 2001 to December 2022 and including January and February 2023.

    “mask” contains a single shapefile, which is the merged product of administrative boundaries, including Jordan, Lebanon, Israel, Syria, and Palestine (“MERGED_LEVANT.shp”).

    “yield_productivity” contains .csv files of yield information for all countries listed above.

    “population” contains two files with the same name but different format. The .csv file is for processing and plotting in R. The .ods file is for enhanced visualization of population dynamics in the Levant (Socio_cultural_political_development_database_FAO2023.ods).

    “GLDAS” stores the raw data of the NASA Global Land Data Assimilation System datasets that can be read, extracted (variable name), and processed using code “8_GLDAS_read_extract_trend” from the respective folder. One folder contains data from 1975-2022 and a second the additional January and February 2023 data.

    “built_up” contains the landcover and built-up change data from 1975 to 2022. This folder is subdivided into two subfolder which contain the raw data and the already processed data. “raw_data” contains the unprocessed datasets and “derived_data” stores the cropped built_up datasets at 5 year intervals, e.g., “Levant_built_up_1975.tif”.

    Code structure

    1_MODIS_NDVI_hdf_file_extraction.R


    This is the first code chunk that refers to the extraction of MODIS data from .hdf file format. The following packages must be installed and the raw data must be downloaded using a simple mass downloader, e.g., from google chrome. Packages: terra. Download MODIS data from after registration from: https://lpdaac.usgs.gov/products/mod13q1v061/ or https://search.earthdata.nasa.gov/search (MODIS/Terra Vegetation Indices 16-Day L3 Global 250m SIN Grid V061, last accessed, 09th of October 2023). The code reads a list of files, extracts the NDVI, and saves each file to a single .tif-file with the indication “NDVI”. Because the study area is quite large, we have to load three different (spatially) time series and merge them later. Note that the time series are temporally consistent.


    2_MERGE_MODIS_tiles.R


    In this code, we load and merge the three different stacks to produce large and consistent time series of NDVI imagery across the study area. We further use the package gtools to load the files in (1, 2, 3, 4, 5, 6, etc.). Here, we have three stacks from which we merge the first two (stack 1, stack 2) and store them. We then merge this stack with stack 3. We produce single files named NDVI_final_*consecutivenumber*.tif. Before saving the final output of single merged files, create a folder called “merged” and set the working directory to this folder, e.g., setwd("your directory_MODIS/merged").


    3_CROP_MODIS_merged_tiles.R


    Now we want to crop the derived MODIS tiles to our study area. We are using a mask, which is provided as .shp file in the repository, named "MERGED_LEVANT.shp". We load the merged .tif files and crop the stack with the vector. Saving to individual files, we name them “NDVI_merged_clip_*consecutivenumber*.tif. We now produced single cropped NDVI time series data from MODIS.
    The repository provides the already clipped and merged NDVI datasets.


    4_TREND_analysis_NDVI.R


    Now, we want to perform trend analysis from the derived data. The data we load is tricky as it contains 16-days return period across a year for the period of 22 years. Growing season sums contain MAM (March-May), JJA (June-August), and SON (September-November). December is represented as a single file, which means that the period DJF (December-February) is represented by 5 images instead of 6. For the last DJF period (December 2022), the data from January and February 2023 can be added. The code selects the respective images from the stack, depending on which period is under consideration. From these stacks, individual annually resolved growing season sums are generated and the slope is calculated. We can then extract the p-values of the trend and characterize all values with high confidence level (0.05). Using the ggplot2 package and the melt function from reshape2 package, we can create a plot of the reclassified NDVI trends together with a local smoother (LOESS) of value 0.3.
    To increase comparability and understand the amplitude of the trends, z-scores were calculated and plotted, which show the deviation of the values from the mean. This has been done for the NDVI values as well as the GLDAS climate variables as a normalization technique.


    5_BUILT_UP_change_raster.R


    Let us look at the landcover changes now. We are working with the terra package and get raster data from here: https://ghsl.jrc.ec.europa.eu/download.php?ds=bu (last accessed 03. March 2023, 100 m resolution, global coverage). Here, one can download the temporal coverage that is aimed for and reclassify it using the code after cropping to the individual study area. Here, I summed up different raster to characterize the built-up change in continuous values between 1975 and 2022.


    6_POPULATION_numbers_plot.R


    For this plot, one needs to load the .csv-file “Socio_cultural_political_development_database_FAO2023.csv” from the repository. The ggplot script provided produces the desired plot with all countries under consideration.


    7_YIELD_plot.R


    In this section, we are using the country productivity from the supplement in the repository “yield_productivity” (e.g., "Jordan_yield.csv". Each of the single country yield datasets is plotted in a ggplot and combined using the patchwork package in R.


    8_GLDAS_read_extract_trend


    The last code provides the basis for the trend analysis of the climate variables used in the paper. The raw data can be accessed https://disc.gsfc.nasa.gov/datasets?keywords=GLDAS%20Noah%20Land%20Surface%20Model%20L4%20monthly&page=1 (last accessed 9th of October 2023). The raw data comes in .nc file format and various variables can be extracted using the [“^a variable name”] command from the spatraster collection. Each time you run the code, this variable name must be adjusted to meet the requirements for the variables (see this link for abbreviations: https://disc.gsfc.nasa.gov/datasets/GLDAS_CLSM025_D_2.0/summary, last accessed 09th of October 2023; or the respective code chunk when reading a .nc file with the ncdf4 package in R) or run print(nc) from the code or use names(the spatraster collection).
    Choosing one variable, the code uses the MERGED_LEVANT.shp mask from the repository to crop and mask the data to the outline of the study area.
    From the processed data, trend analysis are conducted and z-scores were calculated following the code described above. However, annual trends require the frequency of the time series analysis to be set to value = 12. Regarding, e.g., rainfall, which is measured as annual sums and not means, the chunk r.sum=r.sum/12 has to be removed or set to r.sum=r.sum/1 to avoid calculating annual mean values (see other variables). Seasonal subset can be calculated as described in the code. Here, 3-month subsets were chosen for growing seasons, e.g. March-May (MAM), June-July (JJA), September-November (SON), and DJF (December-February, including Jan/Feb of the consecutive year).
    From the data, mean values of 48 consecutive years are calculated and trend analysis are performed as describe above. In the same way, p-values are extracted and 95 % confidence level values are marked with dots on the raster plot. This analysis can be performed with a much longer time series, other variables, ad different spatial extent across the globe due to the availability of the GLDAS variables.

  13. forest cover data

    • kaggle.com
    zip
    Updated Apr 22, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ShalvaRai16MCB0025 (2017). forest cover data [Dataset]. https://www.kaggle.com/shalv16mcb0025/forest-cover-data
    Explore at:
    zip(3155 bytes)Available download formats
    Dataset updated
    Apr 22, 2017
    Authors
    ShalvaRai16MCB0025
    Description

    `require(rgdal) require(sp) x<-"rgdal" if (!require(x,character.only=TRUE))
    { install.packages(pkgs=x,dependencies = TRUE) require(x,character.only=TRUE) }

    location

    location="C:/Users/acer/Documents/R/data/India Map" india1<- readOGR(dsn = location,"IND_adm1")

    Plot india

    plot(india1) slotNames(india1) names(india1) head(india1@data)

    name are available in dataset

    head(india1$NAME_1,10)

    Sample plot ane state

    plot(india1[india1$NAME_1=="Delhi",],col="red") title("Delhi")

    Read file that contain forest information

    forestdata<-read.csv(file="C:/Users/acer/Documents/R/data/Recorded_Forest_Area.csv", stringsAsFactors = FALSE) head(forestdata) names(forestdata)

    name are too long lets change it

    colnames(forestdata)<-c("state","statearea","forestarea2005","reserved","protected","unclassed","totalforestarea","forestareapercent") names(forestdata) head(forestdata)

    now change factpr to character

    india1$NAME_1 = as.character(india1$NAME_1) forestdata$state=as.character(forestdata$state)

    now check all data in map and csv file are matched or not

    india1$NAME_1 %in% forestdata$state

    return row which is having missmatch

    india1$NAME_1[which(!india1$NAME_1 %in% forestdata$state)]

    So the issue is with and in place of & and also a name of the state Uttaranchal that was changed later to Uttrakhand

    #

    Let us make the relevant changes

    india1$NAME_1[grepl("Andaman and Nicobar",india1$NAME_1)]="Andaman & Nicobar" india1$NAME_1[grepl("Dadra and Nagar Haveli",india1$NAME_1)]="Dadra & Nagar Haveli" india1$NAME_1[grepl("Jammu and Kashmir",india1$NAME_1)]="Jammu & Kashmir" india1$NAME_1[grepl("Daman and Diu",india1$NAME_1)]="Daman & Diu" india1$NAME_1[grepl("Uttaranchal",india1$NAME_1)]="Uttarakhand"

    now check again the matching or nor

    india1$NAME_1%in%forestdata$state

  14. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
SophieLiu (2019). Data Mining Project - Boston [Dataset]. https://www.kaggle.com/sliu65/data-mining-project-boston
Organization logo

Data Mining Project - Boston

Explore at:
zip(59313797 bytes)Available download formats
Dataset updated
Nov 25, 2019
Authors
SophieLiu
Area covered
Boston
Description

Context

To make this a seamless process, I cleaned the data and delete many variables that I thought were not important to our dataset. I then uploaded all of those files to Kaggle for each of you to download. The rideshare_data has both lyft and uber but it is still a cleaned version from the dataset we downloaded from Kaggle.

Use of Data Files

You can easily subset the data into the car types that you will be modeling by first loading the csv into R, here is the code for how you do this:

This loads the file into R

df<-read.csv('uber.csv')

The next codes is to subset the data into specific car types. The example below only has Uber 'Black' car types.

df_black<-subset(uber_df, uber_df$name == 'Black')

This next portion of code will be to load it into R. First, we must write this dataframe into a csv file on our computer in order to load it into R.

write.csv(df_black, "nameofthefileyouwanttosaveas.csv")

The file will appear in you working directory. If you are not familiar with your working directory. Run this code:

getwd()

The output will be the file path to your working directory. You will find the file you just created in that folder.

Inspiration

Your data will be in front of the world's largest data science community. What questions do you want to see answered?

Search
Clear search
Close search
Google apps
Main menu