24 datasets found
  1. Housing Price Prediction using DT and RF in R

    • kaggle.com
    zip
    Updated Aug 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    vikram amin (2023). Housing Price Prediction using DT and RF in R [Dataset]. https://www.kaggle.com/datasets/vikramamin/housing-price-prediction-using-dt-and-rf-in-r
    Explore at:
    zip(629100 bytes)Available download formats
    Dataset updated
    Aug 31, 2023
    Authors
    vikram amin
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description
    • Objective: To predict the prices of houses in the City of Melbourne
    • Approach: Using Decision Tree and Random Forest https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Ffc6fb7d0bd8e854daf7a6f033937a397%2FPicture1.png?generation=1693489996707941&alt=media" alt="">
    • Data Cleaning:
    • Date column is shown as a character vector which is converted into a date vector using the library ‘lubridate’
    • We create a new column called age to understand the age of the house as it can be a factor in the pricing of the house. We extract the year from column ‘Date’ and subtract it from the column ‘Year Built’
    • We remove 11566 records which have missing values
    • We drop columns which are not significant such as ‘X’, ‘suburb’, ‘address’, (we have kept zipcode as it serves the purpose in place of suburb and address), ‘type’, ‘method’, ‘SellerG’, ‘date’, ‘Car’, ‘year built’, ‘Council Area’, ‘Region Name’
    • We split the data into ‘train’ and ‘test’ in 80/20 ratio using the sample function
    • Run libraries ‘rpart’, ‘rpart.plot’, ‘rattle’, ‘RcolorBrewer’
    • Run decision tree using the rpart function. ‘Price’ is the dependent variable https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F6065322d19b1376c4a341a4f22933a51%2FPicture2.png?generation=1693490067579017&alt=media" alt="">
    • Average price for 5464 houses is $1084349
    • Where building area is less than 200.5, the average price for 4582 houses is $931445. Where building area is less than 200.5 & age of the building is less than 67.5 years, the avg price for 3385 houses is $799299.6.
    • $4801538 is the Highest average prices of 13 houses where distance is lower than 5.35 & building are is >280.5
      https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F136542b7afb6f03c1890bae9b07dc464%2FDecision%20Tree%20Plot.jpeg?generation=1693490124083168&alt=media" alt="">
    • We use the caret package for tuning the parameter and the optimal complexity parameter found is 0.01 with RMSE 445197.9 https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Feb1633df9dd61ba3a51574873b055fd0%2FPicture3.png?generation=1693490163033658&alt=media" alt="">
    • We use library (Metrics) to find out the RMSE ($392107), MAPE (0.297) which means an accuracy of 99.70% and MAE ($272015.4)
    • Variables ‘postcode’, longitude and building are the most important variables
    • Test$Price indicates the actual price and test$predicted indicates the predicted price for particular 6 houses. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F620b1aad968c9aee169d0e7371bf3818%2FPicture4.png?generation=1693490211728176&alt=media" alt="">
    • We use the default parameters of random forest on the train data https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fe9a3c3f8776ee055e4a1bb92d782e19c%2FPicture5.png?generation=1693490244695668&alt=media" alt="">
    • The below image indicates that ‘Building Area’, ‘Age of the house’ and ‘Distance’ are the most important variables that affect the price of the house. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fc14d6266184db8f30290c528d72b9f6b%2FRandom%20Forest%20Variables%20Importance.jpeg?generation=1693490284920037&alt=media" alt="">
    • Based on the default parameters, RMSE is $250426.2, MAPE is 0.147 (accuracy is 99.853%) and MAE is $151657.7
    • Error starts to remain constant between 100 to 200 trees and thereafter there is almost minimal reduction. We can choose N tree=200. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F365f9e8587d3a65805330889d22f9e60%2FNtree%20Plot.jpeg?generation=1693490308734539&alt=media" alt="">
    • We tune the model and find mtry = 3 has the lowest out of bag error
    • We use the caret package and use 5 fold cross validation technique
    • RMSE is $252216.10 , MAPE is 0.146 (accuracy is 99.854%) , MAE is $151669.4
    • We can conclude that Random Forest give us more accurate results as compared to Decision Tree
    • In Random Forest , the default parameters (N tree = 500) give us lower RMSE and MAPE as compared to N tree = 200. So we can proceed with those parameters.
  2. SCOAPE Pandora Column Observations - Dataset - NASA Open Data Portal

    • data.nasa.gov
    Updated Apr 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). SCOAPE Pandora Column Observations - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/scoape-pandora-column-observations-8c90a
    Explore at:
    Dataset updated
    Apr 1, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    SCOAPE_Pandora_Data is the column NO2 and ozone data collected by Pandora spectrometers during the Satellite Coastal and Oceanic Atmospheric Pollution Experiment (SCOAPE). Pandora instruments were located on the University of Southern Mississippi’s Research Vessel (R/V) Point Sur and at the Louisiana Universities Marine Consortium (LUMCON; Cocodrie, LA). Data collection for this product is complete.The Outer Continental Shelf Lands Act (OCSLA) requires the US Department of Interior Bureau of Ocean Energy Management (BOEM) to ensure compliance with the US National Ambient Air Quality Standard (NAAQS) so that Outer Continental Shelf (OCS) oil and natural gas (ONG) exploration, development, and production do not significantly impact the air quality of any US state. In 2017, BOEM and NASA entered into an interagency agreement to begin a study to scope out the feasibility of BOEM personnel using a suite of NASA and non-NASA resources to assess how pollutants from ONG exploration, development, and production activities affect air quality. An important activity of this interagency agreement was SCOAPE, a field deployment that took place in May 2019, that aimed to assess the capability of satellite observations for monitoring offshore air quality. The outcomes of the study are documented in two BOEM reports (Duncan, 2020; Thompson, 2020).To address BOEM’s goals, the SCOAPE science team conducted surface-based remote sensing and in-situ measurements, which enabled a systematic assessment of the application of satellite observations, primarily NO2, for monitoring air quality. The SCOAPE field measurements consisted of onshore ground sites, including in the vicinity of LUMCON, as well as those from University of Southern Mississippi’s R/V Point Sur, which cruised in the Gulf of America from 10-18 May 2019. Based on the 2014 and 2017 BOEM emissions inventories as well as daily air quality and meteorological forecasts, the cruise track was designed to sample both areas with large oil drilling platforms and areas with dense small natural gas facilities. The R/V Point Sur was instrumented to carry out both remote sensing and in-situ measurements of NO2 and O3 along with in-situ CH4, CO2, CO, and VOC tracers which allowed detailed characterization of airmass type and emissions. In addition, there were also measurements of multi-wavelength AOD and black carbon as well as planetary boundary layer structure and meteorological variables, including surface temperature, humidity, and winds. A ship-based spectrometer instrument provided remotely-sensed total column amounts of NO2 and O3 for direct comparison with satellite measurements. Ozonesondes and radiosondes were also launched 1-3 times daily from the R/V Point Sur to provide O3 and meteorological vertical profile observations. The ground-based observations, primarily at LUMCON, included spectrometer-measured column NO2 and O3, in-situ NO2, VOCs, and planetary boundary layer structure. A NO2sonde was also mounted on a vehicle with the goal to detect pollution onshore from offshore ONG activities during onshore flow; data were collected along coastal Louisiana from Burns Point Park to Grand Isle to the tip of the Mississippi River delta. The in-situ measurements were reported in ICARTT files or Excel files. The remote sensing data are in either HDF or netCDF files.

  3. b

    Water column nitrate+nitrite d15N measurements from R/V L'Atalante in the...

    • bco-dmo.org
    • search.dataone.org
    csv
    Updated Apr 11, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Angela N. Knapp; Kelly M. McCabe; Oliver Grosso; Nathalie Leblond; Thierry Moutin; Sophie Bonnet (2018). Water column nitrate+nitrite d15N measurements from R/V L'Atalante in the southwest Pacific Ocean between New Caledonia and Tahiti from February to March 2015 [Dataset]. http://doi.org/10.1575/1912/bco-dmo.733303
    Explore at:
    csv(5.24 KB)Available download formats
    Dataset updated
    Apr 11, 2018
    Dataset provided by
    Biological and Chemical Data Management Office
    Authors
    Angela N. Knapp; Kelly M. McCabe; Oliver Grosso; Nathalie Leblond; Thierry Moutin; Sophie Bonnet
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Feb 23, 2015 - Mar 30, 2015
    Area covered
    Variables measured
    Cast, Date, Depth, NO3_NO2, Station, Latitude, Longitude, Sigma_theta, NO3_NO2_d15N, NO3_NO2_d15N_1_SD
    Measurement technique
    Isotope-ratio Mass Spectrometer
    Description

    This data set includes water column nitrate+nitrite d15N measurements. These measurements were used together with measurements of the d15N of particulate nitrogen collected in floating sediment traps that were deployed for several days to calculate the relative contribution of subsurface nitrate and nitrogen from N2 fixation for supporting export production (“d15N budgets”). The results suggest that N2 fixation supported a majority of export at Long Duration (LD) stations A and B, and a minor fraction of export at LD C. The results at LD stations A and B are unique compared to other d15N budgets from the oligotrophic regions, whereas the results from LD C are similar to prior reports from the eastern tropical South Pacific as well as the North Pacific near Hawaii. Additionally, these data are compared with other metrics of N2 fixation made on the same cruise.

  4. EM2040 Water Column Sonar Data Collected During H13177

    • catalog.data.gov
    • datasets.ai
    • +2more
    Updated Sep 17, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NOAA National Centers for Environmental Information (Point of Contact) (2021). EM2040 Water Column Sonar Data Collected During H13177 [Dataset]. https://catalog.data.gov/dataset/em2040-water-column-sonar-data-collected-during-h13177
    Explore at:
    Dataset updated
    Sep 17, 2021
    Dataset provided by
    National Oceanic and Atmospheric Administrationhttp://www.noaa.gov/
    National Centers for Environmental Informationhttps://www.ncei.noaa.gov/
    Description

    Sea Scout Hydrographic Survey, H13177 (EM2040). Mainline coverage within the survey area consisted of Complete Coverage (100% side scan sonar with concurrent multibeam data) acquisition. The assigned Fish Haven area and associated debris area were surveyed with Object Detection MBES coverage. Bathymetric and water column data were acquired with a Kongsberg EM2040C multibeam echo sounder aboard the R/V Sea Scout and bathymetry data was acquired with a Kongsberg EM3002 multibeam echo sounder aboard the R/V C-Wolf. Side scan sonar acoustic imagery was collected with a Klein 5000 V2 system aboard the R/V Sea Scout and an EdgeTech 4200 aboard the R/V C-Wolf.

  5. f

    Data from: Separation and determination of D-malic acid enantiomer by...

    • scielo.figshare.com
    • datasetcatalog.nlm.nih.gov
    tiff
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xuejiao Mei; Dingqiang Lu; Xiangping Yan (2023). Separation and determination of D-malic acid enantiomer by reversed-phase liquid chromatography after derivatization with (R)-1-(1-naphthyl) ethylamine [Dataset]. http://doi.org/10.6084/m9.figshare.21907645.v1
    Explore at:
    tiffAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    SciELO journals
    Authors
    Xuejiao Mei; Dingqiang Lu; Xiangping Yan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract L-Malic acid is the Active Pharmaceutical Ingredient of the latest generation of compound electrolyte injection (STEROFUNDIN ISO, Germany) and plays a very important role in the rescue of critically ill patients. The optical purity of L-malic acid is a Critical Quality Attributes. A new reversed-phase high performance liquid chromatography (RP-HPLC) method for pre-column derivatization of D-malic acid enantiomer impurity in L-malic acid bulk drug was established. The derivatization reaction was carried out using (R)-1-(1-naphthyl)ethylamine ((R)-NEA) as a chiral derivatization reagent. The Kromasil C18 column was used with a detection wavelength of 225 nm, a flow rate of 1.0 mL·min-1, and a column temperature of 30 °C. The mobile phase was acetonitrile-0.01 mol·L-1 potassium dihydrogen phosphate solution (containing 20 mmol·L-1 sodium heptanesulfonate, adjusted to pH 2.80 with phosphoric acid) (at a ratio of 45:55) and the resolution of D-malic acid and L-malic acid derivatization products reached 1.7. The proposed method possesses the advantages of simple operation, mild conditions, stable derivatization products and low cost. Also it gave better separation and was more accurate than previous methods.

  6. Google Data Analytics Case Study Cyclistic

    • kaggle.com
    zip
    Updated Sep 27, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Udayakumar19 (2022). Google Data Analytics Case Study Cyclistic [Dataset]. https://www.kaggle.com/datasets/udayakumar19/google-data-analytics-case-study-cyclistic/suggestions
    Explore at:
    zip(1299 bytes)Available download formats
    Dataset updated
    Sep 27, 2022
    Authors
    Udayakumar19
    Description

    Introduction

    Welcome to the Cyclistic bike-share analysis case study! In this case study, you will perform many real-world tasks of a junior data analyst. You will work for a fictional company, Cyclistic, and meet different characters and team members. In order to answer the key business questions, you will follow the steps of the data analysis process: ask, prepare, process, analyze, share, and act. Along the way, the Case Study Roadmap tables — including guiding questions and key tasks — will help you stay on the right path.

    Scenario

    You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.

    Ask

    How do annual members and casual riders use Cyclistic bikes differently?

    Guiding Question:

    What is the problem you are trying to solve?
      How do annual members and casual riders use Cyclistic bikes differently?
    How can your insights drive business decisions?
      The insight will help the marketing team to make a strategy for casual riders
    

    Prepare

    Guiding Question:

    Where is your data located?
      Data located in Cyclistic organization data.
    
    How is data organized?
      Dataset are in csv format for each month wise from Financial year 22.
    
    Are there issues with bias or credibility in this data? Does your data ROCCC? 
      It is good it is ROCCC because data collected in from Cyclistic organization.
    
    How are you addressing licensing, privacy, security, and accessibility?
      The company has their own license over the dataset. Dataset does not have any personal information about the riders.
    
    How did you verify the data’s integrity?
      All the files have consistent columns and each column has the correct type of data.
    
    How does it help you answer your questions?
      Insights always hidden in the data. We have the interpret with data to find the insights.
    
    Are there any problems with the data?
      Yes, starting station names, ending station names have null values.
    

    Process

    Guiding Question:

    What tools are you choosing and why?
      I used R studio for the cleaning and transforming the data for analysis phase because of large dataset and to gather experience in the language.
    
    Have you ensured the data’s integrity?
     Yes, the data is consistent throughout the columns.
    
    What steps have you taken to ensure that your data is clean?
      First duplicates, null values are removed then added new columns for analysis.
    
    How can you verify that your data is clean and ready to analyze? 
     Make sure the column names are consistent thorough out all data sets by using the “bind row” function.
    
    Make sure column data types are consistent throughout all the dataset by using the “compare_df_col” from the “janitor” package.
    Combine the all dataset into single data frame to make consistent throught the analysis.
    Removed the column start_lat, start_lng, end_lat, end_lng from the dataframe because those columns not required for analysis.
    Create new columns day, date, month, year, from the started_at column this will provide additional opportunities to aggregate the data
    Create the “ride_length” column from the started_at and ended_at column to find the average duration of the ride by the riders.
    Removed the null rows from the dataset by using the “na.omit function”
    Have you documented your cleaning process so you can review and share those results? 
      Yes, the cleaning process is documented clearly.
    

    Analyze Phase:

    Guiding Questions:

    How should you organize your data to perform analysis on it? The data has been organized in one single dataframe by using the read csv function in R Has your data been properly formatted? Yes, all the columns have their correct data type.

    What surprises did you discover in the data?
      Casual member ride duration is higher than the annual members
      Causal member widely uses docked bike than the annual members
    What trends or relationships did you find in the data?
      Annual members are used mainly for commute purpose
      Casual member are preferred the docked bikes
      Annual members are preferred the electric or classic bikes
    How will these insights help answer your business questions?
      This insights helps to build a profile for members
    

    Share

    Guiding Quesions:

    Were you able to answer the question of how ...
    
  7. d

    Council; Council Files April 17, 1847, Case of Leander Thompson, GC3/series...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Digital Archive of Massachusetts Anti-Slavery and Anti-Segregation Petitions, Massachusetts Archives, Boston MA (2023). Council; Council Files April 17, 1847, Case of Leander Thompson, GC3/series 378, Petition of Luther Rist [Dataset]. http://doi.org/10.7910/DVN/NCAOR
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Digital Archive of Massachusetts Anti-Slavery and Anti-Segregation Petitions, Massachusetts Archives, Boston MA
    Description

    Petition subject: Execution case Original: http://nrs.harvard.edu/urn-3:FHCL:12233039 Date of creation: (unknown) Petition location: Uxbridge Selected signatures:Luther RistSusan R. UsherHarriett N. Moury Total signatures: 175 Legal voter signatures (males not identified as non-legal): 64 Female signatures: 84 Unidentified signatures: 27 Female only signatures: No Identifications of signatories: inhabitants, [females] Prayer format was printed vs. manuscript: Manuscript Signatory column format: not column separated Additional non-petition or unrelated documents available at archive: additional documents available Additional archivist notes: Leander Thompson Location of the petition at the Massachusetts Archives of the Commonwealth: Governor Council Files, April 17, 1847, Case of Leander Thompson Acknowledgements: Supported by the National Endowment for the Humanities (PW-5105612), Massachusetts Archives of the Commonwealth, Radcliffe Institute for Advanced Study at Harvard University, Center for American Political Studies at Harvard University, Institutional Development Initiative at Harvard University, and Harvard University Library.

  8. A

    Global methane column-averaged dry air mole fraction (XCH4) from TROPOMI...

    • apgc.awi.de
    html, pdf
    Updated Apr 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    European Geosciences Union (EGU) (2024). Global methane column-averaged dry air mole fraction (XCH4) from TROPOMI WFM-DOAS, since 2017 [Dataset]. http://doi.org/10.5194/amt-12-6771-2019
    Explore at:
    pdf, htmlAvailable download formats
    Dataset updated
    Apr 9, 2024
    Dataset authored and provided by
    European Geosciences Union (EGU)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Carbon monoxide (CO) is an important atmospheric constituent affecting air quality, and methane (CH4) is the second most important greenhouse gas contributing to human-induced climate change. Detailed and continuous observations of these gases are necessary to better assess their impact on climate and atmospheric pollution. While surface and airborne measurements are able to accurately determine atmospheric abundances on local scales, global coverage can only be achieved using satellite instruments. The TROPOspheric Monitoring Instrument (TROPOMI) onboard the Sentinel-5 Precursor satellite, which was successfully launched in October 2017, is a spaceborne nadir-viewing imaging spectrometer measuring solar radiation reflected by the Earth in a push-broom configuration. It has a wide swath on the terrestrial surface and covers wavelength bands between the ultraviolet (UV) and the shortwave infrared (SWIR), combining a high spatial resolution with daily global coverage. These characteristics enable the determination of both gases with an unprecedented level of detail on a global scale, introducing new areas of application. Abundances of the atmospheric column-averaged dry air mole fractions XCO and XCH4 are simultaneously retrieved from TROPOMIs radiance measurements in the 2.3µm spectral range of the SWIR part of the solar spectrum using the scientific retrieval algorithm Weighting Function Modified Differential Optical Absorption Spectroscopy (WFM-DOAS). This algorithm is intended to be used with the operational algorithms for mutual verification and to provide new geophysical insights. We introduce the algorithm in detail, including expected error characteristics based on synthetic data, a machine-learning-based quality filter, and a shallow learning calibration procedure applied in the post-processing of the XCH4 data. The quality of the results based on real TROPOMI data is assessed by validation with ground-based Fourier transform spectrometer (FTS) measurements providing realistic error estimates of the satellite data: the XCO data set is characterised by a random error of 5.1ppb (5.8%) and a systematic error of 1.9ppb (2.1%); the XCH4 data set exhibits a random error of 14.0ppb (0.8%) and a systematic error of 4.3ppb (0.2%). The natural XCO and XCH4 variations are well-captured by the satellite retrievals, which is demonstrated by a high correlation with the validation data (R=0.97 for XCO and R=0.91 for XCH4 based on daily averages).

    Citation

    Schneising, O., Buchwitz, M., Reuter, M., Bovensmann, H., Burrows, J. P., Borsdorff, T., Deutscher, N. M., Feist, D. G., Griffith, D. W. T., Hase, F., Hermans, C., Iraci, L. T., Kivi, R., Landgraf, J., Morino, I., Notholt, J., Petri, C., Pollard, D. F., Roche, S., Shiomi, K., Strong, K., Sussmann, R., Velazco, V. A., Warneke, T., and Wunch, D.: A scientific algorithm to simultaneously retrieve carbon monoxide and methane from TROPOMI onboard Sentinel-5 Precursor, Atmos. Meas. Tech., 12, 6771–6802, https://doi.org/10.5194/amt-12-6771-2019, 2019.

  9. m

    Data for: A Functional Near-infrared Spectroscopy Investigation of Directed...

    • data.mendeley.com
    • search.datacite.org
    Updated Oct 15, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Heming Gao (2020). Data for: A Functional Near-infrared Spectroscopy Investigation of Directed Forgetting [Dataset]. http://doi.org/10.17632/h7mb2vzx54.1
    Explore at:
    Dataset updated
    Oct 15, 2020
    Authors
    Heming Gao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Behavioural and fNIRS data. Data 1 shows the memory recognition accuracies for TBR and TBF words, and the false alarm (FA) for new words from 16 participants. Specifically, the participant column shows the serial number of participants; the age column shows the ages of participants; the TBF and TBR columns show the recognition memory performance of TBR and TBF words for each participant; the FA column shows the FA of new words for each participant. Data 2 shows the oxy-Hb levels time-locked to cue onset (four types of cues: TBF-f, TBF-r, TBR-f, and TBR-r cues) during the 0–4 s time window in 12 channels (Ch53, Ch54, Ch9, Ch32, Ch2, Ch38, Ch20, Ch44, Ch22, Ch46, Ch25, and Ch48). Specifically, the participant column shows the serial number of participants; the TBF-F-Ch53 column shows the oxy-Hb levels triggered by TBF-F cues in Ch53 for each participant; Similarly, the remaining columns in turn show the oxy-Hb levels triggered by TBF-F cues in the Ch54, Ch9, Ch32, Ch2, Ch38, Ch20, Ch44, Ch22, Ch46, Ch25, and Ch48, and the oxy-Hb levels triggered by TBF-r, TBR-f, and TBR-r cues in the Ch53, Ch54, Ch9, Ch32, Ch2, Ch38, Ch20, Ch44, Ch22, Ch46, Ch25, and Ch48 for each participant. Data 3 shows the oxy-Hb levels time-locked to cue onset (four types of cues: TBF-f, TBF-r, TBR-f, and TBR-r cues) during the 5–9 s time window in 12 channels (Ch53, Ch54, Ch9, Ch32, Ch2, Ch38, Ch20, Ch44, Ch22, Ch46, Ch25, and Ch48). Other columns settings were identical to Data 2. Data 4 shows the oxy-Hb levels time-locked to cue onset (four types of cues: TBF-f, TBF-r, TBR-f, and TBR-r cues) during the 9–11 s time window in 12 channels (Ch53, Ch54, Ch9, Ch32, Ch2, Ch38, Ch20, Ch44, Ch22, Ch46, Ch25, and Ch48). Other columns settings were identical to Data 2.

  10. b

    Water column DOC, Chl, PN, PC, PON, and POC data from CTD casts from R/V...

    • bco-dmo.org
    csv
    Updated Apr 1, 2010
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Scott Nodder (2010). Water column DOC, Chl, PN, PC, PON, and POC data from CTD casts from R/V Tangaroa cruise VDT0410 in the South East of New Zealand, S.W. Bounty Trough in 2004 (SAGE project) [Dataset]. https://www.bco-dmo.org/dataset/3329
    Explore at:
    csv(29.86 KB)Available download formats
    Dataset updated
    Apr 1, 2010
    Dataset provided by
    Biological and Chemical Data Management Office
    Authors
    Scott Nodder
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    DOC, lat, lon, Chla, PC_1, PC_2, PN_1, PN_2, AV_PN, POC_1, and 25 more
    Measurement technique
    CTD Sea-Bird 911
    Description

    Water column DOC, Chl, PN, PC, PON, POC data from CTD casts

    A summary of methods used and detection limits is below under 'Data Processing Description'.
    
  11. u

    R/V Ron Brown Ozone Column Data

    • ckanprod.data-commons.k8s.ucar.edu
    • data.ucar.edu
    ascii
    Updated Oct 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anne M. Thompson; James E. Johnson (2025). R/V Ron Brown Ozone Column Data [Dataset]. http://doi.org/10.26023/QE6B-M46P-9B0V
    Explore at:
    asciiAvailable download formats
    Dataset updated
    Oct 7, 2025
    Authors
    Anne M. Thompson; James E. Johnson
    Time period covered
    Jan 17, 1999 - Feb 19, 1999
    Area covered
    Description

    This dataset contains the Ron Brown ozonesonde profile data.

  12. D

    Raw multibeam echosounder data collected on-board R/V Helmer Hanssen during...

    • dataverse.azure.uit.no
    • dataverse.no
    txt, zip
    Updated Sep 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Giuliana Panieri; Giuliana Panieri (2023). Raw multibeam echosounder data collected on-board R/V Helmer Hanssen during the CAGE 17-2 AMGG cruise to the Olga Basin, northern Barents Sea [Dataset]. http://doi.org/10.18710/Q2ZVBC
    Explore at:
    zip(3932206966), zip(9086433794), zip(20022462609), zip(15636771385), zip(11781536588), zip(22222760100), zip(26415), zip(4159173634), zip(19637921216), zip(6908699280), txt(102268), zip(19178372522), zip(541459487), zip(21147233533)Available download formats
    Dataset updated
    Sep 28, 2023
    Dataset provided by
    DataverseNO
    Authors
    Giuliana Panieri; Giuliana Panieri
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Time period covered
    Jun 21, 2017 - Jul 3, 2017
    Area covered
    Barents Sea, Norway, Svalbard, Norway, Norway, Norway, Norway, Norway, Norway, Norway, Norway
    Dataset funded by
    Norwegian Petroleum Directorate
    The Research Council of Norway
    Description

    This dataset includes raw multibeam echosounder data collected on-board R/V Helmer Hanssen during the CAGE 17-2 AMGG cruise to the Olga Basin, northern Barents Sea between June 21st to July 3rd, 2017. Data were collected using a Kongsberg Simrad EM302 multibeam echosounder system. The nominal sonar frequency of the sound waves is 30 kHz with an angular coverage sector of up to 150° and 432 beams per ping. The system was typically used with a 60°/60° opening angle. The dataset comprises Kongsberg bathymetry files, as well as water column data files and sound velocity profiles gathered from 8 CTD casts. Water column data collection was not continuous during the cruise, for example during transits between study areas. Data coverage includes the transit from Longyearbyen, Outer Storfjordrenna, the Olga Basin (Storbanken), and transit to Tromsø over West Sentralbanken, with files split according to their day of collection. A rough itinerary is as follows: 21: Isfjorden 22: West Svalbard – Storfjordrenna 23: Storfjordrenna 24: West Svalbard 25: Spitsbergenbanken – Hopendjupet 26-29: Olga Basin (Storbanken) 30: Storbankrenna 01: Sentralbanken

  13. d

    Council; Council Files September 22, 1843, Case of Isaac Leavitt, GC3/series...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Digital Archive of Massachusetts Anti-Slavery and Anti-Segregation Petitions, Massachusetts Archives, Boston MA (2023). Council; Council Files September 22, 1843, Case of Isaac Leavitt, GC3/series 378, Petition of Charles W. Lillie [Dataset]. http://doi.org/10.7910/DVN/2RMA9
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Digital Archive of Massachusetts Anti-Slavery and Anti-Segregation Petitions, Massachusetts Archives, Boston MA
    Time period covered
    Sep 11, 1843
    Description

    Petition subject: Execution case Original: http://nrs.harvard.edu/urn-3:FHCL:12232985 Date of creation: 1843-09-11 Petition location: Roxbury Selected signatures:Charles W. LillieStephen R. DoggettCaroline Williams Total signatures: 13 Legal voter signatures (males not identified as non-legal): 9 Female signatures: 4 Female only signatures: No Identifications of signatories: inhabitants, [females] Prayer format was printed vs. manuscript: Manuscript Signatory column format: not column separated Additional non-petition or unrelated documents available at archive: additional documents available Additional archivist notes: Isaac Leavitt Location of the petition at the Massachusetts Archives of the Commonwealth: Governor Council Files, September 22, 1843, Case of Isaac Leavitt Acknowledgements: Supported by the National Endowment for the Humanities (PW-5105612), Massachusetts Archives of the Commonwealth, Radcliffe Institute for Advanced Study at Harvard University, Center for American Political Studies at Harvard University, Institutional Development Initiative at Harvard University, and Harvard University Library.

  14. d

    House Unpassed Legislation 1842, Docket 1153, SC1/series 230, Petition of...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Digital Archive of Massachusetts Anti-Slavery and Anti-Segregation Petitions, Massachusetts Archives, Boston MA (2023). House Unpassed Legislation 1842, Docket 1153, SC1/series 230, Petition of J.H. Brown [Dataset]. http://doi.org/10.7910/DVN/98KUO
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Digital Archive of Massachusetts Anti-Slavery and Anti-Segregation Petitions, Massachusetts Archives, Boston MA
    Description

    Petition subject: Against railroad discrimination with focus on white passengers Original: http://nrs.harvard.edu/urn-3:FHCL:10956457 Date of creation: (unknown) Petition location: Sudbury Legislator, committee, or address that the petition was sent to: Francis R. Gourgas, Concord Selected signatures:J.H. BrownSally BrownLoring Eaton Total signatures: 76 Legal voter signatures (males not identified as non-legal): 31 Female signatures: 37 Unidentified signatures: 8 Female only signatures: No Identifications of signatories: inhabitants, [females] Prayer format was printed vs. manuscript: Printed Signatory column format: not column separated Additional non-petition or unrelated documents available at archive: no additional documents Additional archivist notes: 11057/4 written on back Location of the petition at the Massachusetts Archives of the Commonwealth: House Unpassed 1842, Docket 1153 Acknowledgements: Supported by the National Endowment for the Humanities (PW-5105612), Massachusetts Archives of the Commonwealth, Radcliffe Institute for Advanced Study at Harvard University, Center for American Political Studies at Harvard University, Institutional Development Initiative at Harvard University, and Harvard University Library.

  15. Students Performance EDA in R

    • kaggle.com
    zip
    Updated Sep 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    vikram amin (2023). Students Performance EDA in R [Dataset]. https://www.kaggle.com/datasets/vikramamin/students-performance
    Explore at:
    zip(7847 bytes)Available download formats
    Dataset updated
    Sep 6, 2023
    Authors
    vikram amin
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    We will be doing Exploratory Data Analysis on the Dataset.

    • Set the working directory and read the data
    • Check the summary of the data https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fad5a02011e6566baedbc40677e3e0b72%2FPicture2.png?generation=1693983416961939&alt=media" alt="">
    • Data Cleaning: No missing values or duplicated values found. Data types for 5 columns needed to be changed from character vector to factor vector. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F14351f36a8bd8987f73bb29dd0978a82%2FPicture3.png?generation=1693983829428938&alt=media" alt="">
    • EDA: Renamed columns ‘race.ethnicity’ to ‘race’, ‘parental.level.of.education’ to ‘parents_edu’, ‘test.preparation.course’ to ‘test_prep’. Created new column ‘avg_score’ by taking the average score of columns ‘math.score’, ‘reading.score’, ‘writing.score’.
    • Run libraries for data visualisation ‘dplyr’, ‘ggplot2’, ‘corrplot’, ‘tidyr’ https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F6f76a53892907e18c45d3e841db4f4c0%2FPicture1.jpg?generation=1693983705727407&alt=media" alt="">

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F9d01012a6039316ad5138bf059296a37%2FPicture2.jpg?generation=1693983800460383&alt=media" alt="">

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fb085faf0af221d44dd461742d714943a%2FPicture3.jpg?generation=1693983876047599&alt=media" alt="">

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fe28cbea4f9ffee5d3325b85bb968f32b%2FPicture4.jpg?generation=1693983910231678&alt=media" alt="">

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fcac9db4c50c89f8b194a866efd7a10fd%2FPicture5.jpg?generation=1693983931073804&alt=media" alt="">

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F545b65e4c4ce23511e6d6d454e3bcb38%2FPicture7.jpg?generation=1693984000632751&alt=media" alt="">

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F98af3b998c2876a76b29625f0edcc894%2FPicture8.jpg?generation=1693984025214732&alt=media" alt="">

    • Conclusion:
    • Female students (518) are more represented than male students (482). Total Students being 1000
    • 58% students belong to Group C Race (180- females and 139-males) & Group D Race (129- females and 133-males) and the least number of students belong to Group A race (53-females and 36-males) Total = 89. 22.6% students parents education is of some college followed closely by associate's degree (22.2%). 5.9% students parents have a master's degree
    • 35.5% students have free or reduced lunch versus 64.5% who get standard lunch. Within this, 18.9% female students and 16.6% male students get free or reduced lunch versus 32.9% female students and 31.6% male students who get standard lunch
    • Females students total average score is more than that of male students. This could also be due to higher proportion of female students
    • 35.8% students had completed the test preparations versus 64.2% who had not completed. Within this, 18.4% female students and 17.4% male students had completed the test preparations versus 33.4% female and 30.8% male students who had not.
    • Highest correlation is between writing score and reading score i.e 0.95
  16. b

    Epifluorescence Microscopy Water Column Samples from R/V Tangaroa TAN1810 in...

    • bco-dmo.org
    Updated Jul 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Natalia Yingling; Karen E. Selph; Michael R. Stukel (2023). Epifluorescence Microscopy Water Column Samples from R/V Tangaroa TAN1810 in the Chatham Rise (Subtropical and Sub-Antarctic waters off of New Zealand) from October to November 2018 (Salp Food Web Ecology project) [Dataset]. https://www.bco-dmo.org/dataset/905060
    Explore at:
    Dataset updated
    Jul 24, 2023
    Dataset provided by
    Biological and Chemical Data Management Office
    Authors
    Natalia Yingling; Karen E. Selph; Michael R. Stukel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Oct 25, 2018 - Nov 18, 2018
    Area covered
    Variables measured
    ID, ESD, Lat, Area, Cast, Date, Long, Cycle, Depth, Width, and 10 more
    Measurement technique
    Camera, Fluorescence Microscope
    Description

    Epifluorescence Microscopy Water Column Samples from Subtropical and Subantarctic Waters East of New Zealand

  17. Spatial Demography Column 1 Data and code

    • figshare.com
    txt
    Updated Jan 18, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Corey Sparks (2016). Spatial Demography Column 1 Data and code [Dataset]. http://doi.org/10.6084/m9.figshare.809582.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jan 18, 2016
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Corey Sparks
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    These are the data and code for the first column in the Spatial Demography journal's Software and Code series

  18. Case study: Cyclistic bike-share analysis

    • kaggle.com
    zip
    Updated Mar 25, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jorge4141 (2022). Case study: Cyclistic bike-share analysis [Dataset]. https://www.kaggle.com/datasets/jorge4141/case-study-cyclistic-bikeshare-analysis
    Explore at:
    zip(131490806 bytes)Available download formats
    Dataset updated
    Mar 25, 2022
    Authors
    Jorge4141
    Description

    Introduction

    This is a case study called Capstone Project from the Google Data Analytics Certificate.

    In this case study, I am working as a junior data analyst at a fictitious bike-share company in Chicago called Cyclistic.

    Cyclistic is a bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike.

    Scenario

    The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, our team will design a new marketing strategy to convert casual riders into annual members.

    ****Primary Stakeholders:****

    1: Cyclistic Executive Team

    2: Lily Moreno, Director of Marketing and Manager

    ASK

    1. How do annual members and casual riders use Cyclistic bikes differently?
    2. Why would casual riders buy Cyclistic annual memberships?
    3. How can Cyclistic use digital media to influence casual riders to become members?

    # Prepare

    The last four quarters were selected for analysis which cover April 01, 2019 - March 31, 2020. These are the datasets used:

    Divvy_Trips_2019_Q2
    Divvy_Trips_2019_Q3
    Divvy_Trips_2019_Q4
    Divvy_Trips_2020_Q1
    

    The data is stored in CSV files. Each file contains one month data for a total of 12 .csv files.

    Data appears to be reliable with no bias. It also appears to be original, current and cited.

    I used Cyclistic’s historical trip data found here: https://divvy-tripdata.s3.amazonaws.com/index.html

    The data has been made available by Motivate International Inc. under this license: https://ride.divvybikes.com/data-license-agreement

    Limitations

    Financial information is not available.

    Process

    Used R to analyze and clean data

    • After installing the R packages, data was collected, wrangled and combined into a single file.
    • Columns were renamed.
    • Looked for incongruencies in the dataframes and converted some columns to character type, so they can stack correctly.
    • Combined all quarters into one big data frame.
    • Removed unnecessary columns

    Analyze

    • Inspected new data table to ensure column names were correctly assigned.
    • Formatted columns to ensure proper data types were assigned (numeric, character, etc).
    • Consolidated the member_casual column.
    • Added day, month and year columns to aggregate data.
    • Added ride-length column to the entire dataframe for consistency.
    • Deleted trip duration rides that showed as negative and bikes out of circulation for quality control.
    • Replaced the word "member" with "Subscriber" and also replaced the word "casual" with "Customer".
    • Aggregated data, compared average rides between members and casual users.

    Share

    After analysis, visuals were created as shown below with R.

    Act

    Conclusion:

    • Data appears to show that casual riders and members use bike share differently.
    • Casual riders' average ride length is more than twice of that of members.
    • Members use bike share for commuting, casual riders use it for leisure and mostly on the weekends.
    • Unfortunately, there's no financial data available to determine which of the two (casual or member) is spending more money.

    Recommendations

    • Offer casual riders a membership package with promotions and discounts.
  19. Spatial Demography Column 3 - Code

    • figshare.com
    txt
    Updated Jan 18, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Corey Sparks (2016). Spatial Demography Column 3 - Code [Dataset]. http://doi.org/10.6084/m9.figshare.963579.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jan 18, 2016
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Corey Sparks
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Here is the code from my third column on the spatial demography journal. In this code, I use R to calculate commonly used measures of residential segregation.

  20. Divvy Bikeshare Data | April 2020 - May 2021

    • kaggle.com
    Updated Aug 21, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Antoni K Pestka (2021). Divvy Bikeshare Data | April 2020 - May 2021 [Dataset]. https://www.kaggle.com/antonikpestka/divvy-bikeshare-data-april-2020-may-2021/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 21, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Antoni K Pestka
    Description

    Original Divvy Bikeshare Data obtained from here

    City of Chicago Zip Code Boundary Data obtained from here

    Tableau Dashboard Viz can be seen here

    R code can be found here

    Context

    This is my first-ever project after recently completing the Google Data Analytics Certificate on Coursera.

    The goal of the project are to answer the following questions: 1. How do annual riders and casual riders use Divvy bikeshare differently? 2. Why would casual riders buy annual memberships? 3. How can Divvy use digital media to influence casual riders to become members?

    Casual riders are defined as those who do not have an annual membership, and instead use the service on a a pay-per-ride basis.

    Content

    Original Divvy Bikeshare Data obtained from here

    The original datasets included the following columns: Ride ID # Rideable Type (electric, docked bike, classic) Started At Date/Time Ended At Date/Time Start Station Address Start Station ID End Station Address End Station ID Start Longitude Start Latitude End Longitude End Latitude Member Type (member, casual)

    City of Chicago Zip Code Boundary Data obtained from here

    The zip code boundary geospatial files were used to calculate the zip code of trip origin for each trip based on start longitude and start latitude.

    Caveats and Assumptions

    1. Divvy utilizes two types of bicycles: electric bicycles and classic bicycles. For the column labeled "rideable_type", three values existed: docked_bike, electric_bike, and classic. Docked_bike and classic were aggregated into the same category. Therefore, they are labeled as "other" on the visualization.

    2. Negative ride lengths and ride lengths under 90 seconds in length were not included in the calculation of average ride length. -Negative ride lengths exist due to the end time and date being recorded as occurring BEFORE the start time and date on certain data entries. -Ride lengths 90 seconds and less were ruled out due to the possibility of bikes failing to dock properly or being checked out for a short time for maintenance checks. -This removed 90,842 records from the calculations for average ride length.

    The process

    R programming language was used for the following:

    1. Create a new column for the zip code of each trip origin based on the start longitude and start latitude
    2. Calculate the ride length in seconds for each trip
    3. Remove unnecessary columns
    4. Rename "electric_bike" to EL and "docked_bike" to DB

    The R code I utilized is found here

    Excel was used for the following:

    1. Deletion of header rows for all dataset files except for the first file (April 2020)
    2. Deletion of the geometry information to save file space

    A .bat file utilizing DOS command line was utilized to merged all the cleaned CSV files into a single file.

    Finally, the cleaned and merged dataset was connected to Tableau for analysis and visualization. A link to the the dashboard can be found here

    Data Analysis Overview

    Zip Code with highest quantity of trips: 60614 (615,010) Total Quantity of Zip Codes: 56 Trip Quantity of Top 9 Zip Codes: 60.35% (2,630,330) Trip Quantity of the Remaining 47 Zip Codes: 39.65% (1,728,281)

    Total Quantity of Trips: 4,358,611 Quantity of Trips by Annual Members: 58.15% (2,534,718) Quantity of Trips by Casual Members: 41.85% (1,823,893)

    Average Ride Length with Electric Bicycle: Annual Members: 13.8 minutes Casual Members: 22.3 minutes

    Average Ride Length with Classic Bicycle: Annual Members: 16.8 minutes Casual Members: 49.7 minutes

    Average Ride Length Overall: Annual Members: 16.2 minutes Casual Members: 44.2 minutes

    Peak Day of the Week for Overall Trip Quantity: Annual Members: Saturday Casual Members: Saturday

    Slowest Day of the Week for Overall Trip Quantity: Tuesday Annual Members: Sunday Casual Members: Tuesday

    Peak Day of the Week for Electric Bikes: Saturday Annual Members: Saturday Casual Members: Saturday

    Slowest Day of the Week for Electric Bikes: Tuesday Annual Members: Sunday Casual Members: Tuesday

    Peak day of the Week for Classic Bikes: Saturday Ann...

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
vikram amin (2023). Housing Price Prediction using DT and RF in R [Dataset]. https://www.kaggle.com/datasets/vikramamin/housing-price-prediction-using-dt-and-rf-in-r
Organization logo

Housing Price Prediction using DT and RF in R

Decision Tree and Random Forest in R for house price prediction

Explore at:
zip(629100 bytes)Available download formats
Dataset updated
Aug 31, 2023
Authors
vikram amin
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description
  • Objective: To predict the prices of houses in the City of Melbourne
  • Approach: Using Decision Tree and Random Forest https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Ffc6fb7d0bd8e854daf7a6f033937a397%2FPicture1.png?generation=1693489996707941&alt=media" alt="">
  • Data Cleaning:
  • Date column is shown as a character vector which is converted into a date vector using the library ‘lubridate’
  • We create a new column called age to understand the age of the house as it can be a factor in the pricing of the house. We extract the year from column ‘Date’ and subtract it from the column ‘Year Built’
  • We remove 11566 records which have missing values
  • We drop columns which are not significant such as ‘X’, ‘suburb’, ‘address’, (we have kept zipcode as it serves the purpose in place of suburb and address), ‘type’, ‘method’, ‘SellerG’, ‘date’, ‘Car’, ‘year built’, ‘Council Area’, ‘Region Name’
  • We split the data into ‘train’ and ‘test’ in 80/20 ratio using the sample function
  • Run libraries ‘rpart’, ‘rpart.plot’, ‘rattle’, ‘RcolorBrewer’
  • Run decision tree using the rpart function. ‘Price’ is the dependent variable https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F6065322d19b1376c4a341a4f22933a51%2FPicture2.png?generation=1693490067579017&alt=media" alt="">
  • Average price for 5464 houses is $1084349
  • Where building area is less than 200.5, the average price for 4582 houses is $931445. Where building area is less than 200.5 & age of the building is less than 67.5 years, the avg price for 3385 houses is $799299.6.
  • $4801538 is the Highest average prices of 13 houses where distance is lower than 5.35 & building are is >280.5
    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F136542b7afb6f03c1890bae9b07dc464%2FDecision%20Tree%20Plot.jpeg?generation=1693490124083168&alt=media" alt="">
  • We use the caret package for tuning the parameter and the optimal complexity parameter found is 0.01 with RMSE 445197.9 https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Feb1633df9dd61ba3a51574873b055fd0%2FPicture3.png?generation=1693490163033658&alt=media" alt="">
  • We use library (Metrics) to find out the RMSE ($392107), MAPE (0.297) which means an accuracy of 99.70% and MAE ($272015.4)
  • Variables ‘postcode’, longitude and building are the most important variables
  • Test$Price indicates the actual price and test$predicted indicates the predicted price for particular 6 houses. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F620b1aad968c9aee169d0e7371bf3818%2FPicture4.png?generation=1693490211728176&alt=media" alt="">
  • We use the default parameters of random forest on the train data https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fe9a3c3f8776ee055e4a1bb92d782e19c%2FPicture5.png?generation=1693490244695668&alt=media" alt="">
  • The below image indicates that ‘Building Area’, ‘Age of the house’ and ‘Distance’ are the most important variables that affect the price of the house. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fc14d6266184db8f30290c528d72b9f6b%2FRandom%20Forest%20Variables%20Importance.jpeg?generation=1693490284920037&alt=media" alt="">
  • Based on the default parameters, RMSE is $250426.2, MAPE is 0.147 (accuracy is 99.853%) and MAE is $151657.7
  • Error starts to remain constant between 100 to 200 trees and thereafter there is almost minimal reduction. We can choose N tree=200. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F365f9e8587d3a65805330889d22f9e60%2FNtree%20Plot.jpeg?generation=1693490308734539&alt=media" alt="">
  • We tune the model and find mtry = 3 has the lowest out of bag error
  • We use the caret package and use 5 fold cross validation technique
  • RMSE is $252216.10 , MAPE is 0.146 (accuracy is 99.854%) , MAE is $151669.4
  • We can conclude that Random Forest give us more accurate results as compared to Decision Tree
  • In Random Forest , the default parameters (N tree = 500) give us lower RMSE and MAPE as compared to N tree = 200. So we can proceed with those parameters.
Search
Clear search
Close search
Google apps
Main menu