4 datasets found
  1. KC_House Dataset -Linear Regression of Home Prices

    • kaggle.com
    zip
    Updated May 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    vikram amin (2023). KC_House Dataset -Linear Regression of Home Prices [Dataset]. https://www.kaggle.com/datasets/vikramamin/kc-house-dataset-home-prices
    Explore at:
    zip(776807 bytes)Available download formats
    Dataset updated
    May 15, 2023
    Authors
    vikram amin
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description
    1. Dataset: House pricing dataset containing 21 columns and 21613 rows.
    2. Programming Language : R
    3. Objective : To predict house prices by creating a model
    4. Steps : A) Import the dataset B) Install and run libraries C) Data Cleaning - Remove Null Values , Change Data Types , Dropping of Columns which are not important D) Data Analysis - (i)Linear Regression Model was used to establish the relationship between the dependent variable (price) and other independent variable (ii) Outliers were identified and removed (iii) Regression model was run once again after removing the outliers (iv) Multiple R- squared was calculated which indicated the independent variables can explain 73% change/ variation in the dependent variable (v) P value was less than that of alpha 0.05 which shows it is statistically significant. (vi) Interpreting the meaning of the results of the coefficients (vii) Checked the assumption of multicollinearity (viii) VIF(Variance inflation factor) was calculated for all the independent variables and their absolute value was found to be less than 5. Hence, there is not threat of multicollinearity and that we can proceed with the independent variables specified.
  2. Uniform Crime Reporting (UCR) Program Data: Arrests by Age, Sex, and Race,...

    • search.datacite.org
    • doi.org
    • +1more
    Updated 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jacob Kaplan (2018). Uniform Crime Reporting (UCR) Program Data: Arrests by Age, Sex, and Race, 1980-2016 [Dataset]. http://doi.org/10.3886/e102263v5-10021
    Explore at:
    Dataset updated
    2018
    Dataset provided by
    DataCitehttps://www.datacite.org/
    Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
    Authors
    Jacob Kaplan
    Description

    Version 5 release notes:
    Removes support for SPSS and Excel data.Changes the crimes that are stored in each file. There are more files now with fewer crimes per file. The files and their included crimes have been updated below.
    Adds in agencies that report 0 months of the year.Adds a column that indicates the number of months reported. This is generated summing up the number of unique months an agency reports data for. Note that this indicates the number of months an agency reported arrests for ANY crime. They may not necessarily report every crime every month. Agencies that did not report a crime with have a value of NA for every arrest column for that crime.Removes data on runaways.
    Version 4 release notes:
    Changes column names from "poss_coke" and "sale_coke" to "poss_heroin_coke" and "sale_heroin_coke" to clearly indicate that these column includes the sale of heroin as well as similar opiates such as morphine, codeine, and opium. Also changes column names for the narcotic columns to indicate that they are only for synthetic narcotics.
    Version 3 release notes:
    Add data for 2016.Order rows by year (descending) and ORI.Version 2 release notes:
    Fix bug where Philadelphia Police Department had incorrect FIPS county code.
    The Arrests by Age, Sex, and Race data is an FBI data set that is part of the annual Uniform Crime Reporting (UCR) Program data. This data contains highly granular data on the number of people arrested for a variety of crimes (see below for a full list of included crimes). The data sets here combine data from the years 1980-2015 into a single file. These files are quite large and may take some time to load.
    All the data was downloaded from NACJD as ASCII+SPSS Setup files and read into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. For the R code used to clean this data, see here. https://github.com/jacobkap/crime_data. If you have any questions, comments, or suggestions please contact me at jkkaplan6@gmail.com.

    I did not make any changes to the data other than the following. When an arrest column has a value of "None/not reported", I change that value to zero. This makes the (possible incorrect) assumption that these values represent zero crimes reported. The original data does not have a value when the agency reports zero arrests other than "None/not reported." In other words, this data does not differentiate between real zeros and missing values. Some agencies also incorrectly report the following numbers of arrests which I change to NA: 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 99999, 99998.

    To reduce file size and make the data more manageable, all of the data is aggregated yearly. All of the data is in agency-year units such that every row indicates an agency in a given year. Columns are crime-arrest category units. For example, If you choose the data set that includes murder, you would have rows for each agency-year and columns with the number of people arrests for murder. The ASR data breaks down arrests by age and gender (e.g. Male aged 15, Male aged 18). They also provide the number of adults or juveniles arrested by race. Because most agencies and years do not report the arrestee's ethnicity (Hispanic or not Hispanic) or juvenile outcomes (e.g. referred to adult court, referred to welfare agency), I do not include these columns.

    To make it easier to merge with other data, I merged this data with the Law Enforcement Agency Identifiers Crosswalk (LEAIC) data. The data from the LEAIC add FIPS (state, county, and place) and agency type/subtype. Please note that some of the FIPS codes have leading zeros and if you open it in Excel it will automatically delete those leading zeros.

    I created 9 arrest categories myself. The categories are:
    Total Male JuvenileTotal Female JuvenileTotal Male AdultTotal Female AdultTotal MaleTotal FemaleTotal JuvenileTotal AdultTotal ArrestsAll of these categories are based on the sums of the sex-age categories (e.g. Male under 10, Female aged 22) rather than using the provided age-race categories (e.g. adult Black, juvenile Asian). As not all agencies report the race data, my method is more accurate. These categories also make up the data in the "simple" version of the data. The "simple" file only includes the above 9 columns as the arrest data (all other columns in the data are just agency identifier columns). Because this "simple" data set need fewer columns, I include all offenses.

    As the arrest data is very granular, and each category of arrest is its own column, there are dozens of columns per crime. To keep the data somewhat manageable, there are nine different files, eight which contain different crimes and the "simple" file. Each file contains the data for all years. The eight categories each have crimes belonging to a major crime category and do not overlap in crimes other than with the index offenses. Please note that the crime names provided below are not the same as the column names in the data. Due to Stata limiting column names to 32 characters maximum, I have abbreviated the crime names in the data. The files and their included crimes are:

    Index Crimes
    MurderRapeRobberyAggravated AssaultBurglaryTheftMotor Vehicle TheftArsonAlcohol CrimesDUIDrunkenness
    LiquorDrug CrimesTotal DrugTotal Drug SalesTotal Drug PossessionCannabis PossessionCannabis SalesHeroin or Cocaine PossessionHeroin or Cocaine SalesOther Drug PossessionOther Drug SalesSynthetic Narcotic PossessionSynthetic Narcotic SalesGrey Collar and Property CrimesForgeryFraudStolen PropertyFinancial CrimesEmbezzlementTotal GamblingOther GamblingBookmakingNumbers LotterySex or Family CrimesOffenses Against the Family and Children
    Other Sex Offenses
    ProstitutionRapeViolent CrimesAggravated AssaultMurderNegligent ManslaughterRobberyWeapon Offenses
    Other CrimesCurfewDisorderly ConductOther Non-trafficSuspicion
    VandalismVagrancy
    Simple
    This data set has every crime and only the arrest categories that I created (see above).
    If you have any questions, comments, or suggestions please contact me at jkkaplan6@gmail.com.

  3. o

    Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program...

    • openicpsr.org
    Updated May 18, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jacob Kaplan (2018). Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program Data: Hate Crime Data 1991-2024 [Dataset]. http://doi.org/10.3886/E103500V12
    Explore at:
    Dataset updated
    May 18, 2018
    Dataset provided by
    Princeton University
    Authors
    Jacob Kaplan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    1991 - 2024
    Area covered
    United States
    Description

    !!!WARNING~~~This dataset has a large number of flaws and is unable to properly answer many questions that people generally use it to answer, such as whether national hate crimes are changing (or at least they use the data so improperly that they get the wrong answer). A large number of people using this data (academics, advocates, reporting, US Congress) do so inappropriately and get the wrong answer to their questions as a result. Indeed, many published papers using this data should be retracted. Before using this data I highly recommend that you thoroughly read my book on UCR data, particularly the chapter on hate crimes (https://ucrbook.com/hate-crimes.html) as well as the FBI's own manual on this data. The questions you could potentially answer well are relatively narrow and generally exclude any causal relationships. ~~~WARNING!!!For a comprehensive guide to this data and other UCR data, please see my book at ucrbook.comVersion 12 release notes:Adds .parquet file formatVersion 11 release notes:Adds 2023-2024 dataVersion 10 release notes:Adds 2022 dataVersion 9 release notes:Adds 2021 data.Version 8 release notes:Adds 2019 and 2020 data. Please note that the FBI has retired UCR data ending in 2020 data so this will be the last UCR hate crime data they release. Changes .rda file to .rds.Version 7 release notes:Changes release notes description, does not change data.Version 6 release notes:Adds 2018 dataVersion 5 release notes:Adds data in the following formats: SPSS, SAS, and Excel.Changes project name to avoid confusing this data for the ones done by NACJD.Adds data for 1991.Fixes bug where bias motivation "anti-lesbian, gay, bisexual, or transgender, mixed group (lgbt)" was labeled "anti-homosexual (gay and lesbian)" prior to 2013 causing there to be two columns and zero values for years with the wrong label.All data is now directly from the FBI, not NACJD. The data initially comes as ASCII+SPSS Setup files and read into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. Version 4 release notes: Adds data for 2017.Adds rows that submitted a zero-report (i.e. that agency reported no hate crimes in the year). This is for all years 1992-2017. Made changes to categorical variables (e.g. bias motivation columns) to make categories consistent over time. Different years had slightly different names (e.g. 'anti-am indian' and 'anti-american indian') which I made consistent. Made the 'population' column which is the total population in that agency. Version 3 release notes: Adds data for 2016.Order rows by year (descending) and ORI.Version 2 release notes: Fix bug where Philadelphia Police Department had incorrect FIPS county code. The Hate Crime data is an FBI data set that is part of the annual Uniform Crime Reporting (UCR) Program data. This data contains information about hate crimes reported in the United States. Please note that the files are quite large and may take some time to open.Each row indicates a hate crime incident for an agency in a given year. I have made a unique ID column ("unique_id") by combining the year, agency ORI9 (the 9 character Originating Identifier code), and incident number columns together. Each column is a variable related to that incident or to the reporting agency. Some of the important columns are the incident date, what crime occurred (up to 10 crimes), the number of victims for each of these crimes, the bias motivation for each of these crimes, and the location of each crime. It also includes the total number of victims, total number of offenders, and race of offenders (as a group). Finally, it has a number of columns indicating if the victim for each offense was a certain type of victim or not (e.g. individual victim, business victim religious victim, etc.). The only changes I made to the data are the following. Minor changes to column names to make all column names 32 characters or fewer (so it can be saved in a Stata format), made all character values lower case, reordered columns. I also generated incident month, weekday, and month-day variables from the incident date variable included in the original data.

  4. o

    Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program...

    • openicpsr.org
    Updated Mar 29, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jacob Kaplan (2018). Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program Data: Arrests by Age, Sex, and Race, 1974-2020 [Dataset]. http://doi.org/10.3886/E102263V14
    Explore at:
    Dataset updated
    Mar 29, 2018
    Dataset provided by
    Princeton University
    Authors
    Jacob Kaplan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    1974 - 2020
    Area covered
    United States
    Description

    For a comprehensive guide to this data and other UCR data, please see my book at ucrbook.comVersion 14 release notes:Adds 2020 data. Please note that the FBI has retired UCR data ending in 2020 data so this will be the last Arrests by Age, Sex, and Race data they release. Version 13 release notes:Changes R files from .rda to .rds.Fixes bug where the number_of_months_reported variable incorrectly was the largest of the number of months reported for a specific crime variable. For example, if theft was reported Jan-June and robbery was reported July-December in an agency, in total there were 12 months reported. But since each crime (and let's assume no other crime was reported more than 6 months of the year) only was reported 6 months, the number_of_months_reported variable was incorrectly set at 6 months. Now it is the total number of months reported of any crime. So it would be set to 12 months in this example. Thank you to Nick Eubank for alerting me to this issue.Adds rows even when a agency reported zero arrests that month; all arrest values are set to zero for these rows.Version 12 release notes:Adds 2019 data.Version 11 release notes:Changes release notes description, does not change data.Version 10 release notes:The data now has the following age categories (which were previously aggregated into larger groups to reduce file size): under 10, 10-12, 13-14, 40-44, 45-49, 50-54, 55-59, 60-64, over 64. These categories are available for female, male, and total (female+male) arrests. The previous aggregated categories (under 15, 40-49, and over 49 have been removed from the data). Version 9 release notes:For each offense, adds a variable indicating the number of months that offense was reported - these variables are labeled as "num_months_[crime]" where [crime] is the offense name. These variables are generated by the number of times one or more arrests were reported per month for that crime. For example, if there was at least one arrest for assault in January, February, March, and August (and no other months), there would be four months reported for assault. Please note that this does not differentiate between an agency not reporting that month and actually having zero arrests. The variable "number_of_months_reported" is still in the data and is the number of months that any offense was reported. So if any agency reports murder arrests every month but no other crimes, the murder number of months variable and the "number_of_months_reported" variable will both be 12 while every other offense number of month variable will be 0. Adds data for 2017 and 2018.Version 8 release notes:Adds annual data in R format.Changes project name to avoid confusing this data for the ones done by NACJD.Fixes bug where bookmaking was excluded as an arrest category. Changed the number of categories to include more offenses per category to have fewer total files. Added a "total_race" file for each category - this file has total arrests by race for each crime and a breakdown of juvenile/adult by race. Version 7 release notes: Adds 1974-1979 dataAdds monthly data (only totals by sex and race, not by age-categories). All data now from FBI, not NACJD. Changes some column names so all columns are <=32 characters to be usable in Stata.Changes how number of months reported is calculated. Now it is the number of unique months with arrest data reported - months of data from the monthly header file (i.e. juvenile disposition data) are not considered in this calculation. Version 6 release notes: Fix bug where juvenile female columns had the same value as juvenile male columns.Version 5 release notes: Removes support for SPSS and Excel data.Changes the crimes that are stored in each file. There are more files now with fewer crimes per file. The files and their included crimes have been updated below.Adds in agencies that report 0 months of the year.Adds a column that indicates the number of months reported. This is generated summing up the number of unique months an agency reports data for. Note that this indicates the number of months an agency reported arrests for ANY crime. They may not necessarily report every crime every month. Agencies that did not report a crime with have a value of NA for every arrest column for that crime.Removes data on runaways.Version 4 release notes: Changes column names from "poss_coke" and "sale_coke" to "poss_heroi

  5. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
vikram amin (2023). KC_House Dataset -Linear Regression of Home Prices [Dataset]. https://www.kaggle.com/datasets/vikramamin/kc-house-dataset-home-prices
Organization logo

KC_House Dataset -Linear Regression of Home Prices

Linear Regression of Home Prices

Explore at:
zip(776807 bytes)Available download formats
Dataset updated
May 15, 2023
Authors
vikram amin
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description
  1. Dataset: House pricing dataset containing 21 columns and 21613 rows.
  2. Programming Language : R
  3. Objective : To predict house prices by creating a model
  4. Steps : A) Import the dataset B) Install and run libraries C) Data Cleaning - Remove Null Values , Change Data Types , Dropping of Columns which are not important D) Data Analysis - (i)Linear Regression Model was used to establish the relationship between the dependent variable (price) and other independent variable (ii) Outliers were identified and removed (iii) Regression model was run once again after removing the outliers (iv) Multiple R- squared was calculated which indicated the independent variables can explain 73% change/ variation in the dependent variable (v) P value was less than that of alpha 0.05 which shows it is statistically significant. (vi) Interpreting the meaning of the results of the coefficients (vii) Checked the assumption of multicollinearity (viii) VIF(Variance inflation factor) was calculated for all the independent variables and their absolute value was found to be less than 5. Hence, there is not threat of multicollinearity and that we can proceed with the independent variables specified.
Search
Clear search
Close search
Google apps
Main menu