Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘California Housing Prices Data (5 new features!)’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/fedesoriano/california-housing-prices-data-extra-features on 28 January 2022.
--- Dataset description provided by original source is as follows ---
Boston House Prices: LINK
This is the dataset is a modified version of the California Housing Data used in the paper Pace, R. Kelley, and Ronald Barry. "Sparse spatial autoregressions." Statistics & Probability Letters 33.3 (1997): 291-297.
. It serves as an excellent introduction to implementing machine learning algorithms because it requires rudimentary data cleaning, has an easily understandable list of variables and sits at an optimal size between being too toyish and too cumbersome.
The data contains information from the 1990 California census. So although it may not help you with predicting current housing prices like the Zillow Zestimate dataset, it does provide an accessible introductory dataset for teaching people about the basics of machine learning.
This dataset includes 5 extra features defined by me: "Distance to coast", "Distance to Los Angeles", "Distance to San Diego", "Distance to San Jose", and "Distance to San Francisco". These extra features try to account for the distance to the nearest coast and the distance to the centre of the largest cities in California.
The distances were calculated using the Haversine formula with the Longitude and Latitude:
https://wikimedia.org/api/rest_v1/media/math/render/svg/a65dbbde43ff45bacd2505fcf32b44fc7dcd8cc0" alt="">
where:
phi_1
and phi_2
are the Latitudes of point 1 and point 2, respectivelylambda_1
and lambda_2
are the Longitudes of point 1 and point 2, respectivelyr
is the radius of the Earth (6371km)The data pertains to the houses found in a given California district and some summary stats about them based on the 1990 census data. The columns are as follows, their names are pretty self-explanatory:
1) Median House Value: Median house value for households within a block (measured in US Dollars) [$] 2) Median Income: Median income for households within a block of houses (measured in tens of thousands of US Dollars) [10k$] 3) Median Age: Median age of a house within a block; a lower number is a newer building [years] 4) Total Rooms: Total number of rooms within a block 5) Total Bedrooms: Total number of bedrooms within a block 6) Population: Total number of people residing within a block 7) Households: Total number of households, a group of people residing within a home unit, for a block 8) Latitude: A measure of how far north a house is; a higher value is farther north [°] 9) Longitude: A measure of how far west a house is; a higher value is farther west [°] 10) Distance to coast: Distance to the nearest coast point [m] 11) Distance to Los Angeles: Distance to the centre of Los Angeles [m] 12) Distance to San Diego: Distance to the centre of San Diego [m] 13) Distance to San Jose: Distance to the centre of San Jose [m] 14) Distance to San Francisco: Distance to the centre of San Francisco [m]
This data was entirely modified and cleaned by me. The original data (without the distance features) was initially featured in the following paper: Pace, R. Kelley, and Ronald Barry. "Sparse spatial autoregressions." Statistics & Probability Letters 33.3 (1997): 291-297.
The original dataset can be found under the following link: https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html
--- Original source retains full ownership of the source dataset ---
List of the data tables as part of the Immigration System Statistics Home Office release. Summary and detailed data tables covering the immigration system, including out-of-country and in-country visas, asylum, detention, and returns.
If you have any feedback, please email MigrationStatsEnquiries@homeoffice.gov.uk.
The Microsoft Excel .xlsx files may not be suitable for users of assistive technology.
If you use assistive technology (such as a screen reader) and need a version of these documents in a more accessible format, please email MigrationStatsEnquiries@homeoffice.gov.uk
Please tell us what format you need. It will help us if you say what assistive technology you use.
Immigration system statistics, year ending March 2025
Immigration system statistics quarterly release
Immigration system statistics user guide
Publishing detailed data tables in migration statistics
Policy and legislative changes affecting migration to the UK: timeline
Immigration statistics data archives
https://assets.publishing.service.gov.uk/media/68258d71aa3556876875ec80/passenger-arrivals-summary-mar-2025-tables.xlsx">Passenger arrivals summary tables, year ending March 2025 (MS Excel Spreadsheet, 66.5 KB)
‘Passengers refused entry at the border summary tables’ and ‘Passengers refused entry at the border detailed datasets’ have been discontinued. The latest published versions of these tables are from February 2025 and are available in the ‘Passenger refusals – release discontinued’ section. A similar data series, ‘Refused entry at port and subsequently departed’, is available within the Returns detailed and summary tables.
https://assets.publishing.service.gov.uk/media/681e406753add7d476d8187f/electronic-travel-authorisation-datasets-mar-2025.xlsx">Electronic travel authorisation detailed datasets, year ending March 2025 (MS Excel Spreadsheet, 56.7 KB)
ETA_D01: Applications for electronic travel authorisations, by nationality
ETA_D02: Outcomes of applications for electronic travel authorisations, by nationality
https://assets.publishing.service.gov.uk/media/68247953b296b83ad5262ed7/visas-summary-mar-2025-tables.xlsx">Entry clearance visas summary tables, year ending March 2025 (MS Excel Spreadsheet, 113 KB)
https://assets.publishing.service.gov.uk/media/682c4241010c5c28d1c7e820/entry-clearance-visa-outcomes-datasets-mar-2025.xlsx">Entry clearance visa applications and outcomes detailed datasets, year ending March 2025 (MS Excel Spreadsheet, 29.1 MB)
Vis_D01: Entry clearance visa applications, by nationality and visa type
Vis_D02: Outcomes of entry clearance visa applications, by nationality, visa type, and outcome
Additional dat
Officer Involved Shooting (OIS) Database and Statistical Analysis. Data is updated after there is an officer involved shooting.PIU#Incident # - the number associated with either the incident or used as reference to store the items in our evidence rooms Date of Occurrence Month - month the incident occurred (Note the year is labeled on the tab of the spreadsheet)Date of Occurrence Day - day of the month the incident occurred (Note the year is labeled on the tab of the spreadsheet)Time of Occurrence - time the incident occurredAddress of incident - the location the incident occurredDivision - the LMPD division in which the incident actually occurredBeat - the LMPD beat in which the incident actually occurredInvestigation Type - the type of investigation (shooting or death)Case Status - status of the case (open or closed)Suspect Name - the name of the suspect involved in the incidentSuspect Race - the race of the suspect involved in the incident (W-White, B-Black)Suspect Sex - the gender of the suspect involved in the incidentSuspect Age - the age of the suspect involved in the incidentSuspect Ethnicity - the ethnicity of the suspect involved in the incident (H-Hispanic, N-Not Hispanic)Suspect Weapon - the type of weapon the suspect used in the incidentOfficer Name - the name of the officer involved in the incidentOfficer Race - the race of the officer involved in the incident (W-White, B-Black, A-Asian)Officer Sex - the gender of the officer involved in the incidentOfficer Age - the age of the officer involved in the incidentOfficer Ethnicity - the ethnicity of the suspect involved in the incident (H-Hispanic, N-Not Hispanic)Officer Years of Service - the number of years the officer has been serving at the time of the incidentLethal Y/N - whether or not the incident involved a death (Y-Yes, N-No, continued-pending)Narrative - a description of what was determined from the investigationContact:Carol Boylecarol.boyle@louisvilleky.gov
The CMS Program Statistics - Medicare Part A & Part B - All Types of Service tables provide use and payment data by type of coverage and type of service. For additional information on enrollment, providers, and Medicare use and payment, visit the CMS Program Statistics page. These data do not exist in a machine-readable format, so the view data and API options are not available. Please use the download function to access the data. Below is the list of tables: MDCR SUMMARY AB 1. Medicare Part A and Part B Summary: Utilization, Program Payments, and Cost Sharing for All Original Medicare Beneficiaries, by Type of Coverage and Type of Service, Yearly Trend MDCR SUMMARY AB 2. Medicare Part A and Part B Summary: Utilization, Program Payments, and Cost Sharing for Aged Original Medicare Beneficiaries, by Type of Coverage and Type of Service, Yearly Trend MDCR SUMMARY AB 3. Medicare Part A and Part B Summary: Utilization, Program Payments, and Cost Sharing for Disabled Original Medicare Beneficiaries by Type of Coverage and Type of Service, Yearly Trend MDCR SUMMARY AB 4. Medicare Part A and Part B Summary: Utilization, Program Payments, and Cost Sharing for Original Medicare Beneficiaries, by Type of Coverage, Demographic Characteristics, and Medicare-Medicaid Enrollment Status MDCR SUMMARY AB 5. Medicare Part A and Part B Summary: Utilization, Program Payments, and Cost Sharing for Original Medicare Beneficiaries, by Type of Coverage and by Area of Residence MDCR SUMMARY AB 6. Medicare Part A and Part B Summary: Utilization and Program Payments for Original Medicare Beneficiaries, by Type of Entitlement, Amount of Program Payments, Type of Coverage, and Type of Service
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains tabular files with information about the usage preferences of speakers of Maltese English with regard to 63 pairs of lexical expressions. These pairs (e.g. truck-lorry or realization-realisation) are known to differ in usage between BrE and AmE (cf. Algeo 2006). The data were elicited with a questionnaire that asks informants to indicate whether they always use one of the two variants, prefer one over the other, have no preference, or do not use either expression (see Krug and Sell 2013 for methodological details). Usage preferences were therefore measured on a symmetric 5-point ordinal scale. Data were collected between 2008 to 2018, as part of a larger research project on lexical and grammatical variation in settings where English is spoken as a native, second, or foreign language. The current dataset, which we use for our methodological study on ordinal data modeling strategies, consists of a subset of 500 speakers that is roughly balanced on year of birth. Abstract: Related publication In empirical work, ordinal variables are typically analyzed using means based on numeric scores assigned to categories. While this strategy has met with justified criticism in the methodological literature, it also generates simple and informative data summaries, a standard often not met by statistically more adequate procedures. Motivated by a survey of how ordered variables are dealt with in language research, we draw attention to an un(der)used latent-variable approach to ordinal data modeling, which constitutes an alternative perspective on the most widely used form of ordered regression, the cumulative model. Since the latent-variable approach does not feature in any of the studies in our survey, we believe it is worthwhile to promote its benefits. To this end, we draw on questionnaire-based preference ratings by speakers of Maltese English, who indicated on a 5-point scale which of two synonymous expressions (e.g. package-parcel) they (tend to) use. We demonstrate that a latent-variable formulation of the cumulative model affords nuanced and interpretable data summaries that can be visualized effectively, while at the same time avoiding limitations inherent in mean response models (e.g. distortions induced by floor and ceiling effects). The online supplementary materials include a tutorial for its implementation in R.
The subject matter in the five individual files which comprise the total data package is similar. SA1 presents detailed kind-of- business statistics (two-, three-, and four-digit industry levels) on number of establishments and receipts (total and with payroll), number of proprietorships and partnerships, annual and first quarter payroll, and number of paid employees. SA2 contains the same data items as above for selected services total, in addition to the number of establishments and receipt s for five major kind-of-business groups. SA3 contains number of establishments and receipts for selected services total and for 130 kind-of- business classifications. SA4 presents receipts and rank by volume of receipts. SA5 statistics are given by city size for number of incorporated cities, total population, number of establishments, receipts, yearly payroll, and the percent of total by population and sales.
Each of the files has slightly different geography for which summaries are presented. SA1 has summaries for the United States, divisions, States, SCA's and SMSA's, and counties and cities with over 300 service establishments. SA2 presents summary counts for each city of 2,500 inhabitants or more and for remainder of county. SA3 has summaries for the United States, regions, divisions, and States. SA4 presents summaries for the 250 largest counties and cities. SA5 presents United States tot al.
Data pertain to the date of the census, 1972. The first major enumeration of Selected Service establishments covered 1933. Censuses were also taken in 1939, 1948, and in 5 year intervals since
The U.S. Department of Transportation's (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT's monthly Air Travel Consumer Report, published about 30 days after the month's end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released.
This version of the dataset was compiled from the Statistical Computing Statistical Graphics 2009 Data Expo and is also available here.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘California Housing Prices Data (5 new features!)’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/fedesoriano/california-housing-prices-data-extra-features on 28 January 2022.
--- Dataset description provided by original source is as follows ---
Boston House Prices: LINK
This is the dataset is a modified version of the California Housing Data used in the paper Pace, R. Kelley, and Ronald Barry. "Sparse spatial autoregressions." Statistics & Probability Letters 33.3 (1997): 291-297.
. It serves as an excellent introduction to implementing machine learning algorithms because it requires rudimentary data cleaning, has an easily understandable list of variables and sits at an optimal size between being too toyish and too cumbersome.
The data contains information from the 1990 California census. So although it may not help you with predicting current housing prices like the Zillow Zestimate dataset, it does provide an accessible introductory dataset for teaching people about the basics of machine learning.
This dataset includes 5 extra features defined by me: "Distance to coast", "Distance to Los Angeles", "Distance to San Diego", "Distance to San Jose", and "Distance to San Francisco". These extra features try to account for the distance to the nearest coast and the distance to the centre of the largest cities in California.
The distances were calculated using the Haversine formula with the Longitude and Latitude:
https://wikimedia.org/api/rest_v1/media/math/render/svg/a65dbbde43ff45bacd2505fcf32b44fc7dcd8cc0" alt="">
where:
phi_1
and phi_2
are the Latitudes of point 1 and point 2, respectivelylambda_1
and lambda_2
are the Longitudes of point 1 and point 2, respectivelyr
is the radius of the Earth (6371km)The data pertains to the houses found in a given California district and some summary stats about them based on the 1990 census data. The columns are as follows, their names are pretty self-explanatory:
1) Median House Value: Median house value for households within a block (measured in US Dollars) [$] 2) Median Income: Median income for households within a block of houses (measured in tens of thousands of US Dollars) [10k$] 3) Median Age: Median age of a house within a block; a lower number is a newer building [years] 4) Total Rooms: Total number of rooms within a block 5) Total Bedrooms: Total number of bedrooms within a block 6) Population: Total number of people residing within a block 7) Households: Total number of households, a group of people residing within a home unit, for a block 8) Latitude: A measure of how far north a house is; a higher value is farther north [°] 9) Longitude: A measure of how far west a house is; a higher value is farther west [°] 10) Distance to coast: Distance to the nearest coast point [m] 11) Distance to Los Angeles: Distance to the centre of Los Angeles [m] 12) Distance to San Diego: Distance to the centre of San Diego [m] 13) Distance to San Jose: Distance to the centre of San Jose [m] 14) Distance to San Francisco: Distance to the centre of San Francisco [m]
This data was entirely modified and cleaned by me. The original data (without the distance features) was initially featured in the following paper: Pace, R. Kelley, and Ronald Barry. "Sparse spatial autoregressions." Statistics & Probability Letters 33.3 (1997): 291-297.
The original dataset can be found under the following link: https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html
--- Original source retains full ownership of the source dataset ---