51 datasets found

T
Vital Signs: Life Expectancy – Bay Area
data.bayareametro.gov
application/rdfxml +5
Updated Mar 22, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
State of California, Department of Health: Death Records (2017). Vital Signs: Life Expectancy – Bay Area [Dataset]. https://data.bayareametro.gov/dataset/Vital-Signs-Life-Expectancy-Bay-Area/emjt-svg9
Explore at:
xml, csv, tsv, application/rssxml, json, application/rdfxmlAvailable download formats
Dataset updated
Mar 22, 2017
Dataset authored and provided by
State of California, Department of Health: Death Records
Area covered
San Francisco Bay Area
Description
VITAL SIGNS INDICATOR Life Expectancy (EQ6)

FULL MEASURE NAME Life Expectancy

LAST UPDATED April 2017

DESCRIPTION Life expectancy refers to the average number of years a newborn is expected to live if mortality patterns remain the same. The measure reflects the mortality rate across a population for a point in time.

DATA SOURCE State of California, Department of Health: Death Records (1990-2013) No link

California Department of Finance: Population Estimates Annual Intercensal Population Estimates (1990-2010) Table P-2: County Population by Age (2010-2013) http://www.dof.ca.gov/Forecasting/Demographics/Estimates/

CONTACT INFORMATION vitalsigns.info@mtc.ca.gov

METHODOLOGY NOTES (across all datasets for this indicator) Life expectancy is commonly used as a measure of the health of a population. Life expectancy does not reflect how long any given individual is expected to live; rather, it is an artificial measure that captures an aspect of the mortality rates across a population. Vital Signs measures life expectancy at birth (as opposed to cohort life expectancy). A statistical model was used to estimate life expectancy for Bay Area counties and Zip codes based on current life tables which require both age and mortality data. A life table is a table which shows, for each age, the survivorship of a people from a certain population.

Current life tables were created using death records and population estimates by age. The California Department of Public Health provided death records based on the California death certificate information. Records include age at death and residential Zip code. Single-year age population estimates at the regional- and county-level comes from the California Department of Finance population estimates and projections for ages 0-100+. Population estimates for ages 100 and over are aggregated to a single age interval. Using this data, death rates in a population within age groups for a given year are computed to form unabridged life tables (as opposed to abridged life tables). To calculate life expectancy, the probability of dying between the jth and (j+1)st birthday is assumed uniform after age 1. Special consideration is taken to account for infant mortality. For the Zip code-level life expectancy calculation, it is assumed that postal Zip codes share the same boundaries as Zip Code Census Tabulation Areas (ZCTAs). More information on the relationship between Zip codes and ZCTAs can be found at https://www.census.gov/geo/reference/zctas.html. Zip code-level data uses three years of mortality data to make robust estimates due to small sample size. Year 2013 Zip code life expectancy estimates reflects death records from 2011 through 2013. 2013 is the last year with available mortality data. Death records for Zip codes with zero population (like those associated with P.O. Boxes) were assigned to the nearest Zip code with population. Zip code population for 2000 estimates comes from the Decennial Census. Zip code population for 2013 estimates are from the American Community Survey (5-Year Average). The ACS provides Zip code population by age in five-year age intervals. Single-year age population estimates were calculated by distributing population within an age interval to single-year ages using the county distribution. Counties were assigned to Zip codes based on majority land-area.

Zip codes in the Bay Area vary in population from over 10,000 residents to less than 20 residents. Traditional life expectancy estimation (like the one used for the regional- and county-level Vital Signs estimates) cannot be used because they are highly inaccurate for small populations and may result in over/underestimation of life expectancy. To avoid inaccurate estimates, Zip codes with populations of less than 5,000 were aggregated with neighboring Zip codes until the merged areas had a population of more than 5,000. In this way, the original 305 Bay Area Zip codes were reduced to 218 Zip code areas for 2013 estimates. Next, a form of Bayesian random-effects analysis was used which established a prior distribution of the probability of death at each age using the regional distribution. This prior is used to shore up the life expectancy calculations where data were sparse.
o
Geonames - All Cities with a population > 1000
public.opendatasoft.com
data.smartidf.services
+2more
csv, excel, geojson +1
Updated Mar 10, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Geonames - All Cities with a population > 1000 [Dataset]. https://public.opendatasoft.com/explore/dataset/geonames-all-cities-with-a-population-1000/
Explore at:
csv, json, geojson, excelAvailable download formats
Dataset updated
Mar 10, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
All cities with a population > 1000 or seats of adm div (ca 80.000)Sources and ContributionsSources : GeoNames is aggregating over hundred different data sources. Ambassadors : GeoNames Ambassadors help in many countries. Wiki : A wiki allows to view the data and quickly fix error and add missing places. Donations and Sponsoring : Costs for running GeoNames are covered by donations and sponsoring.Enrichment:add country name
US Census Demographic Data
kaggle.com
zip
Updated Mar 3, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MuonNeutrino (2019). US Census Demographic Data [Dataset]. https://www.kaggle.com/muonneutrino/us-census-demographic-data
Explore at:
zip(11110116 bytes)Available download formats
Dataset updated
Mar 3, 2019
Authors
MuonNeutrino
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

This dataset expands on my earlier New York City Census Data dataset. It includes data from the entire country instead of just New York City. The expanded data will allow for much more interesting analyses and will also be much more useful at supporting other data sets.

Content

The data here are taken from the DP03 and DP05 tables of the 2015 American Community Survey 5-year estimates. The full datasets and much more can be found at the American Factfinder website. Currently, I include two data files:

acs2015_census_tract_data.csv: Data for each census tract in the US, including DC and Puerto Rico.

acs2015_county_data.csv: Data for each county or county equivalent in the US, including DC and Puerto Rico.

The two files have the same structure, with just a small difference in the name of the id column. Counties are political subdivisions, and the boundaries of some have been set for centuries. Census tracts, however, are defined by the census bureau and will have a much more consistent size. A typical census tract has around 5000 or so residents.

The Census Bureau updates the estimates approximately every year. At least some of the 2016 data is already available, so I will likely update this in the near future.

Acknowledgements

The data here were collected by the US Census Bureau. As a product of the US federal government, this is not subject to copyright within the US.

Inspiration

There are many questions that we could try to answer with the data here. Can we predict things such as the state (classification) or household income (regression)? What kinds of clusters can we find in the data? What other datasets can be improved by the addition of census data?
T
Vital Signs: Life Expectancy – by ZIP Code
data.bayareametro.gov
application/rdfxml +5
Updated Mar 22, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
State of California, Department of Health: Death Records (2017). Vital Signs: Life Expectancy – by ZIP Code [Dataset]. https://data.bayareametro.gov/dataset/Vital-Signs-Life-Expectancy-by-ZIP-Code/xym8-u3kc
Explore at:
tsv, json, application/rdfxml, xml, csv, application/rssxmlAvailable download formats
Dataset updated
Mar 22, 2017
Dataset authored and provided by
State of California, Department of Health: Death Records
Description
VITAL SIGNS INDICATOR Life Expectancy (EQ6)

FULL MEASURE NAME Life Expectancy

LAST UPDATED April 2017

DESCRIPTION Life expectancy refers to the average number of years a newborn is expected to live if mortality patterns remain the same. The measure reflects the mortality rate across a population for a point in time.

DATA SOURCE State of California, Department of Health: Death Records (1990-2013) No link

California Department of Finance: Population Estimates Annual Intercensal Population Estimates (1990-2010) Table P-2: County Population by Age (2010-2013) http://www.dof.ca.gov/Forecasting/Demographics/Estimates/

U.S. Census Bureau: Decennial Census ZCTA Population (2000-2010) http://factfinder.census.gov

U.S. Census Bureau: American Community Survey 5-Year Population Estimates (2013) http://factfinder.census.gov

CONTACT INFORMATION vitalsigns.info@mtc.ca.gov

METHODOLOGY NOTES (across all datasets for this indicator) Life expectancy is commonly used as a measure of the health of a population. Life expectancy does not reflect how long any given individual is expected to live; rather, it is an artificial measure that captures an aspect of the mortality rates across a population that can be compared across time and populations. More information about the determinants of life expectancy that may lead to differences in life expectancy between neighborhoods can be found in the Bay Area Regional Health Inequities Initiative (BARHII) Health Inequities in the Bay Area report at http://www.barhii.org/wp-content/uploads/2015/09/barhii_hiba.pdf. Vital Signs measures life expectancy at birth (as opposed to cohort life expectancy). A statistical model was used to estimate life expectancy for Bay Area counties and ZIP Codes based on current life tables which require both age and mortality data. A life table is a table which shows, for each age, the survivorship of a people from a certain population.

Current life tables were created using death records and population estimates by age. The California Department of Public Health provided death records based on the California death certificate information. Records include age at death and residential ZIP Code. Single-year age population estimates at the regional- and county-level comes from the California Department of Finance population estimates and projections for ages 0-100+. Population estimates for ages 100 and over are aggregated to a single age interval. Using this data, death rates in a population within age groups for a given year are computed to form unabridged life tables (as opposed to abridged life tables). To calculate life expectancy, the probability of dying between the jth and (j+1)st birthday is assumed uniform after age 1. Special consideration is taken to account for infant mortality.

For the ZIP Code-level life expectancy calculation, it is assumed that postal ZIP Codes share the same boundaries as ZIP Code Census Tabulation Areas (ZCTAs). More information on the relationship between ZIP Codes and ZCTAs can be found at http://www.census.gov/geo/reference/zctas.html. ZIP Code-level data uses three years of mortality data to make robust estimates due to small sample size. Year 2013 ZIP Code life expectancy estimates reflects death records from 2011 through 2013. 2013 is the last year with available mortality data. Death records for ZIP Codes with zero population (like those associated with P.O. Boxes) were assigned to the nearest ZIP Code with population. ZIP Code population for 2000 estimates comes from the Decennial Census. ZIP Code population for 2013 estimates are from the American Community Survey (5-Year Average). ACS estimates are adjusted using Decennial Census data for more accurate population estimates. An adjustment factor was calculated using the ratio between the 2010 Decennial Census population estimates and the 2012 ACS 5-Year (with middle year 2010) population estimates. This adjustment factor is particularly important for ZCTAs with high homeless population (not living in group quarters) where the ACS may underestimate the ZCTA population and therefore underestimate the life expectancy. The ACS provides ZIP Code population by age in five-year age intervals. Single-year age population estimates were calculated by distributing population within an age interval to single-year ages using the county distribution. Counties were assigned to ZIP Codes based on majority land-area.

ZIP Codes in the Bay Area vary in population from over 10,000 residents to less than 20 residents. Traditional life expectancy estimation (like the one used for the regional- and county-level Vital Signs estimates) cannot be used because they are highly inaccurate for small populations and may result in over/underestimation of life expectancy. To avoid inaccurate estimates, ZIP Codes with populations of less than 5,000 were aggregated with neighboring ZIP Codes until the merged areas had a population of more than 5,000. ZIP Code 94103, representing Treasure Island, was dropped from the dataset due to its small population and having no bordering ZIP Codes. In this way, the original 305 Bay Area ZIP Codes were reduced to 217 ZIP Code areas for 2013 estimates. Next, a form of Bayesian random-effects analysis was used which established a prior distribution of the probability of death at each age using the regional distribution. This prior is used to shore up the life expectancy calculations where data were sparse.
F
Turkish Newspaper, Magazine, and Books OCR Image Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Turkish Newspaper, Magazine, and Books OCR Image Dataset [Dataset]. https://www.futurebeeai.com/dataset/ocr-dataset/turkish-newspaper-book-magazine-ocr-image-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
Introducing the Turkish Newspaper, Books, and Magazine Image Dataset - a diverse and comprehensive collection of images meticulously curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Turkish language.
Dataset Contain & Diversity:
Containing a total of 5000 images, this Turkish OCR dataset offers an equal distribution across newspapers, books, and magazines. Within, you'll find a diverse collection of content, including articles, advertisements, cover pages, headlines, call outs, and author sections from a variety of newspapers, books, and magazines. Images in this dataset showcases distinct fonts, writing formats, colors, designs, and layouts.
To ensure the diversity of the dataset and to build robust text recognition model we allow limited (less than five) unique images from a single resource. Stringent measures have been taken to exclude any personal identifiable information (PII), and in each image a minimum of 80% space is contain visible Turkish text.
Images have been captured under varying lighting conditions – both day and night – along with different capture angles and backgrounds, further enhancing dataset diversity. The collection features images in portrait and landscape modes.
All these images were captured by native Turkish people to ensure the text quality, avoid toxic content and PII text. We used latest iOS and android mobile devices above 5MP camera to click all these images to maintain the image quality. In this training dataset images are available in both JPEG and HEIC formats.
Metadata:
Along with the image data you will also receive detailed structured metadata in CSV format. For each image it includes metadata like device information, source type like newspaper, magazine or book image, and image type like portrait or landscape etc. Each image is properly renamed corresponding to the metadata.
The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of Turkish text recognition models.
Update & Custom Collection:
We're committed to expanding this dataset by continuously adding more images with the assistance of our native Turkish crowd community.
If you require a custom dataset tailored to your guidelines or specific device distribution, feel free to contact us. We're equipped to curate specialized data to meet your unique needs.
Furthermore, we can annotate or label the images with bounding box or transcribe the text in the image to align with your specific requirements using our crowd community.
License:
This Image dataset, created by FutureBeeAI, is now available for commercial use.
Conclusion:
Leverage the power of this image dataset to elevate the training and performance of text recognition, text detection, and optical character recognition models within the realm of the Turkish language. Your journey to enhanced language understanding and processing starts here.
F
French Newspaper, Magazine, and Books OCR Image Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). French Newspaper, Magazine, and Books OCR Image Dataset [Dataset]. https://www.futurebeeai.com/dataset/ocr-dataset/french-newspaper-book-magazine-ocr-image-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Area covered
French
Dataset funded by
FutureBeeAI
Description
What’s Included
Introducing the French Newspaper, Books, and Magazine Image Dataset - a diverse and comprehensive collection of images meticulously curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the French language.
Dataset Contain & Diversity:
Containing a total of 5000 images, this French OCR dataset offers an equal distribution across newspapers, books, and magazines. Within, you'll find a diverse collection of content, including articles, advertisements, cover pages, headlines, call outs, and author sections from a variety of newspapers, books, and magazines. Images in this dataset showcases distinct fonts, writing formats, colors, designs, and layouts.
To ensure the diversity of the dataset and to build robust text recognition model we allow limited (less than five) unique images from a single resource. Stringent measures have been taken to exclude any personal identifiable information (PII), and in each image a minimum of 80% space is contain visible French text.
Images have been captured under varying lighting conditions – both day and night – along with different capture angles and backgrounds, further enhancing dataset diversity. The collection features images in portrait and landscape modes.
All these images were captured by native French people to ensure the text quality, avoid toxic content and PII text. We used latest iOS and android mobile devices above 5MP camera to click all these images to maintain the image quality. In this training dataset images are available in both JPEG and HEIC formats.
Metadata:
Along with the image data you will also receive detailed structured metadata in CSV format. For each image it includes metadata like device information, source type like newspaper, magazine or book image, and image type like portrait or landscape etc. Each image is properly renamed corresponding to the metadata.
The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of French text recognition models.
Update & Custom Collection:
We're committed to expanding this dataset by continuously adding more images with the assistance of our native French crowd community.
If you require a custom dataset tailored to your guidelines or specific device distribution, feel free to contact us. We're equipped to curate specialized data to meet your unique needs.
Furthermore, we can annotate or label the images with bounding box or transcribe the text in the image to align with your specific requirements using our crowd community.
License:
This Image dataset, created by FutureBeeAI, is now available for commercial use.
Conclusion:
Leverage the power of this image dataset to elevate the training and performance of text recognition, text detection, and optical character recognition models within the realm of the French language. Your journey to enhanced language understanding and processing starts here.
Data from: Specimens from India at the Natural History Museum at the...
gbif.org
Updated Mar 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fridtjof Mehlum; Fridtjof Mehlum (2023). Specimens from India at the Natural History Museum at the University of Oslo (NHM-UiO) [Dataset]. http://doi.org/10.15468/m3fqck
Explore at:
Unique identifier
https://doi.org/10.15468/m3fqck
Dataset updated
Mar 20, 2023
Dataset provided by
Global Biodiversity Information Facilityhttps://www.gbif.org/
University of Oslo
Authors
Fridtjof Mehlum; Fridtjof Mehlum
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered

Description
This dataset includes specimens originating from India in the collections at the Natural History Museum at the University of Oslo (NHM-UiO).

Animals: The mammal collection includes 108 specimens (mounted animals, skulls or skins, sometimes from the same individuals) from ‘India’. Insofar these are dated at all, they originate from the 19th century. No further collecting information is preserved. These data are already available from the GBIF portal (and not included in this dataset). Oslo has 1027 bird specimens from India, both skins and mounted and demounted specimens. These are either not dated or originate from the 19th century or the first half of the 20th century. Locality is sometimes recorded at the region or district level, with relatively many specimens from Darjeeling. More detailed collecting data are missing. Most have been collected by Englishmen, some of whom have had an important role in Indian ornithology. These skins may therefore be of particular historical value. Notable are 295 skins labeled as being collected by ‘Blyth’. This name most likely refers to the English zoologist Edward Blyth (1810 –1873), who was one of the founders of zoology in India (cf. Wikipedia lemma Edward Blyth). Another known name is Henry Seebohm (1832-1895), to whom twelve skins are attributed (misspelled in one case as Subohm). The bird data are not yet published in GBIF. The fish collection contains 34 databased specimens. The Staphylinidae beetle collection includes 508 specimens from India which are not yet identified to species level. The Hymenoptera collection includes 130 pinned specimens originating from the collection of Charles Thomas Bingham (1848-1908). These have been collected in Sikkim. In addition there are 7 Hymenoptera and 1 Orthoptera originating from the Deinboll collection, all labelled Trankebar. Some of these may represent types of taxa described by J.C. Fabricius (1745-1808). These collections are not yet digitised. There are virtually no Lepidoptera or Diptera from India in Oslo. Finally, the museum holds circa 10 crustacean specimens and 3 molluscs.

Plants: There is a small digitised collection of 89 vascular plants from Himachal Pradesh and Maharashtra provinces. These were deposited by the Indian student B. Natarajan who studied in Oslo in the 1990s. In addition, the older vascular plant type collection in Oslo has been digitised. This includes 12 older type specimens from India. Most of the herbarium has not been digitised, however. It may contain between 5 000 and 10 000 specimens from India. These are currently difficult to locate as the herbarium is organised in taxonomic rather than geographic units. The museum intends to digitise the herbarium at a level that would enable the retrieval of taxa per continent or even per country. This enterprise is still in the planning phase, however. Likewise the bryophyte and algae collections might contain material from India, but this can only be retrieved after digitisation. Some of these records are published to GBIF as a separate dataset. Oslo probably holds no Indian fungi. The digitisation of the Oslo lichen herbarium is ongoing. Currently 34 specimens from India are visible in the GBIF portal (and not included i this dataset). This number may increase to circa 100 once the entire lichen herbarium is digitised. Most of these have been collected after 1950 and have rather complete collecting data. The botanical garden in Oslo has 6 living plants originating from India.
F
Filipino Newspaper, Magazine, and Books OCR Image Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Filipino Newspaper, Magazine, and Books OCR Image Dataset [Dataset]. https://www.futurebeeai.com/dataset/ocr-dataset/filipino-newspaper-book-magazine-ocr-image-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
Introducing the Filipino Newspaper, Books, and Magazine Image Dataset - a diverse and comprehensive collection of images meticulously curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Filipino language.
Dataset Contain & Diversity:
Containing a total of 5000 images, this Filipino OCR dataset offers an equal distribution across newspapers, books, and magazines. Within, you'll find a diverse collection of content, including articles, advertisements, cover pages, headlines, call outs, and author sections from a variety of newspapers, books, and magazines. Images in this dataset showcases distinct fonts, writing formats, colors, designs, and layouts.
To ensure the diversity of the dataset and to build robust text recognition model we allow limited (less than five) unique images from a single resource. Stringent measures have been taken to exclude any personal identifiable information (PII), and in each image a minimum of 80% space is contain visible Filipino text.
Images have been captured under varying lighting conditions – both day and night – along with different capture angles and backgrounds, further enhancing dataset diversity. The collection features images in portrait and landscape modes.
All these images were captured by native Filipino people to ensure the text quality, avoid toxic content and PII text. We used latest iOS and android mobile devices above 5MP camera to click all these images to maintain the image quality. In this training dataset images are available in both JPEG and HEIC formats.
Metadata:
Along with the image data you will also receive detailed structured metadata in CSV format. For each image it includes metadata like device information, source type like newspaper, magazine or book image, and image type like portrait or landscape etc. Each image is properly renamed corresponding to the metadata.
The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of Filipino text recognition models.
Update & Custom Collection:
We're committed to expanding this dataset by continuously adding more images with the assistance of our native Filipino crowd community.
If you require a custom dataset tailored to your guidelines or specific device distribution, feel free to contact us. We're equipped to curate specialized data to meet your unique needs.
Furthermore, we can annotate or label the images with bounding box or transcribe the text in the image to align with your specific requirements using our crowd community.
License:
This Image dataset, created by FutureBeeAI, is now available for commercial use.
Conclusion:
Leverage the power of this image dataset to elevate the training and performance of text recognition, text detection, and optical character recognition models within the realm of the Filipino language. Your journey to enhanced language understanding and processing starts here.
B
Data from: Evaluating methods for estimating local effective population size...
borealisdata.ca
dataverse.scholarsportal.info
+2more
Updated May 19, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kimberly Julie Gilbert; Michael C. Whitlock (2021). Data from: Evaluating methods for estimating local effective population size with and without migration [Dataset]. http://doi.org/10.5683/SP2/FY5KY3
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP2/FY5KY3
Dataset updated
May 19, 2021
Dataset provided by
Borealis
Authors
Kimberly Julie Gilbert; Michael C. Whitlock
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
AbstractEffective population size is a fundamental parameter in population genetics, evolutionary biology and conservation biology, yet its estimation can be fraught with difficulties. Several methods to estimate Ne from genetic data have been developed which take advantage of various approaches for inferring Ne. The ability of these methods to accurately estimate Ne, however, has not been comprehensively examined. In this study, we employ seven of the most cited methods for estimating Ne from genetic data (Colony2, CoNe, Estim, MLNe, ONeSAMP, TMVP, and NeEstimator including LDNe) across simulated datasets with populations experiencing migration or no migration. The simulated population demographies are an isolated population with no immigration, an island model metapopulation with a sink population receiving immigrants, and an isolation by distance stepping stone model of populations. We find considerable variance in performance of these methods, both within and across demographic scenarios, with some methods performing very poorly. The most accurate estimates of Ne can be obtained by using LDNe, MLNe, or TMVP; however each of these approaches is outperformed by another in a differing demographic scenario. Knowledge of the approximate demography of population as well as the availability of temporal data largely improves Ne estimates. Usage notesNe500_IdealRawPopulationFilesThese are the "true" population files for ideal (isolation) populations of size 500 simulated from Nemo. They contain all individuals and many extra loci, from which these were sampled to obtain the inputs used in analyses (see Program_InputFiles). Temporal samplers used two time points, from which the files here are identified as belonging to generation 0 or generation 1.Ideal500_Raw.zipNe5000_IdealRawPopulationFilesThese are the "true" population files for ideal (isolation) populations of size 5000 simulated from Nemo. They contain all individuals and many extra loci, from which these were sampled to obtain the inputs used in analyses (see Program_InputFiles). Temporal samplers used two time points, from which the files here are identified as belonging to generation 0 or generation 1.Ideal5000_Raw.zipNe50_Generation0_IdealRawPopulationFilesThese are the "true" population files for ideal (isolation) populations of size 50 simulated from Nemo. They contain all individuals and many extra loci, from which these were sampled to obtain the inputs used in analyses (see Program_InputFiles). Temporal samplers used two time points, from which the files here are identified as belonging to generation 0 or generation 1.Ideal50_Gen0.zipNe50_Generation1_IdealRawPopulationFilesThese are the "true" population files for ideal (isolation) populations of size 50 simulated from Nemo. They contain all individuals and many extra loci, from which these were sampled to obtain the inputs used in analyses (see Program_InputFiles). Temporal samplers used two time points, from which the files here are identified as belonging to generation 0 or generation 1.Ideal50_Gen1.zipEstimationPrograms_FormattedInputFilesSee the ReadMe for further details. These are the input files formatted for each analysis program and are the population samples under analysis.Program_InputFiles.zipProgramOutputFilesFor_Colony_Estim_MLNe_NeEstimator_ONeSamp_TMVPThese are the Ne estimates output by the various programs. See the readme for file naming conventions. Because of their large size, CoNe output files are stored separately.Colony_Estim_MLNe_NeEstimator_ONeSamp_TMVP_OutputFiles.zipCoNe_Ideal_OutputFilesThese are outputs for Cone Ideal (isolation) population cases. See the Readme for file naming conventions.CoNe_Mig50_OutputFilesCoNe Ne estimation output files for Migration scenarios with true Ne = 50. See the same readme for other input/output files for naming conventions.CoNe_Mig500_OutputFilesCoNe Ne estimation output files for Migration scenarios with true Ne = 500. See the same readme for other input/output files for naming conventions.CoNe_IBD50_OutputFilesCoNe Ne estimation output files for IBD scenarios with true Ne = 50. See the same readme for other input/output files for naming conventions.CoNe_IBD500_OutputFilesCoNe Ne estimation output files for IBD scenarios with true Ne = 500. See the same readme for other input/output files for naming conventions.ParamFiles_ConversionAndAnalysisScriptsSee the Readme files contained within each subfolder. These are the input files used for nemo simulations (Migration and IBD raw simulation files were >80GB in size when compressed, and may be requested from KJ Gilbert). Otherwise, these input files contain the parameters used in Nemo v2.2.0 to create the raw population files from which individuals were sampled. R scripts for file conversion to the various program inputs as well as for analyzing the various outputs are also included, but are also made public on GitHub at:...
Dutch House Prices Dataset
kaggle.com
Updated Aug 6, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bryan Lusse (2022). Dutch House Prices Dataset [Dataset]. https://www.kaggle.com/datasets/bryan2k19/dutch-house-prices-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 6, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Bryan Lusse
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This dataset has been created for a personal project on house price predictions. As no dataset on house prices was available in the Netherlands, I decided to create one myself. The data consists of information retrieved from the largest real estate website in the Netherlands: Funda.

The data consists of basic descriptors such as the address and ask price, but also contains the number of rooms and bathrooms, the type of building and the year the building was constructed.

The source code for the creation of the dataset can be found here

Thumbnail image credit: @rarchitecture_melbourne - Unsplash
Statistics on Income and Living Conditions (SILC)
researchdata.se
datacatalogue.cessda.eu
Updated Sep 23, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statistics Sweden (2024). Statistics on Income and Living Conditions (SILC) [Dataset]. https://researchdata.se/en/catalogue/dataset/ext0001-5
Explore at:
Dataset updated
Sep 23, 2024
Dataset authored and provided by
Statistics Swedenhttp://www.scb.se/
Time period covered
1975
Area covered
Sweden
Description
Statistics on Income and Living Conditions (SILC) measures and tracks the development of living conditions in Sweden. The surveys have been conducted on behalf of the Swedish Parliament since 1975, primarily through face-to-face interviews with a random sample of approximately 5,000–12,000 individuals annually from the Swedish population aged 16 and older. Through this, indicators for various welfare areas can be presented in time series that today extend about 50 years back in time. The areas include health, economy, housing, employment, well-being and trust, leisure, social relationships, civic activities, safety, and work environment. The statistics are used, among other things, for comparisons between groups, comparisons over time, and for international comparisons.
Startup Failure Prediction Dataset
kaggle.com
Updated Mar 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sakhare Bharat (2025). Startup Failure Prediction Dataset [Dataset]. https://www.kaggle.com/datasets/sakharebharat/startup-failure-prediction-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 31, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sakhare Bharat
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
** Startup Failure Prediction Dataset

This dataset helps understand why some startups succeed while others fail. It contains 5,000 startups from different industries and includes important details like funding, revenue, team size, and market conditions. **

** What’s Inside?**

This dataset has key information about startups, including:

Industry– Type of business (Tech, Healthcare, E-commerce, etc.)

Startup Age – How many years the startup has been running

Funding Amount – Total investment received

Number of Founders – How many people started the company

Founder Experience – Work experience of the founders

Employees Count – Number of employees in the startup

Revenue – How much money the startup makes

Burn Rate – How much money the startup spends per month

Market Size – Size of the industry (Small, Medium, Large)

Business Model – Does the startup sell to businesses (B2B) or customers (B2C)?

Product Uniqueness Score – How unique the startup’s product is (Scale: 1-10)

Customer Retention Rate – Percentage of customers who return

Marketing Expense – How much money is spent on marketing

Startup Status – 1 = Successful, 0 = Failed (Did the startup succeed or fail?)
n
Data from: Land-use change is associated with multi-century loss of elephant...
data.niaid.nih.gov
search.dataone.org
+1more
zip
Updated Jun 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shermin de Silva; Tiffany Wu; Philip Nyhus; Ashley Weaver; Alison Thieme; Josiah Johnson; Jamie Wadey; Alexander Mossbrucker; Thinh Vu; Thy Neang; Becky Shu Chen; Melissa Songer; Peter Leimgruber (2023). Land-use change is associated with multi-century loss of elephant ecosystems in Asia [Dataset]. http://doi.org/10.6076/D1P305
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6076/D1P305
Dataset updated
Jun 7, 2023
Dataset provided by
Zoological Society of London
Vietnam National University of Forestry
Frankfurt Zoological Society
University of Nottingham Malaysia Campus
Wild Earth Allies
Smithsonian Conservation Biology Institute
University of California, San Diego
Colby College
Authors
Shermin de Silva; Tiffany Wu; Philip Nyhus; Ashley Weaver; Alison Thieme; Josiah Johnson; Jamie Wadey; Alexander Mossbrucker; Thinh Vu; Thy Neang; Becky Shu Chen; Melissa Songer; Peter Leimgruber
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Area covered
Asia
Description
Understanding historic patterns of land use and land cover change across large temporal and spatial scales is critical for developing effective biodiversity conservation management and policy. We quantify the extent and fragmentation of suitable habitat across the continental range of Asian elephants (Elephas maximus) based on present-day occurrence data and land-use variables between 850 and 2015 A.D. We found that following centuries of relative stability, over 64% (3.36 million km2) of suitable elephant habitat across Asia was lost since the year 1700, coincident with colonial-era land-use practices in South Asia and subsequent agricultural intensification in Southeast Asia. Average patch size dropped 83% from approximately 99,000–16,000 km2 and the area occupied by the largest patch decreased 83% from ~ 4 million km2 (45% of area) to 54,000 km2 (~ 7.5% of area). Whereas 100% of the area within 100 km of the current elephant range could have been considered suitable habitat in the year 1700, over half was unsuitable by 2015, driving potential conflict with people. These losses reflect long-term decline of non-forested ecosystems, exceeding estimates of deforestation within this century. Societies must consider ecological histories in addition to proximate threats to develop more just and sustainable land-use and conservation strategies. Methods Elephant occurrence data Elephant occurrence locations were initially compiled from the Global Biodiversity Information Facility (https://www.gbif.org/), Movebank (https://www.movebank.org/) and published literature as well as data contributed by the authors based on direct sightings, data logged via tracking devices, and camera traps (n>5000 locations). Records were first checked visually for irrelevant points (e.g., occurrences outside natural continental range, from GBIF) and then refined to include locations representing ecosystems where the species could conceivably flourish, including but not exclusively limited to protected areas. To minimize sampling bias that could result in model overfitting, we further subsampled data to cover the full distribution as widely as possible while eliminating redundant points located within any particular landscape. For instance, thousands of potential redundancies from collar-based tracking datasets were removed by using only one randomly selected data point per individual, per population or landscape. Outliers from the remaining points were removed using Cooks’ distance to eliminate locations that could represent potential errors. The final dataset consisted of 91 occurrence points spanning the years 1996-2015 which served as training data, where all data other than from GBIF and cited literature were contributed by the authors or individuals listed in acknowledgments. QGIS and Google Earth Pro were used to initially visualize and process the data. Predictor variables We used the Land-Use Harmonization 2 (LUH2) data products 25 as our environmental variables. The LUH2 datasets provide historical reconstructions of land use and land management from 850 to 2015 CE, at annual increments. The LUH2 data products were downloaded from the University of Maryland at http://luh.umd.edu/data.shtml (LUHv2h “baseline” scenario released October 14th 2016). They contain three types of variables gridded at 0.25° x 0.25° (approximately 30 km2 at the equator): state variables describing the land-use of a grid cell for a given year, transition variables describing the changes in a grid cell from one year to the next, and management variables that describe agricultural applications such as irrigation and fertilizer use, totaling 46 variables. Of these, we selected 20 variables corresponding to all 3 types which were expected to be relevant to elephant habitat use based on knowledge of the species’ ecology 21,22,32,81. Using ArcGIS 10 (ESRI 2017) we extracted each variable between 850–1700 CE at 25-year increments, and annually between 1700–2015. We separately obtained elevation from the SRTM Digital Elevation Model. Data analysis We limited the geographic extent of all analyses to the 13 range countries in which elephants are currently found. We used MAXENT, a maximum entropy algorithm 82, to model habitat suitability using the ‘dismo’ package in R (R Core Team 2017). Resulting raster files were binarized in ArcGIS into suitable and unsuitable habitat with a pixel size of approximately 20 km2 as a cutoff threshold. As there is no commonly accepted threshold type 84, to ensure that the specific choice of threshold did not affect the observed trends, we initially used three possible thresholds: 0.237, representing ‘maximum test sensitivity plus specificity,’ 0.284 corresponding to ‘maximum training sensitivity plus specificity,’ and 0.331 representing ‘10th percentile training presence’. Unless otherwise stated, for subsequent analyses we show only results using the threshold of 0.284, where everything below this threshold was classified as ‘unsuitable’ and everything above it was classified as ‘suitable’. The resulting binary maps were re-projected using the WGS84 datum and an Albers Equal Area Conic projection. Polygons representing the known elephant range were digitized from Hedges et al. 2008 from the category labelled as “active confirmed”. We refer to the areas within these polygons as “current range,” and refer to areas outside them as “potential range”. We compared the total extent of suitable habitat within and outside the current elephant range, quantifying changes over time. Country-level analyses were conducted for all countries except Indonesia and Malaysia where the Bornean and Sumatran ranges were treated separately in recognition of the distinct subspecies in these two regions. We ranked each region based on the percentage of the current range within that region as well as the proportion of the estimated elephant population found within it, and calculated the ratio of these ranks. We calculated the total change in extent of suitable habitat by subtracting the area of suitable habitat available in 2015 from the area available in 1700, as major changes were observed within this period. We also specifically quantified the percentage of suitable habitat found within a 100 km buffer of the current range polygons in both years. We then calculated fragmentation statistics using the program FRAGSTATS v.4.2 88. These metrics characterize changes to the spatial configuration of habitat in addition to their absolute extent. We used a ‘no sampling’ strategy with the search radius and threshold distance set to 61 km (approximately three-pixel lengths) based on the movement and dispersal capacity of elephants. See associated paper for references, tables and figures.
F
Chinese Newspaper, Magazine, and Books OCR Image Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Chinese Newspaper, Magazine, and Books OCR Image Dataset [Dataset]. https://www.futurebeeai.com/dataset/ocr-dataset/chinese-newspaper-book-magazine-ocr-image-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
Introducing the Chinese Newspaper, Books, and Magazine Image Dataset - a diverse and comprehensive collection of images meticulously curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Chinese language.
Dataset Contain & Diversity:
Containing a total of 5000 images, this Chinese OCR dataset offers an equal distribution across newspapers, books, and magazines. Within, you'll find a diverse collection of content, including articles, advertisements, cover pages, headlines, call outs, and author sections from a variety of newspapers, books, and magazines. Images in this dataset showcases distinct fonts, writing formats, colors, designs, and layouts.
To ensure the diversity of the dataset and to build robust text recognition model we allow limited (less than five) unique images from a single resource. Stringent measures have been taken to exclude any personal identifiable information (PII), and in each image a minimum of 80% space is contain visible Chinese text.
Images have been captured under varying lighting conditions – both day and night – along with different capture angles and backgrounds, further enhancing dataset diversity. The collection features images in portrait and landscape modes.
All these images were captured by native Chinese people to ensure the text quality, avoid toxic content and PII text. We used latest iOS and android mobile devices above 5MP camera to click all these images to maintain the image quality. In this training dataset images are available in both JPEG and HEIC formats.
Metadata:
Along with the image data you will also receive detailed structured metadata in CSV format. For each image it includes metadata like device information, source type like newspaper, magazine or book image, and image type like portrait or landscape etc. Each image is properly renamed corresponding to the metadata.
The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of Chinese text recognition models.
Update & Custom Collection:
We're committed to expanding this dataset by continuously adding more images with the assistance of our native Chinese crowd community.
If you require a custom dataset tailored to your guidelines or specific device distribution, feel free to contact us. We're equipped to curate specialized data to meet your unique needs.
Furthermore, we can annotate or label the images with bounding box or transcribe the text in the image to align with your specific requirements using our crowd community.
License:
This Image dataset, created by FutureBeeAI, is now available for commercial use.
Conclusion:
Leverage the power of this image dataset to elevate the training and performance of text recognition, text detection, and optical character recognition models within the realm of the Chinese language. Your journey to enhanced language understanding and processing starts here.
F
English Newspaper, Magazine, and Books OCR Image Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). English Newspaper, Magazine, and Books OCR Image Dataset [Dataset]. https://www.futurebeeai.com/dataset/ocr-dataset/english-newspaper-book-magazine-ocr-image-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
Introducing the English Newspaper, Books, and Magazine Image Dataset - a diverse and comprehensive collection of images meticulously curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the English language.
Dataset Contain & Diversity:
Containing a total of 5000 images, this English OCR dataset offers an equal distribution across newspapers, books, and magazines. Within, you'll find a diverse collection of content, including articles, advertisements, cover pages, headlines, call outs, and author sections from a variety of newspapers, books, and magazines. Images in this dataset showcases distinct fonts, writing formats, colors, designs, and layouts.
To ensure the diversity of the dataset and to build robust text recognition model we allow limited (less than five) unique images from a single resource. Stringent measures have been taken to exclude any personal identifiable information (PII), and in each image a minimum of 80% space is contain visible English text.
Images have been captured under varying lighting conditions – both day and night – along with different capture angles and backgrounds, further enhancing dataset diversity. The collection features images in portrait and landscape modes.
All these images were captured by native English Speaking people to ensure the text quality, avoid toxic content and PII text. We used latest iOS and android mobile devices above 5MP camera to click all these images to maintain the image quality. In this training dataset images are available in both JPEG and HEIC formats.
Metadata:
Along with the image data you will also receive detailed structured metadata in CSV format. For each image it includes metadata like device information, source type like newspaper, magazine or book image, and image type like portrait or landscape etc. Each image is properly renamed corresponding to the metadata.
The metadata serves as a valuable tool for understanding and characterizing the data, facilitating informed decision-making in the development of English text recognition models.
Update & Custom Collection:
We're committed to expanding this dataset by continuously adding more images with the assistance of our native English language crowd community.
If you require a custom dataset tailored to your guidelines or specific device distribution, feel free to contact us. We're equipped to curate specialized data to meet your unique needs.
Furthermore, we can annotate or label the images with bounding box or transcribe the text in the image to align with your specific requirements using our crowd community.
License:
This Image dataset, created by FutureBeeAI, is now available for commercial use.
Conclusion:
Leverage the power of this image dataset to elevate the training and performance of text recognition, text detection, and optical character recognition models within the realm of the English language. Your journey to enhanced language understanding and processing starts here.
m
Malayalam Facebook bAbI tasks
data.mendeley.com
Updated Oct 24, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bibin Puthusseril (2024). Malayalam Facebook bAbI tasks [Dataset]. http://doi.org/10.17632/55gfy4j3pc.1
Explore at:
Unique identifier
https://doi.org/10.17632/55gfy4j3pc.1
Dataset updated
Oct 24, 2024
Authors
Bibin Puthusseril
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A Malayalam Question Answering dataset of 5,000 training samples and 5,000 testing samples was generated by translating Facebook bAbI tasks. Facebook's bAbI tasks was originally created in English, some of the languages it has been translated are French, German, Hindi, Chinese, Russian. Twenty fictitious tasks that test a system's capacity for responding to a range of themes, including text comprehension and reasoning, are included in the dataset. Five task-oriented usability questions with comparable sentence patterns are also included in the collection. The questions here range in difficulty. Every job has 1000 test samples and 1000 training samples in the dataset. we created the dataset for the proposed work by utilizing the bAbI dataset to translate the English dataset into Malayalam for five tasks (tasks 1, 4, 11, 12, and 13), represented as tasks 1 through 5. Titles such as "Single Supporting Facts," "Two Argument Relations," "Basic Coreference," "Conjunction," and "Compound Coreference" relate to the tasks. Every sample in the dataset comprises a series of statements (sometimes called stories) about people's movements around things, a question, a suitable answer. Tasks: Task 1: Single supporting fact: This task tests whether a model can identify a single important fact from a story to answer a question. The story usually contains several sentences, but only one sentence is directly useful in answering the question. Task 2: Relationships with two arguments: This task involves understanding the relationship between two entities. The model must infer relationships between pairs of objects, people or places. Task 3: Core co-reference: Co-reference resolution is the task of linking pronouns or phrases to the correct entities. In this task, the model must resolve simple pronominal references. Task 4: Conjunctions: This task tests the model's ability to understand sentences in which several actions or facts are joined by conjunctions such as "and" or "or". The model must process these linked statements to answer the questions correctly. Task 5: Compound Reference: This task is more complex because it requires the model to solve the conjunctions in the sentence with composite entities or more complex structures.
P
CIFAR-10 Dataset
paperswithcode.com
opendatalab.com
+4more
Updated Jun 14, 2009
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Krizhevsky (2009). CIFAR-10 Dataset [Dataset]. https://paperswithcode.com/dataset/cifar-10
Explore at:
Dataset updated
Jun 14, 2009
Authors
Krizhevsky
Description
The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.
UAE Real Estate 2024 Dataset
kaggle.com
Updated Aug 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kanchana1990 (2024). UAE Real Estate 2024 Dataset [Dataset]. http://doi.org/10.34740/kaggle/ds/5567442
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/ds/5567442
Dataset updated
Aug 20, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Kanchana1990
License
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Area covered
United Arab Emirates
Description
Dataset Overview

This dataset provides a detailed snapshot of real estate properties listed in Dubai, UAE, as of August 2024. The dataset includes over 5,000 listings scraped using the Apify API from Propertyfinder and various other real estate websites in the UAE. The data includes key details such as the number of bedrooms and bathrooms, price, location, size, and whether the listing is verified. All personal identifiers, such as agent names and contact details, have been ethically removed.

Data Science Applications

Given the size and structure of this dataset, it is ideal for the following data science applications:

Price Prediction Models: Predicting the price of properties based on features like location, size, and furnishing status.

Market Analysis: Understanding trends in the Dubai real estate market by analyzing price distributions, property types, and locations.

Recommendation Systems: Developing systems to recommend properties based on user preferences (e.g., number of bedrooms, budget).

Sentiment Analysis: Extracting and analyzing sentiments from the property descriptions to gauge the market's tone.

This dataset provides a practical foundation for both beginners and experts in data science, allowing for the exploration of real estate trends, development of predictive models, and implementation of machine learning algorithms.

# Column Descriptors

title: The listing's title, summarizing the key selling points of the property.

displayAddress: The public address of the property, including the community and city.

bathrooms: The number of bathrooms available in the property.

bedrooms: The number of bedrooms available in the property.

addedOn: The timestamp indicating when the property was added to the listing platform.

type: Specifies whether the property is residential, commercial, etc.

price: The listed price of the property in AED.

verified: A boolean value indicating whether the listing has been verified by the platform.

priceDuration: Indicates if the property is listed for sale or rent.

sizeMin: The minimum size of the property in square feet.

furnishing: Describes whether the property is furnished, unfurnished, or partially furnished.

description: A more detailed narrative about the property, including additional features and selling points.

# Ethically Mined Data

This dataset was ethically scraped using the Apify API, ensuring compliance with data privacy standards. All personal data such as agent names, phone numbers, and any other sensitive information have been omitted from this dataset to ensure privacy and ethical use. The data is intended solely for educational purposes and should not be used for commercial activities.

# Acknowledgements

This dataset was made possible thanks to the following:

Apify: For providing the API to ethically scrape the data.

Propertyfinder and various other real estate websites in the UAE for the original listings.

Kaggle: For providing the platform to share and analyze this dataset.

-**Photo by** : Francesca Tosolini on Unsplash

Use the Data Responsibly

Please ensure that this dataset is used responsibly, with respect to privacy and data ethics. This data is provided for educational purposes.
Data from: National Health and Nutrition Examination Survey (NHANES),...
icpsr.umich.edu
ascii, delimited, sas +2
Updated Feb 22, 2012
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
United States Department of Health and Human Services. Centers for Disease Control and Prevention. National Center for Health Statistics (2012). National Health and Nutrition Examination Survey (NHANES), 1999-2000 [Dataset]. http://doi.org/10.3886/ICPSR25501.v4
Explore at:
delimited, spss, ascii, sas, stataAvailable download formats
Unique identifier
https://doi.org/10.3886/ICPSR25501.v4
Dataset updated
Feb 22, 2012
Dataset provided by
Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
Authors
United States Department of Health and Human Services. Centers for Disease Control and Prevention. National Center for Health Statistics
License
https://www.icpsr.umich.edu/web/ICPSR/studies/25501/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/25501/terms
Time period covered
1999 - 2000
Area covered
United States
Description
The National Health and Nutrition Examination Surveys (NHANES) is a program of studies designed to assess the health and nutritional status of adults and children in the United States. The NHANES combines personal interviews and physical examinations, which focus on different population groups or health topics. These surveys have been conducted by the National Center for Health Statistics (NCHS) on a periodic basis from 1971 to 1994. In 1999 the NHANES became a continuous program with a changing focus on a variety of health and nutrition measurements which were designed to meet current and emerging concerns. The surveys examine a nationally representative sample of approximately 5,000 persons each year. These persons are located in counties across the United States, 15 of which are visited each year. The 1999-2000 NHANES contains data for 9,965 individuals (and MEC examined sample size of 9,282) of all ages. Many questions that were asked in NHANES II, 1976-1980, Hispanic HANES 1982-1984, and NHANES III, 1988-1994, were combined with new questions in the NHANES 1999-2000. The 1999-2000 NHANES collected data on the prevalence of selected chronic conditions and diseases in the population and estimates for previously undiagnosed conditions, as well as those known to and reported by respondents. Risk factors, those aspects of a person's lifestyle, constitution, heredity, or environment that may increase the chances of developing a certain disease or condition, were examined. Data on smoking, alcohol consumption, sexual practices, drug use, physical fitness and activity, weight, and dietary intake were collected. Information on certain aspects of reproductive health, such as use of oral contraceptives and breastfeeding practices, were also collected. The interview includes demographic, socioeconomic, dietary, and health-related questions. The examination component consists of medical, dental, and physiological measurements, as well as laboratory tests. Demographic data file variables are grouped into three broad categories: (1) Status Variables: Provide core information on the survey participant. Examples of the core variables include interview status, examination status, and sequence number. (Sequence number is a unique ID assigned to each sample person and is required to match the information on this demographic file to the rest of the NHANES 1999-2000 data). (2) Recoded Demographic Variables: The variables include age (age in months for persons through age 19 years, 11 months; age in years for 1-84 year olds, and a top-coded age group of 85+ years), gender, a race/ethnicity variable, an education variable (high school, and more than high school education), country of birth (United States, Mexico, or other foreign born), and pregnancy status variable. Some of the groupings were made due to limited sample sizes for the two-year dataset. (3) Interview and Examination Sample Weight Variables: Sample weights are available for analyzing NHANES 1999-2000 data. For a complete listing of survey contents for all years of the NHANES see the document -- Survey Content -- NHANES 1999-2010.
National Vegetation Information System (NVIS) Version 5.1 - Australia -...
data.gov.au
data.wu.ac.at
html
Updated Aug 9, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Australian Government Department of Climate Change, Energy, the Environment and Water (2023). National Vegetation Information System (NVIS) Version 5.1 - Australia - Uncertainty Layer (Scale) [Dataset]. https://www.data.gov.au/data/dataset/activity/national-vegetation-information-system-nvis-version-5-1-australia-uncertainty-layer-scale
Explore at:
htmlAvailable download formats
Dataset updated
Aug 9, 2023
Dataset provided by
Australian Governmenthttp://www.australia.gov.au/
Department of Climate Change, Energy, the Environment and Water of Australiahttps://www.dcceew.gov.au/
Authors
Australian Government Department of Climate Change, Energy, the Environment and Water
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Area covered
Australia
Description
Raster layer delineates the level of uncertainty depending on the data scale. General Background: The National Vegetation Information System (NVIS) has been developed over the last 15 years from a variety of data sources. The commonwealth is reliant on states and territories to supply vegetation data, which is then collated into the NVIS. Because of varying levels of resources available to different regions, the data they are able to supply is of varying scale, age, origin and spatial mix. NVIS is widely used by various organisation including, researchers, NGO and governments, for a wide range of uses such as modelling, fire fuel loads and policy. To improve the interpretation of the NVIS dataset the Dept of the Environment & Energy has developed a suite of uncertainty layers and tools to facilitate the robust use of the NVIS. Aim: to develop a raster Uncertainty layer that is intended to better describe the input datasets and assist in dynamic analyses of data quality and gaps in the NVIS extant theme dataset. Output: raster files scaled 0-1 that can be used for uncertainty, including data age, origin, scale and spatial mix. The layers can be added together to form one uncertainty layer. Python script/tools will also be provided so individuals can develop their own uncertainty layers and to be as transparent as possible. The Uncertainty layers comprise 5 rasters, 1 composite and 5 individual layers: - NVIS Uncertainty layer (composite of 4 below layers) - Age: age of the NVIS data - Origin: where the data came from - Spatial mix: if the data is a mosaic - Scale: scale that the data was mapped Age, scale, origin and spatial mix Information is supplied by the states and territories as part of the NVIS resupply process. This dataset represents a scale of uncertainty in relation to data scale. Scale categories are below. Table 1. Scale Categories Other 0 Other 10,000,000 -1,000,001 0.8 Vhigh 1,000,000 -100,001 0.85 High 100,0000 -25,001 0.9 Med 25,000- 5001 0.95 Low

5000 1 VLowCC - Attribution (CC BY) This data has been licensed under the Creative Commons Attribution 3.0 Australia Licence. More information can be found at http://www.ausgoal.gov.au/creative-commons. N/A © Commonwealth of Australia(Department of the Environment and Energy) 2018

Facebook

Twitter

Click to copy link

Link copied

Cite

State of California, Department of Health: Death Records (2017). Vital Signs: Life Expectancy – Bay Area [Dataset]. https://data.bayareametro.gov/dataset/Vital-Signs-Life-Expectancy-Bay-Area/emjt-svg9

Vital Signs: Life Expectancy – Bay Area

Explore at:

xml, csv, tsv, application/rssxml, json, application/rdfxmlAvailable download formats

Dataset updated

Mar 22, 2017

Dataset authored and provided by

State of California, Department of Health: Death Records

Area covered

San Francisco Bay Area

Description

VITAL SIGNS INDICATOR Life Expectancy (EQ6)

FULL MEASURE NAME Life Expectancy

LAST UPDATED April 2017

DESCRIPTION Life expectancy refers to the average number of years a newborn is expected to live if mortality patterns remain the same. The measure reflects the mortality rate across a population for a point in time.

DATA SOURCE State of California, Department of Health: Death Records (1990-2013) No link

California Department of Finance: Population Estimates Annual Intercensal Population Estimates (1990-2010) Table P-2: County Population by Age (2010-2013) http://www.dof.ca.gov/Forecasting/Demographics/Estimates/

CONTACT INFORMATION vitalsigns.info@mtc.ca.gov

METHODOLOGY NOTES (across all datasets for this indicator) Life expectancy is commonly used as a measure of the health of a population. Life expectancy does not reflect how long any given individual is expected to live; rather, it is an artificial measure that captures an aspect of the mortality rates across a population. Vital Signs measures life expectancy at birth (as opposed to cohort life expectancy). A statistical model was used to estimate life expectancy for Bay Area counties and Zip codes based on current life tables which require both age and mortality data. A life table is a table which shows, for each age, the survivorship of a people from a certain population.

Current life tables were created using death records and population estimates by age. The California Department of Public Health provided death records based on the California death certificate information. Records include age at death and residential Zip code. Single-year age population estimates at the regional- and county-level comes from the California Department of Finance population estimates and projections for ages 0-100+. Population estimates for ages 100 and over are aggregated to a single age interval. Using this data, death rates in a population within age groups for a given year are computed to form unabridged life tables (as opposed to abridged life tables). To calculate life expectancy, the probability of dying between the jth and (j+1)st birthday is assumed uniform after age 1. Special consideration is taken to account for infant mortality. For the Zip code-level life expectancy calculation, it is assumed that postal Zip codes share the same boundaries as Zip Code Census Tabulation Areas (ZCTAs). More information on the relationship between Zip codes and ZCTAs can be found at https://www.census.gov/geo/reference/zctas.html. Zip code-level data uses three years of mortality data to make robust estimates due to small sample size. Year 2013 Zip code life expectancy estimates reflects death records from 2011 through 2013. 2013 is the last year with available mortality data. Death records for Zip codes with zero population (like those associated with P.O. Boxes) were assigned to the nearest Zip code with population. Zip code population for 2000 estimates comes from the Decennial Census. Zip code population for 2013 estimates are from the American Community Survey (5-Year Average). The ACS provides Zip code population by age in five-year age intervals. Single-year age population estimates were calculated by distributing population within an age interval to single-year ages using the county distribution. Counties were assigned to Zip codes based on majority land-area.

Zip codes in the Bay Area vary in population from over 10,000 residents to less than 20 residents. Traditional life expectancy estimation (like the one used for the regional- and county-level Vital Signs estimates) cannot be used because they are highly inaccurate for small populations and may result in over/underestimation of life expectancy. To avoid inaccurate estimates, Zip codes with populations of less than 5,000 were aggregated with neighboring Zip codes until the merged areas had a population of more than 5,000. In this way, the original 305 Bay Area Zip codes were reduced to 218 Zip code areas for 2013 estimates. Next, a form of Bayesian random-effects analysis was used which established a prior distribution of the probability of death at each age using the regional distribution. This prior is used to shore up the life expectancy calculations where data were sparse.

Clear search

Close search

Google apps

Main menu

Vital Signs: Life Expectancy – Bay Area

Geonames - All Cities with a population > 1000

US Census Demographic Data

Context

Content

Acknowledgements

Inspiration

Vital Signs: Life Expectancy – by ZIP Code

Turkish Newspaper, Magazine, and Books OCR Image Dataset

What’s Included

French Newspaper, Magazine, and Books OCR Image Dataset

What’s Included

Data from: Specimens from India at the Natural History Museum at the...

Filipino Newspaper, Magazine, and Books OCR Image Dataset

What’s Included

Data from: Evaluating methods for estimating local effective population size...

Dutch House Prices Dataset

Statistics on Income and Living Conditions (SILC)

Startup Failure Prediction Dataset

** Startup Failure Prediction Dataset

** What’s Inside?**

Data from: Land-use change is associated with multi-century loss of elephant...

Chinese Newspaper, Magazine, and Books OCR Image Dataset

What’s Included

English Newspaper, Magazine, and Books OCR Image Dataset

What’s Included

Malayalam Facebook bAbI tasks

CIFAR-10 Dataset

UAE Real Estate 2024 Dataset

Data from: National Health and Nutrition Examination Survey (NHANES),...

National Vegetation Information System (NVIS) Version 5.1 - Australia -...

Vital Signs: Life Expectancy – Bay Area

What’s Inside?