30 datasets found
  1. Name Census top 100 surnames

    • kaggle.com
    Updated Mar 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Name Census (2023). Name Census top 100 surnames [Dataset]. https://www.kaggle.com/datasets/namecensus/name-census-top-100-surnames
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 25, 2023
    Dataset provided by
    Kaggle
    Authors
    Name Census
    License

    http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

    Description

    Name Census top 100 surnames

    In 2018 we were doing a research project, and we needed to know if a name was male or female. After Googling for hours for 'baby name lists', 'name databases' and 'name datasets' we discovered that there wasn't a complete name database for all countries with first names and gender. Most name database layouts we found different per country, were incomplete or contained non-existing names. That is why we created Name Census, the most comprehensive name database in the world! The Name Census top 100 databases is a free database containing the top 100 first names and top 100 surnames for each country.

    Collection methodology

    Our name database is created using first names and surnames obtained from governments and cross-referencing with millions of names from publicly available social media profiles. We took all those names and used millions of social media profiles that where publicly available to cross-reference and count each name per country. This way we were sure that the names in our name database are actually used and we could create our popularity metric. We now offer the complete name database and the name parsing service as separate services.

    Content

    The Name Census top 100 is a name database that consists out of two files; the first names top 100 per country and the surnames top 100 per country. Each file is a CSV file formatted in UTF-8.

  2. o

    Geonames - All Cities with a population > 1000

    • public.opendatasoft.com
    • data.smartidf.services
    • +2more
    csv, excel, geojson +1
    Updated Mar 10, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Geonames - All Cities with a population > 1000 [Dataset]. https://public.opendatasoft.com/explore/dataset/geonames-all-cities-with-a-population-1000/
    Explore at:
    csv, json, geojson, excelAvailable download formats
    Dataset updated
    Mar 10, 2024
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    All cities with a population > 1000 or seats of adm div (ca 80.000)Sources and ContributionsSources : GeoNames is aggregating over hundred different data sources. Ambassadors : GeoNames Ambassadors help in many countries. Wiki : A wiki allows to view the data and quickly fix error and add missing places. Donations and Sponsoring : Costs for running GeoNames are covered by donations and sponsoring.Enrichment:add country name

  3. Global Country Information 2023

    • zenodo.org
    • data.niaid.nih.gov
    csv
    Updated Jun 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nidula Elgiriyewithana; Nidula Elgiriyewithana (2024). Global Country Information 2023 [Dataset]. http://doi.org/10.5281/zenodo.8165229
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jun 15, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nidula Elgiriyewithana; Nidula Elgiriyewithana
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description

    This comprehensive dataset provides a wealth of information about all countries worldwide, covering a wide range of indicators and attributes. It encompasses demographic statistics, economic indicators, environmental factors, healthcare metrics, education statistics, and much more. With every country represented, this dataset offers a complete global perspective on various aspects of nations, enabling in-depth analyses and cross-country comparisons.

    Key Features

    • Country: Name of the country.
    • Density (P/Km2): Population density measured in persons per square kilometer.
    • Abbreviation: Abbreviation or code representing the country.
    • Agricultural Land (%): Percentage of land area used for agricultural purposes.
    • Land Area (Km2): Total land area of the country in square kilometers.
    • Armed Forces Size: Size of the armed forces in the country.
    • Birth Rate: Number of births per 1,000 population per year.
    • Calling Code: International calling code for the country.
    • Capital/Major City: Name of the capital or major city.
    • CO2 Emissions: Carbon dioxide emissions in tons.
    • CPI: Consumer Price Index, a measure of inflation and purchasing power.
    • CPI Change (%): Percentage change in the Consumer Price Index compared to the previous year.
    • Currency_Code: Currency code used in the country.
    • Fertility Rate: Average number of children born to a woman during her lifetime.
    • Forested Area (%): Percentage of land area covered by forests.
    • Gasoline_Price: Price of gasoline per liter in local currency.
    • GDP: Gross Domestic Product, the total value of goods and services produced in the country.
    • Gross Primary Education Enrollment (%): Gross enrollment ratio for primary education.
    • Gross Tertiary Education Enrollment (%): Gross enrollment ratio for tertiary education.
    • Infant Mortality: Number of deaths per 1,000 live births before reaching one year of age.
    • Largest City: Name of the country's largest city.
    • Life Expectancy: Average number of years a newborn is expected to live.
    • Maternal Mortality Ratio: Number of maternal deaths per 100,000 live births.
    • Minimum Wage: Minimum wage level in local currency.
    • Official Language: Official language(s) spoken in the country.
    • Out of Pocket Health Expenditure (%): Percentage of total health expenditure paid out-of-pocket by individuals.
    • Physicians per Thousand: Number of physicians per thousand people.
    • Population: Total population of the country.
    • Population: Labor Force Participation (%): Percentage of the population that is part of the labor force.
    • Tax Revenue (%): Tax revenue as a percentage of GDP.
    • Total Tax Rate: Overall tax burden as a percentage of commercial profits.
    • Unemployment Rate: Percentage of the labor force that is unemployed.
    • Urban Population: Percentage of the population living in urban areas.
    • Latitude: Latitude coordinate of the country's location.
    • Longitude: Longitude coordinate of the country's location.

    Potential Use Cases

    • Analyze population density and land area to study spatial distribution patterns.
    • Investigate the relationship between agricultural land and food security.
    • Examine carbon dioxide emissions and their impact on climate change.
    • Explore correlations between economic indicators such as GDP and various socio-economic factors.
    • Investigate educational enrollment rates and their implications for human capital development.
    • Analyze healthcare metrics such as infant mortality and life expectancy to assess overall well-being.
    • Study labor market dynamics through indicators such as labor force participation and unemployment rates.
    • Investigate the role of taxation and its impact on economic development.
    • Explore urbanization trends and their social and environmental consequences.
  4. d

    COVID Impact Survey - Public Data

    • data.world
    csv, zip
    Updated Oct 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Associated Press (2024). COVID Impact Survey - Public Data [Dataset]. https://data.world/associatedpress/covid-impact-survey-public-data
    Explore at:
    csv, zipAvailable download formats
    Dataset updated
    Oct 16, 2024
    Authors
    The Associated Press
    Description

    Overview

    The Associated Press is sharing data from the COVID Impact Survey, which provides statistics about physical health, mental health, economic security and social dynamics related to the coronavirus pandemic in the United States.

    Conducted by NORC at the University of Chicago for the Data Foundation, the probability-based survey provides estimates for the United States as a whole, as well as in 10 states (California, Colorado, Florida, Louisiana, Minnesota, Missouri, Montana, New York, Oregon and Texas) and eight metropolitan areas (Atlanta, Baltimore, Birmingham, Chicago, Cleveland, Columbus, Phoenix and Pittsburgh).

    The survey is designed to allow for an ongoing gauge of public perception, health and economic status to see what is shifting during the pandemic. When multiple sets of data are available, it will allow for the tracking of how issues ranging from COVID-19 symptoms to economic status change over time.

    The survey is focused on three core areas of research:

    • Physical Health: Symptoms related to COVID-19, relevant existing conditions and health insurance coverage.
    • Economic and Financial Health: Employment, food security, and government cash assistance.
    • Social and Mental Health: Communication with friends and family, anxiety and volunteerism. (Questions based on those used on the U.S. Census Bureau’s Current Population Survey.) ## Using this Data - IMPORTANT This is survey data and must be properly weighted during analysis: DO NOT REPORT THIS DATA AS RAW OR AGGREGATE NUMBERS!!

    Instead, use our queries linked below or statistical software such as R or SPSS to weight the data.

    Queries

    If you'd like to create a table to see how people nationally or in your state or city feel about a topic in the survey, use the survey questionnaire and codebook to match a question (the variable label) to a variable name. For instance, "How often have you felt lonely in the past 7 days?" is variable "soc5c".

    Nationally: Go to this query and enter soc5c as the variable. Hit the blue Run Query button in the upper right hand corner.

    Local or State: To find figures for that response in a specific state, go to this query and type in a state name and soc5c as the variable, and then hit the blue Run Query button in the upper right hand corner.

    The resulting sentence you could write out of these queries is: "People in some states are less likely to report loneliness than others. For example, 66% of Louisianans report feeling lonely on none of the last seven days, compared with 52% of Californians. Nationally, 60% of people said they hadn't felt lonely."

    Margin of Error

    The margin of error for the national and regional surveys is found in the attached methods statement. You will need the margin of error to determine if the comparisons are statistically significant. If the difference is:

    • At least twice the margin of error, you can report there is a clear difference.
    • At least as large as the margin of error, you can report there is a slight or apparent difference.
    • Less than or equal to the margin of error, you can report that the respondents are divided or there is no difference. ## A Note on Timing Survey results will generally be posted under embargo on Tuesday evenings. The data is available for release at 1 p.m. ET Thursdays.

    About the Data

    The survey data will be provided under embargo in both comma-delimited and statistical formats.

    Each set of survey data will be numbered and have the date the embargo lifts in front of it in the format of: 01_April_30_covid_impact_survey. The survey has been organized by the Data Foundation, a non-profit non-partisan think tank, and is sponsored by the Federal Reserve Bank of Minneapolis and the Packard Foundation. It is conducted by NORC at the University of Chicago, a non-partisan research organization. (NORC is not an abbreviation, it part of the organization's formal name.)

    Data for the national estimates are collected using the AmeriSpeak Panel, NORC’s probability-based panel designed to be representative of the U.S. household population. Interviews are conducted with adults age 18 and over representing the 50 states and the District of Columbia. Panel members are randomly drawn from AmeriSpeak with a target of achieving 2,000 interviews in each survey. Invited panel members may complete the survey online or by telephone with an NORC telephone interviewer.

    Once all the study data have been made final, an iterative raking process is used to adjust for any survey nonresponse as well as any noncoverage or under and oversampling resulting from the study specific sample design. Raking variables include age, gender, census division, race/ethnicity, education, and county groupings based on county level counts of the number of COVID-19 deaths. Demographic weighting variables were obtained from the 2020 Current Population Survey. The count of COVID-19 deaths by county was obtained from USA Facts. The weighted data reflect the U.S. population of adults age 18 and over.

    Data for the regional estimates are collected using a multi-mode address-based (ABS) approach that allows residents of each area to complete the interview via web or with an NORC telephone interviewer. All sampled households are mailed a postcard inviting them to complete the survey either online using a unique PIN or via telephone by calling a toll-free number. Interviews are conducted with adults age 18 and over with a target of achieving 400 interviews in each region in each survey.Additional details on the survey methodology and the survey questionnaire are attached below or can be found at https://www.covid-impact.org.

    Attribution

    Results should be credited to the COVID Impact Survey, conducted by NORC at the University of Chicago for the Data Foundation.

    AP Data Distributions

    ​To learn more about AP's data journalism capabilities for publishers, corporations and financial institutions, go here or email kromano@ap.org.

  5. HitCompanies Dataset

    • figshare.com
    zip
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuri Burger (2023). HitCompanies Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.842633.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Yuri Burger
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Worldwide Companies Dataset contains information on random 10,000 worldwide companies, including name, registration number, website url, addresses, phone numbers, industry codes, aliases, associated domain names and key changes such as people changes, contact changes, etc.Original data available at http://endb-consolidated.aihit.com/datasets.htm

  6. h

    100-richest-people-in-world

    • huggingface.co
    Updated Aug 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nate Raw (2023). 100-richest-people-in-world [Dataset]. https://huggingface.co/datasets/nateraw/100-richest-people-in-world
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 2, 2023
    Authors
    Nate Raw
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Area covered
    World
    Description

    Dataset Card for 100 Richest People In World

      Dataset Summary
    

    This dataset contains the list of Top 100 Richest People in the World Column Information:-

    Name - Person Name NetWorth - His/Her Networth Age - Person Age Country - The country person belongs to Source - Information Source Industry - Expertise Domain

      Join our Community
    
    
    
    
    
    
    
    
    
      Supported Tasks and Leaderboards
    

    [More Information Needed]

      Languages
    

    [More Information Needed]… See the full description on the dataset page: https://huggingface.co/datasets/nateraw/100-richest-people-in-world.

  7. World Population Data

    • kaggle.com
    Updated Jan 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sazidul Islam (2024). World Population Data [Dataset]. https://www.kaggle.com/datasets/sazidthe1/world-population-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 1, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sazidul Islam
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    World
    Description

    Context

    The world's population has undergone remarkable growth, exceeding 7.5 billion by mid-2019 and continuing to surge beyond previous estimates. Notably, China and India stand as the two most populous countries, with China's population potentially facing a decline while India's trajectory hints at surpassing it by 2030. This significant demographic shift is just one facet of a global landscape where countries like the United States, Indonesia, Brazil, Nigeria, and others, each with populations surpassing 100 million, play pivotal roles.

    The steady decrease in growth rates, though, is reshaping projections. While the world's population is expected to exceed 8 billion by 2030, growth will notably decelerate compared to previous decades. Specific countries like India, Nigeria, and several African nations will notably contribute to this growth, potentially doubling their populations before rates plateau.

    Content

    This dataset provides comprehensive historical population data for countries and territories globally, offering insights into various parameters such as area size, continent, population growth rates, rankings, and world population percentages. Spanning from 1970 to 2023, it includes population figures for different years, enabling a detailed examination of demographic trends and changes over time.

    Dataset

    Structured with meticulous detail, this dataset offers a wide array of information in a format conducive to analysis and exploration. Featuring parameters like population by year, country rankings, geographical details, and growth rates, it serves as a valuable resource for researchers, policymakers, and analysts. Additionally, the inclusion of growth rates and world population percentages provides a nuanced understanding of how countries contribute to global demographic shifts.

    This dataset is invaluable for those interested in understanding historical population trends, predicting future demographic patterns, and conducting in-depth analyses to inform policies across various sectors such as economics, urban planning, public health, and more.

    Structure

    This dataset (world_population_data.csv) covering from 1970 up to 2023 includes the following columns:

    Column NameDescription
    RankRank by Population
    CCA33 Digit Country/Territories Code
    CountryName of the Country
    ContinentName of the Continent
    2023 PopulationPopulation of the Country in the year 2023
    2022 PopulationPopulation of the Country in the year 2022
    2020 PopulationPopulation of the Country in the year 2020
    2015 PopulationPopulation of the Country in the year 2015
    2010 PopulationPopulation of the Country in the year 2010
    2000 PopulationPopulation of the Country in the year 2000
    1990 PopulationPopulation of the Country in the year 1990
    1980 PopulationPopulation of the Country in the year 1980
    1970 PopulationPopulation of the Country in the year 1970
    Area (km²)Area size of the Country/Territories in square kilometer
    Density (km²)Population Density per square kilometer
    Growth RatePopulation Growth Rate by Country
    World Population PercentageThe population percentage by each Country

    Acknowledgment

    The primary dataset was retrieved from the World Population Review. I sincerely thank the team for providing the core data used in this dataset.

    © Image credit: Freepik

  8. Worldwide Soundscapes project meta-data

    • zenodo.org
    Updated Dec 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kevin F.A. Darras; Kevin F.A. Darras; Rodney Rountree; Rodney Rountree; Steven Van Wilgenburg; Steven Van Wilgenburg; Amandine Gasc; Amandine Gasc; 松海 李; 松海 李; 黎君 董; 黎君 董; Yuhang Song; Youfang Chen; Youfang Chen; Thomas Cherico Wanger; Thomas Cherico Wanger; Yuhang Song (2022). Worldwide Soundscapes project meta-data [Dataset]. http://doi.org/10.5281/zenodo.7415473
    Explore at:
    Dataset updated
    Dec 9, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Kevin F.A. Darras; Kevin F.A. Darras; Rodney Rountree; Rodney Rountree; Steven Van Wilgenburg; Steven Van Wilgenburg; Amandine Gasc; Amandine Gasc; 松海 李; 松海 李; 黎君 董; 黎君 董; Yuhang Song; Youfang Chen; Youfang Chen; Thomas Cherico Wanger; Thomas Cherico Wanger; Yuhang Song
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Worldwide Soundscapes project is a global, open inventory of spatio-temporally replicated soundscape datasets. This Zenodo entry comprises the data tables that constitute its (meta-)database, as well as their description.

    The overview of all sampling sites can be found on the corresponding project on ecoSound-web, as well as a demonstration collection containing selected recordings. More information on the project can be found here and on ResearchGate.

    The audio recording criteria justifying inclusion into the meta-database are:

    • Stationary (no transects, towed sensors or microphones mounted on cars)
    • Passive (unattended, no human disturbance by the recordist)
    • Ambient (no spatial or temporal focus on a particular species or direction)
    • Spatially and/or temporally replicated (multiple sites sampled at least at one common daytime or multiple days sampled at least in one common site)

    The individual columns of the provided data tables are described in the following. Data tables are linked through primary keys; joining them will result in a database.

    datasets

    • dataset_id: incremental integer, primary key
    • name: name of the dataset. if it is repeated, incremental integers should be used in the "subset" column to differentiate them.
    • subset: incremental integer that can be used to distinguish datasets with identical names
    • collaborators: full names of people deemed responsible for the dataset, separated by commas
    • contributors: full names of people who are not the main collaborators but who have significantly contributed to the dataset, and who could be contacted for in-depth analyses, separated by commas.
    • date_added: when the datased was added (DD/MM/YYYY)
    • URL_open_recordings: if recordings (even only some) from this dataset are openly available, indicate the internet link where they can be found.
    • URL_project: internet link for further information about the corresponding project
    • DOI_publication: DOI of corresponding publications, separated by comma
    • core_realm_IUCN: The core realm of the dataset. Datasets may have multiple realms, but the main one should be listed. Datasets may contain sampling sites from different realms in the "sites" sheet. IUCN Global Ecosystem Typology (v2.0): https://global-ecosystems.org/
    • medium: the physical medium the microphone is situated in
    • protected_area: Whether the sampling sites were situated in protected areas or not, or only some.
    • GADM0: For datasets on land or in territorial waters, Global Administrative Database level0
      https://gadm.org/
    • GADM1: For datasets on land or in territorial waters, Global Administrative Database level1
      https://gadm.org/
    • GADM2: For datasets on land or in territorial waters, Global Administrative Database level2
      https://gadm.org/
    • IHO: For marine locations, the sea area that encompassess all the sampling locations according to the International Hydrographic Organisation. Map here: https://www.arcgis.com/home/item.html?id=44e04407fbaf4d93afcb63018fbca9e2
    • locality: optional free text about the locality
    • latitude_numeric_region: study region approximate centroid latitude in WGS84 decimal degrees
    • longitude_numeric_region: study region approximate centroid longitude in WGS84 decimal degrees
    • sites_number: number of sites sampled
    • year_start: starting year of the sampling
    • year_end: ending year of the sampling
    • deployment_schedule: description of the sampling schedule, provisional
    • temporal_recording_selection: list environmental exclusion criteria that were used to determine which recording days or times to discard
    • high_pass_filter_Hz: frequency of the high-pass filter of the recorder, in Hz
    • variable_sampling_frequency: Does the sampling frequency vary? If it does, write "NA" in the sampling_frequency_kHz column and indicate it in the sampling_frequency_kHz column inside the deployments sheet
    • sampling_frequency_kHz: frequency the microphone was sampled at (sounds of half that frequency will be recorded)
    • variable_recorder:
    • recorder: recorder model used
    • microphone: microphone used
    • freshwater_recordist_position: position of the recordist relative to the microphone during sampling (only for freshwater)
    • collaborator_comments: free-text field for comments by the collaborators
    • validated: This cell is checked if the contents of all sheets are complete and have been found to be coherent and consistent with our requirements.
    • validator_name: name of person doing the validation
    • validation_comments: validators: please insert the date when someone was contacted
    • cross-check: this cell is checked if the collaborators confirm the spatial and temporal data after checking the corresponding site maps, deployment and operation time graphs found at https://drive.google.com/drive/folders/1qfwXH_7dpFCqyls-c6b8RZ_fbcn9kXbp?usp=share_link

    datasets-sites

    • dataset_ID: primary key of datasets table
    • dataset_name: lookup field
    • site_ID: primary key of sites table
    • site_name: lookup field

    sites

    • site_ID: unique site IDs, larger than 1000 for compatibility with ecoSound-web
    • site_name: name or code of sampling site as used in respective projects
    • latitude_numeric: exact numeric degrees coordinates of latitude
    • longitude_numeric: exact numeric degrees coordinates of longitude
    • topography_m: for sites on land: elevation. For marine sites: depth (negative). in meters
    • freshwater_depth_m
    • realm: Ecosystem type according to IUCN GET https://global-ecosystems.org/
    • biome: Ecosystem type according to IUCN GET https://global-ecosystems.org/
    • functional_group: Ecosystem type according to IUCN GET https://global-ecosystems.org/
    • comments

    deployments

    • dataset_ID: primary key of datasets table
    • dataset_name: lookup field
    • deployment: use identical subscript letters to denote rows that belong to the same deployment. For instance, you may use different operation times and schedules for different target taxa within one deployment.
    • start_date_min: earliest date of deployment start, double-click cell to get date-picker
    • start_date_max: latest date of deployment start, if applicable (only used when recorders were deployed over several days), double-click cell to get date-picker
    • start_time_mixed: deployment start local time, either in HH:MM format or a choice of solar daytimes (sunrise, sunset, noon, midnight). Corresponds to the recording start time for continuous recording deployments. If multiple start times were used, you should mention the latest start time (corresponds to the earliest daytime from which all recorders are active). If applicable, positive or negative offsets from solar times can be mentioned (For example: if data are collected one hour before sunrise, this will be "sunrise-60")
    • permanent: is the deployment permanent (in which case it would be ongoing and the end date or duration would be unknown)?
    • variable_duration_days: is the duration of the deployment variable? in days
    • duration_days: deployment duration per recorder (use the minimum if variable)
    • end_date_min: earliest date of deployment end, only needed if duration is variable, double-click cell to get date-picker
    • end_date_max: latest date of deployment end, only needed if duration is variable, double-click cell to get date-picker
    • end_time_mixed: deployment end local time, either in HH:MM format or a choice of solar daytimes (sunrise, sunset, noon, midnight). Corresponds to the recording end time for continuous recording deployments.
    • recording_time: does the recording last from the deployment start time to the end time (continuous) or at scheduled daily intervals (scheduled)? Note: we consider recordings with duty cycles to be continuous.
    • operation_start_time_mixed: scheduled recording start local time, either in HH:MM format or a choice of solar daytimes (sunrise, sunset, noon, midnight). If applicable, positive or negative offsets from solar times can be mentioned (For example: if data are collected one hour before sunrise, this will be "sunrise-60")
    • operation_duration_minutes: duration of operation in minutes, if constant
    • operation_end_time_mixed: scheduled recording end local time, either in HH:MM format or a choice of solar daytimes (sunrise, sunset, noon, midnight). If applicable, positive or negative offsets from solar times can be mentioned (For example: if data are collected one hour before sunrise, this will be "sunrise-60")
    • duty_cycle_minutes: duty cycle of the recording (i.e. the fraction of minutes when it is recording), written as "recording(minutes)/period(minutes)". For example: "1/6" if the recorder is active for 1 minute and standing by for 5 minutes.
    • sampling_frequency_kHz: only indicate the sampling frequency if it is variable within a particular dataset so that we need to code different frequencies for different deployments
    • recorder
    • subset_sites: If the deployment was not done in all the sites of the

  9. The ORBIT (Object Recognition for Blind Image Training)-India Dataset

    • zenodo.org
    • data.niaid.nih.gov
    Updated Apr 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gesu India; Gesu India; Martin Grayson; Martin Grayson; Daniela Massiceti; Daniela Massiceti; Cecily Morrison; Cecily Morrison; Simon Robinson; Simon Robinson; Jennifer Pearson; Jennifer Pearson; Matt Jones; Matt Jones (2025). The ORBIT (Object Recognition for Blind Image Training)-India Dataset [Dataset]. http://doi.org/10.5281/zenodo.12608444
    Explore at:
    Dataset updated
    Apr 24, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Gesu India; Gesu India; Martin Grayson; Martin Grayson; Daniela Massiceti; Daniela Massiceti; Cecily Morrison; Cecily Morrison; Simon Robinson; Simon Robinson; Jennifer Pearson; Jennifer Pearson; Matt Jones; Matt Jones
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    India
    Description

    The ORBIT (Object Recognition for Blind Image Training) -India Dataset is a collection of 105,243 images of 76 commonly used objects, collected by 12 individuals in India who are blind or have low vision. This dataset is an "Indian subset" of the original ORBIT dataset [1, 2], which was collected in the UK and Canada. In contrast to the ORBIT dataset, which was created in a Global North, Western, and English-speaking context, the ORBIT-India dataset features images taken in a low-resource, non-English-speaking, Global South context, a home to 90% of the world’s population of people with blindness. Since it is easier for blind or low-vision individuals to gather high-quality data by recording videos, this dataset, like the ORBIT dataset, contains images (each sized 224x224) derived from 587 videos. These videos were taken by our data collectors from various parts of India using the Find My Things [3] Android app. Each data collector was asked to record eight videos of at least 10 objects of their choice.

    Collected between July and November 2023, this dataset represents a set of objects commonly used by people who are blind or have low vision in India, including earphones, talking watches, toothbrushes, and typical Indian household items like a belan (rolling pin), and a steel glass. These videos were taken in various settings of the data collectors' homes and workspaces using the Find My Things Android app.

    The image dataset is stored in the ‘Dataset’ folder, organized by folders assigned to each data collector (P1, P2, ...P12) who collected them. Each collector's folder includes sub-folders named with the object labels as provided by our data collectors. Within each object folder, there are two subfolders: ‘clean’ for images taken on clean surfaces and ‘clutter’ for images taken in cluttered environments where the objects are typically found. The annotations are saved inside a ‘Annotations’ folder containing a JSON file per video (e.g., P1--coffee mug--clean--231220_084852_coffee mug_224.json) that contains keys corresponding to all frames/images in that video (e.g., "P1--coffee mug--clean--231220_084852_coffee mug_224--000001.jpeg": {"object_not_present_issue": false, "pii_present_issue": false}, "P1--coffee mug--clean--231220_084852_coffee mug_224--000002.jpeg": {"object_not_present_issue": false, "pii_present_issue": false}, ...). The ‘object_not_present_issue’ key is True if the object is not present in the image, and the ‘pii_present_issue’ key is True, if there is a personally identifiable information (PII) present in the image. Note, all PII present in the images has been blurred to protect the identity and privacy of our data collectors. This dataset version was created by cropping images originally sized at 1080 × 1920; therefore, an unscaled version of the dataset will follow soon.

    This project was funded by the Engineering and Physical Sciences Research Council (EPSRC) Industrial ICASE Award with Microsoft Research UK Ltd. as the Industrial Project Partner. We would like to acknowledge and express our gratitude to our data collectors for their efforts and time invested in carefully collecting videos to build this dataset for their community. The dataset is designed for developing few-shot learning algorithms, aiming to support researchers and developers in advancing object-recognition systems. We are excited to share this dataset and would love to hear from you if and how you use this dataset. Please feel free to reach out if you have any questions, comments or suggestions.

    REFERENCES:

    1. Daniela Massiceti, Lida Theodorou, Luisa Zintgraf, Matthew Tobias Harris, Simone Stumpf, Cecily Morrison, Edward Cutrell, and Katja Hofmann. 2021. ORBIT: A real-world few-shot dataset for teachable object recognition collected from people who are blind or low vision. DOI: https://doi.org/10.25383/city.14294597

    2. microsoft/ORBIT-Dataset. https://github.com/microsoft/ORBIT-Dataset

    3. Linda Yilin Wen, Cecily Morrison, Martin Grayson, Rita Faia Marques, Daniela Massiceti, Camilla Longden, and Edward Cutrell. 2024. Find My Things: Personalized Accessibility through Teachable AI for People who are Blind or Low Vision. In Extended Abstracts of the 2024 CHI Conference on Human Factors in Computing Systems (CHI EA '24). Association for Computing Machinery, New York, NY, USA, Article 403, 1–6. https://doi.org/10.1145/3613905.3648641

  10. GBIF Backbone Taxonomy

    • gbif.org
    • smng.net
    • +1more
    Updated Nov 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GBIF Secretariat (2023). GBIF Backbone Taxonomy [Dataset]. http://doi.org/10.15468/39omei
    Explore at:
    Dataset updated
    Nov 17, 2023
    Dataset provided by
    Global Biodiversity Information Facilityhttps://www.gbif.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The GBIF Backbone Taxonomy is a single, synthetic management classification with the goal of covering all names GBIF is dealing with. It's the taxonomic backbone that allows GBIF to integrate name based information from different resources, no matter if these are occurrence datasets, species pages, names from nomenclators or external sources like EOL, Genbank or IUCN. This backbone allows taxonomic search, browse and reporting operations across all those resources in a consistent way and to provide means to crosswalk names from one source to another.

    It is updated regulary through an automated process in which the Catalogue of Life acts as a starting point also providing the complete higher classification above families. Additional scientific names only found in other authoritative nomenclatural and taxonomic datasets are then merged into the tree, thus extending the original catalogue and broadening the backbones name coverage. The GBIF Backbone taxonomy also includes identifiers for Operational Taxonomic Units (OTUs) drawn from the barcoding resources iBOL and UNITE.

    International Barcode of Life project (iBOL), Barcode Index Numbers (BINs). BINs are connected to a taxon name and its classification by taking into account all names applied to the BIN and picking names with at least 80% consensus. If there is no consensus of name at the species level, the selection process is repeated moving up the major Linnaean ranks until consensus is achieved.

    UNITE - Unified system for the DNA based fungal species, Species Hypotheses (SHs). SHs are connected to a taxon name and its classification based on the determination of the RefS (reference sequence) if present or the RepS (representative sequence). In the latter case, if there is no match in the UNITE taxonomy, the lowest rank with 100% consensus within the SH will be used.

    The GBIF Backbone Taxonomy is available for download at https://hosted-datasets.gbif.org/datasets/backbone/ in different formats together with an archive of all previous versions.

    The following 105 sources have been used to assemble the GBIF backbone with number of names given in brackets:

    • Catalogue of Life Checklist - 4766428 names
    • International Barcode of Life project (iBOL) Barcode Index Numbers (BINs) - 635951 names
    • UNITE - Unified system for the DNA based fungal species linked to the classification - 611208 names
    • The Paleobiology Database - 212054 names
    • World Register of Marine Species - 188857 names
    • The Interim Register of Marine and Nonmarine Genera - 183894 names
    • The World Checklist of Vascular Plants (WCVP) - 131891 names
    • GBIF Backbone Taxonomy - 114350 names
    • TAXREF - 109374 names
    • The Leipzig catalogue of vascular plants - 75380 names
    • ZooBank - 73549 names
    • Integrated Taxonomic Information System (ITIS) - 68377 names
    • Plazi.org taxonomic treatments database - 61346 names
    • Genome Taxonomy Database r207 - 60545 names
    • International Plant Names Index - 52329 names
    • Fauna Europaea - 45077 names
    • The National Checklist of Taiwan (Catalogue of Life in Taiwan, TaiCoL) - 36193 names
    • Dyntaxa. Svensk taxonomisk databas - 35892 names
    • The Plant List with literature - 32692 names
    • United Kingdom Species Inventory (UKSI) - 29643 names
    • Artsnavnebasen - 29208 names
    • The IUCN Red List of Threatened Species - 21221 names
    • Afromoths, online database of Afrotropical moth species (Lepidoptera) - 13961 names
    • Brazilian Flora 2020 project - Projeto Flora do Brasil 2020 - 13829 names
    • Prokaryotic Nomenclature Up-to-Date (PNU) - 10079 names
    • Checklist Dutch Species Register - Nederlands Soortenregister - 8814 names
    • ICTV Master Species List (MSL) - 7852 names
    • Cockroach Species File - 6020 names
    • GRIN Taxonomy - 5882 names
    • Taxon list of fungi and fungal-like organisms from Germany compiled by the DGfM - 4570 names
    • Catalogue of Afrotropical Bees - 3623 names
    • Catalogue of Tenebrionidae (Coleoptera) of North America - 3327 names
    • Checklist of Beetles (Coleoptera) of Canada and Alaska. Second Edition. - 3312 names
    • Systema Dipterorum - 2850 names
    • Catalogue of the Pterophoroidea of the World - 2807 names
    • The Clements Checklist - 2675 names
    • Taxon list of Hymenoptera from Germany compiled in the context of the GBOL project - 2496 names
    • IOC World Bird List, v13.2 - 2366 names
    • Official Lists and Indexes of Names in Zoology - 2310 names
    • National checklist of all species occurring in Denmark - 1922 names
    • Myriatrix - 1876 names
    • Database of Vascular Plants of Canada (VASCAN) - 1822 names
    • Taxon list of vascular plants from Bavaria, Germany compiled in the context of the BFL project - 1771 names
    • Orthoptera Species File - 1742 names
    • A list of the terrestrial fungi, flora and fauna of Madeira and Selvagens archipelagos - 1602 names
    • Aphid Species File - 1565 names
    • World Spider Catalog - 1561 names
    • Taxon list of Jurassic Pisces of the Tethys Palaeo-Environment compiled at the SNSB-JME - 1270 names
    • Backbone Family Classification Patch - 1143 names
    • GBIF Algae Classification - 1100 names
    • International Cichorieae Network (ICN): Cichorieae Portal - 975 names
    • Psocodea Species File - 803 names
    • New Zealand Marine Macroalgae Species Checklist - 787 names
    • Annotated checklist of endemic species from the Western Balkans - 754 names
    • Taxon list of animals with German names (worldwide) compiled at the SMNS - 503 names
    • Catalogue of the Alucitoidea of the World - 472 names
    • Lygaeoidea Species File - 462 names
    • Catálogo de Plantas y Líquenes de Colombia - 422 names
    • GBIF Backbone Patch - 317 names
    • Phasmida Species File - 259 names
    • Cortinariaceae fetched from the Index Fungorum API - 234 names
    • Coreoidea Species File - 233 names
    • GTDB supplement - 139 names
    • Mantodea Species File - 119 names
    • Endemic species in Taiwan - 93 names
    • Taxon list of Araneae from Germany compiled in the context of the GBOL project - 88 names
    • Species of Hominidae - 78 names
    • Taxon list of Sternorrhyncha from Germany compiled in the context of the GBOL project - 77 names
    • Taxon list of mosses from Germany compiled in the context of the GBOL project - 75 names
    • Mammal Species of the World - 73 names
    • Plecoptera Species File - 71 names
    • Species Fungorum Plus - 64 names
    • Catalogue of the type specimens of Cosmopterigidae (Lepidoptera: Gelechioidea) from research collections of the Zoological Institute, Russian Academy of Sciences - 47 names
    • Species named after famous people - 41 names
    • Dermaptera Species File - 36 names
    • Taxon list of Trichoptera from Germany compiled in the context of the GBOL project - 34 names
    • True Fruit Flies (Diptera, Tephritidae) of the Afrotropical Region - 33 names
    • Range and Regularities in the Distribution of Earthworms of the Earthworms of the USSR Fauna. Perel, 1979 - 32 names
    • Taxon list of Diplura from Germany compiled in the context of the GBOL project - 30 names
    • Lista de referencia de especies de aves de Colombia - 2022 - 24 names
    • Taxon list of Auchenorrhyncha from Germany compiled in the context of the GBOL project - 20 names
    • Catalogue of the type specimens of Polycestinae (Coleoptera: Buprestidae) from research collections of the Zoological Institute, Russian Academy of Sciences - 19 names
    • Taxon list of Thysanoptera from Germany compiled in the context of the GBOL project - 19 names
    • Lista de especies de vertebrados registrados en jurisdicción del Departamento del Huila - 18 names
    • Taxon list of Microcoryphia (Archaeognatha) from Germany compiled in the context of the GBOL project - 15 names
    • Catalogue of the type specimens of Bufonidae and Megophryidae (Amphibia: Anura) from research collections of the Zoological Institute, Russian Academy of Sciences - 12 names
    • Grylloblattodea Species File - 11 names
    • Coleorrhyncha Species File - 9 names
    • Taxon list of liverworts from Germany compiled in the context of the GBOL project - 9 names
    • Embioptera Species File - 7 names
    • Taxon list of Pisces and Cyclostoma from Germany compiled in the context of the GBOL project - 6 names
    • Taxon list of Pteridophyta from Germany compiled in the context of the GBOL project - 6 names
    • Taxon list of Siphonaptera from Germany compiled in the context of the GBOL project - 5 names
    • The Earthworms of the Fauna of Russia. Perel, 1997 - 5 names
    • Taxon list of Zygentoma from Germany compiled in the context of the GBOL project - 4 names
    • Asiloid Flies: new taxa of Diptera: Apioceridae, Asilidae, and Mydidae - 3 names
    • Taxon list of Protura from Germany compiled in the context of the GBOL project - 3 names
    • Taxon list of hornworts from Germany compiled in the context of the GBOL project - 2 names
    • Chrysididae Species File - 1 names
    • Taxon list of Dermaptera from Germany compiled in the context of the GBOL project - 1 names
    • Taxon list of Diplopoda from Germany in the context of the GBOL project - 1 names
    • Taxon list of Orthoptera (Grashoppers) from Germany compiled at the SNSB - 1 names
    • Taxon list of Pscoptera from Germany compiled in the context of the GBOL project - 1 names
    • Taxon list of Pseudoscorpiones from Germany compiled in the context of the GBOL project - 1 names
    • Taxon list of Raphidioptera from Germany compiled in the context of the GBOL project - 1 names

  11. f

    Data from: Global Impacts Dataset of Invasive Alien Species (GIDIAS)

    • springernature.figshare.com
    xlsx
    Updated May 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sven Bacher; Ellen Ryan-Colton; Mario Coiro; Phillip Cassey; Bella S. Galil; Martín A. Nuñez; Michael Ansong; Katharina Dehnen-Schmutz; Georgi Fayvush; Romina Daiana Fernandez; Ankila Hiremath; Makihiko Ikegami; Angeliki F. Martinou; Shana M. McDermott; Cristina Preda; Montserrat Vilà; Olaf L. F. Weyl; Neelavara Ananthram Aravind; Katerina Athanasiou; Vidyadhar Atkore; Jacob N. Barney; Tim M. Blackburn; Eckehard G. Brockerhoff; Clinton Carbutt; Luca Carisio; Vanessa Céspedes; Diego F. Cisneros-Heredia; Meghan Cooling; Maarten de Groot; Jakovos Demetriou; James W. E. Dickey; Regan Early; Thomas E. Evans; Belinda Gallardo; Monica Gruber; Cang Hui; Jonathan Jeschke; Natalia Z. Joelson; Mohd Asgar Khan; Sabrina Kumschick; Lori Lach; Katharina Lapin; Simone Lioy; Chunlong Liu; Zoe J. MacMullen; Manuela A. Mazzitelli; G. John Measey; Agata A. Mrugała-Koese; Camille L. Musseau; Helen F. Nahrung; Alessia lucia Pepori; Luis R. Pertierra; Elizabeth F. Pienaar; Petr Pyšek; Gonzalo Rivas-Torres; Henry A. Rojas Martinez; JULISSA ROJAS-SANDOVAL; Ned Ryan-Schofield; Rocío M. Sánchez; Alberto Santini; Davide Santoro; Riccardo Scalera; Lisanna Schmidt; Tinyiko Cavin Shivambu; Sima Sohrabi; Elena Tricarico; Alejandro Trillo; Pieter G. van't Hof; Lara Volery; Tsungai A. Zengeya; Aikaterini Christopoulou; Virginia G. Duboscq-Carra; Ioanna A. Angelidou; Pilar Castro-Díez; Paola Tatiana Flores Males (2025). Global Impacts Dataset of Invasive Alien Species (GIDIAS) [Dataset]. http://doi.org/10.6084/m9.figshare.27908838.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    May 21, 2025
    Dataset provided by
    figshare
    Authors
    Sven Bacher; Ellen Ryan-Colton; Mario Coiro; Phillip Cassey; Bella S. Galil; Martín A. Nuñez; Michael Ansong; Katharina Dehnen-Schmutz; Georgi Fayvush; Romina Daiana Fernandez; Ankila Hiremath; Makihiko Ikegami; Angeliki F. Martinou; Shana M. McDermott; Cristina Preda; Montserrat Vilà; Olaf L. F. Weyl; Neelavara Ananthram Aravind; Katerina Athanasiou; Vidyadhar Atkore; Jacob N. Barney; Tim M. Blackburn; Eckehard G. Brockerhoff; Clinton Carbutt; Luca Carisio; Vanessa Céspedes; Diego F. Cisneros-Heredia; Meghan Cooling; Maarten de Groot; Jakovos Demetriou; James W. E. Dickey; Regan Early; Thomas E. Evans; Belinda Gallardo; Monica Gruber; Cang Hui; Jonathan Jeschke; Natalia Z. Joelson; Mohd Asgar Khan; Sabrina Kumschick; Lori Lach; Katharina Lapin; Simone Lioy; Chunlong Liu; Zoe J. MacMullen; Manuela A. Mazzitelli; G. John Measey; Agata A. Mrugała-Koese; Camille L. Musseau; Helen F. Nahrung; Alessia lucia Pepori; Luis R. Pertierra; Elizabeth F. Pienaar; Petr Pyšek; Gonzalo Rivas-Torres; Henry A. Rojas Martinez; JULISSA ROJAS-SANDOVAL; Ned Ryan-Schofield; Rocío M. Sánchez; Alberto Santini; Davide Santoro; Riccardo Scalera; Lisanna Schmidt; Tinyiko Cavin Shivambu; Sima Sohrabi; Elena Tricarico; Alejandro Trillo; Pieter G. van't Hof; Lara Volery; Tsungai A. Zengeya; Aikaterini Christopoulou; Virginia G. Duboscq-Carra; Ioanna A. Angelidou; Pilar Castro-Díez; Paola Tatiana Flores Males
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We present the Global Impacts Dataset of Invasive Alien Species (GIDIAS), a global dataset of 22865 records including impacts of invasive alien species on nature, nature’s contributions to people, and good quality of life. Records include positive and negative impacts, neutral impacts (studies were carried out, but no impacts were documented), non-directional impacts (i.e., change without detriments or benefits for native species or people), and finally, some records of alien species where no studies were found that assessed their impacts (indicating data gaps). Records cover 3353 invasive alien species from all major taxa (plants, vertebrates, invertebrates, microorganisms) and all continents and realms (terrestrial, freshwater, marine). The data were compiled to serve as robust evidence for chapter 4 “Impacts of invasive alien species on nature, nature's contributions to people, and good quality of life” of the global assessment report on invasive alien species by the Intergovernmental Science-Policy Platform on Biodiversity and Ecosystem Services (IPBES; available on Zenodo at https://doi.org/10.5281/zenodo.7430731). The dataset is provided in a machine-readable CSV file (file name GIDIAS_20250417_machine_read.csv), with special language characters retained where used (UTF-8 format). The dataset is also provided in Excel format (file name GIDIAS_20250417_Excel.xlsx). Metadata is provided in Excel format, including descriptors for each variable (file name GIDIAS_metadata_20250417.xlsx). Additional explanations for GIDIAS is stored in Microsoft Word format (docx) and contains (1) a short description of the principles of Environmental and Socio-Economic Impact Classification for Alien Taxa (EICAT, SEICAT), (2) a description of the variables included in the Global Impacts Dataset of Invasive Alien Species GIDIAS, and (3) a compilation of the search strategies and datasets included in the Global Impact Dataset of Invasive Alien Species (GIDIAS).

  12. Z

    Data from: The global distribution of plants used by humans datasets: list...

    • data.niaid.nih.gov
    Updated Jan 19, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Willis, Kathy J. (2024). The global distribution of plants used by humans datasets: list of utilised species, occurrence data and model outputs at 10 arc-minutes spatial resolution [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8176317
    Explore at:
    Dataset updated
    Jan 19, 2024
    Dataset provided by
    Govaerts, Rafaël
    Antonelli, Alexander
    Dennehy-Carr, Zoe
    Turner, Rob M.
    Ondo, Ian
    Cámara-Leret, Rodrigo
    Willis, Kathy J.
    Baquero, Andrea C.
    Milliken, William
    Patmore, Kristina
    Hargreaves, Serene
    Pironon, Samuel
    Canteiro, Cátia
    van Andel, Tinde R.
    Schmelzer, Gaby
    Ulian, Tiziana
    Allkin, Robert
    Nesbitt, Mark
    Hudson, Alex J.
    Lemmens, Roel
    Diazgranados, Mauricio
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Datasets and model outputs used to map the global distribution of utilised plants by humans. The folder is composed of two subfolders raw_data and processed_data containing respectively the list of utilised plant species modelled -utilised_plants_species_list.csv-, and their occurrence data -occurrence_data.zip- and predicted distribution -species_proba_per_cell.rds-.

    The file utilised_plants_species_list.csv in the raw_data folder contains a list of 35687 plant species (and hybrids) used by humans and 10 plant use categories with the following 14 fields:

    plant_ID: plant identifier number ranging from between 1-35687

    binomial_acc_name: binomial accepted name of the plant species

    author_acc_name: name of the author(s)

    is_hybrid: logical TRUE or FALSE indicating whether the species is an hybrid or not.

    AnimalFood: forage and fodder for vertebrate animals only.

    EnvironmentalUses: examples include intercrops and nurse crops, ornamentals, barrier hedges, shade plants, windbreaks, soil improvers, plants for revegetation and erosion control, wastewater purifiers, indicators of the presence of metals, pollution, or underground water.

    Fuels: charcoal, petroleum substitutes, fuel alcohols, etc. Given the importance of energy plants for people, those were distinguished from Materials.

    GeneSources: wild relatives of major crops which may possess traits associated with biotic or abiotic resistance and may be valuable for breeding programs.

    HumanFood: food for humans only, including beverages and food additives.

    InvertebrateFood: plants consumed by invertebrates used by humans, such as bees, silkworms, lac insects and edible grubs.

    Materials: woods, fibers, cork, cane, tannins, latex, resins, gums, waxes, oils, lipids, etc. and their derived products.

    Medicines: both human and veterinary.

    Poisons: plants which are poisonous to both vertebrates and invertebrates, both accidentally and intentionally, e.g., for hunting and fishing, molluscicides, herbicides, insecticides.

    SocialsUses: plants used for social purposes, which cannot be defined as food or medicine, for instance, masticatories, smoking materials, narcotics, hallucinogens and psychoactive drugs, and plants with ritual or religious significance.

    Totals: total number of uses recorded for a species

    The zipfile occurrence_data.zip in the processed_data folder contains 35687 Comma Separated Values (CSV) files, one for each species, containing curated geographic occurrence records used to build species distribution models with the following 14 fields:

    Species: the binomial accepted name of the species

    Fullname: same as species

    decimalLongitude: the geographic longitude of the occurrence records of the species in decimal degrees

    decimalLatitude: the geographic latitude of the occurrence records of the species in decimal degrees

    countryCode: a three-letter standard abbreviation for the country of the occurrence locality

    coordinateUncertaintyinMeters: indicator for the accuracy of the coordinate location, described as the radius of a circle around the stated point location

    year: year of the observation of the occurrence record of the species

    individualCount: the number of individuals present at the time of the observation

    gbifID: unique identifier number for the occurrence from the original database

    basisOfRecords: the type of the individual record, e.g. observation, physical specimen, fossil, living ex-situ, culture collection specimen

    institutionCode: the name of the institution or organization listed as the data publisher on GBIF

    establishmentMeans: statement about whether an organism has been introduced to a given place and time through the direct or indirect activity of modern humans

    is_cultivated_observation: whether or not an organism is cultivated

    sourceID: name of the source database

    The file species_proba_per_cell.rds in the processed_data folder is a R Data Serialization (RDS) file containing a data.table object with the following 3 fields:

    plant_ID: plant identifier number ranging from between 1-35687

    proba: species occurrence probability

    cell: raster grid cell number between 1-2251762

    This object can be used in combination with a raster layer to reconstruct the modelled distribution of each species or retrieve species richness and endemism.

  13. ORBITAAL: cOmpRehensive BItcoin daTaset for temorAl grAph anaLysis - Dataset...

    • cryptodata.center
    Updated Dec 4, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    cryptodata.center (2024). ORBITAAL: cOmpRehensive BItcoin daTaset for temorAl grAph anaLysis - Dataset - CryptoData Hub [Dataset]. https://cryptodata.center/dataset/orbitaal-comprehensive-bitcoin-dataset-for-temoral-graph-analysis
    Explore at:
    Dataset updated
    Dec 4, 2024
    Dataset provided by
    CryptoDATA
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Construction This dataset captures the temporal network of Bitcoin (BTC) flow exchanged between entities at the finest time resolution in UNIX timestamp. Its construction is based on the blockchain covering the period from January, 3rd of 2009 to January the 25th of 2021. The blockchain extraction has been made using bitcoin-etl (https://github.com/blockchain-etl/bitcoin-etl) Python package. The entity-entity network is built by aggregating Bitcoin addresses using the common-input heuristic [1] as well as popular Bitcoin users' addresses provided by https://www.walletexplorer.com/ [1] M. Harrigan and C. Fretter, "The Unreasonable Effectiveness of Address Clustering," 2016 Intl IEEE Conferences on Ubiquitous Intelligence & Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People, and Smart World Congress (UIC/ATC/ScalCom/CBDCom/IoP/SmartWorld), Toulouse, France, 2016, pp. 368-373, doi: 10.1109/UIC-ATC-ScalCom-CBDCom-IoP-SmartWorld.2016.0071.keywords: {Online banking;Merging;Protocols;Upper bound;Bipartite graph;Electronic mail;Size measurement;bitcoin;cryptocurrency;blockchain}, Dataset Description Bitcoin Activity Temporal Coverage: From 03 January 2009 to 25 January 2021 Overview: This dataset provides a comprehensive representation of Bitcoin exchanges between entities over a significant temporal span, spanning from the inception of Bitcoin to recent years. It encompasses various temporal resolutions and representations to facilitate Bitcoin transaction network analysis in the context of temporal graphs. Every dates have been retrieved from bloc UNIX timestamp and GMT timezone. Contents: The dataset is distributed across three compressed archives: All data are stored in the Apache Parquet file format, a columnar storage format optimized for analytical queries. It can be used with pyspark Python package. orbitaal-stream_graph.tar.gz: The root directory is STREAM_GRAPH/ Contains a stream graph representation of Bitcoin exchanges at the finest temporal scale, corresponding to the validation time of each block (averaging approximately 10 minutes). The stream graph is divided into 13 files, one for each year Files format is parquet Name format is orbitaal-stream_graph-date-[YYYY]-file-id-[ID].snappy.parquet, where [YYYY] stands for the corresponding year and [ID] is an integer from 1 to N (number of files here) such as sorting in increasing [ID] ordering is similar to sort by increasing year ordering These files are in the subdirectory STREAM_GRAPH/EDGES/ orbitaal-snapshot-all.tar.gz: The root directory is SNAPSHOT/ Contains the snapshot network representing all transactions aggregated over the whole dataset period (from Jan. 2009 to Jan. 2021). Files format is parquet Name format is orbitaal-snapshot-all.snappy.parquet. These files are in the subdirectory SNAPSHOT/EDGES/ALL/ orbitaal-snapshot-year.tar.gz: The root directory is SNAPSHOT/ Contains the yearly resolution of snapshot networks Files format is parquet Name format is orbitaal-snapshot-date-[YYYY]-file-id-[ID].snappy.parquet, where [YYYY] stands for the corresponding year and [ID] is an integer from 1 to N (number of files here) such as sorting in increasing [ID] ordering is similar to sort by increasing year ordering These files are in the subdirectory SNAPSHOT/EDGES/year/ orbitaal-snapshot-month.tar.gz: The root directory is SNAPSHOT/ Contains the monthly resoluted snapshot networks Files format is parquet Name format is orbitaal-snapshot-date-[YYYY]-[MM]-file-id-[ID].snappy.parquet, where [YYYY] and [MM] stands for the corresponding year and month, and [ID] is an integer from 1 to N (number of files here) such as sorting in increasing [ID] ordering is similar to sort by increasing year and month ordering These files are in the subdirectory SNAPSHOT/EDGES/month/ orbitaal-snapshot-day.tar.gz: The root directory is SNAPSHOT/ Contains the daily resoluted snapshot networks Files format is parquet Name format is orbitaal-snapshot-date-[YYYY]-[MM]-[DD]-file-id-[ID].snappy.parquet, where [YYYY], [MM], and [DD] stand for the corresponding year, month, and day, and [ID] is an integer from 1 to N (number of files here) such as sorting in increasing [ID] ordering is similar to sort by increasing year, month, and day ordering These files are in the subdirectory SNAPSHOT/EDGES/day/ orbitaal-snapshot-hour.tar.gz: The root directory is SNAPSHOT/ Contains the hourly resoluted snapshot networks Files format is parquet Name format is orbitaal-snapshot-date-[YYYY]-[MM]-[DD]-[hh]-file-id-[ID].snappy.parquet, where [YYYY], [MM], [DD], and [hh] stand for the corresponding year, month, day, and hour, and [ID] is an integer from 1 to N (number of files here) such as sorting in increasing [ID] ordering is similar to sort by increasing year, month, day and hour ordering These files are in the subdirectory SNAPSHOT/EDGES/hour/ orbitaal-nodetable.tar.gz: The root directory is NODE_TABLE/ Contains two files in parquet format, the first one gives information related to nodes present in stream graphs and snapshots such as period of activity and associated global Bitcoin balance, and the other one contains the list of all associated Bitcoin addresses. Small samples in CSV format orbitaal-stream_graph-2016_07_08.csv and orbitaal-stream_graph-2016_07_09.csv These two CSV files are related to stream graph representations of an halvening happening in 2016.

  14. o

    Global Zomato Dataset

    • opendatabay.com
    .csv
    Updated Jun 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Global Zomato Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/2d4d09a3-1be3-4e57-b435-471c7faf8365
    Explore at:
    .csvAvailable download formats
    Dataset updated
    Jun 21, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Data Science and Analytics
    Description

    Overview of the Foody Dataset Everyone has trouble picking their favourite dish when you're out of your town here we have come up with a food delivery app dataset that can help you find your mouth-watering dishes within your pocket. The Zomato dataset provides restaurant information, including location, cuisines, price, ratings, and more. It enables analysis of factors affecting popularity, such as cuisine type, booking, and delivery, facilitating personalised restaurant recommendations and insights into the food delivery industry.

    Problem Statements: To develop a restaurant recommendation system using the dataset to suggest personalized dining options based on user preferences, location, and restaurant attributes, enhancing the dining experience.

    Columns

    Here's a description of the columns in your dataset:

    Restaurant ID: A unique identifier for each restaurant in the dataset. Restaurant Name: The name of the restaurant. City: The city where the restaurant is located. Address: The specific address of the restaurant. Locality: The locality or neighbourhood where the restaurant is situated. Longitude: The longitude coordinate of the restaurant's location. Latitude: The latitude coordinate of the restaurant's location. Cuisines: The type of cuisine offered by the restaurant. eg: Japanese, Thai, Chinese, Mughlai, etc. Average Cost for two: The average cost for a meal for two people at the restaurant. Currency: The currency in which the average cost is denoted. Has Table booking: Indicates whether the restaurant accepts table bookings (Yes/No). Has Online delivery: Indicates whether the restaurant provides online food delivery services (Yes/No). Is delivering now: Indicates whether the restaurant is currently delivering food (Yes/No). Price range: The price range category of the restaurant from 1 to 4. One being the less price and 4 being the high price. Aggregate rating: The overall rating of the restaurant based on user reviews. Rating colour: The colour representation of the rating (e.g., Dark green, Green, Yellow, orange, red, and white). Rating text: The text representation of the rating (e.g., Excellent, Very good, Good, Average, poor, and Not rated ). Votes: The total number of user votes or reviews received by the restaurant. Questions for solving:

    Can the location (city or locality) of a restaurant influence its average cost for two people? Is there a relationship between the type of cuisine offered by a restaurant and its aggregate rating? How does the average cost for two people at a restaurant correlate with its aggregate rating? Does the presence of table booking and online delivery options impact a restaurant's aggregate rating? How does the number of votes/reviews received by a restaurant relate to its aggregate rating and popularity? Hoping that you would find insightful predictions for your text-long trip.

    Happy Learning!!!!

    Don't forget to Upvote my food lovers… Kindly, upvote if you find the dataset interesting. Thank you.

    License

    CC0 Original Data Source: Global Zomato Dataset

  15. Global soil organisms

    • gbif.org
    • smng.net
    • +1more
    Updated Feb 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PlutoF (2023). Global soil organisms [Dataset]. http://doi.org/10.15468/fdpeaw
    Explore at:
    Dataset updated
    Feb 27, 2023
    Dataset provided by
    Global Biodiversity Information Facilityhttps://www.gbif.org/
    PlutoF
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Global distribution of soil organisms. Data deposited in this project represent unique (non-clustered) sequences. These sequences are members of the curated OTU list (tag-jump filtered and chimera-free, clustered at 98% similarity threshold) from the GSMc dataset (Tedersoo et al., Fungal Diversity, 2021, https://doi.org/10.1007/s13225-021-00493-7). For each OTU in each sample, within-OTU sequences were dereplicated ignoring terminal gaps; in the presence of sequence variants differing only in the length of homopolymeric regions, only the most abundant variant was preserved. Taxonomic annotation was transferred from the representative sequence of each OTU to all unique sequences clustered in it. The current dataset includes additional soil samples not covered by the published article (Tedersoo et al., Fungal Diversity, 2021). Additional samples were collected following a slightly different sampling protocol. Taxon occurrences originating from these samples can be filtered out by Dataset name ('Global soil samples subproject (sequences from additional samples)') and Dataset ID (108273). The number of distinct sampling sites: 3 736, sampling events: 4 514. The number of unique taxa based on UNITE species hypotheses on 1.5% distance threshold: 292 413.

  16. COVID19 Additional Data

    • kaggle.com
    Updated Apr 9, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Orzhiang (2020). COVID19 Additional Data [Dataset]. https://www.kaggle.com/datasets/orzhiang/covid19-additional-data/versions/11
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 9, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Orzhiang
    Description

    This is a collection of dataset that I personally think it is useful in analysing COVID19 data. Since all of the data comes from the internet and majority of them originated from World Bank, I am use some Kaggle users has already uploaded similar data. However, I think it makes my life (and perhaps yours) easier by compiling all of these data together.

    The following are some remarks for the dataset-

    Dataset TitleDescriptions
    Other source of COVID19 Caseshttps://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset#time_series_covid_19_confirmed.csv
    Mortality Tablehttps://www.kaggle.com/robikscube/world-health-organization-who-mortality-database
    Economic Freedom Indexhttps://www.kaggle.com/lewisduncan93/the-economic-freedom-index
    World Bank Development Indicatorshttps://www.kaggle.com/theworldbank/world-development-indicators
    Weather Datahttps://www.kaggle.com/hbfree/covid19formattedweatherjan22march24
    Government Responsehttps://www.bsg.ox.ac.uk/research/research-projects/oxford-covid-19-government-response-tracker
    Containment and Mitigation Measureshttps://www.kaggle.com/paultimothymooney/covid-19-containment-and-mitigation-measures/
    World Happiness Reporthttps://www.kaggle.com/londeen/world-happiness-report-2020
    Weather Data 2https://www.kaggle.com/noaa/gsod
    US Data Prior to 2020-03-09https://www.kaggle.com/johnjdavisiv/jhu-covid19-data-with-us-state-data-prior-to-mar-9
    OCED Hospital Bed per 1000 inhabitantshttps://www.kaggle.com/cpmpml/oecd-hospital-beds-per-1000-inhabitant
    Covid 19 data by the US Stateshttps://www.kaggle.com/scirpus/covid-by-state
    COVID 19 Demographic predictorshttps://www.kaggle.com/nightranger77/covid19-demographic-predictors
    Country Infohttps://www.kaggle.com/koryto/countryinfo
    Population by locationhttps://www.kaggle.com/dgrechka/covid19-global-forecasting-locations-population
    00 COVID19 Country Mapping TableA mapping table serve as a link between world bank country name & country code with the country name used in COVID19 Competition. It makes linking the COVID19 data and World Bank data much easier.
    01 Population_API_SP.POP.TOTLhttps://data.worldbank.org/indicator/sp.pop.totl
    01_1 China Demographic DataSource:
    http://www.chamiji.com/2019chinaprovincepopulation
    http://www.stats.gov.cn/tjsj/ndsj/2017/indexeh.htm
    http://data.stats.gov.cn/english/easyquery.htm?cn=C01
    http://www.gov.cn/test/2007-08/07/content_708271.htm
  17. h

    lmsys-chat-1m

    • huggingface.co
    • opendatalab.com
    Updated Sep 17, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Large Model Systems Organization (2023). lmsys-chat-1m [Dataset]. https://huggingface.co/datasets/lmsys/lmsys-chat-1m
    Explore at:
    Dataset updated
    Sep 17, 2023
    Dataset authored and provided by
    Large Model Systems Organization
    Description

    LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset

    This dataset contains one million real-world conversations with 25 state-of-the-art LLMs. It is collected from 210K unique IP addresses in the wild on the Vicuna demo and Chatbot Arena website from April to August 2023. Each sample includes a conversation ID, model name, conversation text in OpenAI API JSON format, detected language tag, and OpenAI moderation API tag. User consent is obtained through the "Terms of use"… See the full description on the dataset page: https://huggingface.co/datasets/lmsys/lmsys-chat-1m.

  18. o

    The Global Soundscapes Project: overview of datasets and meta-data

    • explore.openaire.eu
    Updated May 11, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kevin F. A. Kevin F.A. Darras; Steven Van Wilgenburg; Rodney Rountree; Yuhang Song; Youfang Chen; Thomas Cherico Wanger (2022). The Global Soundscapes Project: overview of datasets and meta-data [Dataset]. http://doi.org/10.5281/zenodo.6537739
    Explore at:
    Dataset updated
    May 11, 2022
    Authors
    Kevin F. A. Kevin F.A. Darras; Steven Van Wilgenburg; Rodney Rountree; Yuhang Song; Youfang Chen; Thomas Cherico Wanger
    Description

    This is an overview of the soundscape recording datasets that have been contributed to the Global Soundscapes Project, as well as associated meta-data. The audio recording criteria justifying inclusion into the current meta-dataset are: Stationary (no towed sensors or microphones mounted on cars) Passive (no human disturbance by the recordist) Ambient (no focus on a particular species or direction) Recorded over multiple sites of a region and/or days The individual columns are described as follows. General: ID: primary key name: name of the dataset subset: incremental integer that can be used to distinguish sub-datasets collaborators: full names of people deemed responsible for the dataset, separated by commas date_added: when the dataset was added Space: realm_IUCN: realm from IUCN Global Ecosystem Typology (v2.0) (https://global-ecosystems.org/) medium: the physical medium the microphone is situated in GADM0: for terrestrial locations, Database of Global Administrative Areas level 0 unit as per https://gadm.org/ GADM1: for terrestrial locations, Database of Global Administrative Areas level 1 unit as per https://gadm.org/ GADM2: for terrestrial locations, Database of Global Administrative Areas level 2 unit as per https://gadm.org/ IHO: International Hydrographic Organisation sea area as per https://iho.int/ latitude_numeric_region: study region approximate centroid latitude in WGS84 decimal degrees longitude_numeric_region: study region approximate centroid longitude in WGS84 decimal degrees topography_min_m: minimum elevation of sites from sea level topography_max_m: maximum elevation of sites from sea level ground_distance_m: vertical distance of microphone from land ground or ocean floor freshwater_depth_m: vertical distance from water surface for freshwater datasets sites_number: number of sites sampled Time: days_number_per_site: typical number of days sampled per site (or minimum if too variable) day: whether the sites were sampled during daytime night: whether the sites were sampled during nighttime twilight: whether the sites were sampled during twilight warm_season: whether the warm season was sampled. Only outside tropics (https://en.wikipedia.org/wiki/K%C3%B6ppen_climate_classification) cold_season: whether the cold season was sampled. Only outside tropics (https://en.wikipedia.org/wiki/K%C3%B6ppen_climate_classification) dry_season: whether the dry season was sampled. Only for tropics (https://en.wikipedia.org/wiki/K%C3%B6ppen_climate_classification) wet_season: whether the wet season was sampled. Only for tropics (https://en.wikipedia.org/wiki/K%C3%B6ppen_climate_classification) year_start: starting year of the sampling year_end: ending year of the sampling schedule: description of the sampling schedule, free text recording_selection: criteria used to temporally select recordings (e.g., discarded rainy days) Audio: high_pass_filter_Hz: lower frequency of the high-pass filter sampling_frequency_kHz: frequency the microphone was sampled at audio_bit_depth: bit depth used for encoding audio recorder_model: recorder model used microphone: microphone used recordist_position: position of the recordist relative to the microphone during sampling Others: comments: free-text field URL_project: internet link for further information URL_publication: internet link of the corresponding publication More information on the project can be found here: https://ecosound-web.uni-goettingen.de/ecosound_web/project/gsp adding IHO data

  19. o

    Synthetic population for JOR

    • explore.openaire.eu
    • zenodo.org
    Updated Apr 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abhijin Adiga; Hannah Baek; Stephen Eubank; Przemyslaw Porebski; Madhav Marathe; Henning Mortveit; Samarth Swarup; Mandy Wilson; Dawen Xie (2022). Synthetic population for JOR [Dataset]. http://doi.org/10.5281/zenodo.6503397
    Explore at:
    Dataset updated
    Apr 30, 2022
    Authors
    Abhijin Adiga; Hannah Baek; Stephen Eubank; Przemyslaw Porebski; Madhav Marathe; Henning Mortveit; Samarth Swarup; Mandy Wilson; Dawen Xie
    Description

    Synthetic populations for regions of the World (SPW) | JordanDataset informationA synthetic population of a region as provided here, captures the people of the region with selected demographic attributes, their organization into households, their assigned activities for a day, the locations where the activities take place and thus where interactions among population members happen (e.g., spread of epidemics). LicenseCC-BY-4.0 AcknowledgmentThis project was supported by the National Science Foundation under the NSF RAPID: COVID-19 Response Support: Building Synthetic Multi-scale Networks (PI: Madhav Marathe, Co-PIs: Henning Mortveit, Srinivasan Venkatramanan; Fund Number: OAC-2027541). Contact informationHenning.Mortveit@virginia.edu Identifiers Region name Jordan Region ID jor Model coarse Version 0_9_0 Statistics Name Value Population 5723567.0 Average age 23.5 Households 1235755.0 Average household size 4.6 Residence locations 1235755.0 Activity locations 131978.0 Average number of activities 6.4 Average travel distance 44.5 Sources Description Name Version Url Activity template data World Bank 2021 https://data.worldbank.org Administrative boundaries ADCW 7.6 https://www.adci.com/adc-worldmap Curated POIs based on OSM SLIPO/OSM POIs http://slipo.eu/?p=1551 https://www.openstreetmap.org/ Household data DHS https://dhsprogram.com Population count with demographic attributes GPW v4.11 https://sedac.ciesin.columbia.edu/data/set/gpw-v4-admin-unit-center-points-population-estimates-rev11 Files descriptionBase data files (jor_data_v_0_9.zip) Filename Description jor_person_v_0_9.csv Data for each person including attributes such as age, gender, and household ID. jor_household_v_0_9.csv Data at household level. jor_residence_locations_v_0_9.csv Data about residence locations jor_activity_locations_v_0_9.csv Data about activity locations, including what activity types are supported at these locations jor_activity_location_assignment_v_0_9.csv For each person and for each of their activities, this file specifies the location where the activity takes place Derived data files Filename Description jor_contact_matrix_v_0_9.csv A POLYMOD-type contact matrix constructed from a network representation of the location assignment data and a within-location contact model. Validation and measures files Filename Description jor_household_grouping_validation_v_0_9.pdf Validation plots for household construction jor_activity_durations_{adult,child}_v_0_9.pdf Comparison of time spent on generated activities with survey data jor_activity_patterns_{adult,child}_v_0_9.pdf Comparison of generated activity patterns by the time of day with survey data jor_location_construction_0_9.pdf Validation plots for location construction jor_location_assignement_0_9.pdf Validation plots for location assignment, including travel distribution plots jor_jor_ver_0_9_0_avg_travel_distance.pdf Choropleth map visualizing average travel distance jor_jor_ver_0_9_0_travel_distr_combined.pdf Travel distance distribution jor_jor_ver_0_9_0_num_activity_loc.pdf Choropleth map visualizing number of activity locations jor_jor_ver_0_9_0_avg_age.pdf Choropleth map visualizing average age jor_jor_ver_0_9_0_pop_density_per_sqkm.pdf Choropleth map visualizing population density jor_jor_ver_0_9_0_pop_size.pdf Choropleth map visualizing population size

  20. Popular White Last Names in the US

    • johnsnowlabs.com
    csv
    Updated Jan 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John Snow Labs (2021). Popular White Last Names in the US [Dataset]. https://www.johnsnowlabs.com/marketplace/popular-white-last-names-in-the-us/
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jan 20, 2021
    Dataset authored and provided by
    John Snow Labs
    Area covered
    United States
    Description

    This dataset represents the popular last names in the United States for White.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Name Census (2023). Name Census top 100 surnames [Dataset]. https://www.kaggle.com/datasets/namecensus/name-census-top-100-surnames
Organization logo

Name Census top 100 surnames

Surname database with the top 100 surnames per country

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 25, 2023
Dataset provided by
Kaggle
Authors
Name Census
License

http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

Description

Name Census top 100 surnames

In 2018 we were doing a research project, and we needed to know if a name was male or female. After Googling for hours for 'baby name lists', 'name databases' and 'name datasets' we discovered that there wasn't a complete name database for all countries with first names and gender. Most name database layouts we found different per country, were incomplete or contained non-existing names. That is why we created Name Census, the most comprehensive name database in the world! The Name Census top 100 databases is a free database containing the top 100 first names and top 100 surnames for each country.

Collection methodology

Our name database is created using first names and surnames obtained from governments and cross-referencing with millions of names from publicly available social media profiles. We took all those names and used millions of social media profiles that where publicly available to cross-reference and count each name per country. This way we were sure that the names in our name database are actually used and we could create our popularity metric. We now offer the complete name database and the name parsing service as separate services.

Content

The Name Census top 100 is a name database that consists out of two files; the first names top 100 per country and the surnames top 100 per country. Each file is a CSV file formatted in UTF-8.

Search
Clear search
Close search
Google apps
Main menu