30 datasets found

Name Census top 100 surnames
kaggle.com
Updated Mar 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Name Census (2023). Name Census top 100 surnames [Dataset]. https://www.kaggle.com/datasets/namecensus/name-census-top-100-surnames
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 25, 2023
Dataset provided by
Kaggle
Authors
Name Census
License
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Description
Name Census top 100 surnames

In 2018 we were doing a research project, and we needed to know if a name was male or female. After Googling for hours for 'baby name lists', 'name databases' and 'name datasets' we discovered that there wasn't a complete name database for all countries with first names and gender. Most name database layouts we found different per country, were incomplete or contained non-existing names. That is why we created Name Census, the most comprehensive name database in the world! The Name Census top 100 databases is a free database containing the top 100 first names and top 100 surnames for each country.

Collection methodology

Our name database is created using first names and surnames obtained from governments and cross-referencing with millions of names from publicly available social media profiles. We took all those names and used millions of social media profiles that where publicly available to cross-reference and count each name per country. This way we were sure that the names in our name database are actually used and we could create our popularity metric. We now offer the complete name database and the name parsing service as separate services.

Content

The Name Census top 100 is a name database that consists out of two files; the first names top 100 per country and the surnames top 100 per country. Each file is a CSV file formatted in UTF-8.
o
Geonames - All Cities with a population > 1000
public.opendatasoft.com
data.smartidf.services
+2more
csv, excel, geojson +1
Updated Mar 10, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Geonames - All Cities with a population > 1000 [Dataset]. https://public.opendatasoft.com/explore/dataset/geonames-all-cities-with-a-population-1000/
Explore at:
csv, json, geojson, excelAvailable download formats
Dataset updated
Mar 10, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
All cities with a population > 1000 or seats of adm div (ca 80.000)Sources and ContributionsSources : GeoNames is aggregating over hundred different data sources. Ambassadors : GeoNames Ambassadors help in many countries. Wiki : A wiki allows to view the data and quickly fix error and add missing places. Donations and Sponsoring : Costs for running GeoNames are covered by donations and sponsoring.Enrichment:add country name
Global Country Information 2023
zenodo.org
data.niaid.nih.gov
csv
Updated Jun 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nidula Elgiriyewithana; Nidula Elgiriyewithana (2024). Global Country Information 2023 [Dataset]. http://doi.org/10.5281/zenodo.8165229
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8165229
Dataset updated
Jun 15, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nidula Elgiriyewithana; Nidula Elgiriyewithana
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Description

This comprehensive dataset provides a wealth of information about all countries worldwide, covering a wide range of indicators and attributes. It encompasses demographic statistics, economic indicators, environmental factors, healthcare metrics, education statistics, and much more. With every country represented, this dataset offers a complete global perspective on various aspects of nations, enabling in-depth analyses and cross-country comparisons.

Key Features

Country: Name of the country.

Density (P/Km2): Population density measured in persons per square kilometer.

Abbreviation: Abbreviation or code representing the country.

Agricultural Land (%): Percentage of land area used for agricultural purposes.

Land Area (Km2): Total land area of the country in square kilometers.

Armed Forces Size: Size of the armed forces in the country.

Birth Rate: Number of births per 1,000 population per year.

Calling Code: International calling code for the country.

Capital/Major City: Name of the capital or major city.

CO2 Emissions: Carbon dioxide emissions in tons.

CPI: Consumer Price Index, a measure of inflation and purchasing power.

CPI Change (%): Percentage change in the Consumer Price Index compared to the previous year.

Currency_Code: Currency code used in the country.

Fertility Rate: Average number of children born to a woman during her lifetime.

Forested Area (%): Percentage of land area covered by forests.

Gasoline_Price: Price of gasoline per liter in local currency.

GDP: Gross Domestic Product, the total value of goods and services produced in the country.

Gross Primary Education Enrollment (%): Gross enrollment ratio for primary education.

Gross Tertiary Education Enrollment (%): Gross enrollment ratio for tertiary education.

Infant Mortality: Number of deaths per 1,000 live births before reaching one year of age.

Largest City: Name of the country's largest city.

Life Expectancy: Average number of years a newborn is expected to live.

Maternal Mortality Ratio: Number of maternal deaths per 100,000 live births.

Minimum Wage: Minimum wage level in local currency.

Official Language: Official language(s) spoken in the country.

Out of Pocket Health Expenditure (%): Percentage of total health expenditure paid out-of-pocket by individuals.

Physicians per Thousand: Number of physicians per thousand people.

Population: Total population of the country.

Population: Labor Force Participation (%): Percentage of the population that is part of the labor force.

Tax Revenue (%): Tax revenue as a percentage of GDP.

Total Tax Rate: Overall tax burden as a percentage of commercial profits.

Unemployment Rate: Percentage of the labor force that is unemployed.

Urban Population: Percentage of the population living in urban areas.

Latitude: Latitude coordinate of the country's location.

Longitude: Longitude coordinate of the country's location.

Potential Use Cases

Analyze population density and land area to study spatial distribution patterns.

Investigate the relationship between agricultural land and food security.

Examine carbon dioxide emissions and their impact on climate change.

Explore correlations between economic indicators such as GDP and various socio-economic factors.

Investigate educational enrollment rates and their implications for human capital development.

Analyze healthcare metrics such as infant mortality and life expectancy to assess overall well-being.

Study labor market dynamics through indicators such as labor force participation and unemployment rates.

Investigate the role of taxation and its impact on economic development.

Explore urbanization trends and their social and environmental consequences.
d
COVID Impact Survey - Public Data
data.world
csv, zip
Updated Oct 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Associated Press (2024). COVID Impact Survey - Public Data [Dataset]. https://data.world/associatedpress/covid-impact-survey-public-data
Explore at:
csv, zipAvailable download formats
Dataset updated
Oct 16, 2024
Authors
The Associated Press
Description
Overview

The Associated Press is sharing data from the COVID Impact Survey, which provides statistics about physical health, mental health, economic security and social dynamics related to the coronavirus pandemic in the United States.

Conducted by NORC at the University of Chicago for the Data Foundation, the probability-based survey provides estimates for the United States as a whole, as well as in 10 states (California, Colorado, Florida, Louisiana, Minnesota, Missouri, Montana, New York, Oregon and Texas) and eight metropolitan areas (Atlanta, Baltimore, Birmingham, Chicago, Cleveland, Columbus, Phoenix and Pittsburgh).

The survey is designed to allow for an ongoing gauge of public perception, health and economic status to see what is shifting during the pandemic. When multiple sets of data are available, it will allow for the tracking of how issues ranging from COVID-19 symptoms to economic status change over time.

The survey is focused on three core areas of research:

Physical Health: Symptoms related to COVID-19, relevant existing conditions and health insurance coverage.

Economic and Financial Health: Employment, food security, and government cash assistance.

Social and Mental Health: Communication with friends and family, anxiety and volunteerism. (Questions based on those used on the U.S. Census Bureau’s Current Population Survey.) ## Using this Data - IMPORTANT This is survey data and must be properly weighted during analysis: DO NOT REPORT THIS DATA AS RAW OR AGGREGATE NUMBERS!!

Instead, use our queries linked below or statistical software such as R or SPSS to weight the data.

Queries

If you'd like to create a table to see how people nationally or in your state or city feel about a topic in the survey, use the survey questionnaire and codebook to match a question (the variable label) to a variable name. For instance, "How often have you felt lonely in the past 7 days?" is variable "soc5c".

Nationally: Go to this query and enter soc5c as the variable. Hit the blue Run Query button in the upper right hand corner.

Local or State: To find figures for that response in a specific state, go to this query and type in a state name and soc5c as the variable, and then hit the blue Run Query button in the upper right hand corner.

The resulting sentence you could write out of these queries is: "People in some states are less likely to report loneliness than others. For example, 66% of Louisianans report feeling lonely on none of the last seven days, compared with 52% of Californians. Nationally, 60% of people said they hadn't felt lonely."

Margin of Error

The margin of error for the national and regional surveys is found in the attached methods statement. You will need the margin of error to determine if the comparisons are statistically significant. If the difference is:

At least twice the margin of error, you can report there is a clear difference.

At least as large as the margin of error, you can report there is a slight or apparent difference.

Less than or equal to the margin of error, you can report that the respondents are divided or there is no difference. ## A Note on Timing Survey results will generally be posted under embargo on Tuesday evenings. The data is available for release at 1 p.m. ET Thursdays.

About the Data

The survey data will be provided under embargo in both comma-delimited and statistical formats.

Each set of survey data will be numbered and have the date the embargo lifts in front of it in the format of: 01_April_30_covid_impact_survey. The survey has been organized by the Data Foundation, a non-profit non-partisan think tank, and is sponsored by the Federal Reserve Bank of Minneapolis and the Packard Foundation. It is conducted by NORC at the University of Chicago, a non-partisan research organization. (NORC is not an abbreviation, it part of the organization's formal name.)

Data for the national estimates are collected using the AmeriSpeak Panel, NORC’s probability-based panel designed to be representative of the U.S. household population. Interviews are conducted with adults age 18 and over representing the 50 states and the District of Columbia. Panel members are randomly drawn from AmeriSpeak with a target of achieving 2,000 interviews in each survey. Invited panel members may complete the survey online or by telephone with an NORC telephone interviewer.

Once all the study data have been made final, an iterative raking process is used to adjust for any survey nonresponse as well as any noncoverage or under and oversampling resulting from the study specific sample design. Raking variables include age, gender, census division, race/ethnicity, education, and county groupings based on county level counts of the number of COVID-19 deaths. Demographic weighting variables were obtained from the 2020 Current Population Survey. The count of COVID-19 deaths by county was obtained from USA Facts. The weighted data reflect the U.S. population of adults age 18 and over.

Data for the regional estimates are collected using a multi-mode address-based (ABS) approach that allows residents of each area to complete the interview via web or with an NORC telephone interviewer. All sampled households are mailed a postcard inviting them to complete the survey either online using a unique PIN or via telephone by calling a toll-free number. Interviews are conducted with adults age 18 and over with a target of achieving 400 interviews in each region in each survey.Additional details on the survey methodology and the survey questionnaire are attached below or can be found at https://www.covid-impact.org.

Attribution

Results should be credited to the COVID Impact Survey, conducted by NORC at the University of Chicago for the Data Foundation.

AP Data Distributions

To learn more about AP's data journalism capabilities for publishers, corporations and financial institutions, go here or email kromano@ap.org.
HitCompanies Dataset
figshare.com
zip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuri Burger (2023). HitCompanies Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.842633.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.842633.v1
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Yuri Burger
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Worldwide Companies Dataset contains information on random 10,000 worldwide companies, including name, registration number, website url, addresses, phone numbers, industry codes, aliases, associated domain names and key changes such as people changes, contact changes, etc.Original data available at http://endb-consolidated.aihit.com/datasets.htm
h
100-richest-people-in-world
huggingface.co
Updated Aug 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nate Raw (2023). 100-richest-people-in-world [Dataset]. https://huggingface.co/datasets/nateraw/100-richest-people-in-world
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 2, 2023
Authors
Nate Raw
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Area covered
World
Description
Dataset Card for 100 Richest People In World

Dataset Summary

This dataset contains the list of Top 100 Richest People in the World Column Information:-

Name - Person Name NetWorth - His/Her Networth Age - Person Age Country - The country person belongs to Source - Information Source Industry - Expertise Domain

Join our Community Supported Tasks and Leaderboards

[More Information Needed]

Languages

[More Information Needed]… See the full description on the dataset page: https://huggingface.co/datasets/nateraw/100-richest-people-in-world.

World Population Data

kaggle.com

Updated Jan 1, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Sazidul Islam (2024). World Population Data [Dataset]. https://www.kaggle.com/datasets/sazidthe1/world-population-data

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jan 1, 2024

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Sazidul Islam

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Area covered

World

Description

Context

The world's population has undergone remarkable growth, exceeding 7.5 billion by mid-2019 and continuing to surge beyond previous estimates. Notably, China and India stand as the two most populous countries, with China's population potentially facing a decline while India's trajectory hints at surpassing it by 2030. This significant demographic shift is just one facet of a global landscape where countries like the United States, Indonesia, Brazil, Nigeria, and others, each with populations surpassing 100 million, play pivotal roles.

The steady decrease in growth rates, though, is reshaping projections. While the world's population is expected to exceed 8 billion by 2030, growth will notably decelerate compared to previous decades. Specific countries like India, Nigeria, and several African nations will notably contribute to this growth, potentially doubling their populations before rates plateau.

Content

This dataset provides comprehensive historical population data for countries and territories globally, offering insights into various parameters such as area size, continent, population growth rates, rankings, and world population percentages. Spanning from 1970 to 2023, it includes population figures for different years, enabling a detailed examination of demographic trends and changes over time.

Dataset

Structured with meticulous detail, this dataset offers a wide array of information in a format conducive to analysis and exploration. Featuring parameters like population by year, country rankings, geographical details, and growth rates, it serves as a valuable resource for researchers, policymakers, and analysts. Additionally, the inclusion of growth rates and world population percentages provides a nuanced understanding of how countries contribute to global demographic shifts.

This dataset is invaluable for those interested in understanding historical population trends, predicting future demographic patterns, and conducting in-depth analyses to inform policies across various sectors such as economics, urban planning, public health, and more.

Structure

This dataset (world_population_data.csv) covering from 1970 up to 2023 includes the following columns:

Column Name	Description
`Rank`	Rank by Population
`CCA3`	3 Digit Country/Territories Code
`Country`	Name of the Country
`Continent`	Name of the Continent
`2023 Population`	Population of the Country in the year 2023
`2022 Population`	Population of the Country in the year 2022
`2020 Population`	Population of the Country in the year 2020
`2015 Population`	Population of the Country in the year 2015
`2010 Population`	Population of the Country in the year 2010
`2000 Population`	Population of the Country in the year 2000
`1990 Population`	Population of the Country in the year 1990
`1980 Population`	Population of the Country in the year 1980
`1970 Population`	Population of the Country in the year 1970
`Area (km²)`	Area size of the Country/Territories in square kilometer
`Density (km²)`	Population Density per square kilometer
`Growth Rate`	Population Growth Rate by Country
`World Population Percentage`	The population percentage by each Country

Acknowledgment

The primary dataset was retrieved from the World Population Review. I sincerely thank the team for providing the core data used in this dataset.

Worldwide Soundscapes project meta-data
zenodo.org
Updated Dec 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kevin F.A. Darras; Kevin F.A. Darras; Rodney Rountree; Rodney Rountree; Steven Van Wilgenburg; Steven Van Wilgenburg; Amandine Gasc; Amandine Gasc; 松海李; 松海李; 黎君董; 黎君董; Yuhang Song; Youfang Chen; Youfang Chen; Thomas Cherico Wanger; Thomas Cherico Wanger; Yuhang Song (2022). Worldwide Soundscapes project meta-data [Dataset]. http://doi.org/10.5281/zenodo.7415473
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.7415473
Dataset updated
Dec 9, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Kevin F.A. Darras; Kevin F.A. Darras; Rodney Rountree; Rodney Rountree; Steven Van Wilgenburg; Steven Van Wilgenburg; Amandine Gasc; Amandine Gasc; 松海李; 松海李; 黎君董; 黎君董; Yuhang Song; Youfang Chen; Youfang Chen; Thomas Cherico Wanger; Thomas Cherico Wanger; Yuhang Song
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Worldwide Soundscapes project is a global, open inventory of spatio-temporally replicated soundscape datasets. This Zenodo entry comprises the data tables that constitute its (meta-)database, as well as their description.

The overview of all sampling sites can be found on the corresponding project on ecoSound-web, as well as a demonstration collection containing selected recordings. More information on the project can be found here and on ResearchGate.

The audio recording criteria justifying inclusion into the meta-database are:

Stationary (no transects, towed sensors or microphones mounted on cars)

Passive (unattended, no human disturbance by the recordist)

Ambient (no spatial or temporal focus on a particular species or direction)

Spatially and/or temporally replicated (multiple sites sampled at least at one common daytime or multiple days sampled at least in one common site)

The individual columns of the provided data tables are described in the following. Data tables are linked through primary keys; joining them will result in a database.

datasets

dataset_id: incremental integer, primary key

name: name of the dataset. if it is repeated, incremental integers should be used in the "subset" column to differentiate them.

subset: incremental integer that can be used to distinguish datasets with identical names

collaborators: full names of people deemed responsible for the dataset, separated by commas

contributors: full names of people who are not the main collaborators but who have significantly contributed to the dataset, and who could be contacted for in-depth analyses, separated by commas.

date_added: when the datased was added (DD/MM/YYYY)

URL_open_recordings: if recordings (even only some) from this dataset are openly available, indicate the internet link where they can be found.

URL_project: internet link for further information about the corresponding project

DOI_publication: DOI of corresponding publications, separated by comma

core_realm_IUCN: The core realm of the dataset. Datasets may have multiple realms, but the main one should be listed. Datasets may contain sampling sites from different realms in the "sites" sheet. IUCN Global Ecosystem Typology (v2.0): https://global-ecosystems.org/

medium: the physical medium the microphone is situated in

protected_area: Whether the sampling sites were situated in protected areas or not, or only some.

GADM0: For datasets on land or in territorial waters, Global Administrative Database level0
https://gadm.org/

GADM1: For datasets on land or in territorial waters, Global Administrative Database level1
https://gadm.org/

GADM2: For datasets on land or in territorial waters, Global Administrative Database level2
https://gadm.org/

IHO: For marine locations, the sea area that encompassess all the sampling locations according to the International Hydrographic Organisation. Map here: https://www.arcgis.com/home/item.html?id=44e04407fbaf4d93afcb63018fbca9e2

locality: optional free text about the locality

latitude_numeric_region: study region approximate centroid latitude in WGS84 decimal degrees

longitude_numeric_region: study region approximate centroid longitude in WGS84 decimal degrees

sites_number: number of sites sampled

year_start: starting year of the sampling

year_end: ending year of the sampling

deployment_schedule: description of the sampling schedule, provisional

temporal_recording_selection: list environmental exclusion criteria that were used to determine which recording days or times to discard

high_pass_filter_Hz: frequency of the high-pass filter of the recorder, in Hz

variable_sampling_frequency: Does the sampling frequency vary? If it does, write "NA" in the sampling_frequency_kHz column and indicate it in the sampling_frequency_kHz column inside the deployments sheet

sampling_frequency_kHz: frequency the microphone was sampled at (sounds of half that frequency will be recorded)

variable_recorder:

recorder: recorder model used

microphone: microphone used

freshwater_recordist_position: position of the recordist relative to the microphone during sampling (only for freshwater)

collaborator_comments: free-text field for comments by the collaborators

validated: This cell is checked if the contents of all sheets are complete and have been found to be coherent and consistent with our requirements.

validator_name: name of person doing the validation

validation_comments: validators: please insert the date when someone was contacted

cross-check: this cell is checked if the collaborators confirm the spatial and temporal data after checking the corresponding site maps, deployment and operation time graphs found at https://drive.google.com/drive/folders/1qfwXH_7dpFCqyls-c6b8RZ_fbcn9kXbp?usp=share_link

datasets-sites

dataset_ID: primary key of datasets table

dataset_name: lookup field

site_ID: primary key of sites table

site_name: lookup field

sites

site_ID: unique site IDs, larger than 1000 for compatibility with ecoSound-web

site_name: name or code of sampling site as used in respective projects

latitude_numeric: exact numeric degrees coordinates of latitude

longitude_numeric: exact numeric degrees coordinates of longitude

topography_m: for sites on land: elevation. For marine sites: depth (negative). in meters

freshwater_depth_m

realm: Ecosystem type according to IUCN GET https://global-ecosystems.org/

biome: Ecosystem type according to IUCN GET https://global-ecosystems.org/

functional_group: Ecosystem type according to IUCN GET https://global-ecosystems.org/

comments

deployments

dataset_ID: primary key of datasets table

dataset_name: lookup field

deployment: use identical subscript letters to denote rows that belong to the same deployment. For instance, you may use different operation times and schedules for different target taxa within one deployment.

start_date_min: earliest date of deployment start, double-click cell to get date-picker

start_date_max: latest date of deployment start, if applicable (only used when recorders were deployed over several days), double-click cell to get date-picker

start_time_mixed: deployment start local time, either in HH:MM format or a choice of solar daytimes (sunrise, sunset, noon, midnight). Corresponds to the recording start time for continuous recording deployments. If multiple start times were used, you should mention the latest start time (corresponds to the earliest daytime from which all recorders are active). If applicable, positive or negative offsets from solar times can be mentioned (For example: if data are collected one hour before sunrise, this will be "sunrise-60")

permanent: is the deployment permanent (in which case it would be ongoing and the end date or duration would be unknown)?

variable_duration_days: is the duration of the deployment variable? in days

duration_days: deployment duration per recorder (use the minimum if variable)

end_date_min: earliest date of deployment end, only needed if duration is variable, double-click cell to get date-picker

end_date_max: latest date of deployment end, only needed if duration is variable, double-click cell to get date-picker

end_time_mixed: deployment end local time, either in HH:MM format or a choice of solar daytimes (sunrise, sunset, noon, midnight). Corresponds to the recording end time for continuous recording deployments.

recording_time: does the recording last from the deployment start time to the end time (continuous) or at scheduled daily intervals (scheduled)? Note: we consider recordings with duty cycles to be continuous.

operation_start_time_mixed: scheduled recording start local time, either in HH:MM format or a choice of solar daytimes (sunrise, sunset, noon, midnight). If applicable, positive or negative offsets from solar times can be mentioned (For example: if data are collected one hour before sunrise, this will be "sunrise-60")

operation_duration_minutes: duration of operation in minutes, if constant

operation_end_time_mixed: scheduled recording end local time, either in HH:MM format or a choice of solar daytimes (sunrise, sunset, noon, midnight). If applicable, positive or negative offsets from solar times can be mentioned (For example: if data are collected one hour before sunrise, this will be "sunrise-60")

duty_cycle_minutes: duty cycle of the recording (i.e. the fraction of minutes when it is recording), written as "recording(minutes)/period(minutes)". For example: "1/6" if the recorder is active for 1 minute and standing by for 5 minutes.

sampling_frequency_kHz: only indicate the sampling frequency if it is variable within a particular dataset so that we need to code different frequencies for different deployments

recorder

subset_sites: If the deployment was not done in all the sites of the
The ORBIT (Object Recognition for Blind Image Training)-India Dataset
zenodo.org
data.niaid.nih.gov
Updated Apr 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gesu India; Gesu India; Martin Grayson; Martin Grayson; Daniela Massiceti; Daniela Massiceti; Cecily Morrison; Cecily Morrison; Simon Robinson; Simon Robinson; Jennifer Pearson; Jennifer Pearson; Matt Jones; Matt Jones (2025). The ORBIT (Object Recognition for Blind Image Training)-India Dataset [Dataset]. http://doi.org/10.5281/zenodo.12608444
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.12608444
Dataset updated
Apr 24, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gesu India; Gesu India; Martin Grayson; Martin Grayson; Daniela Massiceti; Daniela Massiceti; Cecily Morrison; Cecily Morrison; Simon Robinson; Simon Robinson; Jennifer Pearson; Jennifer Pearson; Matt Jones; Matt Jones
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
India
Description
The ORBIT (Object Recognition for Blind Image Training) -India Dataset is a collection of 105,243 images of 76 commonly used objects, collected by 12 individuals in India who are blind or have low vision. This dataset is an "Indian subset" of the original ORBIT dataset [1, 2], which was collected in the UK and Canada. In contrast to the ORBIT dataset, which was created in a Global North, Western, and English-speaking context, the ORBIT-India dataset features images taken in a low-resource, non-English-speaking, Global South context, a home to 90% of the world’s population of people with blindness. Since it is easier for blind or low-vision individuals to gather high-quality data by recording videos, this dataset, like the ORBIT dataset, contains images (each sized 224x224) derived from 587 videos. These videos were taken by our data collectors from various parts of India using the Find My Things [3] Android app. Each data collector was asked to record eight videos of at least 10 objects of their choice.

Collected between July and November 2023, this dataset represents a set of objects commonly used by people who are blind or have low vision in India, including earphones, talking watches, toothbrushes, and typical Indian household items like a belan (rolling pin), and a steel glass. These videos were taken in various settings of the data collectors' homes and workspaces using the Find My Things Android app.

The image dataset is stored in the ‘Dataset’ folder, organized by folders assigned to each data collector (P1, P2, ...P12) who collected them. Each collector's folder includes sub-folders named with the object labels as provided by our data collectors. Within each object folder, there are two subfolders: ‘clean’ for images taken on clean surfaces and ‘clutter’ for images taken in cluttered environments where the objects are typically found. The annotations are saved inside a ‘Annotations’ folder containing a JSON file per video (e.g., P1--coffee mug--clean--231220_084852_coffee mug_224.json) that contains keys corresponding to all frames/images in that video (e.g., "P1--coffee mug--clean--231220_084852_coffee mug_224--000001.jpeg": {"object_not_present_issue": false, "pii_present_issue": false}, "P1--coffee mug--clean--231220_084852_coffee mug_224--000002.jpeg": {"object_not_present_issue": false, "pii_present_issue": false}, ...). The ‘object_not_present_issue’ key is True if the object is not present in the image, and the ‘pii_present_issue’ key is True, if there is a personally identifiable information (PII) present in the image. Note, all PII present in the images has been blurred to protect the identity and privacy of our data collectors. This dataset version was created by cropping images originally sized at 1080 × 1920; therefore, an unscaled version of the dataset will follow soon.

This project was funded by the Engineering and Physical Sciences Research Council (EPSRC) Industrial ICASE Award with Microsoft Research UK Ltd. as the Industrial Project Partner. We would like to acknowledge and express our gratitude to our data collectors for their efforts and time invested in carefully collecting videos to build this dataset for their community. The dataset is designed for developing few-shot learning algorithms, aiming to support researchers and developers in advancing object-recognition systems. We are excited to share this dataset and would love to hear from you if and how you use this dataset. Please feel free to reach out if you have any questions, comments or suggestions.

REFERENCES:

Daniela Massiceti, Lida Theodorou, Luisa Zintgraf, Matthew Tobias Harris, Simone Stumpf, Cecily Morrison, Edward Cutrell, and Katja Hofmann. 2021. ORBIT: A real-world few-shot dataset for teachable object recognition collected from people who are blind or low vision. DOI: https://doi.org/10.25383/city.14294597

microsoft/ORBIT-Dataset. https://github.com/microsoft/ORBIT-Dataset

Linda Yilin Wen, Cecily Morrison, Martin Grayson, Rita Faia Marques, Daniela Massiceti, Camilla Longden, and Edward Cutrell. 2024. Find My Things: Personalized Accessibility through Teachable AI for People who are Blind or Low Vision. In Extended Abstracts of the 2024 CHI Conference on Human Factors in Computing Systems (CHI EA '24). Association for Computing Machinery, New York, NY, USA, Article 403, 1–6. https://doi.org/10.1145/3613905.3648641
GBIF Backbone Taxonomy
gbif.org
smng.net
+1more
Updated Nov 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GBIF Secretariat (2023). GBIF Backbone Taxonomy [Dataset]. http://doi.org/10.15468/39omei
Explore at:
Unique identifier
https://doi.org/10.15468/39omei
Dataset updated
Nov 17, 2023
Dataset provided by
Global Biodiversity Information Facilityhttps://www.gbif.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The GBIF Backbone Taxonomy is a single, synthetic management classification with the goal of covering all names GBIF is dealing with. It's the taxonomic backbone that allows GBIF to integrate name based information from different resources, no matter if these are occurrence datasets, species pages, names from nomenclators or external sources like EOL, Genbank or IUCN. This backbone allows taxonomic search, browse and reporting operations across all those resources in a consistent way and to provide means to crosswalk names from one source to another.

It is updated regulary through an automated process in which the Catalogue of Life acts as a starting point also providing the complete higher classification above families. Additional scientific names only found in other authoritative nomenclatural and taxonomic datasets are then merged into the tree, thus extending the original catalogue and broadening the backbones name coverage. The GBIF Backbone taxonomy also includes identifiers for Operational Taxonomic Units (OTUs) drawn from the barcoding resources iBOL and UNITE.

International Barcode of Life project (iBOL), Barcode Index Numbers (BINs). BINs are connected to a taxon name and its classification by taking into account all names applied to the BIN and picking names with at least 80% consensus. If there is no consensus of name at the species level, the selection process is repeated moving up the major Linnaean ranks until consensus is achieved.

UNITE - Unified system for the DNA based fungal species, Species Hypotheses (SHs). SHs are connected to a taxon name and its classification based on the determination of the RefS (reference sequence) if present or the RepS (representative sequence). In the latter case, if there is no match in the UNITE taxonomy, the lowest rank with 100% consensus within the SH will be used.

The GBIF Backbone Taxonomy is available for download at https://hosted-datasets.gbif.org/datasets/backbone/ in different formats together with an archive of all previous versions.

The following 105 sources have been used to assemble the GBIF backbone with number of names given in brackets:
Catalogue of Life Checklist - 4766428 names
International Barcode of Life project (iBOL) Barcode Index Numbers (BINs) - 635951 names
UNITE - Unified system for the DNA based fungal species linked to the classification - 611208 names
The Paleobiology Database - 212054 names
World Register of Marine Species - 188857 names
The Interim Register of Marine and Nonmarine Genera - 183894 names
The World Checklist of Vascular Plants (WCVP) - 131891 names
GBIF Backbone Taxonomy - 114350 names
TAXREF - 109374 names
The Leipzig catalogue of vascular plants - 75380 names
ZooBank - 73549 names
Integrated Taxonomic Information System (ITIS) - 68377 names
Plazi.org taxonomic treatments database - 61346 names
Genome Taxonomy Database r207 - 60545 names
International Plant Names Index - 52329 names
Fauna Europaea - 45077 names
The National Checklist of Taiwan (Catalogue of Life in Taiwan, TaiCoL) - 36193 names
Dyntaxa. Svensk taxonomisk databas - 35892 names
The Plant List with literature - 32692 names
United Kingdom Species Inventory (UKSI) - 29643 names
Artsnavnebasen - 29208 names
The IUCN Red List of Threatened Species - 21221 names
Afromoths, online database of Afrotropical moth species (Lepidoptera) - 13961 names
Brazilian Flora 2020 project - Projeto Flora do Brasil 2020 - 13829 names
Prokaryotic Nomenclature Up-to-Date (PNU) - 10079 names
Checklist Dutch Species Register - Nederlands Soortenregister - 8814 names
ICTV Master Species List (MSL) - 7852 names
Cockroach Species File - 6020 names
GRIN Taxonomy - 5882 names
Taxon list of fungi and fungal-like organisms from Germany compiled by the DGfM - 4570 names
Catalogue of Afrotropical Bees - 3623 names
Catalogue of Tenebrionidae (Coleoptera) of North America - 3327 names
Checklist of Beetles (Coleoptera) of Canada and Alaska. Second Edition. - 3312 names
Systema Dipterorum - 2850 names
Catalogue of the Pterophoroidea of the World - 2807 names
The Clements Checklist - 2675 names
Taxon list of Hymenoptera from Germany compiled in the context of the GBOL project - 2496 names
IOC World Bird List, v13.2 - 2366 names
Official Lists and Indexes of Names in Zoology - 2310 names
National checklist of all species occurring in Denmark - 1922 names
Myriatrix - 1876 names
Database of Vascular Plants of Canada (VASCAN) - 1822 names
Taxon list of vascular plants from Bavaria, Germany compiled in the context of the BFL project - 1771 names
Orthoptera Species File - 1742 names
A list of the terrestrial fungi, flora and fauna of Madeira and Selvagens archipelagos - 1602 names
Aphid Species File - 1565 names
World Spider Catalog - 1561 names
Taxon list of Jurassic Pisces of the Tethys Palaeo-Environment compiled at the SNSB-JME - 1270 names
Backbone Family Classification Patch - 1143 names
GBIF Algae Classification - 1100 names
International Cichorieae Network (ICN): Cichorieae Portal - 975 names
Psocodea Species File - 803 names
New Zealand Marine Macroalgae Species Checklist - 787 names
Annotated checklist of endemic species from the Western Balkans - 754 names
Taxon list of animals with German names (worldwide) compiled at the SMNS - 503 names
Catalogue of the Alucitoidea of the World - 472 names
Lygaeoidea Species File - 462 names
Catálogo de Plantas y Líquenes de Colombia - 422 names
GBIF Backbone Patch - 317 names
Phasmida Species File - 259 names
Cortinariaceae fetched from the Index Fungorum API - 234 names
Coreoidea Species File - 233 names
GTDB supplement - 139 names
Mantodea Species File - 119 names
Endemic species in Taiwan - 93 names
Taxon list of Araneae from Germany compiled in the context of the GBOL project - 88 names
Species of Hominidae - 78 names
Taxon list of Sternorrhyncha from Germany compiled in the context of the GBOL project - 77 names
Taxon list of mosses from Germany compiled in the context of the GBOL project - 75 names
Mammal Species of the World - 73 names
Plecoptera Species File - 71 names
Species Fungorum Plus - 64 names
Catalogue of the type specimens of Cosmopterigidae (Lepidoptera: Gelechioidea) from research collections of the Zoological Institute, Russian Academy of Sciences - 47 names
Species named after famous people - 41 names
Dermaptera Species File - 36 names
Taxon list of Trichoptera from Germany compiled in the context of the GBOL project - 34 names
True Fruit Flies (Diptera, Tephritidae) of the Afrotropical Region - 33 names
Range and Regularities in the Distribution of Earthworms of the Earthworms of the USSR Fauna. Perel, 1979 - 32 names
Taxon list of Diplura from Germany compiled in the context of the GBOL project - 30 names
Lista de referencia de especies de aves de Colombia - 2022 - 24 names
Taxon list of Auchenorrhyncha from Germany compiled in the context of the GBOL project - 20 names
Catalogue of the type specimens of Polycestinae (Coleoptera: Buprestidae) from research collections of the Zoological Institute, Russian Academy of Sciences - 19 names
Taxon list of Thysanoptera from Germany compiled in the context of the GBOL project - 19 names
Lista de especies de vertebrados registrados en jurisdicción del Departamento del Huila - 18 names
Taxon list of Microcoryphia (Archaeognatha) from Germany compiled in the context of the GBOL project - 15 names
Catalogue of the type specimens of Bufonidae and Megophryidae (Amphibia: Anura) from research collections of the Zoological Institute, Russian Academy of Sciences - 12 names
Grylloblattodea Species File - 11 names
Coleorrhyncha Species File - 9 names
Taxon list of liverworts from Germany compiled in the context of the GBOL project - 9 names
Embioptera Species File - 7 names
Taxon list of Pisces and Cyclostoma from Germany compiled in the context of the GBOL project - 6 names
Taxon list of Pteridophyta from Germany compiled in the context of the GBOL project - 6 names
Taxon list of Siphonaptera from Germany compiled in the context of the GBOL project - 5 names
The Earthworms of the Fauna of Russia. Perel, 1997 - 5 names
Taxon list of Zygentoma from Germany compiled in the context of the GBOL project - 4 names
Asiloid Flies: new taxa of Diptera: Apioceridae, Asilidae, and Mydidae - 3 names
Taxon list of Protura from Germany compiled in the context of the GBOL project - 3 names
Taxon list of hornworts from Germany compiled in the context of the GBOL project - 2 names
Chrysididae Species File - 1 names
Taxon list of Dermaptera from Germany compiled in the context of the GBOL project - 1 names
Taxon list of Diplopoda from Germany in the context of the GBOL project - 1 names
Taxon list of Orthoptera (Grashoppers) from Germany compiled at the SNSB - 1 names
Taxon list of Pscoptera from Germany compiled in the context of the GBOL project - 1 names
Taxon list of Pseudoscorpiones from Germany compiled in the context of the GBOL project - 1 names
Taxon list of Raphidioptera from Germany compiled in the context of the GBOL project - 1 names
f
Data from: Global Impacts Dataset of Invasive Alien Species (GIDIAS)
springernature.figshare.com
xlsx
Updated May 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sven Bacher; Ellen Ryan-Colton; Mario Coiro; Phillip Cassey; Bella S. Galil; Martín A. Nuñez; Michael Ansong; Katharina Dehnen-Schmutz; Georgi Fayvush; Romina Daiana Fernandez; Ankila Hiremath; Makihiko Ikegami; Angeliki F. Martinou; Shana M. McDermott; Cristina Preda; Montserrat Vilà; Olaf L. F. Weyl; Neelavara Ananthram Aravind; Katerina Athanasiou; Vidyadhar Atkore; Jacob N. Barney; Tim M. Blackburn; Eckehard G. Brockerhoff; Clinton Carbutt; Luca Carisio; Vanessa Céspedes; Diego F. Cisneros-Heredia; Meghan Cooling; Maarten de Groot; Jakovos Demetriou; James W. E. Dickey; Regan Early; Thomas E. Evans; Belinda Gallardo; Monica Gruber; Cang Hui; Jonathan Jeschke; Natalia Z. Joelson; Mohd Asgar Khan; Sabrina Kumschick; Lori Lach; Katharina Lapin; Simone Lioy; Chunlong Liu; Zoe J. MacMullen; Manuela A. Mazzitelli; G. John Measey; Agata A. Mrugała-Koese; Camille L. Musseau; Helen F. Nahrung; Alessia lucia Pepori; Luis R. Pertierra; Elizabeth F. Pienaar; Petr Pyšek; Gonzalo Rivas-Torres; Henry A. Rojas Martinez; JULISSA ROJAS-SANDOVAL; Ned Ryan-Schofield; Rocío M. Sánchez; Alberto Santini; Davide Santoro; Riccardo Scalera; Lisanna Schmidt; Tinyiko Cavin Shivambu; Sima Sohrabi; Elena Tricarico; Alejandro Trillo; Pieter G. van't Hof; Lara Volery; Tsungai A. Zengeya; Aikaterini Christopoulou; Virginia G. Duboscq-Carra; Ioanna A. Angelidou; Pilar Castro-Díez; Paola Tatiana Flores Males (2025). Global Impacts Dataset of Invasive Alien Species (GIDIAS) [Dataset]. http://doi.org/10.6084/m9.figshare.27908838.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27908838.v1
Dataset updated
May 21, 2025
Dataset provided by
figshare
Authors
Sven Bacher; Ellen Ryan-Colton; Mario Coiro; Phillip Cassey; Bella S. Galil; Martín A. Nuñez; Michael Ansong; Katharina Dehnen-Schmutz; Georgi Fayvush; Romina Daiana Fernandez; Ankila Hiremath; Makihiko Ikegami; Angeliki F. Martinou; Shana M. McDermott; Cristina Preda; Montserrat Vilà; Olaf L. F. Weyl; Neelavara Ananthram Aravind; Katerina Athanasiou; Vidyadhar Atkore; Jacob N. Barney; Tim M. Blackburn; Eckehard G. Brockerhoff; Clinton Carbutt; Luca Carisio; Vanessa Céspedes; Diego F. Cisneros-Heredia; Meghan Cooling; Maarten de Groot; Jakovos Demetriou; James W. E. Dickey; Regan Early; Thomas E. Evans; Belinda Gallardo; Monica Gruber; Cang Hui; Jonathan Jeschke; Natalia Z. Joelson; Mohd Asgar Khan; Sabrina Kumschick; Lori Lach; Katharina Lapin; Simone Lioy; Chunlong Liu; Zoe J. MacMullen; Manuela A. Mazzitelli; G. John Measey; Agata A. Mrugała-Koese; Camille L. Musseau; Helen F. Nahrung; Alessia lucia Pepori; Luis R. Pertierra; Elizabeth F. Pienaar; Petr Pyšek; Gonzalo Rivas-Torres; Henry A. Rojas Martinez; JULISSA ROJAS-SANDOVAL; Ned Ryan-Schofield; Rocío M. Sánchez; Alberto Santini; Davide Santoro; Riccardo Scalera; Lisanna Schmidt; Tinyiko Cavin Shivambu; Sima Sohrabi; Elena Tricarico; Alejandro Trillo; Pieter G. van't Hof; Lara Volery; Tsungai A. Zengeya; Aikaterini Christopoulou; Virginia G. Duboscq-Carra; Ioanna A. Angelidou; Pilar Castro-Díez; Paola Tatiana Flores Males
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present the Global Impacts Dataset of Invasive Alien Species (GIDIAS), a global dataset of 22865 records including impacts of invasive alien species on nature, nature’s contributions to people, and good quality of life. Records include positive and negative impacts, neutral impacts (studies were carried out, but no impacts were documented), non-directional impacts (i.e., change without detriments or benefits for native species or people), and finally, some records of alien species where no studies were found that assessed their impacts (indicating data gaps). Records cover 3353 invasive alien species from all major taxa (plants, vertebrates, invertebrates, microorganisms) and all continents and realms (terrestrial, freshwater, marine). The data were compiled to serve as robust evidence for chapter 4 “Impacts of invasive alien species on nature, nature's contributions to people, and good quality of life” of the global assessment report on invasive alien species by the Intergovernmental Science-Policy Platform on Biodiversity and Ecosystem Services (IPBES; available on Zenodo at https://doi.org/10.5281/zenodo.7430731). The dataset is provided in a machine-readable CSV file (file name GIDIAS_20250417_machine_read.csv), with special language characters retained where used (UTF-8 format). The dataset is also provided in Excel format (file name GIDIAS_20250417_Excel.xlsx). Metadata is provided in Excel format, including descriptors for each variable (file name GIDIAS_metadata_20250417.xlsx). Additional explanations for GIDIAS is stored in Microsoft Word format (docx) and contains (1) a short description of the principles of Environmental and Socio-Economic Impact Classification for Alien Taxa (EICAT, SEICAT), (2) a description of the variables included in the Global Impacts Dataset of Invasive Alien Species GIDIAS, and (3) a compilation of the search strategies and datasets included in the Global Impact Dataset of Invasive Alien Species (GIDIAS).
Z
Data from: The global distribution of plants used by humans datasets: list...
data.niaid.nih.gov
Updated Jan 19, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Willis, Kathy J. (2024). The global distribution of plants used by humans datasets: list of utilised species, occurrence data and model outputs at 10 arc-minutes spatial resolution [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8176317
Explore at:
Dataset updated
Jan 19, 2024
Dataset provided by
Govaerts, Rafaël
Antonelli, Alexander
Dennehy-Carr, Zoe
Turner, Rob M.
Ondo, Ian
Cámara-Leret, Rodrigo
Willis, Kathy J.
Baquero, Andrea C.
Milliken, William
Patmore, Kristina
Hargreaves, Serene
Pironon, Samuel
Canteiro, Cátia
van Andel, Tinde R.
Schmelzer, Gaby
Ulian, Tiziana
Allkin, Robert
Nesbitt, Mark
Hudson, Alex J.
Lemmens, Roel
Diazgranados, Mauricio
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Datasets and model outputs used to map the global distribution of utilised plants by humans. The folder is composed of two subfolders raw_data and processed_data containing respectively the list of utilised plant species modelled -utilised_plants_species_list.csv-, and their occurrence data -occurrence_data.zip- and predicted distribution -species_proba_per_cell.rds-.

The file utilised_plants_species_list.csv in the raw_data folder contains a list of 35687 plant species (and hybrids) used by humans and 10 plant use categories with the following 14 fields:

plant_ID: plant identifier number ranging from between 1-35687

binomial_acc_name: binomial accepted name of the plant species

author_acc_name: name of the author(s)

is_hybrid: logical TRUE or FALSE indicating whether the species is an hybrid or not.

AnimalFood: forage and fodder for vertebrate animals only.

EnvironmentalUses: examples include intercrops and nurse crops, ornamentals, barrier hedges, shade plants, windbreaks, soil improvers, plants for revegetation and erosion control, wastewater purifiers, indicators of the presence of metals, pollution, or underground water.

Fuels: charcoal, petroleum substitutes, fuel alcohols, etc. Given the importance of energy plants for people, those were distinguished from Materials.

GeneSources: wild relatives of major crops which may possess traits associated with biotic or abiotic resistance and may be valuable for breeding programs.

HumanFood: food for humans only, including beverages and food additives.

InvertebrateFood: plants consumed by invertebrates used by humans, such as bees, silkworms, lac insects and edible grubs.

Materials: woods, fibers, cork, cane, tannins, latex, resins, gums, waxes, oils, lipids, etc. and their derived products.

Medicines: both human and veterinary.

Poisons: plants which are poisonous to both vertebrates and invertebrates, both accidentally and intentionally, e.g., for hunting and fishing, molluscicides, herbicides, insecticides.

SocialsUses: plants used for social purposes, which cannot be defined as food or medicine, for instance, masticatories, smoking materials, narcotics, hallucinogens and psychoactive drugs, and plants with ritual or religious significance.

Totals: total number of uses recorded for a species

The zipfile occurrence_data.zip in the processed_data folder contains 35687 Comma Separated Values (CSV) files, one for each species, containing curated geographic occurrence records used to build species distribution models with the following 14 fields:

Species: the binomial accepted name of the species

Fullname: same as species

decimalLongitude: the geographic longitude of the occurrence records of the species in decimal degrees

decimalLatitude: the geographic latitude of the occurrence records of the species in decimal degrees

countryCode: a three-letter standard abbreviation for the country of the occurrence locality

coordinateUncertaintyinMeters: indicator for the accuracy of the coordinate location, described as the radius of a circle around the stated point location

year: year of the observation of the occurrence record of the species

individualCount: the number of individuals present at the time of the observation

gbifID: unique identifier number for the occurrence from the original database

basisOfRecords: the type of the individual record, e.g. observation, physical specimen, fossil, living ex-situ, culture collection specimen

institutionCode: the name of the institution or organization listed as the data publisher on GBIF

establishmentMeans: statement about whether an organism has been introduced to a given place and time through the direct or indirect activity of modern humans

is_cultivated_observation: whether or not an organism is cultivated

sourceID: name of the source database

The file species_proba_per_cell.rds in the processed_data folder is a R Data Serialization (RDS) file containing a data.table object with the following 3 fields:

plant_ID: plant identifier number ranging from between 1-35687

proba: species occurrence probability

cell: raster grid cell number between 1-2251762

This object can be used in combination with a raster layer to reconstruct the modelled distribution of each species or retrieve species richness and endemism.
ORBITAAL: cOmpRehensive BItcoin daTaset for temorAl grAph anaLysis - Dataset...
cryptodata.center
Updated Dec 4, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
cryptodata.center (2024). ORBITAAL: cOmpRehensive BItcoin daTaset for temorAl grAph anaLysis - Dataset - CryptoData Hub [Dataset]. https://cryptodata.center/dataset/orbitaal-comprehensive-bitcoin-dataset-for-temoral-graph-analysis
Explore at:
Dataset updated
Dec 4, 2024
Dataset provided by
CryptoDATA
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Construction This dataset captures the temporal network of Bitcoin (BTC) flow exchanged between entities at the finest time resolution in UNIX timestamp. Its construction is based on the blockchain covering the period from January, 3rd of 2009 to January the 25th of 2021. The blockchain extraction has been made using bitcoin-etl (https://github.com/blockchain-etl/bitcoin-etl) Python package. The entity-entity network is built by aggregating Bitcoin addresses using the common-input heuristic [1] as well as popular Bitcoin users' addresses provided by https://www.walletexplorer.com/ [1] M. Harrigan and C. Fretter, "The Unreasonable Effectiveness of Address Clustering," 2016 Intl IEEE Conferences on Ubiquitous Intelligence & Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People, and Smart World Congress (UIC/ATC/ScalCom/CBDCom/IoP/SmartWorld), Toulouse, France, 2016, pp. 368-373, doi: 10.1109/UIC-ATC-ScalCom-CBDCom-IoP-SmartWorld.2016.0071.keywords: {Online banking;Merging;Protocols;Upper bound;Bipartite graph;Electronic mail;Size measurement;bitcoin;cryptocurrency;blockchain}, Dataset Description Bitcoin Activity Temporal Coverage: From 03 January 2009 to 25 January 2021 Overview: This dataset provides a comprehensive representation of Bitcoin exchanges between entities over a significant temporal span, spanning from the inception of Bitcoin to recent years. It encompasses various temporal resolutions and representations to facilitate Bitcoin transaction network analysis in the context of temporal graphs. Every dates have been retrieved from bloc UNIX timestamp and GMT timezone. Contents: The dataset is distributed across three compressed archives: All data are stored in the Apache Parquet file format, a columnar storage format optimized for analytical queries. It can be used with pyspark Python package. orbitaal-stream_graph.tar.gz: The root directory is STREAM_GRAPH/ Contains a stream graph representation of Bitcoin exchanges at the finest temporal scale, corresponding to the validation time of each block (averaging approximately 10 minutes). The stream graph is divided into 13 files, one for each year Files format is parquet Name format is orbitaal-stream_graph-date-[YYYY]-file-id-[ID].snappy.parquet, where [YYYY] stands for the corresponding year and [ID] is an integer from 1 to N (number of files here) such as sorting in increasing [ID] ordering is similar to sort by increasing year ordering These files are in the subdirectory STREAM_GRAPH/EDGES/ orbitaal-snapshot-all.tar.gz: The root directory is SNAPSHOT/ Contains the snapshot network representing all transactions aggregated over the whole dataset period (from Jan. 2009 to Jan. 2021). Files format is parquet Name format is orbitaal-snapshot-all.snappy.parquet. These files are in the subdirectory SNAPSHOT/EDGES/ALL/ orbitaal-snapshot-year.tar.gz: The root directory is SNAPSHOT/ Contains the yearly resolution of snapshot networks Files format is parquet Name format is orbitaal-snapshot-date-[YYYY]-file-id-[ID].snappy.parquet, where [YYYY] stands for the corresponding year and [ID] is an integer from 1 to N (number of files here) such as sorting in increasing [ID] ordering is similar to sort by increasing year ordering These files are in the subdirectory SNAPSHOT/EDGES/year/ orbitaal-snapshot-month.tar.gz: The root directory is SNAPSHOT/ Contains the monthly resoluted snapshot networks Files format is parquet Name format is orbitaal-snapshot-date-[YYYY]-[MM]-file-id-[ID].snappy.parquet, where [YYYY] and [MM] stands for the corresponding year and month, and [ID] is an integer from 1 to N (number of files here) such as sorting in increasing [ID] ordering is similar to sort by increasing year and month ordering These files are in the subdirectory SNAPSHOT/EDGES/month/ orbitaal-snapshot-day.tar.gz: The root directory is SNAPSHOT/ Contains the daily resoluted snapshot networks Files format is parquet Name format is orbitaal-snapshot-date-[YYYY]-[MM]-[DD]-file-id-[ID].snappy.parquet, where [YYYY], [MM], and [DD] stand for the corresponding year, month, and day, and [ID] is an integer from 1 to N (number of files here) such as sorting in increasing [ID] ordering is similar to sort by increasing year, month, and day ordering These files are in the subdirectory SNAPSHOT/EDGES/day/ orbitaal-snapshot-hour.tar.gz: The root directory is SNAPSHOT/ Contains the hourly resoluted snapshot networks Files format is parquet Name format is orbitaal-snapshot-date-[YYYY]-[MM]-[DD]-[hh]-file-id-[ID].snappy.parquet, where [YYYY], [MM], [DD], and [hh] stand for the corresponding year, month, day, and hour, and [ID] is an integer from 1 to N (number of files here) such as sorting in increasing [ID] ordering is similar to sort by increasing year, month, day and hour ordering These files are in the subdirectory SNAPSHOT/EDGES/hour/ orbitaal-nodetable.tar.gz: The root directory is NODE_TABLE/ Contains two files in parquet format, the first one gives information related to nodes present in stream graphs and snapshots such as period of activity and associated global Bitcoin balance, and the other one contains the list of all associated Bitcoin addresses. Small samples in CSV format orbitaal-stream_graph-2016_07_08.csv and orbitaal-stream_graph-2016_07_09.csv These two CSV files are related to stream graph representations of an halvening happening in 2016.
o
Global Zomato Dataset
opendatabay.com
.csv
Updated Jun 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Global Zomato Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/2d4d09a3-1be3-4e57-b435-471c7faf8365
Explore at:
.csvAvailable download formats
Dataset updated
Jun 21, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Data Science and Analytics
Description
Overview of the Foody Dataset Everyone has trouble picking their favourite dish when you're out of your town here we have come up with a food delivery app dataset that can help you find your mouth-watering dishes within your pocket. The Zomato dataset provides restaurant information, including location, cuisines, price, ratings, and more. It enables analysis of factors affecting popularity, such as cuisine type, booking, and delivery, facilitating personalised restaurant recommendations and insights into the food delivery industry.

Problem Statements: To develop a restaurant recommendation system using the dataset to suggest personalized dining options based on user preferences, location, and restaurant attributes, enhancing the dining experience.

Columns

Here's a description of the columns in your dataset:

Restaurant ID: A unique identifier for each restaurant in the dataset. Restaurant Name: The name of the restaurant. City: The city where the restaurant is located. Address: The specific address of the restaurant. Locality: The locality or neighbourhood where the restaurant is situated. Longitude: The longitude coordinate of the restaurant's location. Latitude: The latitude coordinate of the restaurant's location. Cuisines: The type of cuisine offered by the restaurant. eg: Japanese, Thai, Chinese, Mughlai, etc. Average Cost for two: The average cost for a meal for two people at the restaurant. Currency: The currency in which the average cost is denoted. Has Table booking: Indicates whether the restaurant accepts table bookings (Yes/No). Has Online delivery: Indicates whether the restaurant provides online food delivery services (Yes/No). Is delivering now: Indicates whether the restaurant is currently delivering food (Yes/No). Price range: The price range category of the restaurant from 1 to 4. One being the less price and 4 being the high price. Aggregate rating: The overall rating of the restaurant based on user reviews. Rating colour: The colour representation of the rating (e.g., Dark green, Green, Yellow, orange, red, and white). Rating text: The text representation of the rating (e.g., Excellent, Very good, Good, Average, poor, and Not rated ). Votes: The total number of user votes or reviews received by the restaurant. Questions for solving:

Can the location (city or locality) of a restaurant influence its average cost for two people? Is there a relationship between the type of cuisine offered by a restaurant and its aggregate rating? How does the average cost for two people at a restaurant correlate with its aggregate rating? Does the presence of table booking and online delivery options impact a restaurant's aggregate rating? How does the number of votes/reviews received by a restaurant relate to its aggregate rating and popularity? Hoping that you would find insightful predictions for your text-long trip.

Happy Learning!!!!

Don't forget to Upvote my food lovers… Kindly, upvote if you find the dataset interesting. Thank you.

License

CC0 Original Data Source: Global Zomato Dataset
Global soil organisms
gbif.org
smng.net
+1more
Updated Feb 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PlutoF (2023). Global soil organisms [Dataset]. http://doi.org/10.15468/fdpeaw
Explore at:
Unique identifier
https://doi.org/10.15468/fdpeaw
Dataset updated
Feb 27, 2023
Dataset provided by
Global Biodiversity Information Facilityhttps://www.gbif.org/
PlutoF
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Global distribution of soil organisms. Data deposited in this project represent unique (non-clustered) sequences. These sequences are members of the curated OTU list (tag-jump filtered and chimera-free, clustered at 98% similarity threshold) from the GSMc dataset (Tedersoo et al., Fungal Diversity, 2021, https://doi.org/10.1007/s13225-021-00493-7). For each OTU in each sample, within-OTU sequences were dereplicated ignoring terminal gaps; in the presence of sequence variants differing only in the length of homopolymeric regions, only the most abundant variant was preserved. Taxonomic annotation was transferred from the representative sequence of each OTU to all unique sequences clustered in it. The current dataset includes additional soil samples not covered by the published article (Tedersoo et al., Fungal Diversity, 2021). Additional samples were collected following a slightly different sampling protocol. Taxon occurrences originating from these samples can be filtered out by Dataset name ('Global soil samples subproject (sequences from additional samples)') and Dataset ID (108273). The number of distinct sampling sites: 3 736, sampling events: 4 514. The number of unique taxa based on UNITE species hypotheses on 1.5% distance threshold: 292 413.

COVID19 Additional Data

kaggle.com

Updated Apr 9, 2020

Facebook

Twitter

Click to copy link

Link copied

Cite

Orzhiang (2020). COVID19 Additional Data [Dataset]. https://www.kaggle.com/datasets/orzhiang/covid19-additional-data/versions/11

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Apr 9, 2020

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Orzhiang

Description

This is a collection of dataset that I personally think it is useful in analysing COVID19 data. Since all of the data comes from the internet and majority of them originated from World Bank, I am use some Kaggle users has already uploaded similar data. However, I think it makes my life (and perhaps yours) easier by compiling all of these data together.

The following are some remarks for the dataset-

Dataset Title	Descriptions
Other source of COVID19 Cases	https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset#time_series_covid_19_confirmed.csv
Mortality Table	https://www.kaggle.com/robikscube/world-health-organization-who-mortality-database
Economic Freedom Index	https://www.kaggle.com/lewisduncan93/the-economic-freedom-index
World Bank Development Indicators	https://www.kaggle.com/theworldbank/world-development-indicators
Weather Data	https://www.kaggle.com/hbfree/covid19formattedweatherjan22march24
Government Response	https://www.bsg.ox.ac.uk/research/research-projects/oxford-covid-19-government-response-tracker
Containment and Mitigation Measures	https://www.kaggle.com/paultimothymooney/covid-19-containment-and-mitigation-measures/
World Happiness Report	https://www.kaggle.com/londeen/world-happiness-report-2020
Weather Data 2	https://www.kaggle.com/noaa/gsod
US Data Prior to 2020-03-09	https://www.kaggle.com/johnjdavisiv/jhu-covid19-data-with-us-state-data-prior-to-mar-9
OCED Hospital Bed per 1000 inhabitants	https://www.kaggle.com/cpmpml/oecd-hospital-beds-per-1000-inhabitant
Covid 19 data by the US States	https://www.kaggle.com/scirpus/covid-by-state
COVID 19 Demographic predictors	https://www.kaggle.com/nightranger77/covid19-demographic-predictors
Country Info	https://www.kaggle.com/koryto/countryinfo
Population by location	https://www.kaggle.com/dgrechka/covid19-global-forecasting-locations-population
00 COVID19 Country Mapping Table	A mapping table serve as a link between world bank country name & country code with the country name used in COVID19 Competition. It makes linking the COVID19 data and World Bank data much easier.
01 Population_API_SP.POP.TOTL	https://data.worldbank.org/indicator/sp.pop.totl
01_1 China Demographic Data	Source: http://www.chamiji.com/2019chinaprovincepopulation http://www.stats.gov.cn/tjsj/ndsj/2017/indexeh.htm http://data.stats.gov.cn/english/easyquery.htm?cn=C01 http://www.gov.cn/test/2007-08/07/content_708271.htm

h
lmsys-chat-1m
huggingface.co
opendatalab.com
Updated Sep 17, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Large Model Systems Organization (2023). lmsys-chat-1m [Dataset]. https://huggingface.co/datasets/lmsys/lmsys-chat-1m
Explore at:
Dataset updated
Sep 17, 2023
Dataset authored and provided by
Large Model Systems Organization
Description
LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset

This dataset contains one million real-world conversations with 25 state-of-the-art LLMs. It is collected from 210K unique IP addresses in the wild on the Vicuna demo and Chatbot Arena website from April to August 2023. Each sample includes a conversation ID, model name, conversation text in OpenAI API JSON format, detected language tag, and OpenAI moderation API tag. User consent is obtained through the "Terms of use"… See the full description on the dataset page: https://huggingface.co/datasets/lmsys/lmsys-chat-1m.
o
The Global Soundscapes Project: overview of datasets and meta-data
explore.openaire.eu
Updated May 11, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kevin F. A. Kevin F.A. Darras; Steven Van Wilgenburg; Rodney Rountree; Yuhang Song; Youfang Chen; Thomas Cherico Wanger (2022). The Global Soundscapes Project: overview of datasets and meta-data [Dataset]. http://doi.org/10.5281/zenodo.6537739
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.6537739
Dataset updated
May 11, 2022
Authors
Kevin F. A. Kevin F.A. Darras; Steven Van Wilgenburg; Rodney Rountree; Yuhang Song; Youfang Chen; Thomas Cherico Wanger
Description
This is an overview of the soundscape recording datasets that have been contributed to the Global Soundscapes Project, as well as associated meta-data. The audio recording criteria justifying inclusion into the current meta-dataset are: Stationary (no towed sensors or microphones mounted on cars) Passive (no human disturbance by the recordist) Ambient (no focus on a particular species or direction) Recorded over multiple sites of a region and/or days The individual columns are described as follows. General: ID: primary key name: name of the dataset subset: incremental integer that can be used to distinguish sub-datasets collaborators: full names of people deemed responsible for the dataset, separated by commas date_added: when the dataset was added Space: realm_IUCN: realm from IUCN Global Ecosystem Typology (v2.0) (https://global-ecosystems.org/) medium: the physical medium the microphone is situated in GADM0: for terrestrial locations, Database of Global Administrative Areas level 0 unit as per https://gadm.org/ GADM1: for terrestrial locations, Database of Global Administrative Areas level 1 unit as per https://gadm.org/ GADM2: for terrestrial locations, Database of Global Administrative Areas level 2 unit as per https://gadm.org/ IHO: International Hydrographic Organisation sea area as per https://iho.int/ latitude_numeric_region: study region approximate centroid latitude in WGS84 decimal degrees longitude_numeric_region: study region approximate centroid longitude in WGS84 decimal degrees topography_min_m: minimum elevation of sites from sea level topography_max_m: maximum elevation of sites from sea level ground_distance_m: vertical distance of microphone from land ground or ocean floor freshwater_depth_m: vertical distance from water surface for freshwater datasets sites_number: number of sites sampled Time: days_number_per_site: typical number of days sampled per site (or minimum if too variable) day: whether the sites were sampled during daytime night: whether the sites were sampled during nighttime twilight: whether the sites were sampled during twilight warm_season: whether the warm season was sampled. Only outside tropics (https://en.wikipedia.org/wiki/K%C3%B6ppen_climate_classification) cold_season: whether the cold season was sampled. Only outside tropics (https://en.wikipedia.org/wiki/K%C3%B6ppen_climate_classification) dry_season: whether the dry season was sampled. Only for tropics (https://en.wikipedia.org/wiki/K%C3%B6ppen_climate_classification) wet_season: whether the wet season was sampled. Only for tropics (https://en.wikipedia.org/wiki/K%C3%B6ppen_climate_classification) year_start: starting year of the sampling year_end: ending year of the sampling schedule: description of the sampling schedule, free text recording_selection: criteria used to temporally select recordings (e.g., discarded rainy days) Audio: high_pass_filter_Hz: lower frequency of the high-pass filter sampling_frequency_kHz: frequency the microphone was sampled at audio_bit_depth: bit depth used for encoding audio recorder_model: recorder model used microphone: microphone used recordist_position: position of the recordist relative to the microphone during sampling Others: comments: free-text field URL_project: internet link for further information URL_publication: internet link of the corresponding publication More information on the project can be found here: https://ecosound-web.uni-goettingen.de/ecosound_web/project/gsp adding IHO data
o
Synthetic population for JOR
explore.openaire.eu
zenodo.org
Updated Apr 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abhijin Adiga; Hannah Baek; Stephen Eubank; Przemyslaw Porebski; Madhav Marathe; Henning Mortveit; Samarth Swarup; Mandy Wilson; Dawen Xie (2022). Synthetic population for JOR [Dataset]. http://doi.org/10.5281/zenodo.6503397
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.6503397
Dataset updated
Apr 30, 2022
Authors
Abhijin Adiga; Hannah Baek; Stephen Eubank; Przemyslaw Porebski; Madhav Marathe; Henning Mortveit; Samarth Swarup; Mandy Wilson; Dawen Xie
Description
Synthetic populations for regions of the World (SPW) | JordanDataset informationA synthetic population of a region as provided here, captures the people of the region with selected demographic attributes, their organization into households, their assigned activities for a day, the locations where the activities take place and thus where interactions among population members happen (e.g., spread of epidemics). LicenseCC-BY-4.0 AcknowledgmentThis project was supported by the National Science Foundation under the NSF RAPID: COVID-19 Response Support: Building Synthetic Multi-scale Networks (PI: Madhav Marathe, Co-PIs: Henning Mortveit, Srinivasan Venkatramanan; Fund Number: OAC-2027541). Contact informationHenning.Mortveit@virginia.edu Identifiers Region name Jordan Region ID jor Model coarse Version 0_9_0 Statistics Name Value Population 5723567.0 Average age 23.5 Households 1235755.0 Average household size 4.6 Residence locations 1235755.0 Activity locations 131978.0 Average number of activities 6.4 Average travel distance 44.5 Sources Description Name Version Url Activity template data World Bank 2021 https://data.worldbank.org Administrative boundaries ADCW 7.6 https://www.adci.com/adc-worldmap Curated POIs based on OSM SLIPO/OSM POIs http://slipo.eu/?p=1551 https://www.openstreetmap.org/ Household data DHS https://dhsprogram.com Population count with demographic attributes GPW v4.11 https://sedac.ciesin.columbia.edu/data/set/gpw-v4-admin-unit-center-points-population-estimates-rev11 Files descriptionBase data files (jor_data_v_0_9.zip) Filename Description jor_person_v_0_9.csv Data for each person including attributes such as age, gender, and household ID. jor_household_v_0_9.csv Data at household level. jor_residence_locations_v_0_9.csv Data about residence locations jor_activity_locations_v_0_9.csv Data about activity locations, including what activity types are supported at these locations jor_activity_location_assignment_v_0_9.csv For each person and for each of their activities, this file specifies the location where the activity takes place Derived data files Filename Description jor_contact_matrix_v_0_9.csv A POLYMOD-type contact matrix constructed from a network representation of the location assignment data and a within-location contact model. Validation and measures files Filename Description jor_household_grouping_validation_v_0_9.pdf Validation plots for household construction jor_activity_durations_{adult,child}_v_0_9.pdf Comparison of time spent on generated activities with survey data jor_activity_patterns_{adult,child}_v_0_9.pdf Comparison of generated activity patterns by the time of day with survey data jor_location_construction_0_9.pdf Validation plots for location construction jor_location_assignement_0_9.pdf Validation plots for location assignment, including travel distribution plots jor_jor_ver_0_9_0_avg_travel_distance.pdf Choropleth map visualizing average travel distance jor_jor_ver_0_9_0_travel_distr_combined.pdf Travel distance distribution jor_jor_ver_0_9_0_num_activity_loc.pdf Choropleth map visualizing number of activity locations jor_jor_ver_0_9_0_avg_age.pdf Choropleth map visualizing average age jor_jor_ver_0_9_0_pop_density_per_sqkm.pdf Choropleth map visualizing population density jor_jor_ver_0_9_0_pop_size.pdf Choropleth map visualizing population size
Popular White Last Names in the US
johnsnowlabs.com
csv
Updated Jan 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John Snow Labs (2021). Popular White Last Names in the US [Dataset]. https://www.johnsnowlabs.com/marketplace/popular-white-last-names-in-the-us/
Explore at:
csvAvailable download formats
Dataset updated
Jan 20, 2021
Dataset authored and provided by
John Snow Labs
Area covered
United States
Description
This dataset represents the popular last names in the United States for White.

Facebook

Twitter

Click to copy link

Link copied

Cite

Name Census (2023). Name Census top 100 surnames [Dataset]. https://www.kaggle.com/datasets/namecensus/name-census-top-100-surnames

Name Census top 100 surnames

Surname database with the top 100 surnames per country

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Mar 25, 2023

Dataset provided by

Kaggle

Authors

Name Census

License

http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

Description

Name Census top 100 surnames

In 2018 we were doing a research project, and we needed to know if a name was male or female. After Googling for hours for 'baby name lists', 'name databases' and 'name datasets' we discovered that there wasn't a complete name database for all countries with first names and gender. Most name database layouts we found different per country, were incomplete or contained non-existing names. That is why we created Name Census, the most comprehensive name database in the world! The Name Census top 100 databases is a free database containing the top 100 first names and top 100 surnames for each country.

Collection methodology

Our name database is created using first names and surnames obtained from governments and cross-referencing with millions of names from publicly available social media profiles. We took all those names and used millions of social media profiles that where publicly available to cross-reference and count each name per country. This way we were sure that the names in our name database are actually used and we could create our popularity metric. We now offer the complete name database and the name parsing service as separate services.

Content

The Name Census top 100 is a name database that consists out of two files; the first names top 100 per country and the surnames top 100 per country. Each file is a CSV file formatted in UTF-8.

Clear search

Close search

Google apps

Main menu

Name Census top 100 surnames

Name Census top 100 surnames

Collection methodology

Content

Geonames - All Cities with a population > 1000

Global Country Information 2023

COVID Impact Survey - Public Data

Overview

Queries

Margin of Error

About the Data

Attribution

AP Data Distributions

HitCompanies Dataset

100-richest-people-in-world

World Population Data

Context

Content

Dataset

Structure

Acknowledgment

Worldwide Soundscapes project meta-data

The ORBIT (Object Recognition for Blind Image Training)-India Dataset

GBIF Backbone Taxonomy

Data from: Global Impacts Dataset of Invasive Alien Species (GIDIAS)

Data from: The global distribution of plants used by humans datasets: list...

ORBITAAL: cOmpRehensive BItcoin daTaset for temorAl grAph anaLysis - Dataset...

Global Zomato Dataset

License

Global soil organisms

COVID19 Additional Data

lmsys-chat-1m

The Global Soundscapes Project: overview of datasets and meta-data

Synthetic population for JOR

Popular White Last Names in the US

Name Census top 100 surnames

Surname database with the top 100 surnames per country

Name Census top 100 surnames

Collection methodology

Content