70 datasets found
  1. w

    Dataset of books called People and education in the Third World

    • workwithdata.com
    Updated Apr 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2025). Dataset of books called People and education in the Third World [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=People+and+education+in+the+Third+World
    Explore at:
    Dataset updated
    Apr 17, 2025
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    World
    Description

    This dataset is about books. It has 1 row and is filtered where the book is People and education in the Third World. It features 7 columns including author, publication date, language, and book publisher.

  2. Famous Celebrity Name Misspellings

    • kaggle.com
    Updated Jan 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Famous Celebrity Name Misspellings [Dataset]. https://www.kaggle.com/datasets/thedevastator/famous-celebrity-name-misspellings
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 22, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    Description

    Famous Celebrity Name Misspellings

    Aggregated data from The Gyllenhaal Experiment

    By data.world's Admin [source]

    About this dataset

    This dataset contains aggregated spellings and mispellings of the names of 15 famous celebrities. Ever wonder if people can actually spell someone's name correctly? Now you can see for yourself with this compiled data from The Pudding's interactive spelling experiment called The Gyllenhaal Experiment! Interesting to see which names get misspelled more than others - some are easy to guess, some are surprising! With the data provided here, you can start uncovering trends in name-spelling habits. Visualize the data and start analyzing how unique or common each celebrity is with respect to spelling - who stands out? Who blends in? Check it out today and explore a side of celebrity life that hasn't been seen before!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset contains misnames of 15 famous celebrities. It can be used for a variety of research and analysis purposes, including exploring human language, understanding how names are misspelled, or generating data visualizations.

    In order to get the most out of this dataset, you will need to familiarize yourself with its columns. The dataset consists of two columns- “data” and “updated”. The “data” column contains the misnames associated with each celebrity name. The “updated” column is automatically updated with the date on which the data was last changed or modified.

    To use this dataset for your own research and analysis purposes, you may find it useful to filter out certain types of responses or patterns in order to focus more closely on particular trends or topics of interest; for example, if you are interested in exploring how spellings vary by region then you might wish to group together similar responses regardless of whether they exactly match one celebrity name over another (i.e., categorizing all spellings that follow a certain phonetic pattern). You can also separate different types of responses into separate groups in order to explore different aspects such as popularity (i.e., looking at which misspellings occurred most frequently).

    Research Ideas

    • Creating an interactive quiz for users to test their spelling ability by challenging them to spell names correctly from the celebrity dataset.
    • Building a dictionary database of the misspellings, fans’ nicknames and phonetic spellings of each celebrity so that people can find more information about them more easily and accurately.
    • Measuring the popularity of individual celebrities by tracking the frequency in which their name is misspelled

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    See the dataset description for more information.

    Columns

    File: data-all.csv | Column name | Description | |:--------------|:---------------------------------------------------| | data | Misspellings of celebrity names. (String) | | updated | Date when the misspelling was last updated. (Date) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit data.world's Admin.

  3. o

    Geonames - All Cities with a population > 1000

    • public.opendatasoft.com
    • data.smartidf.services
    • +2more
    csv, excel, geojson +1
    Updated Mar 10, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Geonames - All Cities with a population > 1000 [Dataset]. https://public.opendatasoft.com/explore/dataset/geonames-all-cities-with-a-population-1000/
    Explore at:
    csv, json, geojson, excelAvailable download formats
    Dataset updated
    Mar 10, 2024
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    All cities with a population > 1000 or seats of adm div (ca 80.000)Sources and ContributionsSources : GeoNames is aggregating over hundred different data sources. Ambassadors : GeoNames Ambassadors help in many countries. Wiki : A wiki allows to view the data and quickly fix error and add missing places. Donations and Sponsoring : Costs for running GeoNames are covered by donations and sponsoring.Enrichment:add country name

  4. w

    Dataset of books called Between heaven and earth : the religious worlds...

    • workwithdata.com
    Updated Apr 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2025). Dataset of books called Between heaven and earth : the religious worlds people make and the scholars who study them [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=Between+heaven+and+earth+%3A+the+religious+worlds+people+make+and+the+scholars+who+study+them
    Explore at:
    Dataset updated
    Apr 17, 2025
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Earth
    Description

    This dataset is about books. It has 2 rows and is filtered where the book is Between heaven and earth : the religious worlds people make and the scholars who study them. It features 7 columns including author, publication date, language, and book publisher.

  5. d

    COVID Impact Survey - Public Data

    • data.world
    csv, zip
    Updated Oct 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Associated Press (2024). COVID Impact Survey - Public Data [Dataset]. https://data.world/associatedpress/covid-impact-survey-public-data
    Explore at:
    csv, zipAvailable download formats
    Dataset updated
    Oct 16, 2024
    Authors
    The Associated Press
    Description

    Overview

    The Associated Press is sharing data from the COVID Impact Survey, which provides statistics about physical health, mental health, economic security and social dynamics related to the coronavirus pandemic in the United States.

    Conducted by NORC at the University of Chicago for the Data Foundation, the probability-based survey provides estimates for the United States as a whole, as well as in 10 states (California, Colorado, Florida, Louisiana, Minnesota, Missouri, Montana, New York, Oregon and Texas) and eight metropolitan areas (Atlanta, Baltimore, Birmingham, Chicago, Cleveland, Columbus, Phoenix and Pittsburgh).

    The survey is designed to allow for an ongoing gauge of public perception, health and economic status to see what is shifting during the pandemic. When multiple sets of data are available, it will allow for the tracking of how issues ranging from COVID-19 symptoms to economic status change over time.

    The survey is focused on three core areas of research:

    • Physical Health: Symptoms related to COVID-19, relevant existing conditions and health insurance coverage.
    • Economic and Financial Health: Employment, food security, and government cash assistance.
    • Social and Mental Health: Communication with friends and family, anxiety and volunteerism. (Questions based on those used on the U.S. Census Bureau’s Current Population Survey.) ## Using this Data - IMPORTANT This is survey data and must be properly weighted during analysis: DO NOT REPORT THIS DATA AS RAW OR AGGREGATE NUMBERS!!

    Instead, use our queries linked below or statistical software such as R or SPSS to weight the data.

    Queries

    If you'd like to create a table to see how people nationally or in your state or city feel about a topic in the survey, use the survey questionnaire and codebook to match a question (the variable label) to a variable name. For instance, "How often have you felt lonely in the past 7 days?" is variable "soc5c".

    Nationally: Go to this query and enter soc5c as the variable. Hit the blue Run Query button in the upper right hand corner.

    Local or State: To find figures for that response in a specific state, go to this query and type in a state name and soc5c as the variable, and then hit the blue Run Query button in the upper right hand corner.

    The resulting sentence you could write out of these queries is: "People in some states are less likely to report loneliness than others. For example, 66% of Louisianans report feeling lonely on none of the last seven days, compared with 52% of Californians. Nationally, 60% of people said they hadn't felt lonely."

    Margin of Error

    The margin of error for the national and regional surveys is found in the attached methods statement. You will need the margin of error to determine if the comparisons are statistically significant. If the difference is:

    • At least twice the margin of error, you can report there is a clear difference.
    • At least as large as the margin of error, you can report there is a slight or apparent difference.
    • Less than or equal to the margin of error, you can report that the respondents are divided or there is no difference. ## A Note on Timing Survey results will generally be posted under embargo on Tuesday evenings. The data is available for release at 1 p.m. ET Thursdays.

    About the Data

    The survey data will be provided under embargo in both comma-delimited and statistical formats.

    Each set of survey data will be numbered and have the date the embargo lifts in front of it in the format of: 01_April_30_covid_impact_survey. The survey has been organized by the Data Foundation, a non-profit non-partisan think tank, and is sponsored by the Federal Reserve Bank of Minneapolis and the Packard Foundation. It is conducted by NORC at the University of Chicago, a non-partisan research organization. (NORC is not an abbreviation, it part of the organization's formal name.)

    Data for the national estimates are collected using the AmeriSpeak Panel, NORC’s probability-based panel designed to be representative of the U.S. household population. Interviews are conducted with adults age 18 and over representing the 50 states and the District of Columbia. Panel members are randomly drawn from AmeriSpeak with a target of achieving 2,000 interviews in each survey. Invited panel members may complete the survey online or by telephone with an NORC telephone interviewer.

    Once all the study data have been made final, an iterative raking process is used to adjust for any survey nonresponse as well as any noncoverage or under and oversampling resulting from the study specific sample design. Raking variables include age, gender, census division, race/ethnicity, education, and county groupings based on county level counts of the number of COVID-19 deaths. Demographic weighting variables were obtained from the 2020 Current Population Survey. The count of COVID-19 deaths by county was obtained from USA Facts. The weighted data reflect the U.S. population of adults age 18 and over.

    Data for the regional estimates are collected using a multi-mode address-based (ABS) approach that allows residents of each area to complete the interview via web or with an NORC telephone interviewer. All sampled households are mailed a postcard inviting them to complete the survey either online using a unique PIN or via telephone by calling a toll-free number. Interviews are conducted with adults age 18 and over with a target of achieving 400 interviews in each region in each survey.Additional details on the survey methodology and the survey questionnaire are attached below or can be found at https://www.covid-impact.org.

    Attribution

    Results should be credited to the COVID Impact Survey, conducted by NORC at the University of Chicago for the Data Foundation.

    AP Data Distributions

    ​To learn more about AP's data journalism capabilities for publishers, corporations and financial institutions, go here or email kromano@ap.org.

  6. w

    Dataset of books called Denying democracy : how the IMF and World Bank take...

    • workwithdata.com
    Updated Apr 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2025). Dataset of books called Denying democracy : how the IMF and World Bank take power from people [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=Denying+democracy+%3A+how+the+IMF+and+World+Bank+take+power+from+people
    Explore at:
    Dataset updated
    Apr 17, 2025
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is about books. It has 1 row and is filtered where the book is Denying democracy : how the IMF and World Bank take power from people. It features 7 columns including author, publication date, language, and book publisher.

  7. Dataset of Burkhardt 2022 Encyclopaedia of Eponymic Plant Names

    • zenodo.org
    Updated Apr 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Heather Lynn Lindon; Heather Lynn Lindon; Sabine von Mering; Sabine von Mering; Siobhan Leachman; Siobhan Leachman; Carmen Ulloa Ulloa; Carmen Ulloa Ulloa (2025). Dataset of Burkhardt 2022 Encyclopaedia of Eponymic Plant Names [Dataset]. http://doi.org/10.5281/zenodo.14551490
    Explore at:
    Dataset updated
    Apr 29, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Heather Lynn Lindon; Heather Lynn Lindon; Sabine von Mering; Sabine von Mering; Siobhan Leachman; Siobhan Leachman; Carmen Ulloa Ulloa; Carmen Ulloa Ulloa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Author Lotte Burkhardt published in 2022 a free PDF entitled Encyclopedia of Eponymic Plant names. It consisted of two volumes, one listing all plant, algae, lichen, fossil plant, and fungal genera with the person they were named after. The other volume takes the list of people honored and lists the genera named after them. It can be found online here.

    This dataset was created by Carmen Ulloa Ulloa by scraping the PDF of the A-Z names of people honored and converting it into a Google Sheet. That data were normalized with each row representing a person and the eponymic genera and the associated families split into multiple columns to make analysis easier. The data was then cleaned as the conversion from PDF was not 100% accurate with some names being split onto multiple lines, characters misread etc. The gender of the authors were annotated by the Women Plant Genera working group as part of our follow up work to a previous paper.

    We have split the resulting table into three files. The first one contains the entire list of people honoured and the genera named for them. The other two are the first table split into just the flowering plant genera and the other one excludes plant genera.

    Most of the women in the plants-only tab have been marked up from this project. More information could be added to the women for whom non-plant genera were named. We highly encourage anyone who is interested in an analysis of their own based on this data to do so, and get in touch with us with any questions. We anticipate that work on additional groups will deepen our understanding of the impact of the contributions women have made to botany. Our hope is that by making this dataset publically available others will explore the world of genera and eponomy, looking at interesting stories of people for whom genera were named.

    The team would be greatful for any updates or corrections to this data, and we plan to publish updated versions of this dataset accordingly.

  8. ERA5 hourly data on pressure levels from 1940 to present

    • cds.climate.copernicus.eu
    • cds-test-cci2.copernicus-climate.eu
    grib
    Updated Jul 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ECMWF (2025). ERA5 hourly data on pressure levels from 1940 to present [Dataset]. http://doi.org/10.24381/cds.bd0915c6
    Explore at:
    gribAvailable download formats
    Dataset updated
    Jul 14, 2025
    Dataset provided by
    European Centre for Medium-Range Weather Forecastshttp://ecmwf.int/
    Authors
    ECMWF
    License

    https://object-store.os-api.cci2.ecmwf.int:443/cci2-prod-catalogue/licences/cc-by/cc-by_f24dc630aa52ab8c52a0ac85c03bc35e0abc850b4d7453bdc083535b41d5a5c3.pdfhttps://object-store.os-api.cci2.ecmwf.int:443/cci2-prod-catalogue/licences/cc-by/cc-by_f24dc630aa52ab8c52a0ac85c03bc35e0abc850b4d7453bdc083535b41d5a5c3.pdf

    Time period covered
    Jan 1, 1940 - Jul 8, 2025
    Description

    ERA5 is the fifth generation ECMWF reanalysis for the global climate and weather for the past 8 decades. Data is available from 1940 onwards. ERA5 replaces the ERA-Interim reanalysis. Reanalysis combines model data with observations from across the world into a globally complete and consistent dataset using the laws of physics. This principle, called data assimilation, is based on the method used by numerical weather prediction centres, where every so many hours (12 hours at ECMWF) a previous forecast is combined with newly available observations in an optimal way to produce a new best estimate of the state of the atmosphere, called analysis, from which an updated, improved forecast is issued. Reanalysis works in the same way, but at reduced resolution to allow for the provision of a dataset spanning back several decades. Reanalysis does not have the constraint of issuing timely forecasts, so there is more time to collect observations, and when going further back in time, to allow for the ingestion of improved versions of the original observations, which all benefit the quality of the reanalysis product. ERA5 provides hourly estimates for a large number of atmospheric, ocean-wave and land-surface quantities. An uncertainty estimate is sampled by an underlying 10-member ensemble at three-hourly intervals. Ensemble mean and spread have been pre-computed for convenience. Such uncertainty estimates are closely related to the information content of the available observing system which has evolved considerably over time. They also indicate flow-dependent sensitive areas. To facilitate many climate applications, monthly-mean averages have been pre-calculated too, though monthly means are not available for the ensemble mean and spread. ERA5 is updated daily with a latency of about 5 days. In case that serious flaws are detected in this early release (called ERA5T), this data could be different from the final release 2 to 3 months later. In case that this occurs users are notified. The data set presented here is a regridded subset of the full ERA5 data set on native resolution. It is online on spinning disk, which should ensure fast and easy access. It should satisfy the requirements for most common applications. An overview of all ERA5 datasets can be found in this article. Information on access to ERA5 data on native resolution is provided in these guidelines. Data has been regridded to a regular lat-lon grid of 0.25 degrees for the reanalysis and 0.5 degrees for the uncertainty estimate (0.5 and 1 degree respectively for ocean waves). There are four main sub sets: hourly and monthly products, both on pressure levels (upper air fields) and single levels (atmospheric, ocean-wave and land surface quantities). The present entry is "ERA5 hourly data on pressure levels from 1940 to present".

  9. Data from: DOO-RE: A dataset of ambient sensors in a meeting room for...

    • figshare.com
    zip
    Updated Feb 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hyunju Kim (2024). DOO-RE: A dataset of ambient sensors in a meeting room for activity recognition [Dataset]. http://doi.org/10.6084/m9.figshare.24558619.v3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 23, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Hyunju Kim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We release the DOO-RE dataset which consists of data streams from 11 types of various ambient sensors by collecting data 24/7 from a real-world meeting room. 4 types of ambient sensors, called environment-driven sensors, measure continuous state changes in the environment (e.g. sound), and 4 types of sensors, called user-driven sensors, capture user state changes (e.g. motion). The remaining 3 types of sensors, called actuator-driven sensors, check whether the attached actuators are active (e.g. projector on/off). The values of each sensor are automatically collected by IoT agents which are responsible for each sensor in our IoT system. A part of the collected sensor data stream representing a user activity is extracted as an activity episode in the DOO-RE dataset. Each episode's activity labels are annotated and validated by cross-checking and the consent of multiple annotators. A total of 9 activity types appear in the space: 3 based on single users and 6 based on group (i.e. 2 or more people) users. As a result, DOO-RE is constructed with 696 labeled episodes for single and group activities from the meeting room. DOO-RE is a novel dataset created in a public space that contains the properties of the real-world environment and has the potential to be good uses for developing powerful activity recognition approaches.

  10. Worldwide Soundscapes project meta-data

    • zenodo.org
    Updated Dec 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kevin F.A. Darras; Kevin F.A. Darras; Rodney Rountree; Rodney Rountree; Steven Van Wilgenburg; Steven Van Wilgenburg; Amandine Gasc; Amandine Gasc; 松海 李; 松海 李; 黎君 董; 黎君 董; Yuhang Song; Youfang Chen; Youfang Chen; Thomas Cherico Wanger; Thomas Cherico Wanger; Yuhang Song (2022). Worldwide Soundscapes project meta-data [Dataset]. http://doi.org/10.5281/zenodo.7415473
    Explore at:
    Dataset updated
    Dec 9, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Kevin F.A. Darras; Kevin F.A. Darras; Rodney Rountree; Rodney Rountree; Steven Van Wilgenburg; Steven Van Wilgenburg; Amandine Gasc; Amandine Gasc; 松海 李; 松海 李; 黎君 董; 黎君 董; Yuhang Song; Youfang Chen; Youfang Chen; Thomas Cherico Wanger; Thomas Cherico Wanger; Yuhang Song
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Worldwide Soundscapes project is a global, open inventory of spatio-temporally replicated soundscape datasets. This Zenodo entry comprises the data tables that constitute its (meta-)database, as well as their description.

    The overview of all sampling sites can be found on the corresponding project on ecoSound-web, as well as a demonstration collection containing selected recordings. More information on the project can be found here and on ResearchGate.

    The audio recording criteria justifying inclusion into the meta-database are:

    • Stationary (no transects, towed sensors or microphones mounted on cars)
    • Passive (unattended, no human disturbance by the recordist)
    • Ambient (no spatial or temporal focus on a particular species or direction)
    • Spatially and/or temporally replicated (multiple sites sampled at least at one common daytime or multiple days sampled at least in one common site)

    The individual columns of the provided data tables are described in the following. Data tables are linked through primary keys; joining them will result in a database.

    datasets

    • dataset_id: incremental integer, primary key
    • name: name of the dataset. if it is repeated, incremental integers should be used in the "subset" column to differentiate them.
    • subset: incremental integer that can be used to distinguish datasets with identical names
    • collaborators: full names of people deemed responsible for the dataset, separated by commas
    • contributors: full names of people who are not the main collaborators but who have significantly contributed to the dataset, and who could be contacted for in-depth analyses, separated by commas.
    • date_added: when the datased was added (DD/MM/YYYY)
    • URL_open_recordings: if recordings (even only some) from this dataset are openly available, indicate the internet link where they can be found.
    • URL_project: internet link for further information about the corresponding project
    • DOI_publication: DOI of corresponding publications, separated by comma
    • core_realm_IUCN: The core realm of the dataset. Datasets may have multiple realms, but the main one should be listed. Datasets may contain sampling sites from different realms in the "sites" sheet. IUCN Global Ecosystem Typology (v2.0): https://global-ecosystems.org/
    • medium: the physical medium the microphone is situated in
    • protected_area: Whether the sampling sites were situated in protected areas or not, or only some.
    • GADM0: For datasets on land or in territorial waters, Global Administrative Database level0
      https://gadm.org/
    • GADM1: For datasets on land or in territorial waters, Global Administrative Database level1
      https://gadm.org/
    • GADM2: For datasets on land or in territorial waters, Global Administrative Database level2
      https://gadm.org/
    • IHO: For marine locations, the sea area that encompassess all the sampling locations according to the International Hydrographic Organisation. Map here: https://www.arcgis.com/home/item.html?id=44e04407fbaf4d93afcb63018fbca9e2
    • locality: optional free text about the locality
    • latitude_numeric_region: study region approximate centroid latitude in WGS84 decimal degrees
    • longitude_numeric_region: study region approximate centroid longitude in WGS84 decimal degrees
    • sites_number: number of sites sampled
    • year_start: starting year of the sampling
    • year_end: ending year of the sampling
    • deployment_schedule: description of the sampling schedule, provisional
    • temporal_recording_selection: list environmental exclusion criteria that were used to determine which recording days or times to discard
    • high_pass_filter_Hz: frequency of the high-pass filter of the recorder, in Hz
    • variable_sampling_frequency: Does the sampling frequency vary? If it does, write "NA" in the sampling_frequency_kHz column and indicate it in the sampling_frequency_kHz column inside the deployments sheet
    • sampling_frequency_kHz: frequency the microphone was sampled at (sounds of half that frequency will be recorded)
    • variable_recorder:
    • recorder: recorder model used
    • microphone: microphone used
    • freshwater_recordist_position: position of the recordist relative to the microphone during sampling (only for freshwater)
    • collaborator_comments: free-text field for comments by the collaborators
    • validated: This cell is checked if the contents of all sheets are complete and have been found to be coherent and consistent with our requirements.
    • validator_name: name of person doing the validation
    • validation_comments: validators: please insert the date when someone was contacted
    • cross-check: this cell is checked if the collaborators confirm the spatial and temporal data after checking the corresponding site maps, deployment and operation time graphs found at https://drive.google.com/drive/folders/1qfwXH_7dpFCqyls-c6b8RZ_fbcn9kXbp?usp=share_link

    datasets-sites

    • dataset_ID: primary key of datasets table
    • dataset_name: lookup field
    • site_ID: primary key of sites table
    • site_name: lookup field

    sites

    • site_ID: unique site IDs, larger than 1000 for compatibility with ecoSound-web
    • site_name: name or code of sampling site as used in respective projects
    • latitude_numeric: exact numeric degrees coordinates of latitude
    • longitude_numeric: exact numeric degrees coordinates of longitude
    • topography_m: for sites on land: elevation. For marine sites: depth (negative). in meters
    • freshwater_depth_m
    • realm: Ecosystem type according to IUCN GET https://global-ecosystems.org/
    • biome: Ecosystem type according to IUCN GET https://global-ecosystems.org/
    • functional_group: Ecosystem type according to IUCN GET https://global-ecosystems.org/
    • comments

    deployments

    • dataset_ID: primary key of datasets table
    • dataset_name: lookup field
    • deployment: use identical subscript letters to denote rows that belong to the same deployment. For instance, you may use different operation times and schedules for different target taxa within one deployment.
    • start_date_min: earliest date of deployment start, double-click cell to get date-picker
    • start_date_max: latest date of deployment start, if applicable (only used when recorders were deployed over several days), double-click cell to get date-picker
    • start_time_mixed: deployment start local time, either in HH:MM format or a choice of solar daytimes (sunrise, sunset, noon, midnight). Corresponds to the recording start time for continuous recording deployments. If multiple start times were used, you should mention the latest start time (corresponds to the earliest daytime from which all recorders are active). If applicable, positive or negative offsets from solar times can be mentioned (For example: if data are collected one hour before sunrise, this will be "sunrise-60")
    • permanent: is the deployment permanent (in which case it would be ongoing and the end date or duration would be unknown)?
    • variable_duration_days: is the duration of the deployment variable? in days
    • duration_days: deployment duration per recorder (use the minimum if variable)
    • end_date_min: earliest date of deployment end, only needed if duration is variable, double-click cell to get date-picker
    • end_date_max: latest date of deployment end, only needed if duration is variable, double-click cell to get date-picker
    • end_time_mixed: deployment end local time, either in HH:MM format or a choice of solar daytimes (sunrise, sunset, noon, midnight). Corresponds to the recording end time for continuous recording deployments.
    • recording_time: does the recording last from the deployment start time to the end time (continuous) or at scheduled daily intervals (scheduled)? Note: we consider recordings with duty cycles to be continuous.
    • operation_start_time_mixed: scheduled recording start local time, either in HH:MM format or a choice of solar daytimes (sunrise, sunset, noon, midnight). If applicable, positive or negative offsets from solar times can be mentioned (For example: if data are collected one hour before sunrise, this will be "sunrise-60")
    • operation_duration_minutes: duration of operation in minutes, if constant
    • operation_end_time_mixed: scheduled recording end local time, either in HH:MM format or a choice of solar daytimes (sunrise, sunset, noon, midnight). If applicable, positive or negative offsets from solar times can be mentioned (For example: if data are collected one hour before sunrise, this will be "sunrise-60")
    • duty_cycle_minutes: duty cycle of the recording (i.e. the fraction of minutes when it is recording), written as "recording(minutes)/period(minutes)". For example: "1/6" if the recorder is active for 1 minute and standing by for 5 minutes.
    • sampling_frequency_kHz: only indicate the sampling frequency if it is variable within a particular dataset so that we need to code different frequencies for different deployments
    • recorder
    • subset_sites: If the deployment was not done in all the sites of the

  11. A

    ‘Austin's data portal activity metrics’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Feb 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Austin's data portal activity metrics’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-austin-s-data-portal-activity-metrics-1ce3/latest
    Explore at:
    Dataset updated
    Feb 13, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Austin's data portal activity metrics’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/data-portal-activity-metricse on 13 February 2022.

    --- Dataset description provided by original source is as follows ---

    About this dataset

    Background

    Austin's open data portal provides lots of public data about the City of Austin. It also provides portal administrators with behind-the-scenes information about how the portal is used... but that data is mysterious, hard to handle in a spreadsheet, and not located all in one place.

    Until now! Authorized city staff used admin credentials to grab this usage data and share it the public. The City of Austin wants to use this data to inform the development of its open data initiative and manage the open data portal more effectively.

    This project contains related datasets for anyone to explore. These include site-level metrics, dataset-level metrics, and department information for context. A detailed detailed description of how the files were prepared (along with code) can be found on github here.

    Example questions to answer about the data portal

    1. What parts of the open data portal do people seem to value most?
    2. What can we tell about who our users are?
    3. How are our data publishers doing?
    4. How much data is published programmatically vs manually?
    5. How data is super fresh? Super stale?
    6. Whatever you think we should know...

    About the files

    all_views_20161003.csv

    There is a resource available to portal administrators called "Dataset of datasets". This is the export of that resource, and it was captured on Oct 3, 2016. It contains a summary of the assets available on the data portal. While this file contains over 1400 resources (such as views, charts, and binary files), only 363 are actual tabular datasets.

    table_metrics_ytd.csv

    This file contains information about the 363 tabular datasets on the portal. Activity metrics for an individual dataset can be accessed by calling Socrata's views/metrics API and passing along the dataset's unique ID, a time frame, and admin credentials. The process of obtaining the 363 identifiers, calling the API, and staging the information can be reviewed in the python notebook here.

    site_metrics.csv

    This file is the export of site-level stats that Socrata generates using a given time frame and grouping preference. This file contains records about site usage each month from Nov 2011 through Sept 2016. By the way, it contains 285 columns... and we don't know what many of them mean. But we are determined to find out!! For a preliminary exploration of the columns and what portal-related business processes to which they might relate, check out the notes in this python notebook here

    city_departments_in_current_budget.csv

    This file contains a list of all City of Austin departments according to how they're identified in the most recently approved budget documents. Could be helpful for getting to know more about who the publishers are.

    crosswalk_to_budget_dept.csv

    The City is in the process of standardizing how departments identify themselves on the data portal. In the meantime, here's a crosswalk from the department values observed in all_views_20161003.csv to the department names that appear in the City's budget

    This dataset was created by Hailey Pate and contains around 100 samples along with Di Sync Success, Browser Firefox 19, technical information and other features such as: - Browser Firefox 33 - Di Sync Failed - and more.

    How to use this dataset

    • Analyze Sf Query Error User in relation to Js Page View Admin
    • Study the influence of Browser Firefox 37 on Datasets Created
    • More datasets

    Acknowledgements

    If you use this dataset in your research, please credit Hailey Pate

    Start A New Notebook!

    --- Original source retains full ownership of the source dataset ---

  12. o

    Global Employer Dataset (Wikidata)

    • opendatabay.com
    .undefined
    Updated Jul 5, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Global Employer Dataset (Wikidata) [Dataset]. https://www.opendatabay.com/data/ai-ml/e31ecab8-d78b-4108-89df-7ea2d5d3e09e
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 5, 2025
    Dataset authored and provided by
    Datasimple
    Area covered
    E-commerce & Online Transactions
    Description

    This dataset provides a curated and labeled subset of employer entries derived from Wikidata, with the goal of improving the quality and usability of employer data. While Wikidata is an invaluable open resource, direct use often necessitates cleaning. This dataset addresses that need by offering metadata, statistics, and labels to help users identify and utilise valid employer information. An employer is generally defined here as a company or entity that provides employment paying wages or a salary. The dataset specifically screens out entries that do not represent true employers, such as individuals or plurals. It is particularly useful for tasks involving data cleaning, entity recognition, and understanding employment nomenclature.

    Columns

    • item_id: The unique Wikidata item identifier (QCode without the 'Q' prefix).
    • employer_count: The number of Wikidata entries associated with this specific employer reference.
    • employer: The text label of the employer's name, sourced from Kensho's English labels.
    • description: The accompanying description of the Wikidata employer entry, also from Kensho.
    • in_google_news: A binary indicator (0 for no, 1 for yes) showing if the occupation exists within the GoogleNews embedding.
    • language_detected: A three-digit language code, identified using FastText language detection.
    • source: Indicates the origin of the information, such as Wikidata or Wikipedia.
    • label: A binary label (0 for invalid employer, 1 for valid employer) indicating the data's quality.
    • labeled_by: Specifies the method used for labeling, including human, classifier_gnew, classifier_bert, or cleanlab.
    • label_error_reason: Provides the specific reason if a label is deemed an error, such as 'domain' or 'plural'.

    Distribution

    This dataset is provided as a single CSV file, named employers.wikidata.all.labeled.csv. Its current version is 1.0, with a file size of approximately 5.98 MB. The dataset contains a substantial number of entries, with item_id having 60656 values, employer having 60456 values, and description having 60640 values.

    Usage

    This dataset is ideal for various applications, including: * Detecting new trends in employers, occupations, and employment terminology. * Automatic error correction of employer entries. * Converting plural forms of entities to singular forms. * Training Named Entity Recognition (NER) models to identify employer names. * Building Question/Answer models that can understand and respond to queries about employers. * Improving the accuracy of FastText language detection models. * Assessing FastText accuracy with limited data.

    Coverage

    The dataset's coverage is global, drawing data from a Wikidata dump dated 2 February 2020. It includes employer entries from various linguistic contexts, as indicated by the language_detected column, showcasing multilingual employer names and descriptions. The content primarily focuses on entities and organisations that meet the definition of an employer, rather than specific demographic groups.

    License

    CC BY-SA

    Who Can Use It

    This dataset is suitable for: * Data scientists and machine learning engineers working on natural language processing tasks. * Researchers interested in data quality, entity resolution, and knowledge graph analysis. * Developers building applications that require accurate employer information. * Anyone needing to clean and validate employer data for various analytical or operational purposes.

    Dataset Name Suggestions

    • Wikidata Labeled Employers
    • ML-Ready Wikidata Employer Data
    • Cleaned Wikidata Employer References
    • Global Employer Dataset (Wikidata)
    • Validated Employer Entities

    Attributes

    Original Data Source: ML-You-Can-Use Wikidata Employers labeled

  13. Z

    Dataset for the Article "A Predictive Method to Improve the Effectiveness of...

    • data.niaid.nih.gov
    • zenodo.org
    Updated May 24, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Riccardo Martoglia (2021). Dataset for the Article "A Predictive Method to Improve the Effectiveness of Twitter Communication in a Cultural Heritage Scenario" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4782983
    Explore at:
    Dataset updated
    May 24, 2021
    Dataset provided by
    Marco Furini
    Riccardo Martoglia
    Manuela Montangero
    Federica Mandreoli
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the dataset for the article "A Predictive Method to Improve the Effectiveness of Twitter Communication in a Cultural Heritage Scenario".

    Abstract:

    Museums are embracing social technologies in the attempt to broaden their audience and to engage people. Although social communication seems an easy task, media managers know how hard it is to reach millions of people with a simple message. Indeed, millions of posts are competing every day to get visibility in terms of likes and shares and very little research focused on museums communication to identify best practices. In this paper, we focus on Twitter and we propose a novel method that exploits interpretable machine learning techniques to: (a) predict whether a tweet will likely be appreciated by Twitter users or not; (b) present simple suggestions that will help enhancing the message and increasing the probability of its success. Using a real-world dataset of around 40,000 tweets written by 23 world famous museums, we show that our proposed method allows identifying tweet features that are more likely to influence the tweet success.

    Code to run a selection of experiments is available at https://github.com/rmartoglia/predict-twitter-ch

    Dataset structure

    The dataset contains the dataset used in the experiments of the above research paper. Only the extracted features for the museum tweet threads (and not the message full text) are provided and needed for the analyses.

    We selected 23 well known world spread art museums and grouped them into five groups: G1 (museums with at least three million of followers); G2 (museums with more than one million of followers); G3 (museums with more than 400,000 followers); G4 (museums with more that 200,000 followers); G5 (Italian museums). From these museums, we analyzed ca. 40,000 tweets, with a number varying from 5k ca. to 11k ca. for each museum group, depending on the number of museums in each group.

    Content features: these are the features that can be drawn form the content of the tweet itself. We further divide such features in the following two categories:

    – Countable: these features have a value ranging into different intervals. We take into consideration: the number of hashtags (i.e., words preceded by #) in the tweet, the number of URLs (i.e., links to external resources), the number of images (e.g., photos and graphical emoticons), the number of mentions (i.e., twitter accounts preceded by @), the length of the tweet;

    – On-Off : these features have binary values in {0, 1}. We observe whether the tweet has exclamation marks, question marks, person names, place names, organization names, other names. Moreover, we also take into consideration the tweet topic density: assuming that the involved topics correspond to the hashtags mentioned in the text, we define a tweet as dense of topics if the number of hashtags it contains is greater than a given threshold, set to 5. Finally, we observe the tweet sentiment that might be present (positive or negative) or not (neutral).

    Context features: these features are not drawn form the content of the tweet itself and might give a larger picture of the context in which the tweet was sent. Namely, we take into consideration the part of the day in which the tweet was sent (morning, afternoon, evening and night respectively from 5:00am to 11:59am, from 12:00pm to 5:59pm, from 6:00pm to 10:59pm and from 11pm to 4:59am), and a boolean feature indicating whether the tweet is a retweet or not.

    User features: these features are proper of the user that sent the tweet, and are the same for all the tweets of this user. Namely we consider the name of the museum and the number of followers of the user.

  14. GBIF Backbone Taxonomy

    • gbif.org
    • smng.net
    • +3more
    Updated Nov 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GBIF Secretariat (2023). GBIF Backbone Taxonomy [Dataset]. http://doi.org/10.15468/39omei
    Explore at:
    Dataset updated
    Nov 17, 2023
    Dataset provided by
    Global Biodiversity Information Facilityhttps://www.gbif.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The GBIF Backbone Taxonomy is a single, synthetic management classification with the goal of covering all names GBIF is dealing with. It's the taxonomic backbone that allows GBIF to integrate name based information from different resources, no matter if these are occurrence datasets, species pages, names from nomenclators or external sources like EOL, Genbank or IUCN. This backbone allows taxonomic search, browse and reporting operations across all those resources in a consistent way and to provide means to crosswalk names from one source to another.

    It is updated regulary through an automated process in which the Catalogue of Life acts as a starting point also providing the complete higher classification above families. Additional scientific names only found in other authoritative nomenclatural and taxonomic datasets are then merged into the tree, thus extending the original catalogue and broadening the backbones name coverage. The GBIF Backbone taxonomy also includes identifiers for Operational Taxonomic Units (OTUs) drawn from the barcoding resources iBOL and UNITE.

    International Barcode of Life project (iBOL), Barcode Index Numbers (BINs). BINs are connected to a taxon name and its classification by taking into account all names applied to the BIN and picking names with at least 80% consensus. If there is no consensus of name at the species level, the selection process is repeated moving up the major Linnaean ranks until consensus is achieved.

    UNITE - Unified system for the DNA based fungal species, Species Hypotheses (SHs). SHs are connected to a taxon name and its classification based on the determination of the RefS (reference sequence) if present or the RepS (representative sequence). In the latter case, if there is no match in the UNITE taxonomy, the lowest rank with 100% consensus within the SH will be used.

    The GBIF Backbone Taxonomy is available for download at https://hosted-datasets.gbif.org/datasets/backbone/ in different formats together with an archive of all previous versions.

    The following 105 sources have been used to assemble the GBIF backbone with number of names given in brackets:

    • Catalogue of Life Checklist - 4766428 names
    • International Barcode of Life project (iBOL) Barcode Index Numbers (BINs) - 635951 names
    • UNITE - Unified system for the DNA based fungal species linked to the classification - 611208 names
    • The Paleobiology Database - 212054 names
    • World Register of Marine Species - 188857 names
    • The Interim Register of Marine and Nonmarine Genera - 183894 names
    • The World Checklist of Vascular Plants (WCVP) - 131891 names
    • GBIF Backbone Taxonomy - 114350 names
    • TAXREF - 109374 names
    • The Leipzig catalogue of vascular plants - 75380 names
    • ZooBank - 73549 names
    • Integrated Taxonomic Information System (ITIS) - 68377 names
    • Plazi.org taxonomic treatments database - 61346 names
    • Genome Taxonomy Database r207 - 60545 names
    • International Plant Names Index - 52329 names
    • Fauna Europaea - 45077 names
    • The National Checklist of Taiwan (Catalogue of Life in Taiwan, TaiCoL) - 36193 names
    • Dyntaxa. Svensk taxonomisk databas - 35892 names
    • The Plant List with literature - 32692 names
    • United Kingdom Species Inventory (UKSI) - 29643 names
    • Artsnavnebasen - 29208 names
    • The IUCN Red List of Threatened Species - 21221 names
    • Afromoths, online database of Afrotropical moth species (Lepidoptera) - 13961 names
    • Brazilian Flora 2020 project - Projeto Flora do Brasil 2020 - 13829 names
    • Prokaryotic Nomenclature Up-to-Date (PNU) - 10079 names
    • Checklist Dutch Species Register - Nederlands Soortenregister - 8814 names
    • ICTV Master Species List (MSL) - 7852 names
    • Cockroach Species File - 6020 names
    • GRIN Taxonomy - 5882 names
    • Taxon list of fungi and fungal-like organisms from Germany compiled by the DGfM - 4570 names
    • Catalogue of Afrotropical Bees - 3623 names
    • Catalogue of Tenebrionidae (Coleoptera) of North America - 3327 names
    • Checklist of Beetles (Coleoptera) of Canada and Alaska. Second Edition. - 3312 names
    • Systema Dipterorum - 2850 names
    • Catalogue of the Pterophoroidea of the World - 2807 names
    • The Clements Checklist - 2675 names
    • Taxon list of Hymenoptera from Germany compiled in the context of the GBOL project - 2496 names
    • IOC World Bird List, v13.2 - 2366 names
    • Official Lists and Indexes of Names in Zoology - 2310 names
    • National checklist of all species occurring in Denmark - 1922 names
    • Myriatrix - 1876 names
    • Database of Vascular Plants of Canada (VASCAN) - 1822 names
    • Taxon list of vascular plants from Bavaria, Germany compiled in the context of the BFL project - 1771 names
    • Orthoptera Species File - 1742 names
    • A list of the terrestrial fungi, flora and fauna of Madeira and Selvagens archipelagos - 1602 names
    • Aphid Species File - 1565 names
    • World Spider Catalog - 1561 names
    • Taxon list of Jurassic Pisces of the Tethys Palaeo-Environment compiled at the SNSB-JME - 1270 names
    • Backbone Family Classification Patch - 1143 names
    • GBIF Algae Classification - 1100 names
    • International Cichorieae Network (ICN): Cichorieae Portal - 975 names
    • Psocodea Species File - 803 names
    • New Zealand Marine Macroalgae Species Checklist - 787 names
    • Annotated checklist of endemic species from the Western Balkans - 754 names
    • Taxon list of animals with German names (worldwide) compiled at the SMNS - 503 names
    • Catalogue of the Alucitoidea of the World - 472 names
    • Lygaeoidea Species File - 462 names
    • Catálogo de Plantas y Líquenes de Colombia - 422 names
    • GBIF Backbone Patch - 317 names
    • Phasmida Species File - 259 names
    • Cortinariaceae fetched from the Index Fungorum API - 234 names
    • Coreoidea Species File - 233 names
    • GTDB supplement - 139 names
    • Mantodea Species File - 119 names
    • Endemic species in Taiwan - 93 names
    • Taxon list of Araneae from Germany compiled in the context of the GBOL project - 88 names
    • Species of Hominidae - 78 names
    • Taxon list of Sternorrhyncha from Germany compiled in the context of the GBOL project - 77 names
    • Taxon list of mosses from Germany compiled in the context of the GBOL project - 75 names
    • Mammal Species of the World - 73 names
    • Plecoptera Species File - 71 names
    • Species Fungorum Plus - 64 names
    • Catalogue of the type specimens of Cosmopterigidae (Lepidoptera: Gelechioidea) from research collections of the Zoological Institute, Russian Academy of Sciences - 47 names
    • Species named after famous people - 41 names
    • Dermaptera Species File - 36 names
    • Taxon list of Trichoptera from Germany compiled in the context of the GBOL project - 34 names
    • True Fruit Flies (Diptera, Tephritidae) of the Afrotropical Region - 33 names
    • Range and Regularities in the Distribution of Earthworms of the Earthworms of the USSR Fauna. Perel, 1979 - 32 names
    • Taxon list of Diplura from Germany compiled in the context of the GBOL project - 30 names
    • Lista de referencia de especies de aves de Colombia - 2022 - 24 names
    • Taxon list of Auchenorrhyncha from Germany compiled in the context of the GBOL project - 20 names
    • Catalogue of the type specimens of Polycestinae (Coleoptera: Buprestidae) from research collections of the Zoological Institute, Russian Academy of Sciences - 19 names
    • Taxon list of Thysanoptera from Germany compiled in the context of the GBOL project - 19 names
    • Lista de especies de vertebrados registrados en jurisdicción del Departamento del Huila - 18 names
    • Taxon list of Microcoryphia (Archaeognatha) from Germany compiled in the context of the GBOL project - 15 names
    • Catalogue of the type specimens of Bufonidae and Megophryidae (Amphibia: Anura) from research collections of the Zoological Institute, Russian Academy of Sciences - 12 names
    • Grylloblattodea Species File - 11 names
    • Coleorrhyncha Species File - 9 names
    • Taxon list of liverworts from Germany compiled in the context of the GBOL project - 9 names
    • Embioptera Species File - 7 names
    • Taxon list of Pisces and Cyclostoma from Germany compiled in the context of the GBOL project - 6 names
    • Taxon list of Pteridophyta from Germany compiled in the context of the GBOL project - 6 names
    • Taxon list of Siphonaptera from Germany compiled in the context of the GBOL project - 5 names
    • The Earthworms of the Fauna of Russia. Perel, 1997 - 5 names
    • Taxon list of Zygentoma from Germany compiled in the context of the GBOL project - 4 names
    • Asiloid Flies: new taxa of Diptera: Apioceridae, Asilidae, and Mydidae - 3 names
    • Taxon list of Protura from Germany compiled in the context of the GBOL project - 3 names
    • Taxon list of hornworts from Germany compiled in the context of the GBOL project - 2 names
    • Chrysididae Species File - 1 names
    • Taxon list of Dermaptera from Germany compiled in the context of the GBOL project - 1 names
    • Taxon list of Diplopoda from Germany in the context of the GBOL project - 1 names
    • Taxon list of Orthoptera (Grashoppers) from Germany compiled at the SNSB - 1 names
    • Taxon list of Pscoptera from Germany compiled in the context of the GBOL project - 1 names
    • Taxon list of Pseudoscorpiones from Germany compiled in the context of the GBOL project - 1 names
    • Taxon list of Raphidioptera from Germany compiled in the context of the GBOL project - 1 names

  15. World Population Statistics - 2023

    • kaggle.com
    Updated Jan 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bhavik Jikadara (2024). World Population Statistics - 2023 [Dataset]. https://www.kaggle.com/datasets/bhavikjikadara/world-population-statistics-2023
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 9, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Bhavik Jikadara
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    World
    Description
    • The current US Census Bureau world population estimate in June 2019 shows that the current global population is 7,577,130,400 people on Earth, which far exceeds the world population of 7.2 billion in 2015. Our estimate based on UN data shows the world's population surpassing 7.7 billion.
    • China is the most populous country in the world with a population exceeding 1.4 billion. It is one of just two countries with a population of more than 1 billion, with India being the second. As of 2018, India has a population of over 1.355 billion people, and its population growth is expected to continue through at least 2050. By the year 2030, India is expected to become the most populous country in the world. This is because India’s population will grow, while China is projected to see a loss in population.
    • The following 11 countries that are the most populous in the world each have populations exceeding 100 million. These include the United States, Indonesia, Brazil, Pakistan, Nigeria, Bangladesh, Russia, Mexico, Japan, Ethiopia, and the Philippines. Of these nations, all are expected to continue to grow except Russia and Japan, which will see their populations drop by 2030 before falling again significantly by 2050.
    • Many other nations have populations of at least one million, while there are also countries that have just thousands. The smallest population in the world can be found in Vatican City, where only 801 people reside.
    • In 2018, the world’s population growth rate was 1.12%. Every five years since the 1970s, the population growth rate has continued to fall. The world’s population is expected to continue to grow larger but at a much slower pace. By 2030, the population will exceed 8 billion. In 2040, this number will grow to more than 9 billion. In 2055, the number will rise to over 10 billion, and another billion people won’t be added until near the end of the century. The current annual population growth estimates from the United Nations are in the millions - estimating that over 80 million new lives are added yearly.
    • This population growth will be significantly impacted by nine specific countries which are situated to contribute to the population growth more quickly than other nations. These nations include the Democratic Republic of the Congo, Ethiopia, India, Indonesia, Nigeria, Pakistan, Uganda, the United Republic of Tanzania, and the United States of America. Particularly of interest, India is on track to overtake China's position as the most populous country by 2030. Additionally, multiple nations within Africa are expected to double their populations before fertility rates begin to slow entirely.

    Content

    • In this Dataset, we have Historical Population data for every Country/Territory in the world by different parameters like Area Size of the Country/Territory, Name of the Continent, Name of the Capital, Density, Population Growth Rate, Ranking based on Population, World Population Percentage, etc. >Dataset Glossary (Column-Wise):
    • Rank: Rank by Population.
    • CCA3: 3 Digit Country/Territories Code.
    • Country/Territories: Name of the Country/Territories.
    • Capital: Name of the Capital.
    • Continent: Name of the Continent.
    • 2022 Population: Population of the Country/Territories in the year 2022.
    • 2020 Population: Population of the Country/Territories in the year 2020.
    • 2015 Population: Population of the Country/Territories in the year 2015.
    • 2010 Population: Population of the Country/Territories in the year 2010.
    • 2000 Population: Population of the Country/Territories in the year 2000.
    • 1990 Population: Population of the Country/Territories in the year 1990.
    • 1980 Population: Population of the Country/Territories in the year 1980.
    • 1970 Population: Population of the Country/Territories in the year 1970.
    • Area (km²): Area size of the Country/Territories in square kilometers.
    • Density (per km²): Population Density per square kilometer.
    • Growth Rate: Population Growth Rate by Country/Territories.
    • World Population Percentage: The population percentage by each Country/Territories.
  16. People Data Labs Company Dataset

    • datarade.ai
    .json, .csv
    Updated Oct 18, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    People Data Labs (2021). People Data Labs Company Dataset [Dataset]. https://datarade.ai/data-products/people-data-labs-company-dataset-people-data-labs
    Explore at:
    .json, .csvAvailable download formats
    Dataset updated
    Oct 18, 2021
    Dataset provided by
    People Data Labs Inc.
    Authors
    People Data Labs
    Area covered
    Tokelau, South Sudan, Martinique, Dominican Republic, Antarctica, Paraguay, Christmas Island, Romania, Slovenia, Barbados
    Description

    People Data Labs is an aggregator of B2B person and company data. We source our globally compliant person dataset via our "Data Union".

    The "Data Union" is our proprietary data sharing co-op. Customers opt-in to sharing their data and warrant that their data is fully compliant with global data privacy regulations. Some data sources are provided as a one time dump, others are refreshed every time we do a new data build. Our data sources come from a variety of verticals including HR Tech, Real Estate Tech, Identity/Anti-Fraud, Martech, and others. People Data Labs works with customers on compliance based topics. If a customer wishes to ensure anonymity, we work with them to anonymize the data.

    Our company data has identifying information (name, website, social profiles), company attributes (industry, size, founded date), and tags + free text that is useful for segmentation.

  17. Global Sanctions Dataset

    • brightdata.com
    .json, .csv, .xlsx
    Updated Jan 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bright Data (2025). Global Sanctions Dataset [Dataset]. https://brightdata.com/products/datasets/global-sanctions
    Explore at:
    .json, .csv, .xlsxAvailable download formats
    Dataset updated
    Jan 15, 2025
    Dataset authored and provided by
    Bright Datahttps://brightdata.com/
    License

    https://brightdata.com/licensehttps://brightdata.com/license

    Area covered
    Worldwide
    Description

    With in-depth information on individuals who have been included in the international sanctions list and are currently facing economic sanctions from various countries and international organizations, you can benefit greatly. Our list includes key data attributes such as - first name, last name, citizenship, passport details, address, date of proscription & reason for listing. The comprehensive information on individuals listed on the international sanctions list helps organizations ensure compliance with sanctions regulations and avoid any potential risks associated with doing business with sanctioned entities.

    Popular attributes:

    ✔ Financial Intelligence

    ✔ Credit Risk Analysis

    ✔ Compliance

    ✔ Bank Data Enrichment

    ✔ Account Profiling

  18. w

    Dataset of books called Baptists through the centuries : a history of a...

    • workwithdata.com
    Updated Apr 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2025). Dataset of books called Baptists through the centuries : a history of a global people [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=Baptists+through+the+centuries+%3A+a+history+of+a+global+people
    Explore at:
    Dataset updated
    Apr 17, 2025
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is about books. It has 2 rows and is filtered where the book is Baptists through the centuries : a history of a global people. It features 7 columns including author, publication date, language, and book publisher.

  19. Worldwide Soundscapes project metadata and analysis scripts

    • zenodo.org
    • data.niaid.nih.gov
    csv, zip
    Updated May 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kevin F.A. Darras; Kevin F.A. Darras; Rodney Rountree; Rodney Rountree; Steven Van Wilgenburg; Steven Van Wilgenburg; Amandine Gasc; Amandine Gasc; Songhai Li; Songhai Li; Lijun Dong; Lijun Dong; Youfang Chen; Youfang Chen; Thomas Cherico Wanger; Thomas Cherico Wanger (2025). Worldwide Soundscapes project metadata and analysis scripts [Dataset]. http://doi.org/10.5281/zenodo.14216871
    Explore at:
    csv, zipAvailable download formats
    Dataset updated
    May 7, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Kevin F.A. Darras; Kevin F.A. Darras; Rodney Rountree; Rodney Rountree; Steven Van Wilgenburg; Steven Van Wilgenburg; Amandine Gasc; Amandine Gasc; Songhai Li; Songhai Li; Lijun Dong; Lijun Dong; Youfang Chen; Youfang Chen; Thomas Cherico Wanger; Thomas Cherico Wanger
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Worldwide Soundscapes project is a global, open inventory of spatio-temporally replicated passive acoustic monitoring meta-datasets (i.e. meta-data collections). This Zenodo entry comprises the data tables that constitute its (meta-)database, as well as their description. Additionally, R scripts are provided to replicate the analysis published in [placeholder].

    The overview of all sampling sites and timelines can be found on the corresponding project on ecoSound-web, as well as a demonstration collection containing selected recordings. The recordings of this collection were annotated and analysed to explore macro-ecological trends.

    The audio recording criteria justifying inclusion into the meta-database are:

    • Stationary (no transects, towed sensors or microphones mounted on cars)
    • Passive (unattended, no human disturbance by the recordist)
    • Ambient (no directional microphone or triggered recordings, non-experimental conditions)
    • Spatially and/or temporally replicated (i.e. multiple sites sampled at the same time and/or multiple days - covering the same daytime - sampled at the same site)

    The individual columns of the provided data tables are described in the following. Data tables are linked through primary keys; joining them will result in a database. The data shared here only includes validated collections.

    Changes from version 4.0.0

    Added link to the published synthesis.

    Meta-database CSV files

    collections

    • collection_id: unique integer, primary key
    • name: name of the dataset. if it is repeated, incremental integers should be used in the "subset" column to differentiate them.
    • ecoSound-web_link: link of validated meta-collection on ecoSound-web
    • primary_contributors: full names of people deemed corresponding contributors who are responsible for the dataset
    • secondary_contributors: full names of people who are not primary contributors but who have significantly contributed to the dataset, and who could be contacted for in-depth analyses
    • date_added: when the datased was added (YYYY-MM-DD)
    • URL_open_recordings: internet link for openly-available recordings from this collection
    • URL_project: internet link for further information about the corresponding project
    • DOI_publication: Digital Object Identifiers of corresponding publications
    • core_realm_IUCN: The main, core realm of the dataset according to IUCN Global Ecosystem Typology (v2.0): https://global-ecosystems.org/
    • medium: the physical medium the microphone is situated in
    • locality: optional free text about the locality
    • contributor_comments: free-text field for comments by the primary contributors

    collections-sites

    • dataset_ID: primary key of collections table
    • site_ID: primary key of sites table

    sites

    • site_ID: unique integer, primary key
    • site_name: internal name or code of sampling site as used in respective projects
    • latitude_numeric: site's numeric degrees of latitude
    • longitude_numeric: site's numeric degrees of longitude
    • blurred_coordinates: whether latitude and longitude coordinates are inaccurate, boolean. Coordinates may be blurred with random offsets, rounding, snapping, etc. Indicate the blurring method inside the comments field
    • topography_m: vertical position of the microphone relative to the sea level. for sites on land: elevation. For marine sites: depth (negative). in meters. Only indicate if the values were measured by the collaborator.
    • freshwater_depth_m: microphone depth, only used for sites inside freshwater bodies that also have an elevation value above the sea level
    • realm: Ecosystem type: main realm according to IUCN GET https://global-ecosystems.org/
    • biome: Ecosystem type: main biome according to IUCN GET https://global-ecosystems.org/
    • functional_group: Ecosystem type: main functional group according to IUCN GET https://global-ecosystems.org/
    • contributor_comments: free text field for contributor comments
    • GADM_0: Global ADMinistrative Database level 0 classification of terrestrial site or marine site that is within territorial waters. Source: https://gadm.org/download_world.html
    • IHO: International Hydrographic Organization classification of marine site. Source: https://marineregions.org/downloads.php
    • WDPA: World Database on Protected Areas classification of the site. Source: https://www.protectedplanet.net/en/thematic-areas/wdpa?tab=WDPA

    deployments

    • dataset_ID: primary key of datasets table
    • deployment: identical subscript letters to denote rows that belong to the same deployment. For instance, you may use different operation times and schedules for different target taxa within one deployment.
    • subset_site_ID: If the deployment was not done in all the sites of the corresponding collection, site IDs where the deployment was conducted
    • start_date: date of deployment start
    • start_time_mixed: deployment start local time, either in HH:MM format or a choice of solar daytimes (sunrise, sunset). Corresponds to the recording start time for continuous recording deployments. If multiple start times were used, you should mention the latest start time (corresponds to the earliest daytime from which all recorders are active). If applicable, positive or negative offsets from solar times can be mentioned (For example: if data are collected one hour before sunrise, this will be "sunrise-60")
    • permanent: whether the deployment is permanent, boolean
    • end_date: date of deployment end (date when last scheduled operation starts)
    • end_time_mixed: deployment end local time, either in HH:MM format or a choice of solar daytimes (sunrise, sunset, noon, midnight). Corresponds to the recording end time for continuous recording deployments.
    • operation_mode: continuous: recording takes place from the deployment start date-time to deployment end date-time.
      periodical: recording takes place periodically (i.e., with duty cycle) from the deployment start date-time to deployment end date-time.
      scheduled: recording takes place during scheduled daily time intervals (optionally with duty cycle)
    • duty_cycle_minutes: duty cycle of the recording (i.e. the fraction of minutes when it is recording), written as "recording(minutes)/period(minutes)". empty if no duty cycle is used. For example: "1/6" if the recorder is active for 1 minute and standing by for 5 minutes
    • operation_start_time_mixed: only for scheduled recordings: start local time, either in HH:MM format or a choice of solar daytimes (sunrise, sunset, noon, midnight). If applicable, positive or negative offsets from solar times can be mentioned (For example: if data are collected one hour before sunrise, this will be "sunrise-60")
    • operation_duration_minutes: only for scheduled recordings: duration of operation in minutes, if constant
    • operation_end_time_mixed: only for scheduled recordings: end local time, either in HH:MM format or a choice of solar daytimes (sunrise, sunset, noon, midnight). Only required if durations are variable. Do not use when end times are ambiguous (for instance, if a recording could be 1 hour or 25 hours long because the end is on the next day). If applicable, positive or negative offsets from solar times can be mentioned (For example: if data are collected one hour before sunrise, this will be "sunrise-60")
    • high_pass_filter_Hz: frequency of the high-pass filter of the recorder if applied, in Hz. Otherwise, write "none". This may be called a "low-cut" filter too.
    • bit_depth: sampling bit depth of the recordings. Often constant for a particular recorder
    • channels: number of recorded audio channels
    • sampling_frequency_kHz: frequency at which the microphone signal was sampled by the recorder (sounds of half that frequency will be recorded)
    • recorder: recorder used for deployment
    • microphone: microphone used for deployment
    • target_taxa: main IUCN animal taxa that were studied with this deployment, using the exact IUCN Red list names (http://www.iucnredlist.org/), separated by commas. Only genera, families, orders, and classes are accepted. Empty if there was no taxonomic focus (i.e., general soundscapes were the study focus).
    • contributor_comments: free text field for contributor comments
    • exact_recordings: whether the deployment data here have been superseded by inserting more exact recording date-time ranges into the meta-collection on ecoSound-web

    recordings (partial download from ecoSound-web)

    • recording_id: primary key of the recordings table
    • collection_id: ID of the collection the recording belongs to
    • name: name of the recording
    • site_id: site ID the recording belongs to:
    • recorder_id: ID of the recorder used for the recording (internal ecoSound-web code)
    • microphone_id: ID of the microphone used for the recording (internal ecoSound-web code)
    • recording_gain:recording gain applied for amplifying the audio signal, in decibels
    • duty_cycle_recording: fraction of the recording periode when the recorder is actively recording audio
    • duty_cycle_period: period of the duty cycle, i.e., time between the starts of two subsequent recordings
    • note: comments (contains the target taxon)
    • file_date: date of the recording

  20. P

    TrajNet Dataset

    • paperswithcode.com
    Updated Aug 23, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stefan Becker; Ronny Hug; Wolfgang Hübner; Michael Arens (2021). TrajNet Dataset [Dataset]. https://paperswithcode.com/dataset/trajnet-1
    Explore at:
    Dataset updated
    Aug 23, 2021
    Authors
    Stefan Becker; Ronny Hug; Wolfgang Hübner; Michael Arens
    Description

    The TrajNet Challenge represents a large multi-scenario forecasting benchmark. The challenge consists on predicting 3161 human trajectories, observing for each trajectory 8 consecutive ground-truth values (3.2 seconds) i.e., t−7,t−6,…,t, in world plane coordinates (the so-called world plane Human-Human protocol) and forecasting the following 12 (4.8 seconds), i.e., t+1,…,t+12. The 8-12-value protocol is consistent with the most trajectory forecasting approaches, usually focused on the 5-dataset ETH-univ + ETH-hotel + UCY-zara01 + UCY-zara02 + UCY-univ. Trajnet extends substantially the 5-dataset scenario by diversifying the training data, thus stressing the flexibility and generalization one approach has to exhibit when it comes to unseen scenery/situations. In fact, TrajNet is a superset of diverse datasets that requires to train on four families of trajectories, namely 1) BIWI Hotel (orthogonal bird’s eye flight view, moving people), 2) Crowds UCY (3 datasets, tilted bird’s eye view, camera mounted on building or utility poles, moving people), 3) MOT PETS (multisensor, different human activities) and 4) Stanford Drone Dataset (8 scenes, high orthogonal bird’s eye flight view, different agents as people, cars etc. ), for a total of 11448 trajectories. Testing is requested on diverse partitions of BIWI Hotel, Crowds UCY, Stanford Drone Dataset, and is evaluated by a specific server (ground-truth testing data is unavailable for applicants).

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Work With Data (2025). Dataset of books called People and education in the Third World [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=People+and+education+in+the+Third+World

Dataset of books called People and education in the Third World

Explore at:
Dataset updated
Apr 17, 2025
Dataset authored and provided by
Work With Data
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Area covered
World
Description

This dataset is about books. It has 1 row and is filtered where the book is People and education in the Third World. It features 7 columns including author, publication date, language, and book publisher.

Search
Clear search
Close search
Google apps
Main menu