100+ datasets found
  1. s

    Data Sources

    • pacific-data.sprep.org
    • tonga-data.sprep.org
    xlsx
    Updated Feb 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Environment (2025). Data Sources [Dataset]. https://pacific-data.sprep.org/dataset/data-sources
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Feb 14, 2025
    Dataset provided by
    Department of Environment
    Tonga
    License

    https://pacific-data.sprep.org/resource/private-data-license-agreement-0https://pacific-data.sprep.org/resource/private-data-license-agreement-0

    Area covered
    Tonga
    Description

    Data sources. Not complete. Will get it done this weekend.

  2. MassDOT Developers' Data Sources

    • mass.gov
    Updated Nov 13, 2009
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Massachusetts Department of Transportation (2009). MassDOT Developers' Data Sources [Dataset]. https://www.mass.gov/massdot-developers-data-sources
    Explore at:
    Dataset updated
    Nov 13, 2009
    Dataset authored and provided by
    Massachusetts Department of Transportationhttp://www.massdot.state.ma.us/
    Area covered
    Massachusetts
    Description

    Information and links for developers to work with real-time and static transportation data.

  3. h

    Data-source

    • huggingface.co
    Updated Jul 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Paulino (2025). Data-source [Dataset]. https://huggingface.co/datasets/Mi6paulino/Data-source
    Explore at:
    Dataset updated
    Jul 27, 2025
    Authors
    Michael Paulino
    License

    https://choosealicense.com/licenses/artistic-2.0/https://choosealicense.com/licenses/artistic-2.0/

    Description

    from bs4 import BeautifulSoup import requests

      Fetch the webpage
    

    url = 'https://example.com' response = requests.get(url) html_content = response.text

      Parse the HTML content
    

    soup = BeautifulSoup(html_content, 'html.parser')

      Extract all the links
    

    for link in soup.find_all('a'): print(link.get('href'))

  4. Multiple data sources analysis of Trauma Patients

    • kaggle.com
    zip
    Updated Mar 16, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    IT SPOT (2022). Multiple data sources analysis of Trauma Patients [Dataset]. https://www.kaggle.com/datasets/itspot/multiple-data-sources-analysis-of-trauma-patients
    Explore at:
    zip(79173 bytes)Available download formats
    Dataset updated
    Mar 16, 2022
    Authors
    IT SPOT
    Description

    Trauma means an emotional response to a deeply distressing or disturbing event like loss of a loved one or an accident.A trauma patient will get data from multiple sources like neurocognitive, physiologic data from various medical tests.The neuro cognitive data comprises of EEG signal data like Amplitude,Delta,Theta,Alpha and Beta Values.The physiologic data comprises of heart_rate,skin_conductance,skin_temperature,cortisol_level, Systolic_BP and Diastolic_BP. The In the existing system, all thoses data are analyzed by medical experts in order to arrive at the conditional severity of the Trauma patient. But, it is difficult for the experts to correlate data from multiple sources, and arrive at a decision on severity. This dataset is useful at classifying the Severity of Trauma patients.

  5. w

    Department of Community Resources & Services Online Data Sources

    • data.wu.ac.at
    • opendata.howardcountymd.gov
    csv, json, xml
    Updated Apr 22, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Community Resources & Services (2015). Department of Community Resources & Services Online Data Sources [Dataset]. https://data.wu.ac.at/schema/data_maryland_gov/a2RlcS1yN3Fj
    Explore at:
    json, xml, csvAvailable download formats
    Dataset updated
    Apr 22, 2015
    Dataset provided by
    Department of Community Resources & Services
    Description

    This dataset lists various data sources used within the Department of Community Resources & Services for various internal and external reports. This dataset allows individuals and organizations to identify the type of data they are looking for and to which geographical level they are trying to get the data for (i.e. National, State, County, etc.). This dataset will be updated every quarter and should be utilized for research purposes

  6. Data from: Inventory of online public databases and repositories holding...

    • catalog.data.gov
    • s.cnmilf.com
    • +2more
    Updated Apr 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Inventory of online public databases and repositories holding agricultural data in 2017 [Dataset]. https://catalog.data.gov/dataset/inventory-of-online-public-databases-and-repositories-holding-agricultural-data-in-2017-d4c81
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Agricultural Research Servicehttps://www.ars.usda.gov/
    Description

    United States agricultural researchers have many options for making their data available online. This dataset aggregates the primary sources of ag-related data and determines where researchers are likely to deposit their agricultural data. These data serve as both a current landscape analysis and also as a baseline for future studies of ag research data. Purpose As sources of agricultural data become more numerous and disparate, and collaboration and open data become more expected if not required, this research provides a landscape inventory of online sources of open agricultural data. An inventory of current agricultural data sharing options will help assess how the Ag Data Commons, a platform for USDA-funded data cataloging and publication, can best support data-intensive and multi-disciplinary research. It will also help agricultural librarians assist their researchers in data management and publication. The goals of this study were to establish where agricultural researchers in the United States-- land grant and USDA researchers, primarily ARS, NRCS, USFS and other agencies -- currently publish their data, including general research data repositories, domain-specific databases, and the top journals compare how much data is in institutional vs. domain-specific vs. federal platforms determine which repositories are recommended by top journals that require or recommend the publication of supporting data ascertain where researchers not affiliated with funding or initiatives possessing a designated open data repository can publish data Approach The National Agricultural Library team focused on Agricultural Research Service (ARS), Natural Resources Conservation Service (NRCS), and United States Forest Service (USFS) style research data, rather than ag economics, statistics, and social sciences data. To find domain-specific, general, institutional, and federal agency repositories and databases that are open to US research submissions and have some amount of ag data, resources including re3data, libguides, and ARS lists were analysed. Primarily environmental or public health databases were not included, but places where ag grantees would publish data were considered. Search methods We first compiled a list of known domain specific USDA / ARS datasets / databases that are represented in the Ag Data Commons, including ARS Image Gallery, ARS Nutrition Databases (sub-components), SoyBase, PeanutBase, National Fungus Collection, i5K Workspace @ NAL, and GRIN. We then searched using search engines such as Bing and Google for non-USDA / federal ag databases, using Boolean variations of “agricultural data” /“ag data” / “scientific data” + NOT + USDA (to filter out the federal / USDA results). Most of these results were domain specific, though some contained a mix of data subjects. We then used search engines such as Bing and Google to find top agricultural university repositories using variations of “agriculture”, “ag data” and “university” to find schools with agriculture programs. Using that list of universities, we searched each university web site to see if their institution had a repository for their unique, independent research data if not apparent in the initial web browser search. We found both ag specific university repositories and general university repositories that housed a portion of agricultural data. Ag specific university repositories are included in the list of domain-specific repositories. Results included Columbia University – International Research Institute for Climate and Society, UC Davis – Cover Crops Database, etc. If a general university repository existed, we determined whether that repository could filter to include only data results after our chosen ag search terms were applied. General university databases that contain ag data included Colorado State University Digital Collections, University of Michigan ICPSR (Inter-university Consortium for Political and Social Research), and University of Minnesota DRUM (Digital Repository of the University of Minnesota). We then split out NCBI (National Center for Biotechnology Information) repositories. Next we searched the internet for open general data repositories using a variety of search engines, and repositories containing a mix of data, journals, books, and other types of records were tested to determine whether that repository could filter for data results after search terms were applied. General subject data repositories include Figshare, Open Science Framework, PANGEA, Protein Data Bank, and Zenodo. Finally, we compared scholarly journal suggestions for data repositories against our list to fill in any missing repositories that might contain agricultural data. Extensive lists of journals were compiled, in which USDA published in 2012 and 2016, combining search results in ARIS, Scopus, and the Forest Service's TreeSearch, plus the USDA web sites Economic Research Service (ERS), National Agricultural Statistics Service (NASS), Natural Resources and Conservation Service (NRCS), Food and Nutrition Service (FNS), Rural Development (RD), and Agricultural Marketing Service (AMS). The top 50 journals' author instructions were consulted to see if they (a) ask or require submitters to provide supplemental data, or (b) require submitters to submit data to open repositories. Data are provided for Journals based on a 2012 and 2016 study of where USDA employees publish their research studies, ranked by number of articles, including 2015/2016 Impact Factor, Author guidelines, Supplemental Data?, Supplemental Data reviewed?, Open Data (Supplemental or in Repository) Required? and Recommended data repositories, as provided in the online author guidelines for each the top 50 journals. Evaluation We ran a series of searches on all resulting general subject databases with the designated search terms. From the results, we noted the total number of datasets in the repository, type of resource searched (datasets, data, images, components, etc.), percentage of the total database that each term comprised, any dataset with a search term that comprised at least 1% and 5% of the total collection, and any search term that returned greater than 100 and greater than 500 results. We compared domain-specific databases and repositories based on parent organization, type of institution, and whether data submissions were dependent on conditions such as funding or affiliation of some kind. Results A summary of the major findings from our data review: Over half of the top 50 ag-related journals from our profile require or encourage open data for their published authors. There are few general repositories that are both large AND contain a significant portion of ag data in their collection. GBIF (Global Biodiversity Information Facility), ICPSR, and ORNL DAAC were among those that had over 500 datasets returned with at least one ag search term and had that result comprise at least 5% of the total collection. Not even one quarter of the domain-specific repositories and datasets reviewed allow open submission by any researcher regardless of funding or affiliation. See included README file for descriptions of each individual data file in this dataset. Resources in this dataset:Resource Title: Journals. File Name: Journals.csvResource Title: Journals - Recommended repositories. File Name: Repos_from_journals.csvResource Title: TDWG presentation. File Name: TDWG_Presentation.pptxResource Title: Domain Specific ag data sources. File Name: domain_specific_ag_databases.csvResource Title: Data Dictionary for Ag Data Repository Inventory. File Name: Ag_Data_Repo_DD.csvResource Title: General repositories containing ag data. File Name: general_repos_1.csvResource Title: README and file inventory. File Name: README_InventoryPublicDBandREepAgData.txt

  7. a

    GIS Data Sources

    • hub.arcgis.com
    Updated Apr 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    King County (2024). GIS Data Sources [Dataset]. https://hub.arcgis.com/documents/kingcounty::gis-data-sources?uiVersion=content-views
    Explore at:
    Dataset updated
    Apr 2, 2024
    Dataset authored and provided by
    King County
    Area covered
    Description

    This page is an index of all the data sources that the GIS Center has to offer. If you're looking for anything, you'll find it here!

  8. d

    Coresignal | Job Postings Data | Largest Professional Network + Indeed Jobs...

    • datarade.ai
    .json, .csv
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Coresignal, Coresignal | Job Postings Data | Largest Professional Network + Indeed Jobs + 3 Other Sources | Global / 437M+ Records / Updated Monthly [Dataset]. https://datarade.ai/data-products/job-postings-data-coresignal
    Explore at:
    .json, .csvAvailable download formats
    Dataset authored and provided by
    Coresignal
    Area covered
    Barbados, Austria, Qatar, Iraq, Cook Islands, Germany, Malawi, Iceland, Angola, Vietnam
    Description

    ➡️ You can choose from multiple data formats, delivery frequency options, and delivery methods;

    ➡️ Extensive datasets with job postings data from 5 leading B2B data sources;

    ➡️ Jobs API designed for effortless search and enrichment (accessible using a user-friendly self-service tool);

    ➡️ Fresh data: daily updates, easy change tracking with dedicated data fields, and a constant flow of new data;

    ➡️ You get all necessary resources for evaluating our data: a free consultation, a data sample, or free credits for testing the API.

    ✅ For HR tech

    Job posting data can provide insights into the demand for different types of jobs and skills, as well as trends in job postings over time. With access to historical data, companies can develop predictive models.

    ✅ For Investors

    Explore expansion trends, analyze hiring practices, and predict company or industry growth rates, enabling the extraction of actionable strategic and operational insights. At a larger scale of analysis, Job Postings Data can be leveraged to forecast market trends and predict the growth of specific industries.

    ✅ For Lead generation

    Coresignal’s Job Postings Data is ideal for lead generation and determining purchasing intent. In B2B sales, job postings can help identify the best time to approach a prospective client.

    ➡️ Why 400+ data-powered businesses choose Coresignal:

    1. Experienced data provider (in the market since 2016);
    2. Exceptional client service;
    3. Responsible and secure data collection.
  9. united-states-covid-19-county-level-data-sources

    • huggingface.co
    Updated Feb 23, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Health and Human Services (2025). united-states-covid-19-county-level-data-sources [Dataset]. https://huggingface.co/datasets/HHS-Official/united-states-covid-19-county-level-data-sources
    Explore at:
    Dataset updated
    Feb 23, 2025
    Dataset provided by
    United States Department of Health and Human Serviceshttp://www.hhs.gov/
    Authors
    Department of Health and Human Services
    License

    https://choosealicense.com/licenses/odbl/https://choosealicense.com/licenses/odbl/

    Area covered
    United States
    Description

    United States COVID-19 County Level Data Sources - ARCHIVED

      Description
    

    The Public Health Emergency (PHE) declaration for COVID-19 expired on May 11, 2023. As a result, the Aggregate Case and Death Surveillance System will be discontinued. Although these data will continue to be publicly available, this dataset will no longer be updated. On October 20, 2022, CDC began retrieving aggregate case and death data from jurisdictional and state partners weekly instead of daily.… See the full description on the dataset page: https://huggingface.co/datasets/HHS-Official/united-states-covid-19-county-level-data-sources.

  10. The 500MB Tv-Show Dataset

    • kaggle.com
    zip
    Updated Sep 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Iyad Elwy (2023). The 500MB Tv-Show Dataset [Dataset]. https://www.kaggle.com/datasets/iyadelwy/the-500mb-tv-show-dataset/code
    Explore at:
    zip(95221606 bytes)Available download formats
    Dataset updated
    Sep 5, 2023
    Authors
    Iyad Elwy
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Televisions

    This dataset was extracted, transformed and loaded using various sources. The entire ETL Process looks as follows:

    https://github.com/IyadElwy/Televisions/assets/83036619/7088d477-2559-4af2-94e9-924274521d36" alt="data_pipeline">

    Links

    Github ETL Process Code

    Kaggle Dataset

    Explanation

    • First I needed to find some appropriate data-sources. For this I used Wikiquote to extract important and unique script-text from the various shows. Wikipedia was used for the extraction of generic info like summaries, etc. Metacritic was used for the extraction of user reviews (scores + opinions) and the Opensource TvMaze API was used for getting all sorts of data, ranging from titles to cast, episodes, summaries and more.
    • Now the first step was to gather titles from those sources. It was important to divide the scrapers into two categories, scrapers that get the titles and scrapers that do the heavy-scraping which gets you the actual data.
    • For the ETL job orchastration Apache Airflow was used which was hosted on an Azure VM instance with 4 Virtual CPUs and about 14 GBs of RAM. This was needed because of the heavy Sprak data transformations
    • RDS, running PostgreSQL was used to save the titles and their corresponding urls
    • S3 buckets were mostly used as Data Lakes to hold either raw data or temp data that needed to be further processed
    • AWS Glue was used to do Transformations on the dataset and then output to Redshift which was our Data Warehouse in this case

    CosmosDB NoSQL Schema

    It's important to note the additionalProperties field which makes the addition of more data to the field possible. I.e. the following fields will have alot more nested data. json { "type": "object", "additionalProperties": true, "properties": { "id": { "type": "integer" }, "title": { "type": "string" }, "normalized_title": { "type": "string" }, "wikipedia_url": { "type": "string" }, "wikiquotes_url": { "type": "string" }, "eztv_url": { "type": "string" }, "metacritic_url": { "type": "string" }, "wikipedia": { "type": "object", "additionalProperties": true }, "wikiquotes": { "type": "object", "additionalProperties": true }, "metacritic": { "type": "object", "additionalProperties": true }, "tvMaze": { "type": "object", "additionalProperties": true } } }

  11. d

    Donuka: USA Nationwide Commercial Property Data

    • datarade.ai
    Updated Dec 13, 2006
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Donuka (2006). Donuka: USA Nationwide Commercial Property Data [Dataset]. https://datarade.ai/data-products/donuka-usa-nationwide-commercial-property-data-donuka
    Explore at:
    .json, .xml, .csv, .xls, .txtAvailable download formats
    Dataset updated
    Dec 13, 2006
    Dataset authored and provided by
    Donuka
    Area covered
    United States
    Description

    Donuka offers a simple, reliable property data solution to power innovation and create seamless business solutions for companies of all sizes. Our data covers more than 37 million properties spread out across the U.S. that can be accessed in bulk-file format or through our APIs.

    We offer access to data ONLY in selected states and counties

    DATA SOURCES:

    1. ONLY state sources (city/county/state administration, federal agencies, ministries, etc.). We DO NOT use unverified databases
    2. Over 2300 sources. We use even the smallest sources, because they contain valuable data. This allows us to provide our users with the most complete data

    DATA RELEVANCE:

    1. Our data is updated daily, weekly, monthly depending on the sources
    2. We collect, process and store all data, regardless of their relevance. Historical data is also valuable

    DATA TYPES:

    1. Specifications
    2. Owners
    3. Permits
    4. Sales
    5. Inspections
    6. Violations
    7. Assessed values
    8. Taxes
    9. Risks
    10. Foreclosures
    11. Property Tax Liens
    12. Deed Restrictions

    NUMBERS:

    1. 2300+ data sources in total
    2. 4 billion records (listed in the "data types" block above) in total
    3. 2 million new records every day

    DATA USAGE:

    1. Property check, investigation (even the smallest events are stored in our database)
    2. Prospecting (more than 100 parameters to find the required records)
    3. Tracking (our data allows us to track any changes)
  12. Integrating Multiple Data Sources for Meta-analysis to Improve...

    • icpsr.umich.edu
    Updated Sep 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dickersin, Kay (2025). Integrating Multiple Data Sources for Meta-analysis to Improve Patient-Centered Outcomes Research [Methods Study], United States, 2013-2017 [Dataset]. http://doi.org/10.3886/ICPSR39490.v1
    Explore at:
    Dataset updated
    Sep 8, 2025
    Dataset provided by
    Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
    Authors
    Dickersin, Kay
    License

    https://www.icpsr.umich.edu/web/ICPSR/studies/39490/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/39490/terms

    Time period covered
    2013 - 2017
    Area covered
    United States
    Description

    Meta-analyses combine the results of many studies to find out how well a treatment or other healthcare intervention works. Most meta-analyses use public sources of data, such as published journal articles, as the main sources of information for study results. But journal articles are not the only sources of study results. Some results appear in other places, such as clinical study reports. Clinical study reports are documents that describe what researchers did and found in much more detail than journal articles. However, these reports may not be available to the public. As a result, meta-analyses may not include all available information about a treatment. The research team wanted to learn whether adding or replacing public and nonpublic data sources changed the results of meta-analyses. To find out, the research team added and replaced data as they conducted two meta-analyses. The first looked at adult use of a nerve-pain medicine. The second meta-analysis looked at adult use of a medicine to treat bipolar depression.

  13. Data sources used by companies for training AI models South Korea 2024

    • statista.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista, Data sources used by companies for training AI models South Korea 2024 [Dataset]. https://www.statista.com/statistics/1452822/south-korea-data-sources-for-training-artificial-intelligence-models/
    Explore at:
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Sep 2024 - Nov 2024
    Area covered
    South Korea
    Description

    As of 2024, customer data was the leading source of information used to train artificial intelligence (AI) models in South Korea, with nearly ** percent of surveyed companies answering that way. About ** percent responded to use public sector support initiatives.

  14. Linked Open Data Management Services: A Comparison

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    Updated Sep 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robert Nasarek; Robert Nasarek; Lozana Rossenova; Lozana Rossenova (2023). Linked Open Data Management Services: A Comparison [Dataset]. http://doi.org/10.5281/zenodo.7738424
    Explore at:
    Dataset updated
    Sep 18, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Robert Nasarek; Robert Nasarek; Lozana Rossenova; Lozana Rossenova
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Thanks to a variety of software services, it has never been easier to produce, manage and publish Linked Open Data. But until now, there has been a lack of an accessible overview to help researchers make the right choice for their use case. This dataset release will be regularly updated to reflect the latest data published in a comparison table developed in Google Sheets [1]. The comparison table includes the most commonly used LOD management software tools from NFDI4Culture to illustrate what functionalities and features a service should offer for the long-term management of FAIR research data, including:

    • ConedaKOR
    • LinkedDataHub
    • Metaphacts
    • Omeka S
    • ResearchSpace
    • Vitro
    • Wikibase
    • WissKI

    The table presents two views based on a comparison system of categories developed iteratively during workshops with expert users and developers from the respective tool communities. First, a short overview with field values coming from controlled vocabularies and multiple-choice options; and a second sheet allowing for more descriptive free text additions. The table and corresponding dataset releases for each view mode are designed to provide a well-founded basis for evaluation when deciding on a LOD management service. The Google Sheet table will remain open to collaboration and community contribution, as well as updates with new data and potentially new tools, whereas the datasets released here are meant to provide stable reference points with version control.

    The research for the comparison table was first presented as a paper at DHd2023, Open Humanities – Open Culture, 13-17.03.2023, Trier and Luxembourg [2].

    [1] Non-editing access is available here: docs.google.com/spreadsheets/d/1FNU8857JwUNFXmXAW16lgpjLq5TkgBUuafqZF-yo8_I/edit?usp=share_link To get editing access contact the authors.

    [2] Full paper will be made available open access in the conference proceedings.

  15. s

    Open Data Inception

    • data.smartidf.services
    • public.aws-ec2-eu-1.opendatasoft.com
    • +1more
    csv, excel, geojson +1
    Updated Aug 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Open Data Inception [Dataset]. https://data.smartidf.services/explore/dataset/open-data-sources/
    Explore at:
    json, excel, geojson, csvAvailable download formats
    Dataset updated
    Aug 22, 2025
    License

    https://en.wikipedia.org/wiki/Public_domainhttps://en.wikipedia.org/wiki/Public_domain

    Description

    Open Data Inception is a project that compiles a comprehensive list of open data portals worldwide. It provides a geotagged, searchable map and list of these portals, making it easier for users to find clean, usable open data by country or topic. The initiative aims to address the challenge of locating reliable data sources, offering a user-friendly resource with an API for data enthusiasts and researchers. The project also explores standardizing metadata to improve data discoverability.Open Data Inception relies on crowsourcing and anyone can suggest the addition of a portal via this form.

  16. Northern Ireland tourism latest alternative data sources: October 2022

    • gov.uk
    • s3.amazonaws.com
    Updated Oct 6, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Northern Ireland Statistics and Research Agency (2022). Northern Ireland tourism latest alternative data sources: October 2022 [Dataset]. https://www.gov.uk/government/statistics/northern-ireland-tourism-latest-alternative-data-sources-october-2022
    Explore at:
    Dataset updated
    Oct 6, 2022
    Dataset provided by
    GOV.UKhttp://gov.uk/
    Authors
    Northern Ireland Statistics and Research Agency
    Area covered
    Ireland, Northern Ireland
    Description

    NISRA Tourism Statistics have published an alternative document containing information from a range of sources which users may find useful in the absence of our regular published data.

  17. f

    Label distribution of different data sources.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Mar 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rosa, Nicholas; Newman, Janet; Watkins, Christopher J. (2023). Label distribution of different data sources. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000990238
    Explore at:
    Dataset updated
    Mar 24, 2023
    Authors
    Rosa, Nicholas; Newman, Janet; Watkins, Christopher J.
    Description

    The use of imaging systems in protein crystallisation means that the experimental setups no longer require manual inspection to determine the outcome of the trials. However, it leads to the problem of how best to find images which contain useful information about the crystallisation experiments. The adoption of a deeplearning approach in 2018 enabled a four-class machine classification system of the images to exceed human accuracy for the first time. Underpinning this was the creation of a labelled training set which came from a consortium of several different laboratories. The MARCO classification model does not have the same accuracy on local data as it does on images from the original test set; this can be somewhat mitigated by retraining the ML model and including local images. We have characterized the image data used in the original MARCO model, and performed extensive experiments to identify training settings most likely to enhance the local performance of a MARCO-dataset based ML classification model.

  18. Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source...

    • zenodo.org
    application/gzip, bin +2
    Updated Aug 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb (2024). Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI Ecosystem [Dataset]. http://doi.org/10.5281/zenodo.1419788
    Explore at:
    bin, application/gzip, zip, text/x-pythonAvailable download formats
    Dataset updated
    Aug 2, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb
    License

    https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html

    Description
    Replication pack, FSE2018 submission #164:
    ------------------------------------------
    
    **Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: 
    A Case Study of the PyPI Ecosystem
    
    **Note:** link to data artifacts is already included in the paper. 
    Link to the code will be included in the Camera Ready version as well.
    
    
    Content description
    ===================
    
    - **ghd-0.1.0.zip** - the code archive. This code produces the dataset files 
     described below
    - **settings.py** - settings template for the code archive.
    - **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset.
     This dataset only includes stats aggregated by the ecosystem (PyPI)
    - **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level
     statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages
     themselves, which take around 2TB.
    - **build_model.r, helpers.r** - R files to process the survival data 
      (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, 
      `common.cache/survival_data.pypi_2008_2017-12_6.csv` in 
      **dataset_full_Jan_2018.tgz**)
    - **Interview protocol.pdf** - approximate protocol used for semistructured interviews.
    - LICENSE - text of GPL v3, under which this dataset is published
    - INSTALL.md - replication guide (~2 pages)
    Replication guide
    =================
    
    Step 0 - prerequisites
    ----------------------
    
    - Unix-compatible OS (Linux or OS X)
    - Python interpreter (2.7 was used; Python 3 compatibility is highly likely)
    - R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible)
    
    Depending on detalization level (see Step 2 for more details):
    - up to 2Tb of disk space (see Step 2 detalization levels)
    - at least 16Gb of RAM (64 preferable)
    - few hours to few month of processing time
    
    Step 1 - software
    ----------------
    
    - unpack **ghd-0.1.0.zip**, or clone from gitlab:
    
       git clone https://gitlab.com/user2589/ghd.git
       git checkout 0.1.0
     
     `cd` into the extracted folder. 
     All commands below assume it as a current directory.
      
    - copy `settings.py` into the extracted folder. Edit the file:
      * set `DATASET_PATH` to some newly created folder path
      * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` 
    - install docker. For Ubuntu Linux, the command is 
      `sudo apt-get install docker-compose`
    - install libarchive and headers: `sudo apt-get install libarchive-dev`
    - (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools`
     Without this dependency, you might get an error on the next step, 
     but it's safe to ignore.
    - install Python libraries: `pip install --user -r requirements.txt` . 
    - disable all APIs except GitHub (Bitbucket and Gitlab support were
     not yet implemented when this study was in progress): edit
     `scraper/init.py`, comment out everything except GitHub support
     in `PROVIDERS`.
    
    Step 2 - obtaining the dataset
    -----------------------------
    
    The ultimate goal of this step is to get output of the Python function 
    `common.utils.survival_data()` and save it into a CSV file:
    
      # copy and paste into a Python console
      from common import utils
      survival_data = utils.survival_data('pypi', '2008', smoothing=6)
      survival_data.to_csv('survival_data.csv')
    
    Since full replication will take several months, here are some ways to speedup
    the process:
    
    ####Option 2.a, difficulty level: easiest
    
    Just use the precomputed data. Step 1 is not necessary under this scenario.
    
    - extract **dataset_minimal_Jan_2018.zip**
    - get `survival_data.csv`, go to the next step
    
    ####Option 2.b, difficulty level: easy
    
    Use precomputed longitudinal feature values to build the final table.
    The whole process will take 15..30 minutes.
    
    - create a folder `
  19. Developer Community and Code Datasets

    • datarade.ai
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oxylabs, Developer Community and Code Datasets [Dataset]. https://datarade.ai/data-products/developer-community-and-code-datasets-oxylabs
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset authored and provided by
    Oxylabs
    Area covered
    El Salvador, Tuvalu, Bahamas, Guyana, Saint Pierre and Miquelon, Philippines, Marshall Islands, South Sudan, United Kingdom, Djibouti
    Description

    Unlock the power of ready-to-use data sourced from developer communities and repositories with Developer Community and Code Datasets.

    Data Sources:

    1. GitHub: Access comprehensive data about GitHub repositories, developer profiles, contributions, issues, social interactions, and more.

    2. StackShare: Receive information about companies, their technology stacks, reviews, tools, services, trends, and more.

    3. DockerHub: Dive into data from container images, repositories, developer profiles, contributions, usage statistics, and more.

    Developer Community and Code Datasets are a treasure trove of public data points gathered from tech communities and code repositories across the web.

    With our datasets, you'll receive:

    • Usernames;
    • Companies;
    • Locations;
    • Job Titles;
    • Follower Counts;
    • Contact Details;
    • Employability Statuses;
    • And More.

    Choose from various output formats, storage options, and delivery frequencies:

    • Get datasets in CSV, JSON, or other preferred formats.
    • Opt for data delivery via SFTP or directly to your cloud storage, such as AWS S3.
    • Receive datasets either once or as per your agreed-upon schedule.

    Why choose our Datasets?

    1. Fresh and accurate data: Access complete, clean, and structured data from scraping professionals, ensuring the highest quality.

    2. Time and resource savings: Let us handle data extraction and processing cost-effectively, freeing your resources for strategic tasks.

    3. Customized solutions: Share your unique data needs, and we'll tailor our data harvesting approach to fit your requirements perfectly.

    4. Legal compliance: Partner with a trusted leader in ethical data collection. Oxylabs is trusted by Fortune 500 companies and adheres to GDPR and CCPA standards.

    Pricing Options:

    Standard Datasets: choose from various ready-to-use datasets with standardized data schemas, priced from $1,000/month.

    Custom Datasets: Tailor datasets from any public web domain to your unique business needs. Contact our sales team for custom pricing.

    Experience a seamless journey with Oxylabs:

    • Understanding your data needs: We work closely to understand your business nature and daily operations, defining your unique data requirements.
    • Developing a customized solution: Our experts create a custom framework to extract public data using our in-house web scraping infrastructure.
    • Delivering data sample: We provide a sample for your feedback on data quality and the entire delivery process.
    • Continuous data delivery: We continuously collect public data and deliver custom datasets per the agreed frequency.

    Empower your data-driven decisions with Oxylabs Developer Community and Code Datasets!

  20. c

    Datasets exploring metadata commonalities across restricted health data...

    • nrc-digital-repository.canada.ca
    • depot-numerique-cnrc.canada.ca
    Updated Nov 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Read, Kevin B.; Gibson, Grant; Leahey, Ambery; Peterson, Lynn; Rutley, Sarah; Shi, Julie; Smith, Victoria; Stathis, Kelly (2025). Datasets exploring metadata commonalities across restricted health data sources in Canada [Dataset]. http://doi.org/10.17605/OSF.IO/TXRVE
    Explore at:
    Dataset updated
    Nov 21, 2025
    Dataset provided by
    OSF
    Authors
    Read, Kevin B.; Gibson, Grant; Leahey, Ambery; Peterson, Lynn; Rutley, Sarah; Shi, Julie; Smith, Victoria; Stathis, Kelly
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Canada
    Description

    This project includes three datasets: the first dataset compiles dataset metadata commonalities that were identified from 48 Canadian restricted health data sources. The second dataset compiles access process metadata commonalities extracted from the same 48 data sources. The third dataset maps metadata commonalities of the first dataset to existing metadata standards including DataCite, DDI, DCAT, and DATS. This mapping exercise was completed to determine whether metadata used by restricted data sources aligned with existing standards for research data.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Department of Environment (2025). Data Sources [Dataset]. https://pacific-data.sprep.org/dataset/data-sources

Data Sources

Explore at:
xlsxAvailable download formats
Dataset updated
Feb 14, 2025
Dataset provided by
Department of Environment
Tonga
License

https://pacific-data.sprep.org/resource/private-data-license-agreement-0https://pacific-data.sprep.org/resource/private-data-license-agreement-0

Area covered
Tonga
Description

Data sources. Not complete. Will get it done this weekend.

Search
Clear search
Close search
Google apps
Main menu