80 datasets found
  1. Getting Real about Fake News

    • kaggle.com
    zip
    Updated Nov 25, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Meg Risdal (2016). Getting Real about Fake News [Dataset]. https://www.kaggle.com/dsv/911
    Explore at:
    zip(20363882 bytes)Available download formats
    Dataset updated
    Nov 25, 2016
    Authors
    Meg Risdal
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The latest hot topic in the news is fake news and many are wondering what data scientists can do to detect it and stymie its viral spread. This dataset is only a first step in understanding and tackling this problem. It contains text and metadata scraped from 244 websites tagged as "bullshit" by the BS Detector Chrome Extension by Daniel Sieradski.

    Warning: I did not modify the list of news sources from the BS Detector so as not to introduce my (useless) layer of bias; I'm not an authority on fake news. There may be sources whose inclusion you disagree with. It's up to you to decide how to work with the data and how you might contribute to "improving it". The labels of "bs" and "junksci", etc. do not constitute capital "t" Truth. If there are other sources you would like to include, start a discussion. If there are sources you believe should not be included, start a discussion or write a kernel analyzing the data. Or take the data and do something else productive with it. Kaggle's choice to host this dataset is not meant to express any particular political affiliation or intent.

    Contents

    The dataset contains text and metadata from 244 websites and represents 12,999 posts in total from the past 30 days. The data was pulled using the webhose.io API; because it's coming from their crawler, not all websites identified by the BS Detector are present in this dataset. Each website was labeled according to the BS Detector as documented here. Data sources that were missing a label were simply assigned a label of "bs". There are (ostensibly) no genuine, reliable, or trustworthy news sources represented in this dataset (so far), so don't trust anything you read.

    Fake news in the news

    For inspiration, I've included some (presumably non-fake) recent stories covering fake news in the news. This is a sensitive, nuanced topic and if there are other resources you'd like to see included here, please leave a suggestion. From defining fake, biased, and misleading news in the first place to deciding how to take action (a blacklist is not a good answer), there's a lot of information to consider beyond what can be neatly arranged in a CSV file.

    Improvements

    If you have suggestions for improvements or would like to contribute, please let me know. The most obvious extensions are to include data from "real" news sites and to address the bias in the current list. I'd be happy to include any contributions in future versions of the dataset.

    Acknowledgements

    Thanks to Anthony for pointing me to Daniel Sieradski's BS Detector. Thank you to Daniel Nouri for encouraging me to add a disclaimer to the dataset's page.

  2. CT-FAN-21 corpus: A dataset for Fake News Detection

    • zenodo.org
    Updated Oct 23, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl (2022). CT-FAN-21 corpus: A dataset for Fake News Detection [Dataset]. http://doi.org/10.5281/zenodo.4714517
    Explore at:
    Dataset updated
    Oct 23, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl
    Description

    Data Access: The data in the research collection provided may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use it only for research purposes. Due to these restrictions, the collection is not open data. Please download the Agreement at Data Sharing Agreement and send the signed form to fakenewstask@gmail.com .

    Citation

    Please cite our work as

    @article{shahi2021overview,
     title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection},
     author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas},
     journal={Working Notes of CLEF},
     year={2021}
    }

    Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English.

    Subtask 3A: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. The training data will be released in batches and roughly about 900 articles with the respective label. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. Our definitions for the categories are as follows:

    • False - The main claim made in an article is untrue.

    • Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services.

    • True - This rating indicates that the primary elements of the main claim are demonstrably true.

    • Other- An article that cannot be categorised as true, false, or partially false due to lack of evidence about its claims. This category includes articles in dispute and unproven articles.

    Subtask 3B: Topical Domain Classification of News Articles (English) Fact-checkers require background expertise to identify the truthfulness of an article. The categorisation will help to automate the sampling process from a stream of data. Given the text of a news article, determine the topical domain of the article (English). This is a classification problem. The task is to categorise fake news articles into six topical categories like health, election, crime, climate, election, education. This task will be offered for a subset of the data of Subtask 3A.

    Input Data

    The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows:

    Task 3a

    • ID- Unique identifier of the news article
    • Title- Title of the news article
    • text- Text mentioned inside the news article
    • our rating - class of the news article as false, partially false, true, other

    Task 3b

    • public_id- Unique identifier of the news article
    • Title- Title of the news article
    • text- Text mentioned inside the news article
    • domain - domain of the given news article(applicable only for task B)

    Output data format

    Task 3a

    • public_id- Unique identifier of the news article
    • predicted_rating- predicted class

    Sample File

    public_id, predicted_rating
    1, false
    2, true

    Task 3b

    • public_id- Unique identifier of the news article
    • predicted_domain- predicted domain

    Sample file

    public_id, predicted_domain
    1, health
    2, crime

    Additional data for Training

    To train your model, the participant can use additional data with a similar format; some datasets are available over the web. We don't provide the background truth for those datasets. For testing, we will not use any articles from other datasets. Some of the possible source:

    IMPORTANT!

    1. Fake news article used for task 3b is a subset of task 3a.
    2. We have used the data from 2010 to 2021, and the content of fake news is mixed up with several topics like election, COVID-19 etc.

    Evaluation Metrics

    This task is evaluated as a classification task. We will use the F1-macro measure for the ranking of teams. There is a limit of 5 runs (total and not per day), and only one person from a team is allowed to submit runs.

    Submission Link: https://competitions.codalab.org/competitions/31238

    Related Work

    • Shahi GK. AMUSED: An Annotation Framework of Multi-modal Social Media Data. arXiv preprint arXiv:2010.00502. 2020 Oct 1.https://arxiv.org/pdf/2010.00502.pdf
    • G. K. Shahi and D. Nandini, “FakeCovid – a multilingualcross-domain fact check news dataset for covid-19,” inWorkshop Proceedings of the 14th International AAAIConference on Web and Social Media, 2020. http://workshop-proceedings.icwsm.org/abstract?id=2020_14
    • Shahi, G. K., Dirkson, A., & Majchrzak, T. A. (2021). An exploratory study of covid-19 misinformation on twitter. Online Social Networks and Media, 22, 100104. doi: 10.1016/j.osnem.2020.100104
  3. z

    Requirements data sets (user stories)

    • zenodo.org
    • data.mendeley.com
    txt
    Updated Jan 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fabiano Dalpiaz; Fabiano Dalpiaz (2025). Requirements data sets (user stories) [Dataset]. http://doi.org/10.17632/7zbk8zsd8y.1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jan 13, 2025
    Dataset provided by
    Mendeley Data
    Authors
    Fabiano Dalpiaz; Fabiano Dalpiaz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A collection of 22 data set of 50+ requirements each, expressed as user stories.

    The dataset has been created by gathering data from web sources and we are not aware of license agreements or intellectual property rights on the requirements / user stories. The curator took utmost diligence in minimizing the risks of copyright infringement by using non-recent data that is less likely to be critical, by sampling a subset of the original requirements collection, and by qualitatively analyzing the requirements. In case of copyright infringement, please contact the dataset curator (Fabiano Dalpiaz, f.dalpiaz@uu.nl) to discuss the possibility of removal of that dataset [see Zenodo's policies]

    The data sets have been originally used to conduct experiments about ambiguity detection with the REVV-Light tool: https://github.com/RELabUU/revv-light

    This collection has been originally published in Mendeley data: https://data.mendeley.com/datasets/7zbk8zsd8y/1

    Overview of the datasets [data and links added in December 2024]

    The following text provides a description of the datasets, including links to the systems and websites, when available. The datasets are organized by macro-category and then by identifier.

    Public administration and transparency

    g02-federalspending.txt (2018) originates from early data in the Federal Spending Transparency project, which pertain to the website that is used to share publicly the spending data for the U.S. government. The website was created because of the Digital Accountability and Transparency Act of 2014 (DATA Act). The specific dataset pertains a system called DAIMS or Data Broker, which stands for DATA Act Information Model Schema. The sample that was gathered refers to a sub-project related to allowing the government to act as a data broker, thereby providing data to third parties. The data for the Data Broker project is currently not available online, although the backend seems to be hosted in GitHub under a CC0 1.0 Universal license. Current and recent snapshots of federal spending related websites, including many more projects than the one described in the shared collection, can be found here.

    g03-loudoun.txt (2018) is a set of extracted requirements from a document, by the Loudoun County Virginia, that describes the to-be user stories and use cases about a system for land management readiness assessment called Loudoun County LandMARC. The source document can be found here and it is part of the Electronic Land Management System and EPlan Review Project - RFP RFQ issued in March 2018. More information about the overall LandMARC system and services can be found here.

    g04-recycling.txt(2017) concerns a web application where recycling and waste disposal facilities can be searched and located. The application operates through the visualization of a map that the user can interact with. The dataset has obtained from a GitHub website and it is at the basis of a students' project on web site design; the code is available (no license).

    g05-openspending.txt (2018) is about the OpenSpending project (www), a project of the Open Knowledge foundation which aims at transparency about how local governments spend money. At the time of the collection, the data was retrieved from a Trello board that is currently unavailable. The sample focuses on publishing, importing and editing datasets, and how the data should be presented. Currently, OpenSpending is managed via a GitHub repository which contains multiple sub-projects with unknown license.

    g11-nsf.txt (2018) refers to a collection of user stories referring to the NSF Site Redesign & Content Discovery project, which originates from a publicly accessible GitHub repository (GPL 2.0 license). In particular, the user stories refer to an early version of the NSF's website. The user stories can be found as closed Issues.

    (Research) data and meta-data management

    g08-frictionless.txt (2016) regards the Frictionless Data project, which offers an open source dataset for building data infrastructures, to be used by researchers, data scientists, and data engineers. Links to the many projects within the Frictionless Data project are on GitHub (with a mix of Unlicense and MIT license) and web. The specific set of user stories has been collected in 2016 by GitHub user @danfowler and are stored in a Trello board.

    g14-datahub.txt (2013) concerns the open source project DataHub, which is currently developed via a GitHub repository (the code has Apache License 2.0). DataHub is a data discovery platform which has been developed over multiple years. The specific data set is an initial set of user stories, which we can date back to 2013 thanks to a comment therein.

    g16-mis.txt (2015) is a collection of user stories that pertains a repository for researchers and archivists. The source of the dataset is a public Trello repository. Although the user stories do not have explicit links to projects, it can be inferred that the stories originate from some project related to the library of Duke University.

    g17-cask.txt (2016) refers to the Cask Data Application Platform (CDAP). CDAP is an open source application platform (GitHub, under Apache License 2.0) that can be used to develop applications within the Apache Hadoop ecosystem, an open-source framework which can be used for distributed processing of large datasets. The user stories are extracted from a document that includes requirements regarding dataset management for Cask 4.0, which includes the scenarios, user stories and a design for the implementation of these user stories. The raw data is available in the following environment.

    g18-neurohub.txt (2012) is concerned with the NeuroHub platform, a neuroscience data management, analysis and collaboration platform for researchers in neuroscience to collect, store, and share data with colleagues or with the research community. The user stories were collected at a time NeuroHub was still a research project sponsored by the UK Joint Information Systems Committee (JISC). For information about the research project from which the requirements were collected, see the following record.

    g22-rdadmp.txt (2018) is a collection of user stories from the Research Data Alliance's working group on DMP Common Standards. Their GitHub repository contains a collection of user stories that were created by asking the community to suggest functionality that should part of a website that manages data management plans. Each user story is stored as an issue on the GitHub's page.

    g23-archivesspace.txt (2012-2013) refers to ArchivesSpace: an open source, web application for managing archives information. The application is designed to support core functions in archives administration such as accessioning; description and arrangement of processed materials including analog, hybrid, and
    born digital content; management of authorities and rights; and reference service. The application supports collection management through collection management records, tracking of events, and a growing number of administrative reports. ArchivesSpace is open source and its

  4. Website Statistics - Dataset - data.gov.uk

    • ckan.publishing.service.gov.uk
    Updated Mar 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ckan.publishing.service.gov.uk (2024). Website Statistics - Dataset - data.gov.uk [Dataset]. https://ckan.publishing.service.gov.uk/dataset/website-statistics1
    Explore at:
    Dataset updated
    Mar 12, 2024
    Dataset provided by
    CKANhttps://ckan.org/
    License

    Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
    License information was derived automatically

    Description

    This Website Statistics dataset has three resources showing usage of the Lincolnshire Open Data website. Web analytics terms used in each resource are defined in their accompanying Metadata file. Please Note: due to a change in Analytics platform and accompanying metrics, the current files do not contain a full years data. The files will be updated again in January 2025 with 2024-2025 data. The previous dataset containing Web Analytics has been archived and can be found in the following link; https://lincolnshire.ckan.io/dataset/website-statistics-archived Website Usage Statistics: This document shows a statistical summary of usage of the Lincolnshire Open Data site for the latest calendar year. Website Statistics Summary: This dataset shows a website statistics summary for the Lincolnshire Open Data site for the latest calendar year. Webpage Statistics: This dataset shows statistics for individual Webpages on the Lincolnshire Open Data site by calendar year. Note: The resources above exclude API calls (automated requests for datasets). These Website Statistics resources are updated annually in February by the Lincolnshire County Council Open Data team.

  5. Data from: Higher Education Institutions in Poland Dataset

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Sep 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jackson Junior; Jackson Junior; Paulina Rutecka; Paulina Rutecka; Pedro Pinto; Pedro Pinto (2023). Higher Education Institutions in Poland Dataset [Dataset]. http://doi.org/10.5281/zenodo.8333574
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 11, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jackson Junior; Jackson Junior; Paulina Rutecka; Paulina Rutecka; Pedro Pinto; Pedro Pinto
    License

    Attribution 1.0 (CC BY 1.0)https://creativecommons.org/licenses/by/1.0/
    License information was derived automatically

    Area covered
    Poland
    Description

    Higher Education Institutions in Poland Dataset

    This repository contains a dataset of higher education institutions in Poland. The dataset comprises 131 public higher education institutions and 216 private higher education institutions in Poland. The data was collected on 24/11/2022.
    This dataset was compiled in response to a cybersecurity investigation of Poland's higher education institutions' websites [1]. The data is being made publicly available to promote open science principles [2].

    Data

    The data includes the following fields for each institution:

    • Id: A unique identifier assigned to each institution.
    • Region: The federal state in which the institution is located.
    • Name: The original name of the institution in Polish.
    • Name_EN: The international name of the institution in English.
    • Category: Indicates whether the institution is public or private.
    • Url: The website of the institution.

    Methodology

    The dataset was compiled using data from two primary sources:

    • Public Higher Education Institutions: Data was sourced from the official website of the Ministry of Education and Science of Poland [3].
    • Private Higher Education Institutions: Data was obtained from the RAD-on system, which is part of the Integrated Information Network on Science and Higher Education [4].

    For the international names in English, the following methodology was employed:

    Both Polish and English names were retained for each institution. This decision was based on the fact that some universities do not have their English versions available in official sources.

    English names were primarily sourced from:

    • The Polish National Agency for Academic Exchange's official document [5].
    • The website Studies in English [6].
    • Official websites of the respective Higher Education Institutions.

    In instances where English names were not readily available from the aforementioned sources, the GPT-3.5 model was employed to propose suitable names. These proposed names are distinctly marked in blue within the dataset file (hei_poland_en.xls).

    Usage

    This data is available under the Creative Commons Zero (CC0) license and can be used for academic research purposes. We encourage the sharing of knowledge and the advancement of research in this field by adhering to open science principles [2].

    If you use this data in your research, please cite the source and include a link to this repository. To properly attribute this data, please use the following DOI:
    10.5281/zenodo.8333573

    Contribution

    If you have any updates or corrections to the data, please feel free to open a pull request or contact us directly. Let's work together to keep this data accurate and up-to-date.

    Acknowledgment

    We would like to express our gratitude to the Ministry of Education and Science of Poland and the RAD-on system for providing the information used in this dataset.

    We would like to acknowledge the support of the Norte Portugal Regional Operational Programme (NORTE 2020), under the PORTUGAL 2020 Partnership Agreement, through the European Regional Development Fund (ERDF), within the project "Cybers SeC IP" (NORTE-01-0145-FEDER-000044). This study was also developed as part of the Master in Cybersecurity Program at the Polytechnic University of Viana do Castelo, Portugal.

    References

    1. Pending.
    2. S. Bezjak, A. Clyburne-Sherin, P. Conzett, P. Fernandes, E. Görögh, K. Helbig, B. Kramer, I. Labastida, K. Niemeyer, F. Psomopoulos, T. Ross-Hellauer, R. Schneider, J. Tennant, E. Verbakel, H. Brinken, and L. Heller, Open Science Training Handbook. Zenodo, Apr. 2018. [Online]. Available: [https://doi.org/10.5281/zenodo.1212496]
    3. Ministry of Education and Science of Poland. "Wykaz uczelni publicznych nadzorowanych przez Ministra właściwego ds. szkolnictwa wyższego - publiczne uczelnie akademickie." Nov 2022. [Online]. Available: https://www.gov.pl/web/edukacja-i-nauka/wykaz-uczelni-publicznych-nadzorowanych-przez-ministra-wlasciwego-ds-szkolnictwa-wyzszego-publiczne-uczelnie-akademickie
    4. RAD-on System. "Dane instytucji systemu szkolnictwa wyższego i nauki." Nov 2022. [Online]. Available: https://radon.nauka.gov.pl/dane/instytucje-systemu-szkolnictwa-wyzszego-i-nauki
    5. Polish National Agency for Academic Exchange. "List of the university-type HEIs." 2023. [Online]. Available: https://nawa.gov.pl/images/Aktualnosci/2023/Att.-2.-List-of-the-university-type-HEIs.pdf
    6. Studies in English. [Online]. Available: www.studies-in-english.pl
  6. I

    Self-citation analysis data based on PubMed Central subset (2002-2005)

    • databank.illinois.edu
    • aws-databank-alb.library.illinois.edu
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shubhanshu Mishra; Brent D Fegley; Jana Diesner; Vetle I. Torvik, Self-citation analysis data based on PubMed Central subset (2002-2005) [Dataset]. http://doi.org/10.13012/B2IDB-9665377_V1
    Explore at:
    Authors
    Shubhanshu Mishra; Brent D Fegley; Jana Diesner; Vetle I. Torvik
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Dataset funded by
    U.S. National Institutes of Health (NIH)
    U.S. National Science Foundation (NSF)
    Description

    Self-citation analysis data based on PubMed Central subset (2002-2005) ---------------------------------------------------------------------- Created by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik on April 5th, 2018 ## Introduction This is a dataset created as part of the publication titled: Mishra S, Fegley BD, Diesner J, Torvik VI (2018) Self-Citation is the Hallmark of Productive Authors, of Any Gender. PLOS ONE. It contains files for running the self citation analysis on articles published in PubMed Central between 2002 and 2005, collected in 2015. The dataset is distributed in the form of the following tab separated text files: * Training_data_2002_2005_pmc_pair_First.txt (1.2G) - Data for first authors * Training_data_2002_2005_pmc_pair_Last.txt (1.2G) - Data for last authors * Training_data_2002_2005_pmc_pair_Middle_2nd.txt (964M) - Data for middle 2nd authors * Training_data_2002_2005_pmc_pair_txt.header.txt - Header for the data * COLUMNS_DESC.txt file - Descriptions of all columns * model_text_files.tar.gz - Text files containing model coefficients and scores for model selection. * results_all_model.tar.gz - Model coefficient and result files in numpy format used for plotting purposes. v4.reviewer contains models for analysis done after reviewer comments. * README.txt file ## Dataset creation Our experiments relied on data from multiple sources including properitery data from Thompson Rueter's (now Clarivate Analytics) Web of Science collection of MEDLINE citations. Author's interested in reproducing our experiments should personally request from Clarivate Analytics for this data. However, we do make a similar but open dataset based on citations from PubMed Central which can be utilized to get similar results to those reported in our analysis. Furthermore, we have also freely shared our datasets which can be used along with the citation datasets from Clarivate Analytics, to re-create the datased used in our experiments. These datasets are listed below. If you wish to use any of those datasets please make sure you cite both the dataset as well as the paper introducing the dataset. * MEDLINE 2015 baseline: https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html * Citation data from PubMed Central (original paper includes additional citations from Web of Science) * Author-ity 2009 dataset: - Dataset citation: Torvik, Vetle I.; Smalheiser, Neil R. (2018): Author-ity 2009 - PubMed author name disambiguated dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4222651_V1 - Paper citation: Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3), 1–29. https://doi.org/10.1145/1552303.1552304 - Paper citation: Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2004). A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology, 56(2), 140–158. https://doi.org/10.1002/asi.20105 * Genni 2.0 + Ethnea for identifying author gender and ethnicity: - Dataset citation: Torvik, Vetle (2018): Genni + Ethnea for the Author-ity 2009 dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-9087546_V1 - Paper citation: Smith, B. N., Singh, M., & Torvik, V. I. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. In Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries - JCDL ’13. ACM Press. https://doi.org/10.1145/2467696.2467720 - Paper citation: Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geo-coded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington DC, USA. http://hdl.handle.net/2142/88927 * MapAffil for identifying article country of affiliation: - Dataset citation: Torvik, Vetle I. (2018): MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4354331_V1 - Paper citation: Torvik VI. MapAffil: A Bibliographic Tool for Mapping Author Affiliation Strings to Cities and Their Geocodes Worldwide. D-Lib magazine : the magazine of the Digital Library Forum. 2015;21(11-12):10.1045/november2015-torvik * IMPLICIT journal similarity: - Dataset citation: Torvik, Vetle (2018): Author-implicit journal, MeSH, title-word, and affiliation-word pairs based on Author-ity 2009. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4742014_V1 * Novelty dataset for identify article level novelty: - Dataset citation: Mishra, Shubhanshu; Torvik, Vetle I. (2018): Conceptual novelty scores for PubMed articles. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-5060298_V1 - Paper citation: Mishra S, Torvik VI. Quantifying Conceptual Novelty in the Biomedical Literature. D-Lib magazine : The Magazine of the Digital Library Forum. 2016;22(9-10):10.1045/september2016-mishra - Code: https://github.com/napsternxg/Novelty * Expertise dataset for identifying author expertise on articles: * Source code provided at: https://github.com/napsternxg/PubMed_SelfCitationAnalysis Note: The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in the first week of October, 2016. Check here for information to get PubMed/MEDLINE, and NLMs data Terms and Conditions Additional data related updates can be found at Torvik Research Group ## Acknowledgments This work was made possible in part with funding to VIT from NIH grant P01AG039347 and NSF grant 1348742. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. ## License Self-citation analysis data based on PubMed Central subset (2002-2005) by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik is licensed under a Creative Commons Attribution 4.0 International License. Permissions beyond the scope of this license may be available at https://github.com/napsternxg/PubMed_SelfCitationAnalysis.

  7. Worldwide Soundscapes project meta-data

    • zenodo.org
    csv
    Updated Apr 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kevin F.A. Darras; Kevin F.A. Darras; Rodney Rountree; Rodney Rountree; Steven Van Wilgenburg; Steven Van Wilgenburg; Amandine Gasc; Amandine Gasc; Songhai Li; Songhai Li; Lijun Dong; Lijun Dong; Yuhang Song; Youfang Chen; Youfang Chen; Thomas Cherico Wanger; Thomas Cherico Wanger; Yuhang Song (2024). Worldwide Soundscapes project meta-data [Dataset]. http://doi.org/10.5281/zenodo.10598949
    Explore at:
    csvAvailable download formats
    Dataset updated
    Apr 5, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Kevin F.A. Darras; Kevin F.A. Darras; Rodney Rountree; Rodney Rountree; Steven Van Wilgenburg; Steven Van Wilgenburg; Amandine Gasc; Amandine Gasc; Songhai Li; Songhai Li; Lijun Dong; Lijun Dong; Yuhang Song; Youfang Chen; Youfang Chen; Thomas Cherico Wanger; Thomas Cherico Wanger; Yuhang Song
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Worldwide Soundscapes project is a global, open inventory of spatio-temporally replicated passive acoustic monitoring meta-datasets (i.e. meta-data collections). This Zenodo entry comprises the data tables that constitute its (meta-)database, as well as their description.

    The overview of all sampling sites can be found on the corresponding project on ecoSound-web, as well as a demonstration collection containing selected recordings.

    The audio recording criteria justifying inclusion into the meta-database are:

    • Stationary (no transects, towed sensors or microphones mounted on cars)
    • Passive (unattended, no human disturbance by the recordist)
    • Ambient (no directional microphone or triggered recordings)
    • Spatially and/or temporally replicated (i.e. multiple sites sampled at the same time and/or multiple days - covering the same daytime - sampled at the same site)

    The individual columns of the provided data tables are described in the following. Data tables are linked through primary keys; joining them will result in a database. The data shared here only includes validated collections.

    Changes from version 1.0.0

    audio data moved to deployments table. datasets are called collections. Lookup fields and some collection table fields removed.

    collections

    • collection_id: unique integer, primary key
    • name: name of the dataset. if it is repeated, incremental integers should be used in the "subset" column to differentiate them.
    • ecoSound-web_link: link of validated collections that were uploaded to ecoSound-web
    • primary_contributors: full names of people deemed corresponding contributors who are responsible for the dataset
    • secondary_contributors: full names of people who are not primary contributors but who have significantly contributed to the dataset, and who could be contacted for in-depth analyses
    • date_added: when the datased was added (YYYY-MM-DD)
    • URL_open_recordings: internet link of openly-available recordings from this collection
    • URL_project: internet link for further information about the corresponding project
    • DOI_publication: DOIs of corresponding publications
    • core_realm_IUCN: The main, core realm of the dataset
    • medium: the physical medium the microphone is situated in
    • protected_area: whether the sampling sites were situated in protected areas or not, or only some. boolean
    • locality: optional free text about the locality
    • spatial_selection: spatial selection criteria that were used to determine in which locations to record sound (ecotone, elevated spot, etc.) - any deviations from randomness
    • temporal_exclusion: environmental exclusion criteria that were used to determine which recording days or times to discard
    • freshwater_recordist_position: position of the recordist relative to the microphone during sampling (only for freshwater)
    • contributor_comments: free-text field for comments by the primary contributors

    collections-sites

    • dataset_ID: primary key of collections table
    • site_ID: primary key of sites table

    sites

    • site_ID: unique integer, primary key
    • site_name: name or code of sampling site as used in respective projects
    • latitude_numeric: site's numeric degrees of latitude
    • longitude_numeric: site's numeric degrees of longitude
    • blurred_coordinates: whether latitude and longitude coordinates are inaccurate, boolean. Coordinates may be blurred with random offsets, rounding, snapping, etc. Indicate the blurring method inside the comments field
    • topography_m: vertical position of the microphone relative to the sea level. for sites on land: elevation. For marine sites: depth (negative). in meters. Only indicate if the values were measured by the collaborator.
    • freshwater_depth_m: microphone depth, only used for sites inside freshwater bodies that also have an elevation value above the sea level
    • realm: Ecosystem type according to IUCN GET https://global-ecosystems.org/
    • biome: Ecosystem type according to IUCN GET https://global-ecosystems.org/
    • functional_group: Ecosystem type according to IUCN GET https://global-ecosystems.org/
    • contributor_comments: free text field for contributor comments

    deployments

    • dataset_ID: primary key of datasets table
    • dataset_name: lookup field
    • deployment: use identical subscript letters to denote rows that belong to the same deployment. For instance, you may use different operation times and schedules for different target taxa within one deployment.
    • start_date_min: earliest date of deployment start, double-click cell to get date-picker
    • start_date_max: latest date of deployment start, if applicable (only used when recorders were deployed over several days), double-click cell to get date-picker
    • start_time_mixed: deployment start local time, either in HH:MM format or a choice of solar daytimes (sunrise, sunset, noon, midnight). Corresponds to the recording start time for continuous recording deployments. If multiple start times were used, you should mention the latest start time (corresponds to the earliest daytime from which all recorders are active). If applicable, positive or negative offsets from solar times can be mentioned (For example: if data are collected one hour before sunrise, this will be "sunrise-60")
    • permanent: is the deployment permanent (in which case it would be ongoing and the end date or duration would be unknown)?
    • variable_duration_days: is the duration of the deployment variable? in days
    • duration_days: deployment duration per recorder (use the minimum if variable)
    • end_date_min: earliest date of deployment end, only needed if duration is variable, double-click cell to get date-picker
    • end_date_max: latest date of deployment end, only needed if duration is variable, double-click cell to get date-picker
    • end_time_mixed: deployment end local time, either in HH:MM format or a choice of solar daytimes (sunrise, sunset, noon, midnight). Corresponds to the recording end time for continuous recording deployments.
    • recording_time: does the recording last from the deployment start time to the end time (continuous) or at scheduled daily intervals (scheduled)? Note: we consider recordings with duty cycles to be continuous.
    • operation_start_time_mixed: scheduled recording start local time, either in HH:MM format or a choice of solar daytimes (sunrise, sunset, noon, midnight). If applicable, positive or negative offsets from solar times can be mentioned (For example: if data are collected one hour before sunrise, this will be "sunrise-60")
    • operation_duration_minutes: duration of operation in minutes, if constant
    • operation_end_time_mixed: scheduled recording end local time, either in HH:MM format or a choice of solar daytimes (sunrise, sunset, noon, midnight). If applicable, positive or negative offsets from solar times can be mentioned (For example: if data are collected one hour before sunrise, this will be "sunrise-60")
    • duty_cycle_minutes: duty cycle of the recording (i.e. the fraction of minutes when it is recording), written as "recording(minutes)/period(minutes)". For example: "1/6" if the recorder is active for 1 minute and standing by for 5 minutes.
    • sampling_frequency_kHz: only indicate the sampling frequency if it is variable within a particular dataset so that we need to code different frequencies for different deployments
    • recorder
    • subset_sites: If the deployment was not done in all the sites of the corresponding datasest, site IDs can be indicated here, separated by commas
    • comments
  8. The NIST Extensible Resource Data Model (NERDm): JSON schemas for rich...

    • nist.gov
    • gimi9.com
    • +2more
    Updated Sep 2, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2017). The NIST Extensible Resource Data Model (NERDm): JSON schemas for rich description of data resources [Dataset]. http://doi.org/10.18434/mds2-1870
    Explore at:
    Dataset updated
    Sep 2, 2017
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    License

    https://www.nist.gov/open/licensehttps://www.nist.gov/open/license

    Description

    The NIST Extensible Resource Data Model (NERDm) is a set of schemas for encoding in JSON format metadata that describe digital resources. The variety of digital resources it can describe includes not only digital data sets and collections, but also software, digital services, web sites and portals, and digital twins. It was created to serve as the internal metadata format used by the NIST Public Data Repository and Science Portal to drive rich presentations on the web and to enable discovery; however, it was also designed to enable programmatic access to resources and their metadata by external users. Interoperability was also a key design aim: the schemas are defined using the JSON Schema standard, metadata are encoded as JSON-LD, and their semantics are tied to community ontologies, with an emphasis on DCAT and the US federal Project Open Data (POD) models. Finally, extensibility is also central to its design: the schemas are composed of a central core schema and various extension schemas. New extensions to support richer metadata concepts can be added over time without breaking existing applications. Validation is central to NERDm's extensibility model. Consuming applications should be able to choose which metadata extensions they care to support and ignore terms and extensions they don't support. Furthermore, they should not fail when a NERDm document leverages extensions they don't recognize, even when on-the-fly validation is required. To support this flexibility, the NERDm framework allows documents to declare what extensions are being used and where. We have developed an optional extension to the standard JSON Schema validation (see ejsonschema below) to support flexible validation: while a standard JSON Schema validater can validate a NERDm document against the NERDm core schema, our extension will validate a NERDm document against any recognized extensions and ignore those that are not recognized. The NERDm data model is based around the concept of resource, semantically equivalent to a schema.org Resource, and as in schema.org, there can be different types of resources, such as data sets and software. A NERDm document indicates what types the resource qualifies as via the JSON-LD "@type" property. All NERDm Resources are described by metadata terms from the core NERDm schema; however, different resource types can be described by additional metadata properties (often drawing on particular NERDm extension schemas). A Resource contains Components of various types (including DCAT-defined Distributions) that are considered part of the Resource; specifically, these can include downloadable data files, hierachical data collecitons, links to web sites (like software repositories), software tools, or other NERDm Resources. Through the NERDm extension system, domain-specific metadata can be included at either the resource or component level. The direct semantic and syntactic connections to the DCAT, POD, and schema.org schemas is intended to ensure unambiguous conversion of NERDm documents into those schemas. As of this writing, the Core NERDm schema and its framework stands at version 0.7 and is compatible with the "draft-04" version of JSON Schema. Version 1.0 is projected to be released in 2025. In that release, the NERDm schemas will be updated to the "draft2020" version of JSON Schema. Other improvements will include stronger support for RDF and the Linked Data Platform through its support of JSON-LD.

  9. Virginia Springs/Groundwater Layers - 2023

    • data.virginia.gov
    • hub.arcgis.com
    • +3more
    Updated Jul 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Virginia Department of Environmental Quality (2025). Virginia Springs/Groundwater Layers - 2023 [Dataset]. https://data.virginia.gov/dataset/virginia-springs-groundwater-layers-2023
    Explore at:
    html, arcgis geoservices rest apiAvailable download formats
    Dataset updated
    Jul 29, 2025
    Dataset authored and provided by
    Virginia Department of Environmental Qualityhttps://deq.virginia.gov/
    Area covered
    Hot Springs
    Description
    The VDEQ Spring SITES database contains data describing the geographic locations and site attributes of natural springs throughout the commonwealth. This data coverage continues to evolve and contains only spring locations known to exist with a reasonable degree of certainty on the date of publication. The dataset does not replace site specific inventorying or receptor surveys but can be used as a starting point. VDEQ's initial geospatial dataset of approximately 325 springs was formed in 2008 by digitizing historical spring information sheets created by State Water Control Board geologists in the 1970s through early 1990s. Additional data has been consolidated from the EPA STORET database, the U.S. Geological Survey's Ground Water Site Inventory (GWSI) and Geographic Names Inventory System (GNIS), the Virginia Department of Health SDWIS database, the Virginia DEQ Virginia Water Use Data Set (VWUDS), the Commonwealth of Virginia Division of Water Resources and Power Bulletin No. 1: "Springs of Virginia" by Collins et al., 1930 as well as several VDWR&P Surface Water Supply bulletins from the 1940's - 1950's. A 1992 Virginia Department of Game and Inland Fisheries / Virginia Tech sponsored study by Helfrich et al. titled "Evaluation of the Natural Springs of Virginia: Fisheries Management Implications", a 2004 Rockbridge County groundwater resources report written by Frits van der Leeden, and several smaller datasets from consultants and citizens were evaluated and added to the database when confidence in locational accuracy was high or could be verified with aerial or LIDAR imagery. Significant contributions have been made throughout the years by VDEQ Groundwater Characterization staff site visits as well as other geologists working in the region including: Matt Heller at Virginia Division of Geology and Mineral Resources (VDMME), Wil Orndorff at the Virginia Department of Conservation and Recreation Karst Program (VDCR), and David Nelms and Dan Doctor of the U.S. Geological Survey (USGS). Substantial effort has been made to improve locational accuracy and remove duplication present between data sources. Hundreds of spring locations that were originally obtained using topographic maps or unknown methods were updated to sub-meter locational accuracy using post-processed differential GPS (PPGPS) and through the use of several generations of aerial imagery (2002-2017) obtained from Virginia's Geographic Information Network (VGIN) and 1-meter LIDAR, where available. Scores of new spring locations were also obtained by systematic quadrangle by quadrangle analysis in areas of the Shenandoah Valley where 1-meter LIDAR datasets where obtained from the U.S. Geological Survey. Future improvements to the dataset will result when statewide 1-meter LIDAR datasets becomes available and through continued field work by DEQ staff and other contributors working in the region. Please do not hesitate to contact the author to correct mistakes or to contribute to the database.

    The VDEQ Spring FIELD MEASUREMENTS database contains data describing field derived physio-chemical properties of spring discharges measured throughout the Commonwealth of Virginia. Field visits compiled in this dataset were performed from 1928 to 2019 by geologists with the State Water Control Board, the Virginia Division of Water and Power, the Virginia Department of Environmental Quality, and the U.S. Geological Survey with contributions from other sources as noted. Values of -9999 indicate that measurements were not performed for the referenced parameter. Please do not hesitate to contact the author to add data to the database or correct errors.


    The VDEQ_Spring_WQ database is a geodatabase containing groundwater sample information collected from springs throughout Virginia. Sample specific information include: location and site information, measured field parameters, and lab verified quantifications of major ionic concentrations, trace element concentrations, nutrient concentrations, and radiological data. The VDEQ_Spring_WQ database is a subset of the VDEQ GWCHEM database which is a flat-file geodatabase containing groundwater sample information from groundwater wells and springs throughout Virginia. Sample information has been correlated via DEQ Well # and projected using coordinates in VDEQ_Spring_SITES database. The GWCHEM database is comprised of historic groundwater sample data originally archived in the United States Geological Survey (USGS) National Water Information System (NWIS) and the Environmental Protection Agency (EPA) Storage and Retrieval (STORET) data warehouse. Archived STORET data originated as groundwater sample data collected and uploaded by Virginia State Water Control Board Personnel. While groundwater sample data in the STORET data warehouse are static, new groundwater sample data are periodically uploaded to NWIS and spring laboratory WQ data reflect NWIS downloaded on 9/30/2019. Recent groundwater sample data collected by Virginia Department of Environmental Quality (DEQ) personnel as part of the Ambient Groundwater Sampling Program are entered into the database as lab results are made available by the Division of Consolidated Laboratory Services (DCLS). When possible, charge balances were calculated for samples with reported values for major ions including (at a minimum) calcium, magnesium, potassium, sodium, bicarbonate, chloride, and sulfate. Reported values for Nitrate as N, carbonate, and fluoride were included in the charge balance calculation when available. Field determined values for bicarbonate and carbonate were used in the charge balance calculation when available. For much of the legacy DEQ groundwater sample data, bicarbonate values were derived from lab reported values of alkalinity (as mg/CaCO3) under the assumption that there was no contribution by carbonate to the reported alkalinity value. Charge balance values are reported in the "Charge Balance" column of the GWCHEM geodatabase. The closer the charge balance value is to unity (1), the lower the assumed charge balance error.In order to preserve the numerical capabilities of the database, non- numeric lab qualifiers were given the following numeric identifiers:- (minus sign) = less than the concentration specified to the right of the sign-11110 = estimated-22220 = presence verified but not quantified-33330 = radchem non-detect, below sslc-4440 = analyzed for but not detected-55550 = greater than the concentration to the right of the zero-66660 = sample held beyond normal holding time-77770 = quality control failure. Data not valid.-88880 = sample held beyond normal holding time. Sample analyzed for but not detected. Value stored is limit of detection for proces in use.-11120 = Value reported is less than the criteria of detection.-9999 = no data (parameter not quantified)

    A more in depth descprition and hydrogeologic analysis of the database can be found here
    An in Depth data fact sheet can be found here
  10. SF Web Analytics for SFGov Sites

    • kaggle.com
    zip
    Updated Sep 5, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    City of San Francisco (2018). SF Web Analytics for SFGov Sites [Dataset]. https://www.kaggle.com/san-francisco/sf-web-analytics-for-sfgov-sites
    Explore at:
    zip(159410 bytes)Available download formats
    Dataset updated
    Sep 5, 2018
    Dataset authored and provided by
    City of San Francisco
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Area covered
    San Francisco
    Description

    Content

    Web analytics data for SFGov sites

    Context

    This is a dataset hosted by the city of San Francisco. The organization has an open data platform found here and they update their information according the amount of data that is brought in. Explore San Francisco's Data using Kaggle and all of the data sources available through the San Francisco organization page!

    • Update Frequency: This dataset is not updated.

    Acknowledgements

    This dataset is maintained using Socrata's API and Kaggle's API. Socrata has assisted countless organizations with hosting their open data and has been an integral part of the process of bringing more data to the public.

    Cover photo by Chris Liverani on Unsplash
    Unsplash Images are distributed under a unique Unsplash License.

  11. Z

    Ensembl TSS dataset for GRCh38

    • data.niaid.nih.gov
    • investigacion.ubu.es
    • +2more
    Updated Aug 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    José A. Barbero-Aparicio; Alicia Olivares-Gil; José F. Díez-Pastor; César García-Osorio (2024). Ensembl TSS dataset for GRCh38 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7147596
    Explore at:
    Dataset updated
    Aug 26, 2024
    Dataset provided by
    Universidad de Burgos
    Authors
    José A. Barbero-Aparicio; Alicia Olivares-Gil; José F. Díez-Pastor; César García-Osorio
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We used the human genome reference sequence in its GRCh38.p13 version in order to have a reliable source of data in which to carry out our experiments. We chose this version because it is the most recent one available in Ensemble at the moment. However, the DNA sequence by itself is not enough, the specific TSS position of each transcript is needed. In this section, we explain the steps followed to generate the final dataset. These steps are: raw data gathering, positive instances processing, negative instances generation and data splitting by chromosomes.

    First, we need an interface in order to download the raw data, which is composed by every transcript sequence in the human genome. We used Ensembl release 104 (Howe et al., 2020) and its utility BioMart (Smedley et al., 2009), which allows us to get large amounts of data easily. It also enables us to select a wide variety of interesting fields, including the transcription start and end sites. After filtering instances that present null values in any relevant field, this combination of the sequence and its flanks will form our raw dataset. Once the sequences are available, we find the TSS position (given by Ensembl) and the 2 following bases to treat it as a codon. After that, 700 bases before this codon and 300 bases after it are concatenated, getting the final sequence of 1003 nucleotides that is going to be used in our models. These specific window values have been used in (Bhandari et al., 2021) and we have kept them as we find it interesting for comparison purposes. One of the most sensitive parts of this dataset is the generation of negative instances. We cannot get this kind of data in a straightforward manner, so we need to generate it synthetically. In order to get examples of negative instances, i.e. sequences that do not represent a transcript start site, we select random DNA positions inside the transcripts that do not correspond to a TSS. Once we have selected the specific position, we get 700 bases ahead and 300 bases after it as we did with the positive instances.

    Regarding the positive to negative ratio, in a similar problem, but studying TIS instead of TSS (Zhang135 et al., 2017), a ratio of 10 negative instances to each positive one was found optimal. Following this136 idea, we select 10 random positions from the transcript sequence of each positive codon and label them137 as negative instances. After this process, we end up with 1,122,113 instances: 102,488 positive and 1,019,625 negative sequences. In order to validate and test our models, we need to split this dataset into three parts: train, validation and test. We have decided to make this differentiation by chromosomes, as it is done in (Perez-Rodriguez et al., 2020). Thus, we use chromosome 16 as validation because it is a good example of a chromosome with average characteristics. Then we selected samples from chromosomes 1, 3, 13, 19 and 21 to be part of the test set and used the rest of them to train our models. Every step of this process can be replicated using the scripts available in https://github.com/JoseBarbero/EnsemblTSSPrediction.

  12. d

    Site Plan Cases

    • catalog.data.gov
    • datahub.austintexas.gov
    • +2more
    Updated Nov 25, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.austintexas.gov (2025). Site Plan Cases [Dataset]. https://catalog.data.gov/dataset/site-plan-cases
    Explore at:
    Dataset updated
    Nov 25, 2025
    Dataset provided by
    data.austintexas.gov
    Description

    City of Austin Open Data Terms of Use https://data.austintexas.gov/stories/s/ranj-cccq This data set contains information about the site plan case applications submitted for review to the City of Austin. The data set includes information about case status in the permit review system, case number, proposed use, applicant, owner, and location. Austin Development Services Data Disclaimer: The data provided are for informational use only and may differ from official department data. Austin Development Services’ database is continuously updated, so reports run at different times may produce different results. Care should be taken when comparing against other reports as different data collection methods and different data sources may have been used. Austin Development Services does not assume any liability for any decision made or action taken or not taken by the recipient in reliance upon any information or data provided.

  13. g

    GiGL Open Space Friends Group subset

    • gimi9.com
    • data.europa.eu
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GiGL Open Space Friends Group subset [Dataset]. https://gimi9.com/dataset/uk_gigl-open-space-friends-group-subset/
    Explore at:
    Description

    🇬🇧 United Kingdom English Introduction The GiGL Open Space Friends Group subset provides locations and boundaries for selected open space sites in Greater London. The chosen sites represent sites that have established Friends Groups in Greater London and are therefore important to local communities, even if they may not be accessible open spaces, or don’t typically function as destinations for leisure, activities and community engagement*. Friends Groups are groups of interested local people who come together to protect, enhance and improve their local open space or spaces. The dataset has been created by Greenspace Information for Greater London CIC (GiGL). As London’s Environmental Records Centre, GiGL mobilises, curates and shares data that underpin our knowledge of London’s natural environment. We provide impartial evidence to support informed discussion and decision making in policy and practice. GiGL maps under licence from the Greater London Authority. *Publicly accessible sites for leisure, activities and community engagement can be found in GiGL's Spaces to Visit dataset Description This dataset is a sub-set of the GiGL Open Space dataset, the most comprehensive dataset available of open spaces in London. Sites are selected for inclusion in the Friends Group subset based on whether there is a friends group recorded for the site in the Open Space dataset. The dataset is a mapped Geographic Information System (GIS) polygon dataset where one polygon (or multi-polygon) represents one space. As well as site boundaries, the dataset includes information about a site’s name, size, access and type (e.g. park, playing field etc.) and the name and/or web address of the site’s friends group. GiGL developed the dataset to support anyone who is interested in identifying sites in London with friends groups - including friends groups and other community groups, web and app developers, policy makers and researchers - with an open licence data source. More detailed and extensive data are available under GiGL data use licences for GIGL partners, researchers and students. Information services are also available for ecological consultants, biological recorders, community groups and members of the public – please see www.gigl.org.uk for more information. The dataset is updated on a quarterly basis. If you have questions about this dataset please contact GiGL’s GIS and Data Officer. Data sources The boundaries and information in this dataset are a combination of data collected during the London Survey Method habitat and open space survey programme (1986 – 2008) and information provided to GiGL from other sources since. These sources include London borough surveys, land use datasets, volunteer surveys, feedback from the public, park friends’ groups, and updates made as part of GiGL’s on-going data validation and verification process. This is a preliminary version of the dataset as there is currently low coverage of friends groups in GiGL’s Open Space database. We are continually working on updating and improving this dataset. If you have any additional information or corrections for sites included in GiGL’s Friends Group subset please contact GiGL’s GIS and Data Officer. NOTE: The dataset contains OS data © Crown copyright and database rights 2025. The site boundaries are based on Ordnance Survey mapping, and the data are published under Ordnance Survey's 'presumption to publish'. When using these data please acknowledge GiGL and Ordnance Survey as the source of the information using the following citation: ‘Dataset created by Greenspace Information for Greater London CIC (GiGL), 2025 – Contains Ordnance Survey and public sector information licensed under the Open Government Licence v3.0 ’

  14. C

    Data from: Our Block

    • data.cityofchicago.org
    Updated Nov 29, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chicago Police Department (2025). Our Block [Dataset]. https://data.cityofchicago.org/Public-Safety/Our-Block/285v-myf3
    Explore at:
    xml, csv, kmz, kml, application/geo+json, xlsxAvailable download formats
    Dataset updated
    Nov 29, 2025
    Authors
    Chicago Police Department
    Description

    This dataset reflects reported incidents of crime (with the exception of murders where data exists for each victim) that occurred in the City of Chicago from 2001 to present, minus the most recent seven days. Data is extracted from the Chicago Police Department's CLEAR (Citizen Law Enforcement Analysis and Reporting) system. In order to protect the privacy of crime victims, addresses are shown at the block level only and specific locations are not identified. Should you have questions about this dataset, you may contact the Research & Development Division of the Chicago Police Department at 312.745.6071 or RandD@chicagopolice.org. Disclaimer: These crimes may be based upon preliminary information supplied to the Police Department by the reporting parties that have not been verified. The preliminary crime classifications may be changed at a later date based upon additional investigation and there is always the possibility of mechanical or human error. Therefore, the Chicago Police Department does not guarantee (either expressed or implied) the accuracy, completeness, timeliness, or correct sequencing of the information and the information should not be used for comparison purposes over time. The Chicago Police Department will not be responsible for any error or omission, or for the use of, or the results obtained from the use of this information. All data visualizations on maps should be considered approximate and attempts to derive specific addresses are strictly prohibited. The Chicago Police Department is not responsible for the content of any off-site pages that are referenced by or that reference this web page other than an official City of Chicago or Chicago Police Department web page. The user specifically acknowledges that the Chicago Police Department is not responsible for any defamatory, offensive, misleading, or illegal conduct of other users, links, or third parties and that the risk of injury from the foregoing rests entirely with the user. The unauthorized use of the words "Chicago Police Department," "Chicago Police," or any colorable imitation of these words or the unauthorized use of the Chicago Police Department logo is unlawful. This web page does not, in any way, authorize such use. Data is updated daily Tuesday through Sunday. The dataset contains more than 65,000 records/rows of data and cannot be viewed in full in Microsoft Excel. Therefore, when downloading the file, select CSV from the Export menu. Open the file in an ASCII text editor, such as Wordpad, to view and search. To access a list of Chicago Police Department - Illinois Uniform Crime Reporting (IUCR) codes, go to http://data.cityofchicago.org/Public-Safety/Chicago-Police-Department-Illinois-Uniform-Crime-R/c7ck-438e

  15. B

    Open access article processing charges longitudinal study 2015 preliminary...

    • borealisdata.ca
    • search.dataone.org
    Updated Nov 25, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Heather Morrison; Jihane Salhab; Guinsly Mondésir; Alexis Calvé-Genest; César Villamizar; Lisa Desautels (2015). Open access article processing charges longitudinal study 2015 preliminary dataset [Dataset]. http://doi.org/10.5683/SP3/DX005L
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 25, 2015
    Dataset provided by
    Borealis
    Authors
    Heather Morrison; Jihane Salhab; Guinsly Mondésir; Alexis Calvé-Genest; César Villamizar; Lisa Desautels
    License

    https://borealisdata.ca/api/datasets/:persistentId/versions/3.0/customlicense?persistentId=doi:10.5683/SP3/DX005Lhttps://borealisdata.ca/api/datasets/:persistentId/versions/3.0/customlicense?persistentId=doi:10.5683/SP3/DX005L

    Description

    One of the lines of research of Sustaining the Knowledge Commons (SKC) is a longitudinal study of the minority (about a third) of the fully open access journals that use this business model. The original idea was to gather data during an annual two-week census period. The volume of data and growth in this area makes this an impractical goal. For this reason, we are posting this preliminary dataset in case it might be helpful to others working in this area. Future data gathering and analysis will be conducted on an ongoing basis. Major sources of data for this dataset include: • the Directory of Open Access Journals (DOAJ) downloadable metadata; the base set is from May 2014, with some additional data from the 2015 dataset • data on publisher article processing charges and related information gathered from publisher websites by the SKC team in 2015, 2014 (Morris on, Salhab, Calvé-Genest & Horava, 2015) and a 2013 pilot • DOAJ article content data screen scraped from DOAJ (caution; this data can be quite misleading due to limitations with article-level metadata) • Subject analysis based on DOAJ subject metadata in 2014 for selected journals • Data on APCs gathered in 2010 by Solomon and Björk (supplied by the authors). Note that Solomon and Björk use a different method of calculating APC so the numbers are not directly comparable. • Note that this full d ataset includes some working columns which are meaningful only by means of explaining very specific calculations which are not necessarily evident in the dataset per se. Details below. Significant limitation: • This dataset does not include new journals added to DOAJ in 2015. A recent publisher size analysis indicates some significant changes. For example, DeGruyter, not listed in the 2014 survey, is now the third largest DOAJ publisher with over 200 titles. Elsevier is now the 7th largest DOAJ publisher. In both cases, gathering data from the publisher websites will be time-consuming as it is necessary to conduct individual title look-up. • Some OA APC data for newly added journals was gathered in May 2015 but has not yet been added to this dataset. One of the reasons for gathering this data is a comparison of the DOAJ "one price listed" approach with potentially richer data on the publisher's own website. For full details see the documentation.

  16. Z

    Bibliography of Egyptological databases and datasets

    • nde-dev.biothings.io
    Updated Feb 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ilin-Tomich, Alexander (2024). Bibliography of Egyptological databases and datasets [Dataset]. https://nde-dev.biothings.io/resources?id=zenodo_10691756
    Explore at:
    Dataset updated
    Feb 23, 2024
    Dataset provided by
    Konrad, Tobias
    Ilin-Tomich, Alexander
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The Bibliography of Egyptological databases and datasets aims to provide an annotated list of digital publications, both online and offline, of the types that are not covered by conventional Egyptological bibliographies, namely databases, text, image and 3D datasets, and other digital assets. It aims to cover resources that are or were publicly available (even if on a paid basis) rather than private and project internal databases. Until ten years ago, there existed an annotated online database of Egyptological resources then available on the internet, including databases and datasets, called SISYPHOS – Internetquellen zur Ägyptologie, Altorientalistik und zur Klassischen Archäologie. In June 2014 it was taken offline without (to our knowledge) archiving its data in any publicly accessible repository. An incomplete copy, preserved in the Internet Archive, gives some idea of what the database looked like. Our aim is to provide a similar service reflecting the proliferation of online datasets in Egyptology in recent years. Obviously, the idea of cataloguing all Egyptological resources on the internet is no longer viable, at least not as a leisure-time project such as our bibliography. Hence, we had to set clear limits on what is included in the bibliography.

    Scope

    Only digital publications are included.Not included are digitized versions of printed Egyptological publications and digital publications in conventional formats (books, journals, papers, encyclopaedias, and theses). Also excluded are blogs, social media accounts, excavation, exhibition, and project websites without formally structured datasets, general public websites and media, and collections of Egyptological weblinks. Databases and datasets that are supplements to conventional books and papers are also not included.Online databases of museum collections are not included; one can refer to existing overviews of these resources provided by CIPEG, AKU project, and Alexander Ilin-Tomich. The Bibliography of Egyptological databases and datasets excludes resources devoted to Greek, Latin, and Arabis texts from Egypt.The current version of the Bibliography of Egyptological Databases and Datasets does not aim to cover Coptological datasets, although we would welcome efforts to fill this gap.The current version also does not include Egyptological applications, fonts, and online tools.

    Other lists of digital Egyptological resources

    In compiling this database, we have benefited from a number of other efforts to catalogue digital Egyptological resources. As a token of appreciation, we enumerate here the lists that we have used (all links accessed on 20 February 2024):

    Archaeologicallinks. Data and resources (Egypt & Sudan).

    Beek, Nicky van den. Databases.

    Bodleian Libraries, Oxford University. Egyptology. Online resources.

    Chappaz, Jean-Luc and Sandra Poggia (Guarnori). Ressources égyptologiques informatisées. Bulletin de la Société d'Égyptologie de Genève 18 (1994): 97–102; 19 (1995), 115–132; 20 (1996), 95–115; 21 (1997), 103–124; 22 (1998), 107–136; 24 (2001), 123.

    Claes, Wouter, and Ellen van Keer. Les ressources numériques pour l'égyptologie. Bibliotheca Orientalis 71 (2014): 297–306.

    Egyptologists’ Electronic Forum. On-line Egyptological bibliographies, databases, search engines, and other resources.

    El-Enany, Khaled. Sélection de sites pour la recherche en égyptologie.

    Hamilton Lugar School of Global and International Studies, Indiana University Bloomington. Egyptology: resources.

    Institut français d'archéologie orientale. Ressources / bibliographies.

    Institut für Ägyptologie, Universität Heidelberg. Nützliches.

    Institut für Ägyptologie und Koptologie, Ludwig-Maximilians-Universität München. Ägyptologische Material-Datenbanken online.

    Jones, Charles E., and Tom Elliott. The AWOL Index. Index of resources by keywords.

    Library Services, University College London. Egyptology. Specialist databases and resources.

    Seminar für Ägyptologie und Koptologie, Georg-August-Universität Göttingen. Datenbanken & Online-Lexika.

    Société française d’égyptologie. Bases de données en ligne.

    Strudwick, Nigel. Essential resources.

    UCLA Library. Ancient Near East and Egypt: image resources and archaeological data.

    Usage

    The bibliography is currently offered as a Zotero library (https://www.zotero.org/groups/4851156/). You have to open the Group library on Zotero to access the bibliography itself. Entries are assigned to different types: database, spreadsheet, text dataset, image dataset, 3D dataset, digitized archival materials, digitized print materials, controlled vocabulary, and GIS dataset. Users can browse records of specific types by using collections defined within the library. Entries are tagged to characterize their subject matter. Zotero allows users to filter records by one or more tags. The Abstract field is used to briefly describe each entry, and the Extra field is used to describe access modalities and provide additional links, including links to pages with full credits and citation rules. The Date field is used for the date of the last modification, if found on the website, or the publication date for offline media. Authorship of datasets is not always easy to establish, and the rigid structure of the bibliographic software we use (Zotero allows only authors and contributors to be entered) does not make it easier to reflect the different roles sometimes indicated on project websites.The dataset is also available for download in the Zotero RDF, CSL JSON, and CSV formats in this repository. Using these exports, you can import the complete library into your local instance of Zotero or use the data in other reference management software. If you plan to import data into Zotero, it is best to use the Zotero RDF file. The data from the Bibliography of Egyptological databases and datasets can be freely reused under the CC0 1.0 Universal license.The database is a work in progress, and we would appreciate any corrections or additions. Please do not hesitate to contact us under the email addresses given below. We plan to constantly update the bibliography.

    Alexander Ilin-Tomich, ailintom@uni-mainz.deTobias Konrad, tokonrad@uni-mainz.de February 2024

  17. Z

    Dataset: Publication cultures and Dutch research output: a quantitative...

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kramer, Bianca; Bosman, Jeroen (2020). Dataset: Publication cultures and Dutch research output: a quantitative assessment [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2643366
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Utrecht University Library
    Authors
    Kramer, Bianca; Bosman, Jeroen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset belonging to the report: Publication cultures and Dutch research output: a quantitative assessment

    On the report:

    Research into publication cultures commissioned by VSNU and carried out by Utrecht University Library has detailed university output beyond just journal articles, as well as the possibilities to assess open access levels of these other output types. For all four main fields reported on, the use of publication types other than journal articles is indeed substantial. For Social Sciences and Arts & Humanities in particular (with over 40% and over 60% of output respectively not being regular journal articles) looking at journal articles only ignores a significant share of their contribution to research and society. This is not only about books and book chapters, either: book reviews, conference papers, reports, case notes (in law) and all kinds of web publications are also significant parts of university output.

    Analyzing all these publication forms and especially determining to what extent they are open access is currently not easy. Even combining some the largest citation databases (Web of Science, Scopus and Dimensions) leaves out a lot of non-article content and in some fields even journal articles are only partly covered. Lacking metadata like affiliations and DOIs (either in the original documents or in the scholarly search engines) makes it even harder to analyze open access levels by institution and field. Using repository-harvesting databases like BASE and NARCIS in addition to the main citation databases improves understanding of open access of non-article output, but these routes also have limitations. The report has recommendations for stakeholders, mostly to improve metadata and coverage and apply persistent identifiers.

  18. HMA Subapplications Project Site Inventories

    • catalog.data.gov
    • s.cnmilf.com
    Updated Sep 8, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FEMA/RESILIENCE/FIMA (2025). HMA Subapplications Project Site Inventories [Dataset]. https://catalog.data.gov/dataset/hma-subapplications-project-site-inventories
    Explore at:
    Dataset updated
    Sep 8, 2025
    Dataset provided by
    Federal Emergency Management Agencyhttp://www.fema.gov/
    Description

    This dataset contains the Project Site Inventories from the Hazard Mitigation Assistance (HMA) subapplications/subgrants from the FEMA Grants Outcomes (FEMA GO) system (FEMA’s new grants management system). FEMA GO started accepting Flood Mitigation Assistance (FMA) and Building Resilient Infrastructure and Communities (BRIC) subapplications in Fiscal Year 2020. FEMA GO is projected to support the Hazard Mitigation Grant Program (HMGP) in Calendar Year 2023. For details on HMA Project Site Inventories not captured in FEMA GO, visit https://www.fema.gov/openfema-data-page/hazard-mitigation-assistance-mitigated-properties-v3.rnrnThis dataset contains information on the Project Site Inventories identified in the HMA subapplications/subgrants that have been submitted to or awarded in FEMA GO, as well as amendments made to the awarded subgrants. The Project Site Inventory contains information regarding the Building, Infrastructure/Utility/other, and/or Vacant Land proposed to be mitigated by the subapplication/subgrant. Sensitive information, such as Personally Identifiable Information (PII), has been removed to protect privacy. The information in this dataset has been deemed appropriate for publication to empower public knowledge of mitigation activities and the nature of HMA grant programs. For more information on the HMA grant programs, visit: https://www.fema.gov/grants/mitigation. For more information on FEMA GO, visit: https://www.fema.gov/grants/guidance-tools/fema-go.rnrnThis dataset comes from the source system mentioned above and is subject to a small percentage of human error. In some cases, data was not provided by the subapplicant, applicant, and/or entered into FEMA GO. Due to the voluntary nature of the Hazard Mitigation Assistance Programs, not all Project Site Inventory in this dataset will be mitigated. As FEMA GO continues development, additional fields may be added to this dataset to indicate the final status of individual inventory. This dataset is not intended to be used for any official federal financial reporting.rnFEMA's terms and conditions and citation requirements for datasets (API usage or file downloads) can be found on the OpenFEMA Terms and Conditions page: https://www.fema.gov/about/openfema/terms-conditions.rnrnFor answers to Frequently Asked Questions (FAQs) about the OpenFEMA program, API, and publicly available datasets, please visit: https://www.fema.gov/about/openfema/faq.rnIf you have media inquiries about this dataset, please email the FEMA News Desk at FEMA-News-Desk@fema.dhs.gov or call (202) 646-3272. For inquiries about FEMA's data and Open Government program, please email the OpenFEMA team at OpenFEMA@fema.dhs.gov.

  19. u

    Data from: Inventory of online public databases and repositories holding...

    • agdatacommons.nal.usda.gov
    • s.cnmilf.com
    • +2more
    txt
    Updated Feb 8, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Erin Antognoli; Jonathan Sears; Cynthia Parr (2024). Inventory of online public databases and repositories holding agricultural data in 2017 [Dataset]. http://doi.org/10.15482/USDA.ADC/1389839
    Explore at:
    txtAvailable download formats
    Dataset updated
    Feb 8, 2024
    Dataset provided by
    Ag Data Commons
    Authors
    Erin Antognoli; Jonathan Sears; Cynthia Parr
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    United States agricultural researchers have many options for making their data available online. This dataset aggregates the primary sources of ag-related data and determines where researchers are likely to deposit their agricultural data. These data serve as both a current landscape analysis and also as a baseline for future studies of ag research data. Purpose As sources of agricultural data become more numerous and disparate, and collaboration and open data become more expected if not required, this research provides a landscape inventory of online sources of open agricultural data. An inventory of current agricultural data sharing options will help assess how the Ag Data Commons, a platform for USDA-funded data cataloging and publication, can best support data-intensive and multi-disciplinary research. It will also help agricultural librarians assist their researchers in data management and publication. The goals of this study were to

    establish where agricultural researchers in the United States-- land grant and USDA researchers, primarily ARS, NRCS, USFS and other agencies -- currently publish their data, including general research data repositories, domain-specific databases, and the top journals compare how much data is in institutional vs. domain-specific vs. federal platforms determine which repositories are recommended by top journals that require or recommend the publication of supporting data ascertain where researchers not affiliated with funding or initiatives possessing a designated open data repository can publish data

    Approach The National Agricultural Library team focused on Agricultural Research Service (ARS), Natural Resources Conservation Service (NRCS), and United States Forest Service (USFS) style research data, rather than ag economics, statistics, and social sciences data. To find domain-specific, general, institutional, and federal agency repositories and databases that are open to US research submissions and have some amount of ag data, resources including re3data, libguides, and ARS lists were analysed. Primarily environmental or public health databases were not included, but places where ag grantees would publish data were considered.
    Search methods We first compiled a list of known domain specific USDA / ARS datasets / databases that are represented in the Ag Data Commons, including ARS Image Gallery, ARS Nutrition Databases (sub-components), SoyBase, PeanutBase, National Fungus Collection, i5K Workspace @ NAL, and GRIN. We then searched using search engines such as Bing and Google for non-USDA / federal ag databases, using Boolean variations of “agricultural data” /“ag data” / “scientific data” + NOT + USDA (to filter out the federal / USDA results). Most of these results were domain specific, though some contained a mix of data subjects. We then used search engines such as Bing and Google to find top agricultural university repositories using variations of “agriculture”, “ag data” and “university” to find schools with agriculture programs. Using that list of universities, we searched each university web site to see if their institution had a repository for their unique, independent research data if not apparent in the initial web browser search. We found both ag specific university repositories and general university repositories that housed a portion of agricultural data. Ag specific university repositories are included in the list of domain-specific repositories. Results included Columbia University – International Research Institute for Climate and Society, UC Davis – Cover Crops Database, etc. If a general university repository existed, we determined whether that repository could filter to include only data results after our chosen ag search terms were applied. General university databases that contain ag data included Colorado State University Digital Collections, University of Michigan ICPSR (Inter-university Consortium for Political and Social Research), and University of Minnesota DRUM (Digital Repository of the University of Minnesota). We then split out NCBI (National Center for Biotechnology Information) repositories. Next we searched the internet for open general data repositories using a variety of search engines, and repositories containing a mix of data, journals, books, and other types of records were tested to determine whether that repository could filter for data results after search terms were applied. General subject data repositories include Figshare, Open Science Framework, PANGEA, Protein Data Bank, and Zenodo. Finally, we compared scholarly journal suggestions for data repositories against our list to fill in any missing repositories that might contain agricultural data. Extensive lists of journals were compiled, in which USDA published in 2012 and 2016, combining search results in ARIS, Scopus, and the Forest Service's TreeSearch, plus the USDA web sites Economic Research Service (ERS), National Agricultural Statistics Service (NASS), Natural Resources and Conservation Service (NRCS), Food and Nutrition Service (FNS), Rural Development (RD), and Agricultural Marketing Service (AMS). The top 50 journals' author instructions were consulted to see if they (a) ask or require submitters to provide supplemental data, or (b) require submitters to submit data to open repositories. Data are provided for Journals based on a 2012 and 2016 study of where USDA employees publish their research studies, ranked by number of articles, including 2015/2016 Impact Factor, Author guidelines, Supplemental Data?, Supplemental Data reviewed?, Open Data (Supplemental or in Repository) Required? and Recommended data repositories, as provided in the online author guidelines for each the top 50 journals. Evaluation We ran a series of searches on all resulting general subject databases with the designated search terms. From the results, we noted the total number of datasets in the repository, type of resource searched (datasets, data, images, components, etc.), percentage of the total database that each term comprised, any dataset with a search term that comprised at least 1% and 5% of the total collection, and any search term that returned greater than 100 and greater than 500 results. We compared domain-specific databases and repositories based on parent organization, type of institution, and whether data submissions were dependent on conditions such as funding or affiliation of some kind. Results A summary of the major findings from our data review:

    Over half of the top 50 ag-related journals from our profile require or encourage open data for their published authors. There are few general repositories that are both large AND contain a significant portion of ag data in their collection. GBIF (Global Biodiversity Information Facility), ICPSR, and ORNL DAAC were among those that had over 500 datasets returned with at least one ag search term and had that result comprise at least 5% of the total collection.
    Not even one quarter of the domain-specific repositories and datasets reviewed allow open submission by any researcher regardless of funding or affiliation.

    See included README file for descriptions of each individual data file in this dataset. Resources in this dataset:Resource Title: Journals. File Name: Journals.csvResource Title: Journals - Recommended repositories. File Name: Repos_from_journals.csvResource Title: TDWG presentation. File Name: TDWG_Presentation.pptxResource Title: Domain Specific ag data sources. File Name: domain_specific_ag_databases.csvResource Title: Data Dictionary for Ag Data Repository Inventory. File Name: Ag_Data_Repo_DD.csvResource Title: General repositories containing ag data. File Name: general_repos_1.csvResource Title: README and file inventory. File Name: README_InventoryPublicDBandREepAgData.txt

  20. Z

    Film Circulation dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Loist, Skadi; Samoilova, Evgenia (Zhenya) (2024). Film Circulation dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7887671
    Explore at:
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    Film University Babelsberg KONRAD WOLF
    Authors
    Loist, Skadi; Samoilova, Evgenia (Zhenya)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

    A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

    Please cite this when using the dataset.

    Detailed description of the dataset:

    1 Film Dataset: Festival Programs

    The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

    The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

    The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

    The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.

    2 Survey Dataset

    The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

    The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

    The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

    The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.

    3 IMDb & Scripts

    The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

    The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

    The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

    The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

    The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

    The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

    The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

    The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

    The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

    The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

    The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

    The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

    The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

    The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

    The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

    The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

    The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

    The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

    The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.

    4 Festival Library Dataset

    The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

    The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories, units of measurement, data sources and coding and missing data.

    The csv file “4_festival-library_dataset_imdb-and-survey” contains data on all unique festivals collected from both IMDb and survey sources. This dataset appears in wide format, all information for each festival is listed in one row. This

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Meg Risdal (2016). Getting Real about Fake News [Dataset]. https://www.kaggle.com/dsv/911
Organization logo

Getting Real about Fake News

Text & metadata from fake & biased news sources around the web

Explore at:
zip(20363882 bytes)Available download formats
Dataset updated
Nov 25, 2016
Authors
Meg Risdal
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

The latest hot topic in the news is fake news and many are wondering what data scientists can do to detect it and stymie its viral spread. This dataset is only a first step in understanding and tackling this problem. It contains text and metadata scraped from 244 websites tagged as "bullshit" by the BS Detector Chrome Extension by Daniel Sieradski.

Warning: I did not modify the list of news sources from the BS Detector so as not to introduce my (useless) layer of bias; I'm not an authority on fake news. There may be sources whose inclusion you disagree with. It's up to you to decide how to work with the data and how you might contribute to "improving it". The labels of "bs" and "junksci", etc. do not constitute capital "t" Truth. If there are other sources you would like to include, start a discussion. If there are sources you believe should not be included, start a discussion or write a kernel analyzing the data. Or take the data and do something else productive with it. Kaggle's choice to host this dataset is not meant to express any particular political affiliation or intent.

Contents

The dataset contains text and metadata from 244 websites and represents 12,999 posts in total from the past 30 days. The data was pulled using the webhose.io API; because it's coming from their crawler, not all websites identified by the BS Detector are present in this dataset. Each website was labeled according to the BS Detector as documented here. Data sources that were missing a label were simply assigned a label of "bs". There are (ostensibly) no genuine, reliable, or trustworthy news sources represented in this dataset (so far), so don't trust anything you read.

Fake news in the news

For inspiration, I've included some (presumably non-fake) recent stories covering fake news in the news. This is a sensitive, nuanced topic and if there are other resources you'd like to see included here, please leave a suggestion. From defining fake, biased, and misleading news in the first place to deciding how to take action (a blacklist is not a good answer), there's a lot of information to consider beyond what can be neatly arranged in a CSV file.

Improvements

If you have suggestions for improvements or would like to contribute, please let me know. The most obvious extensions are to include data from "real" news sites and to address the bias in the current list. I'd be happy to include any contributions in future versions of the dataset.

Acknowledgements

Thanks to Anthony for pointing me to Daniel Sieradski's BS Detector. Thank you to Daniel Nouri for encouraging me to add a disclaimer to the dataset's page.

Search
Clear search
Close search
Google apps
Main menu