100+ datasets found
  1. D

    Dataset inventory

    • data.sfgov.org
    application/rdfxml +5
    Updated Mar 12, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DataSF (2025). Dataset inventory [Dataset]. https://data.sfgov.org/w/y8fp-fbf5/ikek-yizv?cur=kRuUwDH4vsx
    Explore at:
    csv, tsv, application/rssxml, json, application/rdfxml, xmlAvailable download formats
    Dataset updated
    Mar 12, 2025
    Dataset authored and provided by
    DataSF
    License

    ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
    License information was derived automatically

    Description

    A. SUMMARY The dataset inventory provides a list of data maintained by departments that are candidates for open data publishing or have already been published and is collected in accordance with Chapter 22D of the Administrative Code. The inventory will be used in conjunction with department publishing plans to track progress toward meeting plan goals for each department.

    B. HOW THE DATASET IS CREATED This dataset is collated through 2 ways: 1. Ongoing updates are made throughout the year to reflect new datasets, this process involves DataSF staff reconciling publishing records after datasets are published 2. Annual bulk updates - departments review their inventories and identify changes and updates and submit those to DataSF for a once a year bulk update - not all departments will have changes or their changes will have been captured over the course of the prior year already as ongoing updates

    C. UPDATE PROCESS The dataset is synced automatically daily, but the underlying data changes manually throughout the year as needed

    D. HOW TO USE THIS DATASET Interpreting dates in this dataset This dataset has 2 dates: 1. Date Added - when the dataset was added to the inventory itself 2. First Published - the open data portal automatically captures the date the dataset was first created, this is that system generated date

    Note that in certain cases we may have published a dataset prior to it being added to the inventory. We do our best to have an accurate accounting of when something was added to this inventory and when it was published. In most cases the inventory addition will happen prior to publishing, but in certain cases it will be published and we will have missed updating the inventory as this is a manual process.

    First published will give an accounting of when it was actually available on the open data catalog and date added when it was added to this list.

    E. RELATED DATASETS

  2. Inventory of citywide enterprise systems of record
  3. Dataset Inventory: Column-Level Details

  • Sample Graph Datasets in CSV Format

    • zenodo.org
    csv
    Updated Dec 9, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Edwin Carreño; Edwin Carreño (2024). Sample Graph Datasets in CSV Format [Dataset]. http://doi.org/10.5281/zenodo.14330132
    Explore at:
    csvAvailable download formats
    Dataset updated
    Dec 9, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Edwin Carreño; Edwin Carreño
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Sample Graph Datasets in CSV Format

    Note: none of the data sets published here contain actual data, they are for testing purposes only.

    Description

    This data repository contains graph datasets, where each graph is represented by two CSV files: one for node information and another for edge details. To link the files to the same graph, their names include a common identifier based on the number of nodes. For example:

    • dataset_30_nodes_interactions.csv:contains 30 rows (nodes).
    • dataset_30_edges_interactions.csv: contains 47 rows (edges).
    • the common identifier dataset_30 refers to the same graph.

    CSV nodes

    Each dataset contains the following columns:

    Name of the ColumnTypeDescription
    UniProt IDstringprotein identification
    labelstringprotein label (type of node)
    propertiesstringa dictionary containing properties related to the protein.

    CSV edges

    Each dataset contains the following columns:

    Name of the ColumnTypeDescription
    Relationship IDstringrelationship identification
    Source IDstringidentification of the source protein in the relationship
    Target IDstringidentification of the target protein in the relationship
    labelstringrelationship label (type of relationship)
    propertiesstringa dictionary containing properties related to the relationship.

    Metadata

    GraphNumber of NodesNumber of EdgesSparse graph

    dataset_30*

    30

    47

    Y

    dataset_60*

    60

    181

    Y

    dataset_120*

    120

    689

    Y

    dataset_240*

    240

    2819

    Y

    dataset_300*

    300

    4658

    Y

    dataset_600*

    600

    18004

    Y

    dataset_1200*

    1200

    71785

    Y

    dataset_2400*

    2400

    288600

    Y

    dataset_3000*

    3000

    449727

    Y

    dataset_6000*

    6000

    1799413

    Y

    dataset_12000*

    12000

    7199863

    Y

    dataset_24000*

    24000

    28792361

    Y

    This repository include two (2) additional tiny graph datasets to experiment before dealing with larger datasets.

    CSV nodes (tiny graphs)

    Each dataset contains the following columns:

    Name of the ColumnTypeDescription
    IDstringnode identification
    labelstringnode label (type of node)
    propertiesstringa dictionary containing properties related to the node.

    CSV edges (tiny graphs)

    Each dataset contains the following columns:

    Name of the ColumnTypeDescription
    IDstringrelationship identification
    sourcestringidentification of the source node in the relationship
    targetstringidentification of the target node in the relationship
    labelstringrelationship label (type of relationship)
    propertiesstringa dictionary containing properties related to the relationship.

    Metadata (tiny graphs)

    GraphNumber of NodesNumber of EdgesSparse graph
    dataset_dummy*36N
    dataset_dummy2*36N
  • d

    HIRENASD Experimental Data - matlab format

    • catalog.data.gov
    • data.nasa.gov
    • +2more
    Updated Dec 6, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2023). HIRENASD Experimental Data - matlab format [Dataset]. https://catalog.data.gov/dataset/hirenasd-experimental-data-matlab-format
    Explore at:
    Dataset updated
    Dec 6, 2023
    Dataset provided by
    Dashlink
    Description

    This resource contains the experimental data that was included in tecplot input files but in matlab files. dba1_cp has all the results is dimensioned (7,2) first dimension is 1-7 for each span station 2nd dimension is 1 for upper surface, 2 for lower surface. dba1_cp(ispan,isurf).x are the x/c locations at span station (ispan) and upper(isurf=1) or lower(isurf=2) dba1_cp(ispan,isurf).y are the eta locations at span station (ispan) and upper(isurf=1) or lower(isurf=2) dba1_cp(ispan,isurf).cp are the pressures at span station (ispan) and upper(isurf=1) or lower(isurf=2) Unsteady CP is dimensioned with 4 columns 1st column, real 2nd column, imaginary 3rd column, magnitude 4th column, phase, deg M,Re and other pertinent variables are included as variables and also included in casedata.M, etc

  • f

    Wave runup FieldData

    • auckland.figshare.com
    • catalogue.data.govt.nz
    zip
    Updated Nov 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Giovanni Coco; Paula Gomes (2022). Wave runup FieldData [Dataset]. http://doi.org/10.17608/k6.auckland.7732967.v4
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 8, 2022
    Dataset provided by
    The University of Auckland
    Authors
    Giovanni Coco; Paula Gomes
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    INFORMATION ABOUT THE CONTENT OF THIS DATABASE

    This database comprises wave, beach and runup parameters measured on different beaches around the world. It is a compilation of data published in previous works, with the aim of making all data available in one single repository. More information about methods of data acquisition and data processing can be found in the original papers that describe each experiment. To know how to cite each of the dataset provided here, please check section 3. Please make sure to cite the appropriate publication when using the data. Collecting the data is hard work and needs to be acknowledged. 1. Files content: All data files contain the same structure: Column 1 – R2%: 2-percent exceedance value for runup [m]; Column 2 – Set: setup [m]; Column 3 – Stt: total swash excursion [m]; Column 4 – Sinc: incident swash [m]; Column 5 – Sig: infragravity swash [m]; Column 6 – Hs*: significant deep-water wave height [m]; Column 7 – Tp: peak wave period [s]; Column 8 – tanβ: foreshore beach slope; Column 9 – D50**: Median sediment size [mm] NaN values may be found when the data were not available in the original dataset. *Hs values from field measurements were deshoaled from the depth of measurement to a depth equals to 80m, assuming normal approach and linear theory (we followed the approach presented in Stockdon et al., where great care is paid to make the data comparable). **D50 values were obtained from reports and papers describing the beaches. 2. List of datasets Stockdon et al. 2006: Data recompiled from 10 experiments carried out in 6 beaches (US and NL coasts). Files’ names correspond to the beach and year of the experiments: Original data: available using the link https://pubs.usgs.gov/ds/602/ Senechal et al. 2011: This dataset comprises the measurements carried out in Truc Vert beach, France. The file’s name includes the name of the beach and the year of the experiment. Original data: a table with the full content of the parameters measured during the experiment can be found in Senechal et al. (2011). Guedes et al. 2011: This dataset comprehends data measured at Tairua beach (New Zeland coast). The file’s name indicates the name of the beach and the year of the experiment. Original data: this web. Guedes et al. 2013: This dataset comprehends data measured at Ngarunui beach (Raglan - New Zeland coast). The file’s name represents the name of the beach and the year of the experiment. Original data: this web. Gomes da Silva et al. 2018: Dataset measured during two field campaigns in Somo beach, Spain, in 2016 and 2017. The files names represent that name of the beach and the year of the experiment. Original data: https://data.mendeley.com/datasets/6yh2b327gd/4

    Power et al. 2019: Dataset compiled from previous works, comprising field and laboratory measurements: Poate et al. (2016): field; Nicolae-Lerma et al. (2016): field; Atkinson et al. (2017): field; Mase (1989): Laboratory; Baldock and Huntley (2002): Laboratory; Howe (2016): Laboratory; Original data:www.sciencedirect.com/science/article/pii/S0378383918302552

    Due to the character limit of this description, please refer to the https://coastalhub.science/wave-runup-read-me for the references list.

  • Data from: BEING A TREE CROP INCREASES THE ODDS OF EXPERIENCING YIELD...

    • zenodo.org
    bin, zip
    Updated Aug 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marcelo Adrián Aizen; Marcelo Adrián Aizen; Gabriela Gleiser; Gabriela Gleiser; Thomas Kitzberger; Thomas Kitzberger; Rubén Milla; Rubén Milla (2023). BEING A TREE CROP INCREASES THE ODDS OF EXPERIENCING YIELD DECLINES IRRESPECTIVE OF POLLINATOR DEPENDENCE [Dataset]. http://doi.org/10.5281/zenodo.7863825
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    Aug 8, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Marcelo Adrián Aizen; Marcelo Adrián Aizen; Gabriela Gleiser; Gabriela Gleiser; Thomas Kitzberger; Thomas Kitzberger; Rubén Milla; Rubén Milla
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Marcelo A. Aizen, Gabriela R. Gleiser, Thomas Kitzberger, Ruben Milla. Being a tree crop increases the odds of experiencing yield declines irrespective of pollinator dependence (to be submitted to PCI)

    Data and R scripts to reproduce the analyses and the figures shown in the paper. All analyses were performed using R 4.0.2.

    Data

    1. FAOdata_21-12-2021.csv

    This file includes yearly data (1961-2020, column 8) on yield and cultivated area (columns 6 and 10) at the country, sub-regional, and regional levels (column 2) for each crop (column 4) drawn from the United Nations Food and Agriculture Organization database (data available at http://www.fao.org/faostat/en; accessed July 21-12-2021). [Used in Script 1 to generate the synthesis dataset]

    2. countries.csv

    This file provides information on the region (column 2) to which each country (column 1) belongs. [Used in Script 1 to generate the synthesis dataset]

    3. dependence.csv

    This file provides information on the pollinator dependence category (column 2) of each crop (column 1).

    4. traits.csv

    This file provides information on the traits of each crop other than pollinator dependence, including, besides the crop name (column1), the variables type of harvested organ (column 5) and growth form (column 6). [Used in Script 1 to generate the synthesis dataset]

    5. dataset.csv

    The synthesis dataset generated by Script 1.

    6. growth.csv

    The yield growth dataset generated by Script 1 and used as input by Scripts 2 and 3.

    7. phylonames.csv

    This file lists all the crops (column 1) and their equivalent tip names in the crop phylogeny (column 2). [Used in Script 2 for the phylogenetically-controlled analyses]

    8.phylo137.tre

    File containing the phylogenetic tree.

    Scripts

    1. dataset

    This R script curates and merges all the individual datasets mentioned above into a single dataset, estimating and adding to this single dataset the growth rate for each crop and country, and the (log) cumulative harvested area per crop and country over the period 1961-2020.

    2. analyses

    This R script includes all the analyses described in the article’s main text.

    3. figures

    This R script creates all the main and supplementary figures of this article.

    4. lme4_phylo_setup

    R function written by Li and Bolker (2019) to carry out phylogenetically-controlled generalized linear mixed-effects models as described in the main text of the article.

    References

    Li, M., and B. Bolker. 2019. wzmli/phyloglmm: First release of phylogenetic comparative analysis in lme4- verse. Zenodo. https://doi.org/10.5281/zenodo.2639887.

  • d

    Dataset metadata of known Dataverse installations

    • search.dataone.org
    • dataverse.harvard.edu
    • +1more
    Updated Nov 22, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gautier, Julian (2023). Dataset metadata of known Dataverse installations [Dataset]. http://doi.org/10.7910/DVN/DCDKZQ
    Explore at:
    Dataset updated
    Nov 22, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Gautier, Julian
    Description

    This dataset contains the metadata of the datasets published in 77 Dataverse installations, information about each installation's metadata blocks, and the list of standard licenses that dataset depositors can apply to the datasets they publish in the 36 installations running more recent versions of the Dataverse software. The data is useful for reporting on the quality of dataset and file-level metadata within and across Dataverse installations. Curators and other researchers can use this dataset to explore how well Dataverse software and the repositories using the software help depositors describe data. How the metadata was downloaded The dataset metadata and metadata block JSON files were downloaded from each installation on October 2 and October 3, 2022 using a Python script kept in a GitHub repo at https://github.com/jggautier/dataverse-scripts/blob/main/other_scripts/get_dataset_metadata_of_all_installations.py. In order to get the metadata from installations that require an installation account API token to use certain Dataverse software APIs, I created a CSV file with two columns: one column named "hostname" listing each installation URL in which I was able to create an account and another named "apikey" listing my accounts' API tokens. The Python script expects and uses the API tokens in this CSV file to get metadata and other information from installations that require API tokens. How the files are organized ├── csv_files_with_metadata_from_most_known_dataverse_installations │ ├── author(citation).csv │ ├── basic.csv │ ├── contributor(citation).csv │ ├── ... │ └── topic_classification(citation).csv ├── dataverse_json_metadata_from_each_known_dataverse_installation │ ├── Abacus_2022.10.02_17.11.19.zip │ ├── dataset_pids_Abacus_2022.10.02_17.11.19.csv │ ├── Dataverse_JSON_metadata_2022.10.02_17.11.19 │ ├── hdl_11272.1_AB2_0AQZNT_v1.0.json │ ├── ... │ ├── metadatablocks_v5.6 │ ├── astrophysics_v5.6.json │ ├── biomedical_v5.6.json │ ├── citation_v5.6.json │ ├── ... │ ├── socialscience_v5.6.json │ ├── ACSS_Dataverse_2022.10.02_17.26.19.zip │ ├── ADA_Dataverse_2022.10.02_17.26.57.zip │ ├── Arca_Dados_2022.10.02_17.44.35.zip │ ├── ... │ └── World_Agroforestry_-_Research_Data_Repository_2022.10.02_22.59.36.zip └── dataset_pids_from_most_known_dataverse_installations.csv └── licenses_used_by_dataverse_installations.csv └── metadatablocks_from_most_known_dataverse_installations.csv This dataset contains two directories and three CSV files not in a directory. One directory, "csv_files_with_metadata_from_most_known_dataverse_installations", contains 18 CSV files that contain the values from common metadata fields of all 77 Dataverse installations. For example, author(citation)_2022.10.02-2022.10.03.csv contains the "Author" metadata for all published, non-deaccessioned, versions of all datasets in the 77 installations, where there's a row for each author name, affiliation, identifier type and identifier. The other directory, "dataverse_json_metadata_from_each_known_dataverse_installation", contains 77 zipped files, one for each of the 77 Dataverse installations whose dataset metadata I was able to download using Dataverse APIs. Each zip file contains a CSV file and two sub-directories: The CSV file contains the persistent IDs and URLs of each published dataset in the Dataverse installation as well as a column to indicate whether or not the Python script was able to download the Dataverse JSON metadata for each dataset. For Dataverse installations using Dataverse software versions whose Search APIs include each dataset's owning Dataverse collection name and alias, the CSV files also include which Dataverse collection (within the installation) that dataset was published in. One sub-directory contains a JSON file for each of the installation's published, non-deaccessioned dataset versions. The JSON files contain the metadata in the "Dataverse JSON" metadata schema. The other sub-directory contains information about the metadata models (the "metadata blocks" in JSON files) that the installation was using when the dataset metadata was downloaded. I saved them so that they can be used when extracting metadata from the Dataverse JSON files. The dataset_pids_from_most_known_dataverse_installations.csv file contains the dataset PIDs of all published datasets in the 77 Dataverse installations, with a column to indicate if the Python script was able to download the dataset's metadata. It's a union of all of the "dataset_pids_..." files in each of the 77 zip files. The licenses_used_by_dataverse_installations.csv file contains information about the licenses that a number of the installations let depositors choose when creating datasets. When I collected ... Visit https://dataone.org/datasets/sha256%3Ad27d528dae8cf01e3ea915f450426c38fd6320e8c11d3e901c43580f997a3146 for complete metadata about this dataset.

  • d

    Credibility Corpus with several datasets (Twitter, Web database) in French...

    • data.gouv.fr
    • data.europa.eu
    • +1more
    application/rar
    Updated Dec 1, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nicolas turenne (2016). Credibility Corpus with several datasets (Twitter, Web database) in French and English [Dataset]. https://www.data.gouv.fr/en/datasets/credibility-corpus-with-several-datasets-twitter-web-database-in-french-and-english/
    Explore at:
    application/rar(33261), application/rar(680351), application/rar(102374), application/rar(40693), application/rar(77120), application/rar(212274)Available download formats
    Dataset updated
    Dec 1, 2016
    Authors
    nicolas turenne
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    French
    Description

    Description of the corpora The set of these datasets are made to analyze ifnormation credibility in general (rumor and disinformation for English and French documents), and occuring on the social web. Target databases about rumor, hoax and disinformation helped to collect obviously misinformation. Some topic (with keywords) helps us to made corpora from the micrroblogging platform Twitter, great provider of rumors and disinformation. 1 corpus describes Texts from the web database about rumors and disinformation. 4 corpora from Social Media Twitter about specific rumors (2 in English, 2 in French). 4 corpora from Social Media Twitter randomly built (2 in English, 2 in French). 4 corpora from Social Media Twitter about specific rumors (2 in English, 2 in French). Size of different corpora : Social Web Rumorous corpus: 1,612 French Hollande Rumorous corpus (Twitter): 371 French Lemon Rumorous corpus (Twitter): 270 English Pin Rumorous corpus (Twitter): 679 English Swine Rumorous corpus (Twitter): 1024 French 1st Random corpus (Twitter): 1000 French 2st Random corpus (Twitter): 1000 English 3st Random corpus (Twitter): 1000 English 4st Random corpus (Twitter): 1000 French Rihanna Event corpus (Twitter): 543 English Rihanna Event corpus (Twitter): 1000 French Euro2016 Event corpus (Twitter): 1000 English Euro2016 Event corpus (Twitter): 1000 A matrix links tweets with most 50 frequent words Text data : _id : message id body text : string text data Matrix data : 52 columns (first column is id, second column is rumor indicator 1 or -1, other columns are words value is 1 contain or 0 does not contain) 11,102 lines (each line is a message) Hidalgo corpus: lines range 1:75 Lemon corpus : lines range 76:467 Pin rumor : lines range 468:656 swine : lines range 657:1311 random messages : lines range 1312:11103 Sample contains : French Pin Rumorous corpus (Twitter): 679 Matrix data : 52 columns (first column is id, second column is rumor indicator 1 or -1, other columns are words value is 1 contain or 0 does not contain) 189 lines (each line is a message)

  • Z

    Data from: ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 27, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nagappan, Meiyappan (2022). ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5907001
    Explore at:
    Dataset updated
    Jan 27, 2022
    Dataset provided by
    Nagappan, Meiyappan
    Keshavarz, Hossein
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction

    This archive contains the ApacheJIT dataset presented in the paper "ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction" as well as the replication package. The paper is submitted to MSR 2022 Data Showcase Track.

    The datasets are available under directory dataset. There are 4 datasets in this directory.

    1. apachejit_total.csv: This file contains the entire dataset. Commits are specified by their identifier and a set of commit metrics that are explained in the paper are provided as features. Column buggy specifies whether or not the commit introduced any bug into the system.
    2. apachejit_train.csv: This file is a subset of the entire dataset. It provides a balanced set that we recommend for models that are sensitive to class imbalance. This set is obtained from the first 14 years of data (2003 to 2016).
    3. apachejit_test_large.csv: This file is a subset of the entire dataset. The commits in this file are the commits from the last 3 years of data. This set is not balanced to represent a real-life scenario in a JIT model evaluation where the model is trained on historical data to be applied on future data without any modification.
    4. apachejit_test_small.csv: This file is a subset of the test file explained above. Since the test file has more than 30,000 commits, we also provide a smaller test set which is still unbalanced and from the last 3 years of data.

    In addition to the dataset, we also provide the scripts using which we built the dataset. These scripts are written in Python 3.8. Therefore, Python 3.8 or above is required. To set up the environment, we have provided a list of required packages in file requirements.txt. Additionally, one filtering step requires GumTree [1]. For Java, GumTree requires Java 11. For other languages, external tools are needed. Installation guide and more details can be found here.

    The scripts are comprised of Python scripts under directory src and Python notebooks under directory notebooks. The Python scripts are mainly responsible for conducting GitHub search via GitHub search API and collecting commits through PyDriller Package [2]. The notebooks link the fixed issue reports with their corresponding fixing commits and apply some filtering steps. The bug-inducing candidates then are filtered again using gumtree.py script that utilizes the GumTree package. Finally, the remaining bug-inducing candidates are combined with the clean commits in the dataset_construction notebook to form the entire dataset.

    More specifically, git_token.py handles GitHub API token that is necessary for requests to GitHub API. Script collector.py performs GitHub search. Tracing changed lines and git annotate is done in gitminer.py using PyDriller. Finally, gumtree.py applies 4 filtering steps (number of lines, number of files, language, and change significance).

    References:

    1. GumTree

    Jean-Rémy Falleri, Floréal Morandat, Xavier Blanc, Matias Martinez, and Martin Monperrus. 2014. Fine-grained and accurate source code differencing. In ACM/IEEE International Conference on Automated Software Engineering, ASE ’14,Vasteras, Sweden - September 15 - 19, 2014. 313–324

    1. PyDriller
    • https://pydriller.readthedocs.io/en/latest/

    • Davide Spadini, Maurício Aniche, and Alberto Bacchelli. 2018. PyDriller: Python Framework for Mining Software Repositories. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Lake Buena Vista, FL, USA)(ESEC/FSE2018). Association for Computing Machinery, New York, NY, USA, 908–911

  • Z

    A study on real graphs of fake news spreading on Twitter

    • data.niaid.nih.gov
    • zenodo.org
    Updated Aug 20, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    A study on real graphs of fake news spreading on Twitter [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3711599
    Explore at:
    Dataset updated
    Aug 20, 2021
    Dataset authored and provided by
    Amirhosein Bodaghi
    Description

    *** Fake News on Twitter ***

    These 5 datasets are the results of an empirical study on the spreading process of newly fake news on Twitter. Particularly, we have focused on those fake news which have given rise to a truth spreading simultaneously against them. The story of each fake news is as follow:

    1- FN1: A Muslim waitress refused to seat a church group at a restaurant, claiming "religious freedom" allowed her to do so.

    2- FN2: Actor Denzel Washington said electing President Trump saved the U.S. from becoming an "Orwellian police state."

    3- FN3: Joy Behar of "The View" sent a crass tweet about a fatal fire in Trump Tower.

    4- FN4: The animated children's program 'VeggieTales' introduced a cannabis character in August 2018.

    5- FN5: In September 2018, the University of Alabama football program ended its uniform contract with Nike, in response to Nike's endorsement deal with Colin Kaepernick.

    The data collection has been done in two stages that each provided a new dataset: 1- attaining Dataset of Diffusion (DD) that includes information of fake news/truth tweets and retweets 2- Query of neighbors for spreaders of tweets that provides us with Dataset of Graph (DG).

    DD

    DD for each fake news story is an excel file, named FNx_DD where x is the number of fake news, and has the following structure:

    The structure of excel files for each dataset is as follow:

    Each row belongs to one captured tweet/retweet related to the rumor, and each column of the dataset presents a specific information about the tweet/retweet. These columns from left to right present the following information about the tweet/retweet:

    User ID (user who has posted the current tweet/retweet)

    The description sentence in the profile of the user who has published the tweet/retweet

    The number of published tweet/retweet by the user at the time of posting the current tweet/retweet

    Date and time of creation of the account by which the current tweet/retweet has been posted

    Language of the tweet/retweet

    Number of followers

    Number of followings (friends)

    Date and time of posting the current tweet/retweet

    Number of like (favorite) the current tweet had been acquired before crawling it

    Number of times the current tweet had been retweeted before crawling it

    Is there any other tweet inside of the current tweet/retweet (for example this happens when the current tweet is a quote or reply or retweet)

    The source (OS) of device by which the current tweet/retweet was posted

    Tweet/Retweet ID

    Retweet ID (if the post is a retweet then this feature gives the ID of the tweet that is retweeted by the current post)

    Quote ID (if the post is a quote then this feature gives the ID of the tweet that is quoted by the current post)

    Reply ID (if the post is a reply then this feature gives the ID of the tweet that is replied by the current post)

    Frequency of tweet occurrences which means the number of times the current tweet is repeated in the dataset (for example the number of times that a tweet exists in the dataset in the form of retweet posted by others)

    State of the tweet which can be one of the following forms (achieved by an agreement between the annotators):

    r : The tweet/retweet is a fake news post

    a : The tweet/retweet is a truth post

    q : The tweet/retweet is a question about the fake news, however neither confirm nor deny it

    n : The tweet/retweet is not related to the fake news (even though it contains the queries related to the rumor, but does not refer to the given fake news)

    DG

    DG for each fake news contains two files:

    A file in graph format (.graph) which includes the information of graph such as who is linked to whom. (This file named FNx_DG.graph, where x is the number of fake news)

    A file in Jsonl format (.jsonl) which includes the real user IDs of nodes in the graph file. (This file named FNx_Labels.jsonl, where x is the number of fake news)

    Because in the graph file, the label of each node is the number of its entrance in the graph. For example if node with user ID 12345637 be the first node which has been entered into the graph file then its label in the graph is 0 and its real ID (12345637) would be at the row number 1 (because the row number 0 belongs to column labels) in the jsonl file and so on other node IDs would be at the next rows of the file (each row corresponds to 1 user id). Therefore, if we want to know for example what the user id of node 200 (labeled 200 in the graph) is, then in jsonl file we should look at row number 202.

    The user IDs of spreaders in DG (those who have had a post in DD) would be available in DD to get extra information about them and their tweet/retweet. The other user IDs in DG are the neighbors of these spreaders and might not exist in DD.

  • OMI/Aura Level 2 Sulphur Dioxide (SO2) Trace Gas Column Data 1-Orbit Subset...

    • catalog.data.gov
    • datasets.ai
    • +4more
    Updated Dec 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NASA/GSFC/SED/ESD/GCDC/GESDISC (2023). OMI/Aura Level 2 Sulphur Dioxide (SO2) Trace Gas Column Data 1-Orbit Subset and Collocated Swath along CloudSat V003 (OMSO2_CPR) at GES DISC [Dataset]. https://catalog.data.gov/dataset/omi-aura-level-2-sulphur-dioxide-so2-trace-gas-column-data-1-orbit-subset-and-collocated-s
    Explore at:
    Dataset updated
    Dec 6, 2023
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    This is a CloudSat-collocated subset of the original product OMSO2, for the purposes of the A-Train mission. The goal of the subset is to select and return OMI data that are within +/-100 km across the CloudSat track. The resultant OMI subset swath is sought to be about 200 km cross-track of CloudSat. Even though collocated with CloudSat, this subset can serve many other A-Train applications. (The shortname for this CloudSat-collocated subset of the original product OMSO2 Product is OMSO2_CPR_V003) This document describes the original OMI SO2 product (OMSO2) produced from global mode UV measurements of the Ozone Monitoring Instrument (OMI). OMI was launched on July 15, 2004 on the EOS Aura satellite, which is in a sun-synchronous ascending polar orbit with 1:45pm local equator crossing time. The data collection started on August 17, 2004 (orbit 482) and continues to this day with only minor data gaps. The minimum SO2 mass detectable by OMI is about two orders of magnitude smaller than the detection threshold of the legacy Total Ozone Mapping Spectrometer (TOMS) SO2 data (1978-2005) [Krueger et al 1995]. This is due to smaller OMI footprint and the use of wavelengths better optimized for separating O3 from SO2. The product file, called a data granule, covers the sunlit portion of the orbit with an approximately 2600 km wide swath containing 60 pixels per viewing line. During normal operations, 14 or 15 granules are produced daily, providing fully contiguous coverage of the globe. Currently, OMSO2 products are not produced when OMI goes into the "zoom mode" for one day every 452 orbits (~32 days). For each OMI pixel we provide 4 different estimates of the column density of SO2 in Dobson Units (1DU=2.69x10^16 molecules/cm2) obtained by making different assumptions about the vertical distribution of the SO2. However, it is important to note that in most cases the precise vertical distribution of SO2 is unimportant. The users can use either the SO2 plume height, or the center of mass altitude (CMA) derived from SO2 vertical distribution, to interpolate between the 4 values: 1)Planetary Boundary Layer (PBL) SO2 column (ColumnAmountSO2_PBL), corresponding to CMA of 0.9 km. 2)Lower tropospheric SO2 column (ColumnAmountSO2_TRL), corresponding to CMA of 2.5 km. 3)Middle tropospheric SO2 column, (ColumnAmountSO2_TRM), usually produced by volcanic degassing, corresponding to CMA of 7.5 km, 4)Upper tropospheric and Stratospheric SO2 column (ColumnAmountSO2_STL), usually produced by explosive volcanic eruption, corresponding to CMA of 17 km. The accuracy and precision of the derived SO2 columns vary significantly with the SO2 CMA and column amount, observational geometry, and slant column ozone. OMI becomes more sensitive to SO2 above clouds and snow/ice, and less sensitive to SO2 below clouds. Preliminary error estimates are discussed below (see Data Quality Assessment). OMSO2 files are stored in EOS Hierarchical Data Format (HDF-EOS5). Each file contains data from the day lit portion of an orbit (53 minutes). There are approximately 14 orbits per day. The maximum file size for the OMSO2 data product is about 9 Mbytes.

  • o

    University SET data, with faculty and courses characteristics

    • openicpsr.org
    Updated Sep 12, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Under blind review in refereed journal (2021). University SET data, with faculty and courses characteristics [Dataset]. http://doi.org/10.3886/E149801V1
    Explore at:
    Dataset updated
    Sep 12, 2021
    Authors
    Under blind review in refereed journal
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This paper explores a unique dataset of all the SET ratings provided by students of one university in Poland at the end of the winter semester of the 2020/2021 academic year. The SET questionnaire used by this university is presented in Appendix 1. The dataset is unique for several reasons. It covers all SET surveys filled by students in all fields and levels of study offered by the university. In the period analysed, the university was entirely in the online regime amid the Covid-19 pandemic. While the expected learning outcomes formally have not been changed, the online mode of study could have affected the grading policy and could have implications for some of the studied SET biases. This Covid-19 effect is captured by econometric models and discussed in the paper. The average SET scores were matched with the characteristics of the teacher for degree, seniority, gender, and SET scores in the past six semesters; the course characteristics for time of day, day of the week, course type, course breadth, class duration, and class size; the attributes of the SET survey responses as the percentage of students providing SET feedback; and the grades of the course for the mean, standard deviation, and percentage failed. Data on course grades are also available for the previous six semesters. This rich dataset allows many of the biases reported in the literature to be tested for and new hypotheses to be formulated, as presented in the introduction section. The unit of observation or the single row in the data set is identified by three parameters: teacher unique id (j), course unique id (k) and the question number in the SET questionnaire (n ϵ {1, 2, 3, 4, 5, 6, 7, 8, 9} ). It means that for each pair (j,k), we have nine rows, one for each SET survey question, or sometimes less when students did not answer one of the SET questions at all. For example, the dependent variable SET_score_avg(j,k,n) for the triplet (j=Calculus, k=John Smith, n=2) is calculated as the average of all Likert-scale answers to question nr 2 in the SET survey distributed to all students that took the Calculus course taught by John Smith. The data set has 8,015 such observations or rows. The full list of variables or columns in the data set included in the analysis is presented in the attached filesection. Their description refers to the triplet (teacher id = j, course id = k, question number = n). When the last value of the triplet (n) is dropped, it means that the variable takes the same values for all n ϵ {1, 2, 3, 4, 5, 6, 7, 8, 9}.Two attachments:- word file with variables description- Rdata file with the data set (for R language).Appendix 1. Appendix 1. The SET questionnaire was used for this paper. Evaluation survey of the teaching staff of [university name] Please, complete the following evaluation form, which aims to assess the lecturer’s performance. Only one answer should be indicated for each question. The answers are coded in the following way: 5- I strongly agree; 4- I agree; 3- Neutral; 2- I don’t agree; 1- I strongly don’t agree. Questions 1 2 3 4 5 I learnt a lot during the course. ○ ○ ○ ○ ○ I think that the knowledge acquired during the course is very useful. ○ ○ ○ ○ ○ The professor used activities to make the class more engaging. ○ ○ ○ ○ ○ If it was possible, I would enroll for the course conducted by this lecturer again. ○ ○ ○ ○ ○ The classes started on time. ○ ○ ○ ○ ○ The lecturer always used time efficiently. ○ ○ ○ ○ ○ The lecturer delivered the class content in an understandable and efficient way. ○ ○ ○ ○ ○ The lecturer was available when we had doubts. ○ ○ ○ ○ ○ The lecturer treated all students equally regardless of their race, background and ethnicity. ○ ○

  • Z

    Data from: Redocking the PDB

    • data.niaid.nih.gov
    • zenodo.org
    Updated Dec 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gutermuth, Torben (2023). Redocking the PDB [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7579501
    Explore at:
    Dataset updated
    Dec 6, 2023
    Dataset provided by
    Rarey, Matthias
    Gutermuth, Torben
    Ehrt, Christiane
    Flachsenberg, Florian
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains supplementary data to the journal article 'Redocking the PDB' by Flachsenberg et al. (https://doi.org/10.1021/acs.jcim.3c01573)[1]. In this paper, we described two datasets: The PDBScan22 dataset with a large set of 322,051 macromolecule–ligand binding sites generally suitable for redocking and the PDBScan22-HQ dataset with 21,355 binding sites passing different structure quality filters. These datasets were further characterized by calculating properties of the ligand (e.g., molecular weight), properties of the binding site (e.g., volume), and structure quality descriptors (e.g., crystal structure resolution). Additionally, we performed redocking experiments with our novel JAMDA structure preparation and docking workflow[1] and with AutoDock Vina[2,3]. Details for all these experiments and the dataset composition can be found in the journal article[1]. Here, we provide all the datasets, i.e., the PDBScan22 and PDBScan22-HQ datasets as well as the docking results and the additionally calculated properties (for the ligand, the binding sites, and structure quality descriptors). Furthermore, we give a detailed description of their content (i.e., the data types and a description of the column values). All datasets consist of CSV files with the actual data and associated metadata JSON files describing their content. The CSV/JSON files are compliant with the CSV on the web standard (https://csvw.org/). General hints

    All docking experiment results consist of two CSV files, one with general information about the docking run (e.g., was it successful?) and one with individual pose results (i.e., score and RMSD to the crystal structure). All files (except for the docking pose tables) can be indexed uniquely by the column tuple '(pdb, name)' containing the PDB code of the complex (e.g., 1gm8) and the name ligand (in the format '

    import pandas as pd df = pd.read_csv('PDBScan22-HQ.csv') df_poses = pd.read_csv('PDBScan22-HQ_JAMDA_NL_NR_poses.csv') df_properties = pd.read_csv('PDBScan22_ligand_properties.csv') merged = df.merge(df_properties, how='left', on=['pdb', 'name']) merged = merged[(merged['MW'] >= 100) & (merged['MW'] <= 200)].merge(df_poses[df_poses['rank'] == 1], how='left', on=['pdb', 'name']) nof_successful_top_ranked = (merged['rmsd_ai'] <= 2.0).sum() nof_no_top_ranked = merged['rmsd_ai'].isna().sum() Datasets

    PDBScan22.csv: This is the PDBScan22 dataset[1]. This dataset was derived from the PDB4. It contains macromolecule–ligand binding sites (defined by PDB code and ligand identifier) that can be read by the NAOMI library[5,6] and pass basic consistency filters. PDBScan22-HQ.csv: This is the PDBScan22-HQ dataset[1]. It contains macromolecule–ligand binding sites from the PDBScan22 dataset that pass certain structure quality filters described in our publication[1]. PDBScan22-HQ-ADV-Success.csv: This is a subset of the PDBScan22-HQ dataset without 336 binding sites where AutoDock Vina[2,3] fails. PDBScan22-HQ-Macrocycles.csv: This is a subset of the PDBScan22-HQ dataset without 336 binding sites where AutoDock Vina[2,3] fails and only contains molecules with macrocycles with at least ten atoms. Properties for PDBScan22

    PDBScan22_ligand_properties.csv: Conformation-independent properties of all ligand molecules in the PDBScan22 dataset. Properties were calculated using an in-house tool developed with the NAOMI library[5,6]. PDBScan22_StructureProfiler_quality_descriptors.csv: Structure quality descriptors for the binding sites in the PDBScan22 dataset calculated using the StructureProfiler tool[7]. PDBScan22_basic_complex_properties.csv: Simple properties of the binding sites in the PDBScan22 dataset. Properties were calculated using an in-house tool developed with the NAOMI library[5,6]. Properties for PDBScan22-HQ

    PDBScan22-HQ_DoGSite3_pocket_descriptors.csv: Binding site descriptors calculated for the binding sites in the PDBScan22-HQ dataset using the DoGSite3 tool[8]. PDBScan22-HQ_molecule_types.csv: Assignment of ligands in the PDBScan22-HQ dataset (without 336 binding sites where AutoDock Vina fails) to different molecular classes (i.e., drug-like, fragment-like oligosaccharide, oligopeptide, cofactor, macrocyclic). A detailed description of the assignment can be found in our publication[1]. Docking results on PDBScan22

    PDBScan22_JAMDA_NL_NR.csv: Docking results of JAMDA[1] on the PDBScan22 dataset. This is the general overview for the docking runs; the pose results are given in 'PDBScan22_JAMDA_NL_NR_poses.csv'. For this experiment, the ligand was not considered during preprocessing of the binding site, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was disabled. PDBScan22_JAMDA_NL_NR_poses.csv: Pose scores and RMSDs for the docking results of JAMDA[1] on the PDBScan22 dataset. For this experiment, the ligand was not considered during preprocessing of the binding site, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was disabled. Docking results on PDBScan22-HQ

    PDBScan22-HQ_JAMDA_NL_NR.csv: Docking results of JAMDA[1] on the PDBScan22-HQ dataset. This is the general overview for the docking runs; the pose results are given in 'PDBScan22-HQ_JAMDA_NL_NR_poses.csv'. For this experiment, the ligand was not considered during preprocessing of the binding site, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was disabled. PDBScan22-HQ_JAMDA_NL_NR_poses.csv: Pose scores and RMSDs for the docking results of JAMDA[1] on the PDBScan22-HQ dataset. For this experiment, the ligand was not considered during preprocessing of the binding site, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was disabled. PDBScan22-HQ_JAMDA_NL_WR.csv: Docking results of JAMDA[1] on the PDBScan22-HQ dataset. This is the general overview for the docking runs; the pose results are given in 'PDBScan22-HQ_JAMDA_NL_WR_poses.csv'. For this experiment, the ligand was not considered during preprocessing of the binding site, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was enabled. PDBScan22-HQ_JAMDA_NL_WR_poses.csv: Pose scores and RMSDs for the docking results of JAMDA[1] on the PDBScan22-HQ dataset. For this experiment, the ligand was not considered during preprocessing of the binding site and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was enabled. PDBScan22-HQ_JAMDA_NW_NR.csv: Docking results of JAMDA[1] on the PDBScan22-HQ dataset. This is the general overview for the docking runs; the pose results are given in 'PDBScan22-HQ_JAMDA_NW_NR_poses.csv'. For this experiment, the ligand was not considered during preprocessing of the binding site, all water molecules were removed from the binding site during preprocessing, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was disabled. PDBScan22-HQ_JAMDA_NW_NR_poses.csv: Pose scores and RMSDs for the docking results of JAMDA[1] on the PDBScan22-HQ dataset. For this experiment, the ligand was not considered during preprocessing of the binding site, all water molecules were removed from the binding site during preprocessing, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was disabled. PDBScan22-HQ_JAMDA_NW_WR.csv: Docking results of JAMDA[1] on the PDBScan22-HQ dataset. This is the general overview for the docking runs; the pose results are given in 'PDBScan22-HQ_JAMDA_NW_WR_poses.csv'. For this experiment, the ligand was not considered during preprocessing of the binding site, all water molecules were removed from the binding site during preprocessing, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was enabled. PDBScan22-HQ_JAMDA_NW_WR_poses.csv: Pose scores and RMSDs for the docking results of JAMDA[1] on the PDBScan22-HQ dataset. For this experiment, the ligand was not considered during preprocessing of the binding site, all water molecules were removed from the binding site during preprocessing, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was enabled. PDBScan22-HQ_JAMDA_WL_NR.csv: Docking results of JAMDA[1] on the PDBScan22-HQ dataset. This is the general overview for the docking runs; the pose results are given in 'PDBScan22-HQ_JAMDA_WL_NR_poses.csv'. For this experiment, the ligand was considered during preprocessing of the binding site, and the binding site restriction mode (i.e., biasing the docking towards the crystal ligand position) was disabled. PDBScan22-HQ_JAMDA_WL_NR_poses.csv: Pose scores and RMSDs for the docking results of JAMDA[1] on the PDBScan22-HQ dataset. For this experiment, the ligand was considered during preprocessing of the binding site, and the binding site restriction mode (i.e., biasing the docking

  • ERS-2 GOME Total Column Amount of Trace Gases Product

    • earth.esa.int
    • eocat.esa.int
    • +3more
    Updated Jun 22, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    European Space Agency (2013). ERS-2 GOME Total Column Amount of Trace Gases Product [Dataset]. https://earth.esa.int/eogateway/catalog/ers-2-gome-total-column-amount-of-trace-gases-product
    Explore at:
    Dataset updated
    Jun 22, 2013
    Dataset authored and provided by
    European Space Agencyhttp://www.esa.int/
    License

    https://earth.esa.int/eogateway/documents/20142/1564626/Terms-and-Conditions-for-the-use-of-ESA-Data.pdfhttps://earth.esa.int/eogateway/documents/20142/1564626/Terms-and-Conditions-for-the-use-of-ESA-Data.pdf

    Time period covered
    Jun 28, 1995 - Jul 2, 2011
    Description

    GOME Level 2 products were generated by DLR on behalf of the European Space Agency, and are the end result of the Level 1 to 2 reprocessing campaign of GOME Level 1 version 4 data with Level 2 GOME Data Processor (GDP) version 5.0 (HDF-5 format). The GOME Level 2 data product comprises the product header, total column densities of ozone and nitrogen dioxide and their associated errors, cloud properties and selected geo-location information, diagnostics from the Level 1 to 2 algorithms and a small amount of statistical information.

  • A

    ‘Waiter's Tips Dataset’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Jan 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Waiter's Tips Dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-waiter-s-tips-dataset-b284/latest
    Explore at:
    Dataset updated
    Jan 28, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Waiter's Tips Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/aminizahra/tips-dataset on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    Context

    One waiter recorded information about each tip he received over a period of a few months working in one restaurant. In all he recorded 244 tips.

    Acknowledgements

    The data was reported in a collection of case studies for business statistics.

    Bryant, P. G. and Smith, M (1995) Practical Data Analysis: Case Studies in Business Statistics. Homewood, IL: Richard D. Irwin Publishing

    The dataset is also available through the Python package Seaborn.

    Hint

    Of course, this database has additional columns compared to other tips datasets.

    Dataset info

    RangeIndex: 244 entries, 0 to 243

    Data columns (total 11 columns):

    # Column Non-Null Count Dtype

    0 total_bill 244 non-null float64

    1 tip 244 non-null float64

    2 sex 244 non-null object

    3 smoker 244 non-null object

    4 day 244 non-null object

    5 time 244 non-null object

    6 size 244 non-null int64

    7 price_per_person 244 non-null float64

    8 Payer Name 244 non-null object

    9 CC Number 244 non-null int64

    10 Payment ID 244 non-null object

    dtypes: float64(3), int64(2), object(6)

    Some details

    total_bill a numeric vector, the bill amount (dollars)

    tip a numeric vector, the tip amount (dollars)

    sex a factor with levels Female Male, gender of the payer of the bill

    Smoker a factor with levels No Yes, whether the party included smokers

    day a factor with levels Friday Saturday Sunday Thursday, day of the week

    time a factor with levels Day Night, rough time of day

    size a numeric vector, number of ppartyeople in

    --- Original source retains full ownership of the source dataset ---

  • B

    Dataset 1: Bilateral Travel Restriction Database v1.0

    • borealisdata.ca
    • dataone.org
    Updated Mar 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Global Strategy Lab (2023). Dataset 1: Bilateral Travel Restriction Database v1.0 [Dataset]. http://doi.org/10.5683/SP2/5E4OA8
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 16, 2023
    Dataset provided by
    Borealis
    Authors
    The Global Strategy Lab
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Earlier this year, Dr. Hoffman and Dr. Fafard published a book chapter on the efficacy and legality of border closures enacted by governments in response to changing COVID-19 conditions. The authors concluded border closures are at best, regarded as powerful symbolic acts taken by governments to show they are acting forcefully, even if the actions lack an epidemiological impact and breach international law. This COVID-19 travel restriction project was developed out of a necessity and desire to further examine the empirical implications of border closures. The current dataset contains bilateral travel restriction information on the status of 179 countries between 1 January 2020 and 8 June 2020. The data was extracted from the ‘international controls’ column from the Oxford COVID-19 Government Response Tracker (OxCGRT). The data in the ‘international controls’ column outlined a country’s change in border control status, as a response to COVID-19 conditions. Accompanying source links were further verified through random selection and comparison with external news sources. Greater weight is given to official national government sources, then to provincial and municipal news-affiliated agencies. The database is presented in matrix form for each country-pair and date. Subsequently, each cell is represented by datum Xdmn and indicates the border closure status on date d by country m on country n. The coding is as follows: no border closure (code = 0), targeted border closure (= 1), and a total border closure (= 99). The dataset provides further details in the ‘notes’ column if the type of closure is a modified form of a targeted closure, either as a land or port closure, flight or visa suspension, or a re-opening of borders to select countries. Visa suspensions and closure of land borders were coded separately as de facto border closures and analyzed as targeted border closures in quantitative analyses. The file titled ‘BTR Supplementary Information’ covers a multitude of supplemental details to the database. The various tabs cover the following: 1) Codebook: variable name, format, source links, and description; 2) Sources, Access dates: dates of access for the individual source links with additional notes; 3) Country groups: breakdown of EEA, EU, SADC, Schengen groups with source links; 4) Newly added sources: for missing countries with a population greater than 1 million (meeting the inclusion criteria), relevant news sources were added for analysis; 5) Corrections: external news sources correcting for errors in the coding of international controls retrieved from the OxCGRT dataset. At the time of our study inception, there was no existing dataset which recorded the bilateral decisions of travel restrictions between countries. We hope this dataset will be useful in the study of the impact of border closures in the COVID-19 pandemic and widen the capabilities of studying border closures on a global scale, due to its interconnected nature and impact, rather than being limited in analysis to a single country or region only. Statement of contributions: Data entry and verification was performed mainly by GL, with assistance from MJP and RN. MP and IW provided further data verification on the nine countries purposively selected for the exploratory analysis of political decision-making.

  • Z

    Data from: Qbias – A Dataset on Media Bias in Search Queries and Query...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Mar 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Haak, Fabian (2023). Qbias – A Dataset on Media Bias in Search Queries and Query Suggestions [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7682914
    Explore at:
    Dataset updated
    Mar 1, 2023
    Dataset provided by
    Haak, Fabian
    Schaer, Philipp
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We present Qbias, two novel datasets that promote the investigation of bias in online news search as described in

    Fabian Haak and Philipp Schaer. 2023. 𝑄𝑏𝑖𝑎𝑠 - A Dataset on Media Bias in Search Queries and Query Suggestions. In Proceedings of ACM Web Science Conference (WebSci’23). ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/3578503.3583628.

    Dataset 1: AllSides Balanced News Dataset (allsides_balanced_news_headlines-texts.csv)

    The dataset contains 21,747 news articles collected from AllSides balanced news headline roundups in November 2022 as presented in our publication. The AllSides balanced news feature three expert-selected U.S. news articles from sources of different political views (left, right, center), often featuring spin bias, and slant other forms of non-neutral reporting on political news. All articles are tagged with a bias label by four expert annotators based on the expressed political partisanship, left, right, or neutral. The AllSides balanced news aims to offer multiple political perspectives on important news stories, educate users on biases, and provide multiple viewpoints. Collected data further includes headlines, dates, news texts, topic tags (e.g., "Republican party", "coronavirus", "federal jobs"), and the publishing news outlet. We also include AllSides' neutral description of the topic of the articles. Overall, the dataset contains 10,273 articles tagged as left, 7,222 as right, and 4,252 as center.

    To provide easier access to the most recent and complete version of the dataset for future research, we provide a scraping tool and a regularly updated version of the dataset at https://github.com/irgroup/Qbias. The repository also contains regularly updated more recent versions of the dataset with additional tags (such as the URL to the article). We chose to publish the version used for fine-tuning the models on Zenodo to enable the reproduction of the results of our study.

    Dataset 2: Search Query Suggestions (suggestions.csv)

    The second dataset we provide consists of 671,669 search query suggestions for root queries based on tags of the AllSides biased news dataset. We collected search query suggestions from Google and Bing for the 1,431 topic tags, that have been used for tagging AllSides news at least five times, approximately half of the total number of topics. The topic tags include names, a wide range of political terms, agendas, and topics (e.g., "communism", "libertarian party", "same-sex marriage"), cultural and religious terms (e.g., "Ramadan", "pope Francis"), locations and other news-relevant terms. On average, the dataset contains 469 search queries for each topic. In total, 318,185 suggestions have been retrieved from Google and 353,484 from Bing.

    The file contains a "root_term" column based on the AllSides topic tags. The "query_input" column contains the search term submitted to the search engine ("search_engine"). "query_suggestion" and "rank" represents the search query suggestions at the respective positions returned by the search engines at the given time of search "datetime". We scraped our data from a US server saved in "location".

    We retrieved ten search query suggestions provided by the Google and Bing search autocomplete systems for the input of each of these root queries, without performing a search. Furthermore, we extended the root queries by the letters a to z (e.g., "democrats" (root term) >> "democrats a" (query input) >> "democrats and recession" (query suggestion)) to simulate a user's input during information search and generate a total of up to 270 query suggestions per topic and search engine. The dataset we provide contains columns for root term, query input, and query suggestion for each suggested query. The location from which the search is performed is the location of the Google servers running Colab, in our case Iowa in the United States of America, which is added to the dataset.

    AllSides Scraper

    At https://github.com/irgroup/Qbias, we provide a scraping tool, that allows for the automatic retrieval of all available articles at the AllSides balanced news headlines.

    We want to provide an easy means of retrieving the news and all corresponding information. For many tasks it is relevant to have the most recent documents available. Thus, we provide this Python-based scraper, that scrapes all available AllSides news articles and gathers available information. By providing the scraper we facilitate access to a recent version of the dataset for other researchers.

  • Sound Pressure Spectral Density at 1 Hertz and 1 Hour Resolution Recorded at...

    • datasets.ai
    • s.cnmilf.com
    • +1more
    0, 21
    Updated Oct 29, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Oceanic and Atmospheric Administration, Department of Commerce (2020). Sound Pressure Spectral Density at 1 Hertz and 1 Hour Resolution Recorded at SanctSound Site PM08_13 [Dataset]. https://datasets.ai/datasets/sound-pressure-spectral-density-at-1-hertz-and-1-hour-resolution-recorded-at-sanctsound-site-1329
    Explore at:
    0, 21Available download formats
    Dataset updated
    Oct 29, 2020
    Dataset provided by
    National Oceanic and Atmospheric Administrationhttp://www.noaa.gov/
    Authors
    National Oceanic and Atmospheric Administration, Department of Commerce
    Description

    This record represents the sound pressure spectral density (PSD) levels derived from raw passive acoustic data. The hourly PSD levels were calculated as the median of mean-square pressure amplitude (µPa^2) with a frequency resolution of 1 Hz from 20 Hz to 24,000 Hz over no less than 1,800 seconds in each hour and converted to decibels (dB re 1 µPa^2/Hz). These data were recorded at SanctSound Site PM08_13 between August 20, 2019 and October 29, 2020.

  • Sound Pressure Spectral Density at 1 Hertz and 1 Hour Resolution Recorded at...

    • datasets.ai
    • catalog.data.gov
    0
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Oceanic and Atmospheric Administration, Department of Commerce, Sound Pressure Spectral Density at 1 Hertz and 1 Hour Resolution Recorded at SanctSound Site SB03_21 [Dataset]. https://datasets.ai/datasets/sound-pressure-spectral-density-at-1-hertz-and-1-hour-resolution-recorded-at-sanctsound-site-211
    Explore at:
    0Available download formats
    Dataset provided by
    National Oceanic and Atmospheric Administrationhttp://www.noaa.gov/
    Authors
    National Oceanic and Atmospheric Administration, Department of Commerce
    Description

    This record represents the sound pressure spectral density (PSD) levels derived from raw passive acoustic data. The hourly PSD levels were calculated as the median of mean-square pressure amplitude (µPa^2) with a frequency resolution of 1 Hz from 10 Hz to 24,000 Hz over no less than 1,800 seconds in each hour and converted to decibels (dB re 1 µPa^2/Hz). These data were recorded at SanctSound Site SB03_21 between April 22, 2022 and June 13, 2022.

  • VOYAGER 2 SATURN POSITION RESAMPLED DATA 48.0 SECONDS

    • catalog.data.gov
    • data.nasa.gov
    • +2more
    Updated Dec 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Aeronautics and Space Administration (2023). VOYAGER 2 SATURN POSITION RESAMPLED DATA 48.0 SECONDS [Dataset]. https://catalog.data.gov/dataset/voyager-2-saturn-position-resampled-data-48-0-seconds-7eee6
    Explore at:
    Dataset updated
    Dec 7, 2023
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    This data set includes Voyager 2 Saturn encounter position data that have been generated at a 48.0 second sample rate using the NAIF SPICE kernals. The data set is composed of 4 columns: 1) ctime - this column contains the data acquisition time. The time is always output in the ISO standard spacecraft event time format (yyyy-mm-dd-Thh:mm:ss.sss) but is stored internally in Cline time which is measured in seconds after 00:00:00.000 Jan 01, 1966, 2) r - this column contains the radial distance from Saturn in Rs = 60330 km, 3) longitude - this column contains the east longitude of the spacecraft in degrees, 4) latitude - this column contains the latitude of the spacecraft in degrees. Position data is given in Minus Saturn Longitude System (kronographic) coordinates.

  • R

    Dataset for IPOS TNA of ATMO-ACCESS: Lidar-based aerosol and cloud...

    • repod.icm.edu.pl
    txt, zip
    Updated Mar 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stachlewska, Iwona; Karasewicz, Maciej; Abramowicz, Anna; Wiśniewska, Kinga; Rykowska, Zuzanna; Hafiz, Afwan; Apituley, Arnoud; Drzeniecka-Osiadacz, Anetta; Kryza, Maciej; Jabłońska, Mariola; Nicolae, Doina (2025). Dataset for IPOS TNA of ATMO-ACCESS: Lidar-based aerosol and cloud classification, photometer optical properties and surface particulate matter measurements, Cabauw, Netherlands. [Dataset]. http://doi.org/10.18150/YE1LXN
    Explore at:
    txt(1509), zip(3207819593), zip(3689640), zip(79746), zip(15867285)Available download formats
    Dataset updated
    Mar 13, 2025
    Dataset provided by
    RepOD
    Authors
    Stachlewska, Iwona; Karasewicz, Maciej; Abramowicz, Anna; Wiśniewska, Kinga; Rykowska, Zuzanna; Hafiz, Afwan; Apituley, Arnoud; Drzeniecka-Osiadacz, Anetta; Kryza, Maciej; Jabłońska, Mariola; Nicolae, Doina
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Netherlands, Cabauw
    Dataset funded by
    European Space Agencyhttp://www.esa.int/
    European Commission
    Description

    Overview The dataset includes data collected during the ATMO-ACCESS Trans-National Access project "Industrial Pollution Sensing with synergic techniques (IPOS TNA)" that has been conducted from June 8 to June 24, 2024 at the Cabauw Experimental Site for Atmospheric Research (CESAR, 51°58'03''N, 4°55'47"E, 3 m.a.s.l.) of the Royal Netherlands Meteorological Institute (KNMI). The IPOS TNA was supporting the 3rd Intercomparison Campaign of UV-VIS DOAS Instruments (CINDI-3).The observations were taken with use of three instruments:ESA Mobile Raman Lidar (EMORAL). Lidar emits pulses at fixed wavelengths (355, 532 and 1064 nm), simultaneously with the pulse repetition rate of 10 Hz and pulse duration of 5-7 ns. The backward scattered laser pulses are detected at 5 Mie narrow-band channels (355p,s 532p,s and 1064 nm) and 3 Raman narrow-band channels (for N2 at 387, 607 nm and H2O at 408nm) as well as broad-band fluorescence channel (470 nm). The temporal resolution was set at 1 min and and spatial resolution to 3.75 m. The overlap between the laser beam and the full field of view of the telescope was at ~250 m a.g.l. EMORAL lidar is a state-of-the-art lidar system developed through a collaborative effort involving the University of Warsaw (UW, Poland; leader and operator), Ludwig Maximilian University of Munich (LMU, Germany), National Observatory of Athens (NOA, Greece), Poznan University of Life Sciences (PULS, Poland), and companies Raymetrics (Greece; core manufacturer), Licel (Germany), and InnoLas Laser (Germany). This complex instrument, part of ESA’s Opto-Electronics section (TEC-MME) at the European Space Research and Technology Centre (ESA-ESTEC, The Netherlands), is designed to perform precise atmospheric measurements. EMORAL lidar was validated by the ACTRIS Centre for Aerosol Remote Sensing (CARS) at the Măgurele Center for Atmosphere and Radiation Studies (MARS) of National Institute of R&D for Optoelectronics (INOE, Romania).PM counter GrayWolf PC-3500, GRAYWOLF Graywolf Sensing Solutions (USA) https://graywolfsensing.com/wp-content/pdf/GrayWolfPC-3500Brochure-818.pdf (last access 25/2/2025)Model 540 Microtops II® Sunphotometer, Solar Light Company, LLC (USA) https://www.solarlight.com/product/microtops-ii-sunphotometer (last access 25/2/2025)The dataset contain following items:1) EMORAL lidar data files The data contain of two files LiLi_IPOS.zip and LiLi_IPOS_quicklooks.zip. Both are described in detail below.The LiLi_IPOS.zip file is a folder that contains the high-resolution data obtained using the Lidar, Radar, Microwave radiometer algorithm (LiRaMi; more in Wang et al., 2020). The results were obtained only from the lidar data (referred to as Limited LiRaMi, i.e. LiLi algorithm version). The folder contains files in netcdf4 format for each day of observations. The data products are calculated from the analog channels only.Each of the .nc file has a structure, which contains Variables:Location (string)Latitude (size: 1x1 [deg])Longitude (size: 1x1 [deg])Altitude (size: 1x1 [m a.g.l.])time vector (size: 1 x time, [UTC])range vector (size: range x 1, [m])RCS532p matrix (size: range x time, [V m2]), which contains the data of the range-corrected signal at 532nm, parallel polarizationRCS532s matrix (size: range x time, [V m2]), which contains the data of the range-corrected signal at 532nm, perpendicular polarizationRCS1064 matrix (size: range x time, [V m2]), which contains the data of the range-corrected signal at 1064nmSR532 matrix (size: range x time, [unitless]), which contains the data of the scattering ratio at 532nmATT_BETA532 matrix (size: range x time, [m2/sr]), which contains the data of the attenuated backscatter coefficient at 532nm, parallel polarizationC532 constant (size: 1x1, [V sr]), which is the instrumental factor for 532nmSR1064 matrix (size: range x time, [au]), which contains the data of the scattering ratio at 1064nmATT_BETA1064 matrix (size: range x time, [m2/sr]), which contains the data of the attenuated backscatter coefficient at 1064nmC1064 constant (size: 1x1, [V sr]), which is the instrumental factor for 1064nmCOLOR_RATIO matrix (size: range x time, [au]), which contains the data of color ratio of 532nm and 1064nm.PARTICLE_DEPOLARIZATIO_RATIO matrix (size: range x time, [au]), which contains the data of particle depolarization ratio at 532nmC constant (size: 1x1, [au]), which is the depolarization constant for 532nm.The LiLi_IPOS_quicklooks.zip file contains high-resolution figures representing the data in the form of quicklooks of following parameters:Range-corrected signal at 1064nmScattering ratio at 532nmColor ratio of 532 and 1064nmParticle depolarization ratio at 532nmAerosol target classification from LiLi algorithmWang, D., Stachlewska, I. S., Delanoë, J., Ene, D., Song, X., and Schüttemeyer D., (2020). Spatio-temporal discrimination of molecular, aerosol and cloud scattering and polarization using a combination of a Raman lidar, Doppler cloud radar and microwave radiometer, Opt. Express 28, 20117-20134 (2020).2) PM counterThe PM_counter.zip file contains a folder with data from measurements of atmospheric particulate matter collected using the GrayWolf PC-3500 particle counter from June 15 (16:16:21 CEST) to June 20 (07:06:21 CEST), 2024, at the CESAR station (51°58'04.0"N, 4°55'46.4"E). The data were processed using WolfSense PC software for validation and analysis. The final dataset, provided in XLSX format, includes temporal evaluation in particle concentration from 0.3 to 10.0 µm (6 size ranges). The data is divided into three levels:[1] Level 0: Raw data in XLSX format with measurement data in 4 units (µg/m3, cnts/m3, cnts dif, cnts cum).File structure:Line 1: headers describing columns,Line 2-6646: concentration of PM,Column 1: date and time in format DD-MMM-YY HH:MM:SS AM/PM,Column 2-7: concentration of specific PM values: 0.3, 0.5, 1.0, 2.5, 5.0, 10.0 µm, respectively,Column 8: Temperature,Column 9: Carbon Dioxide (CO2),Column 10: Total Volatile Organic Compounds (TVOC),Column 11: pressure in measuring chamber,Missing data (Column 8-10) represented as zero value (0).[2] Level 1: Tables with validated data in 4 units (µg/m3, cnts/m3, cnts dif, cnts cum) in XLSX format.File structure:Line 1: headers describing columns,Line 2-6646: concentration of PM,Column 1: date and time in format DD-MMM-YY HH:MM:SS AM/PM,Column 2-7: concentration of specific PM values: 0.3, 0.5, 1.0, 2.5, 5.0, 10.0 µm, respectively,Column 8: pressure in measuring chamber,Column 9: assembly method, where: [1] measurement at a height of 60 cm during rain (instrument protected by the table), [2] measurement at a height of 160 cm when there is no rain.[3] Level 2: Tables with post-processed data in XLSX format, and graphs in PNG format visualizing the received data.XLSX file structure:PM counter - level 2 (daily average concentrations), PM counter - level 2 (hourly average concentrations) sheets: structure of columns same as in level 1.PM counter - level 2 (data comparison) sheet: Column 1 - Date in format DD.MM.YYYY; Column 2 - PM2.5 concentration measured within IPOS; Column 3 - PM10.0 concentration measured within IPOS; Column 4 - PM2.5 concentration measured at Cabauw-Wielsekade (RIVM), Column 5 - PM10.0 concentration measured Cabauw-Wielsekade (RIVM).General information for all level files:Decimal separator: coma (,).3) SunphotometerThe MICROTOPS_IPOS.zip file is a folder that contains data from measurements of aerosol optical thickness at wavelengths 380, 500, 675, 870, and 1020 nm done with Microtops II hand-held sunphotometer. The final, quality assured dataset, provided in XLSX format, consists of measurement data for: temperature, pressure, solar zenith angle, signal strength at different wavelengths (340, 380, 500, 936, 1020 nm), standard deviation at specific wavelengths, ratio between signals at two different wavelengths (340/380, 380/500, 500/936, 936/1020), and atmospheric optical thickness at different wavelengths.During the IPOS TNA campaign, in total 29 measurements were taken. Each measurement is composed of 6 scans, whereas the first one is a dark scan. The days when a measurement took place were: 13, 23, 24, and 25 of June 2024. Level 0 of data means raw data converted from dbf to xslx format file. Level 1 of data mean raw data converted from dbf to xslx file format, without the dark scans.Files structure:Line 1: Headers describing columns,Column 1: Serial number of the instrumentColumn 2-3: Date and Time in format YYYY-MM-DD; HH:MM:SS,Column 4-8: Data desciprtion of the camapign; Location (decimal); Latitude; Longitude (decimal), AltitudeColumn 9-14: Atmospheric Pressure; Solar Zenith Angle; Air Mass; Standard Deviation Correction; Temperature; ID of the measurement, Column 15-24: Signal strength at specific wavelength and Standard Deviation,Column 25-28: Ratio between signals at two different wavelengths,Column 29-33: Atmospheric Optical Thickness,Column 34-39: Columnar Water Vapour and Natural Logarithm of Voltage,Column 40-47: Calibration coefficients,Column 48-49: Pressure offset and Pressure scale factor,READ ME sheet: Describing the file content and measurement location.4) readme fileATTENTION:We offer a free access to this dataset. The user is however encouraged to share the information on the data use by sending an e-mail to rslab@fuw.edu.plIn the case this dataset is used for a scientific communication (publication, conference contribution, thesis) we would like to kindly ask for considering to acknowledge data provision by citing this dataset.------------------------------------PI of IPOS TNA Iwona Stachlewska and IPOS team members Maciej Karasewicz, Anna Abramowicz, Kinga Wiśniewska, Zuzanna Rykowska, and Afwan Hafiz acknowledge that the published dataset was prepared within the Trans-National Access grant (IPOS TNA no. ATMO-TNA-7-0000000056) within the ATMO-ACCESS grant financed by European Commission Horizon 2020 program (G.A.

  • Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DataSF (2025). Dataset inventory [Dataset]. https://data.sfgov.org/w/y8fp-fbf5/ikek-yizv?cur=kRuUwDH4vsx

    Dataset inventory

    Explore at:
    163 scholarly articles cite this dataset (View in Google Scholar)
    csv, tsv, application/rssxml, json, application/rdfxml, xmlAvailable download formats
    Dataset updated
    Mar 12, 2025
    Dataset authored and provided by
    DataSF
    License

    ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
    License information was derived automatically

    Description

    A. SUMMARY The dataset inventory provides a list of data maintained by departments that are candidates for open data publishing or have already been published and is collected in accordance with Chapter 22D of the Administrative Code. The inventory will be used in conjunction with department publishing plans to track progress toward meeting plan goals for each department.

    B. HOW THE DATASET IS CREATED This dataset is collated through 2 ways: 1. Ongoing updates are made throughout the year to reflect new datasets, this process involves DataSF staff reconciling publishing records after datasets are published 2. Annual bulk updates - departments review their inventories and identify changes and updates and submit those to DataSF for a once a year bulk update - not all departments will have changes or their changes will have been captured over the course of the prior year already as ongoing updates

    C. UPDATE PROCESS The dataset is synced automatically daily, but the underlying data changes manually throughout the year as needed

    D. HOW TO USE THIS DATASET Interpreting dates in this dataset This dataset has 2 dates: 1. Date Added - when the dataset was added to the inventory itself 2. First Published - the open data portal automatically captures the date the dataset was first created, this is that system generated date

    Note that in certain cases we may have published a dataset prior to it being added to the inventory. We do our best to have an accurate accounting of when something was added to this inventory and when it was published. In most cases the inventory addition will happen prior to publishing, but in certain cases it will be published and we will have missed updating the inventory as this is a manual process.

    First published will give an accounting of when it was actually available on the open data catalog and date added when it was added to this list.

    E. RELATED DATASETS

  • Inventory of citywide enterprise systems of record
  • Dataset Inventory: Column-Level Details

  • Search
    Clear search
    Close search
    Google apps
    Main menu