44 datasets found
  1. H

    Dataset metadata of known Dataverse installations, August 2024

    • dataverse.harvard.edu
    Updated Jan 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julian Gautier (2025). Dataset metadata of known Dataverse installations, August 2024 [Dataset]. http://doi.org/10.7910/DVN/2SA6SN
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 1, 2025
    Dataset provided by
    Harvard Dataverse
    Authors
    Julian Gautier
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains the metadata of the datasets published in 101 Dataverse installations, information about the metadata blocks of 106 installations, and the lists of pre-defined licenses or dataset terms that depositors can apply to datasets in the 88 installations that were running versions of the Dataverse software that include the "multiple-license" feature. The data is useful for improving understandings about how certain Dataverse features and metadata fields are used and for learning about the quality of dataset and file-level metadata within and across Dataverse installations. How the metadata was downloaded The dataset metadata and metadata block JSON files were downloaded from each installation between August 25 and August 30, 2024 using a "get_dataverse_installations_metadata" function in a collection of Python functions at https://github.com/jggautier/dataverse-scripts/blob/main/dataverse_repository_curation_assistant/dataverse_repository_curation_assistant_functions.py. In order to get the metadata from installations that require an installation account API token to use certain Dataverse software APIs, I created a CSV file with two columns: one column named "hostname" listing each installation URL for which I was able to create an account and another column named "apikey" listing my accounts' API tokens. The Python script expects the CSV file and the listed API tokens to get metadata and other information from installations that require API tokens in order to use certain API endpoints. How the files are organized ├── csv_files_with_metadata_from_most_known_dataverse_installations │ ├── author_2024.08.25-2024.08.30.csv │ ├── contributor_2024.08.25-2024.08.30.csv │ ├── data_source_2024.08.25-2024.08.30.csv │ ├── ... │ └── topic_classification_2024.08.25-2024.08.30.csv ├── dataverse_json_metadata_from_each_known_dataverse_installation │ ├── Abacus_2024.08.26_15.52.42.zip │ ├── dataset_pids_Abacus_2024.08.26_15.52.42.csv │ ├── Dataverse_JSON_metadata_2024.08.26_15.52.42 │ ├── hdl_11272.1_AB2_0AQZNT_v1.0(latest_version).json │ ├── ... │ ├── metadatablocks_v5.9 │ ├── astrophysics_v5.9.json │ ├── biomedical_v5.9.json │ ├── citation_v5.9.json │ ├── ... │ ├── socialscience_v5.6.json │ ├── ACSS_Dataverse_2024.08.26_00.02.51.zip │ ├── ... │ └── Yale_Dataverse_2024.08.25_03.52.57.zip └── dataverse_installations_summary_2024.08.30.csv └── dataset_pids_from_most_known_dataverse_installations_2024.08.csv └── license_options_for_each_dataverse_installation_2024.08.28_14.42.54.csv └── metadatablocks_from_most_known_dataverse_installations_2024.08.30.csv This dataset contains two directories and four CSV files not in a directory. One directory, "csv_files_with_metadata_from_most_known_dataverse_installations", contains 20 CSV files that list the values of many of the metadata fields in the "Citation" metadata block and "Geospatial" metadata block of datasets in the 101 Dataverse installations. For example, author_2024.08.25-2024.08.30.csv contains the "Author" metadata for the latest versions of all published, non-deaccessioned datasets in 101 installations, with a column for each of the four child fields: author name, affiliation, identifier type, and identifier. The other directory, "dataverse_json_metadata_from_each_known_dataverse_installation", contains 106 zip files, one zip file for each of the 106 Dataverse installations whose sites were functioning when I attempted to collect their metadata. Each zip file contains a directory with JSON files that have information about the installation's metadata fields, such as the field names and how they're organized. For installations that had published datasets, and I was able to use Dataverse APIs to download the dataset metadata, the zip file also contains: A CSV file listing information about the datasets published in the installation, including a column to indicate if the Python script was able to download the Dataverse JSON metadata for each dataset. A directory of JSON files that contain the metadata of the installation's published, non-deaccessioned dataset versions in the Dataverse JSON metadata schema. The dataverse_installations_summary_2024.08.30.csv file contains information about each installation, including its name, URL, Dataverse software version, and counts of dataset metadata included and not included in this dataset. The dataset_pids_from_most_known_dataverse_installations_2024.08.csv file contains the dataset PIDs of published datasets in 101 Dataverse installations, with a column to indicate if the Python script was able to download the dataset's metadata. It's a union of all "dataset_pids_....csv" files in each of the 101 zip files in the dataverse_json_metadata_from_each_known_dataverse_installation directory. The license_options_for_each_dataverse_installation_2024.08.28_14.42.54.csv file contains information about the licenses and...

  2. d

    Finhubb Stock API - Datasets

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    M, K (2023). Finhubb Stock API - Datasets [Dataset]. http://doi.org/10.7910/DVN/PVEM40
    Explore at:
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    M, K
    Description

    Finnhub is the ultimate stock api in the market, providing real-time and historical price for global stocks with Rest API and websocket. We also support a tons of other financial data like stock fundamentals, analyst estimates, fundamental data and more. Download the file to access balance sheet of Amazon.

  3. H

    Course Planner API

    • dataverse.harvard.edu
    Updated Apr 26, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harvard University IT (2016). Course Planner API [Dataset]. http://doi.org/10.7910/DVN/NCSJZW
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 26, 2016
    Dataset provided by
    Harvard Dataverse
    Authors
    Harvard University IT
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The Course Planner API allows developers to create applications that will interact with Course Planner data. Using this API, you can build applications that will allow your users (that are enrolled Harvard College/GSAS students) to add courses to their Course Planner, view the courses that are in the Course Planner, and remove courses.

  4. d

    Harvard Art Museums API

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harvard Art Museums (2023). Harvard Art Museums API [Dataset]. http://doi.org/10.7910/DVN/BQAQ7G
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Harvard Art Museums
    Description

    The Harvard Art Museums API is a REST-style service designed for developers who wish to explore and integrate the museums’ collections in their projects. The API provides direct access to the data that powers the museums' website and many other aspects of the museums.

  5. d

    Harvard Faculty Finder

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Waldo, Jim (2023). Harvard Faculty Finder [Dataset]. http://doi.org/10.7910/DVN/PLMNRW
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Waldo, Jim
    Description

    The Harvard Faculty Finder creates creates an institution-wide view of the breadth and depth of Harvard faculty and scholarship, and it helps students, faculty, administrators, and the general public locate Harvard faculty according to research and teaching expertise. More information about the HFF website and the data it contains can be found on the Harvard University Faculty Development & Diversity website. HFF is a Semantic Web application, which means its content can be read and understood by other computer programs. This enables the data associated with a person, such as titles, contact information, and publications to be shared with other institutions and appear on other websites. Below are the technical details for building a computer program that can export data from HFF. The data is available through an API. No authentication is required. Documentation can be found at http://api.facultyfinder.harvard.edu, or you can see a snapshot of the documentation as the data for this entry. The API entry points are described in the documentation

  6. d

    SicpaOpenData for .Net - Dataset - B2FIND

    • b2find.dkrz.de
    Updated Nov 4, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). SicpaOpenData for .Net - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/e940abb1-db2f-5d34-b6c2-4ccdd0935cdb
    Explore at:
    Dataset updated
    Nov 4, 2023
    Description

    DOI The Sicpa_OpenData libraries allow to facilitate the publication of data to the INRAE dataverse in a transparent way 1/ by simplifying the creation of the metadata document from the data already present in the information systems, 2/ by simplifying the use of dataverse.org APIs.

  7. d

    Passed Resolves; Resolves 1881, c.49, SC1/series 228, Petition of Priscilla...

    • dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataverse API creator (2023). Passed Resolves; Resolves 1881, c.49, SC1/series 228, Petition of Priscilla Freeman [Dataset]. http://doi.org/10.7910/DVN/RX9POV
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Dataverse API creator
    Description

    Petition subject: Indian resources Original: http://nrs.harvard.edu/urn-3:FHCL:25500494 Date of creation: 1880-11-07 Petition location: Cottage City [Oak Bluffs] Selected signatures:Priscilla Freeman Total signatures: 1 Females of color signatures: 1 Female only signatures: Yes Identifications of signatories: an Indian and one of the riparian proprietors of a certain pond containing more than twenty acres situated in said county and known as "Tisbury Great Pond", [females of color] Prayer format was printed vs. manuscript: Manuscript Additional non-petition or unrelated documents available at archive: additional documents available Additional archivist notes: right to fish in Tisbury Great Pond, lease, waters, commissioners on inland fisheries, Allen Look and others, natural rights as an Indian Location of the petition at the Massachusetts Archives of the Commonwealth: Resolves 1881, c.49, passed April 23, 1881 Acknowledgements: Supported by the National Endowment for the Humanities (PW-5105612), Massachusetts Archives of the Commonwealth, Radcliffe Institute for Advanced Study at Harvard University, Center for American Political Studies at Harvard University, Institutional Development Initiative at Harvard University, and Harvard University Library.

  8. H

    CELL5M: A Multidisciplinary Geospatial Database for Africa South of the...

    • dataverse.harvard.edu
    • search.dataone.org
    Updated Dec 5, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harvard Dataverse (2017). CELL5M: A Multidisciplinary Geospatial Database for Africa South of the Sahara [Dataset]. http://doi.org/10.7910/DVN/G4TBLF
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 5, 2017
    Dataset provided by
    Harvard Dataverse
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Sub-Saharan Africa, Africa
    Dataset funded by
    CGIAR Research Program on Policies, Institutions, and Markets (PIM)
    The Bill and Melinda Gates Foundation
    Description

    Spatially-explicit data is increasingly becoming available across disciplines, yet they are often limited to a specific domain. In order to use such datasets in a coherent analysis, such as to decide where to target specific types of agricultural investment, there should be an effort to make such datasets harmonized and interoperable. For Africa South of the Sahara (SSA) region, the HarvestChoice CELL5M Database was developed in this spirit of moving multidisciplinary data into one harmonized, geospatial database. The database includes over 750 biophysical and socio-economic indicators, many of which can be easily expanded to global scale. The CELL5M database provides a platform for cross-cutting spatial analyses and fine-grain visualization of the mix of farming systems and populations across SSA. It was created as the central core to support a decision-making platform that would enable development practitioners and researchers to explore multi-faceted spatial relationships at the nexus of poverty, health and nutrition, farming systems, innovation, and environment. The database is a matrix populated by over 350,000 grid cells covering SSA at five arc-minute spatial resolution. Users of the database, including those conduct researches on agricultural policy, research, and development issues, can also easily overlay their own indicators. Numerical aggregation of the gridded data by specific geographical domains, either at subnational level or across country borders for more regional analysis, is also readily possible without needing to use any specific GIS software. See the HCID database (http://dx.doi.org/10.7910/DVN/MZLXVQ) for the geometry of each grid cell. The database also provides standard-compliant data API that currently powers several web-based data visualization and analytics tools.

  9. D

    Replication Data for: Climate Nags: Affect and the Convergence of Global...

    • dataverse.azure.uit.no
    • dataverse.no
    • +2more
    Updated Sep 28, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jessica Yarin Robinson; Jessica Yarin Robinson (2023). Replication Data for: Climate Nags: Affect and the Convergence of Global Risk in Online Networks [Dataset]. http://doi.org/10.18710/G1CIXA
    Explore at:
    txt(2845), text/comma-separated-values(47452886)Available download formats
    Dataset updated
    Sep 28, 2023
    Dataset provided by
    DataverseNO
    Authors
    Jessica Yarin Robinson; Jessica Yarin Robinson
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This data set contains the IDs of the 1,186,322 tweets used in "Climate Nags: Affect and the Convergence of Global Risk in Online Networks" (published in Continuum, 2023). The data was collected from Twitter's Streaming API using the DMI-TCAT during the first four months of the Coronavirus pandemic; 2020 U.S. presidential race; and the early stages of the 2022 Russia–Ukraine War. These collections were then filtered based on key words related to climate change (see README file for more details).

  10. d

    LSDO Event Records

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kucuk, Ahmet; Banda, Juan; Angryk, Rafal (2023). LSDO Event Records [Dataset]. https://search.dataone.org/view/sha256%3Abdc7066e5dc3d61a4b3643332796a21c7c939594abca7808bccf883365a5a220
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Kucuk, Ahmet; Banda, Juan; Angryk, Rafal
    Description

    This file contains the event records that is extracted and clean from Heliophysics Event Knowledgebase (HEK) API. The data contains records of four event types (AR, CH, FL, SG) from Jan. 1 2012 to Dec. 31 2014. Corresponding image files can be found in lsdo dataverse (https://dataverse.harvard.edu/dataverse/lsdo).

  11. d

    Harvard Catalyst Profiles

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Weber, Griffin (2023). Harvard Catalyst Profiles [Dataset]. https://search.dataone.org/view/sha256%3A6d78760c5ead404be5cbb66b6e5600ea8d0e2c3e91a09b787cf5e4eb7db910bf
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Weber, Griffin
    Description

    Harvard Catalyst Profiles is a Semantic Web application, which means its content can be read and understood by other computer programs. This enables the data in profiles, such as addresses and publications, to be shared with other institutions and appear on other websites. If you click the "Export RDF" link on the left sidebar of a profile page, you can see what computer programs see when visiting a profile. The section below describes the technical details for building a computer program that can export data from Harvard Catalyst Profiles. There are four types of application programming interfaces (APIs) in Harvard Catalyst Profiles. RDF crawl. Because Harvard Catalyst Profiles is a Semantic Web application, every profile has both an HTML page and a corresponding RDF document, which contains the data for that page in RDF/XML format. Web crawlers can follow the links embedded within the RDF/XML to access additional content. SPARQL endpoint. SPARQL is a programming language that enables arbitrary queries against RDF data. This provides the most flexibility in accessing data; however, the downsides are the complexity in coding SPARQL queries and performance. In general, the XML Search API (see below) is better to use than SPARQL. However, if you require access to the SPARQL endpoint, please contact Griffin Weber. XML Search API. This is a web service that provides support for the most common types of queries. It is designed to be easier to use and to offer better performance than SPARQL, but at the expense of fewer options. It enables full-text search across all entity types, faceting, pagination, and sorting options. The request message to the web service is in XML format, but the output is in RDF/XML format. The URL of the XML Search API is https://connects.catalyst.harvard.edu/API/Profiles/Public/Search. Old XML based web services. This provides backwards compatibility for institutions that built applications using the older version of Harvard Catalyst Profiles. These web services do not take advantage of many of the new features of Harvard Catalyst Profiles. Users are encouraged to switch to one of the new APIs. The URL of the old XML web service is https://connects.catalyst.harvard.edu/ProfilesAPI. For more information about the APIs, please see the documentation and example files.

  12. H

    Replication Data for: Open Journal Systems and Dataverse Integration–...

    • dataverse.harvard.edu
    Updated Oct 15, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Micah Altman; Eleni Castro; Merce Crosas; Philip Durbin; Jen Whitney (2015). Replication Data for: Open Journal Systems and Dataverse Integration– Helping Journals to Upgrade Data Publication for Reusable Research [Dataset]. http://doi.org/10.7910/DVN/Y3WOOE
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 15, 2015
    Dataset provided by
    Harvard Dataverse
    Authors
    Micah Altman; Eleni Castro; Merce Crosas; Philip Durbin; Jen Whitney
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This article describes the novel open source tools for open data publication in open access journal workflows. This comprises a plugin for Open Journal Systems that supports a data submission, citation, review, and publication workflow; and an extension to the Dataverse system that provides a standard deposit API. We describe the function and design of these tools, provide examples of their use, and summarize their initial reception. We conclude by discussing future plans and potential impact.

  13. T

    CML-COVID: a COVID-19 Twitter dataset

    • dataverse.tdl.org
    txt
    Updated Jan 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dhiraj Murthy; Dhiraj Murthy (2021). CML-COVID: a COVID-19 Twitter dataset [Dataset]. http://doi.org/10.18738/T8/W1CHVU
    Explore at:
    txt(2000000), txt(1998140), txt(1997900), txt(1996780), txt(1985880)Available download formats
    Dataset updated
    Jan 28, 2021
    Dataset provided by
    Texas Data Repository
    Authors
    Dhiraj Murthy; Dhiraj Murthy
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    In this dataset, we present CML-COVID, a COVID-19 Twitter data set of 19,298,967 million tweets from 5,977,653 unique individuals collected between March 2020 and July 2020. The prefix to each filename is the search query used. The CML-COVID dataset is released in compliance with Twitter’s Terms & Conditions (T&C) which prohibit the verbatim release of full tweet text and API-derived data. Rather, we provide a list of tweet IDs that others can directly ‘hydrate’ using calls to the Twitter API. If you use the CML-COVID dataset, please cite this dataset to acknowledge your use of our data.

  14. H

    Open Source at Harvard

    • dataverse.harvard.edu
    Updated Jan 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Philip Durbin (2023). Open Source at Harvard [Dataset]. http://doi.org/10.7910/DVN/TJCLKP
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 3, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Philip Durbin
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The tabular file contains information on known Harvard repositories on GitHub, such as the number of stars, programming language, day last updated, number of open issues, size, number of forks, repository URL, create date, and description. Each repository has a corresponding JSON file (see primary-data.zip) that was retrieved using the GitHub API with code and a list of repositories available from https://github.com/IQSS/open-source-at-harvard.

  15. d

    PRIVEE-NJIT dataset

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bhattacharjee, Kaustav; Islam, Akm; Vaidya, Jaideep; Dasgupta, Aritra (2023). PRIVEE-NJIT dataset [Dataset]. https://search.dataone.org/view/sha256%3A628777e9a93c604aebdea040ca66d97e8eda1d0382598881de0ae6392ccd7dee
    Explore at:
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Bhattacharjee, Kaustav; Islam, Akm; Vaidya, Jaideep; Dasgupta, Aritra
    Description

    This dataset contains a list of all the open datasets we have collected through the Socrata API and is used in developing the PRIVEE interface. We have enriched the dataset with metadata information of these datasets, including their columns, tags, and the number of rows. We have also identified some of the quasi-identifiers present in these datasets.

  16. d

    Reddit May 2019 Submissions

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Baumgartner, Jason (2023). Reddit May 2019 Submissions [Dataset]. https://search.dataone.org/view/sha256%3A4f48e12343b6eecd355089d8f410e129e7edb3300dbdbe66a460517d54a27ca0
    Explore at:
    Dataset updated
    Nov 22, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Baumgartner, Jason
    Description

    Dataset Metrics Total size of data uncompressed: 59,515,177,346 bytes Number of objects (submissions): 19,456,493 Reddit API Documentation: https://www.reddit.com/dev/api/ Overview This dataset contains all available submissions from Reddit during the month of May, 2019 (using UTC time boundaries). The data has been split to accommodate the file upload limitations for dataverse. Each file is a collection of json objects (ndjson). Each file was then compressed using zstandard compression (https://facebook.github.io/zstd). The files should be ordered by the id of the submission (represented by the id field). The time that each object was ingested is recorded in the retrieved_on field (in epoch seconds). Methodology Monthly Reddit ingests are usually started around a week into a new month for the previous month (but could be delayed). This gives submission scores, gildings and num_comments time to "settle" close to their eventual score before Reddit archives the posts (usually done after six months from the post's creation). All submissions are ingested via Reddit's API (using the /api/info endpoint). This is a "best effort" attempt to get all available data at the time of ingest. Due to the nature of Reddit, subreddits can go from private to public at any time, so it's possible more submissions could be found by rescanning missing ids. The author of this dataset highly encourages any researchers to do a sanity check on the data and to rescan for missing ids to ensure all available data has been gathered. If you need assistance, you can contact me directly. All efforts were made to capture as much data as possible. Generally, > 95% of all ids are captured. Missing data could be the result of Reddit API errors, submissions that were private during the ingest but then became public and subreddits that were quarantined and were not added to the whitelist before ingesting the data. When collecting the data, two scans are done. The first scan of ids using the /api/info endpoint collects all available data. After the first scan, a second scan is done requesting only missing ids from the first scan. This helps to keep the data as complete and comprehensive as possible. Contact If you have any questions about the data or require more details on the methodology, you are welcome to contact the author.

  17. d

    Topical diversity of user interests and content

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Weng, Lilian; Menczer, Filippo (2023). Topical diversity of user interests and content [Dataset]. https://search.dataone.org/view/sha256%3A515c7041f9529193f3b05cd3b7708cb0add1b61352433e5868de770814406a10
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Weng, Lilian; Menczer, Filippo
    Description
    • Source: Sampled public tweets from Twitter streaming API. - Date range: January 1, 2013 to March 31, 2013. - Data size: 6.4 GB; about 490 millions tweets. - Contains: 1) Sampled tweets during 3 months. 2) Each tweet is associated with a timestamp, anonymized user ID, and a list of hashtags.
  18. H

    Replication Data for: Training Deep Convolutional Object Detectors for...

    • dataverse.harvard.edu
    Updated Apr 16, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tomasz Gandor (2022). Replication Data for: Training Deep Convolutional Object Detectors for Images Affected by Lossy Compression [Dataset]. http://doi.org/10.7910/DVN/UHEP3C
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 16, 2022
    Dataset provided by
    Harvard Dataverse
    Authors
    Tomasz Gandor
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This collection contains the trained models and object detection results of 2 architectures found in the Detectron2 library, on the MS COCO val2017 dataset, under different JPEG compresion level Q = {5, 12, 19, 26, 33, 40, 47, 54, 61, 68, 75, 82, 89, 96} (14 levels per trained model). Architectures: F50 – Faster R-CNN on ResNet-50 with FPN R50 – RetinaNet on ResNet-50 with FPN Training type: D2 – Detectron2 Model ZOO pre-trained 1x model (90.000 iterations, batch 16) STD – standard 1x training (90.000 iterations) on original train2017 dataset Q20 – 1x training (90.000 iterations) on train2017 dataset degraded to Q=20 Q40 – 1x training (90.000 iterations) on train2017 dataset degraded to Q=40 T20 – extra 1x training on top of D2 on train2017 dataset degraded to Q=20 T40 – extra 1x training on top of D2 on train2017 dataset degraded to Q=40 Model and metrics files models_FasterRCNN.tar.gz (F50-STD, F50-Q20, …) models_RetinaNet.tar.gz (R50-STD, R50-Q20, …) For every model there are 3 files: config.yaml – the Detectron2 config of the model. model_final.pth – the weights (training snapshot) in PyTorch format. metrics.json – training metrics (like time, total loss, etc.) every 20 iterations. The D2 models were not included, because they are available from the Detectron2 Model ZOO, as faster_rcnn_R_50_FPN_1x (F50-D2) and retinanet_R_50_FPN_1x (R50-D2). Result files F50-results.tar.gz – results for Faster R-CNN models (inluding D2). R50-results.tar.gz – results for RetinaNet models (inluding D2). For every model there are 14 subdirectories, e.g. evaluator_dump_R50x1_005 through evaluator_dump_R50x1_096, for each of the JPEG Q values. Each such folder contains: coco_instances_results.json – all detected objects (image id, bounding box, class index and confidence). results.json – AP metrics as computed by COCO API. Source code for processing the data The data can be processed using our code, published at: https://github.com/tgandor/urban_oculus. Additional dependencies for the source code: COCO API Detectron2

  19. H

    Data for: Identifying Metadata Quality Issues Across Cultures

    • dataverse.harvard.edu
    Updated Jul 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julie Shi; Mike Nason; Marco Tullney; Juan Pablo Alperin (2023). Data for: Identifying Metadata Quality Issues Across Cultures [Dataset]. http://doi.org/10.7910/DVN/GZI7IA
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 26, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Julie Shi; Mike Nason; Marco Tullney; Juan Pablo Alperin
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This sample was drawn from the Crossref API on March 8, 2022. The sample was constructed purposefully on the hypothesis that records with at least one known issue would be more likely to yield issues related to cultural meanings and identity. Records known or suspected to have at least one quality issue were selected by the authors and Crossref staff. The Crossref API was then used to randomly select additional records from the same prefix. Records in the sample represent 51 DOI prefixes that were chosen without regard for the manuscript management or publishing platform used, as well as 17 prefixes for journals known to use the Open Journal Systems manuscript management and publishing platform. OJS was specifically identified due to the authors' familiarity with the platform, its international and multilingual reach, and previous work on its metadata quality.

  20. d

    Replication Data for Hashtag Co-occurrence Community Detection

    • dataone.org
    • dataverse.harvard.edu
    Updated Nov 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhu, Xi; Alan M. MacEachren (2023). Replication Data for Hashtag Co-occurrence Community Detection [Dataset]. https://dataone.org/datasets/sha256%3Ae05ed893fdd0f93eb0847738fc1d2d4f8e95fb6ed5d293a95bd5f3fe04cfe1ea
    Explore at:
    Dataset updated
    Nov 22, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Zhu, Xi; Alan M. MacEachren
    Time period covered
    Jan 1, 2016 - Dec 31, 2017
    Description

    Geotagged public tweets from Twitter streaming API. Date range: January 1, 2016 to December 31, 2017. Data size:4 GB; about 170 million tweets with hashtags. Attributes: Each tweet is associated with a tweet id, timestamp, anonymized user ID, and a list of hashtags.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Julian Gautier (2025). Dataset metadata of known Dataverse installations, August 2024 [Dataset]. http://doi.org/10.7910/DVN/2SA6SN

Dataset metadata of known Dataverse installations, August 2024

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 1, 2025
Dataset provided by
Harvard Dataverse
Authors
Julian Gautier
License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

This dataset contains the metadata of the datasets published in 101 Dataverse installations, information about the metadata blocks of 106 installations, and the lists of pre-defined licenses or dataset terms that depositors can apply to datasets in the 88 installations that were running versions of the Dataverse software that include the "multiple-license" feature. The data is useful for improving understandings about how certain Dataverse features and metadata fields are used and for learning about the quality of dataset and file-level metadata within and across Dataverse installations. How the metadata was downloaded The dataset metadata and metadata block JSON files were downloaded from each installation between August 25 and August 30, 2024 using a "get_dataverse_installations_metadata" function in a collection of Python functions at https://github.com/jggautier/dataverse-scripts/blob/main/dataverse_repository_curation_assistant/dataverse_repository_curation_assistant_functions.py. In order to get the metadata from installations that require an installation account API token to use certain Dataverse software APIs, I created a CSV file with two columns: one column named "hostname" listing each installation URL for which I was able to create an account and another column named "apikey" listing my accounts' API tokens. The Python script expects the CSV file and the listed API tokens to get metadata and other information from installations that require API tokens in order to use certain API endpoints. How the files are organized ├── csv_files_with_metadata_from_most_known_dataverse_installations │ ├── author_2024.08.25-2024.08.30.csv │ ├── contributor_2024.08.25-2024.08.30.csv │ ├── data_source_2024.08.25-2024.08.30.csv │ ├── ... │ └── topic_classification_2024.08.25-2024.08.30.csv ├── dataverse_json_metadata_from_each_known_dataverse_installation │ ├── Abacus_2024.08.26_15.52.42.zip │ ├── dataset_pids_Abacus_2024.08.26_15.52.42.csv │ ├── Dataverse_JSON_metadata_2024.08.26_15.52.42 │ ├── hdl_11272.1_AB2_0AQZNT_v1.0(latest_version).json │ ├── ... │ ├── metadatablocks_v5.9 │ ├── astrophysics_v5.9.json │ ├── biomedical_v5.9.json │ ├── citation_v5.9.json │ ├── ... │ ├── socialscience_v5.6.json │ ├── ACSS_Dataverse_2024.08.26_00.02.51.zip │ ├── ... │ └── Yale_Dataverse_2024.08.25_03.52.57.zip └── dataverse_installations_summary_2024.08.30.csv └── dataset_pids_from_most_known_dataverse_installations_2024.08.csv └── license_options_for_each_dataverse_installation_2024.08.28_14.42.54.csv └── metadatablocks_from_most_known_dataverse_installations_2024.08.30.csv This dataset contains two directories and four CSV files not in a directory. One directory, "csv_files_with_metadata_from_most_known_dataverse_installations", contains 20 CSV files that list the values of many of the metadata fields in the "Citation" metadata block and "Geospatial" metadata block of datasets in the 101 Dataverse installations. For example, author_2024.08.25-2024.08.30.csv contains the "Author" metadata for the latest versions of all published, non-deaccessioned datasets in 101 installations, with a column for each of the four child fields: author name, affiliation, identifier type, and identifier. The other directory, "dataverse_json_metadata_from_each_known_dataverse_installation", contains 106 zip files, one zip file for each of the 106 Dataverse installations whose sites were functioning when I attempted to collect their metadata. Each zip file contains a directory with JSON files that have information about the installation's metadata fields, such as the field names and how they're organized. For installations that had published datasets, and I was able to use Dataverse APIs to download the dataset metadata, the zip file also contains: A CSV file listing information about the datasets published in the installation, including a column to indicate if the Python script was able to download the Dataverse JSON metadata for each dataset. A directory of JSON files that contain the metadata of the installation's published, non-deaccessioned dataset versions in the Dataverse JSON metadata schema. The dataverse_installations_summary_2024.08.30.csv file contains information about each installation, including its name, URL, Dataverse software version, and counts of dataset metadata included and not included in this dataset. The dataset_pids_from_most_known_dataverse_installations_2024.08.csv file contains the dataset PIDs of published datasets in 101 Dataverse installations, with a column to indicate if the Python script was able to download the dataset's metadata. It's a union of all "dataset_pids_....csv" files in each of the 101 zip files in the dataverse_json_metadata_from_each_known_dataverse_installation directory. The license_options_for_each_dataverse_installation_2024.08.28_14.42.54.csv file contains information about the licenses and...

Search
Clear search
Close search
Google apps
Main menu