42 datasets found
  1. datatrove-tests

    • huggingface.co
    Updated May 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face (2024). datatrove-tests [Dataset]. https://huggingface.co/datasets/huggingface/datatrove-tests
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 5, 2024
    Dataset authored and provided by
    Hugging Facehttps://huggingface.co/
    Description

    Datasets used for datatrove testing. Each split contains the same data: dst = [ {"text": "hello"}, {"text": "world"}, {"text": "how"}, {"text": "are"}, {"text": "you"}, ]

    But based on the split name the data are sharded into n-bins

  2. w

    Trove People and Organisations data

    • data.wu.ac.at
    • researchdata.edu.au
    • +1more
    xml
    Updated Mar 7, 2015
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Library of Australia (2015). Trove People and Organisations data [Dataset]. https://data.wu.ac.at/odso/data_gov_au/YjI0N2UxYzAtNjA4ZC00OTVkLTg1ODMtOGJjOWRlNjNjNGVl
    Explore at:
    xml(10979.0), xml(1262.0)Available download formats
    Dataset updated
    Mar 7, 2015
    Dataset provided by
    National Library of Australia
    Description

    The National Library of Australia operates the "http://trove.nla.gov.au/people">Trove People and Organisations zone which allows users to access information about significant people and organisations (parties) as well as related biographical and contextual information.

    The Trove People and Organisations dataset is based on the Australian Name Authority File, a unique resource maintained since 1981 by Australian libraries which contribute their holdings to "http://librariesaustralia.nla.gov.au">Libraries Australia. The Trove People and Organisations zone plays an important role in exposing records about parties and linking to them in libraries and other collecting institutions. The data also provides links to resources by and about a party and relationships between parties.

    To further enrich the service the Library is collaborating with organisations that already make available information about people and organisations in their specific domains and linking to them.

    The API to this dataset provides access to 885,000 identities.

  3. Z

    Data from: GLAM-Workbench/trove-newspapers-data-post-54

    • data.niaid.nih.gov
    Updated Sep 14, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sherratt, Tim (2024). GLAM-Workbench/trove-newspapers-data-post-54 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6812811
    Explore at:
    Dataset updated
    Sep 14, 2024
    Dataset authored and provided by
    Sherratt, Tim
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    newspapers with articles published after 1954
    Current version: v1.7 Due to copyright restrictions, most of the digitised newspaper articles on Trove were published before 1955. However, some articles published after 1954 have been made available. This repository provides data about digitised newspapers in Trove that have articles available from after 1954 (the 'copyright cliff of death'). The data was extracted from the Trove API using this notebook from the Trove newspapers section of the GLAM Workbench. The data is available as a CSV file entitled newspapers_post_54.csv and contains the following fields:

    title – the full title of the newspaper state – the state in which the newspaper was published id – Trove's unique identifier for this newspaper startDate – the earliest date of articles from this newspaper available in Trove endDate – the latest date of articles from this newspaper available in Trove issn – ISSN number_of_articles – the number of articles from this newspaper published after 1954 available in Trove troveUrl – link to more information about this newspaper

    This repository is part of the GLAM Workbench. If you think this project is worthwhile, you might like to sponsor me on GitHub.

  4. h

    trove-examples-data

    • huggingface.co
    Updated Apr 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bats Research (2025). trove-examples-data [Dataset]. https://huggingface.co/datasets/BatsResearch/trove-examples-data
    Explore at:
    Dataset updated
    Apr 26, 2025
    Dataset authored and provided by
    Bats Research
    Description

    This repo holds files that are used in Trove examples.

    tevatron_msmarco_passage_aug_qrel.jsonl contains the qrels extracted from train.jsonl.gz file in Tevatron/msmarco-passage-aug repo.

  5. Z

    Data from: GLAM-Workbench/trove-lists-metadata

    • data.niaid.nih.gov
    Updated Jun 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sherratt, Tim (2024). GLAM-Workbench/trove-lists-metadata [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6827077
    Explore at:
    Dataset updated
    Jun 6, 2024
    Dataset authored and provided by
    Sherratt, Tim
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Current version: v1.4 Trove users can create collections of resources using Trove's 'lists'. Metadata describing public lists is available via the Trove API. This dataset was created by harvesting this metadata. To reduce file size, the details of the resources collected by each list are not included, just the total number of resources. The data was extracted from the Trove API using this notebook from the Trove lists and tags section of the GLAM Workbench. The data is available as a CSV file entitled trove-lists.csv and contains the following fields:

    created – date the list was created id – Trove's unique list identifier number_items – number of resources in list title – the title of this list updated – date the list was last updated

    This repository is part of the GLAM Workbench. If you think this project is worthwhile, you might like to sponsor me on GitHub.

  6. Z

    Public tags added to resources in Trove, 2008 to 2024

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sherratt, Tim (2024). Public tags added to resources in Trove, 2008 to 2024 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5094313
    Explore at:
    Dataset updated
    Jun 6, 2024
    Dataset authored and provided by
    Sherratt, Tim
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This dataset contains details of 2,495,958 unique public tags added to 10,403,650 resources in Trove between August 2008 and June 2024. I harvested the data using the Trove API and saved it as a CSV file with the following columns:

    tag – lower-cased text tag

    date – date the tag was added

    zone – API zone containing the tagged resource

    record_id – the identifier of the tagged resource

    I've documented the method used to harvest the tags in this notebook.

    Using the zone and record_id you can find more information about a tagged item. To create urls to the resources in Trove:

    for resources in the 'book', 'article', 'picture', 'music', 'map', and 'collection' zones add the record_id to https://trove.nla.gov.au/work/

    for resources in the 'newspaper' and 'gazette' zones add the record_id to https://trove.nla.gov.au/article/

    for resources in the 'list' zone add the record_id to https://trove.nla.gov.au/list/

    Notes:

    Works (such as books) in Trove can have tags attached at either work or version level. This dataset aggregates all tags at the work level, removing any duplicates.

    A single resource in Trove can appear in multiple zones – for example, a book that includes maps and illustrations might appear in the 'book', 'picture', and 'map' zones. This means that some of the tags will essentially be duplicates – harvested from different zones, but relating to the same resource. Depending on your needs, you might want to remove these duplicates.

    While most of the tags were added by Trove users, more than 500,000 tags were added by Trove itself in November 2009. I think these tags were automatically generated from related Wikipedia pages. Depending on your needs, you might want to exclude these by limiting the date range or zones.

    User content added to Trove, including tags, is available for reuse under a CC-BY-NC licence.

    See this notebook for some examples of how you can manipulate, analyse, and visualise the tag data.

  7. Z

    trove-newspaper-issues

    • data.niaid.nih.gov
    • zenodo.org
    Updated Sep 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sherratt, Tim (2024). trove-newspaper-issues [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_12547036
    Explore at:
    Dataset updated
    Sep 14, 2024
    Dataset authored and provided by
    Sherratt, Tim
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains information about the published issues of newspapers digitised and made available through Trove. The data was harvested from the Trove API, using this notebook in the GLAM Workbench.

    There are two data files:

    newspaper_issues_totals_by_year.csv – the total number of newspaper issues per year for each digitised newspaper

    newspaper_issues.csv – a complete list of newspaper issues available from Trove

    newspaper_issues_totals_by_year.csv

    The dataset contains the following columns:

    Column Contents

    title newspaper title

    title_id newspaper id

    state place of publication

    year year published

    issues number of issues

    newspaper_issues.csv

    The dataset contains the following columns:

    Column Contents

    title newspaper title

    title_id newspaper id

    state place of publication

    issue_id issue identifier

    issue_date date of publication (YYYY-MM-DD)

    To keep the file size down, I haven't included an issue_url in this dataset, but these are easily generated from the issue_id. Just add the issue_id to the end of http://nla.gov.au/nla.news-issue. For example: http://nla.gov.au/nla.news-issue495426. Note that when you follow an issue url, you actually get redirected to the url of the first page in the issue.

  8. Z

    Data from: Trove tag counts

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sherratt, Tim (2024). Trove tag counts [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_7563922
    Explore at:
    Dataset updated
    Jun 6, 2024
    Dataset authored and provided by
    Sherratt, Tim
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This dataset was derived from the full harvest of Trove public tags. It contains a list of unique tags and the total number of resources in Trove each tag is attached to. It is formatted as a CSV file with the following columns:

    tag – the tag string

    count – number of resources the tag has been applied to

    User content added to Trove, including tags, is available for reuse under a CC-BY-NC-SA licence.

  9. Yelp Dataset

    • kaggle.com
    zip
    Updated Mar 17, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yelp, Inc. (2022). Yelp Dataset [Dataset]. https://www.kaggle.com/yelp-dataset/yelp-dataset
    Explore at:
    zip(4374983563 bytes)Available download formats
    Dataset updated
    Mar 17, 2022
    Dataset provided by
    Yelphttp://yelp.com/
    Authors
    Yelp, Inc.
    Description

    Context

    This dataset is a subset of Yelp's businesses, reviews, and user data. It was originally put together for the Yelp Dataset Challenge which is a chance for students to conduct research or analysis on Yelp's data and share their discoveries. In the most recent dataset you'll find information about businesses across 8 metropolitan areas in the USA and Canada.

    Content

    This dataset contains five JSON files and the user agreement. More information about those files can be found here.

    Code snippet to read the files

    in Python, you can read the JSON files like this (using the json and pandas libraries):

    import json
    import pandas as pd
    data_file = open("yelp_academic_dataset_checkin.json")
    data = []
    for line in data_file:
     data.append(json.loads(line))
    checkin_df = pd.DataFrame(data)
    data_file.close()
    
    
  10. wragge/trove-newspaper-totals-historical: v1.0.0

    • zenodo.org
    zip
    Updated Jul 9, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tim Sherratt; Tim Sherratt (2022). wragge/trove-newspaper-totals-historical: v1.0.0 [Dataset]. http://doi.org/10.5281/zenodo.6470479
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 9, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Tim Sherratt; Tim Sherratt
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This repository contains past harvests of the number of digitised newspaper articles available through Trove. These harvests were created between 2011 and 2022:

    • 12 April 2011
    • 4 August 2011
    • 12 September 2014
    • 29 November 2015
    • 14 December 2016
    • 28 July 2019
    • 10 July 2020
    • 27 April 2021
    • 21 January 2022

    It's possible I might find additional harvests and add them to the repository in the future.

    Since April 2022, datasets have been automatically created every week and saved in this repository.

    Dataset details

    The datapackage.json file contains a description of all the datasets using the Frictionless Data standard.

    Datasets are saved in the data directory as CSV files. There are two types of harvest – one captures the total number of articles per year, while the other breaks the totals down by year and state. The harvest date is embedded in the file title (in YYYYMMDD format).

    total_articles_by_year_YYYYMMDD.csv

    These datasets are saved as CSV files containing the following columns:

    • year: year of original publication of newspaper article
    • total: total number of articles from that year available in Trove

    total_articles_by_year_and_state_YYYYMMDD.csv

    These datasets are saved as CSV files containing the following columns:

    • state: state in which newspaper article was originally published
    • year: year of original publication of newspaper article
    • total: total number of articles from that year and state available in Trove

    Trove uses the following values for state:

    • ACT
    • International
    • National
    • New South Wales
    • Northern Territory
    • Queensland
    • South Australia
    • Tasmania
    • Victoria
    • Western Australia

    Method

    The method for harvesting this data has changed over time. Harvests from 2011 were screen scraped from the Trove website. Harvests after 2012 make use of the year and state facets from the Trove API. The data was stored in a variety of locations, such as this archived page, my Plotly account, and the Trove Newspapers GLAM Workbench repository. To create this repository, I've retrieved the harvested data from these locations and converted the datasets to CSV files. Column headings have been normalised, but none of the values have been changed.

    For current examples of harvesting this sort of data see Visualise the total number of newspaper articles in Trove by year and state in the GLAM Workbench.

  11. o

    Treasure Trove Lane Cross Street Data in Miami, FL

    • ownerly.com
    Updated Apr 4, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ownerly (2022). Treasure Trove Lane Cross Street Data in Miami, FL [Dataset]. https://www.ownerly.com/fl/miami/treasure-trove-ln-home-details
    Explore at:
    Dataset updated
    Apr 4, 2022
    Dataset authored and provided by
    Ownerly
    Area covered
    Miami, Florida, Treasure Trove Lane
    Description

    This dataset provides information about the number of properties, residents, and average property values for Treasure Trove Lane cross streets in Miami, FL.

  12. e

    Eximpedia Export Import Trade

    • eximpedia.app
    Updated Jan 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seair Exim (2025). Eximpedia Export Import Trade [Dataset]. https://www.eximpedia.app/
    Explore at:
    .bin, .xml, .csv, .xlsAvailable download formats
    Dataset updated
    Jan 9, 2025
    Dataset provided by
    Eximpedia Export Import Trade Data
    Eximpedia PTE LTD
    Authors
    Seair Exim
    Area covered
    Cabo Verde, Switzerland, Malawi, Cuba, Sao Tome and Principe, Singapore, Bangladesh, Australia, Oman, Thailand
    Description

    Opal Trove Llc Company Export Import Records. Follow the Eximpedia platform for HS code, importer-exporter records, and customs shipment details.

  13. Treasure Trove Domains LLC Whois Database | Whois Data Center

    • whoisdatacenter.com
    csv
    Updated Jul 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AllHeart Web Inc (2025). Treasure Trove Domains LLC Whois Database | Whois Data Center [Dataset]. https://whoisdatacenter.com/registrar/2897/
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jul 11, 2025
    Dataset provided by
    AllHeart Web
    Authors
    AllHeart Web Inc
    License

    https://whoisdatacenter.com/terms-of-use/https://whoisdatacenter.com/terms-of-use/

    Time period covered
    Jul 13, 2025 - Dec 31, 2025
    Description

    Treasure Trove Domains LLC Whois Database, discover comprehensive ownership details, registration dates, and more for Treasure Trove Domains LLC with Whois Data Center.

  14. GLAM-Workbench/trove-newspaper-titles-web-archives

    • zenodo.org
    zip
    Updated Sep 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tim Sherratt; Tim Sherratt (2024). GLAM-Workbench/trove-newspaper-titles-web-archives [Dataset]. http://doi.org/10.5281/zenodo.13756837
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 13, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Tim Sherratt; Tim Sherratt
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Current version: v1.1

    The number of digitised newspapers available through Trove has increased dramatically since 2009. Understanding when newspapers were added is important for historiographical purposes, but there's no data about this available directly from Trove. These datasets were created by harvesting information about newspaper titles in Trove from web archives. The harvesting method is documented by Gathering historical data about the addition of newspaper titles to Trove in the GLAM Workbench.

    This dataset contains two files:

    • trove_newspaper_titles_2009_2021.csv
    • trove_newspaper_titles_first_appearance_2009_2021.csv

    trove_newspaper_titles_2009_2021.csv

    CSV formatted data file containing details of newspaper titles extracted from web archive captures.

    This file contains the following columns:

    ColumnContents
    title_idtitle identifier
    full_titlefull title (including location and dates)
    titlenewspaper title
    placeplace of publication
    datesdate range in Trove
    capture_datedate of web archive capture
    capture_timestamptimestamp of web archive capture

    trove_newspaper_titles_first_appearance_2009_2021.csv

    CSV formatted data file containing details of the first appearance of newspaper titles in web archive captures, indicating when the titles were (approximately) added to Trove. The complete list of captures has been filtered to include only the first appearance of each title / place / date range combination.

    The file contains the following columns:

    ColumnContents
    title_idtitle identifier
    full_titlefull title (including location and dates)
    titlenewspaper title
    placeplace of publication
    datesdate range in Trove
    capture_datedate of web archive capture
    capture_timestamptimestamp of web archive capture

  15. a

    A Treasure Trove of Trials (CSV Data)

    • hub.arcgis.com
    Updated May 22, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Library of Congress Online GIS Portal (2018). A Treasure Trove of Trials (CSV Data) [Dataset]. https://hub.arcgis.com/datasets/29265cac2d2844acbda5df0a6ee0df46
    Explore at:
    Dataset updated
    May 22, 2018
    Dataset authored and provided by
    Library of Congress Online GIS Portal
    Area covered
    Description

    This data set represents information concerning the Law Library of Congress's collection of Piracy Trials. Included in the data set are the title of the trial, the location where the trial took place, the date of publication and the URL for the primary source. From this data set, a web map ("Piracy Map with Trial Locations") was created and added to a story map titled "A Treasure Trove of Trials." The aforementioned story map provides an overview of a Law Library digitized collection known as "Piracy Trials" and highlights some of its content. Included in this collection is an item where a woman pirate was the person on trial. The map points shows readers where some of these cases were tried and provides links to individual primary sources. The bibliography includes other sources of interest from throughout the Library. It also compiles a series of sources from the Library's collection where women pirates are the subject.Produced by Francisco Macias, Law Library of Congress.

  16. Video Game Ratings Dataset

    • kaggle.com
    Updated May 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AP\ (2024). Video Game Ratings Dataset [Dataset]. https://www.kaggle.com/datasets/dem0nking/video-game-ratings-dataset/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 23, 2024
    Dataset provided by
    Kaggle
    Authors
    AP\
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset provides detailed information about 120 different video games. Each entry in the dataset represents a video game with the following attributes:

    • Title: The name of the video game.
    • Genre: The category or type of gameplay, such as Action-Adventure, First-Person Shooter, RPG, etc.
    • Platform: The gaming system(s) on which the game can be played, such as PC, PlayStation, Xbox, Switch, or Multi--platform.
    • ReleaseYear: The year in which the game was released.
    • NumPlayers: The maximum number of players that can play the game simultaneously.
    • AvgRating: The average rating of the game, typically on a scale from 0 to 10
  17. Treasure Trove Of Australian History Returns Home

    • data.nsw.gov.au
    pdf
    Updated Sep 8, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NSW Government (2021). Treasure Trove Of Australian History Returns Home [Dataset]. https://data.nsw.gov.au/data/dataset/3-13861-treasure-trove-of-australian-history-returns-home
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Sep 8, 2021
    Dataset provided by
    Government of New South Waleshttp://nsw.gov.au/
    Area covered
    Australia
    Description

    No notes provided

  18. e

    Eximpedia Export Import Trade

    • eximpedia.app
    Updated Mar 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seair Exim (2025). Eximpedia Export Import Trade [Dataset]. https://www.eximpedia.app/
    Explore at:
    .bin, .xml, .csv, .xlsAvailable download formats
    Dataset updated
    Mar 23, 2025
    Dataset provided by
    Eximpedia Export Import Trade Data
    Eximpedia PTE LTD
    Authors
    Seair Exim
    Area covered
    Saint Martin (French part), Haiti, Guernsey, Cambodia, Mauritania, Uzbekistan, Myanmar, Iran (Islamic Republic of), Gambia, United Republic of
    Description

    Flavour Trove Company Export Import Records. Follow the Eximpedia platform for HS code, importer-exporter records, and customs shipment details.

  19. Links to Trove shared on Twitter, 2009 to 2020

    • zenodo.org
    bin, csv, json
    Updated Jun 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tim Sherratt; Tim Sherratt (2025). Links to Trove shared on Twitter, 2009 to 2020 [Dataset]. http://doi.org/10.5281/zenodo.15627800
    Explore at:
    csv, json, binAvailable download formats
    Dataset updated
    Jun 10, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Tim Sherratt; Tim Sherratt
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains information about links to resources in Trove that were shared on Twitter between 2009 and 2020.

    The tweet data was compiled using Twarc in May 2021, under Twitter's academic access program. The search queries used were:

    • url:nla.gov.au/nla.news
    • url:trove.nla.gov.au
    • url:newspapers.nla.gov.au

    From the raw tweet data I extracted the Trove urls from either the entities -> urls field, or by running a regular expression over the tweet text. Where necessary, I attempted to unshorten any shortened links.

    Many of the tweets were produced by bots. Using my Trove bots Twitter list, I separated the tweets into two files, one for bots and one for ordinary users.

    To respect user intentions and comply with the Twitter API terms of use, I've removed all the tweet information except for tweet_id and tweet_date from the files. If it hasn't been deleted, the full data for each tweet can be obtained from the X API using the tweet_id, though this would probably require a paid subscription.

    Links in tweets data

    The main data files are:

    • trove_url_tweets.csv – links shared by human users (although it may include some unidentified bots)
    • trove_url_tweets_bots.csv – links shared by bots

    Both files contain the following fields:

    • tweet_id
    • tweet_date
    • trove_url – the shared url
    • trove_type – type of Trove resource, possible values include:
      • article – an individual newspaper article
      • page – a page of a newspaper
      • title – a newspaper title
      • work – an individual Trove resource from outside the newspapers category
      • other – anything else, including search queries and links to the home page
    • trove_id – the identifier of the Trove resource (extracted from the url)

    In addition, the trove_url_tweets.csv file contains the following field:

    • nla_official – this is set to True or False and indicates whether the tweet originated from one of the NLA's official Twitter accounts.

    Some tweets contain multiple links. The datasets include one row for each link. This means that a single tweet_id can appear multiple times.

    Other data files

    In addition, I created a few derivative data files:

    • trove_url_totals.csv
    • active_users_per_year.csv
    • active_bots_per_year.csv

    trove_url_totals.csv

    This file contains information about the number of times each link was shared by users (not including bots). The file includes the following fields:

    • trove_id – Trove identifier, using this and trove_type you can query the Trove API for further information
    • trove_type – type of Trove resource, possible values include:
      • article – an individual newspaper article
      • page – a page of a newspaper
      • title – a newspaper title
      • work – an individual Trove resource from outside the newspapers category
      • other – anything else, including search queries and links to the home page
    • tweets – number of tweets containing a link to this resource
    • retweets – number of retweets containing a link to this resource
    • quotes – number of quote tweets containing a link to this resource
    • total – the total number of times this link was shared (sum of tweets, retweets, quotes)

    active_users_per_year.csv

    This file contains information about the number of unique users each year who shared a link to Trove. The file includes the following fields:

    • year
    • users – number of unique users who shared a Trove link in this year

    active_bots_per_year.csv

    This file contains information about the number of active bots each year that shared links to Trove. The file includes the following fields:

    • year
    • bots – number of bots that shared a Trove link in this year

    Some summary information

    Number of unique users sharing Trove links9,294
    Number of bots sharing Trove links43
    Number of tweets by humans containing Trove links48,293
    Number of tweets by bots containing Trove links318,797
    Number of unique links shared by humans36,886
    Number of unique links shared by bots270,501

    See this blog post for more information.

  20. GLAM-Workbench/trove-newspapers-non-english

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Sep 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tim Sherratt; Tim Sherratt (2024). GLAM-Workbench/trove-newspapers-non-english [Dataset]. http://doi.org/10.5281/zenodo.13761509
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 14, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Tim Sherratt; Tim Sherratt
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Current version: v1.1

    This dataset contains information about newspapers published in languages other than English that have been digitised and made available through Trove. Data about the languages present in newspapers was generated by harvesting a sample of articles from each newspaper using the Trove API, and then using language detection software on the OCRd text of each article. The method is documented in this notebook in the GLAM Workbench.

    There are two files:

    • newspapers_non_english.csv – list of the main languages detected for each newspaper with non-English language content
    • non-english-newspapers.md – a markdown formatted list of all the newspapers with non-English language content

    newspapers_non_english.csv

    The dataset contains the following columns:

    ColumnContents
    idnewspaper id
    titlenewspaper title
    languagelanguage code
    proportionproportion of articles in this language
    numbernumber of articles sampled
    language_fullfull language name

    non-english-newspapers.md

    This is a markdown-formatted list created by grouping the dataset by newspaper title. It includes details of the main languages in each newspaper.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Hugging Face (2024). datatrove-tests [Dataset]. https://huggingface.co/datasets/huggingface/datatrove-tests
Organization logo

datatrove-tests

huggingface/datatrove-tests

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 5, 2024
Dataset authored and provided by
Hugging Facehttps://huggingface.co/
Description

Datasets used for datatrove testing. Each split contains the same data: dst = [ {"text": "hello"}, {"text": "world"}, {"text": "how"}, {"text": "are"}, {"text": "you"}, ]

But based on the split name the data are sharded into n-bins

Search
Clear search
Close search
Google apps
Main menu