42 datasets found

datatrove-tests
huggingface.co
Updated May 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face (2024). datatrove-tests [Dataset]. https://huggingface.co/datasets/huggingface/datatrove-tests
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 5, 2024
Dataset authored and provided by
Hugging Facehttps://huggingface.co/
Description
Datasets used for datatrove testing. Each split contains the same data: dst = [ {"text": "hello"}, {"text": "world"}, {"text": "how"}, {"text": "are"}, {"text": "you"}, ]

But based on the split name the data are sharded into n-bins
w
Trove People and Organisations data
data.wu.ac.at
researchdata.edu.au
+1more
xml
Updated Mar 7, 2015
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Library of Australia (2015). Trove People and Organisations data [Dataset]. https://data.wu.ac.at/odso/data_gov_au/YjI0N2UxYzAtNjA4ZC00OTVkLTg1ODMtOGJjOWRlNjNjNGVl
Explore at:
xml(10979.0), xml(1262.0)Available download formats
Dataset updated
Mar 7, 2015
Dataset provided by
National Library of Australia
Description
The National Library of Australia operates the "http://trove.nla.gov.au/people">Trove People and Organisations zone which allows users to access information about significant people and organisations (parties) as well as related biographical and contextual information.

The Trove People and Organisations dataset is based on the Australian Name Authority File, a unique resource maintained since 1981 by Australian libraries which contribute their holdings to "http://librariesaustralia.nla.gov.au">Libraries Australia. The Trove People and Organisations zone plays an important role in exposing records about parties and linking to them in libraries and other collecting institutions. The data also provides links to resources by and about a party and relationships between parties.

To further enrich the service the Library is collaborating with organisations that already make available information about people and organisations in their specific domains and linking to them.

The API to this dataset provides access to 885,000 identities.
Z
Data from: GLAM-Workbench/trove-newspapers-data-post-54
data.niaid.nih.gov
Updated Sep 14, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sherratt, Tim (2024). GLAM-Workbench/trove-newspapers-data-post-54 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6812811
Explore at:
Dataset updated
Sep 14, 2024
Dataset authored and provided by
Sherratt, Tim
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
newspapers with articles published after 1954
Current version: v1.7 Due to copyright restrictions, most of the digitised newspaper articles on Trove were published before 1955. However, some articles published after 1954 have been made available. This repository provides data about digitised newspapers in Trove that have articles available from after 1954 (the 'copyright cliff of death'). The data was extracted from the Trove API using this notebook from the Trove newspapers section of the GLAM Workbench. The data is available as a CSV file entitled newspapers_post_54.csv and contains the following fields:

title – the full title of the newspaper state – the state in which the newspaper was published id – Trove's unique identifier for this newspaper startDate – the earliest date of articles from this newspaper available in Trove endDate – the latest date of articles from this newspaper available in Trove issn – ISSN number_of_articles – the number of articles from this newspaper published after 1954 available in Trove troveUrl – link to more information about this newspaper

This repository is part of the GLAM Workbench. If you think this project is worthwhile, you might like to sponsor me on GitHub.
h
trove-examples-data
huggingface.co
Updated Apr 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bats Research (2025). trove-examples-data [Dataset]. https://huggingface.co/datasets/BatsResearch/trove-examples-data
Explore at:
Dataset updated
Apr 26, 2025
Dataset authored and provided by
Bats Research
Description
This repo holds files that are used in Trove examples.

tevatron_msmarco_passage_aug_qrel.jsonl contains the qrels extracted from train.jsonl.gz file in Tevatron/msmarco-passage-aug repo.
Z
Data from: GLAM-Workbench/trove-lists-metadata
data.niaid.nih.gov
Updated Jun 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sherratt, Tim (2024). GLAM-Workbench/trove-lists-metadata [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6827077
Explore at:
Dataset updated
Jun 6, 2024
Dataset authored and provided by
Sherratt, Tim
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Current version: v1.4 Trove users can create collections of resources using Trove's 'lists'. Metadata describing public lists is available via the Trove API. This dataset was created by harvesting this metadata. To reduce file size, the details of the resources collected by each list are not included, just the total number of resources. The data was extracted from the Trove API using this notebook from the Trove lists and tags section of the GLAM Workbench. The data is available as a CSV file entitled trove-lists.csv and contains the following fields:

created – date the list was created id – Trove's unique list identifier number_items – number of resources in list title – the title of this list updated – date the list was last updated

This repository is part of the GLAM Workbench. If you think this project is worthwhile, you might like to sponsor me on GitHub.
Z
Public tags added to resources in Trove, 2008 to 2024
data.niaid.nih.gov
zenodo.org
Updated Jun 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sherratt, Tim (2024). Public tags added to resources in Trove, 2008 to 2024 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5094313
Explore at:
Dataset updated
Jun 6, 2024
Dataset authored and provided by
Sherratt, Tim
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This dataset contains details of 2,495,958 unique public tags added to 10,403,650 resources in Trove between August 2008 and June 2024. I harvested the data using the Trove API and saved it as a CSV file with the following columns:

tag – lower-cased text tag

date – date the tag was added

zone – API zone containing the tagged resource

record_id – the identifier of the tagged resource

I've documented the method used to harvest the tags in this notebook.

Using the zone and record_id you can find more information about a tagged item. To create urls to the resources in Trove:

for resources in the 'book', 'article', 'picture', 'music', 'map', and 'collection' zones add the record_id to https://trove.nla.gov.au/work/

for resources in the 'newspaper' and 'gazette' zones add the record_id to https://trove.nla.gov.au/article/

for resources in the 'list' zone add the record_id to https://trove.nla.gov.au/list/

Notes:

Works (such as books) in Trove can have tags attached at either work or version level. This dataset aggregates all tags at the work level, removing any duplicates.

A single resource in Trove can appear in multiple zones – for example, a book that includes maps and illustrations might appear in the 'book', 'picture', and 'map' zones. This means that some of the tags will essentially be duplicates – harvested from different zones, but relating to the same resource. Depending on your needs, you might want to remove these duplicates.

While most of the tags were added by Trove users, more than 500,000 tags were added by Trove itself in November 2009. I think these tags were automatically generated from related Wikipedia pages. Depending on your needs, you might want to exclude these by limiting the date range or zones.

User content added to Trove, including tags, is available for reuse under a CC-BY-NC licence.

See this notebook for some examples of how you can manipulate, analyse, and visualise the tag data.
Z
trove-newspaper-issues
data.niaid.nih.gov
zenodo.org
Updated Sep 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sherratt, Tim (2024). trove-newspaper-issues [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_12547036
Explore at:
Dataset updated
Sep 14, 2024
Dataset authored and provided by
Sherratt, Tim
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains information about the published issues of newspapers digitised and made available through Trove. The data was harvested from the Trove API, using this notebook in the GLAM Workbench.

There are two data files:

newspaper_issues_totals_by_year.csv – the total number of newspaper issues per year for each digitised newspaper

newspaper_issues.csv – a complete list of newspaper issues available from Trove

newspaper_issues_totals_by_year.csv

The dataset contains the following columns:

Column Contents

title newspaper title

title_id newspaper id

state place of publication

year year published

issues number of issues

newspaper_issues.csv

The dataset contains the following columns:

Column Contents

title newspaper title

title_id newspaper id

state place of publication

issue_id issue identifier

issue_date date of publication (YYYY-MM-DD)

To keep the file size down, I haven't included an issue_url in this dataset, but these are easily generated from the issue_id. Just add the issue_id to the end of http://nla.gov.au/nla.news-issue. For example: http://nla.gov.au/nla.news-issue495426. Note that when you follow an issue url, you actually get redirected to the url of the first page in the issue.
Z
Data from: Trove tag counts
data.niaid.nih.gov
zenodo.org
Updated Jun 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sherratt, Tim (2024). Trove tag counts [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_7563922
Explore at:
Dataset updated
Jun 6, 2024
Dataset authored and provided by
Sherratt, Tim
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This dataset was derived from the full harvest of Trove public tags. It contains a list of unique tags and the total number of resources in Trove each tag is attached to. It is formatted as a CSV file with the following columns:

tag – the tag string

count – number of resources the tag has been applied to

User content added to Trove, including tags, is available for reuse under a CC-BY-NC-SA licence.
Yelp Dataset
kaggle.com
zip
Updated Mar 17, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yelp, Inc. (2022). Yelp Dataset [Dataset]. https://www.kaggle.com/yelp-dataset/yelp-dataset
Explore at:
zip(4374983563 bytes)Available download formats
Dataset updated
Mar 17, 2022
Dataset provided by
Yelphttp://yelp.com/
Authors
Yelp, Inc.
Description
Context

This dataset is a subset of Yelp's businesses, reviews, and user data. It was originally put together for the Yelp Dataset Challenge which is a chance for students to conduct research or analysis on Yelp's data and share their discoveries. In the most recent dataset you'll find information about businesses across 8 metropolitan areas in the USA and Canada.

Content

This dataset contains five JSON files and the user agreement. More information about those files can be found here.

Code snippet to read the files

in Python, you can read the JSON files like this (using the json and pandas libraries):

import json import pandas as pd data_file = open("yelp_academic_dataset_checkin.json") data = [] for line in data_file: data.append(json.loads(line)) checkin_df = pd.DataFrame(data) data_file.close()
wragge/trove-newspaper-totals-historical: v1.0.0
zenodo.org
zip
Updated Jul 9, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tim Sherratt; Tim Sherratt (2022). wragge/trove-newspaper-totals-historical: v1.0.0 [Dataset]. http://doi.org/10.5281/zenodo.6470479
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6470479
Dataset updated
Jul 9, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Tim Sherratt; Tim Sherratt
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This repository contains past harvests of the number of digitised newspaper articles available through Trove. These harvests were created between 2011 and 2022:

12 April 2011

4 August 2011

12 September 2014

29 November 2015

14 December 2016

28 July 2019

10 July 2020

27 April 2021

21 January 2022

It's possible I might find additional harvests and add them to the repository in the future.

Since April 2022, datasets have been automatically created every week and saved in this repository.

Dataset details

The datapackage.json file contains a description of all the datasets using the Frictionless Data standard.

Datasets are saved in the data directory as CSV files. There are two types of harvest – one captures the total number of articles per year, while the other breaks the totals down by year and state. The harvest date is embedded in the file title (in YYYYMMDD format).

total_articles_by_year_YYYYMMDD.csv

These datasets are saved as CSV files containing the following columns:

year: year of original publication of newspaper article

total: total number of articles from that year available in Trove

total_articles_by_year_and_state_YYYYMMDD.csv

These datasets are saved as CSV files containing the following columns:

state: state in which newspaper article was originally published

year: year of original publication of newspaper article

total: total number of articles from that year and state available in Trove

Trove uses the following values for state:

ACT

International

National

New South Wales

Northern Territory

Queensland

South Australia

Tasmania

Victoria

Western Australia

Method

The method for harvesting this data has changed over time. Harvests from 2011 were screen scraped from the Trove website. Harvests after 2012 make use of the year and state facets from the Trove API. The data was stored in a variety of locations, such as this archived page, my Plotly account, and the Trove Newspapers GLAM Workbench repository. To create this repository, I've retrieved the harvested data from these locations and converted the datasets to CSV files. Column headings have been normalised, but none of the values have been changed.

For current examples of harvesting this sort of data see Visualise the total number of newspaper articles in Trove by year and state in the GLAM Workbench.
o
Treasure Trove Lane Cross Street Data in Miami, FL
ownerly.com
Updated Apr 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ownerly (2022). Treasure Trove Lane Cross Street Data in Miami, FL [Dataset]. https://www.ownerly.com/fl/miami/treasure-trove-ln-home-details
Explore at:
Dataset updated
Apr 4, 2022
Dataset authored and provided by
Ownerly
Area covered
Miami, Florida, Treasure Trove Lane
Description
This dataset provides information about the number of properties, residents, and average property values for Treasure Trove Lane cross streets in Miami, FL.
e
Eximpedia Export Import Trade
eximpedia.app
Updated Jan 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seair Exim (2025). Eximpedia Export Import Trade [Dataset]. https://www.eximpedia.app/
Explore at:
.bin, .xml, .csv, .xlsAvailable download formats
Dataset updated
Jan 9, 2025
Dataset provided by
Eximpedia Export Import Trade Data
Eximpedia PTE LTD
Authors
Seair Exim
Area covered
Cabo Verde, Switzerland, Malawi, Cuba, Sao Tome and Principe, Singapore, Bangladesh, Australia, Oman, Thailand
Description
Opal Trove Llc Company Export Import Records. Follow the Eximpedia platform for HS code, importer-exporter records, and customs shipment details.
Treasure Trove Domains LLC Whois Database | Whois Data Center
whoisdatacenter.com
csv
Updated Jul 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AllHeart Web Inc (2025). Treasure Trove Domains LLC Whois Database | Whois Data Center [Dataset]. https://whoisdatacenter.com/registrar/2897/
Explore at:
csvAvailable download formats
Dataset updated
Jul 11, 2025
Dataset provided by
AllHeart Web
Authors
AllHeart Web Inc
License
https://whoisdatacenter.com/terms-of-use/https://whoisdatacenter.com/terms-of-use/
Time period covered
Jul 13, 2025 - Dec 31, 2025
Description
Treasure Trove Domains LLC Whois Database, discover comprehensive ownership details, registration dates, and more for Treasure Trove Domains LLC with Whois Data Center.

GLAM-Workbench/trove-newspaper-titles-web-archives

zenodo.org

zip

Updated Sep 13, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Tim Sherratt; Tim Sherratt (2024). GLAM-Workbench/trove-newspaper-titles-web-archives [Dataset]. http://doi.org/10.5281/zenodo.13756837

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.13756837

Dataset updated

Sep 13, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Tim Sherratt; Tim Sherratt

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Current version: v1.1

The number of digitised newspapers available through Trove has increased dramatically since 2009. Understanding when newspapers were added is important for historiographical purposes, but there's no data about this available directly from Trove. These datasets were created by harvesting information about newspaper titles in Trove from web archives. The harvesting method is documented by Gathering historical data about the addition of newspaper titles to Trove in the GLAM Workbench.

This dataset contains two files:

trove_newspaper_titles_2009_2021.csv
trove_newspaper_titles_first_appearance_2009_2021.csv

`trove_newspaper_titles_2009_2021.csv`

CSV formatted data file containing details of newspaper titles extracted from web archive captures.

This file contains the following columns:

Column	Contents
`title_id`	title identifier
`full_title`	full title (including location and dates)
`title`	newspaper title
`place`	place of publication
`dates`	date range in Trove
`capture_date`	date of web archive capture
`capture_timestamp`	timestamp of web archive capture

`trove_newspaper_titles_first_appearance_2009_2021.csv`

CSV formatted data file containing details of the first appearance of newspaper titles in web archive captures, indicating when the titles were (approximately) added to Trove. The complete list of captures has been filtered to include only the first appearance of each title / place / date range combination.

The file contains the following columns:

Column	Contents
`title_id`	title identifier
`full_title`	full title (including location and dates)
`title`	newspaper title
`place`	place of publication
`dates`	date range in Trove
`capture_date`	date of web archive capture
`capture_timestamp`	timestamp of web archive capture

a
A Treasure Trove of Trials (CSV Data)
hub.arcgis.com
Updated May 22, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Library of Congress Online GIS Portal (2018). A Treasure Trove of Trials (CSV Data) [Dataset]. https://hub.arcgis.com/datasets/29265cac2d2844acbda5df0a6ee0df46
Explore at:
Dataset updated
May 22, 2018
Dataset authored and provided by
Library of Congress Online GIS Portal
Area covered

Description
This data set represents information concerning the Law Library of Congress's collection of Piracy Trials. Included in the data set are the title of the trial, the location where the trial took place, the date of publication and the URL for the primary source. From this data set, a web map ("Piracy Map with Trial Locations") was created and added to a story map titled "A Treasure Trove of Trials." The aforementioned story map provides an overview of a Law Library digitized collection known as "Piracy Trials" and highlights some of its content. Included in this collection is an item where a woman pirate was the person on trial. The map points shows readers where some of these cases were tried and provides links to individual primary sources. The bibliography includes other sources of interest from throughout the Library. It also compiles a series of sources from the Library's collection where women pirates are the subject.Produced by Francisco Macias, Law Library of Congress.
Video Game Ratings Dataset
kaggle.com
Updated May 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AP\ (2024). Video Game Ratings Dataset [Dataset]. https://www.kaggle.com/datasets/dem0nking/video-game-ratings-dataset/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 23, 2024
Dataset provided by
Kaggle
Authors
AP\
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset provides detailed information about 120 different video games. Each entry in the dataset represents a video game with the following attributes:

Title: The name of the video game.

Genre: The category or type of gameplay, such as Action-Adventure, First-Person Shooter, RPG, etc.

Platform: The gaming system(s) on which the game can be played, such as PC, PlayStation, Xbox, Switch, or Multi--platform.

ReleaseYear: The year in which the game was released.

NumPlayers: The maximum number of players that can play the game simultaneously.

AvgRating: The average rating of the game, typically on a scale from 0 to 10
Treasure Trove Of Australian History Returns Home
data.nsw.gov.au
pdf
Updated Sep 8, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NSW Government (2021). Treasure Trove Of Australian History Returns Home [Dataset]. https://data.nsw.gov.au/data/dataset/3-13861-treasure-trove-of-australian-history-returns-home
Explore at:
pdfAvailable download formats
Dataset updated
Sep 8, 2021
Dataset provided by
Government of New South Waleshttp://nsw.gov.au/
Area covered
Australia
Description
No notes provided
e
Eximpedia Export Import Trade
eximpedia.app
Updated Mar 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seair Exim (2025). Eximpedia Export Import Trade [Dataset]. https://www.eximpedia.app/
Explore at:
.bin, .xml, .csv, .xlsAvailable download formats
Dataset updated
Mar 23, 2025
Dataset provided by
Eximpedia Export Import Trade Data
Eximpedia PTE LTD
Authors
Seair Exim
Area covered
Saint Martin (French part), Haiti, Guernsey, Cambodia, Mauritania, Uzbekistan, Myanmar, Iran (Islamic Republic of), Gambia, United Republic of
Description
Flavour Trove Company Export Import Records. Follow the Eximpedia platform for HS code, importer-exporter records, and customs shipment details.

Links to Trove shared on Twitter, 2009 to 2020

zenodo.org

bin, csv, json

Updated Jun 10, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Tim Sherratt; Tim Sherratt (2025). Links to Trove shared on Twitter, 2009 to 2020 [Dataset]. http://doi.org/10.5281/zenodo.15627800

Explore at:

csv, json, binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.15627800

Dataset updated

Jun 10, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Tim Sherratt; Tim Sherratt

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

This dataset contains information about links to resources in Trove that were shared on Twitter between 2009 and 2020.

The tweet data was compiled using Twarc in May 2021, under Twitter's academic access program. The search queries used were:

url:nla.gov.au/nla.news
url:trove.nla.gov.au
url:newspapers.nla.gov.au

From the raw tweet data I extracted the Trove urls from either the entities -> urls field, or by running a regular expression over the tweet text. Where necessary, I attempted to unshorten any shortened links.

Many of the tweets were produced by bots. Using my Trove bots Twitter list, I separated the tweets into two files, one for bots and one for ordinary users.

To respect user intentions and comply with the Twitter API terms of use, I've removed all the tweet information except for tweet_id and tweet_date from the files. If it hasn't been deleted, the full data for each tweet can be obtained from the X API using the tweet_id, though this would probably require a paid subscription.

Links in tweets data

The main data files are:

trove_url_tweets.csv – links shared by human users (although it may include some unidentified bots)
trove_url_tweets_bots.csv – links shared by bots

Both files contain the following fields:

tweet_id
tweet_date
trove_url – the shared url
trove_type – type of Trove resource, possible values include:
- article – an individual newspaper article
- page – a page of a newspaper
- title – a newspaper title
- work – an individual Trove resource from outside the newspapers category
- other – anything else, including search queries and links to the home page
trove_id – the identifier of the Trove resource (extracted from the url)

In addition, the trove_url_tweets.csv file contains the following field:

nla_official – this is set to True or False and indicates whether the tweet originated from one of the NLA's official Twitter accounts.

Some tweets contain multiple links. The datasets include one row for each link. This means that a single tweet_id can appear multiple times.

Other data files

In addition, I created a few derivative data files:

trove_url_totals.csv
active_users_per_year.csv
active_bots_per_year.csv

trove_url_totals.csv

This file contains information about the number of times each link was shared by users (not including bots). The file includes the following fields:

trove_id – Trove identifier, using this and trove_type you can query the Trove API for further information
trove_type – type of Trove resource, possible values include:
- article – an individual newspaper article
- page – a page of a newspaper
- title – a newspaper title
- work – an individual Trove resource from outside the newspapers category
- other – anything else, including search queries and links to the home page
tweets – number of tweets containing a link to this resource
retweets – number of retweets containing a link to this resource
quotes – number of quote tweets containing a link to this resource
total – the total number of times this link was shared (sum of tweets, retweets, quotes)

active_users_per_year.csv

This file contains information about the number of unique users each year who shared a link to Trove. The file includes the following fields:

year
users – number of unique users who shared a Trove link in this year

active_bots_per_year.csv

This file contains information about the number of active bots each year that shared links to Trove. The file includes the following fields:

year
bots – number of bots that shared a Trove link in this year

Some summary information

Number of unique users sharing Trove links	9,294
Number of bots sharing Trove links	43
Number of tweets by humans containing Trove links	48,293
Number of tweets by bots containing Trove links	318,797
Number of unique links shared by humans	36,886
Number of unique links shared by bots	270,501

See this blog post for more information.

GLAM-Workbench/trove-newspapers-non-english
zenodo.org
data.niaid.nih.gov
zip
Updated Sep 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tim Sherratt; Tim Sherratt (2024). GLAM-Workbench/trove-newspapers-non-english [Dataset]. http://doi.org/10.5281/zenodo.13761509
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13761509
Dataset updated
Sep 14, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Tim Sherratt; Tim Sherratt
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Current version: v1.1

This dataset contains information about newspapers published in languages other than English that have been digitised and made available through Trove. Data about the languages present in newspapers was generated by harvesting a sample of articles from each newspaper using the Trove API, and then using language detection software on the OCRd text of each article. The method is documented in this notebook in the GLAM Workbench.

There are two files:

newspapers_non_english.csv – list of the main languages detected for each newspaper with non-English language content

non-english-newspapers.md – a markdown formatted list of all the newspapers with non-English language content

newspapers_non_english.csv

The dataset contains the following columns:

Column Contents
id newspaper id
title newspaper title
language language code
proportion proportion of articles in this language
number number of articles sampled
language_full full language name

non-english-newspapers.md

This is a markdown-formatted list created by grouping the dataset by newspaper title. It includes details of the main languages in each newspaper.

Column	Contents
`id`	newspaper id
`title`	newspaper title
`language`	language code
`proportion`	proportion of articles in this language
`number`	number of articles sampled
`language_full`	full language name

Facebook

Twitter

Click to copy link

Link copied

Cite

Hugging Face (2024). datatrove-tests [Dataset]. https://huggingface.co/datasets/huggingface/datatrove-tests

datatrove-tests

huggingface/datatrove-tests

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

May 5, 2024

Dataset authored and provided by

Hugging Facehttps://huggingface.co/

Description

Datasets used for datatrove testing. Each split contains the same data: dst = [ {"text": "hello"}, {"text": "world"}, {"text": "how"}, {"text": "are"}, {"text": "you"}, ]

But based on the split name the data are sharded into n-bins

Clear search

Close search

Google apps

Main menu

datatrove-tests

Trove People and Organisations data

Data from: GLAM-Workbench/trove-newspapers-data-post-54

trove-examples-data

Data from: GLAM-Workbench/trove-lists-metadata

Public tags added to resources in Trove, 2008 to 2024

trove-newspaper-issues

Data from: Trove tag counts

Yelp Dataset

Context

Content

Code snippet to read the files

wragge/trove-newspaper-totals-historical: v1.0.0

Treasure Trove Lane Cross Street Data in Miami, FL

Eximpedia Export Import Trade

Treasure Trove Domains LLC Whois Database | Whois Data Center

GLAM-Workbench/trove-newspaper-titles-web-archives

`trove_newspaper_titles_2009_2021.csv`

`trove_newspaper_titles_first_appearance_2009_2021.csv`

A Treasure Trove of Trials (CSV Data)

Video Game Ratings Dataset

Treasure Trove Of Australian History Returns Home

Eximpedia Export Import Trade

Links to Trove shared on Twitter, 2009 to 2020

Links in tweets data

Other data files

trove_url_totals.csv

active_users_per_year.csv

active_bots_per_year.csv

Some summary information

GLAM-Workbench/trove-newspapers-non-english

`newspapers_non_english.csv`

`non-english-newspapers.md`

datatrove-tests

huggingface/datatrove-tests