Datasets used for datatrove testing. Each split contains the same data: dst = [ {"text": "hello"}, {"text": "world"}, {"text": "how"}, {"text": "are"}, {"text": "you"}, ]
But based on the split name the data are sharded into n-bins
The National Library of Australia operates the "http://trove.nla.gov.au/people">Trove People and Organisations zone which allows users to access information about significant people and organisations (parties) as well as related biographical and contextual information.
The Trove People and Organisations dataset is based on the Australian Name Authority File, a unique resource maintained since 1981 by Australian libraries which contribute their holdings to "http://librariesaustralia.nla.gov.au">Libraries Australia. The Trove People and Organisations zone plays an important role in exposing records about parties and linking to them in libraries and other collecting institutions. The data also provides links to resources by and about a party and relationships between parties.
To further enrich the service the Library is collaborating with organisations that already make available information about people and organisations in their specific domains and linking to them.
The API to this dataset provides access to 885,000 identities.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
newspapers with articles published after 1954
Current version: v1.7
Due to copyright restrictions, most of the digitised newspaper articles on Trove were published before 1955. However, some articles published after 1954 have been made available. This repository provides data about digitised newspapers in Trove that have articles available from after 1954 (the 'copyright cliff of death').
The data was extracted from the Trove API using this notebook from the Trove newspapers section of the GLAM Workbench.
The data is available as a CSV file entitled newspapers_post_54.csv and contains the following fields:
title β the full title of the newspaper state β the state in which the newspaper was published id β Trove's unique identifier for this newspaper startDate β the earliest date of articles from this newspaper available in Trove endDate β the latest date of articles from this newspaper available in Trove issn β ISSN number_of_articles β the number of articles from this newspaper published after 1954 available in Trove troveUrl β link to more information about this newspaper
This repository is part of the GLAM Workbench. If you think this project is worthwhile, you might like to sponsor me on GitHub.
This repo holds files that are used in Trove examples.
tevatron_msmarco_passage_aug_qrel.jsonl contains the qrels extracted from train.jsonl.gz file in Tevatron/msmarco-passage-aug repo.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Current version: v1.4 Trove users can create collections of resources using Trove's 'lists'. Metadata describing public lists is available via the Trove API. This dataset was created by harvesting this metadata. To reduce file size, the details of the resources collected by each list are not included, just the total number of resources. The data was extracted from the Trove API using this notebook from the Trove lists and tags section of the GLAM Workbench. The data is available as a CSV file entitled trove-lists.csv and contains the following fields:
created β date the list was created id β Trove's unique list identifier number_items β number of resources in list title β the title of this list updated β date the list was last updated
This repository is part of the GLAM Workbench. If you think this project is worthwhile, you might like to sponsor me on GitHub.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset contains details of 2,495,958 unique public tags added to 10,403,650 resources in Trove between August 2008 and June 2024. I harvested the data using the Trove API and saved it as a CSV file with the following columns:
tag
β lower-cased text tag
date
β date the tag was added
zone
β API zone containing the tagged resource
record_id
β the identifier of the tagged resource
I've documented the method used to harvest the tags in this notebook.
Using the zone
and record_id
you can find more information about a tagged item. To create urls to the resources in Trove:
for resources in the 'book', 'article', 'picture', 'music', 'map', and 'collection' zones add the record_id
to https://trove.nla.gov.au/work/
for resources in the 'newspaper' and 'gazette' zones add the record_id
to https://trove.nla.gov.au/article/
for resources in the 'list' zone add the record_id
to https://trove.nla.gov.au/list/
Notes:
Works (such as books) in Trove can have tags attached at either work or version level. This dataset aggregates all tags at the work level, removing any duplicates.
A single resource in Trove can appear in multiple zones β for example, a book that includes maps and illustrations might appear in the 'book', 'picture', and 'map' zones. This means that some of the tags will essentially be duplicates β harvested from different zones, but relating to the same resource. Depending on your needs, you might want to remove these duplicates.
While most of the tags were added by Trove users, more than 500,000 tags were added by Trove itself in November 2009. I think these tags were automatically generated from related Wikipedia pages. Depending on your needs, you might want to exclude these by limiting the date range or zones.
User content added to Trove, including tags, is available for reuse under a CC-BY-NC licence.
See this notebook for some examples of how you can manipulate, analyse, and visualise the tag data.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains information about the published issues of newspapers digitised and made available through Trove. The data was harvested from the Trove API, using this notebook in the GLAM Workbench.
There are two data files:
newspaper_issues_totals_by_year.csv β the total number of newspaper issues per year for each digitised newspaper
newspaper_issues.csv β a complete list of newspaper issues available from Trove
newspaper_issues_totals_by_year.csv
The dataset contains the following columns:
Column Contents
title newspaper title
title_id newspaper id
state place of publication
year year published
issues number of issues
newspaper_issues.csv
The dataset contains the following columns:
Column Contents
title newspaper title
title_id newspaper id
state place of publication
issue_id issue identifier
issue_date date of publication (YYYY-MM-DD)
To keep the file size down, I haven't included an issue_url in this dataset, but these are easily generated from the issue_id. Just add the issue_id to the end of http://nla.gov.au/nla.news-issue. For example: http://nla.gov.au/nla.news-issue495426. Note that when you follow an issue url, you actually get redirected to the url of the first page in the issue.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset was derived from the full harvest of Trove public tags. It contains a list of unique tags and the total number of resources in Trove each tag is attached to. It is formatted as a CSV file with the following columns:
tag
β the tag string
count
β number of resources the tag has been applied to
User content added to Trove, including tags, is available for reuse under a CC-BY-NC-SA licence.
This dataset is a subset of Yelp's businesses, reviews, and user data. It was originally put together for the Yelp Dataset Challenge which is a chance for students to conduct research or analysis on Yelp's data and share their discoveries. In the most recent dataset you'll find information about businesses across 8 metropolitan areas in the USA and Canada.
This dataset contains five JSON files and the user agreement. More information about those files can be found here.
in Python, you can read the JSON files like this (using the json and pandas libraries):
import json
import pandas as pd
data_file = open("yelp_academic_dataset_checkin.json")
data = []
for line in data_file:
data.append(json.loads(line))
checkin_df = pd.DataFrame(data)
data_file.close()
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This repository contains past harvests of the number of digitised newspaper articles available through Trove. These harvests were created between 2011 and 2022:
It's possible I might find additional harvests and add them to the repository in the future.
Since April 2022, datasets have been automatically created every week and saved in this repository.
Dataset details
The datapackage.json
file contains a description of all the datasets using the Frictionless Data standard.
Datasets are saved in the data
directory as CSV files. There are two types of harvest β one captures the total number of articles per year, while the other breaks the totals down by year and state. The harvest date is embedded in the file title (in YYYYMMDD
format).
total_articles_by_year_YYYYMMDD.csv
These datasets are saved as CSV files containing the following columns:
year
: year of original publication of newspaper articletotal
: total number of articles from that year available in Trovetotal_articles_by_year_and_state_YYYYMMDD.csv
These datasets are saved as CSV files containing the following columns:
state
: state in which newspaper article was originally publishedyear
: year of original publication of newspaper articletotal
: total number of articles from that year and state available in TroveTrove uses the following values for state
:
Method
The method for harvesting this data has changed over time. Harvests from 2011 were screen scraped from the Trove website. Harvests after 2012 make use of the year
and state
facets from the Trove API. The data was stored in a variety of locations, such as this archived page, my Plotly account, and the Trove Newspapers GLAM Workbench repository. To create this repository, I've retrieved the harvested data from these locations and converted the datasets to CSV files. Column headings have been normalised, but none of the values have been changed.
For current examples of harvesting this sort of data see Visualise the total number of newspaper articles in Trove by year and state in the GLAM Workbench.
This dataset provides information about the number of properties, residents, and average property values for Treasure Trove Lane cross streets in Miami, FL.
Opal Trove Llc Company Export Import Records. Follow the Eximpedia platform for HS code, importer-exporter records, and customs shipment details.
https://whoisdatacenter.com/terms-of-use/https://whoisdatacenter.com/terms-of-use/
Treasure Trove Domains LLC Whois Database, discover comprehensive ownership details, registration dates, and more for Treasure Trove Domains LLC with Whois Data Center.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Current version: v1.1
The number of digitised newspapers available through Trove has increased dramatically since 2009. Understanding when newspapers were added is important for historiographical purposes, but there's no data about this available directly from Trove. These datasets were created by harvesting information about newspaper titles in Trove from web archives. The harvesting method is documented by Gathering historical data about the addition of newspaper titles to Trove in the GLAM Workbench.
This dataset contains two files:
trove_newspaper_titles_2009_2021.csv
trove_newspaper_titles_first_appearance_2009_2021.csv
trove_newspaper_titles_2009_2021.csv
CSV formatted data file containing details of newspaper titles extracted from web archive captures.
This file contains the following columns:
Column | Contents |
---|---|
title_id | title identifier |
full_title | full title (including location and dates) |
title | newspaper title |
place | place of publication |
dates | date range in Trove |
capture_date | date of web archive capture |
capture_timestamp | timestamp of web archive capture |
trove_newspaper_titles_first_appearance_2009_2021.csv
CSV formatted data file containing details of the first appearance of newspaper titles in web archive captures, indicating when the titles were (approximately) added to Trove. The complete list of captures has been filtered to include only the first appearance of each title / place / date range combination.
The file contains the following columns:
Column | Contents |
---|---|
title_id | title identifier |
full_title | full title (including location and dates) |
title | newspaper title |
place | place of publication |
dates | date range in Trove |
capture_date | date of web archive capture |
capture_timestamp | timestamp of web archive capture |
This data set represents information concerning the Law Library of Congress's collection of Piracy Trials. Included in the data set are the title of the trial, the location where the trial took place, the date of publication and the URL for the primary source. From this data set, a web map ("Piracy Map with Trial Locations") was created and added to a story map titled "A Treasure Trove of Trials." The aforementioned story map provides an overview of a Law Library digitized collection known as "Piracy Trials" and highlights some of its content. Included in this collection is an item where a woman pirate was the person on trial. The map points shows readers where some of these cases were tried and provides links to individual primary sources. The bibliography includes other sources of interest from throughout the Library. It also compiles a series of sources from the Library's collection where women pirates are the subject.Produced by Francisco Macias, Law Library of Congress.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset provides detailed information about 120 different video games. Each entry in the dataset represents a video game with the following attributes:
No notes provided
Flavour Trove Company Export Import Records. Follow the Eximpedia platform for HS code, importer-exporter records, and customs shipment details.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains information about links to resources in Trove that were shared on Twitter between 2009 and 2020.
The tweet data was compiled using Twarc in May 2021, under Twitter's academic access program. The search queries used were:
url:nla.gov.au/nla.news
url:trove.nla.gov.au
url:newspapers.nla.gov.au
From the raw tweet data I extracted the Trove urls from either the entities -> urls
field, or by running a regular expression over the tweet text. Where necessary, I attempted to unshorten any shortened links.
Many of the tweets were produced by bots. Using my Trove bots Twitter list, I separated the tweets into two files, one for bots and one for ordinary users.
To respect user intentions and comply with the Twitter API terms of use, I've removed all the tweet information except for tweet_id
and tweet_date
from the files. If it hasn't been deleted, the full data for each tweet can be obtained from the X API using the tweet_id
, though this would probably require a paid subscription.
The main data files are:
trove_url_tweets.csv
β links shared by human users (although it may include some unidentified bots)trove_url_tweets_bots.csv
β links shared by botsBoth files contain the following fields:
tweet_id
tweet_date
trove_url
β the shared urltrove_type
β type of Trove resource, possible values include:
article
β an individual newspaper articlepage
β a page of a newspapertitle
β a newspaper titlework
β an individual Trove resource from outside the newspapers categoryother
β anything else, including search queries and links to the home pagetrove_id
β the identifier of the Trove resource (extracted from the url)In addition, the trove_url_tweets.csv
file contains the following field:
nla_official
β this is set to True
or False
and indicates whether the tweet originated from one of the NLA's official Twitter accounts.Some tweets contain multiple links. The datasets include one row for each link. This means that a single tweet_id
can appear multiple times.
In addition, I created a few derivative data files:
trove_url_totals.csv
active_users_per_year.csv
active_bots_per_year.csv
This file contains information about the number of times each link was shared by users (not including bots). The file includes the following fields:
trove_id
β Trove identifier, using this and trove_type
you can query the Trove API for further informationtrove_type
β type of Trove resource, possible values include:
article
β an individual newspaper articlepage
β a page of a newspapertitle
β a newspaper titlework
β an individual Trove resource from outside the newspapers categoryother
β anything else, including search queries and links to the home pagetweets
β number of tweets containing a link to this resourceretweets
β number of retweets containing a link to this resourcequotes
β number of quote tweets containing a link to this resourcetotal
β the total number of times this link was shared (sum of tweets
, retweets
, quotes
)This file contains information about the number of unique users each year who shared a link to Trove. The file includes the following fields:
year
users
β number of unique users who shared a Trove link in this yearThis file contains information about the number of active bots each year that shared links to Trove. The file includes the following fields:
year
bots
β number of bots that shared a Trove link in this yearNumber of unique users sharing Trove links | 9,294 |
Number of bots sharing Trove links | 43 |
Number of tweets by humans containing Trove links | 48,293 |
Number of tweets by bots containing Trove links | 318,797 |
Number of unique links shared by humans | 36,886 |
Number of unique links shared by bots | 270,501 |
See this blog post for more information.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Current version: v1.1
This dataset contains information about newspapers published in languages other than English that have been digitised and made available through Trove. Data about the languages present in newspapers was generated by harvesting a sample of articles from each newspaper using the Trove API, and then using language detection software on the OCRd text of each article. The method is documented in this notebook in the GLAM Workbench.
There are two files:
newspapers_non_english.csv
β list of the main languages detected for each newspaper with non-English language contentnon-english-newspapers.md
β a markdown formatted list of all the newspapers with non-English language contentnewspapers_non_english.csv
The dataset contains the following columns:
Column | Contents |
---|---|
id | newspaper id |
title | newspaper title |
language | language code |
proportion | proportion of articles in this language |
number | number of articles sampled |
language_full | full language name |
non-english-newspapers.md
This is a markdown-formatted list created by grouping the dataset by newspaper title. It includes details of the main languages in each newspaper.
Datasets used for datatrove testing. Each split contains the same data: dst = [ {"text": "hello"}, {"text": "world"}, {"text": "how"}, {"text": "are"}, {"text": "you"}, ]
But based on the split name the data are sharded into n-bins