3 datasets found

COKI Language Dataset
zenodo.org
application/gzip, csv
Updated Jun 16, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
James P. Diprose; James P. Diprose; Cameron Neylon; Cameron Neylon (2022). COKI Language Dataset [Dataset]. http://doi.org/10.5281/zenodo.6636625
Explore at:
application/gzip, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6636625
Dataset updated
Jun 16, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
James P. Diprose; James P. Diprose; Cameron Neylon; Cameron Neylon
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The COKI Language Dataset contains predictions for 122 million academic publications. The dataset consists of DOI, title, ISO language code and the fastText language prediction probability score.

Methodology
A subset of the COKI Academic Observatory Dataset, which is produced by the Academic Observatory Workflows codebase [1], was extracted and converted to CSV with Bigquery and downloaded to a virtual machine. The subset consists of all publications with DOIs in our dataset, including each publication’s title and abstract from both Crossref Metadata and Microsoft Academic Graph. The CSV files were then processed with a Python script. The titles and abstracts for each record were pre-processed, concatenated together and analysed with fastText. The titles and abstracts from Crossref Metadata were used first, with the MAG titles and abstracts serving as a fallback when the Crossref Metadata information was empty. Language was predicted for each publication using the fastText lid.176.bin language identification model [2]. fastText was chosen because of its high accuracy and fast runtime speed [3]. The final output dataset consists of DOI, title, ISO language code and the fastText language prediction probability score.

Query or Download
The data is publicly accessible in BigQuery in the following two tables:

coki-data-share.language.doi_language

coki-data-share.language.iso_language

When you make queries on these tables, make sure that you are in your own Google Cloud project, otherwise the queries will fail.

See the COKI Language Detection README for instructions on how to download the data from Zenodo and load it into BigQuery.

Code
The code that generated this dataset, the BigQuery schemas and instructions for loading the data into BigQuery can be found here: https://github.com/The-Academic-Observatory/coki-language

License
COKI Language Dataset © 2022 by Curtin University is licenced under CC BY 4.0.

Attributions
This work contains information from:

Microsoft Academic Graph which is made available under the ODC Attribution Licence.

Crossref Metadata via the Metadata Plus program. Bibliographic metadata is made available without copyright restriction and Crossref generated data under a CC0 licence. See metadata licence information for more details.

References
[1] https://doi.org/10.5281/zenodo.6366695
[2] https://fasttext.cc/docs/en/language-identification.html
[3] https://modelpredict.com/language-identification-survey
Intellectual Property Investigations by the USITC
kaggle.com
zip
Updated Feb 12, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2019). Intellectual Property Investigations by the USITC [Dataset]. https://www.kaggle.com/bigquery/usitc-investigations
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Feb 12, 2019
Dataset provided by
Googlehttp://google.com/
BigQueryhttps://cloud.google.com/bigquery
Authors
Google BigQuery
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Context

Section 337, Tariff Act of 1930, Investigations of Unfair Practices in Import Trade. Under section 337, the USITC determines whether there is unfair competition in the importation of products into, or their subsequent sale in, the United States. Section 337 prohibits the importation into the US , or the sale of such articles by owners, importers or consignees, of articles which infringe a patent, copyright, trademark, or semiconductor mask work, or where unfair competition or unfair acts exist that can destroy or substantially injure a US industry or prevent one from developing, or restrain or monopolize trade in US commerce. These latter categories are very broad: unfair competition can involve counterfeit, mismarked or misbranded goods, where the sale of the goods are at unfairly low prices, where other antitrust violations take place such as price fixing, market division or the goods violate a standard applicable to such goods.

Content

US International Trade Commission 337Info Unfair Import Investigations Information System contains data on investigations done under Section 337. Section 337 declares the infringement of certain statutory intellectual property rights and other forms of unfair competition in import trade to be unlawful practices. Most Section 337 investigations involve allegations of patent or registered trademark infringement.

Fork this notebook to get started on accessing data in the BigQuery dataset using the BQhelper package to write SQL queries.

Acknowledgements

Data Origin: https://bigquery.cloud.google.com/dataset/patents-public-data:usitc_investigations

"US International Trade Commission 337Info Unfair Import Investigations Information System" by the USITC, for public use.

Banner photo by João Silas on Unsplash
gnomAD
console.cloud.google.com
Updated Jun 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
https://console.cloud.google.com/marketplace/browse?filter=partner:Broad%20Institute%20of%20MIT%20and%20Harvard&hl=pl&inv=1&invt=Ab292A (2022). gnomAD [Dataset]. https://console.cloud.google.com/marketplace/product/broad-institute/gnomad?hl=pl
Explore at:
Dataset updated
Jun 13, 2022
Dataset provided by
Googlehttp://google.com/
Description
The Genome Aggregation Database (gnomAD) is maintained by an international coalition of investigators to aggregate and harmonize data from large-scale sequencing projects. These public datasets are available in VCF format in Google Cloud Storage and in Google BigQuery as integer range partitioned tables . Each dataset is sharded by chromosome meaning variants are distributed across 24 tables (indicated with “_chr*” suffix). Utilizing the sharded tables reduces query costs significantly. Variant Transforms was used to process these VCF files and import them to BigQuery. VEP annotations were parsed into separate columns for easier analysis using Variant Transforms’ annotation support . These public datasets are included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. Use this quick start guide to quickly learn how to access public datasets on Google Cloud Storage. Find out more in our blog post, Providing open access to gnomAD on Google Cloud . Questions? Contact gcp-life-sciences-discuss@googlegroups.com.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

James P. Diprose; James P. Diprose; Cameron Neylon; Cameron Neylon (2022). COKI Language Dataset [Dataset]. http://doi.org/10.5281/zenodo.6636625

COKI Language Dataset

Explore at:

5 scholarly articles cite this dataset (View in Google Scholar)

application/gzip, csvAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.6636625

Dataset updated

Jun 16, 2022

Dataset provided by

Zenodohttp://zenodo.org/

Authors

James P. Diprose; James P. Diprose; Cameron Neylon; Cameron Neylon

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The COKI Language Dataset contains predictions for 122 million academic publications. The dataset consists of DOI, title, ISO language code and the fastText language prediction probability score.

Methodology
A subset of the COKI Academic Observatory Dataset, which is produced by the Academic Observatory Workflows codebase [1], was extracted and converted to CSV with Bigquery and downloaded to a virtual machine. The subset consists of all publications with DOIs in our dataset, including each publication’s title and abstract from both Crossref Metadata and Microsoft Academic Graph. The CSV files were then processed with a Python script. The titles and abstracts for each record were pre-processed, concatenated together and analysed with fastText. The titles and abstracts from Crossref Metadata were used first, with the MAG titles and abstracts serving as a fallback when the Crossref Metadata information was empty. Language was predicted for each publication using the fastText lid.176.bin language identification model [2]. fastText was chosen because of its high accuracy and fast runtime speed [3]. The final output dataset consists of DOI, title, ISO language code and the fastText language prediction probability score.

Query or Download
The data is publicly accessible in BigQuery in the following two tables:

When you make queries on these tables, make sure that you are in your own Google Cloud project, otherwise the queries will fail.

See the COKI Language Detection README for instructions on how to download the data from Zenodo and load it into BigQuery.

Code
The code that generated this dataset, the BigQuery schemas and instructions for loading the data into BigQuery can be found here: https://github.com/The-Academic-Observatory/coki-language

Attributions
This work contains information from:

Microsoft Academic Graph which is made available under the ODC Attribution Licence.
Crossref Metadata via the Metadata Plus program. Bibliographic metadata is made available without copyright restriction and Crossref generated data under a CC0 licence. See metadata licence information for more details.

References
[1] https://doi.org/10.5281/zenodo.6366695
[2] https://fasttext.cc/docs/en/language-identification.html
[3] https://modelpredict.com/language-identification-survey

Clear search

Close search

Google apps

Main menu

COKI Language Dataset

Intellectual Property Investigations by the USITC

Context

Content

Acknowledgements

gnomAD

COKI Language Dataset