19 datasets found

COKI Language Dataset
zenodo.org
application/gzip, csv
Updated Jun 16, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
James P. Diprose; James P. Diprose; Cameron Neylon; Cameron Neylon (2022). COKI Language Dataset [Dataset]. http://doi.org/10.5281/zenodo.6636625
Explore at:
application/gzip, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6636625
Dataset updated
Jun 16, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
James P. Diprose; James P. Diprose; Cameron Neylon; Cameron Neylon
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The COKI Language Dataset contains predictions for 122 million academic publications. The dataset consists of DOI, title, ISO language code and the fastText language prediction probability score.

Methodology
A subset of the COKI Academic Observatory Dataset, which is produced by the Academic Observatory Workflows codebase [1], was extracted and converted to CSV with Bigquery and downloaded to a virtual machine. The subset consists of all publications with DOIs in our dataset, including each publication’s title and abstract from both Crossref Metadata and Microsoft Academic Graph. The CSV files were then processed with a Python script. The titles and abstracts for each record were pre-processed, concatenated together and analysed with fastText. The titles and abstracts from Crossref Metadata were used first, with the MAG titles and abstracts serving as a fallback when the Crossref Metadata information was empty. Language was predicted for each publication using the fastText lid.176.bin language identification model [2]. fastText was chosen because of its high accuracy and fast runtime speed [3]. The final output dataset consists of DOI, title, ISO language code and the fastText language prediction probability score.

Query or Download
The data is publicly accessible in BigQuery in the following two tables:

coki-data-share.language.doi_language

coki-data-share.language.iso_language

When you make queries on these tables, make sure that you are in your own Google Cloud project, otherwise the queries will fail.

See the COKI Language Detection README for instructions on how to download the data from Zenodo and load it into BigQuery.

Code
The code that generated this dataset, the BigQuery schemas and instructions for loading the data into BigQuery can be found here: https://github.com/The-Academic-Observatory/coki-language

License
COKI Language Dataset © 2022 by Curtin University is licenced under CC BY 4.0.

Attributions
This work contains information from:

Microsoft Academic Graph which is made available under the ODC Attribution Licence.

Crossref Metadata via the Metadata Plus program. Bibliographic metadata is made available without copyright restriction and Crossref generated data under a CC0 licence. See metadata licence information for more details.

References
[1] https://doi.org/10.5281/zenodo.6366695
[2] https://fasttext.cc/docs/en/language-identification.html
[3] https://modelpredict.com/language-identification-survey
BigQuery GIS Utility Datasets (U.S.)
kaggle.com
zip
Updated Mar 20, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2019). BigQuery GIS Utility Datasets (U.S.) [Dataset]. https://www.kaggle.com/datasets/bigquery/utility-us
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Mar 20, 2019
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Authors
Google BigQuery
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Querying BigQuery tables You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME].

Project: "bigquery-public-data"

Table: "utility_us"

Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.

If you're using Python, you can start with this code:

import pandas as pd from bq_helper import BigQueryHelper bq_assistant = BigQueryHelper("bigquery-public-data", "utility_us")
gnomAD
console.cloud.google.com
Updated Jun 23, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
https://console.cloud.google.com/marketplace/browse?filter=partner:Broad%20Institute%20of%20MIT%20and%20Harvard&inv=1&invt=Ab5hPQ (2020). gnomAD [Dataset]. https://console.cloud.google.com/marketplace/product/broad-institute/gnomad
Explore at:
Dataset updated
Jun 23, 2020
Dataset provided by
Googlehttp://google.com/
Description
The Genome Aggregation Database (gnomAD) is maintained by an international coalition of investigators to aggregate and harmonize data from large-scale sequencing projects. These public datasets are available in VCF format in Google Cloud Storage and in Google BigQuery as integer range partitioned tables . Each dataset is sharded by chromosome meaning variants are distributed across 24 tables (indicated with “_chr*” suffix). Utilizing the sharded tables reduces query costs significantly. Variant Transforms was used to process these VCF files and import them to BigQuery. VEP annotations were parsed into separate columns for easier analysis using Variant Transforms’ annotation support . These public datasets are included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. Use this quick start guide to quickly learn how to access public datasets on Google Cloud Storage. Find out more in our blog post, Providing open access to gnomAD on Google Cloud . Questions? Contact gcp-life-sciences-discuss@googlegroups.com.
Intellectual Property Investigations by the USITC
kaggle.com
zip
Updated Feb 12, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2019). Intellectual Property Investigations by the USITC [Dataset]. https://www.kaggle.com/bigquery/usitc-investigations
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Feb 12, 2019
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Googlehttp://google.com/
Authors
Google BigQuery
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Context

Section 337, Tariff Act of 1930, Investigations of Unfair Practices in Import Trade. Under section 337, the USITC determines whether there is unfair competition in the importation of products into, or their subsequent sale in, the United States. Section 337 prohibits the importation into the US , or the sale of such articles by owners, importers or consignees, of articles which infringe a patent, copyright, trademark, or semiconductor mask work, or where unfair competition or unfair acts exist that can destroy or substantially injure a US industry or prevent one from developing, or restrain or monopolize trade in US commerce. These latter categories are very broad: unfair competition can involve counterfeit, mismarked or misbranded goods, where the sale of the goods are at unfairly low prices, where other antitrust violations take place such as price fixing, market division or the goods violate a standard applicable to such goods.

Content

US International Trade Commission 337Info Unfair Import Investigations Information System contains data on investigations done under Section 337. Section 337 declares the infringement of certain statutory intellectual property rights and other forms of unfair competition in import trade to be unlawful practices. Most Section 337 investigations involve allegations of patent or registered trademark infringement.

Fork this notebook to get started on accessing data in the BigQuery dataset using the BQhelper package to write SQL queries.

Acknowledgements

Data Origin: https://bigquery.cloud.google.com/dataset/patents-public-data:usitc_investigations

"US International Trade Commission 337Info Unfair Import Investigations Information System" by the USITC, for public use.

Banner photo by João Silas on Unsplash
Project Sunroof
console.cloud.google.com
Updated Aug 15, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
https://console.cloud.google.com/marketplace/browse?filter=partner:Google%20Project%20Sunroof (2017). Project Sunroof [Dataset]. https://console.cloud.google.com/marketplace/product/project-sunroof/project-sunroof
Explore at:
Dataset updated
Aug 15, 2017
Dataset provided by
Googlehttp://google.com/
Description
As the price of installing solar has gotten less expensive, more homeowners are turning to it as a possible option for decreasing their energy bill. We want to make installing solar panels easy and understandable for anyone. Project Sunroof puts Google's expansive data in mapping and computing resources to use, helping calculate the best solar plan for you. How does it work? When you enter your address, Project Sunroof looks up your home in Google Maps and combines that information with other databases to create your personalized roof analysis. Don’t worry, Project Sunroof doesn't give the address to anybody else. Learn more about Project Sunroof and see the tool at Project Sunroof’s site . Project Sunroof computes how much sunlight hits roofs in a year, based on shading calculations, typical meteorological data, and estimates of the size and shape of the roofs. You can see more details about how solar viability is determined by checking out methodology here. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .
Google Ads Transparency Center
console.cloud.google.com
Updated Oct 5, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
https://console.cloud.google.com/marketplace/browse?filter=partner:BigQuery%20Public%20Data&inv=1&invt=Ab6ARQ (2020). Google Ads Transparency Center [Dataset]. https://console.cloud.google.com/marketplace/product/bigquery-public-data/google-ads-transparency-center
Explore at:
Dataset updated
Oct 5, 2020
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Googlehttp://google.com/
Description
This dataset contains two tables: creative_stats and removed_creative_stats. The creative_stats table contains information about advertisers that served ads in the European Economic Area or Turkey: their legal name, verification status, disclosed name, and location. It also includes ad specific information: impression ranges per region (including aggregate impressions for the European Economic Area), first shown and last shown dates, which criteria were used in audience selection, the format of the ad, the ad topic and whether the ad is funded by Google Ad Grants program. A link to the ad in the Google Ads Transparency Center is also provided. The removed_creative_stats table contains information about ads that served in the European Economic Area that Google removed: where and why they were removed and per-region information on when they served. The removed_creative_stats table also contains a link to the Google Ads Transparency Center for the removed ad. Data for both tables updates periodically and may be delayed from what appears on the Google Ads Transparency Center website. About BigQuery This data is hosted in Google BigQuery for users to easily query using SQL. Note that to use BigQuery, users must have a Google account and create a GCP project. This public dataset is included in BigQuery's 1TB/mo of free tier processing. Each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery . Download Dataset This public dataset is also hosted in Google Cloud Storage here and available free to use. Use this quick start guide to quickly learn how to access public datasets on Google Cloud Storage. We provide the raw data in JSON format, sharded across multiple files to support easier download of the large dataset. A README file which describes the data structure and our Terms of Service (also listed below) is included with the dataset. You can also download the results from a custom query. See here for options and instructions. Signed out users can download the full dataset by using the gCloud CLI. Follow the instructions here to download and install the gCloud CLI. To remove the login requirement, run "$ gcloud config set auth/disable_credentials True" To download the dataset, run "$ gcloud storage cp gs://ads-transparency-center/* . -R" This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .
Kimia Farma: Performance Analysis 2020-2023
kaggle.com
Updated Feb 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anggun Dwi Lestari (2025). Kimia Farma: Performance Analysis 2020-2023 [Dataset]. https://www.kaggle.com/datasets/anggundwilestari/kimia-farma-performance-analysis-2020-2023
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 27, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Anggun Dwi Lestari
Description
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F19062145%2F025ccf521f62db512b4a98edd0b3508a%2FKimia_Farma_Dashboard.jpg?generation=1748428094441761&alt=media" alt="">This project analyzes Kimia Farma's performance from 2020 to 2023 using Google Looker Studio. The analysis is based on a pre-processed dataset stored in BigQuery, which serves as the data source for the dashboard.

Project Scope

The dashboard is designed to provide insights into branch performance, sales trends, customer ratings, and profitability. The development is ongoing, with multiple pages planned for a more in-depth analysis.

Current Progress

✅ The first page of the dashboard is completed
✅ A sample dashboard file is available on Kaggle
🔄 Development will continue with additional pages

Dataset Overview

The dataset consists of transaction records from Kimia Farma branches across different cities and provinces. Below are the key columns used in the analysis: - transaction_id: Transaction ID code - date: Transaction date - branch_id: Kimia Farma branch ID code - branch_name: Kimia Farma branch name - kota: City of the Kimia Farma branch - provinsi: Province of the Kimia Farma branch - rating_cabang: Customer rating of the Kimia Farma branch - customer_name: Name of the customer who made the transaction - product_id: Product ID code - product_name: Name of the medicine - actual_price: Price of the medicine - discount_percentage: Discount percentage applied to the medicine - persentase_gross_laba: Gross profit percentage based on the following conditions:
Price ≤ Rp 50,000 → 10% profit
Price > Rp 50,000 - 100,000 → 15% profit
Price > Rp 100,000 - 300,000 → 20% profit
Price > Rp 300,000 - 500,000 → 25% profit
Price > Rp 500,000 → 30% profit
- nett_sales: Price after discount - nett_profit: Profit earned by Kimia Farma - rating_transaksi: Customer rating of the transaction

Files Provided

📌 kimia farma_query.txt – Contains SQL queries used for data analysis in Looker Studio
📌 kimia farma_analysis_table.csv – Preprocessed dataset ready for import and analysis

📢 Published on : My LinkedIn
s
Clean Water, Clear Data
streamwaterdata.co.uk
portal-streamwaterdata.hub.arcgis.com
Updated Dec 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
sbeka_streamwaterdata (2024). Clean Water, Clear Data [Dataset]. https://www.streamwaterdata.co.uk/items/9ee2bf11097f4465ac299e2641a7fcc5
Explore at:
Dataset updated
Dec 12, 2024
Dataset authored and provided by
sbeka_streamwaterdata
Description
The quality of our water is vital, and understanding the factors that impact it is crucial for both the environment and public health. A new collaboration between Google Cloud and Stream, a consortium of UK water companies with a collective vision to unlock water data, is putting the power of data and AI into the hands of communities, driving transparency and informed decision-making around water quality. This initiative leverages Stream's ever-growing catalogue of water sector data and combines it with Google Cloud's BigQuery and advanced Generative AI. The result? A revolutionary way to access, analyse, and understand complex water quality information.
OpenStreetMap Public Dataset
console.cloud.google.com
Updated Jan 16, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
https://console.cloud.google.com/marketplace/browse?filter=partner:OpenStreetMap (2020). OpenStreetMap Public Dataset [Dataset]. https://console.cloud.google.com/marketplace/product/openstreetmap/geo-openstreetmap
Explore at:
Dataset updated
Jan 16, 2020
Dataset provided by
OpenStreetMap//www.openstreetmap.org/
Googlehttp://google.com/
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
Adapted from Wikipedia: OpenStreetMap (OSM) is a collaborative project to create a free editable map of the world. Created in 2004, it was inspired by the success of Wikipedia and more than two million registered users who can add data by manual survey, GPS devices, aerial photography, and other free sources. We've made available a number of tables (explained in detail below): history_* tables: full history of OSM objects planet_* tables: snapshot of current OSM objects as of Nov 2019 The history_* and planet_* table groups are composed of node, way, relation, and changeset tables. These contain the primary OSM data types and an additional changeset corresponding to OSM edits for convenient access. These objects are encoded using the BigQuery GEOGRAPHY data type so that they can be operated upon with the built-in geography functions to perform geometry and feature selection, additional processing. Example analyses are given below. This dataset is part of a larger effort to make data available in BigQuery through the Google Cloud Public Datasets program . OSM itself is produced as a public good by volunteers, and there are no guarantees about data quality. Interested in learning more about how these data were brought into BigQuery and how you can use them? Check out the sample queries below to get started. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .
NOAA GSOD
kaggle.com
zip
Updated Aug 30, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NOAA (2019). NOAA GSOD [Dataset]. https://www.kaggle.com/datasets/noaa/gsod
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Aug 30, 2019
Dataset provided by
National Oceanic and Atmospheric Administrationhttp://www.noaa.gov/
Authors
NOAA
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Overview

Global Surface Summary of the Day is derived from The Integrated Surface Hourly (ISH) dataset. The ISH dataset includes global data obtained from the USAF Climatology Center, located in the Federal Climate Complex with NCDC. The latest daily summary data are normally available 1-2 days after the date-time of the observations used in the daily summaries.

Content

Over 9000 stations' data are typically available.

The daily elements included in the dataset (as available from each station) are: Mean temperature (.1 Fahrenheit) Mean dew point (.1 Fahrenheit) Mean sea level pressure (.1 mb) Mean station pressure (.1 mb) Mean visibility (.1 miles) Mean wind speed (.1 knots) Maximum sustained wind speed (.1 knots) Maximum wind gust (.1 knots) Maximum temperature (.1 Fahrenheit) Minimum temperature (.1 Fahrenheit) Precipitation amount (.01 inches) Snow depth (.1 inches)

Indicator for occurrence of: Fog, Rain or Drizzle, Snow or Ice Pellets, Hail, Thunder, Tornado/Funnel

Querying BigQuery tables

You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.

Acknowledgements

This public dataset was created by the National Oceanic and Atmospheric Administration (NOAA) and includes global data obtained from the USAF Climatology Center. This dataset covers GSOD data between 1929 and present, collected from over 9000 stations. Dataset Source: NOAA

Use: This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source — http://www.data.gov/privacy-policy#data_policy — and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

Photo by Allan Nygren on Unsplash
NYC Citi Bike Trips
console.cloud.google.com
Updated Jul 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
https://console.cloud.google.com/marketplace/browse?filter=partner:City%20of%20New%20York&inv=1&invt=Ab4vBw (2022). NYC Citi Bike Trips [Dataset]. https://console.cloud.google.com/marketplace/product/city-of-new-york/nyc-citi-bike
Explore at:
Dataset updated
Jul 1, 2022
Dataset provided by
Googlehttp://google.com/
Area covered
New York
Description
Citi Bike is the nation's largest bike share program, with 10,000 bikes and 600 stations across Manhattan, Brooklyn, Queens, and Jersey City. This dataset includes Citi Bike trips since Citi Bike launched in September 2013 and is updated daily. The data has been processed by Citi Bike to remove trips that are taken by staff to service and inspect the system, as well as any trips below 60 seconds in length, which are considered false starts. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .
CFPB Consumer Complaint Database
console.cloud.google.com
Updated Jan 25, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
https://console.cloud.google.com/marketplace/browse?filter=partner:Consumer%20Financial%20Protection%20Bureau (2020). CFPB Consumer Complaint Database [Dataset]. https://console.cloud.google.com/marketplace/product/cfpb/complaint-database
Explore at:
Dataset updated
Jan 25, 2020
Dataset provided by
Googlehttp://google.com/
Description
The Consumer Complaint Database is a collection of complaints about consumer financial products and services that we sent to companies for response. Complaints are published after the company responds, confirming a commercial relationship with the consumer, or after 15 days, whichever comes first. Complaints referred to other regulators, such as complaints about depository institutions with less than $10 billion in assets, are not published in the Consumer Complaint Database.This database is not a statistical sample of consumers’ experiences in the marketplace. Complaints are not necessarily representative of all consumers’ experiences and complaints do not constitute “information” for purposes of the Information Quality Act . Complaint volume should be considered in the context of company size and/or market share. For example, companies with more customers may have more complaints than companies with fewer customers. We encourage you to pair complaint data with public and private datasets for additional context. The Bureau publishes the consumer’s narrative description of his or her experience if the consumer opts to share it publicly and after the Bureau removes personal information. We don’t verify all the allegations in complaint narratives. Unproven allegations in consumer narratives should be regarded as opinion, not fact. We do not adopt the views expressed and make no representation that consumers’ allegations are accurate, clear, complete, or unbiased in substance or presentation. Users should consider what conclusions may be fairly drawn from complaints alone.This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. Each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery

Meio Ambiente: Taxa de Precipitação (Alerta Rio)

data.rio
datario-pcrj.hub.arcgis.com
+1more

Updated Jun 3, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Prefeitura da Cidade do Rio de Janeiro (2022). Meio Ambiente: Taxa de Precipitação (Alerta Rio) [Dataset]. https://www.data.rio/documents/3ec0f995f6614d5a886c0bdd79beb0f8

Explore at:

Dataset updated

Jun 3, 2022

Dataset authored and provided by

Prefeitura da Cidade do Rio de Janeiro

License

Attribution-NoDerivs 3.0 (CC BY-ND 3.0)https://creativecommons.org/licenses/by-nd/3.0/
License information was derived automatically

Description

Taxa medida de precipitação das estações pluviométricas da cidade do Rio de Janeiro. As estimativas são feitas de 15 em 15 minutos, cada registro contendo dados desta medida.

  Como acessar


  Nessa página


  Aqui, você encontrará um botão para realizar o download dos dados em formato CSV e compactados com gzip. Ou,
  para mesmo resultado, pode clicar aqui.


  BigQuery




      SELECT


      *


      FROM


      `datario.meio_ambiente_clima.taxa_precipitacao_alertario`


      LIMIT


      1000




  Clique aqui
  para ir diretamente a essa tabela no BigQuery. Caso não tenha experiência com BigQuery,
  acesse nossa documentação para entender como acessar os dados.


  Python



    import
    basedosdados
    as
    bd


    # Para carregar o dado direto no pandas

    df
    =
    bd.read_sql
    (
    "SELECT * FROM `datario.meio_ambiente_clima.taxa_precipitacao_alertario` LIMIT 1000"
    ,
    billing_project_id
    =
    "<id_do_seu_projeto_gcp>"
    )




  R



    install.packages(
    "basedosdados"
    )

    library(
    "basedosdados"
    )


    # Defina o seu projeto no Google Cloud

    set_billing_id(
    "<id_do_seu_projeto_gcp>"
    )


    # Para carregar o dado direto no R

    tb <- read_sql(
    "SELECT * FROM `datario.meio_ambiente_clima.taxa_precipitacao_alertario` LIMIT 1000"
    )






  Cobertura temporal


  Desde 1997 até data corrente




  Frequência de atualização


  Diário




  Órgão gestor


  COR




  Colunas



    Nome
    Descrição




      data_particao
      Data em que foi realizada a medição



      id_estacao
      ID da estação pluviométrica em que ocorreu a medição.



      acumulado_chuva_15_min
      Acumulado de chuva em 15 minutos.



      acumulado_chuva_1_h
      Acumulado de chuva em 1 hora.



      acumulado_chuva_4_h
      Acumulado de chuva em 4 horas.



      acumulado_chuva_24_h
      Acumulado de chuva em 24 horas.



      acumulado_chuva_96_h
      Acumulado de chuva em 96 horas.



      primary_key
      Chave primária criada a partir da coluna id_estacao e da data_medicao. Serve para evitar dados duplicados.



      horario
      Horário no qual foi realizada a medição







  Dados do publicador


  Nome: Patrícia Catandi
  E-mail: patriciabcatandi@gmail.com

Meio Ambiente: Estações meteorológicas (INMET/BDMET)

data.rio
datario-pcrj.hub.arcgis.com
+1more

Updated Jun 3, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Prefeitura da Cidade do Rio de Janeiro (2022). Meio Ambiente: Estações meteorológicas (INMET/BDMET) [Dataset]. https://www.data.rio/documents/f14b1ed52be447379383acbb96353e1c

Explore at:

Dataset updated

Jun 3, 2022

Dataset authored and provided by

Prefeitura da Cidade do Rio de Janeiro

License

Attribution-NoDerivs 3.0 (CC BY-ND 3.0)https://creativecommons.org/licenses/by-nd/3.0/
License information was derived automatically

Description

Dados sobre as estações meteorológicas do inmet ( Instituto Nacional de Meteorologia ) na cidade do Rio de Janeiro.

  Como acessar


  Nessa página


  Aqui, você encontrará um botão para realizar o download dos dados em formato CSV e compactados com gzip. Ou, para mesmo resultado, pode clicar aqui.


  BigQuery




      SELECT


      *


      FROM


      `datario.meio_ambiente_clima.estacoes_inmet`


      LIMIT


      1000




  Clique aqui
  para ir diretamente a essa tabela no BigQuery. Caso não tenha experiência com BigQuery,
  acesse nossa documentação para entender como acessar os dados.


  Python



    import
    basedosdados
    as
    bd


    # Para carregar o dado direto no pandas

    df
    =
    bd.read_sql
    (
    "SELECT * FROM `datario.meio_ambiente_clima.estacoes_inmet` LIMIT 1000"
    ,
    billing_project_id
    =
    "<id_do_seu_projeto_gcp>"
    )




  R



    install.packages(
    "basedosdados"
    )

    library(
    "basedosdados"
    )


    # Defina o seu projeto no Google Cloud

    set_billing_id(
    "<id_do_seu_projeto_gcp>"
    )


    # Para carregar o dado direto no R

    tb <- read_sql(
    "SELECT * FROM `datario.meio_ambiente_clima.estacoes_inmet` LIMIT 1000"
    )






  Cobertura temporal


  N/A




  Frequência de atualização


  Nunca




  Órgão gestor


  INMET




  Colunas



    Nome
    Descrição




        id_municipio
        Código do município do IBGE de 7 dígitos.



        latitude
        Latitude onde a estação se encontra.



        data_inicio_operacao
        Data em que a estação começou a operar.



        data_fim_operacao
        Data em que a estação parou de operar.



        situacao
        Indica se a estação está operante ou com falha.



        tipo_estacao
        Indica se a estação é automática ou manual. Pode conter nulos.



        entidade_responsavel
        Entidade responsável pela estação.



        data_atualizacao
        Última data em que os dados sobre a data de operação foram atualizados.



        longitude
        Longitude onde a estação se encontra.



        sigla_uf
        Sigla do estado.



        id_estacao
        ID da estação definido pelo INMET.



        nome_estacao
        Nome da estação.







  Dados do publicador


  Nome: Patricia Catandi
  E-mail: patriciabcatandi@gmail.com

Meio Ambiente: Taxa de Precipitação (GOES-16)

hub.arcgis.com
data.rio
+1more

Updated Jun 3, 2022

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Prefeitura da Cidade do Rio de Janeiro (2022). Meio Ambiente: Taxa de Precipitação (GOES-16) [Dataset]. https://hub.arcgis.com/documents/48c0210e96074b48b401ec2fa4ad99b3

Explore at:

Dataset updated

Jun 3, 2022

Dataset authored and provided by

Prefeitura da Cidade do Rio de Janeiro

License

Attribution-NoDerivs 3.0 (CC BY-ND 3.0)https://creativecommons.org/licenses/by-nd/3.0/
License information was derived automatically

Description

Taxa de precipitação estimada de áreas do sudeste brasileiro. As estimativas são feitas de hora em hora, cada registro contendo dados desta estimativa. Cada área é um quadrado formado por 4km de lado. Dados coletados pelo satélite GOES-16.

  Como acessar


  Nessa página


  Aqui, você encontrará um botão para realizar o download dos dados em formato CSV e compactados com gzip. Ou, para mesmo resultado, pode clicar aqui.


  BigQuery




      SELECT


      *


      FROM


      `datario.meio_ambiente_clima.taxa_precipitacao_satelite`


      LIMIT


      1000




  Clique aqui
  para ir diretamente a essa tabela no BigQuery. Caso não tenha experiência com BigQuery,
  acesse nossa documentação para entender como acessar os dados.


  Python



    import
    basedosdados
    as
    bd


    # Para carregar o dado direto no pandas

    df
    =
    bd.read_sql
    (
    "SELECT * FROM `datario.meio_ambiente_clima.taxa_precipitacao_satelite` LIMIT 1000"
    ,
    billing_project_id
    =
    "<id_do_seu_projeto_gcp>"
    )




  R



    install.packages(
    "basedosdados"
    )

    library(
    "basedosdados"
    )


    # Defina o seu projeto no Google Cloud

    set_billing_id(
    "<id_do_seu_projeto_gcp>"
    )


    # Para carregar o dado direto no R

    tb <- read_sql(
    "SELECT * FROM `datario.meio_ambiente_clima.taxa_precipitacao_satelite` LIMIT 1000"
    )






  Cobertura temporal


  Desde 2020 até a data corrente




  Frequência de atualização


  Diário




  Órgão gestor


  Centro de Operações da Prefeitura do Rio (COR)




  Colunas



    Nome
    Descrição




        latitude
        Latitude do centro da área.



        longitude
        Longitude do centro da área.



        rrqpe
        Taxa de precipitação estimada, medidas em milímetros por hora.



        primary_key
        Chave primária criada a partir da concatenação da coluna data, horário, latitude e longitude. Serve para evitar dados duplicados.



        horario
        Horário no qual foi realizada a medição



        data_particao
        Data na qual foi realizada a medição







  Dados do publicador


  Nome: Patrícia Catandi
  E-mail: patriciabcatandi@gmail.com

Meio Ambiente: Estações pluviométricas (AlertaRio)

hub.arcgis.com
data.rio

Updated Jun 2, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Prefeitura da Cidade do Rio de Janeiro (2022). Meio Ambiente: Estações pluviométricas (AlertaRio) [Dataset]. https://hub.arcgis.com/documents/cc4863712d65418abd8b2063a50bf453

Explore at:

Dataset updated

Jun 2, 2022

Dataset authored and provided by

Prefeitura da Cidade do Rio de Janeiro

License

Attribution-NoDerivs 3.0 (CC BY-ND 3.0)https://creativecommons.org/licenses/by-nd/3.0/
License information was derived automatically

Description

Dados sobre as estações pluviométricas do alertario ( Sistema Alerta Rio da Prefeitura do Rio de Janeiro ) na cidade do Rio de Janeiro.

  Como acessar


  Nessa página


  Aqui, você encontrará um botão para realizar o download dos dados em formato CSV e compactados com gzip. Ou,
  para mesmo resultado, pode clicar aqui.


  BigQuery




      SELECT


      *


      FROM


      `datario.meio_ambiente_clima.estacoes_alertario`


      LIMIT


      1000




  Clique aqui
  para ir diretamente a essa tabela no BigQuery. Caso não tenha experiência com BigQuery,
  acesse nossa documentação para entender como acessar os dados.


  Python



    import
    basedosdados
    as
    bd


    # Para carregar o dado direto no pandas

    df
    =
    bd.read_sql
    (
    "SELECT * FROM `datario.meio_ambiente_clima.estacoes_alertario` LIMIT 1000"
    ,
    billing_project_id
    =
    "<id_do_seu_projeto_gcp>"
    )




  R



    install.packages(
    "basedosdados"
    )

    library(
    "basedosdados"
    )


    # Defina o seu projeto no Google Cloud

    set_billing_id(
    "<id_do_seu_projeto_gcp>"
    )


    # Para carregar o dado direto no R

    tb <- read_sql(
    "SELECT * FROM `datario.meio_ambiente_clima.estacoes_alertario` LIMIT 1000"
    )






  Cobertura temporal


  N/A




  Frequência de atualização


  Anual




  Órgão gestor


  COR




  Colunas



    Nome
    Descrição




      x
      X UTM (SAD69 Zona 23)



      longitude
      Longitude onde a estação se encontra.



      id_estacao
      ID da estação definido pelo AlertaRIO.



      estacao
      Nome da estação.



      latitude
      Latitude onde a estação se encontra.



      cota
      Altura em metros onde a estação se encontra.



      endereco
      Endereço completo da estação.



      situacao
      Indica se a estação está operante ou com falha.



      data_inicio_operacao
      Data em que a estação começou a operar.



      data_fim_operacao
      Data em que a estação parou de operar.



      data_atualizacao
      Última data em que os dados sobre a data de operação foram atualizados.



      y
      Y UTM (SAD69 Zona 23)







  Dados do publicador


  Nome: Patricia Catandi
  E-mail: patriciabcatandi@gmail.com

Dados do sistema Comando (COR): ocorrencias

data.rio
datario-pcrj.hub.arcgis.com
+1more

Updated Oct 5, 2022

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Prefeitura da Cidade do Rio de Janeiro (2022). Dados do sistema Comando (COR): ocorrencias [Dataset]. https://www.data.rio/documents/4b21231ce02243e99b5f080b8a0ff821

Explore at:

Dataset updated

Oct 5, 2022

Dataset authored and provided by

Prefeitura da Cidade do Rio de Janeiro

License

Attribution-NoDerivs 3.0 (CC BY-ND 3.0)https://creativecommons.org/licenses/by-nd/3.0/
License information was derived automatically

Description

Ocorrências disparadas pelo COR desde 2015. Uma ocorrência na cidade do Rio de Janeiro é um acontecimento que exije um acompanhamento e, na maioria das vezes, uma ação da PCRJ. Por exemplo, Buraco na pista, bolsão d'água, enguiço mecânico. Uma ocorrência aberta é uma ocorrência que ainda não foi solucionada. Acesse também através da API do Escritório de Dados: https://api.dados.rio/v1/

  Como acessar


  Nessa página


  Aqui, você encontrará um botão para realizar o download dos dados em formato CSV e compactados com gzip. Ou, para mesmo resultado, pode clicar aqui.


  BigQuery




      SELECT


      *


      FROM


      `datario.adm_cor_comando.ocorrencias`


      LIMIT


      1000




  Clique aqui
  para ir diretamente a essa tabela no BigQuery. Caso não tenha experiência com BigQuery,
  acesse nossa documentação para entender como acessar os dados.


  Python



    import
    basedosdados
    as
    bd


    # Para carregar o dado direto no pandas

    df
    =
    bd.read_sql
    (
    "SELECT * FROM `datario.adm_cor_comando.ocorrencias` LIMIT 1000"
    ,
    billing_project_id
    =
    "<id_do_seu_projeto_gcp>"
    )




  R



    install.packages(
    "basedosdados"
    )

    library(
    "basedosdados"
    )


    # Defina o seu projeto no Google Cloud

    set_billing_id(
    "<id_do_seu_projeto_gcp>"
    )


    # Para carregar o dado direto no R

    tb <- read_sql(
    "SELECT * FROM `datario.adm_cor_comando.ocorrencias` LIMIT 1000"
    )






  Cobertura temporal


  Não informado.




  Frequência de atualização


  Diário




  Órgão gestor


  COR




  Colunas



    Nome
    Descrição




        data_inicio
        Data e hora do registro do evento na PCRJ.



        data_fim
        Data e hora do encerramento do evento na PCRJ. O evento é encerrado quando é solucionado. Este atributo está vazio quanto o evento está aberto.



        bairro
        Bairro onde ocorreu o evento.



        id_pop
        Identificador do POP.



        status
        Status do evento (ABERTO, FECHADO).



        gravidade
        Gravidade do evento (BAIXO, MEDIO, ALTO, CRITICO).



        prazo
        Prazo esperado de solução do evento (CURTO, MEDIO(acima de 3 dias), LONGO( acima de 5 dias)).



        latitude
        Latitude em formato WGS-84 em que ocorreu o evento



        longitude
        Longitude em formato WGS-84 em que ocorreu o evento



        id_evento
        Identificador do evento.



        descricao
        Descrição do evento.



        tipo
        Tipo do evento (PRIMARIO, SECUNDARIO)







  Dados do(a) publicador(a)


  Nome: Patrícia Catandi
  E-mail: patriciabcatandi@gmail.com

Transporte Rodoviário: Histórico de GPS dos ônibus (SPPO)

data.rio
hub.arcgis.com
+1more

Updated Jun 8, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Prefeitura da Cidade do Rio de Janeiro (2022). Transporte Rodoviário: Histórico de GPS dos ônibus (SPPO) [Dataset]. https://www.data.rio/documents/6409ea499d474bfeb4063cfc31203403

Explore at:

Dataset updated

Jun 8, 2022

Dataset authored and provided by

Prefeitura da Cidade do Rio de Janeiro

License

Attribution-NoDerivs 3.0 (CC BY-ND 3.0)https://creativecommons.org/licenses/by-nd/3.0/
License information was derived automatically

Description

Dados completos disponíveis para consulta e download no data lake do data.rio. Os dados são capturados a cada minuto e tratados a cada hora. Dados sujeitos a alteração, como correções de buracos de captura e/ou ajustes de tratamento.

  Como acessar


  Nessa página


  Aqui, você encontrará um botão para realizar o download dos dados em formato CSV e compactados com gzip. Ou, para mesmo resultado, pode clicar aqui.


  BigQuery




      SELECT


      *


      FROM


      `datario.transporte_rodoviario_municipal.gps_onibus`


      LIMIT


      1000




  Clique aqui
  para ir diretamente a essa tabela no BigQuery. Caso não tenha experiência com BigQuery,
  acesse nossa documentação para entender como acessar os dados.


  Python



    import
    basedosdados
    as
    bd


    # Para carregar o dado direto no pandas

    df
    =
    bd.read_sql
    (
    "SELECT * FROM `datario.transporte_rodoviario_municipal.gps_onibus` LIMIT 1000"
    ,
    billing_project_id
    =
    "<id_do_seu_projeto_gcp>"
    )




  R



    install.packages(
    "basedosdados"
    )

    library(
    "basedosdados"
    )


    # Defina o seu projeto no Google Cloud

    set_billing_id(
    "<id_do_seu_projeto_gcp>"
    )


    # Para carregar o dado direto no R

    tb <- read_sql(
    "SELECT * FROM `datario.transporte_rodoviario_municipal.gps_onibus` LIMIT 1000"
    )






  Cobertura temporal


  01/03/2021 até o momento




  Frequência de atualização


  Horária



  Órgão gestor


  Secretaria Municipal de Transportes




  Colunas



    Nome
    Descrição




        modo
        SPPO – nesta tabela consta apenas este modo 



        timestamp_gps
        Timestamp de emissão do sinal de GPS



        data
        Data do timestamp de emissão do sinal de GPS



        hora
        Hora do timestamp de emissão do sinal de GPS



        id_veiculo
        Código identificador do veículo (número de ordem).



        servico
        Serviço realizado pelo veículo.



        latitude
        Parte da coordenada geográfica (eixo y) em graus decimais (EPSG:4326 - WGS84)



        longitude
        Parte da coordenada geográfica (eixo x) em graus decimais (EPSG:4326 - WGS84)



        flag_em_movimento
        Veículos com 'velocidade' abaixo da 'velocidade_limiar_parado', são considerados como parado (false). Caso contrário, são considerados andando (true)



        tipo_parada
        Identifica veículos parados em terminais ou garagens.



        flag_linha_existe_sigmob
        Flag de verificação se a linha informada existe no SIGMOB.



        velocidade_instantanea
         Velocidade instantânea do veículo, conforme informado pelo GPS (km/h)



        velocidade_estimada_10_min
        Velocidade média nos últimos 10 minutos de operação (km/h)



        distancia
        Distância da última posição do GPS em relação à posição atual (m)



        fonte_gps
        Fornecedor dos dados de GPS (zirix ou conecta)



        versao
        Código de controle de versão do dado (SHA Github)







  Dados do(a) publicador(a)


  Nome: Subsecretaria de Tecnologia em Transportes (SUBTT)
  E-mail: dados.smtr@prefeitura.rio

Administração de Serviços Públicos: Chamados feitos ao 1746

datario-pcrj.hub.arcgis.com
data.rio

Updated Jun 2, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Prefeitura da Cidade do Rio de Janeiro (2022). Administração de Serviços Públicos: Chamados feitos ao 1746 [Dataset]. https://datario-pcrj.hub.arcgis.com/documents/52b6bd003abf4b8995ec9860e65a82c5

Explore at:

Dataset updated

Jun 2, 2022

Dataset authored and provided by

Prefeitura da Cidade do Rio de Janeiro

License

Attribution-NoDerivs 3.0 (CC BY-ND 3.0)https://creativecommons.org/licenses/by-nd/3.0/
License information was derived automatically

Description

Chamados feitos ao 1746. São chamados desde março de 2011, quando começou o projeto 1746.

  Como acessar


  Nessa página


  Aqui, você encontrará um botão para realizar o download dos dados em formato CSV e compactados com gzip. Ou, para mesmo resultado, pode clicar aqui.


  BigQuery




      SELECT


      *


      FROM


      `datario.administracao_servicos_publicos.chamado_1746`


      LIMIT


      1000




  Clique aqui para ir diretamente a essa tabela no BigQuery. Caso não tenha experiência com BigQuery,
  acesse nossa documentação para entender como acessar os dados.


  Python



    import
    basedosdados
    as
    bd


    # Para carregar o dado direto no pandas

    df
    =
    bd.read_sql
    (
    "SELECT * FROM `datario.administracao_servicos_publicos.chamado_1746` LIMIT 1000"
    ,
    billing_project_id
    =
    "<id_do_seu_projeto_gcp>"
    )




  R



    install.packages(
    "basedosdados"
    )

    library(
    "basedosdados"
    )


    # Defina o seu projeto no Google Cloud

    set_billing_id(
    "<id_do_seu_projeto_gcp>"
    )


    # Para carregar o dado direto no R

    tb <- read_sql(
    "SELECT * FROM `datario.administracao_servicos_publicos.chamado_1746` LIMIT 1000"
    )






  Cobertura temporal


  Março de 2011




  Frequência de atualização


  Diário




  Órgão gestor


  SEGOVI




  Colunas



    Nome
    Descrição




        id_chamado
        Identificador único do chamado no banco de dados.



        data_inicio
        Data de abertura do chamado. Ocorre quando o operador registra o chamado.



        data_fim
        Data de fechamento do chamado. O chamado é fechado quando o pedido é atendido ou quando se percebe que o pedido não pode ser atendido.



        id_bairro
        Identificador único, no banco de dados, do bairro onde ocorreu o fato que gerou o chamado.



        id_territorialidade
        Identificador único, no banco de dados, da territorialidade onde ocorreu o fato que gerou o chamado. Territorialidade é uma região da cidade do Rio de Janeiro que tem com responsável um órgão especifico. Exemplo: CDURP, que é responsável pela região do porto do Rio de Janeiro.



        id_logradouro
        Identificador único, no banco de dados, do logradouro onde ocorreu o fato que gerou o chamado.



        numero_logradouro
        Número da porta onde ocorreu o fato que gerou o chamado.



        id_unidade_organizacional
        Identificador único, no banco de dados, do órgão que executa o chamado. Por exemplo: identificador da COMLURB quando o chamado é relativo a limpeza urbana.



        nome_unidade_organizacional
        Nome do órgão que executa a demanda. Por exemplo: COMLURB quando a demanda é relativa a limpeza urbana.



        unidade_organizadional_ouvidoria
        Booleano indicando se o chamado do cidadão foi feita Ouvidoria ou não. 1 caso sim, 0 caso não,



        categoria
        Categoria do chamado. Exemplo: Serviço, informação, sugestão, elogio, reclamação, crítica.



        id_tipo
        Identificador único, no banco de dados, do tipo do chamado. Ex: Iluminação pública.



        tipo
        Nome do tipo do chamado. Ex: Iluminação pública.



        id_subtipo
        Identificador único, no banco de dados, do subtipo do chamado. Ex: Reparo de lâmpada apagada.



        subtipo
        Nome do subtipo do chamado. Ex: Reparo de lâmpada apagada.



        status
        Status do chamado. Ex. Fechado com solução, aberto em andamento, pendente etc.



        longitude
        Longitude do lugar do evento que motivou o chamado.



        latitude
        Latitude do lugar do evento que motivou o chamado.



        data_alvo_finalizacao
        Data prevista para o atendimento do chamado. Caso prazo_tipo seja D fica em branco até o diagnóstico ser feito.



        data_alvo_diagnostico
        Data prevista para fazer o diagnóstico do serviço. Caso prazo_tipo seja F esta data fica em branco.



        data_real_diagnostico
        Data em que foi feito o diagnóstico do serviço. Caso prazo_tipo seja F esta data fica em branco.



        tempo_prazo
        Prazo para o serviço ser feito. Em dias ou horas após a abertura do chamado. Caso haja diagnóstico o prazo conta após se fazer o diagnóstico.



        prazo_unidade
        Unidade de tempo utilizada no prazo. Dias ou horas. D ou H.



        prazo_tipo
        Diagnóstico ou finalização. D ou F. Indica se a chamada precisa de diagnóstico ou não. Alguns serviços precisam de avaliação para serem feitos, neste caso é feito o diagnóstico. Por exemplo, pode de árvore. Há a necessidade de um engenheiro ambiental verificar a necessidade da poda ou não.



        id_unidade_organizacional_mae
        ID da unidade organizacional mãe do orgão que executa a demanda. Por exemplo: "CVA - Coordenação de Vigilância de Alimentos" é quem executa a demanda e obede a unidade organizacional mãe "IVISA-RIO - Instituto Municipal de Vigilância Sanitária, de Zoonoses e de Inspeção Agropecuária". A coluna se refere ao ID deste último.



        situacao
        Identifica se o chamado foi encerrado



        tipo_situacao
        Indica o status atual do chamado entre as categorias Atendido, Atendido parcialmente, Não atendido, Não constatado e Andamento



        dentro_prazo
        Indica se a data alvo de finalização do chamado ainda está dentro do prazo estipulado.



        justificativa_status
        Justificativa que os órgãos usam ao definir o status. Exemplo: SEM POSSIBILIDADE DE ATENDIMENTO - justificativa: Fora de área de atuação do municipio



        reclamacoes
        Quantidade de reclamações.







  Dados do(a) publicador(a)


  Nome: Patricia Catandi
  E-mail: patriciabcatandi@gmail.com

Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

James P. Diprose; James P. Diprose; Cameron Neylon; Cameron Neylon (2022). COKI Language Dataset [Dataset]. http://doi.org/10.5281/zenodo.6636625

COKI Language Dataset

Explore at:

4 scholarly articles cite this dataset (View in Google Scholar)

application/gzip, csvAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.6636625

Dataset updated

Jun 16, 2022

Dataset provided by

Zenodohttp://zenodo.org/

Authors

James P. Diprose; James P. Diprose; Cameron Neylon; Cameron Neylon

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The COKI Language Dataset contains predictions for 122 million academic publications. The dataset consists of DOI, title, ISO language code and the fastText language prediction probability score.

Methodology
A subset of the COKI Academic Observatory Dataset, which is produced by the Academic Observatory Workflows codebase [1], was extracted and converted to CSV with Bigquery and downloaded to a virtual machine. The subset consists of all publications with DOIs in our dataset, including each publication’s title and abstract from both Crossref Metadata and Microsoft Academic Graph. The CSV files were then processed with a Python script. The titles and abstracts for each record were pre-processed, concatenated together and analysed with fastText. The titles and abstracts from Crossref Metadata were used first, with the MAG titles and abstracts serving as a fallback when the Crossref Metadata information was empty. Language was predicted for each publication using the fastText lid.176.bin language identification model [2]. fastText was chosen because of its high accuracy and fast runtime speed [3]. The final output dataset consists of DOI, title, ISO language code and the fastText language prediction probability score.

Query or Download
The data is publicly accessible in BigQuery in the following two tables:

When you make queries on these tables, make sure that you are in your own Google Cloud project, otherwise the queries will fail.

See the COKI Language Detection README for instructions on how to download the data from Zenodo and load it into BigQuery.

Code
The code that generated this dataset, the BigQuery schemas and instructions for loading the data into BigQuery can be found here: https://github.com/The-Academic-Observatory/coki-language

Attributions
This work contains information from:

Microsoft Academic Graph which is made available under the ODC Attribution Licence.
Crossref Metadata via the Metadata Plus program. Bibliographic metadata is made available without copyright restriction and Crossref generated data under a CC0 licence. See metadata licence information for more details.

References
[1] https://doi.org/10.5281/zenodo.6366695
[2] https://fasttext.cc/docs/en/language-identification.html
[3] https://modelpredict.com/language-identification-survey

Clear search

Close search

Google apps

Main menu

COKI Language Dataset

BigQuery GIS Utility Datasets (U.S.)

Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.

gnomAD

Intellectual Property Investigations by the USITC

Context

Content

Acknowledgements

Project Sunroof

Google Ads Transparency Center

Kimia Farma: Performance Analysis 2020-2023

Project Scope

Current Progress

Dataset Overview

Files Provided

📢 Published on : My LinkedIn

Clean Water, Clear Data

OpenStreetMap Public Dataset

NOAA GSOD

Overview

Content

Querying BigQuery tables

Acknowledgements

NYC Citi Bike Trips

CFPB Consumer Complaint Database

Meio Ambiente: Taxa de Precipitação (Alerta Rio)

Meio Ambiente: Estações meteorológicas (INMET/BDMET)

Meio Ambiente: Taxa de Precipitação (GOES-16)

Meio Ambiente: Estações pluviométricas (AlertaRio)

Dados do sistema Comando (COR): ocorrencias

Transporte Rodoviário: Histórico de GPS dos ônibus (SPPO)

Administração de Serviços Públicos: Chamados feitos ao 1746

COKI Language Dataset