Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Querying BigQuery tables You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME].
If you're using Python, you can start with this code:
import pandas as pd
from bq_helper import BigQueryHelper
bq_assistant = BigQueryHelper("bigquery-public-data", "utility_us")
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. Individual reports can be parsed by export session ID (column A) or timestamp (column B). Variation between output represents use of different types of Fitbit trackers and individual tracking behaviors / preferences.
This is the list of manipulations performed on the original dataset, published by Möbius.
All the cleaning process and rearrangements were performed in BigQuery, using SQL functions.
1) After I took a closer look at the source dataset, I realized that for my case study, I did not need some of the tables contained in the original archive. Therefore, I decided not to import
- dailyCalories_merged.csv,
- dailyIntensities_merged.csv,
- dailySteps_merged.csv.
as they proved redundant, their content could be found in the dailyActivity_merged.csv file.
In addition, the files
- minutesCaloriesWide_merged.csv,
- minutesIntensitiesWide_merged.csv,
- minuteStepsWide_merged.csv.
were not imported, as they presented the same data contained in other files in a wide format. Hence, only the files with long format containing the same data were imported in the BigQuery database.
2) To be able to compare and measure the correlation among different variables based on hourly records, I decided to create a new table based on LEFT JOIN function and columns Id and ActivityHour. I repeated the same JOIN on tables with minute records. Hence I obtained 2 new tables: - hourly_activity.csv, - minute_activity.csv.
3) To validate most of the columns containing DATE and DATETIME values that were imported as STRING data type, I used the PARSE_DATE() and PARSE_DATETIME() commands. While importing the - heartrate_seconds_merged.csv, - hourlyCalories_merged.csv, - hourlyIntensities_merged.csv, - hourlySteps_merged.csv, - minutesCaloriesNarrow_merged.csv, - minuteIntensitiesNarrow_merged.csv, - minuteMETsNarrow_merged.csv, - minuteSleep_merged.csv, - minuteSteps_merged.csv, - sleepDay_merge.csv, - weigthLog_Info_merged.csv files to BigQuery, it was necessary to import the DATETIME and DATE type columns as STRING, because the original syntax, used in the CSV files, couldn’t be recognized as a correct DATETIME data type, due to “AM” and “PM” text at the end of the expression.
Facebook
TwitterThe Genome Aggregation Database (gnomAD) is maintained by an international coalition of investigators to aggregate and harmonize data from large-scale sequencing projects. These public datasets are available in VCF format in Google Cloud Storage and in Google BigQuery as integer range partitioned tables . Each dataset is sharded by chromosome meaning variants are distributed across 24 tables (indicated with “_chr*” suffix). Utilizing the sharded tables reduces query costs significantly. Variant Transforms was used to process these VCF files and import them to BigQuery. VEP annotations were parsed into separate columns for easier analysis using Variant Transforms’ annotation support . These public datasets are included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. Use this quick start guide to quickly learn how to access public datasets on Google Cloud Storage. Find out more in our blog post, Providing open access to gnomAD on Google Cloud . Questions? Contact gcp-life-sciences-discuss@googlegroups.com.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Section 337, Tariff Act of 1930, Investigations of Unfair Practices in Import Trade. Under section 337, the USITC determines whether there is unfair competition in the importation of products into, or their subsequent sale in, the United States. Section 337 prohibits the importation into the US , or the sale of such articles by owners, importers or consignees, of articles which infringe a patent, copyright, trademark, or semiconductor mask work, or where unfair competition or unfair acts exist that can destroy or substantially injure a US industry or prevent one from developing, or restrain or monopolize trade in US commerce. These latter categories are very broad: unfair competition can involve counterfeit, mismarked or misbranded goods, where the sale of the goods are at unfairly low prices, where other antitrust violations take place such as price fixing, market division or the goods violate a standard applicable to such goods.
US International Trade Commission 337Info Unfair Import Investigations Information System contains data on investigations done under Section 337. Section 337 declares the infringement of certain statutory intellectual property rights and other forms of unfair competition in import trade to be unlawful practices. Most Section 337 investigations involve allegations of patent or registered trademark infringement.
Fork this notebook to get started on accessing data in the BigQuery dataset using the BQhelper package to write SQL queries.
Data Origin: https://bigquery.cloud.google.com/dataset/patents-public-data:usitc_investigations
"US International Trade Commission 337Info Unfair Import Investigations Information System" by the USITC, for public use.
Banner photo by João Silas on Unsplash
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Facebook
Twitterhttps://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F19062145%2F025ccf521f62db512b4a98edd0b3508a%2FKimia_Farma_Dashboard.jpg?generation=1748428094441761&alt=media" alt="">This project analyzes Kimia Farma's performance from 2020 to 2023 using Google Looker Studio. The analysis is based on a pre-processed dataset stored in BigQuery, which serves as the data source for the dashboard.
The dashboard is designed to provide insights into branch performance, sales trends, customer ratings, and profitability. The development is ongoing, with multiple pages planned for a more in-depth analysis.
✅ The first page of the dashboard is completed
✅ A sample dashboard file is available on Kaggle
🔄 Development will continue with additional pages
The dataset consists of transaction records from Kimia Farma branches across different cities and provinces. Below are the key columns used in the analysis:
- transaction_id: Transaction ID code
- date: Transaction date
- branch_id: Kimia Farma branch ID code
- branch_name: Kimia Farma branch name
- kota: City of the Kimia Farma branch
- provinsi: Province of the Kimia Farma branch
- rating_cabang: Customer rating of the Kimia Farma branch
- customer_name: Name of the customer who made the transaction
- product_id: Product ID code
- product_name: Name of the medicine
- actual_price: Price of the medicine
- discount_percentage: Discount percentage applied to the medicine
- persentase_gross_laba: Gross profit percentage based on the following conditions:
Price ≤ Rp 50,000 → 10% profit
Price > Rp 50,000 - 100,000 → 15% profit
Price > Rp 100,000 - 300,000 → 20% profit
Price > Rp 300,000 - 500,000 → 25% profit
Price > Rp 500,000 → 30% profit
- nett_sales: Price after discount
- nett_profit: Profit earned by Kimia Farma
- rating_transaksi: Customer rating of the transaction
📌 kimia farma_query.txt – Contains SQL queries used for data analysis in Looker Studio
📌 kimia farma_analysis_table.csv – Preprocessed dataset ready for import and analysis
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
This dataset provides 69,000 instances of natural language processing (NLP) editing tasks to help researchers develop more effective AI text-editing models. Compiled into a convenient JSON format, this collection offers easy access so that researchers have the tools they need to create groundbreaking AI models that efficiently and effectively redefine natural language processing. This is your chance to be at the forefront of NLP technology and make history through innovative AI capabilities. So join in and unlock a world of possibilities with CoEdIT's Text Editing Dataset!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
- Familiarize yourself with the format of the dataset by taking a look at the columns: task, src, tgt. You’ll see that each row in this dataset contains a specific NLP editing task as well as source text (src) and target text (tgt) which displays what should result from that editing task.
- Import the JSON file of this dataset into your machine learning environment or analyses software toolbox of choice. Some popular options include Python's Pandas library and BigQuery on Google Cloud Platforms for larger datasets like this one oryoou can also import them into Excel Toolboxes .
Once you've imported the data into your chosen program, you can now start exploring! Take a look around at various rows to get an idea of how different types of edits need to be made on source text in order to produce target text successfully meeting given criteria depending on needs/ tasks come together; Make sure you read any documents associated with each column helps understand better context before beginning your analysis or coding part
Test out coding solutions which process different types and scales of edits - if understanding how punctuation impacts sentence similarity measures gives key insight into meaning being conveyed then develop code accordingly ,playing around with different methods utilizing common ML/NLP algorithms & libraries like NLTK , etc
5 Finally – now that have tested conceptual ideas begin work creating efficient & effective AI-powered models system using training data specifically catered towards given tasks at hand; Evaluate performance with validation & test datasets prior getting production ready
- Automated Grammar Checking Solutions: This dataset can be used to train machine learning models to detect grammatical errors and suggest proper corrections.
- Text Summarization: Using this dataset, researchers can create AI-powered summarization algorithms that summarize long-form passages into shorter summaries while preserving accuracy and readability
- Natural Language Generation: This dataset could be used to develop AI solutions that generate accurately formatted natural language sentences when given a prompt or some other form of input
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: validation.csv | Column name | Description | |:--------------|:-------------------------------------------------------------------------------------| | Task | This column describes the task that the dataset is intended to be used for. (String) | | src | This column contains the source text input. (String) | | tgt | This column contains the target text output. (String) |
File: train.csv | Column name | Description | |:--------------|:-------------------------------------------------------------------------------------| | Task | This column describes the task that the dataset is intended to be used for. (String) | | src | This column contains the source text input. (String) | | tgt | This column contains the target text output. (String) ...
Facebook
TwitterAttribution-NoDerivs 3.0 (CC BY-ND 3.0)https://creativecommons.org/licenses/by-nd/3.0/
License information was derived automatically
Taxa de precipitação estimada de áreas do sudeste brasileiro. As estimativas são feitas de hora em hora, cada registro contendo dados desta estimativa. Cada área é um quadrado formado por 4km de lado. Dados coletados pelo satélite GOES-16.
Como acessar
Nessa página
Aqui, você encontrará um botão para realizar o download dos dados em formato CSV e compactados com gzip. Ou, para mesmo resultado, pode clicar aqui.
BigQuery
SELECT
*
FROM
`datario.meio_ambiente_clima.taxa_precipitacao_satelite`
LIMIT
1000
Clique aqui
para ir diretamente a essa tabela no BigQuery. Caso não tenha experiência com BigQuery,
acesse nossa documentação para entender como acessar os dados.
Python
import
basedosdados
as
bd
# Para carregar o dado direto no pandas
df
=
bd.read_sql
(
"SELECT * FROM `datario.meio_ambiente_clima.taxa_precipitacao_satelite` LIMIT 1000"
,
billing_project_id
=
"<id_do_seu_projeto_gcp>"
)
R
install.packages(
"basedosdados"
)
library(
"basedosdados"
)
# Defina o seu projeto no Google Cloud
set_billing_id(
"<id_do_seu_projeto_gcp>"
)
# Para carregar o dado direto no R
tb <- read_sql(
"SELECT * FROM `datario.meio_ambiente_clima.taxa_precipitacao_satelite` LIMIT 1000"
)
Cobertura temporal
Desde 2020 até a data corrente
Frequência de atualização
Diário
Órgão gestor
Centro de Operações da Prefeitura do Rio (COR)
Colunas
Nome
Descrição
latitude
Latitude do centro da área.
longitude
Longitude do centro da área.
rrqpe
Taxa de precipitação estimada, medidas em milímetros por hora.
primary_key
Chave primária criada a partir da concatenação da coluna data, horário, latitude e longitude. Serve para evitar dados duplicados.
horario
Horário no qual foi realizada a medição
data_particao
Data na qual foi realizada a medição
Dados do publicador
Nome: Patrícia Catandi
E-mail: patriciabcatandi@gmail.com
Facebook
TwitterAttribution-NoDerivs 3.0 (CC BY-ND 3.0)https://creativecommons.org/licenses/by-nd/3.0/
License information was derived automatically
Dados sobre as estações pluviométricas do alertario ( Sistema Alerta Rio da Prefeitura do Rio de Janeiro ) na cidade do Rio de Janeiro.
Como acessar
Nessa página
Aqui, você encontrará um botão para realizar o download dos dados em formato CSV e compactados com gzip. Ou,
para mesmo resultado, pode clicar aqui.
BigQuery
SELECT
*
FROM
`datario.meio_ambiente_clima.estacoes_alertario`
LIMIT
1000
Clique aqui
para ir diretamente a essa tabela no BigQuery. Caso não tenha experiência com BigQuery,
acesse nossa documentação para entender como acessar os dados.
Python
import
basedosdados
as
bd
# Para carregar o dado direto no pandas
df
=
bd.read_sql
(
"SELECT * FROM `datario.meio_ambiente_clima.estacoes_alertario` LIMIT 1000"
,
billing_project_id
=
"<id_do_seu_projeto_gcp>"
)
R
install.packages(
"basedosdados"
)
library(
"basedosdados"
)
# Defina o seu projeto no Google Cloud
set_billing_id(
"<id_do_seu_projeto_gcp>"
)
# Para carregar o dado direto no R
tb <- read_sql(
"SELECT * FROM `datario.meio_ambiente_clima.estacoes_alertario` LIMIT 1000"
)
Cobertura temporal
N/A
Frequência de atualização
Anual
Órgão gestor
COR
Colunas
Nome
Descrição
x
X UTM (SAD69 Zona 23)
longitude
Longitude onde a estação se encontra.
id_estacao
ID da estação definido pelo AlertaRIO.
estacao
Nome da estação.
latitude
Latitude onde a estação se encontra.
cota
Altura em metros onde a estação se encontra.
endereco
Endereço completo da estação.
situacao
Indica se a estação está operante ou com falha.
data_inicio_operacao
Data em que a estação começou a operar.
data_fim_operacao
Data em que a estação parou de operar.
data_atualizacao
Última data em que os dados sobre a data de operação foram atualizados.
y
Y UTM (SAD69 Zona 23)
Dados do publicador
Nome: Patricia Catandi
E-mail: patriciabcatandi@gmail.com
Facebook
TwitterAttribution-NoDerivs 3.0 (CC BY-ND 3.0)https://creativecommons.org/licenses/by-nd/3.0/
License information was derived automatically
Procedimentos operacionais padrões (POP) existentes na PCRJ. Um POP é um procedimento que será usado para solucionar um evento. Um POP é composto de várias atividades. Um evento é uma ocorrência na cidade do Rio de Janeiro que exija um acompanhamento e na maioria das vezes uma ação da PCRJ, como por exemplo um buraco na rua. Acesse também através da API do Escritório de Dados: https://api.dados.rio/v1/
Como acessar
Nessa página
Aqui, você encontrará um botão para realizar o download dos dados em formato CSV e compactados com gzip. Ou, para mesmo resultado, pode clicar aqui.
BigQuery
SELECT
*
FROM
`datario.adm_cor_comando.procedimento_operacional_padrao`
LIMIT
1000
Clique aqui
para ir diretamente a essa tabela no BigQuery. Caso não tenha experiência com BigQuery,
acesse nossa documentação para entender como acessar os dados.
Python
import
basedosdados
as
bd
# Para carregar o dado direto no pandas
df
=
bd.read_sql
(
"SELECT * FROM `datario.adm_cor_comando.procedimento_operacional_padrao` LIMIT 1000"
,
billing_project_id
=
"<id_do_seu_projeto_gcp>"
)
R
install.packages(
"basedosdados"
)
library(
"basedosdados"
)
# Defina o seu projeto no Google Cloud
set_billing_id(
"<id_do_seu_projeto_gcp>"
)
# Para carregar o dado direto no R
tb <- read_sql(
"SELECT * FROM `datario.adm_cor_comando.procedimento_operacional_padrao` LIMIT 1000"
)
Cobertura temporal
Não informado.
Frequência de atualização
Mensal
Órgão gestor
COR
Colunas
Nome
Descrição
id_pop
Identificador do POP procedimento operacional padrão).
pop_titulo
Nome do procedimento operacional padrão.
Dados do(a) publicador(a)
Nome: Patrícia Catandi
E-mail: patriciabcatandi@gmail.com
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Querying BigQuery tables You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME].
If you're using Python, you can start with this code:
import pandas as pd
from bq_helper import BigQueryHelper
bq_assistant = BigQueryHelper("bigquery-public-data", "utility_us")