12 datasets found

Google Analytics Sample
kaggle.com
zip
Updated Sep 19, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2019). Google Analytics Sample [Dataset]. https://www.kaggle.com/datasets/bigquery/google-analytics-sample
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Sep 19, 2019
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Googlehttp://google.com/
Authors
Google BigQuery
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

The Google Merchandise Store sells Google branded merchandise. The data is typical of what you would see for an ecommerce website.

Content

The sample dataset contains Google Analytics 360 data from the Google Merchandise Store, a real ecommerce store. The Google Merchandise Store sells Google branded merchandise. The data is typical of what you would see for an ecommerce website. It includes the following kinds of information:

Traffic source data: information about where website visitors originate. This includes data about organic traffic, paid search traffic, display traffic, etc. Content data: information about the behavior of users on the site. This includes the URLs of pages that visitors look at, how they interact with content, etc. Transactional data: information about the transactions that occur on the Google Merchandise Store website.

Fork this kernel to get started.

Acknowledgements

Data from: https://bigquery.cloud.google.com/table/bigquery-public-data:google_analytics_sample.ga_sessions_20170801

Banner Photo by Edho Pratama from Unsplash.

Inspiration

What is the total number of transactions generated per device browser in July 2017?

The real bounce rate is defined as the percentage of visits with a single pageview. What was the real bounce rate per traffic source?

What was the average number of product pageviews for users who made a purchase in July 2017?

What was the average number of product pageviews for users who did not make a purchase in July 2017?

What was the average total transactions per user that made a purchase in July 2017?

What is the average amount of money spent per session in July 2017?

What is the sequence of pages viewed?
1000 Cannabis Genomes Project
kaggle.com
zip
Updated Feb 26, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2019). 1000 Cannabis Genomes Project [Dataset]. https://www.kaggle.com/bigquery/genomics-cannabis
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Feb 26, 2019
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Authors
Google BigQuery
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Cannabis is a genus of flowering plants in the family Cannabaceae.

Source: https://en.wikipedia.org/wiki/Cannabis

Content

In October 2016, Phylos Bioscience released a genomic open dataset of approximately 850 strains of Cannabis via the Open Cannabis Project. In combination with other genomics datasets made available by Courtagen Life Sciences, Michigan State University, NCBI, Sunrise Medicinal, University of Calgary, University of Toronto, and Yunnan Academy of Agricultural Sciences, the total amount of publicly available data exceeds 1,000 samples taken from nearly as many unique strains.

https://medium.com/google-cloud/dna-sequencing-of-1000-cannabis-strains-publicly-available-in-google-bigquery-a33430d63998

These data were retrieved from the National Center for Biotechnology Information’s Sequence Read Archive (NCBI SRA), processed using the BWA aligner and FreeBayes variant caller, indexed with the Google Genomics API, and exported to BigQuery for analysis. Data are available directly from Google Cloud Storage at gs://gcs-public-data--genomics/cannabis, as well as via the Google Genomics API as dataset ID 918853309083001239, and an additional duplicated subset of only transcriptome data as dataset ID 94241232795910911, as well as in the BigQuery dataset bigquery-public-data:genomics_cannabis.

All tables in the Cannabis Genomes Project dataset have a suffix like _201703. The suffix is referred to as [BUILD_DATE] in the descriptions below. The dataset is updated frequently as new releases become available.

The following tables are included in the Cannabis Genomes Project dataset:

Sample_info contains fields extracted for each SRA sample, including the SRA sample ID and other data that give indications about the type of sample. Sample types include: strain, library prep methods, and sequencing technology. See SRP008673 for an example of upstream sample data. SRP008673 is the University of Toronto sequencing of Cannabis Sativa subspecies Purple Kush.

MNPR01_reference_[BUILD_DATE] contains reference sequence names and lengths for the draft assembly of Cannabis Sativa subspecies Cannatonic produced by Phylos Bioscience. This table contains contig identifiers and their lengths.

MNPR01_[BUILD_DATE] contains variant calls for all included samples and types (genomic, transcriptomic) aligned to the MNPR01_reference_[BUILD_DATE] table. Samples can be found in the sample_info table. The MNPR01_[BUILD_DATE] table is exported using the Google Genomics BigQuery variants schema. This table is useful for general analysis of the Cannabis genome.

MNPR01_transcriptome_[BUILD_DATE] is similar to the MNPR01_[BUILD_DATE] table, but it includes only the subset transcriptomic samples. This table is useful for transcribed gene-level analysis of the Cannabis genome.

Fork this kernel to get started with this dataset.

Acknowledgements

Dataset Source: http://opencannabisproject.org/ Category: Genomics Use: This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - https://www.ncbi.nlm.nih.gov/home/about/policies.shtml - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset. Update frequency: As additional data are released to GenBank View in BigQuery: https://bigquery.cloud.google.com/dataset/bigquery-public-data:genomics_cannabis View in Google Cloud Storage: gs://gcs-public-data--genomics/cannabis

Banner Photo by Rick Proctor from Unplash.

Inspiration

Which Cannabis samples are included in the variants table?

Which contigs in the MNPR01_reference_[BUILD_DATE] table have the highest density of variants?

How many variants does each sample have at the THC Synthase gene (THCA1) locus?
Ethereum Classic Cryptocurrency Dataset - Dataset - CryptoData Hub
cryptodata.center
Updated Dec 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
cryptodata.center (2024). Ethereum Classic Cryptocurrency Dataset - Dataset - CryptoData Hub [Dataset]. https://cryptodata.center/dataset/ethereum-classic-cryptocurrency-dataset
Explore at:
Dataset updated
Dec 4, 2024
Dataset provided by
CryptoDATA
Description
Ethereum Classic is a cryptocurrency with shared history with the Ethereum cryptocurrency. On technical merits, the two cryptocurrencies are nearly identical, differing only in programming language features supported by the Ethereum Virtual machine which is used to write smart contracts. This dataset contains the blockchain data in their entirety, pre-processed to be human-friendly and to support common use cases such as auditing, investigating, and researching the economic and financial properties of the system. Interested in learning more about how Cloud Public Data is working to make data from blockchains and cryptocurrencies more accessible? Check out our blog post on the Google Cloud Big Data Blog and try the sample query below to get started. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .
OnPoint Weather - Temperature History & Climatology Sample
console.cloud.google.com
Updated May 14, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
https://console.cloud.google.com/marketplace/browse?filter=partner:Weather%20Source&hl=zh-tw (2023). OnPoint Weather - Temperature History & Climatology Sample [Dataset]. https://console.cloud.google.com/marketplace/product/weathersource-com/data-studio?hl=zh-tw
Explore at:
Dataset updated
May 14, 2023
Dataset provided by
Googlehttp://google.com/
Description
OnPoint Weather is a global weather dataset for business available for any lat/lon point and geographic area such as ZIP codes. OnPoint Weather provides a continuum of hourly and daily weather from the year 2000 to current time and a forward forecast of 45 days. OnPoint Climatology provides hourly and daily weather statistics which can be used to determine ‘departures from normal’ and to provide climatological guidance of expected weather for any location at any point in time. The OnPoint Climatology provides weather statistics such as means, standard deviations and frequency of occurrence. Weather has a significant impact on businesses and accounts for hundreds of billions in lost revenue annually. OnPoint Weather allows businesses to quantify weather impacts and develop strategies to optimize for weather to improve business performance. Examples of Usage Quantify the impact of weather on sales across diverse locations and times of the year Understand how supply chains are impacted by weather Understand how employee’s attendance and performance are impacted by weather Understand how weather influences foot traffic at malls, stores and restaurants OnPoint Weather is available through Google Cloud Platform’s Commercial Dataset Program and can be easily integrated with other Google Cloud Platform Services to quickly reveal and quantify weather impacts on business. Weather Source provides a full range of support services from answering quick questions to consulting and building custom solutions. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery 瞭解詳情
Z
Dataset Reuse Indicators Datasets
data.niaid.nih.gov
zenodo.org
Updated Sep 6, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vougiouklis, Pavlos (2020). Dataset Reuse Indicators Datasets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4015954
Explore at:
Dataset updated
Sep 6, 2020
Dataset provided by
Groth, Paul
Simperl, Elena
Koesten, Laura
Vougiouklis, Pavlos
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains two files.

1) A python pickle file (github_dataset.zip) that contains Github repositories with datasets. Specifically, using Google’s public dataset copy of Github and the BigQuery service to build a list of repositories that have a CSV or XLSX or XLS file. We then used the GitHub API to collect nformation about each repository in this list. The resulting dataset consists of 87936 repositories that contain at least a CSV, XLSX or XLS file, alongside with information about their features (e.g. number of open and closed issues and license) from GitHub. This corpus had more than two million data files. We then excluded those files withless then ten rows, which was the case for 65537 repositories with a total of 1,467,240 data files.

2) A python pickle file (processed_dataset.zip) containing the feature information necessary to train a machine learning model to predict reuse on these Github datasets

Source code can be found at: https://github.com/laurakoesten/Dataset-Reuse-Indicators

For a full description of the content see:

Koesten, Laura and Vougiouklis, Pavlos and Simperl, Elena and Groth, Paul, Dataset Reuse: Translating Principles to Practice. Available at SSRN: https://ssrn.com/abstract=3589836 or http://dx.doi.org/10.2139/ssrn.3589836
Intellectual Property Investigations by the USITC
kaggle.com
zip
Updated Feb 12, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2019). Intellectual Property Investigations by the USITC [Dataset]. https://www.kaggle.com/bigquery/usitc-investigations
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Feb 12, 2019
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Googlehttp://google.com/
Authors
Google BigQuery
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Context

Section 337, Tariff Act of 1930, Investigations of Unfair Practices in Import Trade. Under section 337, the USITC determines whether there is unfair competition in the importation of products into, or their subsequent sale in, the United States. Section 337 prohibits the importation into the US , or the sale of such articles by owners, importers or consignees, of articles which infringe a patent, copyright, trademark, or semiconductor mask work, or where unfair competition or unfair acts exist that can destroy or substantially injure a US industry or prevent one from developing, or restrain or monopolize trade in US commerce. These latter categories are very broad: unfair competition can involve counterfeit, mismarked or misbranded goods, where the sale of the goods are at unfairly low prices, where other antitrust violations take place such as price fixing, market division or the goods violate a standard applicable to such goods.

Content

US International Trade Commission 337Info Unfair Import Investigations Information System contains data on investigations done under Section 337. Section 337 declares the infringement of certain statutory intellectual property rights and other forms of unfair competition in import trade to be unlawful practices. Most Section 337 investigations involve allegations of patent or registered trademark infringement.

Fork this notebook to get started on accessing data in the BigQuery dataset using the BQhelper package to write SQL queries.

Acknowledgements

Data Origin: https://bigquery.cloud.google.com/dataset/patents-public-data:usitc_investigations

"US International Trade Commission 337Info Unfair Import Investigations Information System" by the USITC, for public use.

Banner photo by João Silas on Unsplash
Influence of Continuous Integration on the Development Activity in GitHub...
zenodo.org
csv
Updated Jan 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sebastian Baltes; Sebastian Baltes; Jascha Knack; Jascha Knack (2020). Influence of Continuous Integration on the Development Activity in GitHub Projects [Dataset]. http://doi.org/10.5281/zenodo.1140261
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.1140261
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sebastian Baltes; Sebastian Baltes; Jascha Knack; Jascha Knack
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is based on the TravisTorrent dataset released 2017-01-11 (https://travistorrent.testroots.org), the Google BigQuery GHTorrent dataset accessed 2017-07-03, and the Git log history of all projects in the dataset, retrieved 2017-07-16 - 2017-07-17.

We selected projects hosted on GitHub that employ the Continuous Integration (CI) system Travis CI. We identified the projects using the TravisTorrent data set and considered projects that:

were active for one year before the first build with Travis CI (before_ci),

used Travis CI at least for one year (during_ci),

had commit or merge activity on the default branch in both of these phases, and

used the default branch to trigger builds.

To derive the time frames, we employed the GHTorrent Big Query data set. The resulting sample contains 321 projects. Of these projects, 214 are Ruby projects and 107 are Java projects. The mean time span before_ci was 2.9 years (SD=1.9, Mdn=2.3), the mean time span during_ci was 3.2 years (SD=1.1, Mdn=3.3). For our analysis, we only consider the activity one year before and after the first build.

We cloned the selected project repositories and extracted the version history for all branches (see https://github.com/sbaltes/git-log-parser). For each repo and branch, we created one log file with all regular commits and one log file with all merges. We only considered commits changing non-binary files and applied a file extension filter to only consider changes to Java or Ruby source code files. From the log files, we then extracted metadata about the commits and stored this data in CSV files (see https://github.com/sbaltes/git-log-parser).

The dataset contains the following files:

tr_projects_sample_filtered.csv
A CSV file with information about the 321 selected projects.

tr_sample_commits_default_branch_before_ci.csv
tr_sample_commits_default_branch_during_ci.csv
One CSV file with information about all commits to the default branch before and after the first CI build. Only commits modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the following columns:

project: GitHub project name ("/" replaced by "_").
branch: The branch to which the commit was made.
hash_value: The SHA1 hash value of the commit.
author_name: The author name.
author_email: The author email address.
author_date: The authoring timestamp.
commit_name: The committer name.
commit_email: The committer email address.
commit_date: The commit timestamp.
log_message_length: The length of the git commit messages (in characters).
file_count: Files changed with this commit.
lines_added: Lines added to all files changed with this commit.
lines_deleted: Lines deleted in all files changed with this commit.
file_extensions: Distinct file extensions of files changed with this commit.

tr_sample_merges_default_branch_before_ci.csv
tr_sample_merges_default_branch_during_ci.csv
One CSV file with information about all merges into the default branch before and after the first CI build. Only merges modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the following columns:

project: GitHub project name ("/" replaced by "_").
branch: The destination branch of the merge.
hash_value: The SHA1 hash value of the merge commit.
merged_commits: Unique hash value prefixes of the commits merged with this commit.
author_name: The author name.
author_email: The author email address.
author_date: The authoring timestamp.
commit_name: The committer name.
commit_email: The committer email address.
commit_date: The commit timestamp.
log_message_length: The length of the git commit messages (in characters).
file_count: Files changed with this commit.
lines_added: Lines added to all files changed with this commit.
lines_deleted: Lines deleted in all files changed with this commit.
file_extensions: Distinct file extensions of files changed with this commit.
pull_request_id: ID of the GitHub pull request that has been merged with this commit (extracted from log message).
source_user: GitHub login name of the user who initiated the pull request (extracted from log message).
source_branch : Source branch of the pull request (extracted from log message).
COVID19 - The New York Times
kaggle.com
zip
Updated May 18, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2020). COVID19 - The New York Times [Dataset]. https://www.kaggle.com/bigquery/covid19-nyt
Explore at:
zip(0 bytes)Available download formats
Dataset updated
May 18, 2020
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Googlehttp://google.com/
Authors
Google BigQuery
Description
Context

This is the US Coronavirus data repository from The New York Times . This data includes COVID-19 cases and deaths reported by state and county. The New York Times compiled this data based on reports from state and local health agencies. More information on the data repository is available here . For additional reporting and data visualizations, see The New York Times’ U.S. coronavirus interactive site

Sample Queries

Query 1

Which US counties have the most confirmed cases per capita? This query determines which counties have the most cases per 100,000 residents. Note that this may differ from similar queries of other datasets because of differences in reporting lag, methodologies, or other dataset differences.

SELECT covid19.county, covid19.state_name, total_pop AS county_population, confirmed_cases, ROUND(confirmed_cases/total_pop *100000,2) AS confirmed_cases_per_100000, deaths, ROUND(deaths/total_pop *100000,2) AS deaths_per_100000 FROM bigquery-public-data.covid19_nyt.us_counties covid19 JOIN bigquery-public-data.census_bureau_acs.county_2017_5yr acs ON covid19.county_fips_code = acs.geo_id WHERE date = DATE_SUB(CURRENT_DATE(),INTERVAL 1 day) AND covid19.county_fips_code != "00000" ORDER BY confirmed_cases_per_100000 desc

Query 2

How do I calculate the number of new COVID-19 cases per day? This query determines the total number of new cases in each state for each day available in the dataset SELECT b.state_name, b.date, MAX(b.confirmed_cases - a.confirmed_cases) AS daily_confirmed_cases FROM (SELECT state_name AS state, state_fips_code , confirmed_cases, DATE_ADD(date, INTERVAL 1 day) AS date_shift FROM bigquery-public-data.covid19_nyt.us_states WHERE confirmed_cases + deaths > 0) a JOIN bigquery-public-data.covid19_nyt.us_states b ON a.state_fips_code = b.state_fips_code AND a.date_shift = b.date GROUP BY b.state_name, date ORDER BY date desc
USFS Forest Inventory and Analysis (FIA) Program
kaggle.com
zip
Updated Mar 20, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Forest Service (2019). USFS Forest Inventory and Analysis (FIA) Program [Dataset]. https://www.kaggle.com/usforestservice/usfs-fia
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Mar 20, 2019
Dataset provided by
U.S. Department of Agriculture Forest Servicehttp://fs.fed.us/
Authors
U.S. Forest Service
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

US Forest Service Forest Inventory and Analysis National Program.

The Forest Inventory and Analysis (FIA) Program of the U.S. Forest Service provides the information needed to assess America's forests.

https://www.fia.fs.fed.us/

Content

As the Nation's continuous forest census, our program projects how forests are likely to appear 10 to 50 years from now. This enables us to evaluate whether current forest management practices are sustainable in the long run and to assess whether current policies will allow the next generation to enjoy America's forests as we do today.

FIA reports on status and trends in forest area and location; in the species, size, and health of trees; in total tree growth, mortality, and removals by harvest; in wood production and utilization rates by various products; and in forest land ownership.

The Forest Service has significantly enhanced the FIA program by changing from a periodic survey to an annual survey, by increasing our capacity to analyze and publish data, and by expanding the scope of our data collection to include soil, under story vegetation, tree crown conditions, coarse woody debris, and lichen community composition on a subsample of our plots. The FIA program has also expanded to include the sampling of urban trees on all land use types in select cities.

For more details, see: https://www.fia.fs.fed.us/library/database-documentation/current/ver70/FIADB%20User%20Guide%20P2_7-0_ntc.final.pdf

Fork this kernel to get started with this dataset.

Acknowledgements

https://www.fia.fs.fed.us/

https://cloud.google.com/blog/big-data/2017/10/get-to-know-your-trees-us-forest-service-fia-dataset-now-available-in-bigquery

FIA is managed by the Research and Development organization within the USDA Forest Service in cooperation with State and Private Forestry and National Forest Systems. FIA traces it's origin back to the McSweeney - McNary Forest Research Act of 1928 (P.L. 70-466). This law initiated the first inventories starting in 1930.

Banner Photo by @rmorton3 from Unplash.

Inspiration

Estimating timberland and forest land acres by state.

https://cloud.google.com/blog/big-data/2017/10/images/4728824346443776/forest-data-4.png" alt="enter image description here"> https://cloud.google.com/blog/big-data/2017/10/images/4728824346443776/forest-data-4.png
COVID-19 Cases in Italy
kaggle.com
zip
Updated May 18, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2020). COVID-19 Cases in Italy [Dataset]. https://www.kaggle.com/datasets/bigquery/covid19-italy
Explore at:
zip(0 bytes)Available download formats
Dataset updated
May 18, 2020
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Authors
Google BigQuery
Area covered
Italy
Description
Context

This is the Italian Coronavirus data repository from the Dipartimento della Protezione Civile . This dataset was created in response to the Coronavirus public health emergency in Italy and includes COVID-19 cases reported by region

Sample Queries

Dati Italia COVID-19: Which provinces in Italy have the most confirmed cases? Find which Italian provinces have the highest number of confirmed COVID-19 cases as of yesterday. SELECT covid19.province_name AS province, covid19.region_name AS region, confirmed_cases FROM bigquery-public-data.covid19_italy.data_by_province covid19 WHERE EXTRACT(date from DATE) = DATE_SUB(CURRENT_DATE(),INTERVAL 1 day) ORDER BY confirmed_cases desc

Sample Query 2

What percentage of tests performed have resulted in confirmed cases by region? This query determines what percent of tests performed are made up by confirmed cases. SELECT covid19.region_name AS region, total_confirmed_cases, tests_performed, ROUND(total_confirmed_cases/tests_performed*100,2) AS percent_tests_confirmed_cases FROM bigquery-public-data.covid19_italy.data_by_region covid19 WHERE EXTRACT(date from DATE) = DATE_SUB(CURRENT_DATE(),INTERVAL 1 day) ORDER BY percent_tests_confirmed_cases desc
PatCit: A Comprehensive Dataset of Patent Citations
zenodo.org
explore.openaire.eu
application/gzip, bin
Updated Dec 23, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gaétan de Rassenfosse; Gaétan de Rassenfosse; Cyril Verluise; Cyril Verluise (2020). PatCit: A Comprehensive Dataset of Patent Citations [Dataset]. http://doi.org/10.5281/zenodo.3710994
Explore at:
application/gzip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3710994
Dataset updated
Dec 23, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gaétan de Rassenfosse; Gaétan de Rassenfosse; Cyril Verluise; Cyril Verluise
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
PATCIT: A Comprehensive Dataset of Patent Citations [Website, Newsletter, GitHub]

Patents are at the crossroads of many innovation nodes: science, industry, products, competition, etc. Such interactions can be identified through citations in a broad sense.

It is now common to use front-page patent citations to study some aspects of the innovation system. However, there is much more buried in the Non Patent Literature (NPL) citations and in the patent text itself.

Good news: Natural Language Processing (NLP) tools now enable social scientists to excavate and structure this long hidden information. That's the purpose of this project

IN PRACTICE

A detailed presentation of the current state of the project is available in our March 2020 presentation.

So far, we have:

classified the 40 million NPL citations reported in the DOCDB database in 9 distinct research oriented classes with a 90% accuracy rate.

parsed and consolidated the 27 million NPL citations classified as bibliographical references.

extracted, parsed and consolidated in-text bibliographical references and patent citations from the body of all time USPTO patents.

The latest version of the dataset is the v0.15. It is made of the v0.1 of the US contextual citations dataset and v0.2 of the front-page NPL citations dataset.

Give it a try! The dataset is publicly available on Google Cloud BigQuery, just click here.

FEATURES

Open

The code is licensed under MIT-2 and the dataset is licensed under CC4. Two highly permissive licenses.

The project is thought to be dynamically improved by and for the community. Anyone should feel free to open discussions, raise issues, request features and contribute to the project.

Comprehensive

We address worldwide patents, as long as the data is available.

We address all classes of citations, not only bibliographical references.

We address front-page and in-text citations.

Highest standards

We use and implement state-of-the art machine learning solutions.

We take great care to implement only the most efficient solutions. We believe that computational resources should be used sparsely, for both environmental sustainability and long term financial sustainability of the project.
Historical Air Quality
kaggle.com
zip
Updated Feb 12, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
US Environmental Protection Agency (2019). Historical Air Quality [Dataset]. https://www.kaggle.com/datasets/epa/epa-historical-air-quality
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Feb 12, 2019
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Authors
US Environmental Protection Agency
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The AQS Data Mart is a database containing all of the information from AQS. It has every measured value the EPA has collected via the national ambient air monitoring program. It also includes the associated aggregate values calculated by EPA (8-hour, daily, annual, etc.). The AQS Data Mart is a copy of AQS made once per week and made accessible to the public through web-based applications. The intended users of the Data Mart are air quality data analysts in the regulatory, academic, and health research communities. It is intended for those who need to download large volumes of detailed technical data stored at EPA and does not provide any interactive analytical tools. It serves as the back-end database for several Agency interactive tools that could not fully function without it: AirData, AirCompare, The Remote Sensing Information Gateway, the Map Monitoring Sites KML page, etc.

AQS must maintain constant readiness to accept data and meet high data integrity requirements, thus is limited in the number of users and queries to which it can respond. The Data Mart, as a read only copy, can allow wider access.

The most commonly requested aggregation levels of data (and key metrics in each) are:

Sample Values (2.4 billion values back as far as 1957, national consistency begins in 1980, data for 500 substances routinely collected) The sample value converted to standard units of measure (generally 1-hour averages as reported to EPA, sometimes 24-hour averages) Local Standard Time (LST) and GMT timestamps Measurement method Measurement uncertainty, where known Any exceptional events affecting the data NAAQS Averages NAAQS average values (8-hour averages for ozone and CO, 24-hour averages for PM2.5) Daily Summary Values (each monitor has the following calculated each day) Observation count Observation per cent (of expected observations) Arithmetic mean of observations Max observation and time of max AQI (air quality index) where applicable Number of observations > Standard where applicable Annual Summary Values (each monitor has the following calculated each year) Observation count and per cent Valid days Required observation count Null observation count Exceptional values count Arithmetic Mean and Standard Deviation 1st - 4th maximum (highest) observations Percentiles (99, 98, 95, 90, 75, 50) Number of observations > Standard Site and Monitor Information FIPS State Code (the first 5 items on this list make up the AQS Monitor Identifier) FIPS County Code Site Number (unique within the county) Parameter Code (what is measured) POC (Parameter Occurrence Code) to distinguish from different samplers at the same site Latitude Longitude Measurement method information Owner / operator / data-submitter information Monitoring Network to which the monitor belongs Exemptions from regulatory requirements Operational dates City and CBSA where the monitor is located Quality Assurance Information Various data fields related to the 19 different QA assessments possible

Querying BigQuery tables

You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.epa_historical_air_quality.[TABLENAME]. Fork this kernel to get started.

Acknowledgements

Data provided by the US Environmental Protection Agency Air Quality System Data Mart.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Google BigQuery (2019). Google Analytics Sample [Dataset]. https://www.kaggle.com/datasets/bigquery/google-analytics-sample

Google Analytics Sample

Google Analytics Sample (BigQuery)

Explore at:

17 scholarly articles cite this dataset (View in Google Scholar)

zip(0 bytes)Available download formats

Dataset updated

Sep 19, 2019

Dataset provided by

BigQueryhttps://cloud.google.com/bigquery
Googlehttp://google.com/

Authors

Google BigQuery

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Context

The Google Merchandise Store sells Google branded merchandise. The data is typical of what you would see for an ecommerce website.

Content

The sample dataset contains Google Analytics 360 data from the Google Merchandise Store, a real ecommerce store. The Google Merchandise Store sells Google branded merchandise. The data is typical of what you would see for an ecommerce website. It includes the following kinds of information:

Traffic source data: information about where website visitors originate. This includes data about organic traffic, paid search traffic, display traffic, etc. Content data: information about the behavior of users on the site. This includes the URLs of pages that visitors look at, how they interact with content, etc. Transactional data: information about the transactions that occur on the Google Merchandise Store website.

Fork this kernel to get started.

Acknowledgements

Data from: https://bigquery.cloud.google.com/table/bigquery-public-data:google_analytics_sample.ga_sessions_20170801

Banner Photo by Edho Pratama from Unsplash.

Inspiration

What is the total number of transactions generated per device browser in July 2017?

The real bounce rate is defined as the percentage of visits with a single pageview. What was the real bounce rate per traffic source?

What was the average number of product pageviews for users who made a purchase in July 2017?

What was the average number of product pageviews for users who did not make a purchase in July 2017?

What was the average total transactions per user that made a purchase in July 2017?

What is the average amount of money spent per session in July 2017?

What is the sequence of pages viewed?

Clear search

Close search

Google apps

Main menu

Google Analytics Sample

Context

Content

Acknowledgements

Inspiration

1000 Cannabis Genomes Project

Context

Content

Acknowledgements

Inspiration

Ethereum Classic Cryptocurrency Dataset - Dataset - CryptoData Hub

OnPoint Weather - Temperature History & Climatology Sample

Dataset Reuse Indicators Datasets

Intellectual Property Investigations by the USITC

Context

Content

Acknowledgements

Influence of Continuous Integration on the Development Activity in GitHub...

COVID19 - The New York Times

Context

Sample Queries

Query 1

Query 2

USFS Forest Inventory and Analysis (FIA) Program

Context

Content

Acknowledgements

Inspiration

COVID-19 Cases in Italy

Context

Sample Queries

Sample Query 2

PatCit: A Comprehensive Dataset of Patent Citations

Historical Air Quality

Querying BigQuery tables

Acknowledgements

Google Analytics Sample

Google Analytics Sample (BigQuery)

Context

Content

Acknowledgements

Inspiration