https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Cannabis is a genus of flowering plants in the family Cannabaceae.
Source: https://en.wikipedia.org/wiki/Cannabis
In October 2016, Phylos Bioscience released a genomic open dataset of approximately 850 strains of Cannabis via the Open Cannabis Project. In combination with other genomics datasets made available by Courtagen Life Sciences, Michigan State University, NCBI, Sunrise Medicinal, University of Calgary, University of Toronto, and Yunnan Academy of Agricultural Sciences, the total amount of publicly available data exceeds 1,000 samples taken from nearly as many unique strains.
These data were retrieved from the National Center for Biotechnology Information’s Sequence Read Archive (NCBI SRA), processed using the BWA aligner and FreeBayes variant caller, indexed with the Google Genomics API, and exported to BigQuery for analysis. Data are available directly from Google Cloud Storage at gs://gcs-public-data--genomics/cannabis, as well as via the Google Genomics API as dataset ID 918853309083001239, and an additional duplicated subset of only transcriptome data as dataset ID 94241232795910911, as well as in the BigQuery dataset bigquery-public-data:genomics_cannabis.
All tables in the Cannabis Genomes Project dataset have a suffix like _201703. The suffix is referred to as [BUILD_DATE] in the descriptions below. The dataset is updated frequently as new releases become available.
The following tables are included in the Cannabis Genomes Project dataset:
Sample_info contains fields extracted for each SRA sample, including the SRA sample ID and other data that give indications about the type of sample. Sample types include: strain, library prep methods, and sequencing technology. See SRP008673 for an example of upstream sample data. SRP008673 is the University of Toronto sequencing of Cannabis Sativa subspecies Purple Kush.
MNPR01_reference_[BUILD_DATE] contains reference sequence names and lengths for the draft assembly of Cannabis Sativa subspecies Cannatonic produced by Phylos Bioscience. This table contains contig identifiers and their lengths.
MNPR01_[BUILD_DATE] contains variant calls for all included samples and types (genomic, transcriptomic) aligned to the MNPR01_reference_[BUILD_DATE] table. Samples can be found in the sample_info table. The MNPR01_[BUILD_DATE] table is exported using the Google Genomics BigQuery variants schema. This table is useful for general analysis of the Cannabis genome.
MNPR01_transcriptome_[BUILD_DATE] is similar to the MNPR01_[BUILD_DATE] table, but it includes only the subset transcriptomic samples. This table is useful for transcribed gene-level analysis of the Cannabis genome.
Fork this kernel to get started with this dataset.
Dataset Source: http://opencannabisproject.org/ Category: Genomics Use: This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - https://www.ncbi.nlm.nih.gov/home/about/policies.shtml - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset. Update frequency: As additional data are released to GenBank View in BigQuery: https://bigquery.cloud.google.com/dataset/bigquery-public-data:genomics_cannabis View in Google Cloud Storage: gs://gcs-public-data--genomics/cannabis
Banner Photo by Rick Proctor from Unplash.
Which Cannabis samples are included in the variants table?
Which contigs in the MNPR01_reference_[BUILD_DATE] table have the highest density of variants?
How many variants does each sample have at the THC Synthase gene (THCA1) locus?
This dataset contains two tables: creative_stats and removed_creative_stats. The creative_stats table contains information about advertisers that served ads in the European Economic Area or Turkey: their legal name, verification status, disclosed name, and location. It also includes ad specific information: impression ranges per region (including aggregate impressions for the European Economic Area), first shown and last shown dates, which criteria were used in audience selection, the format of the ad, the ad topic and whether the ad is funded by Google Ad Grants program. A link to the ad in the Google Ads Transparency Center is also provided. The removed_creative_stats table contains information about ads that served in the European Economic Area that Google removed: where and why they were removed and per-region information on when they served. The removed_creative_stats table also contains a link to the Google Ads Transparency Center for the removed ad. Data for both tables updates periodically and may be delayed from what appears on the Google Ads Transparency Center website. About BigQuery This data is hosted in Google BigQuery for users to easily query using SQL. Note that to use BigQuery, users must have a Google account and create a GCP project. This public dataset is included in BigQuery's 1TB/mo of free tier processing. Each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery . Download Dataset This public dataset is also hosted in Google Cloud Storage here and available free to use. Use this quick start guide to quickly learn how to access public datasets on Google Cloud Storage. We provide the raw data in JSON format, sharded across multiple files to support easier download of the large dataset. A README file which describes the data structure and our Terms of Service (also listed below) is included with the dataset. You can also download the results from a custom query. See here for options and instructions. Signed out users can download the full dataset by using the gCloud CLI. Follow the instructions here to download and install the gCloud CLI. To remove the login requirement, run "$ gcloud config set auth/disable_credentials True" To download the dataset, run "$ gcloud storage cp gs://ads-transparency-center/* . -R" This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .
Libraries.io gathers data on open source software from 33 package managers and 3 source code repositories. We track over 2.4m unique open source projects, 25m repositories and 121m interdependencies between them. This gives Libraries.io a unique understanding of open source software. In this release you will find data about software distributed and/or crafted publicly on the Internet. You will find information about its development, its distribution and its relationship with other software included as a dependency. You will not find any information about the individuals who create and maintain these projects. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Adapted from Wikipedia: OpenStreetMap (OSM) is a collaborative project to create a free editable map of the world. Created in 2004, it was inspired by the success of Wikipedia and more than two million registered users who can add data by manual survey, GPS devices, aerial photography, and other free sources.
To aid researchers, data scientists, and analysts in the effort to combat COVID-19, Google is making a hosted repository of public datasets including OpenStreetMap data, free to access. To facilitate the Kaggle community to access the BigQuery dataset, it is onboarded to Kaggle platform which allows querying it without a linked GCP account. Please note that due to the large size of the dataset, Kaggle applies a quota of 5 TB of data scanned per user per 30-days.
This is the OpenStreetMap (OSM) planet-wide dataset loaded to BigQuery.
Tables:
- history_*
tables: full history of OSM objects.
- planet_*
tables: snapshot of current OSM objects as of Nov 2019.
The history_*
and planet_*
table groups are composed of node, way, relation, and changeset tables. These contain the primary OSM data types and an additional changeset corresponding to OSM edits for convenient access. These objects are encoded using the BigQuery GEOGRAPHY data type so that they can be operated upon with the built-in geography functions to perform geometry and feature selection, additional processing.
You can read more about OSM elements on the OSM Wiki. This dataset uses BigQuery GEOGRAPHY datatype which supports a set of functions that can be used to analyze geographical data, determine spatial relationships between geographical features, and construct or manipulate GEOGRAPHYs.
The COVID-19 Search Trends symptoms dataset shows aggregated, anonymized trends in Google searches for a broad set of health symptoms, signs, and conditions. The dataset provides a daily or weekly time series for each region showing the relative volume of searches for each symptom. This dataset is intended to help researchers to better understand the impact of COVID-19. It shouldn't be used for medical diagnostic, prognostic, or treatment purposes. It also isn't intended to be used for guidance on personal travel plans. To learn more about the dataset, how we generate it and preserve privacy, read the data documentation . To visualize the data, try exploring these interactive charts and map of symptom search trends . As of Dec. 15, 2020, the dataset was expanded to include trends for Australia, Ireland, New Zealand, Singapore, and the United Kingdom. This expanded data is available in new tables that provide data at country and two subregional levels. We will not be updating existing state/county tables going forward. All bytes processed in queries against this dataset will be zeroed out, making this part of the query free. Data joined with the dataset will be billed at the normal rate to prevent abuse. After September 15, queries over these datasets will revert to the normal billing rate. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
RxNorm is a name of a US-specific terminology in medicine that contains all medications available on US market. Source: https://en.wikipedia.org/wiki/RxNorm
RxNorm provides normalized names for clinical drugs and links its names to many of the drug vocabularies commonly used in pharmacy management and drug interaction software, including those of First Databank, Micromedex, Gold Standard Drug Database, and Multum. By providing links between these vocabularies, RxNorm can mediate messages between systems not using the same software and vocabulary. Source: https://www.nlm.nih.gov/research/umls/rxnorm/
RxNorm was created by the U.S. National Library of Medicine (NLM) to provide a normalized naming system for clinical drugs, defined as the combination of {ingredient + strength + dose form}. In addition to the naming system, the RxNorm dataset also provides structured information such as brand names, ingredients, drug classes, and so on, for each clinical drug. Typical uses of RxNorm include navigating between names and codes among different drug vocabularies and using information in RxNorm to assist with health information exchange/medication reconciliation, e-prescribing, drug analytics, formulary development, and other functions.
This public dataset includes multiple data files originally released in RxNorm Rich Release Format (RXNRRF) that are loaded into Bigquery tables. The data is updated and archived on a monthly basis.
The following tables are included in the RxNorm dataset:
RXNCONSO contains concept and source information
RXNREL contains information regarding relationships between entities
RXNSAT contains attribute information
RXNSTY contains semantic information
RXNSAB contains source info
RXNCUI contains retired rxcui codes
RXNATOMARCHIVE contains archived data
RXNCUICHANGES contains concept changes
Update Frequency: Monthly
Fork this kernel to get started with this dataset.
https://www.nlm.nih.gov/research/umls/rxnorm/
https://bigquery.cloud.google.com/dataset/bigquery-public-data:nlm_rxnorm
https://cloud.google.com/bigquery/public-data/rxnorm
Dataset Source: Unified Medical Language System RxNorm. The dataset is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset. This dataset uses publicly available data from the U.S. National Library of Medicine (NLM), National Institutes of Health, Department of Health and Human Services; NLM is not responsible for the dataset, does not endorse or recommend this or any other dataset.
Banner Photo by @freestocks from Unsplash.
What are the RXCUI codes for the ingredients of a list of drugs?
Which ingredients have the most variety of dose forms?
In what dose forms is the drug phenylephrine found?
What are the ingredients of the drug labeled with the generic code number 072718?
Google Suite is an umbrella Information System by which USAID receives multiple Google services per USAID's subscription contract. Business services include but are not limited to: Business email through Gmail, Video and voice conferencing, Secure team messaging, Shared calendars, Documents, spreadsheets, and presentations, Unlimited cloud storage, and Smart search across G Suite with Cloud Search. Security and administration controls include: Control how long your email messages and on-the-record chats are retained. Specify policies for your entire domain or based on organizational units, date ranges, and specific terms. Archive and set retention policies for emails and chats, Security center for G Suite, eDiscovery for emails, chats, and files, Audit reports to track user activity, Data loss prevention for Gmail, Data loss prevention for Drive Hosted S/MIME for Gmail, Integrate Gmail with compliant third-party archiving tools, Enterprise-grade access control with security key enforcement, and Gmail log analysis in BigQuery
This public dataset was created by the Centers for Medicare & Medicaid Services. The data summarize counts of enrollees who are dually-eligible for both Medicare and Medicaid program, including those in Medicare Savings Programs. “Duals” represent 20 percent of all Medicare beneficiaries, yet they account for 34 percent of all spending by the program, according to the Commonwealth Fund . As a representation of this high-needs, high-cost population, these data offer a view of regions ripe for more intensive care coordination that can address complex social and clinical needs. In addition to the high cost savings opportunity to deliver upstream clinical interventions, this population represents the county-by-county volume of patients who are eligible for both state level (Medicaid) and federal level (Medicare) reimbursements and potential funding streams to address unmet social needs across various programs, waivers, and other projects. The dataset includes eligibility type and enrollment by quarter, at both the state and county level. These data represent monthly snapshots submitted by states to the CMS, which are inherently lower than ever-enrolled counts (which include persons enrolled at any time during a calendar year.) For more information on dually eligible beneficiaries
You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.sdoh_cms_dual_eligible_enrollment.
In what counties in Michigan has the number of dual-eligible individuals increased the most from 2015 to 2018? Find the counties in Michigan which have experienced the largest increase of dual enrollment households
duals_Jan_2015 AS (
SELECT Public_Total AS duals_2015, County_Name, FIPS
FROM bigquery-public-data.sdoh_cms_dual_eligible_enrollment.dual_eligible_enrollment_by_county_and_program
WHERE State_Abbr = "MI" AND Date = '2015-12-01'
),
duals_increase AS ( SELECT d18.FIPS, d18.County_Name, d15.duals_2015, d18.duals_2018, (d18.duals_2018 - d15.duals_2015) AS total_duals_diff FROM duals_Jan_2018 d18 JOIN duals_Jan_2015 d15 ON d18.FIPS = d15.FIPS )
SELECT * FROM duals_increase WHERE total_duals_diff IS NOT NULL ORDER BY total_duals_diff DESC
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is based on the TravisTorrent dataset released 2017-01-11 (https://travistorrent.testroots.org), the Google BigQuery GHTorrent dataset accessed 2017-07-03, and the Git log history of all projects in the dataset, retrieved 2017-07-16 and 2017-07-17.
We selected projects hosted on GitHub that employ the Continuous Integration (CI) system Travis CI. We identified the projects using the TravisTorrent data set and considered projects that:
used GitHub from the beginning (first commit not more than seven days before project creation date according to GHTorrent),
were active for at least one year (365 days) before the first build with Travis CI (before_ci),
used Travis CI at least for one year (during_ci),
had commit or merge activity on the default branch in both of these phases, and
used the default branch to trigger builds.
To derive the time frames, we employed the GHTorrent Big Query data set. The resulting sample contains 113 projects. Of these projects, 89 are Ruby projects and 24 are Java projects. For our analysis, we only consider the activity one year before and after the first build.
We cloned the selected project repositories and extracted the version history for all branches (see https://github.com/sbaltes/git-log-parser). For each repo and branch, we created one log file with all regular commits and one log file with all merges. We only considered commits changing non-binary files and applied a file extension filter to only consider changes to Java or Ruby source code files. From the log files, we then extracted metadata about the commits and stored this data in CSV files (see https://github.com/sbaltes/git-log-parser).
We also retrieved a random sample of GitHub project to validate the effects we observed in the CI project sample. We only considered projects that:
have Java or Ruby as their project language
used GitHub from the beginning (first commit not more than seven days before project creation date according to GHTorrent)
have commit activity for at least two years (730 days)
are engineered software projects (at least 10 watchers)
were not in the TravisTorrent dataset
In total, 8,046 projects satisfied those constraints. We drew a random sample of 800 projects from this sampling frame and retrieved the commit and merge data in the same way as for the CI sample. We then split the development activity at the median development date, removed projects without commits or merges in either of the two resulting time spans, and then manually checked the remaining projects to remove the ones with CI configuration files. The final comparision sample contained 60 non-CI projects.
This dataset contains the following files:
tr_projects_sample_filtered_2.csv A CSV file with information about the 113 selected projects.
tr_sample_commits_default_branch_before_ci.csv tr_sample_commits_default_branch_during_ci.csv One CSV file with information about all commits to the default branch before and after the first CI build. Only commits modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the following columns:
project: GitHub project name ("/" replaced by "_"). branch: The branch to which the commit was made. hash_value: The SHA1 hash value of the commit. author_name: The author name. author_email: The author email address. author_date: The authoring timestamp. commit_name: The committer name. commit_email: The committer email address. commit_date: The commit timestamp. log_message_length: The length of the git commit messages (in characters). file_count: Files changed with this commit. lines_added: Lines added to all files changed with this commit. lines_deleted: Lines deleted in all files changed with this commit. file_extensions: Distinct file extensions of files changed with this commit.
tr_sample_merges_default_branch_before_ci.csv tr_sample_merges_default_branch_during_ci.csv One CSV file with information about all merges into the default branch before and after the first CI build. Only merges modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the following columns:
project: GitHub project name ("/" replaced by "_"). branch: The destination branch of the merge. hash_value: The SHA1 hash value of the merge commit. merged_commits: Unique hash value prefixes of the commits merged with this commit. author_name: The author name. author_email: The author email address. author_date: The authoring timestamp. commit_name: The committer name. commit_email: The committer email address. commit_date: The commit timestamp. log_message_length: The length of the git commit messages (in characters). file_count: Files changed with this commit. lines_added: Lines added to all files changed with this commit. lines_deleted: Lines deleted in all files changed with this commit. file_extensions: Distinct file extensions of files changed with this commit. pull_request_id: ID of the GitHub pull request that has been merged with this commit (extracted from log message). source_user: GitHub login name of the user who initiated the pull request (extracted from log message). source_branch : Source branch of the pull request (extracted from log message).
comparison_project_sample_800.csv A CSV file with information about the 800 projects in the comparison sample.
commits_default_branch_before_mid.csv commits_default_branch_after_mid.csv One CSV file with information about all commits to the default branch before and after the medium date of the commit history. Only commits modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the same columns as the commits tables described above.
merges_default_branch_before_mid.csv merges_default_branch_after_mid.csv One CSV file with information about all merges into the default branch before and after the medium date of the commit history. Only merges modifying, adding, or deleting Java or Ruby source code files were considered. Those CSV files have the same columns as the merge tables described above.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The Healthcare Common Procedure Coding System (HCPCS, often pronounced by its acronym as "hick picks") is a set of health care procedure codes based on the American Medical Association's Current Procedural Terminology (CPT).
HCPCS includes three levels of codes: Level I consists of the American Medical Association's Current Procedural Terminology (CPT) and is numeric. Level II codes are alphanumeric and primarily include non-physician services such as ambulance services and prosthetic devices, and represent items and supplies and non-physician services, not covered by CPT-4 codes (Level I). Level III codes, also called local codes, were developed by state Medicaid agencies, Medicare contractors, and private insurers for use in specific programs and jurisdictions. The Health Insurance Portability and Accountability Act of 1996 (HIPAA) instructed CMS to adopt a standard coding systems for reporting medical transactions. The use of Level III codes was discontinued on December 31, 2003, in order to adhere to consistent coding standards.
Classification of procedures performed for patients is important for billing and reimbursement in healthcare. The primary classification system used in the United States is Healthcare Common Procedure Coding System (HCPCS), maintained by Centers for Medicare and Medicaid Services (CMS). This system is divided into two levels: level I and level II.
Level I HCPCS codes classify services rendered by physicians. This system is based on Common Procedure Terminology (CPT), a coding system maintained by the American Medical Association (AMA). Level II codes, which are the focus of this public dataset, are used to identify products, supplies, and services not included in level I codes. The level II codes include items such as ambulance services, durable medical goods, prosthetics, orthotics and supplies used outside a physician’s office.
Given the ubiquity of administrative data in healthcare, HCPCS coding systems are also commonly used in areas of clinical research such as outcomes based research.
Update Frequency: Yearly
Fork this kernel to get started.
https://bigquery.cloud.google.com/table/bigquery-public-data:cms_codes.hcpcs
https://cloud.google.com/bigquery/public-data/hcpcs-level2
Dataset Source: Center for Medicare and Medicaid Services. This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - http://www.data.gov/privacy-policy#data_policy — and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.
Banner Photo by @rawpixel from Unplash.
What are the descriptions for a set of HCPCS level II codes?
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
We at Digital Science have been looking at the Data Citation Corpus, to dig deeper into data citation counts.The first release is based on a seed file that includes data citations from the following sources:Data citations from DataCite and Crossref DOI metadata, via Event Data.Data citations from the CZI Science Knowledge Graph, identified via a Named Entity Recognition model algorithm that searches for mentions to datasets in the full text of journal articles and preprints in Europe PMC.So we are basically looking at papers that have a link to a DataCite DOI or accession number.By combining this dataset with Dimensions.ai data in Google Big Query, we we're able to add more dimensions to the dataset (pardon the pun), such as funder or institution. The Data Citation Corpus only gave us about 70% of the paper links that were resolvable DOIs. This should improve over time.This allows us to track how well things like the NIH open data policy is encouraging linking to datasets from papers.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the data repository for the 2019 Novel Coronavirus Visual Dashboard operated by the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE). The data include the location and number of confirmed COVID-19 cases, deaths, and recoveries for all affected countries, aggregated at the appropriate province/state. It was developed to enable researchers, public health authorities and the general public to track the outbreak. Additional information is available in the blog post, Mapping 2019-nCoV , and included data sources are listed here . For publications that use the data, please cite the following publication Dong E, Du H, Gardner L. An interactive web-based dashboard to track COVID-19 in real time. Lancet Inf Dis. 20(5):533-534. doi: 10.1016/S1473-3099(20)30120-1" This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .This dataset has significant public interest in light of the COVID-19 crisis. All bytes processed in queries against this dataset will be zeroed out, making this part of the query free. Data joined with the dataset will be billed at the normal rate to prevent abuse. After September 15, queries over these datasets will revert to the normal billing rate.
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The Analytical Data Store Tools market is experiencing robust growth, driven by the increasing need for real-time insights and advanced analytics across diverse industries. The market, estimated at $50 billion in 2025, is projected to maintain a healthy Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033, reaching approximately $150 billion by 2033. This expansion is fueled by several key factors: the proliferation of big data, the rising adoption of cloud-based solutions offering scalability and cost-effectiveness, and the growing demand for improved decision-making capabilities across organizations. Key trends include the increasing integration of AI and machine learning into analytical data store tools, the emergence of serverless architectures, and a focus on enhanced data security and governance. While the market faces challenges like data integration complexities and the need for skilled professionals, the overall outlook remains positive, driven by continued innovation and expanding enterprise adoption. The competitive landscape is highly dynamic, with major players like Google, Snowflake, Microsoft, Amazon, and Oracle leading the charge. These established players are constantly innovating and expanding their offerings to meet evolving customer needs, while smaller, specialized companies are emerging to cater to niche requirements. The market's segmentation reflects this diversity, with solutions catering to various data volumes, industry verticals, and deployment models (cloud, on-premise, hybrid). Geographical expansion, particularly in rapidly developing economies, presents a significant opportunity for growth. The historical period (2019-2024) likely saw a slower growth rate than the projected future growth, reflecting the time taken for market maturity and broader adoption of cloud technologies. The continued focus on data-driven decision-making across industries ensures the sustained growth trajectory of the Analytical Data Store Tools market.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
ChEMBL is maintained by the European Bioinformatics Institute (EBI), of the European Molecular Biology Laboratory (EMBL), based at the Wellcome Trust Genome Campus, Hinxton, UK.
ChEMBL is a manually curated database of bioactive molecules with drug-like properties used in drug discovery, including information about existing patented drugs.
Schema: http://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_23/chembl_23_schema.png
Documentation: http://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_23/schema_documentation.html
Fork this notebook to get started on accessing data in the BigQuery dataset using the BQhelper package to write SQL queries.
“ChEMBL” by the European Bioinformatics Institute (EMBL-EBI), used under CC BY-SA 3.0. Modifications have been made to add normalized publication numbers.
Data Origin: https://bigquery.cloud.google.com/dataset/patents-public-data:ebi_chembl
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This Dataset contains 3 datasets behind graphs generated in the "State of Open Data 2024 Special Report: Bridging policy and practice in data sharing" The datasets include counts and percentages for papers that link to datasets filtered by Country, Funder and Affiliation DatasetsThe datasets were generated by combining the DataCite Data Citation Corpus (https://corpus.datacite.org/dashboard) with Dimensions (https://www.dimensions.ai/) in Google big query.
This dataset is collected by the NYC Taxi and Limousine Commission (TLC) and includes trip records from all trips completed in Yellow and Green taxis in NYC, and all trips in for-hire vehicles (FHV) in the last 5 years. Records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts. For detailed information about this dataset, go to TOC Trip Record Data This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .
StackExchange Dataset
Working doc: https://docs.google.com/document/d/1h585bH5sYcQW4pkHzqWyQqA4ape2Bq6o1Cya0TkMOQc/edit?usp=sharing
BigQuery query (see so_bigquery.ipynb): CREATE TEMP TABLE answers AS SELECT * FROM bigquery-public-data.stackoverflow.posts_answers WHERE LOWER(Body) LIKE '%arxiv%';
CREATE TEMPORARY TABLE questions AS SELECT * FROM bigquery-public-data.stackoverflow.posts_questions;
SELECT * FROM answers JOIN questions ON questions.id = answers.parent_id;
NOTE:… See the full description on the dataset page: https://huggingface.co/datasets/ag2435/stackexchange.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Content of this repository This is the repository that contains the scripts and dataset for the MSR 2019 mining challenge
DATASET The dataset was retrived utilizing google bigquery and dumped to a csv file for further processing, this original file with no treatment is called jsanswers.csv, here we can find the following information : 1. The Id of the question (PostId) 2. The Content (in this case the code block) 3. the lenght of the code block 4. the line count of the code block 5. The score of the post 6. The title
A quick look at this files, one can notice that a postID can have multiple rows related to it, that's how multiple codeblocks are saved in the database.
Filtered Dataset:
Extracting code from CSV We used a python script called "ExtractCodeFromCSV.py" to extract the code from the original csv and merge all the codeblocks in their respective javascript file with the postID as name, this resulted in 336 thousand files.
Running ESlint Due to the single threaded nature of ESlint, we needed to create a script to run ESlint because it took a huge toll on the machine to run it on 336 thousand files, this script is named "ESlintRunnerScript.py", it splits the files in 20 evenly distributed parts and runs 20 processes of esLinter to generate the reports, as such it generates 20 json files.
Number of Violations per Rule This information was extracted using the script named "parser.py", it generated the file named "NumberofViolationsPerRule.csv" which contains the number of violations per rule used in the linter configuration in the dataset.
Number of violations per Category As a way to make relevant statistics of the dataset, we generated the number of violations per rule category as defined in the eslinter website, this information was extracted using the same "parser.py" script.
Individual Reports This information was extracted from the json reports, it's a csv file with PostID and violations per rule.
Rules The file Rules with categories contains all the rules used and their categories.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
# TDMentions: A Dataset of Technical Debt Mentions in Online Posts (version 1.0)
TDMentions is a dataset that contains mentions of technical debt from Reddit, Hacker News, and Stack Exchange. It also contains a list of blog posts on Medium that were tagged as technical debt. The dataset currently contains approximately 35,000 items.
## Data collection and processing
The dataset is mainly collected from existing datasets. We used data from:
- the archive of Reddit posts by Jason Baumgartner (available at [https://pushshift.io](https://pushshift.io),
- the archive of Hacker News available at Google's BigQuery (available at [https://console.cloud.google.com/marketplace/details/y-combinator/hacker-news](https://console.cloud.google.com/marketplace/details/y-combinator/hacker-news)), and the Stack Exchange data dump (available at [https://archive.org/details/stackexchange](https://archive.org/details/stackexchange)).
- the [GHTorrent](http://ghtorrent.org) project
- the [GH Archive](https://www.gharchive.org)
The data set currently contains data from the start of each source/service until 2018-12-31. For GitHub, we currently only include data from 2015-01-01.
We use the regular expression `tech(nical)?[\s\-_]*?debt` to find mentions in all sources except for Medium. We decided to limit our matches to variations of technical debt and tech debt. Other shorter forms, such as TD, can result in too many false positives. For Medium, we used the tag `technical-debt`.
## Data Format
The dataset is stored as a compressed (bzip2) JSON file with one JSON object per line. Each mention is represented as a JSON object with the following keys.
- `id`: the id used in the original source. We use the URL path to identify Medium posts.
- `body`: the text that contains the mention. This is either the comment or the title of the post. For Medium posts this is the title and subtitle (which might not mention technical debt, since posts are identified by the tag).
- `created_utc`: the time the item was posted in seconds since epoch in UTC.
- `author`: the author of the item. We use the username or userid from the source.
- `source`: where the item was posted. Valid sources are:
- HackerNews Comment
- HackerNews Job
- HackerNews Submission
- Reddit Comment
- Reddit Submission
- StackExchange Answer
- StackExchange Comment
- StackExchange Question
- Medium Post
- `meta`: Additional information about the item specific to the source. This includes, e.g., the subreddit a Reddit submission or comment was posted to, the score, etc. We try to use the same names, e.g., `score` and `num_comments` for keys that have the same meaning/information across multiple sources.
This is a sample item from Reddit:
```JSON
{
"id": "ab8auf",
"body": "Technical Debt Explained (x-post r/Eve)",
"created_utc": 1546271789,
"author": "totally_100_human",
"source": "Reddit Submission",
"meta": {
"title": "Technical Debt Explained (x-post r/Eve)",
"score": 1,
"num_comments": 0,
"url": "http://jestertrek.com/eve/technical-debt-2.png",
"subreddit": "RCBRedditBot"
}
}
```
## Sample Analyses
We decided to use JSON to store the data, since it is easy to work with from multiple programming languages. In the following examples, we use [`jq`](https://stedolan.github.io/jq/) to process the JSON.
### How many items are there for each source?
```
lbzip2 -cd postscomments.json.bz2 | jq '.source' | sort | uniq -c
```
### How many submissions that mentioned technical debt were posted each month?
```
lbzip2 -cd postscomments.json.bz2 | jq 'select(.source == "Reddit Submission") | .created_utc | strftime("%Y-%m")' | sort | uniq -c
```
### What are the titles of items that link (`meta.url`) to PDF documents?
```
lbzip2 -cd postscomments.json.bz2 | jq '. as $r | select(.meta.url?) | .meta.url | select(endswith(".pdf")) | $r.body'
```
### Please, I want CSV!
```
lbzip2 -cd postscomments.json.bz2 | jq -r '[.id, .body, .author] | @csv'
```
Note that you need to specify the keys you want to include for the CSV, so it is easier to either ignore the meta information or process each source.
Please see [https://github.com/sse-lnu/tdmentions](https://github.com/sse-lnu/tdmentions) for more analyses
# Limitations and Future updates
The current version of the dataset lacks GitHub data and Medium comments. GitHub data will be added in the next update. Medium comments (responses) will be added in a future update if we find a good way to represent these.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
US Forest Service Forest Inventory and Analysis National Program.
The Forest Inventory and Analysis (FIA) Program of the U.S. Forest Service provides the information needed to assess America's forests.
As the Nation's continuous forest census, our program projects how forests are likely to appear 10 to 50 years from now. This enables us to evaluate whether current forest management practices are sustainable in the long run and to assess whether current policies will allow the next generation to enjoy America's forests as we do today.
FIA reports on status and trends in forest area and location; in the species, size, and health of trees; in total tree growth, mortality, and removals by harvest; in wood production and utilization rates by various products; and in forest land ownership.
The Forest Service has significantly enhanced the FIA program by changing from a periodic survey to an annual survey, by increasing our capacity to analyze and publish data, and by expanding the scope of our data collection to include soil, under story vegetation, tree crown conditions, coarse woody debris, and lichen community composition on a subsample of our plots. The FIA program has also expanded to include the sampling of urban trees on all land use types in select cities.
For more details, see: https://www.fia.fs.fed.us/library/database-documentation/current/ver70/FIADB%20User%20Guide%20P2_7-0_ntc.final.pdf
Fork this kernel to get started with this dataset.
FIA is managed by the Research and Development organization within the USDA Forest Service in cooperation with State and Private Forestry and National Forest Systems. FIA traces it's origin back to the McSweeney - McNary Forest Research Act of 1928 (P.L. 70-466). This law initiated the first inventories starting in 1930.
Banner Photo by @rmorton3 from Unplash.
Estimating timberland and forest land acres by state.
https://cloud.google.com/blog/big-data/2017/10/images/4728824346443776/forest-data-4.png" alt="enter image description here">
https://cloud.google.com/blog/big-data/2017/10/images/4728824346443776/forest-data-4.png
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Cannabis is a genus of flowering plants in the family Cannabaceae.
Source: https://en.wikipedia.org/wiki/Cannabis
In October 2016, Phylos Bioscience released a genomic open dataset of approximately 850 strains of Cannabis via the Open Cannabis Project. In combination with other genomics datasets made available by Courtagen Life Sciences, Michigan State University, NCBI, Sunrise Medicinal, University of Calgary, University of Toronto, and Yunnan Academy of Agricultural Sciences, the total amount of publicly available data exceeds 1,000 samples taken from nearly as many unique strains.
These data were retrieved from the National Center for Biotechnology Information’s Sequence Read Archive (NCBI SRA), processed using the BWA aligner and FreeBayes variant caller, indexed with the Google Genomics API, and exported to BigQuery for analysis. Data are available directly from Google Cloud Storage at gs://gcs-public-data--genomics/cannabis, as well as via the Google Genomics API as dataset ID 918853309083001239, and an additional duplicated subset of only transcriptome data as dataset ID 94241232795910911, as well as in the BigQuery dataset bigquery-public-data:genomics_cannabis.
All tables in the Cannabis Genomes Project dataset have a suffix like _201703. The suffix is referred to as [BUILD_DATE] in the descriptions below. The dataset is updated frequently as new releases become available.
The following tables are included in the Cannabis Genomes Project dataset:
Sample_info contains fields extracted for each SRA sample, including the SRA sample ID and other data that give indications about the type of sample. Sample types include: strain, library prep methods, and sequencing technology. See SRP008673 for an example of upstream sample data. SRP008673 is the University of Toronto sequencing of Cannabis Sativa subspecies Purple Kush.
MNPR01_reference_[BUILD_DATE] contains reference sequence names and lengths for the draft assembly of Cannabis Sativa subspecies Cannatonic produced by Phylos Bioscience. This table contains contig identifiers and their lengths.
MNPR01_[BUILD_DATE] contains variant calls for all included samples and types (genomic, transcriptomic) aligned to the MNPR01_reference_[BUILD_DATE] table. Samples can be found in the sample_info table. The MNPR01_[BUILD_DATE] table is exported using the Google Genomics BigQuery variants schema. This table is useful for general analysis of the Cannabis genome.
MNPR01_transcriptome_[BUILD_DATE] is similar to the MNPR01_[BUILD_DATE] table, but it includes only the subset transcriptomic samples. This table is useful for transcribed gene-level analysis of the Cannabis genome.
Fork this kernel to get started with this dataset.
Dataset Source: http://opencannabisproject.org/ Category: Genomics Use: This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - https://www.ncbi.nlm.nih.gov/home/about/policies.shtml - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset. Update frequency: As additional data are released to GenBank View in BigQuery: https://bigquery.cloud.google.com/dataset/bigquery-public-data:genomics_cannabis View in Google Cloud Storage: gs://gcs-public-data--genomics/cannabis
Banner Photo by Rick Proctor from Unplash.
Which Cannabis samples are included in the variants table?
Which contigs in the MNPR01_reference_[BUILD_DATE] table have the highest density of variants?
How many variants does each sample have at the THC Synthase gene (THCA1) locus?