Traffic analytics, rankings, and competitive metrics for github.com as of August 2025
2019 Novel Coronavirus COVID-19 (2019-nCoV) Visual Dashboard and Map:
https://www.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6
Downloadable data:
https://github.com/CSSEGISandData/COVID-19
Additional Information about the Visual Dashboard:
https://systems.jhu.edu/research/public-health/ncov
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Codebase [Github] | Dataset [Zenodo]
Abstract
The advent of powerful neural classifiers has increased interest in problems that require both learning and reasoning. These problems are critical for understanding important properties of models, such as trustworthiness, generalization, interpretability, and compliance to safety and structural constraints. However, recent research observed that tasks requiring both learning and reasoning on background knowledge often suffer from reasoning shortcuts (RSs): predictors can solve the downstream reasoning task without associating the correct concepts to the high-dimensional data. To address this issue, we introduce rsbench, a comprehensive benchmark suite designed to systematically evaluate the impact of RSs on models by providing easy access to highly customizable tasks affected by RSs. Furthermore, rsbench implements common metrics for evaluating concept quality and introduces novel formal verification procedures for assessing the presence of RSs in learning tasks. Using rsbench, we highlight that obtaining high quality concepts in both purely neural and neuro-symbolic models is a far-from-solved problem. rsbench is available on Github.
Usage
We recommend visiting the official code website for instructions on how to use the dataset and accompaying software code.
License
All ready-made data sets and generated datasets are distributed under the CC-BY-SA 4.0 license, with the exception of Kand-Logic
, which is derived from Kandinsky-patterns
and as such is distributed under the GPL-3.0 license.
Datasets Overview
The original BDD datasets can be downloaded from the following Google Drive link: [Download BDD Dataset].
References
[1] Xu et al., *Explainable Object-Induced Action Decision for Autonomous Vehicles*, CVPR 2020.
[2] Sawada and Nakamura, *Concept Bottleneck Model With Additional Unsupervised Concepts*, IEEE 2022.
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Programming Languages Infrastructure as Code (PL-IaC) enables IaC programs written in general-purpose programming languages like Python and TypeScript. The currently available PL-IaC solutions are Pulumi and the Cloud Development Kits (CDKs) of Amazon Web Services (AWS) and Terraform. This dataset provides metadata and initial analyses of all public GitHub repositories in August 2022 with an IaC program, including their programming languages, applied testing techniques, and licenses. Further, we provide a shallow copy of the head state of those 7104 repositories whose licenses permit redistribution. The dataset is available under the Open Data Commons Attribution License (ODC-By) v1.0.
Contents:
This artifact is part of the ProTI Infrastructure as Code testing project: https://proti-iac.github.io.
The dataset's metadata comprises three tabular CSV files containing metadata about all analyzed repositories, IaC programs, and testing source code files.
repositories.csv:
programs.csv:
testing-files.csv:
scripts-and-logs.zip contains all scripts and logs of the creation of this dataset. In it, executions/executions.log documents the commands that generated this dataset in detail. On a high level, the dataset was created as follows:
The repositories are searched through search-repositories.py and saved in a CSV file. The script takes these arguments in the following order:
Pulumi projects have a Pulumi.yaml or Pulumi.yml (case-sensitive file name) file in their root folder, i.e., (3) is Pulumi and (4) is yml,yaml. https://www.pulumi.com/docs/intro/concepts/project/
AWS CDK projects have a cdk.json (case-sensitive file name) file in their root folder, i.e., (3) is cdk and (4) is json. https://docs.aws.amazon.com/cdk/v2/guide/cli.html
CDK for Terraform (CDKTF) projects have a cdktf.json (case-sensitive file name) file in their root folder, i.e., (3) is cdktf and (4) is json. https://www.terraform.io/cdktf/create-and-deploy/project-setup
The script uses the GitHub code search API and inherits its limitations:
More details: https://docs.github.com/en/search-github/searching-on-github/searching-code
The results of the GitHub code search API are not stable. However, the generally more robust GraphQL API does not support searching for files in repositories: https://stackoverflow.com/questions/45382069/search-for-code-in-github-using-graphql-v4-api
download-repositories.py downloads all repositories in CSV files generated through search-respositories.py and generates an overview CSV file of the downloads. The script takes these arguments in the following order:
The script only downloads a shallow recursive copy of the HEAD of the repo, i.e., only the main branch's most recent state, including submodules, without the rest of the git history. Each repository is downloaded to a subfolder named by the repository's ID.
https://github.com/nytimes/covid-19-data/blob/master/LICENSEhttps://github.com/nytimes/covid-19-data/blob/master/LICENSE
The New York Times is releasing a series of data files with cumulative counts of coronavirus cases in the United States, at the state and county level, over time. We are compiling this time series data from state and local governments and health departments in an attempt to provide a complete record of the ongoing outbreak.
Since the first reported coronavirus case in Washington State on Jan. 21, 2020, The Times has tracked cases of coronavirus in real time as they were identified after testing. Because of the widespread shortage of testing, however, the data is necessarily limited in the picture it presents of the outbreak.
We have used this data to power our maps and reporting tracking the outbreak, and it is now being made available to the public in response to requests from researchers, scientists and government officials who would like access to the data to better understand the outbreak.
The data begins with the first reported coronavirus case in Washington State on Jan. 21, 2020. We will publish regular updates to the data in this repository.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Quick Draw Dataset is a collection of 50 million drawings across 345 categories, contributed by players of the game "Quick, Draw!". The drawings were captured as timestamped vectors, tagged with metadata including what the player was asked to draw and in which country the player was located.
Example drawings: https://raw.githubusercontent.com/googlecreativelab/quickdraw-dataset/master/preview.jpg" alt="preview">
This Dataset is an updated version of the Amazon review dataset released in 2014. As in the previous version, this dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs). In addition, this version provides the following features:
More reviews:
New reviews:
Metadata: - We have added transaction metadata for each review shown on the review page.
If you publish articles based on this dataset, please cite the following paper:
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset provides insights into the Indian developer community on GitHub, one of the world’s largest platforms for developers to collaborate, share, and contribute to open-source projects. Whether you're interested in analyzing trends, understanding community growth, or identifying popular programming languages, this dataset offers a comprehensive look at the profiles of GitHub users from India.
The dataset includes anonymized profile information for a diverse range of GitHub users based in India. Key features include: - Username: Unique identifier for each user (anonymized) - Location: City or region within India - Programming Languages: Most commonly used languages per user - Repositories: Public repositories owned and contributed to - Followers and Following: Social network connections within the platform - GitHub Join Date: Date the user joined GitHub - Organizations: Affiliated organizations (if publicly available)
This dataset is curated from publicly available GitHub profiles with a specific focus on Indian users. It is inspired by the need to understand the growth of the tech ecosystem in India, including the languages, tools, and topics that are currently popular among Indian developers. This dataset aims to provide valuable insights for recruiters, data scientists, and anyone interested in the open-source contributions of Indian developers.
This dataset is perfect for: - Data scientists looking to explore and visualize developer trends - Recruiters interested in talent scouting within the Indian tech ecosystem - Tech enthusiasts who want to explore the dynamics of India's open-source community - Students and educators looking for real-world data to practice analysis and modeling
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Scientific and related management challenges in the water domain require synthesis of data from multiple domains. Many data analysis tasks are difficult because datasets are large and complex; standard formats for data types are not always agreed upon nor mapped to an efficient structure for analysis; water scientists may lack training in methods needed to efficiently tackle large and complex datasets; and available tools can make it difficult to share, collaborate around, and reproduce scientific work. Overcoming these barriers to accessing, organizing, and preparing datasets for analyses will be an enabler for transforming scientific inquiries. Building on the HydroShare repository’s established cyberinfrastructure, we have advanced two packages for the Python language that make data loading, organization, and curation for analysis easier, reducing time spent in choosing appropriate data structures and writing code to ingest data. These packages enable automated retrieval of data from HydroShare and the USGS’s National Water Information System (NWIS), loading of data into performant structures keyed to specific scientific data types and that integrate with existing visualization, analysis, and data science capabilities available in Python, and then writing analysis results back to HydroShare for sharing and eventual publication. These capabilities reduce the technical burden for scientists associated with creating a computational environment for executing analyses by installing and maintaining the packages within CUAHSI’s HydroShare-linked JupyterHub server. HydroShare users can leverage these tools to build, share, and publish more reproducible scientific workflows. The HydroShare Python Client and USGS NWIS Data Retrieval packages can be installed within a Python environment on any computer running Microsoft Windows, Apple MacOS, or Linux from the Python Package Index using the PIP utility. They can also be used online via the CUAHSI JupyterHub server (https://jupyterhub.cuahsi.org/) or other Python notebook environments like Google Collaboratory (https://colab.research.google.com/). Source code, documentation, and examples for the software are freely available in GitHub at https://github.com/hydroshare/hsclient/ and https://github.com/USGS-python/dataretrieval.
This presentation was delivered as part of the Hawai'i Data Science Institute's regular seminar series: https://datascience.hawaii.edu/event/data-science-and-analytics-for-water/
https://qdr.syr.edu/policies/qdr-standard-access-conditionshttps://qdr.syr.edu/policies/qdr-standard-access-conditions
This is an Annotation for Transparent Inquiry (ATI) data project. The annotated article can be viewed on the Publisher's Website. Data Generation The research project engages a story about perceptions of fairness in criminal justice decisions. The specific focus involves a debate between ProPublica, a news organization, and Northpointe, the owner of a popular risk tool called COMPAS. ProPublica wrote that COMPAS was racist against blacks, while Northpointe posted online a reply rejecting such a finding. These two documents were the obvious foci of the qualitative analysis because of the further media attention they attracted, the confusion their competing conclusions caused readers, and the power both companies wield in public circles. There were no barriers to retrieval as both documents have been publicly available on their corporate websites. This public access was one of the motivators for choosing them as it meant that they were also easily attainable by the general public, thus extending the documents’ reach and impact. Additional materials from ProPublica relating to the main debate were also freely downloadable from its website and a third party, open source platform. Access to secondary source materials comprising additional writings from Northpointe representatives that could assist in understanding Northpointe’s main document, though, was more limited. Because of a claim of trade secrets on its tool and the underlying algorithm, it was more difficult to reach Northpointe’s other reports. Nonetheless, largely because its clients are governmental bodies with transparency and accountability obligations, some of Northpointe-associated reports were retrievable from third parties who had obtained them, largely through Freedom of Information Act queries. Together, the primary and (retrievable) secondary sources allowed for a triangulation of themes, arguments, and conclusions. The quantitative component uses a dataset of over 7,000 individuals with information that was collected and compiled by ProPublica and made available to the public on github. ProPublica’s gathering the data directly from criminal justice officials via Freedom of Information Act requests rendered the dataset in the public domain, and thus no confidentiality issues are present. The dataset was loaded into SPSS v. 25 for data analysis. Data Analysis The qualitative enquiry used critical discourse analysis, which investigates ways in which parties in their communications attempt to create, legitimate, rationalize, and control mutual understandings of important issues. Each of the two main discourse documents was parsed on its own merit. Yet the project was also intertextual in studying how the discourses correspond with each other and to other relevant writings by the same authors. Several more specific types of discursive strategies were of interest in attracting further critical examination: Testing claims and rationalizations that appear to serve the speaker’s self-interest Examining conclusions and determining whether sufficient evidence supported them Revealing contradictions and/or inconsistencies within the same text and intertextually Assessing strategies underlying justifications and rationalizations used to promote a party’s assertions and arguments Noticing strategic deployment of lexical phrasings, syntax, and rhetoric Judging sincerity of voice and the objective consideration of alternative perspectives Of equal importance in a critical discourse analysis is consideration of what is not addressed, that is to uncover facts and/or topics missing from the communication. For this project, this included parsing issues that were either briefly mentioned and then neglected, asserted yet the significance left unstated, or not suggested at all. This task required understanding common practices in the algorithmic data science literature. The paper could have been completed with just the critical discourse analysis. However, because one of the salient findings from it highlighted that the discourses overlooked numerous definitions of algorithmic fairness, the call to fill this gap seemed obvious. Then, the availability of the same dataset used by the parties in conflict, made this opportunity more appealing. Calculating additional algorithmic equity equations would not thereby be troubled by irregularities because of diverse sample sets. New variables were created as relevant to calculate algorithmic fairness equations. In addition to using various SPSS Analyze functions (e.g., regression, crosstabs, means), online statistical calculators were useful to compute z-test comparisons of proportions and t-test comparisons of means. Logic of Annotation Annotations were employed to fulfil a variety of functions, including supplementing the main text with context, observations, counter-points, analysis, and source attributions. These fall under a few categories. Space considerations. Critical discourse analysis offers a rich method...
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Tables TitleVersion
and Votes
are not yet visible in the Data preview page, but they are accessible in Kernels.
Stack Overflow (SO) is the most popular question-and-answer website for software developers, providing a large amount of code snippets and free-form text on a wide variety of topics. Like other software artifacts, questions and answers on SO evolve over time, for example when bugs in code snippets are fixed, code is updated to work with a more recent library version, or text surrounding a code snippet is edited for clarity. To be able to analyze how content on SO evolves, we built SOTorrent, an open dataset based on the official SO data dump.
SOTorrent provides access to the version history of SO content at the level of whole posts and individual text or code blocks. It connects SO posts to other platforms by aggregating URLs from text blocks and comments, and by collecting references from GitHub files to SO posts. Our vision is that researchers will use SOTorrent to investigate and understand the evolution of SO posts and their relation to other platforms such as GitHub. If you use this dataset in your work, please cite our MSR 2018 paper or our MSR 2019 mining challenge proposal.
This version is based on the official Stack Overflow data dump released 2018-12-02 and the Google BigQuery GitHub data set queried 2018-12-09.
The goal of the MSR 2019 mining challenge is to study the origin, evolution, and usage of Stack Overflow code snippets. Questions that are, to the best of our knowledge, not sufficiently answered yet include:
These are just some of the questions that could be answered using SOTorrent. We encourage challenge participants to adapt the above questions or formulate their own research questions about the origin, evolution, and usage of content on Stack Overflow.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A collection of 22 data set of 50+ requirements each, expressed as user stories.
The dataset has been created by gathering data from web sources and we are not aware of license agreements or intellectual property rights on the requirements / user stories. The curator took utmost diligence in minimizing the risks of copyright infringement by using non-recent data that is less likely to be critical, by sampling a subset of the original requirements collection, and by qualitatively analyzing the requirements. In case of copyright infringement, please contact the dataset curator (Fabiano Dalpiaz, f.dalpiaz@uu.nl) to discuss the possibility of removal of that dataset [see Zenodo's policies]
The data sets have been originally used to conduct experiments about ambiguity detection with the REVV-Light tool: https://github.com/RELabUU/revv-light
This collection has been originally published in Mendeley data: https://data.mendeley.com/datasets/7zbk8zsd8y/1
The following text provides a description of the datasets, including links to the systems and websites, when available. The datasets are organized by macro-category and then by identifier.
g02-federalspending.txt
(2018) originates from early data in the Federal Spending Transparency project, which pertain to the website that is used to share publicly the spending data for the U.S. government. The website was created because of the Digital Accountability and Transparency Act of 2014 (DATA Act). The specific dataset pertains a system called DAIMS or Data Broker, which stands for DATA Act Information Model Schema. The sample that was gathered refers to a sub-project related to allowing the government to act as a data broker, thereby providing data to third parties. The data for the Data Broker project is currently not available online, although the backend seems to be hosted in GitHub under a CC0 1.0 Universal license. Current and recent snapshots of federal spending related websites, including many more projects than the one described in the shared collection, can be found here.
g03-loudoun.txt
(2018) is a set of extracted requirements from a document, by the Loudoun County Virginia, that describes the to-be user stories and use cases about a system for land management readiness assessment called Loudoun County LandMARC. The source document can be found here and it is part of the Electronic Land Management System and EPlan Review Project - RFP RFQ issued in March 2018. More information about the overall LandMARC system and services can be found here.
g04-recycling.txt
(2017) concerns a web application where recycling and waste disposal facilities can be searched and located. The application operates through the visualization of a map that the user can interact with. The dataset has obtained from a GitHub website and it is at the basis of a students' project on web site design; the code is available (no license).
g05-openspending.txt
(2018) is about the OpenSpending project (www), a project of the Open Knowledge foundation which aims at transparency about how local governments spend money. At the time of the collection, the data was retrieved from a Trello board that is currently unavailable. The sample focuses on publishing, importing and editing datasets, and how the data should be presented. Currently, OpenSpending is managed via a GitHub repository which contains multiple sub-projects with unknown license.
g11-nsf.txt
(2018) refers to a collection of user stories referring to the NSF Site Redesign & Content Discovery project, which originates from a publicly accessible GitHub repository (GPL 2.0 license). In particular, the user stories refer to an early version of the NSF's website. The user stories can be found as closed Issues.
g08-frictionless.txt
(2016) regards the Frictionless Data project, which offers an open source dataset for building data infrastructures, to be used by researchers, data scientists, and data engineers. Links to the many projects within the Frictionless Data project are on GitHub (with a mix of Unlicense and MIT license) and web. The specific set of user stories has been collected in 2016 by GitHub user @danfowler and are stored in a Trello board.
g14-datahub.txt
(2013) concerns the open source project DataHub, which is currently developed via a GitHub repository (the code has Apache License 2.0). DataHub is a data discovery platform which has been developed over multiple years. The specific data set is an initial set of user stories, which we can date back to 2013 thanks to a comment therein.
g16-mis.txt
(2015) is a collection of user stories that pertains a repository for researchers and archivists. The source of the dataset is a public Trello repository. Although the user stories do not have explicit links to projects, it can be inferred that the stories originate from some project related to the library of Duke University.
g17-cask.txt
(2016) refers to the Cask Data Application Platform (CDAP). CDAP is an open source application platform (GitHub, under Apache License 2.0) that can be used to develop applications within the Apache Hadoop ecosystem, an open-source framework which can be used for distributed processing of large datasets. The user stories are extracted from a document that includes requirements regarding dataset management for Cask 4.0, which includes the scenarios, user stories and a design for the implementation of these user stories. The raw data is available in the following environment.
g18-neurohub.txt
(2012) is concerned with the NeuroHub platform, a neuroscience data management, analysis and collaboration platform for researchers in neuroscience to collect, store, and share data with colleagues or with the research community. The user stories were collected at a time NeuroHub was still a research project sponsored by the UK Joint Information Systems Committee (JISC). For information about the research project from which the requirements were collected, see the following record.
g22-rdadmp.txt
(2018) is a collection of user stories from the Research Data Alliance's working group on DMP Common Standards. Their GitHub repository contains a collection of user stories that were created by asking the community to suggest functionality that should part of a website that manages data management plans. Each user story is stored as an issue on the GitHub's page.
g23-archivesspace.txt
(2012-2013) refers to ArchivesSpace: an open source, web application for managing archives information. The application is designed to support core functions in archives administration such as accessioning; description and arrangement of processed materials including analog, hybrid, and
born digital content; management of authorities and rights; and reference service. The application supports collection management through collection management records, tracking of events, and a growing number of administrative reports. ArchivesSpace is open source and its
Traffic analytics, rankings, and competitive metrics for jasmine.github.io as of August 2025
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
This is the readme for the supplemental data for our ICDAR 2019 paper.
You can read our paper via IEEE here: https://ieeexplore.ieee.org/document/8978202
If you found this dataset useful, please consider citing our paper:
@inproceedings{DBLP:conf/icdar/MorrisTE19,
author = {David Morris and
Peichen Tang and
Ralph Ewerth},
title = {A Neural Approach for Text Extraction from Scholarly Figures},
booktitle = {2019 International Conference on Document Analysis and Recognition,
{ICDAR} 2019, Sydney, Australia, September 20-25, 2019},
pages = {1438--1443},
publisher = {{IEEE}},
year = {2019},
url = {https://doi.org/10.1109/ICDAR.2019.00231},
doi = {10.1109/ICDAR.2019.00231},
timestamp = {Tue, 04 Feb 2020 13:28:39 +0100},
biburl = {https://dblp.org/rec/conf/icdar/MorrisTE19.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
This work was financially supported by the German Federal Ministry of Education and Research (BMBF) and European Social Fund (ESF) (InclusiveOCW project, no. 01PE17004).
We used different sources of data for testing, validation, and training. Our testing set was assembled by the work we cited by Böschen et al. We excluded the DeGruyter dataset, and use it as our validation dataset.
These datasets contain a readme with license information. Further information about the associated project can be found in the authors' published work we cited: https://doi.org/10.1007/978-3-319-51811-4_2
The DeGruyter dataset does not include the labeled images due to license restrictions. As of writing, the images can still be downloaded from DeGruyter via the links in the readme. Note that depending on what program you use to strip the images out of the PDF they are provided in, you may have to re-number the images.
We used label_generator's generated dataset, which the author made available on a requester-pays amazon s3 bucket. We also used the Multi-Type Web Images dataset, which is mirrored here.
We have made our code available in code.zip
. We will upload code, announce further news, and field questions via the github repo.
Our text detection network is adapted from Argman's EAST implementation. The EAST/checkpoints/ours
subdirectory contains the trained weights we used in the paper.
We used a tesseract script to run text extraction from detected text rows. This is inside our code code.tar
as text_recognition_multipro.py
.
We used a java script provided by Falk Böschen and adapted to our file structure. We included this as evaluator.jar
.
Parameter sweeps are automated by param_sweep.rb
. This file also shows how to invoke all of these components.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Full code and dataset for the NBP 2202 map website. Data were collected during Jan-Feb 2022 in the Amundsen sea from the Nathaniel B. Palmer. This is a Python-flask app which displays data in a javascript leaflet map. The contents of this dataset should be all you need to host the website yourself, for local viewing or to make publicly available
This upload is a copy of the GitHub repo taken on 24/03/22 with additional satellite data that was too large for git.
The github repo can be found here https://github.com/callumrollo/itgc-2022-map/
The website is currently maintained at https://nbp2202map.com/
All data are publicly available. Locations and information displayed in the map are for convenience purposes only and are not authoritative. Contact the PIS of the International Thwaites Glacier Collaboration (ITGC) for full datasets. This website is the author's personal work and does not reflect the views of the ITGC group. The author has no official affiliation with ITGC.
HotpotQA is a question answering dataset featuring natural, multi-hop questions, with strong supervision for supporting facts to enable more explainable question answering systems built based on Wikipedia.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The TerraDS dataset provides a comprehensive collection of Terraform programs written in the HashiCorp Configuration Language (HCL). As Infrastructure as Code (IaC) gains popularity for managing cloud infrastructure, Terraform has become one of the leading tools due to its declarative nature and widespread adoption. However, a lack of publicly available, large-scale datasets has hindered systematic research on Terraform practices. TerraDS addresses this gap by compiling metadata and source code from 62,406 open-source repositories with valid licenses. This dataset aims to foster research on best practices, vulnerabilities, and improvements in IaC methodologies.
The TerraDS dataset is organized into two main components: a SQLite database containing metadata and an archive of source code (~335 MB). The metadata, captured in a structured format, includes information about repositories, modules, and resources:
1. Repository Data:
2. Module Data:
3. Resource Data:
The provided archive contains the source code of the 62,406 repositories to allow further analysis based on the actual source instead of the metadata only. As such, researcher can access the permissive repositories and conduct studies on the executable HCL code.
The "HCL Dataset Tools" file contains a snapshot of the https://github.com/prg-grp/hcl-dataset-tools repository - for long term archival reasons. The tools in this repository can be used to reproduce this dataset.
One of the tools - "RepositorySearcher" - can be used to fetch metadata for various other GitHub API queries, not only Terraform code. While the RepositorySearcher allows usage for other types of repository search, the other tools provided are focused on Terraform repositories.
Traffic analytics, rankings, and competitive metrics for vega.github.io as of August 2025
OpenWeb Ninja’s Website Contacts Scraper API provides real-time access to B2B contact data directly from company websites and related public sources. The API delivers clean, structured results including B2B email data, phone number data, and social profile links, making it simple to enrich leads and build accurate company contact lists at scale.
What's included: - Emails & Phone Numbers: extract business emails and phone contacts from a website domain. - Social Profile Links: capture company accounts on LinkedIn, Facebook, Instagram, TikTok, Twitter/X, YouTube, GitHub, and Pinterest. - Domain Search: input a company website domain and get all available contact details. - Company Name Lookup: find a company’s website domain by name, then retrieve its contact data. - Comprehensive Coverage: scrape across all accessible website pages for maximum data capture.
Coverage & Scale: - 1,000+ emails and phone numbers per company website supported. - 8+ major social networks covered. - Real-time REST API for fast, reliable delivery.
Use cases: - B2B contact enrichment and CRM updates. - Targeted email marketing campaigns. - Sales prospecting and lead generation. - Digital ads audience targeting. - Marketing and sales intelligence.
With OpenWeb Ninja’s Website Contacts Scraper API, you get structured B2B email data, phone numbers, and social profiles straight from company websites - always delivered in real time via a fast and reliable API.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The city of Austin has administered a community survey for the 2015, 2016, 2017, 2018 and 2019 years (https://data.austintexas.gov/City-Government/Community-Survey/s2py-ceb7), to “assess satisfaction with the delivery of the major City Services and to help determine priorities for the community as part of the City’s ongoing planning process.” To directly access this dataset from the city of Austin’s website, you can follow this link https://cutt.ly/VNqq5Kd. Although we downloaded the dataset analyzed in this study from the former link, given that the city of Austin is interested in continuing administering this survey, there is a chance that the data we used for this analysis and the data hosted in the city of Austin’s website may differ in the following years. Accordingly, to ensure the replication of our findings, we recommend researchers to download and analyze the dataset we employed in our analyses, which can be accessed at the following link https://github.com/democratizing-data-science/MDCOR/blob/main/Community_Survey.csv. Replication Features or Variables The community survey data has 10,684 rows and 251 columns. Of these columns, our analyses will rely on the following three indicators that are taken verbatim from the survey: “ID”, “Q25 - If there was one thing you could share with the Mayor regarding the City of Austin (any comment, suggestion, etc.), what would it be?", and “Do you own or rent your home?”
Traffic analytics, rankings, and competitive metrics for github.com as of August 2025