2 datasets found

Enterprise-Driven Open Source Software
data.europa.eu
unknown
Updated Feb 7, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2020). Enterprise-Driven Open Source Software [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-3653878?locale=en
Explore at:
unknown(8339687)Available download formats
Dataset updated
Feb 7, 2020
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present a dataset of open source software developed mainly by enterprises rather than volunteers. This can be used to address known generalizability concerns, and, also, to perform research on open source business software development. Based on the premise that an enterprise's employees are likely to contribute to a project developed by their organization using the email account provided by it, we mine domain names associated with enterprises from open data sources as well as through white- and blacklisting, and use them through three heuristics to identify 17,252 enterprise GitHub projects. We provide these as a dataset detailing their provenance and properties. A manual evaluation of a dataset sample shows an identification accuracy of 89%. Through an exploratory data analysis we found that projects are staffed by a plurality of enterprise insiders, who appear to be pulling more than their weight, and that in a small percentage of relatively large projects development happens exclusively through enterprise insiders. The main dataset is provided as a 17,252 record tab-separated file named enterprise_projects.txt with the following 27 fields. url: the project's GitHub URL project_id: the project's GHTorrent identifier sdtc: true if selected using the same domain top committers heuristic (9,006 records) mcpc: true if selected using the multiple committers from a valid enterprise heuristic (8,289 records) mcve: true if selected using the multiple committers from a probable company heuristic (7,990 records), star_number: number of GitHub watchers commit_count: number of commits files: number of files in current main branch lines: corresponding number of lines in text files pull_requests: number of pull requests most_recent_commit: date of the most recent commit committer_count: number of different committers author_count: number of different authors dominant_domain: the projects dominant email domain dominant_domain_committer_commits: number of commits made by committers whose email matches the project's dominant domain dominant_domain_author_commits: corresponding number for commit authors dominant_domain_committers: number of committers whose email matches the project's dominant domain dominant_domain_authors: corresponding number of commit authors cik: SEC's EDGAR "central index key" fg500: true if this is a Fortune Global 500 company (2,232 records) sec10k: true if the company files SEC 10-K forms (4,178 records) sec20f: true if the company files SEC 20-F forms (429 records) project_name: GitHub project name owner_login: GitHub project's owner login company_name: company name as derived from the SEC and Fortune 500 data owner_company: GitHub project's owner company name license: SPDX license identifier The file cohost_project_details.txt provides the full set of 309,531 cohort projects that are not part of the enterprise data set, but have comparable quality attributes. url: the project's GitHub URL project_id: the project's GHTorrent identifier stars: number of GitHub watchers commit_count: number of commits
O
Enterprise-Driven Open Source Software
opendatalab.com
data.niaid.nih.gov
zip
Updated Apr 21, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Athens University of Economics and Business (2020). Enterprise-Driven Open Source Software [Dataset]. https://opendatalab.com/OpenDataLab/Enterprise-Driven_Open_Source_etc
Explore at:
zip(7896769 bytes)Available download formats
Dataset updated
Apr 21, 2020
Dataset provided by
Athens University of Economics and Business
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
We present a dataset of open source software developed mainly by enterprises rather than volunteers. This can be used to address known generalizability concerns, and, also, to perform research on open source business software development. Based on the premise that an enterprise's employees are likely to contribute to a project developed by their organization using the email account provided by it, we mine domain names associated with enterprises from open data sources as well as through white- and blacklisting, and use them through three heuristics to identify 17,264 enterprise GitHub projects. We provide these as a dataset detailing their provenance and properties. A manual evaluation of a dataset sample shows an identification accuracy of 89%. Through an exploratory data analysis we found that projects are staffed by a plurality of enterprise insiders, who appear to be pulling more than their weight, and that in a small percentage of relatively large projects development happens exclusively through enterprise insiders.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Zenodo (2020). Enterprise-Driven Open Source Software [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-3653878?locale=en

Enterprise-Driven Open Source Software

Explore at:

40 scholarly articles cite this dataset (View in Google Scholar)

unknown(8339687)Available download formats

Dataset updated

Feb 7, 2020

Dataset authored and provided by

Zenodohttp://zenodo.org/

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

We present a dataset of open source software developed mainly by enterprises rather than volunteers. This can be used to address known generalizability concerns, and, also, to perform research on open source business software development. Based on the premise that an enterprise's employees are likely to contribute to a project developed by their organization using the email account provided by it, we mine domain names associated with enterprises from open data sources as well as through white- and blacklisting, and use them through three heuristics to identify 17,252 enterprise GitHub projects. We provide these as a dataset detailing their provenance and properties. A manual evaluation of a dataset sample shows an identification accuracy of 89%. Through an exploratory data analysis we found that projects are staffed by a plurality of enterprise insiders, who appear to be pulling more than their weight, and that in a small percentage of relatively large projects development happens exclusively through enterprise insiders. The main dataset is provided as a 17,252 record tab-separated file named enterprise_projects.txt with the following 27 fields. url: the project's GitHub URL project_id: the project's GHTorrent identifier sdtc: true if selected using the same domain top committers heuristic (9,006 records) mcpc: true if selected using the multiple committers from a valid enterprise heuristic (8,289 records) mcve: true if selected using the multiple committers from a probable company heuristic (7,990 records), star_number: number of GitHub watchers commit_count: number of commits files: number of files in current main branch lines: corresponding number of lines in text files pull_requests: number of pull requests most_recent_commit: date of the most recent commit committer_count: number of different committers author_count: number of different authors dominant_domain: the projects dominant email domain dominant_domain_committer_commits: number of commits made by committers whose email matches the project's dominant domain dominant_domain_author_commits: corresponding number for commit authors dominant_domain_committers: number of committers whose email matches the project's dominant domain dominant_domain_authors: corresponding number of commit authors cik: SEC's EDGAR "central index key" fg500: true if this is a Fortune Global 500 company (2,232 records) sec10k: true if the company files SEC 10-K forms (4,178 records) sec20f: true if the company files SEC 20-F forms (429 records) project_name: GitHub project name owner_login: GitHub project's owner login company_name: company name as derived from the SEC and Fortune 500 data owner_company: GitHub project's owner company name license: SPDX license identifier The file cohost_project_details.txt provides the full set of 309,531 cohort projects that are not part of the enterprise data set, but have comparable quality attributes. url: the project's GitHub URL project_id: the project's GHTorrent identifier stars: number of GitHub watchers commit_count: number of commits

Clear search

Close search

Google apps

Main menu