100+ datasets found
  1. Data from: A Large-scale Dataset of (Open Source) License Text Variants

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip, bin +1
    Updated Mar 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stefano Zacchiroli; Stefano Zacchiroli (2022). A Large-scale Dataset of (Open Source) License Text Variants [Dataset]. http://doi.org/10.5281/zenodo.6379164
    Explore at:
    bin, application/gzip, htmlAvailable download formats
    Dataset updated
    Mar 30, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Stefano Zacchiroli; Stefano Zacchiroli
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We introduce a large-scale dataset of the complete texts of free/open source software (FOSS) license variants. To assemble it we have collected from the Software Heritage archive—the largest publicly available archive of FOSS source code with accompanying development history—all versions of files whose names are commonly used to convey licensing terms to software users and developers.
    The dataset consists of 6.5 million unique license files that can be used to conduct empirical studies on open source licensing, training of automated license classifiers, natural language processing (NLP) analyses of legal texts, as well as historical and phylogenetic studies on FOSS licensing.
    Additional metadata about shipped license files are also provided, making the dataset ready to use in various contexts; they include: file length measures, detected MIME type, detected SPDX license (using ScanCode), example origin (e.g., GitHub repository), oldest public commit in which the license appeared.
    The dataset is released as open data as an archive file containing all deduplicated license blobs, plus several portable CSV files for metadata, referencing blobs via cryptographic checksums.

    For more details see the included README file and companion paper:

    If you use this dataset for research purposes, please acknowledge its use by citing the above paper.

  2. Leading open source licenses worldwide 2021

    • statista.com
    Updated Feb 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). Leading open source licenses worldwide 2021 [Dataset]. https://www.statista.com/statistics/1245643/worldwide-leading-open-source-licenses/
    Explore at:
    Dataset updated
    Feb 20, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2021
    Area covered
    Worldwide
    Description

    The most popular open source license in the WhiteSource global database in 2021 was Apache 2.0. This license allows users to distribute, modify, or use software for their own purposes, as long as the user complies with the specified license terms. WhiteSource is a platform that automates open source security, compliance, and reporting processes.

  3. t

    Business Licenses (Open Data)

    • gisdata.tucsonaz.gov
    • hub.arcgis.com
    Updated May 26, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    City of Tucson (2020). Business Licenses (Open Data) [Dataset]. https://gisdata.tucsonaz.gov/datasets/business-licenses-open-data
    Explore at:
    Dataset updated
    May 26, 2020
    Dataset authored and provided by
    City of Tucson
    Area covered
    Description

    All current, active business licenses. Process: Data is queried from TRMS, processed to clean up common addressing issues and geocoded.PurposeLorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.Dataset ClassificationLevel 0 - OpenKnown UsesLorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.Known ErrorsAll locations are approximate. Most data is mapped to existing address points (which are not precise locations),those records that don't have an equivalent address match are approximated along the address range of street. See MT field for more details. Fields for NAICS codes, business types and owner types may be missing data or incorrect. Data layer should not be considered a complete listing of all active businesses in Tucson.Data ContactCity Clerk's OfficeRandy HammelTax-License@tucsonaz.goUpdate FrequencyLorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

  4. NYS NY License Center Business Wizard

    • kaggle.com
    Updated Dec 3, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    State of New York (2019). NYS NY License Center Business Wizard [Dataset]. https://www.kaggle.com/datasets/new-york-state/nys-ny-license-center-business-wizard/versions/4
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 3, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    State of New York
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    New York, New York
    Description

    Content

    This dataset contains a list of the URLs (web addresses), which host information about the business-related licenses and permits identified in the NY Licensing Center’s Business Wizard. The NY Licensing Center’s Business Wizard helps users to learn about certain licenses or permits a business may need to get up and running in New York State.

    Context

    This is a dataset hosted by the State of New York. The state has an open data platform found here and they update their information according the amount of data that is brought in. Explore New York State using Kaggle and all of the data sources available through the State of New York organization page!

    • Update Frequency: This dataset is updated monthly.

    Acknowledgements

    This dataset is maintained using Socrata's API and Kaggle's API. Socrata has assisted countless organizations with hosting their open data and has been an integral part of the process of bringing more data to the public.

    Cover photo by Charles Deluvio on Unsplash
    Unsplash Images are distributed under a unique Unsplash License.

  5. Share of permissive & copyleft open source licenses worldwide 2012-2021

    • statista.com
    Updated Feb 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). Share of permissive & copyleft open source licenses worldwide 2012-2021 [Dataset]. https://www.statista.com/statistics/1245665/worldwide-permissive-copyleft-open-source-licenses/
    Explore at:
    Dataset updated
    Feb 20, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    Worldwide
    Description

    From 2012 to 2021, there appears to be a trend towards open source creators choosing the permissive route when it comes to open source licenses throughout the world. In 2012, only 41 percent of open source licenses were permissive, while in 2021 that figure reached 78 percent.

    When creators attach permissive licenses to their open source projects, it gives corporations various freedom to use the code without having to give much back to the creators.

  6. d

    DMV Driver Licenses

    • catalog.data.gov
    • opendata.dc.gov
    • +3more
    Updated Feb 5, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    D.C. Office of the Chief Technology Officer (2025). DMV Driver Licenses [Dataset]. https://catalog.data.gov/dataset/dmv-driver-licenses-fe6a6
    Explore at:
    Dataset updated
    Feb 5, 2025
    Dataset provided by
    D.C. Office of the Chief Technology Officer
    Description

    This dataset contains the active driver license information. The dataset only includes customer age, type of license( meaning noncommercial license or commercial license), permit type ( meaning learners, provisional, temporary, and regular), if it is real-id or not validated license, license expiration date, and status of the license.

  7. N

    driver license

    • data.cityofnewyork.us
    application/rdfxml +5
    Updated Feb 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Taxi and Limousine Commission (TLC) (2025). driver license [Dataset]. https://data.cityofnewyork.us/Transportation/driver-license/vutq-szgp
    Explore at:
    csv, xml, tsv, application/rdfxml, json, application/rssxmlAvailable download formats
    Dataset updated
    Feb 21, 2025
    Authors
    Taxi and Limousine Commission (TLC)
    Description

    NYC TLC Licensed FHV drivers that are currently active and in good standing. This list is accurate to the date and time represented in the Last Date Updated and Last Time Updated fields.

  8. a

    Business Licenses - All

    • opendata.atlantaregional.com
    • open-alpharetta.opendata.arcgis.com
    • +1more
    Updated Sep 15, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The City of Alpharetta (2022). Business Licenses - All [Dataset]. https://opendata.atlantaregional.com/datasets/alpharetta::business-licenses-all
    Explore at:
    Dataset updated
    Sep 15, 2022
    Dataset authored and provided by
    The City of Alpharetta
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Description

    A listing of current business licenses in the City of Alpharetta. Most items in this dataset are associated with a spatial location and can be plotted in GIS software, however some features may not be tied to a location, and therefore may appear to plot outside of the Alpharetta city limits.

  9. Z

    Dataset from "What do developers talk about open source software licensing?...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 1, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lefteris Angelis (2020). Dataset from "What do developers talk about open source software licensing? " - SEAA2020 [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_3871564
    Explore at:
    Dataset updated
    Jun 1, 2020
    Dataset provided by
    Maria Papoutsoglou
    Georgia M. Kapitsaki
    Daniel German
    Lefteris Angelis
    Description

    This is the dataset used in the respective research work. The abstract is available below.

    If you want to cite this work, please use:

    Georgia M. Kapitsaki, Maria Papoutsoglou, Daniel German and Lefteris Angelis, What do developers talk about open source software licensing?, to appear in the Proceedings of the Euromicro Conference on Software Engineering and Advanced Applications, SEAA 2020.

    Free and open source software has gained a lot of momentum in the industry and the research community. Open source licenses determine the rules, under which the open source software can be further used and distributed. Previous works have examined the usage of open source licenses in the framework of specific projects or online social coding platforms, examining developers specific licensing views for specific software. However, the questions practitioners ask about licenses and licensing as captured in Question and Answer websites also constitute an important aspect toward understanding practitioners general licenses and licensing concerns. In this paper, we investigate open source license discussions using data from the Software Engineering, Open Source and Law Stack Exchange sites that contain relevant data. We describe the process used for the data collection and analysis, and discuss the main results. Our results indicate that clarifications about specific licenses and specific license terms are required. The results can be useful for developers, educators and license authors.

  10. N

    NYC Dog Licensing Dataset

    • data.cityofnewyork.us
    • catalog.data.gov
    • +1more
    application/rdfxml +5
    Updated Feb 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Mental Health and Hygeine (2024). NYC Dog Licensing Dataset [Dataset]. https://data.cityofnewyork.us/Health/NYC-Dog-Licensing-Dataset/nu7n-tubp
    Explore at:
    csv, json, tsv, application/rssxml, application/rdfxml, xmlAvailable download formats
    Dataset updated
    Feb 6, 2024
    Dataset authored and provided by
    Department of Mental Health and Hygeine
    Area covered
    New York
    Description

    Active Dog Licenses.

    All dog owners residing in NYC are required by law to license their dogs. The data is sourced from the DOHMH Dog Licensing System (https://a816-healthpsi.nyc.gov/DogLicense), where owners can apply for and renew dog licenses. Each record represents a unique dog license that was active during the year, but not necessarily a unique record per dog, since a license that is renewed during the year results in a separate record of an active license period. Each record stands as a unique license period for the dog over the course of the yearlong time frame.

  11. d

    Department of Licensing Professional License Counts

    • catalog.data.gov
    • data.wa.gov
    • +1more
    Updated Mar 22, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.wa.gov (2025). Department of Licensing Professional License Counts [Dataset]. https://catalog.data.gov/dataset/active-professional-licenses-with-the-department-of-licensing
    Explore at:
    Dataset updated
    Mar 22, 2025
    Dataset provided by
    data.wa.gov
    Description

    This is a point-in-time count of active professional licenses, by County and State, issued by the Department of Licensing. These licenses are issued to people or businesses.

  12. S

    Issued Licenses

    • data.ny.gov
    • data.cityofnewyork.us
    • +2more
    application/rdfxml +5
    Updated Mar 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Consumer and Worker Protection (DCWP) (2025). Issued Licenses [Dataset]. https://data.ny.gov/Business/Legally-Operating-Businesses/w7w3-xahh/about
    Explore at:
    application/rssxml, csv, application/rdfxml, tsv, json, xmlAvailable download formats
    Dataset updated
    Mar 21, 2025
    Dataset authored and provided by
    Department of Consumer and Worker Protection (DCWP)
    Description

    This dataset features licenses issued by the NYC Department of Consumer and Worker Protection (DCWP)—formerly the Department of Consumer Affairs (DCA).

  13. O

    DataSet-01-All Licenses

    • data.texas.gov
    • catalog.data.gov
    application/rdfxml +5
    Updated Mar 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Texas Medical Board (2025). DataSet-01-All Licenses [Dataset]. https://data.texas.gov/dataset/DataSet-01-All-Licenses/tm3v-pfq9
    Explore at:
    csv, application/rssxml, application/rdfxml, xml, tsv, jsonAvailable download formats
    Dataset updated
    Mar 17, 2025
    Dataset authored and provided by
    Texas Medical Board
    Description

    A listing of all TMB licenses. Additional licensee information can be found at https://www.tmb.state.tx.us/page/look-up-a-license.

  14. Popularity distribution of database management systems worldwide 2023, by...

    • statista.com
    Updated Nov 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2023). Popularity distribution of database management systems worldwide 2023, by license [Dataset]. https://www.statista.com/statistics/1131575/worldwide-popularity-database-management-systems-license/
    Explore at:
    Dataset updated
    Nov 9, 2023
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Nov 2023
    Area covered
    Worldwide
    Description

    As of November 2023, commercial database management systems (DBMSs) are slightly less popular than open source DBMSs, however, both have accumulated similar amounts of ranking scores. The most popular DBMS in the world was Oracle, a commercial system; open source system MySQL and Microsoft SQL server, another commercial system, rounded out the top three.

  15. R

    Mouth Open Dataset

    • universe.roboflow.com
    zip
    Updated Mar 7, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    114514 (2025). Mouth Open Dataset [Dataset]. https://universe.roboflow.com/114514-xeau7/mouth-open-ven4d
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 7, 2025
    Dataset authored and provided by
    114514
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Close Bounding Boxes
    Description

    Mouth Open

    ## Overview
    
    Mouth Open is a dataset for object detection tasks - it contains Close annotations for 236 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  16. w

    Business Licenses

    • data.wu.ac.at
    • data.cityofchicago.org
    • +1more
    csv, json, rdf, xml
    Updated May 8, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    City of Chicago (2018). Business Licenses [Dataset]. https://data.wu.ac.at/schema/data_gov/ODEwYzA1OGYtNjc1Mi00ODdkLThjZTYtYjVmNjIzODEzNGEy
    Explore at:
    rdf, csv, xml, jsonAvailable download formats
    Dataset updated
    May 8, 2018
    Dataset provided by
    City of Chicago
    Description

    Business licenses issued by the Department of Business Affairs and Consumer Protection in the City of Chicago from 2002 to the present. This dataset contains a large number of records/rows of data and may not be viewed in full in Microsoft Excel. Therefore, when downloading the file, select CSV from the Export menu. Open the file in an ASCII text editor, such as Notepad or Wordpad, to view and search.

    Data fields requiring description are detailed below.

    APPLICATION TYPE: ‘ISSUE’ is the record associated with the initial license application. ‘RENEW’ is a subsequent renewal record. All renewal records are created with a term start date and term expiration date. ‘C_LOC’ is a change of location record. It means the business moved. ‘C_CAPA’ is a change of capacity record. Only a few license types may file this type of application. ‘C_EXPA’ only applies to businesses that have liquor licenses. It means the business location expanded. 'C_SBA' is a change of business activity record. It means that a new business activity was added or an existing business activity was marked as expired.

    LICENSE STATUS: ‘AAI’ means the license was issued. ‘AAC’ means the license was cancelled during its term. ‘REV’ means the license was revoked. 'REA' means the license revocation has been appealed.

    LICENSE STATUS CHANGE DATE: This date corresponds to the date a license was cancelled (AAC), revoked (REV) or appealed (REA).

    Business License Owner information may be accessed at: https://data.cityofchicago.org/dataset/Business-Owners/ezma-pppn. To identify the owner of a business, you will need the account number or legal name, which may be obtained from this Business Licenses dataset.

    Data Owner: Business Affairs and Consumer Protection. Time Period: January 1, 2002 to present. Frequency: Data is updated daily.

  17. Chicago Business Licenses and Owners

    • kaggle.com
    Updated Dec 7, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    City of Chicago (2019). Chicago Business Licenses and Owners [Dataset]. https://www.kaggle.com/datasets/chicago/chicago-business-licenses-and-owners/versions/49
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 7, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    City of Chicago
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    Chicago
    Description

    Content

    More details about each file are in the individual file descriptions.

    Context

    This is a dataset hosted by the City of Chicago. The city has an open data platform found here and they update their information according the amount of data that is brought in. Explore the City of Chicago using Kaggle and all of the data sources available through the City of Chicago organization page!

    • Update Frequency: This dataset is updated daily.

    Acknowledgements

    This dataset is maintained using Socrata's API and Kaggle's API. Socrata has assisted countless organizations with hosting their open data and has been an integral part of the process of bringing more data to the public.

    Cover photo by rawpixel on Unsplash
    Unsplash Images are distributed under a unique Unsplash License.

    This dataset is distributed under the following licenses: Public Domain

  18. Road Opening Licences 2017 2023 FCC - Dataset - data.gov.ie

    • data.gov.ie
    Updated Jan 31, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.gov.ie (2024). Road Opening Licences 2017 2023 FCC - Dataset - data.gov.ie [Dataset]. https://data.gov.ie/dataset/road-opening-licences-2017-2023-fcc
    Explore at:
    Dataset updated
    Jan 31, 2024
    Dataset provided by
    data.gov.ie
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data set contains Road Opening License for Fingal County Council from 2017-2023 from January to December month by month. New years will be added under existing and updated quarterly.The new efficiency functionality with respect to automated granting of T3 licenses is operating well. Since 1st June 2022, the majority of T3 ROL applications have been granted by way of automation.Three Clerk of Works has been appointed to Swords/Balbriggan, Castleknock/Mulhuddart and Howth/Malahide areas to carryout improved post inspection of all road openings by utilities. The Licensing unit continues to process and manage the licensing system for the County, applications are allocated by area for examination and conditioning which includes reviews of the existing carriageway, footpaths, cycleways, and grass verges. Applicants submit temporary traffic management plans for review to ensure all works are carried out safely within the public domain. Road Opening Licenses are a cross functional process for the department and are essential for the asset management of Fingal County Council Road NetworkT1 Application: The T1 is not a license. It is notification of intent to perform works of high impact due to extent or complexity.T2 Applications: An application to carry out works of moderate impact due to the location, extent, amount, or duration of the work.T3 Applications: An application to carry out works of low impact due to the location, extent, amount, or duration of the work. A T3 license requires a short application period and does not require a works programme notification.T4 Applications: A notification of emergency works (as defined under legislation). Notification must occur at the time or as soon as possible after commencement and works must be carried out during a limited time period. Ongoing Projects.See a new Data set Road Opening Licenses 2024-2027_FCC

  19. O

    TDLR - All Licenses

    • data.texas.gov
    • gimi9.com
    • +2more
    Updated Oct 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Texas Department of Licensing and Regulation (2024). TDLR - All Licenses [Dataset]. https://data.texas.gov/dataset/TDLR-All-Licenses/7358-krk7
    Explore at:
    xml, csv, tsv, application/rssxml, application/rdfxml, application/geo+json, kml, kmzAvailable download formats
    Dataset updated
    Oct 21, 2024
    Dataset authored and provided by
    Texas Department of Licensing and Regulation
    Description

    A listing of all TDLR license holders from https://www.tdlr.texas.gov/LicenseSearch/.

  20. Occupational Licensing Directory

    • data.ok.gov
    • datasets.ai
    • +1more
    csv, xlsx
    Updated Oct 31, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Office of Management and Enterprise Services (2019). Occupational Licensing Directory [Dataset]. https://data.ok.gov/dataset/occupational-licensing-directory
    Explore at:
    xlsx, csvAvailable download formats
    Dataset updated
    Oct 31, 2019
    Dataset provided by
    Oklahoma Office of Management and Enterprise Serviceshttp://www.omes.ok.gov/
    Authors
    Office of Management and Enterprise Services
    Description

    Directory of common occupations that require state-regulated licensure, along with information about the regulating agency. This directory is subject to change as new information becomes available.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Stefano Zacchiroli; Stefano Zacchiroli (2022). A Large-scale Dataset of (Open Source) License Text Variants [Dataset]. http://doi.org/10.5281/zenodo.6379164
Organization logo

Data from: A Large-scale Dataset of (Open Source) License Text Variants

Related Article
Explore at:
bin, application/gzip, htmlAvailable download formats
Dataset updated
Mar 30, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Stefano Zacchiroli; Stefano Zacchiroli
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

We introduce a large-scale dataset of the complete texts of free/open source software (FOSS) license variants. To assemble it we have collected from the Software Heritage archive—the largest publicly available archive of FOSS source code with accompanying development history—all versions of files whose names are commonly used to convey licensing terms to software users and developers.
The dataset consists of 6.5 million unique license files that can be used to conduct empirical studies on open source licensing, training of automated license classifiers, natural language processing (NLP) analyses of legal texts, as well as historical and phylogenetic studies on FOSS licensing.
Additional metadata about shipped license files are also provided, making the dataset ready to use in various contexts; they include: file length measures, detected MIME type, detected SPDX license (using ScanCode), example origin (e.g., GitHub repository), oldest public commit in which the license appeared.
The dataset is released as open data as an archive file containing all deduplicated license blobs, plus several portable CSV files for metadata, referencing blobs via cryptographic checksums.

For more details see the included README file and companion paper:

If you use this dataset for research purposes, please acknowledge its use by citing the above paper.

Search
Clear search
Close search
Google apps
Main menu