100+ datasets found

Frequently leveraged external data sources for global enterprises 2020
statista.com
Updated Jul 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Frequently leveraged external data sources for global enterprises 2020 [Dataset]. https://www.statista.com/statistics/1235514/worldwide-popular-external-data-sources-companies/
Explore at:
Dataset updated
Jul 1, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Aug 2020
Area covered
Worldwide
Description
In 2020, according to respondents surveyed, data masters typically leverage a variety of external data sources to enhance their insights. The most popular external data sources for data masters being publicly available competitor data, open data, and proprietary datasets from data aggregators, with **, **, and ** percent, respectively.
E
The Human Know-How Dataset
dtechtive.com
find.data.gov.scot
pdf, zip
Updated Apr 29, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2016). The Human Know-How Dataset [Dataset]. http://doi.org/10.7488/ds/1394
Explore at:
pdf(0.0582 MB), zip(19.67 MB), zip(0.0298 MB), zip(9.433 MB), zip(13.06 MB), zip(0.2837 MB), zip(5.372 MB), zip(69.8 MB), zip(20.43 MB), zip(5.769 MB), zip(14.86 MB), zip(19.78 MB), zip(43.28 MB), zip(62.92 MB), zip(92.88 MB), zip(90.08 MB)Available download formats
Unique identifier
https://doi.org/10.7488/ds/1394
Dataset updated
Apr 29, 2016
Description
The Human Know-How Dataset describes 211,696 human activities from many different domains. These activities are decomposed into 2,609,236 entities (each with an English textual label). These entities represent over two million actions and half a million pre-requisites. Actions are interconnected both according to their dependencies (temporal/logical orders between actions) and decompositions (decomposition of complex actions into simpler ones). This dataset has been integrated with DBpedia (259,568 links). For more information see: - The project website: http://homepages.inf.ed.ac.uk/s1054760/prohow/index.htm - The data is also available on datahub: https://datahub.io/dataset/human-activities-and-instructions ---------------------------------------------------------------- * Quickstart: if you want to experiment with the most high-quality data before downloading all the datasets, download the file '9of11_knowhow_wikihow', and optionally files 'Process - Inputs', 'Process - Outputs', 'Process - Step Links' and 'wikiHow categories hierarchy'. * Data representation based on the PROHOW vocabulary: http://w3id.org/prohow# Data extracted from existing web resources is linked to the original resources using the Open Annotation specification * Data Model: an example of how the data is represented within the datasets is available in the attached Data Model PDF file. The attached example represents a simple set of instructions, but instructions in the dataset can have more complex structures. For example, instructions could have multiple methods, steps could have further sub-steps, and complex requirements could be decomposed into sub-requirements. ---------------------------------------------------------------- Statistics: * 211,696: number of instructions. From wikiHow: 167,232 (datasets 1of11_knowhow_wikihow to 9of11_knowhow_wikihow). From Snapguide: 44,464 (datasets 10of11_knowhow_snapguide to 11of11_knowhow_snapguide). * 2,609,236: number of RDF nodes within the instructions From wikiHow: 1,871,468 (datasets 1of11_knowhow_wikihow to 9of11_knowhow_wikihow). From Snapguide: 737,768 (datasets 10of11_knowhow_snapguide to 11of11_knowhow_snapguide). * 255,101: number of process inputs linked to 8,453 distinct DBpedia concepts (dataset Process - Inputs) * 4,467: number of process outputs linked to 3,439 distinct DBpedia concepts (dataset Process - Outputs) * 376,795: number of step links between 114,166 different sets of instructions (dataset Process - Step Links)
Z
Data from: A Large-scale Dataset of (Open Source) License Text Variants
data.niaid.nih.gov
Updated Mar 31, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stefano Zacchiroli (2022). A Large-scale Dataset of (Open Source) License Text Variants [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6379163
Explore at:
Dataset updated
Mar 31, 2022
Dataset authored and provided by
Stefano Zacchiroli
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We introduce a large-scale dataset of the complete texts of free/open source software (FOSS) license variants. To assemble it we have collected from the Software Heritage archive—the largest publicly available archive of FOSS source code with accompanying development history—all versions of files whose names are commonly used to convey licensing terms to software users and developers. The dataset consists of 6.5 million unique license files that can be used to conduct empirical studies on open source licensing, training of automated license classifiers, natural language processing (NLP) analyses of legal texts, as well as historical and phylogenetic studies on FOSS licensing. Additional metadata about shipped license files are also provided, making the dataset ready to use in various contexts; they include: file length measures, detected MIME type, detected SPDX license (using ScanCode), example origin (e.g., GitHub repository), oldest public commit in which the license appeared. The dataset is released as open data as an archive file containing all deduplicated license blobs, plus several portable CSV files for metadata, referencing blobs via cryptographic checksums.

For more details see the included README file and companion paper:

Stefano Zacchiroli. A Large-scale Dataset of (Open Source) License Text Variants. In proceedings of the 2022 Mining Software Repositories Conference (MSR 2022). 23-24 May 2022 Pittsburgh, Pennsylvania, United States. ACM 2022.

If you use this dataset for research purposes, please acknowledge its use by citing the above paper.
Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...
zenodo.org
csv
Updated Sep 15, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous authors; Anonymous authors (2023). Code4ML: a Large-scale Dataset of annotated Machine Learning Code [Dataset]. http://doi.org/10.5281/zenodo.6607065
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6607065
Dataset updated
Sep 15, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous authors; Anonymous authors
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle.

The data is organized in a table structure. Code4ML includes several main objects: competitions information, raw code blocks collected form Kaggle and manually marked up snippets. Each table has a .csv format.

Each competition has the text description and metadata, reflecting competition and used dataset characteristics as well as evaluation metrics (competitions.csv). The corresponding datasets can be loaded using Kaggle API and data sources.

The code blocks themselves and their metadata are collected to the data frames concerning the publishing year of the initial kernels. The current version of the corpus includes two code blocks files: snippets from kernels up to the 2020 year (сode_blocks_upto_20.csv) and those from the 2021 year (сode_blocks_21.csv) with corresponding metadata. The corpus consists of 2 743 615 ML code blocks collected from 107 524 Jupyter notebooks.

Marked up code blocks have the following metadata: anonymized id, the format of the used data (for example, table or audio), the id of the semantic type, a flag for the code errors, the estimated relevance to the semantic class (from 1 to 5), the id of the parent notebook, and the name of the competition. The current version of the corpus has ~12 000 labeled snippets (markup_data_20220415.csv).

As marked up code blocks data contains the numeric id of the code block semantic type, we also provide a mapping from this number to semantic type and subclass (actual_graph_2022-06-01.csv).

The dataset can help solve various problems, including code synthesis from a prompt in natural language, code autocompletion, and semantic code classification.
B
Data Management Plan Examples Database
borealisdata.ca
search.dataone.org
Updated Aug 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rebeca Gaston Jothyraj; Shrey Acharya; Isaac Pratt; Danica Evering; Sarthak Behal (2024). Data Management Plan Examples Database [Dataset]. http://doi.org/10.5683/SP3/SDITUG
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP3/SDITUG
Dataset updated
Aug 27, 2024
Dataset provided by
Borealis
Authors
Rebeca Gaston Jothyraj; Shrey Acharya; Isaac Pratt; Danica Evering; Sarthak Behal
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Time period covered
2011 - 2024
Description
This dataset is comprised of a collection of example DMPs from a wide array of fields; obtained from a number of different sources outlined in the README. Data included/extracted from the examples included the discipline and field of study, author, institutional affiliation and funding information, location, date modified, title, research and data-type, description of project, link to the DMP, and where possible external links to related publications, grant pages, or French language versions. This CSV document serves as the content for a McMaster Data Management Plan (DMP) Database as part of the Research Data Management (RDM) Services website, located at https://u.mcmaster.ca/dmps. Other universities and organizations are encouraged to link to the DMP Database or use this dataset as the content for their own DMP Database. This dataset will be updated regularly to include new additions and will be versioned as such. We are gathering submissions at https://u.mcmaster.ca/submit-a-dmp to continue to expand the collection.
n
Jurisdictional Unit (Public) - Dataset - CKAN
nationaldataplatform.org
Updated Feb 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Jurisdictional Unit (Public) - Dataset - CKAN [Dataset]. https://nationaldataplatform.org/catalog/dataset/jurisdictional-unit-public
Explore at:
Dataset updated
Feb 28, 2024
Description
Jurisdictional Unit, 2022-05-21. For use with WFDSS, IFTDSS, IRWIN, and InFORM.This is a feature service which provides Identify and Copy Feature capabilities. If fast-drawing at coarse zoom levels is a requirement, consider using the tile (map) service layer located at https://nifc.maps.arcgis.com/home/item.html?id=3b2c5daad00742cd9f9b676c09d03d13.OverviewThe Jurisdictional Agencies dataset is developed as a national land management geospatial layer, focused on representing wildland fire jurisdictional responsibility, for interagency wildland fire applications, including WFDSS (Wildland Fire Decision Support System), IFTDSS (Interagency Fuels Treatment Decision Support System), IRWIN (Interagency Reporting of Wildland Fire Information), and InFORM (Interagency Fire Occurrence Reporting Modules). It is intended to provide federal wildland fire jurisdictional boundaries on a national scale. The agency and unit names are an indication of the primary manager name and unit name, respectively, recognizing that:There may be multiple owner names.Jurisdiction may be held jointly by agencies at different levels of government (ie State and Local), especially on private lands, Some owner names may be blocked for security reasons.Some jurisdictions may not allow the distribution of owner names. Private ownerships are shown in this layer with JurisdictionalUnitIdentifier=null,JurisdictionalUnitAgency=null, JurisdictionalUnitKind=null, and LandownerKind="Private", LandownerCategory="Private". All land inside the US country boundary is covered by a polygon.Jurisdiction for privately owned land varies widely depending on state, county, or local laws and ordinances, fire workload, and other factors, and is not available in a national dataset in most cases.For publicly held lands the agency name is the surface managing agency, such as Bureau of Land Management, United States Forest Service, etc. The unit name refers to the descriptive name of the polygon (i.e. Northern California District, Boise National Forest, etc.).These data are used to automatically populate fields on the WFDSS Incident Information page.This data layer implements the NWCG Jurisdictional Unit Polygon Geospatial Data Layer Standard.Relevant NWCG Definitions and StandardsUnit2. A generic term that represents an organizational entity that only has meaning when it is contextualized by a descriptor, e.g. jurisdictional.Definition Extension: When referring to an organizational entity, a unit refers to the smallest area or lowest level. Higher levels of an organization (region, agency, department, etc) can be derived from a unit based on organization hierarchy.Unit, JurisdictionalThe governmental entity having overall land and resource management responsibility for a specific geographical area as provided by law.Definition Extension: 1) Ultimately responsible for the fire report to account for statistical fire occurrence; 2) Responsible for setting fire management objectives; 3) Jurisdiction cannot be re-assigned by agreement; 4) The nature and extent of the incident determines jurisdiction (for example, Wildfire vs. All Hazard); 5) Responsible for signing a Delegation of Authority to the Incident Commander.See also: Unit, Protecting; LandownerUnit IdentifierThis data standard specifies the standard format and rules for Unit Identifier, a code used within the wildland fire community to uniquely identify a particular government organizational unit.Landowner Kind & CategoryThis data standard provides a two-tier classification (kind and category) of landownership. Attribute Fields JurisdictionalAgencyKind Describes the type of unit Jurisdiction using the NWCG Landowner Kind data standard. There are two valid values: Federal, and Other. A value may not be populated for all polygons.JurisdictionalAgencyCategoryDescribes the type of unit Jurisdiction using the NWCG Landowner Category data standard. Valid values include: ANCSA, BIA, BLM, BOR, DOD, DOE, NPS, USFS, USFWS, Foreign, Tribal, City, County, OtherLoc (other local, not in the standard), State. A value may not be populated for all polygons.JurisdictionalUnitNameThe name of the Jurisdictional Unit. Where an NWCG Unit ID exists for a polygon, this is the name used in the Name field from the NWCG Unit ID database. Where no NWCG Unit ID exists, this is the “Unit Name” or other specific, descriptive unit name field from the source dataset. A value is populated for all polygons.JurisdictionalUnitIDWhere it could be determined, this is the NWCG Standard Unit Identifier (Unit ID). Where it is unknown, the value is ‘Null’. Null Unit IDs can occur because a unit may not have a Unit ID, or because one could not be reliably determined from the source data. Not every land ownership has an NWCG Unit ID. Unit ID assignment rules are available from the Unit ID standard, linked above.LandownerKindThe landowner category value associated with the polygon. May be inferred from jurisdictional agency, or by lack of a jurisdictional agency. A value is populated for all polygons. There are three valid values: Federal, Private, or Other.LandownerCategoryThe landowner kind value associated with the polygon. May be inferred from jurisdictional agency, or by lack of a jurisdictional agency. A value is populated for all polygons. Valid values include: ANCSA, BIA, BLM, BOR, DOD, DOE, NPS, USFS, USFWS, Foreign, Tribal, City, County, OtherLoc (other local, not in the standard), State, Private.DataSourceThe database from which the polygon originated. Be as specific as possible, identify the geodatabase name and feature class in which the polygon originated.SecondaryDataSourceIf the Data Source is an aggregation from other sources, use this field to specify the source that supplied data to the aggregation. For example, if Data Source is "PAD-US 2.1", then for a USDA Forest Service polygon, the Secondary Data Source would be "USDA FS Automated Lands Program (ALP)". For a BLM polygon in the same dataset, Secondary Source would be "Surface Management Agency (SMA)."SourceUniqueIDIdentifier (GUID or ObjectID) in the data source. Used to trace the polygon back to its authoritative source.MapMethod:Controlled vocabulary to define how the geospatial feature was derived. Map method may help define data quality. MapMethod will be Mixed Method by default for this layer as the data are from mixed sources. Valid Values include: GPS-Driven; GPS-Flight; GPS-Walked; GPS-Walked/Driven; GPS-Unknown Travel Method; Hand Sketch; Digitized-Image; DigitizedTopo; Digitized-Other; Image Interpretation; Infrared Image; Modeled; Mixed Methods; Remote Sensing Derived; Survey/GCDB/Cadastral; Vector; Phone/Tablet; OtherDateCurrentThe last edit, update, of this GIS record. Date should follow the assigned NWCG Date Time data standard, using 24 hour clock, YYYY-MM-DDhh.mm.ssZ, ISO8601 Standard.CommentsAdditional information describing the feature. GeometryIDPrimary key for linking geospatial objects with other database systems. Required for every feature. This field may be renamed for each standard to fit the feature.JurisdictionalUnitID_sansUSNWCG Unit ID with the "US" characters removed from the beginning. Provided for backwards compatibility.JoinMethodAdditional information on how the polygon was matched information in the NWCG Unit ID database.LocalNameLocalName for the polygon provided from PADUS or other source.LegendJurisdictionalAgencyJurisdictional Agency but smaller landholding agencies, or agencies of indeterminate status are grouped for more intuitive use in a map legend or summary table.LegendLandownerAgencyLandowner Agency but smaller landholding agencies, or agencies of indeterminate status are grouped for more intuitive use in a map legend or summary table.DataSourceYearYear that the source data for the polygon were acquired.Data InputThis dataset is based on an aggregation of 4 spatial data sources: Protected Areas Database US (PAD-US 2.1), data from Bureau of Indian Affairs regional offices, the BLM Alaska Fire Service/State of Alaska, and Census Block-Group Geometry. NWCG Unit ID and Agency Kind/Category data are tabular and sourced from UnitIDActive.txt, in the WFMI Unit ID application (https://wfmi.nifc.gov/unit_id/Publish.html). Areas of with unknown Landowner Kind/Category and Jurisdictional Agency Kind/Category are assigned LandownerKind and LandownerCategory values of "Private" by use of the non-water polygons from the Census Block-Group geometry.PAD-US 2.1:This dataset is based in large part on the USGS Protected Areas Database of the United States - PAD-US 2.`. PAD-US is a compilation of authoritative protected areas data between agencies and organizations that ultimately results in a comprehensive and accurate inventory of protected areas for the United States to meet a variety of needs (e.g. conservation, recreation, public health, transportation, energy siting, ecological, or watershed assessments and planning). Extensive documentation on PAD-US processes and data sources is available.How these data were aggregated:Boundaries, and their descriptors, available in spatial databases (i.e. shapefiles or geodatabase feature classes) from land management agencies are the desired and primary data sources in PAD-US. If these authoritative sources are unavailable, or the agency recommends another source, data may be incorporated by other aggregators such as non-governmental organizations. Data sources are tracked for each record in the PAD-US geodatabase (see below).BIA and Tribal Data:BIA and Tribal land management data are not available in PAD-US. As such, data were aggregated from BIA regional offices. These data date from 2012 and were substantially updated in 2022. Indian Trust Land affiliated with Tribes, Reservations, or BIA Agencies: These data are not considered the system of record and are not intended to be used as such. The Bureau of Indian Affairs (BIA), Branch of Wildland Fire Management (BWFM) is not the originator of these data. The
Meta Kaggle Code
kaggle.com
zip
Updated Jul 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaggle (2025). Meta Kaggle Code [Dataset]. https://www.kaggle.com/datasets/kaggle/meta-kaggle-code/code
Explore at:
zip(150021566586 bytes)Available download formats
Dataset updated
Jul 24, 2025
Dataset authored and provided by
Kagglehttp://kaggle.com/
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Explore our public notebook content!

Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.

Why we’re releasing this dataset

By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.

Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.

The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!

Sensitive data

While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.

Joining with Meta Kaggle

The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.

File organization

The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.

The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays

Questions / Comments

We love feedback! Let us know in the Discussion tab.

Happy Kaggling!
Z
Data from: The Software Heritage License Dataset (2022 Edition)
data.niaid.nih.gov
explore.openaire.eu
Updated Jan 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sergio Montes-Leon (2024). The Software Heritage License Dataset (2022 Edition) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8200351
Explore at:
Dataset updated
Jan 10, 2024
Dataset provided by
Sergio Montes-Leon
Jesus M. Gonzalez-Barahona
Gregorio Robles
Stefano Zacchiroli
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains all “license files” extracted from a snapshot of the Software Heritage archive taken on 2022-04-25. (Other, possibly more recent, versions of the datasets can be found at https://annex.softwareheritage.org/public/dataset/license-blobs/).

In this context, a license file is a unique file content (or “blob”) that appeared in a software origin archived by Software Heritage as a file whose name is often used to ship licenses in software projects. Some name examples are: COPYING, LICENSE, NOTICE, COPYRIGHT, etc. The exact file name pattern used to select the blobs contained in the dataset can be found in the SQL query file 01-select-blobs.sql. Note that the file name was not expected to be at the project root, because project subdirectories can contain different licenses than the top-level one, and we wanted to include those too.

Format

The dataset is organized as follows:

blobs.tar.zst: a Zst-compressed tarball containing deduplicated license blobs, one per file. The tarball contains 6’859’189 blobs, for a total uncompressed size on disk of 66 GiB.

The blobs are organized in a sharded directory structure that contains files named like blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02, where:

blobs/ is the root directory containing all license blobs

8624bcdae55baeef00cd11d5dfcfa60f68710a02 is the SHA1 checksum of a specific license blobs, a copy of the GPL3 license in this case. Each license blob is ultimately named with its SHA1:

$ head -n 3 blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02 GNU GENERAL PUBLIC LICENSE Version 3, 29 June 2007

$ sha1sum blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02 8624bcdae55baeef00cd11d5dfcfa60f68710a02 blobs/86/24/8624bcdae55baeef00cd11d5dfcfa60f68710a02

86 and 24 are, respectively, the first and second group of two hex digits in the blob SHA1

One blob is missing, because its size (313MB) prevented its inclusion; (it was originally a tarball containing source code):

swh:1:cnt:61bf63793c2ee178733b39f8456a796b72dc8bde,1340d4e2da173c92d432026ecdc54b4859fe9911,"AUTHORS"

blobs-sample20k.tar.zst: analogous to blobs.tar.zst, but containing “only” 20’000 randomly selected license blobs

license-blobs.csv.zst a Zst-compressed CSV index of all the blobs in the dataset. Each line in the index (except the first one, which contains column headers) describes a license blob and is in the format SWHID,SHA1,NAME, for example:

swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2,8624bcdae55baeef00cd11d5dfcfa60f68710a02,"COPYING" swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2,8624bcdae55baeef00cd11d5dfcfa60f68710a02,"COPYING.GPL3" swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2,8624bcdae55baeef00cd11d5dfcfa60f68710a02,"COPYING.GLP-3"

where:

SWHID: the Software Heritage persistent identifier of the blob. It can be used to retrieve and cross-reference the license blob via the Software Heritage archive, e.g., at: https://archive.softwareheritage.org/swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2

SHA1: the blob SHA1, that can be used to cross-reference blobs in the blobs/ directory

NAME: a file name given to the license blob in a given software origin. As the same license blob can have different names in different contexts, the index contain multiple entries for the same blob with different names, as it is the case in the example above (yes, one of those has a typo in it, but it’s an original typo from some repository!).

blobs-fileinfo.csv.zst a Zst-compressed CSV mapping from blobs to basic file information in the format: SHA1,MIME_TYPE,ENCODING,LINE_COUNT,WORD_COUNT,SIZE, where:

SHA1: blob SHA1

MIME_TYPE: blob MIME type, as detected by libmagic

ENCODING: blob character encoding, as detected by libmagic

LINE_COUNT: number of lines in the blob (only for textual blobs with UTF8 encoding)

WORD_COUNT: number of words in the blob (only for textual blobs with UTF8 encoding)

SIZE: blob size in bytes

blobs-scancode.csv.zst a Zst-compressed CSV mapping from blobs to software license detected in them by ScanCode, in the format: SHA1,LICENSE,SCORE, where:

SHA1: blob SHA1

LICENSE: license detected in the blob, as an SPDX identifier (or ScanCode identifier for non-SPDX-indexed licenses)

SCORE: confidence score in the result, as a decimal number between 0 and 100

There may be zero or arbitrarily many lines for each blob.

blobs-scancode.ndjson.zst a Zst-compressed line-delimited JSON, containing a superset of the information in blobs-scancode.csv.zst. Each line is a JSON dictionary with three keys:

sha1: blob SHA1

licenses: output of scancode.api.get_licenses(..., min_score=0)

copyrights: output of scancode.api.get_copyrights(...)

There is exactly one line for each blob. licenses and copyrights keys are omitted for files not detected as plain text.

blobs-origins.csv.zst a Zst-compressed CSV mapping of where license blobs come from. Each line in the index associate a license blob to one of its origins in the format SWHIDURL, for example:

swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2 https://github.com/pombreda/Artemis

Note that a license blob can come from many different places, only an arbitrary (and somewhat random) one is listed in this mapping.

If no origin URL is found in the Software Heritage archive, then a blank is used instead. This happens when they were either being loaded when the dataset was generated, or the loader process crashed before completing the blob’s origin’s ingestion.

blobs-nb-origins.csv.zst a Zst-compressed CSV mapping of how many origins of this blob are known to Software Heritage. Each line in the index associate a license blob to this count in the format SWHIDNUMBER, for example:

swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2 2822260

Two blobs are missing because the computation crashes:

swh:1:cnt:e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 swh:1:cnt:8b137891791fe96927ad78e64b0aad7bded08bdc

This issue will be fixed in a future version of the dataset

blobs-earliest.csv.zst a Zst-compressed CSV mapping from blobs to information about their (earliest) known occurence(s) in the archive. Format: SWHIDEARLIEST_SWHIDEARLIEST_TSOCCURRENCES, where:

SWHID: blob SWHID

EARLIEST_SWHID: SWHID of the earliest known commit containing the blob

EARLIEST_TS: timestamp of the earliest known commit containing the blob, as a Unix time integer

OCCURRENCES: number of known commits containing the blob

replication-package.tar.gz: code and scripts used to produce the dataset

licenses-annotated-sample.tar.gz: ground truth, i.e., manually annotated random sample of license blobs, with details about the kind of information they contain.

Changes since the 2021-03-23 dataset

More input data, due to the SWH archive growing: more origins in supported forges and package managers; and support for more forges and package managers. See the SWH Archive Changelog for details.

Values in the NAME column of license-blobs.csv.zst are quoted, as some file names now contain commas.

Replication package now contains all the steps needed to reproduce all artefacts including the licenseblobs/fetch.py script.

blobs-nb-origins.csv.zst is added.

blobs-origins.csv.zst is now generated using the first origin returned by swh-graph’s leaves endpoint, instead of its randomwalk endpoint. This should have no impact on the result, other than a different distribution of “random” origins being picked.

blobs-origins.csv.zst was missing ~10% of its results in previous versions of the dataset, due to errors and/or timeouts in its generation, this is now down to 0.02% (1254 of the 6859445 unique blobs). Blobs with no known origins are now present, with a blank instead of URL.

blobs-earliest.csv.zst was missing ~10% of its results in previous versions of the dataset. It is complete now.

blobs-scancode.csv.zst is generated with a newer scancode-toolkit version (31.2.1)

blobs-scancode.ndjson.zst is added.

Errata

A file name .tmp_1340d4e2da173c92d432026ecdc54b4859fe9911 was present in the initial version of the dataset (published on 2022-11-07). It was removed on 2022-11-09 using these two commands:

pv blobs-fileinfo.csv.zst | zstdcat | grep -v ".tmp" | zstd -19 pv blobs.tar.zst| zstdcat | tar --delete blobs/13/40/.tmp_1340d4e2da173c92d432026ecdc54b4859fe9911 | zstd -19 -T12

The total uncompressed size was announced as 84 GiB based on the physical size on ext4, but it is actually 66 GiB.

Citation

If you use this dataset for research purposes, please acknowledge its use by citing one or both of the following papers:

[pdf, bib] Jesús M. González-Barahona, Sergio Raúl Montes León, Gregorio Robles, Stefano Zacchiroli. The software heritage license dataset (2022 edition). Empirical Software Engineering, Volume 28, Number 6, Article number 147 (2023).

[pdf, bib] Stefano Zacchiroli. A Large-scale Dataset of (Open Source) License Text Variants. In proceedings of the 2022 Mining Software Repositories Conference (MSR 2022). 23-24 May 2022 Pittsburgh, Pennsylvania, United States. ACM 2022.

References

The dataset has been built using primarily the data sources described in the following papers:

[pdf, bib] Roberto Di Cosmo, Stefano Zacchiroli. Software Heritage: Why and How to Preserve Software Source Code. In Proceedings of iPRES 2017: 14th International Conference on Digital Preservation, Kyoto, Japan, 25-29 September 2017.

[pdf, bib] Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli. The Software Heritage Graph Dataset: Public software development under one roof. In proceedings of MSR 2019: The 16th International Conference on Mining Software Repositories, May 2019, Montreal, Canada. Pages 138-142, IEEE 2019.

Errata (v2, 2024-01-09)

licenses-annotated-sample.tar.gz: some comments not intended for publication were removed, and 4
Film Circulation dataset
zenodo.org
data.niaid.nih.gov
bin, csv, png
Updated Jul 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova (2024). Film Circulation dataset [Dataset]. http://doi.org/10.5281/zenodo.7887672
Explore at:
csv, png, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7887672
Dataset updated
Jul 12, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

Please cite this when using the dataset.

Detailed description of the dataset:

1 Film Dataset: Festival Programs

The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.

2 Survey Dataset

The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.

3 IMDb & Scripts

The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.

4 Festival Library Dataset

The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories,
Open Data Portal Catalogue
open.canada.ca
datasets.ai
+1more
csv, json, jsonl, png +2
Updated Jul 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Treasury Board of Canada Secretariat (2025). Open Data Portal Catalogue [Dataset]. https://open.canada.ca/data/en/dataset/c4c5c7f1-bfa6-4ff6-b4a0-c164cb2060f7
Explore at:
csv, sqlite, json, png, jsonl, xlsxAvailable download formats
Dataset updated
Jul 13, 2025
Dataset provided by
Treasury Board of Canada Secretariathttp://www.tbs-sct.gc.ca/
Treasury Board of Canadahttps://www.canada.ca/en/treasury-board-secretariat/corporate/about-treasury-board.html
License
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Description
The open data portal catalogue is a downloadable dataset containing some key metadata for the general datasets available on the Government of Canada's Open Data portal. Resource 1 is generated using the ckanapi tool (external link) Resources 2 - 8 are generated using the Flatterer (external link) utility. ###Description of resources: 1. Dataset is a JSON Lines (external link) file where the metadata of each Dataset/Open Information Record is one line of JSON. The file is compressed with GZip. The file is heavily nested and recommended for users familiar with working with nested JSON. 2. Catalogue is a XLSX workbook where the nested metadata of each Dataset/Open Information Record is flattened into worksheets for each type of metadata. 3. datasets metadata contains metadata at the dataset level. This is also referred to as the package in some CKAN documentation. This is the main table/worksheet in the SQLite database and XLSX output. 4. Resources Metadata contains the metadata for the resources contained within each dataset. 5. resource views metadata contains the metadata for the views applied to each resource, if a resource has a view configured. 6. datastore fields metadata contains the DataStore information for CSV datasets that have been loaded into the DataStore. This information is displayed in the Data Dictionary for DataStore enabled CSVs. 7. Data Package Fields contains a description of the fields available in each of the tables within the Catalogue, as well as the count of the number of records each table contains. 8. data package entity relation diagram Displays the title and format for column, in each table in the Data Package in the form of a ERD Diagram. The Data Package resource offers a text based version. 9. SQLite Database is a .db database, similar in structure to Catalogue. This can be queried with database or analytical software tools for doing analysis.
u
Pinterest Fashion Compatibility
cseweb.ucsd.edu
beta.data.urbandatacentre.ca
json
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCSD CSE Research Project, Pinterest Fashion Compatibility [Dataset]. https://cseweb.ucsd.edu/~jmcauley/datasets.html
Explore at:
jsonAvailable download formats
Dataset authored and provided by
UCSD CSE Research Project
Description
This dataset contains images (scenes) containing fashion products, which are labeled with bounding boxes and links to the corresponding products.

Metadata includes

product IDs

bounding boxes

Basic Statistics:

Scenes: 47,739

Products: 38,111

Scene-Product Pairs: 93,274
Data supporting the Master thesis "Monitoring von Open Data Praktiken -...
zenodo.org
data.niaid.nih.gov
zip
Updated Nov 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Katharina Zinke; Katharina Zinke (2024). Data supporting the Master thesis "Monitoring von Open Data Praktiken - Herausforderungen beim Auffinden von Datenpublikationen am Beispiel der Publikationen von Forschenden der TU Dresden" [Dataset]. http://doi.org/10.5281/zenodo.14196539
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14196539
Dataset updated
Nov 21, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Katharina Zinke; Katharina Zinke
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data supporting the Master thesis "Monitoring von Open Data Praktiken - Herausforderungen beim Auffinden von Datenpublikationen am Beispiel der Publikationen von Forschenden der TU Dresden" (Monitoring open data practices - challenges in finding data publications using the example of publications by researchers at TU Dresden) - Katharina Zinke, Institut für Bibliotheks- und Informationswissenschaften, Humboldt-Universität Berlin, 2023

This ZIP-File contains the data the thesis is based on, interim exports of the results and the R script with all pre-processing, data merging and analyses carried out. The documentation of the additional, explorative analysis is also available. The actual PDFs and text files of the scientific papers used are not included as they are published open access.

The folder structure is shown below with the file names and a brief description of the contents of each file. For details concerning the analyses approach, please refer to the master's thesis (publication following soon).

## Data sources

Folder 01_SourceData/

- PLOS-Dataset_v2_Mar23.csv (PLOS-OSI dataset)

- ScopusSearch_ExportResults.csv (export of Scopus search results from Scopus)

- ScopusSearch_ExportResults.ris (export of Scopus search results from Scopus)

- Zotero_Export_ScopusSearch.csv (export of the file names and DOIs of the Scopus search results from Zotero)

## Automatic classification

Folder 02_AutomaticClassification/

- (NOT INCLUDED) PDFs folder (Folder for PDFs of all publications identified by the Scopus search, named AuthorLastName_Year_PublicationTitle_Title)

- (NOT INCLUDED) PDFs_to_text folder (Folder for all texts extracted from the PDFs by ODDPub, named AuthorLastName_Year_PublicationTitle_Title)

- PLOS_ScopusSearch_matched.csv (merge of the Scopus search results with the PLOS_OSI dataset for the files contained in both)

- oddpub_results_wDOIs.csv (results file of the ODDPub classification)

- PLOS_ODDPub.csv (merge of the results file of the ODDPub classification with the PLOS-OSI dataset for the publications contained in both)

## Manual coding

Folder 03_ManualCheck/

- CodeSheet_ManualCheck.txt (Code sheet with descriptions of the variables for manual coding)

- ManualCheck_2023-06-08.csv (Manual coding results file)

- PLOS_ODDPub_Manual.csv (Merge of the results file of the ODDPub and PLOS-OSI classification with the results file of the manual coding)

## Explorative analysis for the discoverability of open data

Folder04_FurtherAnalyses

Proof_of_of_Concept_Open_Data_Monitoring.pdf (Description of the explorative analysis of the discoverability of open data publications using the example of a researcher) - in German

## R-Script

Analyses_MA_OpenDataMonitoring.R (R-Script for preparing, merging and analyzing the data and for performing the ODDPub algorithm)
c
30x30 Conserved Areas, Terrestrial (2024)
californianature.ca.gov
data.cnra.ca.gov
+2more
Updated Aug 30, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CA Nature Organization (2024). 30x30 Conserved Areas, Terrestrial (2024) [Dataset]. https://www.californianature.ca.gov/datasets/30x30-conserved-areas-terrestrial-2024
Explore at:
Dataset updated
Aug 30, 2024
Dataset authored and provided by
CA Nature Organization
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered

Description
The Terrestrial 30x30 Conserved Areas map layer was developed by the CA Nature working group, providing a statewide perspective on areas managed for the protection or enhancement of biodiversity. Understanding the spatial distribution and extent of these durably protected and managed areas is a vital aspect of tracking and achieving the “30x30” goal of conserving 30% of California's lands and waters by 2030.Terrestrial and Freshwater Data• The California Protected Areas Database (CPAD), developed and managed by GreenInfo Network, is the most comprehensive collection of data on open space in California. CPAD data consists of Holdings, a single parcel or small group of parcels, such that the spatial features of CPAD correspond to ownership boundaries. • The California Conservation Easement Database (CCED), managed by GreenInfo Network, aggregates data on lands with easements. Conservation Easements are legally recorded interests in land in which a landholder sells or relinquishes certain development rights to their land in perpetuity. Easements are often used to ensure that lands remain as open space, either as working farm or ranch lands, or areas for biodiversity protection. Easement restrictions typically remain with the land through changes in ownership. • The Protected Areas Database of the United States (PAD-US), hosted by the United States Geological Survey (USGS), is developed in coordination with multiple federal, state, and non-governmental organization (NGO) partners. PAD-US, through the Gap Analysis Project (GAP), uses a numerical coding system in which GAP codes 1 and 2 correspond to management strategies with explicit emphasis on protection and enhancement of biodiversity. PAD-US is not specifically aligned to parcel boundaries and as such, boundaries represented within it may not align with other data sources. • Numerous datasets representing designated boundaries for entities such as National Parks and Monuments, Wild and Scenic Rivers, Wilderness Areas, and others, were downloaded from publicly available sources, typically hosted by the managing agency.Methodology1. CPAD and CCED represent the most accurate location and ownership information for parcels in California which contribute to the preservation of open space and cultural and biological resources.2. Superunits are collections of parcels (Holdings) within CPAD which share a name, manager, and access policy. Most Superunits are also managed with a generally consistent strategy for biodiversity conservation. Examples of Superunits include Yosemite National Park, Giant Sequoia National Monument, and Anza-Borrego Desert State Park. 3. Some Superunits, such as those owned and managed by the Bureau of Land Management, U.S. Forest Service, or National Park Service , are intersected by one or more designations, each of which may have a distinct management emphasis with regards to biodiversity. Examples of such designations are Wilderness Areas, Wild and Scenic Rivers, or National Monuments.4. CPAD Superunits and CCED easements were intersected with all designation boundary files to create the operative spatial units for conservation analysis, henceforth 'Conservation Units,' which make up the Terrestrial 30x30 Conserved Areas map layer. Each easement was functionally considered to be a Superunit. 5. Each Conservation Unit was intersected with the PAD-US dataset in order to determine the management emphasis with respect to biodiversity, i.e., the GAP code. Because PAD-US is national in scope and not specifically parcel aligned with California assessors' surveys, a direct spatial extraction of GAP codes from PAD-US would leave tens of thousands of GAP code data slivers within the 30x30 Conserved Areas map. Consequently, a generalizing approach was adopted, such that any Conservation Unit with greater than 80% areal overlap with a single GAP code was uniformly assigned that code. Additionally, the total area of GAP codes 1 and 2 were summed for the remaining uncoded Conservation Units. If this sum was greater than 80% of the unit area, the Conservation Unit was coded as GAP 2. 6. Subsequent to this stage of analysis, certain Conservation Units remained uncoded, either due to the lack of a single GAP code (or combined GAP codes 1&2) overlapping 80% of the area, or because the area was not sufficiently represented in the PAD-US dataset. 7. These uncoded Conservation Units were then broken down into their constituent, finer resolution Holdings, which were then analyzed according to the above workflow. 8. Areas remaining uncoded following the two-step process of coding at the Superunit and then Holding levels were assigned a GAP code of 4. This is consistent with the definition of GAP Code 4: areas unknown to have a biodiversity management focus. 9. Greater than 90% of all areas in the Terrestrial 30x30 Conserved Areas map layer were GAP coded at the level of CPAD Superunits intersected by designation boundaries, the coarsest land units of analysis. By adopting these coarser analytical units, the Terrestrial 30X30 Conserved Areas map layer avoids hundreds of thousands of spatial slivers that result from intersecting designations with smaller, more numerous parcel records. In most cases, individual parcels reflect the management scenario and GAP status of the umbrella Superunit and other spatially coincident designations.Tracking Conserved AreasThe total acreage of conserved areas will increase as California works towards its 30x30 goal. Some changes will be due to shifts in legal protection designations or management status of specific lands and waters. However, shifts may also result from new data representing improvements in our understanding of existing biodiversity conservation efforts. The California Nature Project is expected to generate a great deal of excitement regarding the state's trajectory towards achieving the 30x30 goal. We also expect it to spark discussion about how to shape that trajectory, and how to strategize and optimize outcomes. We encourage landowners, managers, and stakeholders to investigate how their lands are represented in the Terrestrial 30X30 Conserved Areas Map Layer. This can be accomplished by using the Conserved Areas Explorer web application, developed by the CA Nature working group. Users can zoom into the locations they understand best and share their expertise with us to improve the data representing the status of conservation efforts at these sites. The Conserved Areas Explorer presents a tremendous opportunity to strengthen our existing data infrastructure and the channels of communication between land stewards and data curators, encouraging the transfer of knowledge and improving the quality of data. CPAD, CCED, and PAD-US are built from the ground up. Data is derived from available parcel information and submissions from those who own and manage the land. So better data starts with you. Do boundary lines require updating? Is the GAP code inconsistent with a Holding’s conservation status? If land under your care can be better represented in the Terrestrial 30X30 Conserved Areas map layer, please use this link to initiate a review. The results of these reviews will inform updates to the California Protected Areas Database, California Conservation Easement Database, and PAD-US as appropriate for incorporation into future updates to CA Nature and tracking progress to 30x30.
Algeria DZ: SPI: Pillar 4 Data Sources Score: Scale 0-100
ceicdata.com
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CEICdata.com (2021). Algeria DZ: SPI: Pillar 4 Data Sources Score: Scale 0-100 [Dataset]. https://www.ceicdata.com/en/algeria/governance-policy-and-institutions/dz-spi-pillar-4-data-sources-score-scale-0100
Explore at:
Dataset provided by
CEIC Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Dec 1, 2016 - Dec 1, 2022
Area covered
Algeria
Variables measured
Money Market Rate
Description
Algeria DZ: SPI: Pillar 4 Data Sources Score: Scale 0-100 data was reported at 45.958 NA in 2022. This records a decrease from the previous number of 49.075 NA for 2021. Algeria DZ: SPI: Pillar 4 Data Sources Score: Scale 0-100 data is updated yearly, averaging 49.892 NA from Dec 2016 (Median) to 2022, with 7 observations. The data reached an all-time high of 52.417 NA in 2018 and a record low of 45.958 NA in 2022. Algeria DZ: SPI: Pillar 4 Data Sources Score: Scale 0-100 data remains active status in CEIC and is reported by World Bank. The data is categorized under Global Database’s Algeria – Table DZ.World Bank.WDI: Governance: Policy and Institutions. The data sources overall score is a composity measure of whether countries have data available from the following sources: Censuses and surveys, administrative data, geospatial data, and private sector/citizen generated data. The data sources (input) pillar is segmented by four types of sources generated by (i) the statistical office (censuses and surveys), and sources accessed from elsewhere such as (ii) administrative data, (iii) geospatial data, and (iv) private sector data and citizen generated data. The appropriate balance between these source types will vary depending on a country’s institutional setting and the maturity of its statistical system. High scores should reflect the extent to which the sources being utilized enable the necessary statistical indicators to be generated. For example, a low score on environment statistics (in the data production pillar) may reflect a lack of use of (and low score for) geospatial data (in the data sources pillar). This type of linkage is inherent in the data cycle approach and can help highlight areas for investment required if country needs are to be met.;Statistical Performance Indicators, The World Bank (https://datacatalog.worldbank.org/dataset/statistical-performance-indicators);Weighted average;
Z
Dataset for paper "Mitigating the effect of errors in source parameters on...
data.niaid.nih.gov
Updated Sep 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicholas Rawlinson (2022). Dataset for paper "Mitigating the effect of errors in source parameters on seismic (waveform) inversion" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6969601
Explore at:
Dataset updated
Sep 28, 2022
Dataset provided by
Nienke Blom
Phil-Simon Hardalupas
Nicholas Rawlinson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset corresponding to the journal article "Mitigating the effect of errors in source parameters on seismic (waveform) inversion" by Blom, Hardalupas and Rawlinson, accepted for publication in Geophysical Journal International. In this paper, we demonstrate the effect or errors in source parameters on seismic tomography, with a particular focus on (full) waveform tomography. We study effect both on forward modelling (i.e. comparing waveforms and measurements resulting from a perturbed vs. unperturbed source) and on seismic inversion (i.e. using a source which contains an (erroneous) perturbation to invert for Earth structure. These data were obtained using Salvus, a state-of-the-art (though proprietary) 3-D solver that can be used for wave propagation simulations (Afanasiev et al., GJI 2018).

This dataset contains:

The entire Salvus project. This project was prepared using Salvus version 0.11.x and 0.12.2 and should be fully compatible with the latter.

A number of Jupyter notebooks used to create all the figures, set up the project and do the data processing.

A number of Python scripts that are used in above notebooks.

two conda environment .yml files: one with the complete environment as used to produce this dataset, and one with the environment as supplied by Mondaic (the Salvus developers), on top of which I installed basemap and cartopy.

An overview of the inversion configurations used for each inversion experiment and the name of hte corresponding figures: inversion_runs_overview.ods / .csv .

Datasets corresponding to the different figures.

One dataset for Figure 1, showing the effect of a source perturbation in a real-world setting, as previously used by Blom et al., Solid Earth 2020

One dataset for Figure 2, showing how different methodologies and assumptions can lead to significantly different source parameters, notably including systematic shifts. This dataset was kindly supplied by Tim Craig (Craig, 2019).

A number of datasets (stored as pickled Pandas dataframes) derived from the Salvus project. We have computed:

travel-time arrival predictions from every source to all stations (df_stations...pkl)

misfits for different metrics for both P-wave centered and S-wave centered windows for all components on all stations, comparing every time waveforms from a reference source against waveforms from a perturbed source (df_misfits_cc.28s.pkl)

addition of synthetic waveforms for different (perturbed) moment tenors. All waveforms are stored in HDF5 (.h5) files of the ASDF (adaptable seismic data format) type

How to use this dataset:

To set up the conda environment:

make sure you have anaconda/miniconda

make sure you have access to Salvus functionality. This is not absolutely necessary, but most of the functionality within this dataset relies on salvus. You can do the analyses and create the figures without, but you'll have to hack around in the scripts to build workarounds.

Set up Salvus / create a conda environment. This is best done following the instructions on the Mondaic website. Check the changelog for breaking changes, in that case download an older salvus version.

Additionally in your conda env, install basemap and cartopy:

conda-env create -n salvus_0_12 -f environment.yml conda install -c conda-forge basemap conda install -c conda-forge cartopy

Install LASIF (https://github.com/dirkphilip/LASIF_2.0) and test. The project uses some lasif functionality.

To recreate the figures: This is extremely straightforward. Every figure has a corresponding Jupyter Notebook. Suffices to run the notebook in its entirety.

Figure 1: separate notebook, Fig1_event_98.py

Figure 2: separate notebook, Fig2_TimCraig_Andes_analysis.py

Figures 3-7: Figures_perturbation_study.py

Figures 8-10: Figures_toy_inversions.py

To recreate the dataframes in DATA: This can be done using the example notebook Create_perturbed_thrust_data_by_MT_addition.py and Misfits_moment_tensor_components.M66_M12.py . The same can easily be extended to the position shift and other perturbations you might want to investigate.

To recreate the complete Salvus project: This can be done using:

the notebook Prepare_project_Phil_28s_absb_M66.py (setting up project and running simulations)

the notebooks Moment_tensor_perturbations.py and Moment_tensor_perturbation_for_NS_thrust.py

For the inversions: using the notebook Inversion_SS_dip.M66.28s.py as an example. See the overview table inversion_runs_overview.ods (or .csv) as to naming conventions.

References:

Michael Afanasiev, Christian Boehm, Martin van Driel, Lion Krischer, Max Rietmann, Dave A May, Matthew G Knepley, Andreas Fichtner, Modular and flexible spectral-element waveform modelling in two and three dimensions, Geophysical Journal International, Volume 216, Issue 3, March 2019, Pages 1675–1692, https://doi.org/10.1093/gji/ggy469

Nienke Blom, Alexey Gokhberg, and Andreas Fichtner, Seismic waveform tomography of the central and eastern Mediterranean upper mantle, Solid Earth, Volume 11, Issue 2, 2020, Pages 669–690, 2020, https://doi.org/10.5194/se-11-669-2020

Tim J. Craig, Accurate depth determination for moderate-magnitude earthquakes using global teleseismic data. Journal of Geophysical Research: Solid Earth, 124, 2019, Pages 1759– 1780. https://doi.org/10.1029/2018JB016902
d
Data from: Distributed Anomaly Detection using 1-class SVM for Vertically...
catalog.data.gov
data.nasa.gov
+1more
Updated Apr 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Distributed Anomaly Detection using 1-class SVM for Vertically Partitioned Data [Dataset]. https://catalog.data.gov/dataset/distributed-anomaly-detection-using-1-class-svm-for-vertically-partitioned-data
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Dashlink
Description
There has been a tremendous increase in the volume of sensor data collected over the last decade for different monitoring tasks. For example, petabytes of earth science data are collected from modern satellites, in-situ sensors and different climate models. Similarly, huge amount of flight operational data is downloaded for different commercial airlines. These different types of datasets need to be analyzed for finding outliers. Information extraction from such rich data sources using advanced data mining methodologies is a challenging task not only due to the massive volume of data, but also because these datasets are physically stored at different geographical locations with only a subset of features available at any location. Moving these petabytes of data to a single location may waste a lot of bandwidth. To solve this problem, in this paper, we present a novel algorithm which can identify outliers in the entire data without moving all the data to a single location. The method we propose only centralizes a very small sample from the different data subsets at different locations. We analytically prove and experimentally verify that the algorithm offers high accuracy compared to complete centralization with only a fraction of the communication cost. We show that our algorithm is highly relevant to both earth sciences and aeronautics by describing applications in these domains. The performance of the algorithm is demonstrated on two large publicly available datasets: (1) the NASA MODIS satellite images and (2) a simulated aviation dataset generated by the ‘Commercial Modular Aero-Propulsion System Simulation’ (CMAPSS).
d
Data from: tableone: An open source Python package for producing summary...
datadryad.org
dataone.org
+1more
zip
Updated Apr 23, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tom J. Pollard; Alistair E. W. Johnson; Jesse D. Raffa; Roger G. Mark (2019). tableone: An open source Python package for producing summary statistics for research papers [Dataset]. http://doi.org/10.5061/dryad.26c4s35
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.26c4s35
Dataset updated
Apr 23, 2019
Dataset provided by
Dryad
Authors
Tom J. Pollard; Alistair E. W. Johnson; Jesse D. Raffa; Roger G. Mark
Time period covered
Apr 19, 2018
Description
Objectives: In quantitative research, understanding basic parameters of the study population is key for interpretation of the results. As a result, it is typical for the first table (“Table 1”) of a research paper to include summary statistics for the study data. Our objectives are 2-fold. First, we seek to provide a simple, reproducible method for providing summary statistics for research papers in the Python programming language. Second, we seek to use the package to improve the quality of summary statistics reported in research papers.

Materials and Methods: The tableone package is developed following good practice guidelines for scientific computing and all code is made available under a permissive MIT License. A testing framework runs on a continuous integration server, helping to maintain code stability. Issues are tracked openly and public contributions are encouraged.

Results: The tableone software package automatically compiles summary statistics into publishable formats such...
m
Dataset of Pairs of an Image and Tags for Cataloging Image-based Records
data.mendeley.com
narcis.nl
Updated Feb 24, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tokinori Suzuki (2022). Dataset of Pairs of an Image and Tags for Cataloging Image-based Records [Dataset]. http://doi.org/10.17632/msyc6mzvhg.1
Explore at:
Unique identifier
https://doi.org/10.17632/msyc6mzvhg.1
Dataset updated
Feb 24, 2022
Authors
Tokinori Suzuki
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Brief Explanation

This dataset is created to develop and evaluate a cataloging system which assigns appropriate metadata to an image record for database management in digital libraries. That is assumed for evaluating a task, in which given an image and assigned tags, an appropriate Wikipedia page is selected for each of the given tags.

A main characteristic of the dataset is including ambiguous tags. Thus, visual contents of images are not unique to their tags. For example, it includes a tag 'mouse' which has double meaning of not a mammal but a computer controller device. The annotations are corresponding Wikipedia articles for tags as correct entities by human judgement.

The dataset offers both data and programs that reproduce experiments of the above-mentioned task. Its data consist of sources of images and annotations. The image sources are URLs of 420 images uploaded to Flickr. The annotations are a total 2,464 relevant Wikipedia pages manually judged for tags of the images. The dataset also provides programs in Jupiter notebook (scripts.ipynb) to conduct a series of experiments running some baseline methods for the designated task and evaluating the results.

Structure of the Dataset

data directory 1.1. image_URL.txt This file lists URLs of image files.

1.2. rels.txt This file lists collect Wikipedia pages for each topic in topics.txt

1.3. topics.txt This file lists a target pair, which is called a topic in this dataset, of an image and a tag to be disambiguated.

1.4. enwiki_20171001.xml This file is extracted texts from the title and body parts of English Wikipedia articles as of 1st October 2017. This is a modified data of Wikipedia dump data (https://archive.org/download/enwiki-20171001).

img directory This directory is a placeholder directory to fetch image files for downloading.

results directory This directory is a placeholder directory to store results files for evaluation. It maintains three results of baseline methods in sub-directories. They contain json files each of which is a result of one topic, and are ready to be evaluated using an evaluation scripts in scripts.ipynb for reference of both usage and performance.

scripts.ipynb The scripts for running baseline methods and evaluation are ready in this Jupyter notebook file.
f
Generated output for our example data sets.
plos.figshare.com
xls
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abdullah-Al Mamun; Robert Aseltine; Sanguthevar Rajasekaran (2023). Generated output for our example data sets. [Dataset]. http://doi.org/10.1371/journal.pone.0124449.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0124449.t004
Dataset updated
May 30, 2023
Dataset provided by
PLOS ONE
Authors
Abdullah-Al Mamun; Robert Aseltine; Sanguthevar Rajasekaran
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Generated output for our example data sets.
Synthetic datasets of the UK Biobank cohort
zenodo.org
data.niaid.nih.gov
bin, csv, pdf, zip
Updated Feb 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Antonio Gasparrini; Antonio Gasparrini; Jacopo Vanoli; Jacopo Vanoli (2025). Synthetic datasets of the UK Biobank cohort [Dataset]. http://doi.org/10.5281/zenodo.13983170
Explore at:
bin, csv, zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13983170
Dataset updated
Feb 6, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Antonio Gasparrini; Antonio Gasparrini; Jacopo Vanoli; Jacopo Vanoli
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository stores synthetic datasets derived from the database of the UK Biobank (UKB) cohort.

The datasets were generated for illustrative purposes, in particular for reproducing specific analyses on the health risks associated with long-term exposure to air pollution using the UKB cohort. The code used to create the synthetic datasets is available and documented in a related GitHub repo, with details provided in the section below. These datasets can be freely used for code testing and for illustrating other examples of analyses on the UKB cohort.

Note: while the synthetic versions of the datasets resemble the real ones in several aspects, the users should be aware that these data are fake and must not be used for testing and making inferences on specific research hypotheses. Even more importantly, these data cannot be considered a reliable description of the original UKB data, and they must not be presented as such.

The original datasets are described in the article by Vanoli et al in Epidemiology (2024) (DOI: 10.1097/EDE.0000000000001796) [freely available here], which also provides information about the data sources.

The work was supported by the Medical Research Council-UK (Grant ID: MR/Y003330/1).

Content

The series of synthetic datasets (stored in two versions with csv and RDS formats) are the following:

synthbdcohortinfo: basic cohort information regarding the follow-up period and birth/death dates for 502,360 participants.

synthbdbasevar: baseline variables, mostly collected at recruitment.

synthpmdata: annual average exposure to PM_2.5 for each participant reconstructed using their residential history.

synthoutdeath: death records that occurred during the follow-up with date and ICD-10 code.

In addition, this repository provides these additional files:

codebook: a pdf file with a codebook for the variables of the various datasets, including references to the fields of the original UKB database.

asscentre: a csv file with information on the assessment centres used for recruitment of the UKB participants, including code, names, and location (as northing/easting coordinates of the British National Grid).

Countries_December_2022_GB_BUC: a zip file including the shapefile defining the boundaries of the countries in Great Britain (England, Wales, and Scotland), used for mapping purposes [source].

Generation of the synthetic data

The datasets resemble the real data used in the analysis, and they were generated using the R package synthpop (www.synthpop.org.uk). The generation process involves two steps, namely the synthesis of the main data (cohort info, baseline variables, annual PM_2.5 exposure) and then the sampling of death events. The R scripts for performing the data synthesis are provided in the GitHub repo (subfolder Rcode/synthcode).

The first part merges all the data including the annual PM_2.5 levels in a single wide-format dataset (with a row for each subject), generates a synthetic version, adds fake IDs, and then extracts (and reshapes) the single datasets. In the second part, a Cox proportional hazard model is fitted on the original data to estimate risks associated with various predictors (including the main exposure represented by PM_2.5), and then these relationships are used to simulate death events in each year. Details on the modelling aspects are provided in the article.

This process guarantees that the synthetic data do not hold specific information about the original records, thus preserving confidentiality. At the same time, the multivariate distribution and correlation across variables as well as the mortality risks resemble those of the original data, so the results of descriptive and inferential analyses are similar to those in the original assessments. However, as noted above, the data are used only for illustrative purposes, and they must not be used to test other research hypotheses.

Facebook

Twitter

Click to copy link

Link copied

Cite

Statista (2025). Frequently leveraged external data sources for global enterprises 2020 [Dataset]. https://www.statista.com/statistics/1235514/worldwide-popular-external-data-sources-companies/

Frequently leveraged external data sources for global enterprises 2020

Explore at:

Dataset updated

Jul 1, 2025

Dataset authored and provided by

Statistahttp://statista.com/

Time period covered

Aug 2020

Area covered

Worldwide

Description

In 2020, according to respondents surveyed, data masters typically leverage a variety of external data sources to enhance their insights. The most popular external data sources for data masters being publicly available competitor data, open data, and proprietary datasets from data aggregators, with **, **, and ** percent, respectively.

Clear search

Close search

Google apps

Main menu

Frequently leveraged external data sources for global enterprises 2020

The Human Know-How Dataset

Data from: A Large-scale Dataset of (Open Source) License Text Variants

Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...

Data Management Plan Examples Database

Jurisdictional Unit (Public) - Dataset - CKAN

Meta Kaggle Code

Explore our public notebook content!

Why we’re releasing this dataset

Sensitive data

Joining with Meta Kaggle

File organization

Questions / Comments

Data from: The Software Heritage License Dataset (2022 Edition)

Film Circulation dataset

Open Data Portal Catalogue

Pinterest Fashion Compatibility

Data supporting the Master thesis "Monitoring von Open Data Praktiken -...

30x30 Conserved Areas, Terrestrial (2024)

Algeria DZ: SPI: Pillar 4 Data Sources Score: Scale 0-100

Dataset for paper "Mitigating the effect of errors in source parameters on...

Data from: Distributed Anomaly Detection using 1-class SVM for Vertically...

Data from: tableone: An open source Python package for producing summary...

Dataset of Pairs of an Image and Tags for Cataloging Image-based Records

Brief Explanation

Structure of the Dataset

Generated output for our example data sets.

Synthetic datasets of the UK Biobank cohort

Content

Generation of the synthetic data

Frequently leveraged external data sources for global enterprises 2020