100+ datasets found

d
Government Open Data Platform Dataset List
data.gov.tw
csv
Updated Jun 17, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ministry of Digital Affairs (2025). Government Open Data Platform Dataset List [Dataset]. https://data.gov.tw/en/datasets/6564
Explore at:
csvAvailable download formats
Dataset updated
Jun 17, 2025
Dataset authored and provided by
Ministry of Digital Affairs
License
https://data.gov.tw/licensehttps://data.gov.tw/license
Description
The interpretation data listed in the government's data open platform dataset includes the dataset name, file format, download link, dataset type, dataset description, main field description, dataset provider, update frequency, authorization, authorization explanation URL, billing method, encoding format, dataset provider contact person, dataset provider contact person phone, and remarks.

Sample Graph Datasets in CSV Format

zenodo.org

csv

Updated Dec 9, 2024

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Edwin Carreño; Edwin Carreño (2024). Sample Graph Datasets in CSV Format [Dataset]. http://doi.org/10.5281/zenodo.14330132

Explore at:

csvAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.14330132

Dataset updated

Dec 9, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Edwin Carreño; Edwin Carreño

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Sample Graph Datasets in CSV Format

Note: none of the data sets published here contain actual data, they are for testing purposes only.

Description

This data repository contains graph datasets, where each graph is represented by two CSV files: one for node information and another for edge details. To link the files to the same graph, their names include a common identifier based on the number of nodes. For example:

dataset_30_nodes_interactions.csv:contains 30 rows (nodes).
dataset_30_edges_interactions.csv: contains 47 rows (edges).
the common identifier dataset_30 refers to the same graph.

CSV nodes

Each dataset contains the following columns:

Name of the Column	Type	Description
UniProt ID	string	protein identification
label	string	protein label (type of node)
properties	string	a dictionary containing properties related to the protein.

CSV edges

Each dataset contains the following columns:

Name of the Column	Type	Description
Relationship ID	string	relationship identification
Source ID	string	identification of the source protein in the relationship
Target ID	string	identification of the target protein in the relationship
label	string	relationship label (type of relationship)
properties	string	a dictionary containing properties related to the relationship.

Metadata

Graph	Number of Nodes	Number of Edges	Sparse graph
dataset_30*	30	47	Y
dataset_60*	60	181	Y
dataset_120*	120	689	Y
dataset_240*	240	2819	Y
dataset_300*	300	4658	Y
dataset_600*	600	18004	Y
dataset_1200*	1200	71785	Y
dataset_2400*	2400	288600	Y
dataset_3000*	3000	449727	Y
dataset_6000*	6000	1799413	Y
dataset_12000*	12000	7199863	Y
dataset_24000*	24000	28792361	Y

This repository include two (2) additional tiny graph datasets to experiment before dealing with larger datasets.

CSV nodes (tiny graphs)

Each dataset contains the following columns:

Name of the Column	Type	Description
ID	string	node identification
label	string	node label (type of node)
properties	string	a dictionary containing properties related to the node.

CSV edges (tiny graphs)

Each dataset contains the following columns:

Name of the Column	Type	Description
ID	string	relationship identification
source	string	identification of the source node in the relationship
target	string	identification of the target node in the relationship
label	string	relationship label (type of relationship)
properties	string	a dictionary containing properties related to the relationship.

Metadata (tiny graphs)

Graph	Number of Nodes	Number of Edges	Sparse graph
dataset_dummy*	3	6	N
dataset_dummy2*	3	6	N

c
Open Data Portal Of The City Of Mendoza
catalog.civicdataecosystem.org
Updated May 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Open Data Portal Of The City Of Mendoza [Dataset]. https://catalog.civicdataecosystem.org/dataset/open-data-portal-of-the-city-of-mendoza
Explore at:
Dataset updated
May 5, 2025
Area covered
Mendoza
Description
Learn the step-by-step process to start downloading the open data of the City of Mendoza. To access and download the open data of the City of Mendoza, you do not need to register or create a user account. Access to the repository is free, and all datasets can be downloaded free of charge and without restrictions. The homepage has access buttons to 14 data categories and a search engine where you can directly enter the topic you want to access. Each data category refers to a section of the platform where you will find the various datasets available, grouped by theme. As an example, if we enter the Security section, we find different datasets within. Once you enter the dataset, you will find a list of resources. Each of these resources is a file that contains the data. For example, the dataset Security Dependencies includes specific information about each of the dependencies and allows you to access the information published in different formats and download it. In this case, if you want to open the file with the Excel program, you must click on the download button of the second resource that specifies that the format is CSV. Likewise, in other sections, there are datasets with information in various formats, such as XLS and KMZ. Each of the datasets also contains a file with additional information where you can see the last update date, the update frequency, and which government area is generating this information, among other things. To access and download the open data of the City of Mendoza, you do not need to register or create a user account. Access to the repository is free, and all datasets can be downloaded free of charge and without restrictions. The homepage has access buttons to 14 data categories and a search engine where you can directly enter the topic you want to access. Each data category refers to a section of the platform where you will find the various datasets available, grouped by theme. As an example, if we enter the Security section, we find different datasets within. Once you enter the dataset, you will find a list of resources. Each of these resources is a file that contains the data. For example, the dataset Security Dependencies includes specific information about each of the dependencies and allows you to access the information published in different formats and download it. In this case, if you want to open the file with the Excel program, you must click on the download button of the second resource that specifies that the format is CSV. Likewise, in other sections, there are datasets with information in various formats, such as XLS and KMZ. Each of the datasets also contains a file with additional information where you can see the last update date, the update frequency, and which government area is generating this information, among other things. Translated from Spanish Original Text: Conocé el paso a paso para empezar a descargar los datos abiertos de la Ciudad de Mendoza. Para acceder y descargar los datos abiertos de la Ciudad de Mendoza, no necesitás realizar ningún tipo de registro ni crear un usuario. El acceso al repositorio es libre y todos los datasets se pueden descargar de manera gratuita y sin restricciones. La página de inicio cuenta con botones de acceso a 14 categorías de datos y un buscador en donde podés ingresar directamente al tema al que quieras acceder. Cada categoría de datos, refiere a una sección de la plataforma en donde vas a encontrar los distintos datasets disponibles agrupados por temática. A modo de ejemplo, si ingresamos en la sección Seguridad, dentro encontramos diferentes datasets. Una vez que ingresas al dataset, encontrarás una lista de recursos. Cada uno de estos recursos es un archivo que contiene los datos. Por ejemplo, el dataset Dependencias de Seguridad incluye información específica sobre cada una de las dependencias y te permite acceder a la información publicada en distintos formatos y descargarla. En este caso, si quieres abrir el archivo con el programa Excel deberás hacer clic sobre el botón descargar del segundo recurso que especifica que el formato es CSV. Así como también, en otras secciones hay datasets con la información en diversos formatos, como XLS y KMZ Cada uno de los datasets, contiene además una ficha con información adicional en donde podés ver la última fecha de actualización, la frecuencia de actualización y qué área de gobierno es la generadora de esta información, entre otros. Para acceder y descargar los datos abiertos de la Ciudad de Mendoza, no necesitás realizar ningún tipo de registro ni crear un usuario. El acceso al repositorio es libre y todos los datasets se pueden descargar de manera gratuita y sin restricciones. La página de inicio cuenta con botones de acceso a 14 categorías de datos y un buscador en donde podés ingresar directamente al tema al que quieras acceder. Cada categoría de datos, refiere a una sección de la plataforma en donde vas a encontrar los distintos datasets disponibles agrupados por temática. A modo de ejemplo, si ingresamos en la sección Seguridad, dentro encontramos diferentes datasets. Una vez que ingresas al dataset, encontrarás una lista de recursos. Cada uno de estos recursos es un archivo que contiene los datos. Por ejemplo, el dataset Dependencias de Seguridad incluye información específica sobre cada una de las dependencias y te permite acceder a la información publicada en distintos formatos y descargarla. En este caso, si quieres abrir el archivo con el programa Excel deberás hacer clic sobre el botón descargar del segundo recurso que especifica que el formato es CSV. Así como también, en otras secciones hay datasets con la información en diversos formatos, como XLS y KMZ Cada uno de los datasets, contiene además una ficha con información adicional en donde podés ver la última fecha de actualización, la frecuencia de actualización y qué área de gobierno es la generadora de esta información, entre otros.
D
History of work (all graph datasets)
druid.datalegend.net
api.druid.datalegend.net
+1more
application/n-quads +5
Updated Jul 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
History of Work (2025). History of work (all graph datasets) [Dataset]. https://druid.datalegend.net/HistoryOfWork/historyOfWork-all-latest
Explore at:
application/n-quads, application/n-triples, application/trig, ttl, jsonld, application/sparql-results+jsonAvailable download formats
Dataset updated
Jul 17, 2025
Dataset authored and provided by
History of Work
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
History of Work

Here you find the History of Work resources as Linked Open Data. It enables you to look ups for HISCO and HISCAM scores for an incredible amount of occupational titles in numerous languages.

Data can be queried (obtained) via the SPARQL endpoint or via the example queries. If the Linked Open Data format is new to you, you might enjoy these data stories on History of Work as Linked Open Data and this user question on Is there a list of female occupations?.

NEW version - CHANGE notes

This version is dated Apr 2025 and is not backwards compatible with the previous version (Feb 2021). The major changes are: - incredible simplification of graph representation (from 81 to 12); - use of sdo (https://schema.org/) rather than schema (http://schema.org); - replacement of prov:wasDerivedFrom with sdo:isPartOf to link occupational titles to originating datasets; - etl files (used for conversion to Linked Data) now publicly available via https://github.com/rlzijdeman/rdf-hisco; - update of issues with language tags; - specfication of language tags for english (eg. @en-gb, instead of @en); - new preferred API: https://api.druid.datalegend.net/datasets/HistoryOfWork/historyOfWork-all-latest/sparql (old API will be deprecated at some point: https://api.druid.datalegend.net/datasets/HistoryOfWork/historyOfWork-all-latest/services/historyOfWork-all-latest/sparql ) .

There are bound to be some issues. Please leave report them here.

Figure 1. Part of model illustrating the basic relation between occupations, schema.org and HISCO. https://druid.datalegend.net/HistoryOfWork/historyOfWork-all-latest/assets/601beed0f7d371035bca5521" alt="hisco-basic">

Figure 2. Part of model illustrating the relation between occupation, provenance and HISCO auxiliary variables. https://druid.datalegend.net/HistoryOfWork/historyOfWork-all-latest/assets/601beed0f7d371035bca551e" alt="hisco-aux">
p
DCAT-AP API endpoints for data.public.lu
data.public.lu
html, rdf, xlsx
Updated May 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Open Data Lëtzebuerg (2024). DCAT-AP API endpoints for data.public.lu [Dataset]. https://data.public.lu/en/datasets/dcat-ap-api-endpoints-for-data-public-lu/
Explore at:
rdf, html, xlsx(16280)Available download formats
Dataset updated
May 27, 2024
Dataset authored and provided by
Open Data Lëtzebuerg
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Data.public.lu provides all its metadata in the DCAT and DCAT-AP formats, i.e. all data about the data stored or referenced on data.public.lu. DCAT (Data Catalog Vocabulary) is a specification designed to facilitate interoperability between data catalogs published on the Web. This specification has been extended via the DCAT-AP (DCAT Application Profile for data portals in Europe) standard, specifically for data portals in Europe. The serialisation of those vocabularies is mainly done in RDF (Resource Description Framework). The implementation of data.public.lu is based on the one of the open source udata platform. This API enables the federation of multiple Data portals together, for example, all the datasets published on data.public.lu are also published on data.europa.eu. The DCAT API from data.public.lu is used by the european data portal to federate its metadata. The DCAT standard is thus very important to guarantee the interoperability between all data portals in Europe. Usage Full catalog You can find here a few examples using the curl command line tool: To get all the metadata from the whole catalog hosted on data.public.lu curl https://data.public.lu/catalog.rdf Metadata for an organization To get the metadata of a specific organization, you need first to find its ID. The ID of an organization is the last part of its URL. For the organization "Open data Lëtzebuerg" its URL is https://data.public.lu/fr/organizations/open-data-letzebuerg/ and its ID is open-data-letzebuerg. To get all the metadata for a given organization, we need to call the following URL, where {id} has been replaced by the correct ID: https://data.public.lu/api/1/organizations/{id}/catalog.rdf Example: curl https://data.public.lu/api/1/organizations/open-data-letzebuerg/catalog.rdf Metadata for a dataset To get the metadata of a specific dataset, you need first to find its ID. The ID of dataset is the last part of its URL. For the dataset "Digital accessibility monitoring report - 2020-2021" its URL is https://data.public.lu/fr/datasets/digital-accessibility-monitoring-report-2020-2021/ and its ID is digital-accessibility-monitoring-report-2020-2021. To get all the metadata for a given dataset, we need to call the following URL, where {id} has been replaced by the correct ID: https://data.public.lu/api/1/datasets/{id}/rdf Example: curl https://data.public.lu/api/1/datasets/digital-accessibility-monitoring-report-2020-2021/rdf Compatibility with DCAT-AP 2.1.1 The DCAT-AP standard is in constant evolution, so the compatibility of the implementation should be regularly compared with the standard and adapted accordingly. In May 2023, we have done this comparison, and the result is available in the resources below (see document named 'udata 6 dcat-ap implementation status"). In the DCAT-AP model, classes and properties have a priority level which should be respected in every implementation: mandatory, recommended and optional. Our goal is to implement all mandatory classes and properties, and if possible implement all recommended classes and properties which make sense in the context of our open data portal.
Open Data Portal Catalogue
open.canada.ca
datasets.ai
+1more
csv, json, jsonl, png +2
Updated Jul 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Treasury Board of Canada Secretariat (2025). Open Data Portal Catalogue [Dataset]. https://open.canada.ca/data/en/dataset/c4c5c7f1-bfa6-4ff6-b4a0-c164cb2060f7
Explore at:
csv, sqlite, json, png, jsonl, xlsxAvailable download formats
Dataset updated
Jul 13, 2025
Dataset provided by
Treasury Board of Canada Secretariathttp://www.tbs-sct.gc.ca/
Treasury Board of Canadahttps://www.canada.ca/en/treasury-board-secretariat/corporate/about-treasury-board.html
License
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Description
The open data portal catalogue is a downloadable dataset containing some key metadata for the general datasets available on the Government of Canada's Open Data portal. Resource 1 is generated using the ckanapi tool (external link) Resources 2 - 8 are generated using the Flatterer (external link) utility. ###Description of resources: 1. Dataset is a JSON Lines (external link) file where the metadata of each Dataset/Open Information Record is one line of JSON. The file is compressed with GZip. The file is heavily nested and recommended for users familiar with working with nested JSON. 2. Catalogue is a XLSX workbook where the nested metadata of each Dataset/Open Information Record is flattened into worksheets for each type of metadata. 3. datasets metadata contains metadata at the dataset level. This is also referred to as the package in some CKAN documentation. This is the main table/worksheet in the SQLite database and XLSX output. 4. Resources Metadata contains the metadata for the resources contained within each dataset. 5. resource views metadata contains the metadata for the views applied to each resource, if a resource has a view configured. 6. datastore fields metadata contains the DataStore information for CSV datasets that have been loaded into the DataStore. This information is displayed in the Data Dictionary for DataStore enabled CSVs. 7. Data Package Fields contains a description of the fields available in each of the tables within the Catalogue, as well as the count of the number of records each table contains. 8. data package entity relation diagram Displays the title and format for column, in each table in the Data Package in the form of a ERD Diagram. The Data Package resource offers a text based version. 9. SQLite Database is a .db database, similar in structure to Catalogue. This can be queried with database or analytical software tools for doing analysis.
h
Llama-2-SQL-and-Code-Dataset
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chris Hayduk, Llama-2-SQL-and-Code-Dataset [Dataset]. https://huggingface.co/datasets/ChrisHayduk/Llama-2-SQL-and-Code-Dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Chris Hayduk
Description
Dataset Card for "Llama-2-SQL-and-Code-Dataset"

This dataset is intended to provide LLaMA 2 improved coding and instruction following capabilities, with a specific focus on SQL generation. The dataset is in Alpaca Instruct format. Please be sure to provide the instruction and input in the prompt to the model, along with any prompt text you would like to place around those inputs. In the train split, please ignore the table column. The eval split provides example tables so that the… See the full description on the dataset page: https://huggingface.co/datasets/ChrisHayduk/Llama-2-SQL-and-Code-Dataset.
Z
Dataset for paper "Mitigating the effect of errors in source parameters on...
data.niaid.nih.gov
Updated Sep 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicholas Rawlinson (2022). Dataset for paper "Mitigating the effect of errors in source parameters on seismic (waveform) inversion" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6969601
Explore at:
Dataset updated
Sep 28, 2022
Dataset provided by
Phil-Simon Hardalupas
Nienke Blom
Nicholas Rawlinson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset corresponding to the journal article "Mitigating the effect of errors in source parameters on seismic (waveform) inversion" by Blom, Hardalupas and Rawlinson, accepted for publication in Geophysical Journal International. In this paper, we demonstrate the effect or errors in source parameters on seismic tomography, with a particular focus on (full) waveform tomography. We study effect both on forward modelling (i.e. comparing waveforms and measurements resulting from a perturbed vs. unperturbed source) and on seismic inversion (i.e. using a source which contains an (erroneous) perturbation to invert for Earth structure. These data were obtained using Salvus, a state-of-the-art (though proprietary) 3-D solver that can be used for wave propagation simulations (Afanasiev et al., GJI 2018).

This dataset contains:

The entire Salvus project. This project was prepared using Salvus version 0.11.x and 0.12.2 and should be fully compatible with the latter.

A number of Jupyter notebooks used to create all the figures, set up the project and do the data processing.

A number of Python scripts that are used in above notebooks.

two conda environment .yml files: one with the complete environment as used to produce this dataset, and one with the environment as supplied by Mondaic (the Salvus developers), on top of which I installed basemap and cartopy.

An overview of the inversion configurations used for each inversion experiment and the name of hte corresponding figures: inversion_runs_overview.ods / .csv .

Datasets corresponding to the different figures.

One dataset for Figure 1, showing the effect of a source perturbation in a real-world setting, as previously used by Blom et al., Solid Earth 2020

One dataset for Figure 2, showing how different methodologies and assumptions can lead to significantly different source parameters, notably including systematic shifts. This dataset was kindly supplied by Tim Craig (Craig, 2019).

A number of datasets (stored as pickled Pandas dataframes) derived from the Salvus project. We have computed:

travel-time arrival predictions from every source to all stations (df_stations...pkl)

misfits for different metrics for both P-wave centered and S-wave centered windows for all components on all stations, comparing every time waveforms from a reference source against waveforms from a perturbed source (df_misfits_cc.28s.pkl)

addition of synthetic waveforms for different (perturbed) moment tenors. All waveforms are stored in HDF5 (.h5) files of the ASDF (adaptable seismic data format) type

How to use this dataset:

To set up the conda environment:

make sure you have anaconda/miniconda

make sure you have access to Salvus functionality. This is not absolutely necessary, but most of the functionality within this dataset relies on salvus. You can do the analyses and create the figures without, but you'll have to hack around in the scripts to build workarounds.

Set up Salvus / create a conda environment. This is best done following the instructions on the Mondaic website. Check the changelog for breaking changes, in that case download an older salvus version.

Additionally in your conda env, install basemap and cartopy:

conda-env create -n salvus_0_12 -f environment.yml conda install -c conda-forge basemap conda install -c conda-forge cartopy

Install LASIF (https://github.com/dirkphilip/LASIF_2.0) and test. The project uses some lasif functionality.

To recreate the figures: This is extremely straightforward. Every figure has a corresponding Jupyter Notebook. Suffices to run the notebook in its entirety.

Figure 1: separate notebook, Fig1_event_98.py

Figure 2: separate notebook, Fig2_TimCraig_Andes_analysis.py

Figures 3-7: Figures_perturbation_study.py

Figures 8-10: Figures_toy_inversions.py

To recreate the dataframes in DATA: This can be done using the example notebook Create_perturbed_thrust_data_by_MT_addition.py and Misfits_moment_tensor_components.M66_M12.py . The same can easily be extended to the position shift and other perturbations you might want to investigate.

To recreate the complete Salvus project: This can be done using:

the notebook Prepare_project_Phil_28s_absb_M66.py (setting up project and running simulations)

the notebooks Moment_tensor_perturbations.py and Moment_tensor_perturbation_for_NS_thrust.py

For the inversions: using the notebook Inversion_SS_dip.M66.28s.py as an example. See the overview table inversion_runs_overview.ods (or .csv) as to naming conventions.

References:

Michael Afanasiev, Christian Boehm, Martin van Driel, Lion Krischer, Max Rietmann, Dave A May, Matthew G Knepley, Andreas Fichtner, Modular and flexible spectral-element waveform modelling in two and three dimensions, Geophysical Journal International, Volume 216, Issue 3, March 2019, Pages 1675–1692, https://doi.org/10.1093/gji/ggy469

Nienke Blom, Alexey Gokhberg, and Andreas Fichtner, Seismic waveform tomography of the central and eastern Mediterranean upper mantle, Solid Earth, Volume 11, Issue 2, 2020, Pages 669–690, 2020, https://doi.org/10.5194/se-11-669-2020

Tim J. Craig, Accurate depth determination for moderate-magnitude earthquakes using global teleseismic data. Journal of Geophysical Research: Solid Earth, 124, 2019, Pages 1759– 1780. https://doi.org/10.1029/2018JB016902
h
ShareGPT-Unfiltered-RedPajama-Chat-format
huggingface.co
Updated Jun 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fredi (2023). ShareGPT-Unfiltered-RedPajama-Chat-format [Dataset]. https://huggingface.co/datasets/Fredithefish/ShareGPT-Unfiltered-RedPajama-Chat-format
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 6, 2023
Authors
Fredi
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
ShareGPT unfiltered dataset in RedPajama-Chat format

This dataset was created by converting The alpaca-lora formatted ShareGPT dataset to the format required by RedPajama-Chat. This script was used for the conversion: https://github.com/fredi-python/Alpaca2INCITE-Dataset-Converter/blob/main/convert.py WARNING: Only the first human and gpt text of each conversation from the original dataset is included in the dataset.

The format

{"text": "
Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...
zenodo.org
csv
Updated Sep 15, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous authors; Anonymous authors (2023). Code4ML: a Large-scale Dataset of annotated Machine Learning Code [Dataset]. http://doi.org/10.5281/zenodo.6607065
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6607065
Dataset updated
Sep 15, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous authors; Anonymous authors
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle.

The data is organized in a table structure. Code4ML includes several main objects: competitions information, raw code blocks collected form Kaggle and manually marked up snippets. Each table has a .csv format.

Each competition has the text description and metadata, reflecting competition and used dataset characteristics as well as evaluation metrics (competitions.csv). The corresponding datasets can be loaded using Kaggle API and data sources.

The code blocks themselves and their metadata are collected to the data frames concerning the publishing year of the initial kernels. The current version of the corpus includes two code blocks files: snippets from kernels up to the 2020 year (сode_blocks_upto_20.csv) and those from the 2021 year (сode_blocks_21.csv) with corresponding metadata. The corpus consists of 2 743 615 ML code blocks collected from 107 524 Jupyter notebooks.

Marked up code blocks have the following metadata: anonymized id, the format of the used data (for example, table or audio), the id of the semantic type, a flag for the code errors, the estimated relevance to the semantic class (from 1 to 5), the id of the parent notebook, and the name of the competition. The current version of the corpus has ~12 000 labeled snippets (markup_data_20220415.csv).

As marked up code blocks data contains the numeric id of the code block semantic type, we also provide a mapping from this number to semantic type and subclass (actual_graph_2022-06-01.csv).

The dataset can help solve various problems, including code synthesis from a prompt in natural language, code autocompletion, and semantic code classification.
D
History of work (all graph datasets)-Deprecated Version Feb 2021
druid.datalegend.net
application/n-quads +5
Updated Apr 13, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
History of Work (2025). History of work (all graph datasets)-Deprecated Version Feb 2021 [Dataset]. https://druid.datalegend.net/HistoryOfWork/historyOfWork-all-latest-Feb2021
Explore at:
ttl, application/trig, application/sparql-results+json, jsonld, application/n-triples, application/n-quadsAvailable download formats
Dataset updated
Apr 13, 2025
Dataset authored and provided by
History of Work
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Introduction

This is your workbench for historical occupations as all graphs from the historyOfWork are combined here. This version is dated Feb 2021. Use this dataset to retrieve HISCO and HISCAM scores for an incredible amount of occupations in numerous languages.

Data can be queried (obtained) via the SPARQL endpoint or via the example queries. If the Linked Open Data format is new to you, you might enjoy these data stories on History of Work as Linked Open Data and this user question on Is there a list of female occupations?.

Figure 1. Part of model illustrating the basic relation between occupations, schema.org and HISCO. https://druid.datalegend.net/HistoryOfWork/historyOfWork-all-latest/assets/601beed0f7d371035bca5521" alt="hisco-basic">

Figure 2. Part of model illustrating the relation between occupation, provenance and HISCO auxiliary variables. https://druid.datalegend.net/HistoryOfWork/historyOfWork-all-latest/assets/601beed0f7d371035bca551e" alt="hisco-aux">
Genomics examples
redivis.com
Updated Aug 16, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Redivis Demo Organization (2022). Genomics examples [Dataset]. https://redivis.com/datasets/yz1s-d09009dbb
Explore at:
Dataset updated
Aug 16, 2022
Dataset provided by
Redivis Inc.
Authors
Redivis Demo Organization
Time period covered
Jan 30, 2025
Description
This is an auto-generated index table corresponding to a folder of files in this dataset with the same name. This table can be used to extract a subset of files based on their metadata, which can then be used for further analysis. You can view the contents of specific files by navigating to the "cells" tab and clicking on an individual file_id.
g
Data on separate collection 2018 in open format (Open data) | gimi9.com
gimi9.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data on separate collection 2018 in open format (Open data) | gimi9.com [Dataset]. https://gimi9.com/dataset/eu_c_l378-1146285/
Explore at:
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
For the different types of waste, the collection manager shall provide the data in tonnes on a monthly basis. Since January 2018 the structure of the open data file has changed compared to previous years because the method of separate collection has changed, for example the "multimate" item no longer exists.
B
Residential Schools Locations Dataset (Geodatabase)
borealisdata.ca
search.dataone.org
Updated May 31, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rosa Orlandini (2019). Residential Schools Locations Dataset (Geodatabase) [Dataset]. http://doi.org/10.5683/SP2/JFQ1SZ
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP2/JFQ1SZ
Dataset updated
May 31, 2019
Dataset provided by
Borealis
Authors
Rosa Orlandini
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 1863 - Jun 30, 1998
Area covered
Canada
Description
The Residential Schools Locations Dataset in Geodatabase format (IRS_Locations.gbd) contains a feature layer "IRS_Locations" that contains the locations (latitude and longitude) of Residential Schools and student hostels operated by the federal government in Canada. All the residential schools and hostels that are listed in the Residential Schools Settlement Agreement are included in this dataset, as well as several Industrial schools and residential schools that were not part of the IRRSA. This version of the dataset doesn’t include the five schools under the Newfoundland and Labrador Residential Schools Settlement Agreement. The original school location data was created by the Truth and Reconciliation Commission, and was provided to the researcher (Rosa Orlandini) by the National Centre for Truth and Reconciliation in April 2017. The dataset was created by Rosa Orlandini, and builds upon and enhances the previous work of the Truth and Reconcilation Commission, Morgan Hite (creator of the Atlas of Indian Residential Schools in Canada that was produced for the Tk'emlups First Nation and Justice for Day Scholar's Initiative, and Stephanie Pyne (project lead for the Residential Schools Interactive Map). Each individual school location in this dataset is attributed either to RSIM, Morgan Hite, NCTR or Rosa Orlandini. Many schools/hostels had several locations throughout the history of the institution. If the school/hostel moved from its’ original location to another property, then the school is considered to have two unique locations in this dataset,the original location and the new location. For example, Lejac Indian Residential School had two locations while it was operating, Stuart Lake and Fraser Lake. If a new school building was constructed on the same property as the original school building, it isn't considered to be a new location, as is the case of Girouard Indian Residential School.When the precise location is known, the coordinates of the main building are provided, and when the precise location of the building isn’t known, an approximate location is provided. For each residential school institution location, the following information is provided: official names, alternative name, dates of operation, religious affiliation, latitude and longitude coordinates, community location, Indigenous community name, contributor (of the location coordinates), school/institution photo (when available), location point precision, type of school (hostel or residential school) and list of references used to determine the location of the main buildings or sites. Access Instructions: there are 47 files in this data package. Please download the entire data package by selecting all the 47 files and click on download. Two files will be downloaded, IRS_Locations.gbd.zip and IRS_LocFields.csv. Uncompress the IRS_Locations.gbd.zip. Use QGIS, ArcGIS Pro, and ArcMap to open the feature layer IRS_Locations that is contained within the IRS_Locations.gbd data package. The feature layer is in WGS 1984 coordinate system. There is also detailed file level metadata included in this feature layer file. The IRS_locations.csv provides the full description of the fields and codes used in this dataset.
Film Circulation dataset
zenodo.org
data.niaid.nih.gov
bin, csv, png
Updated Jul 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova (2024). Film Circulation dataset [Dataset]. http://doi.org/10.5281/zenodo.7887672
Explore at:
csv, png, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7887672
Dataset updated
Jul 12, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

Please cite this when using the dataset.

Detailed description of the dataset:

1 Film Dataset: Festival Programs

The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.

2 Survey Dataset

The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.

3 IMDb & Scripts

The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.

4 Festival Library Dataset

The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories,
Z
Sentence/Table Pair Data from Wikipedia for Pre-training with...
data.niaid.nih.gov
zenodo.org
Updated Oct 29, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Huan Sun (2021). Sentence/Table Pair Data from Wikipedia for Pre-training with Distant-Supervision [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5612315
Explore at:
Dataset updated
Oct 29, 2021
Dataset provided by
Huan Sun
Alyssa Lees
Yu Su
Cong Yu
Xiang Deng
You Wu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the dataset used for pre-training in "ReasonBERT: Pre-trained to Reason with Distant Supervision", EMNLP'21.

There are two files:

sentence_pairs_for_pretrain_no_tokenization.tar.gz -> Contain only sentences as evidence, Text-only

table_pairs_for_pretrain_no_tokenization.tar.gz -> At least one piece of evidence is a table, Hybrid

The data is chunked into multiple tar files for easy loading. We use WebDataset, a PyTorch Dataset (IterableDataset) implementation providing efficient sequential/streaming data access.

For pre-training code, or if you have any questions, please check our GitHub repo https://github.com/sunlab-osu/ReasonBERT

Below is a sample code snippet to load the data

import webdataset as wds

path to the uncompressed files, should be a directory with a set of tar files

url = './sentence_multi_pairs_for_pretrain_no_tokenization/{000000...000763}.tar' dataset = ( wds.Dataset(url) .shuffle(1000) # cache 1000 samples and shuffle .decode() .to_tuple("json") .batched(20) # group every 20 examples into a batch )

Please see the documentation for WebDataset for more details about how to use it as dataloader for Pytorch

You can also iterate through all examples and dump them with your preferred data format

Below we show how the data is organized with two examples.

Text-only

{'s1_text': 'Sils is a municipality in the comarca of Selva, in Catalonia, Spain.', # query sentence 's1_all_links': { 'Sils,_Girona': [[0, 4]], 'municipality': [[10, 22]], 'Comarques_of_Catalonia': [[30, 37]], 'Selva': [[41, 46]], 'Catalonia': [[51, 60]] }, # list of entities and their mentions in the sentence (start, end location) 'pairs': [ # other sentences that share common entity pair with the query, group by shared entity pairs { 'pair': ['Comarques_of_Catalonia', 'Selva'], # the common entity pair 's1_pair_locs': [[[30, 37]], [[41, 46]]], # mention of the entity pair in the query 's2s': [ # list of other sentences that contain the common entity pair, or evidence { 'md5': '2777e32bddd6ec414f0bc7a0b7fea331', 'text': 'Selva is a coastal comarque (county) in Catalonia, Spain, located between the mountain range known as the Serralada Transversal or Puigsacalm and the Costa Brava (part of the Mediterranean coast). Unusually, it is divided between the provinces of Girona and Barcelona, with Fogars de la Selva being part of Barcelona province and all other municipalities falling inside Girona province. Also unusually, its capital, Santa Coloma de Farners, is no longer among its larger municipalities, with the coastal towns of Blanes and Lloret de Mar having far surpassed it in size.', 's_loc': [0, 27], # in addition to the sentence containing the common entity pair, we also keep its surrounding context. 's_loc' is the start/end location of the actual evidence sentence 'pair_locs': [ # mentions of the entity pair in the evidence [[19, 27]], # mentions of entity 1 [[0, 5], [288, 293]] # mentions of entity 2 ], 'all_links': { 'Selva': [[0, 5], [288, 293]], 'Comarques_of_Catalonia': [[19, 27]], 'Catalonia': [[40, 49]] } } ,...] # there are multiple evidence sentences }, ,...] # there are multiple entity pairs in the query }

Hybrid

{'s1_text': 'The 2006 Major League Baseball All-Star Game was the 77th playing of the midseason exhibition baseball game between the all-stars of the American League (AL) and National League (NL), the two leagues comprising Major League Baseball.', 's1_all_links': {...}, # same as text-only 'sentence_pairs': [{'pair': ..., 's1_pair_locs': ..., 's2s': [...]}], # same as text-only 'table_pairs': [ 'tid': 'Major_League_Baseball-1', 'text':[ ['World Series Records', 'World Series Records', ...], ['Team', 'Number of Series won', ...], ['St. Louis Cardinals (NL)', '11', ...], ...] # table content, list of rows 'index':[ [[0, 0], [0, 1], ...], [[1, 0], [1, 1], ...], ...] # index of each cell [row_id, col_id]. we keep only a table snippet, but the index here is from the original table. 'value_ranks':[ [0, 0, ...], [0, 0, ...], [0, 10, ...], ...] # if the cell contain numeric value/date, this is its rank ordered from small to large, follow TAPAS 'value_inv_ranks': [], # inverse rank 'all_links':{ 'St._Louis_Cardinals': { '2': [ [[2, 0], [0, 19]], # [[row_id, col_id], [start, end]] ] # list of mentions in the second row, the key is row_id }, 'CARDINAL:11': {'2': [[[2, 1], [0, 2]]], '8': [[[8, 3], [0, 2]]]}, } 'name': '', # table name, if exists 'pairs': { 'pair': ['American_League', 'National_League'], 's1_pair_locs': [[[137, 152]], [[162, 177]]], # mention in the query 'table_pair_locs': { '17': [ # mention of entity pair in row 17 [ [[17, 0], [3, 18]], [[17, 1], [3, 18]], [[17, 2], [3, 18]], [[17, 3], [3, 18]] ], # mention of the first entity [ [[17, 0], [21, 36]], [[17, 1], [21, 36]], ] # mention of the second entity ] } } ] }
e
Cloud Optimized Raster Encoding (CORE) format
envidat.ch
opendata.swiss
+1more
.sh, not available
Updated Jun 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ionut Iosifescu Enescu; Dominik Haas-Artho; Lucia de Espona; Marius Rüetschi (2025). Cloud Optimized Raster Encoding (CORE) format [Dataset]. http://doi.org/10.16904/envidat.230
Explore at:
not available, .shAvailable download formats
Unique identifier
https://doi.org/10.16904/envidat.230
Dataset updated
Jun 4, 2025
Dataset provided by
Swiss Federal Institute for Forest, Snow and Landscape Research WSL
Authors
Ionut Iosifescu Enescu; Dominik Haas-Artho; Lucia de Espona; Marius Rüetschi
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Switzerland
Dataset funded by
WSL
Description
DISCLAIMER: CORE is still in development. Interested parties are warmly invited to join common development, to comment, discuss, find bugs, etc. Acknowledgement: The CORE format was proudly inspired by the Cloud Optimized GeoTIFF (COG) format, by considering how to leverage the ability of clients issuing HTTP GET range requests for a time-series of remote sensing and aerial imagery (instead of just one image).

License: The Cloud Optimized Raster Encoding (CORE) specifications are released to the public domain under a Creative Commons 1.0 CC0 "No Rights Reserved" international license. You can reuse the information contained herein in any way you want, for any purposes and without restrictions.

Summary: The Cloud Optimized Raster Encoding (CORE) format is being developed for the efficient storage and management of gridded data by applying video encoding algorithms. It is mainly designed for the exchange and preservation of large time series data in environmental data repositories, while in the same time enabling more efficient workflows on the cloud. It can be applied to any large number of similar (in pixel size and image dimensions) raster data layers. CORE is not designed to replace COG but to work together with COG for a collection of many layers (e.g. by offering a fast preview of layers when switching between layers of a time series). WARNING: Currently only applicable to RGB/Byte imagery. The final CORE specifications may probably be very different from what is written herein or CORE may not ever become productive due to a myriad of reasons (see also 'Major issues to be solved'). With this early public sharing of the format we explicitly support the Open Science agenda, which implies "shifting from the standard practices of publishing research results in scientific publications towards sharing and using all available knowledge at an earlier stage in the research process" (quote from: European Commission, Directorate General for Research and Innovation, 2016. Open innovation, open science, open to the world). CORE Specifications: 1) a MP4 or WebM video digital multimedia container format (or any future video container playable as HTML video in major browsers) 2) a free to use or open video compression codec such as H.264, VP9, or AV1 (or any future video codec that is open sourced or free to use for end users) Note: H.264 is currently recommended because of the wide usage with support in all major browsers, fast encoding due to acceleration in hardware (which is currently not the case for AV1 or VP9) and the fact that MPEG LA has allowed the free use for streaming video that is free to the end users. However, please note that H.264 is restricted by patents and its use in proprietary or commercial software requires the payment of royalties to MPEG LA. However, when AV1 matures and accelerated hardware encoding becomes available, AV1 is expected to offer 30% to 50% smaller file size in comparison with H.264, while retaining the same quality. 3) the encoding frame rate should be of one frame per second (fps) with each layer segmented in internal tiles, similar to COG, ordered by the main use case when accessing the data: either layer contiguous or tile contiguous; Note: The internal tile arrangement should support an easy navigation inside the CORE video format, depending on the use case. 4) a CORE file is optimised for streaming with the moov atom at the beginning of the file (e.g. with -movflags faststart) and optional additional optimisations depending on the codec used (e.g. -tune fastdecode -tune zerolatency for H.264) 5) metadata tags inside the moov atom for describing and using geographic image data (that are preferably compatible with the OGC GeoTIFF standard or any future standard accepted by the geospatial community) as well as list of original file names corresponding to each CORE layer 6) it needs to encode similar source rasters (such as time series of rasters with the same extent and resolution, or different tiles of the same product; each input raster should be having the same image and pixel size) 7) it provides a mechanism for addressing and requesting overviews (lower resolution data) for a fast display in web browser depending on the map scale (currently external overviews) Major issues to be solved: - Internal overviews (similar to COG), by chaining lower resolution videos in the same MP4 container for fast access to overviews first); Currently, overviews are kept as separate files, as external overviews. - Metadata encoding (how to best encode spatial extent, layer names, and so on, for each of the layer inside the series, which may have a different geographical extent, etc...; Known issues: adding too many tags with FFmpeg which are not part of the standard MP4 moov atom; metadata tags have a limited string length. - Applicability beyond RGB/Byte datasets; defining a standard way of converting cell values from Int16/UInt16/UInt32/Int32/Float32/Float64/ data types into multi-band Byte values (and reconstructing them back to the original data type within acceptable thresholds) Example Notice: The provided CORE (.mp4) examples contain modified Copernicus Sentinel data [2018-2021]. For generating the CORE examples provided, 50 original Sentinel 2 (S-2) TCI data images from an area located inside Switzerland were downloaded from www.copernicus.eu, and then transformed into CORE format using ffmpeg with H.264 encoding using the x264 library. For full reproducibility, we provide the original data set and results, as well scripts for data encoding and extraction (see resources).
c
Comprehensive Grainger Products Dataset - 200K Records in CSV Format
crawlfeeds.com
csv, zip
Updated Jul 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crawl Feeds (2025). Comprehensive Grainger Products Dataset - 200K Records in CSV Format [Dataset]. https://crawlfeeds.com/datasets/comprehensive-grainger-products-dataset-200k-records-in-csv-format
Explore at:
csv, zipAvailable download formats
Dataset updated
Jul 15, 2025
Dataset authored and provided by
Crawl Feeds
License
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Description
Unlock the power of data with our comprehensive Grainger Products Dataset, featuring over 220,000 meticulously curated records in CSV format. This dataset is an invaluable resource for businesses, researchers, and data scientists looking to optimize their operations, conduct market analysis, or enhance their machine learning models.

Each record in the dataset includes critical fields such as URL, title, brand, SKU, price, pricing unit, product model, product ID, product UNSPSC, breadcrumbs, images, specifications, compliance and restrictions, description, unique ID, and the scraped date. Whether you're analyzing product trends, comparing prices, or developing e-commerce solutions, this dataset provides the depth and breadth of information you need.

Submit you custom requests at grainger products page

Example Use Cases:

Market Research: Analyze product offerings, prices, and brands to stay competitive.

Machine Learning: Train models on rich product data to improve recommendations or search algorithms.

E-commerce Solutions: Integrate with your platform to enhance product listings, optimize pricing strategies, or automate inventory management.

Start leveraging this data to make informed decisions and gain a competitive edge in your industry.
Purchase Order Data
data.ca.gov
catalog.data.gov
csv, docx, pdf
Updated Oct 23, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
California Department of General Services (2019). Purchase Order Data [Dataset]. https://data.ca.gov/dataset/purchase-order-data
Explore at:
docx, pdf, csvAvailable download formats
Dataset updated
Oct 23, 2019
Dataset authored and provided by
California Department of General Services
Description
The State Contract and Procurement Registration System (SCPRS) was established in 2003, as a centralized database of information on State contracts and purchases over $5000. eSCPRS represents the data captured in the State's eProcurement (eP) system, Bidsync, as of March 16, 2009. The data provided is an extract from that system for fiscal years 2012-2013, 2013-2014, and 2014-2015

Data Limitations:
Some purchase orders have multiple UNSPSC numbers, however only first was used to identify the purchase order. Multiple UNSPSC numbers were included to provide additional data for a DGS special event however this affects the formatting of the file. The source system Bidsync is being deprecated and these issues will be resolved in the future as state systems transition to Fi$cal.

Data Collection Methodology:

The data collection process starts with a data file from eSCPRS that is scrubbed and standardized prior to being uploaded into a SQL Server database. There are four primary tables. The Supplier, Department and United Nations Standard Products and Services Code (UNSPSC) tables are reference tables. The Supplier and Department tables are updated and mapped to the appropriate numbering schema and naming conventions. The UNSPSC table is used to categorize line item information and requires no further manipulation. The Purchase Order table contains raw data that requires conversion to the correct data format and mapping to the corresponding data fields. A stacking method is applied to the table to eliminate blanks where needed. Extraneous characters are removed from fields. The four tables are joined together and queries are executed to update the final Purchase Order Dataset table. Once the scrubbing and standardization process is complete the data is then uploaded into the SQL Server database.

Secondary/Related Resources:

State Contract Manual (SCM) vol. 2 http://www.dgs.ca.gov/pd/Resources/publications/SCM2.aspx

State Contract Manual (SCM) vol. 3 http://www.dgs.ca.gov/pd/Resources/publications/SCM3.aspx

Buying Green http://www.dgs.ca.gov/buyinggreen/Home.aspx

United Nations Standard Products and Services Code, http://www.unspsc.org/
B
Residential School Locations Dataset (CSV Format)
borealisdata.ca
search.dataone.org
Updated Jun 5, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rosa Orlandini (2019). Residential School Locations Dataset (CSV Format) [Dataset]. http://doi.org/10.5683/SP2/RIYEMU
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP2/RIYEMU
Dataset updated
Jun 5, 2019
Dataset provided by
Borealis
Authors
Rosa Orlandini
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 1863 - Jun 30, 1998
Area covered
Canada
Description
The Residential School Locations Dataset [IRS_Locations.csv] contains the locations (latitude and longitude) of Residential Schools and student hostels operated by the federal government in Canada. All the residential schools and hostels that are listed in the Indian Residential School Settlement Agreement are included in this dataset, as well as several Industrial schools and residential schools that were not part of the IRRSA. This version of the dataset doesn’t include the five schools under the Newfoundland and Labrador Residential Schools Settlement Agreement. The original school location data was created by the Truth and Reconciliation Commission, and was provided to the researcher (Rosa Orlandini) by the National Centre for Truth and Reconciliation in April 2017. The dataset was created by Rosa Orlandini, and builds upon and enhances the previous work of the Truth and Reconcilation Commission, Morgan Hite (creator of the Atlas of Indian Residential Schools in Canada that was produced for the Tk'emlups First Nation and Justice for Day Scholar's Initiative, and Stephanie Pyne (project lead for the Residential Schools Interactive Map). Each individual school location in this dataset is attributed either to RSIM, Morgan Hite, NCTR or Rosa Orlandini. Many schools/hostels had several locations throughout the history of the institution. If the school/hostel moved from its’ original location to another property, then the school is considered to have two unique locations in this dataset,the original location and the new location. For example, Lejac Indian Residential School had two locations while it was operating, Stuart Lake and Fraser Lake. If a new school building was constructed on the same property as the original school building, it isn't considered to be a new location, as is the case of Girouard Indian Residential School.When the precise location is known, the coordinates of the main building are provided, and when the precise location of the building isn’t known, an approximate location is provided. For each residential school institution location, the following information is provided: official names, alternative name, dates of operation, religious affiliation, latitude and longitude coordinates, community location, Indigenous community name, contributor (of the location coordinates), school/institution photo (when available), location point precision, type of school (hostel or residential school) and list of references used to determine the location of the main buildings or sites.

Facebook

Twitter

Click to copy link

Link copied

Cite

Ministry of Digital Affairs (2025). Government Open Data Platform Dataset List [Dataset]. https://data.gov.tw/en/datasets/6564

Government Open Data Platform Dataset List

Explore at:

csvAvailable download formats

Dataset updated

Jun 17, 2025

Dataset authored and provided by

Ministry of Digital Affairs

License

https://data.gov.tw/licensehttps://data.gov.tw/license

Description

The interpretation data listed in the government's data open platform dataset includes the dataset name, file format, download link, dataset type, dataset description, main field description, dataset provider, update frequency, authorization, authorization explanation URL, billing method, encoding format, dataset provider contact person, dataset provider contact person phone, and remarks.

Clear search

Close search

Google apps

Main menu

Government Open Data Platform Dataset List

Sample Graph Datasets in CSV Format

Sample Graph Datasets in CSV Format

Description

CSV nodes

CSV edges

Metadata

CSV nodes (tiny graphs)

CSV edges (tiny graphs)

Metadata (tiny graphs)

Open Data Portal Of The City Of Mendoza

History of work (all graph datasets)

History of Work

NEW version - CHANGE notes

DCAT-AP API endpoints for data.public.lu

Open Data Portal Catalogue

Llama-2-SQL-and-Code-Dataset

Dataset for paper "Mitigating the effect of errors in source parameters on...

ShareGPT-Unfiltered-RedPajama-Chat-format

Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...

History of work (all graph datasets)-Deprecated Version Feb 2021

Introduction

Genomics examples

Data on separate collection 2018 in open format (Open data) | gimi9.com

Residential Schools Locations Dataset (Geodatabase)

Film Circulation dataset

Sentence/Table Pair Data from Wikipedia for Pre-training with...

path to the uncompressed files, should be a directory with a set of tar files

Please see the documentation for WebDataset for more details about how to use it as dataloader for Pytorch

You can also iterate through all examples and dump them with your preferred data format

Cloud Optimized Raster Encoding (CORE) format

License: The Cloud Optimized Raster Encoding (CORE) specifications are released to the public domain under a Creative Commons 1.0 CC0 "No Rights Reserved" international license. You can reuse the information contained herein in any way you want, for any purposes and without restrictions.

Comprehensive Grainger Products Dataset - 200K Records in CSV Format

Purchase Order Data

Residential School Locations Dataset (CSV Format)

Government Open Data Platform Dataset ListSee More Versions

Government Open Data Platform Dataset List