100+ datasets found

H
Data from: A general purpose tool-set for representing data relationships:...
dataverse.harvard.edu
Updated May 4, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joshua Stillerman, Thomas Fredian, Martin Greenwald, John Wright (2018). A general purpose tool-set for representing data relationships: Converting data into knowledge [Dataset]. http://doi.org/10.7910/DVN/SHYWLB
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/SHYWLB
Dataset updated
May 4, 2018
Dataset provided by
Harvard Dataverse
Authors
Joshua Stillerman, Thomas Fredian, Martin Greenwald, John Wright
License
https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/SHYWLBhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/SHYWLB
Description
Rich metadata is required to find and understand the recorded measurements from modern experiments with their immense and complex data stores. Systems to store and manage these metadata have improved over time, but in most cases are ad-hoc collections of data relationships, often represented in domain or site specific application code. We are developing a general set of tools to store, manage, and retrieve datarelationship metadata. These tools will be agnostic to the underlying data storage mechanisms, and to the data stored in them, making the system applicable across a wide range of science domains. Data management tools typically represent at least one relationship paradigm through implicit or explicit metadata. The addition of these metadata allows the data to be searched and understood by larger groups of users over longer periods of time. Using these systems, researchers are less dependent on one on one communication with the scientists involved in running the experiments, nor to rely on their ability to remember the details of their data. In the magnetic fusion research community, the MDSplus system is widely used to record raw and processed data from experiments. Users create a hierarchical relationship tree for each instance of their experiment, allowing them to record the meanings of what is recorded. Most users of this system, add to this a set of ad-hoc tools to help users locate specific experiment runs, which they can then access via this hierarchical organization. However, the MDSplus tree is only one possible organization of the records, and these additional applications that relate the experiment 'shots' into run days, experimental proposals, logbook entries, run summaries, analysis work flow, publications, etc. have up until now, been implemented on an experiment by experiment basis. The Metadata Provenance Ontology project, MPO, is a system built to record data provenance information about computed results. It allows users to record the inputs and outputs from each step of their computational workflows, in particular, what raw and processed data were used as inputs, what codes were run and what results were produced. The resulting collections of provenance graphs can be annotated, grouped, searched, filtered and browsed. This provides a powerful tool to record, understand, and locate computed results. However, this can be understood as one more specific data relationship, which can be construed as an instance of something more general. Building on concepts developed in these projects, we are developing a general system that could be used to represent all of these kinds of data relationships as mathematical graphs. Just as MDSplus and MPO were generalizations of data management needs for a collection of users, this new system will generalize the storage, location, and retrieval of the relationships between data. The system will store data relationships as data, not encoded in a set of application specific programs or ad hoc data structures. Stored data, would be referred to by URIs allowing the system to be agnostic to the underlying data representations. Users can then traverse these graphs. The system will allow users to construct a collection of graphs describing ANY OR ALL OF the relationships between data items, locate interesting data, see what other graphs these data are members of and navigate into and through them.
H
Dataset metadata of known Dataverse installations, August 2023
dataverse.harvard.edu
search.dataone.org
Updated Aug 30, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julian Gautier (2024). Dataset metadata of known Dataverse installations, August 2023 [Dataset]. http://doi.org/10.7910/DVN/8FEGUV
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/8FEGUV
Dataset updated
Aug 30, 2024
Dataset provided by
Harvard Dataverse
Authors
Julian Gautier
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains the metadata of the datasets published in 85 Dataverse installations and information about each installation's metadata blocks. It also includes the lists of pre-defined licenses or terms of use that dataset depositors can apply to the datasets they publish in the 58 installations that were running versions of the Dataverse software that include that feature. The data is useful for reporting on the quality of dataset and file-level metadata within and across Dataverse installations and improving understandings about how certain Dataverse features and metadata fields are used. Curators and other researchers can use this dataset to explore how well Dataverse software and the repositories using the software help depositors describe data. How the metadata was downloaded The dataset metadata and metadata block JSON files were downloaded from each installation between August 22 and August 28, 2023 using a Python script kept in a GitHub repo at https://github.com/jggautier/dataverse-scripts/blob/main/other_scripts/get_dataset_metadata_of_all_installations.py. In order to get the metadata from installations that require an installation account API token to use certain Dataverse software APIs, I created a CSV file with two columns: one column named "hostname" listing each installation URL in which I was able to create an account and another column named "apikey" listing my accounts' API tokens. The Python script expects the CSV file and the listed API tokens to get metadata and other information from installations that require API tokens. How the files are organized ├── csv_files_with_metadata_from_most_known_dataverse_installations │ ├── author(citation)_2023.08.22-2023.08.28.csv │ ├── contributor(citation)_2023.08.22-2023.08.28.csv │ ├── data_source(citation)_2023.08.22-2023.08.28.csv │ ├── ... │ └── topic_classification(citation)_2023.08.22-2023.08.28.csv ├── dataverse_json_metadata_from_each_known_dataverse_installation │ ├── Abacus_2023.08.27_12.59.59.zip │ ├── dataset_pids_Abacus_2023.08.27_12.59.59.csv │ ├── Dataverse_JSON_metadata_2023.08.27_12.59.59 │ ├── hdl_11272.1_AB2_0AQZNT_v1.0(latest_version).json │ ├── ... │ ├── metadatablocks_v5.6 │ ├── astrophysics_v5.6.json │ ├── biomedical_v5.6.json │ ├── citation_v5.6.json │ ├── ... │ ├── socialscience_v5.6.json │ ├── ACSS_Dataverse_2023.08.26_22.14.04.zip │ ├── ADA_Dataverse_2023.08.27_13.16.20.zip │ ├── Arca_Dados_2023.08.27_13.34.09.zip │ ├── ... │ └── World_Agroforestry_-_Research_Data_Repository_2023.08.27_19.24.15.zip └── dataverse_installations_summary_2023.08.28.csv └── dataset_pids_from_most_known_dataverse_installations_2023.08.csv └── license_options_for_each_dataverse_installation_2023.09.05.csv └── metadatablocks_from_most_known_dataverse_installations_2023.09.05.csv This dataset contains two directories and four CSV files not in a directory. One directory, "csv_files_with_metadata_from_most_known_dataverse_installations", contains 20 CSV files that list the values of many of the metadata fields in the citation metadata block and geospatial metadata block of datasets in the 85 Dataverse installations. For example, author(citation)_2023.08.22-2023.08.28.csv contains the "Author" metadata for the latest versions of all published, non-deaccessioned datasets in the 85 installations, where there's a row for author names, affiliations, identifier types and identifiers. The other directory, "dataverse_json_metadata_from_each_known_dataverse_installation", contains 85 zipped files, one for each of the 85 Dataverse installations whose dataset metadata I was able to download. Each zip file contains a CSV file and two sub-directories: The CSV file contains the persistent IDs and URLs of each published dataset in the Dataverse installation as well as a column to indicate if the Python script was able to download the Dataverse JSON metadata for each dataset. It also includes the alias/identifier and category of the Dataverse collection that the dataset is in. One sub-directory contains a JSON file for each of the installation's published, non-deaccessioned dataset versions. The JSON files contain the metadata in the "Dataverse JSON" metadata schema. The Dataverse JSON export of the latest version of each dataset includes "(latest_version)" in the file name. This should help those who are interested in the metadata of only the latest version of each dataset. The other sub-directory contains information about the metadata models (the "metadata blocks" in JSON files) that the installation was using when the dataset metadata was downloaded. I included them so that they can be used when extracting metadata from the dataset's Dataverse JSON exports. The dataverse_installations_summary_2023.08.28.csv file contains information about each installation, including its name, URL, Dataverse software version, and counts of dataset metadata...
Common Metadata Elements for Cataloging Biomedical Datasets
figshare.com
xlsx
Updated Jan 20, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kevin Read (2016). Common Metadata Elements for Cataloging Biomedical Datasets [Dataset]. http://doi.org/10.6084/m9.figshare.1496573.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1496573.v1
Dataset updated
Jan 20, 2016
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Kevin Read
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset outlines a proposed set of core, minimal metadata elements that can be used to describe biomedical datasets, such as those resulting from research funded by the National Institutes of Health. It can inform efforts to better catalog or index such data to improve discoverability. The proposed metadata elements are based on an analysis of the metadata schemas used in a set of NIH-supported data sharing repositories. Common elements from these data repositories were identified, mapped to existing data-specific metadata standards from to existing multidisciplinary data repositories, DataCite and Dryad, and compared with metadata used in MEDLINE records to establish a sustainable and integrated metadata schema. From the mappings, we developed a preliminary set of minimal metadata elements that can be used to describe NIH-funded datasets. Please see the readme file for more details about the individual sheets within the spreadsheet.
d
Batch Metadata Modifier Toolbar
catalog.data.gov
Updated Nov 30, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
University of Idaho Library (2020). Batch Metadata Modifier Toolbar [Dataset]. https://catalog.data.gov/dataset/batch-metadata-modifier-toolbar
Explore at:
Dataset updated
Nov 30, 2020
Dataset provided by
University of Idaho Library
Description
For more information about this tool see Batch Metadata Modifier Tool Toolbar Help.Modifying multiple files simultaneously that don't have identical structures is possible but not advised. Be especially careful modifying repeatable elements in multiple files that do not have and identical structureTool can be run as an ArcGIS Add-In or as a stand-alone Windows executableExecutable runs on PC only. (Not supported on Mac.)The ArcGIS Add-In requires ArcGIS Desktop version 10.2 or 10.3Metadata formats accepted: FGDC CSDGM, ArcGIS 1.0, ArcGIS ISO, and ISO 19115Contact Bruce Godfrey (bgodfrey@uidaho.edu, Ph. 208-292-1407) if you have questions or wish to collaborate on further developing this tool.Modifying and maintaining metadata for large batches of ArcGIS items can be a daunting task. Out-of-the-box graphical user interface metadata tools within ArcCatalog 10.x are designed primarily to allow users to interact with metadata for one item at a time. There are, however, a limited number of tools for performing metadata operations on multiple items. Therefore, the need exists to develop tools to modify metadata for numerous items more effectively and efficiently. The Batch Metadata Modifier Tools toolbar is a step in that direction. The Toolbar, which is available as an ArcGIS Add-In, currently contains two tools. The first tool, which is additionally available as a standalone Windows executable application, allows users to update metadata on multiple items iteratively. The tool enables users to modify existing elements, find and replace element content, delete metadata elements, and import metadata elements from external templates. The second tool of the Toolbar, a batch thumbnail creator, enables the batch-creation of the graphic that appears in an item’s metadata, illustrating the data an item contains. Both of these tools make updating metadata in ArcCatalog more efficient, since the tools are able to operate on numerous items iteratively through an easy-to-use graphic interface.This tool, developed by INSIDE Idaho at the University of Idaho Library, was created to assist researchers with modifying FGDC CSDGM, ArcGIS 1.0 Format and ISO 19115 metadata for numerous data products generated under EPSCoR award EPS-0814387.This tool is primarily designed to be used by those familiar with metadata, metadata standards, and metadata schemas. The tool is for use by metadata librarians and metadata managers and those having experience modifying standardized metadata. The tool is designed to expedite batch metadata maintenance. Users of this tool must fully understand the files they are modifying. No responsibility is assumed by the Idaho Geospatial Data Clearinghouse or the University of Idaho in the use of this tool. A portion of the development of this tool was made possible by an Idaho EPSCoR Office award.
o
Research Data Management Intro Series: Coffee Lectures & Espresso Shots
explore.openaire.eu
Updated Jan 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeanne Wilbrandt (2023). Research Data Management Intro Series: Coffee Lectures & Espresso Shots [Dataset]. http://doi.org/10.5281/zenodo.7573695
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.7573695
Dataset updated
Jan 26, 2023
Authors
Jeanne Wilbrandt
Description
Description: „What is research data management?“ „How can RDM help me in my daily work?“ „What would I need to do?“ In this series, we break answers to these and similar questions into digestible chunks to go with your midday coffee. With this, we hope to help you to maintain good research data management (RDM) in your daily work. Using data management best practices can be easy, if you have a good starting point, a practice you can embrace and maintain. Researchers of all stages and fields as well as interested parties are welcome. No prior knowledge is required. In the Coffee Lectures (25 minutes + time for questions), we take a quick dive into a selected topic from the RDM realm. Here, you will get to know models, concepts, and motivations as a basis for finding and adopting RDM practices that work for you. Espresso Shots are our super short lectures (5-10 minutes + time for questions) that aim to provide you with one selected actionable practice at a time that you can easily adopt today. Learning goals: Know basic concepts of research data management Identify applicable best practices for your own work Prerequisites: none _ Upload Content: An overview file (.pdf format) detailing the contents and order of the lectures in the intro series 12 slide decks in .pdf format A .zip archive containing the overview as well as the 12 corresponding presentation files in .key format Itemized Upload Content File Name Description 00_RDMIntroSeries_Overview.pdf A4 display of the 12 lectures (3 coffee lectures, 9 espresso shots, thematically arranged) 1-0_Coffee_RDM.pdf First coffee lecture: "What is Research Data Management?" 1-1_Espresso_5Things.pdf First espresso shot lecture: "5 Things to Remember" 1-2_Espresso_FileNaming.pdf Espresso shot lecture: "File Naming" 1-3_Espresso_FolderStructure.pdf Espresso shot lecture: "Folder Structure" 2-0_Coffee_FAIR.pdf Second coffee lecture: "What is FAIR data?" 2-1_Espresso_Formats.pdf Espresso shot lecture: "File Formats" 2-2_Espresso_Tables.pdf Espresso shot lecture: "Tables" 2-3_Espresso_Metadata.pdf Espresso shot lecture: "Metadata?!" 3-0_Coffee_DMP.pdf Third coffee lecture: "Planning a Research Project?" 3-1_Espresso_DMPQuestions.pdf Espresso shot lecture: "Questions to ask for a DMP" 3-2_Espresso_READMEs.pdf Espresso shot lecture: "READMEs" 3-3_Espresso_ArchivingBackups.pdf Espresso shot lecture: "Backups & Archiving" keys.zip All presentation files in .key gormat of the aforementioned .pdfs in one .zip archive Copyright Note: The slide decks include xkcd Comics by Randall Munroe. The license for these can be found here: https://xkcd.com/license.html. The permanent links are as follows (in order of appearance in the slides/decks): https://xkcd.com/1806/ https://xkcd.com/2143/ https://xkcd.com/1459/ https://xkcd.com/1179/ https://xkcd.com/1781/ https://xkcd.com/2582/ Furthermore, one deck contains a comic from the series "Piled Higher and Deeper" by Jorge Cham (www.phdcomics.com): "A story told in file names" (https://phdcomics.com/comics/archive.php?comicid=1323). Funding Note: The series was devised and executed during an employment of Jeanne Wilbrandt by the Leibniz Institute on Aging – Fritz Lipmann Institute (leibniz-fli.de). Additional partial funding was provided by The German Network for Bioinformatics Infrastructure – de.NBI (denbi.de). The slide decks include xkcd comics by Randall Munroe and a P.h.D. comic by Jorge Cham (see Copyright Note above).
c
Niagara Open Data
catalog.civicdataecosystem.org
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Niagara Open Data [Dataset]. https://catalog.civicdataecosystem.org/dataset/niagara-open-data
Explore at:
Description
The Ontario government, generates and maintains thousands of datasets. Since 2012, we have shared data with Ontarians via a data catalogue. Open data is data that is shared with the public. Click here to learn more about open data and why Ontario releases it. Ontario’s Open Data Directive states that all data must be open, unless there is good reason for it to remain confidential. Ontario’s Chief Digital and Data Officer also has the authority to make certain datasets available publicly. Datasets listed in the catalogue that are not open will have one of the following labels: If you want to use data you find in the catalogue, that data must have a licence – a set of rules that describes how you can use it. A licence: Most of the data available in the catalogue is released under Ontario’s Open Government Licence. However, each dataset may be shared with the public under other kinds of licences or no licence at all. If a dataset doesn’t have a licence, you don’t have the right to use the data. If you have questions about how you can use a specific dataset, please contact us. The Ontario Data Catalogue endeavors to publish open data in a machine readable format. For machine readable datasets, you can simply retrieve the file you need using the file URL. The Ontario Data Catalogue is built on CKAN, which means the catalogue has the following features you can use when building applications. APIs (Application programming interfaces) let software applications communicate directly with each other. If you are using the catalogue in a software application, you might want to extract data from the catalogue through the catalogue API. Note: All Datastore API requests to the Ontario Data Catalogue must be made server-side. The catalogue's collection of dataset metadata (and dataset files) is searchable through the CKAN API. The Ontario Data Catalogue has more than just CKAN's documented search fields. You can also search these custom fields. You can also use the CKAN API to retrieve metadata about a particular dataset and check for updated files. Read the complete documentation for CKAN's API. Some of the open data in the Ontario Data Catalogue is available through the Datastore API. You can also search and access the machine-readable open data that is available in the catalogue. How to use the API feature: Read the complete documentation for CKAN's Datastore API. The Ontario Data Catalogue contains a record for each dataset that the Government of Ontario possesses. Some of these datasets will be available to you as open data. Others will not be available to you. This is because the Government of Ontario is unable to share data that would break the law or put someone's safety at risk. You can search for a dataset with a word that might describe a dataset or topic. Use words like “taxes” or “hospital locations” to discover what datasets the catalogue contains. You can search for a dataset from 3 spots on the catalogue: the homepage, the dataset search page, or the menu bar available across the catalogue. On the dataset search page, you can also filter your search results. You can select filters on the left hand side of the page to limit your search for datasets with your favourite file format, datasets that are updated weekly, datasets released by a particular organization, or datasets that are released under a specific licence. Go to the dataset search page to see the filters that are available to make your search easier. You can also do a quick search by selecting one of the catalogue’s categories on the homepage. These categories can help you see the types of data we have on key topic areas. When you find the dataset you are looking for, click on it to go to the dataset record. Each dataset record will tell you whether the data is available, and, if so, tell you about the data available. An open dataset might contain several data files. These files might represent different periods of time, different sub-sets of the dataset, different regions, language translations, or other breakdowns. You can select a file and either download it or preview it. Make sure to read the licence agreement to make sure you have permission to use it the way you want. Read more about previewing data. A non-open dataset may be not available for many reasons. Read more about non-open data. Read more about restricted data. Data that is non-open may still be subject to freedom of information requests. The catalogue has tools that enable all users to visualize the data in the catalogue without leaving the catalogue – no additional software needed. Have a look at our walk-through of how to make a chart in the catalogue. Get automatic notifications when datasets are updated. You can choose to get notifications for individual datasets, an organization’s datasets or the full catalogue. You don’t have to provide and personal information – just subscribe to our feeds using any feed reader you like using the corresponding notification web addresses. Copy those addresses and paste them into your reader. Your feed reader will let you know when the catalogue has been updated. The catalogue provides open data in several file formats (e.g., spreadsheets, geospatial data, etc). Learn about each format and how you can access and use the data each file contains. A file that has a list of items and values separated by commas without formatting (e.g. colours, italics, etc.) or extra visual features. This format provides just the data that you would display in a table. XLSX (Excel) files may be converted to CSV so they can be opened in a text editor. How to access the data: Open with any spreadsheet software application (e.g., Open Office Calc, Microsoft Excel) or text editor. Note: This format is considered machine-readable, it can be easily processed and used by a computer. Files that have visual formatting (e.g. bolded headers and colour-coded rows) can be hard for machines to understand, these elements make a file more human-readable and less machine-readable. A file that provides information without formatted text or extra visual features that may not follow a pattern of separated values like a CSV. How to access the data: Open with any word processor or text editor available on your device (e.g., Microsoft Word, Notepad). A spreadsheet file that may also include charts, graphs, and formatting. How to access the data: Open with a spreadsheet software application that supports this format (e.g., Open Office Calc, Microsoft Excel). Data can be converted to a CSV for a non-proprietary format of the same data without formatted text or extra visual features. A shapefile provides geographic information that can be used to create a map or perform geospatial analysis based on location, points/lines and other data about the shape and features of the area. It includes required files (.shp, .shx, .dbt) and might include corresponding files (e.g., .prj). How to access the data: Open with a geographic information system (GIS) software program (e.g., QGIS). A package of files and folders. The package can contain any number of different file types. How to access the data: Open with an unzipping software application (e.g., WinZIP, 7Zip). Note: If a ZIP file contains .shp, .shx, and .dbt file types, it is an ArcGIS ZIP: a package of shapefiles which provide information to create maps or perform geospatial analysis that can be opened with ArcGIS (a geographic information system software program). A file that provides information related to a geographic area (e.g., phone number, address, average rainfall, number of owl sightings in 2011 etc.) and its geospatial location (i.e., points/lines). How to access the data: Open using a GIS software application to create a map or do geospatial analysis. It can also be opened with a text editor to view raw information. Note: This format is machine-readable, and it can be easily processed and used by a computer. Human-readable data (including visual formatting) is easy for users to read and understand. A text-based format for sharing data in a machine-readable way that can store data with more unconventional structures such as complex lists. How to access the data: Open with any text editor (e.g., Notepad) or access through a browser. Note: This format is machine-readable, and it can be easily processed and used by a computer. Human-readable data (including visual formatting) is easy for users to read and understand. A text-based format to store and organize data in a machine-readable way that can store data with more unconventional structures (not just data organized in tables). How to access the data: Open with any text editor (e.g., Notepad). Note: This format is machine-readable, and it can be easily processed and used by a computer. Human-readable data (including visual formatting) is easy for users to read and understand. A file that provides information related to an area (e.g., phone number, address, average rainfall, number of owl sightings in 2011 etc.) and its geospatial location (i.e., points/lines). How to access the data: Open with a geospatial software application that supports the KML format (e.g., Google Earth). Note: This format is machine-readable, and it can be easily processed and used by a computer. Human-readable data (including visual formatting) is easy for users to read and understand. This format contains files with data from tables used for statistical analysis and data visualization of Statistics Canada census data. How to access the data: Open with the Beyond 20/20 application. A database which links and combines data from different files or applications (including HTML, XML, Excel, etc.). The database file can be converted to a CSV/TXT to make the data machine-readable, but human-readable formatting will be lost. How to access the data: Open with Microsoft Office Access (a database management system used to develop application software). A file that keeps the original layout and
YouTube Videos and Channels Metadata
kaggle.com
Updated Dec 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). YouTube Videos and Channels Metadata [Dataset]. https://www.kaggle.com/datasets/thedevastator/revealing-insights-from-youtube-video-and-channe
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 14, 2022
Dataset provided by
Kaggle
Authors
The Devastator
Area covered
YouTube
Description
YouTube Videos and Channels Metadata

Analyze the statistical relation between videos and form a topic tree

By VISHWANATH SESHAGIRI [source]

About this dataset

This dataset contains YouTube video and channel metadata to analyze the statistical relation between videos and form a topic tree. With 9 direct features, 13 more indirect features, it has all that you need to build a deep understanding of how videos are related – including information like total views per unit time, channel views, likes/subscribers ratio, comments/views ratio, dislikes/subscribers ratio etc. This data provides us with a unique opportunity to gain insights on topics such as subscriber count trends over time or calculating the impact of trends on subscriber engagement. We can develop powerful models that show us how different types of content drive viewership and identify the most popular styles or topics within YouTube's vast catalogue. Additionally this data offers an intriguing look into consumer behaviour as we can explore what drives people to watch specific videos at certain times or appreciate certain channels more than others - by analyzing things like likes per subscribers and dislikes per views ratios for example! Finally this dataset is completely open source with an easy-to-understand Github repo making it an invaluable resource for anyone looking to gain better insights into how their audience interacts with their content and how they might improve it in the future

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

How to Use This Dataset

In general, it is important to understand each parameter in the data set before proceeding with analysis. The parameters included are totalviews/channelelapsedtime, channelViewCount, likes/subscriber, views/subscribers, subscriberCounts, dislikes/views comments/subscriberchannelCommentCounts,, likes/dislikes comments/views dislikes/ subscribers totviewes /totsubsvews /elapsedtime.

To use this dataset for your own analysis:1) Review each parameter’s meaning and purpose in our dataset; 2) Get familiar with basic descriptive statistics such as mean median mode range; 3) Create visualizations or tables based on subsets of our data; 4) Understand correlations between different sets of variables or parameters; 5) Generate meaningful conclusions about specific channels or topics based on organized graph hierarchies or tables.; 6) Analyze trends over time for individual parameters as well as an aggregate reaction from all users when videos are released

Research Ideas

Predicting the Relative Popularity of Videos: This dataset can be used to build a statistical model that can predict the relative popularity of videos based on various factors such as total views, channel viewers, likes/dislikes ratio, and comments/views ratio. This model could then be used to make recommendations and predict which videos are likely to become popular or go viral.

Creating Topic Trees: The dataset can also be used to create topic trees or taxonomies by analyzing the content of videos and looking at what topics they cover. For example, one could analyze the most popular YouTube channels in a specific subject area, group together those that discuss similar topics, and then build an organized tree structure around those topics in order to better understand viewer interests in that area.

Viewer Engagement Analysis: This dataset could also be used for viewer engagement analysis purposes by analyzing factors such as subscriber count, average time spent watching a video per user (elapsed time), comments made per view etc., so as to gain insights into how engaged viewers are with specific content or channels on YouTube. From this information it would be possible to optimize content strategy accordingly in order improve overall engagement rates across various types of video content and channel types

Acknowledgements

If you use this dataset in your research, please credit the original authors.

Data Source

License

Unknown License - Please check the dataset description for more information.

Columns

File: YouTubeDataset_withChannelElapsed.csv | Column name | Description | |:----------------------------------|:-------------------------------------------------------| | totalviews/channelelapsedtime | Ratio of total views to channel elapsed time. (Ratio) | | channelViewCount | Total number of views for the channel. (Integer) | | likes/subscriber ...
p
DCAT-AP API endpoints for data.public.lu
data.public.lu
html, rdf, xlsx
Updated May 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Open Data Lëtzebuerg (2024). DCAT-AP API endpoints for data.public.lu [Dataset]. https://data.public.lu/en/datasets/dcat-ap-api-endpoints-for-data-public-lu/
Explore at:
rdf, html, xlsx(16280)Available download formats
Dataset updated
May 27, 2024
Dataset authored and provided by
Open Data Lëtzebuerg
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Data.public.lu provides all its metadata in the DCAT and DCAT-AP formats, i.e. all data about the data stored or referenced on data.public.lu. DCAT (Data Catalog Vocabulary) is a specification designed to facilitate interoperability between data catalogs published on the Web. This specification has been extended via the DCAT-AP (DCAT Application Profile for data portals in Europe) standard, specifically for data portals in Europe. The serialisation of those vocabularies is mainly done in RDF (Resource Description Framework). The implementation of data.public.lu is based on the one of the open source udata platform. This API enables the federation of multiple Data portals together, for example, all the datasets published on data.public.lu are also published on data.europa.eu. The DCAT API from data.public.lu is used by the european data portal to federate its metadata. The DCAT standard is thus very important to guarantee the interoperability between all data portals in Europe. Usage Full catalog You can find here a few examples using the curl command line tool: To get all the metadata from the whole catalog hosted on data.public.lu curl https://data.public.lu/catalog.rdf Metadata for an organization To get the metadata of a specific organization, you need first to find its ID. The ID of an organization is the last part of its URL. For the organization "Open data Lëtzebuerg" its URL is https://data.public.lu/fr/organizations/open-data-letzebuerg/ and its ID is open-data-letzebuerg. To get all the metadata for a given organization, we need to call the following URL, where {id} has been replaced by the correct ID: https://data.public.lu/api/1/organizations/{id}/catalog.rdf Example: curl https://data.public.lu/api/1/organizations/open-data-letzebuerg/catalog.rdf Metadata for a dataset To get the metadata of a specific dataset, you need first to find its ID. The ID of dataset is the last part of its URL. For the dataset "Digital accessibility monitoring report - 2020-2021" its URL is https://data.public.lu/fr/datasets/digital-accessibility-monitoring-report-2020-2021/ and its ID is digital-accessibility-monitoring-report-2020-2021. To get all the metadata for a given dataset, we need to call the following URL, where {id} has been replaced by the correct ID: https://data.public.lu/api/1/datasets/{id}/rdf Example: curl https://data.public.lu/api/1/datasets/digital-accessibility-monitoring-report-2020-2021/rdf Compatibility with DCAT-AP 2.1.1 The DCAT-AP standard is in constant evolution, so the compatibility of the implementation should be regularly compared with the standard and adapted accordingly. In May 2023, we have done this comparison, and the result is available in the resources below (see document named 'udata 6 dcat-ap implementation status"). In the DCAT-AP model, classes and properties have a priority level which should be respected in every implementation: mandatory, recommended and optional. Our goal is to implement all mandatory classes and properties, and if possible implement all recommended classes and properties which make sense in the context of our open data portal.
f
Data from: Metadata Standard
fairsharing.org
Updated Jun 28, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
University of Oxford, Dept. of Engineering Science, Data Readiness Group (2017). Metadata Standard [Dataset]. https://fairsharing.org/
Explore at:
Dataset updated
Jun 28, 2017
Dataset authored and provided by
University of Oxford, Dept. of Engineering Science, Data Readiness Group
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
A manually curated registry of standards, split into three types - Terminology Artifacts (ontologies, e.g. Gene Ontology), Models and Formats (conceptual schema, formats, data models, e.g. FASTA), and Reporting Guidelines (e.g. the ARRIVE guidelines for in vivo animal testing). These are linked to the databases that implement them and the funder and journal publisher data policies that recommend or endorse their use.
The NIST Extensible Resource Data Model (NERDm): JSON schemas for rich...
catalog.data.gov
s.cnmilf.com
+1more
Updated Mar 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2025). The NIST Extensible Resource Data Model (NERDm): JSON schemas for rich description of data resources [Dataset]. https://catalog.data.gov/dataset/the-nist-extensible-resource-data-model-nerdm-json-schemas-for-rich-description-of-data-re
Explore at:
Dataset updated
Mar 14, 2025
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
The NIST Extensible Resource Data Model (NERDm) is a set of schemas for encoding in JSON format metadatathat describe digital resources. The variety of digital resources it can describe includes not onlydigital data sets and collections, but also software, digital services, web sites and portals, anddigital twins. It was created to serve as the internal metadata format used by the NIST Public DataRepository and Science Portal to drive rich presentations on the web and to enable discovery; however, itwas also designed to enable programmatic access to resources and their metadata by external users.Interoperability was also a key design aim: the schemas are defined using the JSON Schema standard,metadata are encoded as JSON-LD, and their semantics are tied to community ontologies, with an emphasison DCAT and the US federal Project Open Data (POD) models. Finally, extensibility is also central to itsdesign: the schemas are composed of a central core schema and various extension schemas. New extensionsto support richer metadata concepts can be added over time without breaking existing applications.Validation is central to NERDm's extensibility model. Consuming applications should be able to choosewhich metadata extensions they care to support and ignore terms and extensions they don't support.Furthermore, they should not fail when a NERDm document leverages extensions they don't recognize, evenwhen on-the-fly validation is required. To support this flexibility, the NERDm framework allowsdocuments to declare what extensions are being used and where. We have developed an optional extensionto the standard JSON Schema validation (see ejsonschema below) to support flexible validation: while astandard JSON Schema validater can validate a NERDm document against the NERDm core schema, our extensionwill validate a NERDm document against any recognized extensions and ignore those that are notrecognized.The NERDm data model is based around the concept of resource, semantically equivalent to a schema.orgResource, and as in schema.org, there can be different types of resources, such as data sets andsoftware. A NERDm document indicates what types the resource qualifies as via the JSON-LD "@type"property. All NERDm Resources are described by metadata terms from the core NERDm schema; however,different resource types can be described by additional metadata properties (often drawing on particularNERDm extension schemas). A Resource contains Components of various types (includingDCAT-defined Distributions) that are considered part of the Resource; specifically, these can include downloadable data files, hierachical datacollecitons, links to web sites (like software repositories), software tools, or other NERDm Resources.Through the NERDm extension system, domain-specific metadata can be included at either the resource orcomponent level. The direct semantic and syntactic connections to the DCAT, POD, and schema.org schemasis intended to ensure unambiguous conversion of NERDm documents into those schemas.As of this writing, the Core NERDm schema and its framework stands at version 0.7 and is compatible withthe "draft-04" version of JSON Schema. Version 1.0 is projected to be released in 2025. In thatrelease, the NERDm schemas will be updated to the "draft2020" version of JSON Schema. Other improvementswill include stronger support for RDF and the Linked Data Platform through its support of JSON-LD.
Libraries.io Data
kaggle.com
zip
Updated Mar 20, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Libraries.io (2019). Libraries.io Data [Dataset]. https://www.kaggle.com/librariesdotio/libraries-io
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Mar 20, 2019
Dataset provided by
Libraries.iohttps://libraries.io/
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

In this release you will find data about software distributed and/or crafted publicly on the Internet. You will find information about its development, its distribution and its relationship with other software included as a dependency. You will not find any information about the individuals who create and maintain these projects.

Content

Libraries.io gathers data on open source software from 33 package managers and 3 source code repositories. We track over 2.4m unique open source projects, 25m repositories and 121m interdependencies between them. This gives Libraries.io a unique understanding of open source software.

https://libraries.io/data

Fork this kernel to get started with this dataset.

Acknowledgements

This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source — https://libraries.io/data — and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

https://libraries.io/data

https://bigquery.cloud.google.com/dataset/bigquery-public-data:libraries_io?_ga=2.42277601.-577194880.1523455401

https://console.cloud.google.com/marketplace/details/libraries-io/librariesio

Banner Photo by Caspar Rubin from Unplash.

Inspiration

What are the repositories, avg project size, and avg # of stars?

What are the top dependencies per platform?

What are the top unmaintained or deprecated projects?
H
Share and Publish your Data and Models with HydroShare
beta.hydroshare.org
hydroshare.org
zip
Updated Jun 28, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Tarboton (2016). Share and Publish your Data and Models with HydroShare [Dataset]. https://beta.hydroshare.org/resource/5ec2617d7dc84c90a7e596d67846a40a/
Explore at:
zip(6.5 MB)Available download formats
Dataset updated
Jun 28, 2016
Dataset provided by
HydroShare
Authors
David Tarboton
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
How will you manage the data for your next big collaborative project? HydroShare is an online, collaborative system for open sharing of hydrologic data, analytical tools, and models. It supports the sharing of and collaboration around “hydrologic resources” which are data, or models in formats commonly used in hydrology. HydroShare expands the data sharing capability of the CUAHSI Hydrologic Information System by broadening the classes of data accommodated to include geospatial and multidimensional space-time datasets commonly used in hydrology. HydroShare also includes new capability for sharing models, model components, and analytical tools. It can help you manage your data among collaborators and meet funding agency data management plan requirements. It can publish your data using citable digital object identifiers (DOIs). In this seminar you will learn how to load files into HydroShare so that you can share them with colleagues and publish them. I will show how to manage access to the content that you share, and how to easily add metadata, and in some cases how metadata is automatically completed for you. The capability to assign DOIs to HydroShare resources means that they are permanently citable helping researchers who share their data get credit for the data published. Models, and Model Instances, which in HydroShare are a model application to a specific site with its input and output data can also receive DOI's. Collections allow multiple resources from a study to be aggregated together providing a comprehensive archival record of the research outcomes, supporting transparency and reproducibility, thereby enhancing trust in the research findings. Reuse to support additional research is also enabled. Files in HydroShare may be analyzed through web apps configured to access HydroShare resources. Apps support visualization and analysis of HydroShare resources in a platform independent web environment. This presentation will demo some apps and describe ongoing development of functionality to support collaboration, modeling and data analysis in HydroShare.
Data Catalog Market Analysis, Size, and Forecast 2025-2029: North America...
technavio.com
pdf
Updated Aug 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Technavio (2025). Data Catalog Market Analysis, Size, and Forecast 2025-2029: North America (US and Canada), Europe (France, Germany, Italy, Russia, and UK), APAC (China, India, and Japan), and Rest of World (ROW) [Dataset]. https://www.technavio.com/report/data-catalog-market-industry-analysis
Explore at:
pdfAvailable download formats
Dataset updated
Aug 15, 2025
Dataset provided by
TechNavio
Authors
Technavio
License
https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice
Time period covered
2025 - 2029
Area covered
Canada, United Kingdom, Russia, United States
Description
Snapshot img

Data Catalog Market Size 2025-2029

The data catalog market size is valued to increase USD 5.03 billion, at a CAGR of 29.5% from 2024 to 2029. Rising demand for self-service analytics will drive the data catalog market.

Major Market Trends & Insights

North America dominated the market and accounted for a 39% growth during the forecast period. By Component - Solutions segment was valued at USD 822.80 billion in 2023 By Deployment - Cloud segment accounted for the largest market revenue share in 2023

Market Size & Forecast

Market Opportunities: USD 554.30 million Market Future Opportunities: USD 5031.50 million CAGR : 29.5% North America: Largest market in 2023

Market Summary

The market is a dynamic and evolving landscape, driven by the increasing demand for self-service analytics and the rise of data mesh architecture. Core technologies, such as metadata management and data discovery, play a crucial role in enabling organizations to effectively manage and utilize their data assets. Applications, including data governance and data integration, are also seeing significant growth as businesses seek to optimize their data management processes. However, maintaining catalog accuracy over time poses a challenge, with concerns surrounding data lineage, data quality, and data security. According to recent estimates, the market is expected to account for over 30% of the overall data management market share by 2025, underscoring its growing importance in the digital transformation era.

What will be the Size of the Data Catalog Market during the forecast period?

Get Key Insights on Market Forecast (PDF) Request Free Sample

How is the Data Catalog Market Segmented and what are the key trends of market segmentation?

The data catalog industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments.

Component Solutions Services Deployment Cloud On-premises Type Technical metadata Business metadata Operational metadata Geography North America US Canada Europe France Germany Italy Russia UK APAC China India Japan Rest of World (ROW)

By Component Insights

The solutions segment is estimated to witness significant growth during the forecast period.

Data catalog solutions have gained significant traction in today's data-driven business landscape, addressing complexities in data discovery, governance, collaboration, and data lifecycle management. These solutions enable users to search and discover relevant datasets for analytical or reporting purposes, thereby reducing the time spent locating data, promoting data reuse, and ensuring the usage of appropriate datasets for specific tasks. Centralized metadata storage is a key feature of data catalog solutions, offering detailed information about datasets, including source, schema, data quality, lineage, and other essential attributes. This metadata-centric approach enhances understanding of data assets, supports data governance initiatives, and provides users with the necessary context for effective data utilization.

Data catalog solutions also facilitate semantic enrichment, data versioning, data security protocols, data access control, and data model design. Semantic enrichment adds meaning and context to data, making it easier to understand and use. Data versioning ensures that different versions of datasets are managed effectively, while data access control restricts access to sensitive data. Data model design helps create an accurate representation of data structures and relationships. Moreover, data catalog solutions offer data discovery tools, data lineage tracking, data governance policies, schema management, data lake management, ETL process optimization, and data quality monitoring. Data discovery tools help users locate relevant data quickly and efficiently.

Data lineage tracking enables users to trace the origin and movement of data throughout its lifecycle. Data governance policies ensure compliance with regulatory requirements and organizational standards. Schema management maintains the structure and consistency of data, while data lake management simplifies the management of large volumes of data. ETL process optimization improves the efficiency of data integration, and data quality monitoring ensures that data is accurate and reliable. Businesses across various sectors, including healthcare, finance, retail, and manufacturing, are increasingly adopting data catalog solutions to streamline their data management and analytics processes. According to recent studies, the adoption of data catalog solutions has grown by approximately 25%, with an estimated 30% of organizations planning to implement t
Podcast Database - Complete Podcast Metadata, All Countries & Languages
datarade.ai
.json, .csv, .sql
Updated May 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Listen Notes (2025). Podcast Database - Complete Podcast Metadata, All Countries & Languages [Dataset]. https://datarade.ai/data-products/podcast-database-complete-podcast-metadata-all-countries-listen-notes
Explore at:
.json, .csv, .sqlAvailable download formats
Dataset updated
May 27, 2025
Dataset authored and provided by
Listen Notes
Area covered
Colombia, Zambia, Guinea-Bissau, Turkey, Bosnia and Herzegovina, Indonesia, Gibraltar, Iran (Islamic Republic of), Slovenia, Anguilla
Description
== Quick facts ==

The most up-to-date and comprehensive podcast database available All languages & All countries Includes over 3,500,000 podcasts Features 35+ data fields , such as basic metadata, global rank, RSS feed (with audio URLs), Spotify links, and more Delivered in SQLite format Learn how we build a high quality podcast database: https://www.listennotes.help/article/105-high-quality-podcast-database-from-listen-notes

== Use Cases ==

AI training, including speech recognition, generative AI, voice cloning / synthesis, and news analysis Alternative data for investment research, such as sentiment analysis of executive interviews, market research and tracking investment themes PR and marketing, including social monitoring, content research, outreach, and guest booking ...

== Data Attributes ==

See the full list of data attributes on this page: https://www.listennotes.com/podcast-datasets/fields/?filter=podcast_only

How to access podcast audio files: Our dataset includes RSS feed URLs for all podcasts. You can retrieve audio for over 170 million episodes directly from these feeds. With access to the raw audio, you’ll have high-quality podcast speech data ideal for AI training and related applications.

== Custom Offers ==

We can provide custom datasets based on your needs, such as language-specific data, daily/weekly/monthly update frequency, or one-time purchases.

We also provide a RESTful API at PodcastAPI.com

Contact us: hello@listennotes.com

== Need Help? ==

If you have any questions about our products, feel free to reach out hello@listennotes.com

== About Listen Notes, Inc. ==

Since 2017, Listen Notes, Inc. has provided the leading podcast search engine and podcast database.
d
Core mapper moving window averages (primary model) - A landscape...
catalog.data.gov
Updated Feb 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Fish and Wildlife Service (2025). Core mapper moving window averages (primary model) - A landscape connectivity analysis for the coastal marten (Martes caurina humboldtensis) [Dataset]. https://catalog.data.gov/dataset/core-mapper-moving-window-averages-primary-model-a-landscape-connectivity-analysis-for-the
Explore at:
Dataset updated
Feb 21, 2025
Dataset provided by
U.S. Fish and Wildlife Service
Description
This raster dataset of Core Mapper Moving Window Averages is an intermediary modeling product that was produced by the Core Mapper tool (Shirk and McRae 2013) in the process of developing habitat cores for use in our coastal marten connectivity model. It is derived from another dataset (HabitatSurface), and was produced using the Core Mapper parameters defined in the Lineage section of the accompanying geospatial metadata record. More specifically, it is a calculated dataset in which a 977m moving window was used on the habitat surface to calculate the average habitat value within a 977m radius around each pixel (this moving window size was derived from the estimated average size of a female marten's home range of 300 hectares). Of note, the set of habitat cores that came from this Core Mapper tool received additional modifications; see the report or the metadata record for PrimaryModel_HabitatCores for details. Refer to the HabitatSurface and PrimaryModel_HabitatCores metadata records for additional context. We derived the habitat cores using a tool within Gnarly Landscape Utilities called Core Mapper (Shirk and McRae 2015). To develop a Habitat Surface for input into Core Mapper, we started by assigning each 30m pixel on the modeled landscape a habitat value equal to its GNN OGSI value (range = 0-100). In areas with serpentine soils that support habitat potentially suitable for coastal marten, we assigned a minimum habitat value of 31, which is equivalent to the 33rd percentile of OGSI 80 pixels in the marten’s historical range marten (for general details on our incorporation of serpentine soils, see the report section titled "Data Layers - Serpentine Soils"; for specific details on the development of this serpentine dataset, see the metadata record for the ResistancePostProcessing_Serpentine data layer, which was used to make these modifications to the habitat surface). Pixels with an OGSI value >31.0 retained their normal habitat value. Our intention was to allow the modified serpentine pixels to be more easily incorporated into habitat cores if there were higher value OGSI pixels in the vicinity, but not to have them form the entire basis of a core. As a parameter of the Core Mapper tool, we also excluded pixels with a habitat value <1.0 from inclusion in habitat cores. We then used Core Mapper to define a moving window and calculate the average habitat value within a 977m radius around each pixel (derived from the estimated average size of a female marten’s home range of 300 ha). Pixels with an average habitat value ≥36.0 were then incorporated into habitat cores. This is an abbreviated and incomplete description of the dataset. Please refer to the spatial metadata for a more thorough description of the methods used to produce this dataset, and a discussion of any assumptions or caveats that should be taken into consideration. Additional data for this project (including the Habitat Surface referenced above and the Habitat Cores used in our connectivity model) can be found at: https://www.fws.gov/arcata/shc/marten
Data from: arXiv Dataset
kaggle.com
huggingface.co
+1more
zip
Updated Nov 22, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cornell University (2020). arXiv Dataset [Dataset]. https://www.kaggle.com/Cornell-University/arxiv
Explore at:
zip(950178574 bytes)Available download formats
Dataset updated
Nov 22, 2020
Dataset authored and provided by
Cornell University
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
About ArXiv

For nearly 30 years, ArXiv has served the public and research communities by providing open access to scholarly articles, from the vast branches of physics to the many subdisciplines of computer science to everything in between, including math, statistics, electrical engineering, quantitative biology, and economics. This rich corpus of information offers significant, but sometimes overwhelming depth.

In these times of unique global challenges, efficient extraction of insights from data is essential. To help make the arXiv more accessible, we present a free, open pipeline on Kaggle to the machine-readable arXiv dataset: a repository of 1.7 million articles, with relevant features such as article titles, authors, categories, abstracts, full text PDFs, and more.

Our hope is to empower new use cases that can lead to the exploration of richer machine learning techniques that combine multi-modal features towards applications like trend analysis, paper recommender engines, category prediction, co-citation networks, knowledge graph construction and semantic search interfaces.

The dataset is freely available via Google Cloud Storage buckets (more info here). Stay tuned for weekly updates to the dataset!

ArXiv is a collaboratively funded, community-supported resource founded by Paul Ginsparg in 1991 and maintained and operated by Cornell University.

The release of this dataset was featured further in a Kaggle blog post here.

https://storage.googleapis.com/kaggle-public-downloads/arXiv.JPG" alt="">

See here for more information.

ArXiv On Kaggle

Metadata

This dataset is a mirror of the original ArXiv data. Because the full dataset is rather large (1.1TB and growing), this dataset provides only a metadata file in the json format. This file contains an entry for each paper, containing: - id: ArXiv ID (can be used to access the paper, see below) - submitter: Who submitted the paper - authors: Authors of the paper - title: Title of the paper - comments: Additional info, such as number of pages and figures - journal-ref: Information about the journal the paper was published in - doi: https://www.doi.org - abstract: The abstract of the paper - categories: Categories / tags in the ArXiv system - versions: A version history

You can access each paper directly on ArXiv using these links: - https://arxiv.org/abs/{id}: Page for this paper including its abstract and further links - https://arxiv.org/pdf/{id}: Direct link to download the PDF

Bulk access

The full set of PDFs is available for free in the GCS bucket gs://arxiv-dataset or through Google API (json documentation and xml documentation).

You can use for example gsutil to download the data to your local machine. ```

List files:

gsutil cp gs://arxiv-dataset/arxiv/

Download pdfs from March 2020:

gsutil cp gs://arxiv-dataset/arxiv/arxiv/pdf/2003/ ./a_local_directory/

Download all the source files

gsutil cp -r gs://arxiv-dataset/arxiv/ ./a_local_directory/ ```

Update Frequency

We're automatically updating the metadata as well as the GCS bucket on a weekly basis.

License

Creative Commons CC0 1.0 Universal Public Domain Dedication applies to the metadata in this dataset. See https://arxiv.org/help/license for further details and licensing on individual papers.

Acknowledgements

The original data is maintained by ArXiv, huge thanks to the team for building and maintaining this dataset.

We're using https://github.com/mattbierbaum/arxiv-public-datasets to pull the original data, thanks to Matt Bierbaum for providing this tool.
c
ckanext-birmingham - Extensions - CKAN Ecosystem Catalog
catalog.civicdataecosystem.org
Updated Jun 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). ckanext-birmingham - Extensions - CKAN Ecosystem Catalog [Dataset]. https://catalog.civicdataecosystem.org/dataset/ckanext-birmingham
Explore at:
Dataset updated
Jun 4, 2025
Description
Unfortunately, the README for datopian/ckanext-birmingham is missing. Therefore, developing a truly comprehensive and detailed description is challenging. However, based on the repository name, we can infer a few potential functions and create a plausible description, acknowledging its speculative nature. This CKAN extension likely provides customizations and features tailored for a CKAN instance used by or related to the City of Birmingham (likely UK), or an organization based there, to better serve community data needs. It potentially allows the city to efficiently manage and publish local datasets relevant to citizens and local stakeholders. Potential Key Features (Speculative): Theming and Branding: Custom theme specifically designed to match Birmingham's (UK) branding guidelines, providing a consistent user experience and reflecting local identity. This could include specific color palettes, logos, and typography. Specific Metadata Schema: Could implement a custom metadata schema tailored to local datasets, like council data, transportation information, or environmental data, ensuring that data is described using relevant and standardized local information. Integration with Local Services: Potential integration with local services or APIs such as transportation data APIs, planning application portals, or environmental monitoring systems, enabling display and linking of relevant external data through the CKAN interface. This may use the ILinkedPatterns plugin. Custom Dataset Views: Additional views implemented that are optimised for local data types. Location-Based Search and Filtering: Enhanced search and filtering capabilities based on location data specific to Birmingham's geographic regions, enabling users to easily find datasets relevant to specific neighborhoods or areas. Data Visualization Tools: Customized data visualization tools to aid citizens of Birmingham in extracting insights from local data, like custom maps reflecting local demographics, transport routes and more. Custom Data Importers/Exporters: Custom data importers and exporters configured to support local government formats. Possible Integration with CKAN: Assuming typical CKAN extension behavior, ckanext-birmingham likely integrates with CKAN through plugins and resource templates, adding custom components to the platform's user interface and backend. Customizations might require alterations to CKAN's configuration files to enable plugins, define metadata schemas, and set up API integrations. It could potentially also affect how the CKAN search index is populated, depending on the needs specific to the usecase its supposed to serve. Potential Benefits (Speculative): If the extension functions as hypothesized, it can provide the City of Birmingham (UK) or associated organizations with a CKAN instance that is specifically tailored to their needs, enhancing data discoverability and accessibility. This could improve decision-making, increase transparency, promote citizen engagement, or reduce costs through greater operational efficiency. The enhanced accessibility of localized data would empower citizens and stakeholders, encouraging local action and a more data-driven community.
d
Manual snow course observations, raw met data, raw snow depth observations,...
catalog.data.gov
datadiscoverystudio.org
+1more
Updated Jun 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Climate Adaptation Science Centers (2024). Manual snow course observations, raw met data, raw snow depth observations, locations, and associated metadata for Oregon sites [Dataset]. https://catalog.data.gov/dataset/manual-snow-course-observations-raw-met-data-raw-snow-depth-observations-locations-and-ass
Explore at:
Dataset updated
Jun 15, 2024
Dataset provided by
Climate Adaptation Science Centers
Area covered
Oregon
Description
OSU_SnowCourse Summary: Manual snow course observations were collected over WY 2012-2014 from four paired forest-open sites chosen to span a broad elevation range. Study sites were located in the upper McKenzie (McK) River watershed, approximately 100 km east of Corvallis, Oregon, on the western slope of the Cascade Range and in the Middle Fork Willamette (MFW) watershed, located to the south of the McKenzie. The sites were designated based on elevation, with a range of 1110-1480 m. Distributed snow depth and snow water equivalent (SWE) observations were collected via monthly manual snow courses from 1 November through 1 April and bi-weekly thereafter. Snow courses spanned 500 m of forested terrain and 500 m of adjacent open terrain. Snow depth observations were collected approximately every 10 m and SWE was measured every 100 m along the snow courses with a federal snow sampler. These data are raw observations and have not been quality controlled in any way. Distance along the transect was estimated in the field. OSU_SnowDepth Summary: 10-minute snow depth observations collected at OSU met stations in the upper McKenzie River Watershed and the Middle Fork Willamette Watershed during Water Years 2012-2014. Each meterological tower was deployed to represent either a forested or an open area at a particular site, and generally the locations were paired, with a meterological station deployed in the forest and in the open area at a single site. These data were collected in conjunction with manual snow course observations, and the meterological stations were located in the approximate center of each forest or open snow course transect. These data have undergone basic quality control. See manufacturer specifications for individual instruments to determine sensor accuracy. This file was compiled from individual raw data files (named "RawData.txt" within each site and year directory) provided by OSU, along with metadata of site attributes. We converted the Excel-based timestamp (seconds since origin) to a date, changed the NaN flags for missing data to NA, and added site attributes such as site name and cover. We replaced positive values with NA, since snow depth values in raw data are negative (i.e., flipped, with some correction to use the height of the sensor as zero). Thus, positive snow depth values in the raw data equal negative snow depth values. Second, the sign of the data was switched to make them positive. Then, the smooth.m (MATLAB) function was used to roughly smooth the data, with a moving window of 50 points. Third, outliers were removed. All values higher than the smoothed values +10, were replaced with NA. In some cases, further single point outliers were removed. OSU_Met Summary: Raw, 10-minute meteorological observations collected at OSU met stations in the upper McKenzie River Watershed and the Middle Fork Willamette Watershed during Water Years 2012-2014. Each meterological tower was deployed to represent either a forested or an open area at a particular site, and generally the locations were paired, with a meterological station deployed in the forest and in the open area at a single site. These data were collected in conjunction with manual snow course observations, and the meteorological stations were located in the approximate center of each forest or open snow course transect. These stations were deployed to collect numerous meteorological variables, of which snow depth and wind speed are included here. These data are raw datalogger output and have not been quality controlled in any way. See manufacturer specifications for individual instruments to determine sensor accuracy. This file was compiled from individual raw data files (named "RawData.txt" within each site and year directory) provided by OSU, along with metadata of site attributes. We converted the Excel-based timestamp (seconds since origin) to a date, changed the NaN and 7999 flags for missing data to NA, and added site attributes such as site name and cover. OSU_Location Summary: Location Metadata for manual snow course observations and meteorological sensors. These data are compiled from GPS data for which the horizontal accuracy is unknown, and from processed hemispherical photographs. They have not been quality controlled in any way.
d
Habitat surface - A landscape connectivity analysis for the coastal marten...
catalog.data.gov
Updated Feb 22, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Fish and Wildlife Service (2025). Habitat surface - A landscape connectivity analysis for the coastal marten (Martes caurina humboldtensis) [Dataset]. https://catalog.data.gov/dataset/habitat-surface-a-landscape-connectivity-analysis-for-the-coastal-marten-martes-caurina-hu
Explore at:
Dataset updated
Feb 22, 2025
Dataset provided by
U.S. Fish and Wildlife Service
Description
This dataset serves as the habitat surface that was used to derive coastal marten habitat cores for use in our connectivity model. Of note, the set of habitat cores that came from this habitat surface received additional modifications; see the report or the metadata record for PrimaryModel_HabitatCores for details. The Old-growth Structure Index (OGSI) is the primary estimator of habitat quality and cost-weighted distance in the connectivity model. OGSI is a parameter derived by the Gradient Nearest Neighbor (GNN) model produced by the Landscape Ecology, Modeling, Mapping & Analysis laboratory in Corvallis, OR (LEMMA 2014a). The GNN model provides fine-scale spatially explicit data on forest structure across a vast area of California, Oregon, and Washington, and is one of the very few datasets available that provides such habitat information in a consistent manner across the CA/OR state border. GNN summarizes detailed data from thousands of forest survey points. It then uses a multi-step process to interpolate them to the unsurveyed areas on the landscape based on several explanatory datasets such as Landsat remote sensing imagery, elevation, climate, and geology (more information about this process can be found at https://lemma.forestry.oregonstate.edu/methods/methods, see also Ohmann & Gregory 2002). Like Landsat imagery, GNN has a spatial grain of 30mX30m (900 m2). OGSI is used to characterize the suitability of forest habitat conditions for old-growth obligate species and processes. It is scaled to specific regions and ecotypes, and is derived from a conceptual model that incorporates: (1) the density of large trees (2) and snags, (3) the size class diversity of live trees, and (4) the amount of down woody material (Davis et al. 2015). These seem well aligned with critically important features in forests inhabited by Pacific martens generally and Humboldt martens specifically, and a range of literature describes the use of habitat types that are consistent with the presence of these features. We derived the habitat cores using a tool within Gnarly Landscape Utilities called Core Mapper (Shirk and McRae 2015). To develop a Habitat Surface for input into Core Mapper, we started by assigning each 30m pixel on the modeled landscape a habitat value equal to its GNN OGSI value (range = 0-100). In areas with serpentine soils that support habitat potentially suitable for coastal marten, we assigned a minimum habitat value of 31, which is equivalent to the 33rd percentile of OGSI 80 pixels in the marten’s historical range marten (for general details on our incorporation of serpentine soils, see the report section titled "Data Layers - Serpentine Soils"; for specific details on the development of this serpentine dataset, see the metadata record for the ResistancePostProcessing_Serpentine data layer, which was used to make these modifications to the habitat surface). Pixels with an OGSI value >31.0 retained their normal habitat value. Our intention was to allow the modified serpentine pixels to be more easily incorporated into habitat cores if there were higher value OGSI pixels in the vicinity, but not to have them form the entire basis of a core. As a parameter of the Core Mapper tool, we also excluded pixels with a habitat value <1.0 from inclusion in habitat cores. We then used Core Mapper to define a moving window and calculate the average habitat value within a 977m radius around each pixel (derived from the estimated average size of a female marten’s home range of 300 ha). Pixels with an average habitat value ≥36.0 were then incorporated into habitat cores. Additional data for this project (including the Habitat Cores referenced above and the Moving Window Averages used to derive the Habitat Cores) can be found at: https://www.fws.gov/arcata/shc/marten
d
Collaborative Data and Model Sharing using HydroShare
search.dataone.org
hydroshare.org
Updated Dec 5, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeffery Horsburgh (2021). Collaborative Data and Model Sharing using HydroShare [Dataset]. http://doi.org/10.4211/hs.098b13c9835040aaa5e067d8a73585b0
Explore at:
Unique identifier
https://doi.org/10.4211/hs.098b13c9835040aaa5e067d8a73585b0
Dataset updated
Dec 5, 2021
Dataset provided by
Hydroshare
Authors
Jeffery Horsburgh
Description
How do you manage, track, and share hydrologic data and models within your research group? Do you find it difficult to keep track of who has access to which data and who has the most recent version of a dataset or research product? Do you sometimes find it difficult to share data and models and collaborate with colleagues outside your home institution? Would it be easier if you had a simple way to share and collaborate around hydrologic datasets and models? HydroShare is a new, web-based system for sharing hydrologic data and models with specific functionality aimed at making collaboration easier. Within HydroShare, we have developed new functionality for creating datasets, describing them with metadata, and sharing them with collaborators. In HydroShare we cast hydrologic datasets and models as “social objects” that can be published, collaborated around, annotated, discovered, and accessed. In this presentation, we will discuss and demonstrate the collaborative and social features of HydroShare and how it can enable new, collaborative workflows for you, your research group, and your collaborators across institutions. HydroShare’s access control and sharing functionality enable both public and private sharing with individual users and collaborative user groups, giving you flexibility over who can access data and at what point in the research process. HydroShare can make it easier for collaborators to iterate on shared datasets and models, creating multiple versions along the way, and publishing them with a permanent landing page, metadata description, and citable Digital Object Identifier (DOI). Functionality for creating and sharing resources within collaborative groups can also make it easier to overcome barriers such as institutional firewalls that can make collaboration around large datasets difficult. Functionality for commenting on and rating resources supports community collaboration and quality evaluation of resources in HydroShare.

This presentation was delivered as part of a Consortium of Universities for the Advancement of Hydrologic Science, Inc. (CUAHSI) Cyberseminar in June 2016. Cyberseminars are recorded, and archived recordings are available via the CUAHSI website at http://www.cuahsi.org.

Facebook

Twitter

Click to copy link

Link copied

Cite

Joshua Stillerman, Thomas Fredian, Martin Greenwald, John Wright (2018). A general purpose tool-set for representing data relationships: Converting data into knowledge [Dataset]. http://doi.org/10.7910/DVN/SHYWLB

Data from: A general purpose tool-set for representing data relationships: Converting data into knowledge

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Unique identifier

https://doi.org/10.7910/DVN/SHYWLB

Dataset updated

May 4, 2018

Dataset provided by

Harvard Dataverse

Authors

Joshua Stillerman, Thomas Fredian, Martin Greenwald, John Wright

License

https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/SHYWLBhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/SHYWLB

Description

Rich metadata is required to find and understand the recorded measurements from modern experiments with their immense and complex data stores. Systems to store and manage these metadata have improved over time, but in most cases are ad-hoc collections of data relationships, often represented in domain or site specific application code. We are developing a general set of tools to store, manage, and retrieve datarelationship metadata. These tools will be agnostic to the underlying data storage mechanisms, and to the data stored in them, making the system applicable across a wide range of science domains. Data management tools typically represent at least one relationship paradigm through implicit or explicit metadata. The addition of these metadata allows the data to be searched and understood by larger groups of users over longer periods of time. Using these systems, researchers are less dependent on one on one communication with the scientists involved in running the experiments, nor to rely on their ability to remember the details of their data. In the magnetic fusion research community, the MDSplus system is widely used to record raw and processed data from experiments. Users create a hierarchical relationship tree for each instance of their experiment, allowing them to record the meanings of what is recorded. Most users of this system, add to this a set of ad-hoc tools to help users locate specific experiment runs, which they can then access via this hierarchical organization. However, the MDSplus tree is only one possible organization of the records, and these additional applications that relate the experiment 'shots' into run days, experimental proposals, logbook entries, run summaries, analysis work flow, publications, etc. have up until now, been implemented on an experiment by experiment basis. The Metadata Provenance Ontology project, MPO, is a system built to record data provenance information about computed results. It allows users to record the inputs and outputs from each step of their computational workflows, in particular, what raw and processed data were used as inputs, what codes were run and what results were produced. The resulting collections of provenance graphs can be annotated, grouped, searched, filtered and browsed. This provides a powerful tool to record, understand, and locate computed results. However, this can be understood as one more specific data relationship, which can be construed as an instance of something more general. Building on concepts developed in these projects, we are developing a general system that could be used to represent all of these kinds of data relationships as mathematical graphs. Just as MDSplus and MPO were generalizations of data management needs for a collection of users, this new system will generalize the storage, location, and retrieval of the relationships between data. The system will store data relationships as data, not encoded in a set of application specific programs or ad hoc data structures. Stored data, would be referred to by URIs allowing the system to be agnostic to the underlying data representations. Users can then traverse these graphs. The system will allow users to construct a collection of graphs describing ANY OR ALL OF the relationships between data items, locate interesting data, see what other graphs these data are members of and navigate into and through them.

Clear search

Close search

Google apps

Main menu

Data from: A general purpose tool-set for representing data relationships:...

Dataset metadata of known Dataverse installations, August 2023

Common Metadata Elements for Cataloging Biomedical Datasets

Batch Metadata Modifier Toolbar

Research Data Management Intro Series: Coffee Lectures & Espresso Shots

Niagara Open Data

YouTube Videos and Channels Metadata

YouTube Videos and Channels Metadata

Analyze the statistical relation between videos and form a topic tree

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

How to Use This Dataset

Research Ideas

Acknowledgements

License

Columns

DCAT-AP API endpoints for data.public.lu

Data from: Metadata Standard

The NIST Extensible Resource Data Model (NERDm): JSON schemas for rich...

Libraries.io Data

Context

Content

Acknowledgements

Inspiration

Share and Publish your Data and Models with HydroShare

Data Catalog Market Analysis, Size, and Forecast 2025-2029: North America...

Snapshot img

Podcast Database - Complete Podcast Metadata, All Countries & Languages

Core mapper moving window averages (primary model) - A landscape...

Data from: arXiv Dataset

About ArXiv

ArXiv On Kaggle

Metadata

Bulk access

List files:

Download pdfs from March 2020:

Download all the source files

Update Frequency

License

Acknowledgements

ckanext-birmingham - Extensions - CKAN Ecosystem Catalog

Manual snow course observations, raw met data, raw snow depth observations,...

Habitat surface - A landscape connectivity analysis for the coastal marten...

Collaborative Data and Model Sharing using HydroShare

Data from: A general purpose tool-set for representing data relationships: Converting data into knowledge