27 datasets found

Pairwise sentence complexity comparison
kaggle.com
zip
Updated Jun 8, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Douglas K.G. Araujo (2021). Pairwise sentence complexity comparison [Dataset]. https://www.kaggle.com/douglaskgaraujo/pairwise-sentence-complexity-comparison
Explore at:
zip(1148361537 bytes)Available download formats
Dataset updated
Jun 8, 2021
Authors
Douglas K.G. Araujo
Description
Dataset creation

The dataset was created by this notebook: https://www.kaggle.com/douglaskgaraujo/sentence-complexity-comparison-dataset

Context

This data is a pairwise comparison of sentences, together with information about their relative complexity. The original dataset is from the CommonLit Readability Prize competition, and interested readers are referred there (especially the competitions' discussion forums) for more information on the data itself.

Important notice! As per that competition's rules, the license is as follows:

COMPETITION DATA. "Competition Data" means the data or datasets available from the Competition Website for the purpose of use in the Competition, including any prototype or executable code provided on the Competition Website. The Competition Data will contain private and public test sets. Which data belongs to which set will not be made available to participants.

A. Data Access and Use. Competition Use and Non-Commercial & Academic Research: *You may access and use the Competition Data for non-commercial purposes only, including for participating in the Competition and on Kaggle.com forums, and for academic research and education. *The Competition Sponsor reserves the right to disqualify any participant who uses the Competition Data other than as permitted by the Competition Website and these Rules.

B. Data Security. You agree to use reasonable and suitable measures to prevent persons who have not formally agreed to these Rules from gaining access to the Competition Data. You agree not to transmit, duplicate, publish, redistribute or otherwise provide or make available the Competition Data to any party not participating in the Competition. You agree to notify Kaggle immediately upon learning of any possible unauthorized transmission of or unauthorized access to the Competition Data and agree to work with Kaggle to rectify any unauthorized transmission or access.

C. External Data. You may use data other than the Competition Data (“External Data”) to develop and test your Submissions. However, you will ensure the External Data is publicly available and equally accessible to use by all participants of the Competition for purposes of the competition at no cost to the other participants. The ability to use External Data under this Section 7.C (External Data) does not limit your other obligations under these Competition Rules, including but not limited to Section 11 (Winners Obligations).

Content

This dataset is a pairwise comparison of each sentence in the CommonLit competition with 500 other randomly-matched sentences. Sentences are divided into a training and validation datasets before being matched randomly. The relative complexity of each sentence is measured, and features such as the distance between this score for both sentences, and a column indicating whether or not the first sentence's readability score is greater than or equal to the score of the second sentence.

Acknowledgements

Thank you for the organisers of this competition for providing this dataset.

Inspiration

Your data will be in front of the world's largest data science community. What questions do you want to see answered?
ARTS Datasets - ARTS94, ARTS300, ARTS3000, ARTS160
zenodo.org
csv
Updated Sep 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Björn Engelmann; Björn Engelmann; Christin Katharina Kreutz; Christin Katharina Kreutz; Fabian Haak; Fabian Haak; Philipp Schaer; Philipp Schaer (2024). ARTS Datasets - ARTS94, ARTS300, ARTS3000, ARTS160 [Dataset]. http://doi.org/10.5281/zenodo.13847807
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13847807
Dataset updated
Sep 27, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Björn Engelmann; Björn Engelmann; Christin Katharina Kreutz; Christin Katharina Kreutz; Fabian Haak; Fabian Haak; Philipp Schaer; Philipp Schaer
Description
Datasets for readability and text simplicity evaluation in three sizes: 94, 300, 3000 and 160 disjunctive data entries. One data entry contains the following information:

Text_original: Text from a parallel corpus for text simplification

Text_formatted: Text_original where formatting issues have been resolved either manually (ARTS94) or automatically (ARTS300, ARTS3000, ARTS160)

Dataset: Parallel corpus for text simplification, from which the original text has been extracted

Label: information, if the text has been from the simplified (simp) or source (src) part of the corpus

ID: Unique ID

Score: Simplicity/readability score of the formatted text, between 0 and 1, the higher a score, the more complex/less readable the text

Licenses of the different datasets apply for the respective texts.
c
Niagara Open Data
catalog.civicdataecosystem.org
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Niagara Open Data [Dataset]. https://catalog.civicdataecosystem.org/dataset/niagara-open-data
Explore at:
Description
The Ontario government, generates and maintains thousands of datasets. Since 2012, we have shared data with Ontarians via a data catalogue. Open data is data that is shared with the public. Click here to learn more about open data and why Ontario releases it. Ontario’s Open Data Directive states that all data must be open, unless there is good reason for it to remain confidential. Ontario’s Chief Digital and Data Officer also has the authority to make certain datasets available publicly. Datasets listed in the catalogue that are not open will have one of the following labels: If you want to use data you find in the catalogue, that data must have a licence – a set of rules that describes how you can use it. A licence: Most of the data available in the catalogue is released under Ontario’s Open Government Licence. However, each dataset may be shared with the public under other kinds of licences or no licence at all. If a dataset doesn’t have a licence, you don’t have the right to use the data. If you have questions about how you can use a specific dataset, please contact us. The Ontario Data Catalogue endeavors to publish open data in a machine readable format. For machine readable datasets, you can simply retrieve the file you need using the file URL. The Ontario Data Catalogue is built on CKAN, which means the catalogue has the following features you can use when building applications. APIs (Application programming interfaces) let software applications communicate directly with each other. If you are using the catalogue in a software application, you might want to extract data from the catalogue through the catalogue API. Note: All Datastore API requests to the Ontario Data Catalogue must be made server-side. The catalogue's collection of dataset metadata (and dataset files) is searchable through the CKAN API. The Ontario Data Catalogue has more than just CKAN's documented search fields. You can also search these custom fields. You can also use the CKAN API to retrieve metadata about a particular dataset and check for updated files. Read the complete documentation for CKAN's API. Some of the open data in the Ontario Data Catalogue is available through the Datastore API. You can also search and access the machine-readable open data that is available in the catalogue. How to use the API feature: Read the complete documentation for CKAN's Datastore API. The Ontario Data Catalogue contains a record for each dataset that the Government of Ontario possesses. Some of these datasets will be available to you as open data. Others will not be available to you. This is because the Government of Ontario is unable to share data that would break the law or put someone's safety at risk. You can search for a dataset with a word that might describe a dataset or topic. Use words like “taxes” or “hospital locations” to discover what datasets the catalogue contains. You can search for a dataset from 3 spots on the catalogue: the homepage, the dataset search page, or the menu bar available across the catalogue. On the dataset search page, you can also filter your search results. You can select filters on the left hand side of the page to limit your search for datasets with your favourite file format, datasets that are updated weekly, datasets released by a particular organization, or datasets that are released under a specific licence. Go to the dataset search page to see the filters that are available to make your search easier. You can also do a quick search by selecting one of the catalogue’s categories on the homepage. These categories can help you see the types of data we have on key topic areas. When you find the dataset you are looking for, click on it to go to the dataset record. Each dataset record will tell you whether the data is available, and, if so, tell you about the data available. An open dataset might contain several data files. These files might represent different periods of time, different sub-sets of the dataset, different regions, language translations, or other breakdowns. You can select a file and either download it or preview it. Make sure to read the licence agreement to make sure you have permission to use it the way you want. Read more about previewing data. A non-open dataset may be not available for many reasons. Read more about non-open data. Read more about restricted data. Data that is non-open may still be subject to freedom of information requests. The catalogue has tools that enable all users to visualize the data in the catalogue without leaving the catalogue – no additional software needed. Have a look at our walk-through of how to make a chart in the catalogue. Get automatic notifications when datasets are updated. You can choose to get notifications for individual datasets, an organization’s datasets or the full catalogue. You don’t have to provide and personal information – just subscribe to our feeds using any feed reader you like using the corresponding notification web addresses. Copy those addresses and paste them into your reader. Your feed reader will let you know when the catalogue has been updated. The catalogue provides open data in several file formats (e.g., spreadsheets, geospatial data, etc). Learn about each format and how you can access and use the data each file contains. A file that has a list of items and values separated by commas without formatting (e.g. colours, italics, etc.) or extra visual features. This format provides just the data that you would display in a table. XLSX (Excel) files may be converted to CSV so they can be opened in a text editor. How to access the data: Open with any spreadsheet software application (e.g., Open Office Calc, Microsoft Excel) or text editor. Note: This format is considered machine-readable, it can be easily processed and used by a computer. Files that have visual formatting (e.g. bolded headers and colour-coded rows) can be hard for machines to understand, these elements make a file more human-readable and less machine-readable. A file that provides information without formatted text or extra visual features that may not follow a pattern of separated values like a CSV. How to access the data: Open with any word processor or text editor available on your device (e.g., Microsoft Word, Notepad). A spreadsheet file that may also include charts, graphs, and formatting. How to access the data: Open with a spreadsheet software application that supports this format (e.g., Open Office Calc, Microsoft Excel). Data can be converted to a CSV for a non-proprietary format of the same data without formatted text or extra visual features. A shapefile provides geographic information that can be used to create a map or perform geospatial analysis based on location, points/lines and other data about the shape and features of the area. It includes required files (.shp, .shx, .dbt) and might include corresponding files (e.g., .prj). How to access the data: Open with a geographic information system (GIS) software program (e.g., QGIS). A package of files and folders. The package can contain any number of different file types. How to access the data: Open with an unzipping software application (e.g., WinZIP, 7Zip). Note: If a ZIP file contains .shp, .shx, and .dbt file types, it is an ArcGIS ZIP: a package of shapefiles which provide information to create maps or perform geospatial analysis that can be opened with ArcGIS (a geographic information system software program). A file that provides information related to a geographic area (e.g., phone number, address, average rainfall, number of owl sightings in 2011 etc.) and its geospatial location (i.e., points/lines). How to access the data: Open using a GIS software application to create a map or do geospatial analysis. It can also be opened with a text editor to view raw information. Note: This format is machine-readable, and it can be easily processed and used by a computer. Human-readable data (including visual formatting) is easy for users to read and understand. A text-based format for sharing data in a machine-readable way that can store data with more unconventional structures such as complex lists. How to access the data: Open with any text editor (e.g., Notepad) or access through a browser. Note: This format is machine-readable, and it can be easily processed and used by a computer. Human-readable data (including visual formatting) is easy for users to read and understand. A text-based format to store and organize data in a machine-readable way that can store data with more unconventional structures (not just data organized in tables). How to access the data: Open with any text editor (e.g., Notepad). Note: This format is machine-readable, and it can be easily processed and used by a computer. Human-readable data (including visual formatting) is easy for users to read and understand. A file that provides information related to an area (e.g., phone number, address, average rainfall, number of owl sightings in 2011 etc.) and its geospatial location (i.e., points/lines). How to access the data: Open with a geospatial software application that supports the KML format (e.g., Google Earth). Note: This format is machine-readable, and it can be easily processed and used by a computer. Human-readable data (including visual formatting) is easy for users to read and understand. This format contains files with data from tables used for statistical analysis and data visualization of Statistics Canada census data. How to access the data: Open with the Beyond 20/20 application. A database which links and combines data from different files or applications (including HTML, XML, Excel, etc.). The database file can be converted to a CSV/TXT to make the data machine-readable, but human-readable formatting will be lost. How to access the data: Open with Microsoft Office Access (a database management system used to develop application software). A file that keeps the original layout and
H
FakeNewsNet
dataverse.harvard.edu
kaggle.com
json, text/markdown +3
Updated Jan 16, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harvard Dataverse (2020). FakeNewsNet [Dataset]. http://doi.org/10.7910/DVN/UEMMHS
Explore at:
text/x-python(2201), txt(546), json(637), text/x-python(2018), text/markdown(11574), tsv(13172624), tsv(20973070), text/x-python(4760), text/x-python(2891), text/x-python(2384), text/x-python(8673), text/x-python(1825), text/x-python(0), text/x-python(3516), json(104), tsv(8701109), tsv(3454648), text/x-python(281), text/x-python(2829)Available download formats
Unique identifier
https://doi.org/10.7910/DVN/UEMMHS
Dataset updated
Jan 16, 2020
Dataset provided by
Harvard Dataverse
Description
FakeNewsNet is a multi-dimensional data repository that currently contains two datasets with news content, social context, and spatiotemporal information. The dataset is constructed using an end-to-end system, FakeNewsTracker. The constructed FakeNewsNet repository has the potential to boost the study of various open research problems related to fake news study. Because of the Twitter data sharing policy, we only share the news articles and tweet ids as part of this dataset and provide code along with repo to download complete tweet details, social engagements, and social networks. We describe and compare FakeNewsNet with other existing datasets in FakeNewsNet: A Data Repository with News Content, Social Context and Spatialtemporal Information for Studying Fake News on Social Media (https://arxiv.org/abs/1809.01286). A more readable version of the dataset is available at https://github.com/KaiDMML/FakeNewsNet
Z
The dataset of the Global Collections survey of natural history collections
data.niaid.nih.gov
Updated Jul 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Woodburn, Matt; Corrigan, Robert J.; Drew, Nicholas; Meyer, Cailin; Smith, Vincent S.; Vincent, Sarah (2024). The dataset of the Global Collections survey of natural history collections [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6985398
Explore at:
Dataset updated
Jul 16, 2024
Dataset provided by
Natural History Museum, London
Smithsonian National Museum of Natural History
Authors
Woodburn, Matt; Corrigan, Robert J.; Drew, Nicholas; Meyer, Cailin; Smith, Vincent S.; Vincent, Sarah
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
From 2016 to 2018, we surveyed the world’s largest natural history museum collections to begin mapping this globally distributed scientific infrastructure. The resulting dataset includes 73 institutions across the globe. It has:

Basic institution data for the 73 contributing institutions, including estimated total collection sizes, geographic locations (to the city) and latitude/longitude, and Research Organization Registry (ROR) identifiers where available.

Resourcing information, covering the numbers of research, collections and volunteer staff in each institution.

Indicators of the presence and size of collections within each institution broken down into a grid of 19 collection disciplines and 16 geographic regions.

Measures of the depth and breadth of individual researcher experience across the same disciplines and geographic regions.

This dataset contains the data (raw and processed) collected for the survey, and specifications for the schema used to store the data. It includes:

A diagram of the MySQL database schema.

A SQL dump of the MySQL database schema, excluding the data.

A SQL dump of the MySQL database schema with all data. This may be imported into an instance of MySQL Server to create a complete reconstruction of the database.

Raw data from each database table in CSV format.

A set of more human-readable views of the data in CSV format. These correspond to the database tables, but foreign keys are substituted for values from the linked tables to make the data easier to read and analyse.

A text file containing the definitions of the size categories used in the collection_unit table.

The global collections data may also be accessed at https://rebrand.ly/global-collections. This is a preliminary dashboard, constructed and published using Microsoft Power BI, that enables the exploration of the data through a set of visualisations and filters. The dashboard consists of three pages:

Institutional profile: Enables the selection of a specific institution and provides summary information on the institution and its location, staffing, total collection size, collection breakdown and researcher expertise.

Overall heatmap: Supports an interactive exploration of the global picture, including a heatmap of collection distribution across the discipline and geographic categories, and visualisations that demonstrate the relative breadth of collections across institutions and correlations between collection size and breadth. Various filters allow the focus to be refined to specific regions and collection sizes.

Browse: Provides some alternative methods of filtering and visualising the global dataset to look at patterns in the distribution and size of different types of collections across the global view.
c
Ontario Data Catalogue (Ontario Data Catalogue)
catalog.civicdataecosystem.org
Updated Nov 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Ontario Data Catalogue (Ontario Data Catalogue) [Dataset]. https://catalog.civicdataecosystem.org/dataset/ontario-data-catalogue-ontario-data-catalogue
Explore at:
Dataset updated
Nov 24, 2025
Area covered
Ontario
Description
AI Generated Summary: The Ontario Data Catalogue is a data portal providing access to open datasets generated and maintained by the Ontario government. It allows users to search, access, visualize, and download data in various machine-readable formats, often through APIs, while also indicating licensing terms and data update frequencies. The catalogue also provides tools for data visualization and notifications for dataset updates. About: The Ontario government generates and maintains thousands of datasets. Since 2012, we have shared data with Ontarians via a data catalogue. Open data is data that is shared with the public. Click here to learn more about open data and why Ontario releases it. Ontario’s Digital and Data Directive states that all data must be open, unless there is good reason for it to remain confidential. Ontario’s Chief Digital and Data Officer also has the authority to make certain datasets available publicly. Datasets listed in the catalogue that are not open will have one of the following labels: If you want to use data you find in the catalogue, that data must have a licence – a set of rules that describes how you can use it. A licence: Most of the data available in the catalogue is released under Ontario’s Open Government Licence. However, each dataset may be shared with the public under other kinds of licences or no licence at all. If a dataset doesn’t have a licence, you don’t have the right to use the data. If you have questions about how you can use a specific dataset, please contact us. The Ontario Data Catalogue endeavors to publish open data in a machine readable format. For machine readable datasets, you can simply retrieve the file you need using the file URL. The Ontario Data Catalogue is built on CKAN, which means the catalogue has the following features you can use when building applications. APIs (Application programming interfaces) let software applications communicate directly with each other. If you are using the catalogue in a software application, you might want to extract data from the catalogue through the catalogue API. Note: All Datastore API requests to the Ontario Data Catalogue must be made server-side. The catalogue's collection of dataset metadata (and dataset files) is searchable through the CKAN API. The Ontario Data Catalogue has more than just CKAN's documented search fields. You can also search these custom fields. You can also use the CKAN API to retrieve metadata about a particular dataset and check for updated files. Read the complete documentation for CKAN's API. Some of the open data in the Ontario Data Catalogue is available through the Datastore API. You can also search and access the machine-readable open data that is available in the catalogue. How to use the API feature: Read the complete documentation for CKAN's Datastore API. The Ontario Data Catalogue contains a record for each dataset that the Government of Ontario possesses. Some of these datasets will be available to you as open data. Others will not be available to you. This is because the Government of Ontario is unable to share data that would break the law or put someone's safety at risk. You can search for a dataset with a word that might describe a dataset or topic. Use words like “taxes” or “hospital locations” to discover what datasets the catalogue contains. You can search for a dataset from 3 spots on the catalogue: the homepage, the dataset search page, or the menu bar available across the catalogue. On the dataset search page, you can also filter your search results. You can select filters on the left hand side of the page to limit your search for datasets with your favourite file format, datasets that are updated weekly, datasets released by a particular ministry, or datasets that are released under a specific licence. Go to the dataset search page to see the filters that are available to make your search easier. You can also do a quick search by selecting one of the catalogue’s categories on the homepage. These categories can help you see the types of data we have on key topic areas. When you find the dataset you are looking for, click on it to go to the dataset record. Each dataset record will tell you whether the data is available, and, if so, tell you about the data available. An open dataset might contain several data files. These files might represent different periods of time, different sub-sets of the dataset, different regions, language translations, or other breakdowns. You can select a file and either download it or preview it. Make sure to read the licence agreement to make sure you have permission to use it the way you want. A non-open dataset may be not available for many reasons. Read more about non-open data. Read more about restricted data. Data that is non-open may still be subject to freedom of information requests. The catalogue has tools that enable all users to visualize the data in the catalogue without leaving the catalogue – no additional software needed. Get automatic notifications when datasets are updated. You can choose to get notifications for individual datasets, an organization’s datasets or the full catalogue. You don’t have to provide and personal information – just subscribe to our feeds using any feed reader you like using the corresponding notification web addresses. Copy those addresses and paste them into your reader. Your feed reader will let you know when the catalogue has been updated. The catalogue provides open data in several file formats (e.g., spreadsheets, geospatial data, etc). Learn about each format and how you can access and use the data each file contains. A file that has a list of items and values separated by commas without formatting (e.g. colours, italics, etc.) or extra visual features. This format provides just the data that you would display in a table. XLSX (Excel) files may be converted to CSV so they can be opened in a text editor. How to access the data: Open with any spreadsheet software application (e.g., Open Office Calc, Microsoft Excel) or text editor. Note: This format is considered machine-readable, it can be easily processed and used by a computer. Files that have visual formatting (e.g. bolded headers and colour-coded rows) can be hard for machines to understand, these elements make a file more human-readable and less machine-readable. A file that provides information without formatted text or extra visual features that may not follow a pattern of separated values like a CSV. How to access the data: Open with any word processor or text editor available on your device (e.g., Microsoft Word, Notepad). A spreadsheet file that may also include charts, graphs, and formatting. How to access the data: Open with a spreadsheet software application that supports this format (e.g., Open Office Calc, Microsoft Excel). Data can be converted to a CSV for a non-proprietary format of the same data without formatted text or extra visual features. A shapefile provides geographic information that can be used to create a map or perform geospatial analysis based on location, points/lines and other data about the shape and features of the area. It includes required files (.shp, .shx, .dbt) and might include corresponding files (e.g., .prj). How to access the data: Open with a geographic information system (GIS) software program (e.g., QGIS). A package of files and folders. The package can contain any number of different file types. How to access the data: Open with an unzipping software application (e.g., WinZIP, 7Zip). Note: If a ZIP file contains .shp, .shx, and .dbt file types, it is an ArcGIS ZIP: a package of shapefiles which provide information to create maps or perform geospatial analysis that can be opened with ArcGIS (a geographic information system software program). A file that provides information related to a geographic area (e.g., phone number, address, average rainfall, number of owl sightings in 2011 etc.) and its geospatial location (i.e., points/lines). How to access the data: Open using a GIS software application to create a map or do geospatial analysis. It can also be opened with a text editor to view raw information. Note: This format is machine-readable, and it can be easily processed and used by a computer. Human-readable data (including visual formatting) is easy for users to read and understand. A text-based format for sharing data in a machine-readable way that can store data with more unconventional structures such as complex lists. How to access the data: Open with any text editor (e.g., Notepad) or access through a browser. Note: This format is machine-readable, and it can be easily processed and used by a computer. Human-readable data (including visual formatting) is easy for users to read and understand. A text-based format to store and organize data in a machine-readable way that can store data with more unconventional structures (not just data organized in tables). How to access the data: Open with any text editor (e.g., Notepad). Note: This format is machine-readable, and it can be easily processed and used by a computer. Human-readable data (including visual formatting) is easy for users to read and understand. A file that provides information related to an area (e.g., phone number, address, average rainfall, number of owl sightings in 2011 etc.) and its geospatial location (i.e., points/lines). How to access the data: Open with a geospatial software application that supports the KML format (e.g., Google Earth). Note: This format is machine-readable, and it can be easily processed and used by a computer. Human-readable data (including visual formatting) is easy for users to read and understand. This format contains files with data from tables used for statistical analysis and data visualization of Statistics Canada census data. How to access the data: Open with the Beyond 20/20 application. A database which links and combines data from different files or
a
SNAMUTS - Route Segments (Polyline) 2016 - Dataset - AURIN
data.aurin.org.au
Updated Mar 6, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). SNAMUTS - Route Segments (Polyline) 2016 - Dataset - AURIN [Dataset]. https://data.aurin.org.au/dataset/snamuts-snamuts-route-segments-2016-na
Explore at:
Dataset updated
Mar 6, 2025
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset presents the Spatial Network Analysis for Multimodal Urban Transport Systems (SNAMUTS) route segments for the year of 2016. Route segments are a public transport link between two adjacent activity nodes or other network nodes. A numbered public transport route is usually made up of a sequence of consecutive route segments The SNAMUTS methodology has been developed as a planning and decision-making support tool. It determines accessibility performance from a user perspective, bearing in mind that different users sometimes have different needs: some may value speed more than anything else, some may require barrier-free access as their first priority, others may be drawn primarily to services that are legible and have a high profile in the urban realm. Good accessibility is often the result of a balance and integration of these sometimes competing, sometimes complementary claims on the usability of the land use-transport system. The analysis includes a set of tasks and measurements that highlight the contribution of the public transport network and service development from a range of perspectives. These are known as the eight key SNAMUTS indicators, they include: Service Intensity
d
Data from: U.S. Geological Survey - Gap Analysis Project Species Habitat...
catalog.data.gov
data.usgs.gov
+1more
Updated Nov 26, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). U.S. Geological Survey - Gap Analysis Project Species Habitat Maps CONUS_2001 [Dataset]. https://catalog.data.gov/dataset/u-s-geological-survey-gap-analysis-project-species-habitat-maps-conus-2001
Explore at:
Dataset updated
Nov 26, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
Gap Analysis Project (GAP) habitat maps are predictions of the spatial distribution of suitable environmental and land cover conditions within the United States for individual species. Mapped areas represent places where the environment is suitable for the species to occur (i.e. suitable to support one or more life history requirements for breeding, resting, or foraging), while areas not included in the map are those predicted to be unsuitable for the species. While the actual distributions of many species are likely to be habitat limited, suitable habitat will not always be occupied because of population dynamics and species interactions. Furthermore, these maps correspond to midscale characterizations of landscapes, but individual animals may deem areas to be unsuitable because of presence or absence of fine-scale features and characteristics that are not represented in our models (e.g. snags, vernal pools, shrubby undergrowth). These maps are intended to be used at a 1:100,000 or smaller map scale. These habitat maps are created by applying a deductive habitat model to remotely-sensed data layers within a species’ range. The deductive habitat models are built by compiling information on species’ habitat associations and entering it into a relational database. Information is compiled from the best available characterizations of species’ habitat, which included species accounts in books and databases, primary peer-reviewed literature. The literature references for each species are included in the "Species Habitat Model Report" and "Machine Readable Habitat Database Parameters" files attached to each habitat map item in the repository. For all species, the compiled habitat information is used by a biologist to determine which of the ecological systems and land use classes represented in the National Gap Analysis Project’s (GAP) Land Cover Map Ver. 1.0 that species is associated with. The name of the biologist who conducted the literature review and assembled the modeling parameters is shown as the "editor" type contact for each habitat map item in the repository. For many species, information on other mapped factors that define the environment that is suitable is also entered into the database. These factors included elevation (i.e. minimum, maximum), proximity to water features, proximity to wetlands, level of human development, forest ecotone width, and forest edge; and each of these factors corresponded to a data layer that is available during the map production. The individual datasets used in the modeling process with these parameters are also made available in the ScienceBase Repository (see the end of this Summary section for details). The "Machine Readable Habitat Database Parameters" JSON file attached to each species habitat map item has an "input_layers" object that contains the specific parameter names and references (via Digital Object Identifier) to the input data used with that parameter. The specific parameters for each species were output from the database used in the modeling and mapping process to the "Species Habitat Model Report" and "Machine Readable Habitat Database Parameters" files attached to each habitat map item in the repository. The maps are generated using a python script that queries the model parameters in the database; reclassifies the GAP Land Cover Ver 1.0 and ancillary data layers within the species’ range; and combines the reclassified layers to produce the final 30m resolution habitat map. Map output is, therefore, not only a reflection of the ecological systems that are selected in the habitat model, but also any other constraints in the model that are represented by the ancillary data layers. Modeling regions were used to stratify the conterminous U.S. into six regions (Northwest, Southwest, Great Plains, Upper Midwest, Southeast, and Northeast). These regions allowed for efficient processing of the species distribution models on smaller, ecologically homogenous extents. The 2008 start date for the models represents the shift in focus from state and regional project efforts to a national one. At that point all of the datasets needed to be standardized across the national extent and the species list derived based on the current understanding of the taxonomy. The end date for the individual models represents when the species model was considered complete, and therefore reflects the current knowledge related to that species concept and the habitat requirements for the species. Versioning, Naming Conventions and Codes: A composite version code is employed to allow the user to track the spatial extent, the date of the ground conditions, and the iteration of the data set for that extent/date. For example, CONUS_2001v1 represents the spatial extent of the conterminous US (CONUS), the ground condition year of 2001, and the first iteration (v1) for that extent/date. In many cases, a GAP species code is used in conjunction with the version code to identify specific data sets or files (i.e. Cooper’s Hawk Habitat Map named bCOHAx_CONUS_2001v1_HabMap). This collection represents the first complete compilation of terrestrial vertebrate species models for the conterminous U.S. based on 2001 ground conditions. The taxonomic concept for the species model being presented is identified through the Integrated Taxonomic Information System – Taxonomic Serial Number. To provide a link to the NatureServe species information the NatureServe Element Code is provided for each species. The identifiers included for each species habitat map item in the repository include references to a vocabulary system in ScienceBase where definitions can be found for each type of identifier. Source Datasets Uses in Species Habitat Modeling: Gap Analysis Project Species Range Maps - Species ranges were used as model delimiters in predicted distribution models. https://www.sciencebase.gov/catalog/item/5951527de4b062508e3b1e79 Hydrologic Units - Modified 12-digit hydrologic units were used as the spatial framework for species ranges. https://www.sciencebase.gov/catalog/item/56d496eee4b015c306f17a42 Modeling regions - Used to stratify the conterminous U.S. into six ecologically homogeneous regions to facilitate efficient processing. https://www.sciencebase.gov/catalog/item/58b9b8cee4b03b285c07ddef Land Cover - Species were linked to individual map units to document habitat affinity in two ways. Primary map units are those land cover types critical for nesting, rearing young, and/or optimal foraging. Secondary or auxiliary map units are those land cover types generally not critical for breeding, but are typically used in conjunction with primary map units for foraging, roosting, and/or sub-optimal nesting locations. These map units are selected only when located within a specified distance from primary map units. https://www.sciencebase.gov/catalog/item/5540e2d7e4b0a658d79395db Human Impact Avoidance - Buffers around urban areas and roads were used to identify areas that would be suitable for urban exploitative species and unsuitable for urban avoiding species. https://www.sciencebase.gov/catalog/item/5540e099e4b0a658d79395d6 Forest & Edge Habitats - The land cover map was used to derive datasets of forest interior and ecotones between forest and open habitats. Forest edge https://www.sciencebase.gov/catalog/item/5540e3fce4b0a658d79395fe Forest/Open Woodland/Shrubland https://www.sciencebase.gov/catalog/item/5540e48fe4b0a658d7939600 Elevation Derivatives - Slope and aspect were used to constrain some of the southwestern models where those variables are good indicators of microclimates (moist north facing slopes) and local topography (cliffs, flats). For species with a documented relationship to altitude the elevation data was used to constrain the mapped distribution. Aspect https://www.sciencebase.gov/catalog/item/5540ec40e4b0a658d7939628 Slope https://www.sciencebase.gov/catalog/item/5540ebe2e4b0a658d7939626 Elevation https://www.sciencebase.gov/catalog/item/5540e111e4b0a658d79395d9 Hydrology - https://www.sciencebase.gov/catalog/item/5540eb44e4b0a658d7939624: A number of water related data layers were used to refine the species distribution including: water type (i.e. flowing, open/standing), distance to and from water, and stream flow and underlying gradient. The source for this data was the USGS National Hydrography Dataset (NHD)(USGS 2007). Hydrographic features were divided into three types: flowing water, open/standing water, and wet vegetation. Canopy Cover - Some species are limited to open woodlands or dense forest, the National Land Cover’s Canopy Cover dataset was used to constrain the species models based on canopy density. https://www.sciencebase.gov/catalog/item/5540eca9e4b0a658d793962b
g
New Aquitaine: Wind farms — Location
gimi9.com
data.europa.eu
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
New Aquitaine: Wind farms — Location [Dataset]. https://gimi9.com/dataset/eu_d541dcee-8ed7-4e8b-bc08-b26b7c6b7b9f/
Explore at:
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Nouvelle-Aquitaine
Description
This dataset contains the locations (surface objects) of wind farms in the Nouvelle-Aquitaine region. The Grenelle 2 Act brought onshore wind turbines into the field of facilities classified for environmental protection (ICPE). This administrative development aims to ensure the safe development of wind energy in France under good conditions of preservation of the quality of life of local residents. The new regulatory framework has been designed to make project leaders more legible and to reduce training times for them while clarifying the regulatory requirements necessary to protect human and environmental issues. This new regulation also makes it possible to better guarantee compliance with the regulations over time and thus a good control of the risks and nuisances related to this activity. The corresponding regulatory texts, a nomenclature decree, two ministerial decrees on general requirements, as well as a decree and ministerial decree specific to financial guarantees were published on 25, 26 and 27 August 2011 in the Official Journal. These regulations specify the administrative regimes now applicable to wind farms as well as the operating rules, specify the decommissioning obligations at the end of operation and set up a system of financial guarantees to ensure such decommissioning in the event of a failure. From now on, the operation of a wind farm with one or more wind turbines is subject to: Authorisation where the installation includes at least one aerogenerator with a height of more than 50 metres, or where the installation includes only wind turbines with a mast between 12 and 50 metres and an installed capacity exceeding 20 MW. Declaration where the installation includes only wind turbines with a height of between 12 and 50 metres and for an installed capacity of less than 20 MW.
w
Overview of the Department of Veterans' Affairs Claiming Channels
data.wu.ac.at
researchdata.edu.au
+1more
xlsx, zip
Updated Sep 8, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Human Services (2016). Overview of the Department of Veterans' Affairs Claiming Channels [Dataset]. https://data.wu.ac.at/odso/data_gov_au/NThlYmViNGUtNjIzNS00MDdkLTk3NTYtNGE3ZDg5ZWM0NTAw
Explore at:
zip, xlsxAvailable download formats
Dataset updated
Sep 8, 2016
Dataset provided by
Department of Human Services
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Area covered
c86c35299e8e9b2b515565e38630fb204f4efd93
Description
The Department of Human Services through Medicare assesses claims and makes payments to medical, hospital and allied health providers who treat eligible veterans, spouses and dependents, on behalf of the Department of Veterans' Affairs (DVA).

The Department of Human Services, Medicare and DVA promote electronic claiming as the primary way of doing business with the government. For health professionals, electronic claiming means faster payment times, paperless lodgement of claims, faster reconciliation and more efficient confirmation of patient details. It also means lower administrative costs for the government.

** Overview of the Department of Veterans' Affairs Claiming Channels dataset**

This dataset provides information on the channels used by allied health, medical and hospital providers, to lodge DVA claims for processing by Medicare. The dataset includes details on the volume of services processed via a particular channel and the value of the benefit paid. Further information on the dataset may be found in the metadata accompanying the dataset.

Data is provided in the following formats:

Excel/ XLXS : The human readable version of the dataset for the current financial year (2016-2017) will be provided in an individual excel file and will be updated monthly. The human readable files for the 2015-2016 financial year may be found in the zipped excel files.

CSV: The machine readable version of the dataset may be found in the zipped csv file. This contains both monthly and financial year summaries. Metadata and 'Item ranges' are contained in stand-alone csvs within the zipped file.

If you require statistics at a more detailed level, please contact statistics@humanservices.gov.au detailing your request. The Department of Human Services charges on a cost recovery basis for providing more detailed statistics and their provision is subject to privacy considerations.

The Department of Veterans’ Affairs website contains statistical information regarding the veteran population that may be accessed by the public.

Disclaimer: This data is provided by the Department of Human Services (Human Services) for general information purposes only. While Human Services has taken care to ensure the information is as correct and accurate as possible, we do not guarantee, or accept legal liability whatsoever arising from, or connected to its use. We recommend that users exercise their own skill and care with respect to the use of this data and that users carefully evaluate the accuracy, currency, completeness and relevance of the data for their needs.
Paper2Fig100k dataset
zenodo.org
application/gzip
Updated Nov 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Juan A. Rodríguez; David Vázquez; Issam Laradji; Marco Pedersoli; Pau Rodríguez; Juan A. Rodríguez; David Vázquez; Issam Laradji; Marco Pedersoli; Pau Rodríguez (2022). Paper2Fig100k dataset [Dataset]. http://doi.org/10.5281/zenodo.7299423
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7299423
Dataset updated
Nov 8, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Juan A. Rodríguez; David Vázquez; Issam Laradji; Marco Pedersoli; Pau Rodríguez; Juan A. Rodríguez; David Vázquez; Issam Laradji; Marco Pedersoli; Pau Rodríguez
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Paper2Fig100k dataset

A dataset with over 100k images of figures and text captions from research papers. Images of figures display diagrams, methodologies, and architectures of research papers in arXiv.org. We provide also text captions for each figure, and OCR detections and recognitions on the figures (bounding boxes and texts).

The dataset structure consists of a directory called "figures" and two JSON files (train and test), that contain data from each figure. Each JSON object contains the following information about a figure:

figure_id: Figure identification based on the arXiv identifier:

captions: Text pairs extracted from the paper that relates to the figure. For instance, the actual caption of the figure or references to the figure in the manuscript.

ocr_result: Result of performing OCR text recognition over the image. We provide a list of triplets (bounding box, confidence, text) present in the image.

aspect: Aspect ratio of the image (H/W).

Take a look at the OCR-VQGAN GitHub repository, which uses the Paper2Fig100k dataset to train an image encoder for figures and diagrams, that uses OCR perceptual loss to render clear and readable texts inside images.

The dataset is explained in more detail in the paper OCR-VQGAN: Taming Text-within-Image Generation @WACV 2023

Paper abstract

Synthetic image generation has recently experienced significant improvements in domains such as natural image or art generation. However, the problem of figure and diagram generation remains unexplored. A challenging aspect of generating figures and diagrams is effectively rendering readable texts within the images. To alleviate this problem, we present OCR-VQGAN, an image encoder, and decoder that leverages OCR pre-trained features to optimize a text perceptual loss, encouraging the architecture to preserve high-fidelity text and diagram structure. To explore our approach, we introduce the Paper2Fig100k dataset, with over 100k images of figures and texts from research papers. The figures show architecture diagrams and methodologies of articles available at arXiv.org from fields like artificial intelligence and computer vision. Figures usually include text and discrete objects, e.g., boxes in a diagram, with lines and arrows that connect them. We demonstrate the superiority of our method by conducting several experiments on the task of figure reconstruction. Additionally, we explore the qualitative and quantitative impact of weighting different perceptual metrics in the overall loss function.
Enhancing the ReaxFF DFT database
data.europa.eu
unknown
Updated Jun 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2023). Enhancing the ReaxFF DFT database [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-7959122?locale=cs
Explore at:
unknownAvailable download formats
Dataset updated
Jun 23, 2023
Dataset authored and provided by
Zenodohttp://zenodo.org/
Description
This project contains the database used to re-parametrize the ReaxFF force field for LiF, an inorganic compound. The purpose of the database is to improve the accuracy and reliability of ReaxFF calculations for LiF. The results and method used were published in the article Enhancing ReaxFF for Lithium-ion battery simulations: An interactive reparameterization protocol. This database was made using the simulation obtained using the protocol published in Enhancing ReaxFF repository. Installation To use the database and interact with it, ensure that you have the following Python requirements installed: Minimum Requirements: Python 3.9 or above Atomic Simulation Environment (ASE) library Jupyter Lab Requirements for Re-running or Performing New Simulations: SCM (Software for Chemistry & Materials) Amsterdam Modeling Suite PLAMS (Python Library for Automating Molecular Simulation) library You can install the required Python packages using pip: pip install -r requirements.txt Warning: Make sure to have the appropriate licenses and installations of SCM Amsterdam Modeling Suite and any other necessary software for running simulations. Folder Structure The project has the following folder structure: . ├── CONTRIBUTING.md ├── CREDITS.md ├── LICENSE ├── README.md ├── requirements.txt ├── assets ├── data │ ├── LiF.db │ ├── LiF.json │ └── LiF.yaml ├── notebooks │ ├── browsing_db.ipynb │ └── running_simulation.ipynb └── tools ├── db ├── plams_experimental └── scripts CONTRIBUTING.md: This file provides guidelines and instructions for contributing to the repository. It outlines the contribution process, coding conventions, and other relevant information for potential contributors. CREDITS.md: This file acknowledges and credits the individuals or organizations that have contributed to the repository. LICENSE: This file contains the license information for the repository (CC BY 4.0). It specifies the terms and conditions under which the repository's contents are distributed and used. README.md: This file. requirements.txt: This file lists the required Python packages and their versions. (see installation section) assets: This folder contains any additional assets, such as images or documentation, related to the repository. data: This folder contains the data files used in the repository. LiF.db: This file is the SQLite database file that includes the DFT data used for the ReaxFF force field. Specifically, it contains data related to the inorganic compound LiF. LiF.json: This file provides the database metadata in a human-readable format using JSON. LiF.yaml: This file also contains the database metadata in a more human-readable format, still using YAML. notebooks: This folder contains Jupyter notebooks that provide demonstrations and examples of how to use and analyze the database. browsing_db.ipynb: This notebook demonstrates how to handle, select, read, and understand the data points in the LiF.db database using the ASE database Python interface. It serves as a guide for exploring and navigating the database effectively. running_simulation.ipynb: In this notebook, you will find an example of how to get a data point from the LiF.db database and use it to perform a new simulation. The notebook showcases how to utilize either the PLAMS library or the AMSCalculator and ASE Python library to conduct simulations based on the retrieved data and then store it as a new data point in the LiF.db database. It provides step-by-step instructions and code snippets for a seamless simulation workflow. tools: This directory contains a collection of Python modules and scripts that are useful for reading, analyzing, and re-running simulations stored in the database. These tools are indispensable for ensuring that this repository adheres to the principles of Interoperability and Reusability, as outlined by the FAIR principles. db: This Python module provides functionalities for handling, reading, and storing data into the database. plasm_experimental: This Python module includes the necessary components for using the AMSCalculator with PLASM and the SCM software package, utilizing the ASE API. It facilitates running simulations, performing calculations. scripts: This directory contains additional scripts for dvanced usage scenarios of this repository. Interacting with the Database There are three ways to interact with the database: using the ASE db command line, the web interface, and the ASE Python interface. ASE db Command-line To interact with the database using the ASE db terminal command, follow these steps: Open a terminal and navigate to the directory containing the LiF.db file. Run the following command to start the ASE db terminal: ase db LiF.db You can now use the available commands in the terminal to query and manipulate the database. More information can be found in the ASE database documentation. Web Interface To interact with the database using the web interface, follow these steps: Open a terminal and navigate to the directory
a
Stormwater Pipes (Chatham)
data-sagis.opendata.arcgis.com
hub.arcgis.com
Updated Jul 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SAGIS ArcGIS Online (2020). Stormwater Pipes (Chatham) [Dataset]. https://data-sagis.opendata.arcgis.com/datasets/stormwater-pipes-chatham
Explore at:
Dataset updated
Jul 24, 2020
Dataset authored and provided by
SAGIS ArcGIS Online
Area covered

Description
This database is a conflation of the older database and the more current data collected by the Department of Engineering. Many fields have been combined and some removed to make the database more readable and easier to edit. A subtype has been put into place to render the features as either 1) Chatham County Maintained, or 2) Not Chatham County Maintained.The purpose of including non-County maintained pipes serves to understand the complete drainage system connectivity, and also, to easily add or remove existing features into the MS4 should a mistake be discovered or annexation/de-annexations occur.
MABe Structured Dataset
kaggle.com
zip
Updated Oct 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KUSHAGRA MATHUR (2025). MABe Structured Dataset [Dataset]. https://www.kaggle.com/datasets/kushubhai/mabe-structured-dataset
Explore at:
zip(2886423112 bytes)Available download formats
Dataset updated
Oct 17, 2025
Authors
KUSHAGRA MATHUR
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This dataset is a restructured version of the data from the MABe Challenge competition's train_tracking directory. It has been reformatted to improve human readability and to facilitate easier use with machine learning models.

Disclaimer: I do not own the original data. All credit belongs to the MABe Challenge, and a link to the competition is provided. You are welcome to use this modified dataset in your models.

Note: The 2022 data is not yet included but is scheduled for a future update.

Competition Link :- https://www.kaggle.com/competitions/MABe-mouse-behavior-detection/overview Notebook Link for Dataset Creation :- https://www.kaggle.com/code/kushubhai/mabe-new-dataset-creation

Feel free to check out my another notebook for the visualization of parquet files inside the data of the competition :-

Notebook Link for Visualization :- https://www.kaggle.com/code/kushubhai/mabe-visualization
S
Data from: FAIR Science for Social Machines: Let’s Share Metadata Knowlets...
scidb.cn
Updated Oct 15, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Barend Mons (2020). FAIR Science for Social Machines: Let’s Share Metadata Knowlets in the Internet of FAIR Data and Services [Dataset]. http://doi.org/10.11922/sciencedb.j00104.00020
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.11922/sciencedb.j00104.00020
Dataset updated
Oct 15, 2020
Dataset provided by
Science Data Bank
Authors
Barend Mons
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
11 figures of this paper. Figure 1 is the hourglass model of the Internet architecture. Figure 2 shows the merge of three hourglasses (data-infrastructure, tools-infrastructure and compute-infrastructure) into the image of a propeller with three blades and the underlying infrastructure. The narrow waist of the hourglass (minimal essential standards and protocols) is comparable to the center of this picture. Figure 3 is the simple Digital Object picture. The smallest conceivable Digital Object is a persistent identifier (PID) (a digital symbol referring to a particular concept). Each digital object that contains “information” should be adorned with metadata asserting things about the nature of that information. Typical intrinsic metadata describe the factual information that is “indisputable” about the digital object itself. Intrinsic metadata containers, expanded metadata containers and the actual containers holding the data elements or the core (in case of for instance a workflow) could also be treated as separate but permanently-linked digital objects, each with their own unique, persistent and resolvable identifier (UPRI) and thus form a stack of related metadata containers that contain (machine readable, FAIR) metadata of different nature, all asserting, however, relevant information about the data container. Figure 4 shows how in the developing Internet of FAIR Data and Services, a linked-data-compliant query in a virtual machine format could automatically find the most relevant databases. Figure 5 shows the semiotic triangle, based on the concept of cancer. Figure 6 shows the single meaningful assertion in machine readable format, which is called a nanopublication. The smallest conceivable assertion has the structure of a subject, a predicate and an object. To form a nanopublication this “triple” needs to be published in machine readable format with full provenance and publication information (also in machine readable format). Figure 7 shows the Knowlet as a collection of cardinal assertions “about” a given subject. The objects effectively form the “conceptual context” of explicitly associated concepts. The predicates can range from very specific and explicit relationship descriptions such as “inhibits” or “is married to” to more generic and less explicit connections, such as “co-occurs in the same sentence as”. Figure 8 shows that the Knowlet is a digital object and needs to be findable, accessible, interoperable and reusable (i.e., FAIR) in its own right. It also may change over time, when more assertions are collected about the core concept. Therefore, each Knowlet in the Internet of FAIR Data and Services (IFDS) needs a unique, persistent and resolvable identifier (UPRI). Figure 9 shows that the Knowlet can be seen as a metadata container for the concept it represents. It can represent many different things from plain concepts like a gene or a person (ORCID record), to a data set, a data base, a work flow or any other thing in the Internet of Things. Figure 10 shows three ways in which Knowlets can be used to connect dispersed digital objects. In Figure 11, A: Concepts, physical objects or things of different semantic types (and thus also intrinsically meaningless unique, persistent and resolvable identifiers (UPRIs)) can cluster based on contextual similarity without ever being explicitly connected (drug might treat disease). B: Nearly identical concepts that are nevertheless in certain circumstances to be seen as distinct, will automatically cluster as one if the resolution of search or matching is lowered, while they will separate out when the resolution is made higher. C: Conceptual and semantic drift occur.
T
Checkouts By Title (Physical Items)
cos-data.seattle.gov
data.seattle.gov
+4more
csv, xlsx, xml
Updated Nov 23, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Seattle (2025). Checkouts By Title (Physical Items) [Dataset]. https://cos-data.seattle.gov/Community-and-Culture/Checkouts-By-Title-Physical-Items-/5src-czff
Explore at:
xml, xlsx, csvAvailable download formats
Dataset updated
Nov 23, 2025
Dataset authored and provided by
City of Seattle
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
This dataset includes a log of all physical item checkouts from Seattle Public Library. The dataset begins with checkouts that occurring in April 2005. Data from 2005 to 2016 in this dataset is from digital artwork "Making Visible the Invisible" by studios of George Legrady. Learn more about this contribution. Renewals are not included. Have a question about this data? Ask us!

Data Notes: There is a machine-readable data dictionary available to help you understand the collection and item codes. Access from here:

https://data.seattle.gov/Community/Integrated-Library-System-ILS-Data-Dictionary/pbt3-ytbc

Also: 1. "CheckoutDateTime" (the timestamp field) is rounded to the nearest minute. 2. "itemType" is a code from the catalog record that describes the type of item. Some of the more common codes are: acbk (adult book), acdvd (adult DVD), jcbk (children's book), accd (adult CD) 3. "Collection" is a collection code from the catalog record which describes the item. Here are some common examples: nanf (adult non-fiction), nafic(adult fiction), ncpic(children's picture book), nycomic (Young adult comic books). 4. "Subjects" includes the subjects and subject subdivisions from the item record.
w
Guide to fostering the readability of legislative texts
data.wu.ac.at
data.urbandatacentre.ca
+1more
html
Updated Aug 15, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Justice | Ministère de la Justice (2018). Guide to fostering the readability of legislative texts [Dataset]. https://data.wu.ac.at/schema/www_data_gc_ca/Yzg1NWUzN2ItMGJjZi00Y2VkLWI3YjItYzRjYjkxMTY1ZDQ3
Explore at:
htmlAvailable download formats
Dataset updated
Aug 15, 2018
Dataset provided by
Department of Justice | Ministère de la Justice
License
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Description
The January 2012 report the Red Tape Reduction Commission recommended "that the Department of Justice continue to develop tools to foster the intelligibility of legislative texts" to improve the clarity and predictability of regulation for business and improve understanding of regulatory requirements. This document does not intend to repeat the content of related textbooks, manuals, guides and articles. It provides instead a general approach to drafting legislative texts that are accessible to their readers; it is about viewing legislative texts through a particular lens to evaluate their readability.

The primary focus of the guide is teaching those who write and develop legislative and regulatory texts to evaluate what can be done to make these texts more accessible, and demonstrate the steps that should be taken. It provides examples as to why language should be kept as simple as possible, and of preambles that would be considered readable.
Data_Sheet_1_Better Writing in Scientific Publications Builds Reader...
frontiersin.figshare.com
figshare.com
docx
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Benjamin S. Freeling; Zoë A. Doubleday; Matthew J. Dry; Carolyn Semmler; Sean D. Connell (2023). Data_Sheet_1_Better Writing in Scientific Publications Builds Reader Confidence and Understanding.docx [Dataset]. http://doi.org/10.3389/fpsyg.2021.714321.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fpsyg.2021.714321.s001
Dataset updated
Jun 3, 2023
Dataset provided by
Frontiers Mediahttp://www.frontiersin.org/
Authors
Benjamin S. Freeling; Zoë A. Doubleday; Matthew J. Dry; Carolyn Semmler; Sean D. Connell
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Scientific publications are the building blocks of discovery and collaboration, but their impact is limited by the style in which they are traditionally written. Recently, many authors have called for a switch to an engaging, accessible writing style. Here, we experimentally test how readers respond to such a style. We hypothesized that scientific abstracts written in a more accessible style would improve readers’ reported readability and confidence as well as their understanding, assessed using multiple-choice questions on the content. We created a series of scientific abstracts, corresponding to real publications on three scientific topics at four levels of difficulty—varying from the difficult, traditional style to an engaging, accessible style. We gave these abstracts to a team of readers consisting of 170 third-year undergraduate students. Then, we posed questions to measure the readers’ readability, confidence, and understanding with the content. The scientific abstracts written in a more accessible style resulted in higher readability, understanding, and confidence. These findings demonstrate that rethinking the way we communicate our science may empower a more collaborative and diverse industry.
e
Replication Data for: PRISM: Simple and Compact Identification and...
data.europa.eu
rdr.kuleuven.be
Updated May 20, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Spatial Applications Division Leuven, KU Leuven (2025). Replication Data for: PRISM: Simple and Compact Identification and Signatures from Large Prime Degree Isogenies [Dataset]. https://data.europa.eu/data/datasets/doi-10-48804-fzso6r?locale=en
Explore at:
Dataset updated
May 20, 2025
Dataset authored and provided by
Spatial Applications Division Leuven, KU Leuven
Description
SageMath implementation of PRISM: PRime degree ISogeny Mechanism.

We give a proof of concept implementation of PRISM. In some cases, cleaner and more readable code is preferred over a fully optimized implementation.

The code, and in particular the ideal-to-isogeny translation algorithm, is based on the SQIsign2D-West SageMath implementation, which has been privately shared with us by the authors. This can be found in the folder sqisign2d_west. The code to compute (2,2)-isogenies using theta coordinates is based on ThetaIsogenies/two-isogenies. It can be found in theta_isogenies and theta_structures. The Kummer line code is based on FESTA-PKE/FESTA-SageMath. It can be found in montgomery_isogenies.
Data from: arXiv Dataset
kaggle.com
huggingface.co
+1more
zip
Updated Nov 22, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cornell University (2020). arXiv Dataset [Dataset]. https://www.kaggle.com/Cornell-University/arxiv
Explore at:
zip(950178574 bytes)Available download formats
Dataset updated
Nov 22, 2020
Dataset authored and provided by
Cornell University
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
About ArXiv

For nearly 30 years, ArXiv has served the public and research communities by providing open access to scholarly articles, from the vast branches of physics to the many subdisciplines of computer science to everything in between, including math, statistics, electrical engineering, quantitative biology, and economics. This rich corpus of information offers significant, but sometimes overwhelming depth.

In these times of unique global challenges, efficient extraction of insights from data is essential. To help make the arXiv more accessible, we present a free, open pipeline on Kaggle to the machine-readable arXiv dataset: a repository of 1.7 million articles, with relevant features such as article titles, authors, categories, abstracts, full text PDFs, and more.

Our hope is to empower new use cases that can lead to the exploration of richer machine learning techniques that combine multi-modal features towards applications like trend analysis, paper recommender engines, category prediction, co-citation networks, knowledge graph construction and semantic search interfaces.

The dataset is freely available via Google Cloud Storage buckets (more info here). Stay tuned for weekly updates to the dataset!

ArXiv is a collaboratively funded, community-supported resource founded by Paul Ginsparg in 1991 and maintained and operated by Cornell University.

The release of this dataset was featured further in a Kaggle blog post here.

https://storage.googleapis.com/kaggle-public-downloads/arXiv.JPG" alt="">

See here for more information.

ArXiv On Kaggle

Metadata

This dataset is a mirror of the original ArXiv data. Because the full dataset is rather large (1.1TB and growing), this dataset provides only a metadata file in the json format. This file contains an entry for each paper, containing: - id: ArXiv ID (can be used to access the paper, see below) - submitter: Who submitted the paper - authors: Authors of the paper - title: Title of the paper - comments: Additional info, such as number of pages and figures - journal-ref: Information about the journal the paper was published in - doi: https://www.doi.org - abstract: The abstract of the paper - categories: Categories / tags in the ArXiv system - versions: A version history

You can access each paper directly on ArXiv using these links: - https://arxiv.org/abs/{id}: Page for this paper including its abstract and further links - https://arxiv.org/pdf/{id}: Direct link to download the PDF

Bulk access

The full set of PDFs is available for free in the GCS bucket gs://arxiv-dataset or through Google API (json documentation and xml documentation).

You can use for example gsutil to download the data to your local machine. ```

List files:

gsutil cp gs://arxiv-dataset/arxiv/

Download pdfs from March 2020:

gsutil cp gs://arxiv-dataset/arxiv/arxiv/pdf/2003/ ./a_local_directory/

Download all the source files

gsutil cp -r gs://arxiv-dataset/arxiv/ ./a_local_directory/ ```

Update Frequency

We're automatically updating the metadata as well as the GCS bucket on a weekly basis.

License

Creative Commons CC0 1.0 Universal Public Domain Dedication applies to the metadata in this dataset. See https://arxiv.org/help/license for further details and licensing on individual papers.

Acknowledgements

The original data is maintained by ArXiv, huge thanks to the team for building and maintaining this dataset.

We're using https://github.com/mattbierbaum/arxiv-public-datasets to pull the original data, thanks to Matt Bierbaum for providing this tool.

Facebook

Twitter

Click to copy link

Link copied

Cite

Douglas K.G. Araujo (2021). Pairwise sentence complexity comparison [Dataset]. https://www.kaggle.com/douglaskgaraujo/pairwise-sentence-complexity-comparison

Pairwise sentence complexity comparison

Originally compiled and assessed for the CommonLit Readability Prize Competition

Explore at:

zip(1148361537 bytes)Available download formats

Dataset updated

Jun 8, 2021

Authors

Douglas K.G. Araujo

Description

Dataset creation

The dataset was created by this notebook: https://www.kaggle.com/douglaskgaraujo/sentence-complexity-comparison-dataset

Context

This data is a pairwise comparison of sentences, together with information about their relative complexity. The original dataset is from the CommonLit Readability Prize competition, and interested readers are referred there (especially the competitions' discussion forums) for more information on the data itself.

Important notice! As per that competition's rules, the license is as follows:

COMPETITION DATA. "Competition Data" means the data or datasets available from the Competition Website for the purpose of use in the Competition, including any prototype or executable code provided on the Competition Website. The Competition Data will contain private and public test sets. Which data belongs to which set will not be made available to participants.

A. Data Access and Use. Competition Use and Non-Commercial & Academic Research: *You may access and use the Competition Data for non-commercial purposes only, including for participating in the Competition and on Kaggle.com forums, and for academic research and education. *The Competition Sponsor reserves the right to disqualify any participant who uses the Competition Data other than as permitted by the Competition Website and these Rules.

B. Data Security. You agree to use reasonable and suitable measures to prevent persons who have not formally agreed to these Rules from gaining access to the Competition Data. You agree not to transmit, duplicate, publish, redistribute or otherwise provide or make available the Competition Data to any party not participating in the Competition. You agree to notify Kaggle immediately upon learning of any possible unauthorized transmission of or unauthorized access to the Competition Data and agree to work with Kaggle to rectify any unauthorized transmission or access.

C. External Data. You may use data other than the Competition Data (“External Data”) to develop and test your Submissions. However, you will ensure the External Data is publicly available and equally accessible to use by all participants of the Competition for purposes of the competition at no cost to the other participants. The ability to use External Data under this Section 7.C (External Data) does not limit your other obligations under these Competition Rules, including but not limited to Section 11 (Winners Obligations).

Content

This dataset is a pairwise comparison of each sentence in the CommonLit competition with 500 other randomly-matched sentences. Sentences are divided into a training and validation datasets before being matched randomly. The relative complexity of each sentence is measured, and features such as the distance between this score for both sentences, and a column indicating whether or not the first sentence's readability score is greater than or equal to the score of the second sentence.

Acknowledgements

Thank you for the organisers of this competition for providing this dataset.

Inspiration

Your data will be in front of the world's largest data science community. What questions do you want to see answered?

Clear search

Close search

Google apps

Main menu

Pairwise sentence complexity comparison

Dataset creation

Context

Content

Acknowledgements

Inspiration

ARTS Datasets - ARTS94, ARTS300, ARTS3000, ARTS160

Niagara Open Data

FakeNewsNet

The dataset of the Global Collections survey of natural history collections

Ontario Data Catalogue (Ontario Data Catalogue)

SNAMUTS - Route Segments (Polyline) 2016 - Dataset - AURIN

Data from: U.S. Geological Survey - Gap Analysis Project Species Habitat...

New Aquitaine: Wind farms — Location

Overview of the Department of Veterans' Affairs Claiming Channels

Paper2Fig100k dataset

Enhancing the ReaxFF DFT database

Stormwater Pipes (Chatham)

MABe Structured Dataset

Data from: FAIR Science for Social Machines: Let’s Share Metadata Knowlets...

Checkouts By Title (Physical Items)

Guide to fostering the readability of legislative texts

Data_Sheet_1_Better Writing in Scientific Publications Builds Reader...

Replication Data for: PRISM: Simple and Compact Identification and...

Data from: arXiv Dataset

About ArXiv

ArXiv On Kaggle

Metadata

Bulk access

List files:

Download pdfs from March 2020:

Download all the source files

Update Frequency

License

Acknowledgements

Pairwise sentence complexity comparison

Originally compiled and assessed for the CommonLit Readability Prize Competition

Dataset creation

Context

Content

Acknowledgements

Inspiration