100+ datasets found

d
Overview Metadata for the Data used in te Conceptual and Numerical Model of...
catalog.data.gov
datasets.ai
+1more
Updated Nov 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Overview Metadata for the Data used in te Conceptual and Numerical Model of the Colorado River (1990-2016) [Dataset]. https://catalog.data.gov/dataset/overview-metadata-for-the-data-used-in-te-conceptual-and-numerical-model-of-the-color-1990
Explore at:
Dataset updated
Nov 19, 2025
Dataset provided by
U.S. Geological Survey
Area covered
Colorado River
Description
This data release contains six different datasets that were used in the report SIR 2018-5108. These datasets contain discharge data, discrete dissolved-solids data, quality-control discrete dissolved data, and computed mean dissolved solids data that were collected at various locations between the Hoover Dam and the Imperial Dam. Study Sites: Site 1: Colorado River below Hoover Dam Site 2: Bill Williams River near Parker Site 3: Colorado River below Parker Dam Site 4: CRIR Main Canal Site 5: Palo Verde Canal Site 6: Colorado River at Palo Verde Dam Site 7: CRIR Lower Main Drain Site 8: CRIR Upper Levee Drain Site 9: PVID Outfall Drain Site 10: Colorado River above Imperial Dam Discrete Dissolved-solids Dataset and Replicate Samples for Discrete Dissolved-solids Dataset: The Bureau of Reclamation collected discrete water-quality samples for the parameter of dissolved-solids (sum of constituents). Dissolved-solids, measured in milligrams per liter, are the sum of the following constituents: bicarbonate, calcium, carbonate, chloride, fluoride, magnesium, nitrate, potassium, silicon dioxide, sodium, and sulfate. These samples were collected on a monthly to bimonthly basis at various time periods between 1990 and 2016 at Sites 1-5 and Sites 7-10. No data were collected for Site 6: Colorado River at Palo Verde Dam. The Bureau of Reclamation and the USGS collected discrete quality-control replicate samples for the parameter of dissolved-solids, sum of constituents measured in milligrams per liter. The USGS collected discrete quality-control replicate samples in 2002 and 2003 and the Bureau of Reclamation collected discrete quality-control replicate samples in 2016 and 2017. Listed below are the sites where these samples were collected at and which agency collected the samples. Site 3: Colorado River below Parker Dam: USGS and Reclamation Site 4: CRIR Main Canal: Reclamation Site 5: Palo Verde Canal: Reclamation Site 7: CRIR Lower Main Drain: Reclamation Site 8: CRIR Upper Levee Drain: Reclamation Site 9: PVID Outfall Drain: Reclamation Site 10: Colorado River above Imperial Dam: USGS and Reclamation Monthly Mean Datasets and Mean Monthly Datasets: Monthly mean discharge data (cfs), flow weighted monthly mean dissolved-solids concentrations (mg/L) data and monthly mean dissolved-solids load data from 1990 to 2016 were computed using raw data from the USGS and the Bureau of Reclamation. This data were computed for all 10 sites. Flow weighted monthly mean dissolved-solids concentration and monthly mean dissolved-solids load were not computed for Site 2: Bill Williams River near Parker. The monthly mean datasets that were calculated for each month for the period between 1990 and 2016 were used to compute the mean monthly discharge and the mean monthly dissolved-solids load for each of the 12 months within a year. Each monthly mean was weighted by how many days were in the month and then averaged for each of the twelve months. This was computed for all 10 sites except mean monthly dissolved-solids load were not computed at Site 2: Bill Williams River near Parker. Site 8a: Colorado River between Parker and Palo Verde Valleys was computed by summing the data from sites 6, 7 and 8. Bill Williams Daily Mean Discharge, Instantaneous Dissolved-solids Concentration, and Daily Means Dissolved-solids Load Dataset: Daily mean discharge (cfs), instantaneous solids concentration (mg/L), and daily mean dissolved solids load were calculated using raw data collected by the USGS and the Bureau of Reclamation. This data were calculated for Site 2: Bill Williams River near Parker for the period of January 1990 to February 2016. Palo Verde Irrigation District Outfall Drain Mean Daily Discharge Dataset: The Bureau of Reclamation collected mean daily discharge data for the period of 01/01/2005 to 09/30/2016 at the Palo Verde Irrigation District (PVID) outfall drain using a stage-discharge relationship.
Getting Real about Fake News
kaggle.com
zip
Updated Nov 25, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Meg Risdal (2016). Getting Real about Fake News [Dataset]. https://www.kaggle.com/dsv/911
Explore at:
zip(20363882 bytes)Available download formats
Dataset updated
Nov 25, 2016
Authors
Meg Risdal
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The latest hot topic in the news is fake news and many are wondering what data scientists can do to detect it and stymie its viral spread. This dataset is only a first step in understanding and tackling this problem. It contains text and metadata scraped from 244 websites tagged as "bullshit" by the BS Detector Chrome Extension by Daniel Sieradski.

Warning: I did not modify the list of news sources from the BS Detector so as not to introduce my (useless) layer of bias; I'm not an authority on fake news. There may be sources whose inclusion you disagree with. It's up to you to decide how to work with the data and how you might contribute to "improving it". The labels of "bs" and "junksci", etc. do not constitute capital "t" Truth. If there are other sources you would like to include, start a discussion. If there are sources you believe should not be included, start a discussion or write a kernel analyzing the data. Or take the data and do something else productive with it. Kaggle's choice to host this dataset is not meant to express any particular political affiliation or intent.

Contents

The dataset contains text and metadata from 244 websites and represents 12,999 posts in total from the past 30 days. The data was pulled using the webhose.io API; because it's coming from their crawler, not all websites identified by the BS Detector are present in this dataset. Each website was labeled according to the BS Detector as documented here. Data sources that were missing a label were simply assigned a label of "bs". There are (ostensibly) no genuine, reliable, or trustworthy news sources represented in this dataset (so far), so don't trust anything you read.

Fake news in the news

For inspiration, I've included some (presumably non-fake) recent stories covering fake news in the news. This is a sensitive, nuanced topic and if there are other resources you'd like to see included here, please leave a suggestion. From defining fake, biased, and misleading news in the first place to deciding how to take action (a blacklist is not a good answer), there's a lot of information to consider beyond what can be neatly arranged in a CSV file.

How Fake News Spreads (NYT)

We Tracked Down A Fake-News Creator In The Suburbs. Here's What We Learned (NPR)

Does Facebook Generate Over Half of its Revenue from Fake News? (Forbes)

Fake News is Not the Only Problem (Points - Medium)

Washington Post Disgracefully Promotes a McCarthyite Blacklist From a New, Hidden, and Very Shady Group (The Intercept)

Improvements

If you have suggestions for improvements or would like to contribute, please let me know. The most obvious extensions are to include data from "real" news sites and to address the bias in the current list. I'd be happy to include any contributions in future versions of the dataset.

Acknowledgements

Thanks to Anthony for pointing me to Daniel Sieradski's BS Detector. Thank you to Daniel Nouri for encouraging me to add a disclaimer to the dataset's page.
Resource Metadata Harvested from Government and Research Open Data Portals
zenodo.org
data.niaid.nih.gov
application/gzip, bin +1
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tom Brouwer; Tom Brouwer (2020). Resource Metadata Harvested from Government and Research Open Data Portals [Dataset]. http://doi.org/10.5281/zenodo.1438083
Explore at:
application/gzip, bin, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.1438083
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Tom Brouwer; Tom Brouwer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset consists of resource metadata harvested from the APIs of hundreds of government and research data portals from all over the world. This dataset was harvested between the 13^th and 15^th of September 2018. The metadata harvested from these portals was translated to a single metadata format (see metadata_format.odt). An overview of all harvested domains is given in portal_list.txt.

The harvested data is divided into five gzipped json-lines files, based on the ‘type’ of the resource that is derived from the data of the APIs:

dataset_metadata.jsonl.gz: Resources classified as a Dataset, or subsets of dataset (e.g. Dataset:Image and Dataset:Audio) [6 246 250 resources]

document_metadata.jsonl.gz: Resources classified as a Document, or subset of document (e.g. Document:Paper:Conference and Document:Book) [15 626 541 resources]

software_metadata.jsonl.gz: Resources classified as Sofware (including Software:Model) [42 036 resources]

service_metadata.jsonl.gz: Resources classified as a service (e.g. WMS, APIs) [1257 resources]

other_metadata.jsonl.gz: Resources of which the ‘type’ could not be determined from the data the API returned. This set still contains many datasets [1 502 979 resources]
Google Play Store Latest Dataset
kaggle.com
zip
Updated Aug 3, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Iqra Saher (2021). Google Play Store Latest Dataset [Dataset]. https://www.kaggle.com/iqrasaher/google-play-store-latest-dataset
Explore at:
zip(109579 bytes)Available download formats
Dataset updated
Aug 3, 2021
Authors
Iqra Saher
Description
Google Play Store is the emerging platform in which millions of applications are developing and releasing by developers. Users can install different categories of applications on their Android phones. .In the beginning, we tried online scraping tools that are available on the internet for data extraction, but we were not satisfied with the extraction process of these tools because each tool has some limitations. Then we used web scraping, for data collection from Google Play Store and we collected the metadata of various categories of applications for one month from 25th March to 24th April in the year 2020. We deleted all duplicated rows and find total null values in each column using a heat map and filled missing values by taking the mean of columns for better analysis. We collected different attributes above 3500 apps of different categories and there are a total of 48 categories of apps in our dataset which we have collected from the Google Play Store. When we were collecting data, we did not try to collect data of any specific category and we tried to collect all categories of data to find more about each category available in Google Play Store dataset. We performed some analysis of this dataset.

if you want to know about analysis, you can read our article .

"Analyzing App Releasing and Updating behavior of Android Apps Developers"
d
Metadata Management Tool (MMT) records archive
datasets.ai
catalogue.arctic-sdi.org
+1more
21
Updated Sep 13, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Government of Ontario | Gouvernement de l'Ontario (2023). Metadata Management Tool (MMT) records archive [Dataset]. https://datasets.ai/datasets/70de1e8a-b3c2-499f-942b-796408bd16fc
Explore at:
21Available download formats
Dataset updated
Sep 13, 2023
Dataset authored and provided by
Government of Ontario | Gouvernement de l'Ontario
Description
Information summarizing metadata records that were part of Land Information Ontario's (LIO's) Metadata Management Tool. This table represents metadata records which formerly existed on LIO’s Metadata Management Tool. Records representing data licensed for use under the Open Government Licence - Ontario have migrated to the Ontario GeoHub. The remaining records could not migrate for one of the following reasons: * The data is not spatial. * The metadata record is incomplete. * The metadata contact information is invalid. * The metadata references data that has not been made available to LIO. * LIO cannot confirm that the data has been reviewed to be released under the Open Government Licence - Ontario.
Dataset metadata of known Dataverse installations
search.datacite.org
dataverse.harvard.edu
+1more
Updated 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julian Gautier (2019). Dataset metadata of known Dataverse installations [Dataset]. http://doi.org/10.7910/dvn/dcdkzq
Explore at:
Unique identifier
https://doi.org/10.7910/dvn/dcdkzq
Dataset updated
2019
Dataset provided by
DataCitehttps://www.datacite.org/
Harvard Dataverse
Authors
Julian Gautier
Description
This dataset contains the metadata of the datasets published in 77 Dataverse installations, information about each installation's metadata blocks, and the list of standard licenses that dataset depositors can apply to the datasets they publish in the 36 installations running more recent versions of the Dataverse software. The data is useful for reporting on the quality of dataset and file-level metadata within and across Dataverse installations. Curators and other researchers can use this dataset to explore how well Dataverse software and the repositories using the software help depositors describe data. How the metadata was downloaded The dataset metadata and metadata block JSON files were downloaded from each installation on October 2 and October 3, 2022 using a Python script kept in a GitHub repo at https://github.com/jggautier/dataverse-scripts/blob/main/other_scripts/get_dataset_metadata_of_all_installations.py. In order to get the metadata from installations that require an installation account API token to use certain Dataverse software APIs, I created a CSV file with two columns: one column named "hostname" listing each installation URL in which I was able to create an account and another named "apikey" listing my accounts' API tokens. The Python script expects and uses the API tokens in this CSV file to get metadata and other information from installations that require API tokens. How the files are organized ├── csv_files_with_metadata_from_most_known_dataverse_installations │ ├── author(citation).csv │ ├── basic.csv │ ├── contributor(citation).csv │ ├── ... │ └── topic_classification(citation).csv ├── dataverse_json_metadata_from_each_known_dataverse_installation │ ├── Abacus_2022.10.02_17.11.19.zip │ ├── dataset_pids_Abacus_2022.10.02_17.11.19.csv │ ├── Dataverse_JSON_metadata_2022.10.02_17.11.19 │ ├── hdl_11272.1_AB2_0AQZNT_v1.0.json │ ├── ... │ ├── metadatablocks_v5.6 │ ├── astrophysics_v5.6.json │ ├── biomedical_v5.6.json │ ├── citation_v5.6.json │ ├── ... │ ├── socialscience_v5.6.json │ ├── ACSS_Dataverse_2022.10.02_17.26.19.zip │ ├── ADA_Dataverse_2022.10.02_17.26.57.zip │ ├── Arca_Dados_2022.10.02_17.44.35.zip │ ├── ... │ └── World_Agroforestry_-_Research_Data_Repository_2022.10.02_22.59.36.zip └── dataset_pids_from_most_known_dataverse_installations.csv └── licenses_used_by_dataverse_installations.csv └── metadatablocks_from_most_known_dataverse_installations.csv This dataset contains two directories and three CSV files not in a directory. One directory, "csv_files_with_metadata_from_most_known_dataverse_installations", contains 18 CSV files that contain the values from common metadata fields of all 77 Dataverse installations. For example, author(citation)_2022.10.02-2022.10.03.csv contains the "Author" metadata for all published, non-deaccessioned, versions of all datasets in the 77 installations, where there's a row for each author name, affiliation, identifier type and identifier. The other directory, "dataverse_json_metadata_from_each_known_dataverse_installation", contains 77 zipped files, one for each of the 77 Dataverse installations whose dataset metadata I was able to download using Dataverse APIs. Each zip file contains a CSV file and two sub-directories: The CSV file contains the persistent IDs and URLs of each published dataset in the Dataverse installation as well as a column to indicate whether or not the Python script was able to download the Dataverse JSON metadata for each dataset. For Dataverse installations using Dataverse software versions whose Search APIs include each dataset's owning Dataverse collection name and alias, the CSV files also include which Dataverse collection (within the installation) that dataset was published in. One sub-directory contains a JSON file for each of the installation's published, non-deaccessioned dataset versions. The JSON files contain the metadata in the "Dataverse JSON" metadata schema. The other sub-directory contains information about the metadata models (the "metadata blocks" in JSON files) that the installation was using when the dataset metadata was downloaded. I saved them so that they can be used when extracting metadata from the Dataverse JSON files. The dataset_pids_from_most_known_dataverse_installations.csv file contains the dataset PIDs of all published datasets in the 77 Dataverse installations, with a column to indicate if the Python script was able to download the dataset's metadata. It's a union of all of the "dataset_pids_..." files in each of the 77 zip files. The licenses_used_by_dataverse_installations.csv file contains information about the licenses that a number of the installations let depositors choose when creating datasets. When I collected this data, 36 installations were running versions of the Dataverse software that allow depositors to choose a license or data use agreement from a dropdown menu in the dataset deposit form. For more information, see https://guides.dataverse.org/en/5.11.1/user/dataset-management.html#choosing-a-license. The metadatablocks_from_most_known_dataverse_installations.csv file contains the metadata block names, field names and child field names (if the field is a compound field) of the 77 Dataverse installations' metadata blocks. The metadatablocks_from_most_known_dataverse_installations.csv file is useful for comparing each installation's dataset metadata model (the metadata fields and the metadata blocks that each installation uses). The CSV file was created using a Python script at https://github.com/jggautier/dataverse-scripts/blob/main/other_scripts/get_csv_file_with_metadata_block_fields_of_all_installations.py, which takes as inputs the directories and files created by the get_dataset_metadata_of_all_installations.py script. Known errors The metadata of two datasets from one of the known installations could not be downloaded because the datasets' pages and metadata could not be accessed with the Dataverse APIs. About metadata blocks Read about the Dataverse software's metadata blocks system at http://guides.dataverse.org/en/latest/admin/metadatacustomization.html
TMDB 5000 Movie Dataset
kaggle.com
zip
Updated Sep 28, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Movie Database (TMDb) (2017). TMDB 5000 Movie Dataset [Dataset]. https://www.kaggle.com/tmdb/tmdb-movie-metadata
Explore at:
zip(9317430 bytes)Available download formats
Dataset updated
Sep 28, 2017
Dataset provided by
The Movie Databasehttp://themoviedb.org/
Authors
The Movie Database (TMDb)
Description
Background

What can we say about the success of a movie before it is released? Are there certain companies (Pixar?) that have found a consistent formula? Given that major films costing over $100 million to produce can still flop, this question is more important than ever to the industry. Film aficionados might have different interests. Can we predict which films will be highly rated, whether or not they are a commercial success?

This is a great place to start digging in to those questions, with data on the plot, cast, crew, budget, and revenues of several thousand films.

Data Source Transfer Summary

We (Kaggle) have removed the original version of this dataset per a DMCA takedown request from IMDB. In order to minimize the impact, we're replacing it with a similar set of films and data fields from The Movie Database (TMDb) in accordance with their terms of use. The bad news is that kernels built on the old dataset will most likely no longer work.

The good news is that:

You can port your existing kernels over with a bit of editing. This kernel offers functions and examples for doing so. You can also find a general introduction to the new format here.

The new dataset contains full credits for both the cast and the crew, rather than just the first three actors.

Actor and actresses are now listed in the order they appear in the credits. It's unclear what ordering the original dataset used; for the movies I spot checked it didn't line up with either the credits order or IMDB's stars order.

The revenues appear to be more current. For example, IMDB's figures for Avatar seem to be from 2010 and understate the film's global revenues by over $2 billion.

Some of the movies that we weren't able to port over (a couple of hundred) were just bad entries. For example, this IMDB entry has basically no accurate information at all. It lists Star Wars Episode VII as a documentary.

Data Source Transfer Details

Several of the new columns contain json. You can save a bit of time by porting the load data functions from this kernel.

Even in simple fields like runtime may not be consistent across versions. For example, previous dataset shows the duration for Avatar's extended cut while TMDB shows the time for the original version.

There's now a separate file containing the full credits for both the cast and crew.

All fields are filled out by users so don't expect them to agree on keywords, genres, ratings, or the like.

Your existing kernels will continue to render normally until they are re-run.

If you are curious about how this dataset was prepared, the code to access TMDb's API is posted here.

New columns:

homepage

id

original_title

overview

popularity

production_companies

production_countries

release_date

spoken_languages

status

tagline

vote_average

Lost columns:

actor_1_facebook_likes

actor_2_facebook_likes

actor_3_facebook_likes

aspect_ratio

cast_total_facebook_likes

color

content_rating

director_facebook_likes

facenumber_in_poster

movie_facebook_likes

movie_imdb_link

num_critic_for_reviews

num_user_for_reviews

Open Questions About the Data

There are some things we haven't had a chance to confirm about the new dataset. If you have any insights, please let us know in the forums!

Are the budgets and revenues all in US dollars? Do they consistently show the global revenues?

This dataset hasn't yet gone through a data quality analysis. Can you find any obvious corrections? For example, in the IMDb version it was necessary to treat values of zero in the budget field as missing. Similar findings would be very helpful to your fellow Kagglers! (It's probably a good idea to keep treating zeros as missing, with the caveat that missing budgets much more likely to have been from small budget films in the first place).

Inspiration

Can you categorize the films by type, such as animated or not? We don't have explicit labels for this, but it should be possible to build them from the crew's job titles.

How sharp is the divide between major film studios and the independents? Do those two groups fall naturally out of a clustering analysis or is something more complicated going on?

Acknowledgements

This dataset was generated from The Movie Database API. This product uses the TMDb API but is not endorsed or certified by TMDb. Their API also provides access to data on many additiona...
d
Hazardous Waste Portal Manifest Metadata
catalog.data.gov
data.ct.gov
+1more
Updated Jan 26, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.ct.gov (2024). Hazardous Waste Portal Manifest Metadata [Dataset]. https://catalog.data.gov/dataset/hazardous-waste-portal-manifest-metadata
Explore at:
Dataset updated
Jan 26, 2024
Dataset provided by
data.ct.gov
Description
Note: Please use the following view to be able to see the entire Dataset Description: https://data.ct.gov/Environment-and-Natural-Resources/Hazardous-Waste-Portal-Manifest-Metadata/x2z6-swxe Dataset Description Outline (5 sections) • INTRODUCTION • WHY USE THE CONNECTICUT OPEN DATA PORTAL MANIFEST METADATA DATASET INSTEAD OF THE DEEP DOCUMENT ONLINE SEARCH PORTAL ITSELF? • WHAT MANIFESTS ARE INCLUDED IN DEEP’S MANIFEST PERMANENT RECORDS ARE ALSO AVAILABLE VIA THE DEEP DOCUMENT SEARCH PORTAL AND CT OPEN DATA? • HOW DOES THE PORTAL MANIFEST METADATA DATASET RELATE TO THE OTHER TWO MANIFEST DATASETS PUBLISHED IN CT OPEN DATA? • IMPORTANT NOTES INTRODUCTION • All of DEEP’s paper hazardous waste manifest records were recently scanned and “indexed”. • Indexing consisted of 6 basic pieces of information or “metadata” taken from each manifest about the Generator and stored with the scanned image. The metadata enables searches by: Site Town, Site Address, Generator Name, Generator ID Number, Manifest ID Number and Date of Shipment. • All of the metadata and scanned images are available electronically via DEEP’s Document Online Search Portal at: https://filings.deep.ct.gov/DEEPDocumentSearchPortal/ • Therefore, it is no longer necessary to visit the DEEP Records Center in Hartford for manifest records or information. • This CT Data dataset “Hazardous Waste Portal Manifest Metadata” (or “Portal Manifest Metadata”) was copied from the DEEP Document Online Search Portal, and includes only the metadata – no images. WHY USE THE CONNECTICUT OPEN DATA PORTAL MANIFEST METADATA DATASET INSTEAD OF THE DEEP DOCUMENT ONLINE SEARCH PORTAL ITSELF? The Portal Manifest Metadata is a good search tool to use along with the Portal. Searching the Portal Manifest Metadata can provide the following advantages over searching the Portal: • faster searches, especially for “large searches” - those with a large number of search returns unlimited number of search returns (Portal is limited to 500); • larger display of search returns; • search returns can be sorted and filtered online in CT Data; and • search returns and the entire dataset can be downloaded from CT Data and used offline (e.g. download to Excel format) • metadata from searches can be copied from CT Data and pasted into the Portal search fields to quickly find single scanned images. The main advantages of the Portal are: • it provides access to scanned images of manifest documents (CT Data does not); and • images can be downloaded one or multiple at a time. WHAT MANIFESTS ARE INCLUDED IN DEEP’S MANIFEST PERMANENT RECORDS ARE ALSO AVAILABLE VIA THE DEEP DOCUMENT SEARCH PORTAL AND CT OPEN DATA? All hazardous waste manifest records received and maintained by the DEEP Manifest Program; including: • manifests originating from a Connecticut Generator or sent to a Connecticut Destination Facility including manifests accompanying an exported shipment • manifests with RCRA hazardous waste listed on them (such manifests may also have non-RCRA hazardous waste listed) • manifests from a Generator with a Connecticut Generator ID number (permanent or temporary number) • manifests with sufficient quantities of RCRA hazardous waste listed for DEEP to consider the Generator to be a Small or Large Quantity Generator • manifests with PCBs listed on them from 2016 to 6-29-2018. • Note: manifests sent to a CT Destination Facility were indexed by the Connecticut or Out of State Generator. Searches by CT Designated Facility are not possible unless such facility is the Generator for the purposes of manifesting. All other manifests were considered “non-hazardous” manifests and not scanned. They were discarded after 2 years in accord with DEEP records retention schedule. Non-hazardous manifests include: • Manifests with only non-RCRA hazardous waste listed • Manifests from generators that did not have a permanent or temporary Generator ID number • Sometimes non-hazardous manifests were considered “Hazar
c
Niagara Open Data
catalog.civicdataecosystem.org
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Niagara Open Data [Dataset]. https://catalog.civicdataecosystem.org/dataset/niagara-open-data
Explore at:
Description
The Ontario government, generates and maintains thousands of datasets. Since 2012, we have shared data with Ontarians via a data catalogue. Open data is data that is shared with the public. Click here to learn more about open data and why Ontario releases it. Ontario’s Open Data Directive states that all data must be open, unless there is good reason for it to remain confidential. Ontario’s Chief Digital and Data Officer also has the authority to make certain datasets available publicly. Datasets listed in the catalogue that are not open will have one of the following labels: If you want to use data you find in the catalogue, that data must have a licence – a set of rules that describes how you can use it. A licence: Most of the data available in the catalogue is released under Ontario’s Open Government Licence. However, each dataset may be shared with the public under other kinds of licences or no licence at all. If a dataset doesn’t have a licence, you don’t have the right to use the data. If you have questions about how you can use a specific dataset, please contact us. The Ontario Data Catalogue endeavors to publish open data in a machine readable format. For machine readable datasets, you can simply retrieve the file you need using the file URL. The Ontario Data Catalogue is built on CKAN, which means the catalogue has the following features you can use when building applications. APIs (Application programming interfaces) let software applications communicate directly with each other. If you are using the catalogue in a software application, you might want to extract data from the catalogue through the catalogue API. Note: All Datastore API requests to the Ontario Data Catalogue must be made server-side. The catalogue's collection of dataset metadata (and dataset files) is searchable through the CKAN API. The Ontario Data Catalogue has more than just CKAN's documented search fields. You can also search these custom fields. You can also use the CKAN API to retrieve metadata about a particular dataset and check for updated files. Read the complete documentation for CKAN's API. Some of the open data in the Ontario Data Catalogue is available through the Datastore API. You can also search and access the machine-readable open data that is available in the catalogue. How to use the API feature: Read the complete documentation for CKAN's Datastore API. The Ontario Data Catalogue contains a record for each dataset that the Government of Ontario possesses. Some of these datasets will be available to you as open data. Others will not be available to you. This is because the Government of Ontario is unable to share data that would break the law or put someone's safety at risk. You can search for a dataset with a word that might describe a dataset or topic. Use words like “taxes” or “hospital locations” to discover what datasets the catalogue contains. You can search for a dataset from 3 spots on the catalogue: the homepage, the dataset search page, or the menu bar available across the catalogue. On the dataset search page, you can also filter your search results. You can select filters on the left hand side of the page to limit your search for datasets with your favourite file format, datasets that are updated weekly, datasets released by a particular organization, or datasets that are released under a specific licence. Go to the dataset search page to see the filters that are available to make your search easier. You can also do a quick search by selecting one of the catalogue’s categories on the homepage. These categories can help you see the types of data we have on key topic areas. When you find the dataset you are looking for, click on it to go to the dataset record. Each dataset record will tell you whether the data is available, and, if so, tell you about the data available. An open dataset might contain several data files. These files might represent different periods of time, different sub-sets of the dataset, different regions, language translations, or other breakdowns. You can select a file and either download it or preview it. Make sure to read the licence agreement to make sure you have permission to use it the way you want. Read more about previewing data. A non-open dataset may be not available for many reasons. Read more about non-open data. Read more about restricted data. Data that is non-open may still be subject to freedom of information requests. The catalogue has tools that enable all users to visualize the data in the catalogue without leaving the catalogue – no additional software needed. Have a look at our walk-through of how to make a chart in the catalogue. Get automatic notifications when datasets are updated. You can choose to get notifications for individual datasets, an organization’s datasets or the full catalogue. You don’t have to provide and personal information – just subscribe to our feeds using any feed reader you like using the corresponding notification web addresses. Copy those addresses and paste them into your reader. Your feed reader will let you know when the catalogue has been updated. The catalogue provides open data in several file formats (e.g., spreadsheets, geospatial data, etc). Learn about each format and how you can access and use the data each file contains. A file that has a list of items and values separated by commas without formatting (e.g. colours, italics, etc.) or extra visual features. This format provides just the data that you would display in a table. XLSX (Excel) files may be converted to CSV so they can be opened in a text editor. How to access the data: Open with any spreadsheet software application (e.g., Open Office Calc, Microsoft Excel) or text editor. Note: This format is considered machine-readable, it can be easily processed and used by a computer. Files that have visual formatting (e.g. bolded headers and colour-coded rows) can be hard for machines to understand, these elements make a file more human-readable and less machine-readable. A file that provides information without formatted text or extra visual features that may not follow a pattern of separated values like a CSV. How to access the data: Open with any word processor or text editor available on your device (e.g., Microsoft Word, Notepad). A spreadsheet file that may also include charts, graphs, and formatting. How to access the data: Open with a spreadsheet software application that supports this format (e.g., Open Office Calc, Microsoft Excel). Data can be converted to a CSV for a non-proprietary format of the same data without formatted text or extra visual features. A shapefile provides geographic information that can be used to create a map or perform geospatial analysis based on location, points/lines and other data about the shape and features of the area. It includes required files (.shp, .shx, .dbt) and might include corresponding files (e.g., .prj). How to access the data: Open with a geographic information system (GIS) software program (e.g., QGIS). A package of files and folders. The package can contain any number of different file types. How to access the data: Open with an unzipping software application (e.g., WinZIP, 7Zip). Note: If a ZIP file contains .shp, .shx, and .dbt file types, it is an ArcGIS ZIP: a package of shapefiles which provide information to create maps or perform geospatial analysis that can be opened with ArcGIS (a geographic information system software program). A file that provides information related to a geographic area (e.g., phone number, address, average rainfall, number of owl sightings in 2011 etc.) and its geospatial location (i.e., points/lines). How to access the data: Open using a GIS software application to create a map or do geospatial analysis. It can also be opened with a text editor to view raw information. Note: This format is machine-readable, and it can be easily processed and used by a computer. Human-readable data (including visual formatting) is easy for users to read and understand. A text-based format for sharing data in a machine-readable way that can store data with more unconventional structures such as complex lists. How to access the data: Open with any text editor (e.g., Notepad) or access through a browser. Note: This format is machine-readable, and it can be easily processed and used by a computer. Human-readable data (including visual formatting) is easy for users to read and understand. A text-based format to store and organize data in a machine-readable way that can store data with more unconventional structures (not just data organized in tables). How to access the data: Open with any text editor (e.g., Notepad). Note: This format is machine-readable, and it can be easily processed and used by a computer. Human-readable data (including visual formatting) is easy for users to read and understand. A file that provides information related to an area (e.g., phone number, address, average rainfall, number of owl sightings in 2011 etc.) and its geospatial location (i.e., points/lines). How to access the data: Open with a geospatial software application that supports the KML format (e.g., Google Earth). Note: This format is machine-readable, and it can be easily processed and used by a computer. Human-readable data (including visual formatting) is easy for users to read and understand. This format contains files with data from tables used for statistical analysis and data visualization of Statistics Canada census data. How to access the data: Open with the Beyond 20/20 application. A database which links and combines data from different files or applications (including HTML, XML, Excel, etc.). The database file can be converted to a CSV/TXT to make the data machine-readable, but human-readable formatting will be lost. How to access the data: Open with Microsoft Office Access (a database management system used to develop application software). A file that keeps the original layout and
h
the-stack-metadata
huggingface.co
Updated Apr 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BigCode (2023). the-stack-metadata [Dataset]. https://huggingface.co/datasets/bigcode/the-stack-metadata
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 16, 2023
Dataset authored and provided by
BigCode
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for The Stack Metadata

Changelog

Release Description

v1.1 This is the first release of the metadata. It is for The Stack v1.1

v1.2 Metadata dataset matching The Stack v1.2

Dataset Summary

This is a set of additional information for repositories used for The Stack. It contains file paths, detected licenes as well as some other information for the repositories.

Supported Tasks and Leaderboards

The main task is to recreate… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack-metadata.
Data from: Da-TACOS: A Dataset for Cover Song Identification and...
data.europa.eu
unknown
Updated Jul 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2025). Da-TACOS: A Dataset for Cover Song Identification and Understanding [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-3520368?locale=fi
Explore at:
unknown(3513878)Available download formats
Dataset updated
Jul 3, 2025
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
We present Da-TACOS: a dataset for cover song identification and understanding. It contains two subsets, namely the benchmark subset (for benchmarking cover song identification systems) and the cover analysis subset (for analyzing the links among cover songs), with pre-extracted features and metadata for 15,000 and 10,000 songs, respectively. The annotations included in the metadata are obtained with the API of SecondHandSongs.com. All audio files we use to extract features are encoded in MP3 format and their sample rate is 44.1 kHz. Da-TACOS does not contain any audio files. For the results of our analyses on modifiable musical characteristics using the cover analysis subset and our initial benchmarking of 7 state-of-the-art cover song identification algorithms on the benchmark subset, you can look at our publication. For organizing the data, we use the structure of SecondHandSongs where each song is called a ‘performance’, and each clique (cover group) is called a ‘work’. Based on this, the file names of the songs are their unique performance IDs (PID, e.g. P_22), and their labels with respect to their cliques are their work IDs (WID, e.g. W_14). Metadata for each song includes performance title, performance artist, work title, work artist, release year, SecondHandSongs.com performance ID, SecondHandSongs.com work ID, whether the song is instrumental or not. In addition, we matched the original metadata with MusicBrainz to obtain MusicBrainz ID (MBID), song length and genre/style tags. We would like to note that MusicBrainz related information is not available for all the songs in Da-TACOS, and since we used just our metadata for matching, we include all possible MBIDs for a particular songs. For facilitating reproducibility in cover song identification (CSI) research, we propose a framework for feature extraction and benchmarking in our supplementary repository: acoss. The feature extraction component is designed to help CSI researchers to find the most commonly used features for CSI in a single address. The parameter values we used to extract the features in Da-TACOS are shared in the same repository. Moreover, the benchmarking component includes our implementations of 7 state-of-the-art CSI systems. We provide the performance results of an initial benchmarking of those 7 systems on the benchmark subset of Da-TACOS. We encourage other CSI researchers to contribute to acoss with implementing their favorite feature extraction algorithms and their CSI systems to build up a knowledge base where CSI research can reach larger audiences. The instructions for how to download and use the dataset are shared below. Please contact us if you have any questions or requests. 1. Structure 1.1. Metadata We provide two metadata files that contain information about the benchmark subset and the cover analysis subset. Both metadata files are stored as python dictionaries in .json format, and have the same hierarchical structure. An example to load the metadata files in python: import json with open('./da-tacos_metadata/da-tacos_benchmark_subset_metadata.json') as f: benchmark_metadata = json.load(f) The python dictionary obtained with the code above will have the respective WIDs as keys. Each key will provide the song dictionaries that contain the metadata regarding the songs that belong to their WIDs. An example can be seen below: "W_163992": { # work id "P_547131": { # performance id of the first song belonging to the clique 'W_163992' "work_title": "Trade Winds, Trade Winds", "work_artist": "Aki Aleong", "perf_title": "Trade Winds, Trade Winds", "perf_artist": "Aki Aleong", "release_year": "1961", "work_id": "W_163992", "perf_id": "P_547131", "instrumental": "No", "perf_artist_mbid": "9bfa011f-8331-4c9a-b49b-d05bc7916605", "mb_performances": { "4ce274b3-0979-4b39-b8a3-5ae1de388c4a": { "length": "175000" }, "7c10ba3b-6f1d-41ab-8b20-14b2567d384a": { "length": "177653" } } }, "P_547140": { # performance id of the second song belonging to the clique 'W_163992' "work_title": "Trade Winds, Trade Winds", "work_artist": "Aki Aleong", "perf_title": "Trade Winds, Trade Winds", "perf_artist": "Dodie Stevens", "release_year": "1961", "work_id": "W_163992", "perf_id": "P_547140", "instrumental": "No" } } 1.2. Pre-extracted features The list of features included in Da-TACOS can be seen below. All the features are extracted with acoss repository that uses open-source feature extraction libraries such as Essentia, LibROSA, and Madmom. To facilitate the use of the dataset, we provide two options regarding the file structure. 1- In da-tacos_benchmark_subset_single_files and da-tacos_coveranalysis_subset_single_files folders, we organize the data based on their respective cliques, and one file contains all the features for that particular song. { "chroma_cens": numpy.ndarray, "crema": numpy.ndarray, "hpcp": numpy.ndarray, "key_extractor": { "key": numpy.str_, "scale": numpy.str_,_ "strength": numpy.float64 }, "madmom_features": { "novfn":
Z
Data from: Russian Financial Statements Database: A firm-level collection of...
data.niaid.nih.gov
Updated Mar 14, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bondarkov, Sergey; Ledenev, Victor; Skougarevskiy, Dmitriy (2025). Russian Financial Statements Database: A firm-level collection of the universe of financial statements [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14622208
Explore at:
Dataset updated
Mar 14, 2025
Dataset provided by
European University at St Petersburg
European University at St. Petersburg
Authors
Bondarkov, Sergey; Ledenev, Victor; Skougarevskiy, Dmitriy
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
The Russian Financial Statements Database (RFSD) is an open, harmonized collection of annual unconsolidated financial statements of the universe of Russian firms:

🔓 First open data set with information on every active firm in Russia.

🗂️ First open financial statements data set that includes non-filing firms.

🏛️ Sourced from two official data providers: the Rosstat and the Federal Tax Service.

📅 Covers 2011-2023 initially, will be continuously updated.

🏗️ Restores as much data as possible through non-invasive data imputation, statement articulation, and harmonization.

The RFSD is hosted on 🤗 Hugging Face and Zenodo and is stored in a structured, column-oriented, compressed binary format Apache Parquet with yearly partitioning scheme, enabling end-users to query only variables of interest at scale.

The accompanying paper provides internal and external validation of the data: http://arxiv.org/abs/2501.05841.

Here we present the instructions for importing the data in R or Python environment. Please consult with the project repository for more information: http://github.com/irlcode/RFSD.

Importing The Data

You have two options to ingest the data: download the .parquet files manually from Hugging Face or Zenodo or rely on 🤗 Hugging Face Datasets library.

Python

🤗 Hugging Face Datasets

It is as easy as:

from datasets import load_dataset import polars as pl

This line will download 6.6GB+ of all RFSD data and store it in a 🤗 cache folder

RFSD = load_dataset('irlspbru/RFSD')

Alternatively, this will download ~540MB with all financial statements for 2023# to a Polars DataFrame (requires about 8GB of RAM)

RFSD_2023 = pl.read_parquet('hf://datasets/irlspbru/RFSD/RFSD/year=2023/*.parquet')

Please note that the data is not shuffled within year, meaning that streaming first n rows will not yield a random sample.

Local File Import

Importing in Python requires pyarrow package installed.

import pyarrow.dataset as ds import polars as pl

Read RFSD metadata from local file

RFSD = ds.dataset("local/path/to/RFSD")

Use RFSD_dataset.schema to glimpse the data structure and columns' classes

print(RFSD.schema)

Load full dataset into memory

RFSD_full = pl.from_arrow(RFSD.to_table())

Load only 2019 data into memory

RFSD_2019 = pl.from_arrow(RFSD.to_table(filter=ds.field('year') == 2019))

Load only revenue for firms in 2019, identified by taxpayer id

RFSD_2019_revenue = pl.from_arrow( RFSD.to_table( filter=ds.field('year') == 2019, columns=['inn', 'line_2110'] ) )

Give suggested descriptive names to variables

renaming_df = pl.read_csv('local/path/to/descriptive_names_dict.csv') RFSD_full = RFSD_full.rename({item[0]: item[1] for item in zip(renaming_df['original'], renaming_df['descriptive'])})

R

Local File Import

Importing in R requires arrow package installed.

library(arrow) library(data.table)

Read RFSD metadata from local file

RFSD <- open_dataset("local/path/to/RFSD")

Use schema() to glimpse into the data structure and column classes

schema(RFSD)

Load full dataset into memory

scanner <- Scanner$create(RFSD) RFSD_full <- as.data.table(scanner$ToTable())

Load only 2019 data into memory

scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scanner <- scan_builder$Finish() RFSD_2019 <- as.data.table(scanner$ToTable())

Load only revenue for firms in 2019, identified by taxpayer id

scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scan_builder$Project(cols = c("inn", "line_2110")) scanner <- scan_builder$Finish() RFSD_2019_revenue <- as.data.table(scanner$ToTable())

Give suggested descriptive names to variables

renaming_dt <- fread("local/path/to/descriptive_names_dict.csv") setnames(RFSD_full, old = renaming_dt$original, new = renaming_dt$descriptive)

Use Cases

🌍 For macroeconomists: Replication of a Bank of Russia study of the cost channel of monetary policy in Russia by Mogiliat et al. (2024) — interest_payments.md

🏭 For IO: Replication of the total factor productivity estimation by Kaukin and Zhemkova (2023) — tfp.md

🗺️ For economic geographers: A novel model-less house-level GDP spatialization that capitalizes on geocoding of firm addresses — spatialization.md

FAQ

Why should I use this data instead of Interfax's SPARK, Moody's Ruslana, or Kontur's Focus?hat is the data period?

To the best of our knowledge, the RFSD is the only open data set with up-to-date financial statements of Russian companies published under a permissive licence. Apart from being free-to-use, the RFSD benefits from data harmonization and error detection procedures unavailable in commercial sources. Finally, the data can be easily ingested in any statistical package with minimal effort.

What is the data period?

We provide financials for Russian firms in 2011-2023. We will add the data for 2024 by July, 2025 (see Version and Update Policy below).

Why are there no data for firm X in year Y?

Although the RFSD strives to be an all-encompassing database of financial statements, end users will encounter data gaps:

We do not include financials for firms that we considered ineligible to submit financial statements to the Rosstat/Federal Tax Service by law: financial, religious, or state organizations (state-owned commercial firms are still in the data).

Eligible firms may enjoy the right not to disclose under certain conditions. For instance, Gazprom did not file in 2022 and we had to impute its 2022 data from 2023 filings. Sibur filed only in 2023, Novatek — in 2020 and 2021. Commercial data providers such as Interfax's SPARK enjoy dedicated access to the Federal Tax Service data and therefore are able source this information elsewhere.

Firm may have submitted its annual statement but, according to the Uniform State Register of Legal Entities (EGRUL), it was not active in this year. We remove those filings.

Why is the geolocation of firm X incorrect?

We use Nominatim to geocode structured addresses of incorporation of legal entities from the EGRUL. There may be errors in the original addresses that prevent us from geocoding firms to a particular house. Gazprom, for instance, is geocoded up to a house level in 2014 and 2021-2023, but only at street level for 2015-2020 due to improper handling of the house number by Nominatim. In that case we have fallen back to street-level geocoding. Additionally, streets in different districts of one city may share identical names. We have ignored those problems in our geocoding and invite your submissions. Finally, address of incorporation may not correspond with plant locations. For instance, Rosneft has 62 field offices in addition to the central office in Moscow. We ignore the location of such offices in our geocoding, but subsidiaries set up as separate legal entities are still geocoded.

Why is the data for firm X different from https://bo.nalog.ru/?

Many firms submit correcting statements after the initial filing. While we have downloaded the data way past the April, 2024 deadline for 2023 filings, firms may have kept submitting the correcting statements. We will capture them in the future releases.

Why is the data for firm X unrealistic?

We provide the source data as is, with minimal changes. Consider a relatively unknown LLC Banknota. It reported 3.7 trillion rubles in revenue in 2023, or 2% of Russia's GDP. This is obviously an outlier firm with unrealistic financials. We manually reviewed the data and flagged such firms for user consideration (variable outlier), keeping the source data intact.

Why is the data for groups of companies different from their IFRS statements?

We should stress that we provide unconsolidated financial statements filed according to the Russian accounting standards, meaning that it would be wrong to infer financials for corporate groups with this data. Gazprom, for instance, had over 800 affiliated entities and to study this corporate group in its entirety it is not enough to consider financials of the parent company.

Why is the data not in CSV?

The data is provided in Apache Parquet format. This is a structured, column-oriented, compressed binary format allowing for conditional subsetting of columns and rows. In other words, you can easily query financials of companies of interest, keeping only variables of interest in memory, greatly reducing data footprint.

Version and Update Policy

Version (SemVer): 1.0.0.

We intend to update the RFSD annualy as the data becomes available, in other words when most of the firms have their statements filed with the Federal Tax Service. The official deadline for filing of previous year statements is April, 1. However, every year a portion of firms either fails to meet the deadline or submits corrections afterwards. Filing continues up to the very end of the year but after the end of April this stream quickly thins out. Nevertheless, there is obviously a trade-off between minimization of data completeness and version availability. We find it a reasonable compromise to query new data in early June, since on average by the end of May 96.7% statements are already filed, including 86.4% of all the correcting filings. We plan to make a new version of RFSD available by July.

Licence

Creative Commons License Attribution 4.0 International (CC BY 4.0).

Copyright © the respective contributors.

Citation

Please cite as:

@unpublished{bondarkov2025rfsd, title={{R}ussian {F}inancial {S}tatements {D}atabase}, author={Bondarkov, Sergey and Ledenev, Victor and Skougarevskiy, Dmitriy}, note={arXiv preprint arXiv:2501.05841}, doi={https://doi.org/10.48550/arXiv.2501.05841}, year={2025}}

Acknowledgments and Contacts

Data collection and processing: Sergey Bondarkov, sbondarkov@eu.spb.ru, Viktor Ledenev, vledenev@eu.spb.ru

Project conception, data validation, and use cases: Dmitriy Skougarevskiy, Ph.D.,
Surface Water - Chemistry Results
data.cnra.ca.gov
data.ca.gov
+1more
csv, pdf, zip
Updated Nov 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
California State Water Resources Control Board (2025). Surface Water - Chemistry Results [Dataset]. https://data.cnra.ca.gov/dataset/surface-water-chemistry-results
Explore at:
csv, zip, pdfAvailable download formats
Dataset updated
Nov 6, 2025
Dataset authored and provided by
California State Water Resources Control Board
Description
This data provides results from the California Environmental Data Exchange Network (CEDEN) for field and lab chemistry analyses. The data set contains two provisionally assigned values (“DataQuality” and “DataQualityIndicator”) to help users interpret the data quality metadata provided with the associated result.

Due to file size limitations, the data has been split into individual resources by year. The entire dataset can also be downloaded in bulk using the zip files on this page (in csv format or parquet format), and developers can also use the API associated with each year's dataset to access the data.

Users who want to manually download more specific subsets of the data can also use the CEDEN Query Tool, which provides access to the same data presented here, but allows for interactive data filtering.

NOTE: Some of the field and lab chemistry data that has been submitted to CEDEN since 2020 has not been loaded into the CEDEN database. That data is not included in this data set (and is also not available via the CEDEN query tool described above), but is available as a supplemental data set available here: Surface Water - Chemistry Results - CEDEN Augmentation. For consistency, many of the conditions applied to the data in this dataset and in the CEDEN query tool are also applied to that supplemental dataset (e.g., no rejected data or replicates are included), but that supplemental data is provisional and may not reflect all of the QA/QC controls applied to the regular CEDEN data available here.
Z
The Con Espressione Game Dataset
data.niaid.nih.gov
data-staging.niaid.nih.gov
+1more
Updated Nov 5, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cancino-Chacón, Carlos Eduardo; Peter, Silvan; Chowdhury, Shreyan; Aljanaki, Anna; Widmer, Gerhard (2020). The Con Espressione Game Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3968827
Explore at:
Dataset updated
Nov 5, 2020
Dataset provided by
Johannes Kepler University Linz
Austrian Research Institute for Artificial Intelligence
University of Tartu
Authors
Cancino-Chacón, Carlos Eduardo; Peter, Silvan; Chowdhury, Shreyan; Aljanaki, Anna; Widmer, Gerhard
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Con Espressione Game Dataset

A piece of music can be expressively performed, or interpreted, in a variety of ways. With the help of an online questionnaire, the Con Espressione Game, we collected some 1,500 descriptions of expressive character relating to 45 performances of 9 excerpts from classical piano pieces, played by different famous pianists. More specifically, listeners were asked to describe, using freely chosen words (preferably: adjectives), how they perceive the expressive character of the different performances. The aim of this research is to find the dimensions of musical expression (in Western classical piano music) that can be attributed to a performance, as perceived and described in natural language by listeners.

The Con Espressione Game was launched on the 3rd of April 2018.

Dataset structure

Listeners’ Descriptions of Expressive performance

piece_performer_data.csv: A comma separated file (CSV) containing information about the pieces in the dataset. Strings are delimited with ". The columns in this file are:

music_id: An integer ID for each performance in the dataset.

performer_name: (Last) name of the performer.

piece_name: (Short) name of the piece.

performance_name: Name of the the performance. All files in different modalities (alignments, MIDI, loudness features, etc) corresponding to a single performance will have the same name (but possibly different extensions).

composer: Name of the composer of the piece.

piece: Full name of the piece.

album: Name of the album.

performer_name_full: Full name of the performer.

year_of_CD_issue: Year of the issue of the CD.

track_number: Number of the track in the CD.

length_of_excerpt_seconds: Length of the excerpt in seconds.

start_of_excerpt_seconds: Start of the excerpt in its corresponding track (in seconds).

end_of_excerpt_seconds: End of the excerpt in its corresponding track (in seconds).

con_espressione_game_answers.csv: This is the main file of the dataset which contains listener’s descriptions of expressive character. This CSV file contains the following columns:

answer_id: An integer representing the ID of the answer. Each answer gets a unique ID.

participant_id: An integer representing the ID of a participant. Answers with the same ID come from the same participant.

music_id: An integer representing the ID of the performance. This is the same as the music_id in piece_performer_data.csv described above.

answer: (cleaned/formatted) participant description. All answers have been written as lower-case, typos were corrected, spaces replaced by underscores (_) and individual terms are separated by commas. See cleanup_rules.txt for a more detailed description of how the answers were formatted.

original_answer: Raw answers provided by the participants.

timestamp: Timestamp of the answer.

favorite: A boolean (0 or 1) indicating if this performance of the piece is the participant’s favorite.

translated_to_english. Raw translation (from German, Russian, Spanish and Italian).

performer. (Last) name of the performer. See piece_performer_data.csv described above.

piece_name. (Short) name of the piece. See piece_performer_data.csv described above.

performance_name. Name of the performance. See piece_performer_data.csv described above.

participant_profiles.csv. A CSV file containing musical background information of the participants. Empty cells mean that the participant did not provide an answer. This file contains the following columns:

participant_id: An integer representing the ID of a participant.

music_education_years: (Self reported) number of years of musical education of the participants

listening_to_classical_music: Answers to the question “How often do you listen to classical music?”. The possible answers are:

1: Never

2: Very rarely

3: Rarely

4: Occasionally

5: Frequently

6: Very frequently

registration_date: Date and time of registration of the participant.

playing_piano: Answer to the question “Do you play the piano?”. The possible answers are

1: No

2: A little bit

3: Quite well

4: Very well

cleanup_rules.txt: Rules for cleaning/formatting the terms in the participant’s answers.

translations_GERMAN.txt: How the translations from German to English were made.

Metadata

Related meta data is stored in the MetaData folder.

Alignments. This folders contains the manually-corrected score-to-performance alignments for each of the pieces in the dataset. Each of these alignments is a text file.

ApproximateMIDI. This folder contains reconstructed MIDI performances created from the alignments and the loudness curves. The onset time and offset times of the notes were determined from the alignment times and the MIDI velocity was computed from the loudness curves.

Match. This folder contains score-to-performance alignments in Matchfile format.

Scores_MuseScore. Manually encoded sheet music in MuseScore format (.mscz)

Scores_MusicXML. Sheet music in MusicXML format.

Scores_pdf. Images of the sheet music in pdf format.

Audio Features

Audio features computed from the audio files. These features are located in the AudioFeatures folder.

Loudness: Text files containing loudness curves in dB of the audio files. These curves were computed using code provided by Olivier Lartillot. Each of these files contains the following columns:

performance_time_(seconds): Performance time in seconds.

loudness_(db): Loudness curve in dB.

smooth_loudness_(db): Smoothed loudness curve.

Spectrograms. Numpy files (.npy) containing magnitude spectrograms (as Numpy arrays). The shape of each array is (149 frequency bands, number of frames of the performance). The spectrograms were computed from the audio files with the following parameters:

Sample rate (sr): 22050 samples per second

Window length: 2048

Frames per Second (fps): 31.3 fps

Hop size: sample_rate // fps = 704

Filterbank: log scaled filterbank with 24 bands per octave and min frequency 20 Hz

MIDI Performances

Since the dataset consists of commercial recordings, we cannot include the audio files in the dataset. We can, however, share the 2 synthesized MIDI performances used in the Con Espressione game (for Bach’s Prelude in C and the second movement of Mozart’s Sonata in C K 545) in mp3 format. These performances can be found in the MIDIPerformances folder.
n
GameOfLife Prediction Dataset
data.ncl.ac.uk
txt
Updated Sep 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Towers; Linus Ericsson; Amir Atapour-Abarghouei; Elliot J Crowley; Andrew Stephen McGough (2025). GameOfLife Prediction Dataset [Dataset]. http://doi.org/10.25405/data.ncl.30000835.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.25405/data.ncl.30000835.v1
Dataset updated
Sep 10, 2025
Dataset provided by
Newcastle University
Authors
David Towers; Linus Ericsson; Amir Atapour-Abarghouei; Elliot J Crowley; Andrew Stephen McGough
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The GameOfLife dataset is an algorithmically generated dataset based off John Horton Conway's Game of Life. Conway's Game of Life follows a strict set off rules at each "generation" (simulation step) where cells alternate between a dead and alive state based on number of surrounding alive cells. These rules can be found on the Game of Life's Wikipedia pageThis dataset is one of the three hidden datasets used by the 2025 NAS Unseen-Data Challenge at AutoML.The goal of this dataset is to predict the number of cells alive in the next generation. This task is relatively simple for a human to do if a bit tedious, and should theoretically be simple for Machine Learning algorithms. Each cells's state is calculated based off the number of alive neighbour's in the previous step. Effectively for every cell we only need to look at the surrounding eight cells (3x3 square, minus the centre) which means all information for each cell can be found from a 3x3 Convolution, which is a very common kernel size to use. The dataset was used to make sure that participants appraoches could handle simple tasks along with the more complicated tasks to make sure they did not overcomplicate their submission.There are 70,000 images in the dataset where each image is a randomly generated starting configuration of the Game of Life, with a random level of density (number of initial alive cells).The data is stored in a channels-first format with a shape of (n, 1, 10, 10) where n is the number of samples in the corresponding set (50,000 for training, 10,000 for validation, and 10,000 for testing).There are 25 classes in this dataset, where the label (0..24) represents the number of alive celss in the next generation and images are evenly distributed by class across the dataset (2800 each, 2000, 400, 400 for training, validation and testing respectively). We limit the data to 25 classes despite theoretically a limit of 0-100, we do this as the higher classes are increasingly unlikely to occur, and would take much longer to create a balanced dataset. Excluding 0, the lower numbers also get increasingly unlikely, though more likely than higher numbers, we wanted to prevent gaps and therefore limited to 25 contiguous classesNumPy (.npy) files can be opened through the NumPy Python library, using the numpy.load() function by inputting the path to the file into the function as a parameter. The metadata file contains some basic information about the datasets, and can be opened in many text editors such as vim, nano, notepad++, notepad, etc.
IMDb Movies Metadata Dataset – 4.5M Records (Global Coverage)
crawlfeeds.com
csv, zip
Updated Nov 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crawl Feeds (2025). IMDb Movies Metadata Dataset – 4.5M Records (Global Coverage) [Dataset]. https://crawlfeeds.com/datasets/imdb-movies-metadata-dataset-4-5m-records-global-coverage
Explore at:
csv, zipAvailable download formats
Dataset updated
Nov 9, 2025
Dataset authored and provided by
Crawl Feeds
License
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Description
Unlock one of the most comprehensive movie datasets available—4.5 million structured IMDb movie records, extracted and enriched for data science, machine learning, and entertainment research.

This dataset includes a vast collection of global movie metadata, including details on title, release year, genre, country, language, runtime, cast, directors, IMDb ratings, reviews, and synopsis. Whether you're building a recommendation engine, benchmarking trends, or training AI models, this dataset is designed to give you deep and wide access to cinematic data across decades and continents.

Perfect for use in film analytics, OTT platforms, review sentiment analysis, knowledge graphs, and LLM fine-tuning, the dataset is cleaned, normalized, and exportable in multiple formats.

What’s Included:

Genres: Drama, Comedy, Horror, Action, Sci-Fi, Documentary, and more

Delivery: Direct download

Use Cases:

Train LLMs or chatbots on cinematic language and metadata

Build or enrich movie recommendation engines

Run cross-lingual or multi-region film analytics

Benchmark genre popularity across time periods

Power academic studies or entertainment dashboards

Feed into knowledge graphs, search engines, or NLP pipelines
Freestyle Battles
kaggle.com
zip
Updated Oct 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrea Difino (2025). Freestyle Battles [Dataset]. https://www.kaggle.com/datasets/andreadifino/freestyle-battles
Explore at:
zip(11822966 bytes)Available download formats
Dataset updated
Oct 7, 2025
Authors
Andrea Difino
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
📢Motivation

I created this dataset out of personal necessity, since I couldn’t find anything similar available online. My main goal was to build a resource that could be used to align freestyle rap lyrics with their corresponding audio. In addition, I am currently using this dataset as part of my public project on GitHub

🛠️Creation Process

The dataset took around 20 days to build. During this time, I collected metadata from the King of the Dot (KOTD) YouTube channel and used OpenAI Whisper to automatically transcribe the publicly available battle videos into text.
The resulting transcriptions were then aligned with timestamps and organized into a structured format suitable for training.

📝Current Status

At the moment, the dataset is still relatively small. However, it can already be used for fine-tuning larger models on tasks such as text-to-audio alignment or rhythm-based natural language processing. I am actively working on expanding it with more battles to increase its coverage and usefulness.

🧮 Dataset Structure

Each entry in the dataset contains both metadata about the battle and aligned lyrical content. Below is an explanation of the columns:

artist1: The first rapper participating in the battle.

artist2: The second rapper participating in the battle.

battle_name: The official title of the battle as published on YouTube.

timestamp: The corresponding timestamp in the audio where the lyric appears. This is particularly useful for aligning text with the beat.

bar: A single line of rap (each row corresponds to one bar).

Note on UNK values: If the value for artist1 or artist2 is UNK, it means that the battle might have featured more than two rappers, or that the YouTube title did not clearly specify the artist names.

⚠️ Legal Disclaimer

The textual transcriptions included in this dataset are automatically generated using OpenAI Whisper from publicly available YouTube video content.

No audio or video files are distributed, nor any portion of the original multimedia content.

All video names, artists, and timestamps are factual information derived from the videos and are included solely for reference and research purposes.

The dataset is distributed under the Apache 2.0 license, but this license does not override or remove existing copyright protections on the original content (audio/video).

It is recommended to use this dataset only for research and educational purposes, such as ASR model development, text-to-audio alignment, or rhythm/flow-based NLP tasks. Any other use, especially commercial or redistribution of the original material, may require explicit authorization from the rights holders.
Grid Loss Prediction Dataset
kaggle.com
zip
Updated Oct 9, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TrønderEnergi Kraft (2020). Grid Loss Prediction Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/1546931
Explore at:
zip(2913127 bytes)Available download formats
Unique identifier
https://doi.org/10.34740/kaggle/dsv/1546931
Dataset updated
Oct 9, 2020
Authors
TrønderEnergi Kraft
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Context

A power grid transports the electricity from power producers to the consumers. But all that is produced is not delivered to the customers. Some parts of it are lost in either transmission or distribution. In Norway, the grid companies are responsible for reporting this grid loss to the institutes responsible for national transmission networks. They have to nominate the expected loss day ahead to the market so that the electricity price can be decided.

The physics of grid losses are well understood and can be calculated quite accurately given the grid configuration. Still, as these are not known or changes all the time, calculating grid losses is not straight forward.

Content

Grid loss is directly correlated with the total amount of power in the grid, which is also known as the grid load.

We provide data for three different grids from Norway that are owned by Tensio (Previously Trønderenergi Nett).

Features: In this dataset, we provide the hourly values of all the features we found relevant for predicting the grid loss.

For each of the grids, we have: 1. Grid loss: historical measurements of grid loss in MWh 2. Grid load: historical measurements of grid load in MWh 3. Temperature forecast in Kelvin 4. Predictions using the Prophet model in MWh 5. Trend, daily, weekly and yearly components of the grid loss, also from the Prophet Model.

Other than these grid specific features, we provide: 1. Calendar features: year, season, month, week, weekday, hour, in the cyclic form (see Notes 1.) and whether it is a holiday or not. 2. Incorrect data: whether the data was marked incorrect by the experts, in retrospective. We recommend removing this data before training your model. 3. Estimated demand in Trondheim: predicted demand for electricity in Trondheim, a big city in the middle of Norway, in MWh (see Note 2.)

We have split the dataset into two parts: training and testing set.

Training set: This file (train.csv) contains two years of data (December 2017 to November 2019). All the features mentioned above are provided for this duration.

Test set: This file (test.csv) contains six months of data (December 2019 to May 2020). All the features from training data are provided for the test set as well. Occasionally, some of the features could be missing.

Additionally, we provide a copy of test dataset (test_backfilled_missing_features.csv) where the missing features are backfilled.

Note: 1. Calendar features are cyclic in nature. If we encode the weekdays (Monday to Sunday) as 0 to 6, we find that while Sunday and Monday are next to each other, the distance between their embeddings does not reflect it. To reflect this cyclic nature of the calendar features, we created cyclic calendar features based on cosine and sine which together place the highest and lowest value of the features close to each other in the feature space. 2. We don't have an estimate of demand for all the grids. We used the demand predictions for Trondheim, the biggest city closest to the three grids. 3. The grid load is directly proportional to the grid loss. While we don't have predictions for grid load, but since we have historical measurements for them, it makes sense to predict it and use it as a feature for predicting the grid loss. 4. While the Prophet Model did not perform nicely as a prediction tool for our dataset, we found it useful to include its prediction and other components as features in our model. 5. Grid 3 has less training data than grid 1 and grid 2. 6. We published our solution. For more details, please refer to:

Dalal, N., Mølnå, M., Herrem, M., Røen, M., & Gundersen, O. E. (2020). Day-Ahead Forecasting of Losses in the Distribution Network. In AAAI (pp. 13148-13155).

Bibtex format for citation:

@incollection{dalal2020a, author = {Dalal, N. and Mølnå, M. and Herrem, M. and Røen, M. and Gundersen, O.E.}, date = {2020}, title = {Day-Ahead Forecasting of Losses in the Distribution Network}, pages = {13148–13155}, language = {en}, booktitle = {AAAI} }

Challenges

Working with clean and processed data often hides the complexity of running the model in deployment. Some of the challenges we had while predicting grid loss in deployment are:

Day-ahead predictions: We need to predict the grid loss for the next day before 10 am the current day at an hourly resolution i.e on 10:00 May 26, 2020, we need to predict the grid loss on May 27, 2020, from midnight to 23:00 May 28, 2020, at an hourly resolution (24 values) for each grid.

Delayed measurements: We don't receive the measured values of load and loss immediately. We receive them 5 days after. Sometimes, there can be additional delays for a few more days. While grid loss and load are provided for the test data set as well, DO NOT USE them as features, unless they are 6 days old i.e while predicting grid loss for 27th January 2020, you can use the grid loss values will 20th January 2020. Using grid loss or grid load data after that date is unfair and will be discarded.

Missing data: Sometimes, we don't receive some of the features. For example, weather client might be out of service. You should make sure that your model should work even when some features are unavailable/missing.

Incorrect data: There have been times when the measurements we received were incorrect, by a big margin. They have been marked in the dataset in the incorrect_data column. It is recommended to remove those data points before you start analysing the data.

Less training data: For one of the grids, grid 3, we only have a few months of data.

Changes in the grids: Grid structures can keep changing. Sometimes new big consumers are added, or small grids can be merged into big ones.

Acknowledgements

We wouldn't be here without the help of others. We would like to thank Tensio for allowing us to make their grid data public in the interest of open science and research. We would also like to thank the AI group in NTNU for strong collaborations and scientific discussions. If you use this dataset, please cite the following paper: Dalal, N., Mølnå, M., Herrem, M., Røen, M., & Gundersen, O. E. (2020). Day-Ahead Forecasting of Losses in the Distribution Network. In AAAI (pp. 13148-13155).
Data from: AstroChat
kaggle.com
huggingface.co
zip
Updated Jun 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
astro_pat (2024). AstroChat [Dataset]. https://www.kaggle.com/datasets/patrickfleith/astrochat
Explore at:
zip(1214166 bytes)Available download formats
Dataset updated
Jun 9, 2024
Authors
astro_pat
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Purpose and Scope

The AstroChat dataset is a collection of 901 dialogues, synthetically generated, tailored to the specific domain of Astronautics / Space Mission Engineering. This dataset will be frequently updated following feedback from the community. If you would like to contribute, please reach out in the community discussion.

Intended Use

The dataset is intended to be used for supervised fine-tuning of chat LLMs (Large Language Models). Due to its currently limited size, you should use a pre-trained instruct model and ideally augment the AstroChat dataset with other datasets in the area of (Science Technology, Engineering and Math).

Quickstart

To be completed

DATASET DESCRIPTION

Access

Manual download from Hugging face hub: https://huggingface.co/datasets/patrickfleith/AstroChat

Or with python: python from datasets import load_dataset dataset = load_dataset("patrickfleith/AstroChat")

Structure

901 generated conversations between a simulated user and AI-assistant (more on the generation method below). Each instance is made of the following field (column): - id: a unique identifier to refer to this specific conversation. Useeful for traceability purposes, especially for further processing task or merge with other datasets. - topic: a topic within the domain of Astronautics / Space Mission Engineering. This field is useful to filter the dataset by topic, or to create a topic-based split. - subtopic: a subtopic of the topic. For instance in the topic of Propulsion, there are subtopics like Injector Design, Combustion Instability, Electric Propulsion, Chemical Propulsion, etc. - persona: description of the persona used to simulate a user - opening_question: the first question asked by the user to start a conversation with the AI-assistant - messages: the whole conversation messages between the user and the AI assistant in already nicely formatted for rapid use with the transformers library. A list of messages where each message is a dictionary with the following fields: - role: the role of the speaker, either user or assistant - content: the message content. For the assistant, it is the answer to the user's question. For the user, it is the question asked to the assistant.

Important See the full list of topics and subtopics covered below.

Metadata

Dataset is version controlled and commits history is available here: https://huggingface.co/datasets/patrickfleith/AstroChat/commits/main

Generation Method

We used a method inspired from Ultrachat dataset. Especially, we implemented our own version of Human-Model interaction from Sector I: Questions about the World of their paper:

Ding, N., Chen, Y., Xu, B., Qin, Y., Zheng, Z., Hu, S., ... & Zhou, B. (2023). Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233.

Step-by-step description

Defined a set of user persona

Defined a set of topics/ disciplines within the domain of Astronautics / Space Mission Engineering

For each topics, we defined a set of subtopics to narrow down the conversation to more specific and niche conversations (see below the full list)

For each subtopic we generate a set of opening questions that the user could ask to start a conversation (see below the full list)

We then distil the knowledge of an strong Chat Model (in our case ChatGPT through then api with gpt-4-turbo model) to generate the answers to the opening questions

We simulate follow-up questions from the user to the assistant, and the assistant's answers to these questions which builds up the messages.

Future work and contributions appreciated

Distil knowledge from more models (Anthropic, Mixtral, GPT-4o, etc...)

Implement more creativity in the opening questions and follow-up questions

Filter-out questions and conversations which are too similar

Ask topic and subtopic expert to validate the generated conversations to have a sense on how reliable is the overall dataset

Languages

All instances in the dataset are in english

Size

901 synthetically-generated dialogue

USAGE AND GUIDELINES

License

AstroChat © 2024 by Patrick Fleith is licensed under Creative Commons Attribution 4.0 International

Restrictions

No restriction. Please provide the correct attribution following the license terms.

Citation

Patrick Fleith. (2024). AstroChat - A Dataset of synthetically generated conversations for LLM supervised fine-tuning in the domain of Space Mission Engineering and Astronautics (Version 1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.11531579

Update Frequency

Will be updated based on feedbacks. I am also looking for contributors. Help me create more datasets for Space Engineering LLMs :)

Have a feedback or spot an error?

Use the ...
Data from: A Novel Curated Scholarly Graph Connecting Textual and Data...
zenodo.org
data.europa.eu
zip
Updated Dec 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ornella Irrera; Ornella Irrera; Andrea Mannocci; Andrea Mannocci; Paolo Manghi; Paolo Manghi; Gianmaria Silvello; Gianmaria Silvello (2022). A Novel Curated Scholarly Graph Connecting Textual and Data Publications [Dataset]. http://doi.org/10.5281/zenodo.7464120
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7464120
Dataset updated
Dec 21, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ornella Irrera; Ornella Irrera; Andrea Mannocci; Andrea Mannocci; Paolo Manghi; Paolo Manghi; Gianmaria Silvello; Gianmaria Silvello
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains an open and curated scholarly graph we built as a training and test set for data discovery, data connection, author disambiguation, and link prediction tasks. This graph represents the European Marine Science community included in the OpenAIRE Graph. The nodes of the graph we release represent publications, datasets, software, and authors respectively; edges interconnecting research products always have the publication as source, and the dataset/software as target. In addition, edges are labeled with semantics that outline whether the publication is referencing, citing, documenting, or supplementing the related outcome. To curate and enrich nodes metadata and edges semantics, we relied on the information extracted from the PDF of the publications and the datasets/software webpages respectively. We curated the authors so to remove duplicated nodes representing the same person.

The resource we release counts 4,047 publications, 5,488 datasets, 22 software, 21,561 authors, and 9,692 edges connect publications to datasets/software. This graph is in the curated_MES folder. We provide this resource as:

a property graph: we provide the dump that can be imported in neo4j

5 jsonl files containing publications, datasets, software, authors, and relationships respectively. Each line of a jsonl file contains a JSON object representing a node and contains the metadata of that node (or a relationship).

We provide two additional scholarly graphs:

The curated MES graph with the removed edges. During the curation we removed some edges since they were labeled with an inconsistent or imprecise semantics. This graph includes the same nodes and edges as the previous one, and, in addition, it contains the edges removed during the curation pipeline; these edges are marked as Removed. This graph is in the curated_MES_with_removed_semantics folder.

The original MES community of OpenAIRE. It represents the MES community extracted from the OpenAIRE Research Graph. This graph has not been curated, and the metadata and semantics are those of the OpenAIRE Research Graph. This graph is in the original_MES_community folder.

Facebook

Twitter

Click to copy link

Link copied

Cite

U.S. Geological Survey (2025). Overview Metadata for the Data used in te Conceptual and Numerical Model of the Colorado River (1990-2016) [Dataset]. https://catalog.data.gov/dataset/overview-metadata-for-the-data-used-in-te-conceptual-and-numerical-model-of-the-color-1990

Overview Metadata for the Data used in te Conceptual and Numerical Model of the Colorado River (1990-2016)

Explore at:

Dataset updated

Nov 19, 2025

Dataset provided by

U.S. Geological Survey

Area covered

Colorado River

Description

This data release contains six different datasets that were used in the report SIR 2018-5108. These datasets contain discharge data, discrete dissolved-solids data, quality-control discrete dissolved data, and computed mean dissolved solids data that were collected at various locations between the Hoover Dam and the Imperial Dam. Study Sites: Site 1: Colorado River below Hoover Dam Site 2: Bill Williams River near Parker Site 3: Colorado River below Parker Dam Site 4: CRIR Main Canal Site 5: Palo Verde Canal Site 6: Colorado River at Palo Verde Dam Site 7: CRIR Lower Main Drain Site 8: CRIR Upper Levee Drain Site 9: PVID Outfall Drain Site 10: Colorado River above Imperial Dam Discrete Dissolved-solids Dataset and Replicate Samples for Discrete Dissolved-solids Dataset: The Bureau of Reclamation collected discrete water-quality samples for the parameter of dissolved-solids (sum of constituents). Dissolved-solids, measured in milligrams per liter, are the sum of the following constituents: bicarbonate, calcium, carbonate, chloride, fluoride, magnesium, nitrate, potassium, silicon dioxide, sodium, and sulfate. These samples were collected on a monthly to bimonthly basis at various time periods between 1990 and 2016 at Sites 1-5 and Sites 7-10. No data were collected for Site 6: Colorado River at Palo Verde Dam. The Bureau of Reclamation and the USGS collected discrete quality-control replicate samples for the parameter of dissolved-solids, sum of constituents measured in milligrams per liter. The USGS collected discrete quality-control replicate samples in 2002 and 2003 and the Bureau of Reclamation collected discrete quality-control replicate samples in 2016 and 2017. Listed below are the sites where these samples were collected at and which agency collected the samples. Site 3: Colorado River below Parker Dam: USGS and Reclamation Site 4: CRIR Main Canal: Reclamation Site 5: Palo Verde Canal: Reclamation Site 7: CRIR Lower Main Drain: Reclamation Site 8: CRIR Upper Levee Drain: Reclamation Site 9: PVID Outfall Drain: Reclamation Site 10: Colorado River above Imperial Dam: USGS and Reclamation Monthly Mean Datasets and Mean Monthly Datasets: Monthly mean discharge data (cfs), flow weighted monthly mean dissolved-solids concentrations (mg/L) data and monthly mean dissolved-solids load data from 1990 to 2016 were computed using raw data from the USGS and the Bureau of Reclamation. This data were computed for all 10 sites. Flow weighted monthly mean dissolved-solids concentration and monthly mean dissolved-solids load were not computed for Site 2: Bill Williams River near Parker. The monthly mean datasets that were calculated for each month for the period between 1990 and 2016 were used to compute the mean monthly discharge and the mean monthly dissolved-solids load for each of the 12 months within a year. Each monthly mean was weighted by how many days were in the month and then averaged for each of the twelve months. This was computed for all 10 sites except mean monthly dissolved-solids load were not computed at Site 2: Bill Williams River near Parker. Site 8a: Colorado River between Parker and Palo Verde Valleys was computed by summing the data from sites 6, 7 and 8. Bill Williams Daily Mean Discharge, Instantaneous Dissolved-solids Concentration, and Daily Means Dissolved-solids Load Dataset: Daily mean discharge (cfs), instantaneous solids concentration (mg/L), and daily mean dissolved solids load were calculated using raw data collected by the USGS and the Bureau of Reclamation. This data were calculated for Site 2: Bill Williams River near Parker for the period of January 1990 to February 2016. Palo Verde Irrigation District Outfall Drain Mean Daily Discharge Dataset: The Bureau of Reclamation collected mean daily discharge data for the period of 01/01/2005 to 09/30/2016 at the Palo Verde Irrigation District (PVID) outfall drain using a stage-discharge relationship.

Clear search

Close search

Google apps

Main menu

Overview Metadata for the Data used in te Conceptual and Numerical Model of...

Getting Real about Fake News

Contents

Fake news in the news

Improvements

Acknowledgements

Resource Metadata Harvested from Government and Research Open Data Portals

Google Play Store Latest Dataset

Metadata Management Tool (MMT) records archive

Dataset metadata of known Dataverse installations

TMDB 5000 Movie Dataset

Background

Data Source Transfer Summary

Data Source Transfer Details

Open Questions About the Data

Inspiration

Acknowledgements

Hazardous Waste Portal Manifest Metadata

Niagara Open Data

the-stack-metadata

Data from: Da-TACOS: A Dataset for Cover Song Identification and...

Data from: Russian Financial Statements Database: A firm-level collection of...

This line will download 6.6GB+ of all RFSD data and store it in a 🤗 cache folder

Alternatively, this will download ~540MB with all financial statements for 2023# to a Polars DataFrame (requires about 8GB of RAM)

Read RFSD metadata from local file

Use RFSD_dataset.schema to glimpse the data structure and columns' classes

Load full dataset into memory

Load only 2019 data into memory

Load only revenue for firms in 2019, identified by taxpayer id

Give suggested descriptive names to variables

Read RFSD metadata from local file

Use schema() to glimpse into the data structure and column classes

Load full dataset into memory

Load only 2019 data into memory

Load only revenue for firms in 2019, identified by taxpayer id

Give suggested descriptive names to variables

Surface Water - Chemistry Results

The Con Espressione Game Dataset

GameOfLife Prediction Dataset

IMDb Movies Metadata Dataset – 4.5M Records (Global Coverage)

What’s Included:

Use Cases:

Freestyle Battles

📢Motivation

🛠️Creation Process

📝Current Status

🧮 Dataset Structure

⚠️ Legal Disclaimer

Grid Loss Prediction Dataset

Context

Content

Challenges

Acknowledgements

Data from: AstroChat

Purpose and Scope

Intended Use

Quickstart

DATASET DESCRIPTION

Access

Structure

Metadata

Generation Method

Step-by-step description

Future work and contributions appreciated

Languages

Size

USAGE AND GUIDELINES

License

Restrictions

Citation

Update Frequency

Have a feedback or spot an error?

Data from: A Novel Curated Scholarly Graph Connecting Textual and Data...

Overview Metadata for the Data used in te Conceptual and Numerical Model of the Colorado River (1990-2016)