Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
There are several Microsoft Word documents here detailing data creation methods and with various dictionaries describing the included and derived variables.The Database Creation Description is meant to walk a user through some of the steps detailed in the SAS code with this project.The alphabetical list of variables is intended for users as sometimes this makes some coding steps easier to copy and paste from this list instead of retyping.The NIS Data Dictionary contains some general dataset description as well as each variable's responses.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset consists of 42,052 English words and their corresponding definitions. It is a comprehensive collection of words ranging from common terms to more obscure vocabulary. The dataset is ideal for Natural Language Processing (NLP) tasks, educational tools, and various language-related applications.
This dataset is well-suited for a range of use cases, including:
This version focuses on providing essential information while emphasizing the total number of words and potential applications of the dataset. Let me know if you'd like any further adjustments!
Facebook
TwitterThis data includes the location of cooling towers registered with New York State. The data is self-reported by owners/property managers of cooling towers in service in New York State. In August 2015 the New York State Department of Health released emergency regulations requiring the owners of cooling towers to register them with New York State. In addition the regulation includes requirements: regular inspection; annual certification; obtaining and implementing a maintenance plan; record keeping; reporting of certain information; and sample collection and culture testing. All cooling towers in New York State, including New York City, need to be registered in the NYS system. Registration is done through an electronic database found at: www.ny.gov/services/register-cooling-tower-and-submit-reports. For more information, check http://www.health.ny.gov/diseases/communicable/legionellosis/, or go to the “About” tab.
Facebook
TwitterU.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
The Pesticide Data Program (PDP) is a national pesticide residue database program. Through cooperation with State agriculture departments and other Federal agencies, PDP manages the collection, analysis, data entry, and reporting of pesticide residues on agricultural commodities in the U.S. food supply, with an emphasis on those commodities highly consumed by infants and children.This dataset provides information on where each tested sample was collected, where the product originated from, what type of product it was, and what residues were found on the product, for calendar years 1992 through 2023. The data can measure residues of individual compounds and classes of compounds, as well as provide information about the geographic distribution of the origin of samples, from growers, packers and distributors. The dataset also includes information on where the samples were taken, what laboratory was used to test them, and all testing procedures (by sample, so can be linked to the compound that is identified). The dataset also contains a reference variable for each compound that denotes the limit of detection for a pesticide/commodity pair (LOD variable). The metadata also includes EPA tolerance levels or action levels for each pesticide/commodity pair. The dataset will be updated on a continual basis, with a new resource data file added annually after the PDP calendar-year survey data is released.Resources in this dataset:Resource Title: CSV Data Dictionary for PDP.File Name: PDP_DataDictionary.csv. Resource Description: Machine-readable Comma Separated Values (CSV) format data dictionary for PDP Database Zip files. Defines variables for the sample identity and analytical results data tables/files. The ## characters in the Table and Text Data File name refer to the 2-digit year for the PDP survey, like 97 for 1997 or 01 for 2001. For details on table linking, see PDF. Resource Software Recommended: Microsoft Excel,url: https://www.microsoft.com/en-us/microsoft-365/excelResource Title: Data dictionary for Pesticide Data Program. File Name: PDP DataDictionary.pdf. Resource Description: Data dictionary for PDP Database Zip files. Resource Software Recommended: Adobe Acrobat, url: https://www.adobe.comResource Title: 2023 PDP Database Zip File. File Name: 2023PDPDatabase.zipResource Title: 2022 PDP Database Zip File. File Name: 2022PDPDatabase.zipResource Title: 2021 PDP Database Zip File. File Name: 2021PDPDatabase.zipResource Title: 2020 PDP Database Zip File. File Name: 2020PDPDatabase.zipResource Title: 2019 PDP Database Zip File. File Name: 2019PDPDatabase.zipResource Title: 2018 PDP Database Zip File. File Name: 2018PDPDatabase.zipResource Title: 2017 PDP Database Zip File. File Name: 2017PDPDatabase.zipResource Title: 2016 PDP Database Zip File. File Name: 2016PDPDatabase.zipResource Title: 2015 PDP Database Zip File. File Name: 2015PDPDatabase.zipResource Title: 2014 PDP Database Zip File. File Name: 2014PDPDatabase.zipResource Title: 2013 PDP Database Zip File. File Name: 2013PDPDatabase.zipResource Title: 2012 PDP Database Zip File. File Name: 2012PDPDatabase.zipResource Title: 2011 PDP Database Zip File. File Name: 2011PDPDatabase.zipResource Title: 2010 PDP Database Zip File. File Name: 2010PDPDatabase.zipResource Title: 2009 PDP Database Zip File. File Name: 2009PDPDatabase.zipResource Title: 2008 PDP Database Zip File. File Name: 2008PDPDatabase.zipResource Title: 2007 PDP Database Zip File. File Name: 2007PDPDatabase.zipResource Title: 2006 PDP Database Zip File. File Name: 2006PDPDatabase.zipResource Title: 2005 PDP Database Zip File. File Name: 2005PDPDatabase.zipResource Title: 2004 PDP Database Zip File. File Name: 2004PDPDatabase.zipResource Title: 2003 PDP Database Zip File. File Name: 2003PDPDatabase.zipResource Title: 2002 PDP Database Zip File. File Name: 2002PDPDatabase.zipResource Title: 2001 PDP Database Zip File. File Name: 2001PDPDatabase.zipResource Title: 2000 PDP Database Zip File. File Name: 2000PDPDatabase.zipResource Title: 1999 PDP Database Zip File. File Name: 1999PDPDatabase.zipResource Title: 1998 PDP Database Zip File. File Name: 1998PDPDatabase.zipResource Title: 1997 PDP Database Zip File. File Name: 1997PDPDatabase.zipResource Title: 1996 PDP Database Zip File. File Name: 1996PDPDatabase.zipResource Title: 1995 PDP Database Zip File. File Name: 1995PDPDatabase.zipResource Title: 1994 PDP Database Zip File. File Name: 1994PDPDatabase.zipResource Title: 1993 PDP Database Zip File. File Name: 1993PDPDatabase.zipResource Title: 1992 PDP Database Zip File. File Name: 1992PDPDatabase.zip
Facebook
Twitterhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
This is Oxford University Press's most comprehensive single-volume dictionary, with 170,000 entries covering all varieties of English worldwide. The NODE data set constitutes a fully integrated range of formal data types suitable for language engineering and NLP applications: It is available in XML or SGML. - Source dictionary data. The NODE data set includes all the information present in the New Oxford Dictionary of English itself, such as definition text, example sentences, grammatical indicators, and encyclopaedic material. - Morphological data. Each NODE lemma (both headwords and subentries) has a full listing of all possible syntactic forms (e.g. plurals for nouns, inflections for verbs, comparatives and superlatives for adjectives), tagged to show their syntactic relationships. Each form has an IPA pronunciation. Full morphological data is also given for spelling variants (e.g. typical American variants), and a system of links enables straightforward correlation of variant forms to standard forms. The data set thus provides robust support for all look-up routines, and is equally viable for applications dealing with American and British English. - Phrases and idioms. The NODE data set provides a rich and flexible codification of over 10,000 phrasal verbs and other multi-word phrases. It features comprehensive lexical resources enabling applications to identify a phrase not only in the form listed in the dictionary but also in a range of real-world variations, including alternative wording, variable syntactic patterns, inflected verbs, optional determiners, etc. - Subject classification. Using a categorization scheme of 200 key domains, over 80,000 words and senses have been associated with particular subject areas, from aeronautics to zoology. As well as facilitating the extraction of subject-specific sub-lexicons, this also provides an extensive resource for document categorization and information retrieval. - Semantic relationships. The relationships between every noun and noun sense in the dictionary are being codified using an extensive semantic taxonomy on the model of the Princeton WordNet project. (Mapping to WordNet 1.7 is supported.) This structure allows elements of the basic lexical database to function as a formal knowledge database, enabling functionality such as sense disambiguation and logical inference. - Derived from the detailed and authoritative corpus-based research of Oxford University Press's lexicographic team, the NODE data set is a powerful asset for any task dealing with real-world contemporary English usage. By integrating a number of different data types into a single structure, it creates a coherent resource which can be queried along numerous axes, allowing open-ended exploitation by many kinds of language-related applications.
Facebook
TwitterEach feature within this dataset is the authoritative representation of the location of a sample within the U.S. Department of Energy (DOE) Office of Legacy Management (LM) Environmental Database. The dataset includes sample locations from Puerto Rico to Alaska, with point features representing different types of sample locations such as boreholes, wells, geoprobes, etc. All sample locations are maintained within the LM Environmental Database, with feature attributes defined within the associated data dictionary.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We study the behaviour and cognition of wild apes and other species (elephants, corvids, dogs). Our video archive is called the Great Ape Dictionary, you can find out more here www.greatapedictionary.com or about our lab group here www.wildminds.ac.uk We consider these videos to be a data ark that we would like to make as accessible as possible. While we are unable to make the original video files open access at the present time you can search this database to explore what is available, and then request access for collaborations of different kinds by contacting us directly or through our website. We label all videos in the Great Ape Dictionary video archive with basic meta-data on the location, date, duration, individuals present, and behaviour present. Version 1.0.0 contains current data from the Budongo East African chimpanzee population (n=13806 videos). These datasets are being updated regularly and new data will be incorporated here with versioning. As well as the database there is a second read.me file which contains the ethograms used for each variable coded, and a short summary of other datasets that are in preparation for subsequent version(s). If you are interested in these data please contact us. Please note that not all variables are labeled for all videos, the detailed Ethogram categories are only available for a subset of data. All videos are labelled with up to 5 Contexts (at least one, rarely 5). If you are interested in finding a good example video for a particular behaviour, search for 'Library' = Y, this indicates that this clip contains a very clear example of the behaviour.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The first multi-modal Steam dataset with semantic search capabilities. 239,664 applications collected from official Steam Web APIs with PostgreSQL database architecture, vector embeddings for content discovery, and comprehensive review analytics.
Made by a lifelong gamer for the gamer in all of us. Enjoy!🎮
GitHub Repository https://github.com/vintagedon/steam-dataset-2025
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F28514182%2F4b7eb73ac0f2c3cc9f0d57f37321b38f%2FScreenshot%202025-10-18%20180450.png?generation=1760825194507387&alt=media" alt="">
1024-dimensional game embeddings projected to 2D via UMAP reveal natural genre clustering in semantic space
Unlike traditional flat-file Steam datasets, this is built as an analytically-native database optimized for advanced data science workflows:
☑️ Semantic Search Ready - 1024-dimensional BGE-M3 embeddings enable content-based game discovery beyond keyword matching
☑️ Multi-Modal Architecture - PostgreSQL + JSONB + pgvector in unified database structure
☑️ Production Scale - 239K applications vs typical 6K-27K in existing datasets
☑️ Complete Review Corpus - 1,048,148 user reviews with sentiment and metadata
☑️ 28-Year Coverage - Platform evolution from 1997-2025
☑️ Publisher Networks - Developer and publisher relationship data for graph analysis
☑️ Complete Methodology & Infrastructure - Full work logs document every technical decision and challenge encountered, while my API collection scripts, database schemas, and processing pipelines enable you to update the dataset, fork it for customized analysis, learn from real-world data engineering workflows, or critique and improve the methodology
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F28514182%2F649e9f7f46c6ce213101d0948c89e8ac%2F4_price_distribution_by_top_10_genres.png?generation=1760824835918620&alt=media" alt="">
Market segmentation and pricing strategy analysis across top 10 genres
Core Data (CSV Exports): - 239,664 Steam applications with complete metadata - 1,048,148 user reviews with scores and statistics - 13 normalized relational tables for pandas/SQL workflows - Genre classifications, pricing history, platform support - Hardware requirements (min/recommended specs) - Developer and publisher portfolios
Advanced Features (PostgreSQL): - Full database dump with optimized indexes - JSONB storage preserving complete API responses - Materialized columns for sub-second query performance - Vector embeddings table (pgvector-ready)
Documentation: - Complete data dictionary with field specifications - Database schema documentation - Collection methodology and validation reports
Three comprehensive analysis notebooks demonstrate dataset capabilities. All notebooks render directly on GitHub with full visualizations and output:
View on GitHub | PDF Export
28 years of Steam's growth, genre evolution, and pricing strategies.
View on GitHub | PDF Export
Content-based recommendations using vector embeddings across genre boundaries.
View on GitHub | PDF Export
Genre prediction from game descriptions - demonstrates text analysis capabilities.
Notebooks render with full output on GitHub. Kaggle-native versions planned for v1.1 release. CSV data exports included in dataset for immediate analysis.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F28514182%2F4079e43559d0068af00a48e2c31f0f1d%2FScreenshot%202025-10-18%20180214.png?generation=1760824950649726&alt=media" alt="">
*Steam platfor...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Statistics of the sample of 6567 words from the website database.
Facebook
TwitterDuring hydrocarbon production, water is typically co-produced from the geologic formations producing oil and gas. Understanding the composition of these produced waters is important to help investigate the regional hydrogeology, the source of the water, the efficacy of water treatment and disposal plans, potential economic benefits of mineral commodities in the fluids, and the safety of potential sources of drinking or agricultural water. In addition to waters co-produced with hydrocarbons, geothermal development or exploration brings deep formation waters to the surface for possible sampling. This U.S. Geological Survey (USGS) Produced Waters Geochemical Database, which contains geochemical and other information for 114,943 produced water and other deep formation water samples of the United States, is a provisional, updated version of the 2002 USGS Produced Waters Database (Breit and others, 2002). In addition to the major element data presented in the original, the new database contains trace elements, isotopes, and time-series data, as well as nearly 100,000 additional samples that provide greater spatial coverage from both conventional and unconventional reservoir types, including geothermal. The database is a compilation of 40 individual databases, publications, or reports. The database was created in a manner to facilitate addition of new data and correct any compilation errors, and is expected to be updated over time with new data as provided and needed. Table 1, USGSPWDBv2.3 Data Sources.csv, shows the abbreviated ID of each input database (IDDB), the number of samples from each, and its reference. Table 2, USGSPWDBv2.3 Data Dictionary.csv, defines the 190 variables contained in the database and their descriptions. The database variables are organized first with identification and location information, followed by well descriptions, dates, rock properties, physical properties of the water, and then chemistry. The chemistry is organized alphabetically by elemental symbol. Each element is followed by any associated compounds (e.g. H2S is found after S). After Zr, molecules containing carbon, organic 9 compounds and dissolved gases follow. Isotopic data are found at the end of the dataset, just before the culling parameters.
Facebook
Twitterhttps://www.nist.gov/open/licensehttps://www.nist.gov/open/license
Data here contain and describe an open-source structured query language (SQLite) portable database containing high resolution mass spectrometry data (MS1 and MS2) for per- and polyfluorinated alykl substances (PFAS) and associated metadata regarding their measurement techniques, quality assurance metrics, and the samples from which they were produced. These data are stored in a format adhering to the Database Infrastructure for Mass Spectrometry (DIMSpec) project. That project produces and uses databases like this one, providing a complete toolkit for non-targeted analysis. See more information about the full DIMSpec code base - as well as these data for demonstration purposes - at GitHub (https://github.com/usnistgov/dimspec) or view the full User Guide for DIMSpec (https://pages.nist.gov/dimspec/docs). Files of most interest contained here include the database file itself (dimspec_nist_pfas.sqlite) as well as an entity relationship diagram (ERD.png) and data dictionary (DIMSpec for PFAS_1.0.1.20230615_data_dictionary.json) to elucidate the database structure and assist in interpretation and use.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LScDC Word-Category RIG MatrixApril 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk / suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny MirkesGetting StartedThis file describes the Word-Category RIG Matrix for theLeicester Scientific Corpus (LSC) [1], the procedure to build the matrix and introduces the Leicester Scientific Thesaurus (LScT) with the construction process. The Word-Category RIG Matrix is a 103,998 by 252 matrix, where rows correspond to words of Leicester Scientific Dictionary-Core (LScDC) [2] and columns correspond to 252 Web of Science (WoS) categories [3, 4, 5]. Each entry in the matrix corresponds to a pair (category,word). Its value for the pair shows the Relative Information Gain (RIG) on the belonging of a text from the LSC to the category from observing the word in this text. The CSV file of Word-Category RIG Matrix in the published archive is presented with two additional columns of the sum of RIGs in categories and the maximum of RIGs over categories (last two columns of the matrix). So, the file ‘Word-Category RIG Matrix.csv’ contains a total of 254 columns.This matrix is created to be used in future research on quantifying of meaning in scientific texts under the assumption that words have scientifically specific meanings in subject categories and the meaning can be estimated by information gains from word to categories. LScT (Leicester Scientific Thesaurus) is a scientific thesaurus of English. The thesaurus includes a list of 5,000 words from the LScDC. We consider ordering the words of LScDC by the sum of their RIGs in categories. That is, words are arranged in their informativeness in the scientific corpus LSC. Therefore, meaningfulness of words evaluated by words’ average informativeness in the categories. We have decided to include the most informative 5,000 words in the scientific thesaurus. Words as a Vector of Frequencies in WoS CategoriesEach word of the LScDC is represented as a vector of frequencies in WoS categories. Given the collection of the LSC texts, each entry of the vector consists of the number of texts containing the word in the corresponding category.It is noteworthy that texts in a corpus do not necessarily belong to a single category, as they are likely to correspond to multidisciplinary studies, specifically in a corpus of scientific texts. In other words, categories may not be exclusive. There are 252 WoS categories and a text can be assigned to at least 1 and at most 6 categories in the LSC. Using the binary calculation of frequencies, we introduce the presence of a word in a category. We create a vector of frequencies for each word, where dimensions are categories in the corpus.The collection of vectors, with all words and categories in the entire corpus, can be shown in a table, where each entry corresponds to a pair (word,category). This table is build for the LScDC with 252 WoS categories and presented in published archive with this file. The value of each entry in the table shows how many times a word of LScDC appears in a WoS category. The occurrence of a word in a category is determined by counting the number of the LSC texts containing the word in a category. Words as a Vector of Relative Information Gains Extracted for CategoriesIn this section, we introduce our approach to representation of a word as a vector of relative information gains for categories under the assumption that meaning of a word can be quantified by their information gained for categories.For each category, a function is defined on texts that takes the value 1, if the text belongs to the category, and 0 otherwise. For each word, a function is defined on texts that takes the value 1 if the word belongs to the text, and 0 otherwise. Consider LSC as a probabilistic sample space (the space of equally probable elementary outcomes). For the Boolean random variables, the joint probability distribution, the entropy and information gains are defined.The information gain about the category from the word is the amount of information on the belonging of a text from the LSC to the category from observing the word in the text [6]. We used the Relative Information Gain (RIG) providing a normalised measure of the Information Gain. This provides the ability of comparing information gains for different categories. The calculations of entropy, Information Gains and Relative Information Gains can be found in the README file in the archive published. Given a word, we created a vector where each component of the vector corresponds to a category. Therefore, each word is represented as a vector of relative information gains. It is obvious that the dimension of vector for each word is the number of categories. The set of vectors is used to form the Word-Category RIG Matrix, in which each column corresponds to a category, each row corresponds to a word and each component is the relative information gain from the word to the category. In Word-Category RIG Matrix, a row vector represents the corresponding word as a vector of RIGs in categories. We note that in the matrix, a column vector represents RIGs of all words in an individual category. If we choose an arbitrary category, words can be ordered by their RIGs from the most informative to the least informative for the category. As well as ordering words in each category, words can be ordered by two criteria: sum and maximum of RIGs in categories. The top n words in this list can be considered as the most informative words in the scientific texts. For a given word, the sum and maximum of RIGs are calculated from the Word-Category RIG Matrix.RIGs for each word of LScDC in 252 categories are calculated and vectors of words are formed. We then form the Word-Category RIG Matrix for the LSC. For each word, the sum (S) and maximum (M) of RIGs in categories are calculated and added at the end of the matrix (last two columns of the matrix). The Word-Category RIG Matrix for the LScDC with 252 categories, the sum of RIGs in categories and the maximum of RIGs over categories can be found in the database.Leicester Scientific Thesaurus (LScT)Leicester Scientific Thesaurus (LScT) is a list of 5,000 words form the LScDC [2]. Words of LScDC are sorted in descending order by the sum (S) of RIGs in categories and the top 5,000 words are selected to be included in the LScT. We consider these 5,000 words as the most meaningful words in the scientific corpus. In other words, meaningfulness of words evaluated by words’ average informativeness in the categories and the list of these words are considered as a ‘thesaurus’ for science. The LScT with value of sum can be found as CSV file with the published archive. Published archive contains following files:1) Word_Category_RIG_Matrix.csv: A 103,998 by 254 matrix where columns are 252 WoS categories, the sum (S) and the maximum (M) of RIGs in categories (last two columns of the matrix), and rows are words of LScDC. Each entry in the first 252 columns is RIG from the word to the category. Words are ordered as in the LScDC.2) Word_Category_Frequency_Matrix.csv: A 103,998 by 252 matrix where columns are 252 WoS categories and rows are words of LScDC. Each entry of the matrix is the number of texts containing the word in the corresponding category. Words are ordered as in the LScDC.3) LScT.csv: List of words of LScT with sum (S) values. 4) Text_No_in_Cat.csv: The number of texts in categories. 5) Categories_in_Documents.csv: List of WoS categories for each document of the LSC.6) README.txt: Description of Word-Category RIG Matrix, Word-Category Frequency Matrix and LScT and forming procedures.7) README.pdf (same as 6 in PDF format)References[1] Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2[2] Suzen, Neslihan (2019): LScDC (Leicester Scientific Dictionary-Core). figshare. Dataset. https://doi.org/10.25392/leicester.data.9896579.v3[3] Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4] WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html [5] Suzen, N., Mirkes, E. M., & Gorban, A. N. (2019). LScDC-new large scientific dictionary. arXiv preprint arXiv:1912.06858. [6] Shannon, C. E. (1948). A mathematical theory of communication. Bell system technical journal, 27(3), 379-423.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
WordNet Thesaurus with the various files in CSV Format. You can use the definitions in any commercial project, but you must include the license below (link to the license page also provided). The CSV files I created for these WordNet entries contain each Word, Character Count, Definition, Part of Speech (where available) and Examples/Terms (where available).
This collection of the WordNet Thesaurus includes the following CSV files:
Here is the WordNet license, in their own words, as obtained on this page (SEE THE LICENSE PAGE HERE):
WordNet Release 3.0 This software and database is being provided to you, the LICENSEE, by Princeton University under the following license. By obtaining, using and/or copying this software and database, you agree that you have read, understood, and will comply with these terms and conditions.: Permission to use, copy, modify and distribute this software and database and its documentation for any purpose and without fee or royalty is hereby granted, provided that you agree to comply with the following copyright notice and statements, including the disclaimer, and that the same appear on ALL copies of the software, database and documentation, including modifications that you make for internal use or for distribution. WordNet 3.0 Copyright 2006 by Princeton University. All rights reserved. THIS SOFTWARE AND DATABASE IS PROVIDED "AS IS" AND PRINCETON UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR IMPLIED. BY WAY OF EXAMPLE, BUT NOT LIMITATION, PRINCETON UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES OF MERCHANT- ABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF THE LICENSED SOFTWARE, DATABASE OR DOCUMENTATION WILL NOT INFRINGE ANY THIRD PARTY PATENTS, COPYRIGHTS, TRADEMARKS OR OTHER RIGHTS. The name of Princeton University or Princeton may not be used in advertising or publicity pertaining to distribution of the software and/or database. Title to copyright in this software, database and any associated documentation shall at all times remain with Princeton University and LICENSEE agrees to preserve same.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The SSURGO database contains information about soil as collected by the National Cooperative Soil Survey over the course of a century. The information can be displayed in tables or as maps and is available for most areas in the United States and the Territories, Commonwealths, and Island Nations served by the USDA-NRCS (Natural Resources Conservation Service). The information was gathered by walking over the land and observing the soil. Many soil samples were analyzed in laboratories. The maps outline areas called map units. The map units describe soils and other components that have unique properties, interpretations, and productivity. The information was collected at scales ranging from 1:12,000 to 1:63,360. More details were gathered at a scale of 1:12,000 than at a scale of 1:63,360. The mapping is intended for natural resource planning and management by landowners, townships, and counties. Some knowledge of soils data and map scale is necessary to avoid misunderstandings. The maps are linked in the database to information about the component soils and their properties for each map unit. Each map unit may contain one to three major components and some minor components. The map units are typically named for the major components. Examples of information available from the database include available water capacity, soil reaction, electrical conductivity, and frequency of flooding; yields for cropland, woodland, rangeland, and pastureland; and limitations affecting recreational development, building site development, and other engineering uses. SSURGO datasets consist of map data, tabular data, and information about how the maps and tables were created. The extent of a SSURGO dataset is a soil survey area, which may consist of a single county, multiple counties, or parts of multiple counties. SSURGO map data can be viewed in the Web Soil Survey or downloaded in ESRI® Shapefile format. The coordinate systems are Geographic. Attribute data can be downloaded in text format that can be imported into a Microsoft® Access® database. A complete SSURGO dataset consists of:
GIS data (as ESRI® Shapefiles) attribute data (dbf files - a multitude of separate tables) database template (MS Access format - this helps with understanding the structure and linkages of the various tables) metadata
Resources in this dataset:Resource Title: SSURGO Metadata - Tables and Columns Report. File Name: SSURGO_Metadata_-_Tables_and_Columns.pdfResource Description: This report contains a complete listing of all columns in each database table. Please see SSURGO Metadata - Table Column Descriptions Report for more detailed descriptions of each column.
Find the Soil Survey Geographic (SSURGO) web site at https://www.nrcs.usda.gov/wps/portal/nrcs/detail/vt/soils/?cid=nrcs142p2_010596#Datamart Title: SSURGO Metadata - Table Column Descriptions Report. File Name: SSURGO_Metadata_-_Table_Column_Descriptions.pdfResource Description: This report contains the descriptions of all columns in each database table. Please see SSURGO Metadata - Tables and Columns Report for a complete listing of all columns in each database table.
Find the Soil Survey Geographic (SSURGO) web site at https://www.nrcs.usda.gov/wps/portal/nrcs/detail/vt/soils/?cid=nrcs142p2_010596#Datamart Title: SSURGO Data Dictionary. File Name: SSURGO 2.3.2 Data Dictionary.csvResource Description: CSV version of the data dictionary
Facebook
TwitterU.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Data Description: This data set contains all City of Cincinnati expenses by object code and day. The object code is the descriptor explaining the nature of the expense (personnel; overtime; office supplies; etc.)
Data Creation: This data is pulled directly from the City's financial software; which centralizes all department financial transactions city wide.
Data Created By: The Cincinnati Financial System (CFS)
Refresh Frequency: Daily
CincyInsights: The City of Cincinnati maintains an interactive dashboard portal, CincyInsights in addition to our Open Data in an effort to increase access and usage of city data. This data set has an associated dashboard available here: https://insights.cincinnati-oh.gov/stories/s/City-Spending/cuw9-nu34/
Data Dictionary: A data dictionary providing definitions of columns and attributes is available as an attachment to this dataset.
Processing: The City of Cincinnati is committed to providing the most granular and accurate data possible. In that pursuit the Office of Performance and Data Analytics facilitates standard processing to most raw data prior to publication. Processing includes but is not limited: address verification, geocoding, decoding attributes, and addition of administrative areas (i.e. Census, neighborhoods, police districts, etc.).
Data Usage: For directions on downloading and using open data please visit our How-to Guide: https://data.cincinnati-oh.gov/dataset/Open-Data-How-To-Guide/gdr9-g3ad
Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:
See the Splitgraph documentation for more information.
Facebook
Twitterhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
The GlobalPhone pronunciation dictionaries, created within the framework of the multilingual speech and language corpus GlobalPhone, were developed in collaboration with the Karlsruhe Institute of Technology (KIT).
The GlobalPhone pronunciation dictionaries contain the pronunciations of all word forms found in the transcription data of the GlobalPhone speech & text database. The pronunciation dictionaries are currently available in 20 languages: Arabic (29230 entries/27059 words), Bulgarian (20193 entries), Chinese-Mandarin (73388 pronunciations), Croatian (23497 entries/20628 words), Czech (33049 entries/32942 words), French (36837 entries/20710 words), German (48979 entries/46035 words), Hausa (42662 entries/42079 words), Japanese (18094 entries), Korean (3500 entries syllable-based, 97493 entries/81602 words word-based), Polish (36484 entries), Portuguese (Brazilian) (58803 entries/58787 words), Russian (28818 entries/27667 words), Spanish (Latin American) (43264 entries/33960 words), Swahili (10664 entries), Swedish (25401 entries/25356 words), Thai (small set with 12420 entries and larger set with 25570 entries/22462 words), Turkish (31330 entries/31087 words), Ukrainian (7748 entries/7740 words), and Vietnamese (38504 entries/29974 words).
1) Dictionary Encoding: The GlobalPhone pronunciation dictionary entries consist of full word forms and are either given in the original script of that language, mostly in UTF-8 encoding (Bulgarian, Chinese-Mandarin, Croatian, Czech, French, Japanese, Korean, Polish, Portuguese, Russian, Spanish, Swahili, Turkish, Thai, Ukrainian, Vietnamese) corresponding to the trl-files of the GlobalPhone transcriptions or in a Romanized versions encoded in ASCII/ISO-8859 encoding to fit the rmn-files of the GlobalPhone transcriptions (Arabic, German, Hausa (simplified boko), Swedish). In some languages both versions exist. Romanization was performed by reversible mappings, which are documented in most cases. Furthermore, in several languages, alternative versions are available, e.g. Chinese-Mandarin is provided in both, UTF-8 for Hanzi character-based dictionary (trl) and Pinyin version in ASCII (rmn); Korean is provided in both, UTF-8 for Eojeol- and Hangul-based dictionary (trl) and ASCII for a Romanized version in which a data-driven algorithm was performed to merge syllable units into a reasonable set of word-like units (rmn).
2) Dictionary Phone set: The phone sets for each language were derived individually from the literature following best practices for automatic speech processing. Each phone set is explained and described in the documentation using the international standards of the International Phonetic Alphabet (IPA). A language independent GlobalPhone naming convention for the phone sets is used (indicated by “M_”) to support the sharing of phones across languages to build multilingual pronunciation dictionaries or acoustic models. For historical reasons, some dictionaries still use language dependent phone names. For most of those dictionaries, the documentation provides a mapping to the GlobalPhone phone names.
3) Dictionary Generation: Whenever the grapheme-to-phoneme relationship allowed, the dictionaries were created semi-automatically. In the first step handcrafted grapheme-to-phoneme rules were applied to generate initial pronunciations from all word forms appearing in the GlobalPhone transcriptions. The number of rules highly depends on the language. In the second step, the generated pronunciations were manually checked by native speakers, correcting potential errors of the automatic pronunciation generation process. In the third step, most of the dictionaries were enriched by special entries such as acronyms, foreign words, pronunciation variants, numbers, or partial words and cross-checked by the native speakers. Most of the dictionaries have been applied to large vocabulary speech recognition. In many cases the GlobalPhone dictionaries were compared to straight-forward grapheme-based speech recognition and to alternative sources, such as Wiktionary and usually demonstrated to be superior in terms of quality, coverage, and accuracy.
4) Format: The format of the dictionaries is the same across languages and is straight-forward. Each line consists of one word form and its pronunciation separated by blank. The pronunciation consists of a concatenation of phone symbols separated by blanks. Both, words and their pronunciations are given in tcl-script list format, i.e. enclosed in “{}”, since phones can carry tags, indicating the tone, length or stress of a vowel or the palatalization of consonants, or the word boundary tag “WB”, indicating the boundary of a dictionary unit. The WB tag can for example be included as a standard question in the decision tree questions for capturing crossword models in context-dependent modeling. Pronunciation variants are indicated by (
5) Documentation: The pronunciation dictionaries for each language are complemented by a documentation that describes the format of the dictionary, the phone set including its mapping to the International Phonetic Alphabet (IPA), and the frequency distribution of the phones in the dictionary. Most of the pronunciation dictionaries have been successfully applied to large vocabulary speech recognition. Experimental results and general information about the GlobalPhone corpus were published widely in conference or journal papers and partially referenced in the documentation.
A good summary of the pronunciation dictionaries is provided in: Tanja Schultz and Tim Schlippe (2014) GlobalPhone: Pronunciation Dictionaries in 20 Languages, Proceedings of the 9th edition of the Language Resources and Evaluation Conference (LREC), Reykjavik, Iceland, 2014.
Facebook
TwitterMany websites operate using the data in the IANA tz database. "What Is Daylight Saving Time" from timeanddate.com is a good place to start to find interesting information about time zones, such as the strange case of Lord Howe Island, Australia.
transitions.csvChanges in the conversion of a given time zone to UTC (for example for daylight savings or because the definition of the time zone changed).
| variable | class | description |
|---|---|---|
| zone | character | The name of the time zone. |
| begin | character | When this definition went into effect, in UTC. Tip: convert to a datetime using lubridate::as_datetime(). |
| end | character | When this definition ended (and the next definition went into effect), in UTC. Tip: convert to a datetime using lubridate::as_datetime(). |
| offset | double | The offset of this time zone from UTC, in seconds. |
| dst | logical | Whether daylight savings time is active within this definition. |
| abbreviation | character | The time zone abbreviation in use throughout this begin to end range. |
timezones.csvDescriptions of time zones from the IANA time zone database.
| variable | class | description |
|---|---|---|
| zone | character | The name of the time zone. |
| latitude | double | Latitude of the time zone's "principal location." |
| longitude | double | Longitude of the time zone's "principal location." |
| comments | character | Comments from the tzdb definition file. |
timezone_countries.csvCountries (or other place names) that overlap with each time zone.
| variable | class | description |
|---|---|---|
| zone | character | The name of the time zone. |
| country_code | character | The ISO 3166-1 alpha-2 2-character country code. |
countries.csvNames of countries and other places.
| variable | class | description |
|---|---|---|
| country_code | character | The ISO 3166-1 alpha-2 2-character country code. |
| place_name | character | The usual English name for the coded region, chosen so that alphabetic sorting of subsets produces helpful lists. This is not the same as the English name in the ISO 3166 tables. |
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PETROG, AGSO's Petrography Database, is a relational computer database of petrographic data obtained from microscopic examination of thin sections of rock samples. The database is designed for petrographic descriptions of crystalline igneous and metamorphic rocks, and also for sedimentary petrography. A variety of attributes pertaining to thin sections can be recorded, as can the volume proportions of component minerals, clasts and matrix.
PETROG is one of a family of field and laboratory databases that include mineral deposits, regolith, rock chemistry, geochronology, stream-sediment geochemistry, geophysical rock properties and ground spectral properties for remote sensing. All these databases rely on a central Field Database for information on geographic location, outcrops and rock samples. PETROG depends, in particular, on the Field Database's SITES and ROCKS tables, as well as a number of lookup tables of standard terms. ROCKMINSITES, a flat view of PETROG's tables combined with the SITES and ROCKS tables, allows thin-section and mineral data to be accessed from geographic information systems and plotted on maps.
This guide presents an overview of PETROG's infrastructure and describes in detail the menus and screen forms used to input and view the data. In particular, the definitions of most fields in the database are given in some depth under descriptions of the screen forms - providing, in effect, a comprehensive data dictionary of the database. The database schema, with all definitions of tables, views and indexes is contained in an appendix to the guide.
Facebook
TwitterThis CD-ROM (Compact Disk - Read Only Memory) contains sidescan sonar, high-resolution seismic-reflection, bathymetric, textural, and bibliographic data and interpretations collected, compiled, and produced through the U.S. Geological Survey/State of Connecticut Cooperative and the Long Island Sound Environmental Studies Project of the Coastal and Marine Geology Program, U.S. Geological Survey during October 1991 to August 1998. Cooperative research with the State of Connecticut was initiated in 1982. During the initial phase of this cooperative program, geologic framework studies in Long Island Sound were completed. The second and current phase of the program, which is the focus of this CD-ROM, emphasizes studies of sediment distribution, processes that control sediment distribution, near-shore environmental concerns, and the relationship of benthic communities to sea-floor geology. The study area covers all of Long Island Sound, which is bordered on the north by the rocky shoreline of Connecticut, on the east by Block Island Sound, on the south by the eroding sandy bluffs of Long Island, and on the west by the East River and the New York metropolitan area. Sidescan sonar data were variously collected with 100 kHz Klein, Datasonics, and Edgetech systems under two survey schemes. In the first scheme, the data were collected along closely-spaced grids where the ship tracks were spaced 150 m apart and the sonar system was set to sweep 100 m to either side of the ship's track. This scheme produced the continuous-coverage acoustic images that are stored on the CD-ROM as TIF files. In the second scheme, the sidescan sonar data collected along reconnaissance lines spaced about 2,400 m apart. Only selected portions of this data, when used for geologic interpretation, are stored on this CD-ROM. Under both survey schemes, the sidescan sonar data were processed according to procedures summarized by Danforth and others (1991) and Paskevich (1992a, 1992b, 1992c). The seismic reflection data were variously collected with an Ocean Research Equipment 3.5-kHz profiler transmitting at a 0.25-s repetition rate and a Datasonics CHIRP system set to sweep between 2-7 kHz. Only selected seismic-reflection data, which are used as examples in geologic interpretations, are stored as GIF-formatted images on this CD-ROM. Navigation during this project was determined with a differential Global Positioning System (GPS); position data were logged at 10-second intervals. The bathymetric data were collected by means of a 200-kHz echo sounder and logged digitally. Surficial sediment (0-2 cm below the sediment-water interface) sampling completed as part of this project was conducted using a Van Veen grab sampler equipped with an Osprey video and still camera system. The photographic system was used to appraise bottom variability around stations, faunal communities, and sedimentary processes. It also documented bedrock outcrops and boulder fields where samples could not be collected. The fine fraction (less than 62 microns) was analyzed by Coulter Counter (Shideler, 1976); the coarse fraction was analyzed by sieving (gravel) and by rapid sediment analyzer (sand; Schlee, 1966). The data were corrected for the salt content of interstitial water. Size classifications are based on the method proposed by Wentworth (1929) and were calculated using the inclusive graphics statistical method (Folk, 1974), using the nomenclature proposed by Shepard (1954). A detailed discussion of the sedimentological methods employed are given in Poppe and others (1985); a detailed description of the methods used to perform the CHN analyses are given in Poppe and others (1996) . The database presented here contains over 14,000 records and 83 fields (see the Data Dictionary below). The specific fields and parameters have been chosen based on the data produced by the sedimentation laboratory of the Coastal and Marine Geology Program of the U.S. Geological Survey in Woods Hole, Mass., and the format of information typically found in the literature. Because the data have come from numerous sources, there are differing amounts and types of information. Most of the samples or sets of samples do not have data in all of the given fields. However, additional fields, qualifiers, and data can be added in virtually unlimited fashion to accommodate specific needs. The database itself is provided in four formats: Microsoft EXCEL, ver. 5, Quattro Pro for Windows, Dbase IV, and Tab-delimited ASCII text. Four bathymetric data sets are presented and include:1) Interpretations of the bathymetry within the continuous-coverage sidescan sonar study areas; 2} The NOS database modified to remove extraneous data (i.e. bouys); 3) Contoured National Ocean Service (NOS) bathymetry digitized by Applied Geographics Inc., Boston, Massachusetts; and 4) a fly-by based on the modified NOS database. Data files are present in ASCII format with navigation and depth in meters. The bathymetric interpretations within the sidescan sonar study areas are based on mean sea level and stored as TIF images; the NOS data are based on mean low sea level; and the fly-by is configured to run in QuickTime or MPEG, which can be downloaded from this CD-ROM. The bibliographic database, which contains over 2,000 references, is stored as an ASCII text, Microsoft Word, Corel WordPerfect, HTML, and Microsoft EXCEL files. This bibliography is largely a compilation of references from Lewis and Coffin (1985) and the GENCAT bibliographic database at the Long Island Sound Resource Center, Connecticut Department of Environmental Protection, Groton, Connecticut. These sources have been supplemented by citations from the BIOSIS, GEOREF, and FISH AND FISHERIES WORLDWIDE bibliographic databases.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The CSV data was sourced from the existing Kaggle dataset titled "Adventure Works 2022" by Algorismus. This data was normalized and consisted of seven individual CSV files. The Sales table served as a fact table that connected to other dimensions. To consolidate all the data into a single table, it was loaded into a SQLite database and transformed accordingly. The final denormalized table was then exported as a single CSV file (delimited by | ), and the column names were updated to follow snake_case style.
doi.org/10.6084/m9.figshare.27899706
| Column Name | Description |
|---|---|
| sales_order_number | Unique identifier for each sales order. |
| sales_order_date | The date and time when the sales order was placed. (e.g., Friday, August 25, 2017) |
| sales_order_date_day_of_week | The day of the week when the sales order was placed (e.g., Monday, Tuesday). |
| sales_order_date_month | The month when the sales order was placed (e.g., January, February). |
| sales_order_date_day | The day of the month when the sales order was placed (1-31). |
| sales_order_date_year | The year when the sales order was placed (e.g., 2022). |
| quantity | The number of units sold in the sales order. |
| unit_price | The price per unit of the product sold. |
| total_sales | The total sales amount for the sales order (quantity * unit price). |
| cost | The total cost associated with the products sold in the sales order. |
| product_key | Unique identifier for the product sold. |
| product_name | The name of the product sold. |
| reseller_key | Unique identifier for the reseller. |
| reseller_name | The name of the reseller. |
| reseller_business_type | The type of business of the reseller (e.g., Warehouse, Value Reseller, Specialty Bike Shop). |
| reseller_city | The city where the reseller is located. |
| reseller_state | The state where the reseller is located. |
| reseller_country | The country where the reseller is located. |
| employee_key | Unique identifier for the employee associated with the sales order. |
| employee_id | The ID of the employee who processed the sales order. |
| salesperson_fullname | The full name of the salesperson associated with the sales order. |
| salesperson_title | The title of the salesperson (e.g., North American Sales Manager, Sales Representative). |
| email_address | The email address of the salesperson. |
| sales_territory_key | Unique identifier for the sales territory for the actual sale. (e.g. 3) |
| assigned_sales_territory | List of sales_territory_key separated by comma assigned to the salesperson. (e.g., 3,4) |
| sales_territory_region | The region of the sales territory. US territory broken down in regions. International regions listed as country name (e.g., Northeast, France). |
| sales_territory_country | The country associated with the sales territory. |
| sales_territory_group | The group classification of the sales territory. (e.g., Europe, North America, Pacific) |
| target | The ... |
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
There are several Microsoft Word documents here detailing data creation methods and with various dictionaries describing the included and derived variables.The Database Creation Description is meant to walk a user through some of the steps detailed in the SAS code with this project.The alphabetical list of variables is intended for users as sometimes this makes some coding steps easier to copy and paste from this list instead of retyping.The NIS Data Dictionary contains some general dataset description as well as each variable's responses.