100+ datasets found
  1. d

    OpenFEMA Data Set Fields

    • catalog.data.gov
    • datasets.ai
    • +1more
    Updated Jun 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FEMA/Mission Support/Off of Chf Information Officer (2025). OpenFEMA Data Set Fields [Dataset]. https://catalog.data.gov/dataset/openfema-data-set-fields
    Explore at:
    Dataset updated
    Jun 7, 2025
    Dataset provided by
    FEMA/Mission Support/Off of Chf Information Officer
    Description

    Metadata for the OpenFEMA API data set fields. It contains descriptions, data types, and other attributes for each field.rnrnIf you have media inquiries about this dataset please email the FEMA News Desk FEMA-News-Desk@dhs.gov or call (202) 646-3272. For inquiries about FEMA's data and Open government program please contact the OpenFEMA team via email OpenFEMA@fema.dhs.gov.

  2. Global Soil Types, 0.5-Degree Grid (Modified Zobler) - Dataset - NASA Open...

    • data.nasa.gov
    Updated Apr 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). Global Soil Types, 0.5-Degree Grid (Modified Zobler) - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/global-soil-types-0-5-degree-grid-modified-zobler-f09ea
    Explore at:
    Dataset updated
    Apr 1, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    A global data set of soil types is available at 0.5-degree latitude by 0.5-degree longitude resolution. There are 106 soil units, based on Zobler?s (1986) assessment of the FAO/UNESCO Soil Map of the World. This data set is a conversion of the Zobler 1-degree resolution version to a 0.5-degree resolution. The resolution of the data set was not actually increased. Rather, the 1-degree squares were divided into four 0.5-degree squares with the necessary adjustment of continental boundaries and islands. The computer code used to convert the original 1-degree data to 0.5-degree is provided as a companion file. A JPG image of the data is provided in this document. The Zobler data (1-degree resolution) as distributed by Webb et al. (1993) [http://www.ngdc.noaa.gov/seg/eco/cdroms/gedii_a/datasets/a12/wr.htm#top] contains two columns, one column for continent and one column for soil type. The Soil Map of the World consists of 9 maps that represent parts of the world. The texture data that Webb et al.(1993) provided allowed for the fact that a soil type in one part of the world may have different properties than the same soil in a different part of the world. This continent-specific information is retained in this 0.5-degree resolution data set, as well as the soil type information which is the second column. A code was written (one2half.c) to take the file CONTIZOB.LER distributed by Webb et al. (1993) [http://www.ngdc.noaa.gov/seg/eco/cdroms/gedii_a/datasets/a12/wr.htm#top] and simply divide the 1-degree cells into quarters. This code also reads in a land/water file (land.wave) that specifies the cells that are land at 0.5 degrees. The code checks for consistency between the newly quartered map and the land/water map to which the quartered map is to be registered. If there is a discrepancy between the two, an attempt was made to make the two consistent using the following logic. If the cell is supposed to be water, it is forced to be water. If it is supposed to be land but was resolved to water at 1 degree, the code looks at the surrounding 8 cells and picks the most frequent soil type and assigns it to the cell. If there are no surrounding land cells then it is kept as water in the hopes that on the next pass one or more of the surrounding cells might be converted from water to a soil type. The whole map is iterated 5 times. The remaining cells that should be land but couldn't be determined from surrounding cells (mostly islands that are resolved at 0.5 degree but not at 1 degree) are printed out with coordinate information. A temporary map is output with -9 indicating where data is required. This is repeated for the continent code in CONTIZOB.LER as well. A separate map of the temporary continent codes is produced with -9 indicating required data. A nearly identical code (one2half.c) does the same for the continent codes. The printout allows one to consult the printed versions of the soil map and look up the soil type with the largest coverage in the 0.5-degree cell. The program manfix.c then will go through the temporary map and prompt for input to correct both the soil codes and the continent codes for the map. This can be done manually or by preparing a file of changes (new_fix.dat) and redirecting stdin. A new complete version of the map is outputted. This is in the form of the original CONTIZOB.LER file (contizob.half) but four times larger. Original documentation and computer codes prepared by Post et al. (1996) are provided as companion files with this data set. Image of 106 global soil types available at 0.5-degree by 0.5-degree resolution. Additional documentation from Zobler?s assessment of FAO soil units is available from the NASA Center for Scientific Information.

  3. St. Louis Brewery Data Set

    • kaggle.com
    zip
    Updated Dec 19, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). St. Louis Brewery Data Set [Dataset]. https://www.kaggle.com/datasets/thedevastator/st-louis-brewery-data-set/versions/2
    Explore at:
    zip(23756 bytes)Available download formats
    Dataset updated
    Dec 19, 2023
    Authors
    The Devastator
    Area covered
    St. Louis
    Description

    St. Louis Brewery Data Set

    St. Louis Brewery Data: Types, Locations, and Contact Details

    By Throwback Thursday [source]

    About this dataset

    The STL Viz Brew - Brewery Data Set is a comprehensive collection of information pertaining to breweries situated in the St. Louis area. This dataset provides valuable insights into various aspects of these breweries, including their name, type (such as microbreweries, regional breweries, or brewpubs), and contact details.

    For each brewery, the dataset includes their precise location information such as street address, additional address details (if applicable), city, state, county or province name, and postal code. Furthermore, latitude and longitude coordinates are also provided to pinpoint the exact geographical position of each brewery.

    To facilitate easy access for users interested in exploring more about these breweries online, this dataset also features website URLs associated with each establishment. Additionally included are contact phone numbers for direct communication.

    The dataset highlights tags associated with individual breweries which can offer further insights into their unique characteristics or specializations within the industry.

    With continuous updates being made to ensure accuracy and timeliness of data representation within this dataset's records, users can rely on its robustness for various research purposes or decision-making processes.

    In conclusion,the STL Viz Brew - Brewery Data Set serves as an extensive resource providing valuable information about breweries in the St. Louis area. It encompasses essential details such as brewery names; types; location addresses with additional address lines if relevant; city names along with state; county or province; postal codes; website URLs facilitating online exploration; contact phone numbers enabling direct communication channels; precise geolocation coordinates denoting longitude and latitude values for positioning on maps; tags offering additional insights into specific attributes or specialties associated with individual establishments.

    Please note that the dataset does not include specific dates related to creation or last update timestamps

    How to use the dataset

    Guide: How to Use the St. Louis Brewery Dataset

    Welcome to the St. Louis Brewery Dataset! This dataset contains valuable information about breweries in the St. Louis area, including their type, location, website, and contact details. Whether you are a beer enthusiast, a business owner looking for collaboration opportunities, or a data analyst looking for insights into the brewery industry in this region, this guide will help you navigate and extract useful information from this dataset.

    • Understanding the Columns:

      • name: The name of the brewery.
      • brewery_type: The type of brewery (e.g., micro, regional, brewpub).
      • street: The street address of the brewery.
      • address_2: Additional address information if applicable.
      • address_3: Additional address information if applicable.
      • city: The city where the brewery is located.
      • state: The state where the brewery is located.
      • county_province: The county or province where the brewery is located.
      • postal_code: The postal code of the brewery's location.
      • website_url:The website URL of the brewery.
      • phone:The contact phone number of the brewery.
    • Extracting Specific Data:

    You can extract specific data based on your requirements using filters or queries in your preferred programming language or software tool.

    Examples: - If you want to find all microbreweries in St. Louis: SELECT * FROM breweries WHERE city = 'St. Louis' AND brewer_type = 'micro';

    • If you want to filter breweries based on their location within a specific postal code range: SELECT * FROM breweries WHERE postal_code BETWEEN 63101 AND 63110;

    • If you need contact details for further communication with breweries: SELECT name,brewery_type,address_1,address_2,address_3,city,state,county_province, postal_code,website_url,phone FROM breweries;

    • Analyzing and Visualizing the Data:

    Once you have extracted the relevant data, you can perform various analyses or create visualizations to gain insights.

    Examples: - Analyzing the distribution of different brewery types in St. Louis. - Visualizing the geographical locations of breweries on a map using latitude and longitude coordinates. - Investigating correlations between brewery types and their website presence.

    • Additional Considerations:

    When using ...

  4. Ship Images Dataset

    • kaggle.com
    zip
    Updated Dec 19, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Caner Baloglu (2020). Ship Images Dataset [Dataset]. https://www.kaggle.com/canerbaloglu/ship-images-dataset
    Explore at:
    zip(79570926 bytes)Available download formats
    Dataset updated
    Dec 19, 2020
    Authors
    Caner Baloglu
    Description

    Context

    Images of different types of ships in 128x128 pixels.

    Content

    Ship types are distributed to respective folders, test images are in a different folder.

    Acknowledgements

    I used the dataset from Game of Deep Learning: Ship datasets. All of these images are converted to 128*128 pixels and seperated into different folders according to the accompanying .csv file. We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

  5. E

    New Oxford Dictionary of English, 2nd Edition

    • live.european-language-grid.eu
    • catalog.elra.info
    Updated Dec 6, 2005
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2005). New Oxford Dictionary of English, 2nd Edition [Dataset]. https://live.european-language-grid.eu/catalogue/lcr/2276
    Explore at:
    Dataset updated
    Dec 6, 2005
    License

    http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Description

    This is Oxford University Press's most comprehensive single-volume dictionary, with 170,000 entries covering all varieties of English worldwide. The NODE data set constitutes a fully integrated range of formal data types suitable for language engineering and NLP applications: It is available in XML or SGML. - Source dictionary data. The NODE data set includes all the information present in the New Oxford Dictionary of English itself, such as definition text, example sentences, grammatical indicators, and encyclopaedic material. - Morphological data. Each NODE lemma (both headwords and subentries) has a full listing of all possible syntactic forms (e.g. plurals for nouns, inflections for verbs, comparatives and superlatives for adjectives), tagged to show their syntactic relationships. Each form has an IPA pronunciation. Full morphological data is also given for spelling variants (e.g. typical American variants), and a system of links enables straightforward correlation of variant forms to standard forms. The data set thus provides robust support for all look-up routines, and is equally viable for applications dealing with American and British English. - Phrases and idioms. The NODE data set provides a rich and flexible codification of over 10,000 phrasal verbs and other multi-word phrases. It features comprehensive lexical resources enabling applications to identify a phrase not only in the form listed in the dictionary but also in a range of real-world variations, including alternative wording, variable syntactic patterns, inflected verbs, optional determiners, etc. - Subject classification. Using a categorization scheme of 200 key domains, over 80,000 words and senses have been associated with particular subject areas, from aeronautics to zoology. As well as facilitating the extraction of subject-specific sub-lexicons, this also provides an extensive resource for document categorization and information retrieval. - Semantic relationships. The relationships between every noun and noun sense in the dictionary are being codified using an extensive semantic taxonomy on the model of the Princeton WordNet project. (Mapping to WordNet 1.7 is supported.) This structure allows elements of the basic lexical database to function as a formal knowledge database, enabling functionality such as sense disambiguation and logical inference. - Derived from the detailed and authoritative corpus-based research of Oxford University Press's lexicographic team, the NODE data set is a powerful asset for any task dealing with real-world contemporary English usage. By integrating a number of different data types into a single structure, it creates a coherent resource which can be queried along numerous axes, allowing open-ended exploitation by many kinds of language-related applications.

  6. CH1ORB-L-C1XS-4-NPO-REFDR

    • esdcdoi.esac.esa.int
    Updated Nov 26, 2010
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    European Space Agency (2010). CH1ORB-L-C1XS-4-NPO-REFDR [Dataset]. http://doi.org/10.5270/esa-h43z0pz
    Explore at:
    https://www.iana.org/assignments/media-types/application/fitsAvailable download formats
    Dataset updated
    Nov 26, 2010
    Dataset provided by
    European Space Agencyhttp://www.esa.int/
    Time period covered
    Oct 22, 2008 - Aug 28, 2009
    Description

    ESSENTIAL READING The DOCUMENT directory contains 3 instrument specific documents, these are the Experimenter to Archive Interface Control Document (EAICD) which details the structure of this PDS dataset and the various data types produced. The data handling ICD gives more details of the telemetry packets received from C1XS and the command packet format. The final document is the flight operations manual which gives the normal operating command sequences. The user is advised to refer to the papers published about the instrument, specifically Grande_et_al_2009_1, Crawford_et_al_2009_1 and Howe_et_al_2009_1. Data Set Overview Mission Phase Definition Mission phase abbreviations are defined in CATALOG/MISSION.CAT Data levels Data contained within this archive is CODMAC level 2. Data Set Identifier The DATA_SET_ID is a unique alphanumeric identifier for the data sets. Format DATA_SET_ID: XXYZZZUNNNVVVWWW Acronym | Description | Example XX | Instrument Host ID | CH1ORB Y | Target ID | L ZZZ | Instrument ID | C1XS U | CODMAC Data level | 2 NNN | mission phase |NPO VVV | data set type |EDR WWW | Version number | V1.0 E.G. CH1ORBLC1XS2NPOEDRV1.0 VOLUME_ID The VOLUME_ID is a unique alphanumeric identifier for volume. Descriptive files Descriptive files contain information in order to support the processing and analysis of data files. The following file types are defined as descriptive files .LBL PDS label files .AUX Anxiliary truncated!, Please see actual data for full text [truncated!, Please see actual data for full text]

  7. Data from: Login Data Set for Risk-Based Authentication

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Jun 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stephan Wiefling; Stephan Wiefling; Paul René Jørgensen; Paul René Jørgensen; Sigurd Thunem; Sigurd Thunem; Luigi Lo Iacono; Luigi Lo Iacono (2022). Login Data Set for Risk-Based Authentication [Dataset]. http://doi.org/10.5281/zenodo.6782156
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 30, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Stephan Wiefling; Stephan Wiefling; Paul René Jørgensen; Paul René Jørgensen; Sigurd Thunem; Sigurd Thunem; Luigi Lo Iacono; Luigi Lo Iacono
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Login Data Set for Risk-Based Authentication

    Synthesized login feature data of >33M login attempts and >3.3M users on a large-scale online service in Norway. Original data collected between February 2020 and February 2021.

    This data sets aims to foster research and development for Risk-Based Authentication (RBA) systems. The data was synthesized from the real-world login behavior of more than 3.3M users at a large-scale single sign-on (SSO) online service in Norway.

    The users used this SSO to access sensitive data provided by the online service, e.g., a cloud storage and billing information. We used this data set to study how the Freeman et al. (2016) RBA model behaves on a large-scale online service in the real world (see Publication). The synthesized data set can reproduce these results made on the original data set (see Study Reproduction). Beyond that, you can use this data set to evaluate and improve RBA algorithms under real-world conditions.

    WARNING: The feature values are plausible, but still totally artificial. Therefore, you should NOT use this data set in productive systems, e.g., intrusion detection systems.

    Overview

    The data set contains the following features related to each login attempt on the SSO:

    FeatureData TypeDescriptionRange or Example
    IP AddressStringIP address belonging to the login attempt0.0.0.0 - 255.255.255.255
    CountryStringCountry derived from the IP addressUS
    RegionStringRegion derived from the IP addressNew York
    CityStringCity derived from the IP addressRochester
    ASNIntegerAutonomous system number derived from the IP address0 - 600000
    User Agent StringStringUser agent string submitted by the clientMozilla/5.0 (Windows NT 10.0; Win64; ...
    OS Name and VersionStringOperating system name and version derived from the user agent stringWindows 10
    Browser Name and VersionStringBrowser name and version derived from the user agent stringChrome 70.0.3538
    Device TypeStringDevice type derived from the user agent string(mobile, desktop, tablet, bot, unknown)1
    User IDIntegerIdenfication number related to the affected user account[Random pseudonym]
    Login TimestampIntegerTimestamp related to the login attempt[64 Bit timestamp]
    Round-Trip Time (RTT) [ms]IntegerServer-side measured latency between client and server1 - 8600000
    Login SuccessfulBooleanTrue: Login was successful, False: Login failed(true, false)
    Is Attack IPBooleanIP address was found in known attacker data set(true, false)
    Is Account TakeoverBooleanLogin attempt was identified as account takeover by incident response team of the online service(true, false)

    Data Creation

    As the data set targets RBA systems, especially the Freeman et al. (2016) model, the statistical feature probabilities between all users, globally and locally, are identical for the categorical data. All the other data was randomly generated while maintaining logical relations and timely order between the features.

    The timestamps, however, are not identical and contain randomness. The feature values related to IP address and user agent string were randomly generated by publicly available data, so they were very likely not present in the real data set. The RTTs resemble real values but were randomly assigned among users per geolocation. Therefore, the RTT entries were probably in other positions in the original data set.

    • The country was randomly assigned per unique feature value. Based on that, we randomly assigned an ASN related to the country, and generated the IP addresses for this ASN. The cities and regions were derived from the generated IP addresses for privacy reasons and do not reflect the real logical relations from the original data set.

    • The device types are identical to the real data set. Based on that, we randomly assigned the OS, and based on the OS the browser information. From this information, we randomly generated the user agent string. Therefore, all the logical relations regarding the user agent are identical as in the real data set.

    • The RTT was randomly drawn from the login success status and synthesized geolocation data. We did this to ensure that the RTTs are realistic ones.

    Regarding the Data Values

    Due to unresolvable conflicts during the data creation, we had to assign some unrealistic IP addresses and ASNs that are not present in the real world. Nevertheless, these do not have any effects on the risk scores generated by the Freeman et al. (2016) model.

    You can recognize them by the following values:

    • ASNs with values >= 500.000

    • IP addresses in the range 10.0.0.0 - 10.255.255.255 (10.0.0.0/8 CIDR range)

    Study Reproduction

    Based on our evaluation, this data set can reproduce our study results regarding the RBA behavior of an RBA model using the IP address (IP address, country, and ASN) and user agent string (Full string, OS name and version, browser name and version, device type) as features.

    The calculated RTT significances for countries and regions inside Norway are not identical using this data set, but have similar tendencies. The same is true for the Median RTTs per country. This is due to the fact that the available number of entries per country, region, and city changed with the data creation procedure. However, the RTTs still reflect the real-world distributions of different geolocations by city.

    See RESULTS.md for more details.

    Ethics

    By using the SSO service, the users agreed in the data collection and evaluation for research purposes. For study reproduction and fostering RBA research, we agreed with the data owner to create a synthesized data set that does not allow re-identification of customers.

    The synthesized data set does not contain any sensitive data values, as the IP addresses, browser identifiers, login timestamps, and RTTs were randomly generated and assigned.

    Publication

    You can find more details on our conducted study in the following journal article:

    Pump Up Password Security! Evaluating and Enhancing Risk-Based Authentication on a Real-World Large-Scale Online Service (2022)
    Stephan Wiefling, Paul René Jørgensen, Sigurd Thunem, and Luigi Lo Iacono.
    ACM Transactions on Privacy and Security

    Bibtex

    @article{Wiefling_Pump_2022,
     author = {Wiefling, Stephan and Jørgensen, Paul René and Thunem, Sigurd and Lo Iacono, Luigi},
     title = {Pump {Up} {Password} {Security}! {Evaluating} and {Enhancing} {Risk}-{Based} {Authentication} on a {Real}-{World} {Large}-{Scale} {Online} {Service}},
     journal = {{ACM} {Transactions} on {Privacy} and {Security}},
     doi = {10.1145/3546069},
     publisher = {ACM},
     year  = {2022}
    }

    License

    This data set and the contents of this repository are licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. See the LICENSE file for details. If the data set is used within a publication, the following journal article has to be cited as the source of the data set:

    Stephan Wiefling, Paul René Jørgensen, Sigurd Thunem, and Luigi Lo Iacono: Pump Up Password Security! Evaluating and Enhancing Risk-Based Authentication on a Real-World Large-Scale Online Service. In: ACM Transactions on Privacy and Security (2022). doi: 10.1145/3546069

    1. Few (invalid) user agents strings from the original data set could not be parsed, so their device type is empty. Perhaps this parse error is useful information for your studies, so we kept these 1526 entries.↩︎

  8. h

    Classification of Types of Changes in Gully Environments Using Time Series...

    • heidata.uni-heidelberg.de
    csv, text/x-python +2
    Updated Jan 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Miguel Vallejo Orti; Carlos Castillo; Vivien Zahs; Olaf Bubenzer; Bernhard Höfle; Miguel Vallejo Orti; Carlos Castillo; Vivien Zahs; Olaf Bubenzer; Bernhard Höfle (2024). Classification of Types of Changes in Gully Environments Using Time Series Forest Algorithm [data] [Dataset]. http://doi.org/10.11588/DATA/NSMM6P
    Explore at:
    csv(98093), csv(1833843), csv(8041823), txt(4164), text/x-python(6667), txt(3340), tsv(7978335), csv(3585970)Available download formats
    Dataset updated
    Jan 16, 2024
    Dataset provided by
    heiDATA
    Authors
    Miguel Vallejo Orti; Carlos Castillo; Vivien Zahs; Olaf Bubenzer; Bernhard Höfle; Miguel Vallejo Orti; Carlos Castillo; Vivien Zahs; Olaf Bubenzer; Bernhard Höfle
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This code implements the TimeSeriesForest algorithm to classify different types of changes in gully environments. i)gully topographical change, ii)no change outside gully, iii) no change inside gully, and iv) non-topographical change. The algorithm is specifically designed for time series classification tasks, where the input data represents the characteristics of gullies over time. The code follows a series of steps to prepare the data, train the classifier, calculate performance metrics, and generate predictions. The data preparation phase involves importing training and testing data from CSV files. The training data is then divided into classes based on their labels, and a subset of the top rows is selected for each class to create a balanced training dataset. Time series data and corresponding labels are extracted from the training data, while only the time series data is extracted from the testing data. Next, the code calculates various performance metrics to evaluate the trained classifier. It splits the training data into training and testing sets, initializes the TimeSeriesForest classifier, and trains it using the training set. The accuracy of the classifier is calculated on the testing set, and feature importances are determined. Predictions are generated for both the testing set and new data using the trained classifier. The code then computes a confusion matrix to analyze the classification results, visualizing it using Seaborn and Matplotlib. Performance metrics such as True Accuracy, Kappa, Producer's Accuracy, and User's Accuracy are calculated and printed to assess the classifier's effectiveness in classifying gully changes. Lastly, the code performs ensemble predictions by combining the testing data with the generated predictions. The results, including predictions and associated probabilities, are saved to an output file. Overall, this code provides a practical implementation of the TimeSeriesForest algorithm for classifying types of changes in gully environments, demonstrating its potential for environmental monitoring and management.

  9. f

    Data from: Data-Driven Approach Considering Imbalance in Data Sets and...

    • acs.figshare.com
    zip
    Updated Apr 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wataru Takahara; Ryuto Baba; Yosuke Harashima; Tomoaki Takayama; Shogo Takasuka; Yuichi Yamaguchi; Akihiko Kudo; Mikiya Fujii (2025). Data-Driven Approach Considering Imbalance in Data Sets and Experimental Conditions for Exploration of Photocatalysts [Dataset]. http://doi.org/10.1021/acsomega.4c06997.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 10, 2025
    Dataset provided by
    ACS Publications
    Authors
    Wataru Takahara; Ryuto Baba; Yosuke Harashima; Tomoaki Takayama; Shogo Takasuka; Yuichi Yamaguchi; Akihiko Kudo; Mikiya Fujii
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    In the field of data-driven material development, an imbalance in data sets where data points are concentrated in certain regions often causes difficulties in building regression models when machine learning methods are applied. One example of inorganic functional materials facing such difficulties is photocatalysts. Therefore, advanced data-driven approaches are expected to help efficiently develop novel photocatalytic materials even if an imbalance exists in data sets. We propose a two-stage machine learning model aimed at handling imbalanced data sets without data thinning. In this study, we used two types of data sets that exhibit the imbalance: the Materials Project data set (openly shared due to its public domain data) and the in-house metal-sulfide photocatalyst data set (not openly shared due to the confidentiality of experimental data). This two-stage machine learning model consists of the following two parts: the first regression model, which predicts the target quantitatively, and the second classification model, which determines the reliability of the values predicted by the first regression model. We also propose a search scheme for variables related to the experimental conditions based on the proposed two-stage machine learning model. This scheme is designed for photocatalyst exploration, taking experimental conditions into account as the optimal set of variables for these conditions is unknown. The proposed two-stage machine learning model improves the prediction accuracy of the target compared with that of the one-stage model.

  10. w

    Data tables for customs declaration volumes for international trade in goods...

    • gov.uk
    Updated May 16, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HM Revenue & Customs (2023). Data tables for customs declaration volumes for international trade in goods in 2022 [Dataset]. https://www.gov.uk/government/statistical-data-sets/data-tables-for-customs-declaration-volumes-for-international-trade-in-goods-in-2022
    Explore at:
    Dataset updated
    May 16, 2023
    Dataset provided by
    GOV.UK
    Authors
    HM Revenue & Customs
    Description

    The following data tables contain the number of customs declarations for international trade in goods in 2022, with breakdowns by trade flow, trade partner, calendar month, declarant representation, location of entry/exit and declaration type category.

    https://assets.publishing.service.gov.uk/media/64622a74a09dfc000c3c178a/Customs_declarations_volumes_for_international_trade_in_goods_in_2022_dataset.ods">Customs declarations volumes for international trade in goods in 2022 dataset

    ODS, 21.4 KB

    This file is in an OpenDocument format

  11. n

    MIRIAM Resources

    • neuinfo.org
    • scicrunch.org
    • +1more
    Updated Jan 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). MIRIAM Resources [Dataset]. http://identifiers.org/RRID:SCR_006697
    Explore at:
    Dataset updated
    Jan 29, 2022
    Description

    A set of online services created in support of MIRIAM, a set of guidelines for the annotation and curation of computational models. The core of MIRIAM Resources is a catalogue of data types (namespaces corresponding to controlled vocabularies or databases), their URIs and the corresponding physical URLs or resources. Access to this data is made available via exports (XML) and Web Services (SOAP). MIRIAM Resources are developed and maintained under the BioModels.net initiative, and are free for use by all. MIRIAM Resources are composed of four components: a database, some Web Services, a Java library and this web application. * Database: The core of the system is a MySQL database. It allows us to store the data types (which can be controlled vocabularies or databases), their URIs and the corresponding physical URLs, and other details such as documentation and resource identifier patterns. Each entry contains a diverse set of details about the data type: official name and synonyms, root URI, pattern of identifiers, documentation, etc. Moreover, each data type can be associated with several resources (or physical locations). * Web Services: Programmatic access to the data is available via Web Services (based on Apache Axis and SOAP messages). In addition, REST-based services are currently being developed. This API allows one to not only resolve model annotations, but also to generate appropriate URIs, based upon the provision of a resource name and accession number. A list of available web services, and a WSDL are provided. A browser-based online demonstration of the Web Services is also available to try. * Java Library: A Java library is provided to access the Web Services. The documentation explains where to download it, its dependencies, and how to use it. * Web Application: A Web application, using an Apache Tomcat server, offers access to the whole data set via a Web browser. It is possible to browse by data type names as well as browse by tags. A search engine is also provided.

  12. A Predictive Framework for Integrating Disparate Genomic Data Types Using...

    • plos.figshare.com
    xlsx
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brian D. Bennett; Qing Xiong; Sayan Mukherjee; Terrence S. Furey (2023). A Predictive Framework for Integrating Disparate Genomic Data Types Using Sample-Specific Gene Set Enrichment Analysis and Multi-Task Learning [Dataset]. http://doi.org/10.1371/journal.pone.0044635
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Brian D. Bennett; Qing Xiong; Sayan Mukherjee; Terrence S. Furey
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Understanding the root molecular and genetic causes driving complex traits is a fundamental challenge in genomics and genetics. Numerous studies have used variation in gene expression to understand complex traits, but the underlying genomic variation that contributes to these expression changes is not well understood. In this study, we developed a framework to integrate gene expression and genotype data to identify biological differences between samples from opposing complex trait classes that are driven by expression changes and genotypic variation. This framework utilizes pathway analysis and multi-task learning to build a predictive model and discover pathways relevant to the complex trait of interest. We simulated expression and genotype data to test the predictive ability of our framework and to measure how well it uncovered pathways with genes both differentially expressed and genetically associated with a complex trait. We found that the predictive performance of the multi-task model was comparable to other similar methods. Also, methods like multi-task learning that considered enrichment analysis scores from both data sets found pathways with both genetic and expression differences related to the phenotype. We used our framework to analyze differences between estrogen receptor (ER) positive and negative breast cancer samples. An analysis of the top 15 gene sets from the multi-task model showed they were all related to estrogen, steroids, cell signaling, or the cell cycle. Although our study suggests that multi-task learning does not enhance predictive accuracy, the models generated by our framework do provide valuable biological pathway knowledge for complex traits.

  13. i

    INSPIRE Priority Data Set (Compliant) - Habitat types distribution

    • inspire-geoportal.lt
    • inspire-geoportal.ec.europa.eu
    Updated Aug 26, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Construction Sector Development Agency (2020). INSPIRE Priority Data Set (Compliant) - Habitat types distribution [Dataset]. https://www.inspire-geoportal.lt/geonetwork/srv/api/records/e83f80ae-d56f-4989-8ce9-5101d6940032
    Explore at:
    ogc:wms-1.3.0-http-get-map, www:link-1.0-http--link, www:download-1.0-http--downloadAvailable download formats
    Dataset updated
    Aug 26, 2020
    Dataset provided by
    State Service for Protected Areas under the Ministry of Environment
    Construction Sector Development Agency
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    http://inspire.ec.europa.eu/metadata-codelist/LimitationsOnPublicAccess/noLimitationshttp://inspire.ec.europa.eu/metadata-codelist/LimitationsOnPublicAccess/noLimitations

    Area covered
    Description

    INSPIRE Priority Data Set (Compliant) - Habitat types distribution

  14. d

    Presence and abundance data and models for four invasive plant species:...

    • catalog.data.gov
    • s.cnmilf.com
    Updated Oct 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Presence and abundance data and models for four invasive plant species: merged data set to create the models [Dataset]. https://catalog.data.gov/dataset/presence-and-abundance-data-and-models-for-four-invasive-plant-species-merged-data-set-to-
    Explore at:
    Dataset updated
    Oct 30, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Description

    We developed habitat suitability models for four invasive plant species of concern to Department of Interior land management agencies. We generally followed the modeling workflow developed in Young et al. 2020, but developed models both for two data types, where species were present and where they were abundant. We developed models using five algorithms with VisTrails: Software for Assisted Habitat Modeling [SAHM 2.1.2]. We accounted for uncertainty related to sampling bias by using two alternative sources of background samples, and constructed model ensembles using the 10 models for each species (five algorithms by two background methods) for four different thresholds. This data bundle contains the presence and abundance merged data sets to create models for medusahead rye, red brome, venanata and bur buttercup, the eight raster files associated with each species/ data type (presence or abundance), and tabular summaries by management unit (including each species/ data type combination). The spatial data are organized in a separate folder for each species, each containing four rasters. Each of the rasters represent the following, with an occurrence (occ) and abundance (abund) version: 1) 1st - one percentile threshold 2) 1st_masked - one percentile threshold with Restricted Environmental Conditions This file specifically, 2) 'mergedDataset.csv', contains the merged data set used to create the models, including location coordinates and associated environmental covariate data values. The bundle documentation files are: 1) 'AbundOccur.xml' contains FGDC project-level metadata 2) 'mergedDataset.csv', which this metadata file specifically describes, contains the merged data set used to create the models, including location and environmental data. 3) XX.tif where XX is the raster type explained above (occ or abund; masked or not). 4) managementSummaries.csv is the tabular summaries by management unit.

  15. Global Soil Types, 1-Degree Grid (Zobler) - Dataset - NASA Open Data Portal

    • data.nasa.gov
    Updated Apr 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). Global Soil Types, 1-Degree Grid (Zobler) - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/global-soil-types-1-degree-grid-zobler-5b9fd
    Explore at:
    Dataset updated
    Apr 1, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    A global data set of soil types is available at 1-degree latitude by 1-degree longitude resolution. There are 26 soil units based on Zobler?s assessment of FAO Soil Units (Zobler, 1986). The data set was compiled as part of an effort to improve modeling of the hydrologic cycle portion of global climate models. A more extensive version of these data, including 106 soil units as well as soil texture and slope, is available from NCAR, Scientific Computing Division, Data Support Section; the more extensive data set is entitled "Staub and Rosenweig's GISS Soil & Sfc Slope, 1-Deg" [http://www.dss.ucar.edu/datasets/ds770.0/]. A help file prepared by Matthews and Fung (1987) (soil1x1.help) is provided as a companion file. Image of 26 soil types available at 1-degree by 1-degree resolution. Additional documentation from Zobler?s assessment of FAO soil units is available from the NASA Center for Scientific Information

  16. g

    Inspire Download Service (predefined ATOM) for data set Biotope Types...

    • gimi9.com
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Inspire Download Service (predefined ATOM) for data set Biotope Types (Lines) | gimi9.com [Dataset]. https://gimi9.com/dataset/eu_50611a33-5650-0002-7a21-82229df61627/
    Explore at:
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Description of INSPIRE Download Service (predefined Atom): Biotope types (lines) of Rhineland-Palatinate – The link(s) for downloading the records is/are generated dynamically from Get Map Calling a WMS Interface

  17. w

    Data Use in Academia Dataset

    • datacatalog.worldbank.org
    csv, utf-8
    Updated Nov 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Semantic Scholar Open Research Corpus (S2ORC) (2023). Data Use in Academia Dataset [Dataset]. https://datacatalog.worldbank.org/search/dataset/0065200/data_use_in_academia_dataset
    Explore at:
    utf-8, csvAvailable download formats
    Dataset updated
    Nov 27, 2023
    Dataset provided by
    Semantic Scholar Open Research Corpus (S2ORC)
    Brian William Stacy
    License

    https://datacatalog.worldbank.org/public-licenses?fragment=cchttps://datacatalog.worldbank.org/public-licenses?fragment=cc

    Description

    This dataset contains metadata (title, abstract, date of publication, field, etc) for around 1 million academic articles. Each record contains additional information on the country of study and whether the article makes use of data. Machine learning tools were used to classify the country of study and data use.


    Our data source of academic articles is the Semantic Scholar Open Research Corpus (S2ORC) (Lo et al. 2020). The corpus contains more than 130 million English language academic papers across multiple disciplines. The papers included in the Semantic Scholar corpus are gathered directly from publishers, from open archives such as arXiv or PubMed, and crawled from the internet.


    We placed some restrictions on the articles to make them usable and relevant for our purposes. First, only articles with an abstract and parsed PDF or latex file are included in the analysis. The full text of the abstract is necessary to classify the country of study and whether the article uses data. The parsed PDF and latex file are important for extracting important information like the date of publication and field of study. This restriction eliminated a large number of articles in the original corpus. Around 30 million articles remain after keeping only articles with a parsable (i.e., suitable for digital processing) PDF, and around 26% of those 30 million are eliminated when removing articles without an abstract. Second, only articles from the year 2000 to 2020 were considered. This restriction eliminated an additional 9% of the remaining articles. Finally, articles from the following fields of study were excluded, as we aim to focus on fields that are likely to use data produced by countries’ national statistical system: Biology, Chemistry, Engineering, Physics, Materials Science, Environmental Science, Geology, History, Philosophy, Math, Computer Science, and Art. Fields that are included are: Economics, Political Science, Business, Sociology, Medicine, and Psychology. This third restriction eliminated around 34% of the remaining articles. From an initial corpus of 136 million articles, this resulted in a final corpus of around 10 million articles.


    Due to the intensive computer resources required, a set of 1,037,748 articles were randomly selected from the 10 million articles in our restricted corpus as a convenience sample.


    The empirical approach employed in this project utilizes text mining with Natural Language Processing (NLP). The goal of NLP is to extract structured information from raw, unstructured text. In this project, NLP is used to extract the country of study and whether the paper makes use of data. We will discuss each of these in turn.


    To determine the country or countries of study in each academic article, two approaches are employed based on information found in the title, abstract, or topic fields. The first approach uses regular expression searches based on the presence of ISO3166 country names. A defined set of country names is compiled, and the presence of these names is checked in the relevant fields. This approach is transparent, widely used in social science research, and easily extended to other languages. However, there is a potential for exclusion errors if a country’s name is spelled non-standardly.


    The second approach is based on Named Entity Recognition (NER), which uses machine learning to identify objects from text, utilizing the spaCy Python library. The Named Entity Recognition algorithm splits text into named entities, and NER is used in this project to identify countries of study in the academic articles. SpaCy supports multiple languages and has been trained on multiple spellings of countries, overcoming some of the limitations of the regular expression approach. If a country is identified by either the regular expression search or NER, it is linked to the article. Note that one article can be linked to more than one country.


    The second task is to classify whether the paper uses data. A supervised machine learning approach is employed, where 3500 publications were first randomly selected and manually labeled by human raters using the Mechanical Turk service (Paszke et al. 2019).[1] To make sure the human raters had a similar and appropriate definition of data in mind, they were given the following instructions before seeing their first paper:


    Each of these documents is an academic article. The goal of this study is to measure whether a specific academic article is using data and from which country the data came.

    There are two classification tasks in this exercise:

    1. identifying whether an academic article is using data from any country

    2. Identifying from which country that data came.

    For task 1, we are looking specifically at the use of data. Data is any information that has been collected, observed, generated or created to produce research findings. As an example, a study that reports findings or analysis using a survey data, uses data. Some clues to indicate that a study does use data includes whether a survey or census is described, a statistical model estimated, or a table or means or summary statistics is reported.

    After an article is classified as using data, please note the type of data used. The options are population or business census, survey data, administrative data, geospatial data, private sector data, and other data. If no data is used, then mark "Not applicable". In cases where multiple data types are used, please click multiple options.[2]

    For task 2, we are looking at the country or countries that are studied in the article. In some cases, no country may be applicable. For instance, if the research is theoretical and has no specific country application. In some cases, the research article may involve multiple countries. In these cases, select all countries that are discussed in the paper.

    We expect between 10 and 35 percent of all articles to use data.


    The median amount of time that a worker spent on an article, measured as the time between when the article was accepted to be classified by the worker and when the classification was submitted was 25.4 minutes. If human raters were exclusively used rather than machine learning tools, then the corpus of 1,037,748 articles examined in this study would take around 50 years of human work time to review at a cost of $3,113,244, which assumes a cost of $3 per article as was paid to MTurk workers.


    A model is next trained on the 3,500 labelled articles. We use a distilled version of the BERT (bidirectional Encoder Representations for transformers) model to encode raw text into a numeric format suitable for predictions (Devlin et al. (2018)). BERT is pre-trained on a large corpus comprising the Toronto Book Corpus and Wikipedia. The distilled version (DistilBERT) is a compressed model that is 60% the size of BERT and retains 97% of the language understanding capabilities and is 60% faster (Sanh, Debut, Chaumond, Wolf 2019). We use PyTorch to produce a model to classify articles based on the labeled data. Of the 3,500 articles that were hand coded by the MTurk workers, 900 are fed to the machine learning model. 900 articles were selected because of computational limitations in training the NLP model. A classification of “uses data” was assigned if the model predicted an article used data with at least 90% confidence.


    The performance of the models classifying articles to countries and as using data or not can be compared to the classification by the human raters. We consider the human raters as giving us the ground truth. This may underestimate the model performance if the workers at times got the allocation wrong in a way that would not apply to the model. For instance, a human rater could mistake the Republic of Korea for the Democratic People’s Republic of Korea. If both humans and the model perform the same kind of errors, then the performance reported here will be overestimated.


    The model was able to predict whether an article made use of data with 87% accuracy evaluated on the set of articles held out of the model training. The correlation between the number of articles written about each country using data estimated under the two approaches is given in the figure below. The number of articles represents an aggregate total of

  18. T

    30 m resolution lake ice type data set of Qinghai Tibet Plateau, Siberia and...

    • poles.tpdc.ac.cn
    • tpdc.ac.cn
    • +1more
    zip
    Updated Aug 9, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bangsen Tian; Yubao QIU (2020). 30 m resolution lake ice type data set of Qinghai Tibet Plateau, Siberia and alaga river lake region, 2015-2019 [Dataset]. http://doi.org/10.11888/Glacio.tpdc.270806
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 9, 2020
    Dataset provided by
    TPDC
    Authors
    Bangsen Tian; Yubao QIU
    Area covered
    Siberia, Tibetan Plateau,
    Description

    Lake ice is an important parameter of Cryosphere. Its change is closely related to climate parameters such as temperature and precipitation, and can directly reflect climate change. Therefore, lake ice is an important indicator of regional climate parameter change. However, due to the poor natural environment and sparsely populated area, it is difficult to carry out large-scale field observation, The spatial resolution of 10 m and the temporal resolution of better than 30 days were used to monitor the changes of different types of lake ice, which filled in the blank of observation. The hmrf algorithm is used to classify different types of lake ice. The distribution of different types of lake ice in some lakes with an area of more than 25km2 in the three polar regions is analyzed by time series to form the lake ice type data set. The distribution of different types of lake ice in these lakes can be obtained. The data includes the sequence number of the processed lake, the year and its serial number in the time series, and vector The data set includes the algorithm used, sentinel-1 satellite data, imaging time, polar region, lake ice type and other information. Users can determine the change of different types of lake ice in time series according to the vector file.

  19. d

    e-Sbirka: Data set: CzechVOC code list - type linkage related term

    • data.gov.cz
    json, json-ld
    Updated Jan 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ministerstvo vnitra (2024). e-Sbirka: Data set: CzechVOC code list - type linkage related term [Dataset]. https://data.gov.cz/dataset?iri=https%3A%2F%2Fdata.gov.cz%2Fzdroj%2Fdatov%C3%A9-sady%2F00007064%2F1296385352
    Explore at:
    json, json-ldAvailable download formats
    Dataset updated
    Jan 1, 2024
    Dataset authored and provided by
    Ministerstvo vnitra
    Description

    Číselník typů vazeb souvisejících termínů CzechVOCu. Nový režim odděleného spuštění e-Sbírky do ostrého provozu a současného dokončování e-Legislativy, jejího ověřovacího provozu a postupného uvádění do praxe, předpokládá úpravy systému e-Sbírka a e-Legislativa, a tudíž i nová nasazování datové báze v období od 1. 1. 2024 do 15. 1. 2025. Důsledkem těchto úprav je mj. i to, že do 15. 1. 2025 se mohou měnit identifikátory jednotlivých fragmentů tvořících strukturovaná znění aktů e-Sbírky a může dojít k dílčím úpravám struktury dat. Produkční napojení externích služeb využívajících Otevřená data (Open Data) tak doporučujeme realizovat až po 15. 1. 2025.

  20. California Vegetation - WHR13 Types

    • data.ca.gov
    • data.cnra.ca.gov
    • +5more
    Updated Jul 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CAL FIRE (2025). California Vegetation - WHR13 Types [Dataset]. https://data.ca.gov/dataset/california-vegetation-whr13-types
    Explore at:
    html, arcgis geoservices rest api, csv, kml, geojson, zipAvailable download formats
    Dataset updated
    Jul 25, 2025
    Dataset provided by
    California Department of Forestry and Fire Protectionhttp://calfire.ca.gov/
    Authors
    CAL FIRE
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    California
    Description
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
FEMA/Mission Support/Off of Chf Information Officer (2025). OpenFEMA Data Set Fields [Dataset]. https://catalog.data.gov/dataset/openfema-data-set-fields

OpenFEMA Data Set Fields

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Jun 7, 2025
Dataset provided by
FEMA/Mission Support/Off of Chf Information Officer
Description

Metadata for the OpenFEMA API data set fields. It contains descriptions, data types, and other attributes for each field.rnrnIf you have media inquiries about this dataset please email the FEMA News Desk FEMA-News-Desk@dhs.gov or call (202) 646-3272. For inquiries about FEMA's data and Open government program please contact the OpenFEMA team via email OpenFEMA@fema.dhs.gov.

Search
Clear search
Close search
Google apps
Main menu