21 datasets found
  1. d

    Open Data Dictionary Template Individual

    • catalog.data.gov
    • hub.arcgis.com
    Updated Feb 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Office of the Chief Tecnology Officer (2025). Open Data Dictionary Template Individual [Dataset]. https://catalog.data.gov/dataset/open-data-dictionary-template-individual
    Explore at:
    Dataset updated
    Feb 4, 2025
    Dataset provided by
    Office of the Chief Tecnology Officer
    Description

    This template covers section 2.5 Resource Fields: Entity and Attribute Information of the Data Discovery Form cited in the Open Data DC Handbook (2022). It completes documentation elements that are required for publication. Each field column (attribute) in the dataset needs a description clarifying the contents of the column. Data originators are encouraged to enter the code values (domains) of the column to help end-users translate the contents of the column where needed, especially when lookup tables do not exist.

  2. d

    Ecological Concerns Data Dictionary - Ecological Concerns data dictionary

    • catalog.data.gov
    • fisheries.noaa.gov
    Updated May 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (Point of Contact, Custodian) (2025). Ecological Concerns Data Dictionary - Ecological Concerns data dictionary [Dataset]. https://catalog.data.gov/dataset/ecological-concerns-data-dictionary-ecological-concerns-data-dictionary2
    Explore at:
    Dataset updated
    May 24, 2025
    Dataset provided by
    (Point of Contact, Custodian)
    Description

    Evaluating the status of threatened and endangered salmonid populations requires information on the current status of the threats (e.g., habitat, hatcheries, hydropower, and invasives) and the risk of extinction (e.g., status and trend in the Viable Salmonid Population criteria). For salmonids in the Pacific Northwest, threats generally result in changes to physical and biological characteristics of freshwater habitat. These changes are often described by terms like "limiting factors" or "habitat impairment." For example, the condition of freshwater habitat directly impacts salmonid abundance and population spatial structure by affecting carrying capacity and the variability and accessibility of rearing and spawning areas. Thus, one way to assess or quantify threats to ESUs and populations is to evaluate whether the ecological conditions on which fish depend is improving, becoming more degraded, or remains unchanged. In the attached spreadsheets, we have attempted to consistently record limiting factors and threats across all populations and ESUs to enable comparison to other datasets (e.g., restoration projects) in a consistent way. Limiting factors and threats (LF/T) identified in salmon recovery plans were translated in a common language using an ecological concerns data dictionary (see "Ecological Concerns" tab in the attached spreadsheets) (a data dictionaries defines the wording, meaning and scope of categories). The ecological concerns data dictionary defines how different elements are related, such as the relationships between threats, ecological concerns and life history stages. The data dictionary includes categories for ecological dynamics and population level effects such as "reduced genetic fitness" and "behavioral changes." The data dictionary categories are meant to encompass the ecological conditions that directly impact salmonids and can be addressed directly or indirectly by management (habitat restoration, hatchery reform, etc.) actions. Using the ecological concerns data dictionary enables us to more fully capture the range of effects of hydro, hatchery, and invasive threats as well as habitat threat categories. The organization and format of the data dictionary was also chosen so the information we record can be easily related to datasets we already posses (e.g., restoration data). Data Dictionary.

  3. APAC Data Suite | 4M+ Translations | 1.6M+ Words | Natural Language...

    • datarade.ai
    Updated Oct 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oxford Languages (2025). APAC Data Suite | 4M+ Translations | 1.6M+ Words | Natural Language Processing Data | Dictionary Display | Translations | APAC Coverage [Dataset]. https://datarade.ai/data-products/apac-data-suite-4m-translations-1-6m-words-natural-la-oxford-languages
    Explore at:
    .json, .xml, .csv, .txt, .mp3, .wavAvailable download formats
    Dataset updated
    Oct 1, 2025
    Dataset authored and provided by
    Oxford Languageshttps://lexico.com/es
    Area covered
    Papua New Guinea, Vietnam, China, Kiribati, Australia, Marshall Islands, Thailand, Taiwan, Philippines, Fiji
    Description

    APAC Data Suite offers high-quality language datasets. Ideal for NLP, AI, LLMs, translation, and education, it combines linguistic depth and regional authenticity to power scalable, multilingual language technologies.

    Discover our expertly curated language datasets in the APAC Data Suite. Compiled and annotated by language and linguistic experts, this suite offers high-quality resources tailored to your needs. This suite includes:

    • Monolingual and Bilingual Dictionary Data
      Featuring headwords, definitions, word senses, part-of-speech (POS) tags, and semantic metadata.

    • Semi-bilingual Dictionary Data Each entry features a headword with definitions and/or usage examples in Language 1, followed by a translation of the headword and/or definition in Language 2, enabling efficient cross-lingual mapping.

    • Sentence Corpora
      Curated examples of real-world usage with contextual annotations for training and evaluation.

    • Synonyms & Antonyms
      Lexical relations to support semantic search, paraphrasing, and language understanding.

    • Audio Data
      Native speaker recordings for speech recognition, TTS, and pronunciation modeling.

    • Word Lists
      Frequency-ranked and thematically grouped lists for vocabulary training and NLP tasks. The word list data can cover one language or two, such as Tamil words with English translations.

    Each language may contain one or more types of language data. Depending on the dataset, we can provide these in formats such as XML, JSON, TXT, XLSX, CSV, WAV, MP3, and more. Delivery is currently available via email (link-based sharing) or REST API.

    If you require more information about a specific dataset, please contact us Growth.OL@oup.com.

    Below are the different types of datasets available for each language, along with their key features and approximate metrics. If you have any questions or require additional assistance, please don't hesitate to contact us.

    1. Assamese Semi-bilingual Dictionary Data: 72,200 words | 83,700 senses | 83,800 translations.

    2. Bengali Bilingual Dictionary Data: 161,400 translations | 71,600 senses.

    3. Bengali Semi-bilingual Dictionary Data: 28,300 words | 37,700 senses | 62,300 translations.

    4. British English Monolingual Dictionary Data: 146,000 words | 230,000 senses | 149,000 example sentences.

    5. British English Synonyms and Antonyms Data: 600,000 synonyms | 22,000 antonyms.

    6. British English Pronunciations with Audio: 250,000 transcriptions (IPA) | 180,000 audio files.

    7. French Monolingual Dictionary Data: 42,000 words | 56,000 senses | 43,000 example sentences.

    8. French Bilingual Dictionary Data: 380,000 translations | 199,000 senses | 146,000 example translations.

    9. Gujarati Monolingual Dictionary Data: 91,800 words | 131,500 senses.

    10. Gujarati Bilingual Dictionary Data: 171,800 translations | 158,200 senses.

    11. Hindi Monolingual Dictionary Data: 46,200 words | 112,700 senses.

    12. Hindi Bilingual Dictionary Data: 263,400 translations | 208,100 senses | 18,600 example translations.

    13. Hindi Synonyms and Antonyms Dictionary Data: 478,100 synonyms | 18,800 antonyms.

    14. Hindi Sentence Data: 216,000 sentences.

    15. Hindi Audio data: 68,000 audio files.

    16. Indonesian Bilingual Dictionary Data: 36,000 translations | 23,700 senses | 12,700 example translations.

    17. Indonesian Monolingual Dictionary Data: 120,000 words | 140,000 senses | 30,000 example sentences.

      1. Korean Monolingual Dictionary Data: 596,100 words | 386,600 senses | 91,700 example sentences.
    18. Korean Bilingual Dictionary Data: 952,500 translations | 449,700 senses | 227,800 example translations.

    19. Mandarin Chinese (simplified) Monolingual Dictionary Data: 81,300 words | 162,400 senses | 80,700 example sentences.

    20. Mandarin Chinese (traditional) Monolingual Dictionary Data: 60,100 words | 144,700 senses | 29,900 example sentences.

    21. Mandarin Chinese (simplified) Bilingual Dictionary Data: 367,600 translations | 204,500 senses | 150,900 example translations.

    22. Mandarin Chinese (traditional) Bilingual Dictionary Data: 215,600 translations | 202,800 senses | 149,700 example translations.

    23. Mandarin Chinese (simplified) Synonyms and Antonyms Data: 3,800 synonyms | 3,180 antonyms.

    24. Malay Bilingual Dictionary Data: 106,100 translations | 53,500 senses.

    25. Malay Monolingual Dictionary Data: 39,800 words | 40,600 senses | 21,100 example sentences.

    26. Malayalam Monolingual Dictionary Data: 91,300 words | 159,200 senses.

    27. Malayalam Bilingual Word List Data: 76,200 translation pairs.

    28. Marathi Bilingual Dictionary Data: 45,400 translations | 32,800 senses | 3,600 example translations.

    29. Nepali Bilingual Dictionary Data: 350,000 translations | 264,200 senses | 1,300 example translations.

    30. New Zealand English Monolingual Dictionary Data: 100,000 words

    31. Odia Semi-bilingual Dictionary Data: 30,700 words | 69,300 senses | 69,200 translations.

    32. Punjabi ...

  4. Data from "Obstacles to the Reuse of Study Metadata in ClinicalTrials.gov"

    • figshare.com
    zip
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Laura Miron; Rafael Gonçalves; Mark A. Musen (2023). Data from "Obstacles to the Reuse of Study Metadata in ClinicalTrials.gov" [Dataset]. http://doi.org/10.6084/m9.figshare.12743939.v2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Laura Miron; Rafael Gonçalves; Mark A. Musen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This fileset provides supporting data and corpora for the empirical study described in: Laura Miron, Rafael S. Goncalves and Mark A. Musen. Obstacles to the Reuse of Metadata in ClinicalTrials.govDescription of filesOriginal data files:- AllPublicXml.zip contains the set of all public XML records in ClinicalTrials.gov (protocols and summary results information), on which all remaining analyses are based. Set contains 302,091 records downloaded on April 3, 2019.- public.xsd is the XML schema downloaded from ClinicalTrials.gov on April 3, 2019, used to validate records in AllPublicXML.BioPortal API Query Results- condition_matches.csv contains the results of querying the BioPortal API for all ontology terms that are an 'exact match' to each condition string scraped from the ClinicalTrials.gov XML. Columns={filename, condition, url, bioportal term, cuis, tuis}. - intervention_matches.csv contains BioPortal API query results for all interventions scraped from the ClinicalTrials.gov XML. Columns={filename, intervention, url, bioportal term, cuis, tuis}.Data Element Definitions- supplementary_table_1.xlsx Mapping of element names, element types, and whether elements are required in ClinicalTrials.gov data dictionaries, the ClinicalTrials.gov XML schema declaration for records (public.XSD), the Protocol Registration System (PRS), FDAAA801, and the WHO required data elements for clinical trial registrations.Column and value definitions: - CT.gov Data Dictionary Section: Section heading for a group of data elements in the ClinicalTrials.gov data dictionary (https://prsinfo.clinicaltrials.gov/definitions.html) - CT.gov Data Dictionary Element Name: Name of an element/field according to the ClinicalTrials.gov data dictionaries (https://prsinfo.clinicaltrials.gov/definitions.html) and (https://prsinfo.clinicaltrials.gov/expanded_access_definitions.html) - CT.gov Data Dictionary Element Type: "Data" if the element is a field for which the user provides a value, "Group Heading" if the element is a group heading for several sub-fields, but is not in itself associated with a user-provided value. - Required for CT.gov for Interventional Records: "Required" if the element is required for interventional records according to the data dictionary, "CR" if the element is conditionally required, "Jan 2017" if the element is required for studies starting on or after January 18, 2017, the effective date of the FDAAA801 Final Rule, "-" indicates if this element is not applicable to interventional records (only observational or expanded access) - Required for CT.gov for Observational Records: "Required" if the element is required for interventional records according to the data dictionary, "CR" if the element is conditionally required, "Jan 2017" if the element is required for studies starting on or after January 18, 2017, the effective date of the FDAAA801 Final Rule, "-" indicates if this element is not applicable to observational records (only interventional or expanded access) - Required in CT.gov for Expanded Access Records?: "Required" if the element is required for interventional records according to the data dictionary, "CR" if the element is conditionally required, "Jan 2017" if the element is required for studies starting on or after January 18, 2017, the effective date of the FDAAA801 Final Rule, "-" indicates if this element is not applicable to expanded access records (only interventional or observational) - CT.gov XSD Element Definition: abbreviated xpath to the corresponding element in the ClinicalTrials.gov XSD (public.XSD). The full xpath includes 'clinical_study/' as a prefix to every element. (There is a single top-level element called "clinical_study" for all other elements.) - Required in XSD? : "Yes" if the element is required according to public.XSD, "No" if the element is optional, "-" if the element is not made public or included in the XSD - Type in XSD: "text" if the XSD type was "xs:string" or "textblock", name of enum given if type was enum, "integer" if type was "xs:integer" or "xs:integer" extended with the "type" attribute, "struct" if the type was a struct defined in the XSD - PRS Element Name: Name of the corresponding entry field in the PRS system - PRS Entry Type: Entry type in the PRS system. This column contains some free text explanations/observations - FDAAA801 Final Rule FIeld Name: Name of the corresponding required field in the FDAAA801 Final Rule (https://www.federalregister.gov/documents/2016/09/21/2016-22129/clinical-trials-registration-and-results-information-submission). This column contains many empty values where elements in ClinicalTrials.gov do not correspond to a field required by the FDA - WHO Field Name: Name of the corresponding field required by the WHO Trial Registration Data Set (v 1.3.1) (https://prsinfo.clinicaltrials.gov/trainTrainer/WHO-ICMJE-ClinTrialsgov-Cross-Ref.pdf)Analytical Results:- EC_human_review.csv contains the results of a manual review of random sample eligibility criteria from 400 CT.gov records. Table gives filename, criteria, and whether manual review determined the criteria to contain criteria for "multiple subgroups" of participants.- completeness.xlsx contains counts and percentages of interventional records missing fields required by FDAAA801 and its Final Rule.- industry_completeness.xlsx contains percentages of interventional records missing required fields, broken up by agency class of trial's lead sponsor ("NIH", "US Fed", "Industry", or "Other"), and before and after the effective date of the Final Rule- location_completeness.xlsx contains percentages of interventional records missing required fields, broken up by whether record listed at least one location in the United States and records with only international location (excluding trials with no listed location), and before and after the effective date of the Final RuleIntermediate Results:- cache.zip contains pickle and csv files of pandas dataframes with values scraped from the XML records in AllPublicXML. Downloading these files greatly speeds up running analysis steps from jupyter notebooks in our github repository.

  5. original : CIFAR 100

    • kaggle.com
    zip
    Updated Dec 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shashwat Pandey (2024). original : CIFAR 100 [Dataset]. https://www.kaggle.com/datasets/shashwat90/original-cifar-100
    Explore at:
    zip(168517945 bytes)Available download formats
    Dataset updated
    Dec 28, 2024
    Authors
    Shashwat Pandey
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The CIFAR-10 and CIFAR-100 datasets are labeled subsets of the 80 million tiny images dataset. CIFAR-10 and CIFAR-100 were created by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. (Sadly, the 80 million tiny images dataset has been thrown into the memory hole by its authors. Spotting the doublethink which was used to justify its erasure is left as an exercise for the reader.)

    The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

    The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.

    The classes are completely mutually exclusive. There is no overlap between automobiles and trucks. "Automobile" includes sedans, SUVs, things of that sort. "Truck" includes only big trucks. Neither includes pickup trucks.

    Baseline results You can find some baseline replicable results on this dataset on the project page for cuda-convnet. These results were obtained with a convolutional neural network. Briefly, they are 18% test error without data augmentation and 11% with. Additionally, Jasper Snoek has a new paper in which he used Bayesian hyperparameter optimization to find nice settings of the weight decay and other hyperparameters, which allowed him to obtain a test error rate of 15% (without data augmentation) using the architecture of the net that got 18%.

    Other results Rodrigo Benenson has collected results on CIFAR-10/100 and other datasets on his website; click here to view.

    Dataset layout Python / Matlab versions I will describe the layout of the Python version of the dataset. The layout of the Matlab version is identical.

    The archive contains the files data_batch_1, data_batch_2, ..., data_batch_5, as well as test_batch. Each of these files is a Python "pickled" object produced with cPickle. Here is a python2 routine which will open such a file and return a dictionary: python def unpickle(file): import cPickle with open(file, 'rb') as fo: dict = cPickle.load(fo) return dict And a python3 version: def unpickle(file): import pickle with open(file, 'rb') as fo: dict = pickle.load(fo, encoding='bytes') return dict Loaded in this way, each of the batch files contains a dictionary with the following elements: data -- a 10000x3072 numpy array of uint8s. Each row of the array stores a 32x32 colour image. The first 1024 entries contain the red channel values, the next 1024 the green, and the final 1024 the blue. The image is stored in row-major order, so that the first 32 entries of the array are the red channel values of the first row of the image. labels -- a list of 10000 numbers in the range 0-9. The number at index i indicates the label of the ith image in the array data.

    The dataset contains another file, called batches.meta. It too contains a Python dictionary object. It has the following entries: label_names -- a 10-element list which gives meaningful names to the numeric labels in the labels array described above. For example, label_names[0] == "airplane", label_names[1] == "automobile", etc. Binary version The binary version contains the files data_batch_1.bin, data_batch_2.bin, ..., data_batch_5.bin, as well as test_batch.bin. Each of these files is formatted as follows: <1 x label><3072 x pixel> ... <1 x label><3072 x pixel> In other words, the first byte is the label of the first image, which is a number in the range 0-9. The next 3072 bytes are the values of the pixels of the image. The first 1024 bytes are the red channel values, the next 1024 the green, and the final 1024 the blue. The values are stored in row-major order, so the first 32 bytes are the red channel values of the first row of the image.

    Each file contains 10000 such 3073-byte "rows" of images, although there is nothing delimiting the rows. Therefore each file should be exactly 30730000 bytes long.

    There is another file, called batches.meta.txt. This is an ASCII file that maps numeric labels in the range 0-9 to meaningful class names. It is merely a list of the 10 class names, one per row. The class name on row i corresponds to numeric label i.

    The CIFAR-100 dataset This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). Her...

  6. Namesakes

    • figshare.com
    json
    Updated Nov 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oleg Vasilyev; Aysu Altun; Nidhi Vyas; Vedant Dharnidharka; Erika Lampert; John Bohannon (2021). Namesakes [Dataset]. http://doi.org/10.6084/m9.figshare.17009105.v1
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Nov 20, 2021
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Oleg Vasilyev; Aysu Altun; Nidhi Vyas; Vedant Dharnidharka; Erika Lampert; John Bohannon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract

    Motivation: creating challenging dataset for testing Named-Entity
    

    Linking. The Namesakes dataset consists of three closely related datasets: Entities, News and Backlinks. Entities were collected as Wikipedia text chunks corresponding to highly ambiguous entity names. The News were collected as random news text chunks, containing mentions that either belong to the Entities dataset or can be easily confused with them. Backlinks were obtained from Wikipedia dump data with intention to have mentions linked to the entities of the Entity dataset. The Entities and News are human-labeled, resolving the mentions of the entities.Methods

    Entities were collected as Wikipedia 
    

    text chunks corresponding to highly ambiguous entity names: the most popular people names, the most popular locations, and organizations with name ambiguity. In each Entities text chunk, the named entities with the name similar to the chunk Wikipedia page name are labeled. For labeling, these entities were suggested to human annotators (odetta.ai) to tag as "Same" (same as the page entity) or "Other". The labeling was done by 6 experienced annotators that passed through a preliminary trial task. The only accepted tags are the tags assigned in agreement by not less than 5 annotators, and then passed through reconciliation with an experienced reconciliator.

    The News were collected as random news text chunks, containing mentions which either belong to the Entities dataset or can be easily confused with them. In each News text chunk one mention was selected for labeling, and 3-10 Wikipedia pages from Entities were suggested as the labels for an annotator to choose from. The labeling was done by 3 experienced annotators (odetta.ai), after the annotators passed a preliminary trial task. The results were reconciled by an experienced reconciliator. All the labeling was done using Lighttag (lighttag.io).

    Backlinks were obtained from Wikipedia dump data (dumps.wikimedia.org/enwiki/20210701) with intention to have mentions linked to the entities of the Entity dataset. The backlinks were filtered to leave only mentions in a good quality text; each text was cut 1000 characters after the last mention.

    Usage NotesEntities:
    

    File: Namesakes_entities.jsonl The Entities dataset consists of 4148 Wikipedia text chunks containing human-tagged mentions of entities. Each mention is tagged either as "Same" (meaning that the mention is of this Wikipedia page entity), or "Other" (meaning that the mention is of some other entity, just having the same or similar name). The Entities dataset is a jsonl list, each item is a dictionary with the following keys and values: Key: ‘pagename’: page name of the Wikipedia page. Key ‘pageid’: page id of the Wikipedia page. Key ‘title’: title of the Wikipedia page. Key ‘url’: URL of the Wikipedia page. Key ‘text’: The text chunk from the Wikipedia page. Key ‘entities’: list of the mentions in the page text, each entity is represented by a dictionary with the keys: Key 'text': the mention as a string from the page text. Key ‘start’: start character position of the entity in the text. Key ‘end’: end (one-past-last) character position of the entity in the text. Key ‘tag’: annotation tag given as a string - either ‘Same’ or ‘Other’.

    News: File: Namesakes_news.jsonl The News dataset consists of 1000 news text chunks, each one with a single annotated entity mention. The annotation either points to the corresponding entity from the Entities dataset (if the mention is of that entity), or indicates that the mentioned entity does not belong to the Entities dataset. The News dataset is a jsonl list, each item is a dictionary with the following keys and values: Key ‘id_text’: Id of the sample. Key ‘text’: The text chunk. Key ‘urls’: List of URLs of wikipedia entities suggested to labelers for identification of the entity mentioned in the text. Key ‘entity’: a dictionary describing the annotated entity mention in the text: Key 'text': the mention as a string found by an NER model in the text. Key ‘start’: start character position of the mention in the text. Key ‘end’: end (one-past-last) character position of the mention in the text. Key 'tag': This key exists only if the mentioned entity is annotated as belonging to the Entities dataset - if so, the value is a dictionary identifying the Wikipedia page assigned by annotators to the mentioned entity: Key ‘pageid’: Wikipedia page id. Key ‘pagetitle’: page title. Key 'url': page URL.

    Backlinks dataset: The Backlinks dataset consists of two parts: dictionary Entity-to-Backlinks and Backlinks documents. The dictionary points to backlinks for each entity of the Entity dataset (if any backlinks exist for the entity). The Backlinks documents are the backlinks Wikipedia text chunks with identified mentions of the entities from the Entities dataset.

    Each mention is identified by surrounded double square brackets, e.g. "Muir built a small cabin along [[Yosemite Creek]].". However, if the mention differs from the exact entity name, the double square brackets wrap both the exact name and, separated by '|', the mention string to the right, for example: "Muir also spent time with photographer [[Carleton E. Watkins | Carleton Watkins]] and studied his photographs of Yosemite.".

    The Entity-to-Backlinks is a jsonl with 1527 items. File: Namesakes_backlinks_entities.jsonl Each item is a tuple: Entity name. Entity Wikipedia page id. Backlinks ids: a list of pageids of backlink documents.

    The Backlinks documents is a jsonl with 26903 items. File: Namesakes_backlinks_texts.jsonl Each item is a dictionary: Key ‘pageid’: Id of the Wikipedia page. Key ‘title’: Title of the Wikipedia page. Key 'content': Text chunk from the Wikipedia page, with all mentions in the double brackets; the text is cut 1000 characters after the last mention, the cut is denoted as '...[CUT]'. Key 'mentions': List of the mentions from the text, for convenience. Each mention is a tuple: Entity name. Entity Wikipedia page id. Sorted list of all character indexes at which the mention occurrences start in the text.

  7. 🏪 Warehouse and Retail Sales Montgomery County

    • kaggle.com
    zip
    Updated Oct 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saman Fatima (2025). 🏪 Warehouse and Retail Sales Montgomery County [Dataset]. https://www.kaggle.com/datasets/samanfatima7/warehouse-and-retail-sales-montgomery-county
    Explore at:
    zip(6379254 bytes)Available download formats
    Dataset updated
    Oct 9, 2025
    Authors
    Saman Fatima
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    📖 Overview

    (If its helpful kindly support by upvoting the dataset)

    This dataset contains a detailed record of sales and movement data by item and department from Montgomery County, Maryland. It is updated monthly and includes information on warehouse and retail liquor sales.

    📑 Data Dictionary

    Column NameDescriptionExample ValueType
    YearYear of record2025Integer
    MonthMonth of record (numeric)9Integer
    SupplierName of the supplier"Jack Daniels"String
    Item_CodeUnique product code12345String / Numeric
    Item_DescriptionProduct name or description"Whiskey 750ml"String
    Item_TypeCategory or type of product"Liquor"String
    Retail_SalesNumber of cases sold in retail450Integer
    Retail_TransfersNumber of cases transferred internally120Integer
    Warehouse_SalesNumber of cases sold from warehouse to licensees200Integer

    The dataset can be used for:

    📊 Time-series or trend analysis of product sales 🧾 Retail forecasting and demand estimation 🗺️ Regional economic and consumption studies

    🧩 Data Summary

    Source: Montgomery County Open Data Portal Publisher: Montgomery County of Maryland — data.montgomerycountymd.gov Maintainer: svc dmesb (no-reply@data.montgomerycountymd.gov) Category: Community / Recreation Update Frequency: Monthly First Published: July 6, 2017 Last Updated: September 5, 2025

    ⚖️ License & Usage

    This dataset is publicly accessible under the Montgomery County, Maryland Open Data Terms of Use. It is a non-federal dataset and may have different terms of use than Data.gov datasets. No explicit license information is provided by the source. Use responsibly and always cite the original source below when reusing the data.

    🙌 Credits

    Dataset originally published by: Montgomery County of Maryland 📍 https://data.montgomerycountymd.gov

    📄 Source Page: Warehouse and Retail Sales

  8. d

    Smart Triage Jinja Data De-identification

    • search.dataone.org
    • borealisdata.ca
    Updated Dec 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mawji, Alishah (2023). Smart Triage Jinja Data De-identification [Dataset]. http://doi.org/10.5683/SP3/MSTH98
    Explore at:
    Dataset updated
    Dec 28, 2023
    Dataset provided by
    Borealis
    Authors
    Mawji, Alishah
    Description

    This dataset contains de-identified data with an accompanying data dictionary and the R script for de-identification procedures., Objective(s): To demonstrate application of a risk based de-identification framework using the Smart Triage dataset as a clinical example. Data Description: This dataset contains the de-identified version of the Smart Triage Jinja dataset with the accompanying data dictionary and R script for de-identification procedures. Limitations: Utility of the de-identified dataset has only been evaluated with regard to use for the development of prediction models based on a need for hospital admission. Abbreviations: NA Ethics Declaration: The study was reviewed by the instituational review boards at the University of British Columbia in Canada (ID: H19-02398; H20-00484), The Makerere University School of Public Health in Uganda and the Uganda National Council for Science and Technology

  9. d

    Parks Inspection Program – Element Tracking

    • catalog.data.gov
    • data.cityofnewyork.us
    • +1more
    Updated Nov 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.cityofnewyork.us (2025). Parks Inspection Program – Element Tracking [Dataset]. https://catalog.data.gov/dataset/parks-inspection-program-element-tracking
    Explore at:
    Dataset updated
    Nov 22, 2025
    Dataset provided by
    data.cityofnewyork.us
    Description

    This dataset contains additional items that are counted, and in some cases evaluated during a property inspection. Examples include signs, flags and drinking fountains. Each row represents a single observation. Data Dictionary and User Guide can be found here. A complete list of all datasets in the series can be found here.

  10. d

    Checkouts By Title (Physical Items)

    • catalog.data.gov
    • cos-data.seattle.gov
    • +4more
    Updated Mar 21, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.seattle.gov (2017). Checkouts By Title (Physical Items) [Dataset]. https://catalog.data.gov/ro/dataset/checkouts-by-title-physical-items
    Explore at:
    Dataset updated
    Mar 21, 2017
    Dataset provided by
    data.seattle.gov
    Description

    This dataset includes a log of all physical item checkouts from Seattle Public Library. The dataset begins with checkouts that occurring in April 2005. Renewals are not included. Have a question about this data? Ask us! Data Notes: There is a machine-readable data dictionary available to help you understand the collection and item codes. Access from here: https://data.seattle.gov/Community/Integrated-Library-System-ILS-Data-Dictionary/pbt3-ytbc Also: 1. "CheckoutDateTime" (the timestamp field) is rounded to the nearest minute. 2. "itemType" is a code from the catalog record that describes the type of item. Some of the more common codes are: acbk (adult book), acdvd (adult DVD), jcbk (children's book), accd (adult CD) 3. "Collection" is a collection code from the catalog record which describes the item. Here are some common examples: nanf (adult non-fiction), nafic(adult fiction), ncpic(children's picture book), nycomic (Young adult comic books). 4. "Subjects" includes the subjects and subject subdivisions from the item record.

  11. d

    U.S. Geological Survey National Produced Waters Geochemical Database v2.3

    • catalog.data.gov
    • catalog.newmexicowaterdata.org
    • +1more
    Updated Nov 20, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). U.S. Geological Survey National Produced Waters Geochemical Database v2.3 [Dataset]. https://catalog.data.gov/dataset/u-s-geological-survey-national-produced-waters-geochemical-database-v2-3
    Explore at:
    Dataset updated
    Nov 20, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Description

    During hydrocarbon production, water is typically co-produced from the geologic formations producing oil and gas. Understanding the composition of these produced waters is important to help investigate the regional hydrogeology, the source of the water, the efficacy of water treatment and disposal plans, potential economic benefits of mineral commodities in the fluids, and the safety of potential sources of drinking or agricultural water. In addition to waters co-produced with hydrocarbons, geothermal development or exploration brings deep formation waters to the surface for possible sampling. This U.S. Geological Survey (USGS) Produced Waters Geochemical Database, which contains geochemical and other information for 114,943 produced water and other deep formation water samples of the United States, is a provisional, updated version of the 2002 USGS Produced Waters Database (Breit and others, 2002). In addition to the major element data presented in the original, the new database contains trace elements, isotopes, and time-series data, as well as nearly 100,000 additional samples that provide greater spatial coverage from both conventional and unconventional reservoir types, including geothermal. The database is a compilation of 40 individual databases, publications, or reports. The database was created in a manner to facilitate addition of new data and correct any compilation errors, and is expected to be updated over time with new data as provided and needed. Table 1, USGSPWDBv2.3 Data Sources.csv, shows the abbreviated ID of each input database (IDDB), the number of samples from each, and its reference. Table 2, USGSPWDBv2.3 Data Dictionary.csv, defines the 190 variables contained in the database and their descriptions. The database variables are organized first with identification and location information, followed by well descriptions, dates, rock properties, physical properties of the water, and then chemistry. The chemistry is organized alphabetically by elemental symbol. Each element is followed by any associated compounds (e.g. H2S is found after S). After Zr, molecules containing carbon, organic 9 compounds and dissolved gases follow. Isotopic data are found at the end of the dataset, just before the culling parameters.

  12. COVID-19 Patient Impact & Hospital Capacity Data

    • kaggle.com
    zip
    Updated Mar 10, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    aditirajagopal (2021). COVID-19 Patient Impact & Hospital Capacity Data [Dataset]. https://www.kaggle.com/aditirajagopal/covid19-patient-impact-hospital-capacity-data
    Explore at:
    zip(8457331 bytes)Available download formats
    Dataset updated
    Mar 10, 2021
    Authors
    aditirajagopal
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Source: https://healthdata.gov/dataset/covid-19-reported-patient-impact-and-hospital-capacity-facility

    The following dataset provides facility-level data for hospital utilization aggregated on a weekly basis (Friday to Thursday). These are derived from reports with facility-level granularity across two main sources: (1) HHS TeleTracking, and (2) reporting provided directly to HHS Protect by state/territorial health departments on behalf of their healthcare facilities.

    The hospital population includes all hospitals registered with Centers for Medicare & Medicaid Services (CMS) as of June 1, 2020. It includes non-CMS hospitals that have reported since July 15, 2020. It does not include psychiatric, rehabilitation, Indian Health Service (IHS) facilities, U.S. Department of Veterans Affairs (VA) facilities, Defense Health Agency (DHA) facilities, and religious non-medical facilities.

    For a given entry, the term “collection_week” signifies the start of the period that is aggregated. For example, a “collection_week” of 2020-11-20 means the average/sum/coverage of the elements captured from that given facility starting and including Friday, November 20, 2020, and ending and including reports for Thursday, November 26, 2020.

    Reported elements include an append of either “_coverage”, “_sum”, or “_avg”.

    A “_coverage” append denotes how many times the facility reported that element during that collection week.
    A “_sum” append denotes the sum of the reports provided for that facility for that element during that collection week.
    A “_avg” append is the average of the reports provided for that facility for that element during that collection week.
    

    The file will be updated weekly. No statistical analysis is applied to impute non-response. For averages, calculations are based on the number of values collected for a given hospital in that collection week. Suppression is applied to the file for sums and averages less than four (4). In these cases, the field will be replaced with “-999,999”.

    This data is preliminary and subject to change as more data become available. Data is available starting on July 31, 2020.

    Sometimes, reports for a given facility will be provided to both HHS TeleTracking and HHS Protect. When this occurs, to ensure that there are not duplicate reports, deduplication is applied according to prioritization rules within HHS Protect.

    For influenza fields listed in the file, the current HHS guidance marks these fields as optional. As a result, coverage of these elements are varied.

  13. V2 Balloon Detection Dataset

    • kaggle.com
    zip
    Updated Jul 7, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    vbookshelf (2022). V2 Balloon Detection Dataset [Dataset]. https://www.kaggle.com/vbookshelf/v2-balloon-detection-dataset
    Explore at:
    zip(49788043 bytes)Available download formats
    Dataset updated
    Jul 7, 2022
    Authors
    vbookshelf
    Description

    Context

    I needed a simple image dataset that I could use when trying different object detection algorithms for the first time. It had to be something that could be quickly understood and easily loaded. I didn't want spend a lot of time doing EDA or trying to remember how the data is structured. Moreover, I wanted to be able to clearly see when a model 's prediction was correct or when it had made a mistake. When working with chest x-ray images, for example, it takes an expert to know if a model's predictions are correct.

    I found the Balloons dataset and simplified it. The original data is split into train and test sets and it has two json files that need to be parsed. In this new version, I copied all images into a single folder and replaced the json files with one csv file that can be easily loaded with Pandas.

    Content

    The dataset consists of 74 jpg images and one csv file. Each image contains one or more balloons.

    The csv file has five columns:

    fname - The image file name.
    height - The image height.
    width - The image width.
    num_balloons - The number of balloons on the image.
    bbox - The coordinates of each bounding box on the image.
    

    The coordinates of each bbox are stored in a dictionary. The format is as follows:

    {"xmin": 100, "ymin": 100, "xmax": 300, "ymax": 300}
    
    Where xmin and ymin are the coordinates of the top left corner, and xmax and ymax are the coordinates of the bottom right corner.
    

    Each entry in the bbox column is a list of dictionaries. For example, if an image has two ballons and hence two bounding boxes, the entry will be as follows:

    [{"xmin": 100, "ymin": 100, "xmax": 300, "ymax": 300}, {"xmin": 100, "ymin": 100, "xmax": 300, "ymax": 300}]

    When loaded into a Pandas dataframe all items in the bbox column are of type string. The strings can be converted to a python lists like this:

    import ast
    
    # convert each item in the bbox column from type str to type list
    df['bbox'] = df['bbox'].apply(ast.literal_eval)
    
    

    Acknowledgements

    Many thanks to Waleed Abdulla who created this dataset.

    The original dataset can be downloaded and unzipped using this code:

    !wget https://github.com/matterport/Mask_RCNN/releases/download/v2.1/balloon_dataset.zip
    !unzip balloon_dataset.zip > /dev/null
    

    Inspiration

    Can you create an app that can look at an image and tell you: - how many balloons are on the image, and - what are the colours of those balloons.

    This is something that could help blind people. To help you get started here's an example of a similar project .

    License

    In this blog post the dataset's creator mentions that the images were sourced from Flickr. All images have a "Commercial use & mods allowed" license.



    Header image by andremsantana on Pixabay.

  14. w

    Asset database for the Clarence-Moreton bioregion on 24 February 2016 Public...

    • data.wu.ac.at
    • researchdata.edu.au
    • +1more
    zip
    Updated Jul 10, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bioregional Assessment Programme (2017). Asset database for the Clarence-Moreton bioregion on 24 February 2016 Public [Dataset]. https://data.wu.ac.at/schema/data_gov_au/NWI5NDBhNTYtNjFlMi00MmI5LWFlODktY2JmYzcwMWViYWM5
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 10, 2017
    Dataset provided by
    Bioregional Assessment Programme
    Description

    Abstract

    The dataset was derived by the Bioregional Assessment Programme. This dataset was derived from multiple datasets including Natural Resource Management regions, and Australian and state and territory government databases. You can find a link to the parent datasets in the Lineage Field in this metadata statement. The History Field in this metadata statement describes how this dataset was derived.

    This data set holds the publicly-available version of the database of water-dependent assets that was compiled for the bioregional assessment (BA) of the Clarence-Moreton subregion as part of the Bioregional Assessment Technical Programme. Though all life is dependent on water, for the purposes of a bioregional assessment, a water-dependent asset is an asset potentially impacted by changes in the groundwater and/or surface water regime due to coal resource development. The water must be other than local rainfall. Examples include wetlands, rivers, bores and groundwater dependent ecosystems.

    A single asset is represented spatially in the asset database by single or multiple spatial features (point, line or polygon). Individual points, lines or polygons are termed elements.

    This dataset contains the unrestricted publicly-available components of spatial and non-spatial (attribute) data of the (restricted) Asset database for the Clarence-Moreton bioregion on 24 February 2016 (6d11ffbc-ea57-49cb-8e00-f97761e0c5d6). The database is provided primarily as an ESRI File geodatabase (.gdb), which is able to be opened in readily available open source software such as QGIS. Other formats include the Microsoft Access database (.mdb in ESRI Personal Geodatabase format), industry-standard ESRI Shapefiles and tab-delimited text files of all the attribute tables.

    The restricted version of the Clarence-Moreton Asset database has a total count of 294961 Elements and 2708 Assets. In the public version of the Asset Clarence-Moreton database 60074 spatial Element features (~19%) have been removed from the Element List and Element Layer(s) and 729 spatial Assets (~24%) have been removed from the spatial Asset Layer(s)

    The elements/assets removed from the restricted Asset Database are from the following data sources:

    1) Species Profile and Threats Database (SPRAT) - RESTRICTED - Metadata only) (7276dd93-cc8c-4c01-8df0-cef743c72112)

    2) Australia, Register of the National Estate (RNE) - Spatial Database (RNESDB) (Internal 878f6780-be97-469b-8517-54bd12a407d0)

    3) Communities of National Environmental Significance Database - RESTRICTED - Metadata only (c01c4693-0a51-4dbc-bbbd-7a07952aa5f6)

    These important assets are included in the bioregional assessment, but are unable to be publicly distributed by the Bioregional Assessment Programme due to restrictions in their licensing conditions. Please note that many of these data sets are available directly from their custodian. For more precise details please see the associated explanatory Data Dictionary document enclosed with this dataset.

    Dataset History

    The public version of the asset database retains all of the unrestricted components of the Asset database for the Clarence-Moreton bioregion on 24 February 2016 - any material that is unable to be published or redistributed to a third party by the BA Programme has been removed from the database. The data presented corresponds to the assets published Clarence-Moreton bioregion product 1.3: Description of the water-dependent asset register and asset list for the Clarence-Moreton bioregion on 24 February 2016 , and the associated Water-dependent asset register and asset list for the Clarence-Moreton bioregion on 24 February 2016 .

    Individual spatial features or elements are initially included in database if they are partly or wholly within the subregion's preliminary assessment extent (Materiality Test 1, M1). In accordance to BA submethodology M02: Compiling water-dependent assets, individual spatial elements are then grouped into assets which are evaluated by project teams to determine whether they meet materiality test 2 (M2), which are assets that are considered to be water dependent.

    Following delivery of the first pass asset list, project teams make a determination as to whether an asset (comprised of one or more elements) is water dependent, as assessed against the materiality tests detailed in the BA Methodology. These decisions are provided to ERIN by the assessment team and incorporated into the AssetList table in the Asset database.

    Development of the Asset Register from the Asset database:

    Decisions for M0 (fit for BA purpose), M1 (PAE) and M2 (water dependent) determine which assets are included in the "asset list" and "water-dependent asset register" which are published as Product 1.3.

    The rule sets are applied as follows:

    M0 M1 M2 Result

    No n/a n/a Asset is not included in the asset list or the water-dependent asset register

    (≠ No) No n/a Asset is not included in the asset list or the water-dependent asset register

    (≠ No) Yes No Asset included in published asset list but not in water dependent asset register

    (≠ No) Yes Yes Asset included in both asset list and water-dependent asset register

    Assessment teams are then able to use the database to assign receptors and impact variables to water-dependent assets and the development of a receptor register as detailed in BA submethodology M03: Assigning receptors to water-dependent assets and the receptor register is then incorporated into the asset database.

    At this stage of its development, the Asset database for the Clarence-Moreton bioregion on 24 February 2016, which this document describes, does contain receptor information, and the receptor information was removed from this public version.

    Dataset Citation

    Bioregional Assessment Programme (2014) Asset database for the Clarence-Moreton bioregion on 24 February 2016 Public. Bioregional Assessment Derived Dataset. Viewed 10 July 2017, http://data.bioregionalassessments.gov.au/dataset/ba1d4c6f-e657-4e42-bd3c-413c21c7b735.

    Dataset Ancestors

  15. d

    Asset database for the Cooper subregion on 12 May 2016 Public

    • data.gov.au
    • researchdata.edu.au
    • +1more
    zip
    Updated Nov 19, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bioregional Assessment Program (2019). Asset database for the Cooper subregion on 12 May 2016 Public [Dataset]. https://data.gov.au/data/dataset/bffa0c44-c86f-4f81-8070-2f0b13e0b774
    Explore at:
    zip(119227113)Available download formats
    Dataset updated
    Nov 19, 2019
    Dataset provided by
    Bioregional Assessment Program
    Description

    Abstract

    This data set holds the publicly-available version of the database of water-dependent assets that was compiled for the bioregional assessment (BA) of the Cooper subregion as part of the Bioregional Assessment Technical Programme. Though all life is dependent on water, for the purposes of a bioregional assessment, a water-dependent asset is an asset potentially impacted by changes in the groundwater and/or surface water regime due to coal resource development. The water must be other than local rainfall. Examples include wetlands, rivers, bores and groundwater dependent ecosystems.

    The dataset was derived by the Bioregional Assessment Programme. This dataset was derived from multiple datasets including Natural Resource Management regions, and Australian and state and territory government databases. You can find a link to the parent datasets in the Lineage Field in this metadata statement. The History Field in this metadata statement describes how this dataset was derived. A single asset is represented spatially in the asset database by single or multiple spatial features (point, line or polygon). Individual points, lines or polygons are termed elements.

    This dataset contains the unrestricted publicly-available components of spatial and non-spatial (attribute) data of the (restricted) Asset database for the Cooper subregion on 12 May 2016 (90230311-b2e7-4d4d-a69a-03daab0d03cc). The database is provided primarily as an ESRI File geodatabase (.gdb), which is able to be opened in readily available open source software such as QGIS. Other formats include the Microsoft Access database (.mdb in ESRI Personal Geodatabase format), industry-standard ESRI Shapefiles and tab-delimited text files of all the attribute tables.

    The restricted version of the Cooper Asset database has a total count of 63910 Elements and 1 611 Assets. In the public version of the Asset Cooper database 6209 spatial Element features (~10%) have been removed from the Element List and Element Layer(s) and 47 spatial Assets (~3%) have been removed from the spatial Asset Layer(s)

    The elements/assets removed from the restricted Asset Database are from the following data sources:

    1) Species Profile and Threats Database (SPRAT) - Australia - Species of National Environmental Significance Database (BA subset - RESTRICTED - Metadata only) (7276dd93-cc8c-4c01-8df0-cef743c72112)

    2) Australia, Register of the National Estate (RNE) - Spatial Database (RNESDB) (Internal 878f6780-be97-469b-8517-54bd12a407d0)

    3) Lake Eyre Basin (LEB) Aquatic Ecosystems Mapping and Classification (9be10819-0e71-4d8d-aae5-f179012b6906)

    4) Communities of National Environmental Significance Database - RESTRICTED - Metadata only (c01c4693-0a51-4dbc-bbbd-7a07952aa5f6)

    These important assets are included in the bioregional assessment, but are unable to be publicly distributed by the Bioregional Assessment Programme due to restrictions in their licensing conditions. Please note that many of these data sets are available directly from their custodian. For more precise details please see the associated explanatory Data Dictionary document enclosed with this dataset

    Dataset History

    The public version of the asset database retains all of the unrestricted components of the Asset database for the Cooper subregion on 12 May 2016 - any material that is unable to be published or redistributed to a third party by the BA Programme has been removed from the database. The data presented corresponds to the assets published Cooper subregion product 1.3: Description of the water-dependent asset register and asset list for the Cooper subregion on 12 May 2016, and the associated Water-dependent asset register and asset list for the Cooper subregion on 12 May 2016.

    Individual spatial features or elements are initially included in database if they are partly or wholly within the subregion's preliminary assessment extent (Materiality Test 1, M1). In accordance to BA submethodology M02: Compiling water-dependent assets, individual spatial elements are then grouped into assets which are evaluated by project teams to determine whether they meet materiality test 2 (M2), which are assets that are considered to be water dependent.

    Following delivery of the first pass asset list, project teams make a determination as to whether an asset (comprised of one or more elements) is water dependent, as assessed against the materiality tests detailed in the BA Methodology. These decisions are provided to ERIN by the assessment team and incorporated into the AssetList table in the Asset database.

    Development of the Asset Register from the Asset database:

    Decisions for M0 (fit for BA purpose), M1 (PAE) and M2 (water dependent) determine which assets are included in the "asset list" and "water-dependent asset register" which are published as Product 1.3.

    The rule sets are applied as follows:

    M0 M1 M2 Result

    No n/a n/a Asset is not included in the asset list or the water-dependent asset register

    (≠ No) No n/a Asset is not included in the asset list or the water-dependent asset register

    (≠ No) Yes No Asset included in published asset list but not in water dependent asset register

    (≠ No) Yes Yes Asset included in both asset list and water-dependent asset register

    Assessment teams are then able to use the database to assign receptors and impact variables to water-dependent assets and the development of a receptor register as detailed in BA submethodology M03: Assigning receptors to water-dependent assets and the receptor register is then incorporated into the asset database.

    At this stage of its development, the Asset database for the Cooper subregion on 12 May 2016, which this document describes, does not contain receptor information.

    Dataset Citation

    Bioregional Assessment Programme (2014) Asset database for the Cooper subregion on 12 May 2016 Public. Bioregional Assessment Derived Dataset. Viewed 07 February 2017, http://data.bioregionalassessments.gov.au/dataset/bffa0c44-c86f-4f81-8070-2f0b13e0b774.

    Dataset Ancestors

  16. d

    Data from: Expressed Sequence Tags from the Ciliate Protozoan Parasite...

    • catalog.data.gov
    • agdatacommons.nal.usda.gov
    • +1more
    Updated Apr 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Expressed Sequence Tags from the Ciliate Protozoan Parasite Ichthyophthirius Multifiliis [Dataset]. https://catalog.data.gov/dataset/expressed-sequence-tags-from-the-ciliate-protozoan-parasite-ichthyophthirius-multifiliis-b99f0
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Agricultural Research Service
    Description

    Researchers sequenced 10,368 expressed sequence tags (EST) clones using a normalized cDNA library made from pooled samples of the trophont, tomont, and theront life-cycle stages, and generated 9,769 sequences (94.2% success rate). Post-sequencing processing led to 8,432 high quality sequences. Clustering analysis of these ESTs allowed identification of 4,706 unique sequences containing 976 contigs and 3,730 singletons. The ciliate protozoan Ichthyophthirius multifiliis (Ich) is an important parasite of freshwater fish that causes 'white spot disease' leading to significant losses. A genomic resource for large-scale studies of this parasite has been lacking. To study gene expression involved in Ich pathogenesis and virulence, our goal was to generate ESTs for the development of a powerful microarray platform for the analysis of global gene expression in this species. Here, we initiated a project to sequence and analyze over 10,000 ESTs. Resources in this dataset:Resource Title: Data Dictionary - Supplemental Tables 1, 2, and 3. File Name: IchthyophthiriusESTs_DataDictionary.csvResource Description: Machine-readable comma-separated values (CSV) definitions for data elements of Supplemental Tables 1-3 concerning I. multifiliis unique EST sequences, BLAST searches of the Ich ESTs against Tetrahymena thermophila and Plasmodium falciparum genomes, and gene ontology (GO) profile.Resource Title: Table 3. Table of gene ontology (GO) profiles.. File Name: 12864_2006_889_MOESM3_ESM.xlsResource Description: Supplemental Table 3, Excel spreadsheet; Table of gene ontology (GO) profiles; Provided information includes unique EST name, accession numbers, BLASTX top hit, GO identification numbers and enzyme commission (EC) numbers. Data resources found on the main article page under the "Electronic supplementary material" section: http://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-8-176 Direct download for this resource: https://static-content.springer.com/esm/art:10.1186/1471-2164-8-176/MediaObjects/12864_2006_889_MOESM3_ESM.xls Title: Table I. Multifiliis unique EST sequences. File Name: 12864_2006_889_MOESM1_ESM.xlsResource Description: Supplemental Table 1 for article, "Generation and analysis of expressed sequence tags from the ciliate protozoan parasite Ichthyophthirius multifiliis." Excel spreadsheet; Table of I. multifiliis unique EST sequences; Provided information includes I. multifiliis BLASTX top hits to the non-redundant database in GenBank with unique EST name and accession numbers. Also included are significant protein domain comparisons to the Swiss-Prot database. Putative secretory proteins are highlighted. Data resources found on the main article page under the "Electronic supplementary material" section: http://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-8-176 Direct download for this resource: https://static-content.springer.com/esm/art:10.1186/1471-2164-8-176/MediaObjects/12864_2006_889_MOESM1_ESM.xls Title: Table 2. Excel spreadsheet; Summary of BLAST searches of the Ich ESTs against Tetrahymena thermophila and Plasmodium falciparum genomes. File Name: 12864_2006_889_MOESM2_ESM.xlsResource Description: Table 2 from "Generation and analysis of expressed sequence tags from the ciliate protozoan parasite Ichthyophthirius multifiliis." Excel spreadsheet; Summary of BLAST searches of the Ich ESTs against Tetrahymena thermophila and Plasmodium falciparum genomes. Provided information includes I. multifiliis BLASTX top hits to the non-redundant database in GenBank with unique EST name, tBLASTx top hits to the T. thermophila genome, and BLASTX top hits to the P. falciparum genome sequences. This table correlates with the Venn diagram in figure 1. Data resources found on the main article page under the "Electronic supplementary material" section: http://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-8-176 Direct download link for this data resource: https://static-content.springer.com/esm/art:10.1186/1471-2164-8-176/MediaObjects/12864_2006_889_MOESM2_ESM.xls

  17. f

    Data_Sheet_1_The Visual Dictionary of Antimicrobial Stewardship, Infection...

    • frontiersin.figshare.com
    • figshare.com
    docx
    Updated Jun 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julia Keizer; Christian F. Luz; Bhanu Sinha; Lisette van Gemert-Pijnen; Casper Albers; Nienke Beerlage-de Jong; Corinna Glasner (2023). Data_Sheet_1_The Visual Dictionary of Antimicrobial Stewardship, Infection Control, and Institutional Surveillance Data.docx [Dataset]. http://doi.org/10.3389/fmicb.2021.743939.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 8, 2023
    Dataset provided by
    Frontiers
    Authors
    Julia Keizer; Christian F. Luz; Bhanu Sinha; Lisette van Gemert-Pijnen; Casper Albers; Nienke Beerlage-de Jong; Corinna Glasner
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Objectives: Data and data visualization are integral parts of (clinical) decision-making in general and stewardship (antimicrobial stewardship, infection control, and institutional surveillance) in particular. However, systematic research on the use of data visualization in stewardship is lacking. This study aimed at filling this gap by creating a visual dictionary of stewardship through an assessment of data visualization (i.e., graphical representation of quantitative information) in stewardship research.Methods: A random sample of 150 data visualizations from published research articles on stewardship were assessed (excluding geographical maps and flowcharts). The visualization vocabulary (content) and design space (design elements) were combined to create a visual dictionary. Additionally, visualization errors, chart junk, and quality were assessed to identify problems in current visualizations and to provide improvement recommendations.Results: Despite a heterogeneous use of data visualization, distinct combinations of graphical elements to reflect stewardship data were identified. In general, bar (n = 54; 36.0%) and line charts (n = 42; 28.1%) were preferred visualization types. Visualization problems comprised color scheme mismatches, double y-axis, hidden data points through overlaps, and chart junk. Recommendations were derived that can help to clarify visual communication, improve color use for grouping/stratifying, improve the display of magnitude, and match visualizations to scientific standards.Conclusion: Results of this study can be used to guide data visualization creators in designing visualizations that fit the data and visual habits of the stewardship target audience. Additionally, the results can provide the basis to further expand the visual dictionary of stewardship toward more effective visualizations that improve data insights, knowledge, and clinical decision-making.

  18. Z

    Individual Edit Histories of All References in the English Wikipedia

    • data.niaid.nih.gov
    • zenodo.org
    • +1more
    Updated Feb 19, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zagovora, Olga; Ulloa, Roberto; Weller, Katrin; Flöck, Fabian (2021). Individual Edit Histories of All References in the English Wikipedia [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3964989
    Explore at:
    Dataset updated
    Feb 19, 2021
    Dataset provided by
    GESIS – Leibniz Institute for the Social Sciences
    Authors
    Zagovora, Olga; Ulloa, Roberto; Weller, Katrin; Flöck, Fabian
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset includes the historical versions of all individual references per article in the English Wikipedia. Each reference object also contains information about its original creating editor, editors implementing changes to it, and timestamps of all actions (creations, modifications, deletions, and reinsertions) that were applied to the reference. Each historical version of a reference is represented as a list of tokens (≈ words), where each token has an individual creator and change history.

    The extraction process was meticulously vetted through crowdsourcing evaluations, assuring very high accuracy in contrast to standard textual difference algorithms. The dataset includes references that were created with the "" tag until June 2019. It contains 55,503,998 references with 164,530,374 actions. These references were found in 4,690,046 Wikipedia articles.

    The dataset consists of JSON files where each article's page ID (here: article_id) is used as a file name. Each file is represented as a list of “References”. Each reference is a dictionary with the following keys:

    "first_rev_id" type: Integer, first revision where the reference was inserted (the same value is represented in “ins” as the first element of the list and in "rev_id" of the first element in the "change_sequence"),

    "first_hash_id" type: String, the hash value of the first version of token_id (from WikiWho1, see below) list of the reference (the same value is represented as "hash_id" of the first element in the "change_sequence"),

    "first_editor_id" type: String, user_id or IP address of the first revision where the reference was inserted (the same value is represented as "editor_id" of the first element in the "change_sequence",

    "deleted" type: Boolean, an indicator if the reference exists in the last available revision,

    "ins" type: List of Integers, list of revisions where the reference was inserted (includes the first revision mentioned as "first_rev_id"),

    "ins_editor" type: List of Strings, list of user_id or IP addresses of editors where the reference was inserted (includes the first user mentioned as "first_editor_id"),

    "del" type: List of Integers, list of revisions where the reference was deleted from the article or reference was modified in a way that less than 25% of tokens remained,

    "del_editor“ type: List of Strings, list of user_id or IP addresses of editors where the reference was deleted or reference was modified in a way that less than 25% of tokens remained,

    "modif" type: List of Integers, list of revisions where the reference was modified, or reinserted with modification,

    "hashes": type: List of Strings, list of hash values of all versions of references,

    "first_rev_time": type: DateTime, the timestamp when the reference was created (the same value is represented in "ins_time” as the first element of the list and in "time" of the first element in the "change_sequence"),

    "ins_time" type: List of DateTime, list of timestamps when the reference was inserted or reinserted,

    "del_time" type: List of DateTime, list of timestamps when the reference was deleted,

    "change_sequence" type: List of dictionaries, with information about tokens, editors and revisions where the reference was modified (the first element representing the first revision where the reference was inserted), where:

    "hash_id" type: String, the hash value of the token_id (WikiWho1) list of the reference version,

    "rev_id" type: Integer, the revision number of the particular version of the reference,

    "editor_id" type: String, user_id or IP address of the revision editor,

    "time" type: DateTime, the timestamp when of this particular version of the reference,

    "tokens" type: List of Strings, ordered list of tokens (created by WikiWho1) that represents the particular version of the reference (the list has the same length as "token_editors"),

    "token_editors" type: List of Strings, ordered list of user_ids or IP addresses of editors that were first who added the corresponding token (see "tokens") to Wikipedia article.

    1 WikiWho is a text mining algorithm to extract changes to tokens from Wikipedia revisions. Each token is assigned a unique ID. More information: https://www.wikiwho.net/#technical_details

    GitHub Repository with Python example code on how to process data and extract document identifiers: https://github.com/gesiscss/wikipedia_references

    To run the code at GESIS Notebook follow the link: https://notebooks.gesis.org/binder/v2/gh/gesiscss/wikipedia_references/master

  19. Football Events

    • kaggle.com
    zip
    Updated Jan 25, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alin Secareanu (2017). Football Events [Dataset]. http://www.kaggle.com/secareanualin/football-events/home
    Explore at:
    zip(22142158 bytes)Available download formats
    Dataset updated
    Jan 25, 2017
    Authors
    Alin Secareanu
    Description

    Context

    Most publicly available football (soccer) statistics are limited to aggregated data such as Goals, Shots, Fouls, Cards. When assessing performance or building predictive models, this simple aggregation, without any context, can be misleading. For example, a team that produced 10 shots on target from long range has a lower chance of scoring than a club that produced the same amount of shots from inside the box. However, metrics derived from this simple count of shots will similarly asses the two teams.

    A football game generates much more events and it is very important and interesting to take into account the context in which those events were generated. This dataset should keep sports analytics enthusiasts awake for long hours as the number of questions that can be asked is huge.

    Content

    This dataset is a result of a very tiresome effort of webscraping and integrating different data sources. The central element is the text commentary. All the events were derived by reverse engineering the text commentary, using regex. Using this, I was able to derive 11 types of events, as well as the main player and secondary player involved in those events and many other statistics. In case I've missed extracting some useful information, you are gladly invited to do so and share your findings. The dataset provides a granular view of 9,074 games, totaling 941,009 events from the biggest 5 European football (soccer) leagues: England, Spain, Germany, Italy, France from 2011/2012 season to 2016/2017 season as of 25.01.2017. There are games that have been played during these seasons for which I could not collect detailed data. Overall, over 90% of the played games during these seasons have event data.

    The dataset is organized in 3 files:

    • events.csv contains event data about each game. Text commentary was scraped from: bbc.com, espn.com and onefootball.com
    • ginf.csv - contains metadata and market odds about each game. odds were collected from oddsportal.com
    • dictionary.txt contains a dictionary with the textual description of each categorical variable coded with integers

    Past Research

    I have used this data to:

    • create predictive models for football games in order to bet on football outcomes.
    • make visualizations about upcoming games
    • build expected goals models and compare players

    Inspiration

    There are tons of interesting questions a sports enthusiast can answer with this dataset. For example:

    • What is the value of a shot? Or what is the probability of a shot being a goal given it's location, shooter, league, assist method, gamestate, number of players on the pitch, time - known as expected goals (xG) models
    • When are teams more likely to score?
    • Which teams are the best or sloppiest at holding the lead?
    • Which teams or players make the best use of set pieces?
    • In which leagues is the referee more likely to give a card?
    • How do players compare when they shoot with their week foot versus strong foot? Or which players are ambidextrous?
    • Identify different styles of plays (shooting from long range vs shooting from the box, crossing the ball vs passing the ball, use of headers)
    • Which teams have a bias for attacking on a particular flank?

    And many many more...

  20. Analyzing International Restaurant Orders Dataset

    • kaggle.com
    zip
    Updated Jan 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agung Pambudi (2024). Analyzing International Restaurant Orders Dataset [Dataset]. https://www.kaggle.com/datasets/agungpambudi/analyzing-restaurant-orders-international-dataset
    Explore at:
    zip(136277 bytes)Available download formats
    Dataset updated
    Jan 9, 2024
    Authors
    Agung Pambudi
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The dataset comprises a quarterly compilation of orders from a hypothetical restaurant specializing in diverse international cuisines. Each entry includes precise timestamps, dates, items requested, alongside specific details encompassing the category, nomenclature, and corresponding price of the ordered items.

    Data Dictionary

    FieldDescription
    order_detailsorder_details_idUnique ID of an item in an order
    order_detailsorder_idID of an order
    order_detailsorder_dateDate an order was put in (MM/DD/YY)
    order_detailsorder_timeTime an order was put in (HH:MM:SS AM/PM)
    order_detailsitem_idMatches the menu_item_id in the menu_items table
    menu_itemsmenu_item_idUnique ID of a menu item
    menu_itemsitem_nameName of a menu item
    menu_itemscategoryCategory or type of cuisine of the menu item
    menu_itemspricePrice of the menu item (US Dollars $)



    Reference :

    Maven Analytics. (n.d.). Maven Analytics | Data analytics online training for Excel, Power BI, SQL, Tableau, Python and more. [online] Available at: https://mavenanalytics.io [Accessed 6 Dec. 2023].

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Office of the Chief Tecnology Officer (2025). Open Data Dictionary Template Individual [Dataset]. https://catalog.data.gov/dataset/open-data-dictionary-template-individual

Open Data Dictionary Template Individual

Explore at:
Dataset updated
Feb 4, 2025
Dataset provided by
Office of the Chief Tecnology Officer
Description

This template covers section 2.5 Resource Fields: Entity and Attribute Information of the Data Discovery Form cited in the Open Data DC Handbook (2022). It completes documentation elements that are required for publication. Each field column (attribute) in the dataset needs a description clarifying the contents of the column. Data originators are encouraged to enter the code values (domains) of the column to help end-users translate the contents of the column where needed, especially when lookup tables do not exist.

Search
Clear search
Close search
Google apps
Main menu