Facebook
TwitterThis template covers section 2.5 Resource Fields: Entity and Attribute Information of the Data Discovery Form cited in the Open Data DC Handbook (2022). It completes documentation elements that are required for publication. Each field column (attribute) in the dataset needs a description clarifying the contents of the column. Data originators are encouraged to enter the code values (domains) of the column to help end-users translate the contents of the column where needed, especially when lookup tables do not exist.
Facebook
TwitterEvaluating the status of threatened and endangered salmonid populations requires information on the current status of the threats (e.g., habitat, hatcheries, hydropower, and invasives) and the risk of extinction (e.g., status and trend in the Viable Salmonid Population criteria). For salmonids in the Pacific Northwest, threats generally result in changes to physical and biological characteristics of freshwater habitat. These changes are often described by terms like "limiting factors" or "habitat impairment." For example, the condition of freshwater habitat directly impacts salmonid abundance and population spatial structure by affecting carrying capacity and the variability and accessibility of rearing and spawning areas. Thus, one way to assess or quantify threats to ESUs and populations is to evaluate whether the ecological conditions on which fish depend is improving, becoming more degraded, or remains unchanged. In the attached spreadsheets, we have attempted to consistently record limiting factors and threats across all populations and ESUs to enable comparison to other datasets (e.g., restoration projects) in a consistent way. Limiting factors and threats (LF/T) identified in salmon recovery plans were translated in a common language using an ecological concerns data dictionary (see "Ecological Concerns" tab in the attached spreadsheets) (a data dictionaries defines the wording, meaning and scope of categories). The ecological concerns data dictionary defines how different elements are related, such as the relationships between threats, ecological concerns and life history stages. The data dictionary includes categories for ecological dynamics and population level effects such as "reduced genetic fitness" and "behavioral changes." The data dictionary categories are meant to encompass the ecological conditions that directly impact salmonids and can be addressed directly or indirectly by management (habitat restoration, hatchery reform, etc.) actions. Using the ecological concerns data dictionary enables us to more fully capture the range of effects of hydro, hatchery, and invasive threats as well as habitat threat categories. The organization and format of the data dictionary was also chosen so the information we record can be easily related to datasets we already posses (e.g., restoration data). Data Dictionary.
Facebook
TwitterAPAC Data Suite offers high-quality language datasets. Ideal for NLP, AI, LLMs, translation, and education, it combines linguistic depth and regional authenticity to power scalable, multilingual language technologies.
Discover our expertly curated language datasets in the APAC Data Suite. Compiled and annotated by language and linguistic experts, this suite offers high-quality resources tailored to your needs. This suite includes:
Monolingual and Bilingual Dictionary Data
Featuring headwords, definitions, word senses, part-of-speech (POS) tags, and semantic metadata.
Semi-bilingual Dictionary Data Each entry features a headword with definitions and/or usage examples in Language 1, followed by a translation of the headword and/or definition in Language 2, enabling efficient cross-lingual mapping.
Sentence Corpora
Curated examples of real-world usage with contextual annotations for training and evaluation.
Synonyms & Antonyms
Lexical relations to support semantic search, paraphrasing, and language understanding.
Audio Data
Native speaker recordings for speech recognition, TTS, and pronunciation modeling.
Word Lists
Frequency-ranked and thematically grouped lists for vocabulary training and NLP tasks. The word list data can cover one language or two, such as Tamil words with English translations.
Each language may contain one or more types of language data. Depending on the dataset, we can provide these in formats such as XML, JSON, TXT, XLSX, CSV, WAV, MP3, and more. Delivery is currently available via email (link-based sharing) or REST API.
If you require more information about a specific dataset, please contact us Growth.OL@oup.com.
Below are the different types of datasets available for each language, along with their key features and approximate metrics. If you have any questions or require additional assistance, please don't hesitate to contact us.
Assamese Semi-bilingual Dictionary Data: 72,200 words | 83,700 senses | 83,800 translations.
Bengali Bilingual Dictionary Data: 161,400 translations | 71,600 senses.
Bengali Semi-bilingual Dictionary Data: 28,300 words | 37,700 senses | 62,300 translations.
British English Monolingual Dictionary Data: 146,000 words | 230,000 senses | 149,000 example sentences.
British English Synonyms and Antonyms Data: 600,000 synonyms | 22,000 antonyms.
British English Pronunciations with Audio: 250,000 transcriptions (IPA) | 180,000 audio files.
French Monolingual Dictionary Data: 42,000 words | 56,000 senses | 43,000 example sentences.
French Bilingual Dictionary Data: 380,000 translations | 199,000 senses | 146,000 example translations.
Gujarati Monolingual Dictionary Data: 91,800 words | 131,500 senses.
Gujarati Bilingual Dictionary Data: 171,800 translations | 158,200 senses.
Hindi Monolingual Dictionary Data: 46,200 words | 112,700 senses.
Hindi Bilingual Dictionary Data: 263,400 translations | 208,100 senses | 18,600 example translations.
Hindi Synonyms and Antonyms Dictionary Data: 478,100 synonyms | 18,800 antonyms.
Hindi Sentence Data: 216,000 sentences.
Hindi Audio data: 68,000 audio files.
Indonesian Bilingual Dictionary Data: 36,000 translations | 23,700 senses | 12,700 example translations.
Indonesian Monolingual Dictionary Data: 120,000 words | 140,000 senses | 30,000 example sentences.
Korean Bilingual Dictionary Data: 952,500 translations | 449,700 senses | 227,800 example translations.
Mandarin Chinese (simplified) Monolingual Dictionary Data: 81,300 words | 162,400 senses | 80,700 example sentences.
Mandarin Chinese (traditional) Monolingual Dictionary Data: 60,100 words | 144,700 senses | 29,900 example sentences.
Mandarin Chinese (simplified) Bilingual Dictionary Data: 367,600 translations | 204,500 senses | 150,900 example translations.
Mandarin Chinese (traditional) Bilingual Dictionary Data: 215,600 translations | 202,800 senses | 149,700 example translations.
Mandarin Chinese (simplified) Synonyms and Antonyms Data: 3,800 synonyms | 3,180 antonyms.
Malay Bilingual Dictionary Data: 106,100 translations | 53,500 senses.
Malay Monolingual Dictionary Data: 39,800 words | 40,600 senses | 21,100 example sentences.
Malayalam Monolingual Dictionary Data: 91,300 words | 159,200 senses.
Malayalam Bilingual Word List Data: 76,200 translation pairs.
Marathi Bilingual Dictionary Data: 45,400 translations | 32,800 senses | 3,600 example translations.
Nepali Bilingual Dictionary Data: 350,000 translations | 264,200 senses | 1,300 example translations.
New Zealand English Monolingual Dictionary Data: 100,000 words
Odia Semi-bilingual Dictionary Data: 30,700 words | 69,300 senses | 69,200 translations.
Punjabi ...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This fileset provides supporting data and corpora for the empirical study described in: Laura Miron, Rafael S. Goncalves and Mark A. Musen. Obstacles to the Reuse of Metadata in ClinicalTrials.govDescription of filesOriginal data files:- AllPublicXml.zip contains the set of all public XML records in ClinicalTrials.gov (protocols and summary results information), on which all remaining analyses are based. Set contains 302,091 records downloaded on April 3, 2019.- public.xsd is the XML schema downloaded from ClinicalTrials.gov on April 3, 2019, used to validate records in AllPublicXML.BioPortal API Query Results- condition_matches.csv contains the results of querying the BioPortal API for all ontology terms that are an 'exact match' to each condition string scraped from the ClinicalTrials.gov XML. Columns={filename, condition, url, bioportal term, cuis, tuis}. - intervention_matches.csv contains BioPortal API query results for all interventions scraped from the ClinicalTrials.gov XML. Columns={filename, intervention, url, bioportal term, cuis, tuis}.Data Element Definitions- supplementary_table_1.xlsx Mapping of element names, element types, and whether elements are required in ClinicalTrials.gov data dictionaries, the ClinicalTrials.gov XML schema declaration for records (public.XSD), the Protocol Registration System (PRS), FDAAA801, and the WHO required data elements for clinical trial registrations.Column and value definitions: - CT.gov Data Dictionary Section: Section heading for a group of data elements in the ClinicalTrials.gov data dictionary (https://prsinfo.clinicaltrials.gov/definitions.html) - CT.gov Data Dictionary Element Name: Name of an element/field according to the ClinicalTrials.gov data dictionaries (https://prsinfo.clinicaltrials.gov/definitions.html) and (https://prsinfo.clinicaltrials.gov/expanded_access_definitions.html) - CT.gov Data Dictionary Element Type: "Data" if the element is a field for which the user provides a value, "Group Heading" if the element is a group heading for several sub-fields, but is not in itself associated with a user-provided value. - Required for CT.gov for Interventional Records: "Required" if the element is required for interventional records according to the data dictionary, "CR" if the element is conditionally required, "Jan 2017" if the element is required for studies starting on or after January 18, 2017, the effective date of the FDAAA801 Final Rule, "-" indicates if this element is not applicable to interventional records (only observational or expanded access) - Required for CT.gov for Observational Records: "Required" if the element is required for interventional records according to the data dictionary, "CR" if the element is conditionally required, "Jan 2017" if the element is required for studies starting on or after January 18, 2017, the effective date of the FDAAA801 Final Rule, "-" indicates if this element is not applicable to observational records (only interventional or expanded access) - Required in CT.gov for Expanded Access Records?: "Required" if the element is required for interventional records according to the data dictionary, "CR" if the element is conditionally required, "Jan 2017" if the element is required for studies starting on or after January 18, 2017, the effective date of the FDAAA801 Final Rule, "-" indicates if this element is not applicable to expanded access records (only interventional or observational) - CT.gov XSD Element Definition: abbreviated xpath to the corresponding element in the ClinicalTrials.gov XSD (public.XSD). The full xpath includes 'clinical_study/' as a prefix to every element. (There is a single top-level element called "clinical_study" for all other elements.) - Required in XSD? : "Yes" if the element is required according to public.XSD, "No" if the element is optional, "-" if the element is not made public or included in the XSD - Type in XSD: "text" if the XSD type was "xs:string" or "textblock", name of enum given if type was enum, "integer" if type was "xs:integer" or "xs:integer" extended with the "type" attribute, "struct" if the type was a struct defined in the XSD - PRS Element Name: Name of the corresponding entry field in the PRS system - PRS Entry Type: Entry type in the PRS system. This column contains some free text explanations/observations - FDAAA801 Final Rule FIeld Name: Name of the corresponding required field in the FDAAA801 Final Rule (https://www.federalregister.gov/documents/2016/09/21/2016-22129/clinical-trials-registration-and-results-information-submission). This column contains many empty values where elements in ClinicalTrials.gov do not correspond to a field required by the FDA - WHO Field Name: Name of the corresponding field required by the WHO Trial Registration Data Set (v 1.3.1) (https://prsinfo.clinicaltrials.gov/trainTrainer/WHO-ICMJE-ClinTrialsgov-Cross-Ref.pdf)Analytical Results:- EC_human_review.csv contains the results of a manual review of random sample eligibility criteria from 400 CT.gov records. Table gives filename, criteria, and whether manual review determined the criteria to contain criteria for "multiple subgroups" of participants.- completeness.xlsx contains counts and percentages of interventional records missing fields required by FDAAA801 and its Final Rule.- industry_completeness.xlsx contains percentages of interventional records missing required fields, broken up by agency class of trial's lead sponsor ("NIH", "US Fed", "Industry", or "Other"), and before and after the effective date of the Final Rule- location_completeness.xlsx contains percentages of interventional records missing required fields, broken up by whether record listed at least one location in the United States and records with only international location (excluding trials with no listed location), and before and after the effective date of the Final RuleIntermediate Results:- cache.zip contains pickle and csv files of pandas dataframes with values scraped from the XML records in AllPublicXML. Downloading these files greatly speeds up running analysis steps from jupyter notebooks in our github repository.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The CIFAR-10 and CIFAR-100 datasets are labeled subsets of the 80 million tiny images dataset. CIFAR-10 and CIFAR-100 were created by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. (Sadly, the 80 million tiny images dataset has been thrown into the memory hole by its authors. Spotting the doublethink which was used to justify its erasure is left as an exercise for the reader.)
The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.
The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.
The classes are completely mutually exclusive. There is no overlap between automobiles and trucks. "Automobile" includes sedans, SUVs, things of that sort. "Truck" includes only big trucks. Neither includes pickup trucks.
Baseline results You can find some baseline replicable results on this dataset on the project page for cuda-convnet. These results were obtained with a convolutional neural network. Briefly, they are 18% test error without data augmentation and 11% with. Additionally, Jasper Snoek has a new paper in which he used Bayesian hyperparameter optimization to find nice settings of the weight decay and other hyperparameters, which allowed him to obtain a test error rate of 15% (without data augmentation) using the architecture of the net that got 18%.
Other results Rodrigo Benenson has collected results on CIFAR-10/100 and other datasets on his website; click here to view.
Dataset layout Python / Matlab versions I will describe the layout of the Python version of the dataset. The layout of the Matlab version is identical.
The archive contains the files data_batch_1, data_batch_2, ..., data_batch_5, as well as test_batch. Each of these files is a Python "pickled" object produced with cPickle. Here is a python2 routine which will open such a file and return a dictionary:
python
def unpickle(file):
import cPickle
with open(file, 'rb') as fo:
dict = cPickle.load(fo)
return dict
And a python3 version:
def unpickle(file):
import pickle
with open(file, 'rb') as fo:
dict = pickle.load(fo, encoding='bytes')
return dict
Loaded in this way, each of the batch files contains a dictionary with the following elements:
data -- a 10000x3072 numpy array of uint8s. Each row of the array stores a 32x32 colour image. The first 1024 entries contain the red channel values, the next 1024 the green, and the final 1024 the blue. The image is stored in row-major order, so that the first 32 entries of the array are the red channel values of the first row of the image.
labels -- a list of 10000 numbers in the range 0-9. The number at index i indicates the label of the ith image in the array data.
The dataset contains another file, called batches.meta. It too contains a Python dictionary object. It has the following entries: label_names -- a 10-element list which gives meaningful names to the numeric labels in the labels array described above. For example, label_names[0] == "airplane", label_names[1] == "automobile", etc. Binary version The binary version contains the files data_batch_1.bin, data_batch_2.bin, ..., data_batch_5.bin, as well as test_batch.bin. Each of these files is formatted as follows: <1 x label><3072 x pixel> ... <1 x label><3072 x pixel> In other words, the first byte is the label of the first image, which is a number in the range 0-9. The next 3072 bytes are the values of the pixels of the image. The first 1024 bytes are the red channel values, the next 1024 the green, and the final 1024 the blue. The values are stored in row-major order, so the first 32 bytes are the red channel values of the first row of the image.
Each file contains 10000 such 3073-byte "rows" of images, although there is nothing delimiting the rows. Therefore each file should be exactly 30730000 bytes long.
There is another file, called batches.meta.txt. This is an ASCII file that maps numeric labels in the range 0-9 to meaningful class names. It is merely a list of the 10 class names, one per row. The class name on row i corresponds to numeric label i.
The CIFAR-100 dataset This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). Her...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract
Motivation: creating challenging dataset for testing Named-Entity
Linking. The Namesakes dataset consists of three closely related datasets: Entities, News and Backlinks. Entities were collected as Wikipedia text chunks corresponding to highly ambiguous entity names. The News were collected as random news text chunks, containing mentions that either belong to the Entities dataset or can be easily confused with them. Backlinks were obtained from Wikipedia dump data with intention to have mentions linked to the entities of the Entity dataset. The Entities and News are human-labeled, resolving the mentions of the entities.Methods
Entities were collected as Wikipedia
text chunks corresponding to highly ambiguous entity names: the most popular people names, the most popular locations, and organizations with name ambiguity. In each Entities text chunk, the named entities with the name similar to the chunk Wikipedia page name are labeled. For labeling, these entities were suggested to human annotators (odetta.ai) to tag as "Same" (same as the page entity) or "Other". The labeling was done by 6 experienced annotators that passed through a preliminary trial task. The only accepted tags are the tags assigned in agreement by not less than 5 annotators, and then passed through reconciliation with an experienced reconciliator.
The News were collected as random news text chunks, containing mentions which either belong to the Entities dataset or can be easily confused with them. In each News text chunk one mention was selected for labeling, and 3-10 Wikipedia pages from Entities were suggested as the labels for an annotator to choose from. The labeling was done by 3 experienced annotators (odetta.ai), after the annotators passed a preliminary trial task. The results were reconciled by an experienced reconciliator. All the labeling was done using Lighttag (lighttag.io).
Backlinks were obtained from Wikipedia dump data (dumps.wikimedia.org/enwiki/20210701) with intention to have mentions linked to the entities of the Entity dataset. The backlinks were filtered to leave only mentions in a good quality text; each text was cut 1000 characters after the last mention.
Usage NotesEntities:
File: Namesakes_entities.jsonl The Entities dataset consists of 4148 Wikipedia text chunks containing human-tagged mentions of entities. Each mention is tagged either as "Same" (meaning that the mention is of this Wikipedia page entity), or "Other" (meaning that the mention is of some other entity, just having the same or similar name). The Entities dataset is a jsonl list, each item is a dictionary with the following keys and values: Key: ‘pagename’: page name of the Wikipedia page. Key ‘pageid’: page id of the Wikipedia page. Key ‘title’: title of the Wikipedia page. Key ‘url’: URL of the Wikipedia page. Key ‘text’: The text chunk from the Wikipedia page. Key ‘entities’: list of the mentions in the page text, each entity is represented by a dictionary with the keys: Key 'text': the mention as a string from the page text. Key ‘start’: start character position of the entity in the text. Key ‘end’: end (one-past-last) character position of the entity in the text. Key ‘tag’: annotation tag given as a string - either ‘Same’ or ‘Other’.
News: File: Namesakes_news.jsonl The News dataset consists of 1000 news text chunks, each one with a single annotated entity mention. The annotation either points to the corresponding entity from the Entities dataset (if the mention is of that entity), or indicates that the mentioned entity does not belong to the Entities dataset. The News dataset is a jsonl list, each item is a dictionary with the following keys and values: Key ‘id_text’: Id of the sample. Key ‘text’: The text chunk. Key ‘urls’: List of URLs of wikipedia entities suggested to labelers for identification of the entity mentioned in the text. Key ‘entity’: a dictionary describing the annotated entity mention in the text: Key 'text': the mention as a string found by an NER model in the text. Key ‘start’: start character position of the mention in the text. Key ‘end’: end (one-past-last) character position of the mention in the text. Key 'tag': This key exists only if the mentioned entity is annotated as belonging to the Entities dataset - if so, the value is a dictionary identifying the Wikipedia page assigned by annotators to the mentioned entity: Key ‘pageid’: Wikipedia page id. Key ‘pagetitle’: page title. Key 'url': page URL.
Backlinks dataset: The Backlinks dataset consists of two parts: dictionary Entity-to-Backlinks and Backlinks documents. The dictionary points to backlinks for each entity of the Entity dataset (if any backlinks exist for the entity). The Backlinks documents are the backlinks Wikipedia text chunks with identified mentions of the entities from the Entities dataset.
Each mention is identified by surrounded double square brackets, e.g. "Muir built a small cabin along [[Yosemite Creek]].". However, if the mention differs from the exact entity name, the double square brackets wrap both the exact name and, separated by '|', the mention string to the right, for example: "Muir also spent time with photographer [[Carleton E. Watkins | Carleton Watkins]] and studied his photographs of Yosemite.".
The Entity-to-Backlinks is a jsonl with 1527 items. File: Namesakes_backlinks_entities.jsonl Each item is a tuple: Entity name. Entity Wikipedia page id. Backlinks ids: a list of pageids of backlink documents.
The Backlinks documents is a jsonl with 26903 items. File: Namesakes_backlinks_texts.jsonl Each item is a dictionary: Key ‘pageid’: Id of the Wikipedia page. Key ‘title’: Title of the Wikipedia page. Key 'content': Text chunk from the Wikipedia page, with all mentions in the double brackets; the text is cut 1000 characters after the last mention, the cut is denoted as '...[CUT]'. Key 'mentions': List of the mentions from the text, for convenience. Each mention is a tuple: Entity name. Entity Wikipedia page id. Sorted list of all character indexes at which the mention occurrences start in the text.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
📖 Overview
(If its helpful kindly support by upvoting the dataset)
This dataset contains a detailed record of sales and movement data by item and department from Montgomery County, Maryland. It is updated monthly and includes information on warehouse and retail liquor sales.
| Column Name | Description | Example Value | Type |
|---|---|---|---|
Year | Year of record | 2025 | Integer |
Month | Month of record (numeric) | 9 | Integer |
Supplier | Name of the supplier | "Jack Daniels" | String |
Item_Code | Unique product code | 12345 | String / Numeric |
Item_Description | Product name or description | "Whiskey 750ml" | String |
Item_Type | Category or type of product | "Liquor" | String |
Retail_Sales | Number of cases sold in retail | 450 | Integer |
Retail_Transfers | Number of cases transferred internally | 120 | Integer |
Warehouse_Sales | Number of cases sold from warehouse to licensees | 200 | Integer |
The dataset can be used for:
📊 Time-series or trend analysis of product sales 🧾 Retail forecasting and demand estimation 🗺️ Regional economic and consumption studies
🧩 Data Summary
Source: Montgomery County Open Data Portal Publisher: Montgomery County of Maryland — data.montgomerycountymd.gov Maintainer: svc dmesb (no-reply@data.montgomerycountymd.gov) Category: Community / Recreation Update Frequency: Monthly First Published: July 6, 2017 Last Updated: September 5, 2025
⚖️ License & Usage
This dataset is publicly accessible under the Montgomery County, Maryland Open Data Terms of Use. It is a non-federal dataset and may have different terms of use than Data.gov datasets. No explicit license information is provided by the source. Use responsibly and always cite the original source below when reusing the data.
🙌 Credits
Dataset originally published by: Montgomery County of Maryland 📍 https://data.montgomerycountymd.gov
📄 Source Page: Warehouse and Retail Sales
Facebook
TwitterThis dataset contains de-identified data with an accompanying data dictionary and the R script for de-identification procedures., Objective(s): To demonstrate application of a risk based de-identification framework using the Smart Triage dataset as a clinical example. Data Description: This dataset contains the de-identified version of the Smart Triage Jinja dataset with the accompanying data dictionary and R script for de-identification procedures. Limitations: Utility of the de-identified dataset has only been evaluated with regard to use for the development of prediction models based on a need for hospital admission. Abbreviations: NA Ethics Declaration: The study was reviewed by the instituational review boards at the University of British Columbia in Canada (ID: H19-02398; H20-00484), The Makerere University School of Public Health in Uganda and the Uganda National Council for Science and Technology
Facebook
TwitterThis dataset contains additional items that are counted, and in some cases evaluated during a property inspection. Examples include signs, flags and drinking fountains. Each row represents a single observation. Data Dictionary and User Guide can be found here. A complete list of all datasets in the series can be found here.
Facebook
TwitterThis dataset includes a log of all physical item checkouts from Seattle Public Library. The dataset begins with checkouts that occurring in April 2005. Renewals are not included. Have a question about this data? Ask us! Data Notes: There is a machine-readable data dictionary available to help you understand the collection and item codes. Access from here: https://data.seattle.gov/Community/Integrated-Library-System-ILS-Data-Dictionary/pbt3-ytbc Also: 1. "CheckoutDateTime" (the timestamp field) is rounded to the nearest minute. 2. "itemType" is a code from the catalog record that describes the type of item. Some of the more common codes are: acbk (adult book), acdvd (adult DVD), jcbk (children's book), accd (adult CD) 3. "Collection" is a collection code from the catalog record which describes the item. Here are some common examples: nanf (adult non-fiction), nafic(adult fiction), ncpic(children's picture book), nycomic (Young adult comic books). 4. "Subjects" includes the subjects and subject subdivisions from the item record.
Facebook
TwitterDuring hydrocarbon production, water is typically co-produced from the geologic formations producing oil and gas. Understanding the composition of these produced waters is important to help investigate the regional hydrogeology, the source of the water, the efficacy of water treatment and disposal plans, potential economic benefits of mineral commodities in the fluids, and the safety of potential sources of drinking or agricultural water. In addition to waters co-produced with hydrocarbons, geothermal development or exploration brings deep formation waters to the surface for possible sampling. This U.S. Geological Survey (USGS) Produced Waters Geochemical Database, which contains geochemical and other information for 114,943 produced water and other deep formation water samples of the United States, is a provisional, updated version of the 2002 USGS Produced Waters Database (Breit and others, 2002). In addition to the major element data presented in the original, the new database contains trace elements, isotopes, and time-series data, as well as nearly 100,000 additional samples that provide greater spatial coverage from both conventional and unconventional reservoir types, including geothermal. The database is a compilation of 40 individual databases, publications, or reports. The database was created in a manner to facilitate addition of new data and correct any compilation errors, and is expected to be updated over time with new data as provided and needed. Table 1, USGSPWDBv2.3 Data Sources.csv, shows the abbreviated ID of each input database (IDDB), the number of samples from each, and its reference. Table 2, USGSPWDBv2.3 Data Dictionary.csv, defines the 190 variables contained in the database and their descriptions. The database variables are organized first with identification and location information, followed by well descriptions, dates, rock properties, physical properties of the water, and then chemistry. The chemistry is organized alphabetically by elemental symbol. Each element is followed by any associated compounds (e.g. H2S is found after S). After Zr, molecules containing carbon, organic 9 compounds and dissolved gases follow. Isotopic data are found at the end of the dataset, just before the culling parameters.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Source: https://healthdata.gov/dataset/covid-19-reported-patient-impact-and-hospital-capacity-facility
The following dataset provides facility-level data for hospital utilization aggregated on a weekly basis (Friday to Thursday). These are derived from reports with facility-level granularity across two main sources: (1) HHS TeleTracking, and (2) reporting provided directly to HHS Protect by state/territorial health departments on behalf of their healthcare facilities.
The hospital population includes all hospitals registered with Centers for Medicare & Medicaid Services (CMS) as of June 1, 2020. It includes non-CMS hospitals that have reported since July 15, 2020. It does not include psychiatric, rehabilitation, Indian Health Service (IHS) facilities, U.S. Department of Veterans Affairs (VA) facilities, Defense Health Agency (DHA) facilities, and religious non-medical facilities.
For a given entry, the term “collection_week” signifies the start of the period that is aggregated. For example, a “collection_week” of 2020-11-20 means the average/sum/coverage of the elements captured from that given facility starting and including Friday, November 20, 2020, and ending and including reports for Thursday, November 26, 2020.
Reported elements include an append of either “_coverage”, “_sum”, or “_avg”.
A “_coverage” append denotes how many times the facility reported that element during that collection week.
A “_sum” append denotes the sum of the reports provided for that facility for that element during that collection week.
A “_avg” append is the average of the reports provided for that facility for that element during that collection week.
The file will be updated weekly. No statistical analysis is applied to impute non-response. For averages, calculations are based on the number of values collected for a given hospital in that collection week. Suppression is applied to the file for sums and averages less than four (4). In these cases, the field will be replaced with “-999,999”.
This data is preliminary and subject to change as more data become available. Data is available starting on July 31, 2020.
Sometimes, reports for a given facility will be provided to both HHS TeleTracking and HHS Protect. When this occurs, to ensure that there are not duplicate reports, deduplication is applied according to prioritization rules within HHS Protect.
For influenza fields listed in the file, the current HHS guidance marks these fields as optional. As a result, coverage of these elements are varied.
Facebook
TwitterI needed a simple image dataset that I could use when trying different object detection algorithms for the first time. It had to be something that could be quickly understood and easily loaded. I didn't want spend a lot of time doing EDA or trying to remember how the data is structured. Moreover, I wanted to be able to clearly see when a model 's prediction was correct or when it had made a mistake. When working with chest x-ray images, for example, it takes an expert to know if a model's predictions are correct.
I found the Balloons dataset and simplified it. The original data is split into train and test sets and it has two json files that need to be parsed. In this new version, I copied all images into a single folder and replaced the json files with one csv file that can be easily loaded with Pandas.
The dataset consists of 74 jpg images and one csv file. Each image contains one or more balloons.
The csv file has five columns:
fname - The image file name.
height - The image height.
width - The image width.
num_balloons - The number of balloons on the image.
bbox - The coordinates of each bounding box on the image.
The coordinates of each bbox are stored in a dictionary. The format is as follows:
{"xmin": 100, "ymin": 100, "xmax": 300, "ymax": 300}
Where xmin and ymin are the coordinates of the top left corner, and xmax and ymax are the coordinates of the bottom right corner.
Each entry in the bbox column is a list of dictionaries. For example, if an image has two ballons and hence two bounding boxes, the entry will be as follows:
[{"xmin": 100, "ymin": 100, "xmax": 300, "ymax": 300}, {"xmin": 100, "ymin": 100, "xmax": 300, "ymax": 300}]
When loaded into a Pandas dataframe all items in the bbox column are of type string. The strings can be converted to a python lists like this:
import ast
# convert each item in the bbox column from type str to type list
df['bbox'] = df['bbox'].apply(ast.literal_eval)
Many thanks to Waleed Abdulla who created this dataset.
The original dataset can be downloaded and unzipped using this code:
!wget https://github.com/matterport/Mask_RCNN/releases/download/v2.1/balloon_dataset.zip
!unzip balloon_dataset.zip > /dev/null
Can you create an app that can look at an image and tell you: - how many balloons are on the image, and - what are the colours of those balloons.
This is something that could help blind people. To help you get started here's an example of a similar project .
In this blog post the dataset's creator mentions that the images were sourced from Flickr. All images have a "Commercial use & mods allowed" license.
Header image by andremsantana on Pixabay.
Facebook
TwitterThe dataset was derived by the Bioregional Assessment Programme. This dataset was derived from multiple datasets including Natural Resource Management regions, and Australian and state and territory government databases. You can find a link to the parent datasets in the Lineage Field in this metadata statement. The History Field in this metadata statement describes how this dataset was derived.
This data set holds the publicly-available version of the database of water-dependent assets that was compiled for the bioregional assessment (BA) of the Clarence-Moreton subregion as part of the Bioregional Assessment Technical Programme. Though all life is dependent on water, for the purposes of a bioregional assessment, a water-dependent asset is an asset potentially impacted by changes in the groundwater and/or surface water regime due to coal resource development. The water must be other than local rainfall. Examples include wetlands, rivers, bores and groundwater dependent ecosystems.
A single asset is represented spatially in the asset database by single or multiple spatial features (point, line or polygon). Individual points, lines or polygons are termed elements.
This dataset contains the unrestricted publicly-available components of spatial and non-spatial (attribute) data of the (restricted) Asset database for the Clarence-Moreton bioregion on 24 February 2016 (6d11ffbc-ea57-49cb-8e00-f97761e0c5d6). The database is provided primarily as an ESRI File geodatabase (.gdb), which is able to be opened in readily available open source software such as QGIS. Other formats include the Microsoft Access database (.mdb in ESRI Personal Geodatabase format), industry-standard ESRI Shapefiles and tab-delimited text files of all the attribute tables.
The restricted version of the Clarence-Moreton Asset database has a total count of 294961 Elements and 2708 Assets. In the public version of the Asset Clarence-Moreton database 60074 spatial Element features (~19%) have been removed from the Element List and Element Layer(s) and 729 spatial Assets (~24%) have been removed from the spatial Asset Layer(s)
The elements/assets removed from the restricted Asset Database are from the following data sources:
1) Species Profile and Threats Database (SPRAT) - RESTRICTED - Metadata only) (7276dd93-cc8c-4c01-8df0-cef743c72112)
2) Australia, Register of the National Estate (RNE) - Spatial Database (RNESDB) (Internal 878f6780-be97-469b-8517-54bd12a407d0)
3) Communities of National Environmental Significance Database - RESTRICTED - Metadata only (c01c4693-0a51-4dbc-bbbd-7a07952aa5f6)
These important assets are included in the bioregional assessment, but are unable to be publicly distributed by the Bioregional Assessment Programme due to restrictions in their licensing conditions. Please note that many of these data sets are available directly from their custodian. For more precise details please see the associated explanatory Data Dictionary document enclosed with this dataset.
The public version of the asset database retains all of the unrestricted components of the Asset database for the Clarence-Moreton bioregion on 24 February 2016 - any material that is unable to be published or redistributed to a third party by the BA Programme has been removed from the database. The data presented corresponds to the assets published Clarence-Moreton bioregion product 1.3: Description of the water-dependent asset register and asset list for the Clarence-Moreton bioregion on 24 February 2016 , and the associated Water-dependent asset register and asset list for the Clarence-Moreton bioregion on 24 February 2016 .
Individual spatial features or elements are initially included in database if they are partly or wholly within the subregion's preliminary assessment extent (Materiality Test 1, M1). In accordance to BA submethodology M02: Compiling water-dependent assets, individual spatial elements are then grouped into assets which are evaluated by project teams to determine whether they meet materiality test 2 (M2), which are assets that are considered to be water dependent.
Following delivery of the first pass asset list, project teams make a determination as to whether an asset (comprised of one or more elements) is water dependent, as assessed against the materiality tests detailed in the BA Methodology. These decisions are provided to ERIN by the assessment team and incorporated into the AssetList table in the Asset database.
Development of the Asset Register from the Asset database:
Decisions for M0 (fit for BA purpose), M1 (PAE) and M2 (water dependent) determine which assets are included in the "asset list" and "water-dependent asset register" which are published as Product 1.3.
The rule sets are applied as follows:
M0 M1 M2 Result
No n/a n/a Asset is not included in the asset list or the water-dependent asset register
(≠ No) No n/a Asset is not included in the asset list or the water-dependent asset register
(≠ No) Yes No Asset included in published asset list but not in water dependent asset register
(≠ No) Yes Yes Asset included in both asset list and water-dependent asset register
Assessment teams are then able to use the database to assign receptors and impact variables to water-dependent assets and the development of a receptor register as detailed in BA submethodology M03: Assigning receptors to water-dependent assets and the receptor register is then incorporated into the asset database.
At this stage of its development, the Asset database for the Clarence-Moreton bioregion on 24 February 2016, which this document describes, does contain receptor information, and the receptor information was removed from this public version.
Bioregional Assessment Programme (2014) Asset database for the Clarence-Moreton bioregion on 24 February 2016 Public. Bioregional Assessment Derived Dataset. Viewed 10 July 2017, http://data.bioregionalassessments.gov.au/dataset/ba1d4c6f-e657-4e42-bd3c-413c21c7b735.
Derived From QLD Dept of Natural Resources and Mines, Groundwater Entitlements 20131204
Derived From Combined Surface Waterbodies for the Clarence-Moreton bioregion
Derived From Queensland QLD - Regional - NRM - Water Asset Information Tool - WAIT - databases
Derived From Version 02 Asset list for Clarence Morton 8/8/2014 - ERIN ORIGINAL DATA
Derived From CLM16swo NSW Office of Water Surface Water Offtakes processed for Clarence Moreton v3 12032014
Derived From Asset database for the Clarence-Moreton bioregion on 11 December 2014, minor version v20150220
Derived From Matters of State environmental significance (version 4.1), Queensland
Derived From Geofabric Surface Network - V2.1
Derived From Communities of National Environmental Significance Database - RESTRICTED - Metadata only
Derived From Geofabric Surface Catchments - V2.1
Derived From National Groundwater Dependent Ecosystems (GDE) Atlas
Derived From CLM - 16swo NSW Office of Water Surface Water Offtakes - Clarence Moreton v1 24102013
Derived From Multi-resolution Valley Bottom Flatness MrVBF at three second resolution CSIRO 20000211
Derived From National Groundwater Information System (NGIS) v1.1
Derived From Mitchell Landscapes NSW OEH v3 2011
Derived From Asset database for the Clarence-Moreton bioregion on 24 February 2016
Derived From Geofabric Surface Network - V2.1.1
Derived From Birds Australia - Important Bird Areas (IBA) 2009
Derived From Australia - Species of National Environmental Significance Database
Derived From Multi-resolution Ridge Top Flatness at 3 second resolution CSIRO 20000211
Derived From South East Queensland GDE (draft)
Derived From Natural Resource Management (NRM) Regions 2010
Derived From Version 01 Asset list for Clarence Morton 10/3/2014 - ERIN ORIGINAL DATA
Derived From NSW Office of Water Surface Water Entitlements Locations v1_Oct2013
Derived From QLD Dept of Natural Resources and Mines, Groundwater Entitlements linked to bores v3 03122014
Derived From [Queensland QLD Regional CMA
Facebook
TwitterThis data set holds the publicly-available version of the database of water-dependent assets that was compiled for the bioregional assessment (BA) of the Cooper subregion as part of the Bioregional Assessment Technical Programme. Though all life is dependent on water, for the purposes of a bioregional assessment, a water-dependent asset is an asset potentially impacted by changes in the groundwater and/or surface water regime due to coal resource development. The water must be other than local rainfall. Examples include wetlands, rivers, bores and groundwater dependent ecosystems.
The dataset was derived by the Bioregional Assessment Programme. This dataset was derived from multiple datasets including Natural Resource Management regions, and Australian and state and territory government databases. You can find a link to the parent datasets in the Lineage Field in this metadata statement. The History Field in this metadata statement describes how this dataset was derived. A single asset is represented spatially in the asset database by single or multiple spatial features (point, line or polygon). Individual points, lines or polygons are termed elements.
This dataset contains the unrestricted publicly-available components of spatial and non-spatial (attribute) data of the (restricted) Asset database for the Cooper subregion on 12 May 2016 (90230311-b2e7-4d4d-a69a-03daab0d03cc). The database is provided primarily as an ESRI File geodatabase (.gdb), which is able to be opened in readily available open source software such as QGIS. Other formats include the Microsoft Access database (.mdb in ESRI Personal Geodatabase format), industry-standard ESRI Shapefiles and tab-delimited text files of all the attribute tables.
The restricted version of the Cooper Asset database has a total count of 63910 Elements and 1 611 Assets. In the public version of the Asset Cooper database 6209 spatial Element features (~10%) have been removed from the Element List and Element Layer(s) and 47 spatial Assets (~3%) have been removed from the spatial Asset Layer(s)
The elements/assets removed from the restricted Asset Database are from the following data sources:
1) Species Profile and Threats Database (SPRAT) - Australia - Species of National Environmental Significance Database (BA subset - RESTRICTED - Metadata only) (7276dd93-cc8c-4c01-8df0-cef743c72112)
2) Australia, Register of the National Estate (RNE) - Spatial Database (RNESDB) (Internal 878f6780-be97-469b-8517-54bd12a407d0)
3) Lake Eyre Basin (LEB) Aquatic Ecosystems Mapping and Classification (9be10819-0e71-4d8d-aae5-f179012b6906)
4) Communities of National Environmental Significance Database - RESTRICTED - Metadata only (c01c4693-0a51-4dbc-bbbd-7a07952aa5f6)
These important assets are included in the bioregional assessment, but are unable to be publicly distributed by the Bioregional Assessment Programme due to restrictions in their licensing conditions. Please note that many of these data sets are available directly from their custodian. For more precise details please see the associated explanatory Data Dictionary document enclosed with this dataset
The public version of the asset database retains all of the unrestricted components of the Asset database for the Cooper subregion on 12 May 2016 - any material that is unable to be published or redistributed to a third party by the BA Programme has been removed from the database. The data presented corresponds to the assets published Cooper subregion product 1.3: Description of the water-dependent asset register and asset list for the Cooper subregion on 12 May 2016, and the associated Water-dependent asset register and asset list for the Cooper subregion on 12 May 2016.
Individual spatial features or elements are initially included in database if they are partly or wholly within the subregion's preliminary assessment extent (Materiality Test 1, M1). In accordance to BA submethodology M02: Compiling water-dependent assets, individual spatial elements are then grouped into assets which are evaluated by project teams to determine whether they meet materiality test 2 (M2), which are assets that are considered to be water dependent.
Following delivery of the first pass asset list, project teams make a determination as to whether an asset (comprised of one or more elements) is water dependent, as assessed against the materiality tests detailed in the BA Methodology. These decisions are provided to ERIN by the assessment team and incorporated into the AssetList table in the Asset database.
Development of the Asset Register from the Asset database:
Decisions for M0 (fit for BA purpose), M1 (PAE) and M2 (water dependent) determine which assets are included in the "asset list" and "water-dependent asset register" which are published as Product 1.3.
The rule sets are applied as follows:
M0 M1 M2 Result
No n/a n/a Asset is not included in the asset list or the water-dependent asset register
(≠ No) No n/a Asset is not included in the asset list or the water-dependent asset register
(≠ No) Yes No Asset included in published asset list but not in water dependent asset register
(≠ No) Yes Yes Asset included in both asset list and water-dependent asset register
Assessment teams are then able to use the database to assign receptors and impact variables to water-dependent assets and the development of a receptor register as detailed in BA submethodology M03: Assigning receptors to water-dependent assets and the receptor register is then incorporated into the asset database.
At this stage of its development, the Asset database for the Cooper subregion on 12 May 2016, which this document describes, does not contain receptor information.
Bioregional Assessment Programme (2014) Asset database for the Cooper subregion on 12 May 2016 Public. Bioregional Assessment Derived Dataset. Viewed 07 February 2017, http://data.bioregionalassessments.gov.au/dataset/bffa0c44-c86f-4f81-8070-2f0b13e0b774.
Derived From QLD Dept of Natural Resources and Mines, Groundwater Entitlements 20131204
Derived From Queensland QLD - Regional - NRM - Water Asset Information Tool - WAIT - databases
Derived From Matters of State environmental significance (version 4.1), Queensland
Derived From Geofabric Surface Network - V2.1
Derived From Communities of National Environmental Significance Database - RESTRICTED - Metadata only
Derived From South Australia SA - Regional - NRM Board - Water Asset Information Tool - WAIT - databases
Derived From National Groundwater Dependent Ecosystems (GDE) Atlas
Derived From National Groundwater Information System (NGIS) v1.1
Derived From Birds Australia - Important Bird Areas (IBA) 2009
Derived From Queensland QLD Regional CMA Water Asset Information WAIT tool databases RESTRICTED Includes ALL Reports
Derived From Queensland wetland data version 3 - wetland areas.
Derived From SA Department of Environment, Water and Natural Resources (DEWNR) Water Management Areas 141007
Derived From South Australian Wetlands - Groundwater Dependent Ecosystems (GDE) Classification
Derived From National Groundwater Dependent Ecosystems (GDE) Atlas (including WA)
Derived From Asset database for the Cooper subregion on 14 August 2015
Derived From QLD Dept of Natural Resources and Mines, Groundwater Entitlements linked to bores v3 03122014
Derived From Ramsar Wetlands of Australia
Derived From Permanent and Semi-Permanent Waterbodies of the Lake Eyre Basin (Queensland and South Australia) (DRAFT)
Derived From SA EconomicElements v1 20141201
Derived From QLD Dept of Natural Resources and Mines, Groundwater Entitlements linked to bores and NGIS v4 28072014
Derived From National Heritage List Spatial Database (NHL) (v2.1)
Derived From Great Artesian Basin and Laura Basin groundwater recharge areas
Derived From [Asset database
Facebook
TwitterResearchers sequenced 10,368 expressed sequence tags (EST) clones using a normalized cDNA library made from pooled samples of the trophont, tomont, and theront life-cycle stages, and generated 9,769 sequences (94.2% success rate). Post-sequencing processing led to 8,432 high quality sequences. Clustering analysis of these ESTs allowed identification of 4,706 unique sequences containing 976 contigs and 3,730 singletons. The ciliate protozoan Ichthyophthirius multifiliis (Ich) is an important parasite of freshwater fish that causes 'white spot disease' leading to significant losses. A genomic resource for large-scale studies of this parasite has been lacking. To study gene expression involved in Ich pathogenesis and virulence, our goal was to generate ESTs for the development of a powerful microarray platform for the analysis of global gene expression in this species. Here, we initiated a project to sequence and analyze over 10,000 ESTs. Resources in this dataset:Resource Title: Data Dictionary - Supplemental Tables 1, 2, and 3. File Name: IchthyophthiriusESTs_DataDictionary.csvResource Description: Machine-readable comma-separated values (CSV) definitions for data elements of Supplemental Tables 1-3 concerning I. multifiliis unique EST sequences, BLAST searches of the Ich ESTs against Tetrahymena thermophila and Plasmodium falciparum genomes, and gene ontology (GO) profile.Resource Title: Table 3. Table of gene ontology (GO) profiles.. File Name: 12864_2006_889_MOESM3_ESM.xlsResource Description: Supplemental Table 3, Excel spreadsheet; Table of gene ontology (GO) profiles; Provided information includes unique EST name, accession numbers, BLASTX top hit, GO identification numbers and enzyme commission (EC) numbers. Data resources found on the main article page under the "Electronic supplementary material" section: http://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-8-176 Direct download for this resource: https://static-content.springer.com/esm/art:10.1186/1471-2164-8-176/MediaObjects/12864_2006_889_MOESM3_ESM.xls Title: Table I. Multifiliis unique EST sequences. File Name: 12864_2006_889_MOESM1_ESM.xlsResource Description: Supplemental Table 1 for article, "Generation and analysis of expressed sequence tags from the ciliate protozoan parasite Ichthyophthirius multifiliis." Excel spreadsheet; Table of I. multifiliis unique EST sequences; Provided information includes I. multifiliis BLASTX top hits to the non-redundant database in GenBank with unique EST name and accession numbers. Also included are significant protein domain comparisons to the Swiss-Prot database. Putative secretory proteins are highlighted. Data resources found on the main article page under the "Electronic supplementary material" section: http://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-8-176 Direct download for this resource: https://static-content.springer.com/esm/art:10.1186/1471-2164-8-176/MediaObjects/12864_2006_889_MOESM1_ESM.xls Title: Table 2. Excel spreadsheet; Summary of BLAST searches of the Ich ESTs against Tetrahymena thermophila and Plasmodium falciparum genomes. File Name: 12864_2006_889_MOESM2_ESM.xlsResource Description: Table 2 from "Generation and analysis of expressed sequence tags from the ciliate protozoan parasite Ichthyophthirius multifiliis." Excel spreadsheet; Summary of BLAST searches of the Ich ESTs against Tetrahymena thermophila and Plasmodium falciparum genomes. Provided information includes I. multifiliis BLASTX top hits to the non-redundant database in GenBank with unique EST name, tBLASTx top hits to the T. thermophila genome, and BLASTX top hits to the P. falciparum genome sequences. This table correlates with the Venn diagram in figure 1. Data resources found on the main article page under the "Electronic supplementary material" section: http://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-8-176 Direct download link for this data resource: https://static-content.springer.com/esm/art:10.1186/1471-2164-8-176/MediaObjects/12864_2006_889_MOESM2_ESM.xls
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Objectives: Data and data visualization are integral parts of (clinical) decision-making in general and stewardship (antimicrobial stewardship, infection control, and institutional surveillance) in particular. However, systematic research on the use of data visualization in stewardship is lacking. This study aimed at filling this gap by creating a visual dictionary of stewardship through an assessment of data visualization (i.e., graphical representation of quantitative information) in stewardship research.Methods: A random sample of 150 data visualizations from published research articles on stewardship were assessed (excluding geographical maps and flowcharts). The visualization vocabulary (content) and design space (design elements) were combined to create a visual dictionary. Additionally, visualization errors, chart junk, and quality were assessed to identify problems in current visualizations and to provide improvement recommendations.Results: Despite a heterogeneous use of data visualization, distinct combinations of graphical elements to reflect stewardship data were identified. In general, bar (n = 54; 36.0%) and line charts (n = 42; 28.1%) were preferred visualization types. Visualization problems comprised color scheme mismatches, double y-axis, hidden data points through overlaps, and chart junk. Recommendations were derived that can help to clarify visual communication, improve color use for grouping/stratifying, improve the display of magnitude, and match visualizations to scientific standards.Conclusion: Results of this study can be used to guide data visualization creators in designing visualizations that fit the data and visual habits of the stewardship target audience. Additionally, the results can provide the basis to further expand the visual dictionary of stewardship toward more effective visualizations that improve data insights, knowledge, and clinical decision-making.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset includes the historical versions of all individual references per article in the English Wikipedia. Each reference object also contains information about its original creating editor, editors implementing changes to it, and timestamps of all actions (creations, modifications, deletions, and reinsertions) that were applied to the reference. Each historical version of a reference is represented as a list of tokens (≈ words), where each token has an individual creator and change history.
The extraction process was meticulously vetted through crowdsourcing evaluations, assuring very high accuracy in contrast to standard textual difference algorithms. The dataset includes references that were created with the "" tag until June 2019. It contains 55,503,998 references with 164,530,374 actions. These references were found in 4,690,046 Wikipedia articles.
The dataset consists of JSON files where each article's page ID (here: article_id) is used as a file name. Each file is represented as a list of “References”. Each reference is a dictionary with the following keys:
"first_rev_id" type: Integer, first revision where the reference was inserted (the same value is represented in “ins” as the first element of the list and in "rev_id" of the first element in the "change_sequence"),
"first_hash_id" type: String, the hash value of the first version of token_id (from WikiWho1, see below) list of the reference (the same value is represented as "hash_id" of the first element in the "change_sequence"),
"first_editor_id" type: String, user_id or IP address of the first revision where the reference was inserted (the same value is represented as "editor_id" of the first element in the "change_sequence",
"deleted" type: Boolean, an indicator if the reference exists in the last available revision,
"ins" type: List of Integers, list of revisions where the reference was inserted (includes the first revision mentioned as "first_rev_id"),
"ins_editor" type: List of Strings, list of user_id or IP addresses of editors where the reference was inserted (includes the first user mentioned as "first_editor_id"),
"del" type: List of Integers, list of revisions where the reference was deleted from the article or reference was modified in a way that less than 25% of tokens remained,
"del_editor“ type: List of Strings, list of user_id or IP addresses of editors where the reference was deleted or reference was modified in a way that less than 25% of tokens remained,
"modif" type: List of Integers, list of revisions where the reference was modified, or reinserted with modification,
"hashes": type: List of Strings, list of hash values of all versions of references,
"first_rev_time": type: DateTime, the timestamp when the reference was created (the same value is represented in "ins_time” as the first element of the list and in "time" of the first element in the "change_sequence"),
"ins_time" type: List of DateTime, list of timestamps when the reference was inserted or reinserted,
"del_time" type: List of DateTime, list of timestamps when the reference was deleted,
"change_sequence" type: List of dictionaries, with information about tokens, editors and revisions where the reference was modified (the first element representing the first revision where the reference was inserted), where:
"hash_id" type: String, the hash value of the token_id (WikiWho1) list of the reference version,
"rev_id" type: Integer, the revision number of the particular version of the reference,
"editor_id" type: String, user_id or IP address of the revision editor,
"time" type: DateTime, the timestamp when of this particular version of the reference,
"tokens" type: List of Strings, ordered list of tokens (created by WikiWho1) that represents the particular version of the reference (the list has the same length as "token_editors"),
"token_editors" type: List of Strings, ordered list of user_ids or IP addresses of editors that were first who added the corresponding token (see "tokens") to Wikipedia article.
1 WikiWho is a text mining algorithm to extract changes to tokens from Wikipedia revisions. Each token is assigned a unique ID. More information: https://www.wikiwho.net/#technical_details
GitHub Repository with Python example code on how to process data and extract document identifiers: https://github.com/gesiscss/wikipedia_references
To run the code at GESIS Notebook follow the link: https://notebooks.gesis.org/binder/v2/gh/gesiscss/wikipedia_references/master
Facebook
TwitterMost publicly available football (soccer) statistics are limited to aggregated data such as Goals, Shots, Fouls, Cards. When assessing performance or building predictive models, this simple aggregation, without any context, can be misleading. For example, a team that produced 10 shots on target from long range has a lower chance of scoring than a club that produced the same amount of shots from inside the box. However, metrics derived from this simple count of shots will similarly asses the two teams.
A football game generates much more events and it is very important and interesting to take into account the context in which those events were generated. This dataset should keep sports analytics enthusiasts awake for long hours as the number of questions that can be asked is huge.
This dataset is a result of a very tiresome effort of webscraping and integrating different data sources. The central element is the text commentary. All the events were derived by reverse engineering the text commentary, using regex. Using this, I was able to derive 11 types of events, as well as the main player and secondary player involved in those events and many other statistics. In case I've missed extracting some useful information, you are gladly invited to do so and share your findings. The dataset provides a granular view of 9,074 games, totaling 941,009 events from the biggest 5 European football (soccer) leagues: England, Spain, Germany, Italy, France from 2011/2012 season to 2016/2017 season as of 25.01.2017. There are games that have been played during these seasons for which I could not collect detailed data. Overall, over 90% of the played games during these seasons have event data.
The dataset is organized in 3 files:
I have used this data to:
There are tons of interesting questions a sports enthusiast can answer with this dataset. For example:
And many many more...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The dataset comprises a quarterly compilation of orders from a hypothetical restaurant specializing in diverse international cuisines. Each entry includes precise timestamps, dates, items requested, alongside specific details encompassing the category, nomenclature, and corresponding price of the ordered items.
| Field | Description | |
|---|---|---|
| order_details | order_details_id | Unique ID of an item in an order |
| order_details | order_id | ID of an order |
| order_details | order_date | Date an order was put in (MM/DD/YY) |
| order_details | order_time | Time an order was put in (HH:MM:SS AM/PM) |
| order_details | item_id | Matches the menu_item_id in the menu_items table |
| menu_items | menu_item_id | Unique ID of a menu item |
| menu_items | item_name | Name of a menu item |
| menu_items | category | Category or type of cuisine of the menu item |
| menu_items | price | Price of the menu item (US Dollars $) |
Reference :
Maven Analytics. (n.d.). Maven Analytics | Data analytics online training for Excel, Power BI, SQL, Tableau, Python and more. [online] Available at: https://mavenanalytics.io [Accessed 6 Dec. 2023].
Facebook
TwitterThis template covers section 2.5 Resource Fields: Entity and Attribute Information of the Data Discovery Form cited in the Open Data DC Handbook (2022). It completes documentation elements that are required for publication. Each field column (attribute) in the dataset needs a description clarifying the contents of the column. Data originators are encouraged to enter the code values (domains) of the column to help end-users translate the contents of the column where needed, especially when lookup tables do not exist.