Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The provided ZIP archive contains an XML file "main-database-description.xml" with the description of all tables (VIEWS) that are exposed publicly at the PLBD server (https://plbd.org/). In the XML file, all columns of the visible tables are described, specifying their SQL types, measurement units, semantics, calculation formulae, SQL statements that can be used to generate values in these columns, and publications of the formulae derivations.
The XML file conforms to the published XSD schema created for descriptions of relational databases for specifications of scientific measurement data. The XSD schema ("relational-database_v2.0.0-rc.18.xsd") and all included sub-schemas are provided in the same archive for convenience. All XSD schemas are validated against the "XMLSchema.xsd" schema from the W3C consortium.
The ZIP file contains the excerpt from the files hosted in the https://plbd.org/ at the moment of submission of the PLBD database in the Scientific Data journal, and is provided to conform the journal policies. The current data and schemas should be fetched from the published URIs:
https://plbd.org/
https://plbd.org/doc/db/schemas
https://plbd.org/doc/xml/schemas
Software that is used to generate SQL schemas, RestfulDB metadata and the RestfulDB middleware that allows to publish the databases generated from the XML description on the Web are available at public Subversion repositories:
svn://www.crystallography.net/solsa-database-scripts
svn://saulius-grazulis.lt/restfuldb
The unpacked ZIP file will create the "db/" directory with the tree layout given below. In addition to the database description file "main-database-description.xml", all XSD schemas necessary for validation of the XML file are provided. On a GNU/Linux operating system with a GNU Make package installed, the XML file validity can be checked by unpacking the ZIP file, entering the unpacked directory, and running 'make distclean; make'. For example, on a Linux Mint distribution, the following commands should work:
unzip main-database-description.zip
cd db/release/v0.10.0/tables/
sh -x dependencies/Linuxmint-20.1/install.sh
make distclean
make
If necessary, additional packages can be installed using the 'install.sh' script in the 'dependencies/' subdirectory corresponding to your operating system. As of the moment of writing, Debian-10 and Linuxmint-20.1 OSes are supported out of the box; similar OSes might work with the same 'install.sh' scripts. The installation scripts require to run package installation command under system administrator privileges, but they use only the standard system package manager, thus they should not put your system at risk. For validation and syntax checking, the 'rxp' and 'xmllint' programs are used.
The log files provided in the "outputs/validation" subdirectory contain validation logs obtained on the system where the XML files were last checked and should indicate validity of the provided XML file against the references schemas.
db/
└── release
└── v0.10.0
└── tables
├── Makeconfig-validate-xml
├── Makefile
├── Makelocal-validate-xml
├── dependencies
├── main-database-description.xml
├── outputs
└── schema
Facebook
Twitterhttps://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
The Lung Image Database Consortium image collection (LIDC-IDRI) consists of diagnostic and lung cancer screening thoracic computed tomography (CT) scans with marked-up annotated lesions. It is a web-accessible international resource for development, training, and evaluation of computer-assisted diagnostic (CAD) methods for lung cancer detection and diagnosis. Initiated by the National Cancer Institute (NCI), further advanced by the Foundation for the National Institutes of Health (FNIH), and accompanied by the Food and Drug Administration (FDA) through active participation, this public-private partnership demonstrates the success of a consortium founded on a consensus-based process.
Seven academic centers and eight medical imaging companies collaborated to create this data set which contains 1018 cases. Each subject includes images from a clinical thoracic CT scan and an associated XML file that records the results of a two-phase image annotation process performed by four experienced thoracic radiologists. In the initial blinded-read phase, each radiologist independently reviewed each CT scan and marked lesions belonging to one of three categories ("nodule > or =3 mm," "nodule <3 mm," and "non-nodule > or =3 mm"). In the subsequent unblinded-read phase, each radiologist independently reviewed their own marks along with the anonymized marks of the three other radiologists to render a final opinion. The goal of this process was to identify as completely as possible all lung nodules in each CT scan without requiring forced consensus.
Note : The TCIA team strongly encourages users to review pylidc and the Standardized representation of the TCIA LIDC-IDRI annotations using DICOM (DICOM-LIDC-IDRI-Nodules) of the annotations/segmentations included in this dataset before developing custom tools to analyze the XML version.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Read the XML data on Baltimore restaurants from here: https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml
Facebook
Twitterhttps://dataful.in/terms-and-conditionshttps://dataful.in/terms-and-conditions
The dataset contains year- and company-wise compiled data on the total paidup capital, turnover, along with details of contribution from exports to total turnover, and net worth of top 1000 companies listed by market capitalization. The additional details covered in the dataset include Corporate Identification Number (CIN) of each company, whether it is listed in National or Bombay Stock Exchanges or Not and whether the Corporate Social Responsibility (CSR) is applicable or not.
Note: We observed certain discrepancies in the units reported by a few companies in their BRSR submissions, specifically between the PDF files and the corresponding XML files. On Dataful, we have retrieved the units mentioned in the PDF files to correct the XML data wherever possible and are working to make the dataset as accurate as possible. In some cases, data from PDF files of two different years for the same company did not match. In such instances, the data from the latest year has been considered. However, we recommend verifying the information directly with SEBI and/or the respective companies if you notice any such discrepancies.
Facebook
TwitterThe DBLP computer science bibliography contains the metadata of over 1.8 million publications, written by over 1 million authors in several thousands of journals or conference proceedings series.
Although DBLP started with a focus on database systems and logic programming (hence the acronym), it has grown to cover all disciplines of computer science.
Resources list the full dump of the DBLP XML records (see http://dblp.uni-trier.de/xml/ - a simple DTD is available.
The paper "DBLP - Some Lessons Learned" documents technical details of this XML file. In the appendix "DBLP XML Requests" you may find the description of a primitive DBLP API.
As of 2011-12-09 this data is open (relased under ODC-By). See the license information in the Readme.txt and the announce post: http://openbiblio.net/2011/12/09/dblp-releases-its-1-8-million-bibliographic-records-as-open-data/
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Metaclusters obtained from the DPCfam clustering of UniRef50, v. 2017_07. Metaclusters represent putative protein families automatically derived using the DPCfam method, as described in Unsupervised protein family classification by Density Peak clustering, Russo ET, 2020, PhD Thesis http://hdl.handle.net/20.500.11767/116345 . Supervisors: Alessandro Laio, Marco Punta.
Visit also https://dpcfam.areasciencepark.it/ to easily navigate the data.
VERSION 1.1 changes:
Added DPCfamB database, including all small metaclusters with 25<=N<50 seed sequences. DPCdamB files are named with the prefix B_
Added Alphafold representative based on AlphaFoldDB for each MC
FILES DESCRIPTION:
1) Standard DPCfam database
metaclusters_xml.tar.gz Metaclusters' seeds, unaligned in an xml table. Only MCs with seeds with 1) more than 50 elements and 2) average length larger than 50 a.a.s are reported. Metaclusters entries include also some statistical information about each MC (such as size, average length, low complexity fraction etc, ) and Pfam comparison (Dominant Architecture). A README file is included describing the data. A parser is included to transform XML data to space-separated tables. XML schema is included.
metaclusters_msas.tar.gz Metsclusters' multiple sequence alignments, in fasta format. Only MCs with seeds with 1) more than 50 elements and 2) average length larger than 50 a.a.s are reported .
metaclusters_hmms.tar.gz Metsclusters' profile-hmms. A ".hmm" file for each metacluser. Only MCs with seeds with 1) more than 50 elements and 2) average length larger than 50 a.a.s are reported .
all_metaclusters_hmm.tar.gz Collctive metaclusters' profile-hmm. A single .hmm file collecting all MC's profile-hmm. . Only MCs with seeds with 1) more than 50 elements and 2) average length larger than 50 a.a.s are reported
uniref50_annotated.xml.gz UniRef50 v.2017_07 database annotated with Pfam families and DPCfam metaclusters. A README file is included describing the data. A parser is included to transform XML data to space-separated tables. XML schema is included. XML schema is derived from uniprot's UniRef50 xml schema.
2) DPCfamB database
B_metaclusters_xml.tar.gz Metaclusters' seeds, unaligned in an xml table. All metaclusters are listed. Metaclusters entries include also some statistical information about each MC (such as size, average length, low complexity fraction etc, ) and Pfam comparison (Dominant Architecture). A README file is included describing the data. A parser is included to transform XML data to space-separated tables. XML schema is included.
B_metaclusters_msas.tar.gz Metsclusters' multiple sequence alignments, in fasta format. Only MCs with seeds with 1) 25<=N<50 elements and 2) average length larger than 50 a.a.s are reported .
B_metaclusters_hmms.tar.gz Metsclusters' profile-hmms. A ".hmm" file for each metacluser. Only MCs with seeds with 1) 25<=N<50 elements and 2) average length larger than 50 a.a.s are reported .
B_ all_metaclusters_hmm.tar.gz Collctive metaclusters' profile-hmm. A single .hmm file collecting all MC's profile-hmm. . Only MCs with seeds with 1) 25<=N<50 elements and 2) average length larger than 50 a.a.s are reported
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
datafile.csv
datafile.json
datafile.ods
datafile.xls
The data contains following features:-
'Year (Col.1)' 'Geographical Area (Col.2)' 'Reporting area for Land utilisation statistics (Col.3 = Col.4+Col.7+ Col.11+Col.14+Col.15)' 'Forests (Col.4)' 'Not available for cultivation - Area under non-agricultural uses (Col.5)' 'Not available for cultivation - Barren and unculturable Land (Col.6)' 'Not available for cultivation - Total (Col.7 = Col.5+Col.6)' 'Other uncultivated Land excluding Fallow Land - Permanent pastures & other Grazing Lands (Col.8)' 'Other uncultivated Land excluding Fallow Land - Land under Misc. tree crops & groves (not incl. in net area sown) (Col.9)' 'Other uncultivated Land excluding Fallow Land - Culturable waste Land (Col.10)' 'Other uncultivated Land excluding Fallow Land - Total (Col.11 = Col.8 to Col.10)' 'Fallow Lands - Fallow Lands other than current fallows (Col.12)' 'Fallow Lands - Current fallows (Col.13)' 'Fallow Lands - Total Col.14 = (Col.12+Col.13)' 'Net area Sown (Col.15)' 'Total cropped area (Col.16)' 'Area sown more than once (Col.17 = Col.16-Col.15)' 'Agricultural Land/Cultivable Land/Culturable Land/Arable Land (Col.18 = Col.9+Col.10+Col.14+Col.15)' 'Cultivated Land (Col.19 = Col.13+Col.15)' 'Cropping Intensity (Col.20 = % of Col.16 over Col.15)'
I am really thankful to Indian government for storing these valuable data. Source:- https://data.gov.in/
I am inspired by everyone here on Kaggle for the level of their dedication and hard work.
Facebook
Twitterhttps://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
This dataset consists of CT and PET-CT DICOM images of lung cancer subjects with XML Annotation files that indicate tumor location with bounding boxes. The images were retrospectively acquired from patients with suspicion of lung cancer, and who underwent standard-of-care lung biopsy and PET/CT. Subjects were grouped according to a tissue histopathological diagnosis. Patients with Names/IDs containing the letter 'A' were diagnosed with Adenocarcinoma, 'B' with Small Cell Carcinoma, 'E' with Large Cell Carcinoma, and 'G' with Squamous Cell Carcinoma.
The images were analyzed on the mediastinum (window width, 350 HU; level, 40 HU) and lung (window width, 1,400 HU; level, –700 HU) settings. The reconstructions were made in 2mm-slice-thick and lung settings. The CT slice interval varies from 0.625 mm to 5 mm. Scanning mode includes plain, contrast and 3D reconstruction.
Before the examination, the patient underwent fasting for at least 6 hours, and the blood glucose of each patient was less than 11 mmol/L. Whole-body emission scans were acquired 60 minutes after the intravenous injection of 18F-FDG (4.44MBq/kg, 0.12mCi/kg), with patients in the supine position in the PET scanner. FDG doses and uptake times were 168.72-468.79MBq (295.8±64.8MBq) and 27-171min (70.4±24.9 minutes), respectively. 18F-FDG with a radiochemical purity of 95% was provided. Patients were allowed to breathe normally during PET and CT acquisitions. Attenuation correction of PET images was performed using CT data with the hybrid segmentation method. Attenuation corrections were performed using a CT protocol (180mAs,120kV,1.0pitch). Each study comprised one CT volume, one PET volume and fused PET and CT images: the CT resolution was 512 × 512 pixels at 1mm × 1mm, the PET resolution was 200 × 200 pixels at 4.07mm × 4.07mm, with a slice thickness and an interslice distance of 1mm. Both volumes were reconstructed with the same number of slices. Three-dimensional (3D) emission and transmission scanning were acquired from the base of the skull to mid femur. The PET images were reconstructed via the TrueX TOF method with a slice thickness of 1mm.
The location of each tumor was annotated by five academic thoracic radiologists with expertise in lung cancer to make this dataset a useful tool and resource for developing algorithms for medical diagnosis. Two of the radiologists had more than 15 years of experience and the others had more than 5 years of experience. After one of the radiologists labeled each subject the other four radiologists performed a verification, resulting in all five radiologists reviewing each annotation file in the dataset. Annotations were captured using Labellmg. The image annotations are saved as XML files in PASCAL VOC format, which can be parsed using the PASCAL Development Toolkit: https://pypi.org/project/pascal-voc-tools/. Python code to visualize the annotation boxes on top of the DICOM images can be downloaded here.
Two deep learning researchers used the images and the corresponding annotation files to train several well-known detection models which resulted in a maximum a posteriori probability (MAP) of around 0.87 on the validation set.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Pentaho Transformation to Extract Metadata from EPrints XML-Export:
Facebook
Twitterhttps://data.gov.tw/licensehttps://data.gov.tw/license
The Ministry of Transportation and Tourism Bureau collects spatial tourism information released by various government agencies, including data on tourist attractions, activities, dining and accommodations, tourist service stations, trails, and bike lanes, providing comprehensive tourism GIS basic data for value-added applications by industry. For XML field descriptions of each dataset, refer to the Tourism Data Standard V1.0 data at https://media.taiwan.net.tw/Upload/TourismInformationStandardFormatV1.0.pdf; for Tourism Data Standard V2.0 data, refer to https://media.taiwan.net.tw/Upload/TourismDataStandardV2.0.pdf.
Facebook
TwitterAbstract: Research data, research results, and publications of the PhD thesis entitled 'Validation Framework for RDF-based Constraint Languages', submitted to the Department of Economics and Management at the Karlsruhe Institute of Technology (KIT). TechnicalRemarks: # PhD Thesis--- Title: Validation Framework for RDF-based Constraint Languages Author: Thomas Hartmann Examination Date: 08.07.2016 University: Karlsruhe Institute of Technology (KIT) Chair: Institute of Applied Informatics and Formal Description Methods Department: Department of Economics and Management 1. Advisor: Prof. Dr. York Sure-Vetter, Karlsruhe Institute of Technology 2. Advisor: Prof. Dr. Kai Eckert, Stuttgart Media University---PhD Thesis Download http://dx.doi.org/10.5445/IR/1000056458Publications Complete set of publications: publicationsResearch Data, Research Results, and Publications Link to the KIT research data repository: http://dx.doi.org/10.5445/BWDD/11---RDF Validation Requirements Database http://purl.org/net/rdf-validationValidation Environment Demo: http://purl.org/net/rdfval-demo Source code: software/rdf-validator---Chapter 2: Foundations for RDF Validation XML validation: chapter/chapter-2/xml-validationChapter 3: Vocabularies for Representing Research Data and Related Metadata RDF vocabularies commonly used to represent different types of research data and related metadata: chapter/chapter-3/common-vocabularies Complete running example in RDF: chapter/chapter-3/common-vocabularies/running-exampleChapter 4: RDFication of XML Enabling to use RDF Validation Technologies Evaluation results: chapter/chapter-4/evaluation Chapter 6: Consistent Validation across RDF-based Constraint Languages Constraint languages implementations: chapter/chapter-6/constraint-languages-implementationsChapter 7: Validation Framework for RDF-based Constraint Languages Formal specification, HTML documentation, and UML class diagram of the RDF Constraints Vocabulary (RDF-CV): chapter/chapter-7/rdf-constraints-vocabulary Generic SPIN mappings for constraint types: chapter/chapter-7/generic-SPIN-mappings/RDF-CV-2-SPIN.ttlChapter 8: The Role of Reasoning for RDF Validation Implementations for all constraint types expressible by OWL 2 QL, OWL 2 DL, and DSP as well as for major constraint types representable by ReSh and ShEx: chapter/chapter-8/constraint-types-implementations Implementation of reasoning capabilities for all reasoning constraint types for which OWL 2 QL and OWL 2 DL reasoning may be performed: chapter\chapter-8\reasoning-constraint-types-implementations/OWL2-Reasoning-2-SPIN.ttl Validation and reasoning implementations of constraint types: chapter/chapter-8/constraint-types-implementationsChapter 9: Evaluating the Usability of Constraint Types for Assessing RDF Data Quality Implementations of all 115 constraints: chapter/chapter-9/constraints Evaluation results for each QB data set grouped by SPARQL endpoint: chapter/chapter-9/evaluation/data-sets/QB Vocabulary implementations: chapter/chapter-9/vocabularies/implementationsAppendix Link to appendix: http://dx.doi.org/10.5445/IR/1000054062
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
This dataset includes two XML files and a ZIP folder * Arrets_Netex.xml which describes data from the Île-de-France Mobilités stop repository in NeTEx format. * Lines_Netex.xml that describes data from the Île-de-France Mobilités line repository in NeTEx format. *offer_Netex.zip which describes the data of the theoretical offer of Île-de-France Mobilités in NeTEx format. Attention, the folder offers_Netex.zip also contains files listing stops (.xml stops) and rows (lines.xml). The data source for these files is the same as for Arrets_Netex.xml and Lignes_Netex.xml (References Île-de-France Mobilités). However, arrets.xml and.xml lines have the following specificities: * Only the objects used in the theoretical offer are present in these files. * The structure of the files is slightly different from that of files directly from the References The data available in this dataset is extracted from the web services of the repositories. The web services will be opened later via the PRIM portal of Île-de-France Mobilités. * * * * Documentation Documentation relating to Ile-de-France Mobilités transport repositories is available. * see documentation on repositories * see documentation describing the structure of arrets_Netex.xml * see documentation describing the line structure_Netex.xml * see documentation describing the structure of offer_Netex.zip * * * * Focus NeTEx NeTEx (Network Exchange) is a reference format for exchanging theoretical public transport supply data, defined at European level. More info * * * *
Facebook
Twitterhttps://data.gov.tw/licensehttps://data.gov.tw/license
The Ministry of Transportation and Communications' Tourism Bureau collects spatial tourism information released by various government agencies, including data on tourist attractions, activities, food and lodging, tourist service stations, trails, and bike paths, providing comprehensive tourism GIS basic data for value-added applications by businesses. The XML field descriptions for each data set, version 1.0 tourism data standard, can be found at https://media.taiwan.net.tw/Upload/TourismInformationStandardFormatV1.0.pdf; and version 2.0 tourism data standard at https://media.taiwan.net.tw/Upload/TourismDataStandardV2.0.pdf.
Facebook
Twitterhttp://vvlibri.org/fr/licence/odbl-10/legalcode/unofficialhttp://vvlibri.org/fr/licence/odbl-10/legalcode/unofficial
This dataset includes two XML files and a ZIP folder
The data available on this dataset is extracted from webservices repositories. The webservices will be opened later via the PRIM portal of Île-de-France Mobilités.
Documentation
Documentation relating to the transport standards of Île-de-France Mobilités is available.
Focus NeTEx
< p style="font-family: sans-serif;">NeTEx (Network Exchange) is a reference format for exchanging theoretical public transport offer data, defined at European level.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The LIST corpus extraction tool is a Java program for extracting lists from text corpora on the levels of characters, word parts, words, and word sets. It supports VERT and TEI P5 XML formats and outputs .CSV files that can be imported into Microsoft Excel or similar statistical processing software.
Version 1.3 adds support for the KOST 2.0 Slovene Learner Corpus (http://hdl.handle.net/11356/1887) in XML format. It also allows program execution using the command line (see 00README.txt for details), and uses a later version of Java (tested using JDK 21). In addition, Windows users no longer need to have Java installed on their computers to run the program.
Facebook
TwitterSentences and citation contexts identified from the PubMed Central open access articles ---------------------------------------------------------------------- The dataset is delivered as 24 tab-delimited text files. The files contain 720,649,608 sentences, 75,848,689 of which are citation contexts. The dataset is based on a snapshot of articles in the XML version of the PubMed Central open access subset (i.e., the PMCOA subset). The PMCOA subset was collected in May 2019. The dataset is created as described in: Hsiao TK., & Torvik V. I. (manuscript) OpCitance: Citation contexts identified from the PubMed Central open access articles. Files: • A_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with A. • B_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with B. • C_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with C. • D_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with D. • E_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with E. • F_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with F. • G_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with G. • H_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with H. • I_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with I. • J_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with J. • K_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with K. • L_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with L. • M_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with M. • N_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with N. • O_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with O. • P_p1_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with P (part 1). • P_p2_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with P (part 2). • Q_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with Q. • R_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with R. • S_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with S. • T_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with T. • UV_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with U or V. • W_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with W. • XYZ_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with X, Y or Z. Each row in the file is a sentence/citation context and contains the following columns: • pmcid: PMCID of the article • pmid: PMID of the article. If an article does not have a PMID, the value is NONE. • location: The article component (abstract, main text, table, figure, etc.) to which the citation context/sentence belongs. • IMRaD: The type of IMRaD section associated with the citation context/sentence. I, M, R, and D represent introduction/background, method, results, and conclusion/discussion, respectively; NoIMRaD indicates that the section type is not identifiable. • sentence_id: The ID of the citation context/sentence in the article component • total_sentences: The number of sentences in the article component. • intxt_id: The ID of the citation. • intxt_pmid: PMID of the citation (as tagged in the XML file). If a citation does not have a PMID tagged in the XML file, the value is "-". • intxt_pmid_source: The sources where the intxt_pmid can be identified. Xml represents that the PMID is only identified from the XML file; xml,pmc represents that the PMID is not only from the XML file, but also in the citation data collected from the NCBI Entrez Programming Utilities. If a citation does not have an intxt_pmid, the value is "-". • intxt_mark: The citation marker associated with the inline citation. • best_id: The best source link ID (e.g., PMID) of the citation. • best_source: The sources that confirm the best ID. • best_id_diff: The comparison result between the best_id column and the intxt_pmid column. • citation: A citation context. If no citation is found in a sentence, the value is the sentence. • progression: Text progression of the citation context/sentence. Supplementary Files • PMC-OA-patci.tsv.gz – This file contains the best source link IDs for the references (e.g., PMID). Patci [1] was used to identify the best source link IDs. The best source link IDs are mapped to the citation contexts and displayed in the *_journal IntxtCit.tsv files as the best_id column. Each row in the PMC-OA-patci.tsv.gz file is a citation (i.e., a reference extracted from the XML file) and contains the following columns: • pmcid: PMCID of the citing article. • pos: The citation's position in the reference list. • fromPMID: PMID of the citing article. • toPMID: Source link ID (e.g., PMID) of the citation. This ID is identified by Patci. • SRC: The sources that confirm the toPMID. • MatchDB: The origin bibliographic database of the toPMID. • Probability: The match probability of the toPMID. • toPMID2: PMID of the citation (as tagged in the XML file). • SRC2: The sources that confirm the toPMID2. • intxt_id: The ID of the citation. • journal: The first letter of the journal title. This maps to the *_journal_IntxtCit.tsv files. • same_ref_string: Whether the citation string appears in the reference list more than once. • DIFF: The comparison result between the toPMID column and the toPMID2 column. • bestID: The best source link ID (e.g., PMID) of the citation. • bestSRC: The sources that confirm the best ID. • Match: Matching result produced by Patci. [1] Agarwal, S., Lincoln, M., Cai, H., & Torvik, V. (2014). Patci – a tool for identifying scientific articles cited by patents. GSLIS Research Showcase 2014. http://hdl.handle.net/2142/54885 • Supplementary_File_1.zip – This file contains the code for generating the dataset.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains 10,000 synthetic images and corresponding bounding box labels for training object detection models to detect Khmer words.
The dataset is generated using a custom tool designed to create diverse and realistic training data for computer vision tasks, especially where real annotated data is scarce.
/
├── synthetic_images/ # Synthetic images (.png)
├── synthetic_labels/ # YOLO format labels (.txt)
├── synthetic_xml_labels/ # Pascal VOC format labels (.xml)
Each image has corresponding .txt and .xml files with the same filename.
YOLO Format (.txt):
Each line represents a word, with format:
class_id center_x center_y width height
All values are normalized between 0 and 1.
Example:
0 0.235 0.051 0.144 0.081
Pascal VOC Format (.xml):
Standard XML structure containing image metadata and bounding box coordinates (absolute pixel values).
Example:
```xml
Each image contains random Khmer words placed naturally over backgrounds, with different font styles, sizes, and visual effects.
The dataset was carefully generated to simulate real-world challenges like:
We plan to release:
Stay tuned!
This project is licensed under MIT license.
Please credit the original authors when using this data and provide a link to this dataset.
If you have any questions or want to collaborate, feel free to reach out:
Facebook
TwitterNet Primary Production (NPP) is an important component of the carbon cycle and, among the pools and fluxes that make up the cycle, it is one of the steps that are most accessible to field measurement. Direct measurement of NPP is not practical for large areas and so models are generally used to study the carbon cycle at a global scale. This data set contains 2 *.zip files for above ground and total NPP data.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The INSPIRE WFS Download Service for the theme Hydrography (HY) is the service that allows registered users to download repeatedly data using WFS 2.0.0 technology. The Download Service provides harmonized data INSPIRE theme Hydrography (HY)- application schema Hydro-Network corresponding with INSPIRE xml schema in version 4.0. Data are provided in format GML 3.2.1. and in the coordinate system ETRS89 / TM33 determined for INSPIRE to display datasets of large scales. This dataset of hydrography of the Czech Republic therefore has the unified design with other data created for this INSPIRE theme in frame of whole Europe. The base of the dataset is the Fundamental Base of Geographic Data of the Czech Republic (ZABAGED®). The service meets the Technical Guidance for the implementation of INSPIRE Download Services, version 3.1 and also the OGC Standard for WFS 2.0.0.
Facebook
TwitterVarious species have been tracked using ARGOS PTT trackers since the early 1990's. These include Emperor, King and Adelie pengiuns, Light-mantled Sooty, Grey-headed and Black-browed albatrosses, Antarctic and Australian fur seals, Southern Elephant Seal and Blue and Humpback whales. Note that not all data for any species or locations is or will be exposed to OBIS. Geographic coverage is from Heard Island to the west and Macquarie Island to the east and several islands near the southern end of Chile. The data has been filtered to remove most but not all erroneous positions.
DiGIR is an engine which takes XML requests for data and returns a data subset stored as XML data (as defined in a schema). For more DiGIR information, see http://digir.sourceforge.net/ , http://diveintodigir.ecoforge.net/draft/digirdive.html , and http://digir.net/prov/prov_manual.html . A list of Digir providers is at http://bigdig.ecoforge.net/wiki/SchemaStatus .
Darwin is the original schema for use with the DiGIR engine.
The Ocean Biogeographic Information System (OBIS) schema extends Darwin. For more OBIS info, see http://www.iobis.org . See the OBIS schema at http://www.iobis.org/tech/provider/questions .
Queries: Although OBIS datasets have many variables, most variables have few values. The only queries that are likely to succeed MUST include a constraint for Genus= and MAY include constraints for Species=, longitude, latitude, and time.
Most OBIS datasets return a maximum of 1000 rows of data per request. The limitation is imposed by the OBIS administrators.
Available Genera (and number of records): (error) cdm_data_type=Point citation=See the following Metadata records http://data.aad.gov.au/aadc/metadata/metadata_redirect.cfm?md=AMD/AU/DB_Argos_PTT_Tracking http://data.aad.gov.au/aadc/metadata/metadata_redirect.cfm?md=AMD/AU/HI_animaltracks_ARGOS http://data.aad.gov.au/aadc/metadata/metadata_redirect.cfm?md=AMD/AU/Tracking_BI http://data.aad.gov.au/aadc/metadata/metadata_redirect.cfm?md=AMD/AU/Tracking_Mag http://data.aad.gov.au/aadc/metadata/metadata_redirect.cfm?md=AMD/AU/STA_Bibliography http://data.aad.gov.au/aadc/metadata/metadata_redirect.cfm?md=AMD/AU/Tracking_SI http://data.aad.gov.au/aadc/metadata/metadata_redirect.cfm?md=AMD/AU/Tracking_EDP http://data.aad.gov.au/aadc/metadata/metadata_redirect.cfm?md=AMD/AU/Tracking_DD Contact Data Centre for help on citation details. Conventions=COARDS, CF-1.6, ACDD-1.3 Easternmost_Easting=180.0 featureType=Point geospatial_lat_max=90.0 geospatial_lat_min=-90.0 geospatial_lat_units=degrees_north geospatial_lon_max=180.0 geospatial_lon_min=-180.0 geospatial_lon_units=degrees_east geospatial_vertical_positive=up geospatial_vertical_units=m infoUrl=http://data.aad.gov.au/ institution=AADC Northernmost_Northing=90.0 sourceUrl=http://aadc-maps.aad.gov.au/digir/digir.php Southernmost_Northing=-90.0 standard_name_vocabulary=CF Standard Name Table v55 Westernmost_Easting=-180.0
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The provided ZIP archive contains an XML file "main-database-description.xml" with the description of all tables (VIEWS) that are exposed publicly at the PLBD server (https://plbd.org/). In the XML file, all columns of the visible tables are described, specifying their SQL types, measurement units, semantics, calculation formulae, SQL statements that can be used to generate values in these columns, and publications of the formulae derivations.
The XML file conforms to the published XSD schema created for descriptions of relational databases for specifications of scientific measurement data. The XSD schema ("relational-database_v2.0.0-rc.18.xsd") and all included sub-schemas are provided in the same archive for convenience. All XSD schemas are validated against the "XMLSchema.xsd" schema from the W3C consortium.
The ZIP file contains the excerpt from the files hosted in the https://plbd.org/ at the moment of submission of the PLBD database in the Scientific Data journal, and is provided to conform the journal policies. The current data and schemas should be fetched from the published URIs:
https://plbd.org/
https://plbd.org/doc/db/schemas
https://plbd.org/doc/xml/schemas
Software that is used to generate SQL schemas, RestfulDB metadata and the RestfulDB middleware that allows to publish the databases generated from the XML description on the Web are available at public Subversion repositories:
svn://www.crystallography.net/solsa-database-scripts
svn://saulius-grazulis.lt/restfuldb
The unpacked ZIP file will create the "db/" directory with the tree layout given below. In addition to the database description file "main-database-description.xml", all XSD schemas necessary for validation of the XML file are provided. On a GNU/Linux operating system with a GNU Make package installed, the XML file validity can be checked by unpacking the ZIP file, entering the unpacked directory, and running 'make distclean; make'. For example, on a Linux Mint distribution, the following commands should work:
unzip main-database-description.zip
cd db/release/v0.10.0/tables/
sh -x dependencies/Linuxmint-20.1/install.sh
make distclean
make
If necessary, additional packages can be installed using the 'install.sh' script in the 'dependencies/' subdirectory corresponding to your operating system. As of the moment of writing, Debian-10 and Linuxmint-20.1 OSes are supported out of the box; similar OSes might work with the same 'install.sh' scripts. The installation scripts require to run package installation command under system administrator privileges, but they use only the standard system package manager, thus they should not put your system at risk. For validation and syntax checking, the 'rxp' and 'xmllint' programs are used.
The log files provided in the "outputs/validation" subdirectory contain validation logs obtained on the system where the XML files were last checked and should indicate validity of the provided XML file against the references schemas.
db/
└── release
└── v0.10.0
└── tables
├── Makeconfig-validate-xml
├── Makefile
├── Makelocal-validate-xml
├── dependencies
├── main-database-description.xml
├── outputs
└── schema