Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains all the citation data (in CSV format) included in POCI, released on 27 December 2022. In particular, each line of the CSV file defines a citation, and includes the following information:
[field "oci"] the Open Citation Identifier (OCI) for the citation; [field "citing"] the PMID of the citing entity; [field "cited"] the PMID of the cited entity; [field "creation"] the creation date of the citation (i.e. the publication date of the citing entity); [field "timespan"] the time span of the citation (i.e. the interval between the publication date of the cited entity and the publication date of the citing entity); [field "journal_sc"] it records whether the citation is a journal self-citations (i.e. the citing and the cited entities are published in the same journal); [field "author_sc"] it records whether the citation is an author self-citation (i.e. the citing and the cited entities have at least one author in common).
This version of the dataset contains:
717,654,703 citations; 26,024,862 bibliographic resources.
The size of the zipped archive is 9.6 GB, while the size of the unzipped CSV file is 50 GB. Additional information about POCI at official webpage.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Explore a rich research dataset with 5.2M papers and 36.6M citations! Unleash your data science skills for clustering, influence analysis, topic modeling, and more. Dive into the world of research networks.
| Field Name | Field Type | Description | Example |
|---|---|---|---|
| id | string | paper ID | 53e997ddb7602d9701fd3ad7 |
| title | string | paper title | Rewrite-Based Satisfiability Procedures for Recursive Data Structures |
| authors.name | string | author name | Maria Paola Bonacina |
| author.org | string | author affiliation | Dipartimento di Informatica |
| author.id | string | author ID | 53f47275dabfaee43ed25965 |
| venue.raw | string | paper venue name | Electronic Notes in Theoretical Computer Science(ENTCS) |
| year | int | published year | 2007 |
| keywords | list of strings | keywords | ["theorem-proving strategy", "rewrite-based approach", ...] |
| fos.name | string | paper fields of study | Data structure |
| fos.w | float | fields of study weight | 0.48341 |
| references | list of strings | paper references | ["53e9a31fb7602d9702c2c61e", "53e997f1b7602d9701fef4d1", ...] |
| n_citation | int | citation number | 19 |
| page_start | string | page start | 55 |
| page_end | string | page end | 70 |
| doc_type | string | paper type: journal, conference | Journal |
| lang | string | detected language | en |
| volume | string | volume | 174 |
| issue | string | issue | 8 |
| issn | string | issn | Electronic Notes in Theoretical Computer Science |
| isbn | string | isbn | |
| doi | string | doi | 10.1016/j.entcs.2006.11.039 |
| url | list | external links | [https: ...] |
| abstract | string | abstract | Our ability to generate ... |
| indexed_abstract | dict | indexed abstract | {"IndexLength": 116, "InvertedIndex": {"data": [49], ...} |
| v12_id | int | v12 paper id | 2027211529 |
| v12_authors.name | string | v12 author name | Maria Paola Bonacina |
| v12_authors.org | string | v12 author affiliation | Dipartimento di Informatica,Università degli Studi di Verona,Italy#TAB# |
| v12_authors.id | int | v12 author ID | 669130765 |
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Self-citation analysis data based on PubMed Central subset (2002-2005) ---------------------------------------------------------------------- Created by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik on April 5th, 2018 ## Introduction This is a dataset created as part of the publication titled: Mishra S, Fegley BD, Diesner J, Torvik VI (2018) Self-Citation is the Hallmark of Productive Authors, of Any Gender. PLOS ONE. It contains files for running the self citation analysis on articles published in PubMed Central between 2002 and 2005, collected in 2015. The dataset is distributed in the form of the following tab separated text files: * Training_data_2002_2005_pmc_pair_First.txt (1.2G) - Data for first authors * Training_data_2002_2005_pmc_pair_Last.txt (1.2G) - Data for last authors * Training_data_2002_2005_pmc_pair_Middle_2nd.txt (964M) - Data for middle 2nd authors * Training_data_2002_2005_pmc_pair_txt.header.txt - Header for the data * COLUMNS_DESC.txt file - Descriptions of all columns * model_text_files.tar.gz - Text files containing model coefficients and scores for model selection. * results_all_model.tar.gz - Model coefficient and result files in numpy format used for plotting purposes. v4.reviewer contains models for analysis done after reviewer comments. * README.txt file ## Dataset creation Our experiments relied on data from multiple sources including properitery data from Thompson Rueter's (now Clarivate Analytics) Web of Science collection of MEDLINE citations. Author's interested in reproducing our experiments should personally request from Clarivate Analytics for this data. However, we do make a similar but open dataset based on citations from PubMed Central which can be utilized to get similar results to those reported in our analysis. Furthermore, we have also freely shared our datasets which can be used along with the citation datasets from Clarivate Analytics, to re-create the datased used in our experiments. These datasets are listed below. If you wish to use any of those datasets please make sure you cite both the dataset as well as the paper introducing the dataset. * MEDLINE 2015 baseline: https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html * Citation data from PubMed Central (original paper includes additional citations from Web of Science) * Author-ity 2009 dataset: - Dataset citation: Torvik, Vetle I.; Smalheiser, Neil R. (2018): Author-ity 2009 - PubMed author name disambiguated dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4222651_V1 - Paper citation: Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3), 1–29. https://doi.org/10.1145/1552303.1552304 - Paper citation: Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2004). A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology, 56(2), 140–158. https://doi.org/10.1002/asi.20105 * Genni 2.0 + Ethnea for identifying author gender and ethnicity: - Dataset citation: Torvik, Vetle (2018): Genni + Ethnea for the Author-ity 2009 dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-9087546_V1 - Paper citation: Smith, B. N., Singh, M., & Torvik, V. I. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. In Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries - JCDL ’13. ACM Press. https://doi.org/10.1145/2467696.2467720 - Paper citation: Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geo-coded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington DC, USA. http://hdl.handle.net/2142/88927 * MapAffil for identifying article country of affiliation: - Dataset citation: Torvik, Vetle I. (2018): MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4354331_V1 - Paper citation: Torvik VI. MapAffil: A Bibliographic Tool for Mapping Author Affiliation Strings to Cities and Their Geocodes Worldwide. D-Lib magazine : the magazine of the Digital Library Forum. 2015;21(11-12):10.1045/november2015-torvik * IMPLICIT journal similarity: - Dataset citation: Torvik, Vetle (2018): Author-implicit journal, MeSH, title-word, and affiliation-word pairs based on Author-ity 2009. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4742014_V1 * Novelty dataset for identify article level novelty: - Dataset citation: Mishra, Shubhanshu; Torvik, Vetle I. (2018): Conceptual novelty scores for PubMed articles. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-5060298_V1 - Paper citation: Mishra S, Torvik VI. Quantifying Conceptual Novelty in the Biomedical Literature. D-Lib magazine : The Magazine of the Digital Library Forum. 2016;22(9-10):10.1045/september2016-mishra - Code: https://github.com/napsternxg/Novelty * Expertise dataset for identifying author expertise on articles: * Source code provided at: https://github.com/napsternxg/PubMed_SelfCitationAnalysis Note: The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in the first week of October, 2016. Check here for information to get PubMed/MEDLINE, and NLMs data Terms and Conditions Additional data related updates can be found at Torvik Research Group ## Acknowledgments This work was made possible in part with funding to VIT from NIH grant P01AG039347 and NSF grant 1348742. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. ## License Self-citation analysis data based on PubMed Central subset (2002-2005) by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik is licensed under a Creative Commons Attribution 4.0 International License. Permissions beyond the scope of this license may be available at https://github.com/napsternxg/PubMed_SelfCitationAnalysis.
Facebook
Twitterhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/LXQXAOhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/LXQXAO
Extracting and parsing reference strings from research articles is a challenging task. State-of-the-art tools like GROBID apply rather simple machine learning models such as conditional random fields (CRF). Recent research has shown a high potential of deep-learning for reference string parsing. The challenge with deep learning is, however, that the training step requires enormous amounts of labeled data – which does not exist for reference string parsing. Creating such a large dataset manually, through human labor, seems hardly feasible. Therefore, we created GIANT. GIANT is a large dataset with 991,411,100 XML labeled reference strings. The strings were automatically created based on 677,000 entries from CrossRef, 1,500 citation styles in the citation-style language, and the citation processor citeproc-js. GIANT can be used to train machine learning models, particularly deep learning models, for citation parsing. While we have not yet tested GIANT for training such models, we hypothesise that the dataset will be able to significantly improve the accuracy of citation parsing. The dataset and code to create it, are freely available at https://github.com/BeelGroup/.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
A CSV dataset containing the number of references of each bibliographic entity identified by an OMID in the OpenCitations Index (https://opencitations.net/index).The dataset is based on the last release of the OpenCitations Index (https://opencitations.net/download) – November 2023. The size of the zipped archive is 0.35 GB, while the size of the unzipped CSV file is 1.7 GB.The CSV dataset contains the reference count of 71,805,806 bibliographic entities. The first column (omid) lists the entities, while the second column (references) indicates the corresponding number of incoming citations.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Data file for the second release of the Data Citation Corpus, produced by DataCite and Make Data Count as part of an ongoing grant project funded by the Wellcome Trust. Read more about the project.
The data file includes 5,256,114 data citation records in JSON and CSV formats. The JSON file is the version of record.
For convenience, the data is provided in batches of approximately 1 million records each. The publication date and batch number are included in the file name, ex: 2024-08-23-data-citation-corpus-01-v2.0.json.
The data citations in the file originate from DataCite Event Data and a project by Chan Zuckerberg Initiative (CZI) to identify mentions to datasets in the full text of articles.
Each data citation record is comprised of:
A pair of identifiers: An identifier for the dataset (a DOI or an accession number) and the DOI of the publication (journal article or preprint) in which the dataset is cited
Metadata for the cited dataset and for the citing publication
The data file includes the following fields:
|
Field |
Description |
Required? |
|
id |
Internal identifier for the citation |
Yes |
|
created |
Date of item's incorporation into the corpus |
Yes |
|
updated |
Date of item's most recent update in corpus |
Yes |
|
repository |
Repository where cited data is stored |
No |
|
publisher |
Publisher for the article citing the data |
No |
|
journal |
Journal for the article citing the data |
No |
|
title |
Title of cited data |
No |
|
publication |
DOI of article where data is cited |
Yes |
|
dataset |
DOI or accession number of cited data |
Yes |
|
publishedDate |
Date when citing article was published |
No |
|
source |
Source where citation was harvested |
Yes |
|
subjects |
Subject information for cited data |
No |
|
affiliations |
Affiliation information for creator of cited data |
No |
|
funders |
Funding information for cited data |
No |
Additional documentation about the citations and metadata in the file is available on the Make Data Count website.
The second release of the Data Citation Corpus data file reflects several changes made to add new citations, remove some records deemed out of scope for the corpus, update and enhance citation metadata, and improve the overall usability of the file. These changes are as follows:
Add and update Event Data citations:
Add 179,885 new data citations created in DataCite Event Data between 01 June 2023 through 30 June 2024
Remove citation records deemed out of scope for the corpus:
273,567 records from DataCite Event Data with non-citation relationship types
28,334 citations to items in non-data repositories (clinical trials registries, stem cells, samples, and other non-data materials)
44,117 invalid citations where subj_id value was the same as the obj_id value or subj_id and obj_id are inverted, indicating a citation from a dataset to a publication
473,792 citations to invalid accession numbers from CZI data present in v1.1 as a result of false positives in the algorithm used to identify mentions
4,110,019 duplicate records from CZI data present in v1.1 where metadata is the same for obj_id, subj_id, repository_id, publisher_id, journal_id, accession_number, and source_id (the record with the most recent updated date was retained in all of these cases)
Metadata enhancements:
Apply Field of Science subject terms to citation records originating from CZI, based on disciplinary area of data repository
Initial cleanup of affiliation and funder organization names to remove personal email addresses and social media handles (additional cleanup and standardization in progress and will be included in future releases)
Data structure updates to improve usability and eliminate redundancies:
Rename subj_id and obj_id fields to “dataset” and “publication” for clarity
Remove accessionNumber and doi elements to eliminate redundancy with subj_id
Remove relationTypeId fields as these are specific to Event Data only
Full details of the above changes, including scripts used to perform the above tasks, are available in GitHub.
While v2 addresses a number of cleanup and enhancement tasks, additional data issues may remain, and additional enhancements are being explored. These will be addressed in the course of subsequent data file releases.
Feedback on the data file can be submitted via GitHub. For general questions, email info@makedatacount.org.
Facebook
TwitterThis is a searchable historical collection of standards referenced in regulations - Voluntary consensus standards, government-unique standards, industry standards, and international standards referenced in the Code of Federal Regulations (CFR).
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
950 algorithmically restored underwater images, used for image enhancement, image generation, etc.
Facebook
TwitterThe application is responsible for maintaining a series of tables containing data for fiscal and processing years and located on the mainframe and data warehouse servers.
Facebook
TwitterThe dataset includes sensory data from the following sensors:
- Triaxial acceleration force (in m/s^2) from the mobile phone accelerometer (30 Hz) and E4 accelerometer (32 Hz)
- Triaxial rate of rotation (in rad/s) and degrees of rotation (in Degrees) from the mobile phone gyroscope (30 Hz)
- Triaxial geomagnetic field strength (in μT) from the mobile phone magnetometer (30 Hz)
- Latitude and longitude, and horizontal accuracy 1) (in meters) from the mobile phone GPS (every 5 seconds)
- Blood volume pressure (in nano Watt) from E4 photoplethysmography (PPG) sensor (64 Hz)
- Electrodermal activity (skin conductance in μS) from E4 EDA sensor (4 Hz)
- Average heart rate 2) values (in bps) computed in 10 seconds-span based on the BVP analysis from E4 (1 Hz)
- Peripheral skin temperature (in Celsius degrees) from E4 infrared thermopile (4 Hz)
1) The estimated horizontal accuracy is defined as the radius of 68% confidence according to the API. Reference: https://developer.android.com/reference/android/location/Location#getAccuracy() 2) HR values are not derived from a real-time reading but are created after the data collection session. Reference: https://support.empatica.com/hc/en-us/articles/360029469772-E4-data-HR-csv-explanation
The dataset includes files in a structure shown below:
+----- USER-ID | +----- timestamp (DAY 1) | | +----- e4Acc | | | timestamp (e4-accelerometer-data).csv | | | ... | | +----- e4Bvp | | | timestamp (e4-blood-volume-pressure-data).csv | | | ... | | +----- e4Eda | | | timestamp (e4-electrodermal-activity-data).csv | | | ... | | +----- e4Hr | | | timestamp (e4-heart-rate-data).csv | | | ... | | +----- e4Temp | | | timestamp (e4-skin-temperature-data).csv | | | ... | | +----- mAcc | | | timestamp (mobile-accelerometer-data).csv | | | ... | | +----- mGps | | | timestamp (mobile-gps-data).csv | | | ... | | +----- mGyr | | | timestamp (mobile-gyroscope-data).csv | | | ... | | +----- mMag | | | timestamp (mobile-magnetometer-data).csv | | | ... | | timestamp-label.csv | +----- timestamp (DAY 2) | | +----- ...
Directories (in timestamps) located under the USER_ID directory indicate when the user started the experiment each day. Each day has directories named by the corresponding sensors, which includes data files generated every one minute. Each data file records raw sensor values in the designated sampling interval with the timestamp. (Timestamp is represented in second.millisecond format.)
User label files are composed of 12 columns representing the physical, emotional, and contextual states as follows:
- ts: timestamp
- action: sleep, personal_care, work, study, household, care_housemem (caregiving), recreation_media, entertainment, outdoor_act (sports), hobby, recreation_etc (free time), shop, communitiy_interaction (regular activity), travel (includes commute), meal (includes snack), socialising
- actionOption: Details of the selected action. See the description below.
- actionSub: meal-amount when action=meal or snack, move-method when action=travel
- actionSubOption: 1 (light), 2 (moderate), 3 (heavy) when actionSub=meal_amount, 1 (walk), 2 (driving), 3 (taxi, passenger), 4 (personal mobility), 5 (bus), 6 (train, subway), 7 (others) when actionSub=move-method
- condition: ALONE, WITH-ONE, WITH-MANY
- conditionSub1Option: 1 (with families), 2 (with friends), 3 (with colleagues), 4 (acquaintances), 5 (others)
- conditionSub2Option: 1 (passive in conversation), 2 (moderate participation in conversation), 3 (active in conversation)
- place: home, workplace, restaurant, outdoor, other-indoor
- emotionPositive : (negative) 1-2-3-4-5-6-7 (positive)
- emotionTension: (relaxed) 1-2-3-4-5-6-7 (aroused)
- activity 3): 0 (IN-VEHICLE), 1 (ON-BICYCLE), 2 (ON-FOOT), 3 (STILL), 4 (UNKNOWN), 5 (TILTING), 7 (WALKING), 8 (RUNNING)
3) Values in the activity column represent the detected activity of the mobile device using Google's Awareness API. Reference: https://developers.google.com/android/reference/com/google/android/gms/location/DetectedActivity?hl=en
Descriptions for the actionOption field is as follows:
111 Sleep 112 Sleepless 121 Meal 122 Snack 131 Medical services, treatments, sick rest 132 Personal hygiene (bath) 133 Appearance management (makeup, change of clothes) 134 Beauty-related services 211 Main job 212 Side job 213 Rest during work 22 Job search 311 School class / seminar (listening) 312 Break between classes 313 School homework, self-study (individual) 314 Team project (in groups) 321 Private tutoring (offline) 322 Online courses 41 Preparing food and washing dishes 42 Laundry and ironing 43 Housing management and cleaning 44 Vehicle management 45 Pet and plant caring 46 Purchasing goods and services (grocery/take-out) 51 Caring for children under 10 who live together 52 Caring for elementary, middle, and high school students over 10 who ...
Facebook
TwitterThe ARPA-E funded TERRA-REF project generated open-access reference datasets for the study of plant sensing, genomics, and phenomics. Sensor data were generated by a field scanner sensing platform that captures color, thermal, hyperspectral, and active fluorescence imagery as well as three dimensional structure and associated environmental measurements. This dataset is provided alongside data collected using traditional field methods in order to support calibration and validation of algorithms used to extract plot level phenotypes from these datasets. Data were collected at the University of Arizona Maricopa Agricultural Center in Maricopa, Arizona. This site hosts a large field scanner with fifteen sensors, many of which are capable of capturing mm-scale images and point clouds at daily to weekly intervals. These data are intended to be reused and are accessible as a combination of files and databases linked by spatial, temporal, and genomic information. In addition to providing open access data, the entire computational pipeline is open source, and we enable users to access high-performance computing environments. The study has evaluated a sorghum diversity panel, biparental cross populations, and elite lines and hybrids from structured sorghum breeding populations. In addition, a durum wheat diversity panel was grown and evaluated over three winter seasons. The initial release includes derived data from two seasons in which the sorghum diversity panel was evaluated. Future releases will include data from additional seasons and locations. The TERRA-REF reference dataset can be used to characterize phenotype-to-genotype associations, on a genomic scale, that will enable knowledge-driven breeding and the development of higher-yielding cultivars of sorghum and wheat. The data is also being used to develop new algorithms for machine learning, image analysis, genomics, and optical sensor engineering. Resources in this dataset:Resource Title: Link to dataset at Datadryad.org. File Name: Web Page, url: https://datadryad.org/stash/dataset/doi:10.5061/dryad.4b8gtht99 The ARPA-E funded TERRA-REF project is generating open-access reference datasets for the study of plant sensing, genomics, and phenomics. Sensor data were generated by a field scanner sensing platform that captures color, thermal, hyperspectral, and active flourescence imagery as well as three dimensional structure and associated environmental measurements. This dataset is provided alongside data collected using traditional field methods in order to support calibration and validation of algorithms used to extract plot level phenotypes from these datasets.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We use journal articles published by Scientometrics in 2016-2020 as the data source. Through the analysis of the data set usage records of scientometrics research, the frequency ranking of the usage of each dataset for information reference is listed, so as to provide a reference for the selection of data sets for scientometrics research.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset contains biographical information derived from articles on English Wikipedia as it stood in early June 2024. It was created as part of the Structured Contents initiative at Wikimedia Enterprise and is intended for evaluation and research use.
The beta sample dataset is a subset of the Structured Contents Snapshot focusing on people with infoboxes in EN wikipedia; outputted as json files (compressed in tar.gz).
We warmly welcome any feedback you have. Please share your thoughts, suggestions, and any issues you encounter on the discussion page for this dataset here on Kaggle.
Noteworthy Included Fields: - name - title of the article. - identifier - ID of the article. - image - main image representing the article's subject. - description - one-sentence description of the article for quick reference. - abstract - lead section, summarizing what the article is about. - infoboxes - parsed information from the side panel (infobox) on the Wikipedia article. - sections - parsed sections of the article, including links. Note: excludes other media/images, lists, tables and references or similar non-prose sections.
The Wikimedia Enterprise Data Dictionary explains all of the fields in this dataset.
Infoboxes - Compressed: 2GB - Uncompressed: 11GB
Infoboxes + sections + short description - Size of compressed file: 4.12 GB - Size of uncompressed file: 21.28 GB
Article analysis and filtering breakdown: - total # of articles analyzed: 6,940,949 - # people found with QID: 1,778,226 - # people found with Category: 158,996 - people found with Biography Project: 76,150 - Total # of people articles found: 2,013,372 - Total # people articles with infoboxes: 1,559,985 End stats - Total number of people articles in this dataset: 1,559,985 - that have a short description: 1,416,701 - that have an infobox: 1,559,985 - that have article sections: 1,559,921
This dataset includes 235,146 people articles that exist on Wikipedia but aren't yet tagged on Wikidata as instance of:human.
This dataset was originally extracted from the Wikimedia Enterprise APIs on June 5, 2024. The information in this dataset may therefore be out of date. This dataset isn't being actively updated or maintained, and has been shared for community use and feedback. If you'd like to retrieve up-to-date Wikipedia articles or data from other Wikiprojects, get started with Wikimedia Enterprise's APIs
The dataset is built from the Wikimedia Enterprise HTML “snapshots”: https://enterprise.wikimedia.com/docs/snapshot/ and focuses on the Wikipedia article namespace (namespace 0 (main)).
Wikipedia is a human generated corpus of free knowledge, written, edited, and curated by a global community of editors since 2001. It is the largest and most accessed educational resource in history, accessed over 20 billion times by half a billion people each month. Wikipedia represents almost 25 years of work by its community; the creation, curation, and maintenance of millions of articles on distinct topics. This dataset includes the biographical contents of English Wikipedia language editions: English https://en.wikipedia.org/, written by the community.
Wikimedia Enterprise provides this dataset under the assumption that downstream users will adhere to the relevant free culture licenses when the data is reused. In situations where attribution is required, reusers should identify the Wikimedia project from which the content was retrieved as the source of the content. Any attribution should adhere to Wikimedia’s trademark policy (available at https://foundation.wikimedia.org/wiki/Trademark_policy) and visual identity guidelines (ava...
Facebook
Twitterhttps://eidc.ac.uk/licences/ogl/plainhttps://eidc.ac.uk/licences/ogl/plain
The Reference Observatory of Basins for INternational hydrological climate change detection (ROBIN) dataset is a global hydrological dataset containing publicly available daily flow data for 2,386 gauging stations across the globe which have natural or near-natural catchments. Metadata is also provided alongside these stations for the Full ROBIN Dataset consisting of 3,060 gauging stations. Data were quality controlled by the central ROBIN team before being added to the dataset, and two levels of data quality are applied to guide users towards appropriate the data usage. Most records have data of at least 40 years with minimal missing data with data records starting in the late 19th Century for some sites through to 2022. ROBIN represents a significant advance in global-scale, accessible streamflow data. The project was funded the UK Natural Environment Research Council Global Partnership Seedcorn Fund - NE/W004038/1 and the NC-International programme [NE/X006247/1] delivering National Capability
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset information
Arxiv HEP-PH (high energy physics phenomenology ) citation graph is from the
e-print arXiv and covers all the citations within a dataset of 34,546 papers
with 421,578 edges. If a paper i cites paper j, the graph contains a directed
edge from i to j. If a paper cites, or is cited by, a paper outside the
dataset, the graph does not contain any information about this.
The data covers papers in the period from January 1993 to April 2003 (124
months). It begins within a few months of the inception of the arXiv, and thus
represents essentially the complete history of its HEP-PH section.
The data was originally released as a part of 2003 KDD Cup.
Dataset statistics
Nodes 34546
Edges 421578
Nodes in largest WCC 34401 (0.996)
Edges in largest WCC 421485 (1.000)
Nodes in largest SCC 12711 (0.368)
Edges in largest SCC 139981 (0.332)
Average clustering coefficient 0.2962
Number of triangles 1276868
Fraction of closed triangles 0.1457
Diameter (longest shortest path) 12
90-percentile effective diameter 5
Source (citation)
J. Leskovec, J. Kleinberg and C. Faloutsos. Graphs over Time: Densification
Laws, Shrinking Diameters and Possible Explanations. ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (KDD), 2005.
J. Gehrke, P. Ginsparg, J. M. Kleinberg. Overview of the 2003 KDD Cup. SIGKDD
Explorations 5(2): 149-151, 2003.
Files
File Description
cit-HepPh.txt.gz Paper citation network of Arxiv High Energy Physics category
cit-HepPh-dates.txt.gz Time of nodes (paper submission time to Arxiv)
Dataset information
Arxiv HEP-TH (high energy physics theory) citation graph is from the e-print
arXiv and covers all the citations within a dataset of 27,770 papers with
352,807 edges. If a paper i cites paper j, the graph contains a directed edge
from i to j. If a paper cites, or is cited by, a paper outside the dataset, the
graph does not contain any information about this.
The data covers papers in the period from January 1993 to April 2003 (124
months). It begins within a few months of the inception of the arXiv, and thus
represents essentially the complete history of its HEP-TH section.
The data was originally released as a part of 2003 KDD Cup.
Dataset statistics
Nodes 27770
Edges 352807
Nodes in largest WCC 27400 (0.987) ...
Facebook
TwitterMithilss/Per-Citation-Dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterhttps://www.nist.gov/open/licensehttps://www.nist.gov/open/license
This portal is your gateway to documented digital forensic image datasets. These datasets can assist in a variety of tasks including tool testing, developing familiarity with tool behavior for given tasks, general practitioner training and other unforeseen uses that the user of the datasets can devise. Most datasets have a description of the type and locations of significant artifacts present in the dataset. There are descriptions and finding aides to help you locate datasets by the year produced, by author, or by attributes of the dataset.
Facebook
TwitterThis dataset provides the NIST Software Assurance Metrics And Tool Evaluation (SAMATE) Software Assurance Reference Dataset (SARD) - a set of programs with known security flaws. This will allow end users to evaluate tools and tool developers to test their methods.
Facebook
TwitterU.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
The data are 475 thematic land cover raster’s at 2m resolution. Land cover classification was to the land cover classes: Tree (1), Water (2), Barren (3), Other Vegetation (4) and Ice & Snow (8). Cloud cover and Shadow were sometimes coded as Cloud (5) and Shadow (6), however for any land cover application would be considered NoData. Some raster’s may have Cloud and Shadow pixels coded or recoded to NoData already. Commercial high-resolution satellite data was used to create the classifications. Usable image data for the target year (2010) was acquired for 475 of the 500 primary sample locations, with 90% of images acquired within ±2 years of the 2010 target. The remaining 25 of the 500 sample blocks had no usable data so were not able to be mapped. Tabular data is included with the raster classifications indicating the specific high-resolution sensor and date of acquisition for source imagery as well as the stratum to which that sample block belonged. Methods for this classifi ...
Facebook
Twitterimport cv2 original_image = cv2.imread('Original image/IMG-001.png') # Read original image ground_truth_image = cv2.imread('Ground truth/GT-001.png', cv2.IMREAD_GRAYSCALE) # Read the corresponding Ground Truth image When performing model training based on deep learning frameworks (such as TensorFlow, PyTorch), the dataset path can be configured into the corresponding dataset loading class according to the data loading mechanism of the framework to ensure that the model can correctly read and process the image and its annotation data.
References: [1] Noel Codella, Veronica Rotemberg, Philipp Tschandl, M. Emre Celebi, Stephen Dusza, David Gutman, Brian Helba, Aadi Kalloo, Konstantinos Liopyris, Michael Marchetti, Harald Kittler, Allan Halpern: “Skin Lesion Analysis Toward Melanoma Detection 2018: A Challenge Hosted by the International Skin Imaging Collaboration (ISIC)”, 2018; https://arxiv.org/abs/1902.03368
[2] Tschandl, P., Rosendahl, C. & Kittler, H. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci. Data 5, 180161 doi:10.1038/sdata.2018.161 (2018).
[3] Classification of Brain Hemorrhage Using Deep Learning from CT Scan Images - https://link.springer.com/chapter/10.1007/978-981-19-7528-8_15
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains all the citation data (in CSV format) included in POCI, released on 27 December 2022. In particular, each line of the CSV file defines a citation, and includes the following information:
[field "oci"] the Open Citation Identifier (OCI) for the citation; [field "citing"] the PMID of the citing entity; [field "cited"] the PMID of the cited entity; [field "creation"] the creation date of the citation (i.e. the publication date of the citing entity); [field "timespan"] the time span of the citation (i.e. the interval between the publication date of the cited entity and the publication date of the citing entity); [field "journal_sc"] it records whether the citation is a journal self-citations (i.e. the citing and the cited entities are published in the same journal); [field "author_sc"] it records whether the citation is an author self-citation (i.e. the citing and the cited entities have at least one author in common).
This version of the dataset contains:
717,654,703 citations; 26,024,862 bibliographic resources.
The size of the zipped archive is 9.6 GB, while the size of the unzipped CSV file is 50 GB. Additional information about POCI at official webpage.