Facebook
TwitterAttribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
Citation metrics are widely used and misused. We have created a publicly available database of top-cited scientists that provides standardized information on citations, h-index, co-authorship adjusted hm-index, citations to papers in different authorship positions and a composite indicator (c-score). Separate data are shown for career-long and, separately, for single recent year impact. Metrics with and without self-citations and ratio of citations to citing papers are given and data on retracted papers (based on Retraction Watch database) as well as citations to/from retracted papers have been added. Scientists are classified into 22 scientific fields and 174 sub-fields according to the standard Science-Metrix classification. Field- and subfield-specific percentiles are also provided for all scientists with at least 5 papers. Career-long data are updated to end-of-2024 and single recent year data pertain to citations received during calendar year 2024. The selection is based on the top 100,000 scientists by c-score (with and without self-citations) or a percentile rank of 2% or above in the sub-field. This version (7) is based on the August 1, 2025 snapshot from Scopus, updated to end of citation year 2024. This work uses Scopus data. Calculations were performed using all Scopus author profiles as of August 1, 2025. If an author is not on the list, it is simply because the composite indicator value was not high enough to appear on the list. It does not mean that the author does not do good work. PLEASE ALSO NOTE THAT THE DATABASE HAS BEEN PUBLISHED IN AN ARCHIVAL FORM AND WILL NOT BE CHANGED. The published version reflects Scopus author profiles at the time of calculation. We thus advise authors to ensure that their Scopus profiles are accurate. REQUESTS FOR CORRECIONS OF THE SCOPUS DATA (INCLUDING CORRECTIONS IN AFFILIATIONS) SHOULD NOT BE SENT TO US. They should be sent directly to Scopus, preferably by use of the Scopus to ORCID feedback wizard (https://orcid.scopusfeedback.com/) so that the correct data can be used in any future annual updates of the citation indicator databases. The c-score focuses on impact (citations) rather than productivity (number of publications) and it also incorporates information on co-authorship and author positions (single, first, last author). If you have additional questions, see attached file on FREQUENTLY ASKED QUESTIONS. Finally, we alert users that all citation metrics have limitations and their use should be tempered and judicious. For more reading, we refer to the Leiden manifesto: https://www.nature.com/articles/520429a
Facebook
TwitterAttribution 1.0 (CC BY 1.0)https://creativecommons.org/licenses/by/1.0/
License information was derived automatically
Higher Education Institutions in Poland Dataset
This repository contains a dataset of higher education institutions in Poland. The dataset comprises 131 public higher education institutions and 216 private higher education institutions in Poland. The data was collected on 24/11/2022.
This dataset was compiled in response to a cybersecurity investigation of Poland's higher education institutions' websites [1]. The data is being made publicly available to promote open science principles [2].
Data
The data includes the following fields for each institution:
Methodology
The dataset was compiled using data from two primary sources:
For the international names in English, the following methodology was employed:
Both Polish and English names were retained for each institution. This decision was based on the fact that some universities do not have their English versions available in official sources.
English names were primarily sourced from:
In instances where English names were not readily available from the aforementioned sources, the GPT-3.5 model was employed to propose suitable names. These proposed names are distinctly marked in blue within the dataset file (hei_poland_en.xls).
Usage
This data is available under the Creative Commons Zero (CC0) license and can be used for academic research purposes. We encourage the sharing of knowledge and the advancement of research in this field by adhering to open science principles [2].
If you use this data in your research, please cite the source and include a link to this repository. To properly attribute this data, please use the following DOI:
10.5281/zenodo.8333573
Contribution
If you have any updates or corrections to the data, please feel free to open a pull request or contact us directly. Let's work together to keep this data accurate and up-to-date.
Acknowledgment
We would like to express our gratitude to the Ministry of Education and Science of Poland and the RAD-on system for providing the information used in this dataset.
We would like to acknowledge the support of the Norte Portugal Regional Operational Programme (NORTE 2020), under the PORTUGAL 2020 Partnership Agreement, through the European Regional Development Fund (ERDF), within the project "Cybers SeC IP" (NORTE-01-0145-FEDER-000044). This study was also developed as part of the Master in Cybersecurity Program at the Polytechnic University of Viana do Castelo, Portugal.
References
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This repository contains a dataset of higher education institutions in Germany. This includes 400 higher education institutions in Germany, including universities, universities of applied sciences and Higher Institutes as Higher Institute of Engineering, Higher Institute of biotechnologies and few others. This dataset was compiled in response to a cybersecurity investigation of Germany higher education institutions' websites [1]. The data is being made publicly available to promote open science principles [2].
The data includes the following fields for each institution:
The methodology for creating the dataset involved obtaining data from two sources: The European Higher Education Sector Observatory (ETER)[3]. The data was collected on December 26, 2024, the Eurostat for NUTS - Nomenclature of territorial units for statistics 2013-16[4] and 2021[5].
This section outlines the methodology used to create the dataset for Higher Education Institutions (HEIs) in France. The dataset consolidates information from various sources, processes the data, and enriches it to provide accurate and reliable insights.
Data Sources
eter-export-2021-DE.xlsxNUTS2013-NUTS2016.xlsxNUTS2021.xlsxData Cleaning and Preprocessing Column Renaming Columns in the raw dataset were renamed for consistency and readability. Examples include:
ETER ID → ETER_IDInstitution Name → NameLegal status → CategoryValue Replacement
Category column was cleaned, with government-dependent institutions classified as "public."Handling Missing or Incorrect Data
ETER_ID. For instance:
DE0012 (updated to www.zeppelin-university.com)FR0906 (updated to hmtm.de)FR0104 (updated to www.dhfpg.de)FR0466 (updated to fhf.brandenburg.de)FR0907 (updated to hr-nord.niedersachsen.de)FR0333 (updated to www.srh-university.de)Regional Data Integration
Final Dataset The final dataset was saved as a CSV file: germany-heis.csv, encoded in UTF-8 for compatibility. It includes detailed information about HEIs in France, their categories, regional affiliations, and membership in European alliances.
Summary This methodology ensures that the dataset is accurate, consistent, and enriched with valuable regional and institutional details. The final dataset is intended to serve as a reliable resource for analyzing French HEIs.
This data is available under the Creative Commons Zero (CC0) license and can be used for any purpose, including academic research purposes. We encourage the sharing of knowledge and the advancement of research in this field by adhering to open science principles [2].
If you use this data in your research, please cite the source and include a link to this repository. To properly attribute this data, please use the following DOI: 10.5281/zenodo.7614862
If you have any updates or corrections to the data, please feel free to open a pull request or contact us directly. Let's work together to keep this data accurate and up-to-date.
We would like to acknowledge the support of the Norte Portugal Regional Operational Programme (NORTE 2020), under the PORTUGAL 2020 Partnership Agreement, through the European Regional Development Fund (ERDF), within the project "Cybers SeC IP" (NORTE-01-0145-FEDER-000044). This study was also developed as part of the Master in Cybersecurity Program at the Instituto Politécnico de Viana do Castelo, Portugal.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This repository contains a dataset of higher education institutions in the United States of America. This dataset was compiled in response to a cybersecurity research of American higher education institutions' websites [1]. The data is being made publicly available to promote open science principles [2].
The data includes the following fields for each institution:
The dataset was obtained from the Higher Education Integrated Data System (IPEDS) website [3], which is administered by the National Center for Education Statistics (NCES). NCES serves as the primary federal entity for collecting and analyzing education-related data in the United States. The data was collected on February 2, 2023.
The initial list of institutions was derived from the IPEDS database using the following criteria: (1) US institutions only, (2) degree-granting institutions, primarily bachelor's or higher, and (3) industry classification, which includes: public 4 - year or above, private not-for-profit 4 years or more, private for-profit 4 years or more, public 2 years, private not-for-profit 2 years, private for-profit 2 years, public less than 2 years, private not-for-profit for-profit less than 2 years and private for-profit less than 2 years.
The following variables have been added to the list of institutions: Control of the institution, state abbreviation, degree-granting status, Status of the institution, and Institution's internet website address. This resulted in a report with 1,979 institutions.
The institution's status was labeled with the following values: A (Active), N (New), R (Restored), M (Closed in the current year), C (Combined with another institution), D (Deleted out of business), I (Inactive due to hurricane-related issues), O (Outside IPEDS scope), P (Potential new/add institution), Q (Potential institution reestablishment), W (Potential addition outside IPEDS scope), X ( Potential restoration outside the scope of IPEDS) and G (Perfect Children's Campus).
A filter was applied to the report to retain only institutions with an A, N, or R status, resulting in 1,978 institutions. Finally, a data cleaning process was applied, which involved removing the whitespace at the beginning and end of cell content and duplicate whitespace. The final data were compiled into the dataset included in this repository.
This data is available under the Creative Commons Zero (CC0) license and can be used for any purpose, including academic research purposes. We encourage the sharing of knowledge and the advancement of research in this field by adhering to open science principles [2].
If you use this data in your research, please cite the source and include a link to this repository. To properly attribute this data, please use the following DOI: 10.5281/zenodo.7614862
If you have any updates or corrections to the data, please feel free to open a pull request or contact us directly. Let's work together to keep this data accurate and up-to-date.
We would like to acknowledge the support of the Norte Portugal Regional Operational Programme (NORTE 2020), under the PORTUGAL 2020 Partnership Agreement, through the European Regional Development Fund (ERDF), within the project "Cybers SeC IP" (NORTE-01-0145-FEDER-000044). This study was also developed as part of the Master in Cybersecurity Program at the Instituto Politécnico de Viana do Castelo, Portugal.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This submission includes publicly available data extracted in its original form. Please reference the Related Publication listed here for source and citation information: TRI basic plus data files guides. (2024, September 18). US EPA. https://www.epa.gov/toxics-release-inventory-tri-program/tri-basic-plus-data-files-guides If you have questions about the underlying data stored here, please contact tri.help@epa.gov. If you have questions or recommendations related to this metadata entry and extracted data, please contact the CAFE Data Management team at: climatecafe@bu.edu. "EPA has been collecting Toxics Release Inventory (TRI) data since 1987. The "Basic Plus" data files include ten file types that collectively contain all of the data fields from the TRI Reporting Form R and Form A. The files themselves are in tab-delimited .txt format and then compressed into a .zip file. 1a: Facility, chemical, releases and other waste management summary information 1b: Chemical activities and uses 2a: On- and off-site disposal, treatment, energy recovery, and recycling information; non-production-related waste managed quantities; production/activity ratio information; and source reduction activities 2b: Detailed on-site waste treatment methods and efficiency 3a: Transfers off site for disposal and further waste management 3b: Transfers to Publicly Owned Treatment Works (POTWs) (RY1987 - RY2010) 3c: Transfers to Publicly Owned Treatment Works (POTWs) (RY2011 - Present) 4: Facility information 5: Optional information on source reduction, recycling and pollution control (RY2005 - Present) 6: Additional miscellaneous and optional information (RY2010 - Present) Quantities of dioxin and dioxin-like compounds are reported in grams, while all other chemicals are reported in pounds. This webpage contains the most recent versions of all TRI data files; facilities may revise previous years' TRI submissions if necessary, and any such changes will be reflected in these files. For this reason, data contained in these files may differ from data used to construct the TRI National Analysis." [Quote from https://www.epa.gov/toxics-release-inventory-tri-program/tri-basic-plus-data-files-calendar-years-1987-present]
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundAn understanding of the resources which engineering students use to write their academic papers provides information about student behaviour as well as the effectiveness of information literacy programs designed for engineering students. One of the most informative sources of information which can be used to determine the nature of the material that students use is the bibliography at the end of the students’ papers. While reference list analysis has been utilised in other disciplines, few studies have focussed on engineering students or used the results to improve the effectiveness of information literacy programs. Gadd, Baldwin and Norris (2010) found that civil engineering students undertaking a finalyear research project cited journal articles more than other types of material, followed by books and reports, with web sites ranked fourth. Several studies, however, have shown that in their first year at least, most students prefer to use Internet search engines (Ellis & Salisbury, 2004; Wilkes & Gurney, 2009).PURPOSEThe aim of this study was to find out exactly what resources undergraduate students studying civil engineering at La Trobe University were using, and in particular, the extent to which students were utilising the scholarly resources paid for by the library. A secondary purpose of the research was to ascertain whether information literacy sessions delivered to those students had any influence on the resources used, and to investigate ways in which the information literacy component of the unit can be improved to encourage students to make better use of the resources purchased by the Library to support their research.DESIGN/METHODThe study examined student bibliographies for three civil engineering group projects at the Bendigo Campus of La Trobe University over a two-year period, including two first-year units (CIV1EP – Engineering Practice) and one-second year unit (CIV2GR – Engineering Group Research). All units included a mandatory library session at the start of the project where student groups were required to meet with the relevant faculty librarian for guidance. In each case, the Faculty Librarian highlighted specific resources relevant to the topic, including books, e-books, video recordings, websites and internet documents. The students were also shown tips for searching the Library catalogue, Google Scholar, LibSearch (the LTU Library’s research and discovery tool) and ProQuest Central. Subject-specific databases for civil engineering and science were also referred to. After the final reports for each project had been submitted and assessed, the Faculty Librarian contacted the lecturer responsible for the unit, requesting copies of the student bibliographies for each group. References for each bibliography were then entered into EndNote. The Faculty Librarian grouped them according to various facets, including the name of the unit and the group within the unit; the material type of the item being referenced; and whether the item required a Library subscription to access it. A total of 58 references were collated for the 2010 CIV1EP unit; 237 references for the 2010 CIV2GR unit; and 225 references for the 2011 CIV1EP unit.INTERIM FINDINGSThe initial findings showed that student bibliographies for the three group projects were primarily made up of freely available internet resources which required no library subscription. For the 2010 CIV1EP unit, all 58 resources used were freely available on the Internet. For the 2011 CIV1EP unit, 28 of the 225 resources used (12.44%) required a Library subscription or purchase for access, while the second-year students (CIV2GR) used a greater variety of resources, with 71 of the 237 resources used (29.96%) requiring a Library subscription or purchase for access. The results suggest that the library sessions had little or no influence on the 2010 CIV1EP group, but the sessions may have assisted students in the 2011 CIV1EP and 2010 CIV2GR groups to find books, journal articles and conference papers, which were all represented in their bibliographiesFURTHER RESEARCHThe next step in the research is to investigate ways to increase the representation of scholarly references (found by resources other than Google) in student bibliographies. It is anticipated that such a change would lead to an overall improvement in the quality of the student papers. One way of achieving this would be to make it mandatory for students to include a specified number of journal articles, conference papers, or scholarly books in their bibliographies. It is also anticipated that embedding La Trobe University’s Inquiry/Research Quiz (IRQ) using a constructively aligned approach will further enhance the students’ research skills and increase their ability to find suitable scholarly material which relates to their topic. This has already been done successfully (Salisbury, Yager, & Kirkman, 2012)CONCLUSIONS & CHALLENGESThe study shows that most students rely heavily on the free Internet for information. Students don’t naturally use Library databases or scholarly resources such as Google Scholar to find information, without encouragement from their teachers, tutors and/or librarians. It is acknowledged that the use of scholarly resources doesn’t automatically lead to a high quality paper. Resources must be used appropriately and students also need to have the skills to identify and synthesise key findings in the existing literature and relate these to their own paper. Ideally, students should be able to see the benefit of using scholarly resources in their papers, and continue to seek these out even when it’s not a specific assessment requirement, though it can’t be assumed that this will be the outcome.REFERENCESEllis, J., & Salisbury, F. (2004). Information literacy milestones: building upon the prior knowledge of first-year students. Australian Library Journal, 53(4), 383-396.Gadd, E., Baldwin, A., & Norris, M. (2010). The citation behaviour of civil engineering students. Journal of Information Literacy, 4(2), 37-49.Salisbury, F., Yager, Z., & Kirkman, L. (2012). Embedding Inquiry/Research: Moving from a minimalist model to constructive alignment. Paper presented at the 15th International First Year in Higher Education Conference, Brisbane. Retrieved from http://www.fyhe.com.au/past_papers/papers12/Papers/11A.pdfWilkes, J., & Gurney, L. J. (2009). Perceptions and applications of information literacy by first year applied science students. Australian Academic & Research Libraries, 40(3), 159-171.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The database contains several datasets and files with NBA statistical data spanning four seasons (2015-2016 to 2018-2019). These datasets were procured from the Basketball Reference database (https://www.basketball-reference.com/), a publicly accessible source of NBA data.
The main file, `dat.cleaned.csv`, includes the Win/Loss records for all thirty NBA teams, along with box scores and advanced statistics. The data captured over the four seasons correspond to about 4,920 regular-season games. A distinguishing feature of this dataset is the repeated measurements per player within a team across the seasons. However, it's important to note that these repeated measurements are not independent, necessitating the use of hierarchical modelling to properly handle the data.
Two sets of additional text files (`per_2017.txt`, `per_2018.txt`, `rpm_2017.txt`, `rpm_2018.txt`) provide specific metrics for player performance. The 'PER' files contain the Athlete Efficiency Rating (PER) for the years 2017 and 2018. The 'RPM' files contain the ESPN-developed score called Real Plus-Minus (RPM) for the same years.
However, potential biases or limitations within the datasets should be acknowledged. For instance, the Basketball Reference website might not include data from some matches or may exclude certain variables, potentially affecting the quality and accuracy of the dataset.
Facebook
TwitterAttribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Introduction
Wikipedia is written in the wikitext markup language. When serving content, the MediaWiki software that powers Wikipedia parses wikitext to HTML, thereby inserting additional content by expanding macros (templates and modules). Hence, researchers who intend to analyze Wikipedia as seen by its readers should work with HTML, rather than wikitext. Since Wikipedia’s revision history is made publicly available by the Wikimedia Foundation exclusively in wikitext format, researchers have had to produce HTML themselves, typically by using Wikipedia’s REST API for ad-hoc wikitext-to-HTML parsing. This approach, however, (1) does not scale to very large amounts of data and (2) does not correctly expand macros in historical article revisions.
We have solved these problems by developing a parallelized architecture for parsing massive amounts of wikitext using local instances of MediaWiki, enhanced with the capacity of correct historical macro expansion. By deploying our system, we produce and hereby release WikiHist.html, English Wikipedia’s full revision history in HTML format. It comprises the HTML content of 580M revisions of 5.8M articles generated from the full English Wikipedia history spanning 18 years from 1 January 2001 to 1 March 2019. Boilerplate content such as page headers, footers, and navigation sidebars are not included in the HTML.
For more details, please refer to the description below and to the dataset paper:
Blagoj Mitrevski, Tiziano Piccardi, and Robert West: WikiHist.html: English Wikipedia’s Full Revision History in HTML Format. In Proceedings of the 14th International AAAI Conference on Web and Social Media, 2020.
https://arxiv.org/abs/2001.10256
When using the dataset, please cite the above paper.
Dataset summary
The dataset consists of three parts:
Part 1 is our main contribution, while parts 2 and 3 contain complementary information that can aid researchers in their analyses.
Getting the data
Parts 2 and 3 are hosted in this Zenodo repository. Part 1 is 7TB large -- too large for Zenodo -- and is therefore hosted externally on the Internet Archive. For downloading part 1, you have multiple options:
Dataset details
Part 1: HTML revision history
The data is split into 558 directories, named enwiki-20190301-pages-meta-history$1.xml-p$2p$3, where $1 ranges from 1 to 27, and p$2p$3 indicates that the directory contains revisions for pages with ids between $2 and $3. (This naming scheme directly mirrors that of the wikitext revision history from which WikiHist.html was derived.) Each directory contains a collection of gzip-compressed JSON files, each containing 1,000 HTML article revisions. Each row in the gzipped JSON files represents one article revision. Rows are sorted by page id, and revisions of the same page are sorted by revision id. We include all revision information from the original wikitext dump, the only difference being that we replace the revision’s wikitext content with its parsed HTML version (and that we store the data in JSON rather than XML):
Part 2: Page creation times (page_creation_times.json.gz)
This JSON file specifies the creation time of each English Wikipedia page. It can, e.g., be used to determine if a wiki link was blue or red at a specific time in the past. Format:
Part 3: Redirect history (redirect_history.json.gz)
This JSON file specifies all revisions corresponding to redirects, as well as the target page to which the respective page redirected at the time of the revision. This information is useful for reconstructing Wikipedia's link network at any time in the past. Format:
The repository also contains two additional files, metadata.zip and mysql_database.zip. These two files are not part of WikiHist.html per se, and most users will not need to download them manually. The file metadata.zip is required by the download script (and will be fetched by the script automatically), and mysql_database.zip is required by the code used to produce WikiHist.html. The code that uses these files is hosted at GitHub, but the files are too big for GitHub and are therefore hosted here.
WikiHist.html was produced by parsing the 1 March 2019 dump of https://dumps.wikimedia.org/enwiki/20190301 from wikitext to HTML. That old dump is not available anymore on Wikimedia's servers, so we make a copy available at https://archive.org/details/enwiki-20190301-original-full-history-dump_dlab .
Facebook
Twitterhttps://www.iaea.org/about/terms-of-usehttps://www.iaea.org/about/terms-of-use
Transfer parameter data are essential inputs to models for radiological environmental impact assessment and are used to quantify the extent of movement of radionuclides from one environmental compartment to another, relevant for estimating the transfer of radionuclides through food chains to humans. International data compilations (i.e. transfer parameter data for temperate environments from the IAEA Technical Reports Series No. 472) have been frequently used by regulators and professionals in radiological impact assessment for dose estimations when site-specific data are not available.
This international compilation of radionuclide and stable isotope soil-plant concentration ratio values for tropical environments is an output of IAEA’s Modelling and Data for Radiological Impact Assessments II (MODARIA II) programme (2016–2019) and is based on the Köppen-Geiger climate classification (BECK et al. 2018). The IAEA’s MODARIA II tropical dataset is associated with IAEA’s TECDOC-1979: Soil-Plant Transfer of Radionuclides in Non-Temperate Environments (2021).
The dataset contains over 7000 records. Each record includes a concentration ratio value and/or plant and soil concentrations, provided in a consistent way, from which a concentration ratio value can be calculated. Where available, environmentally relevant information is included with each record to allow categorization of the plant and soil data into more refined subsets.
The dataset includes information for over 100 plant species, including many that are common crops and staple foods in tropical environments. Data are included for all measured plant compartments, including both the edible and inedible parts of the plant.
Information in the dataset is organized into 41 fields, with individual lines in ascending order of their source reference. These headline fields are described in the associated ‘Explanatory Information’ file, while a description of the dataset content can be found in the ‘Dataset content‘ file.
The IAEA’s MODARIA II tropical dataset is freely available for all external users, without prejudice to the applicable IAEA’ Terms and Conditions.
Any use of the tropical dataset shall contain appropriate acknowledgement of the data source(s) and the IAEA’s Data Platform [online].
The preferred form of citation of IAEA’s MODARIA II tropical dataset is:
INTERNATIONAL ATOMIC ENERGY AGENCY, IAEA’s MODARIA II Soil-Plant Transfer Parameter Dataset for Tropical Environments. In: IAEA Data Platform [online], IAEA, Vienna (2021). https://ckan.iaea.production.datopian.com/dataset/modaria
The IAEA wishes to express its gratitude to C. Doering (Australia) for compiling this comprehensive dataset as part of the activities of Working Group 4 of the MODARIA II programme, led by B. Howard (UK). The IAEA also gratefully acknowledges the valuable contributions of J. Twining (Australia) and S. Rout (India).
The ‘Explore’ tab, on the right corner of the first page, allows users to explore the data online (by selecting the ‘Preview’ tab or by accessing the CSV-type file under ‘Data and Resources’) or to retrieve the whole dataset as a CSV-type file by selecting the ‘Download’ tab. To search for data in the online preview mode, use the filter control panel on the left of the ‘Data Explorer’ page. Click ‘Download’ at the top right of the page to download the data as a CSV file.
Would you like to learn more about the IAEA’s MODARIA II tropical dataset, or do you have questions related to data compilation? Get in touch with the IAEA’s team at the Terrestrial Environmental Radiochemistry Laboratory and at the Assessment and Management of Environmental Releases Unit by accessing the ‘Contact dataset maintainer’ tab. We will get back to you soon.
Facebook
TwitterThis submission includes publicly available data extracted in its original form. Please reference the Related Publication listed here for source and citation information "This page is intended to be a one stop shop for OpenFEMA—FEMA’s data delivery platform which provides datasets to the public in open, industry standard, machine-readable formats. Datasets are available in multiple formats, including downloadable files and through an easily digestible Application Programming Interface (API). Each page includes information about the specific dataset, links to downloadable files, a data dictionary describing each field, and an endpoint link (if applicable for those datasets available via the API)." [Quote from https://www.fema.gov/about/openfema/data-sets] This dataset includes: Annual NFIRS Public Data Emergency Management Performance Grants IPAWS Archived Alerts National Household Survey Non-Disaster and Assistance to Firefighter Grants Sandy PMO: Disaster Relief Appropriations Act of 2013 (Sandy Supplemental Bill) Financial Data Please review the updated PDF/HTML documentation for more details. (2025-01-31)
Facebook
TwitterAttribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
The data collection contains 12 links to basic and other measurements of radiation at Neumayer station from the Baseline Surface Radiation Network (BSRN). It covers all available measurements from the time period between 2013-01 and 2013-12.Any user who accepts the BSRN data release guidelines (http://bsrn.awi.de/data/conditions-of-data-release) may ask Amelie Driemel (mailto:Amelie.Driemel@awi.de) to obtain an account to download these datasets.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Overview
The Corpus of Decisions: International Court of Justice (CD-ICJ) collects and presents for the first time in human- and machine-readable form all published decisions of the International Court of Justice (ICJ). Among these are judgments, advisory opinions and orders, as well as their respective appended minority opinions (declarations, separate opinions and dissenting opinions). The International Court of Justice has kindly made available these documents on its website.
The International Court of Justice (ICJ) is the primary judicial organ of the United Nations and one of the most consequential courts in international law. Called the 'World Court' by many, it is the only international court with general thematic jurisdiction. While critics occasionally note the lack of compulsory jurisdiction and sharply limited access to the Court, its opinions continue to have an outsize influence on the modern interpretation, codification and wider development of international law. Every international legal textbook covers the workings and decisions of the Court in extenso and participation in international moot courts such as the Philip C. Jessup Moot Court without regular reference to and citation of the International Court of Justice's decisions is unthinkable.
This data set is designed to be complementary to and fully compatible with the Corpus of Decisions: Permanent Court of International Justice (CD-PCIJ), which is also available open access.
Citation
A peer-reviewed academic paper describing the construction and relevance of the data set entitled 'Introducing Twin Corpora of Decisions for the International Court of Justice (ICJ) and the Permanent Court of International Justice (PCIJ)' was published open access in the Journal of Empirical Legal Studies (JELS). It is also available in print at JELS 2022, Vol. 19, No. 2, pp. 491-524.
If you use the data set for academic work, please cite both the JELS paper and the precise version of the data set you used for your analysis.
New in Version 2023-10-22
Full recompilation of data set
Scope extended up to case number 190: Aerial Incident of 8 January 2020 (Canada, Sweden, Ukraine and United Kingdom v. Islamic Republic of Iran)
Add fix for lowercase components in URL basenames
Updated Python toolchain
Align docker config with Debian as host system
Updates
The CD-ICJ cannot be updated anymore, as the website of the Court is blocking automated access to its decisions. Updates will resume if this situation changes.
In case of serious errors an update will be provided at the earliest opportunity and a highlighted advisory issued on the Zenodo page of the current version. Minor errors will be documented in the GitHub issue tracker and fixed with the next scheduled release.
The CD-ICJ is versioned according to the day the data was acquired from the website of the Court, in the ISO format YYYY-MM-DD. Its initial release version was 2021-11-23.
Notifications regarding new and updated data sets will be published on my academic website at www.seanfobbe.com or via Mastodon at @seanfobbe@fediscience.org
Recommended Variants
Practitioners PDF_BEST_MajorityOpinions
Traditional Scholars PDF_BEST_FULL
Quantitative Analysts CSV_BEST_FULL
Please refer to the Codebook regarding the relative merits of each variant. All variants are available in either English or French. Unless you have very specific needs you should only use the variants denoted 'BEST' for serious work.
Features
Fully compatible with the Corpus of Decisions: Permanent Court of International Justice (CD-PCIJ)
27 variables
Public Domain (CC-Zero 1.0)
Open and platform independent file formats (PDF, TXT, CSV)
Extensive Codebook
Compilation Report explains construction and validation of the data set in detail
Large number of diagrams for all purposes (see the 'ANALYSIS' archive)
Diagrams are available as PDF (for printing) and PNG (for web display), tables are available as CSV for easy readability by humans and machines
Secure cryptographic signatures
Publication of full source code (Open Source)
Key Metrics
Version: 2023-10-22
Temporal Coverage: 31 July 1947 – 16 October 2023
Documents: 2289 (English) / 2276 (French)
Tokens: 15,767,521 (English) / 16,239,787 (French)
File Formats: PDF, TXT, CSV
Source Code and Compilation Report
With every compilation of the full data set an extensive Compilation Report is created in a professionally layouted PDF format (comparable to the Codebook). The Compilation Report includes the Source Code, comments and explanations of design decisions, relevant computational results, exact timestamps and a table of contents with clickable internal hyperlinks to each section. The Compilation Report and Source Code are published under the same DOI.
For details of the construction and validation of the data set please refer to the Compilation Report.
Disclaimer
This data set has been created by Mr Seán Fobbe using documents available on the website of the International Court of Justice (https://www.icj-cij.org). It is a personal academic initiative and is not associated with or endorsed by the International Court of Justice or the United Nations.
The Court accepts no responsibility or liability arising out of my use, or that of third parties, of the documents and information produced, used or published on the Zenodo website. Neither the Court nor its staff members nor its contractors may be held responsible or liable for the consequences, financial or otherwise, resulting from the use of these documents and information.
Academic Publications (Fobbe)
Website — www.seanfobbe.com
Open Data — zenodo.org/communities/sean-fobbe-data
Code Repository — zenodo.org/communities/sean-fobbe-code
Regular Publications — zenodo.org/communities/sean-fobbe-publications
Contact
Did you discover any errors? Do you have suggestions on how to improve the data set? You can either post these to the Issue Tracker on GitHub or write me an e-mail at fobbe-data@posteo.de
Facebook
TwitterFacebook received 73,390 user data requests from federal agencies and courts in the United States during the second half of 2023. The social network produced some user data in 88.84 percent of requests from U.S. federal authorities. The United States accounts for the largest share of Facebook user data requests worldwide.
Facebook
TwitterAs of October 2025, 6.04 billion individuals worldwide were internet users, which amounted to 73.2 percent of the global population. Of this total, 5.66 billion, or 68.7 percent of the world's population, were social media users. Global internet usage Connecting billions of people worldwide, the internet is a core pillar of the modern information society. Northern Europe ranked first among worldwide regions by the share of the population using the internet in 2025. In the Netherlands, Norway, and Saudi Arabia, 99 percent of the population used the internet as of February 2025. North Korea was at the opposite end of the spectrum, with virtually no internet usage penetration among the general population, ranking last worldwide. Eastern Asia was home to the largest number of online users worldwide—over 1.34 billion at the latest count. Southern Asia ranked second, with around 1.2 billion internet users. China, India, and the United States rank ahead of other countries worldwide by the number of internet users. Worldwide internet user demographics As of 2024, the share of female internet users worldwide was 65 percent, five percent less than that of men. Gender disparity in internet usage was bigger in African countries, with around a 10-percent difference. Worldwide regions, like the Commonwealth of Independent States and Europe, showed a smaller usage gap between these two genders. As of 2024, global internet usage was higher among individuals between 15 and 24 years old across all regions, with young people in Europe representing the most considerable usage penetration, 98 percent. In comparison, the worldwide average for the age group of 15 to 24 years was 79 percent. The income level of the countries was also an essential factor for internet access, as 93 percent of the population of the countries with high income reportedly used the internet, as opposed to only 27 percent of the low-income markets.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
This dataset contains articles scraped from the Massachusetts Institute of Technology (MIT) News website, specifically focusing on topics related to Artificial Intelligence, Machine Learning, Robotics, and Emerging Technologies.
The data was collected from the MIT News topic page:
👉 https://news.mit.edu/topic/artificial-intelligence2
Each entry includes: - Title of the article - Author(s) - Publication date - Summary (dek) - Full article body text - URL to the original article - Link to related research paper (e.g., Nature, Science) when available
The dataset spans multiple research domains, including: - AI for drug discovery & healthcare - Protein language models - Sustainable AI and eco-driving - Robotics and embodied intelligence - Chemistry and materials science - Climate and clean energy
This dataset is ideal for: - Natural Language Processing (NLP) tasks (summarization, topic modeling, sentiment analysis) - Trend analysis in AI and scientific research - Text classification and information retrieval - Educational projects and AI literacy - Knowledge graph construction of AI research
robots.txt and ethical web scraping practices.| Column | Description |
|---|---|
title | Article headline |
author | Author(s) of the article |
publication_date | Human-readable publication date |
datetime | ISO-formatted publication timestamp |
summary | Article summary (lead paragraph) |
body | Full article text |
paper_link | URL to the related research paper (e.g., Nature) |
url | Direct link to the MIT News article |
Use this dataset to: - Track how AI is being applied across scientific disciplines - Build a news aggregator for AI research - Train a model to predict research trends - Create a search engine for MIT’s AI breakthroughs
This dataset is shared under Kaggle’s Terms of Service for non-commercial, educational, and research purposes.
The original content remains the property of MIT News and should be properly attributed.
Facebook
TwitterAttribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Any user who accepts the BSRN data release guidelines (http://bsrn.awi.de/data/conditions-of-data-release) may ask Amelie Driemel (mailto:Amelie.Driemel@awi.de) to obtain an account to download these datasets.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data set contains thousands of vocal imitations of a large set of diverse sounds. These imitations were collected from hundreds of contributors via Amazon's Mechanical Turk website. The data set also contains data on hundreds of people's ability to correctly label these vocal imitations, also collected via Amazon's Mechanical Turk. This data set will help the research community understand which audio concepts can be effectively communicated with this approach. We have released this data so the community can study the related issues and build systems that leverage vocal imitation as an interaction modality, such as search engines that can be queried by vocally imitating the desired sound.
This data set is a supplement to a paper. Please cite the following paper to reference this data set in a publication:
Cartwright, M., Pardo, B. VocalSketch: Vocally Imitating Audio Concepts. In Proceedings of ACM Conference on Human Factors in Computing Systems (2015). http://dx.doi.org/10.1145/2702123.2702387
See https://github.com/interactiveaudiolab/VocalSketchDataSet for the latest updates to this data set.
Interactive Audio Lab: http://music.eecs.northwestern.edu
Facebook
TwitterAttribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
The data collection contains 156 links to continuous meteorological observations at Neumayer Station from the Baseline Surface Radiation Network (BSRN). It covers all available measurements from the time period between 2002-01 and 2014-12.Any user who accepts the BSRN data release guidelines (http://bsrn.awi.de/data/conditions-of-data-release) may ask Amelie Driemel (mailto:Amelie.Driemel@awi.de) to obtain an account to download these datasets.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This submission includes publicly available data extracted in its original form. Please reference the Related Publication listed here for source and citation information: US Army Corps of Engineers (Corps) Pre-2015 Regulatory Regime Approved Jurisdictional Determination in Light of Sackett v. EPA, 143 S. Ct. 1322 (2023), NWW-2023-00554, MFR 1 of 1 Clean Water Act Approved Jurisdictional Determinations This upload includes data and screenshots of the landing page and FAQs. "This website presents information on approved jurisdictional determinations (JDs) made by the U.S. Army Corps of Engineers (Corps) and the U.S. Environmental Protection Agency (EPA) under the Clean Water Act since August 28, 2015. Users are able to search, sort, map, view, and download approved JDs from both agencies using different search parameters (e.g., by year, State, watershed). An approved JD is an official Corps determination that jurisdictional waters of the United States are either present or absent on a particular site." Quote from https://watersgeo.epa.gov/cwa/CWA-JDs/
Facebook
TwitterAttribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Any user who accepts the BSRN data release guidelines (http://bsrn.awi.de/data/conditions-of-data-release) may ask Amelie Driemel (mailto:Amelie.Driemel@awi.de) to obtain an account to download these datasets.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
Citation metrics are widely used and misused. We have created a publicly available database of top-cited scientists that provides standardized information on citations, h-index, co-authorship adjusted hm-index, citations to papers in different authorship positions and a composite indicator (c-score). Separate data are shown for career-long and, separately, for single recent year impact. Metrics with and without self-citations and ratio of citations to citing papers are given and data on retracted papers (based on Retraction Watch database) as well as citations to/from retracted papers have been added. Scientists are classified into 22 scientific fields and 174 sub-fields according to the standard Science-Metrix classification. Field- and subfield-specific percentiles are also provided for all scientists with at least 5 papers. Career-long data are updated to end-of-2024 and single recent year data pertain to citations received during calendar year 2024. The selection is based on the top 100,000 scientists by c-score (with and without self-citations) or a percentile rank of 2% or above in the sub-field. This version (7) is based on the August 1, 2025 snapshot from Scopus, updated to end of citation year 2024. This work uses Scopus data. Calculations were performed using all Scopus author profiles as of August 1, 2025. If an author is not on the list, it is simply because the composite indicator value was not high enough to appear on the list. It does not mean that the author does not do good work. PLEASE ALSO NOTE THAT THE DATABASE HAS BEEN PUBLISHED IN AN ARCHIVAL FORM AND WILL NOT BE CHANGED. The published version reflects Scopus author profiles at the time of calculation. We thus advise authors to ensure that their Scopus profiles are accurate. REQUESTS FOR CORRECIONS OF THE SCOPUS DATA (INCLUDING CORRECTIONS IN AFFILIATIONS) SHOULD NOT BE SENT TO US. They should be sent directly to Scopus, preferably by use of the Scopus to ORCID feedback wizard (https://orcid.scopusfeedback.com/) so that the correct data can be used in any future annual updates of the citation indicator databases. The c-score focuses on impact (citations) rather than productivity (number of publications) and it also incorporates information on co-authorship and author positions (single, first, last author). If you have additional questions, see attached file on FREQUENTLY ASKED QUESTIONS. Finally, we alert users that all citation metrics have limitations and their use should be tempered and judicious. For more reading, we refer to the Leiden manifesto: https://www.nature.com/articles/520429a