Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
🔬 Research Hypothesis This research hypothesizes that light propagation in biological tissues follows predictable scattering patterns governed by Monte Carlo-based photon transport models. By simulating photon interactions within a biological medium, a synthetic dataset can be generated to support AI-driven medical imaging applications, such as optical coherence tomography (OCT) and diffuse optical tomography. The dataset explores how different optical properties—scattering coefficient, anisotropy, and absorption—affect photon propagation and whether AI models can learn to differentiate between scattering profiles.
📊 Key Findings The dataset consists of 2000 grayscale images, each representing the intensity distribution of photons in a simulated biological medium. Photon propagation follows expected physical principles, with intensity patterns matching theoretical models of light transport in tissues.
Key observations include the effect of anisotropic scattering, where high anisotropy values lead to forward-directed light propagation, while increased scattering coefficients result in greater lateral diffusion. Shorter wavelengths (450 nm) show stronger scattering and shallower penetration, whereas longer wavelengths (780 nm) experience lower scattering and deeper penetration, consistent with real tissue optics.
Absorption plays a critical role, reducing photon intensity as light penetrates deeper into the simulated medium. The structured variation in light intensity ensures that AI models can be trained to predict tissue properties from scattering patterns.
📖 How to Interpret and Use This Dataset Each image represents a Monte Carlo photon transport simulation, where photons undergo multiple scattering and absorption events. The brightness of each pixel corresponds to photon concentration, with brighter areas indicating regions of higher intensity.
Researchers can analyze images by filtering them based on optical properties, such as wavelength or scattering coefficient. AI models can use the dataset for training in tissue classification and reconstruction of subsurface structures. The dataset is also valuable for validating Monte Carlo-based light transport models by comparing the generated images to experimental optical imaging data.
📌 Applications and Use Cases This dataset is applicable to biomedical optics, AI-driven medical imaging, and laser-tissue interaction studies. It provides training data for AI models used in OCT and laser scanning microscopy. In physics and engineering, it supports the study of light transport in scattering media.
For AI research, the dataset enables training deep learning models to classify tissue structures based on scattering properties. It is also useful for developing AI algorithms that reconstruct subsurface tissue features from optical measurements.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset is a copy of the AI-GA (Artificial Intelligence Generated Abstracts) dataset, originally published by the researchers credited below. I have not made any changes.
The AI-GA dataset consists of 28,662 samples. Each sample includes: 1. An abstract 2. A title 3. A label (0 = original abstract, 1 = AI-generated)
The dataset is balanced — half of the abstracts are written by humans, while the other half were generated using GPT-3, one of the most advanced language models available at the time.
All data is provided in CSV format, with the following columns: abstract title label
The original abstracts come from a collection of COVID-19 research papers, while the AI-generated ones were created using GPT-3 for experimental and benchmarking purposes.
🧠 Citation If you use the AI-GA dataset in your research or publications, please acknowledge its use by citing this repository. @INPROCEEDINGS{10233982, author={Theocharopoulos, Panagiotis C. and Anagnostou, Panagiotis and Tsoukala, Anastasia and Georgakopoulos, Spiros V. and Tasoulis, Sotiris K. and Plagianakos, Vassilis P.}, booktitle={2023 IEEE Ninth International Conference on Big Data Computing Service and Applications (BigDataService)}, title={Detection of Fake Generated Scientific Abstracts}, year={2023}, pages={33-39}, doi={10.1109/BigDataService58306.2023.00011} }
📄 License The AI-GA dataset is released under the MIT License.
🙏 Note for the Authors All credit goes to the original authors. If you're one of them and prefer this dataset to be removed or credited differently, please don’t hesitate to reach out. I’m only sharing it here to help make this resource more accessible to the Kaggle community.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
AH&AITD is a comprehensive benchmark dataset designed to support the evaluation of AI-generated text detection tools. The dataset contains 11,580 samples spanning both human-written and AI-generated content across multiple domains. It was developed to address limitations in previous datasets, particularly in terms of diversity, scale, and real-world applicability. To facilitate research in the detection of AI-generated text by providing a diverse, multi-domain dataset. This dataset enables fair benchmarking of detection tools across various writing styles and content categories.Composition1. Human-Written Samples (Total: 5,790)Collected from:Open Web Text (2,343 samples)Blogs (196 samples)Web Text (397 samples)Q&A Platforms (670 samples)News Articles (430 samples)Opinion Statements (1,549 samples)Scientific Research Abstracts (205 samples)2. AI-Generated Samples (Total: 5,790)Generated using:ChatGPT (1,130 samples)GPT-4 (744 samples)Paraphrase Models (1,694 samples)GPT-2 (328 samples)GPT-3 (296 samples)DaVinci (GPT-3.5 variant) (433 samples)GPT-3.5 (364 samples)OPT-IML (406 samples)Flan-T5 (395 samples)Citation:Akram, A. (2023). AH&AITD: Arslan’s Human and AI Text Database. [Dataset]. Associated with the article: An Empirical Study of AI-Generated Text Detection Tools. Advances in Machine Learning & Artificial Intelligence, 4(2), 44–55.
Facebook
TwitterA desk-based study was undertaken to address the increasing need for a strategic approach to industry-led data collection in the face of reducing resources and growing need for evidence in fisheries management. The aim was to establish partnerships and action plans that will support the inclusion of fishing industry knowledge and data in the evidence-base for managing UK fisheries. Three main activities were used to progress the study: • A review of existing initiatives that have incorporated industry generated data into fisheries management advice and their degree of success. • A rapid appraisal of data deficient UK targeted stocks, assessment of data needs/gaps that may be augmented by industry generated data, and selection of case study stocks. • Development of action plans on how data gaps for case study stocks of commercial importance to UK fleets could be filled by the industry.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset was created by collecting posts from Reddit using the official Reddit API via the PRAW library in Python. It contains posts from various mental health–related subreddits, focused on discussions, experiences, and support around mental well-being.
The dataset has been fully cleaned and preprocessed, making it suitable for Natural Language Processing (NLP) tasks such as:
Sentiment analysis
Text classification
Topic modeling
Mental health research
Key Features:
Source: Reddit (via API, PRAW)
Domain: Mental health conversations and discussions
Format: Cleaned text data (ready-to-use)
Usability: NLP model training and experimentation
Potential Use Cases:
Training sentiment analysis models
Building text classification systems for mental health support
Research in psychology, linguistics, and AI ethics
Benchmarking NLP pipelines on real-world user-generated text
Note: This dataset was built for research and educational purposes only. Personal information and identifiers were not included to protect user privacy.
Facebook
TwitterThe Advanced Microwave Scanning Radiometer 2 (AMSR2) instrument on the Global Change Observation Mission - Water 1 (GCOM-W1) provides global passive microwave measurements of terrestrial, oceanic, and atmospheric parameters for the investigation of global water and energy cycles. Near real-time (NRT) products are generated within 3 hours of the last observations in the file, by the Land Atmosphere Near real-time Capability for EOS (LANCE) at the AMSR Science Investigator-led Processing System (AMSR SIPS), which is collocated with the Global Hydrology Resource Center (GHRC) DAAC. The GCOM-W1 NRT AMSR2 Unified L2B Global Swath Ocean Products is a swath product containing global sea surface temperature over ocean, wind speed over ocean, water vapor over ocean and cloud liquid water over ocean, using resampled NRT Level-1R data provided by JAXA. This is the same algorithm that generates the corresponding standard science products in the AMSR SIPS. The NRT products are generated in HDF-EOS-5 augmented with netCDF-4/CF metadata and are available via HTTPS from the EOSDIS LANCE system at https://lance.nsstc.nasa.gov/amsr2-science/data/level2/ocean/. If data latency is not a primary concern, please consider using science quality products. Science products are created using the best available ancillary, calibration and ephemeris information. Science quality products are an internally consistent, well-calibrated record of the Earth's geophysical properties to support science. The AMSR SIPS produces AMSR2 standard science quality data products, and they are available at the NSIDC DAAC.
Facebook
Twitterhttps://datacatalog.worldbank.org/public-licenses?fragment=cchttps://datacatalog.worldbank.org/public-licenses?fragment=cc
This dataset contains metadata (title, abstract, date of publication, field, etc) for around 1 million academic articles. Each record contains additional information on the country of study and whether the article makes use of data. Machine learning tools were used to classify the country of study and data use.
Our data source of academic articles is the Semantic Scholar Open Research Corpus (S2ORC) (Lo et al. 2020). The corpus contains more than 130 million English language academic papers across multiple disciplines. The papers included in the Semantic Scholar corpus are gathered directly from publishers, from open archives such as arXiv or PubMed, and crawled from the internet.
We placed some restrictions on the articles to make them usable and relevant for our purposes. First, only articles with an abstract and parsed PDF or latex file are included in the analysis. The full text of the abstract is necessary to classify the country of study and whether the article uses data. The parsed PDF and latex file are important for extracting important information like the date of publication and field of study. This restriction eliminated a large number of articles in the original corpus. Around 30 million articles remain after keeping only articles with a parsable (i.e., suitable for digital processing) PDF, and around 26% of those 30 million are eliminated when removing articles without an abstract. Second, only articles from the year 2000 to 2020 were considered. This restriction eliminated an additional 9% of the remaining articles. Finally, articles from the following fields of study were excluded, as we aim to focus on fields that are likely to use data produced by countries’ national statistical system: Biology, Chemistry, Engineering, Physics, Materials Science, Environmental Science, Geology, History, Philosophy, Math, Computer Science, and Art. Fields that are included are: Economics, Political Science, Business, Sociology, Medicine, and Psychology. This third restriction eliminated around 34% of the remaining articles. From an initial corpus of 136 million articles, this resulted in a final corpus of around 10 million articles.
Due to the intensive computer resources required, a set of 1,037,748 articles were randomly selected from the 10 million articles in our restricted corpus as a convenience sample.
The empirical approach employed in this project utilizes text mining with Natural Language Processing (NLP). The goal of NLP is to extract structured information from raw, unstructured text. In this project, NLP is used to extract the country of study and whether the paper makes use of data. We will discuss each of these in turn.
To determine the country or countries of study in each academic article, two approaches are employed based on information found in the title, abstract, or topic fields. The first approach uses regular expression searches based on the presence of ISO3166 country names. A defined set of country names is compiled, and the presence of these names is checked in the relevant fields. This approach is transparent, widely used in social science research, and easily extended to other languages. However, there is a potential for exclusion errors if a country’s name is spelled non-standardly.
The second approach is based on Named Entity Recognition (NER), which uses machine learning to identify objects from text, utilizing the spaCy Python library. The Named Entity Recognition algorithm splits text into named entities, and NER is used in this project to identify countries of study in the academic articles. SpaCy supports multiple languages and has been trained on multiple spellings of countries, overcoming some of the limitations of the regular expression approach. If a country is identified by either the regular expression search or NER, it is linked to the article. Note that one article can be linked to more than one country.
The second task is to classify whether the paper uses data. A supervised machine learning approach is employed, where 3500 publications were first randomly selected and manually labeled by human raters using the Mechanical Turk service (Paszke et al. 2019).[1] To make sure the human raters had a similar and appropriate definition of data in mind, they were given the following instructions before seeing their first paper:
Each of these documents is an academic article. The goal of this study is to measure whether a specific academic article is using data and from which country the data came.
There are two classification tasks in this exercise:
1. identifying whether an academic article is using data from any country
2. Identifying from which country that data came.
For task 1, we are looking specifically at the use of data. Data is any information that has been collected, observed, generated or created to produce research findings. As an example, a study that reports findings or analysis using a survey data, uses data. Some clues to indicate that a study does use data includes whether a survey or census is described, a statistical model estimated, or a table or means or summary statistics is reported.
After an article is classified as using data, please note the type of data used. The options are population or business census, survey data, administrative data, geospatial data, private sector data, and other data. If no data is used, then mark "Not applicable". In cases where multiple data types are used, please click multiple options.[2]
For task 2, we are looking at the country or countries that are studied in the article. In some cases, no country may be applicable. For instance, if the research is theoretical and has no specific country application. In some cases, the research article may involve multiple countries. In these cases, select all countries that are discussed in the paper.
We expect between 10 and 35 percent of all articles to use data.
The median amount of time that a worker spent on an article, measured as the time between when the article was accepted to be classified by the worker and when the classification was submitted was 25.4 minutes. If human raters were exclusively used rather than machine learning tools, then the corpus of 1,037,748 articles examined in this study would take around 50 years of human work time to review at a cost of $3,113,244, which assumes a cost of $3 per article as was paid to MTurk workers.
A model is next trained on the 3,500 labelled articles. We use a distilled version of the BERT (bidirectional Encoder Representations for transformers) model to encode raw text into a numeric format suitable for predictions (Devlin et al. (2018)). BERT is pre-trained on a large corpus comprising the Toronto Book Corpus and Wikipedia. The distilled version (DistilBERT) is a compressed model that is 60% the size of BERT and retains 97% of the language understanding capabilities and is 60% faster (Sanh, Debut, Chaumond, Wolf 2019). We use PyTorch to produce a model to classify articles based on the labeled data. Of the 3,500 articles that were hand coded by the MTurk workers, 900 are fed to the machine learning model. 900 articles were selected because of computational limitations in training the NLP model. A classification of “uses data” was assigned if the model predicted an article used data with at least 90% confidence.
The performance of the models classifying articles to countries and as using data or not can be compared to the classification by the human raters. We consider the human raters as giving us the ground truth. This may underestimate the model performance if the workers at times got the allocation wrong in a way that would not apply to the model. For instance, a human rater could mistake the Republic of Korea for the Democratic People’s Republic of Korea. If both humans and the model perform the same kind of errors, then the performance reported here will be overestimated.
The model was able to predict whether an article made use of data with 87% accuracy evaluated on the set of articles held out of the model training. The correlation between the number of articles written about each country using data estimated under the two approaches is given in the figure below. The number of articles represents an aggregate total of
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The USDA Agricultural Research Service (ARS) recently established SCINet , which consists of a shared high performance computing resource, Ceres, and the dedicated high-speed Internet2 network used to access Ceres. Current and potential SCINet users are using and generating very large datasets so SCINet needs to be provisioned with adequate data storage for their active computing. It is not designed to hold data beyond active research phases. At the same time, the National Agricultural Library has been developing the Ag Data Commons, a research data catalog and repository designed for public data release and professional data curation. Ag Data Commons needs to anticipate the size and nature of data it will be tasked with handling.
The ARS Web-enabled Databases Working Group, organized under the SCINet initiative, conducted a study to establish baseline data storage needs and practices, and to make projections that could inform future infrastructure design, purchases, and policies. The SCINet Web-enabled Databases Working Group helped develop the survey which is the basis for an internal report. While the report was for internal use, the survey and resulting data may be generally useful and are being released publicly.
From October 24 to November 8, 2016 we administered a 17-question survey (Appendix A) by emailing a Survey Monkey link to all ARS Research Leaders, intending to cover data storage needs of all 1,675 SY (Category 1 and Category 4) scientists. We designed the survey to accommodate either individual researcher responses or group responses. Research Leaders could decide, based on their unit's practices or their management preferences, whether to delegate response to a data management expert in their unit, to all members of their unit, or to themselves collate responses from their unit before reporting in the survey.
Larger storage ranges cover vastly different amounts of data so the implications here could be significant depending on whether the true amount is at the lower or higher end of the range. Therefore, we requested more detail from "Big Data users," those 47 respondents who indicated they had more than 10 to 100 TB or over 100 TB total current data (Q5). All other respondents are called "Small Data users." Because not all of these follow-up requests were successful, we used actual follow-up responses to estimate likely responses for those who did not respond.
We defined active data as data that would be used within the next six months. All other data would be considered inactive, or archival.
To calculate per person storage needs we used the high end of the reported range divided by 1 for an individual response, or by G, the number of individuals in a group response. For Big Data users we used the actual reported values or estimated likely values.
Resources in this dataset:Resource Title: Appendix A: ARS data storage survey questions. File Name: Appendix A.pdfResource Description: The full list of questions asked with the possible responses. The survey was not administered using this PDF but the PDF was generated directly from the administered survey using the Print option under Design Survey. Asterisked questions were required. A list of Research Units and their associated codes was provided in a drop down not shown here. Resource Software Recommended: Adobe Acrobat,url: https://get.adobe.com/reader/ Resource Title: CSV of Responses from ARS Researcher Data Storage Survey. File Name: Machine-readable survey response data.csvResource Description: CSV file includes raw responses from the administered survey, as downloaded unfiltered from Survey Monkey, including incomplete responses. Also includes additional classification and calculations to support analysis. Individual email addresses and IP addresses have been removed. This information is that same data as in the Excel spreadsheet (also provided).Resource Title: Responses from ARS Researcher Data Storage Survey. File Name: Data Storage Survey Data for public release.xlsxResource Description: MS Excel worksheet that Includes raw responses from the administered survey, as downloaded unfiltered from Survey Monkey, including incomplete responses. Also includes additional classification and calculations to support analysis. Individual email addresses and IP addresses have been removed.Resource Software Recommended: Microsoft Excel,url: https://products.office.com/en-us/excel
Facebook
TwitterAttribution-ShareAlike 2.0 (CC BY-SA 2.0)https://creativecommons.org/licenses/by-sa/2.0/
License information was derived automatically
Database File Name:Okada et al (2025) CONNECT science with AI transversal skillsDatabase Description:Students' self-report perceptions after completing AI activities reflecting about science connection, science skills, science capital and scientific literacies in the context of open schoolingDatabase Citation:Okada A.; Sherborne T; Panselinas; G. (2025) CONNECT science with AI transversal skills - 330 students in Brazil, The UK, and Greece. CCBYSAContact email:ale.okada@open.ac.ukDatabase URL:https://doi.org/10.21954/ou.rd.26115151Information:This database provides the views of 330 students who participated in CONNECT project in the UK, Brazil and Greecehttps://ordo.open.ac.uk/projects/CONNECT_-_Inclusive_open_schooling_with_future_oriented_science/125821Datasheets in this fileNOTESProvides information about this datasetDATAContains data including responses of 330 students and also analysis about the dataEFAContains data processed from SPSS about Exploratory Factorial AnalysisTOTALContain cross-national comparative analysis includign graphsGraphsContains graphs using the TOTAL - providing an overiewMethodology used to generated dataQuestionnaire designSemi-structured questionnaire including a combination of open-ended and closed-ended questions.Platform usedQualtrixMultilanguage supportTarget language (English, Greek, Portuguese ) to ensure that respondents can understand and respond to the questions in their preferred language.Questionnaire implementationLogic for sore, feedback and open badge implementedLanguage selectionQualtrics allows respondents to select their preferred language before starting the survey.Data generationThe questionnaire was distributed to the target audience school students through teachers members of CONNECT project who agreed to contribute to this research. They were supported by the authors.Data storageAs respondents submit their responses, Qualtrics stores the data securely in its database infrastructure. Each response is associated with the respondent's unique identifier and includes the language in which the survey was completed.Data analysisExploratory factorial analysis, descriptive analyses and thematic analysis to support mixed methods.Extra InformationCreator of the Instrument used to generate this database:Okada, A. CONNECT-Science self-report instrumentThis database refers to CONNECT project:https://www.connect-science.net/Project description:CONNECT inclusive open schooling with engaging and future oriented scienceFunder:European Commission No. 872814Questionnaire and database location:https://doi.org/10.21954/ou.rd.23566662Questionnaire citation:Okada A. (2024) CONNECT-science to sustainability with inclusive open schooling with engaging and future oriented science. CCBYSAJournal Article using data presented in this database:https://oro.open.ac.uk/54963/Article Citation:Okada, Alexandra; Sherborne, Tony; Panselinas, Giorgos and Kolionis, George (2025). Fostering Transversal Skills through Open Schooling supported by the CARE-KNOW-DO Pedagogical Model and the UNESCO AI Competencies Framework. International Journal of Artificial Intelligence in Education (In Press).License:CCBYSAREC/ 3825
Facebook
TwitterData-driven models help mobile app designers understand best practices and trends, and can be used to make predictions about design performance and support the creation of adaptive UIs. This paper presents Rico, the largest repository of mobile app designs to date, created to support five classes of data-driven applications: design search, UI layout generation, UI code generation, user interaction modeling, and user perception prediction. To create Rico, we built a system that combines crowdsourcing and automation to scalably mine design and interaction data from Android apps at runtime. The Rico dataset contains design data from more than 9.3k Android apps spanning 27 categories. It exposes visual, textual, structural, and interactive design properties of more than 66k unique UI screens. To demonstrate the kinds of applications that Rico enables, we present results from training an autoencoder for UI layout similarity, which supports query-by-example search over UIs.
Rico was built by mining Android apps at runtime via human-powered and programmatic exploration. Like its predecessor ERICA, Rico’s app mining infrastructure requires no access to — or modification of — an app’s source code. Apps are downloaded from the Google Play Store and served to crowd workers through a web interface. When crowd workers use an app, the system records a user interaction trace that captures the UIs visited and the interactions performed on them. Then, an automated agent replays the trace to warm up a new copy of the app and continues the exploration programmatically, leveraging a content-agnostic similarity heuristic to efficiently discover new UI states. By combining crowdsourcing and automation, Rico can achieve higher coverage over an app’s UI states than either crawling strategy alone. In total, 13 workers recruited on UpWork spent 2,450 hours using apps on the platform over five months, producing 10,811 user interaction traces. After collecting a user trace for an app, we ran the automated crawler on the app for one hour.
UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN https://interactionmining.org/rico
The Rico dataset is large enough to support deep learning applications. We trained an autoencoder to learn an embedding for UI layouts, and used it to annotate each UI with a 64-dimensional vector representation encoding visual layout. This vector representation can be used to compute structurally — and often semantically — similar UIs, supporting example-based search over the dataset. To create training inputs for the autoencoder that embed layout information, we constructed a new image for each UI capturing the bounding box regions of all leaf elements in its view hierarchy, differentiating between text and non-text elements. Rico’s view hierarchies obviate the need for noisy image processing or OCR techniques to create these inputs.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundWith data becoming a centerpiece of modern scientific discovery, data sharing by scientists is now a crucial element of scientific progress. This article aims to provide an in-depth examination of the practices and perceptions of data management, including data storage, data sharing, and data use and reuse by scientists around the world.MethodsThe Usability and Assessment Working Group of DataONE, an NSF-funded environmental cyberinfrastructure project, distributed a survey to a multinational and multidisciplinary sample of scientific researchers in a two-waves approach in 2017–2018. We focused our analysis on examining the differences across age groups, sub-disciplines of science, and sectors of employment.FindingsMost respondents displayed what we describe as high and mediocre risk data practices by storing their data on their personal computer, departmental servers or USB drives. Respondents appeared to be satisfied with short-term storage solutions; however, only half of them are satisfied with available mechanisms for storing data beyond the life of the process. Data sharing and data reuse were viewed positively: over 85% of respondents admitted they would be willing to share their data with others and said they would use data collected by others if it could be easily accessed. A vast majority of respondents felt that the lack of access to data generated by other researchers or institutions was a major impediment to progress in science at large, yet only about a half thought that it restricted their own ability to answer scientific questions. Although attitudes towards data sharing and data use and reuse are mostly positive, practice does not always support data storage, sharing, and future reuse. Assistance through data managers or data librarians, readily available data repositories for both long-term and short-term storage, and educational programs for both awareness and to help engender good data practices are clearly needed.
Facebook
TwitterccPDB (Compilation and Creation of datasets from PDB) is designed to provide service to scientific community working in the field of function or structure annoation of proteins. This database of datasets is based on Protein Data Bank (PDB), where all datasets were derived from PDB. ccPDB have four modules; i) compilation of datasets, ii) creation of datasets, iii) web services and iv) Important links. * Compilation of Datasets: Datasets at ccPDB can be classified in two categories, i) datasets collected from literature and ii) datasets compiled from PDB. We are in process of collecting PDB datasetsfrom literature and maintaining at ccPDB. We are also requesting community to suggest datasets. In addition, we generate datasets from PDB, these datasets were generated using commonly used standard protocols like non-redundant chains, structures solved at high resolution. * Creation of datasets: This module developed for creating customized datasets where user can create a dataset using his/her conditions from PDB. This module will be useful for those users who wish to create a new dataset as per ones requirement. This module have six steps, which are described in help page. * Web Services: We integrated following web services in ccPDB; i) Analyze of PDB ID service allows user to submit their PDB on around 40 servers from single point, ii) BLAST search allows user to perform BLAST search of their protein against PDB, iii) Structural information service is designed for annotating a protein structure from PDB ID, iv) Search in PDB facilitate user in searching structures in PDB, v)Generate patterns service facility to generate different types of patterns required for machine learning techniques and vi) Download useful information allows user to download various types of information for a given set of proteins (PDB IDs). * Important Links: One of major objectives of this web site is to provide links to web servers related to functional annotation of proteins. In first phase we have collected and compiled these links in different categories. In future attempt will be made to collect as many links as possible.
Facebook
TwitterThe Advanced Microwave Scanning Radiometer 2 (AMSR2) instrument on the Global Change Observation Mission - Water 1 (GCOM-W1) provides global passive microwave measurements of terrestrial, oceanic, and atmospheric parameters for the investigation of global water and energy cycles. Near real-time (NRT) products are generated within 3 hours of the last observations in the file, by the Land Atmosphere Near real-time Capability for EOS (LANCE) at the AMSR Science Investigator-led Processing System (AMSR SIPS), which is collocated with the Global Hydrology Resource Center (GHRC) DAAC. The GCOM-W1 NRT AMSR2 Unified Global Swath Surface Precipitation GSFC Profiling Algorithm is a swath product containing global rain rate and type, calculated by the GPROF 2017 V2R rainfall retrieval algorithm using resampled NRT Level-1R data provided by JAXA. This is the same algorithm that generates the corresponding standard science products in the AMSR SIPS. The NRT products are generated in HDF-EOS-5 augmented with netCDF-4/CF metadata and are available via HTTPS from the EOSDIS LANCE system at https://lance.nsstc.nasa.gov/amsr2-science/data/level2/rain/. If data latency is not a primary concern, please consider using science quality products. Science products are created using the best available ancillary, calibration and ephemeris information. Science quality products are an internally consistent, well-calibrated record of the Earth's geophysical properties to support science. The AMSR SIPS produces AMSR2 standard science quality data products, and they are available at the NSIDC DAAC.
Facebook
Twitterhttps://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
According to our latest research, the global Citizen Science Platform market size reached USD 1.29 billion in 2024, and is set to expand at a robust CAGR of 10.8% over the forecast period, culminating in a projected value of USD 3.06 billion by 2033. This remarkable growth is primarily fueled by increasing public engagement in scientific research, the proliferation of digital platforms, and rising governmental and academic support for participatory science initiatives worldwide.
One of the primary growth factors for the Citizen Science Platform market is the exponential rise in public awareness and enthusiasm for participatory science. As individuals become more conscious of environmental, health, and societal issues, they are increasingly motivated to contribute data and observations through digital platforms. The democratization of science, enabled by easy-to-use mobile applications and web-based portals, allows people from diverse backgrounds to participate in research projects. This surge in participation not only expands the volume of data available for scientific studies but also enhances the quality and geographical reach of research. The integration of advanced technologies such as IoT sensors, cloud computing, and artificial intelligence further streamlines data collection, validation, and analysis, making citizen science more accessible and impactful than ever before.
Another significant driver is the strong support from governmental bodies, academic institutions, and non-profit organizations, which recognize the value of citizen-generated data in addressing complex societal challenges. Governments across North America, Europe, and Asia Pacific are increasingly allocating funds and resources to citizen science programs for environmental monitoring, biodiversity conservation, and public health surveillance. Academic and research institutions are leveraging citizen science platforms to supplement their research capabilities, while non-profit organizations use these platforms to mobilize communities around pressing issues such as climate change, pollution, and disease outbreaks. This collaborative ecosystem not only accelerates scientific discovery but also fosters a culture of open science and knowledge sharing.
The evolution of digital infrastructure and widespread adoption of smartphones and high-speed internet have further catalyzed the growth of the Citizen Science Platform market. Cloud-based deployment models, in particular, offer scalable, secure, and cost-effective solutions for managing large volumes of data generated by citizen scientists. These platforms provide real-time data visualization, analytics, and reporting tools that empower both participants and researchers. The availability of multilingual interfaces and user-friendly designs ensures inclusivity, enabling participation from rural and underserved communities. Moreover, the integration of gamification and social features enhances user engagement, retention, and motivation, resulting in sustained participation and richer datasets.
From a regional perspective, North America continues to dominate the Citizen Science Platform market, followed closely by Europe and Asia Pacific. The United States, Canada, the United Kingdom, Germany, France, and Japan are at the forefront of adopting and developing citizen science initiatives. These regions benefit from strong institutional support, advanced technological infrastructure, and a highly engaged populace. Meanwhile, emerging markets in Latin America, the Middle East, and Africa are witnessing growing interest in citizen science, driven by the need to address environmental and public health challenges unique to these regions. As digital connectivity improves and awareness spreads, these regions are expected to contribute significantly to the global market growth in the coming years.
The Citizen Science Platform market by component is primarily segmented into software and services. The software segment encompasses web-based portals, mobile applications, data management systems, and analytical tools that facilitate citizen participation and data-driven research. The growing sophistication of platform software, featuring intuitive user interfaces, real-time data synchronization, and advanced analytics, has been pivotal in attracting a broad user base. Cloud-native software solutions are particularly favored for their scalabilit
Facebook
TwitterThis collection comprises of interview and focus group data gathered in 2024-2025 as part of a project aimed at investigating how synthetic data can support secure data access and improve research workflows, particularly from the perspective of data-owning organisations.
The interviews included 4 case studies of UK-based organisations who had piloted work generating and disseminating synthetic datasets, including the Ministry of Justice, NHS England, the project team working in partnership with the Department for Education, and Office for National Statistics. It also includes 2 focus groups with Trusted Repository Environment (TRE) representatives who had published or were considering publishing synthetic data.
The motivation for this collection stemmed from the growing interest in synthetic data as a tool to enhance access to sensitive data and reduce pressure on Trusted Research Environments (TREs). The study explored organisational engagement with two types of synthetic data: synthetic data generated from real data, and “data-free” synthetic data created using metadata only.
The aims of the case studies and focus groups were to assess current practices, explore motivations and barriers to adoption, understand cost and governance models, and gather perspectives on scaling and outsourcing synthetic data production. Conditional logic was used to tailor the survey to organisations actively producing, planning, or not engaging with synthetic data.
The interviews covered 5 key themes: organisational background; Infrastructure, operational costs, and resourcing; challenges of sharing synthetic data; benefits and use cases of synthetic data; and organisational policy and procedures.
The data offers exploratory insights into how UK organisations are approaching synthetic data in practice and can inform future research, infrastructure development, and policy guidance in this evolving area.
The findings have informed recommendations to support the responsible and efficient scaling of synthetic data production across sectors.
The growing discourse around synthetic data underscores its potential not only in addressing data challenges in a fast-paced changing landscape but for fostering innovation and accelerating advancements in data analytics and artificial intelligence. From optimising data sharing and utility (James et al., 2021), to sustaining and promoting reproducibility (Burgard et al., 2017) to mitigating disclosure (Nikolenko, 2021) synthetic data has emerged as a solution to various complexities of the data ecosystem.
The project proposes a mixed-methods approach and seeks to explore the operational, economic, and efficiency aspects of using low-fidelity synthetic data from the perspectives of data owners and Trusted Research Environments (TREs).
The essence of the challenge is in understanding the tangible and intangible costs associated with creating and sharing low-fidelity synthetic data, alongside measuring its utility and acceptance among data producers, data oweners and TREs. The broader aim of the project is to foster a nuanced understanding that could potentially catalyse a shift towards a more efficient and publicly acceptable model of synthetic data dissemination.
This project is centred around three primary goals: 1. to evaluate the comprehensive costs incurred by data owners and TREs in the creation and ongoing maintenance of low-fidelity synthetic data, including the initial production of synthetic data and subsequent costs; 2. to assess the various models of synthetic data sharing, evaluating the implications and efficiencies for data owners and TREs, covering all aspects from pre-ingest to curation procedures, metadata sharing, and data discoverability; and 3. to measure the efficiency improvements for data owners and TREs when synthetic data is available, analysing impacts on resources, secure environment usage load, and the uptake dynamics between synthetic and real datasets by researchers.
Commencing in March 2024, the project will begin with stakeholder engagement, forming an expert panel and aligning collaborative efforts with parallel projects. Following a robust literature review, the project will embark on a methodical data collection journey through a targeted survey with data creators, case studies with d and data owners and providers of synthetic data, and a focus group with TRE representatives. The insights collected from these activities will be analysed and synthesized to draft a comprehensive report delineating the findings and sensible recommendations for scaling up the production and dissemination of low-fidelity synthetic data as applicable.
The potential applications and benefits of the proposed work are diverse. The project aims to provide a solid foundation for data owners and TREs to make informed decisions regarding synthetic data production and sharing. Furthermore, the findings could significantly influence future policy concerning data privacy thereby having a broader impact on the research community and public perception. By fostering a deeper understanding and establishing a dialogue among key stakeholders, this project strives to bridge the existing knowledge gap and push the domain of synthetic data into a new era of informed and efficient usage. Through meticulous data collection and analysis, the project aims to unravel the intricacies of low-fidelity synthetic data, aiming to pave the way for an efficient, cost-effective, and publicly acceptable framework of synthetic data production and dissemination.
Facebook
TwitterCoastal resources are increasingly impacted by erosion, extreme weather events, sea-level rise, tidal flooding, and other potential hazards related to climate change. These hazards have varying impacts on coastal landscapes due to the numerous geologic, oceanographic, ecological, and socioeconomic factors that exist at a given location. Here, an assessment framework is introduced that synthesizes existing datasets describing the variability of the landscape and hazards that may act on it to evaluate the likelihood of coastal change along the U.S coastline within the coming decade. The pilot study, conducted in the Northeastern U.S. (Maine to Virginia), is comprised of datasets derived from a variety of federal, state, and local sources. First, a decision tree-based dataset is built that describes the fabric or integrity of the coastal landscape and includes landcover, elevation, slope, long-term (>150 years) shoreline change trends, dune height, and marsh stability data. A second database was generated from coastal hazards, which are divided into event hazards (e.g., flooding, wave power, and probability of storm overwash) and persistent hazards (e.g., relative sea-level rise rate, short-term (about 30 years) shoreline erosion rate, and storm recurrence interval). The fabric dataset is then merged with the coastal hazards databases and a training dataset made up of hundreds of polygons is generated from the merged dataset to support a supervised learning classification. Results from this pilot study are location-specific at 10-meter resolution and are made up of four raster datasets that include (1) quantitative and qualitative information used to determine the resistance of the landscape to change, (2 & 3) the potential coastal hazards that act on it, (4) the machine learning output, or Coastal Change Likelihood (CCL), based on the cumulative effects of both fabric and hazards, and (5) an estimate of the hazard type (event or persistent) that is the likely to influence coastal change. Final outcomes are intended to be used as a first order planning tool to determine which areas of the coast may be more likely to change in response to future potential coastal hazards, and to examine elements and drivers that make change in a location more likely.
Facebook
Twitterhttps://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
The work consists of tools for the interaction between Wikidata and OBO Foundry and source codes for the use of MeSH keywords of PubMed publications for the enrichment of biomedical knowledge in Wikidata. This work is funded by the Adapting Wikidata to support clinical practice using Data Science, Semantic Web and Machine Learning Project within the framework of the Wikimedia Foundation Research Fund.To cite the work: Turki, H., Chebil, K., Dossou, B. F. P., Emezue, C. C., Owodunni, A. T., Hadj Taieb, M. A., & Ben Aouicha, M. (2024). A framework for integrating biomedical knowledge in Wikidata with open biomedical ontologies and MeSH keywords. Heliyon, 10(19), e38488. doi:10.1016/j.heliyon.2024.e38448.Wikidata-OBOtool1.py: A tool for the verification of the semantic alignment between Wikidata and OBO ontologies.frame.py: The layout of Tool 1.tool2.py: A tool for extracting Wikidata relations between OBO ontology items.frame2.py: The layout of Tool 2.tool3.py: A tool for extracting multilingual language data for OBO ontology items from Wikidata.frame4.py: The layout of Tool 3.Wikidata-MeSHcorrect_mesh2matrix_dataset.py: A source code for turning MeSH2Matrix into a smaller dataset for the biomedical relation classification based on the MeSH keywords of PubMed publications, named MiniMeSH2Matrix.build_numpy_dataset.py: A source code for building the numpy files for MiniMeSH2Matrix (Relation type-based classification).label_encoded.csv: A table for the conversion of Wikidata Property IDs into MeSH2Matrix Class IDs.new_encoding.csv: A table for the conversion of Wikidata Property IDs into MiniMeSH2Matrix Class IDs.super_classes_new_dataset_labels.npy: The NumPy File of the labels for the superclass-based classification.new_dataset_labels.npy: The NumPy File of the labels for the relation type-based classification.new_dataset_matrices.npy: The Numpy File of the MiniMeSH2Matrix matrices for biomedical relation classification.first_level_new_data.json: The JSON File for the conversion of relation types to superclasses.build_super_classes.py: A source code for building the numpy files for MiniMeSH2Matrix (Superclass-based classification).FC_MeSH_Model_57_New_Data.ipynb: A Jupyter Notebook for training a Dense Model to perform the relation type-based classification.FC_MeSH_Model_57_New_Data_SuperClasses.ipynb: A Jupyter Notebook for training a Dense Model to perform the superclass-based classification.new_data_best_model_1: A stored edition of the best model for the relation type-based classification.new_data_super_classes_best_model_1: A stored edition of the best model for the superclass-based classification.MiniMeSH2Matrix_SuperClasses_Confusion_Matrix.ipynb: A Jupyter Notebook for generating the confusion matrix for the superclass-based supervised classification.MiniMeSH2Matrix_Supervised_Classification_Agreement.ipynb: A Jupyter Notebook for generating the matrix of agreement between the accurate predictions for superclass-based classification and the ones for relation type-based classification.Adding_References_to_Wikidata.ipynb: A Jupyter Notebook to identify the PubMed ID of relevant references to unsupported Wikidata statements between MeSH terms.MeSH_Statistics.xlsx: Statistical data about MeSH-based items and relations in Wikidata.ref_for_unsupported_statements.csv: Retrieved Relevant PubMed References for 1k unsupported Wikidata statements.evaluate_pubmed_ref_assignment.ipynb: A Jupyter Notebook that generates statistics about reference assignment for a sample of 1k unsupported statements.MeSH_Verification.xlsx: A list of inaccurate or duplicated MeSH IDs in Wikidata, as of August 8th, 2023.WikiRelationsPMI.csv: A list of PMI values for the semantic relations between MeSH terms, as available in Wikidata.WikiRelationsPMIDistribution.xlsx: Distribution of PMI values for all Wikidata relations and for specific Wikidata relation types.WikiRelationsToVerify.xlsx: Wikidata relations needing attention because they involve Wikidata items with inaccurate MeSH IDs, they cannot be found in PubMed, or their PMI values are below the threshold of 2.Mesh_part1.py: A Python code that verifies the accuracy of the MeSH IDs for the Wikidata items.MeshWikiPart.py: A Python code that computes the pointwise mutual information values for Wikidata relations between MeSH keywords based on PubMed.Demo.ipynb: A demo of the MeSH-based biomedical relation validation and classification in French.Id_Term.json: A dict of Medical Subject Headings labels corresponding to MeSH Descriptor ID.dict_mesh.json: Number of the occurrences of MeSH keywords in PubMed.finalmatrix.xlsx: Matrix of PMI values between the 5k most common MeSH Keywords.finalmatrixrev.pkl: Pickle File Edition of the PMI matrix.pmi2.xlsx: List of significant PMI associations between the 5k most common MeSH Keywords reaching a threshold of 2.Generate5kMatrix.py: A Python code that generates the PMI matrix.clean_pmi2.py: A Python code to remove the relations already available in Wikidata from pmi.xlsx.missing_rels.xlsx: The final list of the significant PMI associations that do not exist in Wikidata.item_category.json: A dict for MeSH tree categories corresponding to MeSH items.item_categorization.py: A Python code that generates a dict for MeSH tree categories corresponding to MeSH items.classification.py: A Python code for classifying PMI-generated semantic relations between the most common MeSH Keywords.results.xlsx: The output of the classification of the PMI-generated semantic relations between the most common MeSH Keywords.ClassificationStats.ipynb: A Jupyter Notebook for generating statistical data about the classification.
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
To evaluate land use and land cover (LULC) maps an independent and representative test dataset is required. Here, a test dataset was generated via stratified random sampling approach across all areas in Fiji not used to generate training data (i.e. all Tikinas which did not contain a training data point were valid for sampling to generate the test dataset). Following equation 13 in Olofsson et al. (2014), the sample size of the test dataset was 834. This was based on a desired standard error of the overall accuracy score of 0.01 and a user's accuracy of 0.75 for all classes. The strata for sampling test samples were the eight LULC classes: water, mangrove, bare soil, urban, agriculture, grassland, shrubland, and trees.
There are different strategies for allocating samples to strata for evaluating LULC maps, as discussed by Olofsson et al. (2014). Equal allocation of samples to strata ensures coverage of rarely occurring classes and minimise the standard error of estimators of user's accuracy. However, equal allocation does not optimise the standard error of the estimator of overall accuracy. Proportional allocation of samples to strata, based on the proportion of the strata in the overall dataset, can result in rarely occurring classes being underrepresented in the test dataset. Optimal allocation of samples to strata is challenging to implement when there are multiple evaluation objectives. Olofsson et al. (2014) recommend a "simple" allocation procedure where 50 to 100 samples are allocated to rare classes and proportional allocation is used to allocate samples to the remaining majority classes. The number of samples to allocate to rare classes can be determined by iterating over different allocations and computing estimated standard errors for performance metrics. Here, the 2021 all-Fiji LULC map, minus the Tikinas used for generating training samples, was used to estimate the proportional areal coverage of each LULC class. The LULC map from 2021 was used to permit comparison with other LULC products with a 2021 layer, notably the ESA WorldCover 10m v200 2021 product.
The 2021 LULC map was dominated by the tree class (74\% of the area classified) and the remaining classes had less than 10\% coverage each. Therefore, a "simple" allocation of 100 samples to the seven minority classes and an allocation of 133 samples to the tree class was used. This ensured all the minority classes had sufficient coverage in the test set while balancing the requirement to minimise standard errors for the estimate of overall accuracy. The allocated number of test dataset points were randomly sampled within each strata and were manually labelled using 2021 annual median RGB composites from Sentinel-2 and Planet NICFI and high-resolution Google Satellite Basemaps.
The Fiji LULC test data is available in GeoJSON format in the file fiji-lulc-test-data.geojson. Each point feature has two attributes: ref_class (the LULC class manually labelled and quality checked) and strata (the strata the sampled point belongs to derived from the 2021 all-Fiji LULC map). The following integers correspond to the ref_class and strata labels:
When evaluating LULC maps using test data derived from a stratified sample, the nature of the stratified sampling needs to be accounted for when estimating performance metrics such as overall accuracy, user's accuracy, and producer's accuracy. This is particulary so if the strata do not match the map classes (i.e. when comparing different LULC products). Stehman (2014) provide formulas for estimating performance metrics and their standard errors when using test data with a stratified sampling structure.
To support LULC accuracy assessment a Python package has been developed which provides implementations of Stehman's (2014) formulas. The package can be installed via:
pip install lulc-validation
with documentation and examples here.
In order to compute performance metrics accounting for the stratified nature of the sample the total number of points / pixels available to be sampled in each strata must be known. For this dataset that is:
This dataset was generated with support from a Climate Change AI Innovation Grant.
Facebook
TwitterThe Space-based Imaging Spectroscopy and Thermal pathfindER (SISTER) activity originated in support of the NASA Earth System Observatory's Surface Biology and Geology (SBG) mission to develop prototype workflows with community algorithms and generate prototype data products envisioned for SBG. SISTER focused on developing a data system that is open, portable, scalable, standards-compliant, and reproducible. This collection contains EXPERIMENTAL workflows and sample data products, including (a) the Common Workflow Language (CWL) process file and a Jupyter Notebook that run the entire SISTER workflow capable of generating experimental sample data products spanning terrestrial ecosystems, inland and coastal aquatic ecosystems, and snow, (b) the archived algorithm steps (as OGC Application Packages) used to generate products at each step of the workflow, (c) a small number of experimental sample data products produced by the workflow which are based on the Airborne Visible/Infrared Imaging Spectrometer-Classic (AVIRIS or AVIRIS-CL) instrument, and (d) instructions for reproducing the sample products included in this dataset. DISCLAIMER: This collection contains experimental workflows, experimental community algorithms, and experimental sample data products to demonstrate the capabilities of an end-to-end processing system. The experimental sample data products provided have not been fully validated and are not intended for scientific use. The community algorithms provided are placeholders which can be replaced by any user's algorithms for their own science and application interests. These algorithms should not in any capacity be considered the algorithms that will be implemented in the upcoming Surface Biology and Geology mission.
Facebook
TwitterAttribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Horizon 2020 programme supports access to and reuse of research data generated by Horizon 2020 projects through the Open Research Data Pilot (ORDP). To support the validation of scientific results, the pilot focuses on providing access to data needed to validate the scientific results. There are several types of such data, e.g. machine learning data sets, models, measurements, statistical results of experiments, survey outcomes, etc. This deliverable summarizes the data that are expected to be collected in the course of the project and where and how they are stored. The aspect of providing open access to research data (as required by the European Commission’s Open Research Data Pilot, https://www.openaire.eu/what-is-the-open-research-data-pilot) is addressed in Section 3. Finally, in Section 4 we describe the data sets that were or are expected to be generated within the TRINITY projects and made freely available.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
🔬 Research Hypothesis This research hypothesizes that light propagation in biological tissues follows predictable scattering patterns governed by Monte Carlo-based photon transport models. By simulating photon interactions within a biological medium, a synthetic dataset can be generated to support AI-driven medical imaging applications, such as optical coherence tomography (OCT) and diffuse optical tomography. The dataset explores how different optical properties—scattering coefficient, anisotropy, and absorption—affect photon propagation and whether AI models can learn to differentiate between scattering profiles.
📊 Key Findings The dataset consists of 2000 grayscale images, each representing the intensity distribution of photons in a simulated biological medium. Photon propagation follows expected physical principles, with intensity patterns matching theoretical models of light transport in tissues.
Key observations include the effect of anisotropic scattering, where high anisotropy values lead to forward-directed light propagation, while increased scattering coefficients result in greater lateral diffusion. Shorter wavelengths (450 nm) show stronger scattering and shallower penetration, whereas longer wavelengths (780 nm) experience lower scattering and deeper penetration, consistent with real tissue optics.
Absorption plays a critical role, reducing photon intensity as light penetrates deeper into the simulated medium. The structured variation in light intensity ensures that AI models can be trained to predict tissue properties from scattering patterns.
📖 How to Interpret and Use This Dataset Each image represents a Monte Carlo photon transport simulation, where photons undergo multiple scattering and absorption events. The brightness of each pixel corresponds to photon concentration, with brighter areas indicating regions of higher intensity.
Researchers can analyze images by filtering them based on optical properties, such as wavelength or scattering coefficient. AI models can use the dataset for training in tissue classification and reconstruction of subsurface structures. The dataset is also valuable for validating Monte Carlo-based light transport models by comparing the generated images to experimental optical imaging data.
📌 Applications and Use Cases This dataset is applicable to biomedical optics, AI-driven medical imaging, and laser-tissue interaction studies. It provides training data for AI models used in OCT and laser scanning microscopy. In physics and engineering, it supports the study of light transport in scattering media.
For AI research, the dataset enables training deep learning models to classify tissue structures based on scattering properties. It is also useful for developing AI algorithms that reconstruct subsurface tissue features from optical measurements.